2023-12-04 10:21:22

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 00/10] Multi-size THP for anonymous memory

Hi All,

A new week, a new version, a new name... This is v8 of a series to implement
multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
and "large anonymous folios"). Matthew objected to "small huge" so hopefully
this fares better.

The objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
pages, there are efficiency savings to be had; fewer page faults, batched PTE
and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
advantage of HW TLB compression techniques. A reduction in TLB pressure
speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

This version changes the name and tidies up some of the kernel code and test
code, based on feedback against v7 (see change log for details).

By default, the existing behaviour (and performance) is maintained. The user
must explicitly enable multi-size THP to see the performance benefit. This is
done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
David for the suggestion)! This interface is inspired by the existing
per-hugepage-size sysfs interface used by hugetlb, provides full backwards
compatibility with the existing PMD-size THP interface, and provides a base for
future extensibility. See [8] for detailed discussion of the interface.

This series is based on mm-unstable (715b67adf4c8).


Prerequisites
=============

Some work items identified as being prerequisites are listed on page 3 at [9].
The summary is:

| item | status |
|:------------------------------|:------------------------|
| mlock | In mainline (v6.7) |
| madvise | In mainline (v6.6) |
| compaction | v1 posted [10] |
| numa balancing | Investigated: see below |
| user-triggered page migration | In mainline (v6.7) |
| khugepaged collapse | In mainline (NOP) |

On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
John Hubbard has investigated this and concluded that it is A) not clear at the
moment what a better policy might be for PTE-mapped THP and B) questions whether
this should really be considered a prerequisite given no regression is caused
for the default "multi-size THP disabled" case, and there is no correctness
issue when it is enabled - its just a potential for non-optimal performance.

If there are no disagreements about removing numa balancing from the list (none
were raised when I first posted this comment against v7), then that just leaves
compaction which is in review on list at the moment.

I really would like to get this series (and its remaining comapction
prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
point, but lets see where we get to with review?


Testing
=======

The series includes patches for mm selftests to enlighten the cow and khugepaged
tests to explicitly test with multi-size THP, in the same way that PMD-sized
THP is tested. The new tests all pass, and no regressions are observed in the mm
selftest suite. I've also run my usual kernel compilation and java script
benchmarks without any issues.

Refer to my performance numbers posted with v6 [6]. (These are for multi-size
THP only - they do not include the arm64 contpte follow-on series).

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [11]. (Observed using v6 of this series as well as the arm64
contpte series).

Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
there are some latency regressions also.


Changes since v7 [7]
====================

- Renamed "small-sized THP" -> "multi-size THP" in commit logs
- Added various Reviewed-by/Tested-by tags (Barry, David, Alistair)
- Patch 3:
- Fine-tuned transhuge documentation multi-size THP (JohnH)
- Converted hugepage_global_enabled() and hugepage_global_always() macros
to static inline functions (JohnH)
- Renamed hugepage_vma_check() to thp_vma_allowable_orders() (JohnH)
- Renamed transhuge_vma_suitable() to thp_vma_suitable_orders() (JohnH)
- Renamed "global" enabled sysfs file option to "inherit" (JohnH)
- Patch 9:
- cow selftest: Renamed param size -> thpsize (David)
- cow selftest: Changed test fail to assert() (David)
- cow selftest: Log PMD size separately from all the supported THP sizes
(David)
- Patch 10:
- cow selftest: No longer special case pmdsize; keep all THP sizes in
thpsizes[]


Changes since v6 [6]
====================

- Refactored vmf_pte_range_changed() to remove uffd special-case (suggested by
JohnH)
- Dropped accounting patch (#3 in v6) (suggested by DavidH)
- Continue to account *PMD-sized* THP only for now
- Can add more counters in future if needed
- Page cache large folios haven't needed any new counters yet
- Pivot to sysfs ABI proposed by DavidH
- per-size directories in a similar shape to that used by hugetlb
- Dropped "recommend" keyword patch (#6 in v6) (suggested by DavidH, Yu Zhou)
- For now, users need to understand implicitly which sizes are beneficial
to their HW/SW
- Dropped arch_wants_pte_order() patch (#7 in v6)
- No longer needed due to dropping patch "recommend" keyword patch
- Enlightened khugepaged mm selftest to explicitly test with small-size THP
- Scrubbed commit logs to use "small-sized THP" consistently (suggested by
DavidH)


Changes since v5 [5]
====================

- Added accounting for PTE-mapped THPs (patch 3)
- Added runtime control mechanism via sysfs as extension to THP (patch 4)
- Minor refactoring of alloc_anon_folio() to integrate with runtime controls
- Stripped out hardcoded policy for allocation order; its now all user space
controlled (although user space can request "recommend" which will configure
the HW-preferred order)


Changes since v4 [4]
====================

- Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
now uses the default order-3 size. I have moved this patch over to
the contpte series.
- Added "mm: Allow deferred splitting of arbitrary large anon folios" back
into series. I originally removed this at v2 to add to a separate series,
but that series has transformed significantly and it no longer fits, so
bringing it back here.
- Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
set_ptes() is in mm-unstable now.
- Updated policy for when to allocate LAF; only fallback to order-0 if
MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
sysfs's never/madvise/always knob.
- Fallback to order-0 whenever uffd is armed for the vma, not just when
uffd-wp is set on the pte.
- alloc_anon_folio() now returns `struct folio *`, where errors are encoded
with ERR_PTR().

The last 3 changes were proposed by Yu Zhao - thanks!


Changes since v3 [3]
====================

- Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
- Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
sysctl is preferable but we will wait until real workload needs it.
- Fixed uninitialized `addr` on read fault path in do_anonymous_page().
- Added mm selftests for large anon folios in cow test suite.


Changes since v2 [2]
====================

- Dropped commit "Allow deferred splitting of arbitrary large anon folios"
- Huang, Ying suggested the "batch zap" work (which I dropped from this
series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
moved the deferred split patch to a separate series along with the batch
zap changes. I plan to submit this series early next week.
- Changed folio order fallback policy
- We no longer iterate from preferred to 0 looking for acceptable policy
- Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
- Removed vma parameter from arch_wants_pte_order()
- Added command line parameter `flexthp_unhinted_max`
- clamps preferred order when vma hasn't explicitly opted-in to THP
- Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
for process or system).
- Simplified implementation and integration with do_anonymous_page()
- Removed dependency on set_ptes()


Changes since v1 [1]
====================

- removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
- replaced with arch-independent alloc_anon_folio()
- follows THP allocation approach
- no longer retry with intermediate orders if allocation fails
- fallback directly to order-0
- remove folio_add_new_anon_rmap_range() patch
- instead add its new functionality to folio_add_new_anon_rmap()
- remove batch-zap pte mappings optimization patch
- remove enabler folio_remove_rmap_range() patch too
- These offer real perf improvement so will submit separately
- simplify Kconfig
- single FLEXIBLE_THP option, which is independent of arch
- depends on TRANSPARENT_HUGEPAGE
- when enabled default to max anon folio size of 64K unless arch
explicitly overrides
- simplify changes to do_anonymous_page():
- no more retry loop


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://lore.kernel.org/linux-mm/[email protected]/
[5] https://lore.kernel.org/linux-mm/[email protected]/
[6] https://lore.kernel.org/linux-mm/[email protected]/
[7] https://lore.kernel.org/linux-mm/[email protected]/
[8] https://lore.kernel.org/linux-mm/[email protected]/
[9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
[10] https://lore.kernel.org/linux-mm/[email protected]/
[11] https://lore.kernel.org/linux-mm/[email protected]/
[12] https://lore.kernel.org/linux-mm/[email protected]/


Thanks,
Ryan

Ryan Roberts (10):
mm: Allow deferred splitting of arbitrary anon large folios
mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
mm: thp: Introduce multi-size THP sysfs interface
mm: thp: Support allocation of anonymous multi-size THP
selftests/mm/kugepaged: Restore thp settings at exit
selftests/mm: Factor out thp settings management
selftests/mm: Support multi-size THP interface in thp_settings
selftests/mm/khugepaged: Enlighten for multi-size THP
selftests/mm/cow: Generalize do_run_with_thp() helper
selftests/mm/cow: Add tests for anonymous multi-size THP

Documentation/admin-guide/mm/transhuge.rst | 97 ++++-
Documentation/filesystems/proc.rst | 6 +-
fs/proc/task_mmu.c | 3 +-
include/linux/huge_mm.h | 116 ++++--
mm/huge_memory.c | 268 ++++++++++++--
mm/khugepaged.c | 20 +-
mm/memory.c | 114 +++++-
mm/page_vma_mapped.c | 3 +-
mm/rmap.c | 32 +-
tools/testing/selftests/mm/Makefile | 4 +-
tools/testing/selftests/mm/cow.c | 185 +++++++---
tools/testing/selftests/mm/khugepaged.c | 410 ++++-----------------
tools/testing/selftests/mm/run_vmtests.sh | 2 +
tools/testing/selftests/mm/thp_settings.c | 349 ++++++++++++++++++
tools/testing/selftests/mm/thp_settings.h | 80 ++++
15 files changed, 1177 insertions(+), 512 deletions(-)
create mode 100644 tools/testing/selftests/mm/thp_settings.c
create mode 100644 tools/testing/selftests/mm/thp_settings.h

--
2.25.1


2023-12-04 10:21:38

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

In preparation for adding support for anonymous multi-size THP,
introduce new sysfs structure that will be used to control the new
behaviours. A new directory is added under transparent_hugepage for each
supported THP size, and contains an `enabled` file, which can be set to
"inherit" (to inherit the global setting), "always", "madvise" or
"never". For now, the kernel still only supports PMD-sized anonymous
THP, so only 1 directory is populated.

The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which
the user wants to determine support, and the functions filter out all
the orders that can't be supported, given the current sysfs
configuration and the VMA dimensions. If there is only 1 order set in
the input then the output can continue to be treated like a boolean;
this is the case for most call sites. The resulting functions are
renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
respectively.

The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.

See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.

Signed-off-by: Ryan Roberts <[email protected]>
---
Documentation/admin-guide/mm/transhuge.rst | 97 ++++++--
Documentation/filesystems/proc.rst | 6 +-
fs/proc/task_mmu.c | 3 +-
include/linux/huge_mm.h | 114 ++++++---
mm/huge_memory.c | 268 +++++++++++++++++++--
mm/khugepaged.c | 20 +-
mm/memory.c | 8 +-
mm/page_vma_mapped.c | 3 +-
8 files changed, 423 insertions(+), 96 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index b0cc8243e093..04eb45a2f940 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -45,10 +45,25 @@ components:
the two is using hugepages just because of the fact the TLB miss is
going to run faster.

+Modern kernels support "multi-size THP" (mTHP), which introduces the
+ability to allocate memory in blocks that are bigger than a base page
+but smaller than traditional PMD-size (as described above), in
+increments of a power-of-2 number of pages. mTHP can back anonymous
+memory (for example 16K, 32K, 64K, etc). These THPs continue to be
+PTE-mapped, but in many cases can still provide similar benefits to
+those outlined above: Page faults are significantly reduced (by a
+factor of e.g. 4, 8, 16, etc), but latency spikes are much less
+prominent because the size of each page isn't as huge as the PMD-sized
+variant and there is less memory to clear in each page fault. Some
+architectures also employ TLB compression mechanisms to squeeze more
+entries in when a set of PTEs are virtually and physically contiguous
+and approporiately aligned. In this case, TLB misses will occur less
+often.
+
THP can be enabled system wide or restricted to certain tasks or even
memory ranges inside task's address space. Unless THP is completely
disabled, there is ``khugepaged`` daemon that scans memory and
-collapses sequences of basic pages into huge pages.
+collapses sequences of basic pages into PMD-sized huge pages.

The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
interface and using madvise(2) and prctl(2) system calls.
@@ -95,12 +110,40 @@ Global THP controls
Transparent Hugepage Support for anonymous memory can be entirely disabled
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
regions (to avoid the risk of consuming more memory resources) or enabled
-system wide. This can be achieved with one of::
+system wide. This can be achieved per-supported-THP-size with one of::
+
+ echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+ echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+
+where <size> is the hugepage size being addressed, the available sizes
+for which vary by system.
+
+For example::
+
+ echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+
+Alternatively it is possible to specify that a given hugepage size
+will inherit the top-level "enabled" value::
+
+ echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
+
+For example::
+
+ echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+
+The top-level setting (for use with "inherit") can be set by issuing
+one of the following commands::

echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled

+By default, PMD-sized hugepages have enabled="inherit" and all other
+hugepage sizes have enabled="never". If enabling multiple hugepage
+sizes, the kernel will select the most appropriate enabled size for a
+given allocation.
+
It's also possible to limit defrag efforts in the VM to generate
anonymous hugepages in case they're not immediately free to madvise
regions or to never try to defrag memory and simply fallback to regular
@@ -146,25 +189,34 @@ madvise
never
should be self-explanatory.

-By default kernel tries to use huge zero page on read page fault to
-anonymous mapping. It's possible to disable huge zero page by writing 0
-or enable it back by writing 1::
+By default kernel tries to use huge, PMD-mappable zero page on read
+page fault to anonymous mapping. It's possible to disable huge zero
+page by writing 0 or enable it back by writing 1::

echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page

-Some userspace (such as a test program, or an optimized memory allocation
-library) may want to know the size (in bytes) of a transparent hugepage::
+Some userspace (such as a test program, or an optimized memory
+allocation library) may want to know the size (in bytes) of a
+PMD-mappable transparent hugepage::

cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size

-khugepaged will be automatically started when
-transparent_hugepage/enabled is set to "always" or "madvise, and it'll
-be automatically shutdown if it's set to "never".
+khugepaged will be automatically started when one or more hugepage
+sizes are enabled (either by directly setting "always" or "madvise",
+or by setting "inherit" while the top-level enabled is set to "always"
+or "madvise"), and it'll be automatically shutdown when the last
+hugepage size is disabled (either by directly setting "never", or by
+setting "inherit" while the top-level enabled is set to "never").

Khugepaged controls
-------------------

+.. note::
+ khugepaged currently only searches for opportunities to collapse to
+ PMD-sized THP and no attempt is made to collapse to other THP
+ sizes.
+
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
@@ -282,19 +334,26 @@ force
Need of application restart
===========================

-The transparent_hugepage/enabled values and tmpfs mount option only affect
-future behavior. So to make them effective you need to restart any
-application that could have been using hugepages. This also applies to the
-regions registered in khugepaged.
+The transparent_hugepage/enabled and
+transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
+option only affect future behavior. So to make them effective you need
+to restart any application that could have been using hugepages. This
+also applies to the regions registered in khugepaged.

Monitoring usage
================

-The number of anonymous transparent huge pages currently used by the
+.. note::
+ Currently the below counters only record events relating to
+ PMD-sized THP. Events relating to other THP sizes are not included.
+
+The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
-To identify what applications are using anonymous transparent huge pages,
-it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
-for each mapping.
+To identify what applications are using PMD-sized anonymous transparent huge
+pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
+fields for each mapping. (Note that AnonHugePages only applies to traditional
+PMD-sized THP for historical reasons and should have been called
+AnonHugePmdMapped).

The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
@@ -413,7 +472,7 @@ for huge pages.
Optimizing the applications
===========================

-To be guaranteed that the kernel will map a 2M page immediately in any
+To be guaranteed that the kernel will map a THP immediately in any
memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee.

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 49ef12df631b..104c6d047d9b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.

-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
-It just shows the current status.
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled size. 1 if true, 0
+otherwise.

"VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d19924bf0a39..79855e1c5b57 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -865,7 +865,8 @@ static int show_smap(struct seq_file *m, void *v)
__show_smap(m, &mss, false);

seq_printf(m, "THPeligible: %8u\n",
- hugepage_vma_check(vma, vma->vm_flags, true, false, true));
+ !!thp_vma_allowable_orders(vma, vma->vm_flags, true, false,
+ true, THP_ORDERS_ALL));

if (arch_pkeys_enabled())
seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..bd0eadd3befb 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

+/*
+ * Mask of all large folio orders supported for anonymous THP.
+ */
+#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
+
+/*
+ * Mask of all large folio orders supported for file THP.
+ */
+#define THP_ORDERS_ALL_FILE (BIT(PMD_ORDER) | BIT(PUD_ORDER))
+
+/*
+ * Mask of all large folio orders supported for THP.
+ */
+#define THP_ORDERS_ALL (THP_ORDERS_ALL_ANON | THP_ORDERS_ALL_FILE)
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
@@ -78,42 +93,64 @@ extern struct kobj_attribute shmem_enabled_attr;

extern unsigned long transparent_hugepage_flags;

-#define hugepage_flags_enabled() \
- (transparent_hugepage_flags & \
- ((1<<TRANSPARENT_HUGEPAGE_FLAG) | \
- (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
-#define hugepage_flags_always() \
- (transparent_hugepage_flags & \
- (1<<TRANSPARENT_HUGEPAGE_FLAG))
+bool hugepage_flags_enabled(void);
+
+static inline int first_order(unsigned long orders)
+{
+ return fls_long(orders) - 1;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+ *orders &= ~BIT(prev);
+ return first_order(*orders);
+}

/*
- * Do the below checks:
+ * Filter the bitfield of input orders by doing the below checks:
* - For file vma, check if the linear page offset of vma is
- * HPAGE_PMD_NR aligned within the file. The hugepage is
+ * hpage_size-aligned within the file. The hugepage is
* guaranteed to be hugepage-aligned within the file, but we must
- * check that the PMD-aligned addresses in the VMA map to
- * PMD-aligned offsets within the file, else the hugepage will
- * not be PMD-mappable.
- * - For all vmas, check if the haddr is in an aligned HPAGE_PMD_SIZE
+ * check that the hpage_size-aligned addresses in the VMA map to
+ * hpage_size-aligned offsets within the file, else the hugepage will
+ * not be mappable.
+ * - For all vmas, check if the haddr is in an aligned hpage_size
* area.
+ * All orders that pass the checks are returned as a bitfield.
*/
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
- unsigned long addr)
-{
- unsigned long haddr;
-
- /* Don't have to check pgoff for anonymous vma */
- if (!vma_is_anonymous(vma)) {
- if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
- HPAGE_PMD_NR))
- return false;
+static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long orders)
+{
+ int order;
+
+ /*
+ * Iterate over orders, highest to lowest, removing orders that don't
+ * meet alignment requirements from the set. Exit loop at first order
+ * that meets requirements, since all lower orders must also meet
+ * requirements.
+ */
+
+ order = first_order(orders);
+
+ while (orders) {
+ unsigned long hpage_size = PAGE_SIZE << order;
+ unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
+
+ if (haddr >= vma->vm_start &&
+ haddr + hpage_size <= vma->vm_end) {
+ if (!vma_is_anonymous(vma)) {
+ if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+ vma->vm_pgoff,
+ hpage_size >> PAGE_SHIFT))
+ break;
+ } else
+ break;
+ }
+
+ order = next_order(&orders, order);
}

- haddr = addr & HPAGE_PMD_MASK;
-
- if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
- return false;
- return true;
+ return orders;
}

static inline bool file_thp_enabled(struct vm_area_struct *vma)
@@ -130,8 +167,10 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma)
!inode_is_open_for_write(inode) && S_ISREG(inode->i_mode);
}

-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
- bool smaps, bool in_pf, bool enforce_sysfs);
+unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags, bool smaps,
+ bool in_pf, bool enforce_sysfs,
+ unsigned long orders);

#define transparent_hugepage_use_zero_page() \
(transparent_hugepage_flags & \
@@ -267,17 +306,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
return false;
}

-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
- unsigned long addr)
+static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long orders)
{
- return false;
+ return 0;
}

-static inline bool hugepage_vma_check(struct vm_area_struct *vma,
- unsigned long vm_flags, bool smaps,
- bool in_pf, bool enforce_sysfs)
+static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags, bool smaps,
+ bool in_pf, bool enforce_sysfs,
+ unsigned long orders)
{
- return false;
+ return 0;
}

static inline void folio_prep_large_rmappable(struct folio *folio) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a65e2cb6126..da70ae13be5e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -74,12 +74,65 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;
unsigned long huge_zero_pfn __read_mostly = ~0UL;
+static unsigned long huge_anon_orders_always __read_mostly;
+static unsigned long huge_anon_orders_madvise __read_mostly;
+static unsigned long huge_anon_orders_inherit __read_mostly;

-bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
- bool smaps, bool in_pf, bool enforce_sysfs)
+static inline bool hugepage_global_enabled(void)
{
+ return transparent_hugepage_flags &
+ ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
+ (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG));
+}
+
+static inline bool hugepage_global_always(void)
+{
+ return transparent_hugepage_flags &
+ (1<<TRANSPARENT_HUGEPAGE_FLAG);
+}
+
+bool hugepage_flags_enabled(void)
+{
+ /*
+ * We cover both the anon and the file-backed case here; we must return
+ * true if globally enabled, even when all anon sizes are set to never.
+ * So we don't need to look at huge_anon_orders_inherit.
+ */
+ return hugepage_global_enabled() ||
+ huge_anon_orders_always ||
+ huge_anon_orders_madvise;
+}
+
+/**
+ * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
+ * @vma: the vm area to check
+ * @vm_flags: use these vm_flags instead of vma->vm_flags
+ * @smaps: whether answer will be used for smaps file
+ * @in_pf: whether answer will be used by page fault handler
+ * @enforce_sysfs: whether sysfs config should be taken into account
+ * @orders: bitfield of all orders to consider
+ *
+ * Calculates the intersection of the requested hugepage orders and the allowed
+ * hugepage orders for the provided vma. Permitted orders are encoded as a set
+ * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
+ * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
+ *
+ * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
+ * orders are allowed.
+ */
+unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
+ unsigned long vm_flags, bool smaps,
+ bool in_pf, bool enforce_sysfs,
+ unsigned long orders)
+{
+ /* Check the intersection of requested and supported orders. */
+ orders &= vma_is_anonymous(vma) ?
+ THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
+ if (!orders)
+ return 0;
+
if (!vma->vm_mm) /* vdso */
- return false;
+ return 0;

/*
* Explicitly disabled through madvise or prctl, or some
@@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
* */
if ((vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
- return false;
+ return 0;
/*
* If the hardware/firmware marked hugepage support disabled.
*/
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
- return false;
+ return 0;

/* khugepaged doesn't collapse DAX vma, but page fault is fine. */
if (vma_is_dax(vma))
- return in_pf;
+ return in_pf ? orders : 0;

/*
* khugepaged special VMA and hugetlb VMA.
@@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
* VM_MIXEDMAP set.
*/
if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
- return false;
+ return 0;

/*
- * Check alignment for file vma and size for both file and anon vma.
+ * Check alignment for file vma and size for both file and anon vma by
+ * filtering out the unsuitable orders.
*
* Skip the check for page fault. Huge fault does the check in fault
- * handlers. And this check is not suitable for huge PUD fault.
+ * handlers.
*/
- if (!in_pf &&
- !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
- return false;
+ if (!in_pf) {
+ int order = first_order(orders);
+ unsigned long addr;
+
+ while (orders) {
+ addr = vma->vm_end - (PAGE_SIZE << order);
+ if (thp_vma_suitable_orders(vma, addr, BIT(order)))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ if (!orders)
+ return 0;
+ }

/*
* Enabled via shmem mount options or sysfs settings.
@@ -124,13 +189,27 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
*/
if (!in_pf && shmem_file(vma->vm_file))
return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff,
- !enforce_sysfs, vma->vm_mm, vm_flags);
+ !enforce_sysfs, vma->vm_mm, vm_flags)
+ ? orders : 0;

/* Enforce sysfs THP requirements as necessary */
- if (enforce_sysfs &&
- (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
- !hugepage_flags_always())))
- return false;
+ if (enforce_sysfs) {
+ if (vma_is_anonymous(vma)) {
+ unsigned long mask = READ_ONCE(huge_anon_orders_always);
+
+ if (vm_flags & VM_HUGEPAGE)
+ mask |= READ_ONCE(huge_anon_orders_madvise);
+ if (hugepage_global_always() ||
+ ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
+ mask |= READ_ONCE(huge_anon_orders_inherit);
+
+ orders &= mask;
+ if (!orders)
+ return 0;
+ } else if (!hugepage_global_enabled() ||
+ (!(vm_flags & VM_HUGEPAGE) && !hugepage_global_always()))
+ return 0;
+ }

if (!vma_is_anonymous(vma)) {
/*
@@ -138,15 +217,15 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
* in fault path.
*/
if (((in_pf || smaps)) && vma->vm_ops->huge_fault)
- return true;
+ return orders;
/* Only regular file is valid in collapse path */
if (((!in_pf || smaps)) && file_thp_enabled(vma))
- return true;
- return false;
+ return orders;
+ return 0;
}

if (vma_is_temporary_stack(vma))
- return false;
+ return 0;

/*
* THPeligible bit of smaps should show 1 for proper VMAs even
@@ -156,9 +235,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
* the first page fault.
*/
if (!vma->anon_vma)
- return (smaps || in_pf);
+ return (smaps || in_pf) ? orders : 0;

- return true;
+ return orders;
}

static bool get_huge_zero_page(void)
@@ -412,9 +491,127 @@ static const struct attribute_group hugepage_attr_group = {
.attrs = hugepage_attr,
};

+static void hugepage_exit_sysfs(struct kobject *hugepage_kobj);
+static void thpsize_release(struct kobject *kobj);
+static LIST_HEAD(thpsize_list);
+
+struct thpsize {
+ struct kobject kobj;
+ struct list_head node;
+ int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
+static ssize_t thpsize_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int order = to_thpsize(kobj)->order;
+ const char *output;
+
+ if (test_bit(order, &huge_anon_orders_always))
+ output = "[always] inherit madvise never";
+ else if (test_bit(order, &huge_anon_orders_inherit))
+ output = "always [inherit] madvise never";
+ else if (test_bit(order, &huge_anon_orders_madvise))
+ output = "always inherit [madvise] never";
+ else
+ output = "always inherit madvise [never]";
+
+ return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int order = to_thpsize(kobj)->order;
+ ssize_t ret = count;
+
+ if (sysfs_streq(buf, "always")) {
+ set_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ } else if (sysfs_streq(buf, "inherit")) {
+ set_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_madvise);
+ } else if (sysfs_streq(buf, "madvise")) {
+ set_bit(order, &huge_anon_orders_madvise);
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ } else if (sysfs_streq(buf, "never")) {
+ clear_bit(order, &huge_anon_orders_always);
+ clear_bit(order, &huge_anon_orders_inherit);
+ clear_bit(order, &huge_anon_orders_madvise);
+ } else
+ ret = -EINVAL;
+
+ return ret;
+}
+
+static struct kobj_attribute thpsize_enabled_attr =
+ __ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store);
+
+static struct attribute *thpsize_attrs[] = {
+ &thpsize_enabled_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group thpsize_attr_group = {
+ .attrs = thpsize_attrs,
+};
+
+static const struct kobj_type thpsize_ktype = {
+ .release = &thpsize_release,
+ .sysfs_ops = &kobj_sysfs_ops,
+};
+
+static struct thpsize *thpsize_create(int order, struct kobject *parent)
+{
+ unsigned long size = (PAGE_SIZE << order) / SZ_1K;
+ struct thpsize *thpsize;
+ int ret;
+
+ thpsize = kzalloc(sizeof(*thpsize), GFP_KERNEL);
+ if (!thpsize)
+ return ERR_PTR(-ENOMEM);
+
+ ret = kobject_init_and_add(&thpsize->kobj, &thpsize_ktype, parent,
+ "hugepages-%lukB", size);
+ if (ret) {
+ kfree(thpsize);
+ return ERR_PTR(ret);
+ }
+
+ ret = sysfs_create_group(&thpsize->kobj, &thpsize_attr_group);
+ if (ret) {
+ kobject_put(&thpsize->kobj);
+ return ERR_PTR(ret);
+ }
+
+ thpsize->order = order;
+ return thpsize;
+}
+
+static void thpsize_release(struct kobject *kobj)
+{
+ kfree(to_thpsize(kobj));
+}
+
static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
{
int err;
+ struct thpsize *thpsize;
+ unsigned long orders;
+ int order;
+
+ /*
+ * Default to setting PMD-sized THP to inherit the global setting and
+ * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
+ * constant so we have to do this here.
+ */
+ huge_anon_orders_inherit = BIT(PMD_ORDER);

*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
if (unlikely(!*hugepage_kobj)) {
@@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
goto remove_hp_group;
}

+ orders = THP_ORDERS_ALL_ANON;
+ order = first_order(orders);
+ while (orders) {
+ thpsize = thpsize_create(order, *hugepage_kobj);
+ if (IS_ERR(thpsize)) {
+ pr_err("failed to create thpsize for order %d\n", order);
+ err = PTR_ERR(thpsize);
+ goto remove_all;
+ }
+ list_add(&thpsize->node, &thpsize_list);
+ order = next_order(&orders, order);
+ }
+
return 0;

+remove_all:
+ hugepage_exit_sysfs(*hugepage_kobj);
+ return err;
remove_hp_group:
sysfs_remove_group(*hugepage_kobj, &hugepage_attr_group);
delete_obj:
@@ -445,6 +658,13 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)

static void __init hugepage_exit_sysfs(struct kobject *hugepage_kobj)
{
+ struct thpsize *thpsize, *tmp;
+
+ list_for_each_entry_safe(thpsize, tmp, &thpsize_list, node) {
+ list_del(&thpsize->node);
+ kobject_put(&thpsize->kobj);
+ }
+
sysfs_remove_group(hugepage_kobj, &khugepaged_attr_group);
sysfs_remove_group(hugepage_kobj, &hugepage_attr_group);
kobject_put(hugepage_kobj);
@@ -811,7 +1031,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
struct folio *folio;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;

- if (!transhuge_vma_suitable(vma, haddr))
+ if (!thp_vma_suitable_orders(vma, haddr, BIT(PMD_ORDER)))
return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0da6937572cf..3aee6de526f8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -446,7 +446,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
{
if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
hugepage_flags_enabled()) {
- if (hugepage_vma_check(vma, vm_flags, false, false, true))
+ if (thp_vma_allowable_orders(vma, vm_flags, false, false, true,
+ BIT(PMD_ORDER)))
__khugepaged_enter(vma->vm_mm);
}
}
@@ -922,16 +923,16 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
if (!vma)
return SCAN_VMA_NULL;

- if (!transhuge_vma_suitable(vma, address))
+ if (!thp_vma_suitable_orders(vma, address, BIT(PMD_ORDER)))
return SCAN_ADDRESS_RANGE;
- if (!hugepage_vma_check(vma, vma->vm_flags, false, false,
- cc->is_khugepaged))
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, false, false,
+ cc->is_khugepaged, BIT(PMD_ORDER)))
return SCAN_VMA_CHECK;
/*
* Anon VMA expected, the address may be unmapped then
* remapped to file after khugepaged reaquired the mmap_lock.
*
- * hugepage_vma_check may return true for qualified file
+ * thp_vma_allowable_orders may return true for qualified file
* vmas.
*/
if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
@@ -1506,7 +1507,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* and map it by a PMD, regardless of sysfs THP settings. As such, let's
* analogously elide sysfs THP settings here.
*/
- if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, false, false, false,
+ BIT(PMD_ORDER)))
return SCAN_VMA_CHECK;

/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
@@ -2371,7 +2373,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
progress++;
break;
}
- if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) {
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, false, false,
+ true, BIT(PMD_ORDER))) {
skip:
progress++;
continue;
@@ -2708,7 +2711,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,

*prev = vma;

- if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false))
+ if (!thp_vma_allowable_orders(vma, vma->vm_flags, false, false, false,
+ BIT(PMD_ORDER)))
return -EINVAL;

cc = kmalloc(sizeof(*cc), GFP_KERNEL);
diff --git a/mm/memory.c b/mm/memory.c
index 99582b188ed2..3ceeb0f45bf5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4322,7 +4322,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
pmd_t entry;
vm_fault_t ret = VM_FAULT_FALLBACK;

- if (!transhuge_vma_suitable(vma, haddr))
+ if (!thp_vma_suitable_orders(vma, haddr, BIT(PMD_ORDER)))
return ret;

page = compound_head(page);
@@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return VM_FAULT_OOM;
retry_pud:
if (pud_none(*vmf.pud) &&
- hugepage_vma_check(vma, vm_flags, false, true, true)) {
+ thp_vma_allowable_orders(vma, vm_flags, false, true, true,
+ BIT(PUD_ORDER))) {
ret = create_huge_pud(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
@@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
goto retry_pud;

if (pmd_none(*vmf.pmd) &&
- hugepage_vma_check(vma, vm_flags, false, true, true)) {
+ thp_vma_allowable_orders(vma, vm_flags, false, true, true,
+ BIT(PMD_ORDER))) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e0b368e545ed..64da127cc267 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
* cleared *pmd but not decremented compound_mapcount().
*/
if ((pvmw->flags & PVMW_SYNC) &&
- transhuge_vma_suitable(vma, pvmw->address) &&
+ thp_vma_suitable_orders(vma, pvmw->address,
+ BIT(PMD_ORDER)) &&
(pvmw->nr_pages >= HPAGE_PMD_NR)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);

--
2.25.1

2023-12-04 10:21:48

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 05/10] selftests/mm/kugepaged: Restore thp settings at exit

Previously, the saved thp settings would be restored upon a signal or at
the natural end of the test suite. But there are some tests that
directly call exit() upon failure. In this case, the thp settings were
not being restored, which could then influence other tests.

Fix this by installing an atexit() handler to do the actual restore. The
signal handler can now just call exit() and the atexit handler is
invoked.

Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/khugepaged.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 030667cb5533..fc47a1c4944c 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -374,18 +374,22 @@ static void pop_settings(void)
write_settings(current_settings());
}

-static void restore_settings(int sig)
+static void restore_settings_atexit(void)
{
if (skip_settings_restore)
- goto out;
+ return;

printf("Restore THP and khugepaged settings...");
write_settings(&saved_settings);
success("OK");
- if (sig)
- exit(EXIT_FAILURE);
-out:
- exit(exit_status);
+
+ skip_settings_restore = true;
+}
+
+static void restore_settings(int sig)
+{
+ /* exit() will invoke the restore_settings_atexit handler. */
+ exit(sig ? EXIT_FAILURE : exit_status);
}

static void save_settings(void)
@@ -415,6 +419,7 @@ static void save_settings(void)

success("OK");

+ atexit(restore_settings_atexit);
signal(SIGTERM, restore_settings);
signal(SIGINT, restore_settings);
signal(SIGHUP, restore_settings);
--
2.25.1

2023-12-04 10:21:51

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

Introduce the logic to allow THP to be configured (through the new sysfs
interface we just added) to allocate large folios to back anonymous
memory, which are larger than the base page size but smaller than
PMD-size. We call this new THP extension "multi-size THP" (mTHP).

mTHP continues to be PTE-mapped, but in many cases can still provide
similar benefits to traditional PMD-sized THP: Page faults are
significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
the configured order), but latency spikes are much less prominent
because the size of each page isn't as huge as the PMD-sized variant and
there is less memory to clear in each page fault. The number of per-page
operations (e.g. ref counting, rmap management, lru list management) are
also significantly reduced since those ops now become per-folio.

Some architectures also employ TLB compression mechanisms to squeeze
more entries in when a set of PTEs are virtually and physically
contiguous and approporiately aligned. In this case, TLB misses will
occur less often.

The new behaviour is disabled by default, but can be enabled at runtime
by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
(see documentation in previous commit). The long term aim is to change
the default to include suitable lower orders, but there are some risks
around internal fragmentation that need to be better understood first.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/huge_mm.h | 6 ++-
mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
2 files changed, 101 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd0eadd3befb..91a53b9835a4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

/*
- * Mask of all large folio orders supported for anonymous THP.
+ * Mask of all large folio orders supported for anonymous THP; all orders up to
+ * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
+ * (which is a limitation of the THP implementation).
*/
-#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
+#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))

/*
* Mask of all large folio orders supported for file THP.
diff --git a/mm/memory.c b/mm/memory.c
index 3ceeb0f45bf5..bf7e93813018 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
return ret;
}

+static bool pte_range_none(pte_t *pte, int nr_pages)
+{
+ int i;
+
+ for (i = 0; i < nr_pages; i++) {
+ if (!pte_none(ptep_get_lockless(pte + i)))
+ return false;
+ }
+
+ return true;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+{
+ gfp_t gfp;
+ pte_t *pte;
+ unsigned long addr;
+ struct folio *folio;
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long orders;
+ int order;
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (userfaultfd_armed(vma))
+ goto fallback;
+
+ /*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
+ BIT(PMD_ORDER) - 1);
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+
+ if (!orders)
+ goto fallback;
+
+ pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+ if (!pte)
+ return ERR_PTR(-EAGAIN);
+
+ order = first_order(orders);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ vmf->pte = pte + pte_index(addr);
+ if (pte_range_none(vmf->pte, 1 << order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ vmf->pte = NULL;
+ pte_unmap(pte);
+
+ gfp = vma_thp_gfp_mask(vma);
+
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ folio = vma_alloc_folio(gfp, order, vma, addr, true);
+ if (folio) {
+ clear_huge_page(&folio->page, addr, 1 << order);
+ return folio;
+ }
+ order = next_order(&orders, order);
+ }
+
+fallback:
+ return vma_alloc_zeroed_movable_folio(vma, vmf->address);
+}
+#else
+#define alloc_anon_folio(vmf) \
+ vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
+#endif
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
+ int i;
+ int nr_pages = 1;
+ unsigned long addr = vmf->address;
bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
struct vm_area_struct *vma = vmf->vma;
struct folio *folio;
@@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* Allocate our own private page. */
if (unlikely(anon_vma_prepare(vma)))
goto oom;
- folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
+ folio = alloc_anon_folio(vmf);
+ if (IS_ERR(folio))
+ return 0;
if (!folio)
goto oom;

+ nr_pages = folio_nr_pages(folio);
+ addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+
if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
folio_throttle_swaprate(folio, GFP_KERNEL);
@@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry), vma);

- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
if (!vmf->pte)
goto release;
- if (vmf_pte_changed(vmf)) {
- update_mmu_tlb(vma, vmf->address, vmf->pte);
+ if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
+ (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages))) {
+ for (i = 0; i < nr_pages; i++)
+ update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
goto release;
}

@@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return handle_userfault(vmf, VM_UFFD_MISSING);
}

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ folio_add_new_anon_rmap(folio, vma, addr);
folio_add_lru_vma(folio, vma);
setpte:
if (uffd_wp)
entry = pte_mkuffd_wp(entry);
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+ set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.25.1

2023-12-04 10:21:54

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 06/10] selftests/mm: Factor out thp settings management

The khugepaged test has a useful framework for save/restore/pop/push of
all thp settings via the sysfs interface. This will be useful to
explicitly control multi-size THP settings in other tests, so let's
move it out of khugepaged and into its own thp_settings.[c|h] utility.

Tested-by: Alistair Popple <[email protected]>
Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/Makefile | 4 +-
tools/testing/selftests/mm/khugepaged.c | 346 ++--------------------
tools/testing/selftests/mm/thp_settings.c | 296 ++++++++++++++++++
tools/testing/selftests/mm/thp_settings.h | 71 +++++
4 files changed, 391 insertions(+), 326 deletions(-)
create mode 100644 tools/testing/selftests/mm/thp_settings.c
create mode 100644 tools/testing/selftests/mm/thp_settings.h

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index dede0bcf97a3..2453add65d12 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -117,8 +117,8 @@ TEST_FILES += va_high_addr_switch.sh

include ../lib.mk

-$(TEST_GEN_PROGS): vm_util.c
-$(TEST_GEN_FILES): vm_util.c
+$(TEST_GEN_PROGS): vm_util.c thp_settings.c
+$(TEST_GEN_FILES): vm_util.c thp_settings.c

$(OUTPUT)/uffd-stress: uffd-common.c
$(OUTPUT)/uffd-unit-tests: uffd-common.c
diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index fc47a1c4944c..b15e7fd70176 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -22,13 +22,13 @@
#include "linux/magic.h"

#include "vm_util.h"
+#include "thp_settings.h"

#define BASE_ADDR ((void *)(1UL << 30))
static unsigned long hpage_pmd_size;
static unsigned long page_size;
static int hpage_pmd_nr;

-#define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
#define PID_SMAPS "/proc/self/smaps"
#define TEST_FILE "collapse_test_file"

@@ -71,78 +71,7 @@ struct file_info {
};

static struct file_info finfo;
-
-enum thp_enabled {
- THP_ALWAYS,
- THP_MADVISE,
- THP_NEVER,
-};
-
-static const char *thp_enabled_strings[] = {
- "always",
- "madvise",
- "never",
- NULL
-};
-
-enum thp_defrag {
- THP_DEFRAG_ALWAYS,
- THP_DEFRAG_DEFER,
- THP_DEFRAG_DEFER_MADVISE,
- THP_DEFRAG_MADVISE,
- THP_DEFRAG_NEVER,
-};
-
-static const char *thp_defrag_strings[] = {
- "always",
- "defer",
- "defer+madvise",
- "madvise",
- "never",
- NULL
-};
-
-enum shmem_enabled {
- SHMEM_ALWAYS,
- SHMEM_WITHIN_SIZE,
- SHMEM_ADVISE,
- SHMEM_NEVER,
- SHMEM_DENY,
- SHMEM_FORCE,
-};
-
-static const char *shmem_enabled_strings[] = {
- "always",
- "within_size",
- "advise",
- "never",
- "deny",
- "force",
- NULL
-};
-
-struct khugepaged_settings {
- bool defrag;
- unsigned int alloc_sleep_millisecs;
- unsigned int scan_sleep_millisecs;
- unsigned int max_ptes_none;
- unsigned int max_ptes_swap;
- unsigned int max_ptes_shared;
- unsigned long pages_to_scan;
-};
-
-struct settings {
- enum thp_enabled thp_enabled;
- enum thp_defrag thp_defrag;
- enum shmem_enabled shmem_enabled;
- bool use_zero_page;
- struct khugepaged_settings khugepaged;
- unsigned long read_ahead_kb;
-};
-
-static struct settings saved_settings;
static bool skip_settings_restore;
-
static int exit_status;

static void success(const char *msg)
@@ -161,226 +90,13 @@ static void skip(const char *msg)
printf(" \e[33m%s\e[0m\n", msg);
}

-static int read_file(const char *path, char *buf, size_t buflen)
-{
- int fd;
- ssize_t numread;
-
- fd = open(path, O_RDONLY);
- if (fd == -1)
- return 0;
-
- numread = read(fd, buf, buflen - 1);
- if (numread < 1) {
- close(fd);
- return 0;
- }
-
- buf[numread] = '\0';
- close(fd);
-
- return (unsigned int) numread;
-}
-
-static int write_file(const char *path, const char *buf, size_t buflen)
-{
- int fd;
- ssize_t numwritten;
-
- fd = open(path, O_WRONLY);
- if (fd == -1) {
- printf("open(%s)\n", path);
- exit(EXIT_FAILURE);
- return 0;
- }
-
- numwritten = write(fd, buf, buflen - 1);
- close(fd);
- if (numwritten < 1) {
- printf("write(%s)\n", buf);
- exit(EXIT_FAILURE);
- return 0;
- }
-
- return (unsigned int) numwritten;
-}
-
-static int read_string(const char *name, const char *strings[])
-{
- char path[PATH_MAX];
- char buf[256];
- char *c;
- int ret;
-
- ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
- if (ret >= PATH_MAX) {
- printf("%s: Pathname is too long\n", __func__);
- exit(EXIT_FAILURE);
- }
-
- if (!read_file(path, buf, sizeof(buf))) {
- perror(path);
- exit(EXIT_FAILURE);
- }
-
- c = strchr(buf, '[');
- if (!c) {
- printf("%s: Parse failure\n", __func__);
- exit(EXIT_FAILURE);
- }
-
- c++;
- memmove(buf, c, sizeof(buf) - (c - buf));
-
- c = strchr(buf, ']');
- if (!c) {
- printf("%s: Parse failure\n", __func__);
- exit(EXIT_FAILURE);
- }
- *c = '\0';
-
- ret = 0;
- while (strings[ret]) {
- if (!strcmp(strings[ret], buf))
- return ret;
- ret++;
- }
-
- printf("Failed to parse %s\n", name);
- exit(EXIT_FAILURE);
-}
-
-static void write_string(const char *name, const char *val)
-{
- char path[PATH_MAX];
- int ret;
-
- ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
- if (ret >= PATH_MAX) {
- printf("%s: Pathname is too long\n", __func__);
- exit(EXIT_FAILURE);
- }
-
- if (!write_file(path, val, strlen(val) + 1)) {
- perror(path);
- exit(EXIT_FAILURE);
- }
-}
-
-static const unsigned long _read_num(const char *path)
-{
- char buf[21];
-
- if (read_file(path, buf, sizeof(buf)) < 0) {
- perror("read_file(read_num)");
- exit(EXIT_FAILURE);
- }
-
- return strtoul(buf, NULL, 10);
-}
-
-static const unsigned long read_num(const char *name)
-{
- char path[PATH_MAX];
- int ret;
-
- ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
- if (ret >= PATH_MAX) {
- printf("%s: Pathname is too long\n", __func__);
- exit(EXIT_FAILURE);
- }
- return _read_num(path);
-}
-
-static void _write_num(const char *path, unsigned long num)
-{
- char buf[21];
-
- sprintf(buf, "%ld", num);
- if (!write_file(path, buf, strlen(buf) + 1)) {
- perror(path);
- exit(EXIT_FAILURE);
- }
-}
-
-static void write_num(const char *name, unsigned long num)
-{
- char path[PATH_MAX];
- int ret;
-
- ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
- if (ret >= PATH_MAX) {
- printf("%s: Pathname is too long\n", __func__);
- exit(EXIT_FAILURE);
- }
- _write_num(path, num);
-}
-
-static void write_settings(struct settings *settings)
-{
- struct khugepaged_settings *khugepaged = &settings->khugepaged;
-
- write_string("enabled", thp_enabled_strings[settings->thp_enabled]);
- write_string("defrag", thp_defrag_strings[settings->thp_defrag]);
- write_string("shmem_enabled",
- shmem_enabled_strings[settings->shmem_enabled]);
- write_num("use_zero_page", settings->use_zero_page);
-
- write_num("khugepaged/defrag", khugepaged->defrag);
- write_num("khugepaged/alloc_sleep_millisecs",
- khugepaged->alloc_sleep_millisecs);
- write_num("khugepaged/scan_sleep_millisecs",
- khugepaged->scan_sleep_millisecs);
- write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none);
- write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
- write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
- write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
-
- if (file_ops && finfo.type == VMA_FILE)
- _write_num(finfo.dev_queue_read_ahead_path,
- settings->read_ahead_kb);
-}
-
-#define MAX_SETTINGS_DEPTH 4
-static struct settings settings_stack[MAX_SETTINGS_DEPTH];
-static int settings_index;
-
-static struct settings *current_settings(void)
-{
- if (!settings_index) {
- printf("Fail: No settings set");
- exit(EXIT_FAILURE);
- }
- return settings_stack + settings_index - 1;
-}
-
-static void push_settings(struct settings *settings)
-{
- if (settings_index >= MAX_SETTINGS_DEPTH) {
- printf("Fail: Settings stack exceeded");
- exit(EXIT_FAILURE);
- }
- settings_stack[settings_index++] = *settings;
- write_settings(current_settings());
-}
-
-static void pop_settings(void)
-{
- if (settings_index <= 0) {
- printf("Fail: Settings stack empty");
- exit(EXIT_FAILURE);
- }
- --settings_index;
- write_settings(current_settings());
-}
-
static void restore_settings_atexit(void)
{
if (skip_settings_restore)
return;

printf("Restore THP and khugepaged settings...");
- write_settings(&saved_settings);
+ thp_restore_settings();
success("OK");

skip_settings_restore = true;
@@ -395,27 +111,9 @@ static void restore_settings(int sig)
static void save_settings(void)
{
printf("Save THP and khugepaged settings...");
- saved_settings = (struct settings) {
- .thp_enabled = read_string("enabled", thp_enabled_strings),
- .thp_defrag = read_string("defrag", thp_defrag_strings),
- .shmem_enabled =
- read_string("shmem_enabled", shmem_enabled_strings),
- .use_zero_page = read_num("use_zero_page"),
- };
- saved_settings.khugepaged = (struct khugepaged_settings) {
- .defrag = read_num("khugepaged/defrag"),
- .alloc_sleep_millisecs =
- read_num("khugepaged/alloc_sleep_millisecs"),
- .scan_sleep_millisecs =
- read_num("khugepaged/scan_sleep_millisecs"),
- .max_ptes_none = read_num("khugepaged/max_ptes_none"),
- .max_ptes_swap = read_num("khugepaged/max_ptes_swap"),
- .max_ptes_shared = read_num("khugepaged/max_ptes_shared"),
- .pages_to_scan = read_num("khugepaged/pages_to_scan"),
- };
if (file_ops && finfo.type == VMA_FILE)
- saved_settings.read_ahead_kb =
- _read_num(finfo.dev_queue_read_ahead_path);
+ thp_set_read_ahead_path(finfo.dev_queue_read_ahead_path);
+ thp_save_settings();

success("OK");

@@ -798,7 +496,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
struct mem_ops *ops, bool expect)
{
int ret;
- struct settings settings = *current_settings();
+ struct thp_settings settings = *thp_current_settings();

printf("%s...", msg);

@@ -808,7 +506,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
*/
settings.thp_enabled = THP_NEVER;
settings.shmem_enabled = SHMEM_NEVER;
- push_settings(&settings);
+ thp_push_settings(&settings);

/* Clear VM_NOHUGEPAGE */
madvise(p, nr_hpages * hpage_pmd_size, MADV_HUGEPAGE);
@@ -820,7 +518,7 @@ static void __madvise_collapse(const char *msg, char *p, int nr_hpages,
else
success("OK");

- pop_settings();
+ thp_pop_settings();
}

static void madvise_collapse(const char *msg, char *p, int nr_hpages,
@@ -850,13 +548,13 @@ static bool wait_for_scan(const char *msg, char *p, int nr_hpages,
madvise(p, nr_hpages * hpage_pmd_size, MADV_HUGEPAGE);

/* Wait until the second full_scan completed */
- full_scans = read_num("khugepaged/full_scans") + 2;
+ full_scans = thp_read_num("khugepaged/full_scans") + 2;

printf("%s...", msg);
while (timeout--) {
if (ops->check_huge(p, nr_hpages))
break;
- if (read_num("khugepaged/full_scans") >= full_scans)
+ if (thp_read_num("khugepaged/full_scans") >= full_scans)
break;
printf(".");
usleep(TICK);
@@ -911,11 +609,11 @@ static bool is_tmpfs(struct mem_ops *ops)

static void alloc_at_fault(void)
{
- struct settings settings = *current_settings();
+ struct thp_settings settings = *thp_current_settings();
char *p;

settings.thp_enabled = THP_ALWAYS;
- push_settings(&settings);
+ thp_push_settings(&settings);

p = alloc_mapping(1);
*p = 1;
@@ -925,7 +623,7 @@ static void alloc_at_fault(void)
else
fail("Fail");

- pop_settings();
+ thp_pop_settings();

madvise(p, page_size, MADV_DONTNEED);
printf("Split huge PMD on MADV_DONTNEED...");
@@ -973,11 +671,11 @@ static void collapse_single_pte_entry(struct collapse_context *c, struct mem_ops
static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *ops)
{
int max_ptes_none = hpage_pmd_nr / 2;
- struct settings settings = *current_settings();
+ struct thp_settings settings = *thp_current_settings();
void *p;

settings.khugepaged.max_ptes_none = max_ptes_none;
- push_settings(&settings);
+ thp_push_settings(&settings);

p = ops->setup_area(1);

@@ -1002,7 +700,7 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
}
skip:
ops->cleanup_area(p, hpage_pmd_size);
- pop_settings();
+ thp_pop_settings();
}

static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_ops *ops)
@@ -1033,7 +731,7 @@ static void collapse_swapin_single_pte(struct collapse_context *c, struct mem_op

static void collapse_max_ptes_swap(struct collapse_context *c, struct mem_ops *ops)
{
- int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
+ int max_ptes_swap = thp_read_num("khugepaged/max_ptes_swap");
void *p;

p = ops->setup_area(1);
@@ -1250,11 +948,11 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o
fail("Fail");
ops->fault(p, 0, page_size);

- write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
+ thp_write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
c->collapse("Collapse PTE table full of compound pages in child",
p, 1, ops, true);
- write_num("khugepaged/max_ptes_shared",
- current_settings()->khugepaged.max_ptes_shared);
+ thp_write_num("khugepaged/max_ptes_shared",
+ thp_current_settings()->khugepaged.max_ptes_shared);

validate_memory(p, 0, hpage_pmd_size);
ops->cleanup_area(p, hpage_pmd_size);
@@ -1275,7 +973,7 @@ static void collapse_fork_compound(struct collapse_context *c, struct mem_ops *o

static void collapse_max_ptes_shared(struct collapse_context *c, struct mem_ops *ops)
{
- int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
+ int max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared");
int wstatus;
void *p;

@@ -1443,7 +1141,7 @@ static void parse_test_type(int argc, const char **argv)

int main(int argc, const char **argv)
{
- struct settings default_settings = {
+ struct thp_settings default_settings = {
.thp_enabled = THP_MADVISE,
.thp_defrag = THP_DEFRAG_ALWAYS,
.shmem_enabled = SHMEM_ADVISE,
@@ -1484,7 +1182,7 @@ int main(int argc, const char **argv)
default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;

save_settings();
- push_settings(&default_settings);
+ thp_push_settings(&default_settings);

alloc_at_fault();

diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
new file mode 100644
index 000000000000..5e8ec792cac7
--- /dev/null
+++ b/tools/testing/selftests/mm/thp_settings.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "thp_settings.h"
+
+#define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
+#define MAX_SETTINGS_DEPTH 4
+static struct thp_settings settings_stack[MAX_SETTINGS_DEPTH];
+static int settings_index;
+static struct thp_settings saved_settings;
+static char dev_queue_read_ahead_path[PATH_MAX];
+
+static const char * const thp_enabled_strings[] = {
+ "always",
+ "madvise",
+ "never",
+ NULL
+};
+
+static const char * const thp_defrag_strings[] = {
+ "always",
+ "defer",
+ "defer+madvise",
+ "madvise",
+ "never",
+ NULL
+};
+
+static const char * const shmem_enabled_strings[] = {
+ "always",
+ "within_size",
+ "advise",
+ "never",
+ "deny",
+ "force",
+ NULL
+};
+
+int read_file(const char *path, char *buf, size_t buflen)
+{
+ int fd;
+ ssize_t numread;
+
+ fd = open(path, O_RDONLY);
+ if (fd == -1)
+ return 0;
+
+ numread = read(fd, buf, buflen - 1);
+ if (numread < 1) {
+ close(fd);
+ return 0;
+ }
+
+ buf[numread] = '\0';
+ close(fd);
+
+ return (unsigned int) numread;
+}
+
+int write_file(const char *path, const char *buf, size_t buflen)
+{
+ int fd;
+ ssize_t numwritten;
+
+ fd = open(path, O_WRONLY);
+ if (fd == -1) {
+ printf("open(%s)\n", path);
+ exit(EXIT_FAILURE);
+ return 0;
+ }
+
+ numwritten = write(fd, buf, buflen - 1);
+ close(fd);
+ if (numwritten < 1) {
+ printf("write(%s)\n", buf);
+ exit(EXIT_FAILURE);
+ return 0;
+ }
+
+ return (unsigned int) numwritten;
+}
+
+const unsigned long read_num(const char *path)
+{
+ char buf[21];
+
+ if (read_file(path, buf, sizeof(buf)) < 0) {
+ perror("read_file()");
+ exit(EXIT_FAILURE);
+ }
+
+ return strtoul(buf, NULL, 10);
+}
+
+void write_num(const char *path, unsigned long num)
+{
+ char buf[21];
+
+ sprintf(buf, "%ld", num);
+ if (!write_file(path, buf, strlen(buf) + 1)) {
+ perror(path);
+ exit(EXIT_FAILURE);
+ }
+}
+
+int thp_read_string(const char *name, const char * const strings[])
+{
+ char path[PATH_MAX];
+ char buf[256];
+ char *c;
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ if (!read_file(path, buf, sizeof(buf))) {
+ perror(path);
+ exit(EXIT_FAILURE);
+ }
+
+ c = strchr(buf, '[');
+ if (!c) {
+ printf("%s: Parse failure\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ c++;
+ memmove(buf, c, sizeof(buf) - (c - buf));
+
+ c = strchr(buf, ']');
+ if (!c) {
+ printf("%s: Parse failure\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+ *c = '\0';
+
+ ret = 0;
+ while (strings[ret]) {
+ if (!strcmp(strings[ret], buf))
+ return ret;
+ ret++;
+ }
+
+ printf("Failed to parse %s\n", name);
+ exit(EXIT_FAILURE);
+}
+
+void thp_write_string(const char *name, const char *val)
+{
+ char path[PATH_MAX];
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ if (!write_file(path, val, strlen(val) + 1)) {
+ perror(path);
+ exit(EXIT_FAILURE);
+ }
+}
+
+const unsigned long thp_read_num(const char *name)
+{
+ char path[PATH_MAX];
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+ return read_num(path);
+}
+
+void thp_write_num(const char *name, unsigned long num)
+{
+ char path[PATH_MAX];
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+ write_num(path, num);
+}
+
+void thp_read_settings(struct thp_settings *settings)
+{
+ *settings = (struct thp_settings) {
+ .thp_enabled = thp_read_string("enabled", thp_enabled_strings),
+ .thp_defrag = thp_read_string("defrag", thp_defrag_strings),
+ .shmem_enabled =
+ thp_read_string("shmem_enabled", shmem_enabled_strings),
+ .use_zero_page = thp_read_num("use_zero_page"),
+ };
+ settings->khugepaged = (struct khugepaged_settings) {
+ .defrag = thp_read_num("khugepaged/defrag"),
+ .alloc_sleep_millisecs =
+ thp_read_num("khugepaged/alloc_sleep_millisecs"),
+ .scan_sleep_millisecs =
+ thp_read_num("khugepaged/scan_sleep_millisecs"),
+ .max_ptes_none = thp_read_num("khugepaged/max_ptes_none"),
+ .max_ptes_swap = thp_read_num("khugepaged/max_ptes_swap"),
+ .max_ptes_shared = thp_read_num("khugepaged/max_ptes_shared"),
+ .pages_to_scan = thp_read_num("khugepaged/pages_to_scan"),
+ };
+ if (dev_queue_read_ahead_path[0])
+ settings->read_ahead_kb = read_num(dev_queue_read_ahead_path);
+}
+
+void thp_write_settings(struct thp_settings *settings)
+{
+ struct khugepaged_settings *khugepaged = &settings->khugepaged;
+
+ thp_write_string("enabled", thp_enabled_strings[settings->thp_enabled]);
+ thp_write_string("defrag", thp_defrag_strings[settings->thp_defrag]);
+ thp_write_string("shmem_enabled",
+ shmem_enabled_strings[settings->shmem_enabled]);
+ thp_write_num("use_zero_page", settings->use_zero_page);
+
+ thp_write_num("khugepaged/defrag", khugepaged->defrag);
+ thp_write_num("khugepaged/alloc_sleep_millisecs",
+ khugepaged->alloc_sleep_millisecs);
+ thp_write_num("khugepaged/scan_sleep_millisecs",
+ khugepaged->scan_sleep_millisecs);
+ thp_write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none);
+ thp_write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
+ thp_write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
+ thp_write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
+
+ if (dev_queue_read_ahead_path[0])
+ write_num(dev_queue_read_ahead_path, settings->read_ahead_kb);
+}
+
+struct thp_settings *thp_current_settings(void)
+{
+ if (!settings_index) {
+ printf("Fail: No settings set");
+ exit(EXIT_FAILURE);
+ }
+ return settings_stack + settings_index - 1;
+}
+
+void thp_push_settings(struct thp_settings *settings)
+{
+ if (settings_index >= MAX_SETTINGS_DEPTH) {
+ printf("Fail: Settings stack exceeded");
+ exit(EXIT_FAILURE);
+ }
+ settings_stack[settings_index++] = *settings;
+ thp_write_settings(thp_current_settings());
+}
+
+void thp_pop_settings(void)
+{
+ if (settings_index <= 0) {
+ printf("Fail: Settings stack empty");
+ exit(EXIT_FAILURE);
+ }
+ --settings_index;
+ thp_write_settings(thp_current_settings());
+}
+
+void thp_restore_settings(void)
+{
+ thp_write_settings(&saved_settings);
+}
+
+void thp_save_settings(void)
+{
+ thp_read_settings(&saved_settings);
+}
+
+void thp_set_read_ahead_path(char *path)
+{
+ if (!path) {
+ dev_queue_read_ahead_path[0] = '\0';
+ return;
+ }
+
+ strncpy(dev_queue_read_ahead_path, path,
+ sizeof(dev_queue_read_ahead_path));
+ dev_queue_read_ahead_path[sizeof(dev_queue_read_ahead_path) - 1] = '\0';
+}
diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
new file mode 100644
index 000000000000..ff3d98c30617
--- /dev/null
+++ b/tools/testing/selftests/mm/thp_settings.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __THP_SETTINGS_H__
+#define __THP_SETTINGS_H__
+
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+
+enum thp_enabled {
+ THP_ALWAYS,
+ THP_MADVISE,
+ THP_NEVER,
+};
+
+enum thp_defrag {
+ THP_DEFRAG_ALWAYS,
+ THP_DEFRAG_DEFER,
+ THP_DEFRAG_DEFER_MADVISE,
+ THP_DEFRAG_MADVISE,
+ THP_DEFRAG_NEVER,
+};
+
+enum shmem_enabled {
+ SHMEM_ALWAYS,
+ SHMEM_WITHIN_SIZE,
+ SHMEM_ADVISE,
+ SHMEM_NEVER,
+ SHMEM_DENY,
+ SHMEM_FORCE,
+};
+
+struct khugepaged_settings {
+ bool defrag;
+ unsigned int alloc_sleep_millisecs;
+ unsigned int scan_sleep_millisecs;
+ unsigned int max_ptes_none;
+ unsigned int max_ptes_swap;
+ unsigned int max_ptes_shared;
+ unsigned long pages_to_scan;
+};
+
+struct thp_settings {
+ enum thp_enabled thp_enabled;
+ enum thp_defrag thp_defrag;
+ enum shmem_enabled shmem_enabled;
+ bool use_zero_page;
+ struct khugepaged_settings khugepaged;
+ unsigned long read_ahead_kb;
+};
+
+int read_file(const char *path, char *buf, size_t buflen);
+int write_file(const char *path, const char *buf, size_t buflen);
+const unsigned long read_num(const char *path);
+void write_num(const char *path, unsigned long num);
+
+int thp_read_string(const char *name, const char * const strings[]);
+void thp_write_string(const char *name, const char *val);
+const unsigned long thp_read_num(const char *name);
+void thp_write_num(const char *name, unsigned long num);
+
+void thp_write_settings(struct thp_settings *settings);
+void thp_read_settings(struct thp_settings *settings);
+struct thp_settings *thp_current_settings(void);
+void thp_push_settings(struct thp_settings *settings);
+void thp_pop_settings(void);
+void thp_restore_settings(void);
+void thp_save_settings(void);
+
+void thp_set_read_ahead_path(char *path);
+
+#endif /* __THP_SETTINGS_H__ */
--
2.25.1

2023-12-04 10:21:58

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 08/10] selftests/mm/khugepaged: Enlighten for multi-size THP

The `collapse_max_ptes_none` test was previously failing when a THP size
less than PMD-size had enabled="always". The root cause is because the
test faults in 1 page less than the threshold it set for collapsing. But
when THP is enabled always, we "over allocate" and therefore the
threshold is passed, and collapse unexpectedly succeeds.

Solve this by enlightening khugepaged selftest. Add a command line
option to pass in the desired THP size that should be used for all
anonymous allocations. The harness will then explicitly configure a THP
size as requested and modify the `collapse_max_ptes_none` test so that
it faults in the threshold minus the number of pages in the configured
THP size. If no command line option is provided, default to order 0, as
per previous behaviour.

I chose to use an order in the command line interface, since this makes
the interface agnostic of base page size, making it easier to invoke
from run_vmtests.sh.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/khugepaged.c | 48 +++++++++++++++++------
tools/testing/selftests/mm/run_vmtests.sh | 2 +
2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index 7bd3baa9d34b..829320a519e7 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -28,6 +28,7 @@
static unsigned long hpage_pmd_size;
static unsigned long page_size;
static int hpage_pmd_nr;
+static int anon_order;

#define PID_SMAPS "/proc/self/smaps"
#define TEST_FILE "collapse_test_file"
@@ -607,6 +608,11 @@ static bool is_tmpfs(struct mem_ops *ops)
return ops == &__file_ops && finfo.type == VMA_SHMEM;
}

+static bool is_anon(struct mem_ops *ops)
+{
+ return ops == &__anon_ops;
+}
+
static void alloc_at_fault(void)
{
struct thp_settings settings = *thp_current_settings();
@@ -673,6 +679,7 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
int max_ptes_none = hpage_pmd_nr / 2;
struct thp_settings settings = *thp_current_settings();
void *p;
+ int fault_nr_pages = is_anon(ops) ? 1 << anon_order : 1;

settings.khugepaged.max_ptes_none = max_ptes_none;
thp_push_settings(&settings);
@@ -686,10 +693,10 @@ static void collapse_max_ptes_none(struct collapse_context *c, struct mem_ops *o
goto skip;
}

- ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
+ ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none - fault_nr_pages) * page_size);
c->collapse("Maybe collapse with max_ptes_none exceeded", p, 1,
ops, !c->enforce_pte_scan_limits);
- validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
+ validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - fault_nr_pages) * page_size);

if (c->enforce_pte_scan_limits) {
ops->fault(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
@@ -1076,7 +1083,7 @@ static void madvise_retracted_page_tables(struct collapse_context *c,

static void usage(void)
{
- fprintf(stderr, "\nUsage: ./khugepaged <test type> [dir]\n\n");
+ fprintf(stderr, "\nUsage: ./khugepaged [OPTIONS] <test type> [dir]\n\n");
fprintf(stderr, "\t<test type>\t: <context>:<mem_type>\n");
fprintf(stderr, "\t<context>\t: [all|khugepaged|madvise]\n");
fprintf(stderr, "\t<mem_type>\t: [all|anon|file|shmem]\n");
@@ -1085,15 +1092,34 @@ static void usage(void)
fprintf(stderr, "\tCONFIG_READ_ONLY_THP_FOR_FS=y\n");
fprintf(stderr, "\n\tif [dir] is a (sub)directory of a tmpfs mount, tmpfs must be\n");
fprintf(stderr, "\tmounted with huge=madvise option for khugepaged tests to work\n");
+ fprintf(stderr, "\n\tSupported Options:\n");
+ fprintf(stderr, "\t\t-h: This help message.\n");
+ fprintf(stderr, "\t\t-s: mTHP size, expressed as page order.\n");
+ fprintf(stderr, "\t\t Defaults to 0. Use this size for anon allocations.\n");
exit(1);
}

-static void parse_test_type(int argc, const char **argv)
+static void parse_test_type(int argc, char **argv)
{
+ int opt;
char *buf;
const char *token;

- if (argc == 1) {
+ while ((opt = getopt(argc, argv, "s:h")) != -1) {
+ switch (opt) {
+ case 's':
+ anon_order = atoi(optarg);
+ break;
+ case 'h':
+ default:
+ usage();
+ }
+ }
+
+ argv += optind;
+ argc -= optind;
+
+ if (argc == 0) {
/* Backwards compatibility */
khugepaged_context = &__khugepaged_context;
madvise_context = &__madvise_context;
@@ -1101,7 +1127,7 @@ static void parse_test_type(int argc, const char **argv)
return;
}

- buf = strdup(argv[1]);
+ buf = strdup(argv[0]);
token = strsep(&buf, ":");

if (!strcmp(token, "all")) {
@@ -1135,11 +1161,13 @@ static void parse_test_type(int argc, const char **argv)
if (!file_ops)
return;

- if (argc != 3)
+ if (argc != 2)
usage();
+
+ get_finfo(argv[1]);
}

-int main(int argc, const char **argv)
+int main(int argc, char **argv)
{
int hpage_pmd_order;
struct thp_settings default_settings = {
@@ -1164,9 +1192,6 @@ int main(int argc, const char **argv)

parse_test_type(argc, argv);

- if (file_ops)
- get_finfo(argv[2]);
-
setbuf(stdout, NULL);

page_size = getpagesize();
@@ -1183,6 +1208,7 @@ int main(int argc, const char **argv)
default_settings.khugepaged.max_ptes_shared = hpage_pmd_nr / 2;
default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;
default_settings.hugepages[hpage_pmd_order].enabled = THP_INHERIT;
+ default_settings.hugepages[anon_order].enabled = THP_ALWAYS;

save_settings();
thp_push_settings(&default_settings);
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index c0212258b852..87f513f5cf91 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -357,6 +357,8 @@ CATEGORY="cow" run_test ./cow

CATEGORY="thp" run_test ./khugepaged

+CATEGORY="thp" run_test ./khugepaged -s 2
+
CATEGORY="thp" run_test ./transhuge-stress -d 20

CATEGORY="thp" run_test ./split_huge_page_test
--
2.25.1

2023-12-04 10:22:03

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 07/10] selftests/mm: Support multi-size THP interface in thp_settings

Save and restore the new per-size hugepage enabled setting, if available
on the running kernel.

Since the number of per-size directories is not fixed, solve this as
simply as possible by catering for a maximum number in the thp_settings
struct (20). Each array index is the order. The value of THP_NEVER is
changed to 0 so that all of these new settings default to THP_NEVER and
the user only needs to fill in the ones they want to enable.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/khugepaged.c | 3 ++
tools/testing/selftests/mm/thp_settings.c | 55 ++++++++++++++++++++++-
tools/testing/selftests/mm/thp_settings.h | 11 ++++-
3 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
index b15e7fd70176..7bd3baa9d34b 100644
--- a/tools/testing/selftests/mm/khugepaged.c
+++ b/tools/testing/selftests/mm/khugepaged.c
@@ -1141,6 +1141,7 @@ static void parse_test_type(int argc, const char **argv)

int main(int argc, const char **argv)
{
+ int hpage_pmd_order;
struct thp_settings default_settings = {
.thp_enabled = THP_MADVISE,
.thp_defrag = THP_DEFRAG_ALWAYS,
@@ -1175,11 +1176,13 @@ int main(int argc, const char **argv)
exit(EXIT_FAILURE);
}
hpage_pmd_nr = hpage_pmd_size / page_size;
+ hpage_pmd_order = __builtin_ctz(hpage_pmd_nr);

default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
default_settings.khugepaged.max_ptes_shared = hpage_pmd_nr / 2;
default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;
+ default_settings.hugepages[hpage_pmd_order].enabled = THP_INHERIT;

save_settings();
thp_push_settings(&default_settings);
diff --git a/tools/testing/selftests/mm/thp_settings.c b/tools/testing/selftests/mm/thp_settings.c
index 5e8ec792cac7..a4163438108e 100644
--- a/tools/testing/selftests/mm/thp_settings.c
+++ b/tools/testing/selftests/mm/thp_settings.c
@@ -16,9 +16,10 @@ static struct thp_settings saved_settings;
static char dev_queue_read_ahead_path[PATH_MAX];

static const char * const thp_enabled_strings[] = {
+ "never",
"always",
+ "inherit",
"madvise",
- "never",
NULL
};

@@ -198,6 +199,10 @@ void thp_write_num(const char *name, unsigned long num)

void thp_read_settings(struct thp_settings *settings)
{
+ unsigned long orders = thp_supported_orders();
+ char path[PATH_MAX];
+ int i;
+
*settings = (struct thp_settings) {
.thp_enabled = thp_read_string("enabled", thp_enabled_strings),
.thp_defrag = thp_read_string("defrag", thp_defrag_strings),
@@ -218,11 +223,26 @@ void thp_read_settings(struct thp_settings *settings)
};
if (dev_queue_read_ahead_path[0])
settings->read_ahead_kb = read_num(dev_queue_read_ahead_path);
+
+ for (i = 0; i < NR_ORDERS; i++) {
+ if (!((1 << i) & orders)) {
+ settings->hugepages[i].enabled = THP_NEVER;
+ continue;
+ }
+ snprintf(path, PATH_MAX, "hugepages-%ukB/enabled",
+ (getpagesize() >> 10) << i);
+ settings->hugepages[i].enabled =
+ thp_read_string(path, thp_enabled_strings);
+ }
}

void thp_write_settings(struct thp_settings *settings)
{
struct khugepaged_settings *khugepaged = &settings->khugepaged;
+ unsigned long orders = thp_supported_orders();
+ char path[PATH_MAX];
+ int enabled;
+ int i;

thp_write_string("enabled", thp_enabled_strings[settings->thp_enabled]);
thp_write_string("defrag", thp_defrag_strings[settings->thp_defrag]);
@@ -242,6 +262,15 @@ void thp_write_settings(struct thp_settings *settings)

if (dev_queue_read_ahead_path[0])
write_num(dev_queue_read_ahead_path, settings->read_ahead_kb);
+
+ for (i = 0; i < NR_ORDERS; i++) {
+ if (!((1 << i) & orders))
+ continue;
+ snprintf(path, PATH_MAX, "hugepages-%ukB/enabled",
+ (getpagesize() >> 10) << i);
+ enabled = settings->hugepages[i].enabled;
+ thp_write_string(path, thp_enabled_strings[enabled]);
+ }
}

struct thp_settings *thp_current_settings(void)
@@ -294,3 +323,27 @@ void thp_set_read_ahead_path(char *path)
sizeof(dev_queue_read_ahead_path));
dev_queue_read_ahead_path[sizeof(dev_queue_read_ahead_path) - 1] = '\0';
}
+
+unsigned long thp_supported_orders(void)
+{
+ unsigned long orders = 0;
+ char path[PATH_MAX];
+ char buf[256];
+ int ret;
+ int i;
+
+ for (i = 0; i < NR_ORDERS; i++) {
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "hugepages-%ukB/enabled",
+ (getpagesize() >> 10) << i);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ ret = read_file(path, buf, sizeof(buf));
+ if (ret)
+ orders |= 1UL << i;
+ }
+
+ return orders;
+}
diff --git a/tools/testing/selftests/mm/thp_settings.h b/tools/testing/selftests/mm/thp_settings.h
index ff3d98c30617..71cbff05f4c7 100644
--- a/tools/testing/selftests/mm/thp_settings.h
+++ b/tools/testing/selftests/mm/thp_settings.h
@@ -7,9 +7,10 @@
#include <stdint.h>

enum thp_enabled {
+ THP_NEVER,
THP_ALWAYS,
+ THP_INHERIT,
THP_MADVISE,
- THP_NEVER,
};

enum thp_defrag {
@@ -29,6 +30,12 @@ enum shmem_enabled {
SHMEM_FORCE,
};

+#define NR_ORDERS 20
+
+struct hugepages_settings {
+ enum thp_enabled enabled;
+};
+
struct khugepaged_settings {
bool defrag;
unsigned int alloc_sleep_millisecs;
@@ -46,6 +53,7 @@ struct thp_settings {
bool use_zero_page;
struct khugepaged_settings khugepaged;
unsigned long read_ahead_kb;
+ struct hugepages_settings hugepages[NR_ORDERS];
};

int read_file(const char *path, char *buf, size_t buflen);
@@ -67,5 +75,6 @@ void thp_restore_settings(void);
void thp_save_settings(void);

void thp_set_read_ahead_path(char *path);
+unsigned long thp_supported_orders(void);

#endif /* __THP_SETTINGS_H__ */
--
2.25.1

2023-12-04 10:22:06

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 09/10] selftests/mm/cow: Generalize do_run_with_thp() helper

do_run_with_thp() prepares (PMD-sized) THP memory into different states
before running tests. With the introduction of multi-size THP, we would
like to reuse this logic to also test those smaller THP sizes. So let's
add a thpsize parameter which tells the function what size THP it should
operate on.

A separate commit will utilize this change to add new tests for
multi-size THP, where available.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/cow.c | 121 +++++++++++++++++--------------
1 file changed, 67 insertions(+), 54 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 7324ce5363c0..4d0b5a125d3c 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -32,7 +32,7 @@

static size_t pagesize;
static int pagemap_fd;
-static size_t thpsize;
+static size_t pmdsize;
static int nr_hugetlbsizes;
static size_t hugetlbsizes[10];
static int gup_fd;
@@ -734,7 +734,7 @@ enum thp_run {
THP_RUN_PARTIAL_SHARED,
};

-static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
+static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t thpsize)
{
char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED;
size_t size, mmap_size, mremap_size;
@@ -759,11 +759,11 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
}

/*
- * Try to populate a THP. Touch the first sub-page and test if we get
- * another sub-page populated automatically.
+ * Try to populate a THP. Touch the first sub-page and test if
+ * we get the last sub-page populated automatically.
*/
mem[0] = 0;
- if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
+ if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
ksft_test_result_skip("Did not get a THP populated\n");
goto munmap;
}
@@ -773,12 +773,14 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
switch (thp_run) {
case THP_RUN_PMD:
case THP_RUN_PMD_SWAPOUT:
+ assert(thpsize == pmdsize);
break;
case THP_RUN_PTE:
case THP_RUN_PTE_SWAPOUT:
/*
* Trigger PTE-mapping the THP by temporarily mapping a single
- * subpage R/O.
+ * subpage R/O. This is a noop if the THP is not pmdsize (and
+ * therefore already PTE-mapped).
*/
ret = mprotect(mem + pagesize, pagesize, PROT_READ);
if (ret) {
@@ -875,52 +877,60 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run)
munmap(mremap_mem, mremap_size);
}

-static void run_with_thp(test_fn fn, const char *desc)
+static void run_with_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PMD);
+ ksft_print_msg("[RUN] %s ... with THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_PMD, size);
}

-static void run_with_thp_swap(test_fn fn, const char *desc)
+static void run_with_thp_swap(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT);
+ ksft_print_msg("[RUN] %s ... with swapped-out THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size);
}

-static void run_with_pte_mapped_thp(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PTE);
+ ksft_print_msg("[RUN] %s ... with PTE-mapped THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_PTE, size);
}

-static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc)
+static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT);
+ ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size);
}

-static void run_with_single_pte_of_thp(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc);
- do_run_with_thp(fn, THP_RUN_SINGLE_PTE);
+ ksft_print_msg("[RUN] %s ... with single PTE of THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size);
}

-static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc)
+static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc);
- do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT);
+ ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size);
}

-static void run_with_partial_mremap_thp(test_fn fn, const char *desc)
+static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP);
+ ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size);
}

-static void run_with_partial_shared_thp(test_fn fn, const char *desc)
+static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size)
{
- ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc);
- do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED);
+ ksft_print_msg("[RUN] %s ... with partially shared THP (%zu kB)\n",
+ desc, size / 1024);
+ do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size);
}

static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize)
@@ -1091,15 +1101,15 @@ static void run_anon_test_case(struct test_case const *test_case)

run_with_base_page(test_case->fn, test_case->desc);
run_with_base_page_swap(test_case->fn, test_case->desc);
- if (thpsize) {
- run_with_thp(test_case->fn, test_case->desc);
- run_with_thp_swap(test_case->fn, test_case->desc);
- run_with_pte_mapped_thp(test_case->fn, test_case->desc);
- run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc);
- run_with_single_pte_of_thp(test_case->fn, test_case->desc);
- run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc);
- run_with_partial_mremap_thp(test_case->fn, test_case->desc);
- run_with_partial_shared_thp(test_case->fn, test_case->desc);
+ if (pmdsize) {
+ run_with_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_thp_swap(test_case->fn, test_case->desc, pmdsize);
+ run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize);
+ run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize);
+ run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
+ run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
}
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_hugetlb(test_case->fn, test_case->desc,
@@ -1120,7 +1130,7 @@ static int tests_per_anon_test_case(void)
{
int tests = 2 + nr_hugetlbsizes;

- if (thpsize)
+ if (pmdsize)
tests += 8;
return tests;
}
@@ -1329,7 +1339,7 @@ static void run_anon_thp_test_cases(void)
{
int i;

- if (!thpsize)
+ if (!pmdsize)
return;

ksft_print_msg("[INFO] Anonymous THP tests\n");
@@ -1338,13 +1348,13 @@ static void run_anon_thp_test_cases(void)
struct test_case const *test_case = &anon_thp_test_cases[i];

ksft_print_msg("[RUN] %s\n", test_case->desc);
- do_run_with_thp(test_case->fn, THP_RUN_PMD);
+ do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize);
}
}

static int tests_per_anon_thp_test_case(void)
{
- return thpsize ? 1 : 0;
+ return pmdsize ? 1 : 0;
}

typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
@@ -1419,7 +1429,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
}

/* For alignment purposes, we need twice the thp size. */
- mmap_size = 2 * thpsize;
+ mmap_size = 2 * pmdsize;
mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mmap_mem == MAP_FAILED) {
@@ -1434,11 +1444,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
}

/* We need a THP-aligned memory area. */
- mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
- smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1));
+ mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1));
+ smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1));

- ret = madvise(mem, thpsize, MADV_HUGEPAGE);
- ret |= madvise(smem, thpsize, MADV_HUGEPAGE);
+ ret = madvise(mem, pmdsize, MADV_HUGEPAGE);
+ ret |= madvise(smem, pmdsize, MADV_HUGEPAGE);
if (ret) {
ksft_test_result_fail("MADV_HUGEPAGE failed\n");
goto munmap;
@@ -1457,7 +1467,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc)
goto munmap;
}

- fn(mem, smem, thpsize);
+ fn(mem, smem, pmdsize);
munmap:
munmap(mmap_mem, mmap_size);
if (mmap_smem != MAP_FAILED)
@@ -1650,7 +1660,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case)
run_with_zeropage(test_case->fn, test_case->desc);
run_with_memfd(test_case->fn, test_case->desc);
run_with_tmpfile(test_case->fn, test_case->desc);
- if (thpsize)
+ if (pmdsize)
run_with_huge_zeropage(test_case->fn, test_case->desc);
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_memfd_hugetlb(test_case->fn, test_case->desc,
@@ -1671,7 +1681,7 @@ static int tests_per_non_anon_test_case(void)
{
int tests = 3 + nr_hugetlbsizes;

- if (thpsize)
+ if (pmdsize)
tests += 1;
return tests;
}
@@ -1681,10 +1691,13 @@ int main(int argc, char **argv)
int err;

pagesize = getpagesize();
- thpsize = read_pmd_pagesize();
- if (thpsize)
+ pmdsize = read_pmd_pagesize();
+ if (pmdsize) {
+ ksft_print_msg("[INFO] detected PMD size: %zu KiB\n",
+ pmdsize / 1024);
ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
- thpsize / 1024);
+ pmdsize / 1024);
+ }
nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
ARRAY_SIZE(hugetlbsizes));
detect_huge_zeropage();
--
2.25.1

2023-12-04 10:22:06

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v8 10/10] selftests/mm/cow: Add tests for anonymous multi-size THP

Add tests similar to the existing PMD-sized THP tests, but which operate
on memory backed by (PTE-mapped) multi-size THP. This reuses all the
existing infrastructure. If the test suite detects that multi-size THP
is not supported by the kernel, the new tests are skipped.

Signed-off-by: Ryan Roberts <[email protected]>
---
tools/testing/selftests/mm/cow.c | 84 +++++++++++++++++++++++++++-----
1 file changed, 72 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c
index 4d0b5a125d3c..37b4d7d28ae9 100644
--- a/tools/testing/selftests/mm/cow.c
+++ b/tools/testing/selftests/mm/cow.c
@@ -29,15 +29,49 @@
#include "../../../../mm/gup_test.h"
#include "../kselftest.h"
#include "vm_util.h"
+#include "thp_settings.h"

static size_t pagesize;
static int pagemap_fd;
static size_t pmdsize;
+static int nr_thpsizes;
+static size_t thpsizes[20];
static int nr_hugetlbsizes;
static size_t hugetlbsizes[10];
static int gup_fd;
static bool has_huge_zeropage;

+static int sz2ord(size_t size)
+{
+ return __builtin_ctzll(size / pagesize);
+}
+
+static int detect_thp_sizes(size_t sizes[], int max)
+{
+ int count = 0;
+ unsigned long orders;
+ size_t kb;
+ int i;
+
+ /* thp not supported at all. */
+ if (!pmdsize)
+ return 0;
+
+ orders = 1UL << sz2ord(pmdsize);
+ orders |= thp_supported_orders();
+
+ for (i = 0; orders && count < max; i++) {
+ if (!(orders & (1UL << i)))
+ continue;
+ orders &= ~(1UL << i);
+ kb = (pagesize >> 10) << i;
+ sizes[count++] = kb * 1024;
+ ksft_print_msg("[INFO] detected THP size: %zu KiB\n", kb);
+ }
+
+ return count;
+}
+
static void detect_huge_zeropage(void)
{
int fd = open("/sys/kernel/mm/transparent_hugepage/use_zero_page",
@@ -1101,15 +1135,27 @@ static void run_anon_test_case(struct test_case const *test_case)

run_with_base_page(test_case->fn, test_case->desc);
run_with_base_page_swap(test_case->fn, test_case->desc);
- if (pmdsize) {
- run_with_thp(test_case->fn, test_case->desc, pmdsize);
- run_with_thp_swap(test_case->fn, test_case->desc, pmdsize);
- run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize);
- run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize);
- run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize);
- run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize);
- run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize);
- run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize);
+ for (i = 0; i < nr_thpsizes; i++) {
+ size_t size = thpsizes[i];
+ struct thp_settings settings = *thp_current_settings();
+
+ settings.hugepages[sz2ord(pmdsize)].enabled = THP_NEVER;
+ settings.hugepages[sz2ord(size)].enabled = THP_ALWAYS;
+ thp_push_settings(&settings);
+
+ if (size == pmdsize) {
+ run_with_thp(test_case->fn, test_case->desc, size);
+ run_with_thp_swap(test_case->fn, test_case->desc, size);
+ }
+
+ run_with_pte_mapped_thp(test_case->fn, test_case->desc, size);
+ run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, size);
+ run_with_single_pte_of_thp(test_case->fn, test_case->desc, size);
+ run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, size);
+ run_with_partial_mremap_thp(test_case->fn, test_case->desc, size);
+ run_with_partial_shared_thp(test_case->fn, test_case->desc, size);
+
+ thp_pop_settings();
}
for (i = 0; i < nr_hugetlbsizes; i++)
run_with_hugetlb(test_case->fn, test_case->desc,
@@ -1130,8 +1176,9 @@ static int tests_per_anon_test_case(void)
{
int tests = 2 + nr_hugetlbsizes;

+ tests += 6 * nr_thpsizes;
if (pmdsize)
- tests += 8;
+ tests += 2;
return tests;
}

@@ -1689,15 +1736,23 @@ static int tests_per_non_anon_test_case(void)
int main(int argc, char **argv)
{
int err;
+ struct thp_settings default_settings;

pagesize = getpagesize();
pmdsize = read_pmd_pagesize();
if (pmdsize) {
+ /* Only if THP is supported. */
+ thp_read_settings(&default_settings);
+ default_settings.hugepages[sz2ord(pmdsize)].enabled = THP_INHERIT;
+ thp_save_settings();
+ thp_push_settings(&default_settings);
+
ksft_print_msg("[INFO] detected PMD size: %zu KiB\n",
pmdsize / 1024);
- ksft_print_msg("[INFO] detected THP size: %zu KiB\n",
- pmdsize / 1024);
+
+ nr_thpsizes = detect_thp_sizes(thpsizes, ARRAY_SIZE(thpsizes));
}
+
nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes,
ARRAY_SIZE(hugetlbsizes));
detect_huge_zeropage();
@@ -1716,6 +1771,11 @@ int main(int argc, char **argv)
run_anon_thp_test_cases();
run_non_anon_test_cases();

+ if (pmdsize) {
+ /* Only if THP is supported. */
+ thp_restore_settings();
+ }
+
err = ksft_get_fail_cnt();
if (err)
ksft_exit_fail_msg("%d out of %d tests failed\n",
--
2.25.1

2023-12-04 19:30:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On Mon, 4 Dec 2023 10:20:17 +0000 Ryan Roberts <[email protected]> wrote:

> Hi All,
>
>
> Prerequisites
> =============
>
> Some work items identified as being prerequisites are listed on page 3 at [9].
> The summary is:
>
> | item | status |
> |:------------------------------|:------------------------|
> | mlock | In mainline (v6.7) |
> | madvise | In mainline (v6.6) |
> | compaction | v1 posted [10] |
> | numa balancing | Investigated: see below |
> | user-triggered page migration | In mainline (v6.7) |
> | khugepaged collapse | In mainline (NOP) |

What does "prerequisites" mean here? Won't compile without? Kernel
crashes without? Nice-to-have-after? Please expand on this.

I looked at [9], but access is denied.

> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA


2023-12-05 01:16:02

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
>
> Introduce the logic to allow THP to be configured (through the new sysfs
> interface we just added) to allocate large folios to back anonymous
> memory, which are larger than the base page size but smaller than
> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>
> mTHP continues to be PTE-mapped, but in many cases can still provide
> similar benefits to traditional PMD-sized THP: Page faults are
> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> the configured order), but latency spikes are much less prominent
> because the size of each page isn't as huge as the PMD-sized variant and
> there is less memory to clear in each page fault. The number of per-page
> operations (e.g. ref counting, rmap management, lru list management) are
> also significantly reduced since those ops now become per-folio.
>
> Some architectures also employ TLB compression mechanisms to squeeze
> more entries in when a set of PTEs are virtually and physically
> contiguous and approporiately aligned. In this case, TLB misses will
> occur less often.
>
> The new behaviour is disabled by default, but can be enabled at runtime
> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
> (see documentation in previous commit). The long term aim is to change
> the default to include suitable lower orders, but there are some risks
> around internal fragmentation that need to be better understood first.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/huge_mm.h | 6 ++-
> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
> 2 files changed, 101 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index bd0eadd3befb..91a53b9835a4 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
> /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
> */
> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>
> /*
> * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ceeb0f45bf5..bf7e93813018 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool pte_range_none(pte_t *pte, int nr_pages)
> +{
> + int i;
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (!pte_none(ptep_get_lockless(pte + i)))
> + return false;
> + }
> +
> + return true;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> + gfp_t gfp;
> + pte_t *pte;
> + unsigned long addr;
> + struct folio *folio;
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long orders;
> + int order;
> +
> + /*
> + * If uffd is active for the vma we need per-page fault fidelity to
> + * maintain the uffd semantics.
> + */
> + if (userfaultfd_armed(vma))
> + goto fallback;
> +
> + /*
> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> + * for this vma. Then filter out the orders that can't be allocated over
> + * the faulting address and still be fully contained in the vma.
> + */
> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> + BIT(PMD_ORDER) - 1);
> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +
> + if (!orders)
> + goto fallback;
> +
> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> + if (!pte)
> + return ERR_PTR(-EAGAIN);
> +
> + order = first_order(orders);
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + vmf->pte = pte + pte_index(addr);
> + if (pte_range_none(vmf->pte, 1 << order))
> + break;
> + order = next_order(&orders, order);
> + }
> +
> + vmf->pte = NULL;
> + pte_unmap(pte);
> +
> + gfp = vma_thp_gfp_mask(vma);
> +
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> + if (folio) {
> + clear_huge_page(&folio->page, addr, 1 << order);

Minor.

Do we have to constantly clear a huge page? Is it possible to let
post_alloc_hook()
finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
vma_alloc_zeroed_movable_folio() is doing?

struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
unsigned long vaddr)
{
gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;

/*
* If the page is mapped with PROT_MTE, initialise the tags at the
* point of allocation and page zeroing as this is usually faster than
* separate DC ZVA and STGM.
*/
if (vma->vm_flags & VM_MTE)
flags |= __GFP_ZEROTAGS;

return vma_alloc_folio(flags, 0, vma, vaddr, false);
}

> + return folio;
> + }
> + order = next_order(&orders, order);
> + }
> +
> +fallback:
> + return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +}
> +#else
> +#define alloc_anon_folio(vmf) \
> + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> +#endif
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> {
> + int i;
> + int nr_pages = 1;
> + unsigned long addr = vmf->address;
> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> struct vm_area_struct *vma = vmf->vma;
> struct folio *folio;
> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> /* Allocate our own private page. */
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + folio = alloc_anon_folio(vmf);
> + if (IS_ERR(folio))
> + return 0;
> if (!folio)
> goto oom;
>
> + nr_pages = folio_nr_pages(folio);
> + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> goto oom_free_page;
> folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> if (!vmf->pte)
> goto release;
> - if (vmf_pte_changed(vmf)) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
> + (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages))) {
> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> goto release;
> }
>
> @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + folio_add_new_anon_rmap(folio, vma, addr);
> folio_add_lru_vma(folio, vma);
> setpte:
> if (uffd_wp)
> entry = pte_mkuffd_wp(entry);
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> --
> 2.25.1
>

2023-12-05 01:25:15

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On Tue, Dec 5, 2023 at 9:15 AM Barry Song <[email protected]> wrote:
>
> On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
> >
> > Introduce the logic to allow THP to be configured (through the new sysfs
> > interface we just added) to allocate large folios to back anonymous
> > memory, which are larger than the base page size but smaller than
> > PMD-size. We call this new THP extension "multi-size THP" (mTHP).
> >
> > mTHP continues to be PTE-mapped, but in many cases can still provide
> > similar benefits to traditional PMD-sized THP: Page faults are
> > significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> > the configured order), but latency spikes are much less prominent
> > because the size of each page isn't as huge as the PMD-sized variant and
> > there is less memory to clear in each page fault. The number of per-page
> > operations (e.g. ref counting, rmap management, lru list management) are
> > also significantly reduced since those ops now become per-folio.
> >
> > Some architectures also employ TLB compression mechanisms to squeeze
> > more entries in when a set of PTEs are virtually and physically
> > contiguous and approporiately aligned. In this case, TLB misses will
> > occur less often.
> >
> > The new behaviour is disabled by default, but can be enabled at runtime
> > by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
> > (see documentation in previous commit). The long term aim is to change
> > the default to include suitable lower orders, but there are some risks
> > around internal fragmentation that need to be better understood first.
> >
> > Signed-off-by: Ryan Roberts <[email protected]>
> > ---
> > include/linux/huge_mm.h | 6 ++-
> > mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
> > 2 files changed, 101 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index bd0eadd3befb..91a53b9835a4 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> > #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >
> > /*
> > - * Mask of all large folio orders supported for anonymous THP.
> > + * Mask of all large folio orders supported for anonymous THP; all orders up to
> > + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> > + * (which is a limitation of the THP implementation).
> > */
> > -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
> > +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> >
> > /*
> > * Mask of all large folio orders supported for file THP.
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3ceeb0f45bf5..bf7e93813018 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > return ret;
> > }
> >
> > +static bool pte_range_none(pte_t *pte, int nr_pages)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pages; i++) {
> > + if (!pte_none(ptep_get_lockless(pte + i)))
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > +{
> > + gfp_t gfp;
> > + pte_t *pte;
> > + unsigned long addr;
> > + struct folio *folio;
> > + struct vm_area_struct *vma = vmf->vma;
> > + unsigned long orders;
> > + int order;
> > +
> > + /*
> > + * If uffd is active for the vma we need per-page fault fidelity to
> > + * maintain the uffd semantics.
> > + */
> > + if (userfaultfd_armed(vma))
> > + goto fallback;
> > +
> > + /*
> > + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > + * for this vma. Then filter out the orders that can't be allocated over
> > + * the faulting address and still be fully contained in the vma.
> > + */
> > + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > + BIT(PMD_ORDER) - 1);
> > + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > +
> > + if (!orders)
> > + goto fallback;
> > +
> > + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > + if (!pte)
> > + return ERR_PTR(-EAGAIN);
> > +
> > + order = first_order(orders);
> > + while (orders) {
> > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > + vmf->pte = pte + pte_index(addr);
> > + if (pte_range_none(vmf->pte, 1 << order))
> > + break;
> > + order = next_order(&orders, order);
> > + }
> > +
> > + vmf->pte = NULL;
> > + pte_unmap(pte);
> > +
> > + gfp = vma_thp_gfp_mask(vma);
> > +
> > + while (orders) {
> > + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > + if (folio) {
> > + clear_huge_page(&folio->page, addr, 1 << order);
>
> Minor.
>
> Do we have to constantly clear a huge page? Is it possible to let
> post_alloc_hook()
> finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
> vma_alloc_zeroed_movable_folio() is doing?
>
> struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> unsigned long vaddr)
> {
> gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
>
> /*
> * If the page is mapped with PROT_MTE, initialise the tags at the
> * point of allocation and page zeroing as this is usually faster than
> * separate DC ZVA and STGM.
> */
> if (vma->vm_flags & VM_MTE)
> flags |= __GFP_ZEROTAGS;
>
> return vma_alloc_folio(flags, 0, vma, vaddr, false);
> }

I am asking this because Android and some other kernels might always set
CONFIG_INIT_ON_ALLOC_DEFAULT_ON, that means one more explicit
clear_page is doing a duplicated job.

when the below is true, post_alloc_hook has cleared huge_page before
vma_alloc_folio() returns the folio,

static inline bool want_init_on_alloc(gfp_t flags)
{
if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
&init_on_alloc))
return true;
return flags & __GFP_ZERO;
}


>
> > + return folio;
> > + }
> > + order = next_order(&orders, order);
> > + }
> > +
> > +fallback:
> > + return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > +}
> > +#else
> > +#define alloc_anon_folio(vmf) \
> > + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> > +#endif
> > +
> > /*
> > * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > {
> > + int i;
> > + int nr_pages = 1;
> > + unsigned long addr = vmf->address;
> > bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> > struct vm_area_struct *vma = vmf->vma;
> > struct folio *folio;
> > @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > /* Allocate our own private page. */
> > if (unlikely(anon_vma_prepare(vma)))
> > goto oom;
> > - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > + folio = alloc_anon_folio(vmf);
> > + if (IS_ERR(folio))
> > + return 0;
> > if (!folio)
> > goto oom;
> >
> > + nr_pages = folio_nr_pages(folio);
> > + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> > +
> > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> > goto oom_free_page;
> > folio_throttle_swaprate(folio, GFP_KERNEL);
> > @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > if (vma->vm_flags & VM_WRITE)
> > entry = pte_mkwrite(pte_mkdirty(entry), vma);
> >
> > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > - &vmf->ptl);
> > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> > if (!vmf->pte)
> > goto release;
> > - if (vmf_pte_changed(vmf)) {
> > - update_mmu_tlb(vma, vmf->address, vmf->pte);
> > + if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
> > + (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages))) {
> > + for (i = 0; i < nr_pages; i++)
> > + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
> > goto release;
> > }
> >
> > @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > return handle_userfault(vmf, VM_UFFD_MISSING);
> > }
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + folio_add_new_anon_rmap(folio, vma, addr);
> > folio_add_lru_vma(folio, vma);
> > setpte:
> > if (uffd_wp)
> > entry = pte_mkuffd_wp(entry);
> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> > unlock:
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > --
> > 2.25.1
> >

2023-12-05 03:29:25

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On Mon, Dec 4, 2023 at 11:20 PM Ryan Roberts <[email protected]> wrote:
>
> Hi All,
>
> A new week, a new version, a new name... This is v8 of a series to implement
> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
> this fares better.
>
> The objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
>
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This version changes the name and tidies up some of the kernel code and test
> code, based on feedback against v7 (see change log for details).
>
> By default, the existing behaviour (and performance) is maintained. The user
> must explicitly enable multi-size THP to see the performance benefit. This is
> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
> David for the suggestion)! This interface is inspired by the existing
> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
> compatibility with the existing PMD-size THP interface, and provides a base for
> future extensibility. See [8] for detailed discussion of the interface.
>
> This series is based on mm-unstable (715b67adf4c8).
>
>
> Prerequisites
> =============
>
> Some work items identified as being prerequisites are listed on page 3 at [9].
> The summary is:
>
> | item | status |
> |:------------------------------|:------------------------|
> | mlock | In mainline (v6.7) |
> | madvise | In mainline (v6.6) |
> | compaction | v1 posted [10] |
> | numa balancing | Investigated: see below |
> | user-triggered page migration | In mainline (v6.7) |
> | khugepaged collapse | In mainline (NOP) |
>
> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
> John Hubbard has investigated this and concluded that it is A) not clear at the
> moment what a better policy might be for PTE-mapped THP and B) questions whether
> this should really be considered a prerequisite given no regression is caused
> for the default "multi-size THP disabled" case, and there is no correctness
> issue when it is enabled - its just a potential for non-optimal performance.
>
> If there are no disagreements about removing numa balancing from the list (none
> were raised when I first posted this comment against v7), then that just leaves
> compaction which is in review on list at the moment.
>
> I really would like to get this series (and its remaining comapction
> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
> point, but lets see where we get to with review?
>

Hi Ryan,

A question but i don't think it should block this series, do we have any plan
to extend /proc/meminfo, /proc/pid/smaps, /proc/vmstat to present some
information regarding the new multi-size THP.

e.g how many folios in each-size for the system, how many multi-size folios LRU,
how many large folios in each VMA etc.

In products and labs, we need some health monitors to make sure the system
status is visible and works as expected. right now, i feel i am like
blindly exploring
the system without those statistics.

>
> Testing
> =======
>
> The series includes patches for mm selftests to enlighten the cow and khugepaged
> tests to explicitly test with multi-size THP, in the same way that PMD-sized
> THP is tested. The new tests all pass, and no regressions are observed in the mm
> selftest suite. I've also run my usual kernel compilation and java script
> benchmarks without any issues.
>
> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
> THP only - they do not include the arm64 contpte follow-on series).
>
> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
> some workloads at [11]. (Observed using v6 of this series as well as the arm64
> contpte series).
>
> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
> there are some latency regressions also.
>
>
> Changes since v7 [7]
> ====================
>
> - Renamed "small-sized THP" -> "multi-size THP" in commit logs
> - Added various Reviewed-by/Tested-by tags (Barry, David, Alistair)
> - Patch 3:
> - Fine-tuned transhuge documentation multi-size THP (JohnH)
> - Converted hugepage_global_enabled() and hugepage_global_always() macros
> to static inline functions (JohnH)
> - Renamed hugepage_vma_check() to thp_vma_allowable_orders() (JohnH)
> - Renamed transhuge_vma_suitable() to thp_vma_suitable_orders() (JohnH)
> - Renamed "global" enabled sysfs file option to "inherit" (JohnH)
> - Patch 9:
> - cow selftest: Renamed param size -> thpsize (David)
> - cow selftest: Changed test fail to assert() (David)
> - cow selftest: Log PMD size separately from all the supported THP sizes
> (David)
> - Patch 10:
> - cow selftest: No longer special case pmdsize; keep all THP sizes in
> thpsizes[]
>
>
> Changes since v6 [6]
> ====================
>
> - Refactored vmf_pte_range_changed() to remove uffd special-case (suggested by
> JohnH)
> - Dropped accounting patch (#3 in v6) (suggested by DavidH)
> - Continue to account *PMD-sized* THP only for now
> - Can add more counters in future if needed
> - Page cache large folios haven't needed any new counters yet
> - Pivot to sysfs ABI proposed by DavidH
> - per-size directories in a similar shape to that used by hugetlb
> - Dropped "recommend" keyword patch (#6 in v6) (suggested by DavidH, Yu Zhou)
> - For now, users need to understand implicitly which sizes are beneficial
> to their HW/SW
> - Dropped arch_wants_pte_order() patch (#7 in v6)
> - No longer needed due to dropping patch "recommend" keyword patch
> - Enlightened khugepaged mm selftest to explicitly test with small-size THP
> - Scrubbed commit logs to use "small-sized THP" consistently (suggested by
> DavidH)
>
>
> Changes since v5 [5]
> ====================
>
> - Added accounting for PTE-mapped THPs (patch 3)
> - Added runtime control mechanism via sysfs as extension to THP (patch 4)
> - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
> - Stripped out hardcoded policy for allocation order; its now all user space
> controlled (although user space can request "recommend" which will configure
> the HW-preferred order)
>
>
> Changes since v4 [4]
> ====================
>
> - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
> now uses the default order-3 size. I have moved this patch over to
> the contpte series.
> - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
> into series. I originally removed this at v2 to add to a separate series,
> but that series has transformed significantly and it no longer fits, so
> bringing it back here.
> - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
> set_ptes() is in mm-unstable now.
> - Updated policy for when to allocate LAF; only fallback to order-0 if
> MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
> sysfs's never/madvise/always knob.
> - Fallback to order-0 whenever uffd is armed for the vma, not just when
> uffd-wp is set on the pte.
> - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
> with ERR_PTR().
>
> The last 3 changes were proposed by Yu Zhao - thanks!
>
>
> Changes since v3 [3]
> ====================
>
> - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
> - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
> sysctl is preferable but we will wait until real workload needs it.
> - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
> - Added mm selftests for large anon folios in cow test suite.
>
>
> Changes since v2 [2]
> ====================
>
> - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
> - Huang, Ying suggested the "batch zap" work (which I dropped from this
> series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
> moved the deferred split patch to a separate series along with the batch
> zap changes. I plan to submit this series early next week.
> - Changed folio order fallback policy
> - We no longer iterate from preferred to 0 looking for acceptable policy
> - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
> - Removed vma parameter from arch_wants_pte_order()
> - Added command line parameter `flexthp_unhinted_max`
> - clamps preferred order when vma hasn't explicitly opted-in to THP
> - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
> for process or system).
> - Simplified implementation and integration with do_anonymous_page()
> - Removed dependency on set_ptes()
>
>
> Changes since v1 [1]
> ====================
>
> - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
> - replaced with arch-independent alloc_anon_folio()
> - follows THP allocation approach
> - no longer retry with intermediate orders if allocation fails
> - fallback directly to order-0
> - remove folio_add_new_anon_rmap_range() patch
> - instead add its new functionality to folio_add_new_anon_rmap()
> - remove batch-zap pte mappings optimization patch
> - remove enabler folio_remove_rmap_range() patch too
> - These offer real perf improvement so will submit separately
> - simplify Kconfig
> - single FLEXIBLE_THP option, which is independent of arch
> - depends on TRANSPARENT_HUGEPAGE
> - when enabled default to max anon folio size of 64K unless arch
> explicitly overrides
> - simplify changes to do_anonymous_page():
> - no more retry loop
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
> [4] https://lore.kernel.org/linux-mm/[email protected]/
> [5] https://lore.kernel.org/linux-mm/[email protected]/
> [6] https://lore.kernel.org/linux-mm/[email protected]/
> [7] https://lore.kernel.org/linux-mm/[email protected]/
> [8] https://lore.kernel.org/linux-mm/[email protected]/
> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> [10] https://lore.kernel.org/linux-mm/[email protected]/
> [11] https://lore.kernel.org/linux-mm/[email protected]/
> [12] https://lore.kernel.org/linux-mm/[email protected]/
>
>
> Thanks,
> Ryan
>
> Ryan Roberts (10):
> mm: Allow deferred splitting of arbitrary anon large folios
> mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
> mm: thp: Introduce multi-size THP sysfs interface
> mm: thp: Support allocation of anonymous multi-size THP
> selftests/mm/kugepaged: Restore thp settings at exit
> selftests/mm: Factor out thp settings management
> selftests/mm: Support multi-size THP interface in thp_settings
> selftests/mm/khugepaged: Enlighten for multi-size THP
> selftests/mm/cow: Generalize do_run_with_thp() helper
> selftests/mm/cow: Add tests for anonymous multi-size THP
>
> Documentation/admin-guide/mm/transhuge.rst | 97 ++++-
> Documentation/filesystems/proc.rst | 6 +-
> fs/proc/task_mmu.c | 3 +-
> include/linux/huge_mm.h | 116 ++++--
> mm/huge_memory.c | 268 ++++++++++++--
> mm/khugepaged.c | 20 +-
> mm/memory.c | 114 +++++-
> mm/page_vma_mapped.c | 3 +-
> mm/rmap.c | 32 +-
> tools/testing/selftests/mm/Makefile | 4 +-
> tools/testing/selftests/mm/cow.c | 185 +++++++---
> tools/testing/selftests/mm/khugepaged.c | 410 ++++-----------------
> tools/testing/selftests/mm/run_vmtests.sh | 2 +
> tools/testing/selftests/mm/thp_settings.c | 349 ++++++++++++++++++
> tools/testing/selftests/mm/thp_settings.h | 80 ++++
> 15 files changed, 1177 insertions(+), 512 deletions(-)
> create mode 100644 tools/testing/selftests/mm/thp_settings.c
> create mode 100644 tools/testing/selftests/mm/thp_settings.h
>
> --
> 2.25.1
>

Thanks
Barry

2023-12-05 03:38:13

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 12/4/23 02:20, Ryan Roberts wrote:
> Hi All,
>
> A new week, a new version, a new name... This is v8 of a series to implement
> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
> this fares better.
>
> The objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
>
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This version changes the name and tidies up some of the kernel code and test
> code, based on feedback against v7 (see change log for details).

Using a couple of Armv8 systems, I've tested this patchset. I applied it
to top of tree (Linux 6.7-rc4), on top of your latest contig pte series
[1].

With those two patchsets applied, the mm selftests look OK--or at least
as OK as they normally do. I compared test runs between THP/mTHP set to
"always", vs "never", to verify that there were no new test failures.
Details: specifically, I set one particular page size (2 MB) to
"inherit", and then toggled /sys/kernel/mm/transparent_hugepage/enabled
between "always" and "never".

I also re-ran my usual compute/AI benchmark, and I'm still seeing the
same 10x performance improvement that I reported for the v6 patchset.

So for this patchset and for [1] as well, please feel free to add:

Tested-by: John Hubbard <[email protected]>


[1] https://lore.kernel.org/all/[email protected]/


thanks,
--
John Hubbard
NVIDIA

>
> By default, the existing behaviour (and performance) is maintained. The user
> must explicitly enable multi-size THP to see the performance benefit. This is
> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
> David for the suggestion)! This interface is inspired by the existing
> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
> compatibility with the existing PMD-size THP interface, and provides a base for
> future extensibility. See [8] for detailed discussion of the interface.
>
> This series is based on mm-unstable (715b67adf4c8).
>
>
> Prerequisites
> =============
>
> Some work items identified as being prerequisites are listed on page 3 at [9].
> The summary is:
>
> | item | status |
> |:------------------------------|:------------------------|
> | mlock | In mainline (v6.7) |
> | madvise | In mainline (v6.6) |
> | compaction | v1 posted [10] |
> | numa balancing | Investigated: see below |
> | user-triggered page migration | In mainline (v6.7) |
> | khugepaged collapse | In mainline (NOP) |
>
> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
> John Hubbard has investigated this and concluded that it is A) not clear at the
> moment what a better policy might be for PTE-mapped THP and B) questions whether
> this should really be considered a prerequisite given no regression is caused
> for the default "multi-size THP disabled" case, and there is no correctness
> issue when it is enabled - its just a potential for non-optimal performance.
>
> If there are no disagreements about removing numa balancing from the list (none
> were raised when I first posted this comment against v7), then that just leaves
> compaction which is in review on list at the moment.
>
> I really would like to get this series (and its remaining comapction
> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
> point, but lets see where we get to with review?
>
>
> Testing
> =======
>
> The series includes patches for mm selftests to enlighten the cow and khugepaged
> tests to explicitly test with multi-size THP, in the same way that PMD-sized
> THP is tested. The new tests all pass, and no regressions are observed in the mm
> selftest suite. I've also run my usual kernel compilation and java script
> benchmarks without any issues.
>
> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
> THP only - they do not include the arm64 contpte follow-on series).
>
> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
> some workloads at [11]. (Observed using v6 of this series as well as the arm64
> contpte series).
>
> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
> there are some latency regressions also.
>
>
> Changes since v7 [7]
> ====================
>
> - Renamed "small-sized THP" -> "multi-size THP" in commit logs
> - Added various Reviewed-by/Tested-by tags (Barry, David, Alistair)
> - Patch 3:
> - Fine-tuned transhuge documentation multi-size THP (JohnH)
> - Converted hugepage_global_enabled() and hugepage_global_always() macros
> to static inline functions (JohnH)
> - Renamed hugepage_vma_check() to thp_vma_allowable_orders() (JohnH)
> - Renamed transhuge_vma_suitable() to thp_vma_suitable_orders() (JohnH)
> - Renamed "global" enabled sysfs file option to "inherit" (JohnH)
> - Patch 9:
> - cow selftest: Renamed param size -> thpsize (David)
> - cow selftest: Changed test fail to assert() (David)
> - cow selftest: Log PMD size separately from all the supported THP sizes
> (David)
> - Patch 10:
> - cow selftest: No longer special case pmdsize; keep all THP sizes in
> thpsizes[]
>
>
> Changes since v6 [6]
> ====================
>
> - Refactored vmf_pte_range_changed() to remove uffd special-case (suggested by
> JohnH)
> - Dropped accounting patch (#3 in v6) (suggested by DavidH)
> - Continue to account *PMD-sized* THP only for now
> - Can add more counters in future if needed
> - Page cache large folios haven't needed any new counters yet
> - Pivot to sysfs ABI proposed by DavidH
> - per-size directories in a similar shape to that used by hugetlb
> - Dropped "recommend" keyword patch (#6 in v6) (suggested by DavidH, Yu Zhou)
> - For now, users need to understand implicitly which sizes are beneficial
> to their HW/SW
> - Dropped arch_wants_pte_order() patch (#7 in v6)
> - No longer needed due to dropping patch "recommend" keyword patch
> - Enlightened khugepaged mm selftest to explicitly test with small-size THP
> - Scrubbed commit logs to use "small-sized THP" consistently (suggested by
> DavidH)
>
>
> Changes since v5 [5]
> ====================
>
> - Added accounting for PTE-mapped THPs (patch 3)
> - Added runtime control mechanism via sysfs as extension to THP (patch 4)
> - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
> - Stripped out hardcoded policy for allocation order; its now all user space
> controlled (although user space can request "recommend" which will configure
> the HW-preferred order)
>
>
> Changes since v4 [4]
> ====================
>
> - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
> now uses the default order-3 size. I have moved this patch over to
> the contpte series.
> - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
> into series. I originally removed this at v2 to add to a separate series,
> but that series has transformed significantly and it no longer fits, so
> bringing it back here.
> - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
> set_ptes() is in mm-unstable now.
> - Updated policy for when to allocate LAF; only fallback to order-0 if
> MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
> sysfs's never/madvise/always knob.
> - Fallback to order-0 whenever uffd is armed for the vma, not just when
> uffd-wp is set on the pte.
> - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
> with ERR_PTR().
>
> The last 3 changes were proposed by Yu Zhao - thanks!
>
>
> Changes since v3 [3]
> ====================
>
> - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
> - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
> sysctl is preferable but we will wait until real workload needs it.
> - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
> - Added mm selftests for large anon folios in cow test suite.
>
>
> Changes since v2 [2]
> ====================
>
> - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
> - Huang, Ying suggested the "batch zap" work (which I dropped from this
> series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
> moved the deferred split patch to a separate series along with the batch
> zap changes. I plan to submit this series early next week.
> - Changed folio order fallback policy
> - We no longer iterate from preferred to 0 looking for acceptable policy
> - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
> - Removed vma parameter from arch_wants_pte_order()
> - Added command line parameter `flexthp_unhinted_max`
> - clamps preferred order when vma hasn't explicitly opted-in to THP
> - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
> for process or system).
> - Simplified implementation and integration with do_anonymous_page()
> - Removed dependency on set_ptes()
>
>
> Changes since v1 [1]
> ====================
>
> - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
> - replaced with arch-independent alloc_anon_folio()
> - follows THP allocation approach
> - no longer retry with intermediate orders if allocation fails
> - fallback directly to order-0
> - remove folio_add_new_anon_rmap_range() patch
> - instead add its new functionality to folio_add_new_anon_rmap()
> - remove batch-zap pte mappings optimization patch
> - remove enabler folio_remove_rmap_range() patch too
> - These offer real perf improvement so will submit separately
> - simplify Kconfig
> - single FLEXIBLE_THP option, which is independent of arch
> - depends on TRANSPARENT_HUGEPAGE
> - when enabled default to max anon folio size of 64K unless arch
> explicitly overrides
> - simplify changes to do_anonymous_page():
> - no more retry loop
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
> [4] https://lore.kernel.org/linux-mm/[email protected]/
> [5] https://lore.kernel.org/linux-mm/[email protected]/
> [6] https://lore.kernel.org/linux-mm/[email protected]/
> [7] https://lore.kernel.org/linux-mm/[email protected]/
> [8] https://lore.kernel.org/linux-mm/[email protected]/
> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
> [10] https://lore.kernel.org/linux-mm/[email protected]/
> [11] https://lore.kernel.org/linux-mm/[email protected]/
> [12] https://lore.kernel.org/linux-mm/[email protected]/
>
>
> Thanks,
> Ryan
>
> Ryan Roberts (10):
> mm: Allow deferred splitting of arbitrary anon large folios
> mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
> mm: thp: Introduce multi-size THP sysfs interface
> mm: thp: Support allocation of anonymous multi-size THP
> selftests/mm/kugepaged: Restore thp settings at exit
> selftests/mm: Factor out thp settings management
> selftests/mm: Support multi-size THP interface in thp_settings
> selftests/mm/khugepaged: Enlighten for multi-size THP
> selftests/mm/cow: Generalize do_run_with_thp() helper
> selftests/mm/cow: Add tests for anonymous multi-size THP
>
> Documentation/admin-guide/mm/transhuge.rst | 97 ++++-
> Documentation/filesystems/proc.rst | 6 +-
> fs/proc/task_mmu.c | 3 +-
> include/linux/huge_mm.h | 116 ++++--
> mm/huge_memory.c | 268 ++++++++++++--
> mm/khugepaged.c | 20 +-
> mm/memory.c | 114 +++++-
> mm/page_vma_mapped.c | 3 +-
> mm/rmap.c | 32 +-
> tools/testing/selftests/mm/Makefile | 4 +-
> tools/testing/selftests/mm/cow.c | 185 +++++++---
> tools/testing/selftests/mm/khugepaged.c | 410 ++++-----------------
> tools/testing/selftests/mm/run_vmtests.sh | 2 +
> tools/testing/selftests/mm/thp_settings.c | 349 ++++++++++++++++++
> tools/testing/selftests/mm/thp_settings.h | 80 ++++
> 15 files changed, 1177 insertions(+), 512 deletions(-)
> create mode 100644 tools/testing/selftests/mm/thp_settings.c
> create mode 100644 tools/testing/selftests/mm/thp_settings.h
>
> --
> 2.25.1
>


2023-12-05 04:21:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <[email protected]> wrote:
>
> In preparation for adding support for anonymous multi-size THP,
> introduce new sysfs structure that will be used to control the new
> behaviours. A new directory is added under transparent_hugepage for each
> supported THP size, and contains an `enabled` file, which can be set to
> "inherit" (to inherit the global setting), "always", "madvise" or
> "never". For now, the kernel still only supports PMD-sized anonymous
> THP, so only 1 directory is populated.
>
> The first half of the change converts transhuge_vma_suitable() and
> hugepage_vma_check() so that they take a bitfield of orders for which
> the user wants to determine support, and the functions filter out all
> the orders that can't be supported, given the current sysfs
> configuration and the VMA dimensions. If there is only 1 order set in
> the input then the output can continue to be treated like a boolean;
> this is the case for most call sites. The resulting functions are
> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
> respectively.
>
> The second half of the change implements the new sysfs interface. It has
> been done so that each supported THP size has a `struct thpsize`, which
> describes the relevant metadata and is itself a kobject. This is pretty
> minimal for now, but should make it easy to add new per-thpsize files to
> the interface if needed in future (e.g. per-size defrag). Rather than
> keep the `enabled` state directly in the struct thpsize, I've elected to
> directly encode it into huge_anon_orders_[always|madvise|inherit]
> bitfields since this reduces the amount of work required in
> thp_vma_allowable_orders() which is called for every page fault.
>
> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
> commit, for details of how the new sysfs interface works.
>
> Signed-off-by: Ryan Roberts <[email protected]>

Reviewed-by: Barry Song <[email protected]>

> -khugepaged will be automatically started when
> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll
> -be automatically shutdown if it's set to "never".
> +khugepaged will be automatically started when one or more hugepage
> +sizes are enabled (either by directly setting "always" or "madvise",
> +or by setting "inherit" while the top-level enabled is set to "always"
> +or "madvise"), and it'll be automatically shutdown when the last
> +hugepage size is disabled (either by directly setting "never", or by
> +setting "inherit" while the top-level enabled is set to "never").
>
> Khugepaged controls
> -------------------
>
> +.. note::
> + khugepaged currently only searches for opportunities to collapse to
> + PMD-sized THP and no attempt is made to collapse to other THP
> + sizes.

For small-size THP, collapse is probably a bad idea. we like a one-shot
try in Android especially we are using a 64KB and less large folio size. if
PF succeeds in getting large folios, we map large folios, otherwise we
give up as those memories can be quite unstably swapped-out, swapped-in
and madvised to be DONTNEED.

too many compactions will increase power consumption and decrease UI
response.

Thanks
Barry

2023-12-05 09:34:41

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 04/12/2023 19:30, Andrew Morton wrote:
> On Mon, 4 Dec 2023 10:20:17 +0000 Ryan Roberts <[email protected]> wrote:
>
>> Hi All,
>>
>>
>> Prerequisites
>> =============
>>
>> Some work items identified as being prerequisites are listed on page 3 at [9].
>> The summary is:
>>
>> | item | status |
>> |:------------------------------|:------------------------|
>> | mlock | In mainline (v6.7) |
>> | madvise | In mainline (v6.6) |
>> | compaction | v1 posted [10] |
>> | numa balancing | Investigated: see below |
>> | user-triggered page migration | In mainline (v6.7) |
>> | khugepaged collapse | In mainline (NOP) |
>
> What does "prerequisites" mean here? Won't compile without? Kernel
> crashes without? Nice-to-have-after? Please expand on this.

Short answer: It's supposed to mean things that either need to be done to prevent the mm from regressing (both correctness and performance) when multi-size THP is present but disabled, or things that need to be done to make the mm robust (but not neccessarily optimially performant) when multi-size THP is enabled. But in reality, all of the things on the list could really be reclassified as "nice-to-have-after", IMHO; their absence will neither cause compilation nor runtime errors.

Longer answer: When I first started looking at this, I was advised that there were likely a number of corners which made assumptions about large folios always being PMD-sized, and if not found and fixed, could lead to stability issues. At the time I was also pursuing a strategy of multi-size THP being a compile-time feature with no runtime control, so I decided it was important for multi-size THP to not effectively disable other features (e.g. various madvise ops used to ignore PTE-mapped large folios). This list represents all the things that I could find based on code review, as well as things suggested by others, and in the end, they all fall into that last category of "PTE-mapped large folios efectively disable existing features". But given we now have runtime controls to opt-in to multi-size THP, I'm not sure we need to classify these as prerequisites. But I didn't want to unilaterally make that decision, given this list has previously been discussed and agreed by others.

It's also worth noting that in the case of compaction, that's already a problem for large folios in the page cache; large folios will be skipped.

>
> I looked at [9], but access is denied.

Sorry about that; its owned by David Rientjes so I can't fix that for you. It's a PDF of a slide with the following table:

+-------------------------------+------------------------------------------------------------------------+--------------+--------------------+
| Item | Description | Assignee | Status |
+-------------------------------+------------------------------------------------------------------------+--------------+--------------------+
| mlock | Large, pte-mapped folios are ignored when mlock is requested. | Yin, Fengwei | In mainline (v6.7) |
| | Code comment for mlock_vma_folio() says "...filter out pte mappings | | |
| | of THPs which cannot be consistently counted: a pte mapping of the | | |
| | THP head cannot be distinguished by the page alone." | | |
| madvise | MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes | Yin, Fengwei | In mainline (v6.6) |
| | exclusive only if mapcount==1, else skips remainder of operation. | | |
| | For large, pte-mapped folios, exclusive folios can have mapcount | | |
| | upto nr_pages and still be exclusive. Even better; don't split | | |
| | the folio if it fits entirely within the range. | | |
| compaction | Raised at LSFMM: Compaction skips non-order-0 pages. | Zi Yan | v1 posted |
| | Already problem for page-cache pages today. | | |
| numa balancing | Large, pte-mapped folios are ignored by numa-balancing code. Commit | John Hubbard | Investigated: |
| | comment (e81c480): "We're going to have THP mapped with PTEs. It | | Not prerequisite |
| | will confuse numabalancing. Let's skip them for now." | | |
| user-triggered page migration | mm/migrate.c (migrate_pages syscall) We don't want to migrate folio | Kefeng Wang | In mainline (v6.7) |
| | that is shared. | | |
| khugepaged collapse | collapse small-sized THP to PMD-sized THP in khugepaged/MADV_COLLAPSE. | Ryan Roberts | In mainline (NOP) |
| | Kirill thinks khugepage should already be able to collapse | | |
| | small large folios to PMD-sized THP; verification required. | | |
+-------------------------------+------------------------------------------------------------------------+--------------+--------------------+

Thanks,
Ryan

>
>> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>
>

2023-12-05 09:50:40

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 05/12/2023 04:21, Barry Song wrote:
> On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <[email protected]> wrote:
>>
>> In preparation for adding support for anonymous multi-size THP,
>> introduce new sysfs structure that will be used to control the new
>> behaviours. A new directory is added under transparent_hugepage for each
>> supported THP size, and contains an `enabled` file, which can be set to
>> "inherit" (to inherit the global setting), "always", "madvise" or
>> "never". For now, the kernel still only supports PMD-sized anonymous
>> THP, so only 1 directory is populated.
>>
>> The first half of the change converts transhuge_vma_suitable() and
>> hugepage_vma_check() so that they take a bitfield of orders for which
>> the user wants to determine support, and the functions filter out all
>> the orders that can't be supported, given the current sysfs
>> configuration and the VMA dimensions. If there is only 1 order set in
>> the input then the output can continue to be treated like a boolean;
>> this is the case for most call sites. The resulting functions are
>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
>> respectively.
>>
>> The second half of the change implements the new sysfs interface. It has
>> been done so that each supported THP size has a `struct thpsize`, which
>> describes the relevant metadata and is itself a kobject. This is pretty
>> minimal for now, but should make it easy to add new per-thpsize files to
>> the interface if needed in future (e.g. per-size defrag). Rather than
>> keep the `enabled` state directly in the struct thpsize, I've elected to
>> directly encode it into huge_anon_orders_[always|madvise|inherit]
>> bitfields since this reduces the amount of work required in
>> thp_vma_allowable_orders() which is called for every page fault.
>>
>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
>> commit, for details of how the new sysfs interface works.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> Reviewed-by: Barry Song <[email protected]>

Thanks!

>
>> -khugepaged will be automatically started when
>> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll
>> -be automatically shutdown if it's set to "never".
>> +khugepaged will be automatically started when one or more hugepage
>> +sizes are enabled (either by directly setting "always" or "madvise",
>> +or by setting "inherit" while the top-level enabled is set to "always"
>> +or "madvise"), and it'll be automatically shutdown when the last
>> +hugepage size is disabled (either by directly setting "never", or by
>> +setting "inherit" while the top-level enabled is set to "never").
>>
>> Khugepaged controls
>> -------------------
>>
>> +.. note::
>> + khugepaged currently only searches for opportunities to collapse to
>> + PMD-sized THP and no attempt is made to collapse to other THP
>> + sizes.
>
> For small-size THP, collapse is probably a bad idea. we like a one-shot
> try in Android especially we are using a 64KB and less large folio size. if
> PF succeeds in getting large folios, we map large folios, otherwise we
> give up as those memories can be quite unstably swapped-out, swapped-in
> and madvised to be DONTNEED.
>
> too many compactions will increase power consumption and decrease UI
> response.

Understood; that's very useful information for the Android context. Multiple
people have made comments about eventually needing khugepaged (or something
similar) support in the server context though to async collapse to contpte size.
Actually one suggestion was a user space daemon that scans and collapses with
MADV_COLLAPSE. I suspect the key will be to ensure whatever solution we go for
is flexible and can be enabled/disabled/configured for the different environments.

>
> Thanks
> Barry

2023-12-05 09:58:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 05.12.23 10:50, Ryan Roberts wrote:
> On 05/12/2023 04:21, Barry Song wrote:
>> On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <[email protected]> wrote:
>>>
>>> In preparation for adding support for anonymous multi-size THP,
>>> introduce new sysfs structure that will be used to control the new
>>> behaviours. A new directory is added under transparent_hugepage for each
>>> supported THP size, and contains an `enabled` file, which can be set to
>>> "inherit" (to inherit the global setting), "always", "madvise" or
>>> "never". For now, the kernel still only supports PMD-sized anonymous
>>> THP, so only 1 directory is populated.
>>>
>>> The first half of the change converts transhuge_vma_suitable() and
>>> hugepage_vma_check() so that they take a bitfield of orders for which
>>> the user wants to determine support, and the functions filter out all
>>> the orders that can't be supported, given the current sysfs
>>> configuration and the VMA dimensions. If there is only 1 order set in
>>> the input then the output can continue to be treated like a boolean;
>>> this is the case for most call sites. The resulting functions are
>>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
>>> respectively.
>>>
>>> The second half of the change implements the new sysfs interface. It has
>>> been done so that each supported THP size has a `struct thpsize`, which
>>> describes the relevant metadata and is itself a kobject. This is pretty
>>> minimal for now, but should make it easy to add new per-thpsize files to
>>> the interface if needed in future (e.g. per-size defrag). Rather than
>>> keep the `enabled` state directly in the struct thpsize, I've elected to
>>> directly encode it into huge_anon_orders_[always|madvise|inherit]
>>> bitfields since this reduces the amount of work required in
>>> thp_vma_allowable_orders() which is called for every page fault.
>>>
>>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
>>> commit, for details of how the new sysfs interface works.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>
>> Reviewed-by: Barry Song <[email protected]>
>
> Thanks!
>
>>
>>> -khugepaged will be automatically started when
>>> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll
>>> -be automatically shutdown if it's set to "never".
>>> +khugepaged will be automatically started when one or more hugepage
>>> +sizes are enabled (either by directly setting "always" or "madvise",
>>> +or by setting "inherit" while the top-level enabled is set to "always"
>>> +or "madvise"), and it'll be automatically shutdown when the last
>>> +hugepage size is disabled (either by directly setting "never", or by
>>> +setting "inherit" while the top-level enabled is set to "never").
>>>
>>> Khugepaged controls
>>> -------------------
>>>
>>> +.. note::
>>> + khugepaged currently only searches for opportunities to collapse to
>>> + PMD-sized THP and no attempt is made to collapse to other THP
>>> + sizes.
>>
>> For small-size THP, collapse is probably a bad idea. we like a one-shot
>> try in Android especially we are using a 64KB and less large folio size. if
>> PF succeeds in getting large folios, we map large folios, otherwise we
>> give up as those memories can be quite unstably swapped-out, swapped-in
>> and madvised to be DONTNEED.
>>
>> too many compactions will increase power consumption and decrease UI
>> response.
>
> Understood; that's very useful information for the Android context. Multiple
> people have made comments about eventually needing khugepaged (or something
> similar) support in the server context though to async collapse to contpte size.
> Actually one suggestion was a user space daemon that scans and collapses with
> MADV_COLLAPSE. I suspect the key will be to ensure whatever solution we go for
> is flexible and can be enabled/disabled/configured for the different environments.

There certainly is interest for 2 MiB THP on arm64 64k where the THP
size would normally be 512 MiB. In that scenario, khugepaged makes
perfect sense.

--
Cheers,

David / dhildenb

2023-12-05 10:00:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 09/10] selftests/mm/cow: Generalize do_run_with_thp() helper

On 04.12.23 11:20, Ryan Roberts wrote:
> do_run_with_thp() prepares (PMD-sized) THP memory into different states
> before running tests. With the introduction of multi-size THP, we would
> like to reuse this logic to also test those smaller THP sizes. So let's
> add a thpsize parameter which tells the function what size THP it should
> operate on.
>
> A separate commit will utilize this change to add new tests for
> multi-size THP, where available.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---

Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb

2023-12-05 10:48:52

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 05/12/2023 01:24, Barry Song wrote:
> On Tue, Dec 5, 2023 at 9:15 AM Barry Song <[email protected]> wrote:
>>
>> On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
>>>
>>> Introduce the logic to allow THP to be configured (through the new sysfs
>>> interface we just added) to allocate large folios to back anonymous
>>> memory, which are larger than the base page size but smaller than
>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>>
>>> mTHP continues to be PTE-mapped, but in many cases can still provide
>>> similar benefits to traditional PMD-sized THP: Page faults are
>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>>> the configured order), but latency spikes are much less prominent
>>> because the size of each page isn't as huge as the PMD-sized variant and
>>> there is less memory to clear in each page fault. The number of per-page
>>> operations (e.g. ref counting, rmap management, lru list management) are
>>> also significantly reduced since those ops now become per-folio.
>>>
>>> Some architectures also employ TLB compression mechanisms to squeeze
>>> more entries in when a set of PTEs are virtually and physically
>>> contiguous and approporiately aligned. In this case, TLB misses will
>>> occur less often.
>>>
>>> The new behaviour is disabled by default, but can be enabled at runtime
>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>> (see documentation in previous commit). The long term aim is to change
>>> the default to include suitable lower orders, but there are some risks
>>> around internal fragmentation that need to be better understood first.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/huge_mm.h | 6 ++-
>>> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
>>> 2 files changed, 101 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index bd0eadd3befb..91a53b9835a4 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>
>>> /*
>>> - * Mask of all large folio orders supported for anonymous THP.
>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>> + * (which is a limitation of the THP implementation).
>>> */
>>> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
>>> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>>
>>> /*
>>> * Mask of all large folio orders supported for file THP.
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 3ceeb0f45bf5..bf7e93813018 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> return ret;
>>> }
>>>
>>> +static bool pte_range_none(pte_t *pte, int nr_pages)
>>> +{
>>> + int i;
>>> +
>>> + for (i = 0; i < nr_pages; i++) {
>>> + if (!pte_none(ptep_get_lockless(pte + i)))
>>> + return false;
>>> + }
>>> +
>>> + return true;
>>> +}
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>> +{
>>> + gfp_t gfp;
>>> + pte_t *pte;
>>> + unsigned long addr;
>>> + struct folio *folio;
>>> + struct vm_area_struct *vma = vmf->vma;
>>> + unsigned long orders;
>>> + int order;
>>> +
>>> + /*
>>> + * If uffd is active for the vma we need per-page fault fidelity to
>>> + * maintain the uffd semantics.
>>> + */
>>> + if (userfaultfd_armed(vma))
>>> + goto fallback;
>>> +
>>> + /*
>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>> + * for this vma. Then filter out the orders that can't be allocated over
>>> + * the faulting address and still be fully contained in the vma.
>>> + */
>>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>> + BIT(PMD_ORDER) - 1);
>>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>> +
>>> + if (!orders)
>>> + goto fallback;
>>> +
>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>> + if (!pte)
>>> + return ERR_PTR(-EAGAIN);
>>> +
>>> + order = first_order(orders);
>>> + while (orders) {
>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> + vmf->pte = pte + pte_index(addr);
>>> + if (pte_range_none(vmf->pte, 1 << order))
>>> + break;
>>> + order = next_order(&orders, order);
>>> + }
>>> +
>>> + vmf->pte = NULL;
>>> + pte_unmap(pte);
>>> +
>>> + gfp = vma_thp_gfp_mask(vma);
>>> +
>>> + while (orders) {
>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> + if (folio) {
>>> + clear_huge_page(&folio->page, addr, 1 << order);
>>
>> Minor.
>>
>> Do we have to constantly clear a huge page? Is it possible to let
>> post_alloc_hook()
>> finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
>> vma_alloc_zeroed_movable_folio() is doing?

I'm currently following the same allocation pattern as is done for PMD-sized
THP. In earlier versions of this patch I was trying to be smarter and use the
__GFP_ZERO/__GFP_ZEROTAGS as you suggest, but I was advised to keep it simple
and follow the existing pattern.

I have a vague recollection __GFP_ZERO is not preferred for large folios because
of some issue with virtually indexed caches? (Matthew: did I see you mention
that in some other context?)

That said, I wasn't aware that Android ships with
CONFIG_INIT_ON_ALLOC_DEFAULT_ON (I thought it was only used as a debug option),
so I can see the potential for some overhead reduction here.

Options:

1) leave it as is and accept the duplicated clearing
2) Pass __GFP_ZERO and remove clear_huge_page()
3) define __GFP_SKIP_ZERO even when kasan is not enabled and pass it down so
clear_huge_page() is the only clear
4) make clear_huge_page() conditional on !want_init_on_alloc()

I prefer option 4. What do you think?

As an aside, I've also noticed that clear_huge_page() should take vmf->address
so that it clears the faulting page last to keep the cache hot. If we decide on
an option that keeps clear_huge_page(), I'll also make that change.

Thanks,
Ryan

>>
>> struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
>> unsigned long vaddr)
>> {
>> gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
>>
>> /*
>> * If the page is mapped with PROT_MTE, initialise the tags at the
>> * point of allocation and page zeroing as this is usually faster than
>> * separate DC ZVA and STGM.
>> */
>> if (vma->vm_flags & VM_MTE)
>> flags |= __GFP_ZEROTAGS;
>>
>> return vma_alloc_folio(flags, 0, vma, vaddr, false);
>> }
>
> I am asking this because Android and some other kernels might always set
> CONFIG_INIT_ON_ALLOC_DEFAULT_ON, that means one more explicit
> clear_page is doing a duplicated job.
>
> when the below is true, post_alloc_hook has cleared huge_page before
> vma_alloc_folio() returns the folio,
>
> static inline bool want_init_on_alloc(gfp_t flags)
> {
> if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
> &init_on_alloc))
> return true;
> return flags & __GFP_ZERO;
> }
>
>
>>
>>> + return folio;
>>> + }
>>> + order = next_order(&orders, order);
>>> + }
>>> +
>>> +fallback:
>>> + return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +}
>>> +#else
>>> +#define alloc_anon_folio(vmf) \
>>> + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>>> +#endif
>>> +
>>> /*
>>> * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>> * but allow concurrent faults), and pte mapped but not yet locked.
>>> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> */
>>> static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> {
>>> + int i;
>>> + int nr_pages = 1;
>>> + unsigned long addr = vmf->address;
>>> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct folio *folio;
>>> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> /* Allocate our own private page. */
>>> if (unlikely(anon_vma_prepare(vma)))
>>> goto oom;
>>> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> + folio = alloc_anon_folio(vmf);
>>> + if (IS_ERR(folio))
>>> + return 0;
>>> if (!folio)
>>> goto oom;
>>>
>>> + nr_pages = folio_nr_pages(folio);
>>> + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> +
>>> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>> goto oom_free_page;
>>> folio_throttle_swaprate(folio, GFP_KERNEL);
>>> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> if (vma->vm_flags & VM_WRITE)
>>> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>>
>>> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> - &vmf->ptl);
>>> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>> if (!vmf->pte)
>>> goto release;
>>> - if (vmf_pte_changed(vmf)) {
>>> - update_mmu_tlb(vma, vmf->address, vmf->pte);
>>> + if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
>>> + (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages))) {
>>> + for (i = 0; i < nr_pages; i++)
>>> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>> goto release;
>>> }
>>>
>>> @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>> return handle_userfault(vmf, VM_UFFD_MISSING);
>>> }
>>>
>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> + folio_ref_add(folio, nr_pages - 1);
>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> + folio_add_new_anon_rmap(folio, vma, addr);
>>> folio_add_lru_vma(folio, vma);
>>> setpte:
>>> if (uffd_wp)
>>> entry = pte_mkuffd_wp(entry);
>>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>>
>>> /* No need to invalidate - it was non-present before */
>>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>> unlock:
>>> if (vmf->pte)
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> --
>>> 2.25.1
>>>

2023-12-05 10:51:06

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 05/12/2023 09:57, David Hildenbrand wrote:
> On 05.12.23 10:50, Ryan Roberts wrote:
>> On 05/12/2023 04:21, Barry Song wrote:
>>> On Mon, Dec 4, 2023 at 11:21 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> In preparation for adding support for anonymous multi-size THP,
>>>> introduce new sysfs structure that will be used to control the new
>>>> behaviours. A new directory is added under transparent_hugepage for each
>>>> supported THP size, and contains an `enabled` file, which can be set to
>>>> "inherit" (to inherit the global setting), "always", "madvise" or
>>>> "never". For now, the kernel still only supports PMD-sized anonymous
>>>> THP, so only 1 directory is populated.
>>>>
>>>> The first half of the change converts transhuge_vma_suitable() and
>>>> hugepage_vma_check() so that they take a bitfield of orders for which
>>>> the user wants to determine support, and the functions filter out all
>>>> the orders that can't be supported, given the current sysfs
>>>> configuration and the VMA dimensions. If there is only 1 order set in
>>>> the input then the output can continue to be treated like a boolean;
>>>> this is the case for most call sites. The resulting functions are
>>>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
>>>> respectively.
>>>>
>>>> The second half of the change implements the new sysfs interface. It has
>>>> been done so that each supported THP size has a `struct thpsize`, which
>>>> describes the relevant metadata and is itself a kobject. This is pretty
>>>> minimal for now, but should make it easy to add new per-thpsize files to
>>>> the interface if needed in future (e.g. per-size defrag). Rather than
>>>> keep the `enabled` state directly in the struct thpsize, I've elected to
>>>> directly encode it into huge_anon_orders_[always|madvise|inherit]
>>>> bitfields since this reduces the amount of work required in
>>>> thp_vma_allowable_orders() which is called for every page fault.
>>>>
>>>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
>>>> commit, for details of how the new sysfs interface works.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>
>>> Reviewed-by: Barry Song <[email protected]>
>>
>> Thanks!
>>
>>>
>>>> -khugepaged will be automatically started when
>>>> -transparent_hugepage/enabled is set to "always" or "madvise, and it'll
>>>> -be automatically shutdown if it's set to "never".
>>>> +khugepaged will be automatically started when one or more hugepage
>>>> +sizes are enabled (either by directly setting "always" or "madvise",
>>>> +or by setting "inherit" while the top-level enabled is set to "always"
>>>> +or "madvise"), and it'll be automatically shutdown when the last
>>>> +hugepage size is disabled (either by directly setting "never", or by
>>>> +setting "inherit" while the top-level enabled is set to "never").
>>>>
>>>>   Khugepaged controls
>>>>   -------------------
>>>>
>>>> +.. note::
>>>> +   khugepaged currently only searches for opportunities to collapse to
>>>> +   PMD-sized THP and no attempt is made to collapse to other THP
>>>> +   sizes.
>>>
>>> For small-size THP, collapse is probably a bad idea. we like a one-shot
>>> try in Android especially we are using a 64KB and less large folio size. if
>>> PF succeeds in getting large folios, we map large folios, otherwise we
>>> give up as those memories can be quite unstably swapped-out, swapped-in
>>> and madvised to be DONTNEED.
>>>
>>> too many compactions will increase power consumption and decrease UI
>>> response.
>>
>> Understood; that's very useful information for the Android context. Multiple
>> people have made comments about eventually needing khugepaged (or something
>> similar) support in the server context though to async collapse to contpte size.
>> Actually one suggestion was a user space daemon that scans and collapses with
>> MADV_COLLAPSE. I suspect the key will be to ensure whatever solution we go for
>> is flexible and can be enabled/disabled/configured for the different
>> environments.
>
> There certainly is interest for 2 MiB THP on arm64 64k where the THP size would
> normally be 512 MiB. In that scenario, khugepaged makes perfect sense.

Indeed

2023-12-05 11:06:39

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 05/12/2023 03:28, Barry Song wrote:
> On Mon, Dec 4, 2023 at 11:20 PM Ryan Roberts <[email protected]> wrote:
>>
>> Hi All,
>>
>> A new week, a new version, a new name... This is v8 of a series to implement
>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>> this fares better.
>>
>> The objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>> pages, there are efficiency savings to be had; fewer page faults, batched PTE
>> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>> overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>> advantage of HW TLB compression techniques. A reduction in TLB pressure
>> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> This version changes the name and tidies up some of the kernel code and test
>> code, based on feedback against v7 (see change log for details).
>>
>> By default, the existing behaviour (and performance) is maintained. The user
>> must explicitly enable multi-size THP to see the performance benefit. This is
>> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
>> David for the suggestion)! This interface is inspired by the existing
>> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
>> compatibility with the existing PMD-size THP interface, and provides a base for
>> future extensibility. See [8] for detailed discussion of the interface.
>>
>> This series is based on mm-unstable (715b67adf4c8).
>>
>>
>> Prerequisites
>> =============
>>
>> Some work items identified as being prerequisites are listed on page 3 at [9].
>> The summary is:
>>
>> | item | status |
>> |:------------------------------|:------------------------|
>> | mlock | In mainline (v6.7) |
>> | madvise | In mainline (v6.6) |
>> | compaction | v1 posted [10] |
>> | numa balancing | Investigated: see below |
>> | user-triggered page migration | In mainline (v6.7) |
>> | khugepaged collapse | In mainline (NOP) |
>>
>> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
>> John Hubbard has investigated this and concluded that it is A) not clear at the
>> moment what a better policy might be for PTE-mapped THP and B) questions whether
>> this should really be considered a prerequisite given no regression is caused
>> for the default "multi-size THP disabled" case, and there is no correctness
>> issue when it is enabled - its just a potential for non-optimal performance.
>>
>> If there are no disagreements about removing numa balancing from the list (none
>> were raised when I first posted this comment against v7), then that just leaves
>> compaction which is in review on list at the moment.
>>
>> I really would like to get this series (and its remaining comapction
>> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
>> point, but lets see where we get to with review?
>>
>
> Hi Ryan,
>
> A question but i don't think it should block this series, do we have any plan
> to extend /proc/meminfo, /proc/pid/smaps, /proc/vmstat to present some
> information regarding the new multi-size THP.
>
> e.g how many folios in each-size for the system, how many multi-size folios LRU,
> how many large folios in each VMA etc.
>
> In products and labs, we need some health monitors to make sure the system
> status is visible and works as expected. right now, i feel i am like
> blindly exploring
> the system without those statistics.

Yes it's definitely on the list. I had a patch in v6 that added various stats.
But after discussion with David, it became clear there were a few issues with
the implementation and I ripped it out. We also decided that since the page
cache already uses large folios and we don't have counters for those, we could
probably live (initially at least) without counters for multi-size THP too. But
you are the second person to raise this in as many weeks, so clearly this should
be at the top of the list for enhancements after this initial merge.

For now, you can parse /proc/<pid>/pagemap to see how well multi-size THP is
being utilized. That's not a simple interface though. Yu Zhou shared a Python
script a while back. I wonder if there is value in tidying that up and putting
it in tools/mm in the short term?

There were 2 main issues with the previous implementation:

1) What should the semantic of a counter be? Because a PTE-mapped THP can be
partially unmapped or mremapped. So should we count number of pages from a folio
of a given size that are mapped (easy) or should we only count when the whole
folio is contiguously mapped? (I'm sure there are many other semantics we could
consider). The latter is not easy to spot at the moment - perhaps the work David
has been doing on tidying up the rmap functions might help?

2) How should we expose the info? There has been pushback for extending files in
sysfs that expose multiple pieces of data, so David has suggested that in the
long term it might be good to completely redesign the stats interface.

It's certainly something that needs a lot more discussion - input encouraged!

Thanks,
Ryan


>
>>
>> Testing
>> =======
>>
>> The series includes patches for mm selftests to enlighten the cow and khugepaged
>> tests to explicitly test with multi-size THP, in the same way that PMD-sized
>> THP is tested. The new tests all pass, and no regressions are observed in the mm
>> selftest suite. I've also run my usual kernel compilation and java script
>> benchmarks without any issues.
>>
>> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
>> THP only - they do not include the arm64 contpte follow-on series).
>>
>> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
>> some workloads at [11]. (Observed using v6 of this series as well as the arm64
>> contpte series).
>>
>> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
>> there are some latency regressions also.
>>
>>
>> Changes since v7 [7]
>> ====================
>>
>> - Renamed "small-sized THP" -> "multi-size THP" in commit logs
>> - Added various Reviewed-by/Tested-by tags (Barry, David, Alistair)
>> - Patch 3:
>> - Fine-tuned transhuge documentation multi-size THP (JohnH)
>> - Converted hugepage_global_enabled() and hugepage_global_always() macros
>> to static inline functions (JohnH)
>> - Renamed hugepage_vma_check() to thp_vma_allowable_orders() (JohnH)
>> - Renamed transhuge_vma_suitable() to thp_vma_suitable_orders() (JohnH)
>> - Renamed "global" enabled sysfs file option to "inherit" (JohnH)
>> - Patch 9:
>> - cow selftest: Renamed param size -> thpsize (David)
>> - cow selftest: Changed test fail to assert() (David)
>> - cow selftest: Log PMD size separately from all the supported THP sizes
>> (David)
>> - Patch 10:
>> - cow selftest: No longer special case pmdsize; keep all THP sizes in
>> thpsizes[]
>>
>>
>> Changes since v6 [6]
>> ====================
>>
>> - Refactored vmf_pte_range_changed() to remove uffd special-case (suggested by
>> JohnH)
>> - Dropped accounting patch (#3 in v6) (suggested by DavidH)
>> - Continue to account *PMD-sized* THP only for now
>> - Can add more counters in future if needed
>> - Page cache large folios haven't needed any new counters yet
>> - Pivot to sysfs ABI proposed by DavidH
>> - per-size directories in a similar shape to that used by hugetlb
>> - Dropped "recommend" keyword patch (#6 in v6) (suggested by DavidH, Yu Zhou)
>> - For now, users need to understand implicitly which sizes are beneficial
>> to their HW/SW
>> - Dropped arch_wants_pte_order() patch (#7 in v6)
>> - No longer needed due to dropping patch "recommend" keyword patch
>> - Enlightened khugepaged mm selftest to explicitly test with small-size THP
>> - Scrubbed commit logs to use "small-sized THP" consistently (suggested by
>> DavidH)
>>
>>
>> Changes since v5 [5]
>> ====================
>>
>> - Added accounting for PTE-mapped THPs (patch 3)
>> - Added runtime control mechanism via sysfs as extension to THP (patch 4)
>> - Minor refactoring of alloc_anon_folio() to integrate with runtime controls
>> - Stripped out hardcoded policy for allocation order; its now all user space
>> controlled (although user space can request "recommend" which will configure
>> the HW-preferred order)
>>
>>
>> Changes since v4 [4]
>> ====================
>>
>> - Removed "arm64: mm: Override arch_wants_pte_order()" patch; arm64
>> now uses the default order-3 size. I have moved this patch over to
>> the contpte series.
>> - Added "mm: Allow deferred splitting of arbitrary large anon folios" back
>> into series. I originally removed this at v2 to add to a separate series,
>> but that series has transformed significantly and it no longer fits, so
>> bringing it back here.
>> - Reintroduced dependency on set_ptes(); Originally dropped this at v2, but
>> set_ptes() is in mm-unstable now.
>> - Updated policy for when to allocate LAF; only fallback to order-0 if
>> MADV_NOHUGEPAGE is present or if THP disabled via prctl; no longer rely on
>> sysfs's never/madvise/always knob.
>> - Fallback to order-0 whenever uffd is armed for the vma, not just when
>> uffd-wp is set on the pte.
>> - alloc_anon_folio() now returns `struct folio *`, where errors are encoded
>> with ERR_PTR().
>>
>> The last 3 changes were proposed by Yu Zhao - thanks!
>>
>>
>> Changes since v3 [3]
>> ====================
>>
>> - Renamed feature from FLEXIBLE_THP to LARGE_ANON_FOLIO.
>> - Removed `flexthp_unhinted_max` boot parameter. Discussion concluded that a
>> sysctl is preferable but we will wait until real workload needs it.
>> - Fixed uninitialized `addr` on read fault path in do_anonymous_page().
>> - Added mm selftests for large anon folios in cow test suite.
>>
>>
>> Changes since v2 [2]
>> ====================
>>
>> - Dropped commit "Allow deferred splitting of arbitrary large anon folios"
>> - Huang, Ying suggested the "batch zap" work (which I dropped from this
>> series after v1) is a prerequisite for merging FLXEIBLE_THP, so I've
>> moved the deferred split patch to a separate series along with the batch
>> zap changes. I plan to submit this series early next week.
>> - Changed folio order fallback policy
>> - We no longer iterate from preferred to 0 looking for acceptable policy
>> - Instead we iterate through preferred, PAGE_ALLOC_COSTLY_ORDER and 0 only
>> - Removed vma parameter from arch_wants_pte_order()
>> - Added command line parameter `flexthp_unhinted_max`
>> - clamps preferred order when vma hasn't explicitly opted-in to THP
>> - Never allocate large folio for MADV_NOHUGEPAGE vma (or when THP is disabled
>> for process or system).
>> - Simplified implementation and integration with do_anonymous_page()
>> - Removed dependency on set_ptes()
>>
>>
>> Changes since v1 [1]
>> ====================
>>
>> - removed changes to arch-dependent vma_alloc_zeroed_movable_folio()
>> - replaced with arch-independent alloc_anon_folio()
>> - follows THP allocation approach
>> - no longer retry with intermediate orders if allocation fails
>> - fallback directly to order-0
>> - remove folio_add_new_anon_rmap_range() patch
>> - instead add its new functionality to folio_add_new_anon_rmap()
>> - remove batch-zap pte mappings optimization patch
>> - remove enabler folio_remove_rmap_range() patch too
>> - These offer real perf improvement so will submit separately
>> - simplify Kconfig
>> - single FLEXIBLE_THP option, which is independent of arch
>> - depends on TRANSPARENT_HUGEPAGE
>> - when enabled default to max anon folio size of 64K unless arch
>> explicitly overrides
>> - simplify changes to do_anonymous_page():
>> - no more retry loop
>>
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>> [3] https://lore.kernel.org/linux-mm/[email protected]/
>> [4] https://lore.kernel.org/linux-mm/[email protected]/
>> [5] https://lore.kernel.org/linux-mm/[email protected]/
>> [6] https://lore.kernel.org/linux-mm/[email protected]/
>> [7] https://lore.kernel.org/linux-mm/[email protected]/
>> [8] https://lore.kernel.org/linux-mm/[email protected]/
>> [9] https://drive.google.com/file/d/1GnfYFpr7_c1kA41liRUW5YtCb8Cj18Ud/view?usp=sharing&resourcekey=0-U1Mj3-RhLD1JV6EThpyPyA
>> [10] https://lore.kernel.org/linux-mm/[email protected]/
>> [11] https://lore.kernel.org/linux-mm/[email protected]/
>> [12] https://lore.kernel.org/linux-mm/[email protected]/
>>
>>
>> Thanks,
>> Ryan
>>
>> Ryan Roberts (10):
>> mm: Allow deferred splitting of arbitrary anon large folios
>> mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap()
>> mm: thp: Introduce multi-size THP sysfs interface
>> mm: thp: Support allocation of anonymous multi-size THP
>> selftests/mm/kugepaged: Restore thp settings at exit
>> selftests/mm: Factor out thp settings management
>> selftests/mm: Support multi-size THP interface in thp_settings
>> selftests/mm/khugepaged: Enlighten for multi-size THP
>> selftests/mm/cow: Generalize do_run_with_thp() helper
>> selftests/mm/cow: Add tests for anonymous multi-size THP
>>
>> Documentation/admin-guide/mm/transhuge.rst | 97 ++++-
>> Documentation/filesystems/proc.rst | 6 +-
>> fs/proc/task_mmu.c | 3 +-
>> include/linux/huge_mm.h | 116 ++++--
>> mm/huge_memory.c | 268 ++++++++++++--
>> mm/khugepaged.c | 20 +-
>> mm/memory.c | 114 +++++-
>> mm/page_vma_mapped.c | 3 +-
>> mm/rmap.c | 32 +-
>> tools/testing/selftests/mm/Makefile | 4 +-
>> tools/testing/selftests/mm/cow.c | 185 +++++++---
>> tools/testing/selftests/mm/khugepaged.c | 410 ++++-----------------
>> tools/testing/selftests/mm/run_vmtests.sh | 2 +
>> tools/testing/selftests/mm/thp_settings.c | 349 ++++++++++++++++++
>> tools/testing/selftests/mm/thp_settings.h | 80 ++++
>> 15 files changed, 1177 insertions(+), 512 deletions(-)
>> create mode 100644 tools/testing/selftests/mm/thp_settings.c
>> create mode 100644 tools/testing/selftests/mm/thp_settings.h
>>
>> --
>> 2.25.1
>>
>
> Thanks
> Barry

2023-12-05 11:13:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 05/12/2023 03:37, John Hubbard wrote:
> On 12/4/23 02:20, Ryan Roberts wrote:
>> Hi All,
>>
>> A new week, a new version, a new name... This is v8 of a series to implement
>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>> this fares better.
>>
>> The objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> This version changes the name and tidies up some of the kernel code and test
>> code, based on feedback against v7 (see change log for details).
>
> Using a couple of Armv8 systems, I've tested this patchset. I applied it
> to top of tree (Linux 6.7-rc4), on top of your latest contig pte series
> [1].
>
> With those two patchsets applied, the mm selftests look OK--or at least
> as OK as they normally do. I compared test runs between THP/mTHP set to
> "always", vs "never", to verify that there were no new test failures.
> Details: specifically, I set one particular page size (2 MB) to
> "inherit", and then toggled /sys/kernel/mm/transparent_hugepage/enabled
> between "always" and "never".

Excellent - I'm guessing this was for 64K base pages?

>
> I also re-ran my usual compute/AI benchmark, and I'm still seeing the
> same 10x performance improvement that I reported for the v6 patchset.
>
> So for this patchset and for [1] as well, please feel free to add:
>
> Tested-by: John Hubbard <[email protected]>

Thanks!

>
>
> [1] https://lore.kernel.org/all/[email protected]/
>
>
> thanks,

2023-12-05 11:16:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 05.12.23 11:48, Ryan Roberts wrote:
> On 05/12/2023 01:24, Barry Song wrote:
>> On Tue, Dec 5, 2023 at 9:15 AM Barry Song <[email protected]> wrote:
>>>
>>> On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
>>>>
>>>> Introduce the logic to allow THP to be configured (through the new sysfs
>>>> interface we just added) to allocate large folios to back anonymous
>>>> memory, which are larger than the base page size but smaller than
>>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>>>
>>>> mTHP continues to be PTE-mapped, but in many cases can still provide
>>>> similar benefits to traditional PMD-sized THP: Page faults are
>>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>>>> the configured order), but latency spikes are much less prominent
>>>> because the size of each page isn't as huge as the PMD-sized variant and
>>>> there is less memory to clear in each page fault. The number of per-page
>>>> operations (e.g. ref counting, rmap management, lru list management) are
>>>> also significantly reduced since those ops now become per-folio.
>>>>
>>>> Some architectures also employ TLB compression mechanisms to squeeze
>>>> more entries in when a set of PTEs are virtually and physically
>>>> contiguous and approporiately aligned. In this case, TLB misses will
>>>> occur less often.
>>>>
>>>> The new behaviour is disabled by default, but can be enabled at runtime
>>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>> (see documentation in previous commit). The long term aim is to change
>>>> the default to include suitable lower orders, but there are some risks
>>>> around internal fragmentation that need to be better understood first.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>> include/linux/huge_mm.h | 6 ++-
>>>> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
>>>> 2 files changed, 101 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index bd0eadd3befb..91a53b9835a4 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>>
>>>> /*
>>>> - * Mask of all large folio orders supported for anonymous THP.
>>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>>> + * (which is a limitation of the THP implementation).
>>>> */
>>>> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
>>>> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>>>
>>>> /*
>>>> * Mask of all large folio orders supported for file THP.
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 3ceeb0f45bf5..bf7e93813018 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> return ret;
>>>> }
>>>>
>>>> +static bool pte_range_none(pte_t *pte, int nr_pages)
>>>> +{
>>>> + int i;
>>>> +
>>>> + for (i = 0; i < nr_pages; i++) {
>>>> + if (!pte_none(ptep_get_lockless(pte + i)))
>>>> + return false;
>>>> + }
>>>> +
>>>> + return true;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>> +{
>>>> + gfp_t gfp;
>>>> + pte_t *pte;
>>>> + unsigned long addr;
>>>> + struct folio *folio;
>>>> + struct vm_area_struct *vma = vmf->vma;
>>>> + unsigned long orders;
>>>> + int order;
>>>> +
>>>> + /*
>>>> + * If uffd is active for the vma we need per-page fault fidelity to
>>>> + * maintain the uffd semantics.
>>>> + */
>>>> + if (userfaultfd_armed(vma))
>>>> + goto fallback;
>>>> +
>>>> + /*
>>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>> + * for this vma. Then filter out the orders that can't be allocated over
>>>> + * the faulting address and still be fully contained in the vma.
>>>> + */
>>>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>> + BIT(PMD_ORDER) - 1);
>>>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>> +
>>>> + if (!orders)
>>>> + goto fallback;
>>>> +
>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>> + if (!pte)
>>>> + return ERR_PTR(-EAGAIN);
>>>> +
>>>> + order = first_order(orders);
>>>> + while (orders) {
>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>> + vmf->pte = pte + pte_index(addr);
>>>> + if (pte_range_none(vmf->pte, 1 << order))
>>>> + break;
>>>> + order = next_order(&orders, order);
>>>> + }
>>>> +
>>>> + vmf->pte = NULL;
>>>> + pte_unmap(pte);
>>>> +
>>>> + gfp = vma_thp_gfp_mask(vma);
>>>> +
>>>> + while (orders) {
>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>> + if (folio) {
>>>> + clear_huge_page(&folio->page, addr, 1 << order);
>>>
>>> Minor.
>>>
>>> Do we have to constantly clear a huge page? Is it possible to let
>>> post_alloc_hook()
>>> finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
>>> vma_alloc_zeroed_movable_folio() is doing?
>
> I'm currently following the same allocation pattern as is done for PMD-sized
> THP. In earlier versions of this patch I was trying to be smarter and use the
> __GFP_ZERO/__GFP_ZEROTAGS as you suggest, but I was advised to keep it simple
> and follow the existing pattern.

Yes, this should be optimized on top IMHO.

--
Cheers,

David / dhildenb

2023-12-05 16:06:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 10/10] selftests/mm/cow: Add tests for anonymous multi-size THP

On 04.12.23 11:20, Ryan Roberts wrote:
> Add tests similar to the existing PMD-sized THP tests, but which operate
> on memory backed by (PTE-mapped) multi-size THP. This reuses all the
> existing infrastructure. If the test suite detects that multi-size THP
> is not supported by the kernel, the new tests are skipped.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---

Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb

2023-12-05 16:33:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 04.12.23 11:20, Ryan Roberts wrote:
> Introduce the logic to allow THP to be configured (through the new sysfs
> interface we just added) to allocate large folios to back anonymous
> memory, which are larger than the base page size but smaller than
> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>
> mTHP continues to be PTE-mapped, but in many cases can still provide
> similar benefits to traditional PMD-sized THP: Page faults are
> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> the configured order), but latency spikes are much less prominent
> because the size of each page isn't as huge as the PMD-sized variant and
> there is less memory to clear in each page fault. The number of per-page
> operations (e.g. ref counting, rmap management, lru list management) are
> also significantly reduced since those ops now become per-folio.
>
> Some architectures also employ TLB compression mechanisms to squeeze
> more entries in when a set of PTEs are virtually and physically
> contiguous and approporiately aligned. In this case, TLB misses will
> occur less often.
>
> The new behaviour is disabled by default, but can be enabled at runtime
> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
> (see documentation in previous commit). The long term aim is to change
> the default to include suitable lower orders, but there are some risks
> around internal fragmentation that need to be better understood first.
>
> Signed-off-by: Ryan Roberts <[email protected]>

In general, looks good to me, some comments/nits. And the usual "let's
make sure we don't degrade order-0 and keep that as fast as possible"
comment.

> ---
> include/linux/huge_mm.h | 6 ++-
> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
> 2 files changed, 101 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index bd0eadd3befb..91a53b9835a4 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>
> /*
> - * Mask of all large folio orders supported for anonymous THP.
> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> + * (which is a limitation of the THP implementation).
> */
> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>
> /*
> * Mask of all large folio orders supported for file THP.
> diff --git a/mm/memory.c b/mm/memory.c
> index 3ceeb0f45bf5..bf7e93813018 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> return ret;
> }
>
> +static bool pte_range_none(pte_t *pte, int nr_pages)
> +{
> + int i;
> +
> + for (i = 0; i < nr_pages; i++) {
> + if (!pte_none(ptep_get_lockless(pte + i)))
> + return false;
> + }
> +
> + return true;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +{
> + gfp_t gfp;
> + pte_t *pte;
> + unsigned long addr;
> + struct folio *folio;
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long orders;
> + int order;

Nit: reverse christmas tree encouraged ;)

> +
> + /*
> + * If uffd is active for the vma we need per-page fault fidelity to
> + * maintain the uffd semantics.
> + */
> + if (userfaultfd_armed(vma))

Nit: unlikely()

> + goto fallback;
> +
> + /*
> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> + * for this vma. Then filter out the orders that can't be allocated over
> + * the faulting address and still be fully contained in the vma.
> + */
> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> + BIT(PMD_ORDER) - 1);
> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);

Comment: Both will eventually loop over all orders, correct? Could
eventually be sped up in the future.

Nit: the orders = ... order = ... looks like this might deserve a helper
function that makes this easier to read.

Nit: Why call thp_vma_suitable_orders if the orders are already 0?
Again, some helper might be reasonable where that is handled internally.

Comment: For order-0 we'll always perform a function call to both
thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should
perform some fast and efficient check if any <PMD THP are even enabled
in the system / for this VMA, and in that case just fallback before
doing more expensive checks.

> +
> + if (!orders)
> + goto fallback;
> +
> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> + if (!pte)
> + return ERR_PTR(-EAGAIN);
> +
> + order = first_order(orders);
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + vmf->pte = pte + pte_index(addr);
> + if (pte_range_none(vmf->pte, 1 << order))
> + break;

Comment: Likely it would make sense to scan only once and determine the
"largest none range" around that address, having the largest suitable
order in mind.

> + order = next_order(&orders, order);
> + }
> +
> + vmf->pte = NULL;

Nit: Can you elaborate why you are messing with vmf->pte here? A simple
helper variable will make this code look less magical. Unless I am
missing something important :)

> + pte_unmap(pte);
> +
> + gfp = vma_thp_gfp_mask(vma);
> +
> + while (orders) {
> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> + if (folio) {
> + clear_huge_page(&folio->page, addr, 1 << order);
> + return folio;
> + }
> + order = next_order(&orders, order);
> + }
> +

Queestion: would it make sense to combine both loops? I suspect memory
allocations with pte_offset_map()/kmao are problematic.

> +fallback:
> + return vma_alloc_zeroed_movable_folio(vma, vmf->address);
> +}
> +#else
> +#define alloc_anon_folio(vmf) \
> + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
> +#endif
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> {
> + int i;
> + int nr_pages = 1;
> + unsigned long addr = vmf->address;
> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> struct vm_area_struct *vma = vmf->vma;
> struct folio *folio;

Nit: reverse christmas tree :)

> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> /* Allocate our own private page. */
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + folio = alloc_anon_folio(vmf);
> + if (IS_ERR(folio))
> + return 0;
> if (!folio)
> goto oom;
>
> + nr_pages = folio_nr_pages(folio);
> + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +
> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> goto oom_free_page;
> folio_throttle_swaprate(folio, GFP_KERNEL);
> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry), vma);
>
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> if (!vmf->pte)
> goto release;
> - if (vmf_pte_changed(vmf)) {
> - update_mmu_tlb(vma, vmf->address, vmf->pte);
> + if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
> + (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages))) {
> + for (i = 0; i < nr_pages; i++)
> + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);

Comment: separating the order-0 case from the other case might make this
easier to read.

> goto release;
> }
>
> @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + folio_add_new_anon_rmap(folio, vma, addr);
> folio_add_lru_vma(folio, vma);
> setpte:
> if (uffd_wp)
> entry = pte_mkuffd_wp(entry);
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);

Benchmarking order-0 allocations might be interesting. There will be
some added checks + multiple loops/conditionals for order-0 that could
be avoided by having two separate code paths. If we can't measure a
difference, all good.

--
Cheers,

David / dhildenb

2023-12-05 16:37:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

> Comment: Both will eventually loop over all orders, correct? Could
> eventually be sped up in the future.
>
> Nit: the orders = ... order = ... looks like this might deserve a helper
> function that makes this easier to read.
>
> Nit: Why call thp_vma_suitable_orders if the orders are already 0?
> Again, some helper might be reasonable where that is handled internally.
>
> Comment: For order-0 we'll always perform a function call to both
> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should
> perform some fast and efficient check if any <PMD THP are even enabled
> in the system / for this VMA, and in that case just fallback before
> doing more expensive checks.

Correction: only a call to thp_vma_allowable_orders(). I wonder if we
can move some of thp_vma_allowable_orders() into the header file where
we'd just check as fast and as efficiently for "no THP < PMD_THP
enabled" on this system.

--
Cheers,

David / dhildenb

2023-12-05 16:58:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 04.12.23 11:20, Ryan Roberts wrote:
> In preparation for adding support for anonymous multi-size THP,
> introduce new sysfs structure that will be used to control the new
> behaviours. A new directory is added under transparent_hugepage for each
> supported THP size, and contains an `enabled` file, which can be set to
> "inherit" (to inherit the global setting), "always", "madvise" or
> "never". For now, the kernel still only supports PMD-sized anonymous
> THP, so only 1 directory is populated.
>
> The first half of the change converts transhuge_vma_suitable() and
> hugepage_vma_check() so that they take a bitfield of orders for which
> the user wants to determine support, and the functions filter out all
> the orders that can't be supported, given the current sysfs
> configuration and the VMA dimensions. If there is only 1 order set in
> the input then the output can continue to be treated like a boolean;
> this is the case for most call sites. The resulting functions are
> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
> respectively.
>
> The second half of the change implements the new sysfs interface. It has
> been done so that each supported THP size has a `struct thpsize`, which
> describes the relevant metadata and is itself a kobject. This is pretty
> minimal for now, but should make it easy to add new per-thpsize files to
> the interface if needed in future (e.g. per-size defrag). Rather than
> keep the `enabled` state directly in the struct thpsize, I've elected to
> directly encode it into huge_anon_orders_[always|madvise|inherit]
> bitfields since this reduces the amount of work required in
> thp_vma_allowable_orders() which is called for every page fault.
>
> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
> commit, for details of how the new sysfs interface works.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---

Some comments mostly regarding thp_vma_allowable_orders and friends. In
general, LGTM. I'll have to go over the order logic once again, I got a
bit lost once we started mixing anon and file orders.

[...]

Doc updates all looked good to me, skimming over them.

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..bd0eadd3befb 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h

[...]

> +static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
> + unsigned long addr, unsigned long orders)
> +{
> + int order;
> +
> + /*
> + * Iterate over orders, highest to lowest, removing orders that don't
> + * meet alignment requirements from the set. Exit loop at first order
> + * that meets requirements, since all lower orders must also meet
> + * requirements.
> + */
> +
> + order = first_order(orders);

nit: "highest_order" or "largest_order" would be more expressive
regarding the actual semantics.

> +
> + while (orders) {
> + unsigned long hpage_size = PAGE_SIZE << order;
> + unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
> +
> + if (haddr >= vma->vm_start &&
> + haddr + hpage_size <= vma->vm_end) {
> + if (!vma_is_anonymous(vma)) {
> + if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
> + vma->vm_pgoff,
> + hpage_size >> PAGE_SHIFT))
> + break;
> + } else
> + break;

Comment: Codying style wants you to use if () {} else {}

But I'd recommend for the conditions:

if (haddr < vma->vm_start ||
haddr + hpage_size > vma->vm_end)
continue;
/* Don't have to check pgoff for anonymous vma */
if (!vma_is_anonymous(vma))
break;
if (IS_ALIGNED((...
break;

[...]


> +/**
> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> + * @vma: the vm area to check
> + * @vm_flags: use these vm_flags instead of vma->vm_flags
> + * @smaps: whether answer will be used for smaps file
> + * @in_pf: whether answer will be used by page fault handler
> + * @enforce_sysfs: whether sysfs config should be taken into account
> + * @orders: bitfield of all orders to consider
> + *
> + * Calculates the intersection of the requested hugepage orders and the allowed
> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
> + *
> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
> + * orders are allowed.
> + */
> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> + unsigned long vm_flags, bool smaps,
> + bool in_pf, bool enforce_sysfs,
> + unsigned long orders)
> +{
> + /* Check the intersection of requested and supported orders. */
> + orders &= vma_is_anonymous(vma) ?
> + THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
> + if (!orders)
> + return 0;

Comment: if this is called from some hot path, we might want to move as
much as possible into a header, so we can avoid this function call here
when e.g., THP are completely disabled etc.

> +
> if (!vma->vm_mm) /* vdso */
> - return false;
> + return 0;
>
> /*
> * Explicitly disabled through madvise or prctl, or some
> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> * */
> if ((vm_flags & VM_NOHUGEPAGE) ||
> test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> - return false;
> + return 0;
> /*
> * If the hardware/firmware marked hugepage support disabled.
> */
> if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> - return false;
> + return 0;
>
> /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
> if (vma_is_dax(vma))
> - return in_pf;
> + return in_pf ? orders : 0;
>
> /*
> * khugepaged special VMA and hugetlb VMA.
> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
> * VM_MIXEDMAP set.
> */
> if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
> - return false;
> + return 0;
>
> /*
> - * Check alignment for file vma and size for both file and anon vma.
> + * Check alignment for file vma and size for both file and anon vma by
> + * filtering out the unsuitable orders.
> *
> * Skip the check for page fault. Huge fault does the check in fault
> - * handlers. And this check is not suitable for huge PUD fault.
> + * handlers.
> */
> - if (!in_pf &&
> - !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
> - return false;
> + if (!in_pf) {
> + int order = first_order(orders);
> + unsigned long addr;
> +
> + while (orders) {
> + addr = vma->vm_end - (PAGE_SIZE << order);
> + if (thp_vma_suitable_orders(vma, addr, BIT(order)))
> + break;

Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe
the compiler is smart enough to optimize the loop and everyything else out.

[...]

> +
> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int order = to_thpsize(kobj)->order;
> + ssize_t ret = count;
> +
> + if (sysfs_streq(buf, "always")) {
> + set_bit(order, &huge_anon_orders_always);
> + clear_bit(order, &huge_anon_orders_inherit);
> + clear_bit(order, &huge_anon_orders_madvise);
> + } else if (sysfs_streq(buf, "inherit")) {
> + set_bit(order, &huge_anon_orders_inherit);
> + clear_bit(order, &huge_anon_orders_always);
> + clear_bit(order, &huge_anon_orders_madvise);
> + } else if (sysfs_streq(buf, "madvise")) {
> + set_bit(order, &huge_anon_orders_madvise);
> + clear_bit(order, &huge_anon_orders_always);
> + clear_bit(order, &huge_anon_orders_inherit);
> + } else if (sysfs_streq(buf, "never")) {
> + clear_bit(order, &huge_anon_orders_always);
> + clear_bit(order, &huge_anon_orders_inherit);
> + clear_bit(order, &huge_anon_orders_madvise);

Note: I was wondering for a second if some concurrent cames could lead
to an inconsistent state. I think in the worst case we'll simply end up
with "never" on races.

[...]

> static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
> {
> int err;
> + struct thpsize *thpsize;
> + unsigned long orders;
> + int order;
> +
> + /*
> + * Default to setting PMD-sized THP to inherit the global setting and
> + * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
> + * constant so we have to do this here.
> + */
> + huge_anon_orders_inherit = BIT(PMD_ORDER);
>
> *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
> if (unlikely(!*hugepage_kobj)) {
> @@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
> goto remove_hp_group;
> }
>
> + orders = THP_ORDERS_ALL_ANON;
> + order = first_order(orders);
> + while (orders) {
> + thpsize = thpsize_create(order, *hugepage_kobj);
> + if (IS_ERR(thpsize)) {
> + pr_err("failed to create thpsize for order %d\n", order);
> + err = PTR_ERR(thpsize);
> + goto remove_all;
> + }
> + list_add(&thpsize->node, &thpsize_list);
> + order = next_order(&orders, order);
> + }
> +
> return 0;
>

[...]

> page = compound_head(page);
> @@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> return VM_FAULT_OOM;
> retry_pud:
> if (pud_none(*vmf.pud) &&
> - hugepage_vma_check(vma, vm_flags, false, true, true)) {
> + thp_vma_allowable_orders(vma, vm_flags, false, true, true,
> + BIT(PUD_ORDER))) {
> ret = create_huge_pud(&vmf);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> @@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> goto retry_pud;
>
> if (pmd_none(*vmf.pmd) &&
> - hugepage_vma_check(vma, vm_flags, false, true, true)) {
> + thp_vma_allowable_orders(vma, vm_flags, false, true, true,
> + BIT(PMD_ORDER))) {

Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might
make this easier to read -- and the implemenmtation will be faster.

> ret = create_huge_pmd(&vmf);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e0b368e545ed..64da127cc267 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> * cleared *pmd but not decremented compound_mapcount().
> */
> if ((pvmw->flags & PVMW_SYNC) &&
> - transhuge_vma_suitable(vma, pvmw->address) &&
> + thp_vma_suitable_orders(vma, pvmw->address,
> + BIT(PMD_ORDER)) &&

Comment: Similarly, a helper like "thp_vma_suitable_order(vma,
PMD_ORDER)" might make this easier to read.

> (pvmw->nr_pages >= HPAGE_PMD_NR)) {
> spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>

--
Cheers,

David / dhildenb

2023-12-05 17:03:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 05/10] selftests/mm/kugepaged: Restore thp settings at exit

On 04.12.23 11:20, Ryan Roberts wrote:
> Previously, the saved thp settings would be restored upon a signal or at
> the natural end of the test suite. But there are some tests that
> directly call exit() upon failure. In this case, the thp settings were
> not being restored, which could then influence other tests.
>
> Fix this by installing an atexit() handler to do the actual restore. The
> signal handler can now just call exit() and the atexit handler is
> invoked.
>
> Reviewed-by: Alistair Popple <[email protected]>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> tools/testing/selftests/mm/khugepaged.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 030667cb5533..fc47a1c4944c 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -374,18 +374,22 @@ static void pop_settings(void)
> write_settings(current_settings());
> }
>
> -static void restore_settings(int sig)
> +static void restore_settings_atexit(void)
> {
> if (skip_settings_restore)
> - goto out;
> + return;
>
> printf("Restore THP and khugepaged settings...");
> write_settings(&saved_settings);
> success("OK");
> - if (sig)
> - exit(EXIT_FAILURE);
> -out:
> - exit(exit_status);
> +
> + skip_settings_restore = true;
> +}
> +
> +static void restore_settings(int sig)
> +{
> + /* exit() will invoke the restore_settings_atexit handler. */
> + exit(sig ? EXIT_FAILURE : exit_status);
> }
>
> static void save_settings(void)
> @@ -415,6 +419,7 @@ static void save_settings(void)
>
> success("OK");
>
> + atexit(restore_settings_atexit);
> signal(SIGTERM, restore_settings);
> signal(SIGINT, restore_settings);
> signal(SIGHUP, restore_settings);

Reviewed-by: David Hildenbrand <[email protected]>

Might similarly come in handy for the cow tests. Can be done later.

--
Cheers,

David / dhildenb

2023-12-05 17:03:38

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 06/10] selftests/mm: Factor out thp settings management

On 04.12.23 11:20, Ryan Roberts wrote:
> The khugepaged test has a useful framework for save/restore/pop/push of
> all thp settings via the sysfs interface. This will be useful to
> explicitly control multi-size THP settings in other tests, so let's
> move it out of khugepaged and into its own thp_settings.[c|h] utility.
>
> Tested-by: Alistair Popple <[email protected]>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---

[...]

Looks handy

Acked-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb

2023-12-05 17:23:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 04.12.23 11:20, Ryan Roberts wrote:
> Hi All,
>
> A new week, a new version, a new name... This is v8 of a series to implement
> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
> this fares better.
>
> The objective of this is to improve performance by allocating larger chunks of
> memory during anonymous page faults:
>
> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
> pages, there are efficiency savings to be had; fewer page faults, batched PTE
> and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
> overhead. This should benefit all architectures.
> 2) Since we are now mapping physically contiguous chunks of memory, we can take
> advantage of HW TLB compression techniques. A reduction in TLB pressure
> speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
> TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>
> This version changes the name and tidies up some of the kernel code and test
> code, based on feedback against v7 (see change log for details).
>
> By default, the existing behaviour (and performance) is maintained. The user
> must explicitly enable multi-size THP to see the performance benefit. This is
> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
> David for the suggestion)! This interface is inspired by the existing
> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
> compatibility with the existing PMD-size THP interface, and provides a base for
> future extensibility. See [8] for detailed discussion of the interface.
>
> This series is based on mm-unstable (715b67adf4c8).

I took a look at the core pieces. Some things might want some smaller
tweaks, but nothing that should stop this from having fun in
mm-unstable, and replacing the smaller things as we move forward.

--
Cheers,

David / dhildenb

2023-12-05 18:58:34

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 12/5/23 3:13 AM, Ryan Roberts wrote:
> On 05/12/2023 03:37, John Hubbard wrote:
>> On 12/4/23 02:20, Ryan Roberts wrote:
...
>> With those two patchsets applied, the mm selftests look OK--or at least
>> as OK as they normally do. I compared test runs between THP/mTHP set to
>> "always", vs "never", to verify that there were no new test failures.
>> Details: specifically, I set one particular page size (2 MB) to
>> "inherit", and then toggled /sys/kernel/mm/transparent_hugepage/enabled
>> between "always" and "never".
>
> Excellent - I'm guessing this was for 64K base pages?


Yes.


thanks,

--
John Hubbard
NVIDIA

2023-12-05 20:16:57

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On Tue, Dec 5, 2023 at 11:48 PM Ryan Roberts <[email protected]> wrote:
>
> On 05/12/2023 01:24, Barry Song wrote:
> > On Tue, Dec 5, 2023 at 9:15 AM Barry Song <[email protected]> wrote:
> >>
> >> On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
> >>>
> >>> Introduce the logic to allow THP to be configured (through the new sysfs
> >>> interface we just added) to allocate large folios to back anonymous
> >>> memory, which are larger than the base page size but smaller than
> >>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
> >>>
> >>> mTHP continues to be PTE-mapped, but in many cases can still provide
> >>> similar benefits to traditional PMD-sized THP: Page faults are
> >>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> >>> the configured order), but latency spikes are much less prominent
> >>> because the size of each page isn't as huge as the PMD-sized variant and
> >>> there is less memory to clear in each page fault. The number of per-page
> >>> operations (e.g. ref counting, rmap management, lru list management) are
> >>> also significantly reduced since those ops now become per-folio.
> >>>
> >>> Some architectures also employ TLB compression mechanisms to squeeze
> >>> more entries in when a set of PTEs are virtually and physically
> >>> contiguous and approporiately aligned. In this case, TLB misses will
> >>> occur less often.
> >>>
> >>> The new behaviour is disabled by default, but can be enabled at runtime
> >>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
> >>> (see documentation in previous commit). The long term aim is to change
> >>> the default to include suitable lower orders, but there are some risks
> >>> around internal fragmentation that need to be better understood first.
> >>>
> >>> Signed-off-by: Ryan Roberts <[email protected]>
> >>> ---
> >>> include/linux/huge_mm.h | 6 ++-
> >>> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
> >>> 2 files changed, 101 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>> index bd0eadd3befb..91a53b9835a4 100644
> >>> --- a/include/linux/huge_mm.h
> >>> +++ b/include/linux/huge_mm.h
> >>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> >>> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >>>
> >>> /*
> >>> - * Mask of all large folio orders supported for anonymous THP.
> >>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> >>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> >>> + * (which is a limitation of the THP implementation).
> >>> */
> >>> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
> >>> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> >>>
> >>> /*
> >>> * Mask of all large folio orders supported for file THP.
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 3ceeb0f45bf5..bf7e93813018 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> return ret;
> >>> }
> >>>
> >>> +static bool pte_range_none(pte_t *pte, int nr_pages)
> >>> +{
> >>> + int i;
> >>> +
> >>> + for (i = 0; i < nr_pages; i++) {
> >>> + if (!pte_none(ptep_get_lockless(pte + i)))
> >>> + return false;
> >>> + }
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>> +{
> >>> + gfp_t gfp;
> >>> + pte_t *pte;
> >>> + unsigned long addr;
> >>> + struct folio *folio;
> >>> + struct vm_area_struct *vma = vmf->vma;
> >>> + unsigned long orders;
> >>> + int order;
> >>> +
> >>> + /*
> >>> + * If uffd is active for the vma we need per-page fault fidelity to
> >>> + * maintain the uffd semantics.
> >>> + */
> >>> + if (userfaultfd_armed(vma))
> >>> + goto fallback;
> >>> +
> >>> + /*
> >>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> + * for this vma. Then filter out the orders that can't be allocated over
> >>> + * the faulting address and still be fully contained in the vma.
> >>> + */
> >>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> + BIT(PMD_ORDER) - 1);
> >>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> +
> >>> + if (!orders)
> >>> + goto fallback;
> >>> +
> >>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>> + if (!pte)
> >>> + return ERR_PTR(-EAGAIN);
> >>> +
> >>> + order = first_order(orders);
> >>> + while (orders) {
> >>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> + vmf->pte = pte + pte_index(addr);
> >>> + if (pte_range_none(vmf->pte, 1 << order))
> >>> + break;
> >>> + order = next_order(&orders, order);
> >>> + }
> >>> +
> >>> + vmf->pte = NULL;
> >>> + pte_unmap(pte);
> >>> +
> >>> + gfp = vma_thp_gfp_mask(vma);
> >>> +
> >>> + while (orders) {
> >>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >>> + if (folio) {
> >>> + clear_huge_page(&folio->page, addr, 1 << order);
> >>
> >> Minor.
> >>
> >> Do we have to constantly clear a huge page? Is it possible to let
> >> post_alloc_hook()
> >> finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
> >> vma_alloc_zeroed_movable_folio() is doing?
>
> I'm currently following the same allocation pattern as is done for PMD-sized
> THP. In earlier versions of this patch I was trying to be smarter and use the
> __GFP_ZERO/__GFP_ZEROTAGS as you suggest, but I was advised to keep it simple
> and follow the existing pattern.
>
> I have a vague recollection __GFP_ZERO is not preferred for large folios because
> of some issue with virtually indexed caches? (Matthew: did I see you mention
> that in some other context?)
>
> That said, I wasn't aware that Android ships with
> CONFIG_INIT_ON_ALLOC_DEFAULT_ON (I thought it was only used as a debug option),
> so I can see the potential for some overhead reduction here.
>
> Options:
>
> 1) leave it as is and accept the duplicated clearing
> 2) Pass __GFP_ZERO and remove clear_huge_page()
> 3) define __GFP_SKIP_ZERO even when kasan is not enabled and pass it down so
> clear_huge_page() is the only clear
> 4) make clear_huge_page() conditional on !want_init_on_alloc()
>
> I prefer option 4. What do you think?

either 1 and 4 is ok to me if we will finally remove this duplicated
clear_huge_page on top.
4 is even better as it can at least temporarily resolve the problem.

in Android gki_defconfig,
https://android.googlesource.com/kernel/common/+/refs/heads/android14-6.1-lts/arch/arm64/configs/gki_defconfig

Android always has the below,
CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y

here is some explanation for the reason,
https://source.android.com/docs/security/test/memory-safety/zero-initialized-memory

>
> As an aside, I've also noticed that clear_huge_page() should take vmf->address
> so that it clears the faulting page last to keep the cache hot. If we decide on
> an option that keeps clear_huge_page(), I'll also make that change.
>
> Thanks,
> Ryan
>
> >>

Thanks
Barry

2023-12-06 10:08:22

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 05/12/2023 14:19, Kefeng Wang wrote:
>
>
> On 2023/12/4 18:20, Ryan Roberts wrote:
>> Hi All,
>>
>> A new week, a new version, a new name... This is v8 of a series to implement
>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>> this fares better.
>>
>> The objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> This version changes the name and tidies up some of the kernel code and test
>> code, based on feedback against v7 (see change log for details).
>>
>> By default, the existing behaviour (and performance) is maintained. The user
>> must explicitly enable multi-size THP to see the performance benefit. This is
>> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
>> David for the suggestion)! This interface is inspired by the existing
>> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
>> compatibility with the existing PMD-size THP interface, and provides a base for
>> future extensibility. See [8] for detailed discussion of the interface.
>>
>> This series is based on mm-unstable (715b67adf4c8).
>>
>>
>> Prerequisites
>> =============
>>
>> Some work items identified as being prerequisites are listed on page 3 at [9].
>> The summary is:
>>
>> | item                          | status                  |
>> |:------------------------------|:------------------------|
>> | mlock                         | In mainline (v6.7)      |
>> | madvise                       | In mainline (v6.6)      |
>> | compaction                    | v1 posted [10]          |
>> | numa balancing                | Investigated: see below |
>> | user-triggered page migration | In mainline (v6.7)      |
>> | khugepaged collapse           | In mainline (NOP)       |
>>
>> On NUMA balancing, which currently ignores any PTE-mapped THPs it encounters,
>> John Hubbard has investigated this and concluded that it is A) not clear at the
>> moment what a better policy might be for PTE-mapped THP and B) questions whether
>> this should really be considered a prerequisite given no regression is caused
>> for the default "multi-size THP disabled" case, and there is no correctness
>> issue when it is enabled - its just a potential for non-optimal performance.
>>
>> If there are no disagreements about removing numa balancing from the list (none
>> were raised when I first posted this comment against v7), then that just leaves
>> compaction which is in review on list at the moment.
>>
>> I really would like to get this series (and its remaining comapction
>> prerequisite) in for v6.8. I accept that it may be a bit optimistic at this
>> point, but lets see where we get to with review?
>>
>>
>> Testing
>> =======
>>
>> The series includes patches for mm selftests to enlighten the cow and khugepaged
>> tests to explicitly test with multi-size THP, in the same way that PMD-sized
>> THP is tested. The new tests all pass, and no regressions are observed in the mm
>> selftest suite. I've also run my usual kernel compilation and java script
>> benchmarks without any issues.
>>
>> Refer to my performance numbers posted with v6 [6]. (These are for multi-size
>> THP only - they do not include the arm64 contpte follow-on series).
>>
>> John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
>> some workloads at [11]. (Observed using v6 of this series as well as the arm64
>> contpte series).
>>
>> Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
>> there are some latency regressions also.
>
> Hi Ryan,
>
> Here is some test results based on v6.7-rc1 +
> [PATCH v7 00/10] Small-sized THP for anonymous memory +
> [PATCH v2 00/14] Transparent Contiguous PTEs for User Mappings
>
> case1: basepage 64K
> case2: basepage 4K + thp=64k + PAGE_ALLOC_COSTLY_ORDER = 3
> case3: basepage 4K + thp=64k + PAGE_ALLOC_COSTLY_ORDER = 4

Thanks for sharing these results. With the exception of a few outliers, It looks
like the ~rough conclusion is that bandwidth improves, but not as much as 64K
base pages, and latency regresses, but also not as much as 64K base pages?

I expect that over time, as we add more optimizations, we will get bandwidth
closer to 64K base pages; one crucial one is getting executable file-backed
memory into contpte mappings, for example.

It's probably not time to switch PAGE_ALLOC_COSTLY_ORDER quite yet; but
something to keep an eye on and consider down the road?

Thanks,
Ryan

>
> The results is compared with basepage 4K on Kunpeng920.
>
> Note,
> - The test based on ext4 filesystem and THP=2M is disabled.
> - The results were not analyzed, it is for reference only,
>   as some values of test items are not consistent.
>
> 1) Unixbench 1core
> Index_Values_1core                       case1       case2    case3
> Dhrystone_2_using_register_variables     0.28%      0.39%     0.17%
> Double-Precision_Whetstone              -0.01%      0.00%     0.00%
> Execl_Throughput                        *21.13%*    2.16%     3.01%
> File_Copy_1024_bufsize_2000_maxblocks   -0.51%     *8.33%*   *8.76%*
> File_Copy_256_bufsize_500_maxblocks      0.78%     *11.89%*  *10.85%*
> File_Copy_4096_bufsize_8000_maxblocks    7.42%      7.27%    *10.66%*
> Pipe_Throughput                         -0.24%     *6.82%*   *5.08%*
> Pipe-based_Context_Switching             1.38%     *13.49%*  *9.91%*
> Process_Creation                        *32.46%*    4.30%    *8.54%*
> Shell_Scripts_(1_concurrent)            *31.67%*    1.92%     2.60%
> Shell_Scripts_(8_concurrent)            *40.59%*    1.30%    *5.29%*
> System_Call_Overhead                     3.92%     *8.13%     2.96%
>
> System_Benchmarks_Index_Score           10.66%      5.39%     5.58%
>
> For 1core,
> - case1 wins on Execl_Throughput/Process_Creation/Shell_Scripts
>   a lot, and score higher 10.66% vs basepage 4K.
> - case2/3 wins on File_Copy/Pipe and score higher 5%+ than basepage 4K,
>   also case3 looks better on Shell_Scripts_(8_concurrent) than case2.
>
> 2) Unixbench 128core
> Index_Values_128core                    case1     case2     case3
> Dhrystone_2_using_register_variables    2.07%    -0.03%    -0.11%
> Double-Precision_Whetstone             -0.03%     0.00%    0.00%
> Execl_Throughput                       *39.28%*  -4.23%    1.93%
> File_Copy_1024_bufsize_2000_maxblocks   5.46%     1.30%    4.20%
> File_Copy_256_bufsize_500_maxblocks    -8.89%    *6.56%   *5.02%*
> File_Copy_4096_bufsize_8000_maxblocks   3.43%   *-5.46%*   0.56%
> Pipe_Throughput                         3.80%    *7.69%   *7.80%*
> Pipe-based_Context_Switching           *7.62%*    0.95%    4.69%
> Process_Creation                       *28.11%*  -2.79%    2.40%
> Shell_Scripts_(1_concurrent)           *39.68%*   1.86%   *5.30%*
> Shell_Scripts_(8_concurrent)           *41.35%*   2.49%   *7.16%*
> System_Call_Overhead                   -1.55%    -0.04%   *8.23%*
>
> System_Benchmarks_Index_Score          12.08%     0.63%    3.88%
>
> For 128core,
> - case1 wins on Execl_Throughput/Process_Creation/Shell_Scripts
>   a lot, also good at Pipe-based_Context_Switching, and score higher
>   12.08% vs basepage 4K.
> - case2/case3 wins on File_Copy_256/Pipe_Throughput, but case2 is
>   not better than basepage 4K, case3 wins 3.88%.
>
> 3) Lmbench Processor_processes
> Processor_Processes    case1      case2      case3
> null_call              1.76%      0.40%     0.65%
> null_io               -0.76%     -0.38%    -0.23%
> stat                 *-16.09%*  *-12.49%*   4.22%
> open_close            -2.69%      4.51%     3.21%
> slct_TCP              -0.56%      0.00%    -0.44%
> sig_inst              -1.54%      0.73%     0.70%
> sig_hndl              -2.85%      0.01%     1.85%
> fork_proc            *23.31%*     8.77%    -5.42%
> exec_proc            *13.22%*    -0.30%     1.09%
> sh_proc              *14.04%*    -0.10%     1.09%
>
> - case1 is much better than basepage 4K, same as Unixbench test,
>   case2 is better on fork_proc, but case3 is worse
> - note: the variance of fork/exec/sh is bigger than others
>
> 4) Lmbench Context_switching_ctxsw
> Context_switching_ctxsw  case1     case2         case3
> 2p/0K                   -12.16%    -5.29%       -1.86%
> 2p/16K                  -11.26%    -3.71%       -4.53%
> 2p/64K                  -2.60%      3.84%       -1.98%
> 8p/16K                  -7.56%     -1.21%       -0.88%
> 8p/64K                   5.10%      4.88%        1.19%
> 16p/16K                 -5.81%     -2.44%       -3.84%
> 16p/64K                  4.29%     -1.94%       -2.50%
> - case1/2/3 worse than basepage 4K and case1 is the worst.
>
> 4) Lmbench Local_latencies
> Local_latencies      case1      case2     case3
> Pipe                -9.23%      0.58%    -4.34%
> AF_UNIX             -5.34%     -1.76%     3.03%
> UDP                 -6.70%     -5.96%    -9.81%
> TCP                 -7.95%     -7.58%    -5.63%
> TCP_conn            -213.99%   -227.78%  -659.67%
> - TCP_conn is very unreliable, ignore it
> - case1/2/3 slower than basepage 4K
>
> 5) Lmbench File_&_VM_latencies
> File_&_VM_latencies    case1     case2        case3
> 10K_File_Create        2.60%    -0.52%         2.66%
> 10K_File_Delete       -2.91%    -5.20%        -2.11%
> 10K_File_Create       10.23%     1.18%         0.12%
> 10K_File_Delete      -17.76%    -2.97%        -1.49%
> Mmap_Latency         *63.05%*    2.57%        -0.96%
> Prot_Fault            10.41%    -3.21%       *-19.11%*
> Page_Fault          *-132.01%*   2.35%        -0.79%
> 100fd_selct          -1.20%      0.10%         0.31%
> - case1 is very good at Mmap_Latency and not good at Page_fault
> - case2/3 slower on Prot_Faul/10K_FILE_Delete vs basepage 4k,
>   the rest doesn't look much different.
>
> 6) Lmbench Local_bandwidths
> Local_bandwidths    case1   case2       case3
> Pipe               265.22%   15.44%     11.33%
> AF_UNIX            13.41%   -2.66%      2.63%
> TCP               -1.30%     25.90%     2.48%
> File_reread        14.79%    31.52%    -14.16%
> Mmap_reread        27.47%    49.00%    -0.11%
> Bcopy(libc)        2.58%     2.45%      2.46%
> Bcopy(hand)        25.78%    22.56%     22.68%
> Mem_read           38.26%    36.80%     36.49%
> Mem_write          10.93%    3.44%      3.12%
>
> - case1 is very good at bandwidth, case2 is better than basepage 4k
>   but lower than case1, case3 is bad at File_reread
>
> 7)Lmbench Memory_latencies
> Memory_latencies    case1     case2     case3
> L1_$                0.02%     0.00%    -0.03%
> L2_$               -1.56%    -2.65%    -1.25%
> Main_mem           50.82%     32.51%    33.47%
> Rand_mem           15.29%    -8.79%    -8.80%
>
> - case1 also good at Main/Rand mem access latencies,
> - case2/case3 is better at Main_mem, but worse at Rand_mem.
>
> Tested-by: Kefeng Wang <[email protected]>
>
>
>
>
>
>
>

2023-12-06 10:14:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 05/12/2023 17:21, David Hildenbrand wrote:
> On 04.12.23 11:20, Ryan Roberts wrote:
>> Hi All,
>>
>> A new week, a new version, a new name... This is v8 of a series to implement
>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>> this fares better.
>>
>> The objective of this is to improve performance by allocating larger chunks of
>> memory during anonymous page faults:
>>
>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>
>> This version changes the name and tidies up some of the kernel code and test
>> code, based on feedback against v7 (see change log for details).
>>
>> By default, the existing behaviour (and performance) is maintained. The user
>> must explicitly enable multi-size THP to see the performance benefit. This is
>> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
>> David for the suggestion)! This interface is inspired by the existing
>> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
>> compatibility with the existing PMD-size THP interface, and provides a base for
>> future extensibility. See [8] for detailed discussion of the interface.
>>
>> This series is based on mm-unstable (715b67adf4c8).
>
> I took a look at the core pieces. Some things might want some smaller tweaks,
> but nothing that should stop this from having fun in mm-unstable, and replacing
> the smaller things as we move forward.
>

Thanks! I'll address your comments and see if I can post another (final??)
version next week.

2023-12-06 10:16:19

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 05/12/2023 20:16, Barry Song wrote:
> On Tue, Dec 5, 2023 at 11:48 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 05/12/2023 01:24, Barry Song wrote:
>>> On Tue, Dec 5, 2023 at 9:15 AM Barry Song <[email protected]> wrote:
>>>>
>>>> On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
>>>>>
>>>>> Introduce the logic to allow THP to be configured (through the new sysfs
>>>>> interface we just added) to allocate large folios to back anonymous
>>>>> memory, which are larger than the base page size but smaller than
>>>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>>>>
>>>>> mTHP continues to be PTE-mapped, but in many cases can still provide
>>>>> similar benefits to traditional PMD-sized THP: Page faults are
>>>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>>>>> the configured order), but latency spikes are much less prominent
>>>>> because the size of each page isn't as huge as the PMD-sized variant and
>>>>> there is less memory to clear in each page fault. The number of per-page
>>>>> operations (e.g. ref counting, rmap management, lru list management) are
>>>>> also significantly reduced since those ops now become per-folio.
>>>>>
>>>>> Some architectures also employ TLB compression mechanisms to squeeze
>>>>> more entries in when a set of PTEs are virtually and physically
>>>>> contiguous and approporiately aligned. In this case, TLB misses will
>>>>> occur less often.
>>>>>
>>>>> The new behaviour is disabled by default, but can be enabled at runtime
>>>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>> (see documentation in previous commit). The long term aim is to change
>>>>> the default to include suitable lower orders, but there are some risks
>>>>> around internal fragmentation that need to be better understood first.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>> include/linux/huge_mm.h | 6 ++-
>>>>> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
>>>>> 2 files changed, 101 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index bd0eadd3befb..91a53b9835a4 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>>> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>>>
>>>>> /*
>>>>> - * Mask of all large folio orders supported for anonymous THP.
>>>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>>>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>>>> + * (which is a limitation of the THP implementation).
>>>>> */
>>>>> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
>>>>> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>>>>
>>>>> /*
>>>>> * Mask of all large folio orders supported for file THP.
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 3ceeb0f45bf5..bf7e93813018 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>> return ret;
>>>>> }
>>>>>
>>>>> +static bool pte_range_none(pte_t *pte, int nr_pages)
>>>>> +{
>>>>> + int i;
>>>>> +
>>>>> + for (i = 0; i < nr_pages; i++) {
>>>>> + if (!pte_none(ptep_get_lockless(pte + i)))
>>>>> + return false;
>>>>> + }
>>>>> +
>>>>> + return true;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>>> +{
>>>>> + gfp_t gfp;
>>>>> + pte_t *pte;
>>>>> + unsigned long addr;
>>>>> + struct folio *folio;
>>>>> + struct vm_area_struct *vma = vmf->vma;
>>>>> + unsigned long orders;
>>>>> + int order;
>>>>> +
>>>>> + /*
>>>>> + * If uffd is active for the vma we need per-page fault fidelity to
>>>>> + * maintain the uffd semantics.
>>>>> + */
>>>>> + if (userfaultfd_armed(vma))
>>>>> + goto fallback;
>>>>> +
>>>>> + /*
>>>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>> + * for this vma. Then filter out the orders that can't be allocated over
>>>>> + * the faulting address and still be fully contained in the vma.
>>>>> + */
>>>>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>>> + BIT(PMD_ORDER) - 1);
>>>>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>>> +
>>>>> + if (!orders)
>>>>> + goto fallback;
>>>>> +
>>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>> + if (!pte)
>>>>> + return ERR_PTR(-EAGAIN);
>>>>> +
>>>>> + order = first_order(orders);
>>>>> + while (orders) {
>>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> + vmf->pte = pte + pte_index(addr);
>>>>> + if (pte_range_none(vmf->pte, 1 << order))
>>>>> + break;
>>>>> + order = next_order(&orders, order);
>>>>> + }
>>>>> +
>>>>> + vmf->pte = NULL;
>>>>> + pte_unmap(pte);
>>>>> +
>>>>> + gfp = vma_thp_gfp_mask(vma);
>>>>> +
>>>>> + while (orders) {
>>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>>> + if (folio) {
>>>>> + clear_huge_page(&folio->page, addr, 1 << order);
>>>>
>>>> Minor.
>>>>
>>>> Do we have to constantly clear a huge page? Is it possible to let
>>>> post_alloc_hook()
>>>> finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
>>>> vma_alloc_zeroed_movable_folio() is doing?
>>
>> I'm currently following the same allocation pattern as is done for PMD-sized
>> THP. In earlier versions of this patch I was trying to be smarter and use the
>> __GFP_ZERO/__GFP_ZEROTAGS as you suggest, but I was advised to keep it simple
>> and follow the existing pattern.
>>
>> I have a vague recollection __GFP_ZERO is not preferred for large folios because
>> of some issue with virtually indexed caches? (Matthew: did I see you mention
>> that in some other context?)
>>
>> That said, I wasn't aware that Android ships with
>> CONFIG_INIT_ON_ALLOC_DEFAULT_ON (I thought it was only used as a debug option),
>> so I can see the potential for some overhead reduction here.
>>
>> Options:
>>
>> 1) leave it as is and accept the duplicated clearing
>> 2) Pass __GFP_ZERO and remove clear_huge_page()
>> 3) define __GFP_SKIP_ZERO even when kasan is not enabled and pass it down so
>> clear_huge_page() is the only clear
>> 4) make clear_huge_page() conditional on !want_init_on_alloc()
>>
>> I prefer option 4. What do you think?
>
> either 1 and 4 is ok to me if we will finally remove this duplicated
> clear_huge_page on top.
> 4 is even better as it can at least temporarily resolve the problem.

I'm going to stick with option 1 for this series. Then we can fix it uniformly
here and for PMD-sized THP in a separate patch (possibly with the approach
suggested in 4).

>
> in Android gki_defconfig,
> https://android.googlesource.com/kernel/common/+/refs/heads/android14-6.1-lts/arch/arm64/configs/gki_defconfig
>
> Android always has the below,
> CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
>
> here is some explanation for the reason,
> https://source.android.com/docs/security/test/memory-safety/zero-initialized-memory
>
>>
>> As an aside, I've also noticed that clear_huge_page() should take vmf->address
>> so that it clears the faulting page last to keep the cache hot. If we decide on
>> an option that keeps clear_huge_page(), I'll also make that change.

I'll make this change for the next version.

>>
>> Thanks,
>> Ryan
>>
>>>>
>
> Thanks
> Barry

2023-12-06 10:23:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 06.12.23 11:13, Ryan Roberts wrote:
> On 05/12/2023 17:21, David Hildenbrand wrote:
>> On 04.12.23 11:20, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> A new week, a new version, a new name... This is v8 of a series to implement
>>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>>> this fares better.
>>>
>>> The objective of this is to improve performance by allocating larger chunks of
>>> memory during anonymous page faults:
>>>
>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>>     and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>     overhead. This should benefit all architectures.
>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>
>>> This version changes the name and tidies up some of the kernel code and test
>>> code, based on feedback against v7 (see change log for details).
>>>
>>> By default, the existing behaviour (and performance) is maintained. The user
>>> must explicitly enable multi-size THP to see the performance benefit. This is
>>> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
>>> David for the suggestion)! This interface is inspired by the existing
>>> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
>>> compatibility with the existing PMD-size THP interface, and provides a base for
>>> future extensibility. See [8] for detailed discussion of the interface.
>>>
>>> This series is based on mm-unstable (715b67adf4c8).
>>
>> I took a look at the core pieces. Some things might want some smaller tweaks,
>> but nothing that should stop this from having fun in mm-unstable, and replacing
>> the smaller things as we move forward.
>>
>
> Thanks! I'll address your comments and see if I can post another (final??)
> version next week.

It's always possible to do incremental changes on top that Andrew will
squash in the end. I even recall that he prefers that way once a series
has been in mm-unstable for a bit, so one can better observe the diff
and which effects they have.

--
Cheers,

David / dhildenb

2023-12-06 10:26:21

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On Wed, Dec 6, 2023 at 11:16 PM Ryan Roberts <[email protected]> wrote:
>
> On 05/12/2023 20:16, Barry Song wrote:
> > On Tue, Dec 5, 2023 at 11:48 PM Ryan Roberts <[email protected]> wrote:
> >>
> >> On 05/12/2023 01:24, Barry Song wrote:
> >>> On Tue, Dec 5, 2023 at 9:15 AM Barry Song <[email protected]> wrote:
> >>>>
> >>>> On Mon, Dec 4, 2023 at 6:21 PM Ryan Roberts <[email protected]> wrote:
> >>>>>
> >>>>> Introduce the logic to allow THP to be configured (through the new sysfs
> >>>>> interface we just added) to allocate large folios to back anonymous
> >>>>> memory, which are larger than the base page size but smaller than
> >>>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
> >>>>>
> >>>>> mTHP continues to be PTE-mapped, but in many cases can still provide
> >>>>> similar benefits to traditional PMD-sized THP: Page faults are
> >>>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
> >>>>> the configured order), but latency spikes are much less prominent
> >>>>> because the size of each page isn't as huge as the PMD-sized variant and
> >>>>> there is less memory to clear in each page fault. The number of per-page
> >>>>> operations (e.g. ref counting, rmap management, lru list management) are
> >>>>> also significantly reduced since those ops now become per-folio.
> >>>>>
> >>>>> Some architectures also employ TLB compression mechanisms to squeeze
> >>>>> more entries in when a set of PTEs are virtually and physically
> >>>>> contiguous and approporiately aligned. In this case, TLB misses will
> >>>>> occur less often.
> >>>>>
> >>>>> The new behaviour is disabled by default, but can be enabled at runtime
> >>>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
> >>>>> (see documentation in previous commit). The long term aim is to change
> >>>>> the default to include suitable lower orders, but there are some risks
> >>>>> around internal fragmentation that need to be better understood first.
> >>>>>
> >>>>> Signed-off-by: Ryan Roberts <[email protected]>
> >>>>> ---
> >>>>> include/linux/huge_mm.h | 6 ++-
> >>>>> mm/memory.c | 106 ++++++++++++++++++++++++++++++++++++----
> >>>>> 2 files changed, 101 insertions(+), 11 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>>> index bd0eadd3befb..91a53b9835a4 100644
> >>>>> --- a/include/linux/huge_mm.h
> >>>>> +++ b/include/linux/huge_mm.h
> >>>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
> >>>>> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >>>>>
> >>>>> /*
> >>>>> - * Mask of all large folio orders supported for anonymous THP.
> >>>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
> >>>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
> >>>>> + * (which is a limitation of the THP implementation).
> >>>>> */
> >>>>> -#define THP_ORDERS_ALL_ANON BIT(PMD_ORDER)
> >>>>> +#define THP_ORDERS_ALL_ANON ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
> >>>>>
> >>>>> /*
> >>>>> * Mask of all large folio orders supported for file THP.
> >>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>> index 3ceeb0f45bf5..bf7e93813018 100644
> >>>>> --- a/mm/memory.c
> >>>>> +++ b/mm/memory.c
> >>>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>>> return ret;
> >>>>> }
> >>>>>
> >>>>> +static bool pte_range_none(pte_t *pte, int nr_pages)
> >>>>> +{
> >>>>> + int i;
> >>>>> +
> >>>>> + for (i = 0; i < nr_pages; i++) {
> >>>>> + if (!pte_none(ptep_get_lockless(pte + i)))
> >>>>> + return false;
> >>>>> + }
> >>>>> +
> >>>>> + return true;
> >>>>> +}
> >>>>> +
> >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>>>> +{
> >>>>> + gfp_t gfp;
> >>>>> + pte_t *pte;
> >>>>> + unsigned long addr;
> >>>>> + struct folio *folio;
> >>>>> + struct vm_area_struct *vma = vmf->vma;
> >>>>> + unsigned long orders;
> >>>>> + int order;
> >>>>> +
> >>>>> + /*
> >>>>> + * If uffd is active for the vma we need per-page fault fidelity to
> >>>>> + * maintain the uffd semantics.
> >>>>> + */
> >>>>> + if (userfaultfd_armed(vma))
> >>>>> + goto fallback;
> >>>>> +
> >>>>> + /*
> >>>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>>>> + * for this vma. Then filter out the orders that can't be allocated over
> >>>>> + * the faulting address and still be fully contained in the vma.
> >>>>> + */
> >>>>> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>>>> + BIT(PMD_ORDER) - 1);
> >>>>> + orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>>>> +
> >>>>> + if (!orders)
> >>>>> + goto fallback;
> >>>>> +
> >>>>> + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>>>> + if (!pte)
> >>>>> + return ERR_PTR(-EAGAIN);
> >>>>> +
> >>>>> + order = first_order(orders);
> >>>>> + while (orders) {
> >>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>>>> + vmf->pte = pte + pte_index(addr);
> >>>>> + if (pte_range_none(vmf->pte, 1 << order))
> >>>>> + break;
> >>>>> + order = next_order(&orders, order);
> >>>>> + }
> >>>>> +
> >>>>> + vmf->pte = NULL;
> >>>>> + pte_unmap(pte);
> >>>>> +
> >>>>> + gfp = vma_thp_gfp_mask(vma);
> >>>>> +
> >>>>> + while (orders) {
> >>>>> + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>>>> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >>>>> + if (folio) {
> >>>>> + clear_huge_page(&folio->page, addr, 1 << order);
> >>>>
> >>>> Minor.
> >>>>
> >>>> Do we have to constantly clear a huge page? Is it possible to let
> >>>> post_alloc_hook()
> >>>> finish this job by using __GFP_ZERO/__GFP_ZEROTAGS as
> >>>> vma_alloc_zeroed_movable_folio() is doing?
> >>
> >> I'm currently following the same allocation pattern as is done for PMD-sized
> >> THP. In earlier versions of this patch I was trying to be smarter and use the
> >> __GFP_ZERO/__GFP_ZEROTAGS as you suggest, but I was advised to keep it simple
> >> and follow the existing pattern.
> >>
> >> I have a vague recollection __GFP_ZERO is not preferred for large folios because
> >> of some issue with virtually indexed caches? (Matthew: did I see you mention
> >> that in some other context?)
> >>
> >> That said, I wasn't aware that Android ships with
> >> CONFIG_INIT_ON_ALLOC_DEFAULT_ON (I thought it was only used as a debug option),
> >> so I can see the potential for some overhead reduction here.
> >>
> >> Options:
> >>
> >> 1) leave it as is and accept the duplicated clearing
> >> 2) Pass __GFP_ZERO and remove clear_huge_page()
> >> 3) define __GFP_SKIP_ZERO even when kasan is not enabled and pass it down so
> >> clear_huge_page() is the only clear
> >> 4) make clear_huge_page() conditional on !want_init_on_alloc()
> >>
> >> I prefer option 4. What do you think?
> >
> > either 1 and 4 is ok to me if we will finally remove this duplicated
> > clear_huge_page on top.
> > 4 is even better as it can at least temporarily resolve the problem.
>
> I'm going to stick with option 1 for this series. Then we can fix it uniformly
> here and for PMD-sized THP in a separate patch (possibly with the approach
> suggested in 4).

Ok. Thanks. there is no one fixing PMD-sized THP, probably because PMD-sized
THP is shutdown immediately after Android boots :-)

>
> >
> > in Android gki_defconfig,
> > https://android.googlesource.com/kernel/common/+/refs/heads/android14-6.1-lts/arch/arm64/configs/gki_defconfig
> >
> > Android always has the below,
> > CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
> >
> > here is some explanation for the reason,
> > https://source.android.com/docs/security/test/memory-safety/zero-initialized-memory
> >
> >>
> >> As an aside, I've also noticed that clear_huge_page() should take vmf->address
> >> so that it clears the faulting page last to keep the cache hot. If we decide on
> >> an option that keeps clear_huge_page(), I'll also make that change.
>
> I'll make this change for the next version.
>
> >>
> >> Thanks,
> >> Ryan
> >>

Thanks
Barry

2023-12-06 13:18:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 05/12/2023 16:57, David Hildenbrand wrote:
> On 04.12.23 11:20, Ryan Roberts wrote:
>> In preparation for adding support for anonymous multi-size THP,
>> introduce new sysfs structure that will be used to control the new
>> behaviours. A new directory is added under transparent_hugepage for each
>> supported THP size, and contains an `enabled` file, which can be set to
>> "inherit" (to inherit the global setting), "always", "madvise" or
>> "never". For now, the kernel still only supports PMD-sized anonymous
>> THP, so only 1 directory is populated.
>>
>> The first half of the change converts transhuge_vma_suitable() and
>> hugepage_vma_check() so that they take a bitfield of orders for which
>> the user wants to determine support, and the functions filter out all
>> the orders that can't be supported, given the current sysfs
>> configuration and the VMA dimensions. If there is only 1 order set in
>> the input then the output can continue to be treated like a boolean;
>> this is the case for most call sites. The resulting functions are
>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
>> respectively.
>>
>> The second half of the change implements the new sysfs interface. It has
>> been done so that each supported THP size has a `struct thpsize`, which
>> describes the relevant metadata and is itself a kobject. This is pretty
>> minimal for now, but should make it easy to add new per-thpsize files to
>> the interface if needed in future (e.g. per-size defrag). Rather than
>> keep the `enabled` state directly in the struct thpsize, I've elected to
>> directly encode it into huge_anon_orders_[always|madvise|inherit]
>> bitfields since this reduces the amount of work required in
>> thp_vma_allowable_orders() which is called for every page fault.
>>
>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
>> commit, for details of how the new sysfs interface works.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>
> Some comments mostly regarding thp_vma_allowable_orders and friends. In general,
> LGTM. I'll have to go over the order logic once again, I got a bit lost once we
> started mixing anon and file orders.
>
> [...]
>
> Doc updates all looked good to me, skimming over them.
>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index fa0350b0812a..bd0eadd3befb 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>
> [...]
>
>> +static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>> +        unsigned long addr, unsigned long orders)
>> +{
>> +    int order;
>> +
>> +    /*
>> +     * Iterate over orders, highest to lowest, removing orders that don't
>> +     * meet alignment requirements from the set. Exit loop at first order
>> +     * that meets requirements, since all lower orders must also meet
>> +     * requirements.
>> +     */
>> +
>> +    order = first_order(orders);
>
> nit: "highest_order" or "largest_order" would be more expressive regarding the
> actual semantics.

Yep, will call it "highest_order".

>
>> +
>> +    while (orders) {
>> +        unsigned long hpage_size = PAGE_SIZE << order;
>> +        unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
>> +
>> +        if (haddr >= vma->vm_start &&
>> +            haddr + hpage_size <= vma->vm_end) {
>> +            if (!vma_is_anonymous(vma)) {
>> +                if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
>> +                        vma->vm_pgoff,
>> +                        hpage_size >> PAGE_SHIFT))
>> +                    break;
>> +            } else
>> +                break;
>
> Comment: Codying style wants you to use if () {} else {}
>
> But I'd recommend for the conditions:
>
> if (haddr < vma->vm_start ||
>     haddr + hpage_size > vma->vm_end)
>     continue;
> /* Don't have to check pgoff for anonymous vma */
> if (!vma_is_anonymous(vma))
>     break;
> if (IS_ALIGNED((...
>     break;

OK I'll take this structure.

>
> [...]
>
>
>> +/**
>> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>> + * @vma:  the vm area to check
>> + * @vm_flags: use these vm_flags instead of vma->vm_flags
>> + * @smaps: whether answer will be used for smaps file
>> + * @in_pf: whether answer will be used by page fault handler
>> + * @enforce_sysfs: whether sysfs config should be taken into account
>> + * @orders: bitfield of all orders to consider
>> + *
>> + * Calculates the intersection of the requested hugepage orders and the allowed
>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
>> + *
>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
>> + * orders are allowed.
>> + */
>> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>> +                       unsigned long vm_flags, bool smaps,
>> +                       bool in_pf, bool enforce_sysfs,
>> +                       unsigned long orders)
>> +{
>> +    /* Check the intersection of requested and supported orders. */
>> +    orders &= vma_is_anonymous(vma) ?
>> +            THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
>> +    if (!orders)
>> +        return 0;
>
> Comment: if this is called from some hot path, we might want to move as much as
> possible into a header, so we can avoid this function call here when e.g., THP
> are completely disabled etc.

If THP is completely disabled (compiled out) then thp_vma_allowable_orders() is
defined as a header inline that returns 0. I'm not sure there are any paths in
practice which are likely to ask for a set of orders which are never supported
(i.e. where this specific check would return 0). And the "are they run time
enabled" check is further down and fairly involved, so not sure that's ideal for
an inline.

I haven't changed the pattern from how it was previously, so I don't think it
should be any more expensive. Which parts exactly do you want to move to a header?


>
>> +
>>       if (!vma->vm_mm)        /* vdso */
>> -        return false;
>> +        return 0;
>>         /*
>>        * Explicitly disabled through madvise or prctl, or some
>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>> unsigned long vm_flags,
>>        * */
>>       if ((vm_flags & VM_NOHUGEPAGE) ||
>>           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> -        return false;
>> +        return 0;
>>       /*
>>        * If the hardware/firmware marked hugepage support disabled.
>>        */
>>       if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>> -        return false;
>> +        return 0;
>>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>       if (vma_is_dax(vma))
>> -        return in_pf;
>> +        return in_pf ? orders : 0;
>>         /*
>>        * khugepaged special VMA and hugetlb VMA.
>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>> unsigned long vm_flags,
>>        * VM_MIXEDMAP set.
>>        */
>>       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>> -        return false;
>> +        return 0;
>>         /*
>> -     * Check alignment for file vma and size for both file and anon vma.
>> +     * Check alignment for file vma and size for both file and anon vma by
>> +     * filtering out the unsuitable orders.
>>        *
>>        * Skip the check for page fault. Huge fault does the check in fault
>> -     * handlers. And this check is not suitable for huge PUD fault.
>> +     * handlers.
>>        */
>> -    if (!in_pf &&
>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>> -        return false;
>> +    if (!in_pf) {
>> +        int order = first_order(orders);
>> +        unsigned long addr;
>> +
>> +        while (orders) {
>> +            addr = vma->vm_end - (PAGE_SIZE << order);
>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>> +                break;
>
> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
> compiler is smart enough to optimize the loop and everyything else out.

I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
(that's basically how it is laid out already, just all in one function). Is that
what you are requesting?

>
> [...]
>
>> +
>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>> +                     struct kobj_attribute *attr,
>> +                     const char *buf, size_t count)
>> +{
>> +    int order = to_thpsize(kobj)->order;
>> +    ssize_t ret = count;
>> +
>> +    if (sysfs_streq(buf, "always")) {
>> +        set_bit(order, &huge_anon_orders_always);
>> +        clear_bit(order, &huge_anon_orders_inherit);
>> +        clear_bit(order, &huge_anon_orders_madvise);
>> +    } else if (sysfs_streq(buf, "inherit")) {
>> +        set_bit(order, &huge_anon_orders_inherit);
>> +        clear_bit(order, &huge_anon_orders_always);
>> +        clear_bit(order, &huge_anon_orders_madvise);
>> +    } else if (sysfs_streq(buf, "madvise")) {
>> +        set_bit(order, &huge_anon_orders_madvise);
>> +        clear_bit(order, &huge_anon_orders_always);
>> +        clear_bit(order, &huge_anon_orders_inherit);
>> +    } else if (sysfs_streq(buf, "never")) {
>> +        clear_bit(order, &huge_anon_orders_always);
>> +        clear_bit(order, &huge_anon_orders_inherit);
>> +        clear_bit(order, &huge_anon_orders_madvise);
>
> Note: I was wondering for a second if some concurrent cames could lead to an
> inconsistent state. I think in the worst case we'll simply end up with "never"
> on races.

You mean if different threads try to write different values to this file
concurrently? Or if there is a concurrent fault that tries to read the flags
while they are being modified?

I thought about this for a long time too and wasn't sure what was best. The
existing global enabled store impl clears the bits first then sets the bit. With
this approach you can end up with multiple bits set if there is a race to set
diffierent values, and you can end up with a faulting thread seeing never if it
reads the bits after they have been cleared but before setting them.

I decided to set the new bit before clearing the old bits, which is different; A
racing fault will never see "never" but as you say, a race to set the file could
result in "never" being set.

On reflection, it's probably best to set the bit *last* like the global control
does?


>
> [...]
>
>>   static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>   {
>>       int err;
>> +    struct thpsize *thpsize;
>> +    unsigned long orders;
>> +    int order;
>> +
>> +    /*
>> +     * Default to setting PMD-sized THP to inherit the global setting and
>> +     * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>> +     * constant so we have to do this here.
>> +     */
>> +    huge_anon_orders_inherit = BIT(PMD_ORDER);
>>         *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>>       if (unlikely(!*hugepage_kobj)) {
>> @@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject
>> **hugepage_kobj)
>>           goto remove_hp_group;
>>       }
>>   +    orders = THP_ORDERS_ALL_ANON;
>> +    order = first_order(orders);
>> +    while (orders) {
>> +        thpsize = thpsize_create(order, *hugepage_kobj);
>> +        if (IS_ERR(thpsize)) {
>> +            pr_err("failed to create thpsize for order %d\n", order);
>> +            err = PTR_ERR(thpsize);
>> +            goto remove_all;
>> +        }
>> +        list_add(&thpsize->node, &thpsize_list);
>> +        order = next_order(&orders, order);
>> +    }
>> +
>>       return 0;
>>  
>
> [...]
>
>>       page = compound_head(page);
>> @@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct
>> vm_area_struct *vma,
>>           return VM_FAULT_OOM;
>>   retry_pud:
>>       if (pud_none(*vmf.pud) &&
>> -        hugepage_vma_check(vma, vm_flags, false, true, true)) {
>> +        thp_vma_allowable_orders(vma, vm_flags, false, true, true,
>> +                     BIT(PUD_ORDER))) {
>>           ret = create_huge_pud(&vmf);
>>           if (!(ret & VM_FAULT_FALLBACK))
>>               return ret;
>> @@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct
>> vm_area_struct *vma,
>>           goto retry_pud;
>>         if (pmd_none(*vmf.pmd) &&
>> -        hugepage_vma_check(vma, vm_flags, false, true, true)) {
>> +        thp_vma_allowable_orders(vma, vm_flags, false, true, true,
>> +                     BIT(PMD_ORDER))) {
>
> Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might make this
> easier to read -- and the implemenmtation will be faster.

I'm happy to do this and use it to improve readability:

#define thp_vma_allowable_order(..., order) \
thp_vma_allowable_orders(..., BIT(order))

This wouldn't make the implementation any faster though; Are you suggesting a
completely separate impl? Even then, I don't think there is much scope to make
it faster for the case where there is only 1 order in the bitfield.

>
>>           ret = create_huge_pmd(&vmf);
>>           if (!(ret & VM_FAULT_FALLBACK))
>>               return ret;
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index e0b368e545ed..64da127cc267 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>                * cleared *pmd but not decremented compound_mapcount().
>>                */
>>               if ((pvmw->flags & PVMW_SYNC) &&
>> -                transhuge_vma_suitable(vma, pvmw->address) &&
>> +                thp_vma_suitable_orders(vma, pvmw->address,
>> +                            BIT(PMD_ORDER)) &&
>
> Comment: Similarly, a helper like "thp_vma_suitable_order(vma, PMD_ORDER)" might
> make this easier to read.

Yep, will do this.

>
>>                   (pvmw->nr_pages >= HPAGE_PMD_NR)) {
>>                   spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>>  
>

2023-12-06 14:20:26

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 05/12/2023 16:32, David Hildenbrand wrote:
> On 04.12.23 11:20, Ryan Roberts wrote:
>> Introduce the logic to allow THP to be configured (through the new sysfs
>> interface we just added) to allocate large folios to back anonymous
>> memory, which are larger than the base page size but smaller than
>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>
>> mTHP continues to be PTE-mapped, but in many cases can still provide
>> similar benefits to traditional PMD-sized THP: Page faults are
>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>> the configured order), but latency spikes are much less prominent
>> because the size of each page isn't as huge as the PMD-sized variant and
>> there is less memory to clear in each page fault. The number of per-page
>> operations (e.g. ref counting, rmap management, lru list management) are
>> also significantly reduced since those ops now become per-folio.
>>
>> Some architectures also employ TLB compression mechanisms to squeeze
>> more entries in when a set of PTEs are virtually and physically
>> contiguous and approporiately aligned. In this case, TLB misses will
>> occur less often.
>>
>> The new behaviour is disabled by default, but can be enabled at runtime
>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>> (see documentation in previous commit). The long term aim is to change
>> the default to include suitable lower orders, but there are some risks
>> around internal fragmentation that need to be better understood first.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>
> In general, looks good to me, some comments/nits. And the usual "let's make sure
> we don't degrade order-0 and keep that as fast as possible" comment.
>
>> ---
>>   include/linux/huge_mm.h |   6 ++-
>>   mm/memory.c             | 106 ++++++++++++++++++++++++++++++++++++----
>>   2 files changed, 101 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index bd0eadd3befb..91a53b9835a4 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>     /*
>> - * Mask of all large folio orders supported for anonymous THP.
>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>> + * (which is a limitation of the THP implementation).
>>    */
>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>     /*
>>    * Mask of all large folio orders supported for file THP.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 3ceeb0f45bf5..bf7e93813018 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>       return ret;
>>   }
>>   +static bool pte_range_none(pte_t *pte, int nr_pages)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < nr_pages; i++) {
>> +        if (!pte_none(ptep_get_lockless(pte + i)))
>> +            return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> +{
>> +    gfp_t gfp;
>> +    pte_t *pte;
>> +    unsigned long addr;
>> +    struct folio *folio;
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    unsigned long orders;
>> +    int order;
>
> Nit: reverse christmas tree encouraged ;)

ACK will fix.

>
>> +
>> +    /*
>> +     * If uffd is active for the vma we need per-page fault fidelity to
>> +     * maintain the uffd semantics.
>> +     */
>> +    if (userfaultfd_armed(vma))
>
> Nit: unlikely()

ACK will fix.

>
>> +        goto fallback;
>> +
>> +    /*
>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>> +     * for this vma. Then filter out the orders that can't be allocated over
>> +     * the faulting address and still be fully contained in the vma.
>> +     */
>> +    orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>> +                      BIT(PMD_ORDER) - 1);
>> +    orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
> Comment: Both will eventually loop over all orders, correct? Could eventually be
> sped up in the future.

No only thp_vma_suitable_orders() will loop. thp_vma_allowable_orders() only
loops if in_pf=false (it's true here).

>
> Nit: the orders = ... order = ... looks like this might deserve a helper
> function that makes this easier to read.

To be honest, the existing function that I've modified is a bit of a mess.
thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
page fault, because the page fault handlers already do that check themselves. It
would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
a strict superset of thp_vma_suitable_orders(). Then this can just call
thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
handlers, so prefer if we leave that for a separate patch set.

>
> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
> helper might be reasonable where that is handled internally.

Because thp_vma_suitable_orders() will handle it safely and is inline, so it
should just as efficient? This would go away with the refactoring described above.

>
> Comment: For order-0 we'll always perform a function call to both
> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
> fast and efficient check if any <PMD THP are even enabled in the system / for
> this VMA, and in that case just fallback before doing more expensive checks.

thp_vma_allowable_orders() is inline as you mentioned.

I was deliberately trying to keep all the decision logic in one place
(thp_vma_suitable_orders) because it's already pretty complicated. But if you
insist, how about this in the header:

static inline
unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
unsigned long vm_flags, bool smaps,
bool in_pf, bool enforce_sysfs,
unsigned long orders)
{
/* Optimization to check if required orders are enabled early. */
if (enforce_sysfs && vma_is_anonymous(vma)) {
unsigned long mask = READ_ONCE(huge_anon_orders_always);

if (vm_flags & VM_HUGEPAGE)
mask |= READ_ONCE(huge_anon_orders_madvise);
if (hugepage_global_always() ||
((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
mask |= READ_ONCE(huge_anon_orders_inherit);

orders &= mask;
if (!orders)
return 0;

enforce_sysfs = false;
}

return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
enforce_sysfs, orders);
}

Then the above check can be removed from __thp_vma_allowable_orders() - it will
still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.


>
>> +
>> +    if (!orders)
>> +        goto fallback;
>> +
>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>> +    if (!pte)
>> +        return ERR_PTR(-EAGAIN);
>> +
>> +    order = first_order(orders);
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        vmf->pte = pte + pte_index(addr);
>> +        if (pte_range_none(vmf->pte, 1 << order))
>> +            break;
>
> Comment: Likely it would make sense to scan only once and determine the "largest
> none range" around that address, having the largest suitable order in mind.

Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
IIRC. Perhaps this an optimization opportunity for later?

>
>> +        order = next_order(&orders, order);
>> +    }
>> +
>> +    vmf->pte = NULL;
>
> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
> variable will make this code look less magical. Unless I am missing something
> important :)

Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
based on some comments from JohnH, I see I don't need that anymore. Agreed; it
will be much clearer just to use a local variable. Will fix.

>
>> +    pte_unmap(pte);
>> +
>> +    gfp = vma_thp_gfp_mask(vma);
>> +
>> +    while (orders) {
>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> +        if (folio) {
>> +            clear_huge_page(&folio->page, addr, 1 << order);
>> +            return folio;
>> +        }
>> +        order = next_order(&orders, order);
>> +    }
>> +
>
> Queestion: would it make sense to combine both loops? I suspect memory
> allocations with pte_offset_map()/kmao are problematic.

They are both operating on separate orders; next_order() is "consuming" an order
by removing the current one from the orders bitfield and returning the next one.

So the first loop starts at the highest order and keeps checking lower orders
until one fully fits in the VMA. And the second loop starts at the first order
that was found to fully fits and loops to lower orders until an allocation is
successful.

So I don't see a need to combine the loops.

>
>> +fallback:
>> +    return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +}
>> +#else
>> +#define alloc_anon_folio(vmf) \
>> +        vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>> +#endif
>> +
>>   /*
>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>    * but allow concurrent faults), and pte mapped but not yet locked.
>> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>    */
>>   static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>   {
>> +    int i;
>> +    int nr_pages = 1;
>> +    unsigned long addr = vmf->address;
>>       bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>       struct vm_area_struct *vma = vmf->vma;
>>       struct folio *folio;
>
> Nit: reverse christmas tree :)

ACK

>
>> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>       /* Allocate our own private page. */
>>       if (unlikely(anon_vma_prepare(vma)))
>>           goto oom;
>> -    folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>> +    folio = alloc_anon_folio(vmf);
>> +    if (IS_ERR(folio))
>> +        return 0;
>>       if (!folio)
>>           goto oom;
>>   +    nr_pages = folio_nr_pages(folio);
>> +    addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>> +
>>       if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>           goto oom_free_page;
>>       folio_throttle_swaprate(folio, GFP_KERNEL);
>> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>       if (vma->vm_flags & VM_WRITE)
>>           entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>   -    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> -            &vmf->ptl);
>> +    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>       if (!vmf->pte)
>>           goto release;
>> -    if (vmf_pte_changed(vmf)) {
>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>> +    if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
>> +        (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
>> +        for (i = 0; i < nr_pages; i++)
>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>
> Comment: separating the order-0 case from the other case might make this easier
> to read.

Yeah fair enough. Will fix.

>
>>           goto release;
>>       }
>>   @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault
>> *vmf)
>>           return handle_userfault(vmf, VM_UFFD_MISSING);
>>       }
>>   -    inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -    folio_add_new_anon_rmap(folio, vma, vmf->address);
>> +    folio_ref_add(folio, nr_pages - 1);
>> +    add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> +    folio_add_new_anon_rmap(folio, vma, addr);
>>       folio_add_lru_vma(folio, vma);
>>   setpte:
>>       if (uffd_wp)
>>           entry = pte_mkuffd_wp(entry);
>> -    set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>> +    set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>         /* No need to invalidate - it was non-present before */
>> -    update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>> +    update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>   unlock:
>>       if (vmf->pte)
>>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> Benchmarking order-0 allocations might be interesting. There will be some added
> checks + multiple loops/conditionals for order-0 that could be avoided by having
> two separate code paths. If we can't measure a difference, all good.

Yep will do - will post numbers once I have them. I've been assuming that the
major cost is clearing the page, but perhaps I'm wrong.

>

2023-12-06 14:23:05

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 00/10] Multi-size THP for anonymous memory

On 06/12/2023 10:22, David Hildenbrand wrote:
> On 06.12.23 11:13, Ryan Roberts wrote:
>> On 05/12/2023 17:21, David Hildenbrand wrote:
>>> On 04.12.23 11:20, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> A new week, a new version, a new name... This is v8 of a series to implement
>>>> multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP"
>>>> and "large anonymous folios"). Matthew objected to "small huge" so hopefully
>>>> this fares better.
>>>>
>>>> The objective of this is to improve performance by allocating larger chunks of
>>>> memory during anonymous page faults:
>>>>
>>>> 1) Since SW (the kernel) is dealing with larger chunks of memory than base
>>>>      pages, there are efficiency savings to be had; fewer page faults,
>>>> batched PTE
>>>>      and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
>>>>      overhead. This should benefit all architectures.
>>>> 2) Since we are now mapping physically contiguous chunks of memory, we can take
>>>>      advantage of HW TLB compression techniques. A reduction in TLB pressure
>>>>      speeds up kernel and user space. arm64 systems have 2 mechanisms to
>>>> coalesce
>>>>      TLB entries; "the contiguous bit" (architectural) and HPA (uarch).
>>>>
>>>> This version changes the name and tidies up some of the kernel code and test
>>>> code, based on feedback against v7 (see change log for details).
>>>>
>>>> By default, the existing behaviour (and performance) is maintained. The user
>>>> must explicitly enable multi-size THP to see the performance benefit. This is
>>>> done via a new sysfs interface (as recommended by David Hildenbrand - thanks to
>>>> David for the suggestion)! This interface is inspired by the existing
>>>> per-hugepage-size sysfs interface used by hugetlb, provides full backwards
>>>> compatibility with the existing PMD-size THP interface, and provides a base for
>>>> future extensibility. See [8] for detailed discussion of the interface.
>>>>
>>>> This series is based on mm-unstable (715b67adf4c8).
>>>
>>> I took a look at the core pieces. Some things might want some smaller tweaks,
>>> but nothing that should stop this from having fun in mm-unstable, and replacing
>>> the smaller things as we move forward.
>>>
>>
>> Thanks! I'll address your comments and see if I can post another (final??)
>> version next week.
>
> It's always possible to do incremental changes on top that Andrew will squash in
> the end. I even recall that he prefers that way once a series has been in
> mm-unstable for a bit, so one can better observe the diff and which effects they
> have.
>

I've responded to all your comments. There are a bunch of changes that I agree
would be good to make (and some which I disagree with - would be good if you get
a chance to respond).

I think I can get all the changes done and tested by Friday. So perhaps it's
simplest to keep this out of mm-unstable until then, and put the new version in
on Friday? Then if there are any more small changes to do, I can do those as diffs?

Thanks,
Ryan


2023-12-06 15:45:44

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 06/12/2023 14:19, Ryan Roberts wrote:
> On 05/12/2023 16:32, David Hildenbrand wrote:
>> On 04.12.23 11:20, Ryan Roberts wrote:
>>> Introduce the logic to allow THP to be configured (through the new sysfs
>>> interface we just added) to allocate large folios to back anonymous
>>> memory, which are larger than the base page size but smaller than
>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>>
>>> mTHP continues to be PTE-mapped, but in many cases can still provide
>>> similar benefits to traditional PMD-sized THP: Page faults are
>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>>> the configured order), but latency spikes are much less prominent
>>> because the size of each page isn't as huge as the PMD-sized variant and
>>> there is less memory to clear in each page fault. The number of per-page
>>> operations (e.g. ref counting, rmap management, lru list management) are
>>> also significantly reduced since those ops now become per-folio.
>>>
>>> Some architectures also employ TLB compression mechanisms to squeeze
>>> more entries in when a set of PTEs are virtually and physically
>>> contiguous and approporiately aligned. In this case, TLB misses will
>>> occur less often.
>>>
>>> The new behaviour is disabled by default, but can be enabled at runtime
>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>> (see documentation in previous commit). The long term aim is to change
>>> the default to include suitable lower orders, but there are some risks
>>> around internal fragmentation that need to be better understood first.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>
>> In general, looks good to me, some comments/nits. And the usual "let's make sure
>> we don't degrade order-0 and keep that as fast as possible" comment.
>>
>>> ---
>>>   include/linux/huge_mm.h |   6 ++-
>>>   mm/memory.c             | 106 ++++++++++++++++++++++++++++++++++++----
>>>   2 files changed, 101 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index bd0eadd3befb..91a53b9835a4 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>     /*
>>> - * Mask of all large folio orders supported for anonymous THP.
>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>> + * (which is a limitation of the THP implementation).
>>>    */
>>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>>     /*
>>>    * Mask of all large folio orders supported for file THP.
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 3ceeb0f45bf5..bf7e93813018 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       return ret;
>>>   }
>>>   +static bool pte_range_none(pte_t *pte, int nr_pages)
>>> +{
>>> +    int i;
>>> +
>>> +    for (i = 0; i < nr_pages; i++) {
>>> +        if (!pte_none(ptep_get_lockless(pte + i)))
>>> +            return false;
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>> +{
>>> +    gfp_t gfp;
>>> +    pte_t *pte;
>>> +    unsigned long addr;
>>> +    struct folio *folio;
>>> +    struct vm_area_struct *vma = vmf->vma;
>>> +    unsigned long orders;
>>> +    int order;
>>
>> Nit: reverse christmas tree encouraged ;)
>
> ACK will fix.
>
>>
>>> +
>>> +    /*
>>> +     * If uffd is active for the vma we need per-page fault fidelity to
>>> +     * maintain the uffd semantics.
>>> +     */
>>> +    if (userfaultfd_armed(vma))
>>
>> Nit: unlikely()
>
> ACK will fix.
>
>>
>>> +        goto fallback;
>>> +
>>> +    /*
>>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>> +     * for this vma. Then filter out the orders that can't be allocated over
>>> +     * the faulting address and still be fully contained in the vma.
>>> +     */
>>> +    orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>> +                      BIT(PMD_ORDER) - 1);
>>> +    orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>
>> Comment: Both will eventually loop over all orders, correct? Could eventually be
>> sped up in the future.
>
> No only thp_vma_suitable_orders() will loop. thp_vma_allowable_orders() only
> loops if in_pf=false (it's true here).
>
>>
>> Nit: the orders = ... order = ... looks like this might deserve a helper
>> function that makes this easier to read.
>
> To be honest, the existing function that I've modified is a bit of a mess.
> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
> page fault, because the page fault handlers already do that check themselves. It
> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
> a strict superset of thp_vma_suitable_orders(). Then this can just call
> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
> handlers, so prefer if we leave that for a separate patch set.
>
>>
>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>> helper might be reasonable where that is handled internally.
>
> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
> should just as efficient? This would go away with the refactoring described above.
>
>>
>> Comment: For order-0 we'll always perform a function call to both
>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>> fast and efficient check if any <PMD THP are even enabled in the system / for
>> this VMA, and in that case just fallback before doing more expensive checks.
>

I just noticed I got these functions round the wrong way in my previous response:

> thp_vma_allowable_orders() is inline as you mentioned.

^ Meant thp_vma_suitable_orders() here.

>
> I was deliberately trying to keep all the decision logic in one place
> (thp_vma_suitable_orders) because it's already pretty complicated. But if you

^ Meant thp_vma_allowable_orders() here.

Sorry for the confusion.

> insist, how about this in the header:
>
> static inline
> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags, bool smaps,
> bool in_pf, bool enforce_sysfs,
> unsigned long orders)
> {
> /* Optimization to check if required orders are enabled early. */
> if (enforce_sysfs && vma_is_anonymous(vma)) {
> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>
> if (vm_flags & VM_HUGEPAGE)
> mask |= READ_ONCE(huge_anon_orders_madvise);
> if (hugepage_global_always() ||
> ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> mask |= READ_ONCE(huge_anon_orders_inherit);
>
> orders &= mask;
> if (!orders)
> return 0;
>
> enforce_sysfs = false;
> }
>
> return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
> enforce_sysfs, orders);
> }
>
> Then the above check can be removed from __thp_vma_allowable_orders() - it will
> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
>
>
>>
>>> +
>>> +    if (!orders)
>>> +        goto fallback;
>>> +
>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>> +    if (!pte)
>>> +        return ERR_PTR(-EAGAIN);
>>> +
>>> +    order = first_order(orders);
>>> +    while (orders) {
>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +        vmf->pte = pte + pte_index(addr);
>>> +        if (pte_range_none(vmf->pte, 1 << order))
>>> +            break;
>>
>> Comment: Likely it would make sense to scan only once and determine the "largest
>> none range" around that address, having the largest suitable order in mind.
>
> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> IIRC. Perhaps this an optimization opportunity for later?
>
>>
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>> +    vmf->pte = NULL;
>>
>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>> variable will make this code look less magical. Unless I am missing something
>> important :)
>
> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
> will be much clearer just to use a local variable. Will fix.
>
>>
>>> +    pte_unmap(pte);
>>> +
>>> +    gfp = vma_thp_gfp_mask(vma);
>>> +
>>> +    while (orders) {
>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> +        if (folio) {
>>> +            clear_huge_page(&folio->page, addr, 1 << order);
>>> +            return folio;
>>> +        }
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>
>> Queestion: would it make sense to combine both loops? I suspect memory
>> allocations with pte_offset_map()/kmao are problematic.
>
> They are both operating on separate orders; next_order() is "consuming" an order
> by removing the current one from the orders bitfield and returning the next one.
>
> So the first loop starts at the highest order and keeps checking lower orders
> until one fully fits in the VMA. And the second loop starts at the first order
> that was found to fully fits and loops to lower orders until an allocation is
> successful.
>
> So I don't see a need to combine the loops.
>
>>
>>> +fallback:
>>> +    return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +}
>>> +#else
>>> +#define alloc_anon_folio(vmf) \
>>> +        vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>>> +#endif
>>> +
>>>   /*
>>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>    * but allow concurrent faults), and pte mapped but not yet locked.
>>> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>    */
>>>   static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>   {
>>> +    int i;
>>> +    int nr_pages = 1;
>>> +    unsigned long addr = vmf->address;
>>>       bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>       struct vm_area_struct *vma = vmf->vma;
>>>       struct folio *folio;
>>
>> Nit: reverse christmas tree :)
>
> ACK
>
>>
>>> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>       /* Allocate our own private page. */
>>>       if (unlikely(anon_vma_prepare(vma)))
>>>           goto oom;
>>> -    folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>> +    folio = alloc_anon_folio(vmf);
>>> +    if (IS_ERR(folio))
>>> +        return 0;
>>>       if (!folio)
>>>           goto oom;
>>>   +    nr_pages = folio_nr_pages(folio);
>>> +    addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>> +
>>>       if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>           goto oom_free_page;
>>>       folio_throttle_swaprate(folio, GFP_KERNEL);
>>> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>       if (vma->vm_flags & VM_WRITE)
>>>           entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>>   -    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> -            &vmf->ptl);
>>> +    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>       if (!vmf->pte)
>>>           goto release;
>>> -    if (vmf_pte_changed(vmf)) {
>>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>>> +    if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
>>> +        (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
>>> +        for (i = 0; i < nr_pages; i++)
>>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>
>> Comment: separating the order-0 case from the other case might make this easier
>> to read.
>
> Yeah fair enough. Will fix.
>
>>
>>>           goto release;
>>>       }
>>>   @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault
>>> *vmf)
>>>           return handle_userfault(vmf, VM_UFFD_MISSING);
>>>       }
>>>   -    inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -    folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +    folio_ref_add(folio, nr_pages - 1);
>>> +    add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> +    folio_add_new_anon_rmap(folio, vma, addr);
>>>       folio_add_lru_vma(folio, vma);
>>>   setpte:
>>>       if (uffd_wp)
>>>           entry = pte_mkuffd_wp(entry);
>>> -    set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>> +    set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>>         /* No need to invalidate - it was non-present before */
>>> -    update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> +    update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>>   unlock:
>>>       if (vmf->pte)
>>>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
>> Benchmarking order-0 allocations might be interesting. There will be some added
>> checks + multiple loops/conditionals for order-0 that could be avoided by having
>> two separate code paths. If we can't measure a difference, all good.
>
> Yep will do - will post numbers once I have them. I've been assuming that the
> major cost is clearing the page, but perhaps I'm wrong.
>
>>
>

2023-12-07 10:37:57

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 06/12/2023 15:44, Ryan Roberts wrote:
> On 06/12/2023 14:19, Ryan Roberts wrote:
>> On 05/12/2023 16:32, David Hildenbrand wrote:
>>> On 04.12.23 11:20, Ryan Roberts wrote:
>>>> Introduce the logic to allow THP to be configured (through the new sysfs
>>>> interface we just added) to allocate large folios to back anonymous
>>>> memory, which are larger than the base page size but smaller than
>>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>>>
>>>> mTHP continues to be PTE-mapped, but in many cases can still provide
>>>> similar benefits to traditional PMD-sized THP: Page faults are
>>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>>>> the configured order), but latency spikes are much less prominent
>>>> because the size of each page isn't as huge as the PMD-sized variant and
>>>> there is less memory to clear in each page fault. The number of per-page
>>>> operations (e.g. ref counting, rmap management, lru list management) are
>>>> also significantly reduced since those ops now become per-folio.
>>>>
>>>> Some architectures also employ TLB compression mechanisms to squeeze
>>>> more entries in when a set of PTEs are virtually and physically
>>>> contiguous and approporiately aligned. In this case, TLB misses will
>>>> occur less often.
>>>>
>>>> The new behaviour is disabled by default, but can be enabled at runtime
>>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>> (see documentation in previous commit). The long term aim is to change
>>>> the default to include suitable lower orders, but there are some risks
>>>> around internal fragmentation that need to be better understood first.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>
>>> In general, looks good to me, some comments/nits. And the usual "let's make sure
>>> we don't degrade order-0 and keep that as fast as possible" comment.
>>>
>>>> ---
>>>>   include/linux/huge_mm.h |   6 ++-
>>>>   mm/memory.c             | 106 ++++++++++++++++++++++++++++++++++++----
>>>>   2 files changed, 101 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index bd0eadd3befb..91a53b9835a4 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>>     /*
>>>> - * Mask of all large folio orders supported for anonymous THP.
>>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>>> + * (which is a limitation of the THP implementation).
>>>>    */
>>>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>>>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>>>     /*
>>>>    * Mask of all large folio orders supported for file THP.
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 3ceeb0f45bf5..bf7e93813018 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>       return ret;
>>>>   }
>>>>   +static bool pte_range_none(pte_t *pte, int nr_pages)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < nr_pages; i++) {
>>>> +        if (!pte_none(ptep_get_lockless(pte + i)))
>>>> +            return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>> +{
>>>> +    gfp_t gfp;
>>>> +    pte_t *pte;
>>>> +    unsigned long addr;
>>>> +    struct folio *folio;
>>>> +    struct vm_area_struct *vma = vmf->vma;
>>>> +    unsigned long orders;
>>>> +    int order;
>>>
>>> Nit: reverse christmas tree encouraged ;)
>>
>> ACK will fix.
>>
>>>
>>>> +
>>>> +    /*
>>>> +     * If uffd is active for the vma we need per-page fault fidelity to
>>>> +     * maintain the uffd semantics.
>>>> +     */
>>>> +    if (userfaultfd_armed(vma))
>>>
>>> Nit: unlikely()
>>
>> ACK will fix.
>>
>>>
>>>> +        goto fallback;
>>>> +
>>>> +    /*
>>>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>> +     * for this vma. Then filter out the orders that can't be allocated over
>>>> +     * the faulting address and still be fully contained in the vma.
>>>> +     */
>>>> +    orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>> +                      BIT(PMD_ORDER) - 1);
>>>> +    orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>
>>> Comment: Both will eventually loop over all orders, correct? Could eventually be
>>> sped up in the future.
>>
>> No only thp_vma_suitable_orders() will loop. thp_vma_allowable_orders() only
>> loops if in_pf=false (it's true here).
>>
>>>
>>> Nit: the orders = ... order = ... looks like this might deserve a helper
>>> function that makes this easier to read.
>>
>> To be honest, the existing function that I've modified is a bit of a mess.
>> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
>> page fault, because the page fault handlers already do that check themselves. It
>> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
>> a strict superset of thp_vma_suitable_orders(). Then this can just call
>> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
>> handlers, so prefer if we leave that for a separate patch set.
>>
>>>
>>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>>> helper might be reasonable where that is handled internally.
>>
>> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
>> should just as efficient? This would go away with the refactoring described above.
>>
>>>
>>> Comment: For order-0 we'll always perform a function call to both
>>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>>> fast and efficient check if any <PMD THP are even enabled in the system / for
>>> this VMA, and in that case just fallback before doing more expensive checks.
>>
>
> I just noticed I got these functions round the wrong way in my previous response:
>
>> thp_vma_allowable_orders() is inline as you mentioned.
>
> ^ Meant thp_vma_suitable_orders() here.
>
>>
>> I was deliberately trying to keep all the decision logic in one place
>> (thp_vma_suitable_orders) because it's already pretty complicated. But if you
>
> ^ Meant thp_vma_allowable_orders() here.
>
> Sorry for the confusion.
>
>> insist, how about this in the header:
>>
>> static inline
>> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>> unsigned long vm_flags, bool smaps,
>> bool in_pf, bool enforce_sysfs,
>> unsigned long orders)
>> {
>> /* Optimization to check if required orders are enabled early. */
>> if (enforce_sysfs && vma_is_anonymous(vma)) {
>> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>
>> if (vm_flags & VM_HUGEPAGE)
>> mask |= READ_ONCE(huge_anon_orders_madvise);
>> if (hugepage_global_always() ||
>> ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>> mask |= READ_ONCE(huge_anon_orders_inherit);
>>
>> orders &= mask;
>> if (!orders)
>> return 0;
>>
>> enforce_sysfs = false;
>> }
>>
>> return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
>> enforce_sysfs, orders);
>> }
>>
>> Then the above check can be removed from __thp_vma_allowable_orders() - it will
>> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
>>
>>
>>>
>>>> +
>>>> +    if (!orders)
>>>> +        goto fallback;
>>>> +
>>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>> +    if (!pte)
>>>> +        return ERR_PTR(-EAGAIN);
>>>> +
>>>> +    order = first_order(orders);
>>>> +    while (orders) {
>>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>> +        vmf->pte = pte + pte_index(addr);
>>>> +        if (pte_range_none(vmf->pte, 1 << order))
>>>> +            break;
>>>
>>> Comment: Likely it would make sense to scan only once and determine the "largest
>>> none range" around that address, having the largest suitable order in mind.
>>
>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
>> IIRC. Perhaps this an optimization opportunity for later?
>>
>>>
>>>> +        order = next_order(&orders, order);
>>>> +    }
>>>> +
>>>> +    vmf->pte = NULL;
>>>
>>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>>> variable will make this code look less magical. Unless I am missing something
>>> important :)
>>
>> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
>> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
>> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
>> will be much clearer just to use a local variable. Will fix.
>>
>>>
>>>> +    pte_unmap(pte);
>>>> +
>>>> +    gfp = vma_thp_gfp_mask(vma);
>>>> +
>>>> +    while (orders) {
>>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>> +        if (folio) {
>>>> +            clear_huge_page(&folio->page, addr, 1 << order);
>>>> +            return folio;
>>>> +        }
>>>> +        order = next_order(&orders, order);
>>>> +    }
>>>> +
>>>
>>> Queestion: would it make sense to combine both loops? I suspect memory
>>> allocations with pte_offset_map()/kmao are problematic.
>>
>> They are both operating on separate orders; next_order() is "consuming" an order
>> by removing the current one from the orders bitfield and returning the next one.
>>
>> So the first loop starts at the highest order and keeps checking lower orders
>> until one fully fits in the VMA. And the second loop starts at the first order
>> that was found to fully fits and loops to lower orders until an allocation is
>> successful.
>>
>> So I don't see a need to combine the loops.
>>
>>>
>>>> +fallback:
>>>> +    return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>> +}
>>>> +#else
>>>> +#define alloc_anon_folio(vmf) \
>>>> +        vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>>>> +#endif
>>>> +
>>>>   /*
>>>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>>    * but allow concurrent faults), and pte mapped but not yet locked.
>>>> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>    */
>>>>   static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>   {
>>>> +    int i;
>>>> +    int nr_pages = 1;
>>>> +    unsigned long addr = vmf->address;
>>>>       bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>>       struct vm_area_struct *vma = vmf->vma;
>>>>       struct folio *folio;
>>>
>>> Nit: reverse christmas tree :)
>>
>> ACK
>>
>>>
>>>> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>       /* Allocate our own private page. */
>>>>       if (unlikely(anon_vma_prepare(vma)))
>>>>           goto oom;
>>>> -    folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>> +    folio = alloc_anon_folio(vmf);
>>>> +    if (IS_ERR(folio))
>>>> +        return 0;
>>>>       if (!folio)
>>>>           goto oom;
>>>>   +    nr_pages = folio_nr_pages(folio);
>>>> +    addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>> +
>>>>       if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>>           goto oom_free_page;
>>>>       folio_throttle_swaprate(folio, GFP_KERNEL);
>>>> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>       if (vma->vm_flags & VM_WRITE)
>>>>           entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>>>   -    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>> -            &vmf->ptl);
>>>> +    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>>       if (!vmf->pte)
>>>>           goto release;
>>>> -    if (vmf_pte_changed(vmf)) {
>>>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>> +    if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
>>>> +        (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
>>>> +        for (i = 0; i < nr_pages; i++)
>>>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>
>>> Comment: separating the order-0 case from the other case might make this easier
>>> to read.
>>
>> Yeah fair enough. Will fix.
>>
>>>
>>>>           goto release;
>>>>       }
>>>>   @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault
>>>> *vmf)
>>>>           return handle_userfault(vmf, VM_UFFD_MISSING);
>>>>       }
>>>>   -    inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>> -    folio_add_new_anon_rmap(folio, vma, vmf->address);
>>>> +    folio_ref_add(folio, nr_pages - 1);
>>>> +    add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>> +    folio_add_new_anon_rmap(folio, vma, addr);
>>>>       folio_add_lru_vma(folio, vma);
>>>>   setpte:
>>>>       if (uffd_wp)
>>>>           entry = pte_mkuffd_wp(entry);
>>>> -    set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>>> +    set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>>>         /* No need to invalidate - it was non-present before */
>>>> -    update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>>> +    update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>>>   unlock:
>>>>       if (vmf->pte)
>>>>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>
>>> Benchmarking order-0 allocations might be interesting. There will be some added
>>> checks + multiple loops/conditionals for order-0 that could be avoided by having
>>> two separate code paths. If we can't measure a difference, all good.
>>
>> Yep will do - will post numbers once I have them. I've been assuming that the
>> major cost is clearing the page, but perhaps I'm wrong.
>>

I added a "write-fault-byte" benchmark to the microbenchmark tool you gave me.
This elides the normal memset page population routine, and instead writes the
first byte of every page while the timer is running.

I ran with 100 iterations per run, then ran the whole thing 16 times. I ran it
for a baseline kernel, as well as v8 (this series) and v9 (with changes from
your review). I repeated on Ampere Altra (bare metal) and Apple M2 (VM):

| | m2 vm | altra |
|--------------|---------------------|--------------------:|
| kernel | mean | std_rel | mean | std_rel |
|--------------|----------|----------|----------|---------:|
| baseline | 0.000% | 0.341% | 0.000% | 3.581% |
| anonfolio-v8 | 0.005% | 0.272% | 5.068% | 1.128% |
| anonfolio-v9 | -0.013% | 0.442% | 0.107% | 1.788% |

No measurable difference on M2, but altra has a slow down in v8 which is fixed
in v9; Looking at the changes, this is either down to the new unlikely() for the
uffd or due to moving the THP order check to be inline within
thp_vma_allowable_orders().

So I have all the changes done and perf numbers to show no regression for
order-0. I'm gonna do a final check and post v9 later today.

Thanks,
Ryan

2023-12-07 10:41:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 07.12.23 11:37, Ryan Roberts wrote:
> On 06/12/2023 15:44, Ryan Roberts wrote:
>> On 06/12/2023 14:19, Ryan Roberts wrote:
>>> On 05/12/2023 16:32, David Hildenbrand wrote:
>>>> On 04.12.23 11:20, Ryan Roberts wrote:
>>>>> Introduce the logic to allow THP to be configured (through the new sysfs
>>>>> interface we just added) to allocate large folios to back anonymous
>>>>> memory, which are larger than the base page size but smaller than
>>>>> PMD-size. We call this new THP extension "multi-size THP" (mTHP).
>>>>>
>>>>> mTHP continues to be PTE-mapped, but in many cases can still provide
>>>>> similar benefits to traditional PMD-sized THP: Page faults are
>>>>> significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on
>>>>> the configured order), but latency spikes are much less prominent
>>>>> because the size of each page isn't as huge as the PMD-sized variant and
>>>>> there is less memory to clear in each page fault. The number of per-page
>>>>> operations (e.g. ref counting, rmap management, lru list management) are
>>>>> also significantly reduced since those ops now become per-folio.
>>>>>
>>>>> Some architectures also employ TLB compression mechanisms to squeeze
>>>>> more entries in when a set of PTEs are virtually and physically
>>>>> contiguous and approporiately aligned. In this case, TLB misses will
>>>>> occur less often.
>>>>>
>>>>> The new behaviour is disabled by default, but can be enabled at runtime
>>>>> by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
>>>>> (see documentation in previous commit). The long term aim is to change
>>>>> the default to include suitable lower orders, but there are some risks
>>>>> around internal fragmentation that need to be better understood first.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>
>>>> In general, looks good to me, some comments/nits. And the usual "let's make sure
>>>> we don't degrade order-0 and keep that as fast as possible" comment.
>>>>
>>>>> ---
>>>>>   include/linux/huge_mm.h |   6 ++-
>>>>>   mm/memory.c             | 106 ++++++++++++++++++++++++++++++++++++----
>>>>>   2 files changed, 101 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index bd0eadd3befb..91a53b9835a4 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr;
>>>>>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>>>     /*
>>>>> - * Mask of all large folio orders supported for anonymous THP.
>>>>> + * Mask of all large folio orders supported for anonymous THP; all orders up to
>>>>> + * and including PMD_ORDER, except order-0 (which is not "huge") and order-1
>>>>> + * (which is a limitation of the THP implementation).
>>>>>    */
>>>>> -#define THP_ORDERS_ALL_ANON    BIT(PMD_ORDER)
>>>>> +#define THP_ORDERS_ALL_ANON    ((BIT(PMD_ORDER + 1) - 1) & ~(BIT(0) | BIT(1)))
>>>>>     /*
>>>>>    * Mask of all large folio orders supported for file THP.
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 3ceeb0f45bf5..bf7e93813018 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4125,6 +4125,84 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>       return ret;
>>>>>   }
>>>>>   +static bool pte_range_none(pte_t *pte, int nr_pages)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    for (i = 0; i < nr_pages; i++) {
>>>>> +        if (!pte_none(ptep_get_lockless(pte + i)))
>>>>> +            return false;
>>>>> +    }
>>>>> +
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>>> +{
>>>>> +    gfp_t gfp;
>>>>> +    pte_t *pte;
>>>>> +    unsigned long addr;
>>>>> +    struct folio *folio;
>>>>> +    struct vm_area_struct *vma = vmf->vma;
>>>>> +    unsigned long orders;
>>>>> +    int order;
>>>>
>>>> Nit: reverse christmas tree encouraged ;)
>>>
>>> ACK will fix.
>>>
>>>>
>>>>> +
>>>>> +    /*
>>>>> +     * If uffd is active for the vma we need per-page fault fidelity to
>>>>> +     * maintain the uffd semantics.
>>>>> +     */
>>>>> +    if (userfaultfd_armed(vma))
>>>>
>>>> Nit: unlikely()
>>>
>>> ACK will fix.
>>>
>>>>
>>>>> +        goto fallback;
>>>>> +
>>>>> +    /*
>>>>> +     * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>> +     * for this vma. Then filter out the orders that can't be allocated over
>>>>> +     * the faulting address and still be fully contained in the vma.
>>>>> +     */
>>>>> +    orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>>> +                      BIT(PMD_ORDER) - 1);
>>>>> +    orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>>
>>>> Comment: Both will eventually loop over all orders, correct? Could eventually be
>>>> sped up in the future.
>>>
>>> No only thp_vma_suitable_orders() will loop. thp_vma_allowable_orders() only
>>> loops if in_pf=false (it's true here).
>>>
>>>>
>>>> Nit: the orders = ... order = ... looks like this might deserve a helper
>>>> function that makes this easier to read.
>>>
>>> To be honest, the existing function that I've modified is a bit of a mess.
>>> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
>>> page fault, because the page fault handlers already do that check themselves. It
>>> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
>>> a strict superset of thp_vma_suitable_orders(). Then this can just call
>>> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
>>> handlers, so prefer if we leave that for a separate patch set.
>>>
>>>>
>>>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>>>> helper might be reasonable where that is handled internally.
>>>
>>> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
>>> should just as efficient? This would go away with the refactoring described above.
>>>
>>>>
>>>> Comment: For order-0 we'll always perform a function call to both
>>>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>>>> fast and efficient check if any <PMD THP are even enabled in the system / for
>>>> this VMA, and in that case just fallback before doing more expensive checks.
>>>
>>
>> I just noticed I got these functions round the wrong way in my previous response:
>>
>>> thp_vma_allowable_orders() is inline as you mentioned.
>>
>> ^ Meant thp_vma_suitable_orders() here.
>>
>>>
>>> I was deliberately trying to keep all the decision logic in one place
>>> (thp_vma_suitable_orders) because it's already pretty complicated. But if you
>>
>> ^ Meant thp_vma_allowable_orders() here.
>>
>> Sorry for the confusion.
>>
>>> insist, how about this in the header:
>>>
>>> static inline
>>> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>> unsigned long vm_flags, bool smaps,
>>> bool in_pf, bool enforce_sysfs,
>>> unsigned long orders)
>>> {
>>> /* Optimization to check if required orders are enabled early. */
>>> if (enforce_sysfs && vma_is_anonymous(vma)) {
>>> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>>
>>> if (vm_flags & VM_HUGEPAGE)
>>> mask |= READ_ONCE(huge_anon_orders_madvise);
>>> if (hugepage_global_always() ||
>>> ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>>> mask |= READ_ONCE(huge_anon_orders_inherit);
>>>
>>> orders &= mask;
>>> if (!orders)
>>> return 0;
>>>
>>> enforce_sysfs = false;
>>> }
>>>
>>> return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
>>> enforce_sysfs, orders);
>>> }
>>>
>>> Then the above check can be removed from __thp_vma_allowable_orders() - it will
>>> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
>>>
>>>
>>>>
>>>>> +
>>>>> +    if (!orders)
>>>>> +        goto fallback;
>>>>> +
>>>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>> +    if (!pte)
>>>>> +        return ERR_PTR(-EAGAIN);
>>>>> +
>>>>> +    order = first_order(orders);
>>>>> +    while (orders) {
>>>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> +        vmf->pte = pte + pte_index(addr);
>>>>> +        if (pte_range_none(vmf->pte, 1 << order))
>>>>> +            break;
>>>>
>>>> Comment: Likely it would make sense to scan only once and determine the "largest
>>>> none range" around that address, having the largest suitable order in mind.
>>>
>>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
>>> IIRC. Perhaps this an optimization opportunity for later?
>>>
>>>>
>>>>> +        order = next_order(&orders, order);
>>>>> +    }
>>>>> +
>>>>> +    vmf->pte = NULL;
>>>>
>>>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>>>> variable will make this code look less magical. Unless I am missing something
>>>> important :)
>>>
>>> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
>>> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
>>> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
>>> will be much clearer just to use a local variable. Will fix.
>>>
>>>>
>>>>> +    pte_unmap(pte);
>>>>> +
>>>>> +    gfp = vma_thp_gfp_mask(vma);
>>>>> +
>>>>> +    while (orders) {
>>>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>>> +        if (folio) {
>>>>> +            clear_huge_page(&folio->page, addr, 1 << order);
>>>>> +            return folio;
>>>>> +        }
>>>>> +        order = next_order(&orders, order);
>>>>> +    }
>>>>> +
>>>>
>>>> Queestion: would it make sense to combine both loops? I suspect memory
>>>> allocations with pte_offset_map()/kmao are problematic.
>>>
>>> They are both operating on separate orders; next_order() is "consuming" an order
>>> by removing the current one from the orders bitfield and returning the next one.
>>>
>>> So the first loop starts at the highest order and keeps checking lower orders
>>> until one fully fits in the VMA. And the second loop starts at the first order
>>> that was found to fully fits and loops to lower orders until an allocation is
>>> successful.
>>>
>>> So I don't see a need to combine the loops.
>>>
>>>>
>>>>> +fallback:
>>>>> +    return vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>>> +}
>>>>> +#else
>>>>> +#define alloc_anon_folio(vmf) \
>>>>> +        vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address)
>>>>> +#endif
>>>>> +
>>>>>   /*
>>>>>    * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>>>    * but allow concurrent faults), and pte mapped but not yet locked.
>>>>> @@ -4132,6 +4210,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>    */
>>>>>   static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>>   {
>>>>> +    int i;
>>>>> +    int nr_pages = 1;
>>>>> +    unsigned long addr = vmf->address;
>>>>>       bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>>>       struct vm_area_struct *vma = vmf->vma;
>>>>>       struct folio *folio;
>>>>
>>>> Nit: reverse christmas tree :)
>>>
>>> ACK
>>>
>>>>
>>>>> @@ -4176,10 +4257,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>>       /* Allocate our own private page. */
>>>>>       if (unlikely(anon_vma_prepare(vma)))
>>>>>           goto oom;
>>>>> -    folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
>>>>> +    folio = alloc_anon_folio(vmf);
>>>>> +    if (IS_ERR(folio))
>>>>> +        return 0;
>>>>>       if (!folio)
>>>>>           goto oom;
>>>>>   +    nr_pages = folio_nr_pages(folio);
>>>>> +    addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>>>> +
>>>>>       if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>>>>>           goto oom_free_page;
>>>>>       folio_throttle_swaprate(folio, GFP_KERNEL);
>>>>> @@ -4196,12 +4282,13 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>>>>>       if (vma->vm_flags & VM_WRITE)
>>>>>           entry = pte_mkwrite(pte_mkdirty(entry), vma);
>>>>>   -    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>>> -            &vmf->ptl);
>>>>> +    vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>>>>       if (!vmf->pte)
>>>>>           goto release;
>>>>> -    if (vmf_pte_changed(vmf)) {
>>>>> -        update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>>> +    if ((nr_pages == 1 && vmf_pte_changed(vmf)) ||
>>>>> +        (nr_pages  > 1 && !pte_range_none(vmf->pte, nr_pages))) {
>>>>> +        for (i = 0; i < nr_pages; i++)
>>>>> +            update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i);
>>>>
>>>> Comment: separating the order-0 case from the other case might make this easier
>>>> to read.
>>>
>>> Yeah fair enough. Will fix.
>>>
>>>>
>>>>>           goto release;
>>>>>       }
>>>>>   @@ -4216,16 +4303,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault
>>>>> *vmf)
>>>>>           return handle_userfault(vmf, VM_UFFD_MISSING);
>>>>>       }
>>>>>   -    inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>> -    folio_add_new_anon_rmap(folio, vma, vmf->address);
>>>>> +    folio_ref_add(folio, nr_pages - 1);
>>>>> +    add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>> +    folio_add_new_anon_rmap(folio, vma, addr);
>>>>>       folio_add_lru_vma(folio, vma);
>>>>>   setpte:
>>>>>       if (uffd_wp)
>>>>>           entry = pte_mkuffd_wp(entry);
>>>>> -    set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>>>> +    set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages);
>>>>>         /* No need to invalidate - it was non-present before */
>>>>> -    update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>>>> +    update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages);
>>>>>   unlock:
>>>>>       if (vmf->pte)
>>>>>           pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>
>>>> Benchmarking order-0 allocations might be interesting. There will be some added
>>>> checks + multiple loops/conditionals for order-0 that could be avoided by having
>>>> two separate code paths. If we can't measure a difference, all good.
>>>
>>> Yep will do - will post numbers once I have them. I've been assuming that the
>>> major cost is clearing the page, but perhaps I'm wrong.
>>>
>
> I added a "write-fault-byte" benchmark to the microbenchmark tool you gave me.
> This elides the normal memset page population routine, and instead writes the
> first byte of every page while the timer is running.
>
> I ran with 100 iterations per run, then ran the whole thing 16 times. I ran it
> for a baseline kernel, as well as v8 (this series) and v9 (with changes from
> your review). I repeated on Ampere Altra (bare metal) and Apple M2 (VM):
>
> | | m2 vm | altra |
> |--------------|---------------------|--------------------:|
> | kernel | mean | std_rel | mean | std_rel |
> |--------------|----------|----------|----------|---------:|
> | baseline | 0.000% | 0.341% | 0.000% | 3.581% |
> | anonfolio-v8 | 0.005% | 0.272% | 5.068% | 1.128% |
> | anonfolio-v9 | -0.013% | 0.442% | 0.107% | 1.788% |
>
> No measurable difference on M2, but altra has a slow down in v8 which is fixed
> in v9; Looking at the changes, this is either down to the new unlikely() for the
> uffd or due to moving the THP order check to be inline within
> thp_vma_allowable_orders().

I suspect the last one.

>
> So I have all the changes done and perf numbers to show no regression for
> order-0. I'm gonna do a final check and post v9 later today.

Good!

Let me catch up on your comments real quick.

--
Cheers,

David / dhildenb

2023-12-07 10:58:04

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 06/12/2023 13:18, Ryan Roberts wrote:
> On 05/12/2023 16:57, David Hildenbrand wrote:
>> On 04.12.23 11:20, Ryan Roberts wrote:
>>> In preparation for adding support for anonymous multi-size THP,
>>> introduce new sysfs structure that will be used to control the new
>>> behaviours. A new directory is added under transparent_hugepage for each
>>> supported THP size, and contains an `enabled` file, which can be set to
>>> "inherit" (to inherit the global setting), "always", "madvise" or
>>> "never". For now, the kernel still only supports PMD-sized anonymous
>>> THP, so only 1 directory is populated.
>>>
>>> The first half of the change converts transhuge_vma_suitable() and
>>> hugepage_vma_check() so that they take a bitfield of orders for which
>>> the user wants to determine support, and the functions filter out all
>>> the orders that can't be supported, given the current sysfs
>>> configuration and the VMA dimensions. If there is only 1 order set in
>>> the input then the output can continue to be treated like a boolean;
>>> this is the case for most call sites. The resulting functions are
>>> renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders()
>>> respectively.
>>>
>>> The second half of the change implements the new sysfs interface. It has
>>> been done so that each supported THP size has a `struct thpsize`, which
>>> describes the relevant metadata and is itself a kobject. This is pretty
>>> minimal for now, but should make it easy to add new per-thpsize files to
>>> the interface if needed in future (e.g. per-size defrag). Rather than
>>> keep the `enabled` state directly in the struct thpsize, I've elected to
>>> directly encode it into huge_anon_orders_[always|madvise|inherit]
>>> bitfields since this reduces the amount of work required in
>>> thp_vma_allowable_orders() which is called for every page fault.
>>>
>>> See Documentation/admin-guide/mm/transhuge.rst, as modified by this
>>> commit, for details of how the new sysfs interface works.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>
>> Some comments mostly regarding thp_vma_allowable_orders and friends. In general,
>> LGTM. I'll have to go over the order logic once again, I got a bit lost once we
>> started mixing anon and file orders.
>>
>> [...]
>>
>> Doc updates all looked good to me, skimming over them.
>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index fa0350b0812a..bd0eadd3befb 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>
>> [...]
>>
>>> +static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
>>> +        unsigned long addr, unsigned long orders)
>>> +{
>>> +    int order;
>>> +
>>> +    /*
>>> +     * Iterate over orders, highest to lowest, removing orders that don't
>>> +     * meet alignment requirements from the set. Exit loop at first order
>>> +     * that meets requirements, since all lower orders must also meet
>>> +     * requirements.
>>> +     */
>>> +
>>> +    order = first_order(orders);
>>
>> nit: "highest_order" or "largest_order" would be more expressive regarding the
>> actual semantics.
>
> Yep, will call it "highest_order".
>
>>
>>> +
>>> +    while (orders) {
>>> +        unsigned long hpage_size = PAGE_SIZE << order;
>>> +        unsigned long haddr = ALIGN_DOWN(addr, hpage_size);
>>> +
>>> +        if (haddr >= vma->vm_start &&
>>> +            haddr + hpage_size <= vma->vm_end) {
>>> +            if (!vma_is_anonymous(vma)) {
>>> +                if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
>>> +                        vma->vm_pgoff,
>>> +                        hpage_size >> PAGE_SHIFT))
>>> +                    break;
>>> +            } else
>>> +                break;
>>
>> Comment: Codying style wants you to use if () {} else {}
>>
>> But I'd recommend for the conditions:
>>
>> if (haddr < vma->vm_start ||
>>     haddr + hpage_size > vma->vm_end)
>>     continue;
>> /* Don't have to check pgoff for anonymous vma */
>> if (!vma_is_anonymous(vma))
>>     break;
>> if (IS_ALIGNED((...
>>     break;
>
> OK I'll take this structure.

FYI I ended up NOT taking this, because I now have thp_vma_suitable_order() and
thp_vma_suitable_orders(). The former is essentially what was there before
(except I pass the order and derive the nr_pages, alignment, etc from that
rather than using the HPAGE_* macros. The latter is just a loop that calls the
former for each order in the bitfield.

>
>>
>> [...]
>>
>>
>>> +/**
>>> + * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>>> + * @vma:  the vm area to check
>>> + * @vm_flags: use these vm_flags instead of vma->vm_flags
>>> + * @smaps: whether answer will be used for smaps file
>>> + * @in_pf: whether answer will be used by page fault handler
>>> + * @enforce_sysfs: whether sysfs config should be taken into account
>>> + * @orders: bitfield of all orders to consider
>>> + *
>>> + * Calculates the intersection of the requested hugepage orders and the allowed
>>> + * hugepage orders for the provided vma. Permitted orders are encoded as a set
>>> + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3
>>> + * corresponds to order-3, etc). Order-0 is never considered a hugepage order.
>>> + *
>>> + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage
>>> + * orders are allowed.
>>> + */
>>> +unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>> +                       unsigned long vm_flags, bool smaps,
>>> +                       bool in_pf, bool enforce_sysfs,
>>> +                       unsigned long orders)
>>> +{
>>> +    /* Check the intersection of requested and supported orders. */
>>> +    orders &= vma_is_anonymous(vma) ?
>>> +            THP_ORDERS_ALL_ANON : THP_ORDERS_ALL_FILE;
>>> +    if (!orders)
>>> +        return 0;
>>
>> Comment: if this is called from some hot path, we might want to move as much as
>> possible into a header, so we can avoid this function call here when e.g., THP
>> are completely disabled etc.
>
> If THP is completely disabled (compiled out) then thp_vma_allowable_orders() is
> defined as a header inline that returns 0. I'm not sure there are any paths in
> practice which are likely to ask for a set of orders which are never supported
> (i.e. where this specific check would return 0). And the "are they run time
> enabled" check is further down and fairly involved, so not sure that's ideal for
> an inline.
>
> I haven't changed the pattern from how it was previously, so I don't think it
> should be any more expensive. Which parts exactly do you want to move to a header?

As per my response against the next patch (#4), I have now implemented
thp_vma_allowable_orders() so that the order check is in the header with an
early exit if THP is completely disabled.

>
>
>>
>>> +
>>>       if (!vma->vm_mm)        /* vdso */
>>> -        return false;
>>> +        return 0;
>>>         /*
>>>        * Explicitly disabled through madvise or prctl, or some
>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>> unsigned long vm_flags,
>>>        * */
>>>       if ((vm_flags & VM_NOHUGEPAGE) ||
>>>           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> -        return false;
>>> +        return 0;
>>>       /*
>>>        * If the hardware/firmware marked hugepage support disabled.
>>>        */
>>>       if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>> -        return false;
>>> +        return 0;
>>>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>>       if (vma_is_dax(vma))
>>> -        return in_pf;
>>> +        return in_pf ? orders : 0;
>>>         /*
>>>        * khugepaged special VMA and hugetlb VMA.
>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>> unsigned long vm_flags,
>>>        * VM_MIXEDMAP set.
>>>        */
>>>       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>>> -        return false;
>>> +        return 0;
>>>         /*
>>> -     * Check alignment for file vma and size for both file and anon vma.
>>> +     * Check alignment for file vma and size for both file and anon vma by
>>> +     * filtering out the unsuitable orders.
>>>        *
>>>        * Skip the check for page fault. Huge fault does the check in fault
>>> -     * handlers. And this check is not suitable for huge PUD fault.
>>> +     * handlers.
>>>        */
>>> -    if (!in_pf &&
>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>>> -        return false;
>>> +    if (!in_pf) {
>>> +        int order = first_order(orders);
>>> +        unsigned long addr;
>>> +
>>> +        while (orders) {
>>> +            addr = vma->vm_end - (PAGE_SIZE << order);
>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>>> +                break;
>>
>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
>> compiler is smart enough to optimize the loop and everyything else out.
>
> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
> (that's basically how it is laid out already, just all in one function). Is that
> what you are requesting?
>
>>
>> [...]
>>
>>> +
>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>>> +                     struct kobj_attribute *attr,
>>> +                     const char *buf, size_t count)
>>> +{
>>> +    int order = to_thpsize(kobj)->order;
>>> +    ssize_t ret = count;
>>> +
>>> +    if (sysfs_streq(buf, "always")) {
>>> +        set_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>> +    } else if (sysfs_streq(buf, "inherit")) {
>>> +        set_bit(order, &huge_anon_orders_inherit);
>>> +        clear_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>> +    } else if (sysfs_streq(buf, "madvise")) {
>>> +        set_bit(order, &huge_anon_orders_madvise);
>>> +        clear_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>> +    } else if (sysfs_streq(buf, "never")) {
>>> +        clear_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>
>> Note: I was wondering for a second if some concurrent cames could lead to an
>> inconsistent state. I think in the worst case we'll simply end up with "never"
>> on races.
>
> You mean if different threads try to write different values to this file
> concurrently? Or if there is a concurrent fault that tries to read the flags
> while they are being modified?
>
> I thought about this for a long time too and wasn't sure what was best. The
> existing global enabled store impl clears the bits first then sets the bit. With
> this approach you can end up with multiple bits set if there is a race to set
> diffierent values, and you can end up with a faulting thread seeing never if it
> reads the bits after they have been cleared but before setting them.
>
> I decided to set the new bit before clearing the old bits, which is different; A
> racing fault will never see "never" but as you say, a race to set the file could
> result in "never" being set.
>
> On reflection, it's probably best to set the bit *last* like the global control
> does?>
>
>>
>> [...]
>>
>>>   static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
>>>   {
>>>       int err;
>>> +    struct thpsize *thpsize;
>>> +    unsigned long orders;
>>> +    int order;
>>> +
>>> +    /*
>>> +     * Default to setting PMD-sized THP to inherit the global setting and
>>> +     * disable all other sizes. powerpc's PMD_ORDER isn't a compile-time
>>> +     * constant so we have to do this here.
>>> +     */
>>> +    huge_anon_orders_inherit = BIT(PMD_ORDER);
>>>         *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
>>>       if (unlikely(!*hugepage_kobj)) {
>>> @@ -434,8 +631,24 @@ static int __init hugepage_init_sysfs(struct kobject
>>> **hugepage_kobj)
>>>           goto remove_hp_group;
>>>       }
>>>   +    orders = THP_ORDERS_ALL_ANON;
>>> +    order = first_order(orders);
>>> +    while (orders) {
>>> +        thpsize = thpsize_create(order, *hugepage_kobj);
>>> +        if (IS_ERR(thpsize)) {
>>> +            pr_err("failed to create thpsize for order %d\n", order);
>>> +            err = PTR_ERR(thpsize);
>>> +            goto remove_all;
>>> +        }
>>> +        list_add(&thpsize->node, &thpsize_list);
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>>       return 0;
>>>  
>>
>> [...]
>>
>>>       page = compound_head(page);
>>> @@ -5116,7 +5116,8 @@ static vm_fault_t __handle_mm_fault(struct
>>> vm_area_struct *vma,
>>>           return VM_FAULT_OOM;
>>>   retry_pud:
>>>       if (pud_none(*vmf.pud) &&
>>> -        hugepage_vma_check(vma, vm_flags, false, true, true)) {
>>> +        thp_vma_allowable_orders(vma, vm_flags, false, true, true,
>>> +                     BIT(PUD_ORDER))) {
>>>           ret = create_huge_pud(&vmf);
>>>           if (!(ret & VM_FAULT_FALLBACK))
>>>               return ret;
>>> @@ -5150,7 +5151,8 @@ static vm_fault_t __handle_mm_fault(struct
>>> vm_area_struct *vma,
>>>           goto retry_pud;
>>>         if (pmd_none(*vmf.pmd) &&
>>> -        hugepage_vma_check(vma, vm_flags, false, true, true)) {
>>> +        thp_vma_allowable_orders(vma, vm_flags, false, true, true,
>>> +                     BIT(PMD_ORDER))) {
>>
>> Comment: A helper like "thp_vma_allowable_order(vma, PMD_ORDER)" might make this
>> easier to read -- and the implemenmtation will be faster.
>
> I'm happy to do this and use it to improve readability:
>
> #define thp_vma_allowable_order(..., order) \
> thp_vma_allowable_orders(..., BIT(order))
>
> This wouldn't make the implementation any faster though; Are you suggesting a
> completely separate impl? Even then, I don't think there is much scope to make
> it faster for the case where there is only 1 order in the bitfield.
>
>>
>>>           ret = create_huge_pmd(&vmf);
>>>           if (!(ret & VM_FAULT_FALLBACK))
>>>               return ret;
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index e0b368e545ed..64da127cc267 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>                * cleared *pmd but not decremented compound_mapcount().
>>>                */
>>>               if ((pvmw->flags & PVMW_SYNC) &&
>>> -                transhuge_vma_suitable(vma, pvmw->address) &&
>>> +                thp_vma_suitable_orders(vma, pvmw->address,
>>> +                            BIT(PMD_ORDER)) &&
>>
>> Comment: Similarly, a helper like "thp_vma_suitable_order(vma, PMD_ORDER)" might
>> make this easier to read.
>
> Yep, will do this.
>
>>
>>>                   (pvmw->nr_pages >= HPAGE_PMD_NR)) {
>>>                   spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>>>  
>>
>

2023-12-07 11:08:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

[...]

>>
>> Nit: the orders = ... order = ... looks like this might deserve a helper
>> function that makes this easier to read.
>
> To be honest, the existing function that I've modified is a bit of a mess.

It's all an ugly mess and I hate it.

It would be cleanest if we'd just have "thp_vma_configured_orders()"
that gives us all configured orders for the given VMA+flags combination.
No passing in of orders, try handling the masking in the caller.

Then, we move that nasty "transhuge_vma_suitable" handling for !in_pf
out of there and handle that in the callers. The comment "Huge fault
does the check in fault handlers. And this check is not suitable for
huge PUD fault handlers." already makes me angry, what a mess.


Then, we'd have a thp_vma_fitting_orders() / thp_vma_is_fitting_order()
function that does the filtering only based on the given address + vma
size/alignment. That's roughly "thp_vma_suitable_orders()".


Finding a good name to combine both could be something like
"thp_vma_possible_orders()".


Would make more sense to me (but again, German guy, so it's probably all
wrong).


> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
> page fault, because the page fault handlers already do that check themselves. It
> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
> a strict superset of thp_vma_suitable_orders(). Then this can just call
> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
> handlers, so prefer if we leave that for a separate patch set.
>
>>
>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>> helper might be reasonable where that is handled internally.
>
> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
> should just as efficient? This would go away with the refactoring described above.

Right. Won't win in a beauty contest. Some simple helper might make this
much easier to digest.

>
>>
>> Comment: For order-0 we'll always perform a function call to both
>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>> fast and efficient check if any <PMD THP are even enabled in the system / for
>> this VMA, and in that case just fallback before doing more expensive checks.
>
> thp_vma_allowable_orders() is inline as you mentioned.
>
> I was deliberately trying to keep all the decision logic in one place
> (thp_vma_suitable_orders) because it's already pretty complicated. But if you
> insist, how about this in the header:
>
> static inline
> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> unsigned long vm_flags, bool smaps,
> bool in_pf, bool enforce_sysfs,
> unsigned long orders)
> {
> /* Optimization to check if required orders are enabled early. */
> if (enforce_sysfs && vma_is_anonymous(vma)) {
> unsigned long mask = READ_ONCE(huge_anon_orders_always);
>
> if (vm_flags & VM_HUGEPAGE)
> mask |= READ_ONCE(huge_anon_orders_madvise);
> if (hugepage_global_always() ||
> ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> mask |= READ_ONCE(huge_anon_orders_inherit);
>
> orders &= mask;
> if (!orders)
> return 0;
>
> enforce_sysfs = false;
> }
>
> return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
> enforce_sysfs, orders);
> }
>
> Then the above check can be removed from __thp_vma_allowable_orders() - it will
> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
>

Better. I still kind-of hate having to pass in orders here. Such masking
is better done in the caller (see above how it might be done when moving
the transhuge_vma_suitable() check out).

>
>>
>>> +
>>> +    if (!orders)
>>> +        goto fallback;
>>> +
>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>> +    if (!pte)
>>> +        return ERR_PTR(-EAGAIN);
>>> +
>>> +    order = first_order(orders);
>>> +    while (orders) {
>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +        vmf->pte = pte + pte_index(addr);
>>> +        if (pte_range_none(vmf->pte, 1 << order))
>>> +            break;
>>
>> Comment: Likely it would make sense to scan only once and determine the "largest
>> none range" around that address, having the largest suitable order in mind.
>
> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> IIRC. Perhaps this an optimization opportunity for later?

Yes, definetly.

>
>>
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>> +    vmf->pte = NULL;
>>
>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>> variable will make this code look less magical. Unless I am missing something
>> important :)
>
> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
> will be much clearer just to use a local variable. Will fix.
>
>>
>>> +    pte_unmap(pte);
>>> +
>>> +    gfp = vma_thp_gfp_mask(vma);
>>> +
>>> +    while (orders) {
>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> +        if (folio) {
>>> +            clear_huge_page(&folio->page, addr, 1 << order);
>>> +            return folio;
>>> +        }
>>> +        order = next_order(&orders, order);
>>> +    }
>>> +
>>
>> Queestion: would it make sense to combine both loops? I suspect memory
>> allocations with pte_offset_map()/kmao are problematic.
>
> They are both operating on separate orders; next_order() is "consuming" an order
> by removing the current one from the orders bitfield and returning the next one.
>
> So the first loop starts at the highest order and keeps checking lower orders
> until one fully fits in the VMA. And the second loop starts at the first order
> that was found to fully fits and loops to lower orders until an allocation is
> successful.

Right, but you know from the first loop which order is applicable (and
will be fed to the second loop) and could just pte_unmap(pte) +
tryalloc. If that fails, remap and try with the next orders.

That would make the code certainly easier to understand. That "orders"
magic of constructing, filtering, walking is confusing :)


I might find some time today to see if there is an easy way to cleanup
all what I spelled out above. It really is a mess. But likely that
cleanup could be deferred (but you're touching it, so ... :) ).

--
Cheers,

David / dhildenb

2023-12-07 11:13:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

>>
>>> +
>>>       if (!vma->vm_mm)        /* vdso */
>>> -        return false;
>>> +        return 0;
>>>         /*
>>>        * Explicitly disabled through madvise or prctl, or some
>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>> unsigned long vm_flags,
>>>        * */
>>>       if ((vm_flags & VM_NOHUGEPAGE) ||
>>>           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> -        return false;
>>> +        return 0;
>>>       /*
>>>        * If the hardware/firmware marked hugepage support disabled.
>>>        */
>>>       if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>> -        return false;
>>> +        return 0;
>>>         /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>>       if (vma_is_dax(vma))
>>> -        return in_pf;
>>> +        return in_pf ? orders : 0;
>>>         /*
>>>        * khugepaged special VMA and hugetlb VMA.
>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>> unsigned long vm_flags,
>>>        * VM_MIXEDMAP set.
>>>        */
>>>       if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>>> -        return false;
>>> +        return 0;
>>>         /*
>>> -     * Check alignment for file vma and size for both file and anon vma.
>>> +     * Check alignment for file vma and size for both file and anon vma by
>>> +     * filtering out the unsuitable orders.
>>>        *
>>>        * Skip the check for page fault. Huge fault does the check in fault
>>> -     * handlers. And this check is not suitable for huge PUD fault.
>>> +     * handlers.
>>>        */
>>> -    if (!in_pf &&
>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>>> -        return false;
>>> +    if (!in_pf) {
>>> +        int order = first_order(orders);
>>> +        unsigned long addr;
>>> +
>>> +        while (orders) {
>>> +            addr = vma->vm_end - (PAGE_SIZE << order);
>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>>> +                break;
>>
>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
>> compiler is smart enough to optimize the loop and everyything else out.
>
> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
> (that's basically how it is laid out already, just all in one function). Is that
> what you are requesting?

You got the spirit, yes.

>>
>> [...]
>>
>>> +
>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>>> +                     struct kobj_attribute *attr,
>>> +                     const char *buf, size_t count)
>>> +{
>>> +    int order = to_thpsize(kobj)->order;
>>> +    ssize_t ret = count;
>>> +
>>> +    if (sysfs_streq(buf, "always")) {
>>> +        set_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>> +    } else if (sysfs_streq(buf, "inherit")) {
>>> +        set_bit(order, &huge_anon_orders_inherit);
>>> +        clear_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>> +    } else if (sysfs_streq(buf, "madvise")) {
>>> +        set_bit(order, &huge_anon_orders_madvise);
>>> +        clear_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>> +    } else if (sysfs_streq(buf, "never")) {
>>> +        clear_bit(order, &huge_anon_orders_always);
>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>
>> Note: I was wondering for a second if some concurrent cames could lead to an
>> inconsistent state. I think in the worst case we'll simply end up with "never"
>> on races.
>
> You mean if different threads try to write different values to this file
> concurrently? Or if there is a concurrent fault that tries to read the flags
> while they are being modified?

I thought about what you said first, but what you said last might also
apply. As long as "nothing breaks", all good.

>
> I thought about this for a long time too and wasn't sure what was best. The
> existing global enabled store impl clears the bits first then sets the bit. With
> this approach you can end up with multiple bits set if there is a race to set
> diffierent values, and you can end up with a faulting thread seeing never if it
> reads the bits after they have been cleared but before setting them.

Right, but user space is playing stupid games and can win stupid prices.
As long as nothing breaks, we're good.

>
> I decided to set the new bit before clearing the old bits, which is different; A
> racing fault will never see "never" but as you say, a race to set the file could
> result in "never" being set.
>
> On reflection, it's probably best to set the bit *last* like the global control
> does?

Probably might just slap a simple spinlock in there, so at least the
writer side is completely serialized. Then you can just set the bit
last. It's unlikely that readers will actually run into issues, and if
they ever would, we could use some rcu magic to let them read a
consistent state.

--
Cheers,

David / dhildenb

2023-12-07 11:23:14

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 07/12/2023 11:13, David Hildenbrand wrote:
>>>
>>>> +
>>>>        if (!vma->vm_mm)        /* vdso */
>>>> -        return false;
>>>> +        return 0;
>>>>          /*
>>>>         * Explicitly disabled through madvise or prctl, or some
>>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>>> unsigned long vm_flags,
>>>>         * */
>>>>        if ((vm_flags & VM_NOHUGEPAGE) ||
>>>>            test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>> -        return false;
>>>> +        return 0;
>>>>        /*
>>>>         * If the hardware/firmware marked hugepage support disabled.
>>>>         */
>>>>        if (transparent_hugepage_flags & (1 <<
>>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>> -        return false;
>>>> +        return 0;
>>>>          /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>>>        if (vma_is_dax(vma))
>>>> -        return in_pf;
>>>> +        return in_pf ? orders : 0;
>>>>          /*
>>>>         * khugepaged special VMA and hugetlb VMA.
>>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>>> unsigned long vm_flags,
>>>>         * VM_MIXEDMAP set.
>>>>         */
>>>>        if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>>>> -        return false;
>>>> +        return 0;
>>>>          /*
>>>> -     * Check alignment for file vma and size for both file and anon vma.
>>>> +     * Check alignment for file vma and size for both file and anon vma by
>>>> +     * filtering out the unsuitable orders.
>>>>         *
>>>>         * Skip the check for page fault. Huge fault does the check in fault
>>>> -     * handlers. And this check is not suitable for huge PUD fault.
>>>> +     * handlers.
>>>>         */
>>>> -    if (!in_pf &&
>>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>>>> -        return false;
>>>> +    if (!in_pf) {
>>>> +        int order = first_order(orders);
>>>> +        unsigned long addr;
>>>> +
>>>> +        while (orders) {
>>>> +            addr = vma->vm_end - (PAGE_SIZE << order);
>>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>>>> +                break;
>>>
>>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
>>> compiler is smart enough to optimize the loop and everyything else out.
>>
>> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
>> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
>> (that's basically how it is laid out already, just all in one function). Is that
>> what you are requesting?
>
> You got the spirit, yes.
>
>>>
>>> [...]
>>>
>>>> +
>>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>>>> +                     struct kobj_attribute *attr,
>>>> +                     const char *buf, size_t count)
>>>> +{
>>>> +    int order = to_thpsize(kobj)->order;
>>>> +    ssize_t ret = count;
>>>> +
>>>> +    if (sysfs_streq(buf, "always")) {
>>>> +        set_bit(order, &huge_anon_orders_always);
>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>> +    } else if (sysfs_streq(buf, "inherit")) {
>>>> +        set_bit(order, &huge_anon_orders_inherit);
>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>> +    } else if (sysfs_streq(buf, "madvise")) {
>>>> +        set_bit(order, &huge_anon_orders_madvise);
>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>> +    } else if (sysfs_streq(buf, "never")) {
>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>
>>> Note: I was wondering for a second if some concurrent cames could lead to an
>>> inconsistent state. I think in the worst case we'll simply end up with "never"
>>> on races.
>>
>> You mean if different threads try to write different values to this file
>> concurrently? Or if there is a concurrent fault that tries to read the flags
>> while they are being modified?
>
> I thought about what you said first, but what you said last might also apply. As
> long as "nothing breaks", all good.
>
>>
>> I thought about this for a long time too and wasn't sure what was best. The
>> existing global enabled store impl clears the bits first then sets the bit. With
>> this approach you can end up with multiple bits set if there is a race to set
>> diffierent values, and you can end up with a faulting thread seeing never if it
>> reads the bits after they have been cleared but before setting them.
>
> Right, but user space is playing stupid games and can win stupid prices. As long
> as nothing breaks, we're good.
>
>>
>> I decided to set the new bit before clearing the old bits, which is different; A
>> racing fault will never see "never" but as you say, a race to set the file could
>> result in "never" being set.
>>
>> On reflection, it's probably best to set the bit *last* like the global control
>> does?
>
> Probably might just slap a simple spinlock in there, so at least the writer side
> is completely serialized. Then you can just set the bit last. It's unlikely that
> readers will actually run into issues, and if they ever would, we could use some
> rcu magic to let them read a consistent state.

I'd prefer to leave it as it is now; clear first, set last without any explicit
serialization. I've convinced myself that nothing breaks and its the same
pattern used by the global control so its consistent. Unless you're insisting on
the spin lock?



2023-12-07 11:25:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 07.12.23 12:22, Ryan Roberts wrote:
> On 07/12/2023 11:13, David Hildenbrand wrote:
>>>>
>>>>> +
>>>>>        if (!vma->vm_mm)        /* vdso */
>>>>> -        return false;
>>>>> +        return 0;
>>>>>          /*
>>>>>         * Explicitly disabled through madvise or prctl, or some
>>>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>>>> unsigned long vm_flags,
>>>>>         * */
>>>>>        if ((vm_flags & VM_NOHUGEPAGE) ||
>>>>>            test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>>> -        return false;
>>>>> +        return 0;
>>>>>        /*
>>>>>         * If the hardware/firmware marked hugepage support disabled.
>>>>>         */
>>>>>        if (transparent_hugepage_flags & (1 <<
>>>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>>> -        return false;
>>>>> +        return 0;
>>>>>          /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>>>>        if (vma_is_dax(vma))
>>>>> -        return in_pf;
>>>>> +        return in_pf ? orders : 0;
>>>>>          /*
>>>>>         * khugepaged special VMA and hugetlb VMA.
>>>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>>>> unsigned long vm_flags,
>>>>>         * VM_MIXEDMAP set.
>>>>>         */
>>>>>        if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>>>>> -        return false;
>>>>> +        return 0;
>>>>>          /*
>>>>> -     * Check alignment for file vma and size for both file and anon vma.
>>>>> +     * Check alignment for file vma and size for both file and anon vma by
>>>>> +     * filtering out the unsuitable orders.
>>>>>         *
>>>>>         * Skip the check for page fault. Huge fault does the check in fault
>>>>> -     * handlers. And this check is not suitable for huge PUD fault.
>>>>> +     * handlers.
>>>>>         */
>>>>> -    if (!in_pf &&
>>>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>>>>> -        return false;
>>>>> +    if (!in_pf) {
>>>>> +        int order = first_order(orders);
>>>>> +        unsigned long addr;
>>>>> +
>>>>> +        while (orders) {
>>>>> +            addr = vma->vm_end - (PAGE_SIZE << order);
>>>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>>>>> +                break;
>>>>
>>>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
>>>> compiler is smart enough to optimize the loop and everyything else out.
>>>
>>> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
>>> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
>>> (that's basically how it is laid out already, just all in one function). Is that
>>> what you are requesting?
>>
>> You got the spirit, yes.
>>
>>>>
>>>> [...]
>>>>
>>>>> +
>>>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>>>>> +                     struct kobj_attribute *attr,
>>>>> +                     const char *buf, size_t count)
>>>>> +{
>>>>> +    int order = to_thpsize(kobj)->order;
>>>>> +    ssize_t ret = count;
>>>>> +
>>>>> +    if (sysfs_streq(buf, "always")) {
>>>>> +        set_bit(order, &huge_anon_orders_always);
>>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>>> +    } else if (sysfs_streq(buf, "inherit")) {
>>>>> +        set_bit(order, &huge_anon_orders_inherit);
>>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>>> +    } else if (sysfs_streq(buf, "madvise")) {
>>>>> +        set_bit(order, &huge_anon_orders_madvise);
>>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>>> +    } else if (sysfs_streq(buf, "never")) {
>>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>>
>>>> Note: I was wondering for a second if some concurrent cames could lead to an
>>>> inconsistent state. I think in the worst case we'll simply end up with "never"
>>>> on races.
>>>
>>> You mean if different threads try to write different values to this file
>>> concurrently? Or if there is a concurrent fault that tries to read the flags
>>> while they are being modified?
>>
>> I thought about what you said first, but what you said last might also apply. As
>> long as "nothing breaks", all good.
>>
>>>
>>> I thought about this for a long time too and wasn't sure what was best. The
>>> existing global enabled store impl clears the bits first then sets the bit. With
>>> this approach you can end up with multiple bits set if there is a race to set
>>> diffierent values, and you can end up with a faulting thread seeing never if it
>>> reads the bits after they have been cleared but before setting them.
>>
>> Right, but user space is playing stupid games and can win stupid prices. As long
>> as nothing breaks, we're good.
>>
>>>
>>> I decided to set the new bit before clearing the old bits, which is different; A
>>> racing fault will never see "never" but as you say, a race to set the file could
>>> result in "never" being set.
>>>
>>> On reflection, it's probably best to set the bit *last* like the global control
>>> does?
>>
>> Probably might just slap a simple spinlock in there, so at least the writer side
>> is completely serialized. Then you can just set the bit last. It's unlikely that
>> readers will actually run into issues, and if they ever would, we could use some
>> rcu magic to let them read a consistent state.
>
> I'd prefer to leave it as it is now; clear first, set last without any explicit
> serialization. I've convinced myself that nothing breaks and its the same
> pattern used by the global control so its consistent. Unless you're insisting on
> the spin lock?

No, not at all. But it would certainly remove any possible concerns :)

--
Cheers,

David / dhildenb

2023-12-07 11:44:37

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 03/10] mm: thp: Introduce multi-size THP sysfs interface

On 07/12/2023 11:25, David Hildenbrand wrote:
> On 07.12.23 12:22, Ryan Roberts wrote:
>> On 07/12/2023 11:13, David Hildenbrand wrote:
>>>>>
>>>>>> +
>>>>>>         if (!vma->vm_mm)        /* vdso */
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>           /*
>>>>>>          * Explicitly disabled through madvise or prctl, or some
>>>>>> @@ -88,16 +141,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>>>>> unsigned long vm_flags,
>>>>>>          * */
>>>>>>         if ((vm_flags & VM_NOHUGEPAGE) ||
>>>>>>             test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>         /*
>>>>>>          * If the hardware/firmware marked hugepage support disabled.
>>>>>>          */
>>>>>>         if (transparent_hugepage_flags & (1 <<
>>>>>> TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>           /* khugepaged doesn't collapse DAX vma, but page fault is fine. */
>>>>>>         if (vma_is_dax(vma))
>>>>>> -        return in_pf;
>>>>>> +        return in_pf ? orders : 0;
>>>>>>           /*
>>>>>>          * khugepaged special VMA and hugetlb VMA.
>>>>>> @@ -105,17 +158,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma,
>>>>>> unsigned long vm_flags,
>>>>>>          * VM_MIXEDMAP set.
>>>>>>          */
>>>>>>         if (!in_pf && !smaps && (vm_flags & VM_NO_KHUGEPAGED))
>>>>>> -        return false;
>>>>>> +        return 0;
>>>>>>           /*
>>>>>> -     * Check alignment for file vma and size for both file and anon vma.
>>>>>> +     * Check alignment for file vma and size for both file and anon vma by
>>>>>> +     * filtering out the unsuitable orders.
>>>>>>          *
>>>>>>          * Skip the check for page fault. Huge fault does the check in fault
>>>>>> -     * handlers. And this check is not suitable for huge PUD fault.
>>>>>> +     * handlers.
>>>>>>          */
>>>>>> -    if (!in_pf &&
>>>>>> -        !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE)))
>>>>>> -        return false;
>>>>>> +    if (!in_pf) {
>>>>>> +        int order = first_order(orders);
>>>>>> +        unsigned long addr;
>>>>>> +
>>>>>> +        while (orders) {
>>>>>> +            addr = vma->vm_end - (PAGE_SIZE << order);
>>>>>> +            if (thp_vma_suitable_orders(vma, addr, BIT(order)))
>>>>>> +                break;
>>>>>
>>>>> Comment: you'd want a "thp_vma_suitable_order" helper here. But maybe the
>>>>> compiler is smart enough to optimize the loop and everyything else out.
>>>>
>>>> I'm happy to refactor so that thp_vma_suitable_order() is the basic primitive,
>>>> then make thp_vma_suitable_orders() a loop that calls thp_vma_suitable_order()
>>>> (that's basically how it is laid out already, just all in one function). Is
>>>> that
>>>> what you are requesting?
>>>
>>> You got the spirit, yes.
>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>> +
>>>>>> +static ssize_t thpsize_enabled_store(struct kobject *kobj,
>>>>>> +                     struct kobj_attribute *attr,
>>>>>> +                     const char *buf, size_t count)
>>>>>> +{
>>>>>> +    int order = to_thpsize(kobj)->order;
>>>>>> +    ssize_t ret = count;
>>>>>> +
>>>>>> +    if (sysfs_streq(buf, "always")) {
>>>>>> +        set_bit(order, &huge_anon_orders_always);
>>>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>>>> +    } else if (sysfs_streq(buf, "inherit")) {
>>>>>> +        set_bit(order, &huge_anon_orders_inherit);
>>>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>>>> +    } else if (sysfs_streq(buf, "madvise")) {
>>>>>> +        set_bit(order, &huge_anon_orders_madvise);
>>>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>>>> +    } else if (sysfs_streq(buf, "never")) {
>>>>>> +        clear_bit(order, &huge_anon_orders_always);
>>>>>> +        clear_bit(order, &huge_anon_orders_inherit);
>>>>>> +        clear_bit(order, &huge_anon_orders_madvise);
>>>>>
>>>>> Note: I was wondering for a second if some concurrent cames could lead to an
>>>>> inconsistent state. I think in the worst case we'll simply end up with "never"
>>>>> on races.
>>>>
>>>> You mean if different threads try to write different values to this file
>>>> concurrently? Or if there is a concurrent fault that tries to read the flags
>>>> while they are being modified?
>>>
>>> I thought about what you said first, but what you said last might also apply. As
>>> long as "nothing breaks", all good.
>>>
>>>>
>>>> I thought about this for a long time too and wasn't sure what was best. The
>>>> existing global enabled store impl clears the bits first then sets the bit.
>>>> With
>>>> this approach you can end up with multiple bits set if there is a race to set
>>>> diffierent values, and you can end up with a faulting thread seeing never if it
>>>> reads the bits after they have been cleared but before setting them.
>>>
>>> Right, but user space is playing stupid games and can win stupid prices. As long
>>> as nothing breaks, we're good.
>>>
>>>>
>>>> I decided to set the new bit before clearing the old bits, which is
>>>> different; A
>>>> racing fault will never see "never" but as you say, a race to set the file
>>>> could
>>>> result in "never" being set.
>>>>
>>>> On reflection, it's probably best to set the bit *last* like the global control
>>>> does?
>>>
>>> Probably might just slap a simple spinlock in there, so at least the writer side
>>> is completely serialized. Then you can just set the bit last. It's unlikely that
>>> readers will actually run into issues, and if they ever would, we could use some
>>> rcu magic to let them read a consistent state.
>>
>> I'd prefer to leave it as it is now; clear first, set last without any explicit
>> serialization. I've convinced myself that nothing breaks and its the same
>> pattern used by the global control so its consistent. Unless you're insisting on
>> the spin lock?
>
> No, not at all. But it would certainly remove any possible concerns :)

OK fine, you win :). I'll add a spin lock on the writer side.


2023-12-07 12:08:43

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 07/12/2023 11:08, David Hildenbrand wrote:
> [...]
>
>>>
>>> Nit: the orders = ... order = ... looks like this might deserve a helper
>>> function that makes this easier to read.
>>
>> To be honest, the existing function that I've modified is a bit of a mess.
>
> It's all an ugly mess and I hate it.
>
> It would be cleanest if we'd just have "thp_vma_configured_orders()" that gives
> us all configured orders for the given VMA+flags combination. No passing in of
> orders, try handling the masking in the caller.
>
> Then, we move that nasty "transhuge_vma_suitable" handling for !in_pf out of
> there and handle that in the callers. The comment "Huge fault does the check in
> fault handlers. And this check is not suitable for huge PUD fault handlers."
> already makes me angry, what a mess.

My thp_vma_suitable_order[s]() does now at least work correctly for PUD.

>
>
> Then, we'd have a thp_vma_fitting_orders() / thp_vma_is_fitting_order() function
> that does the filtering only based on the given address + vma size/alignment.
> That's roughly "thp_vma_suitable_orders()".
>
>
> Finding a good name to combine both could be something like
> "thp_vma_possible_orders()".
>
>
> Would make more sense to me (but again, German guy, so it's probably all wrong).
>
>
>> thp_vma_allowable_orders() calls thp_vma_suitable_orders() if we are not in a
>> page fault, because the page fault handlers already do that check themselves. It
>> would be nice to refactor the whole thing so that thp_vma_allowable_orders() is
>> a strict superset of thp_vma_suitable_orders(). Then this can just call
>> thp_vma_allowable_orders(). But that's going to start touching the PMD and PUD
>> handlers, so prefer if we leave that for a separate patch set.
>>
>>>
>>> Nit: Why call thp_vma_suitable_orders if the orders are already 0? Again, some
>>> helper might be reasonable where that is handled internally.
>>
>> Because thp_vma_suitable_orders() will handle it safely and is inline, so it
>> should just as efficient? This would go away with the refactoring described
>> above.
>
> Right. Won't win in a beauty contest. Some simple helper might make this much
> easier to digest.
>
>>
>>>
>>> Comment: For order-0 we'll always perform a function call to both
>>> thp_vma_allowable_orders() / thp_vma_suitable_orders(). We should perform some
>>> fast and efficient check if any <PMD THP are even enabled in the system / for
>>> this VMA, and in that case just fallback before doing more expensive checks.
>>
>> thp_vma_allowable_orders() is inline as you mentioned.
>>
>> I was deliberately trying to keep all the decision logic in one place
>> (thp_vma_suitable_orders) because it's already pretty complicated. But if you
>> insist, how about this in the header:
>>
>> static inline
>> unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>                        unsigned long vm_flags, bool smaps,
>>                        bool in_pf, bool enforce_sysfs,
>>                        unsigned long orders)
>> {
>>     /* Optimization to check if required orders are enabled early. */
>>     if (enforce_sysfs && vma_is_anonymous(vma)) {
>>         unsigned long mask = READ_ONCE(huge_anon_orders_always);
>>
>>         if (vm_flags & VM_HUGEPAGE)
>>             mask |= READ_ONCE(huge_anon_orders_madvise);
>>         if (hugepage_global_always() ||
>>             ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>>             mask |= READ_ONCE(huge_anon_orders_inherit);
>>
>>         orders &= mask;
>>         if (!orders)
>>             return 0;
>>        
>>         enforce_sysfs = false;
>>     }
>>
>>     return __thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf,
>>                       enforce_sysfs, orders);
>> }
>>
>> Then the above check can be removed from __thp_vma_allowable_orders() - it will
>> still retain the `if (enforce_sysfs && !vma_is_anonymous(vma))` part.
>>
>
> Better. I still kind-of hate having to pass in orders here. Such masking is
> better done in the caller (see above how it might be done when moving the
> transhuge_vma_suitable() check out).
>
>>
>>>
>>>> +
>>>> +    if (!orders)
>>>> +        goto fallback;
>>>> +
>>>> +    pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>> +    if (!pte)
>>>> +        return ERR_PTR(-EAGAIN);
>>>> +
>>>> +    order = first_order(orders);
>>>> +    while (orders) {
>>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>> +        vmf->pte = pte + pte_index(addr);
>>>> +        if (pte_range_none(vmf->pte, 1 << order))
>>>> +            break;
>>>
>>> Comment: Likely it would make sense to scan only once and determine the "largest
>>> none range" around that address, having the largest suitable order in mind.
>>
>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
>> IIRC. Perhaps this an optimization opportunity for later?
>
> Yes, definetly.
>
>>
>>>
>>>> +        order = next_order(&orders, order);
>>>> +    }
>>>> +
>>>> +    vmf->pte = NULL;
>>>
>>> Nit: Can you elaborate why you are messing with vmf->pte here? A simple helper
>>> variable will make this code look less magical. Unless I am missing something
>>> important :)
>>
>> Gahh, I used to pass the vmf to what pte_range_none() was refactored into (an
>> approach that was suggested by Yu Zhou IIRC). But since I did some refactoring
>> based on some comments from JohnH, I see I don't need that anymore. Agreed; it
>> will be much clearer just to use a local variable. Will fix.
>>
>>>
>>>> +    pte_unmap(pte);
>>>> +
>>>> +    gfp = vma_thp_gfp_mask(vma);
>>>> +
>>>> +    while (orders) {
>>>> +        addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>> +        folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>> +        if (folio) {
>>>> +            clear_huge_page(&folio->page, addr, 1 << order);
>>>> +            return folio;
>>>> +        }
>>>> +        order = next_order(&orders, order);
>>>> +    }
>>>> +
>>>
>>> Queestion: would it make sense to combine both loops? I suspect memory
>>> allocations with pte_offset_map()/kmao are problematic.
>>
>> They are both operating on separate orders; next_order() is "consuming" an order
>> by removing the current one from the orders bitfield and returning the next one.
>>
>> So the first loop starts at the highest order and keeps checking lower orders
>> until one fully fits in the VMA. And the second loop starts at the first order
>> that was found to fully fits and loops to lower orders until an allocation is
>> successful.
>
> Right, but you know from the first loop which order is applicable (and will be
> fed to the second loop) and could just pte_unmap(pte) + tryalloc. If that fails,
> remap and try with the next orders.

You mean something like this?

pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
if (!pte)
return ERR_PTR(-EAGAIN);

order = highest_order(orders);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
if (!pte_range_none(pte + pte_index(addr), 1 << order)) {
order = next_order(&orders, order);
continue;
}

pte_unmap(pte);

folio = vma_alloc_folio(gfp, order, vma, addr, true);
if (folio) {
clear_huge_page(&folio->page, vmf->address, 1 << order);
return folio;
}

pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
if (!pte)
return ERR_PTR(-EAGAIN);

order = next_order(&orders, order);
}

pte_unmap(pte);

I don't really like that because if high order folio allocations fail, then you
are calling pte_range_none() again for the next lower order; once that check has
succeeded for an order it shouldn't be required for any lower orders. In this
case you also have lots of pte map/unmap.

The original version feels more efficient to me.

>
> That would make the code certainly easier to understand. That "orders" magic of
> constructing, filtering, walking is confusing :)
>
>
> I might find some time today to see if there is an easy way to cleanup all what
> I spelled out above. It really is a mess. But likely that cleanup could be
> deferred (but you're touching it, so ... :) ).

I'm going to ignore the last 5 words. I heard the "that cleanup could be
deferred" part loud and clear though :)


>

2023-12-07 13:28:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

>>
>> Right, but you know from the first loop which order is applicable (and will be
>> fed to the second loop) and could just pte_unmap(pte) + tryalloc. If that fails,
>> remap and try with the next orders.
>
> You mean something like this?
>
> pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> if (!pte)
> return ERR_PTR(-EAGAIN);
>
> order = highest_order(orders);
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> if (!pte_range_none(pte + pte_index(addr), 1 << order)) {
> order = next_order(&orders, order);
> continue;
> }
>
> pte_unmap(pte);
>
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> if (folio) {
> clear_huge_page(&folio->page, vmf->address, 1 << order);
> return folio;
> }
>
> pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> if (!pte)
> return ERR_PTR(-EAGAIN);
>
> order = next_order(&orders, order);
> }
>
> pte_unmap(pte);
>
> I don't really like that because if high order folio allocations fail, then you
> are calling pte_range_none() again for the next lower order; once that check has
> succeeded for an order it shouldn't be required for any lower orders. In this
> case you also have lots of pte map/unmap.

I see what you mean.

>
> The original version feels more efficient to me.
Yes it is. Adding in some comments might help, like

/*
* Find the largest order where the aligned range is completely prot_none(). Note
* that all remaining orders will be completely prot_none().
*/
...

/* Try allocating the largest of the remaining orders. */

>
>>
>> That would make the code certainly easier to understand. That "orders" magic of
>> constructing, filtering, walking is confusing :)
>>
>>
>> I might find some time today to see if there is an easy way to cleanup all what
>> I spelled out above. It really is a mess. But likely that cleanup could be
>> deferred (but you're touching it, so ... :) ).
>
> I'm going to ignore the last 5 words. I heard the "that cleanup could be
> deferred" part loud and clear though :)

:)

If we could stop passing orders into thp_vma_allowable_orders(), that would probably
be the biggest win. It's just all a confusing mess.

--
Cheers,

David / dhildenb

2023-12-07 14:45:58

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 07/12/2023 13:28, David Hildenbrand wrote:
>>>
>>> Right, but you know from the first loop which order is applicable (and will be
>>> fed to the second loop) and could just pte_unmap(pte) + tryalloc. If that fails,
>>> remap and try with the next orders.
>>
>> You mean something like this?
>>
>>     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>     if (!pte)
>>         return ERR_PTR(-EAGAIN);
>>
>>     order = highest_order(orders);
>>     while (orders) {
>>         addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>         if (!pte_range_none(pte + pte_index(addr), 1 << order)) {
>>             order = next_order(&orders, order);
>>             continue;
>>         }
>>
>>         pte_unmap(pte);
>>        
>>         folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>         if (folio) {
>>             clear_huge_page(&folio->page, vmf->address, 1 << order);
>>             return folio;
>>         }
>>
>>         pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>         if (!pte)
>>             return ERR_PTR(-EAGAIN);
>>
>>         order = next_order(&orders, order);
>>     }
>>
>>     pte_unmap(pte);
>>
>> I don't really like that because if high order folio allocations fail, then you
>> are calling pte_range_none() again for the next lower order; once that check has
>> succeeded for an order it shouldn't be required for any lower orders. In this
>> case you also have lots of pte map/unmap.
>
> I see what you mean.
>
>>
>> The original version feels more efficient to me.
> Yes it is. Adding in some comments might help, like
>
> /*
>  * Find the largest order where the aligned range is completely prot_none(). Note
>  * that all remaining orders will be completely prot_none().
>  */
> ...
>
> /* Try allocating the largest of the remaining orders. */

OK added.

>
>>
>>>
>>> That would make the code certainly easier to understand. That "orders" magic of
>>> constructing, filtering, walking is confusing :)
>>>
>>>
>>> I might find some time today to see if there is an easy way to cleanup all what
>>> I spelled out above. It really is a mess. But likely that cleanup could be
>>> deferred (but you're touching it, so ... :) ).
>>
>> I'm going to ignore the last 5 words. I heard the "that cleanup could be
>> deferred" part loud and clear though :)
>
> :)
>
> If we could stop passing orders into thp_vma_allowable_orders(), that would
> probably
> be the biggest win. It's just all a confusing mess.



I tried an approach like you suggested in the other thread originally, but I
struggled to define exactly what "thp_vma_configured_orders()" should mean;
Ideally, I just want "all the THP orders that are currently enabled for this
VMA+flags". But some callers want to enforce_sysfs and others don't, so you
probably have to at least pass that flag. Then you have DAX which explicitly
ignores enforce_sysfs, but only in a page fault. And shmem, which ignores
enforce_sysfs, but only outside of a page fault. So it quickly becomes pretty
complex. It is basically thp_vma_allowable_orders() as currently defined.

If this could be a simple function then it could be inline and as you say, we
can do the masking in the caller and exit early for the order-0 case. But it is
very complex (at least if you want to retain the equivalent logic to what
thp_vma_allowable_orders() has) so I'm not sure how to do the order-0 early exit
without passing in the orders bitfield. And we are unlikely to exit early
because PMD-sized THP is likely enabled and because we didn't pass in a orders
bitfield, that wasn't filtered out.

In short, I can't see a solution that's better than the one I have. But if you
have something in mind, if you can spell it out, then I'll have a go at tidying
it up and integrating it into the series. Otherwise I really would prefer to
leave it for a separate series.

2023-12-07 15:02:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 07.12.23 15:45, Ryan Roberts wrote:
> On 07/12/2023 13:28, David Hildenbrand wrote:
>>>>
>>>> Right, but you know from the first loop which order is applicable (and will be
>>>> fed to the second loop) and could just pte_unmap(pte) + tryalloc. If that fails,
>>>> remap and try with the next orders.
>>>
>>> You mean something like this?
>>>
>>>     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>     if (!pte)
>>>         return ERR_PTR(-EAGAIN);
>>>
>>>     order = highest_order(orders);
>>>     while (orders) {
>>>         addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>         if (!pte_range_none(pte + pte_index(addr), 1 << order)) {
>>>             order = next_order(&orders, order);
>>>             continue;
>>>         }
>>>
>>>         pte_unmap(pte);
>>>
>>>         folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>         if (folio) {
>>>             clear_huge_page(&folio->page, vmf->address, 1 << order);
>>>             return folio;
>>>         }
>>>
>>>         pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>         if (!pte)
>>>             return ERR_PTR(-EAGAIN);
>>>
>>>         order = next_order(&orders, order);
>>>     }
>>>
>>>     pte_unmap(pte);
>>>
>>> I don't really like that because if high order folio allocations fail, then you
>>> are calling pte_range_none() again for the next lower order; once that check has
>>> succeeded for an order it shouldn't be required for any lower orders. In this
>>> case you also have lots of pte map/unmap.
>>
>> I see what you mean.
>>
>>>
>>> The original version feels more efficient to me.
>> Yes it is. Adding in some comments might help, like
>>
>> /*
>>  * Find the largest order where the aligned range is completely prot_none(). Note
>>  * that all remaining orders will be completely prot_none().
>>  */
>> ...
>>
>> /* Try allocating the largest of the remaining orders. */
>
> OK added.
>
>>
>>>
>>>>
>>>> That would make the code certainly easier to understand. That "orders" magic of
>>>> constructing, filtering, walking is confusing :)
>>>>
>>>>
>>>> I might find some time today to see if there is an easy way to cleanup all what
>>>> I spelled out above. It really is a mess. But likely that cleanup could be
>>>> deferred (but you're touching it, so ... :) ).
>>>
>>> I'm going to ignore the last 5 words. I heard the "that cleanup could be
>>> deferred" part loud and clear though :)
>>
>> :)
>>
>> If we could stop passing orders into thp_vma_allowable_orders(), that would
>> probably
>> be the biggest win. It's just all a confusing mess.
>
>
>
> I tried an approach like you suggested in the other thread originally, but I
> struggled to define exactly what "thp_vma_configured_orders()" should mean;
> Ideally, I just want "all the THP orders that are currently enabled for this
> VMA+flags". But some callers want to enforce_sysfs and others don't, so you
> probably have to at least pass that flag. Then you have DAX which explicitly

Yes, the flags would still be passed. It's kind of the "context".

> ignores enforce_sysfs, but only in a page fault. And shmem, which ignores
> enforce_sysfs, but only outside of a page fault. So it quickly becomes pretty
> complex. It is basically thp_vma_allowable_orders() as currently defined.

Yeah, but moving the "can we actually fit a THP in there" check out of
the picture.

>
> If this could be a simple function then it could be inline and as you say, we
> can do the masking in the caller and exit early for the order-0 case. But it is
> very complex (at least if you want to retain the equivalent logic to what
> thp_vma_allowable_orders() has) so I'm not sure how to do the order-0 early exit
> without passing in the orders bitfield. And we are unlikely to exit early
> because PMD-sized THP is likely enabled and because we didn't pass in a orders
> bitfield, that wasn't filtered out.
>
> In short, I can't see a solution that's better than the one I have. But if you
> have something in mind, if you can spell it out, then I'll have a go at tidying
> it up and integrating it into the series. Otherwise I really would prefer to
> leave it for a separate series.

I'm playing with some cleanups, but they can all be built on top if they
materialize.

--
Cheers,

David / dhildenb

2023-12-07 15:12:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v8 04/10] mm: thp: Support allocation of anonymous multi-size THP

On 07/12/2023 15:01, David Hildenbrand wrote:
> On 07.12.23 15:45, Ryan Roberts wrote:
>> On 07/12/2023 13:28, David Hildenbrand wrote:
>>>>>
>>>>> Right, but you know from the first loop which order is applicable (and will be
>>>>> fed to the second loop) and could just pte_unmap(pte) + tryalloc. If that
>>>>> fails,
>>>>> remap and try with the next orders.
>>>>
>>>> You mean something like this?
>>>>
>>>>      pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>      if (!pte)
>>>>          return ERR_PTR(-EAGAIN);
>>>>
>>>>      order = highest_order(orders);
>>>>      while (orders) {
>>>>          addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>          if (!pte_range_none(pte + pte_index(addr), 1 << order)) {
>>>>              order = next_order(&orders, order);
>>>>              continue;
>>>>          }
>>>>
>>>>          pte_unmap(pte);
>>>>                  folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>>          if (folio) {
>>>>              clear_huge_page(&folio->page, vmf->address, 1 << order);
>>>>              return folio;
>>>>          }
>>>>
>>>>          pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>          if (!pte)
>>>>              return ERR_PTR(-EAGAIN);
>>>>
>>>>          order = next_order(&orders, order);
>>>>      }
>>>>
>>>>      pte_unmap(pte);
>>>>
>>>> I don't really like that because if high order folio allocations fail, then you
>>>> are calling pte_range_none() again for the next lower order; once that check
>>>> has
>>>> succeeded for an order it shouldn't be required for any lower orders. In this
>>>> case you also have lots of pte map/unmap.
>>>
>>> I see what you mean.
>>>
>>>>
>>>> The original version feels more efficient to me.
>>> Yes it is. Adding in some comments might help, like
>>>
>>> /*
>>>   * Find the largest order where the aligned range is completely prot_none().
>>> Note
>>>   * that all remaining orders will be completely prot_none().
>>>   */
>>> ...
>>>
>>> /* Try allocating the largest of the remaining orders. */
>>
>> OK added.
>>
>>>
>>>>
>>>>>
>>>>> That would make the code certainly easier to understand. That "orders"
>>>>> magic of
>>>>> constructing, filtering, walking is confusing :)
>>>>>
>>>>>
>>>>> I might find some time today to see if there is an easy way to cleanup all
>>>>> what
>>>>> I spelled out above. It really is a mess. But likely that cleanup could be
>>>>> deferred (but you're touching it, so ... :) ).
>>>>
>>>> I'm going to ignore the last 5 words. I heard the "that cleanup could be
>>>> deferred" part loud and clear though :)
>>>
>>> :)
>>>
>>> If we could stop passing orders into thp_vma_allowable_orders(), that would
>>> probably
>>> be the biggest win. It's just all a confusing mess.
>>
>>
>>
>> I tried an approach like you suggested in the other thread originally, but I
>> struggled to define exactly what "thp_vma_configured_orders()" should mean;
>> Ideally, I just want "all the THP orders that are currently enabled for this
>> VMA+flags". But some callers want to enforce_sysfs and others don't, so you
>> probably have to at least pass that flag. Then you have DAX which explicitly
>
> Yes, the flags would still be passed. It's kind of the "context".
>
>> ignores enforce_sysfs, but only in a page fault. And shmem, which ignores
>> enforce_sysfs, but only outside of a page fault. So it quickly becomes pretty
>> complex. It is basically thp_vma_allowable_orders() as currently defined.
>
> Yeah, but moving the "can we actually fit a THP in there" check out of the picture.
>
>>
>> If this could be a simple function then it could be inline and as you say, we
>> can do the masking in the caller and exit early for the order-0 case. But it is
>> very complex (at least if you want to retain the equivalent logic to what
>> thp_vma_allowable_orders() has) so I'm not sure how to do the order-0 early exit
>> without passing in the orders bitfield. And we are unlikely to exit early
>> because PMD-sized THP is likely enabled and because we didn't pass in a orders
>> bitfield, that wasn't filtered out.
>>
>> In short, I can't see a solution that's better than the one I have. But if you
>> have something in mind, if you can spell it out, then I'll have a go at tidying
>> it up and integrating it into the series. Otherwise I really would prefer to
>> leave it for a separate series.
>
> I'm playing with some cleanups, but they can all be built on top if they
> materialize.

OK, I'm going to post a v9 then. And cross my fingers and hope that's the final
version.