2023-01-05 10:35:59

by James Houghton

[permalink] [raw]
Subject: [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e).

This series introduces the concept of HugeTLB high-granularity mapping
(HGM). This series teaches HugeTLB how to map HugeTLB pages at
high-granularity, similar to how THPs can be PTE-mapped.

Support for HGM in this series is for MAP_SHARED VMAs on x86 only. Other
architectures and (some) support for MAP_PRIVATE will come later.

Old versions:
RFC v2: https://lore.kernel.org/linux-mm/[email protected]/
RFC v1: https://lore.kernel.org/linux-mm/[email protected]/

Changelog (RFC v2 -> v1):
- Userspace API to enable HGM changed from
UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT.
- Picked up Acked-bys and Reviewed-bys. Thanks Mike, Peter, and Mina!
- Rebased onto latest mm-unstable, notably picking up Peter's
HugeTLB walk synchronization fix [1].
- Changed MADV_COLLAPSE to take i_mmap_rwsem for writing to make its
synchronization the same as huge_pmd_unshare, so anywhere where
hugetlb_pte_walk() is safe, HGM walks are also safe.
- hugetlb_hgm_walk API has changed -- should reduce complexity where
callers wish to do HGM walks.
- Always round addresses properly before populating hugetlb_ptes (always
pick up first PTE in a contiguous bunch).
- Added a VMA flag for HGM: VM_HUGETLB_HGM; the hugetlb_shared_vma_data
struct has been removed.
- Make hugetlb_pte.ptl always hold the PTL to use.
- Added a requirement that overlapping contiguous and non-contiguous
PTEs must use the same PTL.
- Some things have been slightly renamed for clarity, and I've added
lots of comments that I said I would.
- Added a test for fork() + uffd-wp to cover
copy_hugetlb_page_range().

Patch breakdown:
Patches 1-4: Cleanup.
Patches 5-15: Create hugetlb_pte and implement HGM basics (PT walking,
enabling HGM).
Patches 16-30: Make existing routines compatible with HGM.
Patches 31-34: Extend userfaultfd to support high-granularity
CONTINUEs.
Patch 35: Add HugeTLB HGM support to MADV_COLLAPSE.
Patches 36-39: Cleanup, add HGM stats, and enable HGM for x86.
Patches 40-46: Documentation and selftests.

Motivation (mostly unchanged from RFC v1)
=====

Being able to map HugeTLB pages at PAGE_SIZE has important use cases in
post-copy live migration and memory poisoning.

- Live Migration (userfaultfd)
For post-copy live migration, using userfaultfd, currently we have to
install an entire hugepage before we can allow a guest to access that
page. This is because, right now, either the WHOLE hugepage is mapped or
NONE of it is. So either the guest can access the WHOLE hugepage or NONE
of it. This makes post-copy live migration for 1G HugeTLB-backed VMs
completely infeasible.

With high-granularity mapping, we can map PAGE_SIZE pieces of a
hugepage, thereby allowing the guest to access only PAGE_SIZE chunks,
and getting page faults on the rest (and triggering another
demand-fetch). This gives userspace the flexibility to install PAGE_SIZE
chunks of memory into a hugepage, making migration of 1G-backed VMs
perfectly feasible, and it vastly reduces the vCPU stall time during
post-copy for 2M-backed VMs.

At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
per-page median fetch latencies:
4K: <100us
2M: >10ms
Being able to unpause a vCPU 100x quicker is helpful for guest stability,
and being able to use 1G pages at all can significant improve
steady-state guest performance.

After fully copying a hugepage over the network, we will want to
collapse the mapping down to what it would normally be (e.g., one PUD
for a 1G page). Rather than having the kernel do this automatically,
we leave it up to userspace to tell us to collapse a range (via
MADV_COLLAPSE).

- Memory Failure
When a memory error is found within a HugeTLB page, it would be ideal
if we could unmap only the PAGE_SIZE section that contained the error.
This is what THPs are able to do. Using high-granularity mapping, we
could do this, but this isn't tackled in this patch series.

Userspace API
=====

This series introduces the first application of high-granularity
mapping: high-granularity userfaultfd post-copy for HugeTLB.

The userspace API for this consists of:
- MADV_SPLIT: to enable the following userfaultfd API changes.
1. read(uffd): addresses are rounded to PAGE_SIZE instead of the
hugepage size.
2. UFFDIO_CONTINUE for HugeTLB VMAs is now allowed in
PAGE_SIZE-aligned chunks.
- MADV_COLLAPSE is now available for MAP_SHARED HugeTLB VMAs. It is used
to collapse the page table mappings, but it does not undo the API
changes that MADV_SPLIT provides.

HugeTLB changes
=====

- hugetlb_pte
`hugetlb_pte` is used to keep track of "HugeTLB" PTEs, which are PTEs at
any level and of any size. page_vma_mapped_walk and pagewalk have both
been changed to provide `hugetlb_pte`s to callers so that they can get
size+level information that, before, came from the hstate.

- Mapcount
The mapcount for a high-granularity mapped HugeTLB page is the total
number of page table references to that page. So if we have a 2M page
that is mapped in a single VMA with 512 4K PTEs, the mapcount will be
512.

- Synchronization
Collapsing high-granularity page table mappings has the same
synchronization requirements as huge_pmd_unshare (grab both the HugeTLB
VMA lock for writing and i_mmap_rwsem for writing), so anywhere where it
is safe to do hugetlb_walk(), it is also safe to do a high-granularity
page table walk.

Supporting arm64 & contiguous PTEs
=====

As implemented, HGM does not yet fully support contiguous PTEs. To do
this, the HugeTLB API that architectures implement will need to change.
For example, set_huge_pte_at merely takes a `pte_t *`; there is no
information about the "size" of that PTE (like, if we need to overwrite
multiple contiguous PTEs).

To handle this, in a follow-up series, set_huge_pte_at and many other
similar functions will be replaced with variants that take
`hugetlb_pte`s. See [2] for how this may be implemented, plus a full HGM
implementation for arm64.

Supporting architectures beyond arm64
=====

Each architecture must audit their HugeTLB implementations to make sure
that they support HGM. For example, architectures that implement
arch_make_huge_pte need to ensure that a `shift` of `PAGE_SHIFT` is
acceptable.

Architectures must also audit code that might depend on HugeTLB always
having large mappings (i.e., check huge_page_size(), huge_page_shift(),
vma_kernel_pagesize(), and vma_mmu_pagesize() callers). For example, the
arm64 KVM MMU implementation thinks that all hugepages are mapped at
huge_page_size(), and thus builds the second-stage page table
accordingly. In an HGM world, this isn't true; it is corrected in [2].

[1]: https://lore.kernel.org/linux-mm/[email protected]/
[2]: https://github.com/48ca/linux/tree/hgmv1-dec19-2

James Houghton (46):
hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
hugetlb: remove mk_huge_pte; it is unused
hugetlb: remove redundant pte_mkhuge in migration path
hugetlb: only adjust address ranges when VMAs want PMD sharing
hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
mm: add VM_HUGETLB_HGM VMA flag
hugetlb: rename __vma_shareable_flags_pmd to
__vma_has_hugetlb_vma_lock
hugetlb: add HugeTLB HGM enablement helpers
mm: add MADV_SPLIT to enable HugeTLB HGM
hugetlb: make huge_pte_lockptr take an explicit shift argument
hugetlb: add hugetlb_pte to track HugeTLB page table entries
hugetlb: add hugetlb_alloc_pmd and hugetlb_alloc_pte
hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
hugetlb: add make_huge_pte_with_shift
hugetlb: make default arch_make_huge_pte understand small mappings
hugetlbfs: do a full walk to check if vma maps a page
hugetlb: make unmapping compatible with high-granularity mappings
hugetlb: add HGM support for hugetlb_change_protection
hugetlb: add HGM support for follow_hugetlb_page
hugetlb: add HGM support for hugetlb_follow_page_mask
hugetlb: use struct hugetlb_pte for walk_hugetlb_range
mm: rmap: provide pte_order in page_vma_mapped_walk
mm: rmap: make page_vma_mapped_walk callers use pte_order
rmap: update hugetlb lock comment for HGM
hugetlb: update page_vma_mapped to do high-granularity walks
hugetlb: add HGM support for copy_hugetlb_page_range
hugetlb: add HGM support for move_hugetlb_page_tables
hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
rmap: in try_to_{migrate,unmap}_one, check head page for page flags
hugetlb: add high-granularity migration support
hugetlb: sort hstates in hugetlb_init_hstates
hugetlb: add for_each_hgm_shift
hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to
PAGE_SIZE
hugetlb: add MADV_COLLAPSE for hugetlb
hugetlb: remove huge_pte_lock and huge_pte_lockptr
hugetlb: replace make_huge_pte with make_huge_pte_with_shift
mm: smaps: add stats for HugeTLB mapping size
hugetlb: x86: enable high-granularity mapping
docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM
info
docs: proc: include information about HugeTLB HGM
selftests/vm: add HugeTLB HGM to userfaultfd selftest
selftests/kvm: add HugeTLB HGM to KVM demand paging selftest
selftests/vm: add anon and shared hugetlb to migration test
selftests/vm: add hugetlb HGM test to migration selftest
selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests

Documentation/admin-guide/mm/hugetlbpage.rst | 4 +
Documentation/admin-guide/mm/userfaultfd.rst | 16 +-
Documentation/filesystems/proc.rst | 56 +-
arch/alpha/include/uapi/asm/mman.h | 2 +
arch/mips/include/uapi/asm/mman.h | 2 +
arch/parisc/include/uapi/asm/mman.h | 2 +
arch/powerpc/mm/pgtable.c | 6 +-
arch/s390/include/asm/hugetlb.h | 5 -
arch/s390/mm/gmap.c | 20 +-
arch/x86/Kconfig | 1 +
arch/xtensa/include/uapi/asm/mman.h | 2 +
fs/Kconfig | 7 +
fs/hugetlbfs/inode.c | 17 +-
fs/proc/task_mmu.c | 187 ++-
fs/userfaultfd.c | 14 +-
include/asm-generic/hugetlb.h | 5 -
include/asm-generic/tlb.h | 6 +-
include/linux/huge_mm.h | 12 +-
include/linux/hugetlb.h | 212 ++-
include/linux/mm.h | 7 +
include/linux/pagewalk.h | 10 +-
include/linux/rmap.h | 1 +
include/linux/swapops.h | 8 +-
include/trace/events/mmflags.h | 7 +
include/uapi/asm-generic/mman-common.h | 2 +
mm/damon/vaddr.c | 42 +-
mm/debug_vm_pgtable.c | 2 +-
mm/hmm.c | 20 +-
mm/hugetlb.c | 1265 ++++++++++++++---
mm/khugepaged.c | 4 +-
mm/madvise.c | 44 +-
mm/memory-failure.c | 17 +-
mm/mempolicy.c | 28 +-
mm/migrate.c | 20 +-
mm/mincore.c | 17 +-
mm/mprotect.c | 18 +-
mm/page_vma_mapped.c | 60 +-
mm/pagewalk.c | 20 +-
mm/rmap.c | 54 +-
mm/userfaultfd.c | 40 +-
.../selftests/kvm/demand_paging_test.c | 2 +-
.../testing/selftests/kvm/include/test_util.h | 2 +
.../selftests/kvm/include/userfaultfd_util.h | 6 +-
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
tools/testing/selftests/kvm/lib/test_util.c | 14 +
.../selftests/kvm/lib/userfaultfd_util.c | 14 +-
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugetlb-hgm.c | 455 ++++++
tools/testing/selftests/vm/migration.c | 229 ++-
tools/testing/selftests/vm/userfaultfd.c | 84 +-
50 files changed, 2560 insertions(+), 511 deletions(-)
create mode 100644 tools/testing/selftests/vm/hugetlb-hgm.c

--
2.39.0.314.g84b9a713c41-goog


2023-01-05 10:36:28

by James Houghton

[permalink] [raw]
Subject: [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries

After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.

Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.

hugetlb_pte_present_leaf is included here as a helper function that will
be used frequently later on.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 72 +++++++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 29 +++++++++++++++++
2 files changed, 101 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3f098363cd6e..bf441d8a1b52 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -38,6 +38,54 @@ typedef struct { unsigned long pd; } hugepd_t;
*/
#define __NR_USED_SUBPAGE 3

+enum hugetlb_level {
+ HUGETLB_LEVEL_PTE = 1,
+ /*
+ * We always include PMD, PUD, and P4D in this enum definition so that,
+ * when logged as an integer, we can easily tell which level it is.
+ */
+ HUGETLB_LEVEL_PMD,
+ HUGETLB_LEVEL_PUD,
+ HUGETLB_LEVEL_P4D,
+ HUGETLB_LEVEL_PGD,
+};
+
+struct hugetlb_pte {
+ pte_t *ptep;
+ unsigned int shift;
+ enum hugetlb_level level;
+ spinlock_t *ptl;
+};
+
+static inline
+void __hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
+ unsigned int shift, enum hugetlb_level level,
+ spinlock_t *ptl)
+{
+ /*
+ * If 'shift' indicates that this PTE is contiguous, then @ptep must
+ * be the first pte of the contiguous bunch.
+ */
+ hpte->ptl = ptl;
+ hpte->ptep = ptep;
+ hpte->shift = shift;
+ hpte->level = level;
+}
+
+static inline
+unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
+{
+ return 1UL << hpte->shift;
+}
+
+static inline
+unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
+{
+ return ~(hugetlb_pte_size(hpte) - 1);
+}
+
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
+
struct hugepage_subpool {
spinlock_t lock;
long count;
@@ -1232,6 +1280,30 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
return ptl;
}

+static inline
+spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
+{
+ return hpte->ptl;
+}
+
+static inline
+spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
+{
+ spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
+
+ spin_lock(ptl);
+ return ptl;
+}
+
+static inline
+void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ pte_t *ptep, unsigned int shift,
+ enum hugetlb_level level)
+{
+ __hugetlb_pte_populate(hpte, ptep, shift, level,
+ huge_pte_lockptr(shift, mm, ptep));
+}
+
#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
extern void __init hugetlb_cma_reserve(int order);
#else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4db38dc79d0e..2d83a2c359a2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1266,6 +1266,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
return false;
}

+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
+{
+ pgd_t pgd;
+ p4d_t p4d;
+ pud_t pud;
+ pmd_t pmd;
+
+ switch (hpte->level) {
+ case HUGETLB_LEVEL_PGD:
+ pgd = __pgd(pte_val(pte));
+ return pgd_present(pgd) && pgd_leaf(pgd);
+ case HUGETLB_LEVEL_P4D:
+ p4d = __p4d(pte_val(pte));
+ return p4d_present(p4d) && p4d_leaf(p4d);
+ case HUGETLB_LEVEL_PUD:
+ pud = __pud(pte_val(pte));
+ return pud_present(pud) && pud_leaf(pud);
+ case HUGETLB_LEVEL_PMD:
+ pmd = __pmd(pte_val(pte));
+ return pmd_present(pmd) && pmd_leaf(pmd);
+ case HUGETLB_LEVEL_PTE:
+ return pte_present(pte);
+ default:
+ WARN_ON_ONCE(1);
+ return false;
+ }
+}
+
+
static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
{
int nid = folio_nid(folio);
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:36:35

by James Houghton

[permalink] [raw]
Subject: [PATCH 31/46] hugetlb: sort hstates in hugetlb_init_hstates

When using HugeTLB high-granularity mapping, we need to go through the
supported hugepage sizes in decreasing order so that we pick the largest
size that works. Consider the case where we're faulting in a 1G hugepage
for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
a PUD. By going through the sizes in decreasing order, we will find that
PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.

This commit also changes bootmem hugepages from storing hstate pointers
directly to storing the hstate sizes. The hstate pointers used for
boot-time-allocated hugepages become invalid after we sort the hstates.
`gather_bootmem_prealloc`, called after the hstates have been sorted,
now converts the size to the correct hstate.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 49 ++++++++++++++++++++++++++++++++---------
2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index daf993fdbc38..8a664a9dd0a8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -789,7 +789,7 @@ struct hstate {

struct huge_bootmem_page {
struct list_head list;
- struct hstate *hstate;
+ unsigned long hstate_sz;
};

int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2fb95ecafc63..1e9e149587b3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
#include <linux/nospec.h>
#include <linux/delayacct.h>
#include <linux/memory.h>
+#include <linux/sort.h>

#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -49,6 +50,10 @@

int hugetlb_max_hstate __read_mostly;
unsigned int default_hstate_idx;
+/*
+ * After hugetlb_init_hstates is called, hstates will be sorted from largest
+ * to smallest.
+ */
struct hstate hstates[HUGE_MAX_HSTATE];

#ifdef CONFIG_CMA
@@ -3347,7 +3352,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
/* Put them into a private list first because mem_map is not up yet */
INIT_LIST_HEAD(&m->list);
list_add(&m->list, &huge_boot_pages);
- m->hstate = h;
+ m->hstate_sz = huge_page_size(h);
return 1;
}

@@ -3362,7 +3367,7 @@ static void __init gather_bootmem_prealloc(void)
list_for_each_entry(m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
struct folio *folio = page_folio(page);
- struct hstate *h = m->hstate;
+ struct hstate *h = size_to_hstate(m->hstate_sz);

VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);
@@ -3478,9 +3483,38 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
kfree(node_alloc_noretry);
}

+static int compare_hstates_decreasing(const void *a, const void *b)
+{
+ unsigned long sz_a = huge_page_size((const struct hstate *)a);
+ unsigned long sz_b = huge_page_size((const struct hstate *)b);
+
+ if (sz_a < sz_b)
+ return 1;
+ if (sz_a > sz_b)
+ return -1;
+ return 0;
+}
+
+static void sort_hstates(void)
+{
+ unsigned long default_hstate_sz = huge_page_size(&default_hstate);
+
+ /* Sort from largest to smallest. */
+ sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
+ compare_hstates_decreasing, NULL);
+
+ /*
+ * We may have changed the location of the default hstate, so we need to
+ * update it.
+ */
+ default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
+}
+
static void __init hugetlb_init_hstates(void)
{
- struct hstate *h, *h2;
+ struct hstate *h;
+
+ sort_hstates();

for_each_hstate(h) {
/* oversize hugepages were init'ed in early boot */
@@ -3499,13 +3533,8 @@ static void __init hugetlb_init_hstates(void)
continue;
if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
continue;
- for_each_hstate(h2) {
- if (h2 == h)
- continue;
- if (h2->order < h->order &&
- h2->order > h->demote_order)
- h->demote_order = h2->order;
- }
+ if (h - 1 >= &hstates[0])
+ h->demote_order = huge_page_order(h - 1);
}
}

--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:06

by James Houghton

[permalink] [raw]
Subject: [PATCH 23/46] mm: rmap: make page_vma_mapped_walk callers use pte_order

This also updates the callers' hugetlb mapcounting code to handle
mapcount properly for subpage-mapped hugetlb pages.

Signed-off-by: James Houghton <[email protected]>
---
mm/migrate.c | 2 +-
mm/rmap.c | 17 +++++++++++++----
2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 832f639fc49a..0062689f4878 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -244,7 +244,7 @@ static bool remove_migration_pte(struct folio *folio,

#ifdef CONFIG_HUGETLB_PAGE
if (folio_test_hugetlb(folio)) {
- unsigned int shift = huge_page_shift(hstate_vma(vma));
+ unsigned int shift = pvmw.pte_order + PAGE_SHIFT;

pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (folio_test_anon(folio))
diff --git a/mm/rmap.c b/mm/rmap.c
index 8a24b90d9531..ff7e6c770b0a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1608,7 +1608,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
- hugetlb_count_sub(folio_nr_pages(folio), mm);
+ hugetlb_count_sub(1UL << pvmw.pte_order, mm);
set_huge_pte_at(mm, address, pvmw.pte, pteval);
} else {
dec_mm_counter(mm, mm_counter(&folio->page));
@@ -1767,7 +1767,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
*
* See Documentation/mm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+ if (folio_test_hugetlb(folio))
+ page_remove_rmap(&folio->page, vma, true);
+ else
+ page_remove_rmap(subpage, vma, false);
+
if (vma->vm_flags & VM_LOCKED)
mlock_page_drain_local();
folio_put(folio);
@@ -2030,7 +2034,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
} else if (PageHWPoison(subpage)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
- hugetlb_count_sub(folio_nr_pages(folio), mm);
+ hugetlb_count_sub(1L << pvmw.pte_order, mm);
set_huge_pte_at(mm, address, pvmw.pte, pteval);
} else {
dec_mm_counter(mm, mm_counter(&folio->page));
@@ -2122,7 +2126,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
*
* See Documentation/mm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+ if (folio_test_hugetlb(folio))
+ page_remove_rmap(&folio->page, vma, true);
+ else
+ page_remove_rmap(subpage, vma, false);
if (vma->vm_flags & VM_LOCKED)
mlock_page_drain_local();
folio_put(folio);
@@ -2206,6 +2213,8 @@ static bool page_make_device_exclusive_one(struct folio *folio,
args->owner);
mmu_notifier_invalidate_range_start(&range);

+ VM_BUG_ON_FOLIO(folio_test_hugetlb(folio), folio);
+
while (page_vma_mapped_walk(&pvmw)) {
/* Unexpected PMD-mapped THP? */
VM_BUG_ON_FOLIO(!pvmw.pte, folio);
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:26

by James Houghton

[permalink] [raw]
Subject: [PATCH 28/46] hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page

Update the page fault handler to support high-granularity page faults.
While handling a page fault on a partially-mapped HugeTLB page, if the
PTE we find with hugetlb_pte_walk is none, then we will replace it with
a leaf-level PTE to map the page. To give some examples:
1. For a completely unmapped 1G page, it will be mapped with a 1G PUD.
2. For a 1G page that has its first 512M mapped, any faults on the
unmapped sections will result in 2M PMDs mapping each unmapped 2M
section.
3. For a 1G page that has only its first 4K mapped, a page fault on its
second 4K section will get a 4K PTE to map it.

Unless high-granularity mappings are created via UFFDIO_CONTINUE, it is
impossible for hugetlb_fault to create high-granularity mappings.

This commit does not handle hugetlb_wp right now, and it doesn't handle
HugeTLB page migration and swap entries.

The BUG_ON in huge_pte_alloc is removed, as it is not longer valid when
HGM is possible. HGM can be disabled if the VMA lock cannot be allocated
after a VMA is split, yet high-granularity mappings may still exist.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 115 ++++++++++++++++++++++++++++++++++++---------------
1 file changed, 81 insertions(+), 34 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 582d14a206b5..8e690a22456a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -117,6 +117,18 @@ enum hugetlb_level hpage_size_to_level(unsigned long sz)
return HUGETLB_LEVEL_PGD;
}

+/*
+ * Find the subpage that corresponds to `addr` in `hpage`.
+ */
+static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
+ unsigned long addr)
+{
+ size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
+
+ BUG_ON(idx >= pages_per_huge_page(h));
+ return &hpage[idx];
+}
+
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
if (spool->count)
@@ -5926,14 +5938,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
* Recheck pte with pgtable lock. Returns true if pte didn't change, or
* false if pte changed or is changing.
*/
-static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
- pte_t *ptep, pte_t old_pte)
+static bool hugetlb_pte_stable(struct hstate *h, struct hugetlb_pte *hpte,
+ pte_t old_pte)
{
spinlock_t *ptl;
bool same;

- ptl = huge_pte_lock(h, mm, ptep);
- same = pte_same(huge_ptep_get(ptep), old_pte);
+ ptl = hugetlb_pte_lock(hpte);
+ same = pte_same(huge_ptep_get(hpte->ptep), old_pte);
spin_unlock(ptl);

return same;
@@ -5942,17 +5954,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
- unsigned long address, pte_t *ptep,
+ unsigned long address, struct hugetlb_pte *hpte,
pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
vm_fault_t ret = VM_FAULT_SIGBUS;
int anon_rmap = 0;
unsigned long size;
- struct page *page;
+ struct page *page, *subpage;
pte_t new_pte;
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
+ unsigned long haddr_hgm = address & hugetlb_pte_mask(hpte);
bool new_page, new_pagecache_page = false;
u32 hash = hugetlb_fault_mutex_hash(mapping, idx);

@@ -5997,7 +6010,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
* never happen on the page after UFFDIO_COPY has
* correctly installed the page and returned.
*/
- if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+ if (!hugetlb_pte_stable(h, hpte, old_pte)) {
ret = 0;
goto out;
}
@@ -6021,7 +6034,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
* here. Before returning error, get ptl and make
* sure there really is no pte entry.
*/
- if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+ if (hugetlb_pte_stable(h, hpte, old_pte))
ret = vmf_error(PTR_ERR(page));
else
ret = 0;
@@ -6071,7 +6084,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
put_page(page);
/* See comment in userfaultfd_missing() block above */
- if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+ if (!hugetlb_pte_stable(h, hpte, old_pte)) {
ret = 0;
goto out;
}
@@ -6096,30 +6109,43 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
vma_end_reservation(h, vma, haddr);
}

- ptl = huge_pte_lock(h, mm, ptep);
+ ptl = hugetlb_pte_lock(hpte);
ret = 0;
- /* If pte changed from under us, retry */
- if (!pte_same(huge_ptep_get(ptep), old_pte))
+ /*
+ * If pte changed from under us, retry.
+ *
+ * When dealing with high-granularity-mapped PTEs, it's possible that
+ * a non-contiguous PTE within our contiguous PTE group gets populated,
+ * in which case, we need to retry here. This is NOT caught here, and
+ * will need to be addressed when HGM is supported for architectures
+ * that support contiguous PTEs.
+ */
+ if (!pte_same(huge_ptep_get(hpte->ptep), old_pte))
goto backout;

if (anon_rmap)
hugepage_add_new_anon_rmap(page, vma, haddr);
else
page_dup_file_rmap(page, true);
- new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
- && (vma->vm_flags & VM_SHARED)));
+
+ subpage = hugetlb_find_subpage(h, page, haddr_hgm);
+ new_pte = make_huge_pte_with_shift(vma, subpage,
+ ((vma->vm_flags & VM_WRITE)
+ && (vma->vm_flags & VM_SHARED)),
+ hpte->shift);
/*
* If this pte was previously wr-protected, keep it wr-protected even
* if populated.
*/
if (unlikely(pte_marker_uffd_wp(old_pte)))
new_pte = huge_pte_mkuffd_wp(new_pte);
- set_huge_pte_at(mm, haddr, ptep, new_pte);
+ set_huge_pte_at(mm, haddr_hgm, hpte->ptep, new_pte);

- hugetlb_count_add(pages_per_huge_page(h), mm);
+ hugetlb_count_add(hugetlb_pte_size(hpte) / PAGE_SIZE, mm);
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+ WARN_ON_ONCE(hugetlb_pte_size(hpte) != huge_page_size(h));
/* Optimization, do the COW without a second fault */
- ret = hugetlb_wp(mm, vma, address, ptep, flags, page, ptl);
+ ret = hugetlb_wp(mm, vma, address, hpte->ptep, flags, page, ptl);
}

spin_unlock(ptl);
@@ -6176,17 +6202,20 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
- pte_t *ptep, entry;
+ pte_t entry;
spinlock_t *ptl;
vm_fault_t ret;
u32 hash;
pgoff_t idx;
struct page *page = NULL;
+ struct page *subpage = NULL;
struct page *pagecache_page = NULL;
struct hstate *h = hstate_vma(vma);
struct address_space *mapping;
int need_wait_lock = 0;
unsigned long haddr = address & huge_page_mask(h);
+ unsigned long haddr_hgm;
+ struct hugetlb_pte hpte;

/*
* Serialize hugepage allocation and instantiation, so that we don't
@@ -6200,26 +6229,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

/*
* Acquire vma lock before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
+ * until finished with hpte. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the hpte no longer valid.
*/
hugetlb_vma_lock_read(vma);
- ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
- if (!ptep) {
+ if (hugetlb_full_walk_alloc(&hpte, vma, address, 0)) {
hugetlb_vma_unlock_read(vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return VM_FAULT_OOM;
}

- entry = huge_ptep_get(ptep);
+ entry = huge_ptep_get(hpte.ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry))
+ if (huge_pte_none_mostly(entry)) {
/*
* hugetlb_no_page will drop vma lock and hugetlb fault
* mutex internally, which make us return immediately.
*/
- return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ return hugetlb_no_page(mm, vma, mapping, idx, address, &hpte,
entry, flags);
+ }

ret = 0;

@@ -6240,7 +6269,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* be released there.
*/
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- migration_entry_wait_huge(vma, ptep);
+ migration_entry_wait_huge(vma, hpte.ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
ret = VM_FAULT_HWPOISON_LARGE |
@@ -6248,6 +6277,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mutex;
}

+ if (!hugetlb_pte_present_leaf(&hpte, entry))
+ /* We raced with someone splitting the entry. */
+ goto out_mutex;
+
/*
* If we are going to COW/unshare the mapping later, we examine the
* pending reservations for this page now. This will ensure that any
@@ -6267,14 +6300,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pagecache_page = find_lock_page(mapping, idx);
}

- ptl = huge_pte_lock(h, mm, ptep);
+ ptl = hugetlb_pte_lock(&hpte);

/* Check for a racing update before calling hugetlb_wp() */
- if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+ if (unlikely(!pte_same(entry, huge_ptep_get(hpte.ptep))))
goto out_ptl;

+ /* haddr_hgm is the base address of the region that hpte maps. */
+ haddr_hgm = address & hugetlb_pte_mask(&hpte);
+
/* Handle userfault-wp first, before trying to lock more pages */
- if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+ if (userfaultfd_wp(vma) && huge_pte_uffd_wp(entry) &&
(flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
struct vm_fault vmf = {
.vma = vma,
@@ -6298,7 +6334,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* pagecache_page, so here we need take the former one
* when page != pagecache_page or !pagecache_page.
*/
- page = pte_page(entry);
+ subpage = pte_page(entry);
+ page = compound_head(subpage);
if (page != pagecache_page)
if (!trylock_page(page)) {
need_wait_lock = 1;
@@ -6309,7 +6346,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
if (!huge_pte_write(entry)) {
- ret = hugetlb_wp(mm, vma, address, ptep, flags,
+ WARN_ON_ONCE(hugetlb_pte_size(&hpte) !=
+ huge_page_size(h));
+ ret = hugetlb_wp(mm, vma, address, hpte.ptep, flags,
pagecache_page, ptl);
goto out_put_page;
} else if (likely(flags & FAULT_FLAG_WRITE)) {
@@ -6317,9 +6356,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
}
entry = pte_mkyoung(entry);
- if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
+ if (huge_ptep_set_access_flags(vma, haddr_hgm, hpte.ptep, entry,
flags & FAULT_FLAG_WRITE))
- update_mmu_cache(vma, haddr, ptep);
+ update_mmu_cache(vma, haddr_hgm, hpte.ptep);
out_put_page:
if (page != pagecache_page)
unlock_page(page);
@@ -7523,6 +7562,9 @@ int hugetlb_full_walk(struct hugetlb_pte *hpte,
/*
* hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
* new PTEs.
+ *
+ * If @target_sz is 0, then only attempt to allocate the hstate-level PTE, and
+ * walk as far as we can go.
*/
int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
struct vm_area_struct *vma,
@@ -7541,6 +7583,12 @@ int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
if (!ptep)
return -ENOMEM;

+ if (!target_sz) {
+ WARN_ON_ONCE(hugetlb_hgm_walk_uninit(hpte, ptep, vma, addr,
+ PAGE_SIZE, false));
+ return 0;
+ }
+
return hugetlb_hgm_walk_uninit(hpte, ptep, vma, addr, target_sz, true);
}

@@ -7569,7 +7617,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pte = (pte_t *)pmd_alloc(mm, pud, addr);
}
}
- BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));

return pte;
}
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:35

by James Houghton

[permalink] [raw]
Subject: [PATCH 29/46] rmap: in try_to_{migrate,unmap}_one, check head page for page flags

The main complication here is that HugeTLB pages have their poison
status stored in the head page as the HWPoison page flag. Because
HugeTLB high-granularity mapping can create PTEs that point to subpages
instead of always the head of a hugepage, we need to check the
compound_head for page flags.

Signed-off-by: James Houghton <[email protected]>
---
mm/rmap.c | 34 ++++++++++++++++++++++++++--------
1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 076ea77010e5..a6004d6b0415 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1456,10 +1456,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
pte_t pteval;
- struct page *subpage;
+ struct page *subpage, *page_flags_page;
bool anon_exclusive, ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
+ bool page_poisoned;

/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1512,9 +1513,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,

subpage = folio_page(folio,
pte_pfn(*pvmw.pte) - folio_pfn(folio));
+ /*
+ * We check the page flags of HugeTLB pages by checking the
+ * head page.
+ */
+ page_flags_page = folio_test_hugetlb(folio)
+ ? &folio->page
+ : subpage;
+ page_poisoned = PageHWPoison(page_flags_page);
address = pvmw.address;
anon_exclusive = folio_test_anon(folio) &&
- PageAnonExclusive(subpage);
+ PageAnonExclusive(page_flags_page);

if (folio_test_hugetlb(folio)) {
bool anon = folio_test_anon(folio);
@@ -1523,7 +1532,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* The try_to_unmap() is only passed a hugetlb page
* in the case where the hugetlb page is poisoned.
*/
- VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
+ VM_BUG_ON_FOLIO(!page_poisoned, folio);
/*
* huge_pmd_unshare may unmap an entire PMD page.
* There is no way of knowing exactly which PMDs may
@@ -1606,7 +1615,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);

- if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
+ if (page_poisoned && !(flags & TTU_IGNORE_HWPOISON)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
hugetlb_count_sub(1UL << pvmw.pte_order, mm);
@@ -1632,7 +1641,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
mmu_notifier_invalidate_range(mm, address,
address + PAGE_SIZE);
} else if (folio_test_anon(folio)) {
- swp_entry_t entry = { .val = page_private(subpage) };
+ swp_entry_t entry = {
+ .val = page_private(page_flags_page)
+ };
pte_t swp_pte;
/*
* Store the swap location in the pte.
@@ -1831,7 +1842,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
pte_t pteval;
- struct page *subpage;
+ struct page *subpage, *page_flags_page;
bool anon_exclusive, ret = true;
struct mmu_notifier_range range;
enum ttu_flags flags = (enum ttu_flags)(long)arg;
@@ -1911,9 +1922,16 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
subpage = folio_page(folio,
pte_pfn(*pvmw.pte) - folio_pfn(folio));
}
+ /*
+ * We check the page flags of HugeTLB pages by checking the
+ * head page.
+ */
+ page_flags_page = folio_test_hugetlb(folio)
+ ? &folio->page
+ : subpage;
address = pvmw.address;
anon_exclusive = folio_test_anon(folio) &&
- PageAnonExclusive(subpage);
+ PageAnonExclusive(page_flags_page);

if (folio_test_hugetlb(folio)) {
bool anon = folio_test_anon(folio);
@@ -2032,7 +2050,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
* No need to invalidate here it will synchronize on
* against the special swap migration pte.
*/
- } else if (PageHWPoison(subpage)) {
+ } else if (PageHWPoison(page_flags_page)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
hugetlb_count_sub(1L << pvmw.pte_order, mm);
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:39

by James Houghton

[permalink] [raw]
Subject: [PATCH 43/46] selftests/kvm: add HugeTLB HGM to KVM demand paging selftest

This test exercises the GUP paths for HGM. MADV_COLLAPSE is not tested.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/kvm/demand_paging_test.c | 2 +-
tools/testing/selftests/kvm/include/test_util.h | 2 ++
.../selftests/kvm/include/userfaultfd_util.h | 6 +++---
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
tools/testing/selftests/kvm/lib/test_util.c | 14 ++++++++++++++
tools/testing/selftests/kvm/lib/userfaultfd_util.c | 14 +++++++++++---
6 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e2..e534f9c927bf 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -170,7 +170,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
uffd_descs[i] = uffd_setup_demand_paging(
p->uffd_mode, p->uffd_delay, vcpu_hva,
vcpu_args->pages * memstress_args.guest_page_size,
- &handle_uffd_page_request);
+ p->src_type, &handle_uffd_page_request);
}
}

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 80d6416f3012..a2106c19a614 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -103,6 +103,7 @@ enum vm_mem_backing_src_type {
VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
VM_MEM_SRC_SHMEM,
VM_MEM_SRC_SHARED_HUGETLB,
+ VM_MEM_SRC_SHARED_HUGETLB_HGM,
NUM_SRC_TYPES,
};

@@ -121,6 +122,7 @@ size_t get_def_hugetlb_pagesz(void);
const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
size_t get_backing_src_pagesz(uint32_t i);
bool is_backing_src_hugetlb(uint32_t i);
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type);
void backing_src_help(const char *flag);
enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
long get_run_delay(void);
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c34592..d91528a58245 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -26,9 +26,9 @@ struct uffd_desc {
pthread_t thread;
};

-struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
- void *hva, uint64_t len,
- uffd_handler_t handler);
+struct uffd_desc *uffd_setup_demand_paging(
+ int uffd_mode, useconds_t delay, void *hva, uint64_t len,
+ enum vm_mem_backing_src_type src_type, uffd_handler_t handler);

void uffd_stop_demand_paging(struct uffd_desc *uffd);

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index c88c3ace16d2..67e7223f054b 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -972,7 +972,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->fd = -1;
if (backing_src_is_shared(src_type))
region->fd = kvm_memfd_alloc(region->mmap_size,
- src_type == VM_MEM_SRC_SHARED_HUGETLB);
+ is_backing_src_shared_hugetlb(src_type));

region->mmap_start = mmap(NULL, region->mmap_size,
PROT_READ | PROT_WRITE,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 5c22fa4c2825..712a0878932e 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -271,6 +271,13 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
*/
.flag = MAP_SHARED,
},
+ [VM_MEM_SRC_SHARED_HUGETLB_HGM] = {
+ /*
+ * Identical to shared_hugetlb except for the name.
+ */
+ .name = "shared_hugetlb_hgm",
+ .flag = MAP_SHARED,
+ },
};
_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
"Missing new backing src types?");
@@ -289,6 +296,7 @@ size_t get_backing_src_pagesz(uint32_t i)
switch (i) {
case VM_MEM_SRC_ANONYMOUS:
case VM_MEM_SRC_SHMEM:
+ case VM_MEM_SRC_SHARED_HUGETLB_HGM:
return getpagesize();
case VM_MEM_SRC_ANONYMOUS_THP:
return get_trans_hugepagesz();
@@ -305,6 +313,12 @@ bool is_backing_src_hugetlb(uint32_t i)
return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
}

+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type)
+{
+ return src_type == VM_MEM_SRC_SHARED_HUGETLB ||
+ src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM;
+}
+
static void print_available_backing_src_types(const char *prefix)
{
int i;
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 92cef20902f1..3c7178d6c4f4 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -25,6 +25,10 @@

#ifdef __NR_userfaultfd

+#ifndef MADV_SPLIT
+#define MADV_SPLIT 26
+#endif
+
static void *uffd_handler_thread_fn(void *arg)
{
struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
@@ -108,9 +112,9 @@ static void *uffd_handler_thread_fn(void *arg)
return NULL;
}

-struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
- void *hva, uint64_t len,
- uffd_handler_t handler)
+struct uffd_desc *uffd_setup_demand_paging(
+ int uffd_mode, useconds_t delay, void *hva, uint64_t len,
+ enum vm_mem_backing_src_type src_type, uffd_handler_t handler)
{
struct uffd_desc *uffd_desc;
bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -140,6 +144,10 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
"ioctl UFFDIO_API failed: %" PRIu64,
(uint64_t)uffdio_api.api);

+ if (src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM)
+ TEST_ASSERT(!madvise(hva, len, MADV_SPLIT),
+ "Could not enable HGM");
+
uffdio_register.range.start = (uint64_t)hva;
uffdio_register.range.len = len;
uffdio_register.mode = uffd_mode;
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:41

by James Houghton

[permalink] [raw]
Subject: [PATCH 44/46] selftests/vm: add anon and shared hugetlb to migration test

Shared HugeTLB mappings are migrated best-effort. Sometimes, due to
being unable to grab the VMA lock for writing, migration may just
randomly fail. To allow for that, we allow retries.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/migration.c | 83 ++++++++++++++++++++++++--
1 file changed, 79 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/migration.c b/tools/testing/selftests/vm/migration.c
index 1cec8425e3ca..21577a84d7e4 100644
--- a/tools/testing/selftests/vm/migration.c
+++ b/tools/testing/selftests/vm/migration.c
@@ -13,6 +13,7 @@
#include <sys/types.h>
#include <signal.h>
#include <time.h>
+#include <sys/statfs.h>

#define TWOMEG (2<<20)
#define RUNTIME (60)
@@ -59,11 +60,12 @@ FIXTURE_TEARDOWN(migration)
free(self->pids);
}

-int migrate(uint64_t *ptr, int n1, int n2)
+int migrate(uint64_t *ptr, int n1, int n2, int retries)
{
int ret, tmp;
int status = 0;
struct timespec ts1, ts2;
+ int failed = 0;

if (clock_gettime(CLOCK_MONOTONIC, &ts1))
return -1;
@@ -78,6 +80,9 @@ int migrate(uint64_t *ptr, int n1, int n2)
ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
MPOL_MF_MOVE_ALL);
if (ret) {
+ if (++failed < retries)
+ continue;
+
if (ret > 0)
printf("Didn't migrate %d pages\n", ret);
else
@@ -88,6 +93,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
tmp = n2;
n2 = n1;
n1 = tmp;
+ failed = 0;
}

return 0;
@@ -128,7 +134,7 @@ TEST_F_TIMEOUT(migration, private_anon, 2*RUNTIME)
if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
perror("Couldn't create thread");

- ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
for (i = 0; i < self->nthreads - 1; i++)
ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
}
@@ -158,7 +164,7 @@ TEST_F_TIMEOUT(migration, shared_anon, 2*RUNTIME)
self->pids[i] = pid;
}

- ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
for (i = 0; i < self->nthreads - 1; i++)
ASSERT_EQ(kill(self->pids[i], SIGTERM), 0);
}
@@ -185,9 +191,78 @@ TEST_F_TIMEOUT(migration, private_anon_thp, 2*RUNTIME)
if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
perror("Couldn't create thread");

- ASSERT_EQ(migrate(ptr, self->n1, self->n2), 0);
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
+ for (i = 0; i < self->nthreads - 1; i++)
+ ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+}
+
+/*
+ * Tests the anon hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, private_anon_hugetlb, 2*RUNTIME)
+{
+ uint64_t *ptr;
+ int i;
+
+ if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+ SKIP(return, "Not enough threads or NUMA nodes available");
+
+ ptr = mmap(NULL, TWOMEG, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+ if (ptr == MAP_FAILED)
+ SKIP(return, "Could not allocate hugetlb pages");
+
+ memset(ptr, 0xde, TWOMEG);
+ for (i = 0; i < self->nthreads - 1; i++)
+ if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+ perror("Couldn't create thread");
+
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 1), 0);
for (i = 0; i < self->nthreads - 1; i++)
ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
}

+/*
+ * Tests the shared hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
+{
+ uint64_t *ptr;
+ int i;
+ int fd;
+ unsigned long sz;
+ struct statfs filestat;
+
+ if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+ SKIP(return, "Not enough threads or NUMA nodes available");
+
+ fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+ if (fd < 0)
+ SKIP(return, "Couldn't create hugetlb memfd");
+
+ if (fstatfs(fd, &filestat) < 0)
+ SKIP(return, "Couldn't fstatfs hugetlb file");
+
+ sz = filestat.f_bsize;
+
+ if (ftruncate(fd, sz))
+ SKIP(return, "Couldn't allocate hugetlb pages");
+ ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (ptr == MAP_FAILED)
+ SKIP(return, "Could not map hugetlb pages");
+
+ memset(ptr, 0xde, sz);
+ for (i = 0; i < self->nthreads - 1; i++)
+ if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+ perror("Couldn't create thread");
+
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+ for (i = 0; i < self->nthreads - 1; i++) {
+ ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+ pthread_join(self->threads[i], NULL);
+ }
+ ftruncate(fd, 0);
+ close(fd);
+}
+
TEST_HARNESS_MAIN
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:45

by James Houghton

[permalink] [raw]
Subject: [PATCH 05/46] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

This adds the Kconfig to enable or disable high-granularity mapping.
Each architecture must explicitly opt-in to it (via
ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will
be enabled by default if HUGETLB_PAGE is enabled.

Signed-off-by: James Houghton <[email protected]>
---
fs/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2685a4d0d353..ce2567946016 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -267,6 +267,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
(boot command line) or hugetlb_optimize_vmemmap (sysctl).

+config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+ bool
+
+config HUGETLB_HIGH_GRANULARITY_MAPPING
+ def_bool HUGETLB_PAGE
+ depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+
config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS

--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:54

by James Houghton

[permalink] [raw]
Subject: [PATCH 37/46] hugetlb: replace make_huge_pte with make_huge_pte_with_shift

This removes the old definition of make_huge_pte, where now we always
require the shift to be explicitly given. All callsites are cleaned up.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 31 ++++++++++++-------------------
1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d71adc03138d..10a323e6bd9c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5069,9 +5069,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.pagesize = hugetlb_vm_op_pagesize,
};

-static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
- struct page *page, int writable,
- int shift)
+static pte_t make_huge_pte(struct vm_area_struct *vma,
+ struct page *page, int writable,
+ int shift)
{
pte_t entry;

@@ -5087,14 +5087,6 @@ static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
return entry;
}

-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
- int writable)
-{
- unsigned int shift = huge_page_shift(hstate_vma(vma));
-
- return make_huge_pte_with_shift(vma, page, writable, shift);
-}
-
static void set_huge_ptep_writable(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
@@ -5135,10 +5127,12 @@ static void
hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
struct page *new_page)
{
+ struct hstate *h = hstate_vma(vma);
__SetPageUptodate(new_page);
hugepage_add_new_anon_rmap(new_page, vma, addr);
- set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
- hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
+ set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1,
+ huge_page_shift(h)));
+ hugetlb_count_add(pages_per_huge_page(h), vma->vm_mm);
SetHPageMigratable(new_page);
}

@@ -5854,7 +5848,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
page_remove_rmap(old_page, vma, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_huge_pte_at(mm, haddr, ptep,
- make_huge_pte(vma, new_page, !unshare));
+ make_huge_pte(vma, new_page, !unshare,
+ huge_page_shift(h)));
SetHPageMigratable(new_page);
/* Make the old page be freed below */
new_page = old_page;
@@ -6163,7 +6158,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_file_rmap(page, true);

subpage = hugetlb_find_subpage(h, page, haddr_hgm);
- new_pte = make_huge_pte_with_shift(vma, subpage,
+ new_pte = make_huge_pte(vma, subpage,
((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)),
hpte->shift);
@@ -6585,8 +6580,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,

subpage = hugetlb_find_subpage(h, page, dst_addr);

- _dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
- dst_hpte->shift);
+ _dst_pte = make_huge_pte(dst_vma, subpage, writable, dst_hpte->shift);
/*
* Always mark UFFDIO_COPY page dirty; note that this may not be
* extremely important for hugetlbfs for now since swapping is not
@@ -7999,8 +7993,7 @@ int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
page_dup_file_rmap(hpage, true);

subpage = hugetlb_find_subpage(h, hpage, curr);
- entry = make_huge_pte_with_shift(vma, subpage,
- writable, hpte.shift);
+ entry = make_huge_pte(vma, subpage, writable, hpte.shift);
set_huge_pte_at(mm, curr, hpte.ptep, entry);
next_hpte:
curr += hugetlb_pte_size(&hpte);
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:57

by James Houghton

[permalink] [raw]
Subject: [PATCH 27/46] hugetlb: add HGM support for move_hugetlb_page_tables

This is very similar to the support that was added to
copy_hugetlb_page_range. We simply do a high-granularity walk now, and
most of the rest of the code stays the same.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 47 +++++++++++++++++++++++++++--------------------
1 file changed, 27 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 21a5116f509b..582d14a206b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5313,16 +5313,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
return ret;
}

-static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
- unsigned long new_addr, pte_t *src_pte, pte_t *dst_pte)
+static void move_hugetlb_pte(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, struct hugetlb_pte *src_hpte,
+ struct hugetlb_pte *dst_hpte)
{
- struct hstate *h = hstate_vma(vma);
struct mm_struct *mm = vma->vm_mm;
spinlock_t *src_ptl, *dst_ptl;
pte_t pte;

- dst_ptl = huge_pte_lock(h, mm, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
+ dst_ptl = hugetlb_pte_lock(dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(src_hpte);

/*
* We don't have to worry about the ordering of src and dst ptlocks
@@ -5331,8 +5331,8 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
if (src_ptl != dst_ptl)
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

- pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
- set_huge_pte_at(mm, new_addr, dst_pte, pte);
+ pte = huge_ptep_get_and_clear(mm, old_addr, src_hpte->ptep);
+ set_huge_pte_at(mm, new_addr, dst_hpte->ptep, pte);

if (src_ptl != dst_ptl)
spin_unlock(src_ptl);
@@ -5350,9 +5350,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
unsigned long old_end = old_addr + len;
unsigned long last_addr_mask;
- pte_t *src_pte, *dst_pte;
struct mmu_notifier_range range;
bool shared_pmd = false;
+ struct hugetlb_pte src_hpte, dst_hpte;

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, old_addr,
old_end);
@@ -5368,28 +5368,35 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
/* Prevent race with file truncation */
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
- for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
- src_pte = hugetlb_walk(vma, old_addr, sz);
- if (!src_pte) {
- old_addr |= last_addr_mask;
- new_addr |= last_addr_mask;
+ while (old_addr < old_end) {
+ if (hugetlb_full_walk(&src_hpte, vma, old_addr)) {
+ /* The hstate-level PTE wasn't allocated. */
+ old_addr = (old_addr | last_addr_mask) + sz;
+ new_addr = (new_addr | last_addr_mask) + sz;
continue;
}
- if (huge_pte_none(huge_ptep_get(src_pte)))
+
+ if (huge_pte_none(huge_ptep_get(src_hpte.ptep))) {
+ old_addr += hugetlb_pte_size(&src_hpte);
+ new_addr += hugetlb_pte_size(&src_hpte);
continue;
+ }

- if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
+ if (hugetlb_pte_size(&src_hpte) == sz &&
+ huge_pmd_unshare(mm, vma, old_addr, src_hpte.ptep)) {
shared_pmd = true;
- old_addr |= last_addr_mask;
- new_addr |= last_addr_mask;
+ old_addr = (old_addr | last_addr_mask) + sz;
+ new_addr = (new_addr | last_addr_mask) + sz;
continue;
}

- dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
- if (!dst_pte)
+ if (hugetlb_full_walk_alloc(&dst_hpte, new_vma, new_addr,
+ hugetlb_pte_size(&src_hpte)))
break;

- move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte);
+ move_hugetlb_pte(vma, old_addr, new_addr, &src_hpte, &dst_hpte);
+ old_addr += hugetlb_pte_size(&src_hpte);
+ new_addr += hugetlb_pte_size(&src_hpte);
}

if (shared_pmd)
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:57

by James Houghton

[permalink] [raw]
Subject: [PATCH 45/46] selftests/vm: add hugetlb HGM test to migration selftest

This is mostly the same as the shared HugeTLB case, but instead of
mapping the page with a regular page fault, we map it with lots of
UFFDIO_CONTINUE operations. We also verify that the contents haven't
changed after the migration, which would be the case if the
post-migration PTEs pointed to the wrong page.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/migration.c | 146 +++++++++++++++++++++++++
1 file changed, 146 insertions(+)

diff --git a/tools/testing/selftests/vm/migration.c b/tools/testing/selftests/vm/migration.c
index 21577a84d7e4..1fb3607accab 100644
--- a/tools/testing/selftests/vm/migration.c
+++ b/tools/testing/selftests/vm/migration.c
@@ -14,12 +14,21 @@
#include <signal.h>
#include <time.h>
#include <sys/statfs.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <sys/syscall.h>
+#include <fcntl.h>

#define TWOMEG (2<<20)
#define RUNTIME (60)

#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))

+#ifndef MADV_SPLIT
+#define MADV_SPLIT 26
+#endif
+
FIXTURE(migration)
{
pthread_t *threads;
@@ -265,4 +274,141 @@ TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
close(fd);
}

+#ifdef __NR_userfaultfd
+static int map_at_high_granularity(char *mem, size_t length)
+{
+ int i;
+ int ret;
+ int uffd = syscall(__NR_userfaultfd, 0);
+ struct uffdio_api api;
+ struct uffdio_register reg;
+ int pagesize = getpagesize();
+
+ if (uffd < 0) {
+ perror("couldn't create uffd");
+ return uffd;
+ }
+
+ api.api = UFFD_API;
+ api.features = 0;
+
+ ret = ioctl(uffd, UFFDIO_API, &api);
+ if (ret || api.api != UFFD_API) {
+ perror("UFFDIO_API failed");
+ goto out;
+ }
+
+ if (madvise(mem, length, MADV_SPLIT) == -1) {
+ perror("MADV_SPLIT failed");
+ goto out;
+ }
+
+ reg.range.start = (unsigned long)mem;
+ reg.range.len = length;
+
+ reg.mode = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_MINOR;
+
+ ret = ioctl(uffd, UFFDIO_REGISTER, &reg);
+ if (ret) {
+ perror("UFFDIO_REGISTER failed");
+ goto out;
+ }
+
+ /* UFFDIO_CONTINUE each 4K segment of the 2M page. */
+ for (i = 0; i < length/pagesize; ++i) {
+ struct uffdio_continue cont;
+
+ cont.range.start = (unsigned long long)mem + i * pagesize;
+ cont.range.len = pagesize;
+ cont.mode = 0;
+ ret = ioctl(uffd, UFFDIO_CONTINUE, &cont);
+ if (ret) {
+ fprintf(stderr, "UFFDIO_CONTINUE failed "
+ "for %llx -> %llx: %d\n",
+ cont.range.start,
+ cont.range.start + cont.range.len,
+ errno);
+ goto out;
+ }
+ }
+ ret = 0;
+out:
+ close(uffd);
+ return ret;
+}
+#else
+static int map_at_high_granularity(char *mem, size_t length)
+{
+ fprintf(stderr, "Userfaultfd missing\n");
+ return -1;
+}
+#endif /* __NR_userfaultfd */
+
+/*
+ * Tests the high-granularity hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb_hgm, 2*RUNTIME)
+{
+ uint64_t *ptr;
+ int i;
+ int fd;
+ unsigned long sz;
+ struct statfs filestat;
+
+ if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+ SKIP(return, "Not enough threads or NUMA nodes available");
+
+ fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+ if (fd < 0)
+ SKIP(return, "Couldn't create hugetlb memfd");
+
+ if (fstatfs(fd, &filestat) < 0)
+ SKIP(return, "Couldn't fstatfs hugetlb file");
+
+ sz = filestat.f_bsize;
+
+ if (ftruncate(fd, sz))
+ SKIP(return, "Couldn't allocate hugetlb pages");
+
+ if (fallocate(fd, 0, 0, sz) < 0) {
+ perror("fallocate failed");
+ SKIP(return, "fallocate failed");
+ }
+
+ ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (ptr == MAP_FAILED)
+ SKIP(return, "Could not allocate hugetlb pages");
+
+ /*
+ * We have to map_at_high_granularity before we memset, otherwise
+ * memset will map everything at the hugepage size.
+ */
+ if (map_at_high_granularity((char *)ptr, sz) < 0)
+ SKIP(return, "Could not map HugeTLB range at high granularity");
+
+ /* Populate the page we're migrating. */
+ for (i = 0; i < sz/sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ for (i = 0; i < self->nthreads - 1; i++)
+ if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+ perror("Couldn't create thread");
+
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+ for (i = 0; i < self->nthreads - 1; i++) {
+ ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+ pthread_join(self->threads[i], NULL);
+ }
+
+ /* Check that the contents didnt' change. */
+ for (i = 0; i < sz/sizeof(*ptr); ++i) {
+ ASSERT_EQ(ptr[i], i);
+ if (ptr[i] != i)
+ break;
+ }
+
+ ftruncate(fd, 0);
+ close(fd);
+}
+
TEST_HARNESS_MAIN
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:39:58

by James Houghton

[permalink] [raw]
Subject: [PATCH 15/46] hugetlb: make default arch_make_huge_pte understand small mappings

This is a simple change: don't create a "huge" PTE if we are making a
regular, PAGE_SIZE PTE. All architectures that want to implement HGM
likely need to be changed in a similar way if they implement their own
version of arch_make_huge_pte.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2fcd8f313628..b7cf45535d64 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -912,7 +912,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
vm_flags_t flags)
{
- return pte_mkhuge(entry);
+ return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
}
#endif

--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:40:17

by James Houghton

[permalink] [raw]
Subject: [PATCH 36/46] hugetlb: remove huge_pte_lock and huge_pte_lockptr

They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't
already been replaced don't get called when using HGM, so we handle them
by populating hugetlb_ptes with the standard, hstate-sized huge PTEs.

Signed-off-by: James Houghton <[email protected]>
---
arch/powerpc/mm/pgtable.c | 7 +++++--
include/linux/hugetlb.h | 42 +++++++++++++++------------------------
mm/hugetlb.c | 22 +++++++++++++-------
3 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 035a0df47af0..e20d6aa9a2a6 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -258,11 +258,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,

#ifdef CONFIG_PPC_BOOK3S_64
struct hstate *h = hstate_vma(vma);
+ struct hugetlb_pte hpte;

psize = hstate_get_psize(h);
#ifdef CONFIG_DEBUG_VM
- assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
- vma->vm_mm, ptep));
+ /* HGM is not supported for powerpc yet. */
+ hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+ hpage_size_to_level(psize));
+ assert_spin_locked(hpte.ptl);
#endif

#else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e1baf939afb6..4d318bf2ced9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1032,14 +1032,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
return modified_mask;
}

-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
- struct mm_struct *mm, pte_t *pte)
-{
- if (shift == PMD_SHIFT)
- return pmd_lockptr(mm, (pmd_t *) pte);
- return &mm->page_table_lock;
-}
-
#ifndef hugepages_supported
/*
* Some platform decide whether they support huge pages at boot
@@ -1248,12 +1240,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
return 0;
}

-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
- struct mm_struct *mm, pte_t *pte)
-{
- return &mm->page_table_lock;
-}
-
static inline void hugetlb_count_init(struct mm_struct *mm)
{
}
@@ -1328,16 +1314,6 @@ int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
}
#endif

-static inline spinlock_t *huge_pte_lock(struct hstate *h,
- struct mm_struct *mm, pte_t *pte)
-{
- spinlock_t *ptl;
-
- ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
- spin_lock(ptl);
- return ptl;
-}
-
static inline
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
{
@@ -1358,8 +1334,22 @@ void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
pte_t *ptep, unsigned int shift,
enum hugetlb_level level)
{
- __hugetlb_pte_populate(hpte, ptep, shift, level,
- huge_pte_lockptr(shift, mm, ptep));
+ spinlock_t *ptl;
+
+ /*
+ * For contiguous HugeTLB PTEs that can contain other HugeTLB PTEs
+ * on the same level, the same PTL for both must be used.
+ *
+ * For some architectures that implement hugetlb_walk_step, this
+ * version of hugetlb_pte_populate() may not be correct to use for
+ * high-granularity PTEs. Instead, call __hugetlb_pte_populate()
+ * directly.
+ */
+ if (level == HUGETLB_LEVEL_PMD)
+ ptl = pmd_lockptr(mm, (pmd_t *) ptep);
+ else
+ ptl = &mm->page_table_lock;
+ __hugetlb_pte_populate(hpte, ptep, shift, level, ptl);
}

#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 388c46c7e77a..d71adc03138d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5303,9 +5303,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
put_page(hpage);

/* Install the new huge page if src pte stable */
- dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h),
- src, src_pte);
+ dst_ptl = hugetlb_pte_lock(&dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(&src_hpte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
if (!pte_same(src_pte_old, entry)) {
@@ -7383,7 +7382,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long saddr;
pte_t *spte = NULL;
pte_t *pte;
- spinlock_t *ptl;
+ struct hugetlb_pte hpte;
+ struct hstate *shstate;

i_mmap_lock_read(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -7404,7 +7404,11 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
if (!spte)
goto out;

- ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+ shstate = hstate_vma(svma);
+
+ hugetlb_pte_populate(mm, &hpte, spte, huge_page_shift(shstate),
+ hpage_size_to_level(huge_page_size(shstate)));
+ spin_lock(hpte.ptl);
if (pud_none(*pud)) {
pud_populate(mm, pud,
(pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -7412,7 +7416,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
put_page(virt_to_page(spte));
}
- spin_unlock(ptl);
+ spin_unlock(hpte.ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
i_mmap_unlock_read(mapping);
@@ -8132,6 +8136,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
unsigned long address, start, end;
spinlock_t *ptl;
pte_t *ptep;
+ struct hugetlb_pte hpte;

if (!(vma->vm_flags & VM_MAYSHARE))
return;
@@ -8156,7 +8161,10 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
ptep = hugetlb_walk(vma, address, sz);
if (!ptep)
continue;
- ptl = huge_pte_lock(h, mm, ptep);
+
+ hugetlb_pte_populate(mm, &hpte, ptep, huge_page_shift(h),
+ hpage_size_to_level(sz));
+ ptl = hugetlb_pte_lock(&hpte);
huge_pmd_unshare(mm, vma, address, ptep);
spin_unlock(ptl);
}
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:40:30

by James Houghton

[permalink] [raw]
Subject: [PATCH 38/46] mm: smaps: add stats for HugeTLB mapping size

When the kernel is compiled with HUGETLB_HIGH_GRANULARITY_MAPPING,
smaps may provide HugetlbPudMapped, HugetlbPmdMapped, and
HugetlbPteMapped. Levels that are folded will not be outputted.

Signed-off-by: James Houghton <[email protected]>
---
fs/proc/task_mmu.c | 101 +++++++++++++++++++++++++++++++++------------
1 file changed, 75 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c353cab11eee..af31c4d314d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -412,6 +412,15 @@ struct mem_size_stats {
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+#ifndef __PAGETABLE_PUD_FOLDED
+ unsigned long hugetlb_pud_mapped;
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+ unsigned long hugetlb_pmd_mapped;
+#endif
+ unsigned long hugetlb_pte_mapped;
+#endif
u64 pss;
u64 pss_anon;
u64 pss_file;
@@ -731,6 +740,35 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
}

#ifdef CONFIG_HUGETLB_PAGE
+
+static void smaps_hugetlb_hgm_account(struct mem_size_stats *mss,
+ struct hugetlb_pte *hpte)
+{
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+ unsigned long size = hugetlb_pte_size(hpte);
+
+ switch (hpte->level) {
+#ifndef __PAGETABLE_PUD_FOLDED
+ case HUGETLB_LEVEL_PUD:
+ mss->hugetlb_pud_mapped += size;
+ break;
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+ case HUGETLB_LEVEL_PMD:
+ mss->hugetlb_pmd_mapped += size;
+ break;
+#endif
+ case HUGETLB_LEVEL_PTE:
+ mss->hugetlb_pte_mapped += size;
+ break;
+ default:
+ break;
+ }
+#else
+ return;
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+}
+
static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
unsigned long addr,
struct mm_walk *walk)
@@ -764,6 +802,8 @@ static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
mss->shared_hugetlb += hugetlb_pte_size(hpte);
else
mss->private_hugetlb += hugetlb_pte_size(hpte);
+
+ smaps_hugetlb_hgm_account(mss, hpte);
}
return 0;
}
@@ -833,38 +873,47 @@ static void smap_gather_stats(struct vm_area_struct *vma,
static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
bool rollup_mode)
{
- SEQ_PUT_DEC("Rss: ", mss->resident);
- SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT);
- SEQ_PUT_DEC(" kB\nPss_Dirty: ", mss->pss_dirty >> PSS_SHIFT);
+ SEQ_PUT_DEC("Rss: ", mss->resident);
+ SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT);
+ SEQ_PUT_DEC(" kB\nPss_Dirty: ", mss->pss_dirty >> PSS_SHIFT);
if (rollup_mode) {
/*
* These are meaningful only for smaps_rollup, otherwise two of
* them are zero, and the other one is the same as Pss.
*/
- SEQ_PUT_DEC(" kB\nPss_Anon: ",
+ SEQ_PUT_DEC(" kB\nPss_Anon: ",
mss->pss_anon >> PSS_SHIFT);
- SEQ_PUT_DEC(" kB\nPss_File: ",
+ SEQ_PUT_DEC(" kB\nPss_File: ",
mss->pss_file >> PSS_SHIFT);
- SEQ_PUT_DEC(" kB\nPss_Shmem: ",
+ SEQ_PUT_DEC(" kB\nPss_Shmem: ",
mss->pss_shmem >> PSS_SHIFT);
}
- SEQ_PUT_DEC(" kB\nShared_Clean: ", mss->shared_clean);
- SEQ_PUT_DEC(" kB\nShared_Dirty: ", mss->shared_dirty);
- SEQ_PUT_DEC(" kB\nPrivate_Clean: ", mss->private_clean);
- SEQ_PUT_DEC(" kB\nPrivate_Dirty: ", mss->private_dirty);
- SEQ_PUT_DEC(" kB\nReferenced: ", mss->referenced);
- SEQ_PUT_DEC(" kB\nAnonymous: ", mss->anonymous);
- SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree);
- SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp);
- SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
- SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp);
- SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
- seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
+ SEQ_PUT_DEC(" kB\nShared_Clean: ", mss->shared_clean);
+ SEQ_PUT_DEC(" kB\nShared_Dirty: ", mss->shared_dirty);
+ SEQ_PUT_DEC(" kB\nPrivate_Clean: ", mss->private_clean);
+ SEQ_PUT_DEC(" kB\nPrivate_Dirty: ", mss->private_dirty);
+ SEQ_PUT_DEC(" kB\nReferenced: ", mss->referenced);
+ SEQ_PUT_DEC(" kB\nAnonymous: ", mss->anonymous);
+ SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree);
+ SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp);
+ SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp);
+ SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp);
+ SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb);
+ seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ",
mss->private_hugetlb >> 10, 7);
- SEQ_PUT_DEC(" kB\nSwap: ", mss->swap);
- SEQ_PUT_DEC(" kB\nSwapPss: ",
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+#ifndef __PAGETABLE_PUD_FOLDED
+ SEQ_PUT_DEC(" kB\nHugetlbPudMapped: ", mss->hugetlb_pud_mapped);
+#endif
+#ifndef __PAGETABLE_PMD_FOLDED
+ SEQ_PUT_DEC(" kB\nHugetlbPmdMapped: ", mss->hugetlb_pmd_mapped);
+#endif
+ SEQ_PUT_DEC(" kB\nHugetlbPteMapped: ", mss->hugetlb_pte_mapped);
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+ SEQ_PUT_DEC(" kB\nSwap: ", mss->swap);
+ SEQ_PUT_DEC(" kB\nSwapPss: ",
mss->swap_pss >> PSS_SHIFT);
- SEQ_PUT_DEC(" kB\nLocked: ",
+ SEQ_PUT_DEC(" kB\nLocked: ",
mss->pss_locked >> PSS_SHIFT);
seq_puts(m, " kB\n");
}
@@ -880,18 +929,18 @@ static int show_smap(struct seq_file *m, void *v)

show_map_vma(m, vma);

- SEQ_PUT_DEC("Size: ", vma->vm_end - vma->vm_start);
- SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma));
- SEQ_PUT_DEC(" kB\nMMUPageSize: ", vma_mmu_pagesize(vma));
+ SEQ_PUT_DEC("Size: ", vma->vm_end - vma->vm_start);
+ SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma));
+ SEQ_PUT_DEC(" kB\nMMUPageSize: ", vma_mmu_pagesize(vma));
seq_puts(m, " kB\n");

__show_smap(m, &mss, false);

- seq_printf(m, "THPeligible: %d\n",
+ seq_printf(m, "THPeligible: %d\n",
hugepage_vma_check(vma, vma->vm_flags, true, false, true));

if (arch_pkeys_enabled())
- seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
+ seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma));
show_smap_vma_flags(m, vma);

return 0;
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:40:53

by James Houghton

[permalink] [raw]
Subject: [PATCH 22/46] mm: rmap: provide pte_order in page_vma_mapped_walk

page_vma_mapped_walk callers will need this information to know how
HugeTLB pages are mapped. pte_order only applies if pte is not NULL.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/rmap.h | 1 +
mm/page_vma_mapped.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..e0557ede2951 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -378,6 +378,7 @@ struct page_vma_mapped_walk {
pmd_t *pmd;
pte_t *pte;
spinlock_t *ptl;
+ unsigned int pte_order;
unsigned int flags;
};

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..08295b122ad6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -16,6 +16,7 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
static bool map_pte(struct page_vma_mapped_walk *pvmw)
{
pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+ pvmw->pte_order = 0;
if (!(pvmw->flags & PVMW_SYNC)) {
if (pvmw->flags & PVMW_MIGRATION) {
if (!is_swap_pte(*pvmw->pte))
@@ -177,6 +178,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->pte)
return false;

+ pvmw->pte_order = huge_page_order(hstate);
pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
if (!check_pte(pvmw))
return not_found(pvmw);
@@ -272,6 +274,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
}
pte_unmap(pvmw->pte);
pvmw->pte = NULL;
+ pvmw->pte_order = 0;
goto restart;
}
pvmw->pte++;
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:40:58

by James Houghton

[permalink] [raw]
Subject: [PATCH 14/46] hugetlb: add make_huge_pte_with_shift

This allows us to make huge PTEs at shifts other than the hstate shift,
which will be necessary for high-granularity mappings.

Acked-by: Mike Kravetz <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aa8e59cbca69..3a75833d7aba 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5028,11 +5028,11 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.pagesize = hugetlb_vm_op_pagesize,
};

-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
- int writable)
+static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
+ struct page *page, int writable,
+ int shift)
{
pte_t entry;
- unsigned int shift = huge_page_shift(hstate_vma(vma));

if (writable) {
entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_pte(page,
@@ -5046,6 +5046,14 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
return entry;
}

+static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
+ int writable)
+{
+ unsigned int shift = huge_page_shift(hstate_vma(vma));
+
+ return make_huge_pte_with_shift(vma, page, writable, shift);
+}
+
static void set_huge_ptep_writable(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:41:05

by James Houghton

[permalink] [raw]
Subject: [PATCH 32/46] hugetlb: add for_each_hgm_shift

This is a helper macro to loop through all the usable page sizes for a
high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
loop, in descending order, through the page sizes that HugeTLB supports
for this architecture. It always includes PAGE_SIZE.

This is done by looping through the hstates; however, there is no
hstate for PAGE_SIZE. To handle this case, the loop intentionally goes
out of bounds, and the out-of-bounds pointer is mapped to PAGE_SIZE.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1e9e149587b3..1eef6968b1fa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7780,6 +7780,24 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
{
return vma && (vma->vm_flags & VM_HUGETLB_HGM);
}
+/* Should only be used by the for_each_hgm_shift macro. */
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+ /* If h is out of bounds, we have reached the end, so give PAGE_SIZE */
+ if (h >= &hstates[hugetlb_max_hstate])
+ return PAGE_SHIFT;
+ return huge_page_shift(h);
+}
+
+/*
+ * Intentionally go out of bounds. An out-of-bounds hstate will be converted to
+ * PAGE_SIZE.
+ */
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+ for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+ (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+ (tmp_h)++)
+
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:41:05

by James Houghton

[permalink] [raw]
Subject: [PATCH 42/46] selftests/vm: add HugeTLB HGM to userfaultfd selftest

This test case behaves similarly to the regular shared HugeTLB
configuration, except that it uses 4K instead of hugepages, and that we
ignore the UFFDIO_COPY tests, as UFFDIO_CONTINUE is the only ioctl that
supports PAGE_SIZE-aligned regions.

This doesn't test MADV_COLLAPSE. Other tests are added later to exercise
MADV_COLLAPSE.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 84 +++++++++++++++++++-----
1 file changed, 69 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 7f22844ed704..681c5c5f863b 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -73,9 +73,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
#define BOUNCE_POLL (1<<3)
static int bounces;

-#define TEST_ANON 1
-#define TEST_HUGETLB 2
-#define TEST_SHMEM 3
+#define TEST_ANON 1
+#define TEST_HUGETLB 2
+#define TEST_HUGETLB_HGM 3
+#define TEST_SHMEM 4
static int test_type;

#define UFFD_FLAGS (O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY)
@@ -93,6 +94,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
static bool test_uffdio_wp = true;
/* Whether to test uffd minor faults */
static bool test_uffdio_minor = false;
+static bool test_uffdio_copy = true;
+
static bool map_shared;
static int mem_fd;
static unsigned long long *count_verify;
@@ -151,7 +154,7 @@ static void usage(void)
fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
"[hugetlbfs_file]\n\n");
fprintf(stderr, "Supported <test type>: anon, hugetlb, "
- "hugetlb_shared, shmem\n\n");
+ "hugetlb_shared, hugetlb_shared_hgm, shmem\n\n");
fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
"Supported mods:\n");
fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n");
@@ -167,6 +170,11 @@ static void usage(void)
exit(1);
}

+static bool test_is_hugetlb(void)
+{
+ return test_type == TEST_HUGETLB || test_type == TEST_HUGETLB_HGM;
+}
+
#define _err(fmt, ...) \
do { \
int ret = errno; \
@@ -381,7 +389,7 @@ static struct uffd_test_ops *uffd_test_ops;

static inline uint64_t uffd_minor_feature(void)
{
- if (test_type == TEST_HUGETLB && map_shared)
+ if (test_is_hugetlb() && map_shared)
return UFFD_FEATURE_MINOR_HUGETLBFS;
else if (test_type == TEST_SHMEM)
return UFFD_FEATURE_MINOR_SHMEM;
@@ -393,7 +401,7 @@ static uint64_t get_expected_ioctls(uint64_t mode)
{
uint64_t ioctls = UFFD_API_RANGE_IOCTLS;

- if (test_type == TEST_HUGETLB)
+ if (test_is_hugetlb())
ioctls &= ~(1 << _UFFDIO_ZEROPAGE);

if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
@@ -500,13 +508,16 @@ static void uffd_test_ctx_clear(void)
static void uffd_test_ctx_init(uint64_t features)
{
unsigned long nr, cpu;
+ uint64_t enabled_features = features;

uffd_test_ctx_clear();

uffd_test_ops->allocate_area((void **)&area_src, true);
uffd_test_ops->allocate_area((void **)&area_dst, false);

- userfaultfd_open(&features);
+ userfaultfd_open(&enabled_features);
+ if ((enabled_features & features) != features)
+ err("couldn't enable all features");

count_verify = malloc(nr_pages * sizeof(unsigned long long));
if (!count_verify)
@@ -726,13 +737,16 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
struct uffd_stats *stats)
{
unsigned long offset;
+ unsigned long address;

if (msg->event != UFFD_EVENT_PAGEFAULT)
err("unexpected msg event %u", msg->event);

+ address = msg->arg.pagefault.address;
+
if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
/* Write protect page faults */
- wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+ wp_range(uffd, address, page_size, false);
stats->wp_faults++;
} else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
uint8_t *area;
@@ -751,11 +765,10 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
*/

area = (uint8_t *)(area_dst +
- ((char *)msg->arg.pagefault.address -
- area_dst_alias));
+ ((char *)address - area_dst_alias));
for (b = 0; b < page_size; ++b)
area[b] = ~area[b];
- continue_range(uffd, msg->arg.pagefault.address, page_size);
+ continue_range(uffd, address, page_size);
stats->minor_faults++;
} else {
/*
@@ -782,7 +795,7 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
err("unexpected write fault");

- offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+ offset = (char *)address - area_dst;
offset &= ~(page_size-1);

if (copy_page(uffd, offset))
@@ -1192,6 +1205,12 @@ static int userfaultfd_events_test(void)
char c;
struct uffd_stats stats = { 0 };

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd events test "
+ "(test_uffdio_copy=false)\n");
+ return 0;
+ }
+
printf("testing events (fork, remap, remove): ");
fflush(stdout);

@@ -1245,6 +1264,12 @@ static int userfaultfd_sig_test(void)
char c;
struct uffd_stats stats = { 0 };

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd signal test "
+ "(test_uffdio_copy=false)\n");
+ return 0;
+ }
+
printf("testing signal delivery: ");
fflush(stdout);

@@ -1329,6 +1354,11 @@ static int userfaultfd_minor_test(void)

uffd_test_ctx_init(uffd_minor_feature());

+ if (test_type == TEST_HUGETLB_HGM)
+ /* Enable high-granularity userfaultfd ioctls for HugeTLB */
+ if (madvise(area_dst_alias, nr_pages * page_size, MADV_SPLIT))
+ err("MADV_SPLIT failed");
+
uffdio_register.range.start = (unsigned long)area_dst_alias;
uffdio_register.range.len = nr_pages * page_size;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR;
@@ -1538,6 +1568,12 @@ static int userfaultfd_stress(void)
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 16*1024*1024);

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd stress test "
+ "(test_uffdio_copy=false)\n");
+ bounces = 0;
+ }
+
while (bounces--) {
printf("bounces: %d, mode:", bounces);
if (bounces & BOUNCE_RANDOM)
@@ -1696,6 +1732,16 @@ static void set_test_type(const char *type)
uffd_test_ops = &hugetlb_uffd_test_ops;
/* Minor faults require shared hugetlb; only enable here. */
test_uffdio_minor = true;
+ } else if (!strcmp(type, "hugetlb_shared_hgm")) {
+ map_shared = true;
+ test_type = TEST_HUGETLB_HGM;
+ uffd_test_ops = &hugetlb_uffd_test_ops;
+ /*
+ * HugeTLB HGM only changes UFFDIO_CONTINUE, so don't test
+ * UFFDIO_COPY.
+ */
+ test_uffdio_minor = true;
+ test_uffdio_copy = false;
} else if (!strcmp(type, "shmem")) {
map_shared = true;
test_type = TEST_SHMEM;
@@ -1731,6 +1777,7 @@ static void parse_test_type_arg(const char *raw_type)
err("Unsupported test: %s", raw_type);

if (test_type == TEST_HUGETLB)
+ /* TEST_HUGETLB_HGM gets small pages. */
page_size = hpage_size;
else
page_size = sysconf(_SC_PAGE_SIZE);
@@ -1813,22 +1860,29 @@ int main(int argc, char **argv)
nr_cpus = x < y ? x : y;
}
nr_pages_per_cpu = bytes / page_size / nr_cpus;
+ if (test_type == TEST_HUGETLB_HGM)
+ /*
+ * `page_size` refers to the page_size we can use in
+ * UFFDIO_CONTINUE. We still need nr_pages to be appropriately
+ * aligned, so align it here.
+ */
+ nr_pages_per_cpu -= nr_pages_per_cpu % (hpage_size / page_size);
if (!nr_pages_per_cpu) {
_err("invalid MiB");
usage();
}
+ nr_pages = nr_pages_per_cpu * nr_cpus;

bounces = atoi(argv[3]);
if (bounces <= 0) {
_err("invalid bounces");
usage();
}
- nr_pages = nr_pages_per_cpu * nr_cpus;

- if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) {
+ if (test_type == TEST_SHMEM || test_is_hugetlb()) {
unsigned int memfd_flags = 0;

- if (test_type == TEST_HUGETLB)
+ if (test_is_hugetlb())
memfd_flags = MFD_HUGETLB;
mem_fd = memfd_create(argv[0], memfd_flags);
if (mem_fd < 0)
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:41:26

by James Houghton

[permalink] [raw]
Subject: [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE

MADV_SPLIT enables HugeTLB HGM which allows for UFFDIO_CONTINUE in
PAGE_SIZE chunks. If a huge-page-aligned address were to be provided,
userspace would be completely unable to take advantage of HGM. That
would then require userspace to know to provide
UFFD_FEATURE_EXACT_ADDRESS.

This patch would make it harder to make a mistake. Instead of requiring
userspace to provide UFFD_FEATURE_EXACT_ADDRESS, always provide a usable
address.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5af6db52f34e..5b6215e03fe1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5936,28 +5936,27 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
+ u32 hash;
+ struct vm_fault vmf;
+
/*
* Don't use the hpage-aligned address if the user has explicitly
* enabled HGM.
*/
if (hugetlb_hgm_advised(vma) && reason == VM_UFFD_MINOR)
- haddr = address & PAGE_MASK;
-
- u32 hash;
- struct vm_fault vmf = {
- .vma = vma,
- .address = haddr,
- .real_address = addr,
- .flags = flags,
+ haddr = addr & PAGE_MASK;

- /*
- * Hard to debug if it ends up being
- * used by a callee that assumes
- * something about the other
- * uninitialized fields... same as in
- * memory.c
- */
- };
+ vmf.vma = vma;
+ vmf.address = haddr;
+ vmf.real_address = addr;
+ vmf.flags = flags;
+ /*
+ * Hard to debug if it ends up being
+ * used by a callee that assumes
+ * something about the other
+ * uninitialized fields... same as in
+ * memory.c
+ */

/*
* vma_lock and hugetlb_fault_mutex must be dropped before handling
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:42:53

by James Houghton

[permalink] [raw]
Subject: [PATCH 30/46] hugetlb: add high-granularity migration support

To prevent queueing a hugepage for migration multiple times, we use
last_page to keep track of the last page we saw in queue_pages_hugetlb,
and if the page we're looking at is last_page, then we skip it.

For the non-hugetlb cases, last_page, although unused, is still updated
so that it has a consistent meaning with the hugetlb case.

This commit adds a check in hugetlb_fault for high-granularity migration
PTEs.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/swapops.h | 8 ++++++--
mm/hugetlb.c | 2 +-
mm/mempolicy.c | 24 +++++++++++++++++++-----
mm/migrate.c | 18 ++++++++++--------
4 files changed, 36 insertions(+), 16 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a451b7afcb3..6ef80763e629 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@

static inline bool is_pfn_swap_entry(swp_entry_t entry);

+struct hugetlb_pte;
+
/* Clear all flags but only keep swp_entry_t related information */
static inline pte_t pte_swp_clear_flags(pte_t pte)
{
@@ -339,7 +341,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
#ifdef CONFIG_HUGETLB_PAGE
extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
pte_t *ptep, spinlock_t *ptl);
-extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte);
#endif /* CONFIG_HUGETLB_PAGE */
#else /* CONFIG_MIGRATION */
static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
@@ -369,7 +372,8 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
#ifdef CONFIG_HUGETLB_PAGE
static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
pte_t *ptep, spinlock_t *ptl) { }
-static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte) { }
#endif /* CONFIG_HUGETLB_PAGE */
static inline int is_writable_migration_entry(swp_entry_t entry)
{
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8e690a22456a..2fb95ecafc63 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6269,7 +6269,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* be released there.
*/
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- migration_entry_wait_huge(vma, hpte.ptep);
+ migration_entry_wait_huge(vma, &hpte);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
ret = VM_FAULT_HWPOISON_LARGE |
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e5859ed34e90..6c4c3c923fa2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -424,6 +424,7 @@ struct queue_pages {
unsigned long start;
unsigned long end;
struct vm_area_struct *first;
+ struct page *last_page;
};

/*
@@ -475,6 +476,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
flags = qp->flags;
/* go to thp migration */
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+ qp->last_page = page;
if (!vma_migratable(walk->vma) ||
migrate_page_add(page, qp->pagelist, flags)) {
ret = 1;
@@ -532,6 +534,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
continue;
if (!queue_pages_required(page, qp))
continue;
+
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
/* MPOL_MF_STRICT must be specified if we get here */
if (!vma_migratable(vma)) {
@@ -539,6 +542,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
break;
}

+ qp->last_page = page;
+
/*
* Do not abort immediately since there may be
* temporary off LRU pages in the range. Still
@@ -570,15 +575,22 @@ static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
spinlock_t *ptl;
pte_t entry;

- /* We don't migrate high-granularity HugeTLB mappings for now. */
- if (hugetlb_hgm_enabled(walk->vma))
- return -EINVAL;
-
ptl = hugetlb_pte_lock(hpte);
entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto unlock;
- page = pte_page(entry);
+
+ if (!hugetlb_pte_present_leaf(hpte, entry)) {
+ ret = -EAGAIN;
+ goto unlock;
+ }
+
+ page = compound_head(pte_page(entry));
+
+ /* We already queued this page with another high-granularity PTE. */
+ if (page == qp->last_page)
+ goto unlock;
+
if (!queue_pages_required(page, qp))
goto unlock;

@@ -605,6 +617,7 @@ static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
(flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+ qp->last_page = page;
if (isolate_hugetlb(page, qp->pagelist) &&
(flags & MPOL_MF_STRICT))
/*
@@ -739,6 +752,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
.start = start,
.end = end,
.first = NULL,
+ .last_page = NULL,
};

err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp);
diff --git a/mm/migrate.c b/mm/migrate.c
index 0062689f4878..c30647b75459 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -195,6 +195,9 @@ static bool remove_migration_pte(struct folio *folio,
/* pgoff is invalid for ksm pages, but they are never large */
if (folio_test_large(folio) && !folio_test_hugetlb(folio))
idx = linear_page_index(vma, pvmw.address) - pvmw.pgoff;
+ else if (folio_test_hugetlb(folio))
+ idx = (pvmw.address & ~huge_page_mask(hstate_vma(vma)))/
+ PAGE_SIZE;
new = folio_page(folio, idx);

#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
@@ -244,14 +247,15 @@ static bool remove_migration_pte(struct folio *folio,

#ifdef CONFIG_HUGETLB_PAGE
if (folio_test_hugetlb(folio)) {
+ struct page *hpage = folio_page(folio, 0);
unsigned int shift = pvmw.pte_order + PAGE_SHIFT;

pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (folio_test_anon(folio))
- hugepage_add_anon_rmap(new, vma, pvmw.address,
+ hugepage_add_anon_rmap(hpage, vma, pvmw.address,
rmap_flags);
else
- page_dup_file_rmap(new, true);
+ page_dup_file_rmap(hpage, true);
set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
} else
#endif
@@ -267,7 +271,7 @@ static bool remove_migration_pte(struct folio *folio,
mlock_page_drain_local();

trace_remove_migration_pte(pvmw.address, pte_val(pte),
- compound_order(new));
+ pvmw.pte_order);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -358,12 +362,10 @@ void __migration_entry_wait_huge(struct vm_area_struct *vma,
}
}

-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
+void migration_entry_wait_huge(struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte)
{
- spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
- vma->vm_mm, pte);
-
- __migration_entry_wait_huge(vma, pte, ptl);
+ __migration_entry_wait_huge(vma, hpte->ptep, hpte->ptl);
}
#endif

--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:52:06

by James Houghton

[permalink] [raw]
Subject: [PATCH 46/46] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests

This tests that high-granularity CONTINUEs at all sizes work
(exercising contiguous PTE sizes for arm64, when support is added). This
also tests that collapse works and hwpoison works correctly (although we
aren't yet testing high-granularity poison).

This test uses UFFD_FEATURE_EVENT_FORK + UFFD_REGISTER_MODE_WP to force
the kernel to copy page tables on fork(), exercising the changes to
copy_hugetlb_page_range().

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugetlb-hgm.c | 455 +++++++++++++++++++++++
2 files changed, 456 insertions(+)
create mode 100644 tools/testing/selftests/vm/hugetlb-hgm.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 89c14e41bd43..4aa4ca75a471 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -32,6 +32,7 @@ TEST_GEN_FILES += compaction_test
TEST_GEN_FILES += gup_test
TEST_GEN_FILES += hmm-tests
TEST_GEN_FILES += hugetlb-madvise
+TEST_GEN_FILES += hugetlb-hgm
TEST_GEN_FILES += hugepage-mmap
TEST_GEN_FILES += hugepage-mremap
TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/vm/hugetlb-hgm.c b/tools/testing/selftests/vm/hugetlb-hgm.c
new file mode 100644
index 000000000000..616bc40164bf
--- /dev/null
+++ b/tools/testing/selftests/vm/hugetlb-hgm.c
@@ -0,0 +1,455 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test uncommon cases in HugeTLB high-granularity mapping:
+ * 1. Test all supported high-granularity page sizes (with MADV_COLLAPSE).
+ * 2. Test MADV_HWPOISON behavior.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/poll.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <linux/userfaultfd.h>
+#include <linux/magic.h>
+#include <sys/mman.h>
+#include <sys/statfs.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <pthread.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#define PAGE_MASK ~(4096 - 1)
+
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
+#ifndef MADV_SPLIT
+#define MADV_SPLIT 26
+#endif
+
+#define PREFIX " ... "
+#define ERROR_PREFIX " !!! "
+
+enum test_status {
+ TEST_PASSED = 0,
+ TEST_FAILED = 1,
+ TEST_SKIPPED = 2,
+};
+
+static char *status_to_str(enum test_status status)
+{
+ switch (status) {
+ case TEST_PASSED:
+ return "TEST_PASSED";
+ case TEST_FAILED:
+ return "TEST_FAILED";
+ case TEST_SKIPPED:
+ return "TEST_SKIPPED";
+ default:
+ return "TEST_???";
+ }
+}
+
+int userfaultfd(int flags)
+{
+ return syscall(__NR_userfaultfd, flags);
+}
+
+int map_range(int uffd, char *addr, uint64_t length)
+{
+ struct uffdio_continue cont = {
+ .range = (struct uffdio_range) {
+ .start = (uint64_t)addr,
+ .len = length,
+ },
+ .mode = 0,
+ .mapped = 0,
+ };
+
+ if (ioctl(uffd, UFFDIO_CONTINUE, &cont) < 0) {
+ perror(ERROR_PREFIX "UFFDIO_CONTINUE failed");
+ return -1;
+ }
+ return 0;
+}
+
+int check_equal(char *mapping, size_t length, char value)
+{
+ size_t i;
+
+ for (i = 0; i < length; ++i)
+ if (mapping[i] != value) {
+ printf(ERROR_PREFIX "mismatch at %p (%d != %d)\n",
+ &mapping[i], mapping[i], value);
+ return -1;
+ }
+
+ return 0;
+}
+
+int test_continues(int uffd, char *primary_map, char *secondary_map, size_t len,
+ bool verify)
+{
+ size_t offset = 0;
+ unsigned char iter = 0;
+ unsigned long pagesize = getpagesize();
+ uint64_t size;
+
+ for (size = len/2; size >= pagesize;
+ offset += size, size /= 2) {
+ iter++;
+ memset(secondary_map + offset, iter, size);
+ printf(PREFIX "UFFDIO_CONTINUE: %p -> %p = %d%s\n",
+ primary_map + offset,
+ primary_map + offset + size,
+ iter,
+ verify ? " (and verify)" : "");
+ if (map_range(uffd, primary_map + offset, size))
+ return -1;
+ if (verify && check_equal(primary_map + offset, size, iter))
+ return -1;
+ }
+ return 0;
+}
+
+int verify_contents(char *map, size_t len, bool last_4k_zero)
+{
+ size_t offset = 0;
+ int i = 0;
+ uint64_t size;
+
+ for (size = len/2; size > 4096; offset += size, size /= 2)
+ if (check_equal(map + offset, size, ++i))
+ return -1;
+
+ if (last_4k_zero)
+ /* expect the last 4K to be zero. */
+ if (check_equal(map + len - 4096, 4096, 0))
+ return -1;
+
+ return 0;
+}
+
+int test_collapse(char *primary_map, size_t len, bool hwpoison)
+{
+ printf(PREFIX "collapsing %p -> %p\n", primary_map, primary_map + len);
+ if (madvise(primary_map, len, MADV_COLLAPSE) < 0) {
+ if (errno == EHWPOISON && hwpoison) {
+ /* this is expected for the hwpoison test. */
+ printf(PREFIX "could not collapse due to poison\n");
+ return 0;
+ }
+ perror(ERROR_PREFIX "collapse failed");
+ return -1;
+ }
+
+ printf(PREFIX "verifying %p -> %p\n", primary_map, primary_map + len);
+ return verify_contents(primary_map, len, true);
+}
+
+static void *sigbus_addr;
+bool was_mceerr;
+bool got_sigbus;
+
+void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+ got_sigbus = true;
+ was_mceerr = info->si_code == BUS_MCEERR_AR;
+ sigbus_addr = info->si_addr;
+
+ pthread_exit(NULL);
+}
+
+void *access_mem(void *addr)
+{
+ volatile char *ptr = addr;
+
+ *ptr;
+ return NULL;
+}
+
+int test_sigbus(char *addr, bool poison)
+{
+ int ret = 0;
+ pthread_t pthread;
+
+ sigbus_addr = (void *)0xBADBADBAD;
+ was_mceerr = false;
+ got_sigbus = false;
+ ret = pthread_create(&pthread, NULL, &access_mem, addr);
+ if (ret) {
+ printf(ERROR_PREFIX "failed to create thread: %s\n",
+ strerror(ret));
+ return ret;
+ }
+
+ pthread_join(pthread, NULL);
+ if (!got_sigbus) {
+ printf(ERROR_PREFIX "didn't get a SIGBUS\n");
+ return -1;
+ } else if (sigbus_addr != addr) {
+ printf(ERROR_PREFIX "got incorrect sigbus address: %p vs %p\n",
+ sigbus_addr, addr);
+ return -1;
+ } else if (poison && !was_mceerr) {
+ printf(ERROR_PREFIX "didn't get an MCEERR?\n");
+ return -1;
+ }
+ return 0;
+}
+
+void *read_from_uffd_thd(void *arg)
+{
+ int uffd = *(int *)arg;
+ struct uffd_msg msg;
+ /* opened without O_NONBLOCK */
+ if (read(uffd, &msg, sizeof(msg)) != sizeof(msg))
+ printf(ERROR_PREFIX "reading uffd failed\n");
+
+ return NULL;
+}
+
+int read_event_from_uffd(int *uffd, pthread_t *pthread)
+{
+ int ret = 0;
+
+ ret = pthread_create(pthread, NULL, &read_from_uffd_thd, (void *)uffd);
+ if (ret) {
+ printf(ERROR_PREFIX "failed to create thread: %s\n",
+ strerror(ret));
+ return ret;
+ }
+ return 0;
+}
+
+enum test_status test_hwpoison(char *primary_map, size_t len)
+{
+ const unsigned long pagesize = getpagesize();
+ const int num_poison_checks = 512;
+ unsigned long bytes_per_check = len/num_poison_checks;
+ int i;
+
+ printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len);
+ if (madvise(primary_map, len, MADV_HWPOISON) < 0) {
+ perror(ERROR_PREFIX "MADV_HWPOISON failed");
+ return TEST_SKIPPED;
+ }
+
+ printf(PREFIX "checking that it was poisoned "
+ "(%d addresses within %p -> %p)\n",
+ num_poison_checks, primary_map, primary_map + len);
+
+ if (pagesize > bytes_per_check)
+ bytes_per_check = pagesize;
+
+ for (i = 0; i < len; i += bytes_per_check)
+ if (test_sigbus(primary_map + i, true) < 0)
+ return TEST_FAILED;
+ /* check very last byte, because we left it unmapped */
+ if (test_sigbus(primary_map + len - 1, true))
+ return TEST_FAILED;
+
+ return TEST_PASSED;
+}
+
+int test_fork(int uffd, char *primary_map, size_t len)
+{
+ int status;
+ int ret = 0;
+ pid_t pid;
+ pthread_t uffd_thd;
+
+ /*
+ * UFFD_FEATURE_EVENT_FORK will put fork event on the userfaultfd,
+ * which we must read, otherwise we block fork(). Setup a thread to
+ * read that event now.
+ *
+ * Page fault events should result in a SIGBUS, so we expect only a
+ * single event from the uffd (the fork event).
+ */
+ if (read_event_from_uffd(&uffd, &uffd_thd))
+ return -1;
+
+ pid = fork();
+
+ if (!pid) {
+ /*
+ * Because we have UFFDIO_REGISTER_MODE_WP and
+ * UFFD_FEATURE_EVENT_FORK, the page tables should be copied
+ * exactly.
+ *
+ * Check that everything except that last 4K has correct
+ * contents, and then check that the last 4K gets a SIGBUS.
+ */
+ printf(PREFIX "child validating...\n");
+ ret = verify_contents(primary_map, len, false) ||
+ test_sigbus(primary_map + len - 1, false);
+ ret = 0;
+ exit(ret ? 1 : 0);
+ } else {
+ /* wait for the child to finish. */
+ waitpid(pid, &status, 0);
+ ret = WEXITSTATUS(status);
+ if (!ret) {
+ printf(PREFIX "parent validating...\n");
+ /* Same check as the child. */
+ ret = verify_contents(primary_map, len, false) ||
+ test_sigbus(primary_map + len - 1, false);
+ ret = 0;
+ }
+ }
+
+ pthread_join(uffd_thd, NULL);
+ return ret;
+
+}
+
+enum test_status
+test_hgm(int fd, size_t hugepagesize, size_t len, bool hwpoison)
+{
+ int uffd;
+ char *primary_map, *secondary_map;
+ struct uffdio_api api;
+ struct uffdio_register reg;
+ struct sigaction new, old;
+ enum test_status status = TEST_SKIPPED;
+
+ if (ftruncate(fd, len) < 0) {
+ perror(ERROR_PREFIX "ftruncate failed");
+ return status;
+ }
+
+ uffd = userfaultfd(O_CLOEXEC);
+ if (uffd < 0) {
+ perror(ERROR_PREFIX "uffd not created");
+ return status;
+ }
+
+ primary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (primary_map == MAP_FAILED) {
+ perror(ERROR_PREFIX "mmap for primary mapping failed");
+ goto close_uffd;
+ }
+ secondary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (secondary_map == MAP_FAILED) {
+ perror(ERROR_PREFIX "mmap for secondary mapping failed");
+ goto unmap_primary;
+ }
+
+ printf(PREFIX "primary mapping: %p\n", primary_map);
+ printf(PREFIX "secondary mapping: %p\n", secondary_map);
+
+ api.api = UFFD_API;
+ api.features = UFFD_FEATURE_SIGBUS | UFFD_FEATURE_EXACT_ADDRESS |
+ UFFD_FEATURE_EVENT_FORK;
+ if (ioctl(uffd, UFFDIO_API, &api) == -1) {
+ perror(ERROR_PREFIX "UFFDIO_API failed");
+ goto out;
+ }
+
+ if (madvise(primary_map, len, MADV_SPLIT)) {
+ perror(ERROR_PREFIX "MADV_SPLIT failed");
+ goto out;
+ }
+
+ reg.range.start = (unsigned long)primary_map;
+ reg.range.len = len;
+ /*
+ * Register with UFFDIO_REGISTER_MODE_WP to force fork() to copy page
+ * tables (also need UFFD_FEATURE_EVENT_FORK, which we have).
+ */
+ reg.mode = UFFDIO_REGISTER_MODE_MINOR | UFFDIO_REGISTER_MODE_MISSING |
+ UFFDIO_REGISTER_MODE_WP;
+ reg.ioctls = 0;
+ if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
+ perror(ERROR_PREFIX "register failed");
+ goto out;
+ }
+
+ new.sa_sigaction = &sigbus_handler;
+ new.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGBUS, &new, &old) < 0) {
+ perror(ERROR_PREFIX "could not setup SIGBUS handler");
+ goto out;
+ }
+
+ status = TEST_FAILED;
+
+ if (test_continues(uffd, primary_map, secondary_map, len, !hwpoison))
+ goto done;
+ if (hwpoison) {
+ /* test_hwpoison can fail with TEST_SKIPPED. */
+ enum test_status new_status = test_hwpoison(primary_map, len);
+
+ if (new_status != TEST_PASSED) {
+ status = new_status;
+ goto done;
+ }
+ } else if (test_fork(uffd, primary_map, len))
+ goto done;
+ if (test_collapse(primary_map, len, hwpoison))
+ goto done;
+
+ status = TEST_PASSED;
+
+done:
+ if (ftruncate(fd, 0) < 0) {
+ perror(ERROR_PREFIX "ftruncate back to 0 failed");
+ status = TEST_FAILED;
+ }
+
+out:
+ munmap(secondary_map, len);
+unmap_primary:
+ munmap(primary_map, len);
+close_uffd:
+ close(uffd);
+ return status;
+}
+
+int main(void)
+{
+ int fd;
+ struct statfs file_stat;
+ size_t hugepagesize;
+ size_t len;
+
+ fd = memfd_create("hugetlb_tmp", MFD_HUGETLB);
+ if (fd < 0) {
+ perror(ERROR_PREFIX "could not open hugetlbfs file");
+ return -1;
+ }
+
+ memset(&file_stat, 0, sizeof(file_stat));
+ if (fstatfs(fd, &file_stat)) {
+ perror(ERROR_PREFIX "fstatfs failed");
+ goto close;
+ }
+ if (file_stat.f_type != HUGETLBFS_MAGIC) {
+ printf(ERROR_PREFIX "not hugetlbfs file\n");
+ goto close;
+ }
+
+ hugepagesize = file_stat.f_bsize;
+ len = 2 * hugepagesize;
+ printf("HGM regular test...\n");
+ printf("HGM regular test: %s\n",
+ status_to_str(test_hgm(fd, hugepagesize, len, false)));
+ printf("HGM hwpoison test...\n");
+ printf("HGM hwpoison test: %s\n",
+ status_to_str(test_hgm(fd, hugepagesize, len, true)));
+close:
+ close(fd);
+
+ return 0;
+}
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:52:09

by James Houghton

[permalink] [raw]
Subject: [PATCH 07/46] hugetlb: rename __vma_shareable_flags_pmd to __vma_has_hugetlb_vma_lock

Previously, if the hugetlb VMA lock was present, that meant that the VMA
was PMD-shareable. Now it is possible that the VMA lock is allocated but
the VMA is not PMD-shareable: if the VMA is a high-granularity VMA.

It is possible for a high-granularity VMA not to have a VMA lock; in
this case, MADV_COLLAPSE will not be able to collapse the mappings.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 15 ++++++++++-----
mm/hugetlb.c | 16 ++++++++--------
2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b6b10101bea7..aa49fd8cb47c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1235,7 +1235,8 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
#endif

-static inline bool __vma_shareable_lock(struct vm_area_struct *vma)
+static inline bool
+__vma_has_hugetlb_vma_lock(struct vm_area_struct *vma)
{
return (vma->vm_flags & VM_MAYSHARE) && vma->vm_private_data;
}
@@ -1252,13 +1253,17 @@ hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

/*
- * If pmd sharing possible, locking needed to safely walk the
- * hugetlb pgtables. More information can be found at the comment
- * above huge_pte_offset() in the same file.
+ * If the VMA has the hugetlb vma lock (PMD sharable or HGM
+ * collapsible), locking needed to safely walk the hugetlb pgtables.
+ * More information can be found at the comment above huge_pte_offset()
+ * in the same file.
+ *
+ * This doesn't do a full high-granularity walk, so we are concerned
+ * only with PMD unsharing.
*
* NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
*/
- if (__vma_shareable_lock(vma))
+ if (__vma_has_hugetlb_vma_lock(vma))
WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
!lockdep_is_held(
&vma->vm_file->f_mapping->i_mmap_rwsem));
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 99fadd7680ec..2f86fedef283 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -260,7 +260,7 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
*/
void hugetlb_vma_lock_read(struct vm_area_struct *vma)
{
- if (__vma_shareable_lock(vma)) {
+ if (__vma_has_hugetlb_vma_lock(vma)) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

down_read(&vma_lock->rw_sema);
@@ -269,7 +269,7 @@ void hugetlb_vma_lock_read(struct vm_area_struct *vma)

void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
{
- if (__vma_shareable_lock(vma)) {
+ if (__vma_has_hugetlb_vma_lock(vma)) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

up_read(&vma_lock->rw_sema);
@@ -278,7 +278,7 @@ void hugetlb_vma_unlock_read(struct vm_area_struct *vma)

void hugetlb_vma_lock_write(struct vm_area_struct *vma)
{
- if (__vma_shareable_lock(vma)) {
+ if (__vma_has_hugetlb_vma_lock(vma)) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

down_write(&vma_lock->rw_sema);
@@ -287,7 +287,7 @@ void hugetlb_vma_lock_write(struct vm_area_struct *vma)

void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
{
- if (__vma_shareable_lock(vma)) {
+ if (__vma_has_hugetlb_vma_lock(vma)) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

up_write(&vma_lock->rw_sema);
@@ -298,7 +298,7 @@ int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
{
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

- if (!__vma_shareable_lock(vma))
+ if (!__vma_has_hugetlb_vma_lock(vma))
return 1;

return down_write_trylock(&vma_lock->rw_sema);
@@ -306,7 +306,7 @@ int hugetlb_vma_trylock_write(struct vm_area_struct *vma)

void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
{
- if (__vma_shareable_lock(vma)) {
+ if (__vma_has_hugetlb_vma_lock(vma)) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

lockdep_assert_held(&vma_lock->rw_sema);
@@ -338,7 +338,7 @@ static void __hugetlb_vma_unlock_write_put(struct hugetlb_vma_lock *vma_lock)

static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma)
{
- if (__vma_shareable_lock(vma)) {
+ if (__vma_has_hugetlb_vma_lock(vma)) {
struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;

__hugetlb_vma_unlock_write_put(vma_lock);
@@ -350,7 +350,7 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
/*
* Only present in sharable vmas.
*/
- if (!vma || !__vma_shareable_lock(vma))
+ if (!vma || !__vma_has_hugetlb_vma_lock(vma))
return;

if (vma->vm_private_data) {
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:52:28

by James Houghton

[permalink] [raw]
Subject: [PATCH 20/46] hugetlb: add HGM support for hugetlb_follow_page_mask

The change here is very simple: do a high-granularity walk.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 24 +++++++++++++++++-------
1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 30fea414d9ee..718572444a73 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6553,11 +6553,10 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
- struct mm_struct *mm = vma->vm_mm;
- unsigned long haddr = address & huge_page_mask(h);
struct page *page = NULL;
spinlock_t *ptl;
- pte_t *pte, entry;
+ pte_t entry;
+ struct hugetlb_pte hpte;

/*
* FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
@@ -6567,13 +6566,24 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
return NULL;

hugetlb_vma_lock_read(vma);
- pte = hugetlb_walk(vma, haddr, huge_page_size(h));
- if (!pte)
+
+ if (hugetlb_full_walk(&hpte, vma, address))
goto out_unlock;

- ptl = huge_pte_lock(h, mm, pte);
- entry = huge_ptep_get(pte);
+retry:
+ ptl = hugetlb_pte_lock(&hpte);
+ entry = huge_ptep_get(hpte.ptep);
if (pte_present(entry)) {
+ if (unlikely(!hugetlb_pte_present_leaf(&hpte, entry))) {
+ /*
+ * We raced with someone splitting from under us.
+ * Keep walking to get to the real leaf.
+ */
+ spin_unlock(ptl);
+ hugetlb_full_walk_continue(&hpte, vma, address);
+ goto retry;
+ }
+
page = pte_page(entry) +
((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
/*
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:53:03

by James Houghton

[permalink] [raw]
Subject: [PATCH 24/46] rmap: update hugetlb lock comment for HGM

The VMA lock is used to prevent high-granularity HugeTLB mappings from
being collapsed while other threads are doing high-granularity page
table walks.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 12 ++++++++++++
mm/rmap.c | 3 ++-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b7cf45535d64..daf993fdbc38 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -156,6 +156,18 @@ struct file_region {
#endif
};

+/*
+ * The HugeTLB VMA lock is used to synchronize HugeTLB page table walks.
+ * Right now, it is only used for VM_SHARED mappings.
+ * - The read lock is held when we want to stabilize mappings (prevent PMD
+ * unsharing or MADV_COLLAPSE for high-granularity mappings).
+ * - The write lock is held when we want to free mappings (PMD unsharing and
+ * MADV_COLLAPSE for high-granularity mappings).
+ *
+ * Note: For PMD unsharing and MADV_COLLAPSE, the i_mmap_rwsem is held for
+ * writing as well, so page table walkers will also be safe if they hold
+ * i_mmap_rwsem for at least reading. See hugetlb_walk() for more information.
+ */
struct hugetlb_vma_lock {
struct kref refs;
struct rw_semaphore rw_sema;
diff --git a/mm/rmap.c b/mm/rmap.c
index ff7e6c770b0a..076ea77010e5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -47,7 +47,8 @@
*
* hugetlbfs PageHuge() take locks in this order:
* hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- * vma_lock (hugetlb specific lock for pmd_sharing)
+ * vma_lock (hugetlb specific lock for pmd_sharing and high-granularity
+ * mapping)
* mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
* page->flags PG_locked (lock_page)
*/
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:55:31

by James Houghton

[permalink] [raw]
Subject: [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM

Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
applied to non-HugeTLB memory in the future, if such an application is
to arise.

MADV_SPLIT provides several API changes for some syscalls on HugeTLB
address ranges:
1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
alignment.
2. read()ing a page fault event from a userfaultfd will yield a
PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
address (unless UFFD_FEATURE_EXACT_ADDRESS is used).

There is no way to disable the API changes that come with issuing
MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
table mappings that come from the extended functionality that comes with
using MADV_SPLIT.

For post-copy live migration, the expected use-case is:
1. mmap(MAP_SHARED, some_fd) primary mapping
2. mmap(MAP_SHARED, some_fd) alias mapping
3. MADV_SPLIT the primary mapping
4. UFFDIO_REGISTER/etc. the primary mapping
5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
corresponding PAGE_SIZE sections in the primary mapping.

More API changes may be added in the future.

Signed-off-by: James Houghton <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 2 ++
arch/mips/include/uapi/asm/mman.h | 2 ++
arch/parisc/include/uapi/asm/mman.h | 2 ++
arch/xtensa/include/uapi/asm/mman.h | 2 ++
include/linux/hugetlb.h | 2 ++
include/uapi/asm-generic/mman-common.h | 2 ++
mm/hugetlb.c | 3 +--
mm/madvise.c | 26 ++++++++++++++++++++++++++
8 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..7a26f3648b90 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -78,6 +78,8 @@

#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */

+#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index c6e1fc77c996..f8a74a3a0928 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -105,6 +105,8 @@

#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */

+#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..a6dc6a56c941 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -72,6 +72,8 @@

#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */

+#define MADV_SPLIT 74 /* Enable hugepage high-granularity APIs */
+
#define MADV_HWPOISON 100 /* poison a page for testing */
#define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */

diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..f98a77c430a9 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -113,6 +113,8 @@

#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */

+#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8713d9c4f86c..16fc3e381801 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -109,6 +109,8 @@ struct hugetlb_vma_lock {
struct vm_area_struct *vma;
};

+void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
+
extern struct resv_map *resv_map_alloc(void);
void resv_map_release(struct kref *ref);

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..996e8ded092f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,8 @@

#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */

+#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d27fe05d5ef6..5bd53ae8ca4b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -92,7 +92,6 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);

static inline bool subpool_is_free(struct hugepage_subpool *spool)
@@ -361,7 +360,7 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
}
}

-static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
{
struct hugetlb_vma_lock *vma_lock;

diff --git a/mm/madvise.c b/mm/madvise.c
index 025be3517af1..04ee28992e52 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1011,6 +1011,24 @@ static long madvise_remove(struct vm_area_struct *vma,
return error;
}

+static int madvise_split(struct vm_area_struct *vma,
+ unsigned long *new_flags)
+{
+ if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
+ return -EINVAL;
+ /*
+ * Attempt to allocate the VMA lock again. If it isn't allocated,
+ * MADV_COLLAPSE won't work.
+ */
+ hugetlb_vma_lock_alloc(vma);
+
+ /* PMD sharing doesn't work with HGM. */
+ hugetlb_unshare_all_pmds(vma);
+
+ *new_flags |= VM_HUGETLB_HGM;
+ return 0;
+}
+
/*
* Apply an madvise behavior to a region of a vma. madvise_update_vma
* will handle splitting a vm area into separate areas, each area with its own
@@ -1089,6 +1107,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
break;
case MADV_COLLAPSE:
return madvise_collapse(vma, prev, start, end);
+ case MADV_SPLIT:
+ error = madvise_split(vma, &new_flags);
+ if (error)
+ goto out;
+ break;
}

anon_name = anon_vma_name(vma);
@@ -1183,6 +1206,9 @@ madvise_behavior_valid(int behavior)
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
case MADV_COLLAPSE:
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+ case MADV_SPLIT:
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:57:02

by James Houghton

[permalink] [raw]
Subject: [PATCH 18/46] hugetlb: add HGM support for hugetlb_change_protection

The main change here is to do a high-granularity walk and pulling the
shift from the walk (not from the hstate).

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 59 +++++++++++++++++++++++++++++++++-------------------
1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfd6c1491ac3..73672d806172 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6798,15 +6798,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
- pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0, psize = huge_page_size(h);
+ unsigned long base_pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
unsigned long last_addr_mask;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ struct hugetlb_pte hpte;

/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -6824,28 +6824,30 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
last_addr_mask = hugetlb_mask_last_page(h);
- for (; address < end; address += psize) {
+ while (address < end) {
spinlock_t *ptl;
- ptep = hugetlb_walk(vma, address, psize);
- if (!ptep) {
- address |= last_addr_mask;
+
+ if (hugetlb_full_walk(&hpte, vma, address)) {
+ address = (address | last_addr_mask) + psize;
continue;
}
- ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, vma, address, ptep)) {
+
+ ptl = hugetlb_pte_lock(&hpte);
+ if (hugetlb_pte_size(&hpte) == psize &&
+ huge_pmd_unshare(mm, vma, address, hpte.ptep)) {
/*
* When uffd-wp is enabled on the vma, unshare
* shouldn't happen at all. Warn about it if it
* happened due to some reason.
*/
WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
- pages++;
+ base_pages += psize / PAGE_SIZE;
spin_unlock(ptl);
shared_pmd = true;
- address |= last_addr_mask;
+ address = (address | last_addr_mask) + psize;
continue;
}
- pte = huge_ptep_get(ptep);
+ pte = huge_ptep_get(hpte.ptep);
if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
/* Nothing to do. */
} else if (unlikely(is_hugetlb_entry_migration(pte))) {
@@ -6861,7 +6863,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
entry = make_readable_migration_entry(
swp_offset(entry));
newpte = swp_entry_to_pte(entry);
- pages++;
+ base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
}

if (uffd_wp)
@@ -6869,34 +6871,49 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
if (!pte_same(pte, newpte))
- set_huge_pte_at(mm, address, ptep, newpte);
+ set_huge_pte_at(mm, address, hpte.ptep, newpte);
} else if (unlikely(is_pte_marker(pte))) {
/* No other markers apply for now. */
WARN_ON_ONCE(!pte_marker_uffd_wp(pte));
if (uffd_wp_resolve)
/* Safe to modify directly (non-present->none). */
- huge_pte_clear(mm, address, ptep, psize);
+ huge_pte_clear(mm, address, hpte.ptep,
+ hugetlb_pte_size(&hpte));
} else if (!huge_pte_none(pte)) {
pte_t old_pte;
- unsigned int shift = huge_page_shift(hstate_vma(vma));
+ unsigned int shift = hpte.shift;

- old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
+ if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) {
+ /*
+ * Someone split the PTE from under us, so retry
+ * the walk,
+ */
+ spin_unlock(ptl);
+ continue;
+ }
+
+ old_pte = huge_ptep_modify_prot_start(
+ vma, address, hpte.ptep);
pte = huge_pte_modify(old_pte, newprot);
- pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ pte = arch_make_huge_pte(
+ pte, shift, vma->vm_flags);
if (uffd_wp)
pte = huge_pte_mkuffd_wp(pte);
else if (uffd_wp_resolve)
pte = huge_pte_clear_uffd_wp(pte);
- huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
- pages++;
+ huge_ptep_modify_prot_commit(
+ vma, address, hpte.ptep,
+ old_pte, pte);
+ base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
} else {
/* None pte */
if (unlikely(uffd_wp))
/* Safe to modify directly (none->non-present). */
- set_huge_pte_at(mm, address, ptep,
+ set_huge_pte_at(mm, address, hpte.ptep,
make_pte_marker(PTE_MARKER_UFFD_WP));
}
spin_unlock(ptl);
+ address += hugetlb_pte_size(&hpte);
}
/*
* Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
@@ -6919,7 +6936,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
hugetlb_vma_unlock_write(vma);
mmu_notifier_invalidate_range_end(&range);

- return pages << h->order;
+ return base_pages;
}

/* Return true if reservation was successful, false otherwise. */
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:57:06

by James Houghton

[permalink] [raw]
Subject: [PATCH 06/46] mm: add VM_HUGETLB_HGM VMA flag

VM_HUGETLB_HGM indicates that a HugeTLB VMA may contain high-granularity
mappings. Its VmFlags string is "hm".

Signed-off-by: James Houghton <[email protected]>
---
fs/proc/task_mmu.c | 3 +++
include/linux/mm.h | 7 +++++++
include/trace/events/mmflags.h | 7 +++++++
3 files changed, 17 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..41b5509bde0e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
[ilog2(VM_UFFD_MINOR)] = "ui",
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+ [ilog2(VM_HUGETLB_HGM)] = "hm",
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
};
size_t i;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c37f9330f14e..738b3605f80e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -372,6 +372,13 @@ extern unsigned int kobjsize(const void *objp);
# define VM_UFFD_MINOR VM_NONE
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */

+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+# define VM_HUGETLB_HGM_BIT 38
+# define VM_HUGETLB_HGM BIT(VM_HUGETLB_HGM_BIT) /* HugeTLB high-granularity mapping */
+#else /* !CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+# define VM_HUGETLB_HGM VM_NONE
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+
/* Bits set in the VMA until the stack is in its final location */
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)

diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..88ce04b2ff69 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,6 +163,12 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
# define IF_HAVE_UFFD_MINOR(flag, name)
#endif

+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+# define IF_HAVE_HUGETLB_HGM(flag, name) {flag, name},
+#else
+# define IF_HAVE_HUGETLB_HGM(flag, name)
+#endif
+
#define __def_vmaflag_names \
{VM_READ, "read" }, \
{VM_WRITE, "write" }, \
@@ -187,6 +193,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \
{VM_ACCOUNT, "account" }, \
{VM_NORESERVE, "noreserve" }, \
{VM_HUGETLB, "hugetlb" }, \
+IF_HAVE_HUGETLB_HGM(VM_HUGETLB_HGM, "hugetlb_hgm" ) \
{VM_SYNC, "sync" }, \
__VM_ARCH_SPECIFIC_1 , \
{VM_WIPEONFORK, "wipeonfork" }, \
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:58:07

by James Houghton

[permalink] [raw]
Subject: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

The main change in this commit is to walk_hugetlb_range to support
walking HGM mappings, but all walk_hugetlb_range callers must be updated
to use the new API and take the correct action.

Listing all the changes to the callers:

For s390 changes, we simply ignore HGM PTEs (we don't support s390 yet).

For smaps, shared_hugetlb (and private_hugetlb, although private
mappings don't support HGM) may now not be divisible by the hugepage
size. The appropriate changes have been made to support analyzing HGM
PTEs.

For pagemap, we ignore non-leaf PTEs by treating that as if they were
none PTEs. We can only end up with non-leaf PTEs if they had just been
updated from a none PTE.

For show_numa_map, the challenge is that, if any of a hugepage is
mapped, we have to count that entire page exactly once, as the results
are given in units of hugepages. To support HGM mappings, we keep track
of the last page that we looked it. If the hugepage we are currently
looking at is the same as the last one, then we must be looking at an
HGM-mapped page that has been mapped at high-granularity, and we've
already accounted for it.

For DAMON, we treat non-leaf PTEs as if they were blank, for the same
reason as pagemap.

For hwpoison, we proactively update the logic to support the case when
hpte is pointing to a subpage within the poisoned hugepage.

For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for
now.

For mincore, we ignore non-leaf PTEs for the same reason as pagemap.

For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a
non-leaf PTE.

Signed-off-by: James Houghton <[email protected]>
---
arch/s390/mm/gmap.c | 20 ++++++++--
fs/proc/task_mmu.c | 83 +++++++++++++++++++++++++++++-----------
include/linux/pagewalk.h | 10 +++--
mm/damon/vaddr.c | 42 +++++++++++++-------
mm/hmm.c | 20 +++++++---
mm/memory-failure.c | 17 ++++----
mm/mempolicy.c | 12 ++++--
mm/mincore.c | 17 ++++++--
mm/mprotect.c | 18 ++++++---
mm/pagewalk.c | 20 +++++-----
10 files changed, 180 insertions(+), 79 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 74e1d873dce0..284466bf4f25 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2626,13 +2626,25 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
return 0;
}

-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
- unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
- pmd_t *pmd = (pmd_t *)pte;
+ struct hstate *h = hstate_vma(walk->vma);
+ pmd_t *pmd;
unsigned long start, end;
- struct page *page = pmd_page(*pmd);
+ struct page *page;
+
+ if (huge_page_size(h) != hugetlb_pte_size(hpte))
+ /* Ignore high-granularity PTEs. */
+ return 0;
+
+ if (!pte_present(huge_ptep_get(hpte->ptep)))
+ /* Ignore non-present PTEs. */
+ return 0;
+
+ pmd = (pmd_t *)hpte->ptep;
+ page = pmd_page(*pmd);

/*
* The write check makes sure we do not set a key on shared
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 41b5509bde0e..c353cab11eee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -731,18 +731,28 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
}

#ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
+ unsigned long addr,
+ struct mm_walk *walk)
{
struct mem_size_stats *mss = walk->private;
struct vm_area_struct *vma = walk->vma;
struct page *page = NULL;
+ pte_t pte = huge_ptep_get(hpte->ptep);

- if (pte_present(*pte)) {
- page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
- swp_entry_t swpent = pte_to_swp_entry(*pte);
+ if (pte_present(pte)) {
+ /* We only care about leaf-level PTEs. */
+ if (!hugetlb_pte_present_leaf(hpte, pte))
+ /*
+ * The only case where hpte is not a leaf is that
+ * it was originally none, but it was split from
+ * under us. It was originally none, so exclude it.
+ */
+ return 0;
+
+ page = vm_normal_page(vma, addr, pte);
+ } else if (is_swap_pte(pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(pte);

if (is_pfn_swap_entry(swpent))
page = pfn_swap_entry_to_page(swpent);
@@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
int mapcount = page_mapcount(page);

if (mapcount >= 2)
- mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->shared_hugetlb += hugetlb_pte_size(hpte);
else
- mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->private_hugetlb += hugetlb_pte_size(hpte);
}
return 0;
}
@@ -1572,22 +1582,31 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

#ifdef CONFIG_HUGETLB_PAGE
/* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
u64 flags = 0, frame = 0;
int err = 0;
- pte_t pte;
+ unsigned long hmask = hugetlb_pte_mask(hpte);
+ unsigned long end = addr + hugetlb_pte_size(hpte);
+ pte_t pte = huge_ptep_get(hpte->ptep);
+ struct page *page;

if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;

- pte = huge_ptep_get(ptep);
if (pte_present(pte)) {
- struct page *page = pte_page(pte);
+ /*
+ * We raced with this PTE being split, which can only happen if
+ * it was blank before. Treat it is as if it were blank.
+ */
+ if (!hugetlb_pte_present_leaf(hpte, pte))
+ return 0;
+
+ page = pte_page(pte);

if (!PageAnon(page))
flags |= PM_FILE;
@@ -1868,10 +1887,16 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
}
#endif

+struct show_numa_map_private {
+ struct numa_maps *md;
+ struct page *last_page;
+};
+
static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
- struct numa_maps *md = walk->private;
+ struct show_numa_map_private *priv = walk->private;
+ struct numa_maps *md = priv->md;
struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;
pte_t *orig_pte;
@@ -1883,6 +1908,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
struct page *page;

page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+ priv->last_page = page;
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1896,6 +1922,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
+ priv->last_page = page;
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1906,19 +1933,25 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
return 0;
}
#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+ struct mm_walk *walk)
{
- pte_t huge_pte = huge_ptep_get(pte);
+ struct show_numa_map_private *priv = walk->private;
+ pte_t huge_pte = huge_ptep_get(hpte->ptep);
struct numa_maps *md;
struct page *page;

- if (!pte_present(huge_pte))
+ if (!hugetlb_pte_present_leaf(hpte, huge_pte))
+ return 0;
+
+ page = compound_head(pte_page(huge_pte));
+ if (priv->last_page == page)
+ /* we've already accounted for this page */
return 0;

- page = pte_page(huge_pte);
+ priv->last_page = page;

- md = walk->private;
+ md = priv->md;
gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
}
@@ -1948,9 +1981,15 @@ static int show_numa_map(struct seq_file *m, void *v)
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct mempolicy *pol;
+
char buffer[64];
int nid;

+ struct show_numa_map_private numa_map_private;
+
+ numa_map_private.md = md;
+ numa_map_private.last_page = NULL;
+
if (!mm)
return 0;

@@ -1980,7 +2019,7 @@ static int show_numa_map(struct seq_file *m, void *v)
seq_puts(m, " huge");

/* mmap_lock is held by m_start */
- walk_page_vma(vma, &show_numa_ops, md);
+ walk_page_vma(vma, &show_numa_ops, &numa_map_private);

if (!md->pages)
goto out;
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 27a6df448ee5..f4bddad615c2 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@
#define _LINUX_PAGEWALK_H

#include <linux/mm.h>
+#include <linux/hugetlb.h>

struct mm_walk;

@@ -31,6 +32,10 @@ struct mm_walk;
* ptl after dropping the vma lock, or else revalidate
* those items after re-acquiring the vma lock and before
* accessing them.
+ * In the presence of high-granularity hugetlb entries,
+ * @hugetlb_entry is called only for leaf-level entries
+ * (hstate-level entries are ignored if they are not
+ * leaves).
* @test_walk: caller specific callback function to determine whether
* we walk over the current vma or not. Returning 0 means
* "do page table walk over the current vma", returning
@@ -58,9 +63,8 @@ struct mm_walk_ops {
unsigned long next, struct mm_walk *walk);
int (*pte_hole)(unsigned long addr, unsigned long next,
int depth, struct mm_walk *walk);
- int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long next,
- struct mm_walk *walk);
+ int (*hugetlb_entry)(struct hugetlb_pte *hpte,
+ unsigned long addr, struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
int (*pre_vma)(unsigned long start, unsigned long end,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 9d92c5eb3a1f..2383f647f202 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -330,11 +330,12 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
+ struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long addr)
{
bool referenced = false;
- pte_t entry = huge_ptep_get(pte);
+ pte_t entry = huge_ptep_get(hpte->ptep);
struct folio *folio = pfn_folio(pte_pfn(entry));

folio_get(folio);
@@ -342,12 +343,12 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
if (pte_young(entry)) {
referenced = true;
entry = pte_mkold(entry);
- set_huge_pte_at(mm, addr, pte, entry);
+ set_huge_pte_at(mm, addr, hpte->ptep, entry);
}

#ifdef CONFIG_MMU_NOTIFIER
if (mmu_notifier_clear_young(mm, addr,
- addr + huge_page_size(hstate_vma(vma))))
+ addr + hugetlb_pte_size(hpte)))
referenced = true;
#endif /* CONFIG_MMU_NOTIFIER */

@@ -358,20 +359,26 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
folio_put(folio);
}

-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
- struct hstate *h = hstate_vma(walk->vma);
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(h, walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = hugetlb_pte_lock(hpte);
+ entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto out;

- damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+ if (!hugetlb_pte_present_leaf(hpte, entry))
+ /*
+ * We raced with someone splitting a blank PTE. Treat this PTE
+ * as if it were blank.
+ */
+ goto out;
+
+ damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr);

out:
spin_unlock(ptl);
@@ -484,8 +491,8 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
struct damon_young_walk_private *priv = walk->private;
@@ -494,11 +501,18 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(h, walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = hugetlb_pte_lock(hpte);
+ entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto out;

+ if (!hugetlb_pte_present_leaf(hpte, entry))
+ /*
+ * We raced with someone splitting a blank PTE. Treat this PTE
+ * as if it were blank.
+ */
+ goto out;
+
folio = pfn_folio(pte_pfn(entry));
folio_get(folio);

diff --git a/mm/hmm.c b/mm/hmm.c
index 6a151c09de5e..d3e40cfdd4cb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -468,8 +468,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
#endif

#ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long start, unsigned long end,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long start,
struct mm_walk *walk)
{
unsigned long addr = start, i, pfn;
@@ -479,16 +479,24 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
unsigned int required_fault;
unsigned long pfn_req_flags;
unsigned long cpu_flags;
+ unsigned long hmask = hugetlb_pte_mask(hpte);
+ unsigned int order = hpte->shift - PAGE_SHIFT;
+ unsigned long end = start + hugetlb_pte_size(hpte);
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = hugetlb_pte_lock(hpte);
+ entry = huge_ptep_get(hpte->ptep);
+
+ if (!hugetlb_pte_present_leaf(hpte, entry)) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }

i = (start - range->start) >> PAGE_SHIFT;
pfn_req_flags = range->hmm_pfns[i];
cpu_flags = pte_to_hmm_pfn_flags(range, entry) |
- hmm_pfn_flags_order(huge_page_order(hstate_vma(vma)));
+ hmm_pfn_flags_order(order);
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
if (required_fault) {
@@ -605,7 +613,7 @@ int hmm_range_fault(struct hmm_range *range)
* in pfns. All entries < last in the pfn array are set to their
* output, and all >= are still at their input values.
*/
- } while (ret == -EBUSY);
+ } while (ret == -EBUSY || ret == -EAGAIN);
return ret;
}
EXPORT_SYMBOL(hmm_range_fault);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c77a9e37e27e..e7e56298d305 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -641,6 +641,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
unsigned long poisoned_pfn, struct to_kill *tk)
{
unsigned long pfn = 0;
+ unsigned long base_pages_poisoned = (1UL << shift) / PAGE_SIZE;

if (pte_present(pte)) {
pfn = pte_pfn(pte);
@@ -651,7 +652,8 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
pfn = swp_offset_pfn(swp);
}

- if (!pfn || pfn != poisoned_pfn)
+ if (!pfn || pfn < poisoned_pfn ||
+ pfn >= poisoned_pfn + base_pages_poisoned)
return 0;

set_to_kill(tk, addr, shift);
@@ -717,16 +719,15 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
- unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte,
+ unsigned long addr,
+ struct mm_walk *walk)
{
struct hwp_walk *hwp = walk->private;
- pte_t pte = huge_ptep_get(ptep);
- struct hstate *h = hstate_vma(walk->vma);
+ pte_t pte = huge_ptep_get(hpte->ptep);

- return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
- hwp->pfn, &hwp->tk);
+ return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte),
+ hpte->shift, hwp->pfn, &hwp->tk);
}
#else
#define hwpoison_hugetlb_range NULL
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d3558248a0f0..e5859ed34e90 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -558,8 +558,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
return addr != end ? -EIO : 0;
}

-static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
int ret = 0;
@@ -570,8 +570,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
- entry = huge_ptep_get(pte);
+ /* We don't migrate high-granularity HugeTLB mappings for now. */
+ if (hugetlb_hgm_enabled(walk->vma))
+ return -EINVAL;
+
+ ptl = hugetlb_pte_lock(hpte);
+ entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index a085a2aeabd8..0894965b3944 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,18 +22,29 @@
#include <linux/uaccess.h>
#include "swap.h"

-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
- unsigned long end, struct mm_walk *walk)
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
+ struct mm_walk *walk)
{
#ifdef CONFIG_HUGETLB_PAGE
unsigned char present;
+ unsigned long end = addr + hugetlb_pte_size(hpte);
unsigned char *vec = walk->private;
+ pte_t pte = huge_ptep_get(hpte->ptep);

/*
* Hugepages under user process are always in RAM and never
* swapped out, but theoretically it needs to be checked.
*/
- present = pte && !huge_pte_none(huge_ptep_get(pte));
+ present = !huge_pte_none(pte);
+
+ /*
+ * If the pte is present but not a leaf, we raced with someone
+ * splitting it. For someone to have split it, it must have been
+ * huge_pte_none before, so treat it as such.
+ */
+ if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte))
+ present = false;
+
for (; addr != end; vec++, addr += PAGE_SIZE)
*vec = present;
walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 71358e45a742..62d8c5f7bc92 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -543,12 +543,16 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
0 : -EACCES;
}

-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long next,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
- return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
- 0 : -EACCES;
+ pte_t pte = huge_ptep_get(hpte->ptep);
+
+ if (!hugetlb_pte_present_leaf(hpte, pte))
+ return -EAGAIN;
+ return pfn_modify_allowed(pte_pfn(pte),
+ *(pgprot_t *)(walk->private)) ? 0 : -EACCES;
}

static int prot_none_test(unsigned long addr, unsigned long next,
@@ -591,8 +595,10 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
(newflags & VM_ACCESS_FLAGS) == 0) {
pgprot_t new_pgprot = vm_get_page_prot(newflags);

- error = walk_page_range(current->mm, start, end,
- &prot_none_walk_ops, &new_pgprot);
+ do {
+ error = walk_page_range(current->mm, start, end,
+ &prot_none_walk_ops, &new_pgprot);
+ } while (error == -EAGAIN);
if (error)
return error;
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cb23f8a15c13..05ce242f8b7e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/hugetlb.h>
+#include <linux/minmax.h>

/*
* We want to know the real level where a entry is located ignoring any
@@ -296,20 +297,21 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
struct vm_area_struct *vma = walk->vma;
struct hstate *h = hstate_vma(vma);
unsigned long next;
- unsigned long hmask = huge_page_mask(h);
- unsigned long sz = huge_page_size(h);
- pte_t *pte;
const struct mm_walk_ops *ops = walk->ops;
int err = 0;
+ struct hugetlb_pte hpte;

hugetlb_vma_lock_read(vma);
do {
- next = hugetlb_entry_end(h, addr, end);
- pte = hugetlb_walk(vma, addr & hmask, sz);
- if (pte)
- err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
- else if (ops->pte_hole)
- err = ops->pte_hole(addr, next, -1, walk);
+ if (hugetlb_full_walk(&hpte, vma, addr)) {
+ next = hugetlb_entry_end(h, addr, end);
+ if (ops->pte_hole)
+ err = ops->pte_hole(addr, next, -1, walk);
+ } else {
+ err = ops->hugetlb_entry(
+ &hpte, addr, walk);
+ next = min(addr + hugetlb_pte_size(&hpte), end);
+ }
if (err)
break;
} while (addr = next, addr != end);
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:59:25

by James Houghton

[permalink] [raw]
Subject: [PATCH 10/46] hugetlb: make huge_pte_lockptr take an explicit shift argument

This is needed to handle PTL locking with high-granularity mapping. We
won't always be using the PMD-level PTL even if we're using the 2M
hugepage hstate. It's possible that we're dealing with 4K PTEs, in which
case, we need to lock the PTL for the 4K PTE.

Reviewed-by: Mina Almasry <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
arch/powerpc/mm/pgtable.c | 3 ++-
include/linux/hugetlb.h | 9 ++++-----
mm/hugetlb.c | 7 ++++---
mm/migrate.c | 3 ++-
4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..035a0df47af0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -261,7 +261,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,

psize = hstate_get_psize(h);
#ifdef CONFIG_DEBUG_VM
- assert_spin_locked(huge_pte_lockptr(h, vma->vm_mm, ptep));
+ assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),
+ vma->vm_mm, ptep));
#endif

#else
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 16fc3e381801..3f098363cd6e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -956,12 +956,11 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
return modified_mask;
}

-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
struct mm_struct *mm, pte_t *pte)
{
- if (huge_page_size(h) == PMD_SIZE)
+ if (shift == PMD_SHIFT)
return pmd_lockptr(mm, (pmd_t *) pte);
- VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
return &mm->page_table_lock;
}

@@ -1171,7 +1170,7 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
return 0;
}

-static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
struct mm_struct *mm, pte_t *pte)
{
return &mm->page_table_lock;
@@ -1228,7 +1227,7 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
{
spinlock_t *ptl;

- ptl = huge_pte_lockptr(h, mm, pte);
+ ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
spin_lock(ptl);
return ptl;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5bd53ae8ca4b..4db38dc79d0e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4987,7 +4987,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(h, src, src_pte);
+ src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
again:
@@ -5068,7 +5068,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,

/* Install the new huge page if src pte stable */
dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(h, src, src_pte);
+ src_ptl = huge_pte_lockptr(huge_page_shift(h),
+ src, src_pte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
if (!pte_same(src_pte_old, entry)) {
@@ -5122,7 +5123,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
pte_t pte;

dst_ptl = huge_pte_lock(h, mm, dst_pte);
- src_ptl = huge_pte_lockptr(h, mm, src_pte);
+ src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);

/*
* We don't have to worry about the ordering of src and dst ptlocks
diff --git a/mm/migrate.c b/mm/migrate.c
index b5032c3e940a..832f639fc49a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -360,7 +360,8 @@ void __migration_entry_wait_huge(struct vm_area_struct *vma,

void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
{
- spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
+ spinlock_t *ptl = huge_pte_lockptr(huge_page_shift(hstate_vma(vma)),
+ vma->vm_mm, pte);

__migration_entry_wait_huge(vma, pte, ptl);
}
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 10:59:28

by James Houghton

[permalink] [raw]
Subject: [PATCH 03/46] hugetlb: remove redundant pte_mkhuge in migration path

arch_make_huge_pte, which is called immediately following pte_mkhuge,
already makes the necessary changes to the PTE that pte_mkhuge would
have. The generic implementation of arch_make_huge_pte simply calls
pte_mkhuge.

Acked-by: Peter Xu <[email protected]>
Acked-by: Mina Almasry <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: James Houghton <[email protected]>
---
mm/migrate.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 494b3753fda9..b5032c3e940a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -246,7 +246,6 @@ static bool remove_migration_pte(struct folio *folio,
if (folio_test_hugetlb(folio)) {
unsigned int shift = huge_page_shift(hstate_vma(vma));

- pte = pte_mkhuge(pte);
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
if (folio_test_anon(folio))
hugepage_add_anon_rmap(new, vma, pvmw.address,
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:00:08

by James Houghton

[permalink] [raw]
Subject: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

This is a necessary extension to the UFFDIO_CONTINUE changes. When
userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the
kernel has no mechanism to automatically collapse the page table to map
the whole hugepage normally. We require userspace to inform us that they
would like the mapping to be collapsed; they do this with MADV_COLLAPSE.

If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
only some, hugetlb_collapse will cause the requested range to be mapped
as if it were UFFDIO_CONTINUE'd already. The effects of any
UFFDIO_WRITEPROTECT calls may be undone by a call to MADV_COLLAPSE for
intersecting address ranges.

This commit is co-opting the same madvise mode that has been introduced
to synchronously collapse THPs. The function that does THP collapsing
has been renamed to madvise_collapse_thp.

As with the rest of the high-granularity mapping support, MADV_COLLAPSE
is only supported for shared VMAs right now.

MADV_COLLAPSE has the same synchronization as huge_pmd_unshare.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/huge_mm.h | 12 +--
include/linux/hugetlb.h | 8 ++
mm/hugetlb.c | 164 ++++++++++++++++++++++++++++++++++++++++
mm/khugepaged.c | 4 +-
mm/madvise.c | 18 ++++-
5 files changed, 197 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a1341fdcf666..5d1e3c980f74 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -218,9 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,

int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
int advice);
-int madvise_collapse(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end);
+int madvise_collapse_thp(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end);
void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
unsigned long end, long adjust_next);
spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -367,9 +367,9 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
return -EINVAL;
}

-static inline int madvise_collapse(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static inline int madvise_collapse_thp(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
{
return -EINVAL;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c8524ac49b24..e1baf939afb6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1298,6 +1298,8 @@ bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long start,
unsigned long end);
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
#else
static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
{
@@ -1318,6 +1320,12 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
{
return -EINVAL;
}
+static inline
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ return -EINVAL;
+}
#endif

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5b6215e03fe1..388c46c7e77a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7852,6 +7852,170 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
return 0;
}

+static bool hugetlb_hgm_collapsable(struct vm_area_struct *vma)
+{
+ if (!hugetlb_hgm_eligible(vma))
+ return false;
+ if (!vma->vm_private_data) /* vma lock required for collapsing */
+ return false;
+ return true;
+}
+
+/*
+ * Collapse the address range from @start to @end to be mapped optimally.
+ *
+ * This is only valid for shared mappings. The main use case for this function
+ * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
+ * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
+ * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
+ * it up to userspace to tell us to do so, via MADV_COLLAPSE.
+ *
+ * Any holes in the mapping will be filled. If there is no page in the
+ * pagecache for a region we're collapsing, the PTEs will be cleared.
+ *
+ * If high-granularity PTEs are uffd-wp markers, those markers will be dropped.
+ */
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ struct hstate *h = hstate_vma(vma);
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct mmu_notifier_range range;
+ struct mmu_gather tlb;
+ unsigned long curr = start;
+ int ret = 0;
+ struct page *hpage, *subpage;
+ pgoff_t idx;
+ bool writable = vma->vm_flags & VM_WRITE;
+ bool shared = vma->vm_flags & VM_SHARED;
+ struct hugetlb_pte hpte;
+ pte_t entry;
+
+ /*
+ * This is only supported for shared VMAs, because we need to look up
+ * the page to use for any PTEs we end up creating.
+ */
+ if (!shared)
+ return -EINVAL;
+
+ /* If HGM is not enabled, there is nothing to collapse. */
+ if (!hugetlb_hgm_enabled(vma))
+ return 0;
+
+ /*
+ * We lost the VMA lock after splitting, so we can't safely collapse.
+ * We could improve this in the future (like take the mmap_lock for
+ * writing and try again), but for now just fail with ENOMEM.
+ */
+ if (unlikely(!hugetlb_hgm_collapsable(vma)))
+ return -ENOMEM;
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
+ start, end);
+ mmu_notifier_invalidate_range_start(&range);
+ tlb_gather_mmu(&tlb, mm);
+
+ /*
+ * Grab the VMA lock and mapping sem for writing. This will prevent
+ * concurrent high-granularity page table walks, so that we can safely
+ * collapse and free page tables.
+ *
+ * This is the same locking that huge_pmd_unshare requires.
+ */
+ hugetlb_vma_lock_write(vma);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+
+ while (curr < end) {
+ ret = hugetlb_alloc_largest_pte(&hpte, mm, vma, curr, end);
+ if (ret)
+ goto out;
+
+ entry = huge_ptep_get(hpte.ptep);
+
+ /*
+ * There is no work to do if the PTE doesn't point to page
+ * tables.
+ */
+ if (!pte_present(entry))
+ goto next_hpte;
+ if (hugetlb_pte_present_leaf(&hpte, entry))
+ goto next_hpte;
+
+ idx = vma_hugecache_offset(h, vma, curr);
+ hpage = find_get_page(mapping, idx);
+
+ if (hpage && !HPageMigratable(hpage)) {
+ /*
+ * Don't collapse a mapping to a page that is pending
+ * a migration. Migration swap entries may have placed
+ * in the page table.
+ */
+ ret = -EBUSY;
+ put_page(hpage);
+ goto out;
+ }
+
+ if (hpage && PageHWPoison(hpage)) {
+ /*
+ * Don't collapse a mapping to a page that is
+ * hwpoisoned.
+ */
+ ret = -EHWPOISON;
+ put_page(hpage);
+ /*
+ * By setting ret to -EHWPOISON, if nothing else
+ * happens, we will tell userspace that we couldn't
+ * fully collapse everything due to poison.
+ *
+ * Skip this page, and continue to collapse the rest
+ * of the mapping.
+ */
+ curr = (curr & huge_page_mask(h)) + huge_page_size(h);
+ continue;
+ }
+
+ /*
+ * Clear all the PTEs, and drop ref/mapcounts
+ * (on tlb_finish_mmu).
+ */
+ __unmap_hugepage_range(&tlb, vma, curr,
+ curr + hugetlb_pte_size(&hpte),
+ NULL,
+ ZAP_FLAG_DROP_MARKER);
+ /* Free the PTEs. */
+ hugetlb_free_pgd_range(&tlb,
+ curr, curr + hugetlb_pte_size(&hpte),
+ curr, curr + hugetlb_pte_size(&hpte));
+ if (!hpage) {
+ huge_pte_clear(mm, curr, hpte.ptep,
+ hugetlb_pte_size(&hpte));
+ goto next_hpte;
+ }
+
+ page_dup_file_rmap(hpage, true);
+
+ subpage = hugetlb_find_subpage(h, hpage, curr);
+ entry = make_huge_pte_with_shift(vma, subpage,
+ writable, hpte.shift);
+ set_huge_pte_at(mm, curr, hpte.ptep, entry);
+next_hpte:
+ curr += hugetlb_pte_size(&hpte);
+
+ if (curr < end) {
+ /* Don't hold the VMA lock for too long. */
+ hugetlb_vma_unlock_write(vma);
+ cond_resched();
+ hugetlb_vma_lock_write(vma);
+ }
+ }
+out:
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ hugetlb_vma_unlock_write(vma);
+ tlb_finish_mmu(&tlb);
+ mmu_notifier_invalidate_range_end(&range);
+ return ret;
+}
+
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e1c7c1f357ef..cbeb7f00f1bf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2718,8 +2718,8 @@ static int madvise_collapse_errno(enum scan_result r)
}
}

-int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+int madvise_collapse_thp(struct vm_area_struct *vma, struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
{
struct collapse_control *cc;
struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/madvise.c b/mm/madvise.c
index 04ee28992e52..fec47e9f845b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1029,6 +1029,18 @@ static int madvise_split(struct vm_area_struct *vma,
return 0;
}

+static int madvise_collapse(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ if (is_vm_hugetlb_page(vma)) {
+ *prev = vma;
+ return hugetlb_collapse(vma->vm_mm, vma, start, end);
+ }
+
+ return madvise_collapse_thp(vma, prev, start, end);
+}
+
/*
* Apply an madvise behavior to a region of a vma. madvise_update_vma
* will handle splitting a vm area into separate areas, each area with its own
@@ -1205,6 +1217,9 @@ madvise_behavior_valid(int behavior)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
+#endif
+#if defined(CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING) || \
+ defined(CONFIG_TRANSPARENT_HUGEPAGE)
case MADV_COLLAPSE:
#endif
#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
@@ -1398,7 +1413,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
* MADV_NOHUGEPAGE - mark the given range as not worth being backed by
* transparent huge pages so the existing pages will not be
* coalesced into THP and new pages will not be allocated as THP.
- * MADV_COLLAPSE - synchronously coalesce pages into new THP.
+ * MADV_COLLAPSE - synchronously coalesce pages into new THP, or, for HugeTLB
+ * pages, collapse the mapping.
* MADV_DONTDUMP - the application wants to prevent pages in the given range
* from being included in its core dump.
* MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:01:57

by James Houghton

[permalink] [raw]
Subject: [PATCH 33/46] hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE

Changes here are similar to the changes made for hugetlb_no_page.

Pass vmf->real_address to userfaultfd_huge_must_wait because
vmf->address may be rounded down to the hugepage size, and a
high-granularity page table walk would look up the wrong PTE. Also
change the call to userfaultfd_must_wait in the same way for
consistency.

This commit introduces hugetlb_alloc_largest_pte which is used to find
the appropriate PTE size to map pages with UFFDIO_CONTINUE.

When MADV_SPLIT is provided, page fault events will report
PAGE_SIZE-aligned address instead of huge_page_size(h)-aligned
addresses, regardless of if UFFD_FEATURE_EXACT_ADDRESS is used.

Signed-off-by: James Houghton <[email protected]>
---
fs/userfaultfd.c | 14 +++----
include/linux/hugetlb.h | 18 ++++++++-
mm/hugetlb.c | 85 +++++++++++++++++++++++++++++++++--------
mm/userfaultfd.c | 40 +++++++++++--------
4 files changed, 119 insertions(+), 38 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 15a5bf765d43..940ff63096a9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -252,17 +252,17 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
unsigned long flags,
unsigned long reason)
{
- pte_t *ptep, pte;
+ pte_t pte;
bool ret = true;
+ struct hugetlb_pte hpte;

mmap_assert_locked(ctx->mm);

- ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
- if (!ptep)
+ if (hugetlb_full_walk(&hpte, vma, address))
goto out;

ret = false;
- pte = huge_ptep_get(ptep);
+ pte = huge_ptep_get(hpte.ptep);

/*
* Lockless access: we're in a wait_event so it's ok if it
@@ -531,11 +531,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
spin_unlock_irq(&ctx->fault_pending_wqh.lock);

if (!is_vm_hugetlb_page(vma))
- must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
- reason);
+ must_wait = userfaultfd_must_wait(ctx, vmf->real_address,
+ vmf->flags, reason);
else
must_wait = userfaultfd_huge_must_wait(ctx, vma,
- vmf->address,
+ vmf->real_address,
vmf->flags, reason);
if (is_vm_hugetlb_page(vma))
hugetlb_vma_unlock_read(vma);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8a664a9dd0a8..c8524ac49b24 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -224,7 +224,8 @@ unsigned long hugetlb_total_pages(void);
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
#ifdef CONFIG_USERFAULTFD
-int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
+ struct hugetlb_pte *dst_hpte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
@@ -1292,16 +1293,31 @@ static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)

#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
bool hugetlb_hgm_enabled(struct vm_area_struct *vma);
+bool hugetlb_hgm_advised(struct vm_area_struct *vma);
bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end);
#else
static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
{
return false;
}
+static inline bool hugetlb_hgm_advised(struct vm_area_struct *vma)
+{
+ return false;
+}
static inline bool hugetlb_hgm_eligible(struct vm_area_struct *vma)
{
return false;
}
+static inline
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ return -EINVAL;
+}
#endif

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1eef6968b1fa..5af6db52f34e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5936,6 +5936,13 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
+ /*
+ * Don't use the hpage-aligned address if the user has explicitly
+ * enabled HGM.
+ */
+ if (hugetlb_hgm_advised(vma) && reason == VM_UFFD_MINOR)
+ haddr = address & PAGE_MASK;
+
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -6420,7 +6427,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* modifications for huge pages.
*/
int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
- pte_t *dst_pte,
+ struct hugetlb_pte *dst_hpte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
@@ -6431,13 +6438,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct hstate *h = hstate_vma(dst_vma);
struct address_space *mapping = dst_vma->vm_file->f_mapping;
- pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+ unsigned long haddr = dst_addr & huge_page_mask(h);
+ pgoff_t idx = vma_hugecache_offset(h, dst_vma, haddr);
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
pte_t _dst_pte;
spinlock_t *ptl;
int ret = -ENOMEM;
- struct page *page;
+ struct page *page, *subpage;
int writable;
bool page_in_pagecache = false;

@@ -6452,12 +6460,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* a non-missing case. Return -EEXIST.
*/
if (vm_shared &&
- hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+ hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
ret = -EEXIST;
goto out;
}

- page = alloc_huge_page(dst_vma, dst_addr, 0);
+ page = alloc_huge_page(dst_vma, haddr, 0);
if (IS_ERR(page)) {
ret = -ENOMEM;
goto out;
@@ -6473,13 +6481,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
/* Free the allocated page which may have
* consumed a reservation.
*/
- restore_reserve_on_error(h, dst_vma, dst_addr, page);
+ restore_reserve_on_error(h, dst_vma, haddr, page);
put_page(page);

/* Allocate a temporary page to hold the copied
* contents.
*/
- page = alloc_huge_page_vma(h, dst_vma, dst_addr);
+ page = alloc_huge_page_vma(h, dst_vma, haddr);
if (!page) {
ret = -ENOMEM;
goto out;
@@ -6493,14 +6501,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
}
} else {
if (vm_shared &&
- hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+ hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
put_page(*pagep);
ret = -EEXIST;
*pagep = NULL;
goto out;
}

- page = alloc_huge_page(dst_vma, dst_addr, 0);
+ page = alloc_huge_page(dst_vma, haddr, 0);
if (IS_ERR(page)) {
put_page(*pagep);
ret = -ENOMEM;
@@ -6548,7 +6556,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
page_in_pagecache = true;
}

- ptl = huge_pte_lock(h, dst_mm, dst_pte);
+ ptl = hugetlb_pte_lock(dst_hpte);

ret = -EIO;
if (PageHWPoison(page))
@@ -6560,7 +6568,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* page backing it, then access the page.
*/
ret = -EEXIST;
- if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+ if (!huge_pte_none_mostly(huge_ptep_get(dst_hpte->ptep)))
goto out_release_unlock;

if (page_in_pagecache)
@@ -6577,7 +6585,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
else
writable = dst_vma->vm_flags & VM_WRITE;

- _dst_pte = make_huge_pte(dst_vma, page, writable);
+ subpage = hugetlb_find_subpage(h, page, dst_addr);
+
+ _dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
+ dst_hpte->shift);
/*
* Always mark UFFDIO_COPY page dirty; note that this may not be
* extremely important for hugetlbfs for now since swapping is not
@@ -6590,12 +6601,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (wp_copy)
_dst_pte = huge_pte_mkuffd_wp(_dst_pte);

- set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);

- hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+ hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);

/* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
+ update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);

spin_unlock(ptl);
if (!is_continue)
@@ -7780,6 +7791,18 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
{
return vma && (vma->vm_flags & VM_HUGETLB_HGM);
}
+bool hugetlb_hgm_advised(struct vm_area_struct *vma)
+{
+ /*
+ * Right now, the only way for HGM to be enabled is if a user
+ * explicitly enables it via MADV_SPLIT, but in the future, there
+ * may be cases where it gets enabled automatically.
+ *
+ * Provide hugetlb_hgm_advised() now for call sites where care that the
+ * user explicitly enabled HGM.
+ */
+ return hugetlb_hgm_enabled(vma);
+}
/* Should only be used by the for_each_hgm_shift macro. */
static unsigned int __shift_for_hstate(struct hstate *h)
{
@@ -7798,6 +7821,38 @@ static unsigned int __shift_for_hstate(struct hstate *h)
(tmp_h) <= &hstates[hugetlb_max_hstate]; \
(tmp_h)++)

+/*
+ * Find the HugeTLB PTE that maps as much of [start, end) as possible with a
+ * single page table entry. It is returned in @hpte.
+ */
+int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ struct hstate *h = hstate_vma(vma), *tmp_h;
+ unsigned int shift;
+ unsigned long sz;
+ int ret;
+
+ for_each_hgm_shift(h, tmp_h, shift) {
+ sz = 1UL << shift;
+
+ if (!IS_ALIGNED(start, sz) || start + sz > end)
+ continue;
+ goto found;
+ }
+ return -EINVAL;
+found:
+ ret = hugetlb_full_walk_alloc(hpte, vma, start, sz);
+ if (ret)
+ return ret;
+
+ if (hpte->shift > shift)
+ return -EEXIST;
+
+ return 0;
+}
+
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 65ad172add27..2b233d31be24 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -320,14 +320,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
{
int vm_shared = dst_vma->vm_flags & VM_SHARED;
ssize_t err;
- pte_t *dst_pte;
unsigned long src_addr, dst_addr;
long copied;
struct page *page;
- unsigned long vma_hpagesize;
+ unsigned long vma_hpagesize, target_pagesize;
pgoff_t idx;
u32 hash;
struct address_space *mapping;
+ bool use_hgm = hugetlb_hgm_advised(dst_vma) &&
+ mode == MCOPY_ATOMIC_CONTINUE;
+ struct hstate *h = hstate_vma(dst_vma);

/*
* There is no default zero huge page for all huge page sizes as
@@ -345,12 +347,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
copied = 0;
page = NULL;
vma_hpagesize = vma_kernel_pagesize(dst_vma);
+ target_pagesize = use_hgm ? PAGE_SIZE : vma_hpagesize;

/*
- * Validate alignment based on huge page size
+ * Validate alignment based on the targeted page size.
*/
err = -EINVAL;
- if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+ if (dst_start & (target_pagesize - 1) || len & (target_pagesize - 1))
goto out_unlock;

retry:
@@ -381,13 +384,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
+ struct hugetlb_pte hpte;
BUG_ON(dst_addr >= dst_start + len);

/*
* Serialize via vma_lock and hugetlb_fault_mutex.
- * vma_lock ensures the dst_pte remains valid even
- * in the case of shared pmds. fault mutex prevents
- * races with other faulting threads.
+ * vma_lock ensures the hpte.ptep remains valid even
+ * in the case of shared pmds and page table collapsing.
+ * fault mutex prevents races with other faulting threads.
*/
idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
@@ -395,23 +399,28 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vma_lock_read(dst_vma);

- err = -ENOMEM;
- dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
- if (!dst_pte) {
+ if (use_hgm)
+ err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
+ dst_addr,
+ dst_start + len);
+ else
+ err = hugetlb_full_walk_alloc(&hpte, dst_vma, dst_addr,
+ vma_hpagesize);
+ if (err) {
hugetlb_vma_unlock_read(dst_vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
goto out_unlock;
}

if (mode != MCOPY_ATOMIC_CONTINUE &&
- !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+ !huge_pte_none_mostly(huge_ptep_get(hpte.ptep))) {
err = -EEXIST;
hugetlb_vma_unlock_read(dst_vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
goto out_unlock;
}

- err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+ err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
dst_addr, src_addr, mode, &page,
wp_copy);

@@ -423,6 +432,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
if (unlikely(err == -ENOENT)) {
mmap_read_unlock(dst_mm);
BUG_ON(!page);
+ WARN_ON_ONCE(hpte.shift != huge_page_shift(h));

err = copy_huge_page_from_user(page,
(const void __user *)src_addr,
@@ -440,9 +450,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
BUG_ON(page);

if (!err) {
- dst_addr += vma_hpagesize;
- src_addr += vma_hpagesize;
- copied += vma_hpagesize;
+ dst_addr += hugetlb_pte_size(&hpte);
+ src_addr += hugetlb_pte_size(&hpte);
+ copied += hugetlb_pte_size(&hpte);

if (fatal_signal_pending(current))
err = -EINTR;
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:02:28

by James Houghton

[permalink] [raw]
Subject: [PATCH 26/46] hugetlb: add HGM support for copy_hugetlb_page_range

This allows fork() to work with high-granularity mappings. The page
table structure is copied such that partially mapped regions will remain
partially mapped in the same way for the new process.

A page's reference count is incremented for *each* portion of it that is
mapped in the page table. For example, if you have a PMD-mapped 1G page,
the reference count and mapcount will be incremented by 512.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 75 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 50 insertions(+), 25 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 718572444a73..21a5116f509b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5106,7 +5106,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *src_vma)
{
pte_t *src_pte, *dst_pte, entry;
- struct page *ptepage;
+ struct hugetlb_pte src_hpte, dst_hpte;
+ struct page *ptepage, *hpage;
unsigned long addr;
bool cow = is_cow_mapping(src_vma->vm_flags);
struct hstate *h = hstate_vma(src_vma);
@@ -5126,26 +5127,34 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
/*
* For shared mappings the vma lock must be held before
- * calling hugetlb_walk() in the src vma. Otherwise, the
- * returned ptep could go away if part of a shared pmd and
- * another thread calls huge_pmd_unshare.
+ * calling hugetlb_full_walk() in the src vma. Otherwise, the
+ * returned hpte could go away if
+ * - part of a shared pmd and another thread calls
+ * - huge_pmd_unshare, or
+ * - another thread collapses a high-granularity mapping.
*/
hugetlb_vma_lock_read(src_vma);
}

last_addr_mask = hugetlb_mask_last_page(h);
- for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
+ addr = src_vma->vm_start;
+ while (addr < src_vma->vm_end) {
spinlock_t *src_ptl, *dst_ptl;
- src_pte = hugetlb_walk(src_vma, addr, sz);
- if (!src_pte) {
- addr |= last_addr_mask;
+ unsigned long hpte_sz;
+
+ if (hugetlb_full_walk(&src_hpte, src_vma, addr)) {
+ addr = (addr | last_addr_mask) + sz;
continue;
}
- dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
- if (!dst_pte) {
- ret = -ENOMEM;
+ ret = hugetlb_full_walk_alloc(&dst_hpte, dst_vma, addr,
+ hugetlb_pte_size(&src_hpte));
+ if (ret)
break;
- }
+
+ src_pte = src_hpte.ptep;
+ dst_pte = dst_hpte.ptep;
+
+ hpte_sz = hugetlb_pte_size(&src_hpte);

/*
* If the pagetables are shared don't copy or take references.
@@ -5155,13 +5164,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* another vma. So page_count of ptep page is checked instead
* to reliably determine whether pte is shared.
*/
- if (page_count(virt_to_page(dst_pte)) > 1) {
- addr |= last_addr_mask;
+ if (hugetlb_pte_size(&dst_hpte) == sz &&
+ page_count(virt_to_page(dst_pte)) > 1) {
+ addr = (addr | last_addr_mask) + sz;
continue;
}

- dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
+ dst_ptl = hugetlb_pte_lock(&dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(&src_hpte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
again:
@@ -5205,10 +5215,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
*/
if (userfaultfd_wp(dst_vma))
set_huge_pte_at(dst, addr, dst_pte, entry);
+ } else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) {
+ /* Retry the walk. */
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ continue;
} else {
- entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
- get_page(ptepage);
+ hpage = compound_head(ptepage);
+ get_page(hpage);

/*
* Failing to duplicate the anon rmap is a rare case
@@ -5220,25 +5235,31 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* need to be without the pgtable locks since we could
* sleep during the process.
*/
- if (!PageAnon(ptepage)) {
- page_dup_file_rmap(ptepage, true);
- } else if (page_try_dup_anon_rmap(ptepage, true,
+ if (!PageAnon(hpage)) {
+ page_dup_file_rmap(hpage, true);
+ } else if (page_try_dup_anon_rmap(hpage, true,
src_vma)) {
pte_t src_pte_old = entry;
struct page *new;

+ if (hugetlb_pte_size(&src_hpte) != sz) {
+ put_page(hpage);
+ ret = -EINVAL;
+ break;
+ }
+
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
/* Do not use reserve as it's private owned */
new = alloc_huge_page(dst_vma, addr, 1);
if (IS_ERR(new)) {
- put_page(ptepage);
+ put_page(hpage);
ret = PTR_ERR(new);
break;
}
- copy_user_huge_page(new, ptepage, addr, dst_vma,
+ copy_user_huge_page(new, hpage, addr, dst_vma,
npages);
- put_page(ptepage);
+ put_page(hpage);

/* Install the new huge page if src pte stable */
dst_ptl = huge_pte_lock(h, dst, dst_pte);
@@ -5256,6 +5277,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
hugetlb_install_page(dst_vma, dst_pte, addr, new);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+ addr += hugetlb_pte_size(&src_hpte);
continue;
}

@@ -5272,10 +5294,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

set_huge_pte_at(dst, addr, dst_pte, entry);
- hugetlb_count_add(npages, dst);
+ hugetlb_count_add(
+ hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
+ dst);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+ addr += hugetlb_pte_size(&src_hpte);
}

if (cow) {
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:05:01

by James Houghton

[permalink] [raw]
Subject: [PATCH 39/46] hugetlb: x86: enable high-granularity mapping

Now that HGM is fully supported for GENERAL_HUGETLB, x86 can enable it.
The x86 KVM MMU already properly handles HugeTLB HGM pages (it does a
page table walk to determine which size to use in the second-stage page
table instead of, for example, checking vma_mmu_pagesize, like arm64
does).

We could also enable HugeTLB HGM for arm (32-bit) at this point, as it
also uses GENERAL_HUGETLB and I don't see anything else that is needed
for it. However, I haven't tested on arm at all, so I won't enable it.

Signed-off-by: James Houghton <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..3d08cd45549c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -126,6 +126,7 @@ config X86
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64
+ select ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_THP_SWAP if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:14:22

by James Houghton

[permalink] [raw]
Subject: [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page

This enables high-granularity mapping support in GUP.

In case it is confusing, pfn_offset is the offset (in PAGE_SIZE units)
that vaddr points to within the subpage that hpte points to.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 59 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 73672d806172..30fea414d9ee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6532,11 +6532,9 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
}

static inline bool __follow_hugetlb_must_fault(struct vm_area_struct *vma,
- unsigned int flags, pte_t *pte,
+ unsigned int flags, pte_t pteval,
bool *unshare)
{
- pte_t pteval = huge_ptep_get(pte);
-
*unshare = false;
if (is_swap_pte(pteval))
return true;
@@ -6611,11 +6609,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
int err = -EFAULT, refs;

while (vaddr < vma->vm_end && remainder) {
- pte_t *pte;
+ pte_t *ptep, pte;
spinlock_t *ptl = NULL;
bool unshare = false;
int absent;
- struct page *page;
+ unsigned long pages_per_hpte;
+ struct page *page, *subpage;
+ struct hugetlb_pte hpte;

/*
* If we have a pending SIGKILL, don't keep faulting pages and
@@ -6632,13 +6632,19 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* each hugepage. We have to make sure we get the
* first, for the page indexing below to work.
*
- * Note that page table lock is not held when pte is null.
+ * hugetlb_full_walk will mask the address appropriately.
+ *
+ * Note that page table lock is not held when ptep is null.
*/
- pte = hugetlb_walk(vma, vaddr & huge_page_mask(h),
- huge_page_size(h));
- if (pte)
- ptl = huge_pte_lock(h, mm, pte);
- absent = !pte || huge_pte_none(huge_ptep_get(pte));
+ if (hugetlb_full_walk(&hpte, vma, vaddr)) {
+ ptep = NULL;
+ absent = true;
+ } else {
+ ptl = hugetlb_pte_lock(&hpte);
+ ptep = hpte.ptep;
+ pte = huge_ptep_get(ptep);
+ absent = huge_pte_none(pte);
+ }

/*
* When coredumping, it suits get_dump_page if we just return
@@ -6649,13 +6655,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (absent && (flags & FOLL_DUMP) &&
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
- if (pte)
+ if (ptep)
spin_unlock(ptl);
hugetlb_vma_unlock_read(vma);
remainder = 0;
break;
}

+ if (!absent && pte_present(pte) &&
+ !hugetlb_pte_present_leaf(&hpte, pte)) {
+ /* We raced with someone splitting the PTE, so retry. */
+ spin_unlock(ptl);
+ continue;
+ }
+
/*
* We need call hugetlb_fault for both hugepages under migration
* (in which case hugetlb_fault waits for the migration,) and
@@ -6671,7 +6684,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
vm_fault_t ret;
unsigned int fault_flags = 0;

- if (pte)
+ if (ptep)
spin_unlock(ptl);
hugetlb_vma_unlock_read(vma);

@@ -6720,8 +6733,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
continue;
}

- pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
- page = pte_page(huge_ptep_get(pte));
+ pfn_offset = (vaddr & ~hugetlb_pte_mask(&hpte)) >> PAGE_SHIFT;
+ subpage = pte_page(pte);
+ pages_per_hpte = hugetlb_pte_size(&hpte) / PAGE_SIZE;
+ page = compound_head(subpage);

VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
!PageAnonExclusive(page), page);
@@ -6731,22 +6746,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* and skip the same_page loop below.
*/
if (!pages && !vmas && !pfn_offset &&
- (vaddr + huge_page_size(h) < vma->vm_end) &&
- (remainder >= pages_per_huge_page(h))) {
- vaddr += huge_page_size(h);
- remainder -= pages_per_huge_page(h);
- i += pages_per_huge_page(h);
+ (vaddr + pages_per_hpte < vma->vm_end) &&
+ (remainder >= pages_per_hpte)) {
+ vaddr += pages_per_hpte;
+ remainder -= pages_per_hpte;
+ i += pages_per_hpte;
spin_unlock(ptl);
hugetlb_vma_unlock_read(vma);
continue;
}

/* vaddr may not be aligned to PAGE_SIZE */
- refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
+ refs = min3(pages_per_hpte - pfn_offset, remainder,
(vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);

if (pages || vmas)
- record_subpages_vmas(nth_page(page, pfn_offset),
+ record_subpages_vmas(nth_page(subpage, pfn_offset),
vma, refs,
likely(pages) ? pages + i : NULL,
vmas ? vmas + i : NULL);
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:17:28

by James Houghton

[permalink] [raw]
Subject: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step

hugetlb_hgm_walk implements high-granularity page table walks for
HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
immediately.

hugetlb_walk_step implements how we step forwards in the walk. For
architectures that don't use GENERAL_HUGETLB, they will need to provide
their own implementation.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 35 +++++--
mm/hugetlb.c | 213 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 242 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ad9d19f0d1b9..2fcd8f313628 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -239,6 +239,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud);

+int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
+ unsigned long addr);
+void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma, unsigned long addr);
+int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long target_sz);
+
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);

extern int sysctl_hugetlb_shm_group;
@@ -288,6 +296,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
unsigned long hugetlb_mask_last_page(struct hstate *h);
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz);
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
@@ -1067,6 +1077,8 @@ void hugetlb_register_node(struct node *node);
void hugetlb_unregister_node(struct node *node);
#endif

+enum hugetlb_level hpage_size_to_level(unsigned long sz);
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};

@@ -1259,6 +1271,11 @@ static inline void hugetlb_register_node(struct node *node)
static inline void hugetlb_unregister_node(struct node *node)
{
}
+
+static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+ return HUGETLB_LEVEL_PTE;
+}
#endif /* CONFIG_HUGETLB_PAGE */

#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
@@ -1333,12 +1350,8 @@ __vma_has_hugetlb_vma_lock(struct vm_area_struct *vma)
return (vma->vm_flags & VM_MAYSHARE) && vma->vm_private_data;
}

-/*
- * Safe version of huge_pte_offset() to check the locks. See comments
- * above huge_pte_offset().
- */
-static inline pte_t *
-hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+static inline void
+hugetlb_walk_lock_check(struct vm_area_struct *vma)
{
#if defined(CONFIG_HUGETLB_PAGE) && \
defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
@@ -1360,6 +1373,16 @@ hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
!lockdep_is_held(
&vma->vm_file->f_mapping->i_mmap_rwsem));
#endif
+}
+
+/*
+ * Safe version of huge_pte_offset() to check the locks. See comments
+ * above huge_pte_offset().
+ */
+static inline pte_t *
+hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
+{
+ hugetlb_walk_lock_check(vma);
return huge_pte_offset(vma->vm_mm, addr, sz);
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2160cbaf3311..aa8e59cbca69 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -94,6 +94,29 @@ static int hugetlb_acct_memory(struct hstate *h, long delta);
static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);

+/*
+ * hpage_size_to_level() - convert @sz to the corresponding page table level
+ *
+ * @sz must be less than or equal to a valid hugepage size.
+ */
+enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+ /*
+ * We order the conditionals from smallest to largest to pick the
+ * smallest level when multiple levels have the same size (i.e.,
+ * when levels are folded).
+ */
+ if (sz < PMD_SIZE)
+ return HUGETLB_LEVEL_PTE;
+ if (sz < PUD_SIZE)
+ return HUGETLB_LEVEL_PMD;
+ if (sz < P4D_SIZE)
+ return HUGETLB_LEVEL_PUD;
+ if (sz < PGDIR_SIZE)
+ return HUGETLB_LEVEL_P4D;
+ return HUGETLB_LEVEL_PGD;
+}
+
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
if (spool->count)
@@ -7276,6 +7299,153 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
}
#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */

+/* hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
+ * the page table entry for @addr. We might allocate new PTEs.
+ *
+ * @hpte must always be pointing at an hstate-level PTE or deeper.
+ *
+ * This function will never walk further if it encounters a PTE of a size
+ * less than or equal to @sz.
+ *
+ * @alloc determines what we do when we encounter an empty PTE. If false,
+ * we stop walking. If true and @sz is less than the current PTE's size,
+ * we make that PTE point to the next level down, going until @sz is the same
+ * as our current PTE.
+ *
+ * If @alloc is false and @sz is PAGE_SIZE, this function will always
+ * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
+ *
+ * Return:
+ * -ENOMEM if we couldn't allocate new PTEs.
+ * -EEXIST if the caller wanted to walk further than a migration PTE,
+ * poison PTE, or a PTE marker. The caller needs to manually deal
+ * with this scenario.
+ * -EINVAL if called with invalid arguments (@sz invalid, @hpte not
+ * initialized).
+ * 0 otherwise.
+ *
+ * Even if this function fails, @hpte is guaranteed to always remain
+ * valid.
+ */
+static int hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte, unsigned long addr,
+ unsigned long sz, bool alloc)
+{
+ int ret = 0;
+ pte_t pte;
+
+ if (WARN_ON_ONCE(sz < PAGE_SIZE))
+ return -EINVAL;
+
+ if (WARN_ON_ONCE(!hpte->ptep))
+ return -EINVAL;
+
+ /* We have the same synchronization requirements as hugetlb_walk. */
+ hugetlb_walk_lock_check(vma);
+
+ while (hugetlb_pte_size(hpte) > sz && !ret) {
+ pte = huge_ptep_get(hpte->ptep);
+ if (!pte_present(pte)) {
+ if (!alloc)
+ return 0;
+ if (unlikely(!huge_pte_none(pte)))
+ return -EEXIST;
+ } else if (hugetlb_pte_present_leaf(hpte, pte))
+ return 0;
+ ret = hugetlb_walk_step(mm, hpte, addr, sz);
+ }
+
+ return ret;
+}
+
+static int hugetlb_hgm_walk_uninit(struct hugetlb_pte *hpte,
+ pte_t *ptep,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ unsigned long target_sz,
+ bool alloc)
+{
+ struct hstate *h = hstate_vma(vma);
+
+ hugetlb_pte_populate(vma->vm_mm, hpte, ptep, huge_page_shift(h),
+ hpage_size_to_level(huge_page_size(h)));
+ return hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
+ alloc);
+}
+
+/*
+ * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
+ *
+ * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
+ * attempt to continue walking by calling this function.
+ *
+ * This function may never fail, but @hpte might not change.
+ *
+ * If @hpte is not valid, then this function is a no-op.
+ */
+void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ /* hugetlb_hgm_walk will never fail with these arguments. */
+ WARN_ON_ONCE(hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
+ PAGE_SIZE, false));
+}
+
+/*
+ * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
+ *
+ * This function can only fail if we find that the hstate-level PTE is not
+ * allocated. Callers can take advantage of this fact to skip address regions
+ * that cannot be mapped in that case.
+ *
+ * If this function succeeds, @hpte is guaranteed to be valid.
+ */
+int hugetlb_full_walk(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned long sz = huge_page_size(h);
+ /*
+ * We must mask the address appropriately so that we pick up the first
+ * PTE in a contiguous group.
+ */
+ pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
+
+ if (!ptep)
+ return -ENOMEM;
+
+ /* hugetlb_hgm_walk_uninit will never fail with these arguments. */
+ WARN_ON_ONCE(hugetlb_hgm_walk_uninit(hpte, ptep, vma, addr,
+ PAGE_SIZE, false));
+ return 0;
+}
+
+/*
+ * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
+ * new PTEs.
+ */
+int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ unsigned long target_sz)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned long sz = huge_page_size(h);
+ /*
+ * We must mask the address appropriately so that we pick up the first
+ * PTE in a contiguous group.
+ */
+ pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
+ sz);
+
+ if (!ptep)
+ return -ENOMEM;
+
+ return hugetlb_hgm_walk_uninit(hpte, ptep, vma, addr, target_sz, true);
+}
+
#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
@@ -7343,6 +7513,49 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return (pte_t *)pmd;
}

+/*
+ * hugetlb_walk_step() - Walk the page table one step to resolve the page
+ * (hugepage or subpage) entry at address @addr.
+ *
+ * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
+ * lowest level PTE).
+ *
+ * @hpte will always remain valid, even if this function fails.
+ *
+ * Architectures that implement this function must ensure that if @hpte does
+ * not change levels, then its PTL must also stay the same.
+ */
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz)
+{
+ pte_t *ptep;
+ spinlock_t *ptl;
+
+ switch (hpte->level) {
+ case HUGETLB_LEVEL_PUD:
+ ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
+ if (IS_ERR(ptep))
+ return PTR_ERR(ptep);
+ hugetlb_pte_populate(mm, hpte, ptep, PMD_SHIFT,
+ HUGETLB_LEVEL_PMD);
+ break;
+ case HUGETLB_LEVEL_PMD:
+ ptep = hugetlb_alloc_pte(mm, hpte, addr);
+ if (IS_ERR(ptep))
+ return PTR_ERR(ptep);
+ ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
+ __hugetlb_pte_populate(hpte, ptep, PAGE_SHIFT,
+ HUGETLB_LEVEL_PTE, ptl);
+ hpte->ptl = ptl;
+ break;
+ default:
+ WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
+ __func__, hpte->level, hpte->shift);
+ return -EINVAL;
+ }
+ return 0;
+}
+
/*
* Return a mask that can be used to update an address to the last huge
* page in a page table page mapping size. Used to skip non-present
--
2.39.0.314.g84b9a713c41-goog

2023-01-05 11:37:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e).

On 05.01.23 11:17, James Houghton wrote:
> This series introduces the concept of HugeTLB high-granularity mapping
> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> high-granularity, similar to how THPs can be PTE-mapped.
>
> Support for HGM in this series is for MAP_SHARED VMAs on x86 only. Other
> architectures and (some) support for MAP_PRIVATE will come later.

Why even care about the complexity of COW-sharable anon pages? TBH, I'd
just limit this to MAP_SHARED and call it a day. Sure, we can come up
with use cases for everything (snapshotting VMs using fork while also
support optimized postcopy), but I think this would need some real
justification for the added complexity and possible (likely!) issues.

--
Thanks,

David / dhildenb

2023-01-05 15:30:47

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230105]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc2 v6.2-rc1 v6.1 v6.2-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
patch link: https://lore.kernel.org/r/20230105101844.1893104-10-jthoughton%40google.com
patch subject: [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM
config: m68k-allmodconfig
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/33a65f9a66e72ccc2c7151dc3ff9cb1d692074d8
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
git checkout 33a65f9a66e72ccc2c7151dc3ff9cb1d692074d8
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/madvise.c: In function 'madvise_split':
>> mm/madvise.c:1023:9: error: implicit declaration of function 'hugetlb_vma_lock_alloc'; did you mean 'hugetlb_vma_lock_write'? [-Werror=implicit-function-declaration]
1023 | hugetlb_vma_lock_alloc(vma);
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_vma_lock_write
cc1: some warnings being treated as errors


vim +1023 mm/madvise.c

1013
1014 static int madvise_split(struct vm_area_struct *vma,
1015 unsigned long *new_flags)
1016 {
1017 if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
1018 return -EINVAL;
1019 /*
1020 * Attempt to allocate the VMA lock again. If it isn't allocated,
1021 * MADV_COLLAPSE won't work.
1022 */
> 1023 hugetlb_vma_lock_alloc(vma);
1024
1025 /* PMD sharing doesn't work with HGM. */
1026 hugetlb_unshare_all_pmds(vma);
1027
1028 *new_flags |= VM_HUGETLB_HGM;
1029 return 0;
1030 }
1031

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (2.87 kB)
config (284.77 kB)
Download all attachments

2023-01-05 16:06:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM

On 05.01.23 11:18, James Houghton wrote:
> Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> applied to non-HugeTLB memory in the future, if such an application is
> to arise.
>
> MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> address ranges:
> 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
> alignment.
> 2. read()ing a page fault event from a userfaultfd will yield a
> PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
> address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
>
> There is no way to disable the API changes that come with issuing
> MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> table mappings that come from the extended functionality that comes with
> using MADV_SPLIT.
>
> For post-copy live migration, the expected use-case is:
> 1. mmap(MAP_SHARED, some_fd) primary mapping
> 2. mmap(MAP_SHARED, some_fd) alias mapping
> 3. MADV_SPLIT the primary mapping
> 4. UFFDIO_REGISTER/etc. the primary mapping
> 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
> corresponding PAGE_SIZE sections in the primary mapping.
>
> More API changes may be added in the future.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> arch/alpha/include/uapi/asm/mman.h | 2 ++
> arch/mips/include/uapi/asm/mman.h | 2 ++
> arch/parisc/include/uapi/asm/mman.h | 2 ++
> arch/xtensa/include/uapi/asm/mman.h | 2 ++
> include/linux/hugetlb.h | 2 ++
> include/uapi/asm-generic/mman-common.h | 2 ++
> mm/hugetlb.c | 3 +--
> mm/madvise.c | 26 ++++++++++++++++++++++++++
> 8 files changed, 39 insertions(+), 2 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..7a26f3648b90 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -78,6 +78,8 @@
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */

I think we should make a split more generic, such that it also splits
(pte-maps) a THP. Has that been discussed?

--
Thanks,

David / dhildenb

2023-01-05 16:30:22

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230105]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc2 v6.2-rc1 v6.1 v6.2-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
patch link: https://lore.kernel.org/r/20230105101844.1893104-12-jthoughton%40google.com
patch subject: [PATCH 11/46] hugetlb: add hugetlb_pte to track HugeTLB page table entries
config: m68k-allmodconfig
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/2f657c22d05e0d32f2dfb41cb3a79a705bd9e37c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
git checkout 2f657c22d05e0d32f2dfb41cb3a79a705bd9e37c
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash kernel/ mm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All error/warnings (new ones prefixed by >>):

In file included from include/linux/migrate.h:8,
from mm/zsmalloc.c:60:
>> include/linux/hugetlb.h:1284:40: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1284 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lockptr':
>> include/linux/hugetlb.h:1286:20: error: invalid use of undefined type 'struct hugetlb_pte'
1286 | return hpte->ptl;
| ^~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1290:37: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1290 | spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lock':
>> include/linux/hugetlb.h:1292:47: error: passing argument 1 of 'hugetlb_pte_lockptr' from incompatible pointer type [-Werror=incompatible-pointer-types]
1292 | spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
| ^~~~
| |
| struct hugetlb_pte *
include/linux/hugetlb.h:1284:53: note: expected 'struct hugetlb_pte *' but argument is of type 'struct hugetlb_pte *'
1284 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ~~~~~~~~~~~~~~~~~~~~^~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1301:32: warning: 'enum hugetlb_level' declared inside parameter list will not be visible outside of this definition or declaration
1301 | enum hugetlb_level level)
| ^~~~~~~~~~~~~
include/linux/hugetlb.h:1299:56: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1299 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~
>> include/linux/hugetlb.h:1301:46: error: parameter 5 ('level') has incomplete type
1301 | enum hugetlb_level level)
| ~~~~~~~~~~~~~~~~~~~^~~~~
>> include/linux/hugetlb.h:1299:6: error: function declaration isn't a prototype [-Werror=strict-prototypes]
1299 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_populate':
>> include/linux/hugetlb.h:1303:9: error: implicit declaration of function '__hugetlb_pte_populate'; did you mean 'hugetlb_pte_populate'? [-Werror=implicit-function-declaration]
1303 | __hugetlb_pte_populate(hpte, ptep, shift, level,
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_pte_populate
cc1: some warnings being treated as errors
--
In file included from kernel/fork.c:52:
>> include/linux/hugetlb.h:1284:40: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1284 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lockptr':
>> include/linux/hugetlb.h:1286:20: error: invalid use of undefined type 'struct hugetlb_pte'
1286 | return hpte->ptl;
| ^~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1290:37: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1290 | spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lock':
>> include/linux/hugetlb.h:1292:47: error: passing argument 1 of 'hugetlb_pte_lockptr' from incompatible pointer type [-Werror=incompatible-pointer-types]
1292 | spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
| ^~~~
| |
| struct hugetlb_pte *
include/linux/hugetlb.h:1284:53: note: expected 'struct hugetlb_pte *' but argument is of type 'struct hugetlb_pte *'
1284 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ~~~~~~~~~~~~~~~~~~~~^~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1301:32: warning: 'enum hugetlb_level' declared inside parameter list will not be visible outside of this definition or declaration
1301 | enum hugetlb_level level)
| ^~~~~~~~~~~~~
include/linux/hugetlb.h:1299:56: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1299 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~
>> include/linux/hugetlb.h:1301:46: error: parameter 5 ('level') has incomplete type
1301 | enum hugetlb_level level)
| ~~~~~~~~~~~~~~~~~~~^~~~~
>> include/linux/hugetlb.h:1299:6: error: function declaration isn't a prototype [-Werror=strict-prototypes]
1299 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_populate':
>> include/linux/hugetlb.h:1303:9: error: implicit declaration of function '__hugetlb_pte_populate'; did you mean 'hugetlb_pte_populate'? [-Werror=implicit-function-declaration]
1303 | __hugetlb_pte_populate(hpte, ptep, shift, level,
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_pte_populate
kernel/fork.c: At top level:
kernel/fork.c:162:13: warning: no previous prototype for 'arch_release_task_struct' [-Wmissing-prototypes]
162 | void __weak arch_release_task_struct(struct task_struct *tsk)
| ^~~~~~~~~~~~~~~~~~~~~~~~
kernel/fork.c:862:20: warning: no previous prototype for 'arch_task_cache_init' [-Wmissing-prototypes]
862 | void __init __weak arch_task_cache_init(void) { }
| ^~~~~~~~~~~~~~~~~~~~
kernel/fork.c:957:12: warning: no previous prototype for 'arch_dup_task_struct' [-Wmissing-prototypes]
957 | int __weak arch_dup_task_struct(struct task_struct *dst,
| ^~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from mm/madvise.c:16:
>> include/linux/hugetlb.h:1284:40: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1284 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lockptr':
>> include/linux/hugetlb.h:1286:20: error: invalid use of undefined type 'struct hugetlb_pte'
1286 | return hpte->ptl;
| ^~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1290:37: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1290 | spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lock':
>> include/linux/hugetlb.h:1292:47: error: passing argument 1 of 'hugetlb_pte_lockptr' from incompatible pointer type [-Werror=incompatible-pointer-types]
1292 | spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
| ^~~~
| |
| struct hugetlb_pte *
include/linux/hugetlb.h:1284:53: note: expected 'struct hugetlb_pte *' but argument is of type 'struct hugetlb_pte *'
1284 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ~~~~~~~~~~~~~~~~~~~~^~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1301:32: warning: 'enum hugetlb_level' declared inside parameter list will not be visible outside of this definition or declaration
1301 | enum hugetlb_level level)
| ^~~~~~~~~~~~~
include/linux/hugetlb.h:1299:56: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1299 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~
>> include/linux/hugetlb.h:1301:46: error: parameter 5 ('level') has incomplete type
1301 | enum hugetlb_level level)
| ~~~~~~~~~~~~~~~~~~~^~~~~
>> include/linux/hugetlb.h:1299:6: error: function declaration isn't a prototype [-Werror=strict-prototypes]
1299 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_populate':
>> include/linux/hugetlb.h:1303:9: error: implicit declaration of function '__hugetlb_pte_populate'; did you mean 'hugetlb_pte_populate'? [-Werror=implicit-function-declaration]
1303 | __hugetlb_pte_populate(hpte, ptep, shift, level,
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_pte_populate
mm/madvise.c: In function 'madvise_split':
mm/madvise.c:1023:9: error: implicit declaration of function 'hugetlb_vma_lock_alloc'; did you mean 'hugetlb_vma_lock_write'? [-Werror=implicit-function-declaration]
1023 | hugetlb_vma_lock_alloc(vma);
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_vma_lock_write
cc1: some warnings being treated as errors


vim +1286 include/linux/hugetlb.h

1282
1283 static inline
> 1284 spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
1285 {
> 1286 return hpte->ptl;
1287 }
1288
1289 static inline
1290 spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
1291 {
> 1292 spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
1293
1294 spin_lock(ptl);
1295 return ptl;
1296 }
1297
1298 static inline
> 1299 void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
1300 pte_t *ptep, unsigned int shift,
> 1301 enum hugetlb_level level)
1302 {
> 1303 __hugetlb_pte_populate(hpte, ptep, shift, level,
1304 huge_pte_lockptr(shift, mm, ptep));
1305 }
1306

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (13.07 kB)
config (284.77 kB)
Download all attachments

2023-01-05 17:39:39

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230105]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc2 v6.2-rc1 v6.1 v6.2-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
patch link: https://lore.kernel.org/r/20230105101844.1893104-14-jthoughton%40google.com
patch subject: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
config: m68k-allmodconfig
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/5395f068d45f39d202240799d3a8146226387f5c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
git checkout 5395f068d45f39d202240799d3a8146226387f5c
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash fs/proc/ mm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from include/linux/migrate.h:8,
from mm/zsmalloc.c:60:
>> include/linux/hugetlb.h:1275:34: error: return type is an incomplete type
1275 | static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
| ^~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hpage_size_to_level':
>> include/linux/hugetlb.h:1277:16: error: 'HUGETLB_LEVEL_PTE' undeclared (first use in this function)
1277 | return HUGETLB_LEVEL_PTE;
| ^~~~~~~~~~~~~~~~~
include/linux/hugetlb.h:1277:16: note: each undeclared identifier is reported only once for each function it appears in
include/linux/hugetlb.h:1277:16: error: 'return' with a value, in function returning void [-Werror=return-type]
include/linux/hugetlb.h:1275:34: note: declared here
1275 | static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
| ^~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1306:40: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1306 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lockptr':
include/linux/hugetlb.h:1308:20: error: invalid use of undefined type 'struct hugetlb_pte'
1308 | return hpte->ptl;
| ^~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1312:37: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1312 | spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lock':
include/linux/hugetlb.h:1314:47: error: passing argument 1 of 'hugetlb_pte_lockptr' from incompatible pointer type [-Werror=incompatible-pointer-types]
1314 | spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
| ^~~~
| |
| struct hugetlb_pte *
include/linux/hugetlb.h:1306:53: note: expected 'struct hugetlb_pte *' but argument is of type 'struct hugetlb_pte *'
1306 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ~~~~~~~~~~~~~~~~~~~~^~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1321:56: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1321 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~
include/linux/hugetlb.h:1323:46: error: parameter 5 ('level') has incomplete type
1323 | enum hugetlb_level level)
| ~~~~~~~~~~~~~~~~~~~^~~~~
include/linux/hugetlb.h:1321:6: error: function declaration isn't a prototype [-Werror=strict-prototypes]
1321 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_populate':
include/linux/hugetlb.h:1325:9: error: implicit declaration of function '__hugetlb_pte_populate'; did you mean 'hugetlb_pte_populate'? [-Werror=implicit-function-declaration]
1325 | __hugetlb_pte_populate(hpte, ptep, shift, level,
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_pte_populate
cc1: some warnings being treated as errors
--
In file included from mm/madvise.c:16:
>> include/linux/hugetlb.h:1275:34: error: return type is an incomplete type
1275 | static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
| ^~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hpage_size_to_level':
>> include/linux/hugetlb.h:1277:16: error: 'HUGETLB_LEVEL_PTE' undeclared (first use in this function)
1277 | return HUGETLB_LEVEL_PTE;
| ^~~~~~~~~~~~~~~~~
include/linux/hugetlb.h:1277:16: note: each undeclared identifier is reported only once for each function it appears in
include/linux/hugetlb.h:1277:16: error: 'return' with a value, in function returning void [-Werror=return-type]
include/linux/hugetlb.h:1275:34: note: declared here
1275 | static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
| ^~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1306:40: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1306 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lockptr':
include/linux/hugetlb.h:1308:20: error: invalid use of undefined type 'struct hugetlb_pte'
1308 | return hpte->ptl;
| ^~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1312:37: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1312 | spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lock':
include/linux/hugetlb.h:1314:47: error: passing argument 1 of 'hugetlb_pte_lockptr' from incompatible pointer type [-Werror=incompatible-pointer-types]
1314 | spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
| ^~~~
| |
| struct hugetlb_pte *
include/linux/hugetlb.h:1306:53: note: expected 'struct hugetlb_pte *' but argument is of type 'struct hugetlb_pte *'
1306 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ~~~~~~~~~~~~~~~~~~~~^~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1321:56: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1321 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~
include/linux/hugetlb.h:1323:46: error: parameter 5 ('level') has incomplete type
1323 | enum hugetlb_level level)
| ~~~~~~~~~~~~~~~~~~~^~~~~
include/linux/hugetlb.h:1321:6: error: function declaration isn't a prototype [-Werror=strict-prototypes]
1321 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_populate':
include/linux/hugetlb.h:1325:9: error: implicit declaration of function '__hugetlb_pte_populate'; did you mean 'hugetlb_pte_populate'? [-Werror=implicit-function-declaration]
1325 | __hugetlb_pte_populate(hpte, ptep, shift, level,
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_pte_populate
mm/madvise.c: In function 'madvise_split':
mm/madvise.c:1023:9: error: implicit declaration of function 'hugetlb_vma_lock_alloc'; did you mean 'hugetlb_vma_lock_write'? [-Werror=implicit-function-declaration]
1023 | hugetlb_vma_lock_alloc(vma);
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_vma_lock_write
cc1: some warnings being treated as errors
--
In file included from fs/proc/meminfo.c:6:
>> include/linux/hugetlb.h:1275:34: error: return type is an incomplete type
1275 | static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
| ^~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hpage_size_to_level':
>> include/linux/hugetlb.h:1277:16: error: 'HUGETLB_LEVEL_PTE' undeclared (first use in this function)
1277 | return HUGETLB_LEVEL_PTE;
| ^~~~~~~~~~~~~~~~~
include/linux/hugetlb.h:1277:16: note: each undeclared identifier is reported only once for each function it appears in
include/linux/hugetlb.h:1277:16: error: 'return' with a value, in function returning void [-Werror=return-type]
include/linux/hugetlb.h:1275:34: note: declared here
1275 | static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
| ^~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1306:40: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1306 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lockptr':
include/linux/hugetlb.h:1308:20: error: invalid use of undefined type 'struct hugetlb_pte'
1308 | return hpte->ptl;
| ^~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1312:37: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1312 | spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
| ^~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_lock':
include/linux/hugetlb.h:1314:47: error: passing argument 1 of 'hugetlb_pte_lockptr' from incompatible pointer type [-Werror=incompatible-pointer-types]
1314 | spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
| ^~~~
| |
| struct hugetlb_pte *
include/linux/hugetlb.h:1306:53: note: expected 'struct hugetlb_pte *' but argument is of type 'struct hugetlb_pte *'
1306 | spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
| ~~~~~~~~~~~~~~~~~~~~^~~~
include/linux/hugetlb.h: At top level:
include/linux/hugetlb.h:1321:56: warning: 'struct hugetlb_pte' declared inside parameter list will not be visible outside of this definition or declaration
1321 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~
include/linux/hugetlb.h:1323:46: error: parameter 5 ('level') has incomplete type
1323 | enum hugetlb_level level)
| ~~~~~~~~~~~~~~~~~~~^~~~~
include/linux/hugetlb.h:1321:6: error: function declaration isn't a prototype [-Werror=strict-prototypes]
1321 | void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
| ^~~~~~~~~~~~~~~~~~~~
include/linux/hugetlb.h: In function 'hugetlb_pte_populate':
include/linux/hugetlb.h:1325:9: error: implicit declaration of function '__hugetlb_pte_populate'; did you mean 'hugetlb_pte_populate'? [-Werror=implicit-function-declaration]
1325 | __hugetlb_pte_populate(hpte, ptep, shift, level,
| ^~~~~~~~~~~~~~~~~~~~~~
| hugetlb_pte_populate
fs/proc/meminfo.c: At top level:
fs/proc/meminfo.c:22:28: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes]
22 | void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
| ^~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors


vim +1275 include/linux/hugetlb.h

1274
> 1275 static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
1276 {
> 1277 return HUGETLB_LEVEL_PTE;
1278 }
1279 #endif /* CONFIG_HUGETLB_PAGE */
1280

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (14.17 kB)
config (284.77 kB)
Download all attachments

2023-01-05 19:36:41

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step

Hi James,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20230105]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc2 v6.2-rc1 v6.1 v6.2-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
patch link: https://lore.kernel.org/r/20230105101844.1893104-14-jthoughton%40google.com
patch subject: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
config: powerpc-randconfig-r005-20230105
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 8d9828ef5aa9688500657d36cd2aefbe12bbd162)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install powerpc cross compiling tool for clang build
# apt-get install binutils-powerpc-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/5395f068d45f39d202240799d3a8146226387f5c
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230105-182428
git checkout 5395f068d45f39d202240799d3a8146226387f5c
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc SHELL=/bin/bash fs/proc/ mm/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from mm/filemap.c:36:
>> include/linux/hugetlb.h:1275:34: error: incomplete result type 'enum hugetlb_level' in function definition
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
include/linux/hugetlb.h:1275:20: note: forward declaration of 'enum hugetlb_level'
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
>> include/linux/hugetlb.h:1277:9: error: use of undeclared identifier 'HUGETLB_LEVEL_PTE'
return HUGETLB_LEVEL_PTE;
^
include/linux/hugetlb.h:1306:40: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1308:13: error: incomplete definition of type 'struct hugetlb_pte'
return hpte->ptl;
~~~~^
include/linux/hugetlb.h:1306:40: note: forward declaration of 'struct hugetlb_pte'
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1312:37: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1314:40: error: incompatible pointer types passing 'struct hugetlb_pte *' to parameter of type 'struct hugetlb_pte *' [-Werror,-Wincompatible-pointer-types]
spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
^~~~
include/linux/hugetlb.h:1306:53: note: passing argument to parameter 'hpte' here
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1321:56: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
^
include/linux/hugetlb.h:1323:25: error: variable has incomplete type 'enum hugetlb_level'
enum hugetlb_level level)
^
include/linux/hugetlb.h:1275:20: note: forward declaration of 'enum hugetlb_level'
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
include/linux/hugetlb.h:1325:2: error: call to undeclared function '__hugetlb_pte_populate'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
__hugetlb_pte_populate(hpte, ptep, shift, level,
^
include/linux/hugetlb.h:1325:2: note: did you mean 'hugetlb_pte_populate'?
include/linux/hugetlb.h:1321:6: note: 'hugetlb_pte_populate' declared here
void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
^
3 warnings and 6 errors generated.
--
In file included from mm/shmem.c:38:
>> include/linux/hugetlb.h:1275:34: error: incomplete result type 'enum hugetlb_level' in function definition
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
include/linux/hugetlb.h:1275:20: note: forward declaration of 'enum hugetlb_level'
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
>> include/linux/hugetlb.h:1277:9: error: use of undeclared identifier 'HUGETLB_LEVEL_PTE'
return HUGETLB_LEVEL_PTE;
^
include/linux/hugetlb.h:1306:40: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1308:13: error: incomplete definition of type 'struct hugetlb_pte'
return hpte->ptl;
~~~~^
include/linux/hugetlb.h:1306:40: note: forward declaration of 'struct hugetlb_pte'
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1312:37: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1314:40: error: incompatible pointer types passing 'struct hugetlb_pte *' to parameter of type 'struct hugetlb_pte *' [-Werror,-Wincompatible-pointer-types]
spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
^~~~
include/linux/hugetlb.h:1306:53: note: passing argument to parameter 'hpte' here
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1321:56: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
^
include/linux/hugetlb.h:1323:25: error: variable has incomplete type 'enum hugetlb_level'
enum hugetlb_level level)
^
include/linux/hugetlb.h:1275:20: note: forward declaration of 'enum hugetlb_level'
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
include/linux/hugetlb.h:1325:2: error: call to undeclared function '__hugetlb_pte_populate'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
__hugetlb_pte_populate(hpte, ptep, shift, level,
^
include/linux/hugetlb.h:1325:2: note: did you mean 'hugetlb_pte_populate'?
include/linux/hugetlb.h:1321:6: note: 'hugetlb_pte_populate' declared here
void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
^
In file included from mm/shmem.c:57:
include/linux/mman.h:153:9: warning: division by zero is undefined [-Wdivision-by-zero]
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/mman.h:132:21: note: expanded from macro '_calc_vm_trans'
: ((x) & (bit1)) / ((bit1) / (bit2))))
^ ~~~~~~~~~~~~~~~~~
include/linux/mman.h:154:9: warning: division by zero is undefined [-Wdivision-by-zero]
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/mman.h:132:21: note: expanded from macro '_calc_vm_trans'
: ((x) & (bit1)) / ((bit1) / (bit2))))
^ ~~~~~~~~~~~~~~~~~
5 warnings and 6 errors generated.
--
In file included from fs/proc/meminfo.c:6:
>> include/linux/hugetlb.h:1275:34: error: incomplete result type 'enum hugetlb_level' in function definition
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
include/linux/hugetlb.h:1275:20: note: forward declaration of 'enum hugetlb_level'
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
>> include/linux/hugetlb.h:1277:9: error: use of undeclared identifier 'HUGETLB_LEVEL_PTE'
return HUGETLB_LEVEL_PTE;
^
include/linux/hugetlb.h:1306:40: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1308:13: error: incomplete definition of type 'struct hugetlb_pte'
return hpte->ptl;
~~~~^
include/linux/hugetlb.h:1306:40: note: forward declaration of 'struct hugetlb_pte'
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1312:37: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
spinlock_t *hugetlb_pte_lock(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1314:40: error: incompatible pointer types passing 'struct hugetlb_pte *' to parameter of type 'struct hugetlb_pte *' [-Werror,-Wincompatible-pointer-types]
spinlock_t *ptl = hugetlb_pte_lockptr(hpte);
^~~~
include/linux/hugetlb.h:1306:53: note: passing argument to parameter 'hpte' here
spinlock_t *hugetlb_pte_lockptr(struct hugetlb_pte *hpte)
^
include/linux/hugetlb.h:1321:56: warning: declaration of 'struct hugetlb_pte' will not be visible outside of this function [-Wvisibility]
void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
^
include/linux/hugetlb.h:1323:25: error: variable has incomplete type 'enum hugetlb_level'
enum hugetlb_level level)
^
include/linux/hugetlb.h:1275:20: note: forward declaration of 'enum hugetlb_level'
static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
^
include/linux/hugetlb.h:1325:2: error: call to undeclared function '__hugetlb_pte_populate'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
__hugetlb_pte_populate(hpte, ptep, shift, level,
^
include/linux/hugetlb.h:1325:2: note: did you mean 'hugetlb_pte_populate'?
include/linux/hugetlb.h:1321:6: note: 'hugetlb_pte_populate' declared here
void hugetlb_pte_populate(struct mm_struct *mm, struct hugetlb_pte *hpte,
^
In file included from fs/proc/meminfo.c:7:
include/linux/mman.h:153:9: warning: division by zero is undefined [-Wdivision-by-zero]
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/mman.h:132:21: note: expanded from macro '_calc_vm_trans'
: ((x) & (bit1)) / ((bit1) / (bit2))))
^ ~~~~~~~~~~~~~~~~~
include/linux/mman.h:154:9: warning: division by zero is undefined [-Wdivision-by-zero]
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/mman.h:132:21: note: expanded from macro '_calc_vm_trans'
: ((x) & (bit1)) / ((bit1) / (bit2))))
^ ~~~~~~~~~~~~~~~~~
fs/proc/meminfo.c:22:28: warning: no previous prototype for function 'arch_report_meminfo' [-Wmissing-prototypes]
void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
^
fs/proc/meminfo.c:22:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
^
static
6 warnings and 6 errors generated.


vim +1275 include/linux/hugetlb.h

1274
> 1275 static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
1276 {
> 1277 return HUGETLB_LEVEL_PTE;
1278 }
1279 #endif /* CONFIG_HUGETLB_PAGE */
1280

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (13.72 kB)
config (149.02 kB)
Download all attachments

2023-01-05 23:27:20

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page

On Thu, Jan 05, 2023 at 10:18:17AM +0000, James Houghton wrote:
> @@ -6649,13 +6655,20 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> if (absent && (flags & FOLL_DUMP) &&
> !hugetlbfs_pagecache_present(h, vma, vaddr)) {
> - if (pte)
> + if (ptep)
> spin_unlock(ptl);
> hugetlb_vma_unlock_read(vma);
> remainder = 0;
> break;
> }
>
> + if (!absent && pte_present(pte) &&
> + !hugetlb_pte_present_leaf(&hpte, pte)) {
> + /* We raced with someone splitting the PTE, so retry. */
> + spin_unlock(ptl);

vma unlock missing here.

> + continue;
> + }

--
Peter Xu

2023-01-05 23:34:25

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> -static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
> +static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
> + struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long addr)
> {
> bool referenced = false;
> - pte_t entry = huge_ptep_get(pte);
> + pte_t entry = huge_ptep_get(hpte->ptep);

My compiler throws me:

mm/damon/vaddr.c: In function ‘damon_hugetlb_mkold’:
mm/damon/vaddr.c:338:15: error: ‘entry’ redeclared as different kind of symbol
338 | pte_t entry = huge_ptep_get(hpte->ptep);
| ^~~~~

I guess this line can just be dropped.

> struct folio *folio = pfn_folio(pte_pfn(entry));
>
> folio_get(folio);

--
Peter Xu

2023-01-06 15:37:41

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE

On Thu, Jan 05, 2023 at 10:18:32AM +0000, James Houghton wrote:
> MADV_SPLIT enables HugeTLB HGM which allows for UFFDIO_CONTINUE in
> PAGE_SIZE chunks. If a huge-page-aligned address were to be provided,
> userspace would be completely unable to take advantage of HGM. That
> would then require userspace to know to provide
> UFFD_FEATURE_EXACT_ADDRESS.
>
> This patch would make it harder to make a mistake. Instead of requiring
> userspace to provide UFFD_FEATURE_EXACT_ADDRESS, always provide a usable
> address.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 31 +++++++++++++++----------------
> 1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5af6db52f34e..5b6215e03fe1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5936,28 +5936,27 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> unsigned long addr,
> unsigned long reason)
> {
> + u32 hash;
> + struct vm_fault vmf;
> +
> /*
> * Don't use the hpage-aligned address if the user has explicitly
> * enabled HGM.
> */
> if (hugetlb_hgm_advised(vma) && reason == VM_UFFD_MINOR)
> - haddr = address & PAGE_MASK;
> -
> - u32 hash;
> - struct vm_fault vmf = {
> - .vma = vma,
> - .address = haddr,
> - .real_address = addr,
> - .flags = flags,
> + haddr = addr & PAGE_MASK;
>
> - /*
> - * Hard to debug if it ends up being
> - * used by a callee that assumes
> - * something about the other
> - * uninitialized fields... same as in
> - * memory.c
> - */
> - };
> + vmf.vma = vma;
> + vmf.address = haddr;
> + vmf.real_address = addr;
> + vmf.flags = flags;

Const fields here:

mm/hugetlb.c: In function ‘hugetlb_handle_userfault’:
mm/hugetlb.c:5961:17: error: assignment of member ‘vma’ in read-only object
5961 | vmf.vma = vma;
| ^
mm/hugetlb.c:5962:21: error: assignment of member ‘address’ in read-only object
5962 | vmf.address = haddr;
| ^
mm/hugetlb.c:5963:26: error: assignment of member ‘real_address’ in read-only object
5963 | vmf.real_address = addr;

> + /*
> + * Hard to debug if it ends up being
> + * used by a callee that assumes
> + * something about the other
> + * uninitialized fields... same as in
> + * memory.c
> + */

PS: I think we can drop this along the way.

>
> /*
> * vma_lock and hugetlb_fault_mutex must be dropped before handling
> --
> 2.39.0.314.g84b9a713c41-goog
>

--
Peter Xu

2023-01-09 20:31:54

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e).

On 01/05/23 11:47, David Hildenbrand wrote:
> On 05.01.23 11:17, James Houghton wrote:
> > This series introduces the concept of HugeTLB high-granularity mapping
> > (HGM). This series teaches HugeTLB how to map HugeTLB pages at
> > high-granularity, similar to how THPs can be PTE-mapped.
> >
> > Support for HGM in this series is for MAP_SHARED VMAs on x86 only. Other
> > architectures and (some) support for MAP_PRIVATE will come later.
>
> Why even care about the complexity of COW-sharable anon pages? TBH, I'd just
> limit this to MAP_SHARED and call it a day. Sure, we can come up with use
> cases for everything (snapshotting VMs using fork while also support
> optimized postcopy), but I think this would need some real justification for
> the added complexity and possible (likely!) issues.

I believe the primary use case driving this beyond MAP_SHARED would be
poisoning due to memory errors. Extending HGM seems to be the most
elegant way to start providing better support for this.
--
Mike Kravetz

2023-01-10 00:37:55

by Zach O'Keefe

[permalink] [raw]
Subject: Re: [PATCH 09/46] mm: add MADV_SPLIT to enable HugeTLB HGM

On Thu, Jan 5, 2023 at 7:29 AM David Hildenbrand <[email protected]> wrote:
>
> On 05.01.23 11:18, James Houghton wrote:
> > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> > applied to non-HugeTLB memory in the future, if such an application is
> > to arise.
> >
> > MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> > address ranges:
> > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
> > alignment.
> > 2. read()ing a page fault event from a userfaultfd will yield a
> > PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
> > address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> >
> > There is no way to disable the API changes that come with issuing
> > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> > table mappings that come from the extended functionality that comes with
> > using MADV_SPLIT.
> >
> > For post-copy live migration, the expected use-case is:
> > 1. mmap(MAP_SHARED, some_fd) primary mapping
> > 2. mmap(MAP_SHARED, some_fd) alias mapping
> > 3. MADV_SPLIT the primary mapping
> > 4. UFFDIO_REGISTER/etc. the primary mapping
> > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
> > corresponding PAGE_SIZE sections in the primary mapping.
> >
> > More API changes may be added in the future.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > arch/alpha/include/uapi/asm/mman.h | 2 ++
> > arch/mips/include/uapi/asm/mman.h | 2 ++
> > arch/parisc/include/uapi/asm/mman.h | 2 ++
> > arch/xtensa/include/uapi/asm/mman.h | 2 ++
> > include/linux/hugetlb.h | 2 ++
> > include/uapi/asm-generic/mman-common.h | 2 ++
> > mm/hugetlb.c | 3 +--
> > mm/madvise.c | 26 ++++++++++++++++++++++++++
> > 8 files changed, 39 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 763929e814e9..7a26f3648b90 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -78,6 +78,8 @@
> >
> > #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT 26 /* Enable hugepage high-granularity APIs */
>
> I think we should make a split more generic, such that it also splits
> (pte-maps) a THP. Has that been discussed?


Thanks James / David.

MADV_SPLIT for THP has come up a few times; firstly, during the
initial RFC about hugepage collapse in process context, as the natural
inverse operation required by a generic userspace-managed hugepage
daemon, the second -- which is more immediately practical -- is to
avoid stranding THPs on the deferred split queue (and thus still
incurring the memcg charge) for too long [1].

However, its exact semantics / API have yet to be discussed / flushed
out (though I'm planning to do exactly this in the near-term).

Just as James has co-opted MADV_COLLAPSE for hugetlb, we can co-opt
MADV_SPLIT for THP, when the time comes -- which I think makes a lot
of sense.

Hopefully I can get my ducks in order to start a discussion about this
eminently.

Best,
Zach

[1] https://lore.kernel.org/linux-mm/[email protected]/

> --
> Thanks,
>
> David / dhildenb
>

2023-01-10 14:57:56

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 34/46] hugetlb: userfaultfd: when using MADV_SPLIT, round addresses to PAGE_SIZE

On Fri, Jan 6, 2023 at 10:13 AM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 05, 2023 at 10:18:32AM +0000, James Houghton wrote:
> > MADV_SPLIT enables HugeTLB HGM which allows for UFFDIO_CONTINUE in
> > PAGE_SIZE chunks. If a huge-page-aligned address were to be provided,
> > userspace would be completely unable to take advantage of HGM. That
> > would then require userspace to know to provide
> > UFFD_FEATURE_EXACT_ADDRESS.
> >
> > This patch would make it harder to make a mistake. Instead of requiring
> > userspace to provide UFFD_FEATURE_EXACT_ADDRESS, always provide a usable
> > address.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 31 +++++++++++++++----------------
> > 1 file changed, 15 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 5af6db52f34e..5b6215e03fe1 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5936,28 +5936,27 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> > unsigned long addr,
> > unsigned long reason)
> > {
> > + u32 hash;
> > + struct vm_fault vmf;
> > +
> > /*
> > * Don't use the hpage-aligned address if the user has explicitly
> > * enabled HGM.
> > */
> > if (hugetlb_hgm_advised(vma) && reason == VM_UFFD_MINOR)
> > - haddr = address & PAGE_MASK;
> > -
> > - u32 hash;
> > - struct vm_fault vmf = {
> > - .vma = vma,
> > - .address = haddr,
> > - .real_address = addr,
> > - .flags = flags,
> > + haddr = addr & PAGE_MASK;
> >
> > - /*
> > - * Hard to debug if it ends up being
> > - * used by a callee that assumes
> > - * something about the other
> > - * uninitialized fields... same as in
> > - * memory.c
> > - */
> > - };
> > + vmf.vma = vma;
> > + vmf.address = haddr;
> > + vmf.real_address = addr;
> > + vmf.flags = flags;
>
> Const fields here:
>
> mm/hugetlb.c: In function ‘hugetlb_handle_userfault’:
> mm/hugetlb.c:5961:17: error: assignment of member ‘vma’ in read-only object
> 5961 | vmf.vma = vma;
> | ^
> mm/hugetlb.c:5962:21: error: assignment of member ‘address’ in read-only object
> 5962 | vmf.address = haddr;
> | ^
> mm/hugetlb.c:5963:26: error: assignment of member ‘real_address’ in read-only object
> 5963 | vmf.real_address = addr;

Thanks Peter for this and your other findings. Not sure why my
compiler (clang) let me do this. :/ Will send a v2 soon with this +
the other problems fixed.

2023-01-10 16:01:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 00/46] Based on latest mm-unstable (85b44c25cd1e).

On 09.01.23 20:53, Mike Kravetz wrote:
> On 01/05/23 11:47, David Hildenbrand wrote:
>> On 05.01.23 11:17, James Houghton wrote:
>>> This series introduces the concept of HugeTLB high-granularity mapping
>>> (HGM). This series teaches HugeTLB how to map HugeTLB pages at
>>> high-granularity, similar to how THPs can be PTE-mapped.
>>>
>>> Support for HGM in this series is for MAP_SHARED VMAs on x86 only. Other
>>> architectures and (some) support for MAP_PRIVATE will come later.
>>
>> Why even care about the complexity of COW-sharable anon pages? TBH, I'd just
>> limit this to MAP_SHARED and call it a day. Sure, we can come up with use
>> cases for everything (snapshotting VMs using fork while also support
>> optimized postcopy), but I think this would need some real justification for
>> the added complexity and possible (likely!) issues.
>
> I believe the primary use case driving this beyond MAP_SHARED would be
> poisoning due to memory errors. Extending HGM seems to be the most
> elegant way to start providing better support for this.

Good point. Although I wonder if in practice, most applicable users
either already are, or should switch to, using MAP_SHARED hugetlb.

--
Thanks,

David / dhildenb

2023-01-10 20:15:10

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

> +
> + if (curr < end) {
> + /* Don't hold the VMA lock for too long. */
> + hugetlb_vma_unlock_write(vma);
> + cond_resched();
> + hugetlb_vma_lock_write(vma);

I need to drop/reacquire the mapping lock here too (missed this when I
added the bits to grab the mapping lock in this function).

2023-01-11 22:07:01

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step

On Thu, Jan 05, 2023 at 10:18:11AM +0000, James Houghton wrote:

[...]

> +static int hugetlb_hgm_walk_uninit(struct hugetlb_pte *hpte,

Nitpick on the name: the "uninit" can be misread into pairing with some
other "init()" calls..

How about just call it hugetlb_hgm_walk (since it's the higher level API
comparing to the existing one)? Then the existing hugetlb_hgm_walk can be
called hugetlb_hgm_do_walk/__hugetlb_hgm_walk since it's one level down.

> + pte_t *ptep,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + unsigned long target_sz,
> + bool alloc)
> +{
> + struct hstate *h = hstate_vma(vma);
> +
> + hugetlb_pte_populate(vma->vm_mm, hpte, ptep, huge_page_shift(h),
> + hpage_size_to_level(huge_page_size(h)));

Another nitpick on name: I remembered we used to reach a consensus of using
hugetlb_pte_init before? Can we still avoid the word "populate" (if "init"
is not suitable since it can be updated during stepping, how about "setup")?

[...]

> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz)
> +{
> + pte_t *ptep;
> + spinlock_t *ptl;
> +
> + switch (hpte->level) {
> + case HUGETLB_LEVEL_PUD:
> + ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
> + if (IS_ERR(ptep))
> + return PTR_ERR(ptep);
> + hugetlb_pte_populate(mm, hpte, ptep, PMD_SHIFT,
> + HUGETLB_LEVEL_PMD);
> + break;
> + case HUGETLB_LEVEL_PMD:
> + ptep = hugetlb_alloc_pte(mm, hpte, addr);
> + if (IS_ERR(ptep))
> + return PTR_ERR(ptep);
> + ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
> + __hugetlb_pte_populate(hpte, ptep, PAGE_SHIFT,
> + HUGETLB_LEVEL_PTE, ptl);
> + hpte->ptl = ptl;

This line seems to be superfluous (even if benign).

> + break;
> + default:
> + WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
> + __func__, hpte->level, hpte->shift);
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> /*
> * Return a mask that can be used to update an address to the last huge
> * page in a page table page mapping size. Used to skip non-present
> --
> 2.39.0.314.g84b9a713c41-goog
>

--
Peter Xu

2023-01-11 23:19:27

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

James,

On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> int mapcount = page_mapcount(page);
>
> if (mapcount >= 2)
> - mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> + mss->shared_hugetlb += hugetlb_pte_size(hpte);
> else
> - mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> + mss->private_hugetlb += hugetlb_pte_size(hpte);
> }
> return 0;

One thing interesting I found with hgm right now is mostly everything will
be counted as "shared" here, I think it's because mapcount is accounted
always to the huge page even if mapped in smaller sizes, so page_mapcount()
to a small page should be huge too because the head page mapcount should be
huge. I'm curious the reasons behind the mapcount decision.

For example, would that risk overflow with head_compound_mapcount? One 1G
page mapping all 4K takes 0.25M counts, while the limit should be 2G for
atomic_t. Looks like it's possible.

Btw, are the small page* pointers still needed in the latest HGM design?
Is there code taking care of disabling of hugetlb vmemmap optimization for
HGM? Or maybe it's not needed anymore for the current design?

Thanks,

--
Peter Xu

2023-01-12 15:21:55

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <[email protected]> wrote:
>
> James,
>
> On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > int mapcount = page_mapcount(page);
> >
> > if (mapcount >= 2)
> > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > + mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > else
> > - mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > + mss->private_hugetlb += hugetlb_pte_size(hpte);
> > }
> > return 0;
>
> One thing interesting I found with hgm right now is mostly everything will
> be counted as "shared" here, I think it's because mapcount is accounted
> always to the huge page even if mapped in smaller sizes, so page_mapcount()
> to a small page should be huge too because the head page mapcount should be
> huge. I'm curious the reasons behind the mapcount decision.
>
> For example, would that risk overflow with head_compound_mapcount? One 1G
> page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> atomic_t. Looks like it's possible.

The original mapcount approach was "if the hstate-level PTE is
present, increment the compound_mapcount by 1" (basically "if any of
the hugepage is mapped, increment the compound_mapcount by 1"), but
this was painful to implement, so I changed it to what it is now (each
new PT mapping increments the compound_mapcount by 1). I think you're
right, there is absolutely an overflow risk. :( I'm not sure what the
best solution is. I could just go back to the old approach.

Regarding when things are accounted in private_hugetlb vs.
shared_hugetlb, HGM shouldn't change that, because it only applies to
shared mappings (right now anyway). It seems like "private_hugetlb"
can include cases where the page is shared but has only one mapping,
in which case HGM will change it from "private" to "shared".

>
> Btw, are the small page* pointers still needed in the latest HGM design?
> Is there code taking care of disabling of hugetlb vmemmap optimization for
> HGM? Or maybe it's not needed anymore for the current design?

The hugetlb vmemmap optimization can still be used with HGM, so there
is no code to disable it. We don't need small page* pointers either,
except for when we're dealing with mapping subpages, like in
hugetlb_no_page. Everything else can handle the hugetlb page as a
folio.

I hope we can keep compatibility with the vmemmap optimization while
solving the mapcount overflow risk.

Thanks Peter.
- James

2023-01-12 15:59:08

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step

On Wed, Jan 11, 2023 at 4:51 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 05, 2023 at 10:18:11AM +0000, James Houghton wrote:
>
> [...]
>
> > +static int hugetlb_hgm_walk_uninit(struct hugetlb_pte *hpte,
>
> Nitpick on the name: the "uninit" can be misread into pairing with some
> other "init()" calls..
>
> How about just call it hugetlb_hgm_walk (since it's the higher level API
> comparing to the existing one)? Then the existing hugetlb_hgm_walk can be
> called hugetlb_hgm_do_walk/__hugetlb_hgm_walk since it's one level down.
>
> > + pte_t *ptep,
> > + struct vm_area_struct *vma,
> > + unsigned long addr,
> > + unsigned long target_sz,
> > + bool alloc)
> > +{
> > + struct hstate *h = hstate_vma(vma);
> > +
> > + hugetlb_pte_populate(vma->vm_mm, hpte, ptep, huge_page_shift(h),
> > + hpage_size_to_level(huge_page_size(h)));
>
> Another nitpick on name: I remembered we used to reach a consensus of using
> hugetlb_pte_init before? Can we still avoid the word "populate" (if "init"
> is not suitable since it can be updated during stepping, how about "setup")?

Right, we did talk about this, sorry. Ok I'll go ahead with this name change.
- hugetlb_hgm_walk => __hugetlb_hgm_walk
- hugetlb_hgm_walk_uninit => hugetlb_hgm_walk
- [__,]hugetlb_pte_populate => [__,]hugetlb_pte_init

>
> [...]
>
> > +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long sz)
> > +{
> > + pte_t *ptep;
> > + spinlock_t *ptl;
> > +
> > + switch (hpte->level) {
> > + case HUGETLB_LEVEL_PUD:
> > + ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
> > + if (IS_ERR(ptep))
> > + return PTR_ERR(ptep);
> > + hugetlb_pte_populate(mm, hpte, ptep, PMD_SHIFT,
> > + HUGETLB_LEVEL_PMD);
> > + break;
> > + case HUGETLB_LEVEL_PMD:
> > + ptep = hugetlb_alloc_pte(mm, hpte, addr);
> > + if (IS_ERR(ptep))
> > + return PTR_ERR(ptep);
> > + ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
> > + __hugetlb_pte_populate(hpte, ptep, PAGE_SHIFT,
> > + HUGETLB_LEVEL_PTE, ptl);
> > + hpte->ptl = ptl;
>
> This line seems to be superfluous (even if benign).

Nice catch! It shouldn't be there; I accidentally left it in when I
changed how `ptl` was handled.

Thanks Peter!

2023-01-12 16:17:49

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <[email protected]> wrote:
> >
> > James,
> >
> > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > int mapcount = page_mapcount(page);
> > >
> > > if (mapcount >= 2)
> > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > + mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > else
> > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > + mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > }
> > > return 0;
> >
> > One thing interesting I found with hgm right now is mostly everything will
> > be counted as "shared" here, I think it's because mapcount is accounted
> > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > to a small page should be huge too because the head page mapcount should be
> > huge. I'm curious the reasons behind the mapcount decision.
> >
> > For example, would that risk overflow with head_compound_mapcount? One 1G
> > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > atomic_t. Looks like it's possible.
>
> The original mapcount approach was "if the hstate-level PTE is
> present, increment the compound_mapcount by 1" (basically "if any of
> the hugepage is mapped, increment the compound_mapcount by 1"), but
> this was painful to implement,

Any more info here on why it was painful? What is the major blocker?

> so I changed it to what it is now (each new PT mapping increments the
> compound_mapcount by 1). I think you're right, there is absolutely an
> overflow risk. :( I'm not sure what the best solution is. I could just go
> back to the old approach.

No rush on that; let's discuss it thoroughly before doing anything. We
have more context than when it was discussed initially in the calls, so
maybe a good time to revisit.

>
> Regarding when things are accounted in private_hugetlb vs.
> shared_hugetlb, HGM shouldn't change that, because it only applies to
> shared mappings (right now anyway). It seems like "private_hugetlb"
> can include cases where the page is shared but has only one mapping,
> in which case HGM will change it from "private" to "shared".

The two fields are not defined against VM_SHARED, it seems. At least not
with current code base.

Let me quote the code again just to be clear:

int mapcount = page_mapcount(page); <------------- [1]

if (mapcount >= 2)
mss->shared_hugetlb += hugetlb_pte_size(hpte);
else
mss->private_hugetlb += hugetlb_pte_size(hpte);

smaps_hugetlb_hgm_account(mss, hpte);

So that information (for some reason) is only relevant to how many mapcount
is there. If we have one 1G page and only two 4K mapped, with the existing
logic we should see 8K private_hugetlb while in fact I think it should be
8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
(as they all goes back to the head).

I have no idea whether violating that will be a problem or not, I suppose
at least it needs justification if it will be violated, or hopefully it can
be kept as-is.

>
> >
> > Btw, are the small page* pointers still needed in the latest HGM design?
> > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > HGM? Or maybe it's not needed anymore for the current design?
>
> The hugetlb vmemmap optimization can still be used with HGM, so there
> is no code to disable it. We don't need small page* pointers either,
> except for when we're dealing with mapping subpages, like in
> hugetlb_no_page. Everything else can handle the hugetlb page as a
> folio.
>
> I hope we can keep compatibility with the vmemmap optimization while
> solving the mapcount overflow risk.

Yeh that'll be perfect if it works. But afaiu even with your current
design (ignoring all the issues on either smaps accounting or overflow
risks), we already referenced the small pages, aren't we? See:

static inline int page_mapcount(struct page *page)
{
int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here

if (likely(!PageCompound(page)))
return mapcount;
page = compound_head(page);
return head_compound_mapcount(page) + mapcount;
}

Even if we assume small page->_mapcount should always be zero in this case,
we may need to take special care of hugetlb pages in page_mapcount() to not
reference it at all. Or I think it's reading random values and some days
it can be non-zero.

The other approach is probably using the thp approach. After Hugh's rework
on the thp accounting I assumed it would be even cleaner (at least no
DoubleMap complexity anymore.. even though I can't say I fully digested the
whole history of that). It's all about what's the major challenges of
using the same approach there with thp. You may have more knowledge on
that aspect so I'd be willing to know.

Thanks,

--
Peter Xu

2023-01-12 17:48:05

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <[email protected]> wrote:
> > >
> > > James,
> > >
> > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > > int mapcount = page_mapcount(page);
> > > >
> > > > if (mapcount >= 2)
> > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > > else
> > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > + mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > > }
> > > > return 0;
> > >
> > > One thing interesting I found with hgm right now is mostly everything will
> > > be counted as "shared" here, I think it's because mapcount is accounted
> > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > to a small page should be huge too because the head page mapcount should be
> > > huge. I'm curious the reasons behind the mapcount decision.
> > >
> > > For example, would that risk overflow with head_compound_mapcount? One 1G
> > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > atomic_t. Looks like it's possible.
> >
> > The original mapcount approach was "if the hstate-level PTE is
> > present, increment the compound_mapcount by 1" (basically "if any of
> > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > this was painful to implement,
>
> Any more info here on why it was painful? What is the major blocker?

The original approach was implemented in RFC v1, but the
implementation was broken: the way refcount was handled was wrong; it
was incremented once for each new page table mapping. (How?
find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
would increment refcount and we wouldn't drop it, and in
__unmap_hugepage_range(), the mmu_gather bits would decrement the
refcount once per mapping.)

At the time, I figured the complexity of handling mapcount AND
refcount correctly in the original approach would be quite complex, so
I switched to the new one.

1. In places that already change the mapcount, check that we're
installing the hstate-level PTE, not a high-granularity PTE. Adjust
mapcount AND refcount appropriately.
2. In the HGM walking bits, to the caller if we made the hstate-level
PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
keep track of this until we figure out which page we're allocating
PTEs for, then change mapcount/refcount appropriately.
3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
once per hugepage. (This is probably the hardest of these three things
to get right.)

>
> > so I changed it to what it is now (each new PT mapping increments the
> > compound_mapcount by 1). I think you're right, there is absolutely an
> > overflow risk. :( I'm not sure what the best solution is. I could just go
> > back to the old approach.
>
> No rush on that; let's discuss it thoroughly before doing anything. We
> have more context than when it was discussed initially in the calls, so
> maybe a good time to revisit.
>
> >
> > Regarding when things are accounted in private_hugetlb vs.
> > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > shared mappings (right now anyway). It seems like "private_hugetlb"
> > can include cases where the page is shared but has only one mapping,
> > in which case HGM will change it from "private" to "shared".
>
> The two fields are not defined against VM_SHARED, it seems. At least not
> with current code base.
>
> Let me quote the code again just to be clear:
>
> int mapcount = page_mapcount(page); <------------- [1]
>
> if (mapcount >= 2)
> mss->shared_hugetlb += hugetlb_pte_size(hpte);
> else
> mss->private_hugetlb += hugetlb_pte_size(hpte);
>
> smaps_hugetlb_hgm_account(mss, hpte);
>
> So that information (for some reason) is only relevant to how many mapcount
> is there. If we have one 1G page and only two 4K mapped, with the existing
> logic we should see 8K private_hugetlb while in fact I think it should be
> 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> (as they all goes back to the head).
>
> I have no idea whether violating that will be a problem or not, I suppose
> at least it needs justification if it will be violated, or hopefully it can
> be kept as-is.

Agreed that this is a problem. I'm not sure what should be done here.
It seems like the current upstream implementation is incorrect (surely
MAP_SHARED with only one mapping should still be shared_hugetlb not
private_hugetlb); the check should really be `if (vma->vm_flags &
VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
we don't have a problem here.

>
> >
> > >
> > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > HGM? Or maybe it's not needed anymore for the current design?
> >
> > The hugetlb vmemmap optimization can still be used with HGM, so there
> > is no code to disable it. We don't need small page* pointers either,
> > except for when we're dealing with mapping subpages, like in
> > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > folio.
> >
> > I hope we can keep compatibility with the vmemmap optimization while
> > solving the mapcount overflow risk.
>
> Yeh that'll be perfect if it works. But afaiu even with your current
> design (ignoring all the issues on either smaps accounting or overflow
> risks), we already referenced the small pages, aren't we? See:
>
> static inline int page_mapcount(struct page *page)
> {
> int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here
>
> if (likely(!PageCompound(page)))
> return mapcount;
> page = compound_head(page);
> return head_compound_mapcount(page) + mapcount;
> }
>
> Even if we assume small page->_mapcount should always be zero in this case,
> we may need to take special care of hugetlb pages in page_mapcount() to not
> reference it at all. Or I think it's reading random values and some days
> it can be non-zero.

IIUC, it's ok to read from all the hugetlb subpage structs, you just
can't *write* to them (except the first few). The first page of page
structs is mapped RW; all the others are mapped RO to a single
physical page.

>
> The other approach is probably using the thp approach. After Hugh's rework
> on the thp accounting I assumed it would be even cleaner (at least no
> DoubleMap complexity anymore.. even though I can't say I fully digested the
> whole history of that). It's all about what's the major challenges of
> using the same approach there with thp. You may have more knowledge on
> that aspect so I'd be willing to know.

I need to take a closer look at Hugh's approach to see if we can do it
the same way. (I wonder if the 1G THP series has some ideas too.)

A really simple solution could be just to prevent userspace from doing
MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
poison) if it could result in a mapcount overflow. For 1G pages,
userspace would need 8192 mappings to overflow mapcount/refcount.

2023-01-12 18:31:42

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

> The original approach was implemented in RFC v1, but the
> implementation was broken: the way refcount was handled was wrong; it
> was incremented once for each new page table mapping. (How?
> find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> would increment refcount and we wouldn't drop it, and in
> __unmap_hugepage_range(), the mmu_gather bits would decrement the
> refcount once per mapping.)
>
> At the time, I figured the complexity of handling mapcount AND
> refcount correctly in the original approach would be quite complex, so
> I switched to the new one.

Sorry I didn't make this clear... the following steps are how we could
correctly implement the original approach.

> 1. In places that already change the mapcount, check that we're
> installing the hstate-level PTE, not a high-granularity PTE. Adjust
> mapcount AND refcount appropriately.
> 2. In the HGM walking bits, to the caller if we made the hstate-level
> PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> keep track of this until we figure out which page we're allocating
> PTEs for, then change mapcount/refcount appropriately.
> 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> once per hugepage. (This is probably the hardest of these three things
> to get right.)

2023-01-12 18:50:54

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page

On Thu, Jan 05, 2023 at 10:18:17AM +0000, James Houghton wrote:
> @@ -6731,22 +6746,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> * and skip the same_page loop below.
> */
> if (!pages && !vmas && !pfn_offset &&
> - (vaddr + huge_page_size(h) < vma->vm_end) &&
> - (remainder >= pages_per_huge_page(h))) {
> - vaddr += huge_page_size(h);
> - remainder -= pages_per_huge_page(h);
> - i += pages_per_huge_page(h);
> + (vaddr + pages_per_hpte < vma->vm_end) &&
> + (remainder >= pages_per_hpte)) {
> + vaddr += pages_per_hpte;

This silently breaks hugetlb GUP.. should be

vaddr += hugetlb_pte_size(&hpte);

It caused misterious MISSING events when I'm playing with this tree, and
I'm surprised it rooted here. So far the most time consuming one. :)

> + remainder -= pages_per_hpte;
> + i += pages_per_hpte;
> spin_unlock(ptl);
> hugetlb_vma_unlock_read(vma);
> continue;
> }

--
Peter Xu

2023-01-12 19:45:55

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 19/46] hugetlb: add HGM support for follow_hugetlb_page

On Thu, Jan 12, 2023 at 1:02 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 05, 2023 at 10:18:17AM +0000, James Houghton wrote:
> > @@ -6731,22 +6746,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > * and skip the same_page loop below.
> > */
> > if (!pages && !vmas && !pfn_offset &&
> > - (vaddr + huge_page_size(h) < vma->vm_end) &&
> > - (remainder >= pages_per_huge_page(h))) {
> > - vaddr += huge_page_size(h);
> > - remainder -= pages_per_huge_page(h);
> > - i += pages_per_huge_page(h);
> > + (vaddr + pages_per_hpte < vma->vm_end) &&
> > + (remainder >= pages_per_hpte)) {
> > + vaddr += pages_per_hpte;
>
> This silently breaks hugetlb GUP.. should be
>
> vaddr += hugetlb_pte_size(&hpte);
>
> It caused misterious MISSING events when I'm playing with this tree, and
> I'm surprised it rooted here. So far the most time consuming one. :)

Thanks Peter!! And the `vaddr + pages_per_hpte < vma->vm_end` should
be `vaddr + hugetlb_pte_size(&hpte) < vma->vm_end` too.

2023-01-12 21:28:19

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 39/46] hugetlb: x86: enable high-granularity mapping

On Thu, Jan 5, 2023 at 5:19 AM James Houghton <[email protected]> wrote:
>
> Now that HGM is fully supported for GENERAL_HUGETLB, x86 can enable it.
> The x86 KVM MMU already properly handles HugeTLB HGM pages (it does a
> page table walk to determine which size to use in the second-stage page
> table instead of, for example, checking vma_mmu_pagesize, like arm64
> does).
>
> We could also enable HugeTLB HGM for arm (32-bit) at this point, as it
> also uses GENERAL_HUGETLB and I don't see anything else that is needed
> for it. However, I haven't tested on arm at all, so I won't enable it.

Given that we are using a high bit for VM_HUGETLB_HGM, we can only
support 64-bit architectures. Userfaultfd minor faults is limited to
64-bit architectures for the same reason: VM_UFFD_MINOR uses a bit.

>
> Signed-off-by: James Houghton <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3604074a878b..3d08cd45549c 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -126,6 +126,7 @@ config X86
> select ARCH_WANT_GENERAL_HUGETLB
> select ARCH_WANT_HUGE_PMD_SHARE
> select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP if X86_64
> + select ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING

This needs `if X86_64` at the end. Will be corrected for v2.

> select ARCH_WANT_LD_ORPHAN_WARN
> select ARCH_WANTS_THP_SWAP if X86_64
> select ARCH_HAS_PARANOID_L1D_FLUSH
> --
> 2.39.0.314.g84b9a713c41-goog
>

2023-01-12 21:32:54

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote:
> On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > James,
> > > >
> > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > > > int mapcount = page_mapcount(page);
> > > > >
> > > > > if (mapcount >= 2)
> > > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > > > else
> > > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > + mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > > > }
> > > > > return 0;
> > > >
> > > > One thing interesting I found with hgm right now is mostly everything will
> > > > be counted as "shared" here, I think it's because mapcount is accounted
> > > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > > to a small page should be huge too because the head page mapcount should be
> > > > huge. I'm curious the reasons behind the mapcount decision.
> > > >
> > > > For example, would that risk overflow with head_compound_mapcount? One 1G
> > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > > atomic_t. Looks like it's possible.
> > >
> > > The original mapcount approach was "if the hstate-level PTE is
> > > present, increment the compound_mapcount by 1" (basically "if any of
> > > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > > this was painful to implement,
> >
> > Any more info here on why it was painful? What is the major blocker?
>
> The original approach was implemented in RFC v1, but the
> implementation was broken: the way refcount was handled was wrong; it
> was incremented once for each new page table mapping. (How?
> find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> would increment refcount and we wouldn't drop it, and in
> __unmap_hugepage_range(), the mmu_gather bits would decrement the
> refcount once per mapping.)

I'm not sure I fully get the point here, but perhaps it's mostly about how
hugetlb page cache is managed (in hstate size not PAGE_SIZE)?

static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
{
/* HugeTLBfs indexes the page cache in units of hpage_size */
if (folio_test_hugetlb(folio))
return &folio->page;
return folio_page(folio, index & (folio_nr_pages(folio) - 1));
}

I haven't thought through on that either. Is it possible that we switche
the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a
simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any
page already).

If keep using the same pgcache scheme (hpage_size stepped indexes),
find_lock_page() will also easily content on the head page lock, so we may
not be able to handle concurrent page faults on small mappings on the same
page as efficient as thp.

>
> At the time, I figured the complexity of handling mapcount AND
> refcount correctly in the original approach would be quite complex, so
> I switched to the new one.
>
> 1. In places that already change the mapcount, check that we're
> installing the hstate-level PTE, not a high-granularity PTE. Adjust
> mapcount AND refcount appropriately.
> 2. In the HGM walking bits, to the caller if we made the hstate-level
> PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> keep track of this until we figure out which page we're allocating
> PTEs for, then change mapcount/refcount appropriately.
> 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> once per hugepage. (This is probably the hardest of these three things
> to get right.)
>
> >
> > > so I changed it to what it is now (each new PT mapping increments the
> > > compound_mapcount by 1). I think you're right, there is absolutely an
> > > overflow risk. :( I'm not sure what the best solution is. I could just go
> > > back to the old approach.
> >
> > No rush on that; let's discuss it thoroughly before doing anything. We
> > have more context than when it was discussed initially in the calls, so
> > maybe a good time to revisit.
> >
> > >
> > > Regarding when things are accounted in private_hugetlb vs.
> > > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > > shared mappings (right now anyway). It seems like "private_hugetlb"
> > > can include cases where the page is shared but has only one mapping,
> > > in which case HGM will change it from "private" to "shared".
> >
> > The two fields are not defined against VM_SHARED, it seems. At least not
> > with current code base.
> >
> > Let me quote the code again just to be clear:
> >
> > int mapcount = page_mapcount(page); <------------- [1]
> >
> > if (mapcount >= 2)
> > mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > else
> > mss->private_hugetlb += hugetlb_pte_size(hpte);
> >
> > smaps_hugetlb_hgm_account(mss, hpte);
> >
> > So that information (for some reason) is only relevant to how many mapcount
> > is there. If we have one 1G page and only two 4K mapped, with the existing
> > logic we should see 8K private_hugetlb while in fact I think it should be
> > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> > (as they all goes back to the head).
> >
> > I have no idea whether violating that will be a problem or not, I suppose
> > at least it needs justification if it will be violated, or hopefully it can
> > be kept as-is.
>
> Agreed that this is a problem. I'm not sure what should be done here.
> It seems like the current upstream implementation is incorrect (surely
> MAP_SHARED with only one mapping should still be shared_hugetlb not
> private_hugetlb); the check should really be `if (vma->vm_flags &
> VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
> we don't have a problem here.

Naoya added relevant code. Not sure whether he'll chim in.

commit 25ee01a2fca02dfb5a3ce316e77910c468108199
Author: Naoya Horiguchi <[email protected]>
Date: Thu Nov 5 18:47:11 2015 -0800

mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps

In all cases, it'll still be a slight ABI change, so worth considering the
effects to existing users.

>
> >
> > >
> > > >
> > > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > > HGM? Or maybe it's not needed anymore for the current design?
> > >
> > > The hugetlb vmemmap optimization can still be used with HGM, so there
> > > is no code to disable it. We don't need small page* pointers either,
> > > except for when we're dealing with mapping subpages, like in
> > > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > > folio.
> > >
> > > I hope we can keep compatibility with the vmemmap optimization while
> > > solving the mapcount overflow risk.
> >
> > Yeh that'll be perfect if it works. But afaiu even with your current
> > design (ignoring all the issues on either smaps accounting or overflow
> > risks), we already referenced the small pages, aren't we? See:
> >
> > static inline int page_mapcount(struct page *page)
> > {
> > int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here
> >
> > if (likely(!PageCompound(page)))
> > return mapcount;
> > page = compound_head(page);
> > return head_compound_mapcount(page) + mapcount;
> > }
> >
> > Even if we assume small page->_mapcount should always be zero in this case,
> > we may need to take special care of hugetlb pages in page_mapcount() to not
> > reference it at all. Or I think it's reading random values and some days
> > it can be non-zero.
>
> IIUC, it's ok to read from all the hugetlb subpage structs, you just
> can't *write* to them (except the first few). The first page of page
> structs is mapped RW; all the others are mapped RO to a single
> physical page.

I'm not familiar with vmemmap work, but I saw this:

/*
* Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
* to the page which @vmemmap_reuse is mapped to, then free the pages
* which the range [@vmemmap_start, @vmemmap_end] is mapped to.
*/
if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))

It seems it'll reuse the 1st page of the huge page* rather than backing the
rest vmemmap with zero pages. Would that be a problem (as I think some
small page* can actually points to the e.g. head page* if referenced)?

>
> >
> > The other approach is probably using the thp approach. After Hugh's rework
> > on the thp accounting I assumed it would be even cleaner (at least no
> > DoubleMap complexity anymore.. even though I can't say I fully digested the
> > whole history of that). It's all about what's the major challenges of
> > using the same approach there with thp. You may have more knowledge on
> > that aspect so I'd be willing to know.
>
> I need to take a closer look at Hugh's approach to see if we can do it
> the same way. (I wonder if the 1G THP series has some ideas too.)

https://lore.kernel.org/all/[email protected]/

It's already in the mainline. I think it's mostly internally implemented
under the rmap code APIs. For the HGM effort, I'd start with simply
passing around compound=false when using the rmap APIs, and see what'll
start to fail.

>
> A really simple solution could be just to prevent userspace from doing
> MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
> poison) if it could result in a mapcount overflow. For 1G pages,
> userspace would need 8192 mappings to overflow mapcount/refcount.

I'm not sure you can calculate it; consider fork()s along the way when pmd
sharing disabled, or whatever reason when the 1G pages mapped at multiple
places with more than one mmap()s.

Thanks,

--
Peter Xu

2023-01-12 21:51:08

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 12, 2023 at 3:27 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote:
> > On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > > > > int mapcount = page_mapcount(page);
> > > > > >
> > > > > > if (mapcount >= 2)
> > > > > > - mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > > + mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > > > > else
> > > > > > - mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > > + mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > > > > }
> > > > > > return 0;
> > > > >
> > > > > One thing interesting I found with hgm right now is mostly everything will
> > > > > be counted as "shared" here, I think it's because mapcount is accounted
> > > > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > > > to a small page should be huge too because the head page mapcount should be
> > > > > huge. I'm curious the reasons behind the mapcount decision.
> > > > >
> > > > > For example, would that risk overflow with head_compound_mapcount? One 1G
> > > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > > > atomic_t. Looks like it's possible.
> > > >
> > > > The original mapcount approach was "if the hstate-level PTE is
> > > > present, increment the compound_mapcount by 1" (basically "if any of
> > > > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > > > this was painful to implement,
> > >
> > > Any more info here on why it was painful? What is the major blocker?
> >
> > The original approach was implemented in RFC v1, but the
> > implementation was broken: the way refcount was handled was wrong; it
> > was incremented once for each new page table mapping. (How?
> > find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> > would increment refcount and we wouldn't drop it, and in
> > __unmap_hugepage_range(), the mmu_gather bits would decrement the
> > refcount once per mapping.)
>
> I'm not sure I fully get the point here, but perhaps it's mostly about how
> hugetlb page cache is managed (in hstate size not PAGE_SIZE)?
>
> static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
> {
> /* HugeTLBfs indexes the page cache in units of hpage_size */
> if (folio_test_hugetlb(folio))
> return &folio->page;
> return folio_page(folio, index & (folio_nr_pages(folio) - 1));
> }
>
> I haven't thought through on that either. Is it possible that we switche
> the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a
> simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any
> page already).
>
> If keep using the same pgcache scheme (hpage_size stepped indexes),
> find_lock_page() will also easily content on the head page lock, so we may
> not be able to handle concurrent page faults on small mappings on the same
> page as efficient as thp.

We keep the pgcache scheme the same: hpage_size stepped indexes. The
refcount and mapcount are both stored in the head page. The problems
with the implementation in RFC v1 were (1) refcount became much much
larger than mapcount, and (2) refcount would have the same overflow
problem we're discussing.

You're right -- find_lock_page() in hugetlb_no_page() and
hugetlb_mcopy_atomic_pte() will contend on the head page lock, but I
think it's possible to improve things here (i.e., replace them with
find_get_page() somehow).

>
> >
> > At the time, I figured the complexity of handling mapcount AND
> > refcount correctly in the original approach would be quite complex, so
> > I switched to the new one.
> >
> > 1. In places that already change the mapcount, check that we're
> > installing the hstate-level PTE, not a high-granularity PTE. Adjust
> > mapcount AND refcount appropriately.
> > 2. In the HGM walking bits, to the caller if we made the hstate-level
> > PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> > keep track of this until we figure out which page we're allocating
> > PTEs for, then change mapcount/refcount appropriately.
> > 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> > once per hugepage. (This is probably the hardest of these three things
> > to get right.)
> >
> > >
> > > > so I changed it to what it is now (each new PT mapping increments the
> > > > compound_mapcount by 1). I think you're right, there is absolutely an
> > > > overflow risk. :( I'm not sure what the best solution is. I could just go
> > > > back to the old approach.
> > >
> > > No rush on that; let's discuss it thoroughly before doing anything. We
> > > have more context than when it was discussed initially in the calls, so
> > > maybe a good time to revisit.
> > >
> > > >
> > > > Regarding when things are accounted in private_hugetlb vs.
> > > > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > > > shared mappings (right now anyway). It seems like "private_hugetlb"
> > > > can include cases where the page is shared but has only one mapping,
> > > > in which case HGM will change it from "private" to "shared".
> > >
> > > The two fields are not defined against VM_SHARED, it seems. At least not
> > > with current code base.
> > >
> > > Let me quote the code again just to be clear:
> > >
> > > int mapcount = page_mapcount(page); <------------- [1]
> > >
> > > if (mapcount >= 2)
> > > mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > else
> > > mss->private_hugetlb += hugetlb_pte_size(hpte);
> > >
> > > smaps_hugetlb_hgm_account(mss, hpte);
> > >
> > > So that information (for some reason) is only relevant to how many mapcount
> > > is there. If we have one 1G page and only two 4K mapped, with the existing
> > > logic we should see 8K private_hugetlb while in fact I think it should be
> > > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> > > (as they all goes back to the head).
> > >
> > > I have no idea whether violating that will be a problem or not, I suppose
> > > at least it needs justification if it will be violated, or hopefully it can
> > > be kept as-is.
> >
> > Agreed that this is a problem. I'm not sure what should be done here.
> > It seems like the current upstream implementation is incorrect (surely
> > MAP_SHARED with only one mapping should still be shared_hugetlb not
> > private_hugetlb); the check should really be `if (vma->vm_flags &
> > VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
> > we don't have a problem here.
>
> Naoya added relevant code. Not sure whether he'll chim in.
>
> commit 25ee01a2fca02dfb5a3ce316e77910c468108199
> Author: Naoya Horiguchi <[email protected]>
> Date: Thu Nov 5 18:47:11 2015 -0800
>
> mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps
>
> In all cases, it'll still be a slight ABI change, so worth considering the
> effects to existing users.
>
> >
> > >
> > > >
> > > > >
> > > > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > > > HGM? Or maybe it's not needed anymore for the current design?
> > > >
> > > > The hugetlb vmemmap optimization can still be used with HGM, so there
> > > > is no code to disable it. We don't need small page* pointers either,
> > > > except for when we're dealing with mapping subpages, like in
> > > > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > > > folio.
> > > >
> > > > I hope we can keep compatibility with the vmemmap optimization while
> > > > solving the mapcount overflow risk.
> > >
> > > Yeh that'll be perfect if it works. But afaiu even with your current
> > > design (ignoring all the issues on either smaps accounting or overflow
> > > risks), we already referenced the small pages, aren't we? See:
> > >
> > > static inline int page_mapcount(struct page *page)
> > > {
> > > int mapcount = atomic_read(&page->_mapcount) + 1; <-------- here
> > >
> > > if (likely(!PageCompound(page)))
> > > return mapcount;
> > > page = compound_head(page);
> > > return head_compound_mapcount(page) + mapcount;
> > > }
> > >
> > > Even if we assume small page->_mapcount should always be zero in this case,
> > > we may need to take special care of hugetlb pages in page_mapcount() to not
> > > reference it at all. Or I think it's reading random values and some days
> > > it can be non-zero.
> >
> > IIUC, it's ok to read from all the hugetlb subpage structs, you just
> > can't *write* to them (except the first few). The first page of page
> > structs is mapped RW; all the others are mapped RO to a single
> > physical page.
>
> I'm not familiar with vmemmap work, but I saw this:
>
> /*
> * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
> * to the page which @vmemmap_reuse is mapped to, then free the pages
> * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
> */
> if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
>
> It seems it'll reuse the 1st page of the huge page* rather than backing the
> rest vmemmap with zero pages. Would that be a problem (as I think some
> small page* can actually points to the e.g. head page* if referenced)?

It shouldn't be a problem if we don't use _mapcount and always use
compound_mapcount, which is how mapcount is handled in this version of
this series.

>
> >
> > >
> > > The other approach is probably using the thp approach. After Hugh's rework
> > > on the thp accounting I assumed it would be even cleaner (at least no
> > > DoubleMap complexity anymore.. even though I can't say I fully digested the
> > > whole history of that). It's all about what's the major challenges of
> > > using the same approach there with thp. You may have more knowledge on
> > > that aspect so I'd be willing to know.
> >
> > I need to take a closer look at Hugh's approach to see if we can do it
> > the same way. (I wonder if the 1G THP series has some ideas too.)
>
> https://lore.kernel.org/all/[email protected]/
>
> It's already in the mainline. I think it's mostly internally implemented
> under the rmap code APIs. For the HGM effort, I'd start with simply
> passing around compound=false when using the rmap APIs, and see what'll
> start to fail.

I'll look into it, but doing it this way will use _mapcount, so we
won't be able to use the vmemmap optimization. I think even if we do
use Hugh's approach, refcount is still being kept on the head page, so
there's still an overflow risk there (but maybe I am
misunderstanding).

>
> >
> > A really simple solution could be just to prevent userspace from doing
> > MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
> > poison) if it could result in a mapcount overflow. For 1G pages,
> > userspace would need 8192 mappings to overflow mapcount/refcount.
>
> I'm not sure you can calculate it; consider fork()s along the way when pmd
> sharing disabled, or whatever reason when the 1G pages mapped at multiple
> places with more than one mmap()s.

Yeah you're right. :(

2023-01-12 21:51:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> I'll look into it, but doing it this way will use _mapcount, so we
> won't be able to use the vmemmap optimization. I think even if we do
> use Hugh's approach, refcount is still being kept on the head page, so
> there's still an overflow risk there (but maybe I am
> misunderstanding).

Could you remind me what's the issue if using refcount on the small pages
rather than the head (assuming vmemmap still can be disabled)?

Thanks,

--
Peter Xu

2023-01-16 11:43:28

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 12.01.23 22:33, Peter Xu wrote:
> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
>> I'll look into it, but doing it this way will use _mapcount, so we
>> won't be able to use the vmemmap optimization. I think even if we do
>> use Hugh's approach, refcount is still being kept on the head page, so
>> there's still an overflow risk there (but maybe I am
>> misunderstanding).
>
> Could you remind me what's the issue if using refcount on the small pages
> rather than the head (assuming vmemmap still can be disabled)?

The THP-way of doing things is refcounting on the head page. All folios
use a single refcount on the head.

There has to be a pretty good reason to do it differently.

--
Thanks,

David / dhildenb

2023-01-17 23:08:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

On Tue, Jan 17, 2023 at 01:38:24PM -0800, James Houghton wrote:
> > > + if (curr < end) {
> > > + /* Don't hold the VMA lock for too long. */
> > > + hugetlb_vma_unlock_write(vma);
> > > + cond_resched();
> > > + hugetlb_vma_lock_write(vma);
> >
> > The intention is good here but IIUC this will cause vma lock to be taken
> > after the i_mmap_rwsem, which can cause circular deadlocks. If to do this
> > properly we'll need to also release the i_mmap_rwsem.
>
> Sorry if you spent a long time debugging this! I sent a reply a week
> ago about this too.

Oops, yes, I somehow missed that one. No worry - it's reported by
lockdep. :)

>
> >
> > However it may make the resched() logic over complicated, meanwhile for 2M
> > huge pages I think this will be called for each 2M range which can be too
> > fine grained, so it looks like the "cur < end" check is a bit too aggresive.
> >
> > The other thing is I noticed that the long period of mmu notifier
> > invalidate between start -> end will (in reallife VM context) causing vcpu
> > threads spinning.
> >
> > I _think_ it's because is_page_fault_stale() (when during a vmexit
> > following a kvm page fault) always reports true during the long procedure
> > of MADV_COLLAPSE if to be called upon a large range, so even if we release
> > both locks here it may not tremedously on the VM migration use case because
> > of the long-standing mmu notifier invalidation procedure.
>
> Oh... indeed. Thanks for pointing that out.
>
> >
> > To summarize.. I think a simpler start version of hugetlb MADV_COLLAPSE can
> > drop this "if" block, and let the userapp decide the step size of COLLAPSE?
>
> I'll drop this resched logic. Thanks Peter.

Sounds good, thanks.

--
Peter Xu

2023-01-17 23:10:15

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

> > + if (curr < end) {
> > + /* Don't hold the VMA lock for too long. */
> > + hugetlb_vma_unlock_write(vma);
> > + cond_resched();
> > + hugetlb_vma_lock_write(vma);
>
> The intention is good here but IIUC this will cause vma lock to be taken
> after the i_mmap_rwsem, which can cause circular deadlocks. If to do this
> properly we'll need to also release the i_mmap_rwsem.

Sorry if you spent a long time debugging this! I sent a reply a week
ago about this too.

>
> However it may make the resched() logic over complicated, meanwhile for 2M
> huge pages I think this will be called for each 2M range which can be too
> fine grained, so it looks like the "cur < end" check is a bit too aggresive.
>
> The other thing is I noticed that the long period of mmu notifier
> invalidate between start -> end will (in reallife VM context) causing vcpu
> threads spinning.
>
> I _think_ it's because is_page_fault_stale() (when during a vmexit
> following a kvm page fault) always reports true during the long procedure
> of MADV_COLLAPSE if to be called upon a large range, so even if we release
> both locks here it may not tremedously on the VM migration use case because
> of the long-standing mmu notifier invalidation procedure.

Oh... indeed. Thanks for pointing that out.

>
> To summarize.. I think a simpler start version of hugetlb MADV_COLLAPSE can
> drop this "if" block, and let the userapp decide the step size of COLLAPSE?

I'll drop this resched logic. Thanks Peter.

2023-01-17 23:47:12

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

On Thu, Jan 05, 2023 at 10:18:33AM +0000, James Houghton wrote:
> This is a necessary extension to the UFFDIO_CONTINUE changes. When
> userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the
> kernel has no mechanism to automatically collapse the page table to map
> the whole hugepage normally. We require userspace to inform us that they
> would like the mapping to be collapsed; they do this with MADV_COLLAPSE.
>
> If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
> only some, hugetlb_collapse will cause the requested range to be mapped
> as if it were UFFDIO_CONTINUE'd already. The effects of any
> UFFDIO_WRITEPROTECT calls may be undone by a call to MADV_COLLAPSE for
> intersecting address ranges.
>
> This commit is co-opting the same madvise mode that has been introduced
> to synchronously collapse THPs. The function that does THP collapsing
> has been renamed to madvise_collapse_thp.
>
> As with the rest of the high-granularity mapping support, MADV_COLLAPSE
> is only supported for shared VMAs right now.
>
> MADV_COLLAPSE has the same synchronization as huge_pmd_unshare.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/huge_mm.h | 12 +--
> include/linux/hugetlb.h | 8 ++
> mm/hugetlb.c | 164 ++++++++++++++++++++++++++++++++++++++++
> mm/khugepaged.c | 4 +-
> mm/madvise.c | 18 ++++-
> 5 files changed, 197 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a1341fdcf666..5d1e3c980f74 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -218,9 +218,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>
> int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> int advice);
> -int madvise_collapse(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start, unsigned long end);
> +int madvise_collapse_thp(struct vm_area_struct *vma,
> + struct vm_area_struct **prev,
> + unsigned long start, unsigned long end);
> void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, long adjust_next);
> spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -367,9 +367,9 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
> return -EINVAL;
> }
>
> -static inline int madvise_collapse(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start, unsigned long end)
> +static inline int madvise_collapse_thp(struct vm_area_struct *vma,
> + struct vm_area_struct **prev,
> + unsigned long start, unsigned long end)
> {
> return -EINVAL;
> }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c8524ac49b24..e1baf939afb6 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1298,6 +1298,8 @@ bool hugetlb_hgm_eligible(struct vm_area_struct *vma);
> int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start,
> unsigned long end);
> +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long start, unsigned long end);
> #else
> static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> {
> @@ -1318,6 +1320,12 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> {
> return -EINVAL;
> }
> +static inline
> +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + return -EINVAL;
> +}
> #endif
>
> static inline spinlock_t *huge_pte_lock(struct hstate *h,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5b6215e03fe1..388c46c7e77a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -7852,6 +7852,170 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
> return 0;
> }
>
> +static bool hugetlb_hgm_collapsable(struct vm_area_struct *vma)
> +{
> + if (!hugetlb_hgm_eligible(vma))
> + return false;
> + if (!vma->vm_private_data) /* vma lock required for collapsing */
> + return false;
> + return true;
> +}
> +
> +/*
> + * Collapse the address range from @start to @end to be mapped optimally.
> + *
> + * This is only valid for shared mappings. The main use case for this function
> + * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
> + * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
> + * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
> + * it up to userspace to tell us to do so, via MADV_COLLAPSE.
> + *
> + * Any holes in the mapping will be filled. If there is no page in the
> + * pagecache for a region we're collapsing, the PTEs will be cleared.
> + *
> + * If high-granularity PTEs are uffd-wp markers, those markers will be dropped.
> + */
> +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + struct hstate *h = hstate_vma(vma);
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + struct mmu_notifier_range range;
> + struct mmu_gather tlb;
> + unsigned long curr = start;
> + int ret = 0;
> + struct page *hpage, *subpage;
> + pgoff_t idx;
> + bool writable = vma->vm_flags & VM_WRITE;
> + bool shared = vma->vm_flags & VM_SHARED;
> + struct hugetlb_pte hpte;
> + pte_t entry;
> +
> + /*
> + * This is only supported for shared VMAs, because we need to look up
> + * the page to use for any PTEs we end up creating.
> + */
> + if (!shared)
> + return -EINVAL;
> +
> + /* If HGM is not enabled, there is nothing to collapse. */
> + if (!hugetlb_hgm_enabled(vma))
> + return 0;
> +
> + /*
> + * We lost the VMA lock after splitting, so we can't safely collapse.
> + * We could improve this in the future (like take the mmap_lock for
> + * writing and try again), but for now just fail with ENOMEM.
> + */
> + if (unlikely(!hugetlb_hgm_collapsable(vma)))
> + return -ENOMEM;
> +
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
> + start, end);
> + mmu_notifier_invalidate_range_start(&range);
> + tlb_gather_mmu(&tlb, mm);
> +
> + /*
> + * Grab the VMA lock and mapping sem for writing. This will prevent
> + * concurrent high-granularity page table walks, so that we can safely
> + * collapse and free page tables.
> + *
> + * This is the same locking that huge_pmd_unshare requires.
> + */
> + hugetlb_vma_lock_write(vma);
> + i_mmap_lock_write(vma->vm_file->f_mapping);
> +
> + while (curr < end) {
> + ret = hugetlb_alloc_largest_pte(&hpte, mm, vma, curr, end);
> + if (ret)
> + goto out;
> +
> + entry = huge_ptep_get(hpte.ptep);
> +
> + /*
> + * There is no work to do if the PTE doesn't point to page
> + * tables.
> + */
> + if (!pte_present(entry))
> + goto next_hpte;
> + if (hugetlb_pte_present_leaf(&hpte, entry))
> + goto next_hpte;
> +
> + idx = vma_hugecache_offset(h, vma, curr);
> + hpage = find_get_page(mapping, idx);
> +
> + if (hpage && !HPageMigratable(hpage)) {
> + /*
> + * Don't collapse a mapping to a page that is pending
> + * a migration. Migration swap entries may have placed
> + * in the page table.
> + */
> + ret = -EBUSY;
> + put_page(hpage);
> + goto out;
> + }
> +
> + if (hpage && PageHWPoison(hpage)) {
> + /*
> + * Don't collapse a mapping to a page that is
> + * hwpoisoned.
> + */
> + ret = -EHWPOISON;
> + put_page(hpage);
> + /*
> + * By setting ret to -EHWPOISON, if nothing else
> + * happens, we will tell userspace that we couldn't
> + * fully collapse everything due to poison.
> + *
> + * Skip this page, and continue to collapse the rest
> + * of the mapping.
> + */
> + curr = (curr & huge_page_mask(h)) + huge_page_size(h);
> + continue;
> + }
> +
> + /*
> + * Clear all the PTEs, and drop ref/mapcounts
> + * (on tlb_finish_mmu).
> + */
> + __unmap_hugepage_range(&tlb, vma, curr,
> + curr + hugetlb_pte_size(&hpte),
> + NULL,
> + ZAP_FLAG_DROP_MARKER);
> + /* Free the PTEs. */
> + hugetlb_free_pgd_range(&tlb,
> + curr, curr + hugetlb_pte_size(&hpte),
> + curr, curr + hugetlb_pte_size(&hpte));
> + if (!hpage) {
> + huge_pte_clear(mm, curr, hpte.ptep,
> + hugetlb_pte_size(&hpte));
> + goto next_hpte;
> + }
> +
> + page_dup_file_rmap(hpage, true);
> +
> + subpage = hugetlb_find_subpage(h, hpage, curr);
> + entry = make_huge_pte_with_shift(vma, subpage,
> + writable, hpte.shift);
> + set_huge_pte_at(mm, curr, hpte.ptep, entry);
> +next_hpte:
> + curr += hugetlb_pte_size(&hpte);
> +
> + if (curr < end) {
> + /* Don't hold the VMA lock for too long. */
> + hugetlb_vma_unlock_write(vma);
> + cond_resched();
> + hugetlb_vma_lock_write(vma);

The intention is good here but IIUC this will cause vma lock to be taken
after the i_mmap_rwsem, which can cause circular deadlocks. If to do this
properly we'll need to also release the i_mmap_rwsem.

However it may make the resched() logic over complicated, meanwhile for 2M
huge pages I think this will be called for each 2M range which can be too
fine grained, so it looks like the "cur < end" check is a bit too aggresive.

The other thing is I noticed that the long period of mmu notifier
invalidate between start -> end will (in reallife VM context) causing vcpu
threads spinning.

I _think_ it's because is_page_fault_stale() (when during a vmexit
following a kvm page fault) always reports true during the long procedure
of MADV_COLLAPSE if to be called upon a large range, so even if we release
both locks here it may not tremedously on the VM migration use case because
of the long-standing mmu notifier invalidation procedure.

To summarize.. I think a simpler start version of hugetlb MADV_COLLAPSE can
drop this "if" block, and let the userapp decide the step size of COLLAPSE?

> + }
> + }
> +out:
> + i_mmap_unlock_write(vma->vm_file->f_mapping);
> + hugetlb_vma_unlock_write(vma);
> + tlb_finish_mmu(&tlb);
> + mmu_notifier_invalidate_range_end(&range);
> + return ret;
> +}
> +
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

--
Peter Xu

2023-01-18 00:26:42

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <[email protected]> wrote:
>
> On 12.01.23 22:33, Peter Xu wrote:
> > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> >> I'll look into it, but doing it this way will use _mapcount, so we
> >> won't be able to use the vmemmap optimization. I think even if we do
> >> use Hugh's approach, refcount is still being kept on the head page, so
> >> there's still an overflow risk there (but maybe I am
> >> misunderstanding).
> >
> > Could you remind me what's the issue if using refcount on the small pages
> > rather than the head (assuming vmemmap still can be disabled)?
>
> The THP-way of doing things is refcounting on the head page. All folios
> use a single refcount on the head.
>
> There has to be a pretty good reason to do it differently.

Peter and I have discussed this a lot offline. There are two main problems here:

1. Refcount overflow

Refcount is always kept on the head page (before and after this
series). IIUC, this means that if THPs could be 1G in size, they too
would be susceptible to the same potential overflow. How easy is the
overflow? [1]

To deal with this, the best solution we've been able to come up with
is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
want to kill a random process).

(So David, I think this answers your question. Refcount should be
handled just like THPs.)

2. page_mapcount() API differences

In this series, page_mapcount() returns the total number of page table
references for the compound page. For example, if you have a
PTE-mapped 2M page (with no other mappings), page_mapcount() for each
4K page will be 512. This is not the same as a THP: page_mapcount()
would return 1 for each page. Because of the difference in
page_mapcount(), we have 4 problems:

i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
"private_hugetlb" or "shared_hugetlb".
ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
the hugepage is shared or not. Pages that would otherwise be migrated
now require MPOL_MF_MOVE_ALL to be migrated.
[Really both of the above are checking how many VMAs are mapping our hugepage.]
iii. CoW. This isn't a problem right now because CoW is only possible
with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
iv. The hwpoison handling code will check if it successfully unmapped
the poisoned page. This isn't a problem right now, as hwpoison will
unmap all the mappings for the hugepage, not just the 4K where the
poison was found.

Doing it this way allows HGM to remain compatible with the hugetlb
vmemmap optimization. None of the above problems strike me as
particularly major, but it's unclear to me how important it is to have
page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.

The other way page_mapcount() (let's say the "THP-like way") could be
done is like this: increment compound mapcount if we're mapping a
hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
high-granularity, increment the mapcount for each 4K page that is
getting mapped (e.g., PMD within a 1G page: increment the mapcount for
the 512 pages that are now mapped). This yields the same
page_mapcount() API we had before, but we lose the hugetlb vmemmap
optimization.

We could introduce an API like hugetlb_vma_mapcount() that would, for
hugetlb, give us the number of VMAs that map a hugepage, but I don't
think people would like this.

I'm curious what others think (Mike, Matthew?). I'm guessing the
THP-like way is probably what most people would want, though it would
be a real shame to lose the vmemmap optimization.

- James

[1] INT_MAX is 2^31. We increment the refcount once for each 4K
mapping in 1G: that's 512 * 512 (2^18). That means we only need 8192
(2^13) VMAs for the same 1G page to overflow refcount. I think it's
safe to say that if userspace is doing this, they are attempting to
overflow refcount.

2023-01-18 11:10:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 18.01.23 00:11, James Houghton wrote:
> On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 12.01.23 22:33, Peter Xu wrote:
>>> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
>>>> I'll look into it, but doing it this way will use _mapcount, so we
>>>> won't be able to use the vmemmap optimization. I think even if we do
>>>> use Hugh's approach, refcount is still being kept on the head page, so
>>>> there's still an overflow risk there (but maybe I am
>>>> misunderstanding).
>>>
>>> Could you remind me what's the issue if using refcount on the small pages
>>> rather than the head (assuming vmemmap still can be disabled)?
>>
>> The THP-way of doing things is refcounting on the head page. All folios
>> use a single refcount on the head.
>>
>> There has to be a pretty good reason to do it differently.
>
> Peter and I have discussed this a lot offline. There are two main problems here:
>
> 1. Refcount overflow
>
> Refcount is always kept on the head page (before and after this
> series). IIUC, this means that if THPs could be 1G in size, they too
> would be susceptible to the same potential overflow. How easy is the
> overflow? [1]

Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64
processes with 64k VMAs. Not impossible to achieve if one really wants
to break the system ...

Side note: a long long time ago, we used to have sub-page refcounts for
THP. IIRC, that was even before we had sub-page mapcounts and was used
to make COW decisions.

>
> To deal with this, the best solution we've been able to come up with
> is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> want to kill a random process).

You'd have to also make sure that fork() won't do the same. At least
with uffd-wp, Peter also added page table copying during fork() for
MAP_SHARED mappings, which would have to be handled.

Of course, one can just disallow fork() with any HGM right from the
start and keep it all simpler to not open up a can of worms there.

Is it reasonable, to have more than one (or a handful) of VMAs mapping a
huge page via a HGM? Restricting it to a single one, would make handling
much easier.

If there is ever demand for more HGM mappings, that whole problem (and
complexity) could be dealt with later. ... but I assume it will already
be a requirement for VMs (e.g., under QEMU) that share memory with other
processes (virtiofsd and friends?)


>
> (So David, I think this answers your question. Refcount should be
> handled just like THPs.)
>
> 2. page_mapcount() API differences
>
> In this series, page_mapcount() returns the total number of page table
> references for the compound page. For example, if you have a
> PTE-mapped 2M page (with no other mappings), page_mapcount() for each
> 4K page will be 512. This is not the same as a THP: page_mapcount()
> would return 1 for each page. Because of the difference in
> page_mapcount(), we have 4 problems:

IMHO, it would actually be great to just be able to remove the sub-page
mapcounts for THP and make it all simpler.

Right now, the sub-page mapcount is mostly required for making COW
decisions, but only for accounting purposes IIRC (NR_ANON_THPS,
NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See
page_remove_rmap().

If we can avoid that complexity right from the start for hugetlb, great, ..

>
> i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
> "private_hugetlb" or "shared_hugetlb".
> ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
> the hugepage is shared or not. Pages that would otherwise be migrated
> now require MPOL_MF_MOVE_ALL to be migrated.
> [Really both of the above are checking how many VMAs are mapping our hugepage.]
> iii. CoW. This isn't a problem right now because CoW is only possible
> with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
> iv. The hwpoison handling code will check if it successfully unmapped
> the poisoned page. This isn't a problem right now, as hwpoison will
> unmap all the mappings for the hugepage, not just the 4K where the
> poison was found.
>
> Doing it this way allows HGM to remain compatible with the hugetlb
> vmemmap optimization. None of the above problems strike me as
> particularly major, but it's unclear to me how important it is to have
> page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.

See below, maybe we should tackle HGM from a different direction.

>
> The other way page_mapcount() (let's say the "THP-like way") could be
> done is like this: increment compound mapcount if we're mapping a
> hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
> high-granularity, increment the mapcount for each 4K page that is
> getting mapped (e.g., PMD within a 1G page: increment the mapcount for
> the 512 pages that are now mapped). This yields the same
> page_mapcount() API we had before, but we lose the hugetlb vmemmap
> optimization.
>
> We could introduce an API like hugetlb_vma_mapcount() that would, for
> hugetlb, give us the number of VMAs that map a hugepage, but I don't
> think people would like this.
>
> I'm curious what others think (Mike, Matthew?). I'm guessing the
> THP-like way is probably what most people would want, though it would
> be a real shame to lose the vmemmap optimization.

Heh, not me :) Having a single mapcount is certainly much cleaner. ...
and if we're dealing with refcount overflows already, mapcount overflows
are not an issue.


I wonder if the following crazy idea has already been discussed: treat
the whole mapping as a single large logical mapping. One reference and
one mapping, no matter how the individual parts are mapped into the
assigned page table sub-tree.

Because for hugetlb with MAP_SHARED, we know that the complete assigned
sub-tree of page tables can only map the given hugetlb page, no
fragments of something else. That's very different to THP in private
mappings ...

So as soon as the first piece gets mapped, we increment
refcount+mapcount. Other pieces in the same subtree don't do that.

Once the last piece is unmapped (or simpler: once the complete subtree
of page tables is gone), we decrement refcount+mapcount. Might require
some brain power to do this tracking, but I wouldn't call it impossible
right from the start.

Would such a design violate other design aspects that are important?

--
Thanks,

David / dhildenb

2023-01-18 15:45:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote:
> On 18.01.23 00:11, James Houghton wrote:
> > On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <[email protected]> wrote:
> > >
> > > On 12.01.23 22:33, Peter Xu wrote:
> > > > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> > > > > I'll look into it, but doing it this way will use _mapcount, so we
> > > > > won't be able to use the vmemmap optimization. I think even if we do
> > > > > use Hugh's approach, refcount is still being kept on the head page, so
> > > > > there's still an overflow risk there (but maybe I am
> > > > > misunderstanding).
> > > >
> > > > Could you remind me what's the issue if using refcount on the small pages
> > > > rather than the head (assuming vmemmap still can be disabled)?
> > >
> > > The THP-way of doing things is refcounting on the head page. All folios
> > > use a single refcount on the head.
> > >
> > > There has to be a pretty good reason to do it differently.
> >
> > Peter and I have discussed this a lot offline. There are two main problems here:
> >
> > 1. Refcount overflow
> >
> > Refcount is always kept on the head page (before and after this
> > series). IIUC, this means that if THPs could be 1G in size, they too
> > would be susceptible to the same potential overflow. How easy is the
> > overflow? [1]
>
> Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64
> processes with 64k VMAs. Not impossible to achieve if one really wants to
> break the system ...
>
> Side note: a long long time ago, we used to have sub-page refcounts for THP.
> IIRC, that was even before we had sub-page mapcounts and was used to make
> COW decisions.
>
> >
> > To deal with this, the best solution we've been able to come up with
> > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > want to kill a random process).
>
> You'd have to also make sure that fork() won't do the same. At least with
> uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> mappings, which would have to be handled.

If we want such a check to make a real difference, IIUC we may want to
consider having similar check in:

page_ref_add
page_ref_inc
page_ref_inc_return
page_ref_add_unless

But it's unfortunate that mostly all the callers to these functions
(especially the first two) do not have a retval yet at all. Considering
the low possibility so far on having it overflow, maybe it can also be done
for later (and I think checking negative as try_get_page would suffice too).

>
> Of course, one can just disallow fork() with any HGM right from the start
> and keep it all simpler to not open up a can of worms there.
>
> Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> huge page via a HGM? Restricting it to a single one, would make handling
> much easier.
>
> If there is ever demand for more HGM mappings, that whole problem (and
> complexity) could be dealt with later. ... but I assume it will already be a
> requirement for VMs (e.g., under QEMU) that share memory with other
> processes (virtiofsd and friends?)

Yes, any form of multi-proc QEMU will need that for supporting HGM
postcopy.

>
>
> >
> > (So David, I think this answers your question. Refcount should be
> > handled just like THPs.)
> >
> > 2. page_mapcount() API differences
> >
> > In this series, page_mapcount() returns the total number of page table
> > references for the compound page. For example, if you have a
> > PTE-mapped 2M page (with no other mappings), page_mapcount() for each
> > 4K page will be 512. This is not the same as a THP: page_mapcount()
> > would return 1 for each page. Because of the difference in
> > page_mapcount(), we have 4 problems:
>
> IMHO, it would actually be great to just be able to remove the sub-page
> mapcounts for THP and make it all simpler.
>
> Right now, the sub-page mapcount is mostly required for making COW
> decisions, but only for accounting purposes IIRC (NR_ANON_THPS,
> NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See
> page_remove_rmap().
>
> If we can avoid that complexity right from the start for hugetlb, great, ..
>
> >
> > i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
> > "private_hugetlb" or "shared_hugetlb".
> > ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
> > the hugepage is shared or not. Pages that would otherwise be migrated
> > now require MPOL_MF_MOVE_ALL to be migrated.
> > [Really both of the above are checking how many VMAs are mapping our hugepage.]
> > iii. CoW. This isn't a problem right now because CoW is only possible
> > with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
> > iv. The hwpoison handling code will check if it successfully unmapped
> > the poisoned page. This isn't a problem right now, as hwpoison will
> > unmap all the mappings for the hugepage, not just the 4K where the
> > poison was found.
> >
> > Doing it this way allows HGM to remain compatible with the hugetlb
> > vmemmap optimization. None of the above problems strike me as
> > particularly major, but it's unclear to me how important it is to have
> > page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.
>
> See below, maybe we should tackle HGM from a different direction.
>
> >
> > The other way page_mapcount() (let's say the "THP-like way") could be
> > done is like this: increment compound mapcount if we're mapping a
> > hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
> > high-granularity, increment the mapcount for each 4K page that is
> > getting mapped (e.g., PMD within a 1G page: increment the mapcount for
> > the 512 pages that are now mapped). This yields the same
> > page_mapcount() API we had before, but we lose the hugetlb vmemmap
> > optimization.
> >
> > We could introduce an API like hugetlb_vma_mapcount() that would, for
> > hugetlb, give us the number of VMAs that map a hugepage, but I don't
> > think people would like this.
> >
> > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > THP-like way is probably what most people would want, though it would
> > be a real shame to lose the vmemmap optimization.
>
> Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> if we're dealing with refcount overflows already, mapcount overflows are not
> an issue.
>
>
> I wonder if the following crazy idea has already been discussed: treat the
> whole mapping as a single large logical mapping. One reference and one
> mapping, no matter how the individual parts are mapped into the assigned
> page table sub-tree.
>
> Because for hugetlb with MAP_SHARED, we know that the complete assigned
> sub-tree of page tables can only map the given hugetlb page, no fragments of
> something else. That's very different to THP in private mappings ...
>
> So as soon as the first piece gets mapped, we increment refcount+mapcount.
> Other pieces in the same subtree don't do that.
>
> Once the last piece is unmapped (or simpler: once the complete subtree of
> page tables is gone), we decrement refcount+mapcount. Might require some
> brain power to do this tracking, but I wouldn't call it impossible right
> from the start.
>
> Would such a design violate other design aspects that are important?

The question is how to maintaining above information.

It needs to be per-map (so one page mapped multiple times can be accounted
differently), and per-page (so one mapping/vma can contain multiple pages).
So far I think that's exactly the pgtable. If we can squeeze information
into the pgtable it'll work out, but definitely not trivial. Or we can
maintain seperate allocates for such information, but that can be extra
overheads too.

So far I'd still consider going with reusing thp mapcounts, which will
mostly be what James mentioned above. The only difference is I'm not sure
whether we should allow mapping e.g. 2M ranges for 1G pages. THP mapcount
doesn't have intermediate layer to maintain mapcount information like 2M,
so to me it's easier we start with only mapping either the hpage size or
PAGE_SIZE, not any intermediate size allowed.

Having intermediate size mapping allowed can at least be error prone to
me. One example is if some pgtable walker found a 2M page, it may easily
fetch the PFN out of it, assuming it's a compound page and it should
satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
PFN will only be a small page pfn for the 1G huge page in this case.

To me, intermediate sized mappings are good to have but not required to
resolve HGM problems, at least so far. Said that, I'm fine with looking at
what it'll look like if James would like to keep persuing that direction.

Thanks,

--
Peter Xu

2023-01-18 16:50:57

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

> > > To deal with this, the best solution we've been able to come up with
> > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > > want to kill a random process).
> >
> > You'd have to also make sure that fork() won't do the same. At least with
> > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> > mappings, which would have to be handled.

Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in
some cases) would need this check too.

>
> If we want such a check to make a real difference, IIUC we may want to
> consider having similar check in:
>
> page_ref_add
> page_ref_inc
> page_ref_inc_return
> page_ref_add_unless
>
> But it's unfortunate that mostly all the callers to these functions
> (especially the first two) do not have a retval yet at all. Considering
> the low possibility so far on having it overflow, maybe it can also be done
> for later (and I think checking negative as try_get_page would suffice too).

I think as long as we annotate the few cases that userspace can
exploit to overflow refcount, we should be ok. I think this was the
same idea with try_get_page(): it is supposed to be used in places
that userspace could reasonably exploit to overflow refcount.

>
> >
> > Of course, one can just disallow fork() with any HGM right from the start
> > and keep it all simpler to not open up a can of worms there.
> >
> > Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> > huge page via a HGM? Restricting it to a single one, would make handling
> > much easier.
> >
> > If there is ever demand for more HGM mappings, that whole problem (and
> > complexity) could be dealt with later. ... but I assume it will already be a
> > requirement for VMs (e.g., under QEMU) that share memory with other
> > processes (virtiofsd and friends?)
>
> Yes, any form of multi-proc QEMU will need that for supporting HGM
> postcopy.

+1. Disallowing fork() entirely is quite a restrictive limitation.

[snip]
> > > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > > THP-like way is probably what most people would want, though it would
> > > be a real shame to lose the vmemmap optimization.
> >
> > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> > if we're dealing with refcount overflows already, mapcount overflows are not
> > an issue.
> >
> >
> > I wonder if the following crazy idea has already been discussed: treat the
> > whole mapping as a single large logical mapping. One reference and one
> > mapping, no matter how the individual parts are mapped into the assigned
> > page table sub-tree.
> >
> > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > something else. That's very different to THP in private mappings ...
> >
> > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > Other pieces in the same subtree don't do that.
> >
> > Once the last piece is unmapped (or simpler: once the complete subtree of
> > page tables is gone), we decrement refcount+mapcount. Might require some
> > brain power to do this tracking, but I wouldn't call it impossible right
> > from the start.
> >
> > Would such a design violate other design aspects that are important?

This is actually how mapcount was treated in HGM RFC v1 (though not
refcount); it is doable for both [2].

One caveat here: if a page is unmapped in small pieces, it is
difficult to know if the page is legitimately completely unmapped (we
would have to check all the PTEs in the page table). In RFC v1, I
sidestepped this caveat by saying that "page_mapcount() is incremented
if the hstate-level PTE is present". A single unmap on the whole
hugepage will clear the hstate-level PTE, thus decrementing the
mapcount.

On a related note, there still exists an (albeit minor) API difference
vs. THPs: a piece of a page that is legitimately unmapped can still
have a positive page_mapcount().

Given that this approach allows us to retain the hugetlb vmemmap
optimization (and it wouldn't require a horrible amount of
complexity), I prefer this approach over the THP-like approach.

>
> The question is how to maintaining above information.
>
> It needs to be per-map (so one page mapped multiple times can be accounted
> differently), and per-page (so one mapping/vma can contain multiple pages).
> So far I think that's exactly the pgtable. If we can squeeze information
> into the pgtable it'll work out, but definitely not trivial. Or we can
> maintain seperate allocates for such information, but that can be extra
> overheads too.

I don't think we necessarily need to check the page table if we allow
for the limitations stated above.

>
> So far I'd still consider going with reusing thp mapcounts, which will
> mostly be what James mentioned above. The only difference is I'm not sure
> whether we should allow mapping e.g. 2M ranges for 1G pages. THP mapcount
> doesn't have intermediate layer to maintain mapcount information like 2M,
> so to me it's easier we start with only mapping either the hpage size or
> PAGE_SIZE, not any intermediate size allowed.
>
> Having intermediate size mapping allowed can at least be error prone to
> me. One example is if some pgtable walker found a 2M page, it may easily
> fetch the PFN out of it, assuming it's a compound page and it should
> satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
> PFN will only be a small page pfn for the 1G huge page in this case.

I assume you mean PageHuge(page). There are cases where assumptions
are made about hugetlb pages and VMAs; they need to be corrected. It
sounds like you're really talking about errors wrt missing a
compound_head()/page_folio(), but it can be even more general than
that. For example, the arm64 KVM MMU assumes that hugetlb HGM doesn't
exist, and so it needs to be fixed before HGM can be enabled for
arm64.

If a page is HGM-mapped, I think it's more error-prone to somehow make
PageHuge() come back with false. So the only solution I see here is
careful auditing and testing.

I don't really see how allow/disallowing intermediate sizes changes
this problem either.

>
> To me, intermediate sized mappings are good to have but not required to
> resolve HGM problems, at least so far. Said that, I'm fine with looking at
> what it'll look like if James would like to keep persuing that direction.

I've mentioned this to Peter already, but I don't think discarding
intermediate mapping levels really reduces complexity all that much.

[2]: Using the names of functions as of v1 (Peter has since suggested
a name change): hugetlb_alloc_{pte,pmd} will tell us if we really
allocated the PTE. That info can be passed up through
hugetlb_walk_step()=>hugetlb_hgm_walk() (where we only care if the
*hstate-level* PTE got allocated) and hugetlb_full_walk*() for the
caller to determine if refcount/mapcount should be incremented.

Thank you both for your thoughts so far. :)

- James

2023-01-18 18:40:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 18.01.23 16:35, Peter Xu wrote:
> On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote:
>> On 18.01.23 00:11, James Houghton wrote:
>>> On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 12.01.23 22:33, Peter Xu wrote:
>>>>> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
>>>>>> I'll look into it, but doing it this way will use _mapcount, so we
>>>>>> won't be able to use the vmemmap optimization. I think even if we do
>>>>>> use Hugh's approach, refcount is still being kept on the head page, so
>>>>>> there's still an overflow risk there (but maybe I am
>>>>>> misunderstanding).
>>>>>
>>>>> Could you remind me what's the issue if using refcount on the small pages
>>>>> rather than the head (assuming vmemmap still can be disabled)?
>>>>
>>>> The THP-way of doing things is refcounting on the head page. All folios
>>>> use a single refcount on the head.
>>>>
>>>> There has to be a pretty good reason to do it differently.
>>>
>>> Peter and I have discussed this a lot offline. There are two main problems here:
>>>
>>> 1. Refcount overflow
>>>
>>> Refcount is always kept on the head page (before and after this
>>> series). IIUC, this means that if THPs could be 1G in size, they too
>>> would be susceptible to the same potential overflow. How easy is the
>>> overflow? [1]
>>
>> Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64
>> processes with 64k VMAs. Not impossible to achieve if one really wants to
>> break the system ...
>>
>> Side note: a long long time ago, we used to have sub-page refcounts for THP.
>> IIRC, that was even before we had sub-page mapcounts and was used to make
>> COW decisions.
>>
>>>
>>> To deal with this, the best solution we've been able to come up with
>>> is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
>>> and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
>>> from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
>>> page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
>>> want to kill a random process).
>>
>> You'd have to also make sure that fork() won't do the same. At least with
>> uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
>> mappings, which would have to be handled.
>
> If we want such a check to make a real difference, IIUC we may want to
> consider having similar check in:
>
> page_ref_add
> page_ref_inc
> page_ref_inc_return
> page_ref_add_unless
>
> But it's unfortunate that mostly all the callers to these functions
> (especially the first two) do not have a retval yet at all. Considering
> the low possibility so far on having it overflow, maybe it can also be done
> for later (and I think checking negative as try_get_page would suffice too).
>
>>
>> Of course, one can just disallow fork() with any HGM right from the start
>> and keep it all simpler to not open up a can of worms there.
>>
>> Is it reasonable, to have more than one (or a handful) of VMAs mapping a
>> huge page via a HGM? Restricting it to a single one, would make handling
>> much easier.
>>
>> If there is ever demand for more HGM mappings, that whole problem (and
>> complexity) could be dealt with later. ... but I assume it will already be a
>> requirement for VMs (e.g., under QEMU) that share memory with other
>> processes (virtiofsd and friends?)
>
> Yes, any form of multi-proc QEMU will need that for supporting HGM
> postcopy.
>
>>
>>
>>>
>>> (So David, I think this answers your question. Refcount should be
>>> handled just like THPs.)
>>>
>>> 2. page_mapcount() API differences
>>>
>>> In this series, page_mapcount() returns the total number of page table
>>> references for the compound page. For example, if you have a
>>> PTE-mapped 2M page (with no other mappings), page_mapcount() for each
>>> 4K page will be 512. This is not the same as a THP: page_mapcount()
>>> would return 1 for each page. Because of the difference in
>>> page_mapcount(), we have 4 problems:
>>
>> IMHO, it would actually be great to just be able to remove the sub-page
>> mapcounts for THP and make it all simpler.
>>
>> Right now, the sub-page mapcount is mostly required for making COW
>> decisions, but only for accounting purposes IIRC (NR_ANON_THPS,
>> NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See
>> page_remove_rmap().
>>
>> If we can avoid that complexity right from the start for hugetlb, great, ..
>>
>>>
>>> i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
>>> "private_hugetlb" or "shared_hugetlb".
>>> ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
>>> the hugepage is shared or not. Pages that would otherwise be migrated
>>> now require MPOL_MF_MOVE_ALL to be migrated.
>>> [Really both of the above are checking how many VMAs are mapping our hugepage.]
>>> iii. CoW. This isn't a problem right now because CoW is only possible
>>> with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
>>> iv. The hwpoison handling code will check if it successfully unmapped
>>> the poisoned page. This isn't a problem right now, as hwpoison will
>>> unmap all the mappings for the hugepage, not just the 4K where the
>>> poison was found.
>>>
>>> Doing it this way allows HGM to remain compatible with the hugetlb
>>> vmemmap optimization. None of the above problems strike me as
>>> particularly major, but it's unclear to me how important it is to have
>>> page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.
>>
>> See below, maybe we should tackle HGM from a different direction.
>>
>>>
>>> The other way page_mapcount() (let's say the "THP-like way") could be
>>> done is like this: increment compound mapcount if we're mapping a
>>> hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
>>> high-granularity, increment the mapcount for each 4K page that is
>>> getting mapped (e.g., PMD within a 1G page: increment the mapcount for
>>> the 512 pages that are now mapped). This yields the same
>>> page_mapcount() API we had before, but we lose the hugetlb vmemmap
>>> optimization.
>>>
>>> We could introduce an API like hugetlb_vma_mapcount() that would, for
>>> hugetlb, give us the number of VMAs that map a hugepage, but I don't
>>> think people would like this.
>>>
>>> I'm curious what others think (Mike, Matthew?). I'm guessing the
>>> THP-like way is probably what most people would want, though it would
>>> be a real shame to lose the vmemmap optimization.
>>
>> Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
>> if we're dealing with refcount overflows already, mapcount overflows are not
>> an issue.
>>
>>
>> I wonder if the following crazy idea has already been discussed: treat the
>> whole mapping as a single large logical mapping. One reference and one
>> mapping, no matter how the individual parts are mapped into the assigned
>> page table sub-tree.
>>
>> Because for hugetlb with MAP_SHARED, we know that the complete assigned
>> sub-tree of page tables can only map the given hugetlb page, no fragments of
>> something else. That's very different to THP in private mappings ...
>>
>> So as soon as the first piece gets mapped, we increment refcount+mapcount.
>> Other pieces in the same subtree don't do that.
>>
>> Once the last piece is unmapped (or simpler: once the complete subtree of
>> page tables is gone), we decrement refcount+mapcount. Might require some
>> brain power to do this tracking, but I wouldn't call it impossible right
>> from the start.
>>
>> Would such a design violate other design aspects that are important?
>
> The question is how to maintaining above information.

Right.

>
> It needs to be per-map (so one page mapped multiple times can be accounted
> differently), and per-page (so one mapping/vma can contain multiple pages).
> So far I think that's exactly the pgtable. If we can squeeze information
> into the pgtable it'll work out, but definitely not trivial. Or we can
> maintain seperate allocates for such information, but that can be extra
> overheads too.

If there is no sub-pgtable level, there is certainly no HGM. If there is
a sub-pgtable, we can store that information in that pgtable memmap most
probably. Maybe simply a pointer to the hugetlb page. As long the
pointer is there, we increment the mapcount/refcount.


Either directly, or via some additional metadata. Metadata should be
small and most probably "noting relevant in size" compared to the actual
1 GiB page or the 2 MiB+ of page tables to cover 1 GiB.

We could even teach most pgtable walkers to just assume that "logically"
there is simply a hugtlb page mapped, without traversing the actual
sub-pgtables. IIUC, only pgtable walkers that actually want to access
page content (page faults, pinning) or change PTEs (mprotect, uffd)
would really care. Maybe stuff like smaps could just say "well, there is
a hugetlb page mapped" and continue. Just a thought.




>
> So far I'd still consider going with reusing thp mapcounts, which will
> mostly be what James mentioned above. The only difference is I'm not sure
> whether we should allow mapping e.g. 2M ranges for 1G pages. THP mapcount
> doesn't have intermediate layer to maintain mapcount information like 2M,
> so to me it's easier we start with only mapping either the hpage size or
> PAGE_SIZE, not any intermediate size allowed.
>
> Having intermediate size mapping allowed can at least be error prone to
> me. One example is if some pgtable walker found a 2M page, it may easily
> fetch the PFN out of it, assuming it's a compound page and it should
> satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
> PFN will only be a small page pfn for the 1G huge page in this case.
>
> To me, intermediate sized mappings are good to have but not required to
> resolve HGM problems, at least so far. Said that, I'm fine with looking at
> what it'll look like if James would like to keep persuing that direction.

Yeah, was just an idea from my side to avoid most of the refcount and
mapcount issues -- in theory :)

Let me think about that all a bit more ...

--
Thanks,

David / dhildenb

2023-01-18 18:47:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

>>> Once the last piece is unmapped (or simpler: once the complete subtree of
>>> page tables is gone), we decrement refcount+mapcount. Might require some
>>> brain power to do this tracking, but I wouldn't call it impossible right
>>> from the start.
>>>
>>> Would such a design violate other design aspects that are important?
>
> This is actually how mapcount was treated in HGM RFC v1 (though not
> refcount); it is doable for both [2].
>
> One caveat here: if a page is unmapped in small pieces, it is
> difficult to know if the page is legitimately completely unmapped (we
> would have to check all the PTEs in the page table). In RFC v1, I
> sidestepped this caveat by saying that "page_mapcount() is incremented
> if the hstate-level PTE is present". A single unmap on the whole
> hugepage will clear the hstate-level PTE, thus decrementing the
> mapcount.
>
> On a related note, there still exists an (albeit minor) API difference
> vs. THPs: a piece of a page that is legitimately unmapped can still
> have a positive page_mapcount().
>
> Given that this approach allows us to retain the hugetlb vmemmap
> optimization (and it wouldn't require a horrible amount of
> complexity), I prefer this approach over the THP-like approach.

If we can store (directly/indirectly) metadata in the highest pgtable
that HGM-maps a hugetlb page, I guess what would be reasonable:

* hugetlb page pointer
* mapped size

Whenever mapping/unmapping sub-parts, we'd have to update that information.

Once "mapped size" dropped to 0, we know that the hugetlb page was
completely unmapped and we can drop the refcount+mapcount, clear
metadata (including hugetlb page pointer) [+ remove the page tables?].

Similarly, once "mapped size" corresponds to the hugetlb size, we can
immediately spot that everything is mapped.

Again, just a high-level idea.

--
Thanks,

David / dhildenb

2023-01-18 19:39:14

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 01/18/23 08:39, James Houghton wrote:
> > > > To deal with this, the best solution we've been able to come up with
> > > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > > > want to kill a random process).
> > >
> > > You'd have to also make sure that fork() won't do the same. At least with
> > > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> > > mappings, which would have to be handled.
>
> Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in
> some cases) would need this check too.
>
> >
> > If we want such a check to make a real difference, IIUC we may want to
> > consider having similar check in:
> >
> > page_ref_add
> > page_ref_inc
> > page_ref_inc_return
> > page_ref_add_unless
> >
> > But it's unfortunate that mostly all the callers to these functions
> > (especially the first two) do not have a retval yet at all. Considering
> > the low possibility so far on having it overflow, maybe it can also be done
> > for later (and I think checking negative as try_get_page would suffice too).
>
> I think as long as we annotate the few cases that userspace can
> exploit to overflow refcount, we should be ok. I think this was the
> same idea with try_get_page(): it is supposed to be used in places
> that userspace could reasonably exploit to overflow refcount.
>
> >
> > >
> > > Of course, one can just disallow fork() with any HGM right from the start
> > > and keep it all simpler to not open up a can of worms there.
> > >
> > > Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> > > huge page via a HGM? Restricting it to a single one, would make handling
> > > much easier.
> > >
> > > If there is ever demand for more HGM mappings, that whole problem (and
> > > complexity) could be dealt with later. ... but I assume it will already be a
> > > requirement for VMs (e.g., under QEMU) that share memory with other
> > > processes (virtiofsd and friends?)
> >
> > Yes, any form of multi-proc QEMU will need that for supporting HGM
> > postcopy.
>
> +1. Disallowing fork() entirely is quite a restrictive limitation.
>
> [snip]
> > > > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > > > THP-like way is probably what most people would want, though it would
> > > > be a real shame to lose the vmemmap optimization.
> > >
> > > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> > > if we're dealing with refcount overflows already, mapcount overflows are not
> > > an issue.
> > >
> > >
> > > I wonder if the following crazy idea has already been discussed: treat the
> > > whole mapping as a single large logical mapping. One reference and one
> > > mapping, no matter how the individual parts are mapped into the assigned
> > > page table sub-tree.
> > >
> > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > something else. That's very different to THP in private mappings ...
> > >
> > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > Other pieces in the same subtree don't do that.
> > >
> > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > brain power to do this tracking, but I wouldn't call it impossible right
> > > from the start.
> > >
> > > Would such a design violate other design aspects that are important?
>
> This is actually how mapcount was treated in HGM RFC v1 (though not
> refcount); it is doable for both [2].

My apologies for being late to the party :)

When Peter first brought up the issue with ref/map_count overflows I was
thinking that we should use a scheme like David describes above. As
James points out, this was the approach taken in the first RFC.

> One caveat here: if a page is unmapped in small pieces, it is
> difficult to know if the page is legitimately completely unmapped (we
> would have to check all the PTEs in the page table).

Are we allowing unmapping of small (non-huge page sized) areas with HGM?
We must be if you are concerned with it. What API would cause this?
I just do not remember this discussion.

> would have to check all the PTEs in the page table). In RFC v1, I
> sidestepped this caveat by saying that "page_mapcount() is incremented
> if the hstate-level PTE is present". A single unmap on the whole
> hugepage will clear the hstate-level PTE, thus decrementing the
> mapcount.
>
> On a related note, there still exists an (albeit minor) API difference
> vs. THPs: a piece of a page that is legitimately unmapped can still
> have a positive page_mapcount().
>
> Given that this approach allows us to retain the hugetlb vmemmap
> optimization (and it wouldn't require a horrible amount of
> complexity), I prefer this approach over the THP-like approach.

Me too.

> >
> > The question is how to maintaining above information.
> >
> > It needs to be per-map (so one page mapped multiple times can be accounted
> > differently), and per-page (so one mapping/vma can contain multiple pages).
> > So far I think that's exactly the pgtable. If we can squeeze information
> > into the pgtable it'll work out, but definitely not trivial. Or we can
> > maintain seperate allocates for such information, but that can be extra
> > overheads too.
>
> I don't think we necessarily need to check the page table if we allow
> for the limitations stated above.
>

When I was thinking about this I was a bit concerned about having enough
information to know exactly when to inc or dec counts. I was actually
worried about knowing to do the increment. I don't recall how it was
done in the first RFC, but from a high level it would need to be done
when the first hstate level PTE is allocated/added to the page table.
Right? My concern was with all the places where we could 'error out'
after allocating the PTE, but before initializing it. I was just thinking
that we might need to scan the page table or keep metadata for better
or easier accounting.

I think Peter mentioned it elsewhere, we should come up with a workable
scheme for HGM ref/map counting. This can be done somewhat independently.
--
Mike Kravetz

2023-01-19 17:12:04

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

> > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > whole mapping as a single large logical mapping. One reference and one
> > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > page table sub-tree.
> > > >
> > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > something else. That's very different to THP in private mappings ...
> > > >
> > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > Other pieces in the same subtree don't do that.
> > > >
> > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > from the start.
> > > >
> > > > Would such a design violate other design aspects that are important?
> >
> > This is actually how mapcount was treated in HGM RFC v1 (though not
> > refcount); it is doable for both [2].
>
> My apologies for being late to the party :)
>
> When Peter first brought up the issue with ref/map_count overflows I was
> thinking that we should use a scheme like David describes above. As
> James points out, this was the approach taken in the first RFC.
>
> > One caveat here: if a page is unmapped in small pieces, it is
> > difficult to know if the page is legitimately completely unmapped (we
> > would have to check all the PTEs in the page table).
>
> Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> We must be if you are concerned with it. What API would cause this?
> I just do not remember this discussion.

There was some discussion about allowing MADV_DONTNEED on
less-than-hugepage pieces [3] (it actually motivated the switch from
UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
in this series, but it could be implemented in the future.

In the more immediate future, we want hwpoison to unmap at 4K, so
MADV_HWPOISON would be a mechanism that userspace is granted to do
this.

>
> > would have to check all the PTEs in the page table). In RFC v1, I
> > sidestepped this caveat by saying that "page_mapcount() is incremented
> > if the hstate-level PTE is present". A single unmap on the whole
> > hugepage will clear the hstate-level PTE, thus decrementing the
> > mapcount.
> >
> > On a related note, there still exists an (albeit minor) API difference
> > vs. THPs: a piece of a page that is legitimately unmapped can still
> > have a positive page_mapcount().
> >
> > Given that this approach allows us to retain the hugetlb vmemmap
> > optimization (and it wouldn't require a horrible amount of
> > complexity), I prefer this approach over the THP-like approach.
>
> Me too.
>
> > >
> > > The question is how to maintaining above information.
> > >
> > > It needs to be per-map (so one page mapped multiple times can be accounted
> > > differently), and per-page (so one mapping/vma can contain multiple pages).
> > > So far I think that's exactly the pgtable. If we can squeeze information
> > > into the pgtable it'll work out, but definitely not trivial. Or we can
> > > maintain seperate allocates for such information, but that can be extra
> > > overheads too.
> >
> > I don't think we necessarily need to check the page table if we allow
> > for the limitations stated above.
> >
>
> When I was thinking about this I was a bit concerned about having enough
> information to know exactly when to inc or dec counts. I was actually
> worried about knowing to do the increment. I don't recall how it was
> done in the first RFC, but from a high level it would need to be done
> when the first hstate level PTE is allocated/added to the page table.
> Right? My concern was with all the places where we could 'error out'
> after allocating the PTE, but before initializing it. I was just thinking
> that we might need to scan the page table or keep metadata for better
> or easier accounting.

The only two places where we can *create* a high-granularity page
table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
where we error out. To correctly handle these cases, we basically have
to do the pagecache lookup before touching the page table.

1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
so immediately. We can easily pass the page into
hugetlb_mcopy_atomic_pte() (via 'pagep') .

2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
lookup before we do the page table walk. I'm not sure how to support
non-shared HGM mappings with this scheme (in this series, we also
don't support non-shared; we return -EINVAL).
NB: The only case where high-granularity mappings for !VM_MAYSHARE
VMAs would come up is as a result of hwpoison.

So we can avoid keeping additional metadata for what this series is
trying to accomplish, but if the above isn't acceptable, then I/we can
try to come up with a scheme that would be acceptable.

There is also the possibility that the scheme implemented in this
version of the series is acceptable (i.e., the page_mapcount() API
difference, which results in slightly modified page migration behavior
and smaps output, is ok... assuming we have the refcount overflow
check).

>
> I think Peter mentioned it elsewhere, we should come up with a workable
> scheme for HGM ref/map counting. This can be done somewhat independently.

FWIW, what makes the most sense to me right now is to implement the
THP-like scheme and mark HGM as mutually exclusive with the vmemmap
optimization. We can later come up with a scheme that lets us retain
compatibility. (Is that what you mean by "this can be done somewhat
independently", Mike?)

[3]: https://lore.kernel.org/linux-mm/[email protected]/T/#m9a1090108b61d32c04b68a1f3f2577644823a999

- James

2023-01-19 17:44:46

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 01/19/23 08:57, James Houghton wrote:
> > > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > > whole mapping as a single large logical mapping. One reference and one
> > > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > > page table sub-tree.
> > > > >
> > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > > something else. That's very different to THP in private mappings ...
> > > > >
> > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > > Other pieces in the same subtree don't do that.
> > > > >
> > > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > > from the start.
> > > > >
> > > > > Would such a design violate other design aspects that are important?
> > >
> > > This is actually how mapcount was treated in HGM RFC v1 (though not
> > > refcount); it is doable for both [2].
> >
> > My apologies for being late to the party :)
> >
> > When Peter first brought up the issue with ref/map_count overflows I was
> > thinking that we should use a scheme like David describes above. As
> > James points out, this was the approach taken in the first RFC.
> >
> > > One caveat here: if a page is unmapped in small pieces, it is
> > > difficult to know if the page is legitimately completely unmapped (we
> > > would have to check all the PTEs in the page table).
> >
> > Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> > We must be if you are concerned with it. What API would cause this?
> > I just do not remember this discussion.
>
> There was some discussion about allowing MADV_DONTNEED on
> less-than-hugepage pieces [3] (it actually motivated the switch from
> UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
> in this series, but it could be implemented in the future.

OK, so we do not actually create HGM mappings until a uffd operation is
done at a less than huge page size granularity. MADV_SPLIT just says
that HGM mappings are 'possible' for this vma. Hopefully, my understanding
is correct.

I was concerned about things like the page fault path, but in that case
we have already 'entered HGM mode' via a uffd operation.

Both David and Peter have asked whether eliminating intermediate mapping
levels would be a simplification. I trust your response that it would
not help much in the current design/implementation. But, it did get me
thinking about something else.

Perhaps we have discussed this before, and perhaps it does not meet all
user needs, but one way possibly simplify this is:

- 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate)
granularity.
- MADV_SPLIT implicitly unmaps everything with in the range.
- MADV_SPLIT says all mappings for this vma will now be done at a base
(4K) page size granularity. vma would be marked some way.
- I think this eliminates the need for hugetlb_pte's as we KNOW the
mapping size.
- We still use huge pages to back 4K mappings, and we still have to deal
with the ref/map_count issues.
- Code touching hugetlb page tables would KNOW the mapping size up front.

Again, apologies if we talked about and previously dismissed this type
of approach.

> > When I was thinking about this I was a bit concerned about having enough
> > information to know exactly when to inc or dec counts. I was actually
> > worried about knowing to do the increment. I don't recall how it was
> > done in the first RFC, but from a high level it would need to be done
> > when the first hstate level PTE is allocated/added to the page table.
> > Right? My concern was with all the places where we could 'error out'
> > after allocating the PTE, but before initializing it. I was just thinking
> > that we might need to scan the page table or keep metadata for better
> > or easier accounting.
>
> The only two places where we can *create* a high-granularity page
> table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
> copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
> where we error out. To correctly handle these cases, we basically have
> to do the pagecache lookup before touching the page table.
>
> 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
> PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
> so immediately. We can easily pass the page into
> hugetlb_mcopy_atomic_pte() (via 'pagep') .
>
> 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
> lookup before we do the page table walk. I'm not sure how to support
> non-shared HGM mappings with this scheme (in this series, we also
> don't support non-shared; we return -EINVAL).
> NB: The only case where high-granularity mappings for !VM_MAYSHARE
> VMAs would come up is as a result of hwpoison.
>
> So we can avoid keeping additional metadata for what this series is
> trying to accomplish, but if the above isn't acceptable, then I/we can
> try to come up with a scheme that would be acceptable.

Ok, I was thinking we had to deal with other code paths such as page
fault. But, now I understand that is not the case with this design.

> There is also the possibility that the scheme implemented in this
> version of the series is acceptable (i.e., the page_mapcount() API
> difference, which results in slightly modified page migration behavior
> and smaps output, is ok... assuming we have the refcount overflow
> check).
>
> >
> > I think Peter mentioned it elsewhere, we should come up with a workable
> > scheme for HGM ref/map counting. This can be done somewhat independently.
>
> FWIW, what makes the most sense to me right now is to implement the
> THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> optimization. We can later come up with a scheme that lets us retain
> compatibility. (Is that what you mean by "this can be done somewhat
> independently", Mike?)

Sort of, I was only saying that getting the ref/map counting right seems
like a task than can be independently worked. Using the THP-like scheme
is good.
--
Mike Kravetz

2023-01-19 20:46:26

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <[email protected]> wrote:
>
> On 01/19/23 08:57, James Houghton wrote:
> > > > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > > > whole mapping as a single large logical mapping. One reference and one
> > > > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > > > page table sub-tree.
> > > > > >
> > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > > > something else. That's very different to THP in private mappings ...
> > > > > >
> > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > > > Other pieces in the same subtree don't do that.
> > > > > >
> > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > > > from the start.
> > > > > >
> > > > > > Would such a design violate other design aspects that are important?
> > > >
> > > > This is actually how mapcount was treated in HGM RFC v1 (though not
> > > > refcount); it is doable for both [2].
> > >
> > > My apologies for being late to the party :)
> > >
> > > When Peter first brought up the issue with ref/map_count overflows I was
> > > thinking that we should use a scheme like David describes above. As
> > > James points out, this was the approach taken in the first RFC.
> > >
> > > > One caveat here: if a page is unmapped in small pieces, it is
> > > > difficult to know if the page is legitimately completely unmapped (we
> > > > would have to check all the PTEs in the page table).
> > >
> > > Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> > > We must be if you are concerned with it. What API would cause this?
> > > I just do not remember this discussion.
> >
> > There was some discussion about allowing MADV_DONTNEED on
> > less-than-hugepage pieces [3] (it actually motivated the switch from
> > UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
> > in this series, but it could be implemented in the future.
>
> OK, so we do not actually create HGM mappings until a uffd operation is
> done at a less than huge page size granularity. MADV_SPLIT just says
> that HGM mappings are 'possible' for this vma. Hopefully, my understanding
> is correct.

Right, that's the current meaning of MADV_SPLIT for hugetlb.

> I was concerned about things like the page fault path, but in that case
> we have already 'entered HGM mode' via a uffd operation.
>
> Both David and Peter have asked whether eliminating intermediate mapping
> levels would be a simplification. I trust your response that it would
> not help much in the current design/implementation. But, it did get me
> thinking about something else.
>
> Perhaps we have discussed this before, and perhaps it does not meet all
> user needs, but one way possibly simplify this is:
>
> - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate)
> granularity.
> - MADV_SPLIT implicitly unmaps everything with in the range.
> - MADV_SPLIT says all mappings for this vma will now be done at a base
> (4K) page size granularity. vma would be marked some way.
> - I think this eliminates the need for hugetlb_pte's as we KNOW the
> mapping size.
> - We still use huge pages to back 4K mappings, and we still have to deal
> with the ref/map_count issues.
> - Code touching hugetlb page tables would KNOW the mapping size up front.
>
> Again, apologies if we talked about and previously dismissed this type
> of approach.

I think Peter was the one who originally suggested an approach like
this, and it meets my needs. However, I still think the way that
things are currently implemented is the right way to go.

Assuming we want decent performance, we can't get away with the same
strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE
should be based on the PMD above the PTE, so we need to either pass
around the PMD above our PTE, or we need to pass around the PTL. This
is something that hugetlb_pte does for us, so, in some sense, even
going with this simpler approach, we still need a hugetlb_pte-like
construct.

Although most of the other complexity that HGM introduces would have
to be introduced either way (like having to deal with putting
compound_head()/page_folio() in more places and doing some
per-architecture updates), there are some complexities that the
simpler approach avoids:

- We avoid problems related to compound PTEs (the problem being: two
threads racing to populate a contiguous and non-contiguous PTE that
take up the same space could lead to user-detectable incorrect
behavior. This isn't hard to fix; it will be when I send the arm64
patches up.)

- We don't need to check if PTEs get split from under us in PT walks.
(In a lot of cases, the appropriate action is just to treat the PTE as
if it were pte_none().) In the same vein, we don't need
hugetlb_pte_present_leaf() at all, because PTEs we find will always be
leaves.

- We don't have to deal with sorting hstates or implementing
for_each_hgm_shift()/hugetlb_alloc_largest_pte().

None of these complexities are particularly major in my opinion.

This might seem kind of contrived, but let's say you have a VM with 1T
of memory, and you find 100 memory errors all in different 1G pages
over the life of this VM (years, potentially). Having 10% of your
memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
(lost performance and increased memory overhead). There might be other
cases in the future where being able to have intermediate mapping
sizes could be helpful.

> > > When I was thinking about this I was a bit concerned about having enough
> > > information to know exactly when to inc or dec counts. I was actually
> > > worried about knowing to do the increment. I don't recall how it was
> > > done in the first RFC, but from a high level it would need to be done
> > > when the first hstate level PTE is allocated/added to the page table.
> > > Right? My concern was with all the places where we could 'error out'
> > > after allocating the PTE, but before initializing it. I was just thinking
> > > that we might need to scan the page table or keep metadata for better
> > > or easier accounting.
> >
> > The only two places where we can *create* a high-granularity page
> > table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
> > copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
> > where we error out. To correctly handle these cases, we basically have
> > to do the pagecache lookup before touching the page table.
> >
> > 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
> > PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
> > so immediately. We can easily pass the page into
> > hugetlb_mcopy_atomic_pte() (via 'pagep') .
> >
> > 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
> > lookup before we do the page table walk. I'm not sure how to support
> > non-shared HGM mappings with this scheme (in this series, we also
> > don't support non-shared; we return -EINVAL).
> > NB: The only case where high-granularity mappings for !VM_MAYSHARE
> > VMAs would come up is as a result of hwpoison.
> >
> > So we can avoid keeping additional metadata for what this series is
> > trying to accomplish, but if the above isn't acceptable, then I/we can
> > try to come up with a scheme that would be acceptable.
>
> Ok, I was thinking we had to deal with other code paths such as page
> fault. But, now I understand that is not the case with this design.
>
> > There is also the possibility that the scheme implemented in this
> > version of the series is acceptable (i.e., the page_mapcount() API
> > difference, which results in slightly modified page migration behavior
> > and smaps output, is ok... assuming we have the refcount overflow
> > check).
> >
> > >
> > > I think Peter mentioned it elsewhere, we should come up with a workable
> > > scheme for HGM ref/map counting. This can be done somewhat independently.
> >
> > FWIW, what makes the most sense to me right now is to implement the
> > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > optimization. We can later come up with a scheme that lets us retain
> > compatibility. (Is that what you mean by "this can be done somewhat
> > independently", Mike?)
>
> Sort of, I was only saying that getting the ref/map counting right seems
> like a task than can be independently worked. Using the THP-like scheme
> is good.

Ok! So if you're ok with the intermediate mapping sizes, it sounds
like I should go ahead and implement the THP-like scheme.

- James

2023-01-19 21:32:37

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote:
> - We avoid problems related to compound PTEs (the problem being: two
> threads racing to populate a contiguous and non-contiguous PTE that
> take up the same space could lead to user-detectable incorrect
> behavior. This isn't hard to fix; it will be when I send the arm64
> patches up.)

Could you elaborate this one a bit more?

> This might seem kind of contrived, but let's say you have a VM with 1T
> of memory, and you find 100 memory errors all in different 1G pages
> over the life of this VM (years, potentially). Having 10% of your
> memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> (lost performance and increased memory overhead). There might be other
> cases in the future where being able to have intermediate mapping
> sizes could be helpful.

This is not the norm, or is it? How the possibility of bad pages can
distribute over hosts over years? This can definitely affect how we should
target the intermediate level mappings.

Thanks,

--
Peter Xu

2023-01-19 22:53:11

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> I do not know much about the (primary) live migration use case. My
> guess is that page table lock contention may be an issue? In this use
> case, HGM is only enabled for the duration the live migration operation,
> then a MADV_COLLAPSE is performed. If contention is likely to be an
> issue during this time, then yes we would need to pass around with
> something like hugetlb_pte.

I'm not aware of any such contention issue. IMHO the migration problem is
majorly about being too slow transferring a page being so large. Shrinking
the page size should resolve the major problem already here IIUC.

AFAIU 4K-only solution should only reduce any lock contention because locks
will always be pte-level if VM_HUGETLB_HGM set. When walking and creating
the intermediate pgtable entries we can use atomic ops just like generic
mm, so no lock needed at all. With uncertainty on the size of mappings,
we'll need to take any of the multiple layers of locks.

[...]

> > None of these complexities are particularly major in my opinion.
>
> Perhaps not. I was just thinking about the overall complexity of the
> hugetlb code after HGM. Currently, it is 'relatively simple' with
> fixed huge page sizes. IMO, much simpler than THP with two possible
> mapping sizes. With HGM and intermediate mapping sizes, it seems
> things could get more complicated than THP. Perhaps it is just me.

Please count me in. :) I'm just still happy to see what it'll look like if
James think having that complexity doesn't greatly affect the whole design.

Thanks,

--
Peter Xu

2023-01-19 23:13:43

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > I do not know much about the (primary) live migration use case. My
> > guess is that page table lock contention may be an issue? In this use
> > case, HGM is only enabled for the duration the live migration operation,
> > then a MADV_COLLAPSE is performed. If contention is likely to be an
> > issue during this time, then yes we would need to pass around with
> > something like hugetlb_pte.
>
> I'm not aware of any such contention issue. IMHO the migration problem is
> majorly about being too slow transferring a page being so large. Shrinking
> the page size should resolve the major problem already here IIUC.

This will be problematic if you scale up VMs to be quite large. Google
upstreamed the "TDP MMU" for KVM/x86 that removed the need to take the
MMU lock for writing in the EPT violation path. We found that this
change is required for VMs >200 or so vCPUs to consistently avoid CPU
soft lockups in the guest.

Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
the same PTL would be problematic in the same way.

>
> AFAIU 4K-only solution should only reduce any lock contention because locks
> will always be pte-level if VM_HUGETLB_HGM set. When walking and creating
> the intermediate pgtable entries we can use atomic ops just like generic
> mm, so no lock needed at all. With uncertainty on the size of mappings,
> we'll need to take any of the multiple layers of locks.
>

Other than taking the HugeTLB VMA lock for reading, walking/allocating
page tables won't need any additional locking.

We take the PTL to allocate the next level down, but so does generic
mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
misunderstanding.

- James

2023-01-19 23:14:45

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 12:53 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote:
> > - We avoid problems related to compound PTEs (the problem being: two
> > threads racing to populate a contiguous and non-contiguous PTE that
> > take up the same space could lead to user-detectable incorrect
> > behavior. This isn't hard to fix; it will be when I send the arm64
> > patches up.)
>
> Could you elaborate this one a bit more?

In hugetlb_mcopy_atomic_pte(), we check that the PTE we're about to
overwrite is pte_none() before overwriting it. For contiguous PTEs,
this only checks the first PTE in the bunch.

If someone came around and populated one of the PTEs that lied in the
middle of a potentially contiguous group of PTEs, we could end up
overwriting that PTE if we later UFFDIO_CONTINUEd in such a way to
create a contiguous PTE.

We would expect to get EEXIST here, but in this case the operation
would succeed. To fix this, we can just check that ALL the PTEs in the
contiguous bunch have the value that we're expecting, not just the
first one.

hugetlb_no_page() has the same problem, but it's not immediately clear
to me how it would result in incorrect behavior.

>
> > This might seem kind of contrived, but let's say you have a VM with 1T
> > of memory, and you find 100 memory errors all in different 1G pages
> > over the life of this VM (years, potentially). Having 10% of your
> > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> > (lost performance and increased memory overhead). There might be other
> > cases in the future where being able to have intermediate mapping
> > sizes could be helpful.
>
> This is not the norm, or is it? How the possibility of bad pages can
> distribute over hosts over years? This can definitely affect how we should
> target the intermediate level mappings.

I can't really speak for norms generally, but I can try to speak for
Google Cloud. Google Cloud hasn't had memory error virtualization for
very long (only about a year), but we've seen cases where VMs can pick
up several memory errors over a few days/weeks. IMO, 100 errors in
separate 1G pages over a few years isn't completely nonsensical,
especially if the memory that you're using isn't so reliable or was
damaged in shipping (like if it was flown over the poles or
something!).

Now there is the concern about how an application would handle it. In
a VMM's case, we can virtualize the error for the guest. In the guest,
it's possible that a good chunk of the errors lie in unused pages and
so can be easily marked as poisoned. It's possible that recovery is
much more difficult. It's not unreasonable for an application to
recover from a lot of memory errors.

- James

2023-01-19 23:15:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 01/19/23 11:42, James Houghton wrote:
> On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <[email protected]> wrote:
> > On 01/19/23 08:57, James Houghton wrote:
> >
> > OK, so we do not actually create HGM mappings until a uffd operation is
> > done at a less than huge page size granularity. MADV_SPLIT just says
> > that HGM mappings are 'possible' for this vma. Hopefully, my understanding
> > is correct.
>
> Right, that's the current meaning of MADV_SPLIT for hugetlb.
>
> > I was concerned about things like the page fault path, but in that case
> > we have already 'entered HGM mode' via a uffd operation.
> >
> > Both David and Peter have asked whether eliminating intermediate mapping
> > levels would be a simplification. I trust your response that it would
> > not help much in the current design/implementation. But, it did get me
> > thinking about something else.
> >
> > Perhaps we have discussed this before, and perhaps it does not meet all
> > user needs, but one way possibly simplify this is:
> >
> > - 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate)
> > granularity.
> > - MADV_SPLIT implicitly unmaps everything with in the range.
> > - MADV_SPLIT says all mappings for this vma will now be done at a base
> > (4K) page size granularity. vma would be marked some way.
> > - I think this eliminates the need for hugetlb_pte's as we KNOW the
> > mapping size.
> > - We still use huge pages to back 4K mappings, and we still have to deal
> > with the ref/map_count issues.
> > - Code touching hugetlb page tables would KNOW the mapping size up front.
> >
> > Again, apologies if we talked about and previously dismissed this type
> > of approach.
>
> I think Peter was the one who originally suggested an approach like
> this, and it meets my needs. However, I still think the way that
> things are currently implemented is the right way to go.
>
> Assuming we want decent performance, we can't get away with the same
> strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE
> should be based on the PMD above the PTE, so we need to either pass
> around the PMD above our PTE, or we need to pass around the PTL. This
> is something that hugetlb_pte does for us, so, in some sense, even
> going with this simpler approach, we still need a hugetlb_pte-like
> construct.

Agree there is this performance hit. However, the 'simplest' approach
would be to just use the page table lock as is done by default for 4K
PTEs.

I do not know much about the (primary) live migration use case. My
guess is that page table lock contention may be an issue? In this use
case, HGM is only enabled for the duration the live migration operation,
then a MADV_COLLAPSE is performed. If contention is likely to be an
issue during this time, then yes we would need to pass around with
something like hugetlb_pte.

> Although most of the other complexity that HGM introduces would have
> to be introduced either way (like having to deal with putting
> compound_head()/page_folio() in more places and doing some
> per-architecture updates), there are some complexities that the
> simpler approach avoids:
>
> - We avoid problems related to compound PTEs (the problem being: two
> threads racing to populate a contiguous and non-contiguous PTE that
> take up the same space could lead to user-detectable incorrect
> behavior. This isn't hard to fix; it will be when I send the arm64
> patches up.)
>
> - We don't need to check if PTEs get split from under us in PT walks.
> (In a lot of cases, the appropriate action is just to treat the PTE as
> if it were pte_none().) In the same vein, we don't need
> hugetlb_pte_present_leaf() at all, because PTEs we find will always be
> leaves.
>
> - We don't have to deal with sorting hstates or implementing
> for_each_hgm_shift()/hugetlb_alloc_largest_pte().
>
> None of these complexities are particularly major in my opinion.

Perhaps not. I was just thinking about the overall complexity of the
hugetlb code after HGM. Currently, it is 'relatively simple' with
fixed huge page sizes. IMO, much simpler than THP with two possible
mapping sizes. With HGM and intermediate mapping sizes, it seems
things could get more complicated than THP. Perhaps it is just me.
I am just too familiar with the current code and a bit anxious about
added complexity. But, I felt the same about vmemmap optimizations. :)

> This might seem kind of contrived, but let's say you have a VM with 1T
> of memory, and you find 100 memory errors all in different 1G pages
> over the life of this VM (years, potentially). Having 10% of your
> memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> (lost performance and increased memory overhead). There might be other
> cases in the future where being able to have intermediate mapping
> sizes could be helpful.

That may be a bit contrived. We know memory error handling is a future
use case, but I believe there is work outside of HGM than needs to be
done to handle such situations. For example, HGM will allow the 1G
mapping to isolate the 4K page with error. This prevents errors if you
fault almost anywhere within the 1G page. But, there still remains the
possibility of accessing that 4K page page with error. IMO, it will
require user space/application intervention to address this as only the
application knows about the potentially lost data. This is still something
that needs to be designed. It would then makes sense for the application
to also determine how it wants to proceed WRT mapping the 1G area.
Perhaps they will want (and there will exist a mechanism) to migrate the
data to a new 1G page without error.

> > > > I think Peter mentioned it elsewhere, we should come up with a workable
> > > > scheme for HGM ref/map counting. This can be done somewhat independently.
> > >
> > > FWIW, what makes the most sense to me right now is to implement the
> > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > > optimization. We can later come up with a scheme that lets us retain
> > > compatibility. (Is that what you mean by "this can be done somewhat
> > > independently", Mike?)
> >
> > Sort of, I was only saying that getting the ref/map counting right seems
> > like a task than can be independently worked. Using the THP-like scheme
> > is good.
>
> Ok! So if you're ok with the intermediate mapping sizes, it sounds
> like I should go ahead and implement the THP-like scheme.

Yes, I am OK with it. Just expressed a bit of concern above.
--
Mike Kravetz

2023-01-19 23:17:31

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

On Thu, Jan 05, 2023 at 10:18:33AM +0000, James Houghton wrote:
> + /*
> + * Grab the VMA lock and mapping sem for writing. This will prevent
> + * concurrent high-granularity page table walks, so that we can safely
> + * collapse and free page tables.
> + *
> + * This is the same locking that huge_pmd_unshare requires.
> + */
> + hugetlb_vma_lock_write(vma);
> + i_mmap_lock_write(vma->vm_file->f_mapping);

One thing I just noticed - do we need the mmap write lock here? I don't
quickly see what stops another thread from having the mmap read and walking
upon the pgtables being collapsed.

Thanks,

--
Peter Xu

2023-01-19 23:39:30

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 35/46] hugetlb: add MADV_COLLAPSE for hugetlb

On Thu, Jan 19, 2023 at 2:37 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 05, 2023 at 10:18:33AM +0000, James Houghton wrote:
> > + /*
> > + * Grab the VMA lock and mapping sem for writing. This will prevent
> > + * concurrent high-granularity page table walks, so that we can safely
> > + * collapse and free page tables.
> > + *
> > + * This is the same locking that huge_pmd_unshare requires.
> > + */
> > + hugetlb_vma_lock_write(vma);
> > + i_mmap_lock_write(vma->vm_file->f_mapping);
>
> One thing I just noticed - do we need the mmap write lock here? I don't
> quickly see what stops another thread from having the mmap read and walking
> upon the pgtables being collapsed.

Maybe. Does huge_pmd_unshare() have the same problem?

2023-01-19 23:47:22

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > > I do not know much about the (primary) live migration use case. My
> > > > guess is that page table lock contention may be an issue? In this use
> > > > case, HGM is only enabled for the duration the live migration operation,
> > > > then a MADV_COLLAPSE is performed. If contention is likely to be an
> > > > issue during this time, then yes we would need to pass around with
> > > > something like hugetlb_pte.
> > >
> > > I'm not aware of any such contention issue. IMHO the migration problem is
> > > majorly about being too slow transferring a page being so large. Shrinking
> > > the page size should resolve the major problem already here IIUC.
> >
> > This will be problematic if you scale up VMs to be quite large.
>
> Do you mean that for the postcopy use case one can leverage e.g. 2M
> mappings (over 1G) to avoid lock contentions when VM is large I agree it
> should be more efficient than having 512 4K page installed, but I think
> it'll make the page fault resolution slower too if some thead is only
> looking for a 4k portion of it.

No, that's not what I meant. Sorry. If you can use the PTL that is
normally used for 4K PTEs, then you're right, there is no contention
problem. However, this PTL is determined by the value of the PMD, so
you need a pointer to the PMD to determine what the PTL should be (or
a pointer to the PTL itself).

In hugetlb, we only ever pass around the PTE pointer, and we rely on
huge_pte_lockptr() to find the PTL for us (and it does so
appropriately for everything except 4K PTEs). We would need to add the
complexity of passing around a PMD or PTL everywhere, and that's what
hugetlb_pte does for us. So that complexity is basically unavoidable,
unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm
not).

>
> > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> > the MMU lock for writing in the EPT violation path. We found that this
> > change is required for VMs >200 or so vCPUs to consistently avoid CPU
> > soft lockups in the guest.
>
> After the kvm mmu rwlock convertion, it'll allow concurrent page faults
> even if only 4K pages are used, so it seems not directly relevant to what
> we're discussing here, no?

Right. I was just bringing it up to say that if 4K PTLs were
mm->page_table_lock, we would have a problem.

>
> >
> > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> > the same PTL would be problematic in the same way.
>
> Pte-level pgtable lock only covers 2M range, so I think it depends on which
> is the address that the vcpu is faulted on? IIUC the major case should be
> that the faulted threads are not falling upon the same 2M range.

Right. I think my comment should make more sense with the above clarification.

>
> >
> > >
> > > AFAIU 4K-only solution should only reduce any lock contention because locks
> > > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating
> > > the intermediate pgtable entries we can use atomic ops just like generic
> > > mm, so no lock needed at all. With uncertainty on the size of mappings,
> > > we'll need to take any of the multiple layers of locks.
> > >
> >
> > Other than taking the HugeTLB VMA lock for reading, walking/allocating
> > page tables won't need any additional locking.
>
> Actually when revisiting the locks I'm getting a bit confused on whether
> the vma lock is needed if pmd sharing is anyway forbidden for HGM. I
> raised a question in the other patch of MADV_COLLAPSE, maybe they're
> related questions so we can keep it there.

We can discuss there. :) I take both the VMA lock and mapping lock so
that it can stay in sync with huge_pmd_unshare(), and so HGM walks
have the same synchronization as regular hugetlb PT walks.

>
> >
> > We take the PTL to allocate the next level down, but so does generic
> > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> > misunderstanding.
>
> Sorry you're right, please ignore that. I don't know why I had that
> impression that spinlocks are not needed in that process.
>
> Actually I am also curious why atomics won't work (by holding mmap read
> lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I
> think it's possible I just missed something else.

I think there are cases where we need to make sure the value of a PTE
isn't going to change from under us while we're doing some kind of
other operation, and so a compare-and-swap won't really be what we
need.

2023-01-19 23:48:52

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > I do not know much about the (primary) live migration use case. My
> > > guess is that page table lock contention may be an issue? In this use
> > > case, HGM is only enabled for the duration the live migration operation,
> > > then a MADV_COLLAPSE is performed. If contention is likely to be an
> > > issue during this time, then yes we would need to pass around with
> > > something like hugetlb_pte.
> >
> > I'm not aware of any such contention issue. IMHO the migration problem is
> > majorly about being too slow transferring a page being so large. Shrinking
> > the page size should resolve the major problem already here IIUC.
>
> This will be problematic if you scale up VMs to be quite large.

Do you mean that for the postcopy use case one can leverage e.g. 2M
mappings (over 1G) to avoid lock contentions when VM is large? I agree it
should be more efficient than having 512 4K page installed, but I think
it'll make the page fault resolution slower too if some thead is only
looking for a 4k portion of it.

> Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> the MMU lock for writing in the EPT violation path. We found that this
> change is required for VMs >200 or so vCPUs to consistently avoid CPU
> soft lockups in the guest.

After the kvm mmu rwlock convertion, it'll allow concurrent page faults
even if only 4K pages are used, so it seems not directly relevant to what
we're discussing here, no?

>
> Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> the same PTL would be problematic in the same way.

Pte-level pgtable lock only covers 2M range, so I think it depends on which
is the address that the vcpu is faulted on? IIUC the major case should be
that the faulted threads are not falling upon the same 2M range.

>
> >
> > AFAIU 4K-only solution should only reduce any lock contention because locks
> > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating
> > the intermediate pgtable entries we can use atomic ops just like generic
> > mm, so no lock needed at all. With uncertainty on the size of mappings,
> > we'll need to take any of the multiple layers of locks.
> >
>
> Other than taking the HugeTLB VMA lock for reading, walking/allocating
> page tables won't need any additional locking.

Actually when revisiting the locks I'm getting a bit confused on whether
the vma lock is needed if pmd sharing is anyway forbidden for HGM. I
raised a question in the other patch of MADV_COLLAPSE, maybe they're
related questions so we can keep it there.

>
> We take the PTL to allocate the next level down, but so does generic
> mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> misunderstanding.

Sorry you're right, please ignore that. I don't know why I had that
impression that spinlocks are not needed in that process.

Actually I am also curious why atomics won't work (by holding mmap read
lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I
think it's possible I just missed something else.

--
Peter Xu

2023-01-20 00:07:02

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 01/19/23 18:07, Peter Xu wrote:
>
> Actually when revisiting the locks I'm getting a bit confused on whether
> the vma lock is needed if pmd sharing is anyway forbidden for HGM. I
> raised a question in the other patch of MADV_COLLAPSE, maybe they're
> related questions so we can keep it there.

I can quickly answer that. Yes. The vma lock is also being used for
fault/truncation synchronization. Commit e700898fa075 make sure it is
even used on architectures that do not support PMD sharing.

I had come up with a rather ugly method of using the fault mutex for
fault/truncation synchronization, but using the vma lock was more
elegant.
--
Mike Kravetz

2023-01-20 18:11:14

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 03:26:14PM -0800, James Houghton wrote:
> On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> > > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > > > I do not know much about the (primary) live migration use case. My
> > > > > guess is that page table lock contention may be an issue? In this use
> > > > > case, HGM is only enabled for the duration the live migration operation,
> > > > > then a MADV_COLLAPSE is performed. If contention is likely to be an
> > > > > issue during this time, then yes we would need to pass around with
> > > > > something like hugetlb_pte.
> > > >
> > > > I'm not aware of any such contention issue. IMHO the migration problem is
> > > > majorly about being too slow transferring a page being so large. Shrinking
> > > > the page size should resolve the major problem already here IIUC.
> > >
> > > This will be problematic if you scale up VMs to be quite large.
> >
> > Do you mean that for the postcopy use case one can leverage e.g. 2M
> > mappings (over 1G) to avoid lock contentions when VM is large I agree it
> > should be more efficient than having 512 4K page installed, but I think
> > it'll make the page fault resolution slower too if some thead is only
> > looking for a 4k portion of it.
>
> No, that's not what I meant. Sorry. If you can use the PTL that is
> normally used for 4K PTEs, then you're right, there is no contention
> problem. However, this PTL is determined by the value of the PMD, so
> you need a pointer to the PMD to determine what the PTL should be (or
> a pointer to the PTL itself).
>
> In hugetlb, we only ever pass around the PTE pointer, and we rely on
> huge_pte_lockptr() to find the PTL for us (and it does so
> appropriately for everything except 4K PTEs). We would need to add the
> complexity of passing around a PMD or PTL everywhere, and that's what
> hugetlb_pte does for us. So that complexity is basically unavoidable,
> unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm
> not).
>
> >
> > > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> > > the MMU lock for writing in the EPT violation path. We found that this
> > > change is required for VMs >200 or so vCPUs to consistently avoid CPU
> > > soft lockups in the guest.
> >
> > After the kvm mmu rwlock convertion, it'll allow concurrent page faults
> > even if only 4K pages are used, so it seems not directly relevant to what
> > we're discussing here, no?
>
> Right. I was just bringing it up to say that if 4K PTLs were
> mm->page_table_lock, we would have a problem.

Ah I see what you meant. We definitely don't want to use the
page_table_lock for sure.

So if it's about keeping hugetlb_pte I'm fine with it, no matter what the
final version will look like.

>
> >
> > >
> > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> > > the same PTL would be problematic in the same way.
> >
> > Pte-level pgtable lock only covers 2M range, so I think it depends on which
> > is the address that the vcpu is faulted on? IIUC the major case should be
> > that the faulted threads are not falling upon the same 2M range.
>
> Right. I think my comment should make more sense with the above clarification.
>
> >
> > >
> > > >
> > > > AFAIU 4K-only solution should only reduce any lock contention because locks
> > > > will always be pte-level if VM_HUGETLB_HGM set. When walking and creating
> > > > the intermediate pgtable entries we can use atomic ops just like generic
> > > > mm, so no lock needed at all. With uncertainty on the size of mappings,
> > > > we'll need to take any of the multiple layers of locks.
> > > >
> > >
> > > Other than taking the HugeTLB VMA lock for reading, walking/allocating
> > > page tables won't need any additional locking.
> >
> > Actually when revisiting the locks I'm getting a bit confused on whether
> > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I
> > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > related questions so we can keep it there.
>
> We can discuss there. :) I take both the VMA lock and mapping lock so
> that it can stay in sync with huge_pmd_unshare(), and so HGM walks
> have the same synchronization as regular hugetlb PT walks.

Sure. :)

Now after a 2nd thought I don't think it's unsafe to take the vma write
lock here, especially for VM_SHARED. I can't think of anything that will
go wrong. It's because we need the vma lock anywhere we'll be walking the
pgtables when having mmap_sem read I think, being afraid of having pmd
sharing being possible.

But I'm not sure whether this is the cleanest way to do it.

IMHO the major special part of hugetlb comparing to generic mm on pgtable
thread safety. I worry that complicating this lock can potentially make
the hugetlb code even more specific, which is not good for the long term if
we still have a hope of merging more hugetlb codes with the generic paths.

Here since pmd sharing is impossible for HGM, the original vma lock is not
needed here. Meanwhile, what we want to guard is the pgtable walkers.
They're logically being protected by either mmap lock or the mapping lock
(for rmap walkers). Fast-gup is another thing but so far I think it's all
safe when you're following the mmu gather facilities.

Somehow I had a feeling that the hugetlb vma lock (along with the pgtable
sharing explorations in the hugetlb world keeping going..) can keep
evolving in the future, and it should be helpful to keep its semantics
simple too.

So to summarize: I wonder whether we can use mmap write lock and
i_mmap_rwsem write lock to protect collapsing for hugetlb, just like what
we do with THP collapsing (after Jann's fix).

madvise_need_mmap_write() is not easily feasible because it's before the
vma scanning so we can't take conditional write lock only for hugetlb, but
that's the next question to ask only if we can reach a consensus on the
lock scheme first for HGM in general.

>
> >
> > >
> > > We take the PTL to allocate the next level down, but so does generic
> > > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> > > misunderstanding.
> >
> > Sorry you're right, please ignore that. I don't know why I had that
> > impression that spinlocks are not needed in that process.
> >
> > Actually I am also curious why atomics won't work (by holding mmap read
> > lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries). I
> > think it's possible I just missed something else.
>
> I think there are cases where we need to make sure the value of a PTE
> isn't going to change from under us while we're doing some kind of
> other operation, and so a compare-and-swap won't really be what we
> need.

Currently the pgtable spinlock is only taken during populating the
pgtables. If it can happen, then it can happen too right after we release
the spinlock in e.g. __pmd_alloc().

One thing I can think of is we need more things done rather than the
pgtable entry installations so atomics will stop working if so. E.g. on
x86 we have paravirt_alloc_pmd(). But I'm not sure whether that's the only
reason.

--
Peter Xu

2023-01-23 15:22:17

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Hi, Mike,

On Thu, Jan 19, 2023 at 03:44:25PM -0800, Mike Kravetz wrote:
> On 01/19/23 18:07, Peter Xu wrote:
> >
> > Actually when revisiting the locks I'm getting a bit confused on whether
> > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I
> > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > related questions so we can keep it there.
>
> I can quickly answer that. Yes. The vma lock is also being used for
> fault/truncation synchronization. Commit e700898fa075 make sure it is
> even used on architectures that do not support PMD sharing.
>
> I had come up with a rather ugly method of using the fault mutex for
> fault/truncation synchronization, but using the vma lock was more
> elegant.

Thanks for answering, I'll need to read some more on truncation later.
Before that, since COLLAPSE will already require the i_mmap_rwsem write
lock already, does it mean it is naturally race-free against truncation
even without vma lock?

--
Peter Xu


2023-01-23 17:51:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 01/23/23 10:19, Peter Xu wrote:
> Hi, Mike,
>
> On Thu, Jan 19, 2023 at 03:44:25PM -0800, Mike Kravetz wrote:
> > On 01/19/23 18:07, Peter Xu wrote:
> > >
> > > Actually when revisiting the locks I'm getting a bit confused on whether
> > > the vma lock is needed if pmd sharing is anyway forbidden for HGM. I
> > > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > > related questions so we can keep it there.
> >
> > I can quickly answer that. Yes. The vma lock is also being used for
> > fault/truncation synchronization. Commit e700898fa075 make sure it is
> > even used on architectures that do not support PMD sharing.
> >
> > I had come up with a rather ugly method of using the fault mutex for
> > fault/truncation synchronization, but using the vma lock was more
> > elegant.
>
> Thanks for answering, I'll need to read some more on truncation later.
> Before that, since COLLAPSE will already require the i_mmap_rwsem write
> lock already, does it mean it is naturally race-free against truncation
> even without vma lock?

Yes, and thanks for making me take a closer look at COLLAPSE. :)

--
Mike Kravetz

2023-01-26 16:59:33

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 19, 2023 at 11:42 AM James Houghton <[email protected]> wrote:
>
> On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <[email protected]> wrote:
> >
> > On 01/19/23 08:57, James Houghton wrote:
> > > FWIW, what makes the most sense to me right now is to implement the
> > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > > optimization. We can later come up with a scheme that lets us retain
> > > compatibility. (Is that what you mean by "this can be done somewhat
> > > independently", Mike?)
> >
> > Sort of, I was only saying that getting the ref/map counting right seems
> > like a task than can be independently worked. Using the THP-like scheme
> > is good.
>
> Ok! So if you're ok with the intermediate mapping sizes, it sounds
> like I should go ahead and implement the THP-like scheme.

It turns out that the THP-like scheme significantly slows down
MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
the vast majority of the time spent in MADV_COLLAPSE when collapsing
1G mappings. It is doing 262k atomic decrements, so this makes sense.

This is only really a problem because this is done between
mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
access any of the 1G page while we're doing this (and it can take like
~1 second for each 1G, at least on the x86 server I was testing on).

- James

2023-01-26 20:31:43

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

James,

On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> It turns out that the THP-like scheme significantly slows down
> MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> the vast majority of the time spent in MADV_COLLAPSE when collapsing
> 1G mappings. It is doing 262k atomic decrements, so this makes sense.
>
> This is only really a problem because this is done between
> mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> access any of the 1G page while we're doing this (and it can take like
> ~1 second for each 1G, at least on the x86 server I was testing on).

Did you try to measure the time, or it's a quick observation from perf?

IIRC I used to measure some atomic ops, it is not as drastic as I thought.
But maybe it depends on many things.

I'm curious how the 1sec is provisioned between the procedures. E.g., I
would expect mmu_notifier_invalidate_range_start() to also take some time
too as it should walk the smally mapped EPT pgtables.

Since we'll still keep the intermediate levels around - from application
POV, one other thing to remedy this is further shrink the size of COLLAPSE
so potentially for a very large page we can start with building 2M layers.
But then collapse will need to be run at least two rounds.

--
Peter Xu


2023-01-27 21:02:46

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <[email protected]> wrote:
>
> James,
>
> On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > It turns out that the THP-like scheme significantly slows down
> > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> >
> > This is only really a problem because this is done between
> > mmu_notifier_invalidate_range_start() and
> > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > access any of the 1G page while we're doing this (and it can take like
> > ~1 second for each 1G, at least on the x86 server I was testing on).
>
> Did you try to measure the time, or it's a quick observation from perf?

I put some ktime_get()s in.

>
> IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> But maybe it depends on many things.
>
> I'm curious how the 1sec is provisioned between the procedures. E.g., I
> would expect mmu_notifier_invalidate_range_start() to also take some time
> too as it should walk the smally mapped EPT pgtables.

Somehow this doesn't take all that long (only like 10-30ms when
collapsing from 4K -> 1G) compared to hugetlb_collapse().

>
> Since we'll still keep the intermediate levels around - from application
> POV, one other thing to remedy this is further shrink the size of COLLAPSE
> so potentially for a very large page we can start with building 2M layers.
> But then collapse will need to be run at least two rounds.

That's exactly what I thought to do. :) I realized, too, that this is
actually how userspace *should* collapse things to avoid holding up
vCPUs too long. I think this is a good reason to keep intermediate
page sizes.

When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
huge difference: the THP-like scheme is about 30% slower overall.

When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
difference. For the THP-like scheme, collapsing 4K -> 2M requires
decrementing and then re-incrementing subpage->_mapcount, and then
from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
the head-only scheme, for each 2M in the 4K -> 2M collapse, we
decrement the compound_mapcount 512 times (once per PTE), then
increment it once. And then for 2M -> 1G, for each 1G, we decrement
mapcount again by 512 (once per PMD), incrementing it once.

The mapcount decrements are about on par with how long it takes to do
other things, like updating page tables. The main problem is, with the
THP-like scheme (implemented like this [1]), there isn't a way to
avoid the 262k decrements when collapsing 1G. So if we want
MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
then I think something more clever needs to be implemented.

[1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178


- James

2023-01-30 17:30:44

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <[email protected]> wrote:
> >
> > James,
> >
> > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > It turns out that the THP-like scheme significantly slows down
> > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > >
> > > This is only really a problem because this is done between
> > > mmu_notifier_invalidate_range_start() and
> > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > access any of the 1G page while we're doing this (and it can take like
> > > ~1 second for each 1G, at least on the x86 server I was testing on).
> >
> > Did you try to measure the time, or it's a quick observation from perf?
>
> I put some ktime_get()s in.
>
> >
> > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > But maybe it depends on many things.
> >
> > I'm curious how the 1sec is provisioned between the procedures. E.g., I
> > would expect mmu_notifier_invalidate_range_start() to also take some time
> > too as it should walk the smally mapped EPT pgtables.
>
> Somehow this doesn't take all that long (only like 10-30ms when
> collapsing from 4K -> 1G) compared to hugetlb_collapse().

Did you populate as much the EPT pgtable when measuring this?

IIUC this number should be pretty much relevant to how many pages are
shadowed to the kvm pgtables. If the EPT table is mostly empty it should
be super fast, but OTOH it can be much slower if when it's populated,
because tdp mmu should need to handle the pgtable leaves one by one.

E.g. it should be fully populated if you have a program busy dirtying most
of the guest pages during test migration.

Write op should be the worst here case since it'll require the atomic op
being applied; see kvm_tdp_mmu_write_spte().

>
> >
> > Since we'll still keep the intermediate levels around - from application
> > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > so potentially for a very large page we can start with building 2M layers.
> > But then collapse will need to be run at least two rounds.
>
> That's exactly what I thought to do. :) I realized, too, that this is
> actually how userspace *should* collapse things to avoid holding up
> vCPUs too long. I think this is a good reason to keep intermediate
> page sizes.
>
> When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> huge difference: the THP-like scheme is about 30% slower overall.
>
> When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> difference. For the THP-like scheme, collapsing 4K -> 2M requires
> decrementing and then re-incrementing subpage->_mapcount, and then
> from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> decrement the compound_mapcount 512 times (once per PTE), then
> increment it once. And then for 2M -> 1G, for each 1G, we decrement
> mapcount again by 512 (once per PMD), incrementing it once.

Did you have quantified numbers (with your ktime treak) to compare these?
If we want to go the other route, I think these will be materials to
justify any other approach on mapcount handling.

>
> The mapcount decrements are about on par with how long it takes to do
> other things, like updating page tables. The main problem is, with the
> THP-like scheme (implemented like this [1]), there isn't a way to
> avoid the 262k decrements when collapsing 1G. So if we want
> MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> then I think something more clever needs to be implemented.
>
> [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178

I believe the whole goal of HGM is trying to face the same challenge if
we'll allow 1G THP exist and being able to split for anon.

I don't remember whether we discussed below, maybe we did? Anyway...

Another way to not use thp mapcount, nor break smaps and similar calls to
page_mapcount() on small page, is to only increase the hpage mapcount only
when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
as leaf or a non-leaf), and the mapcount can be decreased when the pXd
entry is removed (for leaf, it's the same as for now; for HGM, it's when
freeing pgtable of the PUD entry).

Again, in all cases I think some solid measurements would definitely be
helpful (as commented above) to see how much overhead will there be and
whether that'll start to become a problem at least for the current
motivations of the whole HGM idea.

Thanks,

--
Peter Xu


2023-01-30 18:40:22

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <[email protected]> wrote:
> > >
> > > James,
> > >
> > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > It turns out that the THP-like scheme significantly slows down
> > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > >
> > > > This is only really a problem because this is done between
> > > > mmu_notifier_invalidate_range_start() and
> > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > access any of the 1G page while we're doing this (and it can take like
> > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > >
> > > Did you try to measure the time, or it's a quick observation from perf?
> >
> > I put some ktime_get()s in.
> >
> > >
> > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > But maybe it depends on many things.
> > >
> > > I'm curious how the 1sec is provisioned between the procedures. E.g., I
> > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > too as it should walk the smally mapped EPT pgtables.
> >
> > Somehow this doesn't take all that long (only like 10-30ms when
> > collapsing from 4K -> 1G) compared to hugetlb_collapse().
>
> Did you populate as much the EPT pgtable when measuring this?
>
> IIUC this number should be pretty much relevant to how many pages are
> shadowed to the kvm pgtables. If the EPT table is mostly empty it should
> be super fast, but OTOH it can be much slower if when it's populated,
> because tdp mmu should need to handle the pgtable leaves one by one.
>
> E.g. it should be fully populated if you have a program busy dirtying most
> of the guest pages during test migration.

That's what I was doing. I was running a workload in the guest that
just writes 8 bytes to a page and jumps ahead a few pages on all
vCPUs, touching most of its memory.

But there is more to understand; I'll collect more results. I'm not
sure why the EPT can be unmapped/collapsed so quickly.

>
> Write op should be the worst here case since it'll require the atomic op
> being applied; see kvm_tdp_mmu_write_spte().
>
> >
> > >
> > > Since we'll still keep the intermediate levels around - from application
> > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > so potentially for a very large page we can start with building 2M layers.
> > > But then collapse will need to be run at least two rounds.
> >
> > That's exactly what I thought to do. :) I realized, too, that this is
> > actually how userspace *should* collapse things to avoid holding up
> > vCPUs too long. I think this is a good reason to keep intermediate
> > page sizes.
> >
> > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > huge difference: the THP-like scheme is about 30% slower overall.
> >
> > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > decrementing and then re-incrementing subpage->_mapcount, and then
> > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > decrement the compound_mapcount 512 times (once per PTE), then
> > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > mapcount again by 512 (once per PMD), incrementing it once.
>
> Did you have quantified numbers (with your ktime treak) to compare these?
> If we want to go the other route, I think these will be materials to
> justify any other approach on mapcount handling.

Ok, I can do that. GIve me a couple days to collect more results and
organize them in a helpful way.

(If it's helpful at all, here are some results I collected last week:
[2]. Please ignore it if it's not helpful.)

>
> >
> > The mapcount decrements are about on par with how long it takes to do
> > other things, like updating page tables. The main problem is, with the
> > THP-like scheme (implemented like this [1]), there isn't a way to
> > avoid the 262k decrements when collapsing 1G. So if we want
> > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > then I think something more clever needs to be implemented.
> >
> > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
>
> I believe the whole goal of HGM is trying to face the same challenge if
> we'll allow 1G THP exist and being able to split for anon.
>
> I don't remember whether we discussed below, maybe we did? Anyway...
>
> Another way to not use thp mapcount, nor break smaps and similar calls to
> page_mapcount() on small page, is to only increase the hpage mapcount only
> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> entry is removed (for leaf, it's the same as for now; for HGM, it's when
> freeing pgtable of the PUD entry).

Right, and this is doable. Also it seems like this is pretty close to
the direction Matthew Wilcox wants to go with THPs.

Something I noticed though, from the implementation of
folio_referenced()/folio_referenced_one(), is that folio_mapcount()
ought to report the total number of PTEs that are pointing on the page
(or the number of times page_vma_mapped_walk returns true). FWIW,
folio_referenced() is never called for hugetlb folios.

>
> Again, in all cases I think some solid measurements would definitely be
> helpful (as commented above) to see how much overhead will there be and
> whether that'll start to become a problem at least for the current
> motivations of the whole HGM idea.
>
> Thanks,
>
> --
> Peter Xu
>

Thanks, Peter!

[2]: https://pastebin.com/raw/DVfNFi2m

- James

2023-01-30 21:15:31

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> >
> > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > James,
> > > >
> > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > It turns out that the THP-like scheme significantly slows down
> > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > >
> > > > > This is only really a problem because this is done between
> > > > > mmu_notifier_invalidate_range_start() and
> > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > >
> > > > Did you try to measure the time, or it's a quick observation from perf?
> > >
> > > I put some ktime_get()s in.
> > >
> > > >
> > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > But maybe it depends on many things.
> > > >
> > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I
> > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > too as it should walk the smally mapped EPT pgtables.
> > >
> > > Somehow this doesn't take all that long (only like 10-30ms when
> > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> >
> > Did you populate as much the EPT pgtable when measuring this?
> >
> > IIUC this number should be pretty much relevant to how many pages are
> > shadowed to the kvm pgtables. If the EPT table is mostly empty it should
> > be super fast, but OTOH it can be much slower if when it's populated,
> > because tdp mmu should need to handle the pgtable leaves one by one.
> >
> > E.g. it should be fully populated if you have a program busy dirtying most
> > of the guest pages during test migration.
>
> That's what I was doing. I was running a workload in the guest that
> just writes 8 bytes to a page and jumps ahead a few pages on all
> vCPUs, touching most of its memory.
>
> But there is more to understand; I'll collect more results. I'm not
> sure why the EPT can be unmapped/collapsed so quickly.

Maybe something smart done by the hypervisor?

>
> >
> > Write op should be the worst here case since it'll require the atomic op
> > being applied; see kvm_tdp_mmu_write_spte().
> >
> > >
> > > >
> > > > Since we'll still keep the intermediate levels around - from application
> > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > so potentially for a very large page we can start with building 2M layers.
> > > > But then collapse will need to be run at least two rounds.
> > >
> > > That's exactly what I thought to do. :) I realized, too, that this is
> > > actually how userspace *should* collapse things to avoid holding up
> > > vCPUs too long. I think this is a good reason to keep intermediate
> > > page sizes.
> > >
> > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > huge difference: the THP-like scheme is about 30% slower overall.
> > >
> > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > decrement the compound_mapcount 512 times (once per PTE), then
> > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > mapcount again by 512 (once per PMD), incrementing it once.
> >
> > Did you have quantified numbers (with your ktime treak) to compare these?
> > If we want to go the other route, I think these will be materials to
> > justify any other approach on mapcount handling.
>
> Ok, I can do that. GIve me a couple days to collect more results and
> organize them in a helpful way.
>
> (If it's helpful at all, here are some results I collected last week:
> [2]. Please ignore it if it's not helpful.)

It's helpful already at least to me, thanks. Yes the change is drastic.

>
> >
> > >
> > > The mapcount decrements are about on par with how long it takes to do
> > > other things, like updating page tables. The main problem is, with the
> > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > avoid the 262k decrements when collapsing 1G. So if we want
> > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > then I think something more clever needs to be implemented.
> > >
> > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> >
> > I believe the whole goal of HGM is trying to face the same challenge if
> > we'll allow 1G THP exist and being able to split for anon.
> >
> > I don't remember whether we discussed below, maybe we did? Anyway...
> >
> > Another way to not use thp mapcount, nor break smaps and similar calls to
> > page_mapcount() on small page, is to only increase the hpage mapcount only
> > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > freeing pgtable of the PUD entry).
>
> Right, and this is doable. Also it seems like this is pretty close to
> the direction Matthew Wilcox wants to go with THPs.

I may not be familiar with it, do you mean this one?

https://lore.kernel.org/all/Y9Afwds%[email protected]/

For hugetlb I think it should be easier to maintain rather than any-sized
folios, because there's the pgtable non-leaf entry to track rmap
information and the folio size being static to hpage size.

It'll be different to folios where it can be random sized pages chunk, so
it needs to be managed by batching the ptes when install/zap.

>
> Something I noticed though, from the implementation of
> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> ought to report the total number of PTEs that are pointing on the page
> (or the number of times page_vma_mapped_walk returns true). FWIW,
> folio_referenced() is never called for hugetlb folios.

FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
it'll walk every leaf page being mapped, big or small, so IIUC that number
should match with what it expects to see later, more or less.

But I agree the mapcount/referenced value itself is debatable to me, just
like what you raised in the other thread on page migration. Meanwhile, I
am not certain whether the mapcount is accurate either because AFAICT the
mapcount can be modified if e.g. new page mapping established as long as
before taking the page lock later in folio_referenced().

It's just that I don't see any severe issue either due to any of above, as
long as that information is only used as a hint for next steps, e.g., to
swap which page out.

--
Peter Xu


2023-02-01 00:24:57

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
>
> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > > It turns out that the THP-like scheme significantly slows down
> > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > > >
> > > > > > This is only really a problem because this is done between
> > > > > > mmu_notifier_invalidate_range_start() and
> > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > > >
> > > > > Did you try to measure the time, or it's a quick observation from perf?
> > > >
> > > > I put some ktime_get()s in.
> > > >
> > > > >
> > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > > But maybe it depends on many things.
> > > > >
> > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I
> > > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > > too as it should walk the smally mapped EPT pgtables.
> > > >
> > > > Somehow this doesn't take all that long (only like 10-30ms when
> > > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> > >
> > > Did you populate as much the EPT pgtable when measuring this?
> > >
> > > IIUC this number should be pretty much relevant to how many pages are
> > > shadowed to the kvm pgtables. If the EPT table is mostly empty it should
> > > be super fast, but OTOH it can be much slower if when it's populated,
> > > because tdp mmu should need to handle the pgtable leaves one by one.
> > >
> > > E.g. it should be fully populated if you have a program busy dirtying most
> > > of the guest pages during test migration.
> >
> > That's what I was doing. I was running a workload in the guest that
> > just writes 8 bytes to a page and jumps ahead a few pages on all
> > vCPUs, touching most of its memory.
> >
> > But there is more to understand; I'll collect more results. I'm not
> > sure why the EPT can be unmapped/collapsed so quickly.
>
> Maybe something smart done by the hypervisor?

Doing a little bit more digging, it looks like the
invalidate_range_start notifier clears the sptes, and then later on
(on the next EPT violation), the page tables are freed. I still need
to look at how they end up being so much faster still, but I thought
that was interesting.

>
> >
> > >
> > > Write op should be the worst here case since it'll require the atomic op
> > > being applied; see kvm_tdp_mmu_write_spte().
> > >
> > > >
> > > > >
> > > > > Since we'll still keep the intermediate levels around - from application
> > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > > so potentially for a very large page we can start with building 2M layers.
> > > > > But then collapse will need to be run at least two rounds.
> > > >
> > > > That's exactly what I thought to do. :) I realized, too, that this is
> > > > actually how userspace *should* collapse things to avoid holding up
> > > > vCPUs too long. I think this is a good reason to keep intermediate
> > > > page sizes.
> > > >
> > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > > huge difference: the THP-like scheme is about 30% slower overall.
> > > >
> > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > > decrement the compound_mapcount 512 times (once per PTE), then
> > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > > mapcount again by 512 (once per PMD), incrementing it once.
> > >
> > > Did you have quantified numbers (with your ktime treak) to compare these?
> > > If we want to go the other route, I think these will be materials to
> > > justify any other approach on mapcount handling.
> >
> > Ok, I can do that. GIve me a couple days to collect more results and
> > organize them in a helpful way.
> >
> > (If it's helpful at all, here are some results I collected last week:
> > [2]. Please ignore it if it's not helpful.)
>
> It's helpful already at least to me, thanks. Yes the change is drastic.

That data only contains THP-like mapcount performance, no performance
for the head-only way. But the head-only scheme makes the 2M -> 1G
very good ("56" comes down to about the same everything else, instead
of being ~100-500x bigger).

>
> >
> > >
> > > >
> > > > The mapcount decrements are about on par with how long it takes to do
> > > > other things, like updating page tables. The main problem is, with the
> > > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > > avoid the 262k decrements when collapsing 1G. So if we want
> > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > > then I think something more clever needs to be implemented.
> > > >
> > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> > >
> > > I believe the whole goal of HGM is trying to face the same challenge if
> > > we'll allow 1G THP exist and being able to split for anon.
> > >
> > > I don't remember whether we discussed below, maybe we did? Anyway...
> > >
> > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > freeing pgtable of the PUD entry).
> >
> > Right, and this is doable. Also it seems like this is pretty close to
> > the direction Matthew Wilcox wants to go with THPs.
>
> I may not be familiar with it, do you mean this one?
>
> https://lore.kernel.org/all/Y9Afwds%[email protected]/

Yep that's it.

>
> For hugetlb I think it should be easier to maintain rather than any-sized
> folios, because there's the pgtable non-leaf entry to track rmap
> information and the folio size being static to hpage size.
>
> It'll be different to folios where it can be random sized pages chunk, so
> it needs to be managed by batching the ptes when install/zap.

Agreed. It's probably easier for HugeTLB because they're always
"naturally aligned" and yeah they can't change sizes.

>
> >
> > Something I noticed though, from the implementation of
> > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > ought to report the total number of PTEs that are pointing on the page
> > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > folio_referenced() is never called for hugetlb folios.
>
> FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> it'll walk every leaf page being mapped, big or small, so IIUC that number
> should match with what it expects to see later, more or less.

I don't fully understand what you mean here.

>
> But I agree the mapcount/referenced value itself is debatable to me, just
> like what you raised in the other thread on page migration. Meanwhile, I
> am not certain whether the mapcount is accurate either because AFAICT the
> mapcount can be modified if e.g. new page mapping established as long as
> before taking the page lock later in folio_referenced().
>
> It's just that I don't see any severe issue either due to any of above, as
> long as that information is only used as a hint for next steps, e.g., to
> swap which page out.

I also don't see a big problem with folio_referenced() (and you're
right that folio_mapcount() can be stale by the time it takes the
folio lock). It still seems like folio_mapcount() should return the
total number of PTEs that map the page though. Are you saying that
breaking this would be ok?

2023-02-01 01:25:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> >
> > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > James,
> > > > > >
> > > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > > > It turns out that the THP-like scheme significantly slows down
> > > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > > > >
> > > > > > > This is only really a problem because this is done between
> > > > > > > mmu_notifier_invalidate_range_start() and
> > > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > > > >
> > > > > > Did you try to measure the time, or it's a quick observation from perf?
> > > > >
> > > > > I put some ktime_get()s in.
> > > > >
> > > > > >
> > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > > > But maybe it depends on many things.
> > > > > >
> > > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I
> > > > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > > > too as it should walk the smally mapped EPT pgtables.
> > > > >
> > > > > Somehow this doesn't take all that long (only like 10-30ms when
> > > > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> > > >
> > > > Did you populate as much the EPT pgtable when measuring this?
> > > >
> > > > IIUC this number should be pretty much relevant to how many pages are
> > > > shadowed to the kvm pgtables. If the EPT table is mostly empty it should
> > > > be super fast, but OTOH it can be much slower if when it's populated,
> > > > because tdp mmu should need to handle the pgtable leaves one by one.
> > > >
> > > > E.g. it should be fully populated if you have a program busy dirtying most
> > > > of the guest pages during test migration.
> > >
> > > That's what I was doing. I was running a workload in the guest that
> > > just writes 8 bytes to a page and jumps ahead a few pages on all
> > > vCPUs, touching most of its memory.
> > >
> > > But there is more to understand; I'll collect more results. I'm not
> > > sure why the EPT can be unmapped/collapsed so quickly.
> >
> > Maybe something smart done by the hypervisor?
>
> Doing a little bit more digging, it looks like the
> invalidate_range_start notifier clears the sptes, and then later on
> (on the next EPT violation), the page tables are freed. I still need
> to look at how they end up being so much faster still, but I thought
> that was interesting.
>
> >
> > >
> > > >
> > > > Write op should be the worst here case since it'll require the atomic op
> > > > being applied; see kvm_tdp_mmu_write_spte().
> > > >
> > > > >
> > > > > >
> > > > > > Since we'll still keep the intermediate levels around - from application
> > > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > > > so potentially for a very large page we can start with building 2M layers.
> > > > > > But then collapse will need to be run at least two rounds.
> > > > >
> > > > > That's exactly what I thought to do. :) I realized, too, that this is
> > > > > actually how userspace *should* collapse things to avoid holding up
> > > > > vCPUs too long. I think this is a good reason to keep intermediate
> > > > > page sizes.
> > > > >
> > > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > > > huge difference: the THP-like scheme is about 30% slower overall.
> > > > >
> > > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > > > decrement the compound_mapcount 512 times (once per PTE), then
> > > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > > > mapcount again by 512 (once per PMD), incrementing it once.
> > > >
> > > > Did you have quantified numbers (with your ktime treak) to compare these?
> > > > If we want to go the other route, I think these will be materials to
> > > > justify any other approach on mapcount handling.
> > >
> > > Ok, I can do that. GIve me a couple days to collect more results and
> > > organize them in a helpful way.
> > >
> > > (If it's helpful at all, here are some results I collected last week:
> > > [2]. Please ignore it if it's not helpful.)
> >
> > It's helpful already at least to me, thanks. Yes the change is drastic.
>
> That data only contains THP-like mapcount performance, no performance
> for the head-only way. But the head-only scheme makes the 2M -> 1G
> very good ("56" comes down to about the same everything else, instead
> of being ~100-500x bigger).

Oops, I think I misread those. Yeah please keep sharing information if you
come up with any.

>
> >
> > >
> > > >
> > > > >
> > > > > The mapcount decrements are about on par with how long it takes to do
> > > > > other things, like updating page tables. The main problem is, with the
> > > > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > > > avoid the 262k decrements when collapsing 1G. So if we want
> > > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > > > then I think something more clever needs to be implemented.
> > > > >
> > > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> > > >
> > > > I believe the whole goal of HGM is trying to face the same challenge if
> > > > we'll allow 1G THP exist and being able to split for anon.
> > > >
> > > > I don't remember whether we discussed below, maybe we did? Anyway...
> > > >
> > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > freeing pgtable of the PUD entry).
> > >
> > > Right, and this is doable. Also it seems like this is pretty close to
> > > the direction Matthew Wilcox wants to go with THPs.
> >
> > I may not be familiar with it, do you mean this one?
> >
> > https://lore.kernel.org/all/Y9Afwds%[email protected]/
>
> Yep that's it.
>
> >
> > For hugetlb I think it should be easier to maintain rather than any-sized
> > folios, because there's the pgtable non-leaf entry to track rmap
> > information and the folio size being static to hpage size.
> >
> > It'll be different to folios where it can be random sized pages chunk, so
> > it needs to be managed by batching the ptes when install/zap.
>
> Agreed. It's probably easier for HugeTLB because they're always
> "naturally aligned" and yeah they can't change sizes.
>
> >
> > >
> > > Something I noticed though, from the implementation of
> > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > ought to report the total number of PTEs that are pointing on the page
> > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > folio_referenced() is never called for hugetlb folios.
> >
> > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > should match with what it expects to see later, more or less.
>
> I don't fully understand what you mean here.

I meant the rmap_walk pairing with folio_referenced_one() will walk all the
leaves for the folio, big or small. I think that will match the number
with what got returned from folio_mapcount().

>
> >
> > But I agree the mapcount/referenced value itself is debatable to me, just
> > like what you raised in the other thread on page migration. Meanwhile, I
> > am not certain whether the mapcount is accurate either because AFAICT the
> > mapcount can be modified if e.g. new page mapping established as long as
> > before taking the page lock later in folio_referenced().
> >
> > It's just that I don't see any severe issue either due to any of above, as
> > long as that information is only used as a hint for next steps, e.g., to
> > swap which page out.
>
> I also don't see a big problem with folio_referenced() (and you're
> right that folio_mapcount() can be stale by the time it takes the
> folio lock). It still seems like folio_mapcount() should return the
> total number of PTEs that map the page though. Are you saying that
> breaking this would be ok?

I didn't quite follow - isn't that already doing so?

folio_mapcount() is total_compound_mapcount() here, IIUC it is an
accumulated value of all possible PTEs or PMDs being mapped as long as it's
all or part of the folio being mapped.

--
Peter Xu


2023-02-01 15:46:29

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
>
> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
[snip]
> > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > freeing pgtable of the PUD entry).
> > > >
> > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > the direction Matthew Wilcox wants to go with THPs.
> > >
> > > I may not be familiar with it, do you mean this one?
> > >
> > > https://lore.kernel.org/all/Y9Afwds%[email protected]/
> >
> > Yep that's it.
> >
> > >
> > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > folios, because there's the pgtable non-leaf entry to track rmap
> > > information and the folio size being static to hpage size.
> > >
> > > It'll be different to folios where it can be random sized pages chunk, so
> > > it needs to be managed by batching the ptes when install/zap.
> >
> > Agreed. It's probably easier for HugeTLB because they're always
> > "naturally aligned" and yeah they can't change sizes.
> >
> > >
> > > >
> > > > Something I noticed though, from the implementation of
> > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > ought to report the total number of PTEs that are pointing on the page
> > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > folio_referenced() is never called for hugetlb folios.
> > >
> > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > should match with what it expects to see later, more or less.
> >
> > I don't fully understand what you mean here.
>
> I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> leaves for the folio, big or small. I think that will match the number
> with what got returned from folio_mapcount().

See below.

>
> >
> > >
> > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > like what you raised in the other thread on page migration. Meanwhile, I
> > > am not certain whether the mapcount is accurate either because AFAICT the
> > > mapcount can be modified if e.g. new page mapping established as long as
> > > before taking the page lock later in folio_referenced().
> > >
> > > It's just that I don't see any severe issue either due to any of above, as
> > > long as that information is only used as a hint for next steps, e.g., to
> > > swap which page out.
> >
> > I also don't see a big problem with folio_referenced() (and you're
> > right that folio_mapcount() can be stale by the time it takes the
> > folio lock). It still seems like folio_mapcount() should return the
> > total number of PTEs that map the page though. Are you saying that
> > breaking this would be ok?
>
> I didn't quite follow - isn't that already doing so?
>
> folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> accumulated value of all possible PTEs or PMDs being mapped as long as it's
> all or part of the folio being mapped.

We've talked about 3 ways of handling mapcount:

1. The RFC v2 way, which is head-only, and we increment the compound
mapcount for each PT mapping we have. So a PTE-mapped 2M page,
compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
2. The THP-like way. If we are fully mapping the hugetlb page with the
hstate-level PTE, we increment the compound mapcount, otherwise we
increment subpage->_mapcount.
3. The RFC v1 way (the way you have suggested above), which is
head-only, and we increment the compound mapcount if the hstate-level
PTE is made present.

With #1 and #2, there is no concern with folio_mapcount(). But with
#3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
would yield 1 instead of 512 (right?). That's what I mean.

#1 has problems wrt smaps and migration (though there were other
problems with those anyway that Mike has fixed), and #2 makes
MADV_COLLAPSE slow to the point of being unusable for some
applications.

It seems like the least bad option is #1, but maybe we should have a
face-to-face discussion about it? I'm still collecting some more
performance numbers.

- James

2023-02-01 15:58:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On 01.02.23 16:45, James Houghton wrote:
> On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
>>
>> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
>>> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
>>>>
>>>> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
>>>>> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> [snip]
>>>>>> Another way to not use thp mapcount, nor break smaps and similar calls to
>>>>>> page_mapcount() on small page, is to only increase the hpage mapcount only
>>>>>> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
>>>>>> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
>>>>>> entry is removed (for leaf, it's the same as for now; for HGM, it's when
>>>>>> freeing pgtable of the PUD entry).
>>>>>
>>>>> Right, and this is doable. Also it seems like this is pretty close to
>>>>> the direction Matthew Wilcox wants to go with THPs.
>>>>
>>>> I may not be familiar with it, do you mean this one?
>>>>
>>>> https://lore.kernel.org/all/Y9Afwds%[email protected]/
>>>
>>> Yep that's it.
>>>
>>>>
>>>> For hugetlb I think it should be easier to maintain rather than any-sized
>>>> folios, because there's the pgtable non-leaf entry to track rmap
>>>> information and the folio size being static to hpage size.
>>>>
>>>> It'll be different to folios where it can be random sized pages chunk, so
>>>> it needs to be managed by batching the ptes when install/zap.
>>>
>>> Agreed. It's probably easier for HugeTLB because they're always
>>> "naturally aligned" and yeah they can't change sizes.
>>>
>>>>
>>>>>
>>>>> Something I noticed though, from the implementation of
>>>>> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
>>>>> ought to report the total number of PTEs that are pointing on the page
>>>>> (or the number of times page_vma_mapped_walk returns true). FWIW,
>>>>> folio_referenced() is never called for hugetlb folios.
>>>>
>>>> FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
>>>> it'll walk every leaf page being mapped, big or small, so IIUC that number
>>>> should match with what it expects to see later, more or less.
>>>
>>> I don't fully understand what you mean here.
>>
>> I meant the rmap_walk pairing with folio_referenced_one() will walk all the
>> leaves for the folio, big or small. I think that will match the number
>> with what got returned from folio_mapcount().
>
> See below.
>
>>
>>>
>>>>
>>>> But I agree the mapcount/referenced value itself is debatable to me, just
>>>> like what you raised in the other thread on page migration. Meanwhile, I
>>>> am not certain whether the mapcount is accurate either because AFAICT the
>>>> mapcount can be modified if e.g. new page mapping established as long as
>>>> before taking the page lock later in folio_referenced().
>>>>
>>>> It's just that I don't see any severe issue either due to any of above, as
>>>> long as that information is only used as a hint for next steps, e.g., to
>>>> swap which page out.
>>>
>>> I also don't see a big problem with folio_referenced() (and you're
>>> right that folio_mapcount() can be stale by the time it takes the
>>> folio lock). It still seems like folio_mapcount() should return the
>>> total number of PTEs that map the page though. Are you saying that
>>> breaking this would be ok?
>>
>> I didn't quite follow - isn't that already doing so?
>>
>> folio_mapcount() is total_compound_mapcount() here, IIUC it is an
>> accumulated value of all possible PTEs or PMDs being mapped as long as it's
>> all or part of the folio being mapped.
>
> We've talked about 3 ways of handling mapcount:
>
> 1. The RFC v2 way, which is head-only, and we increment the compound
> mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> 2. The THP-like way. If we are fully mapping the hugetlb page with the
> hstate-level PTE, we increment the compound mapcount, otherwise we
> increment subpage->_mapcount.
> 3. The RFC v1 way (the way you have suggested above), which is
> head-only, and we increment the compound mapcount if the hstate-level
> PTE is made present.
>
> With #1 and #2, there is no concern with folio_mapcount(). But with
> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> would yield 1 instead of 512 (right?). That's what I mean.

My 2 cents:

The mapcount is primarily used (in hugetlb context) to

(a) Detect if a page might be shared. mapcount > 1 implies that two
independent page table hierarchies are mapping the page. We care about
mapcount == 1 vs. mapcount != 1.

(b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs.
mapcount != 0.

For hugetlb, I don't see why we should care about the subpage mapcount
at all.

For (a) it's even good to count "somehow mapped into a single page table
structure" as "mapcount == 1" For (b), we don't care as long as "still
mapped" implies "mapcount != 0".

--
Thanks,

David / dhildenb


2023-02-01 16:23:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
> >
> > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> [snip]
> > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > freeing pgtable of the PUD entry).
> > > > >
> > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > the direction Matthew Wilcox wants to go with THPs.
> > > >
> > > > I may not be familiar with it, do you mean this one?
> > > >
> > > > https://lore.kernel.org/all/Y9Afwds%[email protected]/
> > >
> > > Yep that's it.
> > >
> > > >
> > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > information and the folio size being static to hpage size.
> > > >
> > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > it needs to be managed by batching the ptes when install/zap.
> > >
> > > Agreed. It's probably easier for HugeTLB because they're always
> > > "naturally aligned" and yeah they can't change sizes.
> > >
> > > >
> > > > >
> > > > > Something I noticed though, from the implementation of
> > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > folio_referenced() is never called for hugetlb folios.
> > > >
> > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > should match with what it expects to see later, more or less.
> > >
> > > I don't fully understand what you mean here.
> >
> > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > leaves for the folio, big or small. I think that will match the number
> > with what got returned from folio_mapcount().
>
> See below.
>
> >
> > >
> > > >
> > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > like what you raised in the other thread on page migration. Meanwhile, I
> > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > before taking the page lock later in folio_referenced().
> > > >
> > > > It's just that I don't see any severe issue either due to any of above, as
> > > > long as that information is only used as a hint for next steps, e.g., to
> > > > swap which page out.
> > >
> > > I also don't see a big problem with folio_referenced() (and you're
> > > right that folio_mapcount() can be stale by the time it takes the
> > > folio lock). It still seems like folio_mapcount() should return the
> > > total number of PTEs that map the page though. Are you saying that
> > > breaking this would be ok?
> >
> > I didn't quite follow - isn't that already doing so?
> >
> > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > all or part of the folio being mapped.
>
> We've talked about 3 ways of handling mapcount:
>
> 1. The RFC v2 way, which is head-only, and we increment the compound
> mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> 2. The THP-like way. If we are fully mapping the hugetlb page with the
> hstate-level PTE, we increment the compound mapcount, otherwise we
> increment subpage->_mapcount.
> 3. The RFC v1 way (the way you have suggested above), which is
> head-only, and we increment the compound mapcount if the hstate-level
> PTE is made present.

Oh that's where it come from! It took quite some months going through all
these, I can hardly remember the details.

>
> With #1 and #2, there is no concern with folio_mapcount(). But with
> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> would yield 1 instead of 512 (right?). That's what I mean.
>
> #1 has problems wrt smaps and migration (though there were other
> problems with those anyway that Mike has fixed), and #2 makes
> MADV_COLLAPSE slow to the point of being unusable for some
> applications.

Ah so you're talking about after HGM being applied.. while I was only
talking about THPs.

If to apply the logic here with idea 3), the worst case is we'll need to
have special care of HGM hugetlb in folio_referenced_one(), so the default
page_vma_mapped_walk() may not apply anymore - the resource is always in
hstate sized, so counting small ptes do not help too - we can just walk
until the hstate entry and do referenced++ if it's not none, at the
entrance of folio_referenced_one().

But I'm not sure whether that'll be necessary at all, as I'm not sure
whether that path can be triggered at all in any form (where from the top
it should always be shrink_page_list()). In that sense maybe we can also
consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
hugetlb page that got passed in? Meanwhile, adding a TODO comment
explaining that current walk won't work easily for HGM only, so when it
will be applicable to hugetlb we need to rework?

I confess that's not pretty, though. But that'll make 3) with no major
defect from function-wise.

Side note: did we finish folio conversion on hugetlb at all? I think at
least we need some helper like folio_test_huge(). It seems still missing.
Maybe it's another clue that hugetlb is not important to folio_referenced()
because it's already fully converted?

>
> It seems like the least bad option is #1, but maybe we should have a
> face-to-face discussion about it? I'm still collecting some more
> performance numbers.

Let's see how it goes..

Thanks,

--
Peter Xu


2023-02-01 17:59:17

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 1, 2023 at 7:56 AM David Hildenbrand <[email protected]> wrote:
>
> On 01.02.23 16:45, James Houghton wrote:
> > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
> >>
> >> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> >>> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> >>>>
> >>>> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> >>>>> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> >>>>>>
> >>>>>> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > [snip]
> >>>>>> Another way to not use thp mapcount, nor break smaps and similar calls to
> >>>>>> page_mapcount() on small page, is to only increase the hpage mapcount only
> >>>>>> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> >>>>>> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> >>>>>> entry is removed (for leaf, it's the same as for now; for HGM, it's when
> >>>>>> freeing pgtable of the PUD entry).
> >>>>>
> >>>>> Right, and this is doable. Also it seems like this is pretty close to
> >>>>> the direction Matthew Wilcox wants to go with THPs.
> >>>>
> >>>> I may not be familiar with it, do you mean this one?
> >>>>
> >>>> https://lore.kernel.org/all/Y9Afwds%[email protected]/
> >>>
> >>> Yep that's it.
> >>>
> >>>>
> >>>> For hugetlb I think it should be easier to maintain rather than any-sized
> >>>> folios, because there's the pgtable non-leaf entry to track rmap
> >>>> information and the folio size being static to hpage size.
> >>>>
> >>>> It'll be different to folios where it can be random sized pages chunk, so
> >>>> it needs to be managed by batching the ptes when install/zap.
> >>>
> >>> Agreed. It's probably easier for HugeTLB because they're always
> >>> "naturally aligned" and yeah they can't change sizes.
> >>>
> >>>>
> >>>>>
> >>>>> Something I noticed though, from the implementation of
> >>>>> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> >>>>> ought to report the total number of PTEs that are pointing on the page
> >>>>> (or the number of times page_vma_mapped_walk returns true). FWIW,
> >>>>> folio_referenced() is never called for hugetlb folios.
> >>>>
> >>>> FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> >>>> it'll walk every leaf page being mapped, big or small, so IIUC that number
> >>>> should match with what it expects to see later, more or less.
> >>>
> >>> I don't fully understand what you mean here.
> >>
> >> I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> >> leaves for the folio, big or small. I think that will match the number
> >> with what got returned from folio_mapcount().
> >
> > See below.
> >
> >>
> >>>
> >>>>
> >>>> But I agree the mapcount/referenced value itself is debatable to me, just
> >>>> like what you raised in the other thread on page migration. Meanwhile, I
> >>>> am not certain whether the mapcount is accurate either because AFAICT the
> >>>> mapcount can be modified if e.g. new page mapping established as long as
> >>>> before taking the page lock later in folio_referenced().
> >>>>
> >>>> It's just that I don't see any severe issue either due to any of above, as
> >>>> long as that information is only used as a hint for next steps, e.g., to
> >>>> swap which page out.
> >>>
> >>> I also don't see a big problem with folio_referenced() (and you're
> >>> right that folio_mapcount() can be stale by the time it takes the
> >>> folio lock). It still seems like folio_mapcount() should return the
> >>> total number of PTEs that map the page though. Are you saying that
> >>> breaking this would be ok?
> >>
> >> I didn't quite follow - isn't that already doing so?
> >>
> >> folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> >> accumulated value of all possible PTEs or PMDs being mapped as long as it's
> >> all or part of the folio being mapped.
> >
> > We've talked about 3 ways of handling mapcount:
> >
> > 1. The RFC v2 way, which is head-only, and we increment the compound
> > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > hstate-level PTE, we increment the compound mapcount, otherwise we
> > increment subpage->_mapcount.
> > 3. The RFC v1 way (the way you have suggested above), which is
> > head-only, and we increment the compound mapcount if the hstate-level
> > PTE is made present.
> >
> > With #1 and #2, there is no concern with folio_mapcount(). But with
> > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > would yield 1 instead of 512 (right?). That's what I mean.
>
> My 2 cents:
>
> The mapcount is primarily used (in hugetlb context) to
>
> (a) Detect if a page might be shared. mapcount > 1 implies that two
> independent page table hierarchies are mapping the page. We care about
> mapcount == 1 vs. mapcount != 1.
>
> (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs.
> mapcount != 0.
>
> For hugetlb, I don't see why we should care about the subpage mapcount
> at all.

Agreed -- it shouldn't really matter all that much.

>
> For (a) it's even good to count "somehow mapped into a single page table
> structure" as "mapcount == 1" For (b), we don't care as long as "still
> mapped" implies "mapcount != 0".

Thanks for your thoughts, David. So it sounds like you're still
squarely in the #3 camp. :)

2023-02-01 18:02:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

>>> 1. The RFC v2 way, which is head-only, and we increment the compound
>>> mapcount for each PT mapping we have. So a PTE-mapped 2M page,
>>> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
>>> 2. The THP-like way. If we are fully mapping the hugetlb page with the
>>> hstate-level PTE, we increment the compound mapcount, otherwise we
>>> increment subpage->_mapcount.
>>> 3. The RFC v1 way (the way you have suggested above), which is
>>> head-only, and we increment the compound mapcount if the hstate-level
>>> PTE is made present.
>>>
>>> With #1 and #2, there is no concern with folio_mapcount(). But with
>>> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
>>> would yield 1 instead of 512 (right?). That's what I mean.
>>
>> My 2 cents:
>>
>> The mapcount is primarily used (in hugetlb context) to
>>
>> (a) Detect if a page might be shared. mapcount > 1 implies that two
>> independent page table hierarchies are mapping the page. We care about
>> mapcount == 1 vs. mapcount != 1.
>>
>> (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs.
>> mapcount != 0.
>>
>> For hugetlb, I don't see why we should care about the subpage mapcount
>> at all.
>
> Agreed -- it shouldn't really matter all that much.
>
>>
>> For (a) it's even good to count "somehow mapped into a single page table
>> structure" as "mapcount == 1" For (b), we don't care as long as "still
>> mapped" implies "mapcount != 0".
>
> Thanks for your thoughts, David. So it sounds like you're still
> squarely in the #3 camp. :)

Well, yes. As long as we can get it implemented in a clean way ... :)

--
Thanks,

David / dhildenb


2023-02-01 21:33:05

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <[email protected]> wrote:
>
> On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > [snip]
> > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > freeing pgtable of the PUD entry).
> > > > > >
> > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > >
> > > > > I may not be familiar with it, do you mean this one?
> > > > >
> > > > > https://lore.kernel.org/all/Y9Afwds%[email protected]/
> > > >
> > > > Yep that's it.
> > > >
> > > > >
> > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > information and the folio size being static to hpage size.
> > > > >
> > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > it needs to be managed by batching the ptes when install/zap.
> > > >
> > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > "naturally aligned" and yeah they can't change sizes.
> > > >
> > > > >
> > > > > >
> > > > > > Something I noticed though, from the implementation of
> > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > folio_referenced() is never called for hugetlb folios.
> > > > >
> > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > should match with what it expects to see later, more or less.
> > > >
> > > > I don't fully understand what you mean here.
> > >
> > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > leaves for the folio, big or small. I think that will match the number
> > > with what got returned from folio_mapcount().
> >
> > See below.
> >
> > >
> > > >
> > > > >
> > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > like what you raised in the other thread on page migration. Meanwhile, I
> > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > before taking the page lock later in folio_referenced().
> > > > >
> > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > swap which page out.
> > > >
> > > > I also don't see a big problem with folio_referenced() (and you're
> > > > right that folio_mapcount() can be stale by the time it takes the
> > > > folio lock). It still seems like folio_mapcount() should return the
> > > > total number of PTEs that map the page though. Are you saying that
> > > > breaking this would be ok?
> > >
> > > I didn't quite follow - isn't that already doing so?
> > >
> > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > all or part of the folio being mapped.
> >
> > We've talked about 3 ways of handling mapcount:
> >
> > 1. The RFC v2 way, which is head-only, and we increment the compound
> > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > hstate-level PTE, we increment the compound mapcount, otherwise we
> > increment subpage->_mapcount.
> > 3. The RFC v1 way (the way you have suggested above), which is
> > head-only, and we increment the compound mapcount if the hstate-level
> > PTE is made present.
>
> Oh that's where it come from! It took quite some months going through all
> these, I can hardly remember the details.
>
> >
> > With #1 and #2, there is no concern with folio_mapcount(). But with
> > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > would yield 1 instead of 512 (right?). That's what I mean.
> >
> > #1 has problems wrt smaps and migration (though there were other
> > problems with those anyway that Mike has fixed), and #2 makes
> > MADV_COLLAPSE slow to the point of being unusable for some
> > applications.
>
> Ah so you're talking about after HGM being applied.. while I was only
> talking about THPs.
>
> If to apply the logic here with idea 3), the worst case is we'll need to
> have special care of HGM hugetlb in folio_referenced_one(), so the default
> page_vma_mapped_walk() may not apply anymore - the resource is always in
> hstate sized, so counting small ptes do not help too - we can just walk
> until the hstate entry and do referenced++ if it's not none, at the
> entrance of folio_referenced_one().
>
> But I'm not sure whether that'll be necessary at all, as I'm not sure
> whether that path can be triggered at all in any form (where from the top
> it should always be shrink_page_list()). In that sense maybe we can also
> consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> hugetlb page that got passed in? Meanwhile, adding a TODO comment
> explaining that current walk won't work easily for HGM only, so when it
> will be applicable to hugetlb we need to rework?
>
> I confess that's not pretty, though. But that'll make 3) with no major
> defect from function-wise.

Another potential idea would be to add something like page_vmacount().
For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
HugeTLB pages, we could keep a separate count (in one of the tail
pages, I guess). And then in the places that matter (so smaps,
migration, and maybe CoW and hwpoison), potentially change their calls
to page_vmacount() instead of page_mapcount().

Then to implement page_vmacount(), we do the RFC v1 mapcount approach
(but like.... correctly this time). And then for page_mapcount(), we
do the RFC v2 mapcount approach (head-only, once per PTE).

Then we fix folio_referenced() without needing to special-case it for
HugeTLB. :) Or we could just special-case it. *shrug*

Does that sound reasonable? We still have the problem where a series
of partially unmaps could leave page_vmacount() incremented, but I
don't think that's a big problem.

>
> Side note: did we finish folio conversion on hugetlb at all? I think at
> least we need some helper like folio_test_huge(). It seems still missing.
> Maybe it's another clue that hugetlb is not important to folio_referenced()
> because it's already fully converted?

I'm not sure. A lot of work was done very pretty recently, so I bet
there's probably some work left to do.

2023-02-01 21:52:45

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <[email protected]> wrote:
> >
> > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > [snip]
> > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > freeing pgtable of the PUD entry).
> > > > > > >
> > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > >
> > > > > > I may not be familiar with it, do you mean this one?
> > > > > >
> > > > > > https://lore.kernel.org/all/Y9Afwds%[email protected]/
> > > > >
> > > > > Yep that's it.
> > > > >
> > > > > >
> > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > information and the folio size being static to hpage size.
> > > > > >
> > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > >
> > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > "naturally aligned" and yeah they can't change sizes.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Something I noticed though, from the implementation of
> > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > >
> > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > should match with what it expects to see later, more or less.
> > > > >
> > > > > I don't fully understand what you mean here.
> > > >
> > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > leaves for the folio, big or small. I think that will match the number
> > > > with what got returned from folio_mapcount().
> > >
> > > See below.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > like what you raised in the other thread on page migration. Meanwhile, I
> > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > before taking the page lock later in folio_referenced().
> > > > > >
> > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > swap which page out.
> > > > >
> > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > total number of PTEs that map the page though. Are you saying that
> > > > > breaking this would be ok?
> > > >
> > > > I didn't quite follow - isn't that already doing so?
> > > >
> > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > all or part of the folio being mapped.
> > >
> > > We've talked about 3 ways of handling mapcount:
> > >
> > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > increment subpage->_mapcount.
> > > 3. The RFC v1 way (the way you have suggested above), which is
> > > head-only, and we increment the compound mapcount if the hstate-level
> > > PTE is made present.
> >
> > Oh that's where it come from! It took quite some months going through all
> > these, I can hardly remember the details.
> >
> > >
> > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > would yield 1 instead of 512 (right?). That's what I mean.
> > >
> > > #1 has problems wrt smaps and migration (though there were other
> > > problems with those anyway that Mike has fixed), and #2 makes
> > > MADV_COLLAPSE slow to the point of being unusable for some
> > > applications.
> >
> > Ah so you're talking about after HGM being applied.. while I was only
> > talking about THPs.
> >
> > If to apply the logic here with idea 3), the worst case is we'll need to
> > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > hstate sized, so counting small ptes do not help too - we can just walk
> > until the hstate entry and do referenced++ if it's not none, at the
> > entrance of folio_referenced_one().
> >
> > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > whether that path can be triggered at all in any form (where from the top
> > it should always be shrink_page_list()). In that sense maybe we can also
> > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > hugetlb page that got passed in? Meanwhile, adding a TODO comment
> > explaining that current walk won't work easily for HGM only, so when it
> > will be applicable to hugetlb we need to rework?
> >
> > I confess that's not pretty, though. But that'll make 3) with no major
> > defect from function-wise.
>
> Another potential idea would be to add something like page_vmacount().
> For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> HugeTLB pages, we could keep a separate count (in one of the tail
> pages, I guess). And then in the places that matter (so smaps,
> migration, and maybe CoW and hwpoison), potentially change their calls
> to page_vmacount() instead of page_mapcount().
>
> Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> (but like.... correctly this time). And then for page_mapcount(), we
> do the RFC v2 mapcount approach (head-only, once per PTE).
>
> Then we fix folio_referenced() without needing to special-case it for
> HugeTLB. :) Or we could just special-case it. *shrug*
>
> Does that sound reasonable? We still have the problem where a series
> of partially unmaps could leave page_vmacount() incremented, but I
> don't think that's a big problem.

I'm afraid someone will stop you from introducing yet another definition of
mapcount, where others are trying to remove it. :)

Or, can we just drop folio_referenced_arg.mapcount? We need to keep:

if (!pra.mapcount)
return 0;

By replacing it with folio_mapcount which is definitely something
worthwhile, but what about the rest?

If it can be dropped, afaict it'll naturally work with HGM again.

IIUC that's an optimization where we want to stop the rmap walk as long as
we found all the pages, however (1) IIUC it's not required to function, and
(2) it's not guaranteed to work as solid anyway.. As we've discussed
before: right after it reads mapcount (but before taking the page lock),
the mapcount can get decreased by 1, then it'll still need to loop over all
the vmas just to find that there's one "misterious" mapcount lost.

Personally I really have no idea on how much that optimization can help.

--
Peter Xu


2023-02-02 00:25:19

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <[email protected]> wrote:
>
> On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > [snip]
> > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > > freeing pgtable of the PUD entry).
> > > > > > > >
> > > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > > >
> > > > > > > I may not be familiar with it, do you mean this one?
> > > > > > >
> > > > > > > https://lore.kernel.org/all/Y9Afwds%[email protected]/
> > > > > >
> > > > > > Yep that's it.
> > > > > >
> > > > > > >
> > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > > information and the folio size being static to hpage size.
> > > > > > >
> > > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > > >
> > > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > > "naturally aligned" and yeah they can't change sizes.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Something I noticed though, from the implementation of
> > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > > >
> > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > > should match with what it expects to see later, more or less.
> > > > > >
> > > > > > I don't fully understand what you mean here.
> > > > >
> > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > > leaves for the folio, big or small. I think that will match the number
> > > > > with what got returned from folio_mapcount().
> > > >
> > > > See below.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > > like what you raised in the other thread on page migration. Meanwhile, I
> > > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > > before taking the page lock later in folio_referenced().
> > > > > > >
> > > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > > swap which page out.
> > > > > >
> > > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > > total number of PTEs that map the page though. Are you saying that
> > > > > > breaking this would be ok?
> > > > >
> > > > > I didn't quite follow - isn't that already doing so?
> > > > >
> > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > > all or part of the folio being mapped.
> > > >
> > > > We've talked about 3 ways of handling mapcount:
> > > >
> > > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > > increment subpage->_mapcount.
> > > > 3. The RFC v1 way (the way you have suggested above), which is
> > > > head-only, and we increment the compound mapcount if the hstate-level
> > > > PTE is made present.
> > >
> > > Oh that's where it come from! It took quite some months going through all
> > > these, I can hardly remember the details.
> > >
> > > >
> > > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > > would yield 1 instead of 512 (right?). That's what I mean.
> > > >
> > > > #1 has problems wrt smaps and migration (though there were other
> > > > problems with those anyway that Mike has fixed), and #2 makes
> > > > MADV_COLLAPSE slow to the point of being unusable for some
> > > > applications.
> > >
> > > Ah so you're talking about after HGM being applied.. while I was only
> > > talking about THPs.
> > >
> > > If to apply the logic here with idea 3), the worst case is we'll need to
> > > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > > hstate sized, so counting small ptes do not help too - we can just walk
> > > until the hstate entry and do referenced++ if it's not none, at the
> > > entrance of folio_referenced_one().
> > >
> > > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > > whether that path can be triggered at all in any form (where from the top
> > > it should always be shrink_page_list()). In that sense maybe we can also
> > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > > hugetlb page that got passed in? Meanwhile, adding a TODO comment
> > > explaining that current walk won't work easily for HGM only, so when it
> > > will be applicable to hugetlb we need to rework?
> > >
> > > I confess that's not pretty, though. But that'll make 3) with no major
> > > defect from function-wise.
> >
> > Another potential idea would be to add something like page_vmacount().
> > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> > HugeTLB pages, we could keep a separate count (in one of the tail
> > pages, I guess). And then in the places that matter (so smaps,
> > migration, and maybe CoW and hwpoison), potentially change their calls
> > to page_vmacount() instead of page_mapcount().
> >
> > Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> > (but like.... correctly this time). And then for page_mapcount(), we
> > do the RFC v2 mapcount approach (head-only, once per PTE).
> >
> > Then we fix folio_referenced() without needing to special-case it for
> > HugeTLB. :) Or we could just special-case it. *shrug*
> >
> > Does that sound reasonable? We still have the problem where a series
> > of partially unmaps could leave page_vmacount() incremented, but I
> > don't think that's a big problem.
>
> I'm afraid someone will stop you from introducing yet another definition of
> mapcount, where others are trying to remove it. :)
>
> Or, can we just drop folio_referenced_arg.mapcount? We need to keep:
>
> if (!pra.mapcount)
> return 0;
>
> By replacing it with folio_mapcount which is definitely something
> worthwhile, but what about the rest?
>
> If it can be dropped, afaict it'll naturally work with HGM again.
>
> IIUC that's an optimization where we want to stop the rmap walk as long as
> we found all the pages, however (1) IIUC it's not required to function, and
> (2) it's not guaranteed to work as solid anyway.. As we've discussed
> before: right after it reads mapcount (but before taking the page lock),
> the mapcount can get decreased by 1, then it'll still need to loop over all
> the vmas just to find that there's one "misterious" mapcount lost.
>
> Personally I really have no idea on how much that optimization can help.

Ok, yeah, I think pra.mapcount can be removed too. (And we replace
!pra.mapcount with !folio_mapcount().)

I don't see any other existing users of folio_mapcount() and
total_mapcount() that are problematic. We do need to make sure to keep
refcount and mapcount in sync though; it can be done.

So I'll compare this "RFC v1" way with the THP-like way and get you a
performance comparison.


- James

2023-02-07 16:31:41

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 1, 2023 at 4:24 PM James Houghton <[email protected]> wrote:
>
> On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <[email protected]> wrote:
> >
> > On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> > > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > > [snip]
> > > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > > > freeing pgtable of the PUD entry).
> > > > > > > > >
> > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > > > >
> > > > > > > > I may not be familiar with it, do you mean this one?
> > > > > > > >
> > > > > > > > https://lore.kernel.org/all/Y9Afwds%[email protected]/
> > > > > > >
> > > > > > > Yep that's it.
> > > > > > >
> > > > > > > >
> > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > > > information and the folio size being static to hpage size.
> > > > > > > >
> > > > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > > > >
> > > > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > > > "naturally aligned" and yeah they can't change sizes.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Something I noticed though, from the implementation of
> > > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > > > >
> > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > > > should match with what it expects to see later, more or less.
> > > > > > >
> > > > > > > I don't fully understand what you mean here.
> > > > > >
> > > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > > > leaves for the folio, big or small. I think that will match the number
> > > > > > with what got returned from folio_mapcount().
> > > > >
> > > > > See below.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > > > like what you raised in the other thread on page migration. Meanwhile, I
> > > > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > > > before taking the page lock later in folio_referenced().
> > > > > > > >
> > > > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > > > swap which page out.
> > > > > > >
> > > > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > > > total number of PTEs that map the page though. Are you saying that
> > > > > > > breaking this would be ok?
> > > > > >
> > > > > > I didn't quite follow - isn't that already doing so?
> > > > > >
> > > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > > > all or part of the folio being mapped.
> > > > >
> > > > > We've talked about 3 ways of handling mapcount:
> > > > >
> > > > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > > > increment subpage->_mapcount.
> > > > > 3. The RFC v1 way (the way you have suggested above), which is
> > > > > head-only, and we increment the compound mapcount if the hstate-level
> > > > > PTE is made present.
> > > >
> > > > Oh that's where it come from! It took quite some months going through all
> > > > these, I can hardly remember the details.
> > > >
> > > > >
> > > > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > > > would yield 1 instead of 512 (right?). That's what I mean.
> > > > >
> > > > > #1 has problems wrt smaps and migration (though there were other
> > > > > problems with those anyway that Mike has fixed), and #2 makes
> > > > > MADV_COLLAPSE slow to the point of being unusable for some
> > > > > applications.
> > > >
> > > > Ah so you're talking about after HGM being applied.. while I was only
> > > > talking about THPs.
> > > >
> > > > If to apply the logic here with idea 3), the worst case is we'll need to
> > > > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > > > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > > > hstate sized, so counting small ptes do not help too - we can just walk
> > > > until the hstate entry and do referenced++ if it's not none, at the
> > > > entrance of folio_referenced_one().
> > > >
> > > > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > > > whether that path can be triggered at all in any form (where from the top
> > > > it should always be shrink_page_list()). In that sense maybe we can also
> > > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > > > hugetlb page that got passed in? Meanwhile, adding a TODO comment
> > > > explaining that current walk won't work easily for HGM only, so when it
> > > > will be applicable to hugetlb we need to rework?
> > > >
> > > > I confess that's not pretty, though. But that'll make 3) with no major
> > > > defect from function-wise.
> > >
> > > Another potential idea would be to add something like page_vmacount().
> > > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> > > HugeTLB pages, we could keep a separate count (in one of the tail
> > > pages, I guess). And then in the places that matter (so smaps,
> > > migration, and maybe CoW and hwpoison), potentially change their calls
> > > to page_vmacount() instead of page_mapcount().
> > >
> > > Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> > > (but like.... correctly this time). And then for page_mapcount(), we
> > > do the RFC v2 mapcount approach (head-only, once per PTE).
> > >
> > > Then we fix folio_referenced() without needing to special-case it for
> > > HugeTLB. :) Or we could just special-case it. *shrug*
> > >
> > > Does that sound reasonable? We still have the problem where a series
> > > of partially unmaps could leave page_vmacount() incremented, but I
> > > don't think that's a big problem.
> >
> > I'm afraid someone will stop you from introducing yet another definition of
> > mapcount, where others are trying to remove it. :)
> >
> > Or, can we just drop folio_referenced_arg.mapcount? We need to keep:
> >
> > if (!pra.mapcount)
> > return 0;
> >
> > By replacing it with folio_mapcount which is definitely something
> > worthwhile, but what about the rest?
> >
> > If it can be dropped, afaict it'll naturally work with HGM again.
> >
> > IIUC that's an optimization where we want to stop the rmap walk as long as
> > we found all the pages, however (1) IIUC it's not required to function, and
> > (2) it's not guaranteed to work as solid anyway.. As we've discussed
> > before: right after it reads mapcount (but before taking the page lock),
> > the mapcount can get decreased by 1, then it'll still need to loop over all
> > the vmas just to find that there's one "misterious" mapcount lost.
> >
> > Personally I really have no idea on how much that optimization can help.
>
> Ok, yeah, I think pra.mapcount can be removed too. (And we replace
> !pra.mapcount with !folio_mapcount().)
>
> I don't see any other existing users of folio_mapcount() and
> total_mapcount() that are problematic. We do need to make sure to keep
> refcount and mapcount in sync though; it can be done.
>
> So I'll compare this "RFC v1" way with the THP-like way and get you a
> performance comparison.

Here is the result: [1] (sorry it took a little while heh). The
implementation of the "RFC v1" way is pretty horrible[2] (and this
implementation probably has bugs anyway; it doesn't account for the
folio_referenced() problem).

Matthew is trying to solve the same problem with THPs right now: [3].
I haven't figured out how we can apply Matthews's approach to HGM
right now, but there probably is a way. (If we left the mapcount
increment bits in the same place, we couldn't just check the
hstate-level PTE; it would have already been made present.)

We could:
- use the THP-like way and tolerate ~1 second collapses
- use the (non-RFC) v1 way and tolerate the migration/smaps differences
- use the RFC v1 way and tolerate the complicated mapcount accounting
- flesh out [3] and see if it can be applied to HGM nicely

I'm happy to go with any of these approaches.

[1]: https://pastebin.com/raw/hJzFJHiD
[2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
[3]: https://lore.kernel.org/linux-mm/[email protected]/

- James

2023-02-07 22:46:47

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

> Here is the result: [1] (sorry it took a little while heh). The
> implementation of the "RFC v1" way is pretty horrible[2] (and this
> implementation probably has bugs anyway; it doesn't account for the
> folio_referenced() problem).
>
> Matthew is trying to solve the same problem with THPs right now: [3].
> I haven't figured out how we can apply Matthews's approach to HGM
> right now, but there probably is a way. (If we left the mapcount
> increment bits in the same place, we couldn't just check the
> hstate-level PTE; it would have already been made present.)
>
> We could:
> - use the THP-like way and tolerate ~1 second collapses

Another thought here. We don't necessarily *need* to collapse the page
table mappings in between mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
we aren't punching any holes, and we aren't changing permission bits.
If we had an MMU notifier that simply informed KVM that we collapsed
the page tables *after* we finished collapsing, then it would be ok
for hugetlb_collapse() to be slow.

If this MMU notifier is something that makes sense, it probably
applies to MADV_COLLAPSE for THPs as well.


> - use the (non-RFC) v1 way and tolerate the migration/smaps differences
> - use the RFC v1 way and tolerate the complicated mapcount accounting
> - flesh out [3] and see if it can be applied to HGM nicely
>
> I'm happy to go with any of these approaches.
>
> [1]: https://pastebin.com/raw/hJzFJHiD
> [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
> [3]: https://lore.kernel.org/linux-mm/[email protected]/

- James

2023-02-07 23:14:14

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

James,

On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > Here is the result: [1] (sorry it took a little while heh). The

Thanks. From what I can tell, that number shows that it'll be great we
start with your rfcv1 mapcount approach, which mimics what's proposed by
Matthew for generic folio.

> > implementation of the "RFC v1" way is pretty horrible[2] (and this

Any more information on why it's horrible? :)

A quick comment is I'm wondering whether that "whether we should boost the
mapcount" value can be hidden in hugetlb_pte* so you don't need to pass
over a lot of bool* deep into the hgm walk routines.

> > implementation probably has bugs anyway; it doesn't account for the
> > folio_referenced() problem).

I thought we reached a consensus on the resolution, by a proposal to remove
folio_referenced_arg.mapcount. Is it not working for some reason?

> >
> > Matthew is trying to solve the same problem with THPs right now: [3].
> > I haven't figured out how we can apply Matthews's approach to HGM
> > right now, but there probably is a way. (If we left the mapcount
> > increment bits in the same place, we couldn't just check the
> > hstate-level PTE; it would have already been made present.)

I'm just worried that (1) this may add yet another dependency to your work
which is still during discussion phase, and (2) whether the folio approach
is easily applicable here, e.g., we may not want to populate all the ptes
for hugetlb HGMs by default.

> >
> > We could:
> > - use the THP-like way and tolerate ~1 second collapses
>
> Another thought here. We don't necessarily *need* to collapse the page
> table mappings in between mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> we aren't punching any holes, and we aren't changing permission bits.
> If we had an MMU notifier that simply informed KVM that we collapsed
> the page tables *after* we finished collapsing, then it would be ok
> for hugetlb_collapse() to be slow.

That's a great point! It'll definitely apply to either approach.

>
> If this MMU notifier is something that makes sense, it probably
> applies to MADV_COLLAPSE for THPs as well.

THPs are definitely different, mmu notifiers should be required there,
afaict. Isn't that what the current code does?

See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.

>
>
> > - use the (non-RFC) v1 way and tolerate the migration/smaps differences
> > - use the RFC v1 way and tolerate the complicated mapcount accounting
> > - flesh out [3] and see if it can be applied to HGM nicely
> >
> > I'm happy to go with any of these approaches.
> >
> > [1]: https://pastebin.com/raw/hJzFJHiD
> > [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
> > [3]: https://lore.kernel.org/linux-mm/[email protected]/
>
> - James
>

--
Peter Xu


2023-02-08 00:26:44

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <[email protected]> wrote:
>
> James,
>
> On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > Here is the result: [1] (sorry it took a little while heh). The
>
> Thanks. From what I can tell, that number shows that it'll be great we
> start with your rfcv1 mapcount approach, which mimics what's proposed by
> Matthew for generic folio.

Do you think the RFC v1 way is better than doing the THP-like way
*with the additional MMU notifier*?

>
> > > implementation of the "RFC v1" way is pretty horrible[2] (and this
>
> Any more information on why it's horrible? :)

I figured the code would speak for itself, heh. It's quite complicated.

I really didn't like:
1. The 'inc' business in copy_hugetlb_page_range.
2. How/where I call put_page()/folio_put() to keep the refcount and
mapcount synced up.
3. Having to check the page cache in UFFDIO_CONTINUE.

>
> A quick comment is I'm wondering whether that "whether we should boost the
> mapcount" value can be hidden in hugetlb_pte* so you don't need to pass
> over a lot of bool* deep into the hgm walk routines.

Oh yeah, that's a great idea.

>
> > > implementation probably has bugs anyway; it doesn't account for the
> > > folio_referenced() problem).
>
> I thought we reached a consensus on the resolution, by a proposal to remove
> folio_referenced_arg.mapcount. Is it not working for some reason?

I think that works, I just didn't bother here. I just wanted to show
you approximately what it would look like to implement the RFC v1
approach.

>
> > >
> > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > I haven't figured out how we can apply Matthews's approach to HGM
> > > right now, but there probably is a way. (If we left the mapcount
> > > increment bits in the same place, we couldn't just check the
> > > hstate-level PTE; it would have already been made present.)
>
> I'm just worried that (1) this may add yet another dependency to your work
> which is still during discussion phase, and (2) whether the folio approach
> is easily applicable here, e.g., we may not want to populate all the ptes
> for hugetlb HGMs by default.

That's true. I definitely don't want to wait for this either. It seems
like Matthew's approach won't work very well for us -- when doing a
lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
the PTEs to see if any of them are mapped would get really slow.

>
> > >
> > > We could:
> > > - use the THP-like way and tolerate ~1 second collapses
> >
> > Another thought here. We don't necessarily *need* to collapse the page
> > table mappings in between mmu_notifier_invalidate_range_start() and
> > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > we aren't punching any holes, and we aren't changing permission bits.
> > If we had an MMU notifier that simply informed KVM that we collapsed
> > the page tables *after* we finished collapsing, then it would be ok
> > for hugetlb_collapse() to be slow.
>
> That's a great point! It'll definitely apply to either approach.
>
> >
> > If this MMU notifier is something that makes sense, it probably
> > applies to MADV_COLLAPSE for THPs as well.
>
> THPs are definitely different, mmu notifiers should be required there,
> afaict. Isn't that what the current code does?
>
> See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.

Oh, yes, of course, MADV_COLLAPSE can actually move things around and
properly make THPs. Thanks. But it would apply if we were only
collapsing PTE-mapped THPs, I think?


- James

2023-02-08 16:17:15

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <[email protected]> wrote:
> >
> > James,
> >
> > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > Here is the result: [1] (sorry it took a little while heh). The
> >
> > Thanks. From what I can tell, that number shows that it'll be great we
> > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > Matthew for generic folio.
>
> Do you think the RFC v1 way is better than doing the THP-like way
> *with the additional MMU notifier*?

What's the additional MMU notifier you're referring?

>
> >
> > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> >
> > Any more information on why it's horrible? :)
>
> I figured the code would speak for itself, heh. It's quite complicated.
>
> I really didn't like:
> 1. The 'inc' business in copy_hugetlb_page_range.
> 2. How/where I call put_page()/folio_put() to keep the refcount and
> mapcount synced up.
> 3. Having to check the page cache in UFFDIO_CONTINUE.

I think the complexity is one thing which I'm fine with so far. However
when I think again about the things behind that complexity, I noticed there
may be at least one flaw that may not be trivial to work around.

It's about truncation. The problem is now we use the pgtable entry to
represent the mapcount, but the pgtable entry cannot be zapped easily,
unless vma unmapped or collapsed.

It means e.g. truncate_inode_folio() may stop working for hugetlb (of
course, with page lock held). The mappings will be removed for real, but
not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
the pgtable leaves, not the ones that we used to account for mapcounts.

So the kernel may see weird things, like mapcount>0 after
truncate_inode_folio() being finished completely.

For HGM to do the right thing, we may want to also remove the non-leaf
entries when truncating or doing similar things like a rmap walk to drop
any mappings for a page/folio. Though that's not doable for now because
the locks that truncate_inode_folio() is weaker than what we need to free
the pgtable non-leaf entries - we'll need mmap write lock for that, the
same as when we unmap or collapse.

Matthew's design doesn't have such issue if the ptes need to be populated,
because mapcount is still with the leaves; not the case for us here.

If that's the case, _maybe_ we still need to start with the stupid but
working approach of subpage mapcounts.

[...]

> > > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > > I haven't figured out how we can apply Matthews's approach to HGM
> > > > right now, but there probably is a way. (If we left the mapcount
> > > > increment bits in the same place, we couldn't just check the
> > > > hstate-level PTE; it would have already been made present.)
> >
> > I'm just worried that (1) this may add yet another dependency to your work
> > which is still during discussion phase, and (2) whether the folio approach
> > is easily applicable here, e.g., we may not want to populate all the ptes
> > for hugetlb HGMs by default.
>
> That's true. I definitely don't want to wait for this either. It seems
> like Matthew's approach won't work very well for us -- when doing a
> lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
> the PTEs to see if any of them are mapped would get really slow.

I think it'll be a common problem to userfaultfd when it comes, e.g.,
userfaultfd by design is PAGE_SIZE based so far. It needs page size
granule on pgtable manipulations, unless we extend the userfaultfd protocol
to support folios, iiuc.

>
> >
> > > >
> > > > We could:
> > > > - use the THP-like way and tolerate ~1 second collapses
> > >
> > > Another thought here. We don't necessarily *need* to collapse the page
> > > table mappings in between mmu_notifier_invalidate_range_start() and
> > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > > we aren't punching any holes, and we aren't changing permission bits.
> > > If we had an MMU notifier that simply informed KVM that we collapsed
> > > the page tables *after* we finished collapsing, then it would be ok
> > > for hugetlb_collapse() to be slow.
> >
> > That's a great point! It'll definitely apply to either approach.
> >
> > >
> > > If this MMU notifier is something that makes sense, it probably
> > > applies to MADV_COLLAPSE for THPs as well.
> >
> > THPs are definitely different, mmu notifiers should be required there,
> > afaict. Isn't that what the current code does?
> >
> > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.
>
> Oh, yes, of course, MADV_COLLAPSE can actually move things around and
> properly make THPs. Thanks. But it would apply if we were only
> collapsing PTE-mapped THPs, I think?

Yes it applies I think. And if I'm not wrong it's also doing so. :) See
collapse_pte_mapped_thp().

While for anon we always allocate a new page, hence not applicable.

--
Peter Xu


2023-02-09 16:44:27

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <[email protected]> wrote:
>
> On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <[email protected]> wrote:
> > >
> > > James,
> > >
> > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > Here is the result: [1] (sorry it took a little while heh). The
> > >
> > > Thanks. From what I can tell, that number shows that it'll be great we
> > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > Matthew for generic folio.
> >
> > Do you think the RFC v1 way is better than doing the THP-like way
> > *with the additional MMU notifier*?
>
> What's the additional MMU notifier you're referring?

An MMU notifier that informs KVM that a collapse has happened without
having to invalidate_range_start() and invalidate_range_end(), the one
you're replying to lower down in the email. :) [ see below... ]

>
> >
> > >
> > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > >
> > > Any more information on why it's horrible? :)
> >
> > I figured the code would speak for itself, heh. It's quite complicated.
> >
> > I really didn't like:
> > 1. The 'inc' business in copy_hugetlb_page_range.
> > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > mapcount synced up.
> > 3. Having to check the page cache in UFFDIO_CONTINUE.
>
> I think the complexity is one thing which I'm fine with so far. However
> when I think again about the things behind that complexity, I noticed there
> may be at least one flaw that may not be trivial to work around.
>
> It's about truncation. The problem is now we use the pgtable entry to
> represent the mapcount, but the pgtable entry cannot be zapped easily,
> unless vma unmapped or collapsed.
>
> It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> course, with page lock held). The mappings will be removed for real, but
> not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> the pgtable leaves, not the ones that we used to account for mapcounts.
>
> So the kernel may see weird things, like mapcount>0 after
> truncate_inode_folio() being finished completely.
>
> For HGM to do the right thing, we may want to also remove the non-leaf
> entries when truncating or doing similar things like a rmap walk to drop
> any mappings for a page/folio. Though that's not doable for now because
> the locks that truncate_inode_folio() is weaker than what we need to free
> the pgtable non-leaf entries - we'll need mmap write lock for that, the
> same as when we unmap or collapse.
>
> Matthew's design doesn't have such issue if the ptes need to be populated,
> because mapcount is still with the leaves; not the case for us here.
>
> If that's the case, _maybe_ we still need to start with the stupid but
> working approach of subpage mapcounts.

Good point. I can't immediately think of a solution. I would prefer to
go with the subpage mapcount approach to simplify HGM for now;
optimizing mapcount for HugeTLB can then be handled separately. If
you're ok with this, I'll go ahead and send v2.

One way that might be possible: using the PAGE_SPECIAL bit on the
hstate-level PTE to indicate if mapcount has been incremented or not
(if the PTE is pointing to page tables). As far as I can tell,
PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
need to be careful with existing PTE examination code as to not
misinterpret these PTEs.

>
> [...]
>
> > > > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > > > I haven't figured out how we can apply Matthews's approach to HGM
> > > > > right now, but there probably is a way. (If we left the mapcount
> > > > > increment bits in the same place, we couldn't just check the
> > > > > hstate-level PTE; it would have already been made present.)
> > >
> > > I'm just worried that (1) this may add yet another dependency to your work
> > > which is still during discussion phase, and (2) whether the folio approach
> > > is easily applicable here, e.g., we may not want to populate all the ptes
> > > for hugetlb HGMs by default.
> >
> > That's true. I definitely don't want to wait for this either. It seems
> > like Matthew's approach won't work very well for us -- when doing a
> > lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
> > the PTEs to see if any of them are mapped would get really slow.
>
> I think it'll be a common problem to userfaultfd when it comes, e.g.,
> userfaultfd by design is PAGE_SIZE based so far. It needs page size
> granule on pgtable manipulations, unless we extend the userfaultfd protocol
> to support folios, iiuc.
>
> >
> > >
> > > > >
> > > > > We could:
> > > > > - use the THP-like way and tolerate ~1 second collapses
> > > >
> > > > Another thought here. We don't necessarily *need* to collapse the page
> > > > table mappings in between mmu_notifier_invalidate_range_start() and
> > > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > > > we aren't punching any holes, and we aren't changing permission bits.
> > > > If we had an MMU notifier that simply informed KVM that we collapsed
> > > > the page tables *after* we finished collapsing, then it would be ok
> > > > for hugetlb_collapse() to be slow.

[ from above... ] This MMU notifier. :)

> > >
> > > That's a great point! It'll definitely apply to either approach.
> > >
> > > >
> > > > If this MMU notifier is something that makes sense, it probably
> > > > applies to MADV_COLLAPSE for THPs as well.
> > >
> > > THPs are definitely different, mmu notifiers should be required there,
> > > afaict. Isn't that what the current code does?
> > >
> > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.
> >
> > Oh, yes, of course, MADV_COLLAPSE can actually move things around and
> > properly make THPs. Thanks. But it would apply if we were only
> > collapsing PTE-mapped THPs, I think?
>
> Yes it applies I think. And if I'm not wrong it's also doing so. :) See
> collapse_pte_mapped_thp().
>
> While for anon we always allocate a new page, hence not applicable.
>
> --
> Peter Xu

Thanks Peter!
- James

2023-02-09 19:11:49

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <[email protected]> wrote:
> >
> > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > James,
> > > >
> > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > >
> > > > Thanks. From what I can tell, that number shows that it'll be great we
> > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > Matthew for generic folio.
> > >
> > > Do you think the RFC v1 way is better than doing the THP-like way
> > > *with the additional MMU notifier*?
> >
> > What's the additional MMU notifier you're referring?
>
> An MMU notifier that informs KVM that a collapse has happened without
> having to invalidate_range_start() and invalidate_range_end(), the one
> you're replying to lower down in the email. :) [ see below... ]

Isn't that something that is needed no matter what mapcount approach we'll
go for? Did I miss something?

>
> >
> > >
> > > >
> > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > >
> > > > Any more information on why it's horrible? :)
> > >
> > > I figured the code would speak for itself, heh. It's quite complicated.
> > >
> > > I really didn't like:
> > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > mapcount synced up.
> > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> >
> > I think the complexity is one thing which I'm fine with so far. However
> > when I think again about the things behind that complexity, I noticed there
> > may be at least one flaw that may not be trivial to work around.
> >
> > It's about truncation. The problem is now we use the pgtable entry to
> > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > unless vma unmapped or collapsed.
> >
> > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > course, with page lock held). The mappings will be removed for real, but
> > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > the pgtable leaves, not the ones that we used to account for mapcounts.
> >
> > So the kernel may see weird things, like mapcount>0 after
> > truncate_inode_folio() being finished completely.
> >
> > For HGM to do the right thing, we may want to also remove the non-leaf
> > entries when truncating or doing similar things like a rmap walk to drop
> > any mappings for a page/folio. Though that's not doable for now because
> > the locks that truncate_inode_folio() is weaker than what we need to free
> > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > same as when we unmap or collapse.
> >
> > Matthew's design doesn't have such issue if the ptes need to be populated,
> > because mapcount is still with the leaves; not the case for us here.
> >
> > If that's the case, _maybe_ we still need to start with the stupid but
> > working approach of subpage mapcounts.
>
> Good point. I can't immediately think of a solution. I would prefer to
> go with the subpage mapcount approach to simplify HGM for now;
> optimizing mapcount for HugeTLB can then be handled separately. If
> you're ok with this, I'll go ahead and send v2.

I'm okay with it, but I suggest wait for at least another one day or two to
see whether Mike or others have any comments.

>
> One way that might be possible: using the PAGE_SPECIAL bit on the
> hstate-level PTE to indicate if mapcount has been incremented or not
> (if the PTE is pointing to page tables). As far as I can tell,
> PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> need to be careful with existing PTE examination code as to not
> misinterpret these PTEs.

This is an interesting idea. :) Yes I don't see it being used at all in any
pgtable non-leaves.

Then it's about how to let the zap code know when to remove the special
bit, hence the mapcount, because not all of them should.

Maybe it can be passed over as a new zap_flags_t bit?

Thanks,

--
Peter Xu


2023-02-09 19:50:07

by James Houghton

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Feb 9, 2023 at 11:11 AM Peter Xu <[email protected]> wrote:
>
> On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > > >
> > > > > Thanks. From what I can tell, that number shows that it'll be great we
> > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > > Matthew for generic folio.
> > > >
> > > > Do you think the RFC v1 way is better than doing the THP-like way
> > > > *with the additional MMU notifier*?
> > >
> > > What's the additional MMU notifier you're referring?
> >
> > An MMU notifier that informs KVM that a collapse has happened without
> > having to invalidate_range_start() and invalidate_range_end(), the one
> > you're replying to lower down in the email. :) [ see below... ]
>
> Isn't that something that is needed no matter what mapcount approach we'll
> go for? Did I miss something?

It's not really needed for anything, but it could be an optimization
for both approaches. However, for the subpage-mapcount approach, it
would have a *huge* impact. That's what I mean.

>
> >
> > >
> > > >
> > > > >
> > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > > >
> > > > > Any more information on why it's horrible? :)
> > > >
> > > > I figured the code would speak for itself, heh. It's quite complicated.
> > > >
> > > > I really didn't like:
> > > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > > mapcount synced up.
> > > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> > >
> > > I think the complexity is one thing which I'm fine with so far. However
> > > when I think again about the things behind that complexity, I noticed there
> > > may be at least one flaw that may not be trivial to work around.
> > >
> > > It's about truncation. The problem is now we use the pgtable entry to
> > > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > > unless vma unmapped or collapsed.
> > >
> > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > > course, with page lock held). The mappings will be removed for real, but
> > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > > the pgtable leaves, not the ones that we used to account for mapcounts.
> > >
> > > So the kernel may see weird things, like mapcount>0 after
> > > truncate_inode_folio() being finished completely.
> > >
> > > For HGM to do the right thing, we may want to also remove the non-leaf
> > > entries when truncating or doing similar things like a rmap walk to drop
> > > any mappings for a page/folio. Though that's not doable for now because
> > > the locks that truncate_inode_folio() is weaker than what we need to free
> > > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > > same as when we unmap or collapse.
> > >
> > > Matthew's design doesn't have such issue if the ptes need to be populated,
> > > because mapcount is still with the leaves; not the case for us here.
> > >
> > > If that's the case, _maybe_ we still need to start with the stupid but
> > > working approach of subpage mapcounts.
> >
> > Good point. I can't immediately think of a solution. I would prefer to
> > go with the subpage mapcount approach to simplify HGM for now;
> > optimizing mapcount for HugeTLB can then be handled separately. If
> > you're ok with this, I'll go ahead and send v2.
>
> I'm okay with it, but I suggest wait for at least another one day or two to
> see whether Mike or others have any comments.

Ok. :)

>
> >
> > One way that might be possible: using the PAGE_SPECIAL bit on the
> > hstate-level PTE to indicate if mapcount has been incremented or not
> > (if the PTE is pointing to page tables). As far as I can tell,
> > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> > need to be careful with existing PTE examination code as to not
> > misinterpret these PTEs.
>
> This is an interesting idea. :) Yes I don't see it being used at all in any
> pgtable non-leaves.
>
> Then it's about how to let the zap code know when to remove the special
> bit, hence the mapcount, because not all of them should.
>
> Maybe it can be passed over as a new zap_flags_t bit?

Here[1] is one way it could be done (it doesn't work 100% correctly,
it's just approximately what we could do). Basically we pass in the
entire range that we are unmapping ("floor" and "ceil"), and if
hugetlb_remove_rmap finds that we're doing the final removal of a page
that we are entirely unmapping (i.e., floor <= addr &
huge_page_mask(h)). Having a zap flag would probably work too.

I think something like [1] ought to go in its own series. :)

[1]: https://github.com/48ca/linux/commit/de884eaaadf61b8dcfb1defd99bbf487667e46f4

- James

2023-02-09 20:24:13

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

On Thu, Feb 09, 2023 at 11:49:25AM -0800, James Houghton wrote:
> On Thu, Feb 9, 2023 at 11:11 AM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> > > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > James,
> > > > > >
> > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > > > >
> > > > > > Thanks. From what I can tell, that number shows that it'll be great we
> > > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > > > Matthew for generic folio.
> > > > >
> > > > > Do you think the RFC v1 way is better than doing the THP-like way
> > > > > *with the additional MMU notifier*?
> > > >
> > > > What's the additional MMU notifier you're referring?
> > >
> > > An MMU notifier that informs KVM that a collapse has happened without
> > > having to invalidate_range_start() and invalidate_range_end(), the one
> > > you're replying to lower down in the email. :) [ see below... ]
> >
> > Isn't that something that is needed no matter what mapcount approach we'll
> > go for? Did I miss something?
>
> It's not really needed for anything, but it could be an optimization
> for both approaches. However, for the subpage-mapcount approach, it
> would have a *huge* impact. That's what I mean.

Ah, okay.

>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > > > >
> > > > > > Any more information on why it's horrible? :)
> > > > >
> > > > > I figured the code would speak for itself, heh. It's quite complicated.
> > > > >
> > > > > I really didn't like:
> > > > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > > > mapcount synced up.
> > > > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> > > >
> > > > I think the complexity is one thing which I'm fine with so far. However
> > > > when I think again about the things behind that complexity, I noticed there
> > > > may be at least one flaw that may not be trivial to work around.
> > > >
> > > > It's about truncation. The problem is now we use the pgtable entry to
> > > > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > > > unless vma unmapped or collapsed.
> > > >
> > > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > > > course, with page lock held). The mappings will be removed for real, but
> > > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > > > the pgtable leaves, not the ones that we used to account for mapcounts.
> > > >
> > > > So the kernel may see weird things, like mapcount>0 after
> > > > truncate_inode_folio() being finished completely.
> > > >
> > > > For HGM to do the right thing, we may want to also remove the non-leaf
> > > > entries when truncating or doing similar things like a rmap walk to drop
> > > > any mappings for a page/folio. Though that's not doable for now because
> > > > the locks that truncate_inode_folio() is weaker than what we need to free
> > > > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > > > same as when we unmap or collapse.
> > > >
> > > > Matthew's design doesn't have such issue if the ptes need to be populated,
> > > > because mapcount is still with the leaves; not the case for us here.
> > > >
> > > > If that's the case, _maybe_ we still need to start with the stupid but
> > > > working approach of subpage mapcounts.
> > >
> > > Good point. I can't immediately think of a solution. I would prefer to
> > > go with the subpage mapcount approach to simplify HGM for now;
> > > optimizing mapcount for HugeTLB can then be handled separately. If
> > > you're ok with this, I'll go ahead and send v2.
> >
> > I'm okay with it, but I suggest wait for at least another one day or two to
> > see whether Mike or others have any comments.
>
> Ok. :)
>
> >
> > >
> > > One way that might be possible: using the PAGE_SPECIAL bit on the
> > > hstate-level PTE to indicate if mapcount has been incremented or not
> > > (if the PTE is pointing to page tables). As far as I can tell,
> > > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> > > need to be careful with existing PTE examination code as to not
> > > misinterpret these PTEs.
> >
> > This is an interesting idea. :) Yes I don't see it being used at all in any
> > pgtable non-leaves.
> >
> > Then it's about how to let the zap code know when to remove the special
> > bit, hence the mapcount, because not all of them should.
> >
> > Maybe it can be passed over as a new zap_flags_t bit?
>
> Here[1] is one way it could be done (it doesn't work 100% correctly,
> it's just approximately what we could do). Basically we pass in the
> entire range that we are unmapping ("floor" and "ceil"), and if
> hugetlb_remove_rmap finds that we're doing the final removal of a page
> that we are entirely unmapping (i.e., floor <= addr &
> huge_page_mask(h)). Having a zap flag would probably work too.

Yeah maybe flags are not needed at all. I had a quick glance, looks good
in general.

I think the trick is when it's not unmapped in a single shot. Consider
someone zaps the first half of HGM-mapped hpage then the other half. The
range may not always tell the whole story so rmap might be left over in
some cases.

But maybe it is not a big deal. The only thing I think of so far is the
partial DONTNEED. but I think maybe it's fine to leave it there until
another more serious request to either truncate or unmap it. At least all
rmap walks should work as expected.

>
> I think something like [1] ought to go in its own series. :)
>
> [1]: https://github.com/48ca/linux/commit/de884eaaadf61b8dcfb1defd99bbf487667e46f4

Yes I agree it can be worked on top.

--
Peter Xu