2022-10-21 17:22:26

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 00/47] hugetlb: introduce HugeTLB high-granularity mapping

This RFC v2 is a more complete and correct implementation of
the original high-granularity mapping RFC[1]. For HGM background and
motivation, please see the original RFC.

This series has changed quite significantly since its first version, so
I've dropped all the Reviewed-bys that it picked up, and I am not
including a full changelog here. Some notable changes:
1. mapcount rules have been simplified (now: the number of times a
hugepage is referenced in page tables, still tracked on the head
page).
2. Synchronizing page table collapsing is now done using the VMA lock
that Mike introduced recently.
3. PTE splitting is only supported for blank PTEs, and it is done
without needing to hold the VMA lock for writing. In many places,
we explicitly check if a PTE has been split from under us.
4. The userspace API has changed slightly.

This series implements high-granularity mapping basics, enough to
support PAGE_SIZE-aligned UFFDIO_CONTINUE operations and MADV_COLLAPSE
for shared HugeTLB VMAs for x86. The main use case for this is post-copy
for virtual machines, one of the important HGM use cases described in
[1]. MADV_COLLAPSE was originally introduced for THPs[2], but it is
now meaningful for HGM, and so I am co-opting the same API.

- Userspace API

There are two main ways userspace interacts with high-granularity
mappings:
1. Create them with UFFDIO_CONTINUE in an apporiately configured
userfaultfd VMA.
2. Collapse high-granularity mappings with MADV_COLLAPSE.

The userfaultfd bits of the userspace API have changed slightly since
RFC v1. To configure a userfaultfd VMA to enable HGM, userspace must
provide UFFD_FEATURE_MINOR_HUGETLBFS_HGM and UFFD_FEATURE_EXACT_ADDRESS
in its call to UFFDIO_API.

- A Note About KVM

Normally KVM (as well as any other non-HugeTLB code that assumes that
HugeTLB pages will always be mapped with huge PTEs) would need to be
enlightened to do the correct thing with high-granularity-mapped HugeTLB
pages. It turns out that the x86 TDP MMU already handles HGM mappings
correctly, but other architectures' KVM MMUs, like arm64's, will need to
be updated before HGM can be enabled for those architectures.

- How complete is this series?

I have tested this series with the self-tests that I have modified and
added, and I have run real, large end-to-end migration tests. This
series should be mostly stable, though I haven't tested DAMON and other
pieces that were slightly changed by this series.

There is a bug in the current x86 TDP MMU that prevents MADV_COLLAPSE
from having an effect. That is, the second-stage mappings will remain
small. This will be fixed with [3], so unless you have [3] merged in
your tree, you will see that MADV_COLLAPSE does not impact on virtual
machine performance.

- Future Work

The main areas of future work are:
1) Support more architectures (arm64 support is mostly complete, but
supporting it is not trivial, and to keep this RFC as short as
possible, I will send the arm64 support series separately).
2) Improve performance. Right now we take two per-hpage locks in the
hotpath for userfaultfd-based post-copy live migration, the page
lock and the fault mutex. To improve post-copy performance as much
as possible, we likely need to improve this locking strategy.
3) Support PAGE_SIZE poisoning of HugeTLB pages. To provide userspace
with consistent poison behavior whether using MAP_PRIVATE or
MAP_SHARED, more work is needed to implement basic HGM support for
MAP_PRIVATE mappings.

- Patches

Patches 1-4: Cleanup.
Patches 5-6: Extend the HugeTLB shared VMA lock struct.
Patches 7-14: Create hugetlb_pte and implement HGM basics (PT walking,
enabling HGM).
Patches 15-30: Make existing routines compatible with HGM.
Patches 31-35: Extend userfaultfd to support high-granularity CONTINUEs.
Patch 36: Add HugeTLB HGM support to MADV_COLLAPSE.
Patches 37-40: Cleanup, add HGM stats, and enable HGM for x86.
Patches 41-47: Documentation and selftests.

This series is based on mm-everything-2022-10-20-00-43.

Finally, I will be on vacation next week (until Nov 2, unfortunate
timing). I will try to respond before Nov 2; I wanted to get this series
up ASAP.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] commit 7d8faaf155454 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
[3] https://lore.kernel.org/kvm/[email protected]/

James Houghton (47):
hugetlb: don't set PageUptodate for UFFDIO_CONTINUE
hugetlb: remove mk_huge_pte; it is unused
hugetlb: remove redundant pte_mkhuge in migration path
hugetlb: only adjust address ranges when VMAs want PMD sharing
hugetlb: make hugetlb_vma_lock_alloc return its failure reason
hugetlb: extend vma lock for shared vmas
hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
hugetlb: add HGM enablement functions
hugetlb: make huge_pte_lockptr take an explicit shift argument.
hugetlb: add hugetlb_pte to track HugeTLB page table entries
hugetlb: add hugetlb_pmd_alloc and hugetlb_pte_alloc
hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
hugetlb: add make_huge_pte_with_shift
hugetlb: make default arch_make_huge_pte understand small mappings
hugetlbfs: for unmapping, treat HGM-mapped pages as potentially mapped
hugetlb: make unmapping compatible with high-granularity mappings
hugetlb: make hugetlb_change_protection compatible with HGM
hugetlb: enlighten follow_hugetlb_page to support HGM
hugetlb: make hugetlb_follow_page_mask HGM-enabled
hugetlb: use struct hugetlb_pte for walk_hugetlb_range
mm: rmap: provide pte_order in page_vma_mapped_walk
mm: rmap: make page_vma_mapped_walk callers use pte_order
rmap: update hugetlb lock comment for HGM
hugetlb: update page_vma_mapped to do high-granularity walks
hugetlb: add HGM support for copy_hugetlb_page_range
hugetlb: make move_hugetlb_page_tables compatible with HGM
hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
rmap: in try_to_{migrate,unmap}_one, check head page for page flags
hugetlb: add high-granularity migration support
hugetlb: add high-granularity check for hwpoison in fault path
hugetlb: sort hstates in hugetlb_init_hstates
hugetlb: add for_each_hgm_shift
userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
hugetlb: userfaultfd: add support for high-granularity UFFDIO_CONTINUE
userfaultfd: require UFFD_FEATURE_EXACT_ADDRESS when using HugeTLB HGM
hugetlb: add MADV_COLLAPSE for hugetlb
hugetlb: remove huge_pte_lock and huge_pte_lockptr
hugetlb: replace make_huge_pte with make_huge_pte_with_shift
mm: smaps: add stats for HugeTLB mapping size
hugetlb: x86: enable high-granularity mapping
docs: hugetlb: update hugetlb and userfaultfd admin-guides with HGM
info
docs: proc: include information about HugeTLB HGM
selftests/vm: add HugeTLB HGM to userfaultfd selftest
selftests/kvm: add HugeTLB HGM to KVM demand paging selftest
selftests/vm: add anon and shared hugetlb to migration test
selftests/vm: add hugetlb HGM test to migration selftest
selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests

Documentation/admin-guide/mm/hugetlbpage.rst | 4 +
Documentation/admin-guide/mm/userfaultfd.rst | 16 +-
Documentation/filesystems/proc.rst | 56 +-
arch/powerpc/mm/pgtable.c | 3 +-
arch/s390/include/asm/hugetlb.h | 5 -
arch/s390/mm/gmap.c | 20 +-
arch/x86/Kconfig | 1 +
fs/Kconfig | 7 +
fs/hugetlbfs/inode.c | 27 +-
fs/proc/task_mmu.c | 184 ++-
fs/userfaultfd.c | 56 +-
include/asm-generic/hugetlb.h | 5 -
include/asm-generic/tlb.h | 6 +-
include/linux/huge_mm.h | 12 +-
include/linux/hugetlb.h | 173 ++-
include/linux/pagewalk.h | 11 +-
include/linux/rmap.h | 5 +
include/linux/swapops.h | 8 +-
include/linux/userfaultfd_k.h | 7 +
include/uapi/linux/userfaultfd.h | 2 +
mm/damon/vaddr.c | 57 +-
mm/debug_vm_pgtable.c | 2 +-
mm/hmm.c | 21 +-
mm/hugetlb.c | 1209 ++++++++++++++---
mm/khugepaged.c | 4 +-
mm/madvise.c | 24 +-
mm/memory-failure.c | 17 +-
mm/mempolicy.c | 28 +-
mm/migrate.c | 20 +-
mm/mincore.c | 17 +-
mm/mprotect.c | 18 +-
mm/page_vma_mapped.c | 60 +-
mm/pagewalk.c | 32 +-
mm/rmap.c | 102 +-
mm/userfaultfd.c | 46 +-
.../selftests/kvm/demand_paging_test.c | 20 +-
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
tools/testing/selftests/kvm/lib/test_util.c | 14 +
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugetlb-hgm.c | 326 +++++
tools/testing/selftests/vm/migration.c | 222 ++-
tools/testing/selftests/vm/userfaultfd.c | 90 +-
43 files changed, 2449 insertions(+), 493 deletions(-)
create mode 100644 tools/testing/selftests/vm/hugetlb-hgm.c

--
2.38.0.135.g90850a2211-goog


2022-10-21 17:22:33

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 46/47] selftests/vm: add hugetlb HGM test to migration selftest

This is mostly the same as the shared HugeTLB case, but instead of
mapping the page with a regular page fault, we map it with lots of
UFFDIO_CONTINUE operations. We also verify that the contents haven't
changed after the migration, which would be the case if the
post-migration PTEs pointed to the wrong page.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/migration.c | 139 +++++++++++++++++++++++++
1 file changed, 139 insertions(+)

diff --git a/tools/testing/selftests/vm/migration.c b/tools/testing/selftests/vm/migration.c
index 21577a84d7e4..89cb5934f139 100644
--- a/tools/testing/selftests/vm/migration.c
+++ b/tools/testing/selftests/vm/migration.c
@@ -14,6 +14,11 @@
#include <signal.h>
#include <time.h>
#include <sys/statfs.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+#include <sys/syscall.h>
+#include <fcntl.h>

#define TWOMEG (2<<20)
#define RUNTIME (60)
@@ -265,4 +270,138 @@ TEST_F_TIMEOUT(migration, shared_hugetlb, 2*RUNTIME)
close(fd);
}

+#ifdef __NR_userfaultfd
+static int map_at_high_granularity(char *mem, size_t length)
+{
+ int i;
+ int ret;
+ int uffd = syscall(__NR_userfaultfd, 0);
+ struct uffdio_api api;
+ struct uffdio_register reg;
+ int pagesize = getpagesize();
+
+ if (uffd < 0) {
+ perror("couldn't create uffd");
+ return uffd;
+ }
+
+ api.api = UFFD_API;
+ api.features = UFFD_FEATURE_MISSING_HUGETLBFS
+ | UFFD_FEATURE_MINOR_HUGETLBFS
+ | UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+
+ ret = ioctl(uffd, UFFDIO_API, &api);
+ if (ret || api.api != UFFD_API) {
+ perror("UFFDIO_API failed");
+ goto out;
+ }
+
+ reg.range.start = (unsigned long)mem;
+ reg.range.len = length;
+
+ reg.mode = UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_MINOR;
+
+ ret = ioctl(uffd, UFFDIO_REGISTER, &reg);
+ if (ret) {
+ perror("UFFDIO_REGISTER failed");
+ goto out;
+ }
+
+ /* UFFDIO_CONTINUE each 4K segment of the 2M page. */
+ for (i = 0; i < length/pagesize; ++i) {
+ struct uffdio_continue cont;
+
+ cont.range.start = (unsigned long long)mem + i * pagesize;
+ cont.range.len = pagesize;
+ cont.mode = 0;
+ ret = ioctl(uffd, UFFDIO_CONTINUE, &cont);
+ if (ret) {
+ fprintf(stderr, "UFFDIO_CONTINUE failed "
+ "for %llx -> %llx: %d\n",
+ cont.range.start,
+ cont.range.start + cont.range.len,
+ errno);
+ goto out;
+ }
+ }
+ ret = 0;
+out:
+ close(uffd);
+ return ret;
+}
+#else
+static int map_at_high_granularity(char *mem, size_t length)
+{
+ fprintf(stderr, "Userfaultfd missing\n");
+ return -1;
+}
+#endif /* __NR_userfaultfd */
+
+/*
+ * Tests the high-granularity hugetlb migration entry paths.
+ */
+TEST_F_TIMEOUT(migration, shared_hugetlb_hgm, 2*RUNTIME)
+{
+ uint64_t *ptr;
+ int i;
+ int fd;
+ unsigned long sz;
+ struct statfs filestat;
+
+ if (self->nthreads < 2 || self->n1 < 0 || self->n2 < 0)
+ SKIP(return, "Not enough threads or NUMA nodes available");
+
+ fd = memfd_create("tmp_hugetlb", MFD_HUGETLB);
+ if (fd < 0)
+ SKIP(return, "Couldn't create hugetlb memfd");
+
+ if (fstatfs(fd, &filestat) < 0)
+ SKIP(return, "Couldn't fstatfs hugetlb file");
+
+ sz = filestat.f_bsize;
+
+ if (ftruncate(fd, sz))
+ SKIP(return, "Couldn't allocate hugetlb pages");
+
+ if (fallocate(fd, 0, 0, sz) < 0) {
+ perror("fallocate failed");
+ SKIP(return, "fallocate failed");
+ }
+
+ ptr = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (ptr == MAP_FAILED)
+ SKIP(return, "Could not allocate hugetlb pages");
+
+ /*
+ * We have to map_at_high_granularity before we memset, otherwise
+ * memset will map everything at the hugepage size.
+ */
+ if (map_at_high_granularity((char *)ptr, sz) < 0)
+ SKIP(return, "Could not map HugeTLB range at high granularity");
+
+ /* Populate the page we're migrating. */
+ for (i = 0; i < sz/sizeof(*ptr); ++i)
+ ptr[i] = i;
+
+ for (i = 0; i < self->nthreads - 1; i++)
+ if (pthread_create(&self->threads[i], NULL, access_mem, ptr))
+ perror("Couldn't create thread");
+
+ ASSERT_EQ(migrate(ptr, self->n1, self->n2, 10), 0);
+ for (i = 0; i < self->nthreads - 1; i++) {
+ ASSERT_EQ(pthread_cancel(self->threads[i]), 0);
+ pthread_join(self->threads[i], NULL);
+ }
+
+ /* Check that the contents didnt' change. */
+ for (i = 0; i < sz/sizeof(*ptr); ++i) {
+ ASSERT_EQ(ptr[i], i);
+ if (ptr[i] != i)
+ break;
+ }
+
+ ftruncate(fd, 0);
+ close(fd);
+}
+
TEST_HARNESS_MAIN
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:22:36

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr

They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't
already been replaced don't get called when using HGM, so we handle them
by populating hugetlb_ptes with the standard, hstate-sized huge PTEs.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 28 +++-------------------------
mm/hugetlb.c | 15 ++++++++++-----
2 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5378b98cc7b8..e6dc25b15403 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1015,14 +1015,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
return modified_mask;
}

-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
- struct mm_struct *mm, pte_t *pte)
-{
- if (shift == PMD_SHIFT)
- return pmd_lockptr(mm, (pmd_t *) pte);
- return &mm->page_table_lock;
-}
-
#ifndef hugepages_supported
/*
* Some platform decide whether they support huge pages at boot
@@ -1226,12 +1218,6 @@ static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
return 0;
}

-static inline spinlock_t *huge_pte_lockptr(unsigned int shift,
- struct mm_struct *mm, pte_t *pte)
-{
- return &mm->page_table_lock;
-}
-
static inline void hugetlb_count_init(struct mm_struct *mm)
{
}
@@ -1307,16 +1293,6 @@ int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
}
#endif

-static inline spinlock_t *huge_pte_lock(struct hstate *h,
- struct mm_struct *mm, pte_t *pte)
-{
- spinlock_t *ptl;
-
- ptl = huge_pte_lockptr(huge_page_shift(h), mm, pte);
- spin_lock(ptl);
- return ptl;
-}
-
static inline
spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
{
@@ -1324,7 +1300,9 @@ spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
BUG_ON(!hpte->ptep);
if (hpte->ptl)
return hpte->ptl;
- return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
+ if (hugetlb_pte_level(hpte) == HUGETLB_LEVEL_PMD)
+ return pmd_lockptr(mm, (pmd_t *) hpte->ptep);
+ return &mm->page_table_lock;
}

static inline
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d80db81a1fa5..9d4e41c41f78 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5164,9 +5164,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
put_page(hpage);

/* Install the new huge page if src pte stable */
- dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h),
- src, src_pte);
+ dst_ptl = hugetlb_pte_lock(dst, &dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(src, &src_hpte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
if (!pte_same(src_pte_old, entry)) {
@@ -7465,6 +7464,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *spte = NULL;
pte_t *pte;
spinlock_t *ptl;
+ struct hugetlb_pte hpte;

i_mmap_lock_read(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -7485,7 +7485,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
if (!spte)
goto out;

- ptl = huge_pte_lock(hstate_vma(vma), mm, spte);
+ hugetlb_pte_populate(&hpte, (pte_t *)pud, PUD_SHIFT, HUGETLB_LEVEL_PUD);
+ ptl = hugetlb_pte_lock(mm, &hpte);
if (pud_none(*pud)) {
pud_populate(mm, pud,
(pmd_t *)((unsigned long)spte & PAGE_MASK));
@@ -8179,6 +8180,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
unsigned long address, start, end;
spinlock_t *ptl;
pte_t *ptep;
+ struct hugetlb_pte hpte;

if (!(vma->vm_flags & VM_MAYSHARE))
return;
@@ -8203,7 +8205,10 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
ptep = huge_pte_offset(mm, address, sz);
if (!ptep)
continue;
- ptl = huge_pte_lock(h, mm, ptep);
+
+ hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+ hpage_size_to_level(sz));
+ ptl = hugetlb_pte_lock(mm, &hpte);
huge_pmd_unshare(mm, vma, address, ptep);
spin_unlock(ptl);
}
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:22:55

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 38/47] hugetlb: replace make_huge_pte with make_huge_pte_with_shift

This removes the old definition of make_huge_pte, where now we always
require the shift to be explicitly given. All callsites are cleaned up.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 31 ++++++++++++-------------------
1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d4e41c41f78..b26142bec4fe 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4908,9 +4908,9 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.pagesize = hugetlb_vm_op_pagesize,
};

-static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
- struct page *page, int writable,
- int shift)
+static pte_t make_huge_pte(struct vm_area_struct *vma,
+ struct page *page, int writable,
+ int shift)
{
pte_t entry;

@@ -4926,14 +4926,6 @@ static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
return entry;
}

-static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
- int writable)
-{
- unsigned int shift = huge_page_shift(hstate_vma(vma));
-
- return make_huge_pte_with_shift(vma, page, writable, shift);
-}
-
static void set_huge_ptep_writable(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
@@ -4974,10 +4966,12 @@ static void
hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr,
struct page *new_page)
{
+ struct hstate *h = hstate_vma(vma);
__SetPageUptodate(new_page);
hugepage_add_new_anon_rmap(new_page, vma, addr);
- set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
- hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
+ set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1,
+ huge_page_shift(h)));
+ hugetlb_count_add(pages_per_huge_page(h), vma->vm_mm);
ClearHPageRestoreReserve(new_page);
SetHPageMigratable(new_page);
}
@@ -5737,7 +5731,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
page_remove_rmap(old_page, vma, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_huge_pte_at(mm, haddr, ptep,
- make_huge_pte(vma, new_page, !unshare));
+ make_huge_pte(vma, new_page, !unshare,
+ huge_page_shift(h)));
SetHPageMigratable(new_page);
/* Make the old page be freed below */
new_page = old_page;
@@ -6033,7 +6028,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_file_rmap(page, true);

subpage = hugetlb_find_subpage(h, page, haddr_hgm);
- new_pte = make_huge_pte_with_shift(vma, subpage,
+ new_pte = make_huge_pte(vma, subpage,
((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)),
hpte->shift);
@@ -6481,8 +6476,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
subpage = hugetlb_find_subpage(h, page, dst_addr);
WARN_ON_ONCE(subpage != page && !hugetlb_hgm_enabled(dst_vma));

- _dst_pte = make_huge_pte_with_shift(dst_vma, subpage, writable,
- dst_hpte->shift);
+ _dst_pte = make_huge_pte(dst_vma, subpage, writable, dst_hpte->shift);
/*
* Always mark UFFDIO_COPY page dirty; note that this may not be
* extremely important for hugetlbfs for now since swapping is not
@@ -8044,8 +8038,7 @@ int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
page_dup_file_rmap(hpage, true);

subpage = hugetlb_find_subpage(h, hpage, curr);
- entry = make_huge_pte_with_shift(vma, subpage,
- writable, hpte.shift);
+ entry = make_huge_pte(vma, subpage, writable, hpte.shift);
set_huge_pte_at(mm, curr, hpte.ptep, entry);
next_hpte:
curr += hugetlb_pte_size(&hpte);
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:23:17

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 20/47] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

The main change in this commit is to walk_hugetlb_range to support
walking HGM mappings, but all walk_hugetlb_range callers must be updated
to use the new API and take the correct action.

Listing all the changes to the callers:

For s390 changes, we simply ignore HGM PTEs (we don't support s390 yet).

For smaps, shared_hugetlb (and private_hugetlb, although private
mappings don't support HGM) may now not be divisible by the hugepage
size. The appropriate changes have been made to support analyzing HGM
PTEs.

For pagemap, we ignore non-leaf PTEs by treating that as if they were
none PTEs. We can only end up with non-leaf PTEs if they had just been
updated from a none PTE.

For show_numa_map, the challenge is that, if any of a hugepage is
mapped, we have to count that entire page exactly once, as the results
are given in units of hugepages. To support HGM mappings, we keep track
of the last page that we looked it. If the hugepage we are currently
looking at is the same as the last one, then we must be looking at an
HGM-mapped page that has been mapped at high-granularity, and we've
already accounted for it.

For DAMON, we treat non-leaf PTEs as if they were blank, for the same
reason as pagemap.

For hwpoison, we proactively update the logic to support the case when
hpte is pointing to a subpage within the poisoned hugepage.

For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for
now.

For mincore, we ignore non-leaf PTEs for the same reason as pagemap.

For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a
non-leaf PTE.

Signed-off-by: James Houghton <[email protected]>
---
arch/s390/mm/gmap.c | 20 ++++++++--
fs/proc/task_mmu.c | 83 +++++++++++++++++++++++++++++-----------
include/linux/pagewalk.h | 11 ++++--
mm/damon/vaddr.c | 57 +++++++++++++++++----------
mm/hmm.c | 21 ++++++----
mm/memory-failure.c | 17 ++++----
mm/mempolicy.c | 12 ++++--
mm/mincore.c | 17 ++++++--
mm/mprotect.c | 18 ++++++---
mm/pagewalk.c | 32 +++++++++++++---
10 files changed, 203 insertions(+), 85 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 02d15c8dc92e..d65c15b5dccb 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2622,13 +2622,25 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
return 0;
}

-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
- unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
- pmd_t *pmd = (pmd_t *)pte;
+ struct hstate *h = hstate_vma(walk->vma);
+ pmd_t *pmd;
unsigned long start, end;
- struct page *page = pmd_page(*pmd);
+ struct page *page;
+
+ if (huge_page_size(h) != hugetlb_pte_size(hpte))
+ /* Ignore high-granularity PTEs. */
+ return 0;
+
+ if (!pte_present(huge_ptep_get(hpte->ptep)))
+ /* Ignore non-present PTEs. */
+ return 0;
+
+ pmd = (pmd_t *)pte;
+ page = pmd_page(*pmd);

/*
* The write check makes sure we do not set a key on shared
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a74cdcc9af0..be78cdb7677e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -720,18 +720,28 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
}

#ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
+ unsigned long addr,
+ struct mm_walk *walk)
{
struct mem_size_stats *mss = walk->private;
struct vm_area_struct *vma = walk->vma;
struct page *page = NULL;
+ pte_t pte = huge_ptep_get(hpte->ptep);

- if (pte_present(*pte)) {
- page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
- swp_entry_t swpent = pte_to_swp_entry(*pte);
+ if (pte_present(pte)) {
+ /* We only care about leaf-level PTEs. */
+ if (!hugetlb_pte_present_leaf(hpte, pte))
+ /*
+ * The only case where hpte is not a leaf is that
+ * it was originally none, but it was split from
+ * under us. It was originally none, so exclude it.
+ */
+ return 0;
+
+ page = vm_normal_page(vma, addr, pte);
+ } else if (is_swap_pte(pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(pte);

if (is_pfn_swap_entry(swpent))
page = pfn_swap_entry_to_page(swpent);
@@ -740,9 +750,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
int mapcount = page_mapcount(page);

if (mapcount >= 2)
- mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->shared_hugetlb += hugetlb_pte_size(hpte);
else
- mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->private_hugetlb += hugetlb_pte_size(hpte);
}
return 0;
}
@@ -1561,22 +1571,31 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

#ifdef CONFIG_HUGETLB_PAGE
/* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
struct pagemapread *pm = walk->private;
struct vm_area_struct *vma = walk->vma;
u64 flags = 0, frame = 0;
int err = 0;
- pte_t pte;
+ unsigned long hmask = hugetlb_pte_mask(hpte);
+ unsigned long end = addr + hugetlb_pte_size(hpte);
+ pte_t pte = huge_ptep_get(hpte->ptep);
+ struct page *page;

if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;

- pte = huge_ptep_get(ptep);
if (pte_present(pte)) {
- struct page *page = pte_page(pte);
+ /*
+ * We raced with this PTE being split, which can only happen if
+ * it was blank before. Treat it is as if it were blank.
+ */
+ if (!hugetlb_pte_present_leaf(hpte, pte))
+ return 0;
+
+ page = pte_page(pte);

if (!PageAnon(page))
flags |= PM_FILE;
@@ -1857,10 +1876,16 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
}
#endif

+struct show_numa_map_private {
+ struct numa_maps *md;
+ struct page *last_page;
+};
+
static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
- struct numa_maps *md = walk->private;
+ struct show_numa_map_private *priv = walk->private;
+ struct numa_maps *md = priv->md;
struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;
pte_t *orig_pte;
@@ -1872,6 +1897,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
struct page *page;

page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+ priv->last_page = page;
if (page)
gather_stats(page, md, pmd_dirty(*pmd),
HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1885,6 +1911,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
+ priv->last_page = page;
if (!page)
continue;
gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1895,19 +1922,25 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
return 0;
}
#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+ struct mm_walk *walk)
{
- pte_t huge_pte = huge_ptep_get(pte);
+ struct show_numa_map_private *priv = walk->private;
+ pte_t huge_pte = huge_ptep_get(hpte->ptep);
struct numa_maps *md;
struct page *page;

- if (!pte_present(huge_pte))
+ if (!hugetlb_pte_present_leaf(hpte, huge_pte))
+ return 0;
+
+ page = compound_head(pte_page(huge_pte));
+ if (priv->last_page == page)
+ /* we've already accounted for this page */
return 0;

- page = pte_page(huge_pte);
+ priv->last_page = page;

- md = walk->private;
+ md = priv->md;
gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
}
@@ -1937,9 +1970,15 @@ static int show_numa_map(struct seq_file *m, void *v)
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct mempolicy *pol;
+
char buffer[64];
int nid;

+ struct show_numa_map_private numa_map_private;
+
+ numa_map_private.md = md;
+ numa_map_private.last_page = NULL;
+
if (!mm)
return 0;

@@ -1969,7 +2008,7 @@ static int show_numa_map(struct seq_file *m, void *v)
seq_puts(m, " huge");

/* mmap_lock is held by m_start */
- walk_page_vma(vma, &show_numa_ops, md);
+ walk_page_vma(vma, &show_numa_ops, &numa_map_private);

if (!md->pages)
goto out;
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 2f8f6cc980b4..7ed065ea5dba 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@
#define _LINUX_PAGEWALK_H

#include <linux/mm.h>
+#include <linux/hugetlb.h>

struct mm_walk;

@@ -21,7 +22,10 @@ struct mm_walk;
* depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
* Any folded depths (where PTRS_PER_P?D is equal to 1)
* are skipped.
- * @hugetlb_entry: if set, called for each hugetlb entry
+ * @hugetlb_entry: if set, called for each hugetlb entry. In the presence
+ * of high-granularity hugetlb entries, @hugetlb_entry is
+ * called only for leaf-level entries (i.e., hstate-level
+ * page table entries are ignored if they are not leaves).
* @test_walk: caller specific callback function to determine whether
* we walk over the current vma or not. Returning 0 means
* "do page table walk over the current vma", returning
@@ -47,9 +51,8 @@ struct mm_walk_ops {
unsigned long next, struct mm_walk *walk);
int (*pte_hole)(unsigned long addr, unsigned long next,
int depth, struct mm_walk *walk);
- int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long next,
- struct mm_walk *walk);
+ int (*hugetlb_entry)(struct hugetlb_pte *hpte,
+ unsigned long addr, struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
int (*pre_vma)(unsigned long start, unsigned long end,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 15f03df66db6..42845e1b560d 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -330,48 +330,55 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
+ struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long addr)
{
bool referenced = false;
- pte_t entry = huge_ptep_get(pte);
struct page *page = pte_page(entry);
+ struct page *hpage = compound_head(page);

- get_page(page);
+ get_page(hpage);

if (pte_young(entry)) {
referenced = true;
entry = pte_mkold(entry);
- set_huge_pte_at(mm, addr, pte, entry);
+ set_huge_pte_at(mm, addr, hpte->ptep, entry);
}

#ifdef CONFIG_MMU_NOTIFIER
if (mmu_notifier_clear_young(mm, addr,
- addr + huge_page_size(hstate_vma(vma))))
+ addr + hugetlb_pte_size(hpte)))
referenced = true;
#endif /* CONFIG_MMU_NOTIFIER */

if (referenced)
- set_page_young(page);
+ set_page_young(hpage);

- set_page_idle(page);
- put_page(page);
+ set_page_idle(hpage);
+ put_page(hpage);
}

-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
- struct hstate *h = hstate_vma(walk->vma);
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(h, walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = hugetlb_pte_lock(walk->mm, hpte);
+ entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto out;

- damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+ if (!hugetlb_pte_present_leaf(hpte, entry))
+ /*
+ * We raced with someone splitting a blank PTE. Treat this PTE
+ * as if it were blank.
+ */
+ goto out;
+
+ damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr);

out:
spin_unlock(ptl);
@@ -484,31 +491,39 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
struct damon_young_walk_private *priv = walk->private;
struct hstate *h = hstate_vma(walk->vma);
- struct page *page;
+ struct page *page, *hpage;
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(h, walk->mm, pte);
+ ptl = hugetlb_pte_lock(walk->mm, hpte);
entry = huge_ptep_get(pte);
if (!pte_present(entry))
goto out;

+ if (!hugetlb_pte_present_leaf(hpte, entry))
+ /*
+ * We raced with someone splitting a blank PTE. Treat this PTE
+ * as if it were blank.
+ */
+ goto out;
+
page = pte_page(entry);
- get_page(page);
+ hpage = compound_head(page);
+ get_page(hpage);

- if (pte_young(entry) || !page_is_idle(page) ||
+ if (pte_young(entry) || !page_is_idle(hpage) ||
mmu_notifier_test_young(walk->mm, addr)) {
*priv->page_sz = huge_page_size(h);
priv->young = true;
}

- put_page(page);
+ put_page(hpage);

out:
spin_unlock(ptl);
diff --git a/mm/hmm.c b/mm/hmm.c
index 3850fb625dda..76679b46ad5e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -469,27 +469,34 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
#endif

#ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long start, unsigned long end,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long start,
struct mm_walk *walk)
{
unsigned long addr = start, i, pfn;
struct hmm_vma_walk *hmm_vma_walk = walk->private;
struct hmm_range *range = hmm_vma_walk->range;
- struct vm_area_struct *vma = walk->vma;
unsigned int required_fault;
unsigned long pfn_req_flags;
unsigned long cpu_flags;
+ unsigned long hmask = hugetlb_pte_mask(hpte);
+ unsigned int order = hugetlb_pte_shift(hpte) - PAGE_SHIFT;
+ unsigned long end = start + hugetlb_pte_size(hpte);
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = hugetlb_pte_lock(walk->mm, hpte);
+ entry = huge_ptep_get(hpte->ptep);
+
+ if (!hugetlb_pte_present_leaf(hpte, entry)) {
+ spin_unlock(ptl);
+ return -EAGAIN;
+ }

i = (start - range->start) >> PAGE_SHIFT;
pfn_req_flags = range->hmm_pfns[i];
cpu_flags = pte_to_hmm_pfn_flags(range, entry) |
- hmm_pfn_flags_order(huge_page_order(hstate_vma(vma)));
+ hmm_pfn_flags_order(order);
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
if (required_fault) {
@@ -593,7 +600,7 @@ int hmm_range_fault(struct hmm_range *range)
* in pfns. All entries < last in the pfn array are set to their
* output, and all >= are still at their input values.
*/
- } while (ret == -EBUSY);
+ } while (ret == -EBUSY || ret == -EAGAIN);
return ret;
}
EXPORT_SYMBOL(hmm_range_fault);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bead6bccc7f2..505efba59d29 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -628,6 +628,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
unsigned long poisoned_pfn, struct to_kill *tk)
{
unsigned long pfn = 0;
+ unsigned long base_pages_poisoned = (1UL << shift) / PAGE_SIZE;

if (pte_present(pte)) {
pfn = pte_pfn(pte);
@@ -638,7 +639,8 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
pfn = swp_offset_pfn(swp);
}

- if (!pfn || pfn != poisoned_pfn)
+ if (!pfn || pfn < poisoned_pfn ||
+ pfn >= poisoned_pfn + base_pages_poisoned)
return 0;

set_to_kill(tk, addr, shift);
@@ -704,16 +706,15 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
- unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte,
+ unsigned long addr,
+ struct mm_walk *walk)
{
struct hwp_walk *hwp = walk->private;
- pte_t pte = huge_ptep_get(ptep);
- struct hstate *h = hstate_vma(walk->vma);
+ pte_t pte = huge_ptep_get(hpte->ptep);

- return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
- hwp->pfn, &hwp->tk);
+ return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte),
+ hpte->shift, hwp->pfn, &hwp->tk);
}
#else
#define hwpoison_hugetlb_range NULL
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 61aa9aedb728..275bc549590e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -558,8 +558,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
return addr != end ? -EIO : 0;
}

-static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
int ret = 0;
@@ -570,8 +570,12 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
- entry = huge_ptep_get(pte);
+ /* We don't migrate high-granularity HugeTLB mappings for now. */
+ if (hugetlb_hgm_enabled(walk->vma))
+ return -EINVAL;
+
+ ptl = hugetlb_pte_lock(walk->mm, hpte);
+ entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index a085a2aeabd8..0894965b3944 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,18 +22,29 @@
#include <linux/uaccess.h>
#include "swap.h"

-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
- unsigned long end, struct mm_walk *walk)
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
+ struct mm_walk *walk)
{
#ifdef CONFIG_HUGETLB_PAGE
unsigned char present;
+ unsigned long end = addr + hugetlb_pte_size(hpte);
unsigned char *vec = walk->private;
+ pte_t pte = huge_ptep_get(hpte->ptep);

/*
* Hugepages under user process are always in RAM and never
* swapped out, but theoretically it needs to be checked.
*/
- present = pte && !huge_pte_none(huge_ptep_get(pte));
+ present = !huge_pte_none(pte);
+
+ /*
+ * If the pte is present but not a leaf, we raced with someone
+ * splitting it. For someone to have split it, it must have been
+ * huge_pte_none before, so treat it as such.
+ */
+ if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte))
+ present = false;
+
for (; addr != end; vec++, addr += PAGE_SIZE)
*vec = present;
walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 99762403cc8f..9975b86035e0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -524,12 +524,16 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
0 : -EACCES;
}

-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long next,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
+ unsigned long addr,
struct mm_walk *walk)
{
- return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
- 0 : -EACCES;
+ pte_t pte = huge_ptep_get(hpte->ptep);
+
+ if (!hugetlb_pte_present_leaf(hpte, pte))
+ return -EAGAIN;
+ return pfn_modify_allowed(pte_pfn(pte),
+ *(pgprot_t *)(walk->private)) ? 0 : -EACCES;
}

static int prot_none_test(unsigned long addr, unsigned long next,
@@ -572,8 +576,10 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
(newflags & VM_ACCESS_FLAGS) == 0) {
pgprot_t new_pgprot = vm_get_page_prot(newflags);

- error = walk_page_range(current->mm, start, end,
- &prot_none_walk_ops, &new_pgprot);
+ do {
+ error = walk_page_range(current->mm, start, end,
+ &prot_none_walk_ops, &new_pgprot);
+ } while (error == -EAGAIN);
if (error)
return error;
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index bb33c1e8c017..2318aae98f1e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/hugetlb.h>
+#include <linux/minmax.h>

/*
* We want to know the real level where a entry is located ignoring any
@@ -301,20 +302,39 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
pte_t *pte;
const struct mm_walk_ops *ops = walk->ops;
int err = 0;
+ struct hugetlb_pte hpte;
+
+ if (hugetlb_hgm_enabled(vma))
+ /*
+ * We could potentially do high-granularity walks. Grab the
+ * VMA lock to prevent PTEs from becoming invalid.
+ */
+ hugetlb_vma_lock_read(vma);

do {
- next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
-
- if (pte)
- err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
- else if (ops->pte_hole)
- err = ops->pte_hole(addr, next, -1, walk);
+ if (!pte) {
+ next = hugetlb_entry_end(h, addr, end);
+ if (ops->pte_hole)
+ err = ops->pte_hole(addr, next, -1, walk);
+ } else {
+ hugetlb_pte_populate(&hpte, pte, huge_page_shift(h),
+ hpage_size_to_level(sz));
+ hugetlb_hgm_walk(walk->mm, vma, &hpte, addr,
+ PAGE_SIZE,
+ /*stop_at_none=*/true);
+ err = ops->hugetlb_entry(
+ &hpte, addr, walk);
+ next = min(addr + hugetlb_pte_size(&hpte), end);
+ }

if (err)
break;
} while (addr = next, addr != end);

+ if (hugetlb_hgm_enabled(vma))
+ hugetlb_vma_unlock_read(vma);
+
return err;
}

--
2.38.0.135.g90850a2211-goog

2022-10-21 17:23:57

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.

Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.

hugetlb_pte_present_leaf is included here as a helper function that will
be used frequently later on.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 29 ++++++++++++++
2 files changed, 117 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index db3ed6095b1c..d30322108b34 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -50,6 +50,75 @@ enum {
__NR_USED_SUBPAGE,
};

+enum hugetlb_level {
+ HUGETLB_LEVEL_PTE = 1,
+ /*
+ * We always include PMD, PUD, and P4D in this enum definition so that,
+ * when logged as an integer, we can easily tell which level it is.
+ */
+ HUGETLB_LEVEL_PMD,
+ HUGETLB_LEVEL_PUD,
+ HUGETLB_LEVEL_P4D,
+ HUGETLB_LEVEL_PGD,
+};
+
+struct hugetlb_pte {
+ pte_t *ptep;
+ unsigned int shift;
+ enum hugetlb_level level;
+ spinlock_t *ptl;
+};
+
+static inline
+void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
+ unsigned int shift, enum hugetlb_level level)
+{
+ WARN_ON_ONCE(!ptep);
+ hpte->ptep = ptep;
+ hpte->shift = shift;
+ hpte->level = level;
+ hpte->ptl = NULL;
+}
+
+static inline
+unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
+{
+ WARN_ON_ONCE(!hpte->ptep);
+ return 1UL << hpte->shift;
+}
+
+static inline
+unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
+{
+ WARN_ON_ONCE(!hpte->ptep);
+ return ~(hugetlb_pte_size(hpte) - 1);
+}
+
+static inline
+unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
+{
+ WARN_ON_ONCE(!hpte->ptep);
+ return hpte->shift;
+}
+
+static inline
+enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
+{
+ WARN_ON_ONCE(!hpte->ptep);
+ return hpte->level;
+}
+
+static inline
+void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
+{
+ dest->ptep = src->ptep;
+ dest->shift = src->shift;
+ dest->level = src->level;
+ dest->ptl = src->ptl;
+}
+
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
+
struct hugepage_subpool {
spinlock_t lock;
long count;
@@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
return ptl;
}

+static inline
+spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+
+ BUG_ON(!hpte->ptep);
+ if (hpte->ptl)
+ return hpte->ptl;
+ return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
+}
+
+static inline
+spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+ spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
+
+ spin_lock(ptl);
+ return ptl;
+}
+
#if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
extern void __init hugetlb_cma_reserve(int order);
#else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef7662bd0068..a0e46d35dabc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
return false;
}

+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
+{
+ pgd_t pgd;
+ p4d_t p4d;
+ pud_t pud;
+ pmd_t pmd;
+
+ WARN_ON_ONCE(!hpte->ptep);
+ switch (hugetlb_pte_level(hpte)) {
+ case HUGETLB_LEVEL_PGD:
+ pgd = __pgd(pte_val(pte));
+ return pgd_present(pgd) && pgd_leaf(pgd);
+ case HUGETLB_LEVEL_P4D:
+ p4d = __p4d(pte_val(pte));
+ return p4d_present(p4d) && p4d_leaf(p4d);
+ case HUGETLB_LEVEL_PUD:
+ pud = __pud(pte_val(pte));
+ return pud_present(pud) && pud_leaf(pud);
+ case HUGETLB_LEVEL_PMD:
+ pmd = __pmd(pte_val(pte));
+ return pmd_present(pmd) && pmd_leaf(pmd);
+ case HUGETLB_LEVEL_PTE:
+ return pte_present(pte);
+ default:
+ WARN_ON_ONCE(1);
+ return false;
+ }
+}
+
static void enqueue_huge_page(struct hstate *h, struct page *page)
{
int nid = page_to_nid(page);
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:24:00

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 43/47] selftests/vm: add HugeTLB HGM to userfaultfd selftest

This test case behaves similarly to the regular shared HugeTLB
configuration, except that it uses 4K instead of hugepages, and that we
ignore the UFFDIO_COPY tests, as UFFDIO_CONTINUE is the only ioctl that
supports PAGE_SIZE-aligned regions.

This doesn't test MADV_COLLAPSE. Other tests are added later to exercise
MADV_COLLAPSE.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 90 +++++++++++++++++++-----
1 file changed, 74 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 7f22844ed704..c9cdfb20f292 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -73,9 +73,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;
#define BOUNCE_POLL (1<<3)
static int bounces;

-#define TEST_ANON 1
-#define TEST_HUGETLB 2
-#define TEST_SHMEM 3
+#define TEST_ANON 1
+#define TEST_HUGETLB 2
+#define TEST_HUGETLB_HGM 3
+#define TEST_SHMEM 4
static int test_type;

#define UFFD_FLAGS (O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY)
@@ -93,6 +94,8 @@ static volatile bool test_uffdio_zeropage_eexist = true;
static bool test_uffdio_wp = true;
/* Whether to test uffd minor faults */
static bool test_uffdio_minor = false;
+static bool test_uffdio_copy = true;
+
static bool map_shared;
static int mem_fd;
static unsigned long long *count_verify;
@@ -151,7 +154,7 @@ static void usage(void)
fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
"[hugetlbfs_file]\n\n");
fprintf(stderr, "Supported <test type>: anon, hugetlb, "
- "hugetlb_shared, shmem\n\n");
+ "hugetlb_shared, hugetlb_shared_hgm, shmem\n\n");
fprintf(stderr, "'Test mods' can be joined to the test type string with a ':'. "
"Supported mods:\n");
fprintf(stderr, "\tsyscall - Use userfaultfd(2) (default)\n");
@@ -167,6 +170,11 @@ static void usage(void)
exit(1);
}

+static bool test_is_hugetlb(void)
+{
+ return test_type == TEST_HUGETLB || test_type == TEST_HUGETLB_HGM;
+}
+
#define _err(fmt, ...) \
do { \
int ret = errno; \
@@ -381,8 +389,12 @@ static struct uffd_test_ops *uffd_test_ops;

static inline uint64_t uffd_minor_feature(void)
{
- if (test_type == TEST_HUGETLB && map_shared)
- return UFFD_FEATURE_MINOR_HUGETLBFS;
+ if (test_is_hugetlb() && map_shared)
+ return UFFD_FEATURE_MINOR_HUGETLBFS |
+ (test_type == TEST_HUGETLB_HGM
+ ? (UFFD_FEATURE_MINOR_HUGETLBFS_HGM |
+ UFFD_FEATURE_EXACT_ADDRESS)
+ : 0);
else if (test_type == TEST_SHMEM)
return UFFD_FEATURE_MINOR_SHMEM;
else
@@ -393,7 +405,7 @@ static uint64_t get_expected_ioctls(uint64_t mode)
{
uint64_t ioctls = UFFD_API_RANGE_IOCTLS;

- if (test_type == TEST_HUGETLB)
+ if (test_is_hugetlb())
ioctls &= ~(1 << _UFFDIO_ZEROPAGE);

if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
@@ -500,13 +512,16 @@ static void uffd_test_ctx_clear(void)
static void uffd_test_ctx_init(uint64_t features)
{
unsigned long nr, cpu;
+ uint64_t enabled_features = features;

uffd_test_ctx_clear();

uffd_test_ops->allocate_area((void **)&area_src, true);
uffd_test_ops->allocate_area((void **)&area_dst, false);

- userfaultfd_open(&features);
+ userfaultfd_open(&enabled_features);
+ if ((enabled_features & features) != features)
+ err("couldn't enable all features");

count_verify = malloc(nr_pages * sizeof(unsigned long long));
if (!count_verify)
@@ -726,13 +741,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
struct uffd_stats *stats)
{
unsigned long offset;
+ unsigned long address;

if (msg->event != UFFD_EVENT_PAGEFAULT)
err("unexpected msg event %u", msg->event);

+ /*
+ * Round down address to nearest page_size.
+ * We do this manually because we specified UFFD_FEATURE_EXACT_ADDRESS
+ * to support UFFD_FEATURE_MINOR_HUGETLBFS_HGM.
+ */
+ address = msg->arg.pagefault.address & ~(page_size - 1);
+
if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
/* Write protect page faults */
- wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+ wp_range(uffd, address, page_size, false);
stats->wp_faults++;
} else if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR) {
uint8_t *area;
@@ -751,11 +774,10 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
*/

area = (uint8_t *)(area_dst +
- ((char *)msg->arg.pagefault.address -
- area_dst_alias));
+ ((char *)address - area_dst_alias));
for (b = 0; b < page_size; ++b)
area[b] = ~area[b];
- continue_range(uffd, msg->arg.pagefault.address, page_size);
+ continue_range(uffd, address, page_size);
stats->minor_faults++;
} else {
/*
@@ -782,7 +804,7 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
err("unexpected write fault");

- offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+ offset = (char *)address - area_dst;
offset &= ~(page_size-1);

if (copy_page(uffd, offset))
@@ -1192,6 +1214,12 @@ static int userfaultfd_events_test(void)
char c;
struct uffd_stats stats = { 0 };

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd events test "
+ "(test_uffdio_copy=false)\n");
+ return 0;
+ }
+
printf("testing events (fork, remap, remove): ");
fflush(stdout);

@@ -1245,6 +1273,12 @@ static int userfaultfd_sig_test(void)
char c;
struct uffd_stats stats = { 0 };

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd signal test "
+ "(test_uffdio_copy=false)\n");
+ return 0;
+ }
+
printf("testing signal delivery: ");
fflush(stdout);

@@ -1538,6 +1572,12 @@ static int userfaultfd_stress(void)
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 16*1024*1024);

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd stress test "
+ "(test_uffdio_copy=false)\n");
+ bounces = 0;
+ }
+
while (bounces--) {
printf("bounces: %d, mode:", bounces);
if (bounces & BOUNCE_RANDOM)
@@ -1696,6 +1736,16 @@ static void set_test_type(const char *type)
uffd_test_ops = &hugetlb_uffd_test_ops;
/* Minor faults require shared hugetlb; only enable here. */
test_uffdio_minor = true;
+ } else if (!strcmp(type, "hugetlb_shared_hgm")) {
+ map_shared = true;
+ test_type = TEST_HUGETLB_HGM;
+ uffd_test_ops = &hugetlb_uffd_test_ops;
+ /*
+ * HugeTLB HGM only changes UFFDIO_CONTINUE, so don't test
+ * UFFDIO_COPY.
+ */
+ test_uffdio_minor = true;
+ test_uffdio_copy = false;
} else if (!strcmp(type, "shmem")) {
map_shared = true;
test_type = TEST_SHMEM;
@@ -1731,6 +1781,7 @@ static void parse_test_type_arg(const char *raw_type)
err("Unsupported test: %s", raw_type);

if (test_type == TEST_HUGETLB)
+ /* TEST_HUGETLB_HGM gets small pages. */
page_size = hpage_size;
else
page_size = sysconf(_SC_PAGE_SIZE);
@@ -1813,22 +1864,29 @@ int main(int argc, char **argv)
nr_cpus = x < y ? x : y;
}
nr_pages_per_cpu = bytes / page_size / nr_cpus;
+ if (test_type == TEST_HUGETLB_HGM)
+ /*
+ * `page_size` refers to the page_size we can use in
+ * UFFDIO_CONTINUE. We still need nr_pages to be appropriately
+ * aligned, so align it here.
+ */
+ nr_pages_per_cpu -= nr_pages_per_cpu % (hpage_size / page_size);
if (!nr_pages_per_cpu) {
_err("invalid MiB");
usage();
}
+ nr_pages = nr_pages_per_cpu * nr_cpus;

bounces = atoi(argv[3]);
if (bounces <= 0) {
_err("invalid bounces");
usage();
}
- nr_pages = nr_pages_per_cpu * nr_cpus;

- if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) {
+ if (test_type == TEST_SHMEM || test_is_hugetlb()) {
unsigned int memfd_flags = 0;

- if (test_type == TEST_HUGETLB)
+ if (test_is_hugetlb())
memfd_flags = MFD_HUGETLB;
mem_fd = memfd_create(argv[0], memfd_flags);
if (mem_fd < 0)
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:24:26

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

This adds the Kconfig to enable or disable high-granularity mapping.
Each architecture must explicitly opt-in to it (via
ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will
be enabled by default if HUGETLB_PAGE is enabled.

Signed-off-by: James Houghton <[email protected]>
---
fs/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 2685a4d0d353..ce2567946016 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -267,6 +267,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
(boot command line) or hugetlb_optimize_vmemmap (sysctl).

+config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+ bool
+
+config HUGETLB_HIGH_GRANULARITY_MAPPING
+ def_bool HUGETLB_PAGE
+ depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
+
config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS

--
2.38.0.135.g90850a2211-goog

2022-10-21 17:24:32

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings

Enlighten __unmap_hugepage_range to deal with high-granularity mappings.
This doesn't change its API; it still must be called with hugepage
alignment, but it will correctly unmap hugepages that have been mapped
at high granularity.

The rules for mapcount and refcount here are:
1. Refcount and mapcount are tracked on the head page.
2. Each page table mapping into some of an hpage will increase that
hpage's mapcount and refcount by 1.

Eventually, functionality here can be expanded to allow users to call
MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
not done here.

Signed-off-by: James Houghton <[email protected]>
---
include/asm-generic/tlb.h | 6 ++--
mm/hugetlb.c | 76 +++++++++++++++++++++++++--------------
2 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..c378a44915a9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -566,9 +566,9 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
+#define tlb_remove_huge_tlb_entry(tlb, hpte, address) \
do { \
- unsigned long _sz = huge_page_size(h); \
+ unsigned long _sz = hugetlb_pte_size(&hpte); \
if (_sz >= P4D_SIZE) \
tlb_flush_p4d_range(tlb, address, _sz); \
else if (_sz >= PUD_SIZE) \
@@ -577,7 +577,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
tlb_flush_pmd_range(tlb, address, _sz); \
else \
tlb_flush_pte_range(tlb, address, _sz); \
- __tlb_remove_tlb_entry(tlb, ptep, address); \
+ __tlb_remove_tlb_entry(tlb, hpte.ptep, address);\
} while (0)

/**
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 74a4afda1a7e..227150c25763 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5221,10 +5221,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *ptep;
+ struct hugetlb_pte hpte;
pte_t pte;
spinlock_t *ptl;
- struct page *page;
+ struct page *hpage, *subpage;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
struct mmu_notifier_range range;
@@ -5235,11 +5235,6 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
BUG_ON(start & ~huge_page_mask(h));
BUG_ON(end & ~huge_page_mask(h));

- /*
- * This is a hugetlb vma, all the pte entries should point
- * to huge page.
- */
- tlb_change_page_size(tlb, sz);
tlb_start_vma(tlb, vma);

/*
@@ -5251,26 +5246,35 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
mmu_notifier_invalidate_range_start(&range);
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
- for (; address < end; address += sz) {
- ptep = huge_pte_offset(mm, address, sz);
+
+ while (address < end) {
+ pte_t *ptep = huge_pte_offset(mm, address, sz);
+
if (!ptep) {
address |= last_addr_mask;
+ address += sz;
continue;
}
+ hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h),
+ hpage_size_to_level(huge_page_size(h)));
+ hugetlb_hgm_walk(mm, vma, &hpte, address,
+ PAGE_SIZE, /*stop_at_none=*/true);

- ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, vma, address, ptep)) {
+ ptl = hugetlb_pte_lock(mm, &hpte);
+ if (huge_pmd_unshare(mm, vma, address, hpte.ptep)) {
spin_unlock(ptl);
tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
force_flush = true;
address |= last_addr_mask;
+ address += sz;
continue;
}

- pte = huge_ptep_get(ptep);
+ pte = huge_ptep_get(hpte.ptep);
+
if (huge_pte_none(pte)) {
spin_unlock(ptl);
- continue;
+ goto next_hpte;
}

/*
@@ -5287,25 +5291,36 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
*/
if (pte_swp_uffd_wp_any(pte) &&
!(zap_flags & ZAP_FLAG_DROP_MARKER))
- set_huge_pte_at(mm, address, ptep,
+ set_huge_pte_at(mm, address, hpte.ptep,
make_pte_marker(PTE_MARKER_UFFD_WP));
else
#endif
- huge_pte_clear(mm, address, ptep, sz);
+ huge_pte_clear(mm, address, hpte.ptep,
+ hugetlb_pte_size(&hpte));
+ spin_unlock(ptl);
+ goto next_hpte;
+ }
+
+ if (unlikely(!hugetlb_pte_present_leaf(&hpte, pte))) {
+ /*
+ * We raced with someone splitting out from under us.
+ * Retry the walk.
+ */
spin_unlock(ptl);
continue;
}

- page = pte_page(pte);
+ subpage = pte_page(pte);
+ hpage = compound_head(subpage);
/*
* If a reference page is supplied, it is because a specific
* page is being unmapped, not a range. Ensure the page we
* are about to unmap is the actual page of interest.
*/
if (ref_page) {
- if (page != ref_page) {
+ if (hpage != ref_page) {
spin_unlock(ptl);
- continue;
+ goto next_hpte;
}
/*
* Mark the VMA as having unmapped its page so that
@@ -5315,27 +5330,34 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
set_vma_resv_flags(vma, HPAGE_RESV_UNMAPPED);
}

- pte = huge_ptep_get_and_clear(mm, address, ptep);
- tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
+ pte = huge_ptep_get_and_clear(mm, address, hpte.ptep);
+ tlb_change_page_size(tlb, hugetlb_pte_size(&hpte));
+ tlb_remove_huge_tlb_entry(tlb, hpte, address);
if (huge_pte_dirty(pte))
- set_page_dirty(page);
+ set_page_dirty(hpage);
#ifdef CONFIG_PTE_MARKER_UFFD_WP
/* Leave a uffd-wp pte marker if needed */
if (huge_pte_uffd_wp(pte) &&
!(zap_flags & ZAP_FLAG_DROP_MARKER))
- set_huge_pte_at(mm, address, ptep,
+ set_huge_pte_at(mm, address, hpte.ptep,
make_pte_marker(PTE_MARKER_UFFD_WP));
#endif
- hugetlb_count_sub(pages_per_huge_page(h), mm);
- page_remove_rmap(page, vma, true);
+ hugetlb_count_sub(hugetlb_pte_size(&hpte)/PAGE_SIZE, mm);
+ page_remove_rmap(hpage, vma, true);

spin_unlock(ptl);
- tlb_remove_page_size(tlb, page, huge_page_size(h));
/*
- * Bail out after unmapping reference page if supplied
+ * Lower the reference count on the head page.
+ */
+ tlb_remove_page_size(tlb, hpage, sz);
+ /*
+ * Bail out after unmapping reference page if supplied,
+ * and there's only one PTE mapping this page.
*/
- if (ref_page)
+ if (ref_page && hugetlb_pte_size(&hpte) == sz)
break;
+next_hpte:
+ address += hugetlb_pte_size(&hpte);
}
mmu_notifier_invalidate_range_end(&range);
tlb_end_vma(tlb, vma);
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:24:35

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 21/47] mm: rmap: provide pte_order in page_vma_mapped_walk

page_vma_mapped_walk callers will need this information to know how
HugeTLB pages are mapped. pte_order only applies if pte is not NULL.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/rmap.h | 1 +
mm/page_vma_mapped.c | 3 +++
2 files changed, 4 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..e0557ede2951 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -378,6 +378,7 @@ struct page_vma_mapped_walk {
pmd_t *pmd;
pte_t *pte;
spinlock_t *ptl;
+ unsigned int pte_order;
unsigned int flags;
};

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..395ca4e21c56 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -16,6 +16,7 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
static bool map_pte(struct page_vma_mapped_walk *pvmw)
{
pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+ pvmw->pte_order = 0;
if (!(pvmw->flags & PVMW_SYNC)) {
if (pvmw->flags & PVMW_MIGRATION) {
if (!is_swap_pte(*pvmw->pte))
@@ -174,6 +175,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (!pvmw->pte)
return false;

+ pvmw->pte_order = huge_page_order(hstate);
pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
if (!check_pte(pvmw))
return not_found(pvmw);
@@ -269,6 +271,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
}
pte_unmap(pvmw->pte);
pvmw->pte = NULL;
+ pvmw->pte_order = 0;
goto restart;
}
pvmw->pte++;
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:24:37

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings

This is a simple change: don't create a "huge" PTE if we are making a
regular, PAGE_SIZE PTE. All architectures that want to implement HGM
likely need to be changed in a similar way if they implement their own
version of arch_make_huge_pte.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4b1548adecde..d305742e9d44 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -907,7 +907,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
vm_flags_t flags)
{
- return pte_mkhuge(entry);
+ return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
}
#endif

--
2.38.0.135.g90850a2211-goog

2022-10-21 17:25:00

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 31/47] hugetlb: sort hstates in hugetlb_init_hstates

When using HugeTLB high-granularity mapping, we need to go through the
supported hugepage sizes in decreasing order so that we pick the largest
size that works. Consider the case where we're faulting in a 1G hugepage
for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
a PUD. By going through the sizes in decreasing order, we will find that
PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.

This commit also changes bootmem hugepages from storing hstate pointers
directly to storing the hstate sizes. The hstate pointers used for
boot-time-allocated hugepages become invalid after we sort the hstates.
`gather_bootmem_prealloc`, called after the hstates have been sorted,
now converts the size to the correct hstate.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 49 ++++++++++++++++++++++++++++++++---------
2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d305742e9d44..e25f97cdd086 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -772,7 +772,7 @@ struct hstate {

struct huge_bootmem_page {
struct list_head list;
- struct hstate *hstate;
+ unsigned long hstate_sz;
};

int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb0005d57cab..d6f07968156c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
#include <linux/nospec.h>
#include <linux/delayacct.h>
#include <linux/memory.h>
+#include <linux/sort.h>

#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -49,6 +50,10 @@

int hugetlb_max_hstate __read_mostly;
unsigned int default_hstate_idx;
+/*
+ * After hugetlb_init_hstates is called, hstates will be sorted from largest
+ * to smallest.
+ */
struct hstate hstates[HUGE_MAX_HSTATE];

#ifdef CONFIG_CMA
@@ -3189,7 +3194,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
/* Put them into a private list first because mem_map is not up yet */
INIT_LIST_HEAD(&m->list);
list_add(&m->list, &huge_boot_pages);
- m->hstate = h;
+ m->hstate_sz = huge_page_size(h);
return 1;
}

@@ -3203,7 +3208,7 @@ static void __init gather_bootmem_prealloc(void)

list_for_each_entry(m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
- struct hstate *h = m->hstate;
+ struct hstate *h = size_to_hstate(m->hstate_sz);

VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(page_count(page) != 1);
@@ -3319,9 +3324,38 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
kfree(node_alloc_noretry);
}

+static int compare_hstates_decreasing(const void *a, const void *b)
+{
+ unsigned long sz_a = huge_page_size((const struct hstate *)a);
+ unsigned long sz_b = huge_page_size((const struct hstate *)b);
+
+ if (sz_a < sz_b)
+ return 1;
+ if (sz_a > sz_b)
+ return -1;
+ return 0;
+}
+
+static void sort_hstates(void)
+{
+ unsigned long default_hstate_sz = huge_page_size(&default_hstate);
+
+ /* Sort from largest to smallest. */
+ sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
+ compare_hstates_decreasing, NULL);
+
+ /*
+ * We may have changed the location of the default hstate, so we need to
+ * update it.
+ */
+ default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
+}
+
static void __init hugetlb_init_hstates(void)
{
- struct hstate *h, *h2;
+ struct hstate *h;
+
+ sort_hstates();

for_each_hstate(h) {
/* oversize hugepages were init'ed in early boot */
@@ -3340,13 +3374,8 @@ static void __init hugetlb_init_hstates(void)
continue;
if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
continue;
- for_each_hstate(h2) {
- if (h2 == h)
- continue;
- if (h2->order < h->order &&
- h2->order > h->demote_order)
- h->demote_order = h2->order;
- }
+ if (h - 1 >= &hstates[0])
+ h->demote_order = huge_page_order(h - 1);
}
}

--
2.38.0.135.g90850a2211-goog

2022-10-21 17:25:18

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range

This allows fork() to work with high-granularity mappings. The page
table structure is copied such that partially mapped regions will remain
partially mapped in the same way for the new process.

A page's reference count is incremented for *each* portion of it that is
mapped in the page table. For example, if you have a PMD-mapped 1G page,
the reference count and mapcount will be incremented by 512.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 81 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 65 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5783a8307a77..7d692907cbf3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4946,7 +4946,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *src_vma)
{
pte_t *src_pte, *dst_pte, entry;
- struct page *ptepage;
+ struct hugetlb_pte src_hpte, dst_hpte;
+ struct page *ptepage, *hpage;
unsigned long addr;
bool cow = is_cow_mapping(src_vma->vm_flags);
struct hstate *h = hstate_vma(src_vma);
@@ -4956,6 +4957,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
unsigned long last_addr_mask;
int ret = 0;

+ if (hugetlb_hgm_enabled(src_vma)) {
+ /*
+ * src_vma might have high-granularity PTEs, and dst_vma will
+ * need to copy those.
+ */
+ ret = enable_hugetlb_hgm(dst_vma);
+ if (ret)
+ return ret;
+ }
+
if (cow) {
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_vma, src,
src_vma->vm_start,
@@ -4967,18 +4978,22 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
/*
* For shared mappings the vma lock must be held before
* calling huge_pte_offset in the src vma. Otherwise, the
- * returned ptep could go away if part of a shared pmd and
- * another thread calls huge_pmd_unshare.
+ * returned ptep could go away if
+ * - part of a shared pmd and another thread calls
+ * huge_pmd_unshare, or
+ * - another thread collapses a high-granularity mapping.
*/
hugetlb_vma_lock_read(src_vma);
}

last_addr_mask = hugetlb_mask_last_page(h);
- for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
+ addr = src_vma->vm_start;
+ while (addr < src_vma->vm_end) {
spinlock_t *src_ptl, *dst_ptl;
+ unsigned long hpte_sz;
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte) {
- addr |= last_addr_mask;
+ addr = (addr | last_addr_mask) + sz;
continue;
}
dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
@@ -4987,6 +5002,26 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
break;
}

+ hugetlb_pte_populate(&src_hpte, src_pte, huge_page_shift(h),
+ hpage_size_to_level(huge_page_size(h)));
+ hugetlb_pte_populate(&dst_hpte, dst_pte, huge_page_shift(h),
+ hpage_size_to_level(huge_page_size(h)));
+
+ if (hugetlb_hgm_enabled(src_vma)) {
+ hugetlb_hgm_walk(src, src_vma, &src_hpte, addr,
+ PAGE_SIZE, /*stop_at_none=*/true);
+ ret = hugetlb_hgm_walk(dst, dst_vma, &dst_hpte, addr,
+ hugetlb_pte_size(&src_hpte),
+ /*stop_at_none=*/false);
+ if (ret)
+ break;
+
+ src_pte = src_hpte.ptep;
+ dst_pte = dst_hpte.ptep;
+ }
+
+ hpte_sz = hugetlb_pte_size(&src_hpte);
+
/*
* If the pagetables are shared don't copy or take references.
*
@@ -4996,12 +5031,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* to reliably determine whether pte is shared.
*/
if (page_count(virt_to_page(dst_pte)) > 1) {
- addr |= last_addr_mask;
+ addr = (addr | last_addr_mask) + sz;
continue;
}

- dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
+ dst_ptl = hugetlb_pte_lock(dst, &dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(src, &src_hpte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
again:
@@ -5042,10 +5077,15 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
*/
if (userfaultfd_wp(dst_vma))
set_huge_pte_at(dst, addr, dst_pte, entry);
+ } else if (!hugetlb_pte_present_leaf(&src_hpte, entry)) {
+ /* Retry the walk. */
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
+ continue;
} else {
- entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
- get_page(ptepage);
+ hpage = compound_head(ptepage);
+ get_page(hpage);

/*
* Failing to duplicate the anon rmap is a rare case
@@ -5058,24 +5098,29 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* sleep during the process.
*/
if (!PageAnon(ptepage)) {
- page_dup_file_rmap(ptepage, true);
- } else if (page_try_dup_anon_rmap(ptepage, true,
+ page_dup_file_rmap(hpage, true);
+ } else if (page_try_dup_anon_rmap(hpage, true,
src_vma)) {
pte_t src_pte_old = entry;
struct page *new;

+ if (hugetlb_hgm_enabled(src_vma)) {
+ ret = -EINVAL;
+ break;
+ }
+
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
/* Do not use reserve as it's private owned */
new = alloc_huge_page(dst_vma, addr, 1);
if (IS_ERR(new)) {
- put_page(ptepage);
+ put_page(hpage);
ret = PTR_ERR(new);
break;
}
- copy_user_huge_page(new, ptepage, addr, dst_vma,
+ copy_user_huge_page(new, hpage, addr, dst_vma,
npages);
- put_page(ptepage);
+ put_page(hpage);

/* Install the new huge page if src pte stable */
dst_ptl = huge_pte_lock(h, dst, dst_pte);
@@ -5093,6 +5138,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
hugetlb_install_page(dst_vma, dst_pte, addr, new);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+ addr += hugetlb_pte_size(&src_hpte);
continue;
}

@@ -5109,10 +5155,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

set_huge_pte_at(dst, addr, dst_pte, entry);
- hugetlb_count_add(npages, dst);
+ hugetlb_count_add(
+ hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
+ dst);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+ addr += hugetlb_pte_size(&src_hpte);
}

if (cow) {
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:25:20

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 26/47] hugetlb: make move_hugetlb_page_tables compatible with HGM

This is very similar to the support that was added to
copy_hugetlb_page_range. We simply do a high-granularity walk now, and
most of the rest of the code stays the same.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 47 ++++++++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7d692907cbf3..16b0d192445c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5174,16 +5174,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
return ret;
}

-static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
- unsigned long new_addr, pte_t *src_pte, pte_t *dst_pte)
+static void move_hugetlb_pte(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, struct hugetlb_pte *src_hpte,
+ struct hugetlb_pte *dst_hpte)
{
- struct hstate *h = hstate_vma(vma);
struct mm_struct *mm = vma->vm_mm;
spinlock_t *src_ptl, *dst_ptl;
pte_t pte;

- dst_ptl = huge_pte_lock(h, mm, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h), mm, src_pte);
+ dst_ptl = hugetlb_pte_lock(mm, dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(mm, src_hpte);

/*
* We don't have to worry about the ordering of src and dst ptlocks
@@ -5192,8 +5192,8 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
if (src_ptl != dst_ptl)
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

- pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
- set_huge_pte_at(mm, new_addr, dst_pte, pte);
+ pte = huge_ptep_get_and_clear(mm, old_addr, src_hpte->ptep);
+ set_huge_pte_at(mm, new_addr, dst_hpte->ptep, pte);

if (src_ptl != dst_ptl)
spin_unlock(src_ptl);
@@ -5214,6 +5214,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
pte_t *src_pte, *dst_pte;
struct mmu_notifier_range range;
bool shared_pmd = false;
+ struct hugetlb_pte src_hpte, dst_hpte;

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, old_addr,
old_end);
@@ -5229,20 +5230,28 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
/* Prevent race with file truncation */
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
- for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
+ while (old_addr < old_end) {
src_pte = huge_pte_offset(mm, old_addr, sz);
if (!src_pte) {
- old_addr |= last_addr_mask;
- new_addr |= last_addr_mask;
+ old_addr = (old_addr | last_addr_mask) + sz;
+ new_addr = (new_addr | last_addr_mask) + sz;
continue;
}
- if (huge_pte_none(huge_ptep_get(src_pte)))
+
+ hugetlb_pte_populate(&src_hpte, src_pte, huge_page_shift(h),
+ hpage_size_to_level(sz));
+ hugetlb_hgm_walk(mm, vma, &src_hpte, old_addr,
+ PAGE_SIZE, /*stop_at_none=*/true);
+ if (huge_pte_none(huge_ptep_get(src_hpte.ptep))) {
+ old_addr += hugetlb_pte_size(&src_hpte);
+ new_addr += hugetlb_pte_size(&src_hpte);
continue;
+ }

- if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
+ if (huge_pmd_unshare(mm, vma, old_addr, src_hpte.ptep)) {
shared_pmd = true;
- old_addr |= last_addr_mask;
- new_addr |= last_addr_mask;
+ old_addr = (old_addr | last_addr_mask) + sz;
+ new_addr = (new_addr | last_addr_mask) + sz;
continue;
}

@@ -5250,7 +5259,15 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
if (!dst_pte)
break;

- move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte);
+ hugetlb_pte_populate(&dst_hpte, dst_pte, huge_page_shift(h),
+ hpage_size_to_level(sz));
+ if (hugetlb_hgm_walk(mm, vma, &dst_hpte, new_addr,
+ hugetlb_pte_size(&src_hpte),
+ /*stop_at_none=*/false))
+ break;
+ move_hugetlb_pte(vma, old_addr, new_addr, &src_hpte, &dst_hpte);
+ old_addr += hugetlb_pte_size(&src_hpte);
+ new_addr += hugetlb_pte_size(&src_hpte);
}

if (shared_pmd)
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:25:40

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas

This allows us to add more data into the shared structure, which we will
use to store whether or not HGM is enabled for this VMA or not, as HGM
is only available for shared mappings.

It may be better to include HGM as a VMA flag instead of extending the
VMA lock structure.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 4 +++
mm/hugetlb.c | 65 +++++++++++++++++++++--------------------
2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..534958499ac4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
struct vm_area_struct *vma;
};

+struct hugetlb_shared_vma_data {
+ struct hugetlb_vma_lock vma_lock;
+};
+
extern struct resv_map *resv_map_alloc(void);
void resv_map_release(struct kref *ref);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dc82256b89dd..5ae8bc8c928e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -91,8 +91,8 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;

/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
-static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
-static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
+static void hugetlb_vma_data_free(struct vm_area_struct *vma);
+static int hugetlb_vma_data_alloc(struct vm_area_struct *vma);
static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);

static inline bool subpool_is_free(struct hugepage_subpool *spool)
@@ -4643,11 +4643,11 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
if (vma_lock) {
if (vma_lock->vma != vma) {
vma->vm_private_data = NULL;
- hugetlb_vma_lock_alloc(vma);
+ hugetlb_vma_data_alloc(vma);
} else
pr_warn("HugeTLB: vma_lock already exists in %s.\n", __func__);
} else
- hugetlb_vma_lock_alloc(vma);
+ hugetlb_vma_data_alloc(vma);
}
}

@@ -4659,7 +4659,7 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
unsigned long reserve, start, end;
long gbl_reserve;

- hugetlb_vma_lock_free(vma);
+ hugetlb_vma_data_free(vma);

resv = vma_resv_map(vma);
if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER))
@@ -6629,7 +6629,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
/*
* vma specific semaphore used for pmd sharing synchronization
*/
- hugetlb_vma_lock_alloc(vma);
+ hugetlb_vma_data_alloc(vma);

/*
* Only apply hugepage reservation if asked. At fault time, an
@@ -6753,7 +6753,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
chg * pages_per_huge_page(h), h_cg);
out_err:
- hugetlb_vma_lock_free(vma);
+ hugetlb_vma_data_free(vma);
if (!vma || vma->vm_flags & VM_MAYSHARE)
/* Only call region_abort if the region_chg succeeded but the
* region_add failed or didn't run.
@@ -6901,55 +6901,55 @@ static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
void hugetlb_vma_lock_read(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;

- down_read(&vma_lock->rw_sema);
+ down_read(&data->vma_lock.rw_sema);
}
}

void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;

- up_read(&vma_lock->rw_sema);
+ up_read(&data->vma_lock.rw_sema);
}
}

void hugetlb_vma_lock_write(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;

- down_write(&vma_lock->rw_sema);
+ down_write(&data->vma_lock.rw_sema);
}
}

void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;

- up_write(&vma_lock->rw_sema);
+ up_write(&data->vma_lock.rw_sema);
}
}

int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
{
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;

if (!__vma_shareable_flags_pmd(vma))
return 1;

- return down_write_trylock(&vma_lock->rw_sema);
+ return down_write_trylock(&data->vma_lock.rw_sema);
}

void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
{
if (__vma_shareable_flags_pmd(vma)) {
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;

- lockdep_assert_held(&vma_lock->rw_sema);
+ lockdep_assert_held(&data->vma_lock.rw_sema);
}
}

@@ -6985,7 +6985,7 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma)
}
}

-static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+static void hugetlb_vma_data_free(struct vm_area_struct *vma)
{
/*
* Only present in sharable vmas.
@@ -6994,16 +6994,17 @@ static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
return;

if (vma->vm_private_data) {
- struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
+ struct hugetlb_shared_vma_data *data = vma->vm_private_data;
+ struct hugetlb_vma_lock *vma_lock = &data->vma_lock;

down_write(&vma_lock->rw_sema);
__hugetlb_vma_unlock_write_put(vma_lock);
}
}

-static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
{
- struct hugetlb_vma_lock *vma_lock;
+ struct hugetlb_shared_vma_data *data;

/* Only establish in (flags) sharable vmas */
if (!vma || !(vma->vm_flags & VM_MAYSHARE))
@@ -7013,8 +7014,8 @@ static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
if (vma->vm_private_data)
return 0;

- vma_lock = kmalloc(sizeof(*vma_lock), GFP_KERNEL);
- if (!vma_lock) {
+ data = kmalloc(sizeof(*data), GFP_KERNEL);
+ if (!data) {
/*
* If we can not allocate structure, then vma can not
* participate in pmd sharing. This is only a possible
@@ -7025,14 +7026,14 @@ static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
* until the file is removed. Warn in the unlikely case of
* allocation failure.
*/
- pr_warn_once("HugeTLB: unable to allocate vma specific lock\n");
+ pr_warn_once("HugeTLB: unable to allocate vma shared data\n");
return -ENOMEM;
}

- kref_init(&vma_lock->refs);
- init_rwsem(&vma_lock->rw_sema);
- vma_lock->vma = vma;
- vma->vm_private_data = vma_lock;
+ kref_init(&data->vma_lock.refs);
+ init_rwsem(&data->vma_lock.rw_sema);
+ data->vma_lock.vma = vma;
+ vma->vm_private_data = data;
return 0;
}

@@ -7157,11 +7158,11 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma)
{
}

-static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+static void hugetlb_vma_data_free(struct vm_area_struct *vma)
{
}

-static int hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+static int hugetlb_vma_data_alloc(struct vm_area_struct *vma)
{
return 0;
}
--
2.38.0.135.g90850a2211-goog

2022-10-21 17:42:03

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH v2 47/47] selftests/vm: add HGM UFFDIO_CONTINUE and hwpoison tests

This tests that high-granularity CONTINUEs at all sizes work
(exercising contiguous PTE sizes for arm64, when support is added). This
also tests that collapse works and hwpoison works correctly (although we
aren't yet testing high-granularity poison).

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/hugetlb-hgm.c | 326 +++++++++++++++++++++++
2 files changed, 327 insertions(+)
create mode 100644 tools/testing/selftests/vm/hugetlb-hgm.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 00920cb8b499..da1e01a5ac9b 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -32,6 +32,7 @@ TEST_GEN_FILES += compaction_test
TEST_GEN_FILES += gup_test
TEST_GEN_FILES += hmm-tests
TEST_GEN_FILES += hugetlb-madvise
+TEST_GEN_FILES += hugetlb-hgm
TEST_GEN_FILES += hugepage-mmap
TEST_GEN_FILES += hugepage-mremap
TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/vm/hugetlb-hgm.c b/tools/testing/selftests/vm/hugetlb-hgm.c
new file mode 100644
index 000000000000..e36a1c988bb4
--- /dev/null
+++ b/tools/testing/selftests/vm/hugetlb-hgm.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test uncommon cases in HugeTLB high-granularity mapping:
+ * 1. Test all supported high-granularity page sizes (with MADV_COLLAPSE).
+ * 2. Test MADV_HWPOISON behavior.
+ */
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/poll.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <linux/userfaultfd.h>
+#include <linux/magic.h>
+#include <sys/mman.h>
+#include <sys/statfs.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <pthread.h>
+
+#define PAGE_MASK ~(4096 - 1)
+
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
+#define PREFIX " ... "
+
+int userfaultfd(int flags)
+{
+ return syscall(__NR_userfaultfd, flags);
+}
+
+int map_range(int uffd, char *addr, uint64_t length)
+{
+ struct uffdio_continue cont = {
+ .range = (struct uffdio_range) {
+ .start = (uint64_t)addr,
+ .len = length,
+ },
+ .mode = 0,
+ .mapped = 0,
+ };
+
+ if (ioctl(uffd, UFFDIO_CONTINUE, &cont) < 0) {
+ perror("UFFDIO_CONTINUE failed");
+ return -1;
+ }
+ return 0;
+}
+
+int check_equal(char *mapping, size_t length, char value)
+{
+ size_t i;
+
+ for (i = 0; i < length; ++i)
+ if (mapping[i] != value) {
+ printf("mismatch at %p (%d != %d)\n", &mapping[i],
+ mapping[i], value);
+ return -1;
+ }
+
+ return 0;
+}
+
+int test_continues(int uffd, char *primary_map, char *secondary_map, size_t len,
+ bool verify)
+{
+ size_t offset = 0;
+ unsigned char iter = 0;
+ unsigned long pagesize = getpagesize();
+ uint64_t size;
+
+ for (size = len/2; size >= pagesize;
+ offset += size, size /= 2) {
+ iter++;
+ memset(secondary_map + offset, iter, size);
+ printf(PREFIX "UFFDIO_CONTINUE: %p -> %p = %d%s\n",
+ primary_map + offset,
+ primary_map + offset + size,
+ iter,
+ verify ? " (and verify)" : "");
+ if (map_range(uffd, primary_map + offset, size))
+ return -1;
+ if (verify && check_equal(primary_map + offset, size, iter))
+ return -1;
+ }
+ return 0;
+}
+
+int test_collapse(char *primary_map, size_t len, bool hwpoison)
+{
+ size_t offset;
+ int i;
+ uint64_t size;
+
+ printf(PREFIX "collapsing %p -> %p\n", primary_map, primary_map + len);
+ if (madvise(primary_map, len, MADV_COLLAPSE) < 0) {
+ if (errno == EHWPOISON && hwpoison) {
+ /* this is expected for the hwpoison test. */
+ printf(PREFIX "could not collapse due to poison\n");
+ return 0;
+ }
+ perror("collapse failed");
+ return -1;
+ }
+
+ printf(PREFIX "verifying %p -> %p\n", primary_map, primary_map + len);
+
+ offset = 0;
+ i = 0;
+ for (size = len/2; size > 4096; offset += size, size /= 2) {
+ if (check_equal(primary_map + offset, size, ++i))
+ return -1;
+ }
+ /* expect the last 4K to be zero. */
+ if (check_equal(primary_map + len - 4096, 4096, 0))
+ return -1;
+
+ return 0;
+}
+
+static void *poisoned_addr;
+
+void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+ if (info->si_code != BUS_MCEERR_AR)
+ goto kill;
+ poisoned_addr = info->si_addr;
+kill:
+ pthread_exit(NULL);
+}
+
+void *access_mem(void *addr)
+{
+ volatile char *ptr = addr;
+
+ *ptr;
+ return NULL;
+}
+
+int test_poison_sigbus(char *addr)
+{
+ int ret = 0;
+ pthread_t pthread;
+
+ poisoned_addr = (void *)0xBADBADBAD;
+ ret = pthread_create(&pthread, NULL, &access_mem, addr);
+ if (pthread_create(&pthread, NULL, &access_mem, addr)) {
+ printf("failed to create thread: %s\n", strerror(ret));
+ return ret;
+ }
+
+ pthread_join(pthread, NULL);
+ if (poisoned_addr != addr) {
+ printf("got incorrect poisoned address: %p vs %p\n",
+ poisoned_addr, addr);
+ return -1;
+ }
+ return 0;
+}
+
+int test_hwpoison(char *primary_map, size_t len)
+{
+ const unsigned long pagesize = getpagesize();
+ const int num_poison_checks = 512;
+ unsigned long bytes_per_check = len/num_poison_checks;
+ struct sigaction new, old;
+ int i;
+
+ printf(PREFIX "poisoning %p -> %p\n", primary_map, primary_map + len);
+ if (madvise(primary_map, len, MADV_HWPOISON) < 0) {
+ perror("MADV_HWPOISON failed");
+ return -1;
+ }
+
+ printf(PREFIX "checking that it was poisoned "
+ "(%d addresses within %p -> %p)\n",
+ num_poison_checks, primary_map, primary_map + len);
+
+ new.sa_sigaction = &sigbus_handler;
+ new.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGBUS, &new, &old) < 0) {
+ perror("could not setup SIGBUS handler");
+ return -1;
+ }
+
+ if (pagesize > bytes_per_check)
+ bytes_per_check = pagesize;
+
+ for (i = 0; i < len; i += bytes_per_check)
+ if (test_poison_sigbus(primary_map + i) < 0)
+ return -1;
+ /* check very last byte, because we left it unmapped */
+ if (test_poison_sigbus(primary_map + len - 1))
+ return -1;
+
+ return 0;
+}
+
+int test_hgm(int fd, size_t hugepagesize, size_t len, bool hwpoison)
+{
+ int ret = 0;
+ int uffd;
+ char *primary_map, *secondary_map;
+ struct uffdio_api api;
+ struct uffdio_register reg;
+
+ if (ftruncate(fd, len) < 0) {
+ perror("ftruncate failed");
+ return -1;
+ }
+
+ uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0) {
+ perror("uffd not created");
+ return -1;
+ }
+
+ primary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (primary_map == MAP_FAILED) {
+ perror("mmap for primary mapping failed");
+ ret = -1;
+ goto close_uffd;
+ }
+ secondary_map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (secondary_map == MAP_FAILED) {
+ perror("mmap for secondary mapping failed");
+ ret = -1;
+ goto unmap_primary;
+ }
+
+ printf(PREFIX "primary mapping: %p\n", primary_map);
+ printf(PREFIX "secondary mapping: %p\n", secondary_map);
+
+ api.api = UFFD_API;
+ api.features = UFFD_FEATURE_MINOR_HUGETLBFS |
+ UFFD_FEATURE_MISSING_HUGETLBFS |
+ UFFD_FEATURE_MINOR_HUGETLBFS_HGM | UFFD_FEATURE_SIGBUS |
+ UFFD_FEATURE_EXACT_ADDRESS;
+ if (ioctl(uffd, UFFDIO_API, &api) == -1) {
+ perror("UFFDIO_API failed");
+ ret = -1;
+ goto out;
+ }
+ if (!(api.features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM)) {
+ puts("UFFD_FEATURE_MINOR_HUGETLBFS_HGM not present");
+ ret = -1;
+ goto out;
+ }
+
+ reg.range.start = (unsigned long)primary_map;
+ reg.range.len = len;
+ reg.mode = UFFDIO_REGISTER_MODE_MINOR | UFFDIO_REGISTER_MODE_MISSING;
+ reg.ioctls = 0;
+ if (ioctl(uffd, UFFDIO_REGISTER, &reg) == -1) {
+ perror("register failed");
+ ret = -1;
+ goto out;
+ }
+
+ if (test_continues(uffd, primary_map, secondary_map, len, !hwpoison)
+ || (hwpoison && test_hwpoison(primary_map, len))
+ || test_collapse(primary_map, len, hwpoison)) {
+ ret = -1;
+ }
+
+ if (ftruncate(fd, 0) < 0) {
+ perror("ftruncate back to 0 failed");
+ ret = -1;
+ }
+
+out:
+ munmap(secondary_map, len);
+unmap_primary:
+ munmap(primary_map, len);
+close_uffd:
+ close(uffd);
+ return ret;
+}
+
+int main(void)
+{
+ int fd;
+ struct statfs file_stat;
+ size_t hugepagesize;
+ size_t len;
+
+ fd = memfd_create("hugetlb_tmp", MFD_HUGETLB);
+ if (fd < 0) {
+ perror("could not open hugetlbfs file");
+ return -1;
+ }
+
+ memset(&file_stat, 0, sizeof(file_stat));
+ if (fstatfs(fd, &file_stat)) {
+ perror("fstatfs failed");
+ goto close;
+ }
+ if (file_stat.f_type != HUGETLBFS_MAGIC) {
+ printf("not hugetlbfs file\n");
+ goto close;
+ }
+
+ hugepagesize = file_stat.f_bsize;
+ len = 2 * hugepagesize;
+ printf("HGM regular test...\n");
+ printf("HGM regular test: %s\n",
+ test_hgm(fd, hugepagesize, len, false)
+ ? "FAILED" : "PASSED");
+ printf("HGM hwpoison test...\n");
+ printf("HGM hwpoison test: %s\n",
+ test_hgm(fd, hugepagesize, len, true)
+ ? "FAILED" : "PASSED");
+close:
+ close(fd);
+
+ return 0;
+}
--
2.38.0.135.g90850a2211-goog

2022-11-16 20:37:38

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 37/47] hugetlb: remove huge_pte_lock and huge_pte_lockptr

On Fri, Oct 21, 2022 at 04:36:53PM +0000, James Houghton wrote:
> They are replaced with hugetlb_pte_lock{,ptr}. All callers that haven't
> already been replaced don't get called when using HGM, so we handle them
> by populating hugetlb_ptes with the standard, hstate-sized huge PTEs.

I didn't yet check the rational at all, but just noticed there's one more
of it for ppc code:

*** arch/powerpc/mm/pgtable.c:
huge_ptep_set_access_flags[264] assert_spin_locked(huge_pte_lockptr(huge_page_shift(h),

--
Peter Xu


2022-11-16 22:56:03

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> +struct hugetlb_pte {
> + pte_t *ptep;
> + unsigned int shift;
> + enum hugetlb_level level;
> + spinlock_t *ptl;
> +};

Do we need both shift + level? Maybe it's only meaningful for ARM where
the shift may not be directly calculcated from level?

I'm wondering whether we can just maintain "shift" then we calculate
"level" realtime. It just reads a bit weird to have these two fields, also
a burden to most of the call sites where shift and level exactly match..

> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> + unsigned int shift, enum hugetlb_level level)

I'd think it's nicer to replace "populate" with something else, as populate
is definitely a meaningful word in vm world for "making something appear if
it wasn't". Maybe hugetlb_pte_setup()?

Even one step back, on the naming of hugetlb_pte.. Sorry to comment on
namings especially on this one, I really don't like to do that normally..
but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile
it's not really a pte but an iterator. How about hugetlb_hgm_iter? "hgm"
tells that it only walks sub-level, and "iter" tells that it is an
iterator, being updated for each stepping downwards.

Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init().

Take these comments with a grain of salt, and it never hurts to wait for a
2nd opinion before anything.

> +{
> + WARN_ON_ONCE(!ptep);
> + hpte->ptep = ptep;
> + hpte->shift = shift;
> + hpte->level = level;
> + hpte->ptl = NULL;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return hpte->shift;
> +}
> +
> +static inline
> +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);

There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s.. AFAIK the
hugetlb_pte* will be setup once with valid ptep and then it should always
be. I rem someone commented on these helpers look not useful, which I must
confess I had the same feeling. But besides that, I'd rather drop all
these WARN_ON_ONCE()s but only check it when init() the iterator/pte.

> + return hpte->level;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> + dest->ptep = src->ptep;
> + dest->shift = src->shift;
> + dest->level = src->level;
> + dest->ptl = src->ptl;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
> struct hugepage_subpool {
> spinlock_t lock;
> long count;
> @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> return ptl;
> }
>
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> + BUG_ON(!hpte->ptep);

Another BUG_ON(); better be dropped too.

> + if (hpte->ptl)
> + return hpte->ptl;
> + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);

I'm curious whether we can always have hpte->ptl set for a valid
hugetlb_pte. I think that means we'll need to also init the ptl in the
init() fn of the iterator. Then it'll be clear on which lock to take for
each valid hugetlb_pte.

> +}

--
Peter Xu


2022-11-17 01:24:51

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > +struct hugetlb_pte {
> > + pte_t *ptep;
> > + unsigned int shift;
> > + enum hugetlb_level level;
> > + spinlock_t *ptl;
> > +};
>
> Do we need both shift + level? Maybe it's only meaningful for ARM where
> the shift may not be directly calculcated from level?
>
> I'm wondering whether we can just maintain "shift" then we calculate
> "level" realtime. It just reads a bit weird to have these two fields, also
> a burden to most of the call sites where shift and level exactly match..

My main concern is interaction with folded levels. For example, if
PUD_SIZE and PMD_SIZE are the same, we want to do something like this:

pud = pud_offset(p4d, addr)
pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
pte = pte_offset(pmd, addr)

and I think we should avoid quietly skipping the folded level, which
could happen:

pud = pud_offset(p4d, addr)
/* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
it is impossible to know that `pud` came from `pud_offset` and not
`pmd_offset`. We must assume the deeper level so that we don't get
stuck in a loop. */
pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
pmd_t *) */

Quietly dropping p*d_offset for folded levels is safe; it's just a
cast that we're doing anyway. If you think this is fine, then I can
remove `level`. It might also be that this is a non-issue and that
there will never be a folded level underneath a hugepage level.

We could also change `ptep` to a union eventually (to clean up
"hugetlb casts everything to pte_t *" messiness), and having an
explicit `level` as a tag for the union would be nice help. In the
same way: I like having `level` explicitly so that we know for sure
where `ptep` came from.

I can try to reduce the burden at the callsite while keeping `level`:
hpage_size_to_level() is really annoying to have everywhere.

>
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > + unsigned int shift, enum hugetlb_level level)
>
> I'd think it's nicer to replace "populate" with something else, as populate
> is definitely a meaningful word in vm world for "making something appear if
> it wasn't". Maybe hugetlb_pte_setup()?
>
> Even one step back, on the naming of hugetlb_pte.. Sorry to comment on
> namings especially on this one, I really don't like to do that normally..
> but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile
> it's not really a pte but an iterator. How about hugetlb_hgm_iter? "hgm"
> tells that it only walks sub-level, and "iter" tells that it is an
> iterator, being updated for each stepping downwards.
>
> Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init().
>
> Take these comments with a grain of salt, and it never hurts to wait for a
> 2nd opinion before anything.

I think this is a great idea. :) Thank you! I'll make this change for
v1 unless someone has a better suggestion.

>
> > +{
> > + WARN_ON_ONCE(!ptep);
> > + hpte->ptep = ptep;
> > + hpte->shift = shift;
> > + hpte->level = level;
> > + hpte->ptl = NULL;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return hpte->shift;
> > +}
> > +
> > +static inline
> > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
>
> There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s.. AFAIK the
> hugetlb_pte* will be setup once with valid ptep and then it should always
> be. I rem someone commented on these helpers look not useful, which I must
> confess I had the same feeling. But besides that, I'd rather drop all
> these WARN_ON_ONCE()s but only check it when init() the iterator/pte.

The idea with these WARN_ON_ONCE()s is that it WARNs for the case that
`hpte` was never populated/initialized, but I realize that we can't
even rely on hpte->ptep == NULL. I'll remove the WARN_ON_ONCE()s, and
I'll drop hugetlb_pte_shift and hugetlb_pte_level entirely.

I'll keep the hugetlb_pte_{size,mask,copy,present_leaf} helpers as
they are legitimately helpful.

>
> > + return hpte->level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > + dest->ptep = src->ptep;
> > + dest->shift = src->shift;
> > + dest->level = src->level;
> > + dest->ptl = src->ptl;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> > struct hugepage_subpool {
> > spinlock_t lock;
> > long count;
> > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> > return ptl;
> > }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > + BUG_ON(!hpte->ptep);
>
> Another BUG_ON(); better be dropped too.

Can do.

>
> > + if (hpte->ptl)
> > + return hpte->ptl;
> > + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
>
> I'm curious whether we can always have hpte->ptl set for a valid
> hugetlb_pte. I think that means we'll need to also init the ptl in the
> init() fn of the iterator. Then it'll be clear on which lock to take for
> each valid hugetlb_pte.

I can work on this for v1. Right now it's not very good: for 4K PTEs,
we manually set ->ptl while walking. I'll make it so that ->ptl is
always populated so the code is easier to read.

- James

>
> > +}

>
> --
> Peter Xu
>

2022-11-17 17:32:13

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

On Wed, Nov 16, 2022 at 05:00:08PM -0800, James Houghton wrote:
> On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <[email protected]> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > > +struct hugetlb_pte {
> > > + pte_t *ptep;
> > > + unsigned int shift;
> > > + enum hugetlb_level level;
> > > + spinlock_t *ptl;
> > > +};
> >
> > Do we need both shift + level? Maybe it's only meaningful for ARM where
> > the shift may not be directly calculcated from level?
> >
> > I'm wondering whether we can just maintain "shift" then we calculate
> > "level" realtime. It just reads a bit weird to have these two fields, also
> > a burden to most of the call sites where shift and level exactly match..
>
> My main concern is interaction with folded levels. For example, if
> PUD_SIZE and PMD_SIZE are the same, we want to do something like this:
>
> pud = pud_offset(p4d, addr)
> pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
> pte = pte_offset(pmd, addr)
>
> and I think we should avoid quietly skipping the folded level, which
> could happen:
>
> pud = pud_offset(p4d, addr)
> /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
> it is impossible to know that `pud` came from `pud_offset` and not
> `pmd_offset`. We must assume the deeper level so that we don't get
> stuck in a loop. */
> pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
> pmd_t *) */
>
> Quietly dropping p*d_offset for folded levels is safe; it's just a
> cast that we're doing anyway. If you think this is fine, then I can
> remove `level`. It might also be that this is a non-issue and that
> there will never be a folded level underneath a hugepage level.
>
> We could also change `ptep` to a union eventually (to clean up
> "hugetlb casts everything to pte_t *" messiness), and having an
> explicit `level` as a tag for the union would be nice help. In the
> same way: I like having `level` explicitly so that we know for sure
> where `ptep` came from.

Makes sense.

>
> I can try to reduce the burden at the callsite while keeping `level`:
> hpage_size_to_level() is really annoying to have everywhere.

Yeah this would be nice.

Per what I see most callers are having paired level/shift, so maybe we can
make hugetlb_hgm_iter_init() to only take one of them and calculate the
other. Then there can also be the __hugetlb_hgm_iter_init() which takes
both, only used in the few places where necessary to have explicit
level/shift. hugetlb_hgm_iter_init() calls __hugetlb_hgm_iter_init().

Thanks,

--
Peter Xu


2022-11-30 21:23:02

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas

On Fri, Oct 21, 2022 at 04:36:22PM +0000, James Houghton wrote:
> This allows us to add more data into the shared structure, which we will
> use to store whether or not HGM is enabled for this VMA or not, as HGM
> is only available for shared mappings.
>
> It may be better to include HGM as a VMA flag instead of extending the
> VMA lock structure.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 4 +++
> mm/hugetlb.c | 65 +++++++++++++++++++++--------------------
> 2 files changed, 37 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a899bc76d677..534958499ac4 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
> struct vm_area_struct *vma;
> };
>
> +struct hugetlb_shared_vma_data {
> + struct hugetlb_vma_lock vma_lock;
> +};

How about add a comment above hugetlb_vma_lock showing how it should be
used correctly? We lacked documents on the lock for pmd sharing
protections, now if to reuse the same lock for HGM pgtables I think some
doc will definitely help.

To summarize, I think so far it means:

- Read lock needed when one wants to stablize VM_SHARED pgtables (covers
both pmd shared pgtables or hgm low-level pgtables)

- Write lock needed when one wants to release VM_SHARED pgtable pages
(covers both pmd unshare or releasing hgm low-level pgtables)

Or something like that.

--
Peter Xu

2022-11-30 22:08:40

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range

On Fri, Oct 21, 2022 at 04:36:41PM +0000, James Houghton wrote:
> This allows fork() to work with high-granularity mappings. The page
> table structure is copied such that partially mapped regions will remain
> partially mapped in the same way for the new process.
>
> A page's reference count is incremented for *each* portion of it that is
> mapped in the page table. For example, if you have a PMD-mapped 1G page,
> the reference count and mapcount will be incremented by 512.
>
> Signed-off-by: James Houghton <[email protected]>

I have a feeling that this path is not triggered. See:

bcd51a3c679d ("hugetlb: lazy page table copies in fork()", 2022-07-17)

It might be helpful to have it when exploring private mapping support of
hgm on page poison in the future. But the thing is if we want this to be
accepted we still need a way to test it. I just don't see how to test this
without the private support being there..

--
Peter Xu

2022-11-30 23:58:34

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH v2 25/47] hugetlb: add HGM support for copy_hugetlb_page_range

On Wed, Nov 30, 2022 at 4:32 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:41PM +0000, James Houghton wrote:
> > This allows fork() to work with high-granularity mappings. The page
> > table structure is copied such that partially mapped regions will remain
> > partially mapped in the same way for the new process.
> >
> > A page's reference count is incremented for *each* portion of it that is
> > mapped in the page table. For example, if you have a PMD-mapped 1G page,
> > the reference count and mapcount will be incremented by 512.
> >
> > Signed-off-by: James Houghton <[email protected]>
>
> I have a feeling that this path is not triggered. See:
>
> bcd51a3c679d ("hugetlb: lazy page table copies in fork()", 2022-07-17)
>
> It might be helpful to have it when exploring private mapping support of
> hgm on page poison in the future. But the thing is if we want this to be
> accepted we still need a way to test it. I just don't see how to test this
> without the private support being there..

We can trigger this behavior by registering the VMA with
uffd-writeprotect. I didn't include any self-tests for this though;
I'll make sure to actually test this path in v1.

- James

>
> --
> Peter Xu
>

2022-12-01 00:02:14

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas

On Wed, Nov 30, 2022 at 4:01 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:22PM +0000, James Houghton wrote:
> > This allows us to add more data into the shared structure, which we will
> > use to store whether or not HGM is enabled for this VMA or not, as HGM
> > is only available for shared mappings.
> >
> > It may be better to include HGM as a VMA flag instead of extending the
> > VMA lock structure.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > include/linux/hugetlb.h | 4 +++
> > mm/hugetlb.c | 65 +++++++++++++++++++++--------------------
> > 2 files changed, 37 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index a899bc76d677..534958499ac4 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
> > struct vm_area_struct *vma;
> > };
> >
> > +struct hugetlb_shared_vma_data {
> > + struct hugetlb_vma_lock vma_lock;
> > +};
>
> How about add a comment above hugetlb_vma_lock showing how it should be
> used correctly? We lacked documents on the lock for pmd sharing
> protections, now if to reuse the same lock for HGM pgtables I think some
> doc will definitely help.
>
> To summarize, I think so far it means:
>
> - Read lock needed when one wants to stablize VM_SHARED pgtables (covers
> both pmd shared pgtables or hgm low-level pgtables)
>
> - Write lock needed when one wants to release VM_SHARED pgtable pages
> (covers both pmd unshare or releasing hgm low-level pgtables)
>
> Or something like that.

Will do. I'll make this change together with the rmap comment update
("rmap: update hugetlb lock comment for HGM").

- James

>
> --
> Peter Xu
>

2022-12-08 01:12:20

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

On Fri, Oct 21, 2022 at 9:37 AM James Houghton <[email protected]> wrote:
>
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
>
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
>
> hugetlb_pte_present_leaf is included here as a helper function that will
> be used frequently later on.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
> mm/hugetlb.c | 29 ++++++++++++++
> 2 files changed, 117 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index db3ed6095b1c..d30322108b34 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -50,6 +50,75 @@ enum {
> __NR_USED_SUBPAGE,
> };
>
> +enum hugetlb_level {
> + HUGETLB_LEVEL_PTE = 1,
> + /*
> + * We always include PMD, PUD, and P4D in this enum definition so that,
> + * when logged as an integer, we can easily tell which level it is.
> + */
> + HUGETLB_LEVEL_PMD,
> + HUGETLB_LEVEL_PUD,
> + HUGETLB_LEVEL_P4D,
> + HUGETLB_LEVEL_PGD,
> +};
> +

Don't we need to support CONTIG_PTE/PMD levels here for ARM64?

> +struct hugetlb_pte {
> + pte_t *ptep;
> + unsigned int shift;
> + enum hugetlb_level level;

Is shift + level redundant? When would those diverge?

> + spinlock_t *ptl;
> +};
> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> + unsigned int shift, enum hugetlb_level level)
> +{
> + WARN_ON_ONCE(!ptep);
> + hpte->ptep = ptep;
> + hpte->shift = shift;
> + hpte->level = level;
> + hpte->ptl = NULL;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return hpte->shift;
> +}
> +
> +static inline
> +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> +{
> + WARN_ON_ONCE(!hpte->ptep);
> + return hpte->level;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> + dest->ptep = src->ptep;
> + dest->shift = src->shift;
> + dest->level = src->level;
> + dest->ptl = src->ptl;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
> struct hugepage_subpool {
> spinlock_t lock;
> long count;
> @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> return ptl;
> }
>
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> + BUG_ON(!hpte->ptep);

I think BUG_ON()s will be frowned upon. This function also doesn't
really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if
necessary.


> + if (hpte->ptl)
> + return hpte->ptl;
> + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);

I don't know if this fallback to huge_pte_lockptr() should be obivous
to the reader. If not, a comment would help.

> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +
> + spin_lock(ptl);
> + return ptl;
> +}
> +
> #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> extern void __init hugetlb_cma_reserve(int order);
> #else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ef7662bd0068..a0e46d35dabc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> return false;
> }
>
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)

I also don't know if this is obvious to other readers, but I'm quite
confused that we pass both hugetlb_pte and pte_t here, especially when
hpte has a pte_t inside of it. Maybe a comment would help.

> +{
> + pgd_t pgd;
> + p4d_t p4d;
> + pud_t pud;
> + pmd_t pmd;
> +
> + WARN_ON_ONCE(!hpte->ptep);
> + switch (hugetlb_pte_level(hpte)) {
> + case HUGETLB_LEVEL_PGD:
> + pgd = __pgd(pte_val(pte));
> + return pgd_present(pgd) && pgd_leaf(pgd);
> + case HUGETLB_LEVEL_P4D:
> + p4d = __p4d(pte_val(pte));
> + return p4d_present(p4d) && p4d_leaf(p4d);
> + case HUGETLB_LEVEL_PUD:
> + pud = __pud(pte_val(pte));
> + return pud_present(pud) && pud_leaf(pud);
> + case HUGETLB_LEVEL_PMD:
> + pmd = __pmd(pte_val(pte));
> + return pmd_present(pmd) && pmd_leaf(pmd);
> + case HUGETLB_LEVEL_PTE:
> + return pte_present(pte);
> + default:
> + WARN_ON_ONCE(1);
> + return false;
> + }
> +}
> +
> static void enqueue_huge_page(struct hstate *h, struct page *page)
> {
> int nid = page_to_nid(page);
> --
> 2.38.0.135.g90850a2211-goog
>

2022-12-09 16:11:45

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <[email protected]> wrote:
>
> On Fri, Oct 21, 2022 at 9:37 AM James Houghton <[email protected]> wrote:
> >
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > hugetlb_pte_present_leaf is included here as a helper function that will
> > be used frequently later on.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
> > mm/hugetlb.c | 29 ++++++++++++++
> > 2 files changed, 117 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index db3ed6095b1c..d30322108b34 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -50,6 +50,75 @@ enum {
> > __NR_USED_SUBPAGE,
> > };
> >
> > +enum hugetlb_level {
> > + HUGETLB_LEVEL_PTE = 1,
> > + /*
> > + * We always include PMD, PUD, and P4D in this enum definition so that,
> > + * when logged as an integer, we can easily tell which level it is.
> > + */
> > + HUGETLB_LEVEL_PMD,
> > + HUGETLB_LEVEL_PUD,
> > + HUGETLB_LEVEL_P4D,
> > + HUGETLB_LEVEL_PGD,
> > +};
> > +
>
> Don't we need to support CONTIG_PTE/PMD levels here for ARM64?

Yeah, which is why shift and level aren't quite the same thing.
Contiguous PMDs would be HUGETLB_LEVEL_PMD but have shift =
CONT_PMD_SHIFT, whereas regular PMDs would have shift = PMD_SHIFT.

>
> > +struct hugetlb_pte {
> > + pte_t *ptep;
> > + unsigned int shift;
> > + enum hugetlb_level level;
>
> Is shift + level redundant? When would those diverge?

Peter asked a very similar question. `shift` can be used to determine
`level` if no levels are being folded. In the case of folded levels,
you might have a single shift that corresponds to multiple "levels".
That isn't necessarily a problem, as folding a level just means
casting your p?d_t* differently, but I think it's good to be able to
*know* if the hugetlb_pte was populated with a pud_t* that we treat it
like a pud_t* always.

If `ptep` was instead a union, then `level` would be the tag. Perhaps
it should be written that way.

>
> > + spinlock_t *ptl;
> > +};
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > + unsigned int shift, enum hugetlb_level level)
> > +{
> > + WARN_ON_ONCE(!ptep);
> > + hpte->ptep = ptep;
> > + hpte->shift = shift;
> > + hpte->level = level;
> > + hpte->ptl = NULL;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return hpte->shift;
> > +}
> > +
> > +static inline
> > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> > +{
> > + WARN_ON_ONCE(!hpte->ptep);
> > + return hpte->level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > + dest->ptep = src->ptep;
> > + dest->shift = src->shift;
> > + dest->level = src->level;
> > + dest->ptl = src->ptl;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> > struct hugepage_subpool {
> > spinlock_t lock;
> > long count;
> > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> > return ptl;
> > }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > + BUG_ON(!hpte->ptep);
>
> I think BUG_ON()s will be frowned upon. This function also doesn't
> really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if
> necessary.

Right. I'll remove this (and others that aren't really necessary).
Peter's suggestion to just let the kernel take a #pf and crash
(thereby logging more info) SGTM.

>
>
> > + if (hpte->ptl)
> > + return hpte->ptl;
> > + return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
>
> I don't know if this fallback to huge_pte_lockptr() should be obivous
> to the reader. If not, a comment would help.

I'll clean this up a little for the next version. If something like
this branch stays, I'll add a comment.

>
> > +}
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > + spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +
> > + spin_lock(ptl);
> > + return ptl;
> > +}
> > +
> > #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> > extern void __init hugetlb_cma_reserve(int order);
> > #else
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ef7662bd0068..a0e46d35dabc 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> > return false;
> > }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
>
> I also don't know if this is obvious to other readers, but I'm quite
> confused that we pass both hugetlb_pte and pte_t here, especially when
> hpte has a pte_t inside of it. Maybe a comment would help.

It's possible for the value of the pte to change if we haven't locked
the PTL; we only store a pte_t* in hugetlb_pte, not the value itself.

Thinking about this... we *do* store `shift` which technically depends
on the value of the PTE. If the PTE is pte_none, the true `shift` of
the PTE is ambiguous, and so we just provide what the user asked for.
That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then
UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed
because we merely check if the *first* PTE in the contiguous bunch is
none/has changed.

So, in the case of a contiguous PTE where we *think* we're overwriting
a bunch of none PTEs, we need to check that each PTE we're overwriting
is still none while holding the PTL. That means that the PTL we use
for cont PTEs and non-cont PTEs of the same level must be the same.

So for the next version, I'll:
- add some requirement that contiguous and non-contiguous PTEs on the
same level must use the same PTL
- think up some kind of API like all_contig_ptes_none(), but it only
really applies for arm64, so I think actually putting it in can wait.
I'll at least put a comment in hugetlb_mcopy_atomic_pte and
hugetlb_no_page (near the final huge_pte_none() and pte_same()
checks).


>
> > +{
> > + pgd_t pgd;
> > + p4d_t p4d;
> > + pud_t pud;
> > + pmd_t pmd;
> > +
> > + WARN_ON_ONCE(!hpte->ptep);
> > + switch (hugetlb_pte_level(hpte)) {
> > + case HUGETLB_LEVEL_PGD:
> > + pgd = __pgd(pte_val(pte));
> > + return pgd_present(pgd) && pgd_leaf(pgd);
> > + case HUGETLB_LEVEL_P4D:
> > + p4d = __p4d(pte_val(pte));
> > + return p4d_present(p4d) && p4d_leaf(p4d);
> > + case HUGETLB_LEVEL_PUD:
> > + pud = __pud(pte_val(pte));
> > + return pud_present(pud) && pud_leaf(pud);
> > + case HUGETLB_LEVEL_PMD:
> > + pmd = __pmd(pte_val(pte));
> > + return pmd_present(pmd) && pmd_leaf(pmd);
> > + case HUGETLB_LEVEL_PTE:
> > + return pte_present(pte);
> > + default:
> > + WARN_ON_ONCE(1);
> > + return false;
> > + }
> > +}
> > +
> > static void enqueue_huge_page(struct hstate *h, struct page *page)
> > {
> > int nid = page_to_nid(page);
> > --
> > 2.38.0.135.g90850a2211-goog
> >

2022-12-09 23:15:05

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/47] hugetlb: extend vma lock for shared vmas

On 11/30/22 16:01, Peter Xu wrote:
> On Fri, Oct 21, 2022 at 04:36:22PM +0000, James Houghton wrote:
> > This allows us to add more data into the shared structure, which we will
> > use to store whether or not HGM is enabled for this VMA or not, as HGM
> > is only available for shared mappings.
> >
> > It may be better to include HGM as a VMA flag instead of extending the
> > VMA lock structure.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > include/linux/hugetlb.h | 4 +++
> > mm/hugetlb.c | 65 +++++++++++++++++++++--------------------
> > 2 files changed, 37 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index a899bc76d677..534958499ac4 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -121,6 +121,10 @@ struct hugetlb_vma_lock {
> > struct vm_area_struct *vma;
> > };
> >
> > +struct hugetlb_shared_vma_data {
> > + struct hugetlb_vma_lock vma_lock;
> > +};
>
> How about add a comment above hugetlb_vma_lock showing how it should be
> used correctly? We lacked documents on the lock for pmd sharing
> protections, now if to reuse the same lock for HGM pgtables I think some
> doc will definitely help.
>
> To summarize, I think so far it means:
>
> - Read lock needed when one wants to stablize VM_SHARED pgtables (covers
> both pmd shared pgtables or hgm low-level pgtables)
>
> - Write lock needed when one wants to release VM_SHARED pgtable pages
> (covers both pmd unshare or releasing hgm low-level pgtables)

Peter must have read ahead and knows that you plan to use the vma_lock for HGM.

The commit message implies that a you only need some type of indication (a flag
for instance) that HGM is enabled for the vma.

No objections to expanding the structure as is done here.

If this is the direction we take, and someday this is extended to private
mappings we could used the same scheme to expand the reserve map structure.
--
Mike Kravetz

2022-12-09 23:16:08

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH v2 07/47] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

On 10/21/22 16:36, James Houghton wrote:
> This adds the Kconfig to enable or disable high-granularity mapping.
> Each architecture must explicitly opt-in to it (via
> ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING), but when opted in, HGM will
> be enabled by default if HUGETLB_PAGE is enabled.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> fs/Kconfig | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 2685a4d0d353..ce2567946016 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -267,6 +267,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
> enable HVO by default. It can be disabled via hugetlb_free_vmemmap=off
> (boot command line) or hugetlb_optimize_vmemmap (sysctl).
>
> +config ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING
> + bool
> +
> +config HUGETLB_HIGH_GRANULARITY_MAPPING
> + def_bool HUGETLB_PAGE
> + depends on ARCH_WANT_HUGETLB_HIGH_GRANULARITY_MAPPING

Might also need to make this depend on CONFIG_ARCH_WANT_HUGE_PMD_SHARE
as the vma local allocation will only happen in this case.

--
Mike Kravetz

> +
> config MEMFD_CREATE
> def_bool TMPFS || HUGETLBFS
>
> --
> 2.38.0.135.g90850a2211-goog
>

2022-12-13 20:16:43

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH v2 10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

On 12/09/22 11:02, James Houghton wrote:
> On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <[email protected]> wrote:
> > On Fri, Oct 21, 2022 at 9:37 AM James Houghton <[email protected]> wrote:
> > >
> > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
> >
> > I also don't know if this is obvious to other readers, but I'm quite
> > confused that we pass both hugetlb_pte and pte_t here, especially when
> > hpte has a pte_t inside of it. Maybe a comment would help.
>
> It's possible for the value of the pte to change if we haven't locked
> the PTL; we only store a pte_t* in hugetlb_pte, not the value itself.

I had comments similar to Mina and Peter on other parts of this patch. Calling
this without some type of locking is 'interesting'. I have not yet looked at
callers (without locking), but I assume such callers can handle stale results.

> Thinking about this... we *do* store `shift` which technically depends
> on the value of the PTE. If the PTE is pte_none, the true `shift` of
> the PTE is ambiguous, and so we just provide what the user asked for.
> That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then
> UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed
> because we merely check if the *first* PTE in the contiguous bunch is
> none/has changed.

Right, Yuck!

>
> So, in the case of a contiguous PTE where we *think* we're overwriting
> a bunch of none PTEs, we need to check that each PTE we're overwriting
> is still none while holding the PTL. That means that the PTL we use
> for cont PTEs and non-cont PTEs of the same level must be the same.
>
> So for the next version, I'll:
> - add some requirement that contiguous and non-contiguous PTEs on the
> same level must use the same PTL
> - think up some kind of API like all_contig_ptes_none(), but it only
> really applies for arm64, so I think actually putting it in can wait.
> I'll at least put a comment in hugetlb_mcopy_atomic_pte and
> hugetlb_no_page (near the final huge_pte_none() and pte_same()
> checks).
>
--
Mike Kravetz

2022-12-14 22:30:11

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/47] hugetlb: make default arch_make_huge_pte understand small mappings

On 10/21/22 16:36, James Houghton wrote:
> This is a simple change: don't create a "huge" PTE if we are making a
> regular, PAGE_SIZE PTE. All architectures that want to implement HGM
> likely need to be changed in a similar way if they implement their own
> version of arch_make_huge_pte.

Nothing wrong with this patch.

However, I wish there was some way we could flag this requirement in
arch specific code. Just seems like something that would be easy to
overlook.

--
Mike Kravetz

> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 4b1548adecde..d305742e9d44 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -907,7 +907,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) { }
> static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
> vm_flags_t flags)
> {
> - return pte_mkhuge(entry);
> + return shift > PAGE_SHIFT ? pte_mkhuge(entry) : entry;
> }
> #endif
>
> --
> 2.38.0.135.g90850a2211-goog
>

2022-12-15 00:46:48

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH v2 16/47] hugetlb: make unmapping compatible with high-granularity mappings

On 10/21/22 16:36, James Houghton wrote:
> Enlighten __unmap_hugepage_range to deal with high-granularity mappings.
> This doesn't change its API; it still must be called with hugepage
> alignment, but it will correctly unmap hugepages that have been mapped
> at high granularity.
>
> The rules for mapcount and refcount here are:
> 1. Refcount and mapcount are tracked on the head page.
> 2. Each page table mapping into some of an hpage will increase that
> hpage's mapcount and refcount by 1.
>
> Eventually, functionality here can be expanded to allow users to call
> MADV_DONTNEED on PAGE_SIZE-aligned sections of a hugepage, but that is
> not done here.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/asm-generic/tlb.h | 6 ++--
> mm/hugetlb.c | 76 +++++++++++++++++++++++++--------------
> 2 files changed, 52 insertions(+), 30 deletions(-)

All looks reasonable, nothing stands out.

--
Mike Kravetz