2022-06-24 18:01:54

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

This RFC introduces the concept of HugeTLB high-granularity mapping
(HGM)[1]. In broad terms, this series teaches HugeTLB how to map HugeTLB
pages at different granularities, and, more importantly, to partially map
a HugeTLB page. This cover letter will go over
- the motivation for these changes
- userspace API
- some of the changes to HugeTLB to make this work
- limitations & future enhancements

High-granularity mapping does *not* involve dissolving the hugepages
themselves; it only affects how they are mapped.

---- Motivation ----

Being able to map HugeTLB memory with PAGE_SIZE PTEs has important use
cases in post-copy live migration and memory failure handling.

- Live Migration (userfaultfd)
For post-copy live migration, using userfaultfd, currently we have to
install an entire hugepage before we can allow a guest to access that page.
This is because, right now, either the WHOLE hugepage is mapped or NONE of
it is. So either the guest can access the WHOLE hugepage or NONE of it.
This makes post-copy live migration for 1G HugeTLB-backed VMs completely
infeasible.

With high-granularity mapping, we can map PAGE_SIZE pieces of a hugepage,
thereby allowing the guest to access only PAGE_SIZE chunks, and getting
page faults on the rest (and triggering another demand-fetch). This gives
userspace the flexibility to install PAGE_SIZE chunks of memory into a
hugepage, making migration of 1G-backed VMs perfectly feasible, and it
vastly reduces the vCPU stall time during post-copy for 2M-backed VMs.

At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
per-page median fetch latencies:
4K: <100us
2M: >10ms
Being able to unpause a vCPU 100x quicker is helpful for guest stability,
and being able to use 1G pages at all can significant improve steady-state
guest performance.

After fully copying a hugepage over the network, we will want to collapse
the mapping down to what it would normally be (e.g., one PUD for a 1G
page). Rather than having the kernel do this automatically, we leave it up
to userspace to tell us to collapse a range (via MADV_COLLAPSE, co-opting
the API that is being introduced for THPs[2]).

- Memory Failure
When a memory error is found within a HugeTLB page, it would be ideal if we
could unmap only the PAGE_SIZE section that contained the error. This is
what THPs are able to do. Using high-granularity mapping, we could do this,
but this isn't tackled in this patch series.

---- Userspace API ----

This patch series introduces a single way to take advantage of
high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
userspace to resolve MINOR page faults on shared VMAs.

To collapse a HugeTLB address range that has been mapped with several
UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
userspace to know when all pages (that they care about) have been fetched.

---- HugeTLB Changes ----

- Mapcount
The way mapcount is handled is different from the way that it was handled
before. If the PUD for a hugepage is not none, a hugepage's mapcount will
be increased. This scheme means that, for hugepages that aren't mapped at
high granularity, their mapcounts will remain the same as what they would
have been pre-HGM.

- Page table walking and manipulation
A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
high-granularity mappings. Eventually, it's possible to merge
hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.

We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
This is because we generally need to know the "size" of a PTE (previously
always just huge_page_size(hstate)).

For every page table manipulation function that has a huge version (e.g.
huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
PTE really is "huge".

- Synchronization
For existing bits of HugeTLB, synchronization is unchanged. For splitting
and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
writing, and for doing high-granularity page table walks, we require it to
be held for reading.

---- Limitations & Future Changes ----

This patch series only implements high-granularity mapping for VM_SHARED
VMAs. I intend to implement enough HGM to support 4K unmapping for memory
failure recovery for both shared and private mappings.

The memory failure use case poses its own challenges that can be
addressed, but I will do so in a separate RFC.

Performance has not been heavily scrutinized with this patch series. There
are places where lock contention can significantly reduce performance. This
will be addressed later.

The patch series, as it stands right now, is compatible with the VMEMMAP
page struct optimization[3], as we do not need to modify data contained
in the subpage page structs.

Other omissions:
- Compatibility with userfaultfd write-protect (will be included in v1).
- Support for mremap() (will be included in v1). This looks a lot like
the support we have for fork().
- Documentation changes (will be included in v1).
- Completely ignores PMD sharing and hugepage migration (will be included
in v1).
- Implementations for architectures that don't use GENERAL_HUGETLB other
than arm64.

---- Patch Breakdown ----

Patch 1 - Preliminary changes
Patch 2-10 - HugeTLB HGM core changes
Patch 11-13 - HugeTLB HGM page table walking functionality
Patch 14-19 - HugeTLB HGM compatibility with other bits
Patch 20-23 - Userfaultfd and collapse changes
Patch 24-26 - arm64 support and selftests

[1] This used to be called HugeTLB double mapping, a bad and confusing
name. "High-granularity mapping" is not a great name either. I am open
to better names.
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")

James Houghton (26):
hugetlb: make hstate accessor functions const
hugetlb: sort hstates in hugetlb_init_hstates
hugetlb: add make_huge_pte_with_shift
hugetlb: make huge_pte_lockptr take an explicit shift argument.
hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
mm: make free_p?d_range functions public
hugetlb: add hugetlb_pte to track HugeTLB page table entries
hugetlb: add hugetlb_free_range to free PT structures
hugetlb: add hugetlb_hgm_enabled
hugetlb: add for_each_hgm_shift
hugetlb: add hugetlb_walk_to to do PT walks
hugetlb: add HugeTLB splitting functionality
hugetlb: add huge_pte_alloc_high_granularity
hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
hugetlb: make unmapping compatible with high-granularity mappings
hugetlb: make hugetlb_change_protection compatible with HGM
hugetlb: update follow_hugetlb_page to support HGM
hugetlb: use struct hugetlb_pte for walk_hugetlb_range
hugetlb: add HGM support for copy_hugetlb_page_range
hugetlb: add support for high-granularity UFFDIO_CONTINUE
hugetlb: add hugetlb_collapse
madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
arm64/hugetlb: add support for high-granularity mappings
selftests: add HugeTLB HGM to userfaultfd selftest
selftests: add HugeTLB HGM to KVM demand paging selftest

arch/arm64/Kconfig | 1 +
arch/arm64/mm/hugetlbpage.c | 63 ++
arch/powerpc/mm/pgtable.c | 3 +-
arch/s390/mm/gmap.c | 8 +-
fs/Kconfig | 7 +
fs/proc/task_mmu.c | 35 +-
fs/userfaultfd.c | 10 +-
include/asm-generic/tlb.h | 6 +-
include/linux/hugetlb.h | 177 +++-
include/linux/mm.h | 7 +
include/linux/pagewalk.h | 3 +-
include/uapi/asm-generic/mman-common.h | 2 +
include/uapi/linux/userfaultfd.h | 2 +
mm/damon/vaddr.c | 34 +-
mm/hmm.c | 7 +-
mm/hugetlb.c | 987 +++++++++++++++---
mm/madvise.c | 23 +
mm/memory.c | 8 +-
mm/mempolicy.c | 11 +-
mm/migrate.c | 3 +-
mm/mincore.c | 4 +-
mm/mprotect.c | 6 +-
mm/page_vma_mapped.c | 3 +-
mm/pagewalk.c | 18 +-
mm/userfaultfd.c | 57 +-
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
tools/testing/selftests/kvm/lib/test_util.c | 14 +
tools/testing/selftests/vm/userfaultfd.c | 61 +-
29 files changed, 1314 insertions(+), 250 deletions(-)

--
2.37.0.rc0.161.g10f37bed90-goog


2022-06-24 18:01:56

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 25/26] selftests: add HugeTLB HGM to userfaultfd selftest

It behaves just like the regular shared HugeTLB configuration, except
that it uses 4K instead of hugepages.

This doesn't test collapsing yet. I'll add a test for that for v1.

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 61 ++++++++++++++++++++----
1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 0bdfc1955229..9cbb959519a6 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -64,7 +64,7 @@

#ifdef __NR_userfaultfd

-static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size, hpage_size;

#define BOUNCE_RANDOM (1<<0)
#define BOUNCE_RACINGFAULTS (1<<1)
@@ -72,9 +72,10 @@ static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
#define BOUNCE_POLL (1<<3)
static int bounces;

-#define TEST_ANON 1
-#define TEST_HUGETLB 2
-#define TEST_SHMEM 3
+#define TEST_ANON 1
+#define TEST_HUGETLB 2
+#define TEST_HUGETLB_HGM 3
+#define TEST_SHMEM 4
static int test_type;

/* exercise the test_uffdio_*_eexist every ALARM_INTERVAL_SECS */
@@ -85,6 +86,7 @@ static volatile bool test_uffdio_zeropage_eexist = true;
static bool test_uffdio_wp = true;
/* Whether to test uffd minor faults */
static bool test_uffdio_minor = false;
+static bool test_uffdio_copy = true;

static bool map_shared;
static int shm_fd;
@@ -140,12 +142,17 @@ static void usage(void)
fprintf(stderr, "\nUsage: ./userfaultfd <test type> <MiB> <bounces> "
"[hugetlbfs_file]\n\n");
fprintf(stderr, "Supported <test type>: anon, hugetlb, "
- "hugetlb_shared, shmem\n\n");
+ "hugetlb_shared, hugetlb_shared_hgm, shmem\n\n");
fprintf(stderr, "Examples:\n\n");
fprintf(stderr, "%s", examples);
exit(1);
}

+static bool test_is_hugetlb(void)
+{
+ return test_type == TEST_HUGETLB || test_type == TEST_HUGETLB_HGM;
+}
+
#define _err(fmt, ...) \
do { \
int ret = errno; \
@@ -348,7 +355,7 @@ static struct uffd_test_ops *uffd_test_ops;

static inline uint64_t uffd_minor_feature(void)
{
- if (test_type == TEST_HUGETLB && map_shared)
+ if (test_is_hugetlb() && map_shared)
return UFFD_FEATURE_MINOR_HUGETLBFS;
else if (test_type == TEST_SHMEM)
return UFFD_FEATURE_MINOR_SHMEM;
@@ -360,7 +367,7 @@ static uint64_t get_expected_ioctls(uint64_t mode)
{
uint64_t ioctls = UFFD_API_RANGE_IOCTLS;

- if (test_type == TEST_HUGETLB)
+ if (test_is_hugetlb())
ioctls &= ~(1 << _UFFDIO_ZEROPAGE);

if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
@@ -1116,6 +1123,12 @@ static int userfaultfd_events_test(void)
char c;
struct uffd_stats stats = { 0 };

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd events test "
+ "(test_uffdio_copy=false)\n");
+ return 0;
+ }
+
printf("testing events (fork, remap, remove): ");
fflush(stdout);

@@ -1169,6 +1182,12 @@ static int userfaultfd_sig_test(void)
char c;
struct uffd_stats stats = { 0 };

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd signal test "
+ "(test_uffdio_copy=false)\n");
+ return 0;
+ }
+
printf("testing signal delivery: ");
fflush(stdout);

@@ -1438,6 +1457,12 @@ static int userfaultfd_stress(void)
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 16*1024*1024);

+ if (!test_uffdio_copy) {
+ printf("Skipping userfaultfd stress test "
+ "(test_uffdio_copy=false)\n");
+ bounces = 0;
+ }
+
while (bounces--) {
printf("bounces: %d, mode:", bounces);
if (bounces & BOUNCE_RANDOM)
@@ -1598,6 +1623,13 @@ static void set_test_type(const char *type)
uffd_test_ops = &hugetlb_uffd_test_ops;
/* Minor faults require shared hugetlb; only enable here. */
test_uffdio_minor = true;
+ } else if (!strcmp(type, "hugetlb_shared_hgm")) {
+ map_shared = true;
+ test_type = TEST_HUGETLB_HGM;
+ uffd_test_ops = &hugetlb_uffd_test_ops;
+ /* Minor faults require shared hugetlb; only enable here. */
+ test_uffdio_minor = true;
+ test_uffdio_copy = false;
} else if (!strcmp(type, "shmem")) {
map_shared = true;
test_type = TEST_SHMEM;
@@ -1607,8 +1639,10 @@ static void set_test_type(const char *type)
err("Unknown test type: %s", type);
}

+ hpage_size = default_huge_page_size();
if (test_type == TEST_HUGETLB)
- page_size = default_huge_page_size();
+ // TEST_HUGETLB_HGM gets small pages.
+ page_size = hpage_size;
else
page_size = sysconf(_SC_PAGE_SIZE);

@@ -1658,19 +1692,26 @@ int main(int argc, char **argv)
nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
nr_cpus;
+ if (test_type == TEST_HUGETLB_HGM)
+ /*
+ * `page_size` refers to the page_size we can use in
+ * UFFDIO_CONTINUE. We still need nr_pages to be appropriately
+ * aligned, so align it here.
+ */
+ nr_pages_per_cpu -= nr_pages_per_cpu % (hpage_size / page_size);
if (!nr_pages_per_cpu) {
_err("invalid MiB");
usage();
}
+ nr_pages = nr_pages_per_cpu * nr_cpus;

bounces = atoi(argv[3]);
if (bounces <= 0) {
_err("invalid bounces");
usage();
}
- nr_pages = nr_pages_per_cpu * nr_cpus;

- if (test_type == TEST_HUGETLB && map_shared) {
+ if (test_is_hugetlb() && map_shared) {
if (argc < 5)
usage();
huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:02:00

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

When using HugeTLB high-granularity mapping, we need to go through the
supported hugepage sizes in decreasing order so that we pick the largest
size that works. Consider the case where we're faulting in a 1G hugepage
for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
a PUD. By going through the sizes in decreasing order, we will find that
PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
1 file changed, 37 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..5df838d86f32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -33,6 +33,7 @@
#include <linux/migrate.h>
#include <linux/nospec.h>
#include <linux/delayacct.h>
+#include <linux/sort.h>

#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -48,6 +49,10 @@

int hugetlb_max_hstate __read_mostly;
unsigned int default_hstate_idx;
+/*
+ * After hugetlb_init_hstates is called, hstates will be sorted from largest
+ * to smallest.
+ */
struct hstate hstates[HUGE_MAX_HSTATE];

#ifdef CONFIG_CMA
@@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
kfree(node_alloc_noretry);
}

+static int compare_hstates_decreasing(const void *a, const void *b)
+{
+ const int shift_a = huge_page_shift((const struct hstate *)a);
+ const int shift_b = huge_page_shift((const struct hstate *)b);
+
+ if (shift_a < shift_b)
+ return 1;
+ if (shift_a > shift_b)
+ return -1;
+ return 0;
+}
+
+static void sort_hstates(void)
+{
+ unsigned long default_hstate_sz = huge_page_size(&default_hstate);
+
+ /* Sort from largest to smallest. */
+ sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
+ compare_hstates_decreasing, NULL);
+
+ /*
+ * We may have changed the location of the default hstate, so we need to
+ * update it.
+ */
+ default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
+}
+
static void __init hugetlb_init_hstates(void)
{
struct hstate *h, *h2;

- for_each_hstate(h) {
- if (minimum_order > huge_page_order(h))
- minimum_order = huge_page_order(h);
+ sort_hstates();

+ /* The last hstate is now the smallest. */
+ minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
+
+ for_each_hstate(h) {
/* oversize hugepages were init'ed in early boot */
if (!hstate_is_gigantic(h))
hugetlb_hstate_alloc_pages(h);
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:02:01

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 24/26] arm64/hugetlb: add support for high-granularity mappings

This is included in this RFC to demonstrate how an architecture that
doesn't use ARCH_WANT_GENERAL_HUGETLB can be updated to support HugeTLB
high-granularity mappings: an architecture just needs to implement
hugetlb_walk_to.

Signed-off-by: James Houghton <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/hugetlbpage.c | 63 +++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1652a9800ebe..74108713a99a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -99,6 +99,7 @@ config ARM64
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ select ARCH_HAS_SPECIAL_HUGETLB_HGM
select ARCH_WANT_LD_ORPHAN_WARN
select ARCH_WANTS_NO_INSTR
select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index e2a5ec9fdc0d..1901818bed9d 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -281,6 +281,69 @@ void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
set_pte(ptep, pte);
}

+int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz, bool stop_at_none)
+{
+ pgd_t *pgdp;
+ p4d_t *p4dp;
+ pte_t *ptep;
+
+ if (!hpte->ptep) {
+ pgdp = pgd_offset(mm, addr);
+ p4dp = p4d_offset(pgdp, addr);
+ if (!p4dp)
+ return -ENOMEM;
+ hugetlb_pte_populate(hpte, (pte_t *)p4dp, P4D_SHIFT);
+ }
+
+ while (hugetlb_pte_size(hpte) > sz &&
+ !hugetlb_pte_present_leaf(hpte) &&
+ !(stop_at_none && hugetlb_pte_none(hpte))) {
+ if (hpte->shift == PMD_SHIFT) {
+ unsigned long rounded_addr = sz == CONT_PTE_SIZE
+ ? addr & CONT_PTE_MASK
+ : addr;
+
+ ptep = pte_offset_kernel((pmd_t *)hpte->ptep,
+ rounded_addr);
+ if (!ptep)
+ return -ENOMEM;
+ if (sz == CONT_PTE_SIZE)
+ hpte->shift = CONT_PTE_SHIFT;
+ else
+ hpte->shift = pte_cont(*ptep) ? CONT_PTE_SHIFT
+ : PAGE_SHIFT;
+ hpte->ptep = ptep;
+ } else if (hpte->shift == PUD_SHIFT) {
+ pud_t *pudp = (pud_t *)hpte->ptep;
+
+ ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
+
+ if (!ptep)
+ return -ENOMEM;
+ if (sz == CONT_PMD_SIZE)
+ hpte->shift = CONT_PMD_SHIFT;
+ else
+ hpte->shift = pte_cont(*ptep) ? CONT_PMD_SHIFT
+ : PMD_SHIFT;
+ hpte->ptep = ptep;
+ } else if (hpte->shift == P4D_SHIFT) {
+ ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep, addr);
+ if (!ptep)
+ return -ENOMEM;
+ hpte->shift = PUD_SHIFT;
+ hpte->ptep = ptep;
+ } else
+ /*
+ * This also catches the cases of CONT_PMD_SHIFT and
+ * CONT_PTE_SHIFT. Those PTEs should always be leaves.
+ */
+ BUG();
+ }
+
+ return 0;
+}
+
pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:02:55

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 16/26] hugetlb: make hugetlb_change_protection compatible with HGM

HugeTLB is now able to change the protection of hugepages that are
mapped at high granularity.

I need to add more of the HugeTLB PTE wrapper functions to clean up this
patch. I'll do this in the next version.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 91 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 62 insertions(+), 29 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51fc1d3f122f..f9c7daa6c090 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6476,14 +6476,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
- pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0, psize = huge_page_size(h);
+ unsigned long base_pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ struct hugetlb_pte hpte;
+ bool hgm_enabled = hugetlb_hgm_enabled(vma);

/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -6499,28 +6500,38 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

mmu_notifier_invalidate_range_start(&range);
i_mmap_lock_write(vma->vm_file->f_mapping);
- for (; address < end; address += psize) {
+ while (address < end) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, psize);
- if (!ptep)
+ pte_t *ptep = huge_pte_offset(mm, address, huge_page_size(h));
+
+ if (!ptep) {
+ address += huge_page_size(h);
continue;
- ptl = huge_pte_lock(h, mm, ptep);
- if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+ }
+ hugetlb_pte_populate(&hpte, ptep, huge_page_shift(h));
+ if (hgm_enabled) {
+ int ret = hugetlb_walk_to(mm, &hpte, address, PAGE_SIZE,
+ /*stop_at_none=*/true);
+ BUG_ON(ret);
+ }
+
+ ptl = hugetlb_pte_lock(mm, &hpte);
+ if (huge_pmd_unshare(mm, vma, &address, hpte.ptep)) {
/*
* When uffd-wp is enabled on the vma, unshare
* shouldn't happen at all. Warn about it if it
* happened due to some reason.
*/
WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
- pages++;
+ base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
spin_unlock(ptl);
shared_pmd = true;
- continue;
+ goto next_hpte;
}
- pte = huge_ptep_get(ptep);
+ pte = hugetlb_ptep_get(&hpte);
if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
spin_unlock(ptl);
- continue;
+ goto next_hpte;
}
if (unlikely(is_hugetlb_entry_migration(pte))) {
swp_entry_t entry = pte_to_swp_entry(pte);
@@ -6540,12 +6551,13 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
newpte = pte_swp_mkuffd_wp(newpte);
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
- set_huge_swap_pte_at(mm, address, ptep,
- newpte, psize);
- pages++;
+ set_huge_swap_pte_at(mm, address, hpte.ptep,
+ newpte,
+ hugetlb_pte_size(&hpte));
+ base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
}
spin_unlock(ptl);
- continue;
+ goto next_hpte;
}
if (unlikely(pte_marker_uffd_wp(pte))) {
/*
@@ -6553,21 +6565,40 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
* no need for huge_ptep_modify_prot_start/commit().
*/
if (uffd_wp_resolve)
- huge_pte_clear(mm, address, ptep, psize);
+ huge_pte_clear(mm, address, hpte.ptep, psize);
}
- if (!huge_pte_none(pte)) {
+ if (!hugetlb_pte_none(&hpte)) {
pte_t old_pte;
- unsigned int shift = huge_page_shift(hstate_vma(vma));
-
- old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
- pte = huge_pte_modify(old_pte, newprot);
- pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
- if (uffd_wp)
- pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
- else if (uffd_wp_resolve)
- pte = huge_pte_clear_uffd_wp(pte);
- huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
- pages++;
+ unsigned int shift = hpte.shift;
+ /*
+ * This is ugly. This will be cleaned up in a future
+ * version of this series.
+ */
+ if (shift > PAGE_SHIFT) {
+ old_pte = huge_ptep_modify_prot_start(
+ vma, address, hpte.ptep);
+ pte = huge_pte_modify(old_pte, newprot);
+ pte = arch_make_huge_pte(
+ pte, shift, vma->vm_flags);
+ if (uffd_wp)
+ pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = huge_pte_clear_uffd_wp(pte);
+ huge_ptep_modify_prot_commit(
+ vma, address, hpte.ptep,
+ old_pte, pte);
+ } else {
+ old_pte = ptep_modify_prot_start(
+ vma, address, hpte.ptep);
+ pte = pte_modify(old_pte, newprot);
+ if (uffd_wp)
+ pte = pte_mkuffd_wp(pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = pte_clear_uffd_wp(pte);
+ ptep_modify_prot_commit(
+ vma, address, hpte.ptep, old_pte, pte);
+ }
+ base_pages += hugetlb_pte_size(&hpte) / PAGE_SIZE;
} else {
/* None pte */
if (unlikely(uffd_wp))
@@ -6576,6 +6607,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
make_pte_marker(PTE_MARKER_UFFD_WP));
}
spin_unlock(ptl);
+next_hpte:
+ address += hugetlb_pte_size(&hpte);
}
/*
* Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
@@ -6597,7 +6630,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
i_mmap_unlock_write(vma->vm_file->f_mapping);
mmu_notifier_invalidate_range_end(&range);

- return pages << h->order;
+ return base_pages;
}

/* Return true if reservation was successful, false otherwise. */
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:03:11

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks

This adds it for architectures that use GENERAL_HUGETLB, including x86.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 2 ++
mm/hugetlb.c | 45 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 47 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e7a6b944d0cc..605aa19d8572 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz);
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
+int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz, bool stop_at_none);
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long *addr, pte_t *ptep);
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 557b0afdb503..3ec2a921ee6f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return (pte_t *)pmd;
}

+int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz, bool stop_at_none)
+{
+ pte_t *ptep;
+
+ if (!hpte->ptep) {
+ pgd_t *pgd = pgd_offset(mm, addr);
+
+ if (!pgd)
+ return -ENOMEM;
+ ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
+ if (!ptep)
+ return -ENOMEM;
+ hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
+ }
+
+ while (hugetlb_pte_size(hpte) > sz &&
+ !hugetlb_pte_present_leaf(hpte) &&
+ !(stop_at_none && hugetlb_pte_none(hpte))) {
+ if (hpte->shift == PMD_SHIFT) {
+ ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);
+ if (!ptep)
+ return -ENOMEM;
+ hpte->shift = PAGE_SHIFT;
+ hpte->ptep = ptep;
+ } else if (hpte->shift == PUD_SHIFT) {
+ ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
+ addr);
+ if (!ptep)
+ return -ENOMEM;
+ hpte->shift = PMD_SHIFT;
+ hpte->ptep = ptep;
+ } else if (hpte->shift == P4D_SHIFT) {
+ ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
+ addr);
+ if (!ptep)
+ return -ENOMEM;
+ hpte->shift = PUD_SHIFT;
+ hpte->ptep = ptep;
+ } else
+ BUG();
+ }
+ return 0;
+}
+
#endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */

#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:03:44

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift

This allows us to make huge PTEs at shifts other than the hstate shift,
which will be necessary for high-granularity mappings.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 33 ++++++++++++++++++++-------------
1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5df838d86f32..0eec34edf3b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4686,23 +4686,30 @@ const struct vm_operations_struct hugetlb_vm_ops = {
.pagesize = hugetlb_vm_op_pagesize,
};

+static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
+ struct page *page, int writable,
+ int shift)
+{
+ bool huge = shift > PAGE_SHIFT;
+ pte_t entry = huge ? mk_huge_pte(page, vma->vm_page_prot)
+ : mk_pte(page, vma->vm_page_prot);
+
+ if (writable)
+ entry = huge ? huge_pte_mkwrite(entry) : pte_mkwrite(entry);
+ else
+ entry = huge ? huge_pte_wrprotect(entry) : pte_wrprotect(entry);
+ pte_mkyoung(entry);
+ if (huge)
+ entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
+ return entry;
+}
+
static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
- int writable)
+ int writable)
{
- pte_t entry;
unsigned int shift = huge_page_shift(hstate_vma(vma));

- if (writable) {
- entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
- vma->vm_page_prot)));
- } else {
- entry = huge_pte_wrprotect(mk_huge_pte(page,
- vma->vm_page_prot));
- }
- entry = pte_mkyoung(entry);
- entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
-
- return entry;
+ return make_huge_pte_with_shift(vma, page, writable, shift);
}

static void set_huge_ptep_writable(struct vm_area_struct *vma,
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:03:44

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 18/26] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Although this change is large, it is somewhat straightforward. Before,
all users of walk_hugetlb_range could get the size of the PTE just be
checking the hmask or the mm_walk struct. With HGM, that information is
held in the hugetlb_pte struct, so we provide that instead of the raw
pte_t*.

Signed-off-by: James Houghton <[email protected]>
---
arch/s390/mm/gmap.c | 8 ++++++--
fs/proc/task_mmu.c | 35 +++++++++++++++++++----------------
include/linux/pagewalk.h | 3 ++-
mm/damon/vaddr.c | 34 ++++++++++++++++++----------------
mm/hmm.c | 7 ++++---
mm/mempolicy.c | 11 ++++++++---
mm/mincore.c | 4 ++--
mm/mprotect.c | 6 +++---
mm/pagewalk.c | 18 ++++++++++++++++--
9 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index b8ae4a4aa2ba..518cebfd72cd 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2620,10 +2620,14 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
return 0;
}

-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
- unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long next,
struct mm_walk *walk)
{
+ if (!hugetlb_pte_present_leaf(hpte) ||
+ hugetlb_pte_size(hpte) != PMD_SIZE)
+ return -EINVAL;
+
pmd_t *pmd = (pmd_t *)pte;
unsigned long start, end;
struct page *page = pmd_page(*pmd);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2d04e3470d4c..b2d683f99fa9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -714,18 +714,19 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
}

#ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct mem_size_stats *mss = walk->private;
struct vm_area_struct *vma = walk->vma;
struct page *page = NULL;
+ pte_t pte = hugetlb_ptep_get(hpte);

- if (pte_present(*pte)) {
- page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
- swp_entry_t swpent = pte_to_swp_entry(*pte);
+ if (hugetlb_pte_present_leaf(hpte)) {
+ page = vm_normal_page(vma, addr, pte);
+ } else if (is_swap_pte(pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(pte);

if (is_pfn_swap_entry(swpent))
page = pfn_swap_entry_to_page(swpent);
@@ -734,9 +735,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
int mapcount = page_mapcount(page);

if (mapcount >= 2)
- mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->shared_hugetlb += hugetlb_pte_size(hpte);
else
- mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->private_hugetlb += hugetlb_pte_size(hpte);
}
return 0;
}
@@ -1535,7 +1536,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,

#ifdef CONFIG_HUGETLB_PAGE
/* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
@@ -1543,13 +1544,13 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
struct vm_area_struct *vma = walk->vma;
u64 flags = 0, frame = 0;
int err = 0;
- pte_t pte;
+ unsigned long hmask = hugetlb_pte_mask(hpte);

if (vma->vm_flags & VM_SOFTDIRTY)
flags |= PM_SOFT_DIRTY;

- pte = huge_ptep_get(ptep);
- if (pte_present(pte)) {
+ if (hugetlb_pte_present_leaf(hpte)) {
+ pte_t pte = hugetlb_ptep_get(hpte);
struct page *page = pte_page(pte);

if (!PageAnon(page))
@@ -1565,7 +1566,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
if (pm->show_pfn)
frame = pte_pfn(pte) +
((addr & ~hmask) >> PAGE_SHIFT);
- } else if (pte_swp_uffd_wp_any(pte)) {
+ } else if (pte_swp_uffd_wp_any(hugetlb_ptep_get(hpte))) {
flags |= PM_UFFD_WP;
}

@@ -1869,17 +1870,19 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
return 0;
}
#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
- pte_t huge_pte = huge_ptep_get(pte);
+ pte_t huge_pte = hugetlb_ptep_get(hpte);
struct numa_maps *md;
struct page *page;

- if (!pte_present(huge_pte))
+ if (!hugetlb_pte_present_leaf(hpte))
return 0;

page = pte_page(huge_pte);
+ if (page != compound_head(page))
+ return 0;

md = walk->private;
gather_stats(page, md, pte_dirty(huge_pte), 1);
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index ac7b38ad5903..0d21e25df37f 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@
#define _LINUX_PAGEWALK_H

#include <linux/mm.h>
+#include <linux/hugetlb.h>

struct mm_walk;

@@ -47,7 +48,7 @@ struct mm_walk_ops {
unsigned long next, struct mm_walk *walk);
int (*pte_hole)(unsigned long addr, unsigned long next,
int depth, struct mm_walk *walk);
- int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
+ int (*hugetlb_entry)(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long next,
struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 59e1653799f8..ce50b937dcf2 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -324,14 +324,15 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long addr)
{
bool referenced = false;
pte_t entry = huge_ptep_get(pte);
struct page *page = pte_page(entry);
+ struct page *hpage = compound_head(page);

- get_page(page);
+ get_page(hpage);

if (pte_young(entry)) {
referenced = true;
@@ -342,18 +343,18 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,

#ifdef CONFIG_MMU_NOTIFIER
if (mmu_notifier_clear_young(mm, addr,
- addr + huge_page_size(hstate_vma(vma))))
+ addr + hugetlb_pte_size(hpte));
referenced = true;
#endif /* CONFIG_MMU_NOTIFIER */

if (referenced)
- set_page_young(page);
+ set_page_young(hpage);

- set_page_idle(page);
- put_page(page);
+ set_page_idle(hpage);
+ put_page(hpage);
}

-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
@@ -361,12 +362,12 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(h, walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = huge_pte_lock_shift(hpte->shift, walk->mm, hpte->ptep);
+ entry = huge_ptep_get(hpte->ptep);
if (!pte_present(entry))
goto out;

- damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+ damon_hugetlb_mkold(hpte, walk->mm, walk->vma, addr);

out:
spin_unlock(ptl);
@@ -474,31 +475,32 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct damon_young_walk_private *priv = walk->private;
struct hstate *h = hstate_vma(walk->vma);
- struct page *page;
+ struct page *page, *hpage;
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(h, walk->mm, pte);
+ ptl = huge_pte_lock_shift(hpte->shift, walk->mm, hpte->ptep);
entry = huge_ptep_get(pte);
if (!pte_present(entry))
goto out;

page = pte_page(entry);
- get_page(page);
+ hpage = compound_head(page);
+ get_page(hpage);

- if (pte_young(entry) || !page_is_idle(page) ||
+ if (pte_young(entry) || !page_is_idle(hpage) ||
mmu_notifier_test_young(walk->mm, addr)) {
*priv->page_sz = huge_page_size(h);
priv->young = true;
}

- put_page(page);
+ put_page(hpage);

out:
spin_unlock(ptl);
diff --git a/mm/hmm.c b/mm/hmm.c
index 3fd3242c5e50..1ad5d76fa8be 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -472,7 +472,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
#endif

#ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
unsigned long start, unsigned long end,
struct mm_walk *walk)
{
@@ -483,11 +483,12 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
unsigned int required_fault;
unsigned long pfn_req_flags;
unsigned long cpu_flags;
+ unsigned long hmask = hugetlb_pte_mask(hpte);
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
- entry = huge_ptep_get(pte);
+ ptl = huge_pte_lock_shift(hpte->shift, walk->mm, hpte->ptep);
+ entry = huge_ptep_get(hpte->ptep);

i = (start - range->start) >> PAGE_SHIFT;
pfn_req_flags = range->hmm_pfns[i];
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..a1d82db7c19f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -559,7 +559,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
return addr != end ? -EIO : 0;
}

-static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
+static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
@@ -571,8 +571,13 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
spinlock_t *ptl;
pte_t entry;

- ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
- entry = huge_ptep_get(pte);
+ /* We don't migrate high-granularity HugeTLB mappings for now. */
+ if (hugetlb_pte_size(hpte) !=
+ huge_page_size(hstate_vma(walk->vma)))
+ return -EINVAL;
+
+ ptl = hugetlb_pte_lock(walk->mm, hpte);
+ entry = hugetlb_ptep_get(hpte);
if (!pte_present(entry))
goto unlock;
page = pte_page(entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index fa200c14185f..dc1717dc6a2c 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,7 +22,7 @@
#include <linux/uaccess.h>
#include "swap.h"

-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
#ifdef CONFIG_HUGETLB_PAGE
@@ -33,7 +33,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
* Hugepages under user process are always in RAM and never
* swapped out, but theoretically it needs to be checked.
*/
- present = pte && !huge_pte_none(huge_ptep_get(pte));
+ present = hpte->ptep && !hugetlb_pte_none(hpte);
for (; addr != end; vec++, addr += PAGE_SIZE)
*vec = present;
walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..9c5a35a1c0eb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -476,12 +476,12 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
0 : -EACCES;
}

-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
unsigned long addr, unsigned long next,
struct mm_walk *walk)
{
- return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
- 0 : -EACCES;
+ return pfn_modify_allowed(pte_pfn(*hpte->ptep),
+ *(pgprot_t *)(walk->private)) ? 0 : -EACCES;
}

static int prot_none_test(unsigned long addr, unsigned long next,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9b3db11a4d1d..f8e24a0a0179 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@
#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/hugetlb.h>
+#include <linux/minmax.h>

/*
* We want to know the real level where a entry is located ignoring any
@@ -301,13 +302,26 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
pte_t *pte;
const struct mm_walk_ops *ops = walk->ops;
int err = 0;
+ struct hugetlb_pte hpte;

do {
- next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
+ if (!pte) {
+ next = hugetlb_entry_end(h, addr, end);
+ } else {
+ hugetlb_pte_populate(&hpte, pte, huge_page_shift(h));
+ if (hugetlb_hgm_enabled(vma)) {
+ err = hugetlb_walk_to(walk->mm, &hpte, addr,
+ PAGE_SIZE,
+ /*stop_at_none=*/true);
+ if (err)
+ break;
+ }
+ next = min(addr + hugetlb_pte_size(&hpte), end);
+ }

if (pte)
- err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
+ err = ops->hugetlb_entry(&hpte, addr, next, walk);
else if (ops->pte_hole)
err = ops->pte_hole(addr, next, -1, walk);

--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:04:17

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift

This is a helper macro to loop through all the usable page sizes for a
high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
loop, in descending order, through the page sizes that HugeTLB supports
for this architecture; it always includes PAGE_SIZE.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8b10b941458d..557b0afdb503 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
/* All shared VMAs have HGM enabled. */
return vma->vm_flags & VM_SHARED;
}
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+ if (h >= &hstates[hugetlb_max_hstate])
+ return PAGE_SHIFT;
+ return huge_page_shift(h);
+}
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+ for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+ (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+ (tmp_h)++)
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:04:50

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality

The new function, hugetlb_split_to_shift, will optimally split the page
table to map a particular address at a particular granularity.

This is useful for punching a hole in the mapping and for mapping small
sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 122 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3ec2a921ee6f..eaffe7b4f67c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);

+/*
+ * Find the subpage that corresponds to `addr` in `hpage`.
+ */
+static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
+ unsigned long addr)
+{
+ size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
+
+ BUG_ON(idx >= pages_per_huge_page(h));
+ return &hpage[idx];
+}
+
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
if (spool->count)
@@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
(tmp_h) <= &hstates[hugetlb_max_hstate]; \
(tmp_h)++)
+
+/*
+ * Given a particular address, split the HugeTLB PTE that currently maps it
+ * so that, for the given address, the PTE that maps it is `desired_shift`.
+ * This function will always split the HugeTLB PTE optimally.
+ *
+ * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
+ * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
+ * these changes to the page table:
+ * 1. The PUD will be split into 2M PMDs.
+ * 2. The first PMD will be split again into 4K PTEs.
+ */
+static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
+ const struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long desired_shift)
+{
+ unsigned long start, end, curr;
+ unsigned long desired_sz = 1UL << desired_shift;
+ struct hstate *h = hstate_vma(vma);
+ int ret;
+ struct hugetlb_pte new_hpte;
+ struct mmu_notifier_range range;
+ struct page *hpage = NULL;
+ struct page *subpage;
+ pte_t old_entry;
+ struct mmu_gather tlb;
+
+ BUG_ON(!hpte->ptep);
+ BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
+
+ start = addr & hugetlb_pte_mask(hpte);
+ end = start + hugetlb_pte_size(hpte);
+
+ i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+
+ BUG_ON(!hpte->ptep);
+ /* This function only works if we are looking at a leaf-level PTE. */
+ BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
+
+ /*
+ * Clear the PTE so that we will allocate the PT structures when
+ * walking the page table.
+ */
+ old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
+
+ if (!huge_pte_none(old_entry))
+ hpage = pte_page(old_entry);
+
+ BUG_ON(!IS_ALIGNED(start, desired_sz));
+ BUG_ON(!IS_ALIGNED(end, desired_sz));
+
+ for (curr = start; curr < end;) {
+ struct hstate *tmp_h;
+ unsigned int shift;
+
+ for_each_hgm_shift(h, tmp_h, shift) {
+ unsigned long sz = 1UL << shift;
+
+ if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+ continue;
+ /*
+ * If we are including `addr`, we need to make sure
+ * splitting down to the correct size. Go to a smaller
+ * size if we are not.
+ */
+ if (curr <= addr && curr + sz > addr &&
+ shift > desired_shift)
+ continue;
+
+ /*
+ * Continue the page table walk to the level we want,
+ * allocate PT structures as we go.
+ */
+ hugetlb_pte_copy(&new_hpte, hpte);
+ ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
+ /*stop_at_none=*/false);
+ if (ret)
+ goto err;
+ BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
+ if (hpage) {
+ pte_t new_entry;
+
+ subpage = hugetlb_find_subpage(h, hpage, curr);
+ new_entry = make_huge_pte_with_shift(vma, subpage,
+ huge_pte_write(old_entry),
+ shift);
+ set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
+ }
+ curr += sz;
+ goto next;
+ }
+ /* We couldn't find a size that worked. */
+ BUG();
+next:
+ continue;
+ }
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+ start, end);
+ mmu_notifier_invalidate_range_start(&range);
+ return 0;
+err:
+ tlb_gather_mmu(&tlb, mm);
+ /* Free any newly allocated page table entries. */
+ hugetlb_free_range(&tlb, hpte, start, curr);
+ /* Restore the old entry. */
+ set_huge_pte_at(mm, start, hpte->ptep, old_entry);
+ tlb_finish_mmu(&tlb);
+ return ret;
+}
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

/*
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:05:18

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 21/26] hugetlb: add hugetlb_collapse

This is what implements MADV_COLLAPSE for HugeTLB pages. This is a
necessary extension to the UFFDIO_CONTINUE changes. When userspace
finishes mapping an entire hugepage with UFFDIO_CONTINUE, the kernel has
no mechanism to automatically collapse the page table to map the whole
hugepage normally. We require userspace to inform us that they would
like the hugepages to be collapsed; they do this with MADV_COLLAPSE.

If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but
only some, hugetlb_collapse will cause the requested range to be mapped
as if it were UFFDIO_CONTINUE'd already.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 7 ++++
mm/hugetlb.c | 88 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 95 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c207b1ac6195..438057dc3b75 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1197,6 +1197,8 @@ int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
unsigned int desired_sz,
enum split_mode mode,
bool write_locked);
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
#else
static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
{
@@ -1221,6 +1223,11 @@ static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte,
{
return -EINVAL;
}
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ return -EINVAL;
+}
#endif

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 09fa57599233..70bb3a1342d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7280,6 +7280,94 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm,
return -EINVAL;
}

+/*
+ * Collapse the address range from @start to @end to be mapped optimally.
+ *
+ * This is only valid for shared mappings. The main use case for this function
+ * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage
+ * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know
+ * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave
+ * it up to userspace to tell us to do so, via MADV_COLLAPSE.
+ *
+ * Any holes in the mapping will be filled. If there is no page in the
+ * pagecache for a region we're collapsing, the PTEs will be cleared.
+ */
+int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
+{
+ struct hstate *h = hstate_vma(vma);
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ struct mmu_notifier_range range;
+ struct mmu_gather tlb;
+ struct hstate *tmp_h;
+ unsigned int shift;
+ unsigned long curr = start;
+ int ret = 0;
+ struct page *hpage, *subpage;
+ pgoff_t idx;
+ bool writable = vma->vm_flags & VM_WRITE;
+ bool shared = vma->vm_flags & VM_SHARED;
+ pte_t entry;
+
+ /*
+ * This is only supported for shared VMAs, because we need to look up
+ * the page to use for any PTEs we end up creating.
+ */
+ if (!shared)
+ return -EINVAL;
+
+ i_mmap_assert_write_locked(mapping);
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
+ start, end);
+ mmu_notifier_invalidate_range_start(&range);
+ tlb_gather_mmu(&tlb, mm);
+
+ while (curr < end) {
+ for_each_hgm_shift(h, tmp_h, shift) {
+ unsigned long sz = 1UL << shift;
+ struct hugetlb_pte hpte;
+
+ if (!IS_ALIGNED(curr, sz) || curr + sz > end)
+ continue;
+
+ hugetlb_pte_init(&hpte);
+ ret = hugetlb_walk_to(mm, &hpte, curr, sz,
+ /*stop_at_none=*/false);
+ if (ret)
+ goto out;
+ if (hugetlb_pte_size(&hpte) >= sz)
+ goto hpte_finished;
+
+ idx = vma_hugecache_offset(h, vma, curr);
+ hpage = find_lock_page(mapping, idx);
+ hugetlb_free_range(&tlb, &hpte, curr,
+ curr + hugetlb_pte_size(&hpte));
+ if (!hpage) {
+ hugetlb_pte_clear(mm, &hpte, curr);
+ goto hpte_finished;
+ }
+
+ subpage = hugetlb_find_subpage(h, hpage, curr);
+ entry = make_huge_pte_with_shift(vma, subpage,
+ writable, shift);
+ set_huge_pte_at(mm, curr, hpte.ptep, entry);
+ unlock_page(hpage);
+hpte_finished:
+ curr += hugetlb_pte_size(&hpte);
+ goto next;
+ }
+ ret = -EINVAL;
+ goto out;
+next:
+ continue;
+ }
+out:
+ tlb_finish_mmu(&tlb);
+ mmu_notifier_invalidate_range_end(&range);
+ return ret;
+}
+
/*
* Given a particular address, split the HugeTLB PTE that currently maps it
* so that, for the given address, the PTE that maps it is `desired_shift`.
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:15:06

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

This adds the Kconfig to enable or disable high-granularity mapping. It
is enabled by default for architectures that use
ARCH_WANT_GENERAL_HUGETLB.

There is also an arch-specific config ARCH_HAS_SPECIAL_HUGETLB_HGM which
controls whether or not the architecture has been updated to support
HGM if it doesn't use general HugeTLB.

Signed-off-by: James Houghton <[email protected]>
---
fs/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 5976eb33535f..d76c7d812656 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -268,6 +268,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
to enable optimizing vmemmap pages of HugeTLB by default. It can then
be disabled on the command line via hugetlb_free_vmemmap=off.

+config ARCH_HAS_SPECIAL_HUGETLB_HGM
+ bool
+
+config HUGETLB_HIGH_GRANULARITY_MAPPING
+ def_bool ARCH_WANT_GENERAL_HUGETLB || ARCH_HAS_SPECIAL_HUGETLB_HGM
+ depends on HUGETLB_PAGE
+
config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS

--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:15:10

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 26/26] selftests: add HugeTLB HGM to KVM demand paging selftest

This doesn't address collapsing yet, and it only works with the MINOR
mode (UFFDIO_CONTINUE).

Signed-off-by: James Houghton <[email protected]>
---
tools/testing/selftests/kvm/include/test_util.h | 2 ++
tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
tools/testing/selftests/kvm/lib/test_util.c | 14 ++++++++++++++
3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 99e0dcdc923f..6209e44981a7 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -87,6 +87,7 @@ enum vm_mem_backing_src_type {
VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
VM_MEM_SRC_SHMEM,
VM_MEM_SRC_SHARED_HUGETLB,
+ VM_MEM_SRC_SHARED_HUGETLB_HGM,
NUM_SRC_TYPES,
};

@@ -105,6 +106,7 @@ size_t get_def_hugetlb_pagesz(void);
const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
size_t get_backing_src_pagesz(uint32_t i);
bool is_backing_src_hugetlb(uint32_t i);
+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type);
void backing_src_help(const char *flag);
enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
long get_run_delay(void);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..382f8fb75b7f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -993,7 +993,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->fd = -1;
if (backing_src_is_shared(src_type))
region->fd = kvm_memfd_alloc(region->mmap_size,
- src_type == VM_MEM_SRC_SHARED_HUGETLB);
+ is_backing_src_shared_hugetlb(src_type));

region->mmap_start = mmap(NULL, region->mmap_size,
PROT_READ | PROT_WRITE,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 6d23878bbfe1..710dc42077fe 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -254,6 +254,13 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
*/
.flag = MAP_SHARED,
},
+ [VM_MEM_SRC_SHARED_HUGETLB_HGM] = {
+ /*
+ * Identical to shared_hugetlb except for the name.
+ */
+ .name = "shared_hugetlb_hgm",
+ .flag = MAP_SHARED,
+ },
};
_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
"Missing new backing src types?");
@@ -272,6 +279,7 @@ size_t get_backing_src_pagesz(uint32_t i)
switch (i) {
case VM_MEM_SRC_ANONYMOUS:
case VM_MEM_SRC_SHMEM:
+ case VM_MEM_SRC_SHARED_HUGETLB_HGM:
return getpagesize();
case VM_MEM_SRC_ANONYMOUS_THP:
return get_trans_hugepagesz();
@@ -288,6 +296,12 @@ bool is_backing_src_hugetlb(uint32_t i)
return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
}

+bool is_backing_src_shared_hugetlb(enum vm_mem_backing_src_type src_type)
+{
+ return src_type == VM_MEM_SRC_SHARED_HUGETLB ||
+ src_type == VM_MEM_SRC_SHARED_HUGETLB_HGM;
+}
+
static void print_available_backing_src_types(const char *prefix)
{
int i;
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:15:15

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 23/26] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM

This is so that userspace is aware that their kernel was compiled with
HugeTLB high-granularity mapping and that UFFDIO_CONTINUE down to
PAGE_SIZE-aligned chunks are valid.

Signed-off-by: James Houghton <[email protected]>
---
fs/userfaultfd.c | 7 ++++++-
include/uapi/linux/userfaultfd.h | 2 ++
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 77c1b8a7d0b9..59bfdb7a67e0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1935,10 +1935,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
goto err_out;
/* report all available features and ioctls to userland */
uffdio_api.features = UFFD_API_FEATURES;
+
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
uffdio_api.features &=
~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
-#endif
+#ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+ uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
#endif
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7d32b1e797fb..50fbcb0bcba0 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -32,6 +32,7 @@
UFFD_FEATURE_SIGBUS | \
UFFD_FEATURE_THREAD_ID | \
UFFD_FEATURE_MINOR_HUGETLBFS | \
+ UFFD_FEATURE_MINOR_HUGETLBFS_HGM | \
UFFD_FEATURE_MINOR_SHMEM | \
UFFD_FEATURE_EXACT_ADDRESS | \
UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
@@ -213,6 +214,7 @@ struct uffdio_api {
#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
#define UFFD_FEATURE_EXACT_ADDRESS (1<<11)
#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
+#define UFFD_FEATURE_MINOR_HUGETLBFS_HGM (1<<13)
__u64 features;

__u64 ioctls;
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:16:14

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const

This is just a const-correctness change so that the new hugetlb_pte
changes can be const-correct too.

Acked-by: David Rientjes <[email protected]>

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4cff27d1198..498a4ae3d462 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
return hstate_file(vma->vm_file);
}

-static inline unsigned long huge_page_size(struct hstate *h)
+static inline unsigned long huge_page_size(const struct hstate *h)
{
return (unsigned long)PAGE_SIZE << h->order;
}
@@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
return h->mask;
}

-static inline unsigned int huge_page_order(struct hstate *h)
+static inline unsigned int huge_page_order(const struct hstate *h)
{
return h->order;
}

-static inline unsigned huge_page_shift(struct hstate *h)
+static inline unsigned huge_page_shift(const struct hstate *h)
{
return h->order + PAGE_SHIFT;
}

-static inline bool hstate_is_gigantic(struct hstate *h)
+static inline bool hstate_is_gigantic(const struct hstate *h)
{
return huge_page_order(h) >= MAX_ORDER;
}

-static inline unsigned int pages_per_huge_page(struct hstate *h)
+static inline unsigned int pages_per_huge_page(const struct hstate *h)
{
return 1 << h->order;
}

-static inline unsigned int blocks_per_huge_page(struct hstate *h)
+static inline unsigned int blocks_per_huge_page(const struct hstate *h)
{
return huge_page_size(h) / 512;
}
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:16:29

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 22/26] madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE

This commit is co-opting the same madvise mode that is being introduced
by [email protected] to manually collapse THPs[1].

As with the rest of the high-granularity mapping support, MADV_COLLAPSE
is only supported for shared VMAs right now.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: James Houghton <[email protected]>
---
include/uapi/asm-generic/mman-common.h | 2 ++
mm/madvise.c | 23 +++++++++++++++++++++++
2 files changed, 25 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..b686920ca731 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@

#define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */

+#define MADV_COLLAPSE 25 /* collapse an address range into hugepages */
+
/* compatibility flags */
#define MAP_FILE 0

diff --git a/mm/madvise.c b/mm/madvise.c
index d7b4f2602949..c624c0f02276 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_FREE:
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
+ case MADV_COLLAPSE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -981,6 +982,20 @@ static long madvise_remove(struct vm_area_struct *vma,
return error;
}

+static int madvise_collapse(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ bool shared = vma->vm_flags & VM_SHARED;
+ *prev = vma;
+
+ /* Only allow collapsing for HGM-enabled, shared mappings. */
+ if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_enabled(vma) || !shared)
+ return -EINVAL;
+
+ return hugetlb_collapse(vma->vm_mm, vma, start, end);
+}
+
/*
* Apply an madvise behavior to a region of a vma. madvise_update_vma
* will handle splitting a vm area into separate areas, each area with its own
@@ -1011,6 +1026,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
return madvise_populate(vma, prev, start, end, behavior);
+ case MADV_COLLAPSE:
+ return madvise_collapse(vma, prev, start, end);
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
@@ -1158,6 +1175,9 @@ madvise_behavior_valid(int behavior)
#ifdef CONFIG_MEMORY_FAILURE
case MADV_SOFT_OFFLINE:
case MADV_HWPOISON:
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+ case MADV_COLLAPSE:
#endif
return true;

@@ -1351,6 +1371,9 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
* triggering read faults if required
* MADV_POPULATE_WRITE - populate (prefault) page tables writable by
* triggering write faults if required
+ * MADV_COLLAPSE - collapse a high-granularity HugeTLB mapping into huge
+ * mappings. This is useful after an entire hugepage has been
+ * mapped with individual small UFFDIO_CONTINUE operations.
*
* return values:
* zero - success
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:17:13

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures

This is a helper function for freeing the bits of the page table that
map a particular HugeTLB PTE.

Signed-off-by: James Houghton <[email protected]>
---
include/linux/hugetlb.h | 2 ++
mm/hugetlb.c | 17 +++++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1d4ec9dfdebf..33ba48fac551 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -107,6 +107,8 @@ bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
unsigned long address);
+void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
+ unsigned long start, unsigned long end);

struct hugepage_subpool {
spinlock_t lock;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1a1434e29740..a2d2ffa76173 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1120,6 +1120,23 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
return false;
}

+void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
+ unsigned long start, unsigned long end)
+{
+ unsigned long floor = start & hugetlb_pte_mask(hpte);
+ unsigned long ceiling = floor + hugetlb_pte_size(hpte);
+
+ if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
+ free_p4d_range(tlb, (pgd_t *)hpte->ptep, start, end, floor, ceiling);
+ } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
+ free_pud_range(tlb, (p4d_t *)hpte->ptep, start, end, floor, ceiling);
+ } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
+ free_pmd_range(tlb, (pud_t *)hpte->ptep, start, end, floor, ceiling);
+ } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
+ free_pte_range(tlb, (pmd_t *)hpte->ptep, start);
+ }
+}
+
bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
{
pgd_t pgd;
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:17:54

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range

This allows fork() to work with high-granularity mappings. The page
table structure is copied such that partially mapped regions will remain
partially mapped in the same way for the new process.

Signed-off-by: James Houghton <[email protected]>
---
mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 59 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aadfcee947cf..0ec2f231524e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4851,7 +4851,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *src_vma)
{
pte_t *src_pte, *dst_pte, entry, dst_entry;
- struct page *ptepage;
+ struct hugetlb_pte src_hpte, dst_hpte;
+ struct page *ptepage, *hpage;
unsigned long addr;
bool cow = is_cow_mapping(src_vma->vm_flags);
struct hstate *h = hstate_vma(src_vma);
@@ -4878,17 +4879,44 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
i_mmap_lock_read(mapping);
}

- for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
+ addr = src_vma->vm_start;
+ while (addr < src_vma->vm_end) {
spinlock_t *src_ptl, *dst_ptl;
+ unsigned long hpte_sz;
src_pte = huge_pte_offset(src, addr, sz);
- if (!src_pte)
+ if (!src_pte) {
+ addr += sz;
continue;
+ }
dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
break;
}

+ hugetlb_pte_populate(&src_hpte, src_pte, huge_page_shift(h));
+ hugetlb_pte_populate(&dst_hpte, dst_pte, huge_page_shift(h));
+
+ if (hugetlb_hgm_enabled(src_vma)) {
+ BUG_ON(!hugetlb_hgm_enabled(dst_vma));
+ ret = hugetlb_walk_to(src, &src_hpte, addr,
+ PAGE_SIZE, /*stop_at_none=*/true);
+ if (ret)
+ break;
+ ret = huge_pte_alloc_high_granularity(
+ &dst_hpte, dst, dst_vma, addr,
+ hugetlb_pte_shift(&src_hpte),
+ HUGETLB_SPLIT_NONE,
+ /*write_locked=*/false);
+ if (ret)
+ break;
+
+ src_pte = src_hpte.ptep;
+ dst_pte = dst_hpte.ptep;
+ }
+
+ hpte_sz = hugetlb_pte_size(&src_hpte);
+
/*
* If the pagetables are shared don't copy or take references.
* dst_pte == src_pte is the common case of src/dest sharing.
@@ -4899,16 +4927,19 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* after taking the lock below.
*/
dst_entry = huge_ptep_get(dst_pte);
- if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
+ if ((dst_pte == src_pte) || !hugetlb_pte_none(&dst_hpte)) {
+ addr += hugetlb_pte_size(&src_hpte);
continue;
+ }

- dst_ptl = huge_pte_lock(h, dst, dst_pte);
- src_ptl = huge_pte_lockptr(huge_page_shift(h), src, src_pte);
+ dst_ptl = hugetlb_pte_lock(dst, &dst_hpte);
+ src_ptl = hugetlb_pte_lockptr(src, &src_hpte);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
dst_entry = huge_ptep_get(dst_pte);
again:
- if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
+ if (hugetlb_pte_none(&src_hpte) ||
+ !hugetlb_pte_none(&dst_hpte)) {
/*
* Skip if src entry none. Also, skip in the
* unlikely case dst entry !none as this implies
@@ -4931,11 +4962,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (userfaultfd_wp(src_vma) && uffd_wp)
entry = huge_pte_mkuffd_wp(entry);
set_huge_swap_pte_at(src, addr, src_pte,
- entry, sz);
+ entry, hpte_sz);
}
if (!userfaultfd_wp(dst_vma) && uffd_wp)
entry = huge_pte_clear_uffd_wp(entry);
- set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
+ set_huge_swap_pte_at(dst, addr, dst_pte, entry,
+ hpte_sz);
} else if (unlikely(is_pte_marker(entry))) {
/*
* We copy the pte marker only if the dst vma has
@@ -4946,7 +4978,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else {
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
- get_page(ptepage);
+ hpage = compound_head(ptepage);
+ get_page(hpage);

/*
* Failing to duplicate the anon rmap is a rare case
@@ -4959,9 +4992,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
* sleep during the process.
*/
if (!PageAnon(ptepage)) {
- page_dup_file_rmap(ptepage, true);
+ /* Only dup_rmap once for a page */
+ if (IS_ALIGNED(addr, sz))
+ page_dup_file_rmap(hpage, true);
} else if (page_try_dup_anon_rmap(ptepage, true,
src_vma)) {
+ if (hugetlb_hgm_enabled(src_vma)) {
+ ret = -EINVAL;
+ break;
+ }
+ BUG_ON(!IS_ALIGNED(addr, hugetlb_pte_size(&src_hpte)));
pte_t src_pte_old = entry;
struct page *new;

@@ -4970,13 +5010,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
/* Do not use reserve as it's private owned */
new = alloc_huge_page(dst_vma, addr, 1);
if (IS_ERR(new)) {
- put_page(ptepage);
+ put_page(hpage);
ret = PTR_ERR(new);
break;
}
- copy_user_huge_page(new, ptepage, addr, dst_vma,
+ copy_user_huge_page(new, hpage, addr, dst_vma,
npages);
- put_page(ptepage);
+ put_page(hpage);

/* Install the new huge page if src pte stable */
dst_ptl = huge_pte_lock(h, dst, dst_pte);
@@ -4994,6 +5034,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
hugetlb_install_page(dst_vma, dst_pte, addr, new);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+ addr += hugetlb_pte_size(&src_hpte);
continue;
}

@@ -5010,10 +5051,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}

set_huge_pte_at(dst, addr, dst_pte, entry);
- hugetlb_count_add(npages, dst);
+ hugetlb_count_add(
+ hugetlb_pte_size(&dst_hpte) / PAGE_SIZE,
+ dst);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
+ addr += hugetlb_pte_size(&src_hpte);
}

if (cow) {
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:19:13

by James Houghton

[permalink] [raw]
Subject: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

The changes here are very similar to the changes made to
hugetlb_no_page, where we do a high-granularity page table walk and
do accounting slightly differently because we are mapping only a piece
of a page.

Signed-off-by: James Houghton <[email protected]>
---
fs/userfaultfd.c | 3 +++
include/linux/hugetlb.h | 6 +++--
mm/hugetlb.c | 54 +++++++++++++++++++++-----------------
mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++----------
4 files changed, 82 insertions(+), 38 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index e943370107d0..77c1b8a7d0b9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
if (!ptep)
goto out;

+ if (hugetlb_hgm_enabled(vma))
+ goto out;
+
ret = false;
pte = huge_ptep_get(ptep);

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ac4ac8fbd901..c207b1ac6195 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -221,13 +221,15 @@ unsigned long hugetlb_total_pages(void);
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
#ifdef CONFIG_USERFAULTFD
-int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
+int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
+ struct hugetlb_pte *dst_hpte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
struct page **pagep,
- bool wp_copy);
+ bool wp_copy,
+ bool new_mapping);
#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0ec2f231524e..09fa57599233 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5808,6 +5808,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
vma_end_reservation(h, vma, haddr);
}

+ /* This lock will get pretty expensive at 4K. */
ptl = hugetlb_pte_lock(mm, hpte);
ret = 0;
/* If pte changed from under us, retry */
@@ -6098,24 +6099,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* modifications for huge pages.
*/
int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
- pte_t *dst_pte,
+ struct hugetlb_pte *dst_hpte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
struct page **pagep,
- bool wp_copy)
+ bool wp_copy,
+ bool new_mapping)
{
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct hstate *h = hstate_vma(dst_vma);
struct address_space *mapping = dst_vma->vm_file->f_mapping;
+ unsigned long haddr = dst_addr & huge_page_mask(h);
pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
pte_t _dst_pte;
spinlock_t *ptl;
int ret = -ENOMEM;
- struct page *page;
+ struct page *page, *subpage;
int writable;
bool page_in_pagecache = false;

@@ -6130,12 +6133,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* a non-missing case. Return -EEXIST.
*/
if (vm_shared &&
- hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+ hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
ret = -EEXIST;
goto out;
}

- page = alloc_huge_page(dst_vma, dst_addr, 0);
+ page = alloc_huge_page(dst_vma, haddr, 0);
if (IS_ERR(page)) {
ret = -ENOMEM;
goto out;
@@ -6151,13 +6154,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
/* Free the allocated page which may have
* consumed a reservation.
*/
- restore_reserve_on_error(h, dst_vma, dst_addr, page);
+ restore_reserve_on_error(h, dst_vma, haddr, page);
put_page(page);

/* Allocate a temporary page to hold the copied
* contents.
*/
- page = alloc_huge_page_vma(h, dst_vma, dst_addr);
+ page = alloc_huge_page_vma(h, dst_vma, haddr);
if (!page) {
ret = -ENOMEM;
goto out;
@@ -6171,14 +6174,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
}
} else {
if (vm_shared &&
- hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+ hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
put_page(*pagep);
ret = -EEXIST;
*pagep = NULL;
goto out;
}

- page = alloc_huge_page(dst_vma, dst_addr, 0);
+ page = alloc_huge_page(dst_vma, haddr, 0);
if (IS_ERR(page)) {
ret = -ENOMEM;
*pagep = NULL;
@@ -6216,8 +6219,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
page_in_pagecache = true;
}

- ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
- spin_lock(ptl);
+ ptl = hugetlb_pte_lock(dst_mm, dst_hpte);

/*
* Recheck the i_size after holding PT lock to make sure not
@@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
* registered, we firstly wr-protect a none pte which has no page cache
* page backing it, then access the page.
*/
- if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+ if (!hugetlb_pte_none_mostly(dst_hpte))
goto out_release_unlock;

- if (vm_shared) {
- page_dup_file_rmap(page, true);
- } else {
- ClearHPageRestoreReserve(page);
- hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+ if (new_mapping) {
+ if (vm_shared) {
+ page_dup_file_rmap(page, true);
+ } else {
+ ClearHPageRestoreReserve(page);
+ hugepage_add_new_anon_rmap(page, dst_vma, haddr);
+ }
}

/*
@@ -6258,7 +6262,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
else
writable = dst_vma->vm_flags & VM_WRITE;

- _dst_pte = make_huge_pte(dst_vma, page, writable);
+ subpage = hugetlb_find_subpage(h, page, dst_addr);
+ if (subpage != page)
+ BUG_ON(!hugetlb_hgm_enabled(dst_vma));
+
+ _dst_pte = make_huge_pte(dst_vma, subpage, writable);
/*
* Always mark UFFDIO_COPY page dirty; note that this may not be
* extremely important for hugetlbfs for now since swapping is not
@@ -6271,14 +6279,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (wp_copy)
_dst_pte = huge_pte_mkuffd_wp(_dst_pte);

- set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+ set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);

- (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
- dst_vma->vm_flags & VM_WRITE);
- hugetlb_count_add(pages_per_huge_page(h), dst_mm);
+ (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_hpte->ptep,
+ _dst_pte, dst_vma->vm_flags & VM_WRITE);
+ hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);

/* No need to invalidate - it was non-present before */
- update_mmu_cache(dst_vma, dst_addr, dst_pte);
+ update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);

spin_unlock(ptl);
if (!is_continue)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4f4892a5f767..ee40d98068bf 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -310,14 +310,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
{
int vm_shared = dst_vma->vm_flags & VM_SHARED;
ssize_t err;
- pte_t *dst_pte;
unsigned long src_addr, dst_addr;
long copied;
struct page *page;
- unsigned long vma_hpagesize;
+ unsigned long vma_hpagesize, vma_altpagesize;
pgoff_t idx;
u32 hash;
struct address_space *mapping;
+ bool use_hgm = hugetlb_hgm_enabled(dst_vma) &&
+ mode == MCOPY_ATOMIC_CONTINUE;
+ struct hstate *h = hstate_vma(dst_vma);

/*
* There is no default zero huge page for all huge page sizes as
@@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
copied = 0;
page = NULL;
vma_hpagesize = vma_kernel_pagesize(dst_vma);
+ if (use_hgm)
+ vma_altpagesize = PAGE_SIZE;
+ else
+ vma_altpagesize = vma_hpagesize;

/*
* Validate alignment based on huge page size
*/
err = -EINVAL;
- if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
+ if (dst_start & (vma_altpagesize - 1) || len & (vma_altpagesize - 1))
goto out_unlock;

retry:
@@ -361,6 +367,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
vm_shared = dst_vma->vm_flags & VM_SHARED;
}

+ BUG_ON(!vm_shared && use_hgm);
+
/*
* If not shared, ensure the dst_vma has a anon_vma.
*/
@@ -371,11 +379,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
+ struct hugetlb_pte hpte;
+ bool new_mapping;
BUG_ON(dst_addr >= dst_start + len);

/*
* Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
- * i_mmap_rwsem ensures the dst_pte remains valid even
+ * i_mmap_rwsem ensures the hpte.ptep remains valid even
* in the case of shared pmds. fault mutex prevents
* races with other faulting threads.
*/
@@ -383,27 +393,47 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
i_mmap_lock_read(mapping);
idx = linear_page_index(dst_vma, dst_addr);
hash = hugetlb_fault_mutex_hash(mapping, idx);
+ /* This lock will get expensive at 4K. */
mutex_lock(&hugetlb_fault_mutex_table[hash]);

- err = -ENOMEM;
- dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
- if (!dst_pte) {
+ err = 0;
+
+ pte_t *ptep = huge_pte_alloc(dst_mm, dst_vma, dst_addr,
+ vma_hpagesize);
+ if (!ptep)
+ err = -ENOMEM;
+ else {
+ hugetlb_pte_populate(&hpte, ptep,
+ huge_page_shift(h));
+ /*
+ * If the hstate-level PTE is not none, then a mapping
+ * was previously established.
+ * The per-hpage mutex prevents double-counting.
+ */
+ new_mapping = hugetlb_pte_none(&hpte);
+ if (use_hgm)
+ err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
+ dst_addr,
+ dst_start + len);
+ }
+
+ if (err) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
goto out_unlock;
}

if (mode != MCOPY_ATOMIC_CONTINUE &&
- !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+ !hugetlb_pte_none_mostly(&hpte)) {
err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
goto out_unlock;
}

- err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
+ err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
dst_addr, src_addr, mode, &page,
- wp_copy);
+ wp_copy, new_mapping);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -413,6 +443,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
if (unlikely(err == -ENOENT)) {
mmap_read_unlock(dst_mm);
BUG_ON(!page);
+ BUG_ON(hpte.shift != huge_page_shift(h));

err = copy_huge_page_from_user(page,
(const void __user *)src_addr,
@@ -430,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
BUG_ON(page);

if (!err) {
- dst_addr += vma_hpagesize;
- src_addr += vma_hpagesize;
- copied += vma_hpagesize;
+ dst_addr += hugetlb_pte_size(&hpte);
+ src_addr += hugetlb_pte_size(&hpte);
+ copied += hugetlb_pte_size(&hpte);

if (fatal_signal_pending(current))
err = -EINTR;
--
2.37.0.rc0.161.g10f37bed90-goog

2022-06-24 18:45:08

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
>
> This RFC introduces the concept of HugeTLB high-granularity mapping
> (HGM)[1]. In broad terms, this series teaches HugeTLB how to map HugeTLB
> pages at different granularities, and, more importantly, to partially map
> a HugeTLB page. This cover letter will go over
> - the motivation for these changes
> - userspace API
> - some of the changes to HugeTLB to make this work
> - limitations & future enhancements
>
> High-granularity mapping does *not* involve dissolving the hugepages
> themselves; it only affects how they are mapped.
>
> ---- Motivation ----
>
> Being able to map HugeTLB memory with PAGE_SIZE PTEs has important use
> cases in post-copy live migration and memory failure handling.
>
> - Live Migration (userfaultfd)
> For post-copy live migration, using userfaultfd, currently we have to
> install an entire hugepage before we can allow a guest to access that page.
> This is because, right now, either the WHOLE hugepage is mapped or NONE of
> it is. So either the guest can access the WHOLE hugepage or NONE of it.
> This makes post-copy live migration for 1G HugeTLB-backed VMs completely
> infeasible.
>
> With high-granularity mapping, we can map PAGE_SIZE pieces of a hugepage,
> thereby allowing the guest to access only PAGE_SIZE chunks, and getting
> page faults on the rest (and triggering another demand-fetch). This gives
> userspace the flexibility to install PAGE_SIZE chunks of memory into a
> hugepage, making migration of 1G-backed VMs perfectly feasible, and it
> vastly reduces the vCPU stall time during post-copy for 2M-backed VMs.
>
> At Google, for a 48 vCPU VM in post-copy, we can expect these approximate
> per-page median fetch latencies:
> 4K: <100us
> 2M: >10ms
> Being able to unpause a vCPU 100x quicker is helpful for guest stability,
> and being able to use 1G pages at all can significant improve steady-state
> guest performance.
>
> After fully copying a hugepage over the network, we will want to collapse
> the mapping down to what it would normally be (e.g., one PUD for a 1G
> page). Rather than having the kernel do this automatically, we leave it up
> to userspace to tell us to collapse a range (via MADV_COLLAPSE, co-opting
> the API that is being introduced for THPs[2]).
>
> - Memory Failure
> When a memory error is found within a HugeTLB page, it would be ideal if we
> could unmap only the PAGE_SIZE section that contained the error. This is
> what THPs are able to do. Using high-granularity mapping, we could do this,
> but this isn't tackled in this patch series.
>
> ---- Userspace API ----
>
> This patch series introduces a single way to take advantage of
> high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> userspace to resolve MINOR page faults on shared VMAs.
>
> To collapse a HugeTLB address range that has been mapped with several
> UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> userspace to know when all pages (that they care about) have been fetched.
>

Thanks James! Cover letter looks good. A few questions:

Why not have the kernel collapse the hugepage once all the 4K pages
have been fetched automatically? It would remove the need for a new
userspace API, and AFACT there aren't really any cases where it is
beneficial to have a hugepage sharded into 4K mappings when those
mappings can be collapsed.

> ---- HugeTLB Changes ----
>
> - Mapcount
> The way mapcount is handled is different from the way that it was handled
> before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> be increased. This scheme means that, for hugepages that aren't mapped at
> high granularity, their mapcounts will remain the same as what they would
> have been pre-HGM.
>

Sorry, I didn't quite follow this. It says mapcount is handled
differently, but the same if the page is not mapped at high
granularity. Can you elaborate on how the mapcount handling will be
different when the page is mapped at high granularity?

> - Page table walking and manipulation
> A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> high-granularity mappings. Eventually, it's possible to merge
> hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
>
> We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> This is because we generally need to know the "size" of a PTE (previously
> always just huge_page_size(hstate)).
>
> For every page table manipulation function that has a huge version (e.g.
> huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> PTE really is "huge".
>
> - Synchronization
> For existing bits of HugeTLB, synchronization is unchanged. For splitting
> and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> writing, and for doing high-granularity page table walks, we require it to
> be held for reading.
>
> ---- Limitations & Future Changes ----
>
> This patch series only implements high-granularity mapping for VM_SHARED
> VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> failure recovery for both shared and private mappings.
>
> The memory failure use case poses its own challenges that can be
> addressed, but I will do so in a separate RFC.
>
> Performance has not been heavily scrutinized with this patch series. There
> are places where lock contention can significantly reduce performance. This
> will be addressed later.
>
> The patch series, as it stands right now, is compatible with the VMEMMAP
> page struct optimization[3], as we do not need to modify data contained
> in the subpage page structs.
>
> Other omissions:
> - Compatibility with userfaultfd write-protect (will be included in v1).
> - Support for mremap() (will be included in v1). This looks a lot like
> the support we have for fork().
> - Documentation changes (will be included in v1).
> - Completely ignores PMD sharing and hugepage migration (will be included
> in v1).
> - Implementations for architectures that don't use GENERAL_HUGETLB other
> than arm64.
>
> ---- Patch Breakdown ----
>
> Patch 1 - Preliminary changes
> Patch 2-10 - HugeTLB HGM core changes
> Patch 11-13 - HugeTLB HGM page table walking functionality
> Patch 14-19 - HugeTLB HGM compatibility with other bits
> Patch 20-23 - Userfaultfd and collapse changes
> Patch 24-26 - arm64 support and selftests
>
> [1] This used to be called HugeTLB double mapping, a bad and confusing
> name. "High-granularity mapping" is not a great name either. I am open
> to better names.

I would drop 1 extra word and do "granular mapping", as in the mapping
is more granular than what it normally is (2MB/1G, etc).

> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
>
> James Houghton (26):
> hugetlb: make hstate accessor functions const
> hugetlb: sort hstates in hugetlb_init_hstates
> hugetlb: add make_huge_pte_with_shift
> hugetlb: make huge_pte_lockptr take an explicit shift argument.
> hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> mm: make free_p?d_range functions public
> hugetlb: add hugetlb_pte to track HugeTLB page table entries
> hugetlb: add hugetlb_free_range to free PT structures
> hugetlb: add hugetlb_hgm_enabled
> hugetlb: add for_each_hgm_shift
> hugetlb: add hugetlb_walk_to to do PT walks
> hugetlb: add HugeTLB splitting functionality
> hugetlb: add huge_pte_alloc_high_granularity
> hugetlb: add HGM support for hugetlb_fault and hugetlb_no_page
> hugetlb: make unmapping compatible with high-granularity mappings
> hugetlb: make hugetlb_change_protection compatible with HGM
> hugetlb: update follow_hugetlb_page to support HGM
> hugetlb: use struct hugetlb_pte for walk_hugetlb_range
> hugetlb: add HGM support for copy_hugetlb_page_range
> hugetlb: add support for high-granularity UFFDIO_CONTINUE
> hugetlb: add hugetlb_collapse
> madvise: add uapi for HugeTLB HGM collapse: MADV_COLLAPSE
> userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM
> arm64/hugetlb: add support for high-granularity mappings
> selftests: add HugeTLB HGM to userfaultfd selftest
> selftests: add HugeTLB HGM to KVM demand paging selftest
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/hugetlbpage.c | 63 ++
> arch/powerpc/mm/pgtable.c | 3 +-
> arch/s390/mm/gmap.c | 8 +-
> fs/Kconfig | 7 +
> fs/proc/task_mmu.c | 35 +-
> fs/userfaultfd.c | 10 +-
> include/asm-generic/tlb.h | 6 +-
> include/linux/hugetlb.h | 177 +++-
> include/linux/mm.h | 7 +
> include/linux/pagewalk.h | 3 +-
> include/uapi/asm-generic/mman-common.h | 2 +
> include/uapi/linux/userfaultfd.h | 2 +
> mm/damon/vaddr.c | 34 +-
> mm/hmm.c | 7 +-
> mm/hugetlb.c | 987 +++++++++++++++---
> mm/madvise.c | 23 +
> mm/memory.c | 8 +-
> mm/mempolicy.c | 11 +-
> mm/migrate.c | 3 +-
> mm/mincore.c | 4 +-
> mm/mprotect.c | 6 +-
> mm/page_vma_mapped.c | 3 +-
> mm/pagewalk.c | 18 +-
> mm/userfaultfd.c | 57 +-
> .../testing/selftests/kvm/include/test_util.h | 2 +
> tools/testing/selftests/kvm/lib/kvm_util.c | 2 +-
> tools/testing/selftests/kvm/lib/test_util.c | 14 +
> tools/testing/selftests/vm/userfaultfd.c | 61 +-
> 29 files changed, 1314 insertions(+), 250 deletions(-)
>
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-24 18:58:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> [1] This used to be called HugeTLB double mapping, a bad and confusing
> name. "High-granularity mapping" is not a great name either. I am open
> to better names.

Oh good, I was grinding my teeth every time I read it ;-)

How does "Fine granularity" work for you?
"sub-page mapping" might work too.

2022-06-24 18:59:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> - Page table walking and manipulation
> A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> high-granularity mappings. Eventually, it's possible to merge
> hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
>
> We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> This is because we generally need to know the "size" of a PTE (previously
> always just huge_page_size(hstate)).
>
> For every page table manipulation function that has a huge version (e.g.
> huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> PTE really is "huge".

I'm disappointed to hear that page table walking is going to become even
more special. I'd much prefer it if hugetlb walking were exactly the
same as THP walking. This seems like a good time to do at least some
of that work.

Was there a reason you chose the "more complexity" direction?

2022-06-24 19:05:58

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
>
> This is just a const-correctness change so that the new hugetlb_pte
> changes can be const-correct too.
>
> Acked-by: David Rientjes <[email protected]>
>

Reviewed-By: Mina Almasry <[email protected]>

> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..498a4ae3d462 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> return hstate_file(vma->vm_file);
> }
>
> -static inline unsigned long huge_page_size(struct hstate *h)
> +static inline unsigned long huge_page_size(const struct hstate *h)
> {
> return (unsigned long)PAGE_SIZE << h->order;
> }
> @@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
> return h->mask;
> }
>
> -static inline unsigned int huge_page_order(struct hstate *h)
> +static inline unsigned int huge_page_order(const struct hstate *h)
> {
> return h->order;
> }
>
> -static inline unsigned huge_page_shift(struct hstate *h)
> +static inline unsigned huge_page_shift(const struct hstate *h)
> {
> return h->order + PAGE_SHIFT;
> }
>
> -static inline bool hstate_is_gigantic(struct hstate *h)
> +static inline bool hstate_is_gigantic(const struct hstate *h)
> {
> return huge_page_order(h) >= MAX_ORDER;
> }
>
> -static inline unsigned int pages_per_huge_page(struct hstate *h)
> +static inline unsigned int pages_per_huge_page(const struct hstate *h)
> {
> return 1 << h->order;
> }
>
> -static inline unsigned int blocks_per_huge_page(struct hstate *h)
> +static inline unsigned int blocks_per_huge_page(const struct hstate *h)
> {
> return huge_page_size(h) / 512;
> }
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-24 19:07:15

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
>
> This allows us to make huge PTEs at shifts other than the hstate shift,
> which will be necessary for high-granularity mappings.
>

Can you elaborate on why?

> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 33 ++++++++++++++++++++-------------
> 1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5df838d86f32..0eec34edf3b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4686,23 +4686,30 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> .pagesize = hugetlb_vm_op_pagesize,
> };
>
> +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
> + struct page *page, int writable,
> + int shift)
> +{
> + bool huge = shift > PAGE_SHIFT;
> + pte_t entry = huge ? mk_huge_pte(page, vma->vm_page_prot)
> + : mk_pte(page, vma->vm_page_prot);
> +
> + if (writable)
> + entry = huge ? huge_pte_mkwrite(entry) : pte_mkwrite(entry);
> + else
> + entry = huge ? huge_pte_wrprotect(entry) : pte_wrprotect(entry);
> + pte_mkyoung(entry);
> + if (huge)
> + entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> + return entry;
> +}
> +
> static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> - int writable)
> + int writable)

Looks like an unnecessary diff?

> {
> - pte_t entry;
> unsigned int shift = huge_page_shift(hstate_vma(vma));
>
> - if (writable) {
> - entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
> - vma->vm_page_prot)));

In this case there is an intermediate call to huge_pte_mkdirty() that
is not done in make_huge_pte_with_shift(). Why was this removed?

> - } else {
> - entry = huge_pte_wrprotect(mk_huge_pte(page,
> - vma->vm_page_prot));
> - }
> - entry = pte_mkyoung(entry);
> - entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> -
> - return entry;
> + return make_huge_pte_with_shift(vma, page, writable, shift);

I think this is marginally cleaner to calculate the shift inline:

return make_huge_pte_with_shift(vma, page, writable,
huge_page_shift(hstate_vma(vma)));

> }
>
> static void set_huge_ptep_writable(struct vm_area_struct *vma,
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-24 19:23:12

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
>
> When using HugeTLB high-granularity mapping, we need to go through the
> supported hugepage sizes in decreasing order so that we pick the largest
> size that works. Consider the case where we're faulting in a 1G hugepage
> for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> a PUD. By going through the sizes in decreasing order, we will find that
> PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
>

Mostly nits:

Reviewed-by: Mina Almasry <[email protected]>

> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> 1 file changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..5df838d86f32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -33,6 +33,7 @@
> #include <linux/migrate.h>
> #include <linux/nospec.h>
> #include <linux/delayacct.h>
> +#include <linux/sort.h>
>
> #include <asm/page.h>
> #include <asm/pgalloc.h>
> @@ -48,6 +49,10 @@
>
> int hugetlb_max_hstate __read_mostly;
> unsigned int default_hstate_idx;
> +/*
> + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> + * to smallest.
> + */
> struct hstate hstates[HUGE_MAX_HSTATE];
>
> #ifdef CONFIG_CMA
> @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> kfree(node_alloc_noretry);
> }
>
> +static int compare_hstates_decreasing(const void *a, const void *b)
> +{
> + const int shift_a = huge_page_shift((const struct hstate *)a);
> + const int shift_b = huge_page_shift((const struct hstate *)b);
> +
> + if (shift_a < shift_b)
> + return 1;
> + if (shift_a > shift_b)
> + return -1;
> + return 0;
> +}
> +
> +static void sort_hstates(void)

Maybe sort_hstates_descending(void) for extra clarity.

> +{
> + unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> +
> + /* Sort from largest to smallest. */

I'd remove this redundant comment; it's somewhat obvious what the next
line does.

> + sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> + compare_hstates_decreasing, NULL);
> +
> + /*
> + * We may have changed the location of the default hstate, so we need to
> + * update it.
> + */
> + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> +}
> +
> static void __init hugetlb_init_hstates(void)
> {
> struct hstate *h, *h2;
>
> - for_each_hstate(h) {
> - if (minimum_order > huge_page_order(h))
> - minimum_order = huge_page_order(h);
> + sort_hstates();
>
> + /* The last hstate is now the smallest. */

Same, given that above is sort_hstates().

> + minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> +
> + for_each_hstate(h) {
> /* oversize hugepages were init'ed in early boot */
> if (!hstate_is_gigantic(h))
> hugetlb_hstate_alloc_pages(h);
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-27 12:35:54

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 03/26] hugetlb: add make_huge_pte_with_shift


On 24/06/22 11:06 pm, James Houghton wrote:
> This allows us to make huge PTEs at shifts other than the hstate shift,
> which will be necessary for high-granularity mappings.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 33 ++++++++++++++++++++-------------
> 1 file changed, 20 insertions(+), 13 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5df838d86f32..0eec34edf3b2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4686,23 +4686,30 @@ const struct vm_operations_struct hugetlb_vm_ops = {
> .pagesize = hugetlb_vm_op_pagesize,
> };
reviewed-by: [email protected]
>
> +static pte_t make_huge_pte_with_shift(struct vm_area_struct *vma,
> + struct page *page, int writable,
> + int shift)
> +{
> + bool huge = shift > PAGE_SHIFT;
> + pte_t entry = huge ? mk_huge_pte(page, vma->vm_page_prot)
> + : mk_pte(page, vma->vm_page_prot);
> +
> + if (writable)
> + entry = huge ? huge_pte_mkwrite(entry) : pte_mkwrite(entry);
> + else
> + entry = huge ? huge_pte_wrprotect(entry) : pte_wrprotect(entry);
> + pte_mkyoung(entry);
> + if (huge)
> + entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> + return entry;
> +}
> +
> static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> - int writable)
> + int writable)
> {
> - pte_t entry;
> unsigned int shift = huge_page_shift(hstate_vma(vma));
>
> - if (writable) {
> - entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
> - vma->vm_page_prot)));
> - } else {
> - entry = huge_pte_wrprotect(mk_huge_pte(page,
> - vma->vm_page_prot));
> - }
> - entry = pte_mkyoung(entry);
> - entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
> -
> - return entry;
> + return make_huge_pte_with_shift(vma, page, writable, shift);
> }
>
> static void set_huge_ptep_writable(struct vm_area_struct *vma,

2022-06-27 12:36:34

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates


On 24/06/22 11:06 pm, James Houghton wrote:
> When using HugeTLB high-granularity mapping, we need to go through the
> supported hugepage sizes in decreasing order so that we pick the largest
> size that works. Consider the case where we're faulting in a 1G hugepage
> for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> a PUD. By going through the sizes in decreasing order, we will find that
> PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> 1 file changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..5df838d86f32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -33,6 +33,7 @@
> #include <linux/migrate.h>
> #include <linux/nospec.h>
> #include <linux/delayacct.h>
> +#include <linux/sort.h>
>
> #include <asm/page.h>
> #include <asm/pgalloc.h>
> @@ -48,6 +49,10 @@
>
> int hugetlb_max_hstate __read_mostly;
> unsigned int default_hstate_idx;
> +/*
> + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> + * to smallest.
> + */
> struct hstate hstates[HUGE_MAX_HSTATE];
>
> #ifdef CONFIG_CMA
> @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> kfree(node_alloc_noretry);
> }
>
> +static int compare_hstates_decreasing(const void *a, const void *b)
> +{
> + const int shift_a = huge_page_shift((const struct hstate *)a);
> + const int shift_b = huge_page_shift((const struct hstate *)b);
> +
> + if (shift_a < shift_b)
> + return 1;
> + if (shift_a > shift_b)
> + return -1;
> + return 0;
> +}
> +
> +static void sort_hstates(void)
> +{
> + unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> +
> + /* Sort from largest to smallest. */
> + sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> + compare_hstates_decreasing, NULL);
> +
> + /*
> + * We may have changed the location of the default hstate, so we need to
> + * update it.
> + */
> + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> +}
> +
> static void __init hugetlb_init_hstates(void)
> {
> struct hstate *h, *h2;
>
> - for_each_hstate(h) {
> - if (minimum_order > huge_page_order(h))
> - minimum_order = huge_page_order(h);
> + sort_hstates();
>
> + /* The last hstate is now the smallest. */
> + minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> +
> + for_each_hstate(h) {
> /* oversize hugepages were init'ed in early boot */
> if (!hstate_is_gigantic(h))
> hugetlb_hstate_alloc_pages(h);

As now hstates are ordered can code which does calculation of demot_order

can too be optimised, i mean it can be value of order of hstate at next index?


2022-06-27 12:37:21

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING


On 24/06/22 11:06 pm, James Houghton wrote:
> This adds the Kconfig to enable or disable high-granularity mapping. It
> is enabled by default for architectures that use
> ARCH_WANT_GENERAL_HUGETLB.
>
> There is also an arch-specific config ARCH_HAS_SPECIAL_HUGETLB_HGM which
> controls whether or not the architecture has been updated to support
> HGM if it doesn't use general HugeTLB.
>
> Signed-off-by: James Houghton <[email protected]>
reviewed-by:[email protected]
> ---
> fs/Kconfig | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 5976eb33535f..d76c7d812656 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -268,6 +268,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
> to enable optimizing vmemmap pages of HugeTLB by default. It can then
> be disabled on the command line via hugetlb_free_vmemmap=off.
>
> +config ARCH_HAS_SPECIAL_HUGETLB_HGM
> + bool
> +
> +config HUGETLB_HIGH_GRANULARITY_MAPPING
> + def_bool ARCH_WANT_GENERAL_HUGETLB || ARCH_HAS_SPECIAL_HUGETLB_HGM
> + depends on HUGETLB_PAGE
> +
> config MEMFD_CREATE
> def_bool TMPFS || HUGETLBFS
>

2022-06-27 13:02:29

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures


On 24/06/22 11:06 pm, James Houghton wrote:
> This is a helper function for freeing the bits of the page table that
> map a particular HugeTLB PTE.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 2 ++
> mm/hugetlb.c | 17 +++++++++++++++++
> 2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1d4ec9dfdebf..33ba48fac551 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -107,6 +107,8 @@ bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> unsigned long address);
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> + unsigned long start, unsigned long end);
>
> struct hugepage_subpool {
> spinlock_t lock;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1a1434e29740..a2d2ffa76173 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,23 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> return false;
> }
>
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> + unsigned long start, unsigned long end)
> +{
> + unsigned long floor = start & hugetlb_pte_mask(hpte);
> + unsigned long ceiling = floor + hugetlb_pte_size(hpte);
> +
> + if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {

sorry again did not understand why it is >= check and not just ==, does it help

in non-x86 arches.

> + free_p4d_range(tlb, (pgd_t *)hpte->ptep, start, end, floor, ceiling);
> + } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> + free_pud_range(tlb, (p4d_t *)hpte->ptep, start, end, floor, ceiling);
> + } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> + free_pmd_range(tlb, (pud_t *)hpte->ptep, start, end, floor, ceiling);
> + } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> + free_pte_range(tlb, (pmd_t *)hpte->ptep, start);
> + }
> +}
> +
> bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> {
> pgd_t pgd;

2022-06-27 13:40:17

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift


On 24/06/22 11:06 pm, James Houghton wrote:
> This is a helper macro to loop through all the usable page sizes for a
> high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> loop, in descending order, through the page sizes that HugeTLB supports
> for this architecture; it always includes PAGE_SIZE.
reviewed-by:[email protected]
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8b10b941458d..557b0afdb503 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> /* All shared VMAs have HGM enabled. */
> return vma->vm_flags & VM_SHARED;
> }
> +static unsigned int __shift_for_hstate(struct hstate *h)
> +{
> + if (h >= &hstates[hugetlb_max_hstate])
> + return PAGE_SHIFT;
> + return huge_page_shift(h);
> +}
> +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> + for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> + (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> + (tmp_h)++)
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> /*

2022-06-27 13:42:35

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks


On 24/06/22 11:06 pm, James Houghton wrote:
> This adds it for architectures that use GENERAL_HUGETLB, including x86.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 2 ++
> mm/hugetlb.c | 45 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e7a6b944d0cc..605aa19d8572 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz);
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz, bool stop_at_none);
> int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long *addr, pte_t *ptep);
> void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 557b0afdb503..3ec2a921ee6f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> return (pte_t *)pmd;
> }


not strong feeling but this name looks confusing to me as it does

not only walk over page-tables but can also alloc.

> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz, bool stop_at_none)
> +{
> + pte_t *ptep;
> +
> + if (!hpte->ptep) {
> + pgd_t *pgd = pgd_offset(mm, addr);
> +
> + if (!pgd)
> + return -ENOMEM;
> + ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
> + if (!ptep)
> + return -ENOMEM;
> + hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
> + }
> +
> + while (hugetlb_pte_size(hpte) > sz &&
> + !hugetlb_pte_present_leaf(hpte) &&
> + !(stop_at_none && hugetlb_pte_none(hpte))) {

Should this ordering of if-else condition be in reverse, i mean it will look

more natural and possibly less condition checks as we go from top to bottom.

> + if (hpte->shift == PMD_SHIFT) {
> + ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);
> + if (!ptep)
> + return -ENOMEM;
> + hpte->shift = PAGE_SHIFT;
> + hpte->ptep = ptep;
> + } else if (hpte->shift == PUD_SHIFT) {
> + ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
> + addr);
> + if (!ptep)
> + return -ENOMEM;
> + hpte->shift = PMD_SHIFT;
> + hpte->ptep = ptep;
> + } else if (hpte->shift == P4D_SHIFT) {
> + ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
> + addr);
> + if (!ptep)
> + return -ENOMEM;
> + hpte->shift = PUD_SHIFT;
> + hpte->ptep = ptep;
> + } else
> + BUG();
> + }
> + return 0;
> +}
> +
> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
> #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

2022-06-27 14:14:36

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality


On 24/06/22 11:06 pm, James Houghton wrote:
> The new function, hugetlb_split_to_shift, will optimally split the page
> table to map a particular address at a particular granularity.
>
> This is useful for punching a hole in the mapping and for mapping small
> sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 122 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3ec2a921ee6f..eaffe7b4f67c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> /* Forward declaration */
> static int hugetlb_acct_memory(struct hstate *h, long delta);
>
> +/*
> + * Find the subpage that corresponds to `addr` in `hpage`.
> + */
> +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> + unsigned long addr)
> +{
> + size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> +
> + BUG_ON(idx >= pages_per_huge_page(h));
> + return &hpage[idx];
> +}
> +
> static inline bool subpool_is_free(struct hugepage_subpool *spool)
> {
> if (spool->count)
> @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> (tmp_h)++)
> +
> +/*
> + * Given a particular address, split the HugeTLB PTE that currently maps it
> + * so that, for the given address, the PTE that maps it is `desired_shift`.
> + * This function will always split the HugeTLB PTE optimally.
> + *
> + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> + * these changes to the page table:
> + * 1. The PUD will be split into 2M PMDs.
> + * 2. The first PMD will be split again into 4K PTEs.
> + */
> +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> + const struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long desired_shift)
> +{
> + unsigned long start, end, curr;
> + unsigned long desired_sz = 1UL << desired_shift;
> + struct hstate *h = hstate_vma(vma);
> + int ret;
> + struct hugetlb_pte new_hpte;
> + struct mmu_notifier_range range;
> + struct page *hpage = NULL;
> + struct page *subpage;
> + pte_t old_entry;
> + struct mmu_gather tlb;
> +
> + BUG_ON(!hpte->ptep);
> + BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
can it be BUG_ON(hugetlb_pte_size(hpte) <= desired_sz)
> +
> + start = addr & hugetlb_pte_mask(hpte);
> + end = start + hugetlb_pte_size(hpte);
> +
> + i_mmap_assert_write_locked(vma->vm_file->f_mapping);

As it is just changing mappings is holding f_mapping required? I mean in future

is it any paln or way to use some per process level sub-lock?

> +
> + BUG_ON(!hpte->ptep);
> + /* This function only works if we are looking at a leaf-level PTE. */
> + BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> +
> + /*
> + * Clear the PTE so that we will allocate the PT structures when
> + * walking the page table.
> + */
> + old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
> +
> + if (!huge_pte_none(old_entry))
> + hpage = pte_page(old_entry);
> +
> + BUG_ON(!IS_ALIGNED(start, desired_sz));
> + BUG_ON(!IS_ALIGNED(end, desired_sz));
> +
> + for (curr = start; curr < end;) {
> + struct hstate *tmp_h;
> + unsigned int shift;
> +
> + for_each_hgm_shift(h, tmp_h, shift) {
> + unsigned long sz = 1UL << shift;
> +
> + if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> + continue;
> + /*
> + * If we are including `addr`, we need to make sure
> + * splitting down to the correct size. Go to a smaller
> + * size if we are not.
> + */
> + if (curr <= addr && curr + sz > addr &&
> + shift > desired_shift)
> + continue;
> +
> + /*
> + * Continue the page table walk to the level we want,
> + * allocate PT structures as we go.
> + */

As i understand this for_each_hgm_shift loop is just to find right size of shift,

then code below this line can be put out of loop, no strong feeling but it looks

more proper may make code easier to understand.

> + hugetlb_pte_copy(&new_hpte, hpte);
> + ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> + /*stop_at_none=*/false);
> + if (ret)
> + goto err;
> + BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> + if (hpage) {
> + pte_t new_entry;
> +
> + subpage = hugetlb_find_subpage(h, hpage, curr);
> + new_entry = make_huge_pte_with_shift(vma, subpage,
> + huge_pte_write(old_entry),
> + shift);
> + set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> + }
> + curr += sz;
> + goto next;
> + }
> + /* We couldn't find a size that worked. */
> + BUG();
> +next:
> + continue;
> + }
> +
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> + start, end);
> + mmu_notifier_invalidate_range_start(&range);

sorry did not understand where tlb flush will be taken care in case of success?

I see set_huge_pte_at does not do it internally by self.

> + return 0;
> +err:
> + tlb_gather_mmu(&tlb, mm);
> + /* Free any newly allocated page table entries. */
> + hugetlb_free_range(&tlb, hpte, start, curr);
> + /* Restore the old entry. */
> + set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> + tlb_finish_mmu(&tlb);
> + return ret;
> +}
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> /*

2022-06-27 16:52:50

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > name. "High-granularity mapping" is not a great name either. I am open
> > to better names.
>
> Oh good, I was grinding my teeth every time I read it ;-)
>
> How does "Fine granularity" work for you?
> "sub-page mapping" might work too.

"Granularity", as I've come to realize, is hard to say, so I think I
prefer sub-page mapping. :) So to recap the suggestions I have so far:

1. Sub-page mapping
2. Granular mapping
3. Flexible mapping

I'll pick one of these (or maybe some other one that works better) for
the next version of this series.

2022-06-27 16:53:30

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Fri, Jun 24, 2022 at 11:47 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > - Page table walking and manipulation
> > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > high-granularity mappings. Eventually, it's possible to merge
> > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> >
> > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > This is because we generally need to know the "size" of a PTE (previously
> > always just huge_page_size(hstate)).
> >
> > For every page table manipulation function that has a huge version (e.g.
> > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > PTE really is "huge".
>
> I'm disappointed to hear that page table walking is going to become even
> more special. I'd much prefer it if hugetlb walking were exactly the
> same as THP walking. This seems like a good time to do at least some
> of that work.
>
> Was there a reason you chose the "more complexity" direction?

I chose this direction because it seemed to be the most
straightforward to get to a working prototype and then to an RFC. I
agree with your sentiment -- I'll see what I can do to reconcile THP
walking with HugeTLB(+HGM) walking.

2022-06-27 17:08:03

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> >
> > [trimmed...]
> > ---- Userspace API ----
> >
> > This patch series introduces a single way to take advantage of
> > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > userspace to resolve MINOR page faults on shared VMAs.
> >
> > To collapse a HugeTLB address range that has been mapped with several
> > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > userspace to know when all pages (that they care about) have been fetched.
> >
>
> Thanks James! Cover letter looks good. A few questions:
>
> Why not have the kernel collapse the hugepage once all the 4K pages
> have been fetched automatically? It would remove the need for a new
> userspace API, and AFACT there aren't really any cases where it is
> beneficial to have a hugepage sharded into 4K mappings when those
> mappings can be collapsed.

The reason that we don't automatically collapse mappings is because it
would take additional complexity, and it is less flexible. Consider
the case of 1G pages on x86: currently, userspace can collapse the
whole page when it's all ready, but they can also choose to collapse a
2M piece of it. On architectures with more supported hugepage sizes
(e.g., arm64), userspace has even more possibilities for when to
collapse. This likely further complicates a potential
automatic-collapse solution. Userspace may also want to collapse the
mapping for an entire hugepage without completely mapping the hugepage
first (this would also be possible by issuing UFFDIO_CONTINUE on all
the holes, though).

>
> > ---- HugeTLB Changes ----
> >
> > - Mapcount
> > The way mapcount is handled is different from the way that it was handled
> > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > be increased. This scheme means that, for hugepages that aren't mapped at
> > high granularity, their mapcounts will remain the same as what they would
> > have been pre-HGM.
> >
>
> Sorry, I didn't quite follow this. It says mapcount is handled
> differently, but the same if the page is not mapped at high
> granularity. Can you elaborate on how the mapcount handling will be
> different when the page is mapped at high granularity?

I guess I didn't phrase this very well. For the sake of simplicity,
consider 1G pages on x86, typically mapped with leaf-level PUDs.
Previously, there were two possibilities for how a hugepage was
mapped, either it was (1) completely mapped (PUD is present and a
leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
case, where the PUD is not none but also not a leaf (this usually
means that the page is partially mapped). We handle this case as if
the whole page was mapped. That is, if we partially map a hugepage
that was previously unmapped (making the PUD point to PMDs), we
increment its mapcount, and if we completely unmap a partially mapped
hugepage (making the PUD none), we decrement its mapcount. If we
collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.

It is possible for a PUD to be present and not a leaf (mapcount has
been incremented) but for the page to still be unmapped: if the PMDs
(or PTEs) underneath are all none. This case is atypical, and as of
this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
think it would be very difficult to get this to happen.

>
> > - Page table walking and manipulation
> > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > high-granularity mappings. Eventually, it's possible to merge
> > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> >
> > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > This is because we generally need to know the "size" of a PTE (previously
> > always just huge_page_size(hstate)).
> >
> > For every page table manipulation function that has a huge version (e.g.
> > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > PTE really is "huge".
> >
> > - Synchronization
> > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > writing, and for doing high-granularity page table walks, we require it to
> > be held for reading.
> >
> > ---- Limitations & Future Changes ----
> >
> > This patch series only implements high-granularity mapping for VM_SHARED
> > VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> > failure recovery for both shared and private mappings.
> >
> > The memory failure use case poses its own challenges that can be
> > addressed, but I will do so in a separate RFC.
> >
> > Performance has not been heavily scrutinized with this patch series. There
> > are places where lock contention can significantly reduce performance. This
> > will be addressed later.
> >
> > The patch series, as it stands right now, is compatible with the VMEMMAP
> > page struct optimization[3], as we do not need to modify data contained
> > in the subpage page structs.
> >
> > Other omissions:
> > - Compatibility with userfaultfd write-protect (will be included in v1).
> > - Support for mremap() (will be included in v1). This looks a lot like
> > the support we have for fork().
> > - Documentation changes (will be included in v1).
> > - Completely ignores PMD sharing and hugepage migration (will be included
> > in v1).
> > - Implementations for architectures that don't use GENERAL_HUGETLB other
> > than arm64.
> >
> > ---- Patch Breakdown ----
> >
> > Patch 1 - Preliminary changes
> > Patch 2-10 - HugeTLB HGM core changes
> > Patch 11-13 - HugeTLB HGM page table walking functionality
> > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > Patch 20-23 - Userfaultfd and collapse changes
> > Patch 24-26 - arm64 support and selftests
> >
> > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > name. "High-granularity mapping" is not a great name either. I am open
> > to better names.
>
> I would drop 1 extra word and do "granular mapping", as in the mapping
> is more granular than what it normally is (2MB/1G, etc).

Noted. :)

2022-06-27 19:01:44

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On 06/24/22 17:36, James Houghton wrote:
> When using HugeTLB high-granularity mapping, we need to go through the
> supported hugepage sizes in decreasing order so that we pick the largest
> size that works. Consider the case where we're faulting in a 1G hugepage
> for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> a PUD. By going through the sizes in decreasing order, we will find that
> PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> 1 file changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..5df838d86f32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -33,6 +33,7 @@
> #include <linux/migrate.h>
> #include <linux/nospec.h>
> #include <linux/delayacct.h>
> +#include <linux/sort.h>
>
> #include <asm/page.h>
> #include <asm/pgalloc.h>
> @@ -48,6 +49,10 @@
>
> int hugetlb_max_hstate __read_mostly;
> unsigned int default_hstate_idx;
> +/*
> + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> + * to smallest.
> + */
> struct hstate hstates[HUGE_MAX_HSTATE];
>
> #ifdef CONFIG_CMA
> @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> kfree(node_alloc_noretry);
> }
>
> +static int compare_hstates_decreasing(const void *a, const void *b)
> +{
> + const int shift_a = huge_page_shift((const struct hstate *)a);
> + const int shift_b = huge_page_shift((const struct hstate *)b);
> +
> + if (shift_a < shift_b)
> + return 1;
> + if (shift_a > shift_b)
> + return -1;
> + return 0;
> +}
> +
> +static void sort_hstates(void)
> +{
> + unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> +
> + /* Sort from largest to smallest. */
> + sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> + compare_hstates_decreasing, NULL);
> +
> + /*
> + * We may have changed the location of the default hstate, so we need to
> + * update it.
> + */
> + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> +}
> +
> static void __init hugetlb_init_hstates(void)
> {
> struct hstate *h, *h2;
>
> - for_each_hstate(h) {
> - if (minimum_order > huge_page_order(h))
> - minimum_order = huge_page_order(h);
> + sort_hstates();
>
> + /* The last hstate is now the smallest. */
> + minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> +
> + for_each_hstate(h) {
> /* oversize hugepages were init'ed in early boot */
> if (!hstate_is_gigantic(h))
> hugetlb_hstate_alloc_pages(h);

This may/will cause problems for gigantic hugetlb pages allocated at boot
time. See alloc_bootmem_huge_page() where a pointer to the associated hstate
is encoded within the allocated hugetlb page. These pages are added to
hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
hstate to add prep the gigantic page and add to the correct pool. Currently,
gather_bootmem_prealloc is called after hugetlb_init_hstates. So, changing
hstate order will cause errors.

I do not see any reason why we could not call gather_bootmem_prealloc before
hugetlb_init_hstates to avoid this issue.
--
Mike Kravetz

2022-06-27 19:03:38

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

* James Houghton ([email protected]) wrote:
> On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > name. "High-granularity mapping" is not a great name either. I am open
> > > to better names.
> >
> > Oh good, I was grinding my teeth every time I read it ;-)
> >
> > How does "Fine granularity" work for you?
> > "sub-page mapping" might work too.
>
> "Granularity", as I've come to realize, is hard to say, so I think I
> prefer sub-page mapping. :) So to recap the suggestions I have so far:
>
> 1. Sub-page mapping
> 2. Granular mapping
> 3. Flexible mapping
>
> I'll pick one of these (or maybe some other one that works better) for
> the next version of this series.

<shrug> Just a name; SPM might work (although may confuse those
architectures which had subprotection for normal pages), and at least
we can mispronounce it.

In 14/26 your commit message says:

1. Faults can be passed to handle_userfault. (Userspace will want to
use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
region they should be call UFFDIO_CONTINUE on later.)

can you explain what that new UFFD_FEATURE does?

Dave

--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-27 21:10:22

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
<[email protected]> wrote:
>
> * James Houghton ([email protected]) wrote:
> > On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > name. "High-granularity mapping" is not a great name either. I am open
> > > > to better names.
> > >
> > > Oh good, I was grinding my teeth every time I read it ;-)
> > >
> > > How does "Fine granularity" work for you?
> > > "sub-page mapping" might work too.
> >
> > "Granularity", as I've come to realize, is hard to say, so I think I
> > prefer sub-page mapping. :) So to recap the suggestions I have so far:
> >
> > 1. Sub-page mapping
> > 2. Granular mapping
> > 3. Flexible mapping
> >
> > I'll pick one of these (or maybe some other one that works better) for
> > the next version of this series.
>
> <shrug> Just a name; SPM might work (although may confuse those
> architectures which had subprotection for normal pages), and at least
> we can mispronounce it.
>
> In 14/26 your commit message says:
>
> 1. Faults can be passed to handle_userfault. (Userspace will want to
> use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
> region they should be call UFFDIO_CONTINUE on later.)
>
> can you explain what that new UFFD_FEATURE does?

+cc Nadav Amit <[email protected]> to check me here.

Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
feature, and it actually isn't needed (I will correct the commit
message). Why it isn't needed is a little bit complicated, though. Let
me explain:

Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
userfaultfd gave userspace for HugeTLB pages was rounded down to be
hstate-size-aligned. This would have had to change, because userspace,
to take advantage of HGM, needs to know which 4K piece to install.

However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
address was rounded down to be PAGE_SIZE-aligned instead, even if the
flag wasn't used. I think this was an unintended change. If the flag
is used, then the address isn't rounded at all -- that was the
intended purpose of this flag. Hope that makes sense.

The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
userspace that high-granularity CONTINUEs are available.

[1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")


>
> Dave
>
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
>

2022-06-28 00:33:41

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping



> On Jun 27, 2022, at 1:31 PM, James Houghton <[email protected]> wrote:
>
> âš  External Email
>
> On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
> <[email protected]> wrote:
>>
>> * James Houghton ([email protected]) wrote:
>>> On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <[email protected]> wrote:
>>>>
>>>> On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
>>>>> [1] This used to be called HugeTLB double mapping, a bad and confusing
>>>>> name. "High-granularity mapping" is not a great name either. I am open
>>>>> to better names.
>>>>
>>>> Oh good, I was grinding my teeth every time I read it ;-)
>>>>
>>>> How does "Fine granularity" work for you?
>>>> "sub-page mapping" might work too.
>>>
>>> "Granularity", as I've come to realize, is hard to say, so I think I
>>> prefer sub-page mapping. :) So to recap the suggestions I have so far:
>>>
>>> 1. Sub-page mapping
>>> 2. Granular mapping
>>> 3. Flexible mapping
>>>
>>> I'll pick one of these (or maybe some other one that works better) for
>>> the next version of this series.
>>
>> <shrug> Just a name; SPM might work (although may confuse those
>> architectures which had subprotection for normal pages), and at least
>> we can mispronounce it.
>>
>> In 14/26 your commit message says:
>>
>> 1. Faults can be passed to handle_userfault. (Userspace will want to
>> use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
>> region they should be call UFFDIO_CONTINUE on later.)
>>
>> can you explain what that new UFFD_FEATURE does?
>
> +cc Nadav Amit <[email protected]> to check me here.
>
> Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
> feature, and it actually isn't needed (I will correct the commit
> message). Why it isn't needed is a little bit complicated, though. Let
> me explain:
>
> Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
> userfaultfd gave userspace for HugeTLB pages was rounded down to be
> hstate-size-aligned. This would have had to change, because userspace,
> to take advantage of HGM, needs to know which 4K piece to install.
>
> However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
> address was rounded down to be PAGE_SIZE-aligned instead, even if the
> flag wasn't used. I think this was an unintended change. If the flag
> is used, then the address isn't rounded at all -- that was the
> intended purpose of this flag. Hope that makes sense.
>
> The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
> userspace that high-granularity CONTINUEs are available.
>
> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")

Indeed this change of behavior (not aligning to huge-pages when flags is
not set) was unintentional. If you want to fix it in a separate patch so
it would be backported, that may be a good idea.

For the record, there was a short period of time in 2016 when the exact
fault address was delivered even when UFFD_FEATURE_EXACT_ADDRESS was not
provided. We had some arguments whether this was a regression...

BTW: I should have thought on the use-case of knowing the exact address
in huge-pages. It would have shorten my discussions with Andrea on whether
this feature (UFFD_FEATURE_EXACT_ADDRESS) is needed. :)

2022-06-28 08:47:50

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

* James Houghton ([email protected]) wrote:
> On Mon, Jun 27, 2022 at 10:56 AM Dr. David Alan Gilbert
> <[email protected]> wrote:
> >
> > * James Houghton ([email protected]) wrote:
> > > On Fri, Jun 24, 2022 at 11:29 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 05:36:30PM +0000, James Houghton wrote:
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > > name. "High-granularity mapping" is not a great name either. I am open
> > > > > to better names.
> > > >
> > > > Oh good, I was grinding my teeth every time I read it ;-)
> > > >
> > > > How does "Fine granularity" work for you?
> > > > "sub-page mapping" might work too.
> > >
> > > "Granularity", as I've come to realize, is hard to say, so I think I
> > > prefer sub-page mapping. :) So to recap the suggestions I have so far:
> > >
> > > 1. Sub-page mapping
> > > 2. Granular mapping
> > > 3. Flexible mapping
> > >
> > > I'll pick one of these (or maybe some other one that works better) for
> > > the next version of this series.
> >
> > <shrug> Just a name; SPM might work (although may confuse those
> > architectures which had subprotection for normal pages), and at least
> > we can mispronounce it.
> >
> > In 14/26 your commit message says:
> >
> > 1. Faults can be passed to handle_userfault. (Userspace will want to
> > use UFFD_FEATURE_REAL_ADDRESS to get the real address to know which
> > region they should be call UFFDIO_CONTINUE on later.)
> >
> > can you explain what that new UFFD_FEATURE does?
>
> +cc Nadav Amit <[email protected]> to check me here.
>
> Sorry, this should be UFFD_FEATURE_EXACT_ADDRESS. It isn't a new
> feature, and it actually isn't needed (I will correct the commit
> message). Why it isn't needed is a little bit complicated, though. Let
> me explain:
>
> Before UFFD_FEATURE_EXACT_ADDRESS was introduced, the address that
> userfaultfd gave userspace for HugeTLB pages was rounded down to be
> hstate-size-aligned. This would have had to change, because userspace,
> to take advantage of HGM, needs to know which 4K piece to install.
>
> However, after UFFD_FEATURE_EXACT_ADDRESS was introduced[1], the
> address was rounded down to be PAGE_SIZE-aligned instead, even if the
> flag wasn't used. I think this was an unintended change. If the flag
> is used, then the address isn't rounded at all -- that was the
> intended purpose of this flag. Hope that makes sense.

Oh that's 'fun'; right but the need for the less-rounded address makes
sense.

One other thing I thought of; you provide the modified 'CONTINUE'
behaviour, which works for postcopy as long as you use two mappings in
userspace; one protected by userfault, and one which you do the writes
to, and then issue the CONTINUE into the protected mapping; that's fine,
but it's not currently how we have our postcopy code wired up in qemu,
we have one mapping and use UFFDIO_COPY to place the page.
Requiring the two mappings is fine, but it's probably worth pointing out
the need for it somewhere.

Dave

> The new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS_HGM, informs
> userspace that high-granularity CONTINUEs are available.
>
> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
>
>
> >
> > Dave
> >
> > --
> > Dr. David Alan Gilbert / [email protected] / Manchester, UK
> >
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-28 14:32:17

by Muchun Song

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Mon, Jun 27, 2022 at 09:27:38AM -0700, James Houghton wrote:
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
>
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
>
> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled

+1

> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
>
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
>
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
>

It is a good explanation. I think it is better to go to cover letter.

Thanks.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > > - Compatibility with userfaultfd write-protect (will be included in v1).
> > > - Support for mremap() (will be included in v1). This looks a lot like
> > > the support we have for fork().
> > > - Documentation changes (will be included in v1).
> > > - Completely ignores PMD sharing and hugepage migration (will be included
> > > in v1).
> > > - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1 - Preliminary changes
> > > Patch 2-10 - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > name. "High-granularity mapping" is not a great name either. I am open
> > > to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
>
> Noted. :)
>

2022-06-28 16:03:21

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On Mon, Jun 27, 2022 at 5:09 AM manish.mishra <[email protected]> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > When using HugeTLB high-granularity mapping, we need to go through the
> > supported hugepage sizes in decreasing order so that we pick the largest
> > size that works. Consider the case where we're faulting in a 1G hugepage
> > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > a PUD. By going through the sizes in decreasing order, we will find that
> > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 37 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a57e1be41401..5df838d86f32 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -33,6 +33,7 @@
> > #include <linux/migrate.h>
> > #include <linux/nospec.h>
> > #include <linux/delayacct.h>
> > +#include <linux/sort.h>
> >
> > #include <asm/page.h>
> > #include <asm/pgalloc.h>
> > @@ -48,6 +49,10 @@
> >
> > int hugetlb_max_hstate __read_mostly;
> > unsigned int default_hstate_idx;
> > +/*
> > + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> > + * to smallest.
> > + */
> > struct hstate hstates[HUGE_MAX_HSTATE];
> >
> > #ifdef CONFIG_CMA
> > @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > kfree(node_alloc_noretry);
> > }
> >
> > +static int compare_hstates_decreasing(const void *a, const void *b)
> > +{
> > + const int shift_a = huge_page_shift((const struct hstate *)a);
> > + const int shift_b = huge_page_shift((const struct hstate *)b);
> > +
> > + if (shift_a < shift_b)
> > + return 1;
> > + if (shift_a > shift_b)
> > + return -1;
> > + return 0;
> > +}
> > +
> > +static void sort_hstates(void)
> > +{
> > + unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> > +
> > + /* Sort from largest to smallest. */
> > + sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> > + compare_hstates_decreasing, NULL);
> > +
> > + /*
> > + * We may have changed the location of the default hstate, so we need to
> > + * update it.
> > + */
> > + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> > +}
> > +
> > static void __init hugetlb_init_hstates(void)
> > {
> > struct hstate *h, *h2;
> >
> > - for_each_hstate(h) {
> > - if (minimum_order > huge_page_order(h))
> > - minimum_order = huge_page_order(h);
> > + sort_hstates();
> >
> > + /* The last hstate is now the smallest. */
> > + minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> > +
> > + for_each_hstate(h) {
> > /* oversize hugepages were init'ed in early boot */
> > if (!hstate_is_gigantic(h))
> > hugetlb_hstate_alloc_pages(h);
>
> As now hstates are ordered can code which does calculation of demot_order
>
> can too be optimised, i mean it can be value of order of hstate at next index?
>

Indeed -- thanks for catching that. I'll make this optimization for
the next version of this series.

>

2022-06-28 16:04:44

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <[email protected]> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > When using HugeTLB high-granularity mapping, we need to go through the
> > supported hugepage sizes in decreasing order so that we pick the largest
> > size that works. Consider the case where we're faulting in a 1G hugepage
> > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > a PUD. By going through the sizes in decreasing order, we will find that
> > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 37 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a57e1be41401..5df838d86f32 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -33,6 +33,7 @@
> > #include <linux/migrate.h>
> > #include <linux/nospec.h>
> > #include <linux/delayacct.h>
> > +#include <linux/sort.h>
> >
> > #include <asm/page.h>
> > #include <asm/pgalloc.h>
> > @@ -48,6 +49,10 @@
> >
> > int hugetlb_max_hstate __read_mostly;
> > unsigned int default_hstate_idx;
> > +/*
> > + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> > + * to smallest.
> > + */
> > struct hstate hstates[HUGE_MAX_HSTATE];
> >
> > #ifdef CONFIG_CMA
> > @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > kfree(node_alloc_noretry);
> > }
> >
> > +static int compare_hstates_decreasing(const void *a, const void *b)
> > +{
> > + const int shift_a = huge_page_shift((const struct hstate *)a);
> > + const int shift_b = huge_page_shift((const struct hstate *)b);
> > +
> > + if (shift_a < shift_b)
> > + return 1;
> > + if (shift_a > shift_b)
> > + return -1;
> > + return 0;
> > +}
> > +
> > +static void sort_hstates(void)
> > +{
> > + unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> > +
> > + /* Sort from largest to smallest. */
> > + sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> > + compare_hstates_decreasing, NULL);
> > +
> > + /*
> > + * We may have changed the location of the default hstate, so we need to
> > + * update it.
> > + */
> > + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> > +}
> > +
> > static void __init hugetlb_init_hstates(void)
> > {
> > struct hstate *h, *h2;
> >
> > - for_each_hstate(h) {
> > - if (minimum_order > huge_page_order(h))
> > - minimum_order = huge_page_order(h);
> > + sort_hstates();
> >
> > + /* The last hstate is now the smallest. */
> > + minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> > +
> > + for_each_hstate(h) {
> > /* oversize hugepages were init'ed in early boot */
> > if (!hstate_is_gigantic(h))
> > hugetlb_hstate_alloc_pages(h);
>
> This may/will cause problems for gigantic hugetlb pages allocated at boot
> time. See alloc_bootmem_huge_page() where a pointer to the associated hstate
> is encoded within the allocated hugetlb page. These pages are added to
> hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> hstate to add prep the gigantic page and add to the correct pool. Currently,
> gather_bootmem_prealloc is called after hugetlb_init_hstates. So, changing
> hstate order will cause errors.
>
> I do not see any reason why we could not call gather_bootmem_prealloc before
> hugetlb_init_hstates to avoid this issue.

Thanks for catching this, Mike. Your suggestion certainly seems to
work, but it also seems kind of error prone. I'll have to look at the
code more closely, but maybe it would be better if I just maintained a
separate `struct hstate *sorted_hstate_ptrs[]`, where the original
locations of the hstates remain unchanged, as to not break
gather_bootmem_prealloc/other things.

> --
> Mike Kravetz

2022-06-28 17:41:53

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const

On Mon, Jun 27, 2022 at 5:09 AM manish.mishra <[email protected]> wrote:
>
>
> On 27/06/22 5:06 pm, manish.mishra wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
>
> This is just a const-correctness change so that the new hugetlb_pte
> changes can be const-correct too.
>
> Acked-by: David Rientjes <[email protected]>
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..498a4ae3d462 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> return hstate_file(vma->vm_file);
> }
>
> -static inline unsigned long huge_page_size(struct hstate *h)
> +static inline unsigned long huge_page_size(const struct hstate *h)
> {
> return (unsigned long)PAGE_SIZE << h->order;
> }
> @@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
> return h->mask;
> }
>
> -static inline unsigned int huge_page_order(struct hstate *h)
> +static inline unsigned int huge_page_order(const struct hstate *h)
> {
> return h->order;
> }
>
> -static inline unsigned huge_page_shift(struct hstate *h)
> +static inline unsigned huge_page_shift(const struct hstate *h)
> {
> return h->order + PAGE_SHIFT;
> }
>
> -static inline bool hstate_is_gigantic(struct hstate *h)
> +static inline bool hstate_is_gigantic(const struct hstate *h)
> {
> return huge_page_order(h) >= MAX_ORDER;
> }
>
> -static inline unsigned int pages_per_huge_page(struct hstate *h)
> +static inline unsigned int pages_per_huge_page(const struct hstate *h)
> {
> return 1 << h->order;
> }
>
> -static inline unsigned int blocks_per_huge_page(struct hstate *h)
> +static inline unsigned int blocks_per_huge_page(const struct hstate *h)
> {
> return huge_page_size(h) / 512;
> }
>
> James, Just wanted to check why you did it selectively only for these functions
>
> why not for something like hstate_index which too i see used in your code.

I'll look into which other functions can be made const. We need
huge_page_shift() to take `const struct hstate *h` so that the hstates
can be sorted, and it then followed to make the surrounding, related
functions const as well. I could also just leave it at
huge_page_shift().

The commit message here is wrong -- the hugetlb_pte const-correctness
is a separate issue that doesn't depend the constness of hstates. I'll
fix that -- sorry about that.

2022-06-28 17:55:34

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Mon, Jun 27, 2022 at 9:27 AM James Houghton <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> > >
> > > [trimmed...]
> > > ---- Userspace API ----
> > >
> > > This patch series introduces a single way to take advantage of
> > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > userspace to resolve MINOR page faults on shared VMAs.
> > >
> > > To collapse a HugeTLB address range that has been mapped with several
> > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > userspace to know when all pages (that they care about) have been fetched.
> > >
> >
> > Thanks James! Cover letter looks good. A few questions:
> >
> > Why not have the kernel collapse the hugepage once all the 4K pages
> > have been fetched automatically? It would remove the need for a new
> > userspace API, and AFACT there aren't really any cases where it is
> > beneficial to have a hugepage sharded into 4K mappings when those
> > mappings can be collapsed.
>
> The reason that we don't automatically collapse mappings is because it
> would take additional complexity, and it is less flexible. Consider
> the case of 1G pages on x86: currently, userspace can collapse the
> whole page when it's all ready, but they can also choose to collapse a
> 2M piece of it. On architectures with more supported hugepage sizes
> (e.g., arm64), userspace has even more possibilities for when to
> collapse. This likely further complicates a potential
> automatic-collapse solution. Userspace may also want to collapse the
> mapping for an entire hugepage without completely mapping the hugepage
> first (this would also be possible by issuing UFFDIO_CONTINUE on all
> the holes, though).
>

To be honest I'm don't think I'm a fan of this. I don't think this
saves complexity, but rather pushes it to the userspace. I.e. the
userspace now must track which regions are faulted in and which are
not to call MADV_COLLAPSE at the right time. Also, if the userspace
gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
any hugepages) or call MADV_COLLAPSE too early and have to deal with a
storm of maybe hundreds of minor faults at once which may take too
long to resolve and may impact guest stability, yes?

For these reasons I think automatic collapsing is something that will
eventually be implemented by us or someone else, and at that point
MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
is adding a userspace API that will probably need to be maintained for
perpetuity but actually is likely going to be going obsolete "soon".
For this reason I had hoped that automatic collapsing would come with
V1.

I wonder if we can have a very simple first try at automatic
collapsing for V1? I.e., can we support collapsing to the hstate size
and only that? So 4K pages can only be either collapsed to 2MB or 1G
on x86 depending on the hstate size. I think this may be not too
difficult to implement: we can have a counter similar to mapcount that
tracks how many of the subpages are mapped (subpage_mapcount). Once
all the subpages are mapped (the counter reaches a certain value),
trigger collapsing similar to hstate size MADV_COLLAPSE.

I gather that no one else reviewing this has raised this issue thus
far so it might not be a big deal and I will continue to review the
RFC, but I had hoped for automatic collapsing myself for the reasons
above.

> >
> > > ---- HugeTLB Changes ----
> > >
> > > - Mapcount
> > > The way mapcount is handled is different from the way that it was handled
> > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > high granularity, their mapcounts will remain the same as what they would
> > > have been pre-HGM.
> > >
> >
> > Sorry, I didn't quite follow this. It says mapcount is handled
> > differently, but the same if the page is not mapped at high
> > granularity. Can you elaborate on how the mapcount handling will be
> > different when the page is mapped at high granularity?
>
> I guess I didn't phrase this very well. For the sake of simplicity,
> consider 1G pages on x86, typically mapped with leaf-level PUDs.
> Previously, there were two possibilities for how a hugepage was
> mapped, either it was (1) completely mapped (PUD is present and a
> leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> case, where the PUD is not none but also not a leaf (this usually
> means that the page is partially mapped). We handle this case as if
> the whole page was mapped. That is, if we partially map a hugepage
> that was previously unmapped (making the PUD point to PMDs), we
> increment its mapcount, and if we completely unmap a partially mapped
> hugepage (making the PUD none), we decrement its mapcount. If we
> collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
>
> It is possible for a PUD to be present and not a leaf (mapcount has
> been incremented) but for the page to still be unmapped: if the PMDs
> (or PTEs) underneath are all none. This case is atypical, and as of
> this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> think it would be very difficult to get this to happen.
>

Thank you for the detailed explanation. Please add it to the cover letter.

I wonder the case "PUD present but all the PMD are none": is that a
bug? I don't understand the usefulness of that. Not a comment on this
patch but rather a curiosity.

> >
> > > - Page table walking and manipulation
> > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > high-granularity mappings. Eventually, it's possible to merge
> > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > >
> > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > This is because we generally need to know the "size" of a PTE (previously
> > > always just huge_page_size(hstate)).
> > >
> > > For every page table manipulation function that has a huge version (e.g.
> > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > > PTE really is "huge".
> > >
> > > - Synchronization
> > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > writing, and for doing high-granularity page table walks, we require it to
> > > be held for reading.
> > >
> > > ---- Limitations & Future Changes ----
> > >
> > > This patch series only implements high-granularity mapping for VM_SHARED
> > > VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> > > failure recovery for both shared and private mappings.
> > >
> > > The memory failure use case poses its own challenges that can be
> > > addressed, but I will do so in a separate RFC.
> > >
> > > Performance has not been heavily scrutinized with this patch series. There
> > > are places where lock contention can significantly reduce performance. This
> > > will be addressed later.
> > >
> > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > page struct optimization[3], as we do not need to modify data contained
> > > in the subpage page structs.
> > >
> > > Other omissions:
> > > - Compatibility with userfaultfd write-protect (will be included in v1).
> > > - Support for mremap() (will be included in v1). This looks a lot like
> > > the support we have for fork().
> > > - Documentation changes (will be included in v1).
> > > - Completely ignores PMD sharing and hugepage migration (will be included
> > > in v1).
> > > - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > than arm64.
> > >
> > > ---- Patch Breakdown ----
> > >
> > > Patch 1 - Preliminary changes
> > > Patch 2-10 - HugeTLB HGM core changes
> > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > Patch 20-23 - Userfaultfd and collapse changes
> > > Patch 24-26 - arm64 support and selftests
> > >
> > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > name. "High-granularity mapping" is not a great name either. I am open
> > > to better names.
> >
> > I would drop 1 extra word and do "granular mapping", as in the mapping
> > is more granular than what it normally is (2MB/1G, etc).
>
> Noted. :)

2022-06-28 17:58:36

by Dr. David Alan Gilbert

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

* Mina Almasry ([email protected]) wrote:
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <[email protected]> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
>
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I think it depends on whether the userspace is already holding bitmaps
and data structures to let it know when the right time to call collapse
is; if it already has to do all that book keeping for it's own postcopy
or whatever process, then getting userspace to call it is easy.
(I don't know the answer to whether it does have!)

Dave

> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
>
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.
>
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
>
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
>
> Thank you for the detailed explanation. Please add it to the cover letter.
>
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
>
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > > - Compatibility with userfaultfd write-protect (will be included in v1).
> > > > - Support for mremap() (will be included in v1). This looks a lot like
> > > > the support we have for fork().
> > > > - Documentation changes (will be included in v1).
> > > > - Completely ignores PMD sharing and hugepage migration (will be included
> > > > in v1).
> > > > - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > > than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1 - Preliminary changes
> > > > Patch 2-10 - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > name. "High-granularity mapping" is not a great name either. I am open
> > > > to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK

2022-06-28 20:46:59

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 08/26] hugetlb: add hugetlb_free_range to free PT structures

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
>
> This is a helper function for freeing the bits of the page table that
> map a particular HugeTLB PTE.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 2 ++
> mm/hugetlb.c | 17 +++++++++++++++++
> 2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1d4ec9dfdebf..33ba48fac551 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -107,6 +107,8 @@ bool hugetlb_pte_none_mostly(const struct hugetlb_pte *hpte);
> pte_t hugetlb_ptep_get(const struct hugetlb_pte *hpte);
> void hugetlb_pte_clear(struct mm_struct *mm, const struct hugetlb_pte *hpte,
> unsigned long address);
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> + unsigned long start, unsigned long end);
>
> struct hugepage_subpool {
> spinlock_t lock;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1a1434e29740..a2d2ffa76173 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1120,6 +1120,23 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> return false;
> }
>
> +void hugetlb_free_range(struct mmu_gather *tlb, const struct hugetlb_pte *hpte,
> + unsigned long start, unsigned long end)
> +{
> + unsigned long floor = start & hugetlb_pte_mask(hpte);
> + unsigned long ceiling = floor + hugetlb_pte_size(hpte);
> +
> + if (hugetlb_pte_size(hpte) >= PGDIR_SIZE) {
> + free_p4d_range(tlb, (pgd_t *)hpte->ptep, start, end, floor, ceiling);
> + } else if (hugetlb_pte_size(hpte) >= P4D_SIZE) {
> + free_pud_range(tlb, (p4d_t *)hpte->ptep, start, end, floor, ceiling);
> + } else if (hugetlb_pte_size(hpte) >= PUD_SIZE) {
> + free_pmd_range(tlb, (pud_t *)hpte->ptep, start, end, floor, ceiling);
> + } else if (hugetlb_pte_size(hpte) >= PMD_SIZE) {
> + free_pte_range(tlb, (pmd_t *)hpte->ptep, start);
> + }

Same as the previous patch: I wonder about >=, and if possible
calculate hugetlb_pte_size() once, or use *_SHIFT comparison.

> +}
> +
> bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte)
> {
> pgd_t pgd;
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-28 20:57:27

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 05/26] hugetlb: add CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

On Mon, Jun 27, 2022 at 5:29 AM manish.mishra <[email protected]> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This adds the Kconfig to enable or disable high-granularity mapping. It
> > is enabled by default for architectures that use
> > ARCH_WANT_GENERAL_HUGETLB.
> >
> > There is also an arch-specific config ARCH_HAS_SPECIAL_HUGETLB_HGM which
> > controls whether or not the architecture has been updated to support
> > HGM if it doesn't use general HugeTLB.
> >
> > Signed-off-by: James Houghton <[email protected]>
> reviewed-by:[email protected]

Mostly minor nits,

Reviewed-by: Mina Almasry <[email protected]>

> > ---
> > fs/Kconfig | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 5976eb33535f..d76c7d812656 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -268,6 +268,13 @@ config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
> > to enable optimizing vmemmap pages of HugeTLB by default. It can then
> > be disabled on the command line via hugetlb_free_vmemmap=off.
> >
> > +config ARCH_HAS_SPECIAL_HUGETLB_HGM

Nit: would have preferred just ARCH_HAS_HUGETLB_HGM, as ARCH implies
arch-specific.

> > + bool
> > +
> > +config HUGETLB_HIGH_GRANULARITY_MAPPING
> > + def_bool ARCH_WANT_GENERAL_HUGETLB || ARCH_HAS_SPECIAL_HUGETLB_HGM

Nit: would have preferred to go with either HGM _or_
HIGH_GRANULARITY_MAPPING (or whatever new name comes up), rather than
both, for consistency's sake.

> > + depends on HUGETLB_PAGE
> > +
> > config MEMFD_CREATE
> > def_bool TMPFS || HUGETLBFS
> >

2022-06-28 22:22:26

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift

On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
>
> This is a helper macro to loop through all the usable page sizes for a
> high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> loop, in descending order, through the page sizes that HugeTLB supports
> for this architecture; it always includes PAGE_SIZE.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8b10b941458d..557b0afdb503 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> /* All shared VMAs have HGM enabled. */
> return vma->vm_flags & VM_SHARED;
> }
> +static unsigned int __shift_for_hstate(struct hstate *h)
> +{
> + if (h >= &hstates[hugetlb_max_hstate])
> + return PAGE_SHIFT;

h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
I missing something here?

So is this intending to do:

if (h == hstates[hugetlb_max_hstate]
return PAGE_SHIFT;

? If so, could we write it as so?

I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
== PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?

> + return huge_page_shift(h);
> +}
> +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> + for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> + (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> + (tmp_h)++)
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> /*
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

2022-06-29 06:30:06

by Muchun Song

[permalink] [raw]
Subject: Re: [RFC PATCH 01/26] hugetlb: make hstate accessor functions const

On Fri, Jun 24, 2022 at 05:36:31PM +0000, James Houghton wrote:
> This is just a const-correctness change so that the new hugetlb_pte
> changes can be const-correct too.
>
> Acked-by: David Rientjes <[email protected]>
>
> Signed-off-by: James Houghton <[email protected]>

This is a good start. I also want to make those helpers take
const type parameter. Seems you have forgotten to update them
when !CONFIG_HUGETLB_PAGE.

Thanks.

> ---
> include/linux/hugetlb.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e4cff27d1198..498a4ae3d462 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -715,7 +715,7 @@ static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
> return hstate_file(vma->vm_file);
> }
>
> -static inline unsigned long huge_page_size(struct hstate *h)
> +static inline unsigned long huge_page_size(const struct hstate *h)
> {
> return (unsigned long)PAGE_SIZE << h->order;
> }
> @@ -729,27 +729,27 @@ static inline unsigned long huge_page_mask(struct hstate *h)
> return h->mask;
> }
>
> -static inline unsigned int huge_page_order(struct hstate *h)
> +static inline unsigned int huge_page_order(const struct hstate *h)
> {
> return h->order;
> }
>
> -static inline unsigned huge_page_shift(struct hstate *h)
> +static inline unsigned huge_page_shift(const struct hstate *h)
> {
> return h->order + PAGE_SHIFT;
> }
>
> -static inline bool hstate_is_gigantic(struct hstate *h)
> +static inline bool hstate_is_gigantic(const struct hstate *h)
> {
> return huge_page_order(h) >= MAX_ORDER;
> }
>
> -static inline unsigned int pages_per_huge_page(struct hstate *h)
> +static inline unsigned int pages_per_huge_page(const struct hstate *h)
> {
> return 1 << h->order;
> }
>
> -static inline unsigned int blocks_per_huge_page(struct hstate *h)
> +static inline unsigned int blocks_per_huge_page(const struct hstate *h)
> {
> return huge_page_size(h) / 512;
> }
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>
>

2022-06-29 07:01:12

by Muchun Song

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On Tue, Jun 28, 2022 at 08:40:27AM -0700, James Houghton wrote:
> On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <[email protected]> wrote:
> >
> > On 06/24/22 17:36, James Houghton wrote:
> > > When using HugeTLB high-granularity mapping, we need to go through the
> > > supported hugepage sizes in decreasing order so that we pick the largest
> > > size that works. Consider the case where we're faulting in a 1G hugepage
> > > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > > a PUD. By going through the sizes in decreasing order, we will find that
> > > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> > >
> > > Signed-off-by: James Houghton <[email protected]>
> > > ---
> > > mm/hugetlb.c | 40 +++++++++++++++++++++++++++++++++++++---
> > > 1 file changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index a57e1be41401..5df838d86f32 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -33,6 +33,7 @@
> > > #include <linux/migrate.h>
> > > #include <linux/nospec.h>
> > > #include <linux/delayacct.h>
> > > +#include <linux/sort.h>
> > >
> > > #include <asm/page.h>
> > > #include <asm/pgalloc.h>
> > > @@ -48,6 +49,10 @@
> > >
> > > int hugetlb_max_hstate __read_mostly;
> > > unsigned int default_hstate_idx;
> > > +/*
> > > + * After hugetlb_init_hstates is called, hstates will be sorted from largest
> > > + * to smallest.
> > > + */
> > > struct hstate hstates[HUGE_MAX_HSTATE];
> > >
> > > #ifdef CONFIG_CMA
> > > @@ -3144,14 +3149,43 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > > kfree(node_alloc_noretry);
> > > }
> > >
> > > +static int compare_hstates_decreasing(const void *a, const void *b)
> > > +{
> > > + const int shift_a = huge_page_shift((const struct hstate *)a);
> > > + const int shift_b = huge_page_shift((const struct hstate *)b);
> > > +
> > > + if (shift_a < shift_b)
> > > + return 1;
> > > + if (shift_a > shift_b)
> > > + return -1;
> > > + return 0;
> > > +}
> > > +
> > > +static void sort_hstates(void)
> > > +{
> > > + unsigned long default_hstate_sz = huge_page_size(&default_hstate);
> > > +
> > > + /* Sort from largest to smallest. */
> > > + sort(hstates, hugetlb_max_hstate, sizeof(*hstates),
> > > + compare_hstates_decreasing, NULL);
> > > +
> > > + /*
> > > + * We may have changed the location of the default hstate, so we need to
> > > + * update it.
> > > + */
> > > + default_hstate_idx = hstate_index(size_to_hstate(default_hstate_sz));
> > > +}
> > > +
> > > static void __init hugetlb_init_hstates(void)
> > > {
> > > struct hstate *h, *h2;
> > >
> > > - for_each_hstate(h) {
> > > - if (minimum_order > huge_page_order(h))
> > > - minimum_order = huge_page_order(h);
> > > + sort_hstates();
> > >
> > > + /* The last hstate is now the smallest. */
> > > + minimum_order = huge_page_order(&hstates[hugetlb_max_hstate - 1]);
> > > +
> > > + for_each_hstate(h) {
> > > /* oversize hugepages were init'ed in early boot */
> > > if (!hstate_is_gigantic(h))
> > > hugetlb_hstate_alloc_pages(h);
> >
> > This may/will cause problems for gigantic hugetlb pages allocated at boot
> > time. See alloc_bootmem_huge_page() where a pointer to the associated hstate
> > is encoded within the allocated hugetlb page. These pages are added to
> > hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> > hstate to add prep the gigantic page and add to the correct pool. Currently,
> > gather_bootmem_prealloc is called after hugetlb_init_hstates. So, changing
> > hstate order will cause errors.
> >
> > I do not see any reason why we could not call gather_bootmem_prealloc before
> > hugetlb_init_hstates to avoid this issue.
>
> Thanks for catching this, Mike. Your suggestion certainly seems to
> work, but it also seems kind of error prone. I'll have to look at the
> code more closely, but maybe it would be better if I just maintained a
> separate `struct hstate *sorted_hstate_ptrs[]`, where the original

I don't think this is a good idea. If you really rely on the order of
the initialization in this patch. The easier solution is changing
huge_bootmem_page->hstate to huge_bootmem_page->hugepagesz. Then we
can use size_to_hstate(huge_bootmem_page->hugepagesz) in
gather_bootmem_prealloc().

Thanks.

> locations of the hstates remain unchanged, as to not break
> gather_bootmem_prealloc/other things.
>
> > --
> > Mike Kravetz
>

2022-06-29 14:50:45

by manish.mishra

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality


On 24/06/22 11:06 pm, James Houghton wrote:
> The new function, hugetlb_split_to_shift, will optimally split the page
> table to map a particular address at a particular granularity.
>
> This is useful for punching a hole in the mapping and for mapping small
> sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 122 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3ec2a921ee6f..eaffe7b4f67c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> /* Forward declaration */
> static int hugetlb_acct_memory(struct hstate *h, long delta);
>
> +/*
> + * Find the subpage that corresponds to `addr` in `hpage`.
> + */
> +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> + unsigned long addr)
> +{
> + size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> +
> + BUG_ON(idx >= pages_per_huge_page(h));
> + return &hpage[idx];
> +}
> +
> static inline bool subpool_is_free(struct hugepage_subpool *spool)
> {
> if (spool->count)
> @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> (tmp_h)++)
> +
> +/*
> + * Given a particular address, split the HugeTLB PTE that currently maps it
> + * so that, for the given address, the PTE that maps it is `desired_shift`.
> + * This function will always split the HugeTLB PTE optimally.
> + *
> + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> + * these changes to the page table:
> + * 1. The PUD will be split into 2M PMDs.
> + * 2. The first PMD will be split again into 4K PTEs.
> + */
> +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> + const struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long desired_shift)
> +{
> + unsigned long start, end, curr;
> + unsigned long desired_sz = 1UL << desired_shift;
> + struct hstate *h = hstate_vma(vma);
> + int ret;
> + struct hugetlb_pte new_hpte;
> + struct mmu_notifier_range range;
> + struct page *hpage = NULL;
> + struct page *subpage;
> + pte_t old_entry;
> + struct mmu_gather tlb;
> +
> + BUG_ON(!hpte->ptep);
> + BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> +
> + start = addr & hugetlb_pte_mask(hpte);
> + end = start + hugetlb_pte_size(hpte);
> +
> + i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> +
> + BUG_ON(!hpte->ptep);
> + /* This function only works if we are looking at a leaf-level PTE. */
> + BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> +
> + /*
> + * Clear the PTE so that we will allocate the PT structures when
> + * walking the page table.
> + */
> + old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);

Sorry missed it last time, what if hgm mapping present here and current hpte is

at higher level. Where we will clear and free child page-table pages.

I see it does not happen in huge_ptep_get_and_clear.

> +
> + if (!huge_pte_none(old_entry))
> + hpage = pte_page(old_entry);
> +
> + BUG_ON(!IS_ALIGNED(start, desired_sz));
> + BUG_ON(!IS_ALIGNED(end, desired_sz));
> +
> + for (curr = start; curr < end;) {
> + struct hstate *tmp_h;
> + unsigned int shift;
> +
> + for_each_hgm_shift(h, tmp_h, shift) {
> + unsigned long sz = 1UL << shift;
> +
> + if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> + continue;
> + /*
> + * If we are including `addr`, we need to make sure
> + * splitting down to the correct size. Go to a smaller
> + * size if we are not.
> + */
> + if (curr <= addr && curr + sz > addr &&
> + shift > desired_shift)
> + continue;
> +
> + /*
> + * Continue the page table walk to the level we want,
> + * allocate PT structures as we go.
> + */
> + hugetlb_pte_copy(&new_hpte, hpte);
> + ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> + /*stop_at_none=*/false);
> + if (ret)
> + goto err;
> + BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> + if (hpage) {
> + pte_t new_entry;
> +
> + subpage = hugetlb_find_subpage(h, hpage, curr);
> + new_entry = make_huge_pte_with_shift(vma, subpage,
> + huge_pte_write(old_entry),
> + shift);
> + set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> + }
> + curr += sz;
> + goto next;
> + }
> + /* We couldn't find a size that worked. */
> + BUG();
> +next:
> + continue;
> + }
> +
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> + start, end);
> + mmu_notifier_invalidate_range_start(&range);
> + return 0;
> +err:
> + tlb_gather_mmu(&tlb, mm);
> + /* Free any newly allocated page table entries. */
> + hugetlb_free_range(&tlb, hpte, start, curr);
> + /* Restore the old entry. */
> + set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> + tlb_finish_mmu(&tlb);
> + return ret;
> +}
> #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
>
> /*

2022-06-29 16:45:11

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality

On Mon, Jun 27, 2022 at 6:51 AM manish.mishra <[email protected]> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > The new function, hugetlb_split_to_shift, will optimally split the page
> > table to map a particular address at a particular granularity.
> >
> > This is useful for punching a hole in the mapping and for mapping small
> > sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 122 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3ec2a921ee6f..eaffe7b4f67c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> > /* Forward declaration */
> > static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > +/*
> > + * Find the subpage that corresponds to `addr` in `hpage`.
> > + */
> > +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> > + unsigned long addr)
> > +{
> > + size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> > +
> > + BUG_ON(idx >= pages_per_huge_page(h));
> > + return &hpage[idx];
> > +}
> > +
> > static inline bool subpool_is_free(struct hugepage_subpool *spool)
> > {
> > if (spool->count)
> > @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> > for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> > (tmp_h)++)
> > +
> > +/*
> > + * Given a particular address, split the HugeTLB PTE that currently maps it
> > + * so that, for the given address, the PTE that maps it is `desired_shift`.
> > + * This function will always split the HugeTLB PTE optimally.
> > + *
> > + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> > + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> > + * these changes to the page table:
> > + * 1. The PUD will be split into 2M PMDs.
> > + * 2. The first PMD will be split again into 4K PTEs.
> > + */
> > +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> > + const struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long desired_shift)
> > +{
> > + unsigned long start, end, curr;
> > + unsigned long desired_sz = 1UL << desired_shift;
> > + struct hstate *h = hstate_vma(vma);
> > + int ret;
> > + struct hugetlb_pte new_hpte;
> > + struct mmu_notifier_range range;
> > + struct page *hpage = NULL;
> > + struct page *subpage;
> > + pte_t old_entry;
> > + struct mmu_gather tlb;
> > +
> > + BUG_ON(!hpte->ptep);
> > + BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> can it be BUG_ON(hugetlb_pte_size(hpte) <= desired_sz)

Sure -- I think that's better.

> > +
> > + start = addr & hugetlb_pte_mask(hpte);
> > + end = start + hugetlb_pte_size(hpte);
> > +
> > + i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>
> As it is just changing mappings is holding f_mapping required? I mean in future
>
> is it any paln or way to use some per process level sub-lock?

We don't need to hold a per-mapping lock here, you're right; a per-VMA
lock will do just fine. I'll replace this with a per-VMA lock for the
next version.

>
> > +
> > + BUG_ON(!hpte->ptep);
> > + /* This function only works if we are looking at a leaf-level PTE. */
> > + BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> > +
> > + /*
> > + * Clear the PTE so that we will allocate the PT structures when
> > + * walking the page table.
> > + */
> > + old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
> > +
> > + if (!huge_pte_none(old_entry))
> > + hpage = pte_page(old_entry);
> > +
> > + BUG_ON(!IS_ALIGNED(start, desired_sz));
> > + BUG_ON(!IS_ALIGNED(end, desired_sz));
> > +
> > + for (curr = start; curr < end;) {
> > + struct hstate *tmp_h;
> > + unsigned int shift;
> > +
> > + for_each_hgm_shift(h, tmp_h, shift) {
> > + unsigned long sz = 1UL << shift;
> > +
> > + if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> > + continue;
> > + /*
> > + * If we are including `addr`, we need to make sure
> > + * splitting down to the correct size. Go to a smaller
> > + * size if we are not.
> > + */
> > + if (curr <= addr && curr + sz > addr &&
> > + shift > desired_shift)
> > + continue;
> > +
> > + /*
> > + * Continue the page table walk to the level we want,
> > + * allocate PT structures as we go.
> > + */
>
> As i understand this for_each_hgm_shift loop is just to find right size of shift,
>
> then code below this line can be put out of loop, no strong feeling but it looks
>
> more proper may make code easier to understand.

Agreed. I'll clean this up.

>
> > + hugetlb_pte_copy(&new_hpte, hpte);
> > + ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> > + /*stop_at_none=*/false);
> > + if (ret)
> > + goto err;
> > + BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> > + if (hpage) {
> > + pte_t new_entry;
> > +
> > + subpage = hugetlb_find_subpage(h, hpage, curr);
> > + new_entry = make_huge_pte_with_shift(vma, subpage,
> > + huge_pte_write(old_entry),
> > + shift);
> > + set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> > + }
> > + curr += sz;
> > + goto next;
> > + }
> > + /* We couldn't find a size that worked. */
> > + BUG();
> > +next:
> > + continue;
> > + }
> > +
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > + start, end);
> > + mmu_notifier_invalidate_range_start(&range);
>
> sorry did not understand where tlb flush will be taken care in case of success?
>
> I see set_huge_pte_at does not do it internally by self.

A TLB flush isn't necessary in the success case -- pages that were
mapped before will continue to be mapped the same way, so the TLB
entries will still be valid. If we're splitting a none P*D, then
there's nothing to flush. If we're splitting a present P*D, then the
flush will come if/when we clear any of the page table entries below
the P*D.

>
> > + return 0;
> > +err:
> > + tlb_gather_mmu(&tlb, mm);
> > + /* Free any newly allocated page table entries. */
> > + hugetlb_free_range(&tlb, hpte, start, curr);
> > + /* Restore the old entry. */
> > + set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> > + tlb_finish_mmu(&tlb);
> > + return ret;
> > +}
> > #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> > /*

2022-06-29 16:45:40

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 12/26] hugetlb: add HugeTLB splitting functionality

On Wed, Jun 29, 2022 at 7:33 AM manish.mishra <[email protected]> wrote:
>
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > The new function, hugetlb_split_to_shift, will optimally split the page
> > table to map a particular address at a particular granularity.
> >
> > This is useful for punching a hole in the mapping and for mapping small
> > sections of a HugeTLB page (via UFFDIO_CONTINUE, for example).
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 122 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 3ec2a921ee6f..eaffe7b4f67c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -102,6 +102,18 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
> > /* Forward declaration */
> > static int hugetlb_acct_memory(struct hstate *h, long delta);
> >
> > +/*
> > + * Find the subpage that corresponds to `addr` in `hpage`.
> > + */
> > +static struct page *hugetlb_find_subpage(struct hstate *h, struct page *hpage,
> > + unsigned long addr)
> > +{
> > + size_t idx = (addr & ~huge_page_mask(h))/PAGE_SIZE;
> > +
> > + BUG_ON(idx >= pages_per_huge_page(h));
> > + return &hpage[idx];
> > +}
> > +
> > static inline bool subpool_is_free(struct hugepage_subpool *spool)
> > {
> > if (spool->count)
> > @@ -7044,6 +7056,116 @@ static unsigned int __shift_for_hstate(struct hstate *h)
> > for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> > (tmp_h)++)
> > +
> > +/*
> > + * Given a particular address, split the HugeTLB PTE that currently maps it
> > + * so that, for the given address, the PTE that maps it is `desired_shift`.
> > + * This function will always split the HugeTLB PTE optimally.
> > + *
> > + * For example, given a HugeTLB 1G page that is mapped from VA 0 to 1G. If we
> > + * call this function with addr=0 and desired_shift=PAGE_SHIFT, will result in
> > + * these changes to the page table:
> > + * 1. The PUD will be split into 2M PMDs.
> > + * 2. The first PMD will be split again into 4K PTEs.
> > + */
> > +static int hugetlb_split_to_shift(struct mm_struct *mm, struct vm_area_struct *vma,
> > + const struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long desired_shift)
> > +{
> > + unsigned long start, end, curr;
> > + unsigned long desired_sz = 1UL << desired_shift;
> > + struct hstate *h = hstate_vma(vma);
> > + int ret;
> > + struct hugetlb_pte new_hpte;
> > + struct mmu_notifier_range range;
> > + struct page *hpage = NULL;
> > + struct page *subpage;
> > + pte_t old_entry;
> > + struct mmu_gather tlb;
> > +
> > + BUG_ON(!hpte->ptep);
> > + BUG_ON(hugetlb_pte_size(hpte) == desired_sz);
> > +
> > + start = addr & hugetlb_pte_mask(hpte);
> > + end = start + hugetlb_pte_size(hpte);
> > +
> > + i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> > +
> > + BUG_ON(!hpte->ptep);
> > + /* This function only works if we are looking at a leaf-level PTE. */
> > + BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));
> > +
> > + /*
> > + * Clear the PTE so that we will allocate the PT structures when
> > + * walking the page table.
> > + */
> > + old_entry = huge_ptep_get_and_clear(mm, start, hpte->ptep);
>
> Sorry missed it last time, what if hgm mapping present here and current hpte is
>
> at higher level. Where we will clear and free child page-table pages.
>
> I see it does not happen in huge_ptep_get_and_clear.

This shouldn't happen because earlier we have
BUG_ON(!hugetlb_pte_none(hpte) && !hugetlb_pte_present_leaf(hpte));

i.e., hpte must either be none or present and leaf-level.

>
> > +
> > + if (!huge_pte_none(old_entry))
> > + hpage = pte_page(old_entry);
> > +
> > + BUG_ON(!IS_ALIGNED(start, desired_sz));
> > + BUG_ON(!IS_ALIGNED(end, desired_sz));
> > +
> > + for (curr = start; curr < end;) {
> > + struct hstate *tmp_h;
> > + unsigned int shift;
> > +
> > + for_each_hgm_shift(h, tmp_h, shift) {
> > + unsigned long sz = 1UL << shift;
> > +
> > + if (!IS_ALIGNED(curr, sz) || curr + sz > end)
> > + continue;
> > + /*
> > + * If we are including `addr`, we need to make sure
> > + * splitting down to the correct size. Go to a smaller
> > + * size if we are not.
> > + */
> > + if (curr <= addr && curr + sz > addr &&
> > + shift > desired_shift)
> > + continue;
> > +
> > + /*
> > + * Continue the page table walk to the level we want,
> > + * allocate PT structures as we go.
> > + */
> > + hugetlb_pte_copy(&new_hpte, hpte);
> > + ret = hugetlb_walk_to(mm, &new_hpte, curr, sz,
> > + /*stop_at_none=*/false);
> > + if (ret)
> > + goto err;
> > + BUG_ON(hugetlb_pte_size(&new_hpte) != sz);
> > + if (hpage) {
> > + pte_t new_entry;
> > +
> > + subpage = hugetlb_find_subpage(h, hpage, curr);
> > + new_entry = make_huge_pte_with_shift(vma, subpage,
> > + huge_pte_write(old_entry),
> > + shift);
> > + set_huge_pte_at(mm, curr, new_hpte.ptep, new_entry);
> > + }
> > + curr += sz;
> > + goto next;
> > + }
> > + /* We couldn't find a size that worked. */
> > + BUG();
> > +next:
> > + continue;
> > + }
> > +
> > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> > + start, end);
> > + mmu_notifier_invalidate_range_start(&range);
> > + return 0;
> > +err:
> > + tlb_gather_mmu(&tlb, mm);
> > + /* Free any newly allocated page table entries. */
> > + hugetlb_free_range(&tlb, hpte, start, curr);
> > + /* Restore the old entry. */
> > + set_huge_pte_at(mm, start, hpte->ptep, old_entry);
> > + tlb_finish_mmu(&tlb);
> > + return ret;
> > +}
> > #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> > /*

2022-06-29 18:46:01

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Tue, Jun 28, 2022 at 10:56 AM Dr. David Alan Gilbert
<[email protected]> wrote:
>
> * Mina Almasry ([email protected]) wrote:
> > On Mon, Jun 27, 2022 at 9:27 AM James Houghton <[email protected]> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <[email protected]> wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> > > > >
> > > > > [trimmed...]
> > > > > ---- Userspace API ----
> > > > >
> > > > > This patch series introduces a single way to take advantage of
> > > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > > userspace to resolve MINOR page faults on shared VMAs.
> > > > >
> > > > > To collapse a HugeTLB address range that has been mapped with several
> > > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > > userspace to know when all pages (that they care about) have been fetched.
> > > > >
> > > >
> > > > Thanks James! Cover letter looks good. A few questions:
> > > >
> > > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > > have been fetched automatically? It would remove the need for a new
> > > > userspace API, and AFACT there aren't really any cases where it is
> > > > beneficial to have a hugepage sharded into 4K mappings when those
> > > > mappings can be collapsed.
> > >
> > > The reason that we don't automatically collapse mappings is because it
> > > would take additional complexity, and it is less flexible. Consider
> > > the case of 1G pages on x86: currently, userspace can collapse the
> > > whole page when it's all ready, but they can also choose to collapse a
> > > 2M piece of it. On architectures with more supported hugepage sizes
> > > (e.g., arm64), userspace has even more possibilities for when to
> > > collapse. This likely further complicates a potential
> > > automatic-collapse solution. Userspace may also want to collapse the
> > > mapping for an entire hugepage without completely mapping the hugepage
> > > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > > the holes, though).
> > >
> >
> > To be honest I'm don't think I'm a fan of this. I don't think this
> > saves complexity, but rather pushes it to the userspace. I.e. the
> > userspace now must track which regions are faulted in and which are
> > not to call MADV_COLLAPSE at the right time. Also, if the userspace
> > gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> > any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> > storm of maybe hundreds of minor faults at once which may take too
> > long to resolve and may impact guest stability, yes?
>
> I think it depends on whether the userspace is already holding bitmaps
> and data structures to let it know when the right time to call collapse
> is; if it already has to do all that book keeping for it's own postcopy
> or whatever process, then getting userspace to call it is easy.
> (I don't know the answer to whether it does have!)

Userspace generally has a lot of information about which pages have
been UFFDIO_CONTINUE'd, but they may not have the information (say,
some atomic count per hpage) to tell them exactly when to collapse.

I think it's worth discussing the tmpfs/THP case right now, too. Right
now, after userfaultfd post-copy, all THPs we have will all be
PTE-mapped. To deal with this, we need to use Zach's MADV_COLLAPSE to
collapse the mappings to PMD mappings (we don't want to wait for
khugepaged to happen upon them -- we want good performance ASAP :)).
In fact, IIUC, khugepaged actually won't collapse these *ever* right
now. I suppose we could enlighten tmpfs's UFFDIO_CONTINUE to
automatically collapse too (thus avoiding the need for MADV_COLLAPSE),
but that could be complicated/unwanted (if that is something we might
want, maybe we should have a separate discussion).

So, as it stands today, we intend to use MADV_COLLAPSE explicitly in
the tmpfs case as soon as it is supported, and so it follows that it's
ok to require userspace to do the same thing for HugeTLBFS-backed
memory.

>
> Dave
>
> > For these reasons I think automatic collapsing is something that will
> > eventually be implemented by us or someone else, and at that point
> > MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> > is adding a userspace API that will probably need to be maintained for
> > perpetuity but actually is likely going to be going obsolete "soon".
> > For this reason I had hoped that automatic collapsing would come with
> > V1.

Small, unimportant clarification: the API, as described here, won't be
*completely* meaningless if we end up implementing automatic
collapsing :) It still has the effect of not requiring other
UFFDIO_CONTINUE operations to be done for the collapsed region.

> >
> > I wonder if we can have a very simple first try at automatic
> > collapsing for V1? I.e., can we support collapsing to the hstate size
> > and only that? So 4K pages can only be either collapsed to 2MB or 1G
> > on x86 depending on the hstate size. I think this may be not too
> > difficult to implement: we can have a counter similar to mapcount that
> > tracks how many of the subpages are mapped (subpage_mapcount). Once
> > all the subpages are mapped (the counter reaches a certain value),
> > trigger collapsing similar to hstate size MADV_COLLAPSE.
> >

In my estimation, to implement automatic collapsing, for one VMA, we
will need a per-hstate count, where when the count reaches the maximum
number, we collapse automatically to the next most optimal size. So if
we finish filling in enough PTEs for a CONT_PTE, we will collapse to a
CONT_PTE. If we finish filling up CONT_PTEs to a PMD, then collapse to
a PMD.

If you are suggesting to only collapse to the hstate size at the end,
then we lose flexibility.

> > I gather that no one else reviewing this has raised this issue thus
> > far so it might not be a big deal and I will continue to review the
> > RFC, but I had hoped for automatic collapsing myself for the reasons
> > above.

Thanks for the thorough review, Mina. :)

> >
> > > >
> > > > > ---- HugeTLB Changes ----
> > > > >
> > > > > - Mapcount
> > > > > The way mapcount is handled is different from the way that it was handled
> > > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > > high granularity, their mapcounts will remain the same as what they would
> > > > > have been pre-HGM.
> > > > >
> > > >
> > > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > > differently, but the same if the page is not mapped at high
> > > > granularity. Can you elaborate on how the mapcount handling will be
> > > > different when the page is mapped at high granularity?
> > >
> > > I guess I didn't phrase this very well. For the sake of simplicity,
> > > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > > Previously, there were two possibilities for how a hugepage was
> > > mapped, either it was (1) completely mapped (PUD is present and a
> > > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > > case, where the PUD is not none but also not a leaf (this usually
> > > means that the page is partially mapped). We handle this case as if
> > > the whole page was mapped. That is, if we partially map a hugepage
> > > that was previously unmapped (making the PUD point to PMDs), we
> > > increment its mapcount, and if we completely unmap a partially mapped
> > > hugepage (making the PUD none), we decrement its mapcount. If we
> > > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> > >
> > > It is possible for a PUD to be present and not a leaf (mapcount has
> > > been incremented) but for the page to still be unmapped: if the PMDs
> > > (or PTEs) underneath are all none. This case is atypical, and as of
> > > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > > think it would be very difficult to get this to happen.
> > >
> >
> > Thank you for the detailed explanation. Please add it to the cover letter.
> >
> > I wonder the case "PUD present but all the PMD are none": is that a
> > bug? I don't understand the usefulness of that. Not a comment on this
> > patch but rather a curiosity.
> >
> > > >
> > > > > - Page table walking and manipulation
> > > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > > high-granularity mappings. Eventually, it's possible to merge
> > > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > > >
> > > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > > This is because we generally need to know the "size" of a PTE (previously
> > > > > always just huge_page_size(hstate)).
> > > > >
> > > > > For every page table manipulation function that has a huge version (e.g.
> > > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > > > > PTE really is "huge".
> > > > >
> > > > > - Synchronization
> > > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > > writing, and for doing high-granularity page table walks, we require it to
> > > > > be held for reading.
> > > > >
> > > > > ---- Limitations & Future Changes ----
> > > > >
> > > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > > VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> > > > > failure recovery for both shared and private mappings.
> > > > >
> > > > > The memory failure use case poses its own challenges that can be
> > > > > addressed, but I will do so in a separate RFC.
> > > > >
> > > > > Performance has not been heavily scrutinized with this patch series. There
> > > > > are places where lock contention can significantly reduce performance. This
> > > > > will be addressed later.
> > > > >
> > > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > > page struct optimization[3], as we do not need to modify data contained
> > > > > in the subpage page structs.
> > > > >
> > > > > Other omissions:
> > > > > - Compatibility with userfaultfd write-protect (will be included in v1).
> > > > > - Support for mremap() (will be included in v1). This looks a lot like
> > > > > the support we have for fork().
> > > > > - Documentation changes (will be included in v1).
> > > > > - Completely ignores PMD sharing and hugepage migration (will be included
> > > > > in v1).
> > > > > - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > > > than arm64.
> > > > >
> > > > > ---- Patch Breakdown ----
> > > > >
> > > > > Patch 1 - Preliminary changes
> > > > > Patch 2-10 - HugeTLB HGM core changes
> > > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > > Patch 24-26 - arm64 support and selftests
> > > > >
> > > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > > name. "High-granularity mapping" is not a great name either. I am open
> > > > > to better names.
> > > >
> > > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > > is more granular than what it normally is (2MB/1G, etc).
> > >
> > > Noted. :)
> >
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
>

2022-06-29 21:17:40

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Tue, Jun 28, 2022 at 10:27 AM Mina Almasry <[email protected]> wrote:
>
> On Mon, Jun 27, 2022 at 9:27 AM James Houghton <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 11:41 AM Mina Almasry <[email protected]> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> > > >
> > > > [trimmed...]
> > > > ---- Userspace API ----
> > > >
> > > > This patch series introduces a single way to take advantage of
> > > > high-granularity mapping: via UFFDIO_CONTINUE. UFFDIO_CONTINUE allows
> > > > userspace to resolve MINOR page faults on shared VMAs.
> > > >
> > > > To collapse a HugeTLB address range that has been mapped with several
> > > > UFFDIO_CONTINUE operations, userspace can issue MADV_COLLAPSE. We expect
> > > > userspace to know when all pages (that they care about) have been fetched.
> > > >
> > >
> > > Thanks James! Cover letter looks good. A few questions:
> > >
> > > Why not have the kernel collapse the hugepage once all the 4K pages
> > > have been fetched automatically? It would remove the need for a new
> > > userspace API, and AFACT there aren't really any cases where it is
> > > beneficial to have a hugepage sharded into 4K mappings when those
> > > mappings can be collapsed.
> >
> > The reason that we don't automatically collapse mappings is because it
> > would take additional complexity, and it is less flexible. Consider
> > the case of 1G pages on x86: currently, userspace can collapse the
> > whole page when it's all ready, but they can also choose to collapse a
> > 2M piece of it. On architectures with more supported hugepage sizes
> > (e.g., arm64), userspace has even more possibilities for when to
> > collapse. This likely further complicates a potential
> > automatic-collapse solution. Userspace may also want to collapse the
> > mapping for an entire hugepage without completely mapping the hugepage
> > first (this would also be possible by issuing UFFDIO_CONTINUE on all
> > the holes, though).
> >
>
> To be honest I'm don't think I'm a fan of this. I don't think this
> saves complexity, but rather pushes it to the userspace. I.e. the
> userspace now must track which regions are faulted in and which are
> not to call MADV_COLLAPSE at the right time. Also, if the userspace
> gets it wrong it may accidentally not call MADV_COLLAPSE (and not get
> any hugepages) or call MADV_COLLAPSE too early and have to deal with a
> storm of maybe hundreds of minor faults at once which may take too
> long to resolve and may impact guest stability, yes?

I disagree, I think this is state userspace needs to maintain anyway,
even if we ignore the use case James' series is about.

One example: today, you can't UFFDIO_CONTINUE a region which is
already mapped - you'll get -EEXIST. So, userspace needs to be sure
not to double-continue an area. We could think about relaxing this,
but there's a tradeoff - being more permissive means it's "easier to
use", but, it also means we're less strict about catching potentially
buggy userspaces.

There's another case that I don't see any way to get rid of. The way
live migration at least for GCE works is, we have two things
installing new pages: the on-demand fetcher, which reacts to UFFD
events and resolves them. And then we have the background fetcher,
which goes along and fetches pages which haven't been touched /
requested yet (and which may never be, it's not uncommon for a guest
to have at least *some* pages which are very infrequently / never
touched). In order for the background fetcher to know what pages to
transfer over the network, or not, userspace has to remember which
ones it's already installed.

Another point is, consider the use case of UFFDIO_CONTINUE over
UFFDIO_COPY. When userspace gets a UFFD event for a page, the
assumption is that it's somewhat likely the page is already up to
date, because we already copied it over from the source machine before
we stopped the guest and restarted it running on the target machine
("precopy"). So, we want to maintain a dirty bitmap, which tells us
which pages are clean or not - when we get a UFFD event, we check the
bitmap, and only if the page is dirty do we actually go fetch it over
the network - otherwise we just UFFDIO_CONTINUE and we're done.

>
> For these reasons I think automatic collapsing is something that will
> eventually be implemented by us or someone else, and at that point
> MADV_COLLAPSE for hugetlb memory will become obsolete; i.e. this patch
> is adding a userspace API that will probably need to be maintained for
> perpetuity but actually is likely going to be going obsolete "soon".
> For this reason I had hoped that automatic collapsing would come with
> V1.
>
> I wonder if we can have a very simple first try at automatic
> collapsing for V1? I.e., can we support collapsing to the hstate size
> and only that? So 4K pages can only be either collapsed to 2MB or 1G
> on x86 depending on the hstate size. I think this may be not too
> difficult to implement: we can have a counter similar to mapcount that
> tracks how many of the subpages are mapped (subpage_mapcount). Once
> all the subpages are mapped (the counter reaches a certain value),
> trigger collapsing similar to hstate size MADV_COLLAPSE.

I'm not sure I agree this is likely.

Two problems:

One is, say you UFFDIO_CONTINUE a 4k PTE. If we wanted collapsing to
happen automatically, we'd need to answer the question: is this the
last 4k PTE in a 2M region, so now it can be collapsed? Today the only
way to know is to go check - walk the PTEs. This is expensive, and
it's something we'd have to do on each and every UFFDIO_CONTINUE
operation -- this sucks because we're incurring the cost on every
operation, even though most of them (only 1 / 512, say) the answer
will be "no it wasn't the last one, we can't collapse yet". For
on-demand paging, it's really critical installing the page is as fast
as possible -- in an ideal world it would be exactly as fast as a
"normal" minor fault and the guest would not even be able to tell at
all that it was in the process of being migrated.

Now, as you pointed out, we can just store a mapcount somewhere which
keeps track of how many PTEs in each 2M region are installed or not.
So, then we can more quickly check in UFFDIO_CONTINUE. But, we have
the memory overhead and CPU time overhead of maintaining this
metadata. And, it's not like having the kernel do this means userspace
doesn't have to - like I described above, I think userspace would
*also* need to keep track of this same thing anyway, so now we're
doing it 2x.

Another problem I see is, it seems like collapsing automatically would
involve letting UFFD know a bit too much for my liking about hugetlbfs
internals. It seems to me more ideal to have it know as little as
possible about how hugetlbfs works internally.



Also, there are some benefits to letting userspace decide when / if to
collapse.

For example, userspace might decide it prefers to MADV_COLLAPSE
immediately, in the demand paging thread. Or, it might decide it's
okay to let it be collapsed a bit later, and leave that up to some
other background thread. It might MADV_COLLAPSE as soon as it sees a
complete 2M region, or maybe it wants to batch things up and waits
until it has a full 1G region to collapse. It might also do different
things for different regions, e.g. depending on if they were hot or
cold (demand paged vs. background fetched). I don't see any single
"right way" to do things here, I just see tradeoffs, which userspace
is in a good position to decide on.

>
> I gather that no one else reviewing this has raised this issue thus
> far so it might not be a big deal and I will continue to review the
> RFC, but I had hoped for automatic collapsing myself for the reasons
> above.
>
> > >
> > > > ---- HugeTLB Changes ----
> > > >
> > > > - Mapcount
> > > > The way mapcount is handled is different from the way that it was handled
> > > > before. If the PUD for a hugepage is not none, a hugepage's mapcount will
> > > > be increased. This scheme means that, for hugepages that aren't mapped at
> > > > high granularity, their mapcounts will remain the same as what they would
> > > > have been pre-HGM.
> > > >
> > >
> > > Sorry, I didn't quite follow this. It says mapcount is handled
> > > differently, but the same if the page is not mapped at high
> > > granularity. Can you elaborate on how the mapcount handling will be
> > > different when the page is mapped at high granularity?
> >
> > I guess I didn't phrase this very well. For the sake of simplicity,
> > consider 1G pages on x86, typically mapped with leaf-level PUDs.
> > Previously, there were two possibilities for how a hugepage was
> > mapped, either it was (1) completely mapped (PUD is present and a
> > leaf), or (2) it wasn't mapped (PUD is none). Now we have a third
> > case, where the PUD is not none but also not a leaf (this usually
> > means that the page is partially mapped). We handle this case as if
> > the whole page was mapped. That is, if we partially map a hugepage
> > that was previously unmapped (making the PUD point to PMDs), we
> > increment its mapcount, and if we completely unmap a partially mapped
> > hugepage (making the PUD none), we decrement its mapcount. If we
> > collapse a non-leaf PUD to a leaf PUD, we don't change mapcount.
> >
> > It is possible for a PUD to be present and not a leaf (mapcount has
> > been incremented) but for the page to still be unmapped: if the PMDs
> > (or PTEs) underneath are all none. This case is atypical, and as of
> > this RFC (without bestowing MADV_DONTNEED with HGM flexibility), I
> > think it would be very difficult to get this to happen.
> >
>
> Thank you for the detailed explanation. Please add it to the cover letter.
>
> I wonder the case "PUD present but all the PMD are none": is that a
> bug? I don't understand the usefulness of that. Not a comment on this
> patch but rather a curiosity.
>
> > >
> > > > - Page table walking and manipulation
> > > > A new function, hugetlb_walk_to, handles walking HugeTLB page tables for
> > > > high-granularity mappings. Eventually, it's possible to merge
> > > > hugetlb_walk_to with huge_pte_offset and huge_pte_alloc.
> > > >
> > > > We keep track of HugeTLB page table entries with a new struct, hugetlb_pte.
> > > > This is because we generally need to know the "size" of a PTE (previously
> > > > always just huge_page_size(hstate)).
> > > >
> > > > For every page table manipulation function that has a huge version (e.g.
> > > > huge_ptep_get and ptep_get), there is a wrapper for it (e.g.
> > > > hugetlb_ptep_get). The correct version is used depending on if a HugeTLB
> > > > PTE really is "huge".
> > > >
> > > > - Synchronization
> > > > For existing bits of HugeTLB, synchronization is unchanged. For splitting
> > > > and collapsing HugeTLB PTEs, we require that the i_mmap_rw_sem is held for
> > > > writing, and for doing high-granularity page table walks, we require it to
> > > > be held for reading.
> > > >
> > > > ---- Limitations & Future Changes ----
> > > >
> > > > This patch series only implements high-granularity mapping for VM_SHARED
> > > > VMAs. I intend to implement enough HGM to support 4K unmapping for memory
> > > > failure recovery for both shared and private mappings.
> > > >
> > > > The memory failure use case poses its own challenges that can be
> > > > addressed, but I will do so in a separate RFC.
> > > >
> > > > Performance has not been heavily scrutinized with this patch series. There
> > > > are places where lock contention can significantly reduce performance. This
> > > > will be addressed later.
> > > >
> > > > The patch series, as it stands right now, is compatible with the VMEMMAP
> > > > page struct optimization[3], as we do not need to modify data contained
> > > > in the subpage page structs.
> > > >
> > > > Other omissions:
> > > > - Compatibility with userfaultfd write-protect (will be included in v1).
> > > > - Support for mremap() (will be included in v1). This looks a lot like
> > > > the support we have for fork().
> > > > - Documentation changes (will be included in v1).
> > > > - Completely ignores PMD sharing and hugepage migration (will be included
> > > > in v1).
> > > > - Implementations for architectures that don't use GENERAL_HUGETLB other
> > > > than arm64.
> > > >
> > > > ---- Patch Breakdown ----
> > > >
> > > > Patch 1 - Preliminary changes
> > > > Patch 2-10 - HugeTLB HGM core changes
> > > > Patch 11-13 - HugeTLB HGM page table walking functionality
> > > > Patch 14-19 - HugeTLB HGM compatibility with other bits
> > > > Patch 20-23 - Userfaultfd and collapse changes
> > > > Patch 24-26 - arm64 support and selftests
> > > >
> > > > [1] This used to be called HugeTLB double mapping, a bad and confusing
> > > > name. "High-granularity mapping" is not a great name either. I am open
> > > > to better names.
> > >
> > > I would drop 1 extra word and do "granular mapping", as in the mapping
> > > is more granular than what it normally is (2MB/1G, etc).
> >
> > Noted. :)

2022-06-29 21:19:38

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On Wed, Jun 29, 2022 at 2:06 PM Mike Kravetz <[email protected]> wrote:
>
> On 06/29/22 14:39, Muchun Song wrote:
> > On Tue, Jun 28, 2022 at 08:40:27AM -0700, James Houghton wrote:
> > > On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <[email protected]> wrote:
> > > >
> > > > On 06/24/22 17:36, James Houghton wrote:
> > > > > When using HugeTLB high-granularity mapping, we need to go through the
> > > > > supported hugepage sizes in decreasing order so that we pick the largest
> > > > > size that works. Consider the case where we're faulting in a 1G hugepage
> > > > > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > > > > a PUD. By going through the sizes in decreasing order, we will find that
> > > > > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> > > > >
> > > >
> > > > This may/will cause problems for gigantic hugetlb pages allocated at boot
> > > > time. See alloc_bootmem_huge_page() where a pointer to the associated hstate
> > > > is encoded within the allocated hugetlb page. These pages are added to
> > > > hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> > > > hstate to add prep the gigantic page and add to the correct pool. Currently,
> > > > gather_bootmem_prealloc is called after hugetlb_init_hstates. So, changing
> > > > hstate order will cause errors.
> > > >
> > > > I do not see any reason why we could not call gather_bootmem_prealloc before
> > > > hugetlb_init_hstates to avoid this issue.
> > >
> > > Thanks for catching this, Mike. Your suggestion certainly seems to
> > > work, but it also seems kind of error prone. I'll have to look at the
> > > code more closely, but maybe it would be better if I just maintained a
> > > separate `struct hstate *sorted_hstate_ptrs[]`, where the original
> >
> > I don't think this is a good idea. If you really rely on the order of
> > the initialization in this patch. The easier solution is changing
> > huge_bootmem_page->hstate to huge_bootmem_page->hugepagesz. Then we
> > can use size_to_hstate(huge_bootmem_page->hugepagesz) in
> > gather_bootmem_prealloc().
> >
>
> That is a much better solution. Thanks Muchun!

Indeed. Thank you, Muchun. :)

>
> --
> Mike Kravetz

2022-06-29 21:35:42

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 02/26] hugetlb: sort hstates in hugetlb_init_hstates

On 06/29/22 14:39, Muchun Song wrote:
> On Tue, Jun 28, 2022 at 08:40:27AM -0700, James Houghton wrote:
> > On Mon, Jun 27, 2022 at 11:42 AM Mike Kravetz <[email protected]> wrote:
> > >
> > > On 06/24/22 17:36, James Houghton wrote:
> > > > When using HugeTLB high-granularity mapping, we need to go through the
> > > > supported hugepage sizes in decreasing order so that we pick the largest
> > > > size that works. Consider the case where we're faulting in a 1G hugepage
> > > > for the first time: we want hugetlb_fault/hugetlb_no_page to map it with
> > > > a PUD. By going through the sizes in decreasing order, we will find that
> > > > PUD_SIZE works before finding out that PMD_SIZE or PAGE_SIZE work too.
> > > >
> > >
> > > This may/will cause problems for gigantic hugetlb pages allocated at boot
> > > time. See alloc_bootmem_huge_page() where a pointer to the associated hstate
> > > is encoded within the allocated hugetlb page. These pages are added to
> > > hugetlb pools by the routine gather_bootmem_prealloc() which uses the saved
> > > hstate to add prep the gigantic page and add to the correct pool. Currently,
> > > gather_bootmem_prealloc is called after hugetlb_init_hstates. So, changing
> > > hstate order will cause errors.
> > >
> > > I do not see any reason why we could not call gather_bootmem_prealloc before
> > > hugetlb_init_hstates to avoid this issue.
> >
> > Thanks for catching this, Mike. Your suggestion certainly seems to
> > work, but it also seems kind of error prone. I'll have to look at the
> > code more closely, but maybe it would be better if I just maintained a
> > separate `struct hstate *sorted_hstate_ptrs[]`, where the original
>
> I don't think this is a good idea. If you really rely on the order of
> the initialization in this patch. The easier solution is changing
> huge_bootmem_page->hstate to huge_bootmem_page->hugepagesz. Then we
> can use size_to_hstate(huge_bootmem_page->hugepagesz) in
> gather_bootmem_prealloc().
>

That is a much better solution. Thanks Muchun!

--
Mike Kravetz

2022-06-30 16:42:00

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Tue, Jun 28, 2022 at 09:20:41AM +0100, Dr. David Alan Gilbert wrote:
> One other thing I thought of; you provide the modified 'CONTINUE'
> behaviour, which works for postcopy as long as you use two mappings in
> userspace; one protected by userfault, and one which you do the writes
> to, and then issue the CONTINUE into the protected mapping; that's fine,
> but it's not currently how we have our postcopy code wired up in qemu,
> we have one mapping and use UFFDIO_COPY to place the page.
> Requiring the two mappings is fine, but it's probably worth pointing out
> the need for it somewhere.

It'll be about CONTINUE, maybe not directly related to sub-page mapping,
but indeed that's something we may need to do. It's also in my poc [1]
previously (I never got time to get back to it yet though..).

It's just that two mappings are not required. E.g., one could use a fd on
the file and lseek()/write() to the file to update content rather than
using another mapping. It might be just slower.

Or, IMHO an app can legally just delay faulting of some mapping using minor
mode and maybe the app doesn't even need to modify the page content before
CONTINUE for some reason, then it's even not needed to have either the
other mapping or the fd. Fundamentally, MINOR mode and CONTINUE provides
another way to trap page fault when page cache existed. It doesn't really
define whether or how the data will be modified.

It's just that for QEMU unfortunately we may need to have that two mappings
just for this use case indeed..

[1] https://github.com/xzpeter/qemu/commit/41538a9a8ff5c981af879afe48e4ecca9a1aabc8

Thanks,

--
Peter Xu

2022-06-30 19:36:49

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Tue, Jun 28, 2022 at 12:04:28AM +0000, Nadav Amit wrote:
> > [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
>
> Indeed this change of behavior (not aligning to huge-pages when flags is
> not set) was unintentional. If you want to fix it in a separate patch so
> it would be backported, that may be a good idea.

The fix seems to be straightforward, though. Nadav, wanna post a patch
yourself?

That seems to be an accident and it's just that having sub-page mapping
rely on the accident is probably not desirable.. So irrelevant of the
separate patch I'd suggest we keep the requirement on enabling the exact
addr feature for sub-page mapping.

Thanks,

--
Peter Xu

2022-07-01 06:18:11

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/26] hugetlb: Introduce HugeTLB high-granularity mapping

On Jun 30, 2022, at 12:21 PM, Peter Xu <[email protected]> wrote:

> âš  External Email
>
> On Tue, Jun 28, 2022 at 12:04:28AM +0000, Nadav Amit wrote:
>>> [1] commit 824ddc601adc ("userfaultfd: provide unmasked address on page-fault")
>>
>> Indeed this change of behavior (not aligning to huge-pages when flags is
>> not set) was unintentional. If you want to fix it in a separate patch so
>> it would be backported, that may be a good idea.
>
> The fix seems to be straightforward, though. Nadav, wanna post a patch
> yourself?

Yes, even I can do it :)

Just busy right now, so I’ll try to do it over the weekend.

2022-07-07 21:51:51

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift

On 06/28/22 14:58, Mina Almasry wrote:
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> >
> > This is a helper macro to loop through all the usable page sizes for a
> > high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> > loop, in descending order, through the page sizes that HugeTLB supports
> > for this architecture; it always includes PAGE_SIZE.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8b10b941458d..557b0afdb503 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > /* All shared VMAs have HGM enabled. */
> > return vma->vm_flags & VM_SHARED;
> > }
> > +static unsigned int __shift_for_hstate(struct hstate *h)
> > +{
> > + if (h >= &hstates[hugetlb_max_hstate])
> > + return PAGE_SHIFT;
>
> h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
> I missing something here?
>
> So is this intending to do:
>
> if (h == hstates[hugetlb_max_hstate]
> return PAGE_SHIFT;
>
> ? If so, could we write it as so?
>
> I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
> == PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
> be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?
>

I too am missing how this is working for similar reasons.
--
Mike Kravetz

> > + return huge_page_shift(h);
> > +}
> > +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> > + for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > + (tmp_h) <= &hstates[hugetlb_max_hstate]; \
> > + (tmp_h)++)
> > #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> > /*
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

2022-07-07 23:36:37

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks

On 06/27/22 18:37, manish.mishra wrote:
>
> On 24/06/22 11:06 pm, James Houghton wrote:
> > This adds it for architectures that use GENERAL_HUGETLB, including x86.

I expect this will be used in arch independent code and there will need to
be at least a stub for all architectures?

> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > include/linux/hugetlb.h | 2 ++
> > mm/hugetlb.c | 45 +++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 47 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index e7a6b944d0cc..605aa19d8572 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long addr, unsigned long sz);
> > pte_t *huge_pte_offset(struct mm_struct *mm,
> > unsigned long addr, unsigned long sz);
> > +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long sz, bool stop_at_none);
> > int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long *addr, pte_t *ptep);
> > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 557b0afdb503..3ec2a921ee6f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> > return (pte_t *)pmd;
> > }
>
>
> not strong feeling but this name looks confusing to me as it does
>
> not only walk over page-tables but can also alloc.
>

Somewhat agree. With this we have:
- huge_pte_offset to walk/lookup a pte
- huge_pte_alloc to allocate ptes
- hugetlb_walk_to which does some/all of both

Do not see anything obviously wrong with the routine, but future
direction would be to combine/clean up these routines with similar
purpose.
--
Mike Kravetz

> > +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long sz, bool stop_at_none)
> > +{
> > + pte_t *ptep;
> > +
> > + if (!hpte->ptep) {
> > + pgd_t *pgd = pgd_offset(mm, addr);
> > +
> > + if (!pgd)
> > + return -ENOMEM;
> > + ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
> > + if (!ptep)
> > + return -ENOMEM;
> > + hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
> > + }
> > +
> > + while (hugetlb_pte_size(hpte) > sz &&
> > + !hugetlb_pte_present_leaf(hpte) &&
> > + !(stop_at_none && hugetlb_pte_none(hpte))) {
>
> Should this ordering of if-else condition be in reverse, i mean it will look
>
> more natural and possibly less condition checks as we go from top to bottom.
>
> > + if (hpte->shift == PMD_SHIFT) {
> > + ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);
> > + if (!ptep)
> > + return -ENOMEM;
> > + hpte->shift = PAGE_SHIFT;
> > + hpte->ptep = ptep;
> > + } else if (hpte->shift == PUD_SHIFT) {
> > + ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
> > + addr);
> > + if (!ptep)
> > + return -ENOMEM;
> > + hpte->shift = PMD_SHIFT;
> > + hpte->ptep = ptep;
> > + } else if (hpte->shift == P4D_SHIFT) {
> > + ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
> > + addr);
> > + if (!ptep)
> > + return -ENOMEM;
> > + hpte->shift = PUD_SHIFT;
> > + hpte->ptep = ptep;
> > + } else
> > + BUG();
> > + }
> > + return 0;
> > +}
> > +
> > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
> > #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING

2022-07-08 16:10:06

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift

On Tue, Jun 28, 2022 at 2:58 PM Mina Almasry <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> >
> > This is a helper macro to loop through all the usable page sizes for a
> > high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> > loop, in descending order, through the page sizes that HugeTLB supports
> > for this architecture; it always includes PAGE_SIZE.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 8b10b941458d..557b0afdb503 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > /* All shared VMAs have HGM enabled. */
> > return vma->vm_flags & VM_SHARED;
> > }
> > +static unsigned int __shift_for_hstate(struct hstate *h)
> > +{
> > + if (h >= &hstates[hugetlb_max_hstate])
> > + return PAGE_SHIFT;
>
> h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
> I missing something here?

Yeah, it goes out of bounds intentionally. Maybe I should have called
this out. We need for_each_hgm_shift to include PAGE_SHIFT, and there
is no hstate for it. So to handle it, we iterate past the end of the
hstate array, and when we are past the end, we return PAGE_SHIFT and
stop iterating further. This is admittedly kind of gross; if you have
other suggestions for a way to get a clean `for_each_hgm_shift` macro
like this, I'm all ears. :)

>
> So is this intending to do:
>
> if (h == hstates[hugetlb_max_hstate]
> return PAGE_SHIFT;
>
> ? If so, could we write it as so?

Yeah, this works. I'll write it this way instead. If that condition is
true, `h` is out of bounds (`hugetlb_max_hstate` is past the end, not
the index for the final element). I guess `hugetlb_max_hstate` is a
bit of a misnomer.

>
> I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
> == PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
> be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?

`huge_page_shift(hstate[hugetlb_max_hstate-1])` is PMD_SHIFT on x86.
Actually reading `hstate[hugetlb_max_hstate]` would be bad, which is
why `__shift_for_hstate` exists: to return PAGE_SIZE when we would
otherwise attempt to compute
`huge_page_shift(hstate[hugetlb_max_hstate])`.

>
> > + return huge_page_shift(h);
> > +}
> > +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> > + for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > + (tmp_h) <= &hstates[hugetlb_max_hstate]; \

Note the <= here. If we wanted to always remain inbounds here, we'd
want < instead. But we don't have an hstate for PAGE_SIZE.

> > + (tmp_h)++)
> > #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> >
> > /*
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >

2022-07-09 22:34:12

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH 10/26] hugetlb: add for_each_hgm_shift

On Fri, Jul 8, 2022 at 8:52 AM James Houghton <[email protected]> wrote:
>
> On Tue, Jun 28, 2022 at 2:58 PM Mina Almasry <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 10:37 AM James Houghton <[email protected]> wrote:
> > >
> > > This is a helper macro to loop through all the usable page sizes for a
> > > high-granularity-enabled HugeTLB VMA. Given the VMA's hstate, it will
> > > loop, in descending order, through the page sizes that HugeTLB supports
> > > for this architecture; it always includes PAGE_SIZE.
> > >
> > > Signed-off-by: James Houghton <[email protected]>
> > > ---
> > > mm/hugetlb.c | 10 ++++++++++
> > > 1 file changed, 10 insertions(+)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 8b10b941458d..557b0afdb503 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6989,6 +6989,16 @@ bool hugetlb_hgm_enabled(struct vm_area_struct *vma)
> > > /* All shared VMAs have HGM enabled. */
> > > return vma->vm_flags & VM_SHARED;
> > > }
> > > +static unsigned int __shift_for_hstate(struct hstate *h)
> > > +{
> > > + if (h >= &hstates[hugetlb_max_hstate])
> > > + return PAGE_SHIFT;
> >
> > h > &hstates[hugetlb_max_hstate] means that h is out of bounds, no? am
> > I missing something here?
>
> Yeah, it goes out of bounds intentionally. Maybe I should have called
> this out. We need for_each_hgm_shift to include PAGE_SHIFT, and there
> is no hstate for it. So to handle it, we iterate past the end of the
> hstate array, and when we are past the end, we return PAGE_SHIFT and
> stop iterating further. This is admittedly kind of gross; if you have
> other suggestions for a way to get a clean `for_each_hgm_shift` macro
> like this, I'm all ears. :)
>
> >
> > So is this intending to do:
> >
> > if (h == hstates[hugetlb_max_hstate]
> > return PAGE_SHIFT;
> >
> > ? If so, could we write it as so?
>
> Yeah, this works. I'll write it this way instead. If that condition is
> true, `h` is out of bounds (`hugetlb_max_hstate` is past the end, not
> the index for the final element). I guess `hugetlb_max_hstate` is a
> bit of a misnomer.
>
> >
> > I'm also wondering why __shift_for_hstate(hstate[hugetlb_max_hstate])
> > == PAGE_SHIFT? Isn't the last hstate the smallest hstate which should
> > be 2MB on x86? Shouldn't this return PMD_SHIFT in that case?
>
> `huge_page_shift(hstate[hugetlb_max_hstate-1])` is PMD_SHIFT on x86.
> Actually reading `hstate[hugetlb_max_hstate]` would be bad, which is
> why `__shift_for_hstate` exists: to return PAGE_SIZE when we would
> otherwise attempt to compute
> `huge_page_shift(hstate[hugetlb_max_hstate])`.
>
> >
> > > + return huge_page_shift(h);
> > > +}
> > > +#define for_each_hgm_shift(hstate, tmp_h, shift) \
> > > + for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
> > > + (tmp_h) <= &hstates[hugetlb_max_hstate]; \
>
> Note the <= here. If we wanted to always remain inbounds here, we'd
> want < instead. But we don't have an hstate for PAGE_SIZE.
>

I see, thanks for the explanation. I can see 2 options here to make
the code more understandable:

option (a), don't go past the array. I.e. for_each_hgm_shift() will
loop over all the hugetlb-supported shifts on this arch, and the
calling code falls back to PAGE_SHIFT if the hugetlb page shifts don't
work for it. I admit that could lead to code dup in the calling code,
but I have not gotten to the patch that calls this yet.

option (b), simply add a comment and/or make it more obvious that
you're intentionally going out of bounds, and you want to loop over
PAGE_SHIFT at the end. Something like:

+ /* Returns huge_page_shift(h) if h is a pointer to an hstate in
hstates[] array, PAGE_SIZE otherwise. */
+static unsigned int __shift_for_hstate(struct hstate *h)
+{
+ if (h < &hstates[0] || h > &hstates[hugetlb_max_hstate - 1])
+ return PAGE_SHIFT;
+ return huge_page_shift(h);
+}
+
+ /* Loops over all the HGM shifts supported on this arch, from the
largest shift possible down to PAGE_SHIFT inclusive. */
+#define for_each_hgm_shift(hstate, tmp_h, shift) \
+ for ((tmp_h) = hstate; (shift) = __shift_for_hstate(tmp_h), \
+ (tmp_h) <= &hstates[hugetlb_max_hstate]; \
+ (tmp_h)++)
#endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */

> > > + (tmp_h)++)
> > > #endif /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
> > >
> > > /*
> > > --
> > > 2.37.0.rc0.161.g10f37bed90-goog
> > >

2022-07-11 23:51:57

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range

On 06/24/22 17:36, James Houghton wrote:
> This allows fork() to work with high-granularity mappings. The page
> table structure is copied such that partially mapped regions will remain
> partially mapped in the same way for the new process.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 59 insertions(+), 15 deletions(-)

FYI -
With https://lore.kernel.org/linux-mm/[email protected]/
copy_hugetlb_page_range() should never be called for shared mappings.
Since HGM only works on shared mappings, code in this patch will never
be executed.

I have a TODO to remove shared mapping support from copy_hugetlb_page_range.
--
Mike Kravetz

2022-07-12 17:40:41

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range

On Mon, Jul 11, 2022 at 4:41 PM Mike Kravetz <[email protected]> wrote:
>
> On 06/24/22 17:36, James Houghton wrote:
> > This allows fork() to work with high-granularity mappings. The page
> > table structure is copied such that partially mapped regions will remain
> > partially mapped in the same way for the new process.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 59 insertions(+), 15 deletions(-)
>
> FYI -
> With https://lore.kernel.org/linux-mm/[email protected]/
> copy_hugetlb_page_range() should never be called for shared mappings.
> Since HGM only works on shared mappings, code in this patch will never
> be executed.
>
> I have a TODO to remove shared mapping support from copy_hugetlb_page_range.

Thanks Mike. If I understand things correctly, it seems like I don't
have to do anything to support fork() then; we just don't copy the
page table structure from the old VMA to the new one. That is, as
opposed to having the same bits of the old VMA being mapped in the new
one, the new VMA will have an empty page table. This would slightly
change how userfaultfd's behavior on the new VMA, but that seems fine
to me.

- James

> --
> Mike Kravetz

2022-07-12 18:11:09

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range

On 07/12/22 10:19, James Houghton wrote:
> On Mon, Jul 11, 2022 at 4:41 PM Mike Kravetz <[email protected]> wrote:
> >
> > On 06/24/22 17:36, James Houghton wrote:
> > > This allows fork() to work with high-granularity mappings. The page
> > > table structure is copied such that partially mapped regions will remain
> > > partially mapped in the same way for the new process.
> > >
> > > Signed-off-by: James Houghton <[email protected]>
> > > ---
> > > mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> > > 1 file changed, 59 insertions(+), 15 deletions(-)
> >
> > FYI -
> > With https://lore.kernel.org/linux-mm/[email protected]/
> > copy_hugetlb_page_range() should never be called for shared mappings.
> > Since HGM only works on shared mappings, code in this patch will never
> > be executed.
> >
> > I have a TODO to remove shared mapping support from copy_hugetlb_page_range.
>
> Thanks Mike. If I understand things correctly, it seems like I don't
> have to do anything to support fork() then; we just don't copy the
> page table structure from the old VMA to the new one.

Yes, for now. We will not copy the page tables for shared mappings.
When adding support for private mapping, we will need to handle the
HGM case.

> That is, as
> opposed to having the same bits of the old VMA being mapped in the new
> one, the new VMA will have an empty page table. This would slightly
> change how userfaultfd's behavior on the new VMA, but that seems fine
> to me.

Right. Since the 'mapping size information' is essentially carried in
the page tables, it will be lost if page tables are not copied.

Not sure if anyone would depend on that behavior.

Axel, this may also impact minor fault processing. Any concerns?
Patch is sitting in Andrew's tree for next merge window.
--
Mike Kravetz

2022-07-15 16:36:35

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> The changes here are very similar to the changes made to
> hugetlb_no_page, where we do a high-granularity page table walk and
> do accounting slightly differently because we are mapping only a piece
> of a page.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> fs/userfaultfd.c | 3 +++
> include/linux/hugetlb.h | 6 +++--
> mm/hugetlb.c | 54 +++++++++++++++++++++-----------------
> mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++----------
> 4 files changed, 82 insertions(+), 38 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index e943370107d0..77c1b8a7d0b9 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> if (!ptep)
> goto out;
>
> + if (hugetlb_hgm_enabled(vma))
> + goto out;
> +

This is weird. It means we'll never wait for sub-page mapping enabled
vmas. Why?

Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
means we'll stop waiting for all shared hugetlbfs uffd page faults..

I'd expect in the in-house postcopy tests you should see vcpu threads
spinning on the page faults until it's serviced.

IMO we still need to properly wait when the pgtable doesn't have the
faulted address covered. For sub-page mapping it'll probably need to walk
into sub-page levels.

> ret = false;
> pte = huge_ptep_get(ptep);
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ac4ac8fbd901..c207b1ac6195 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -221,13 +221,15 @@ unsigned long hugetlb_total_pages(void);
> vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, unsigned int flags);
> #ifdef CONFIG_USERFAULTFD
> -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> +int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> + struct hugetlb_pte *dst_hpte,
> struct vm_area_struct *dst_vma,
> unsigned long dst_addr,
> unsigned long src_addr,
> enum mcopy_atomic_mode mode,
> struct page **pagep,
> - bool wp_copy);
> + bool wp_copy,
> + bool new_mapping);
> #endif /* CONFIG_USERFAULTFD */
> bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0ec2f231524e..09fa57599233 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5808,6 +5808,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> vma_end_reservation(h, vma, haddr);
> }
>
> + /* This lock will get pretty expensive at 4K. */
> ptl = hugetlb_pte_lock(mm, hpte);
> ret = 0;
> /* If pte changed from under us, retry */
> @@ -6098,24 +6099,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> * modifications for huge pages.
> */
> int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> - pte_t *dst_pte,
> + struct hugetlb_pte *dst_hpte,
> struct vm_area_struct *dst_vma,
> unsigned long dst_addr,
> unsigned long src_addr,
> enum mcopy_atomic_mode mode,
> struct page **pagep,
> - bool wp_copy)
> + bool wp_copy,
> + bool new_mapping)
> {
> bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> struct hstate *h = hstate_vma(dst_vma);
> struct address_space *mapping = dst_vma->vm_file->f_mapping;
> + unsigned long haddr = dst_addr & huge_page_mask(h);
> pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> unsigned long size;
> int vm_shared = dst_vma->vm_flags & VM_SHARED;
> pte_t _dst_pte;
> spinlock_t *ptl;
> int ret = -ENOMEM;
> - struct page *page;
> + struct page *page, *subpage;
> int writable;
> bool page_in_pagecache = false;
>
> @@ -6130,12 +6133,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> * a non-missing case. Return -EEXIST.
> */
> if (vm_shared &&
> - hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> + hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
> ret = -EEXIST;
> goto out;
> }
>
> - page = alloc_huge_page(dst_vma, dst_addr, 0);
> + page = alloc_huge_page(dst_vma, haddr, 0);
> if (IS_ERR(page)) {
> ret = -ENOMEM;
> goto out;
> @@ -6151,13 +6154,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> /* Free the allocated page which may have
> * consumed a reservation.
> */
> - restore_reserve_on_error(h, dst_vma, dst_addr, page);
> + restore_reserve_on_error(h, dst_vma, haddr, page);
> put_page(page);
>
> /* Allocate a temporary page to hold the copied
> * contents.
> */
> - page = alloc_huge_page_vma(h, dst_vma, dst_addr);
> + page = alloc_huge_page_vma(h, dst_vma, haddr);
> if (!page) {
> ret = -ENOMEM;
> goto out;
> @@ -6171,14 +6174,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> }
> } else {
> if (vm_shared &&
> - hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> + hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
> put_page(*pagep);
> ret = -EEXIST;
> *pagep = NULL;
> goto out;
> }
>
> - page = alloc_huge_page(dst_vma, dst_addr, 0);
> + page = alloc_huge_page(dst_vma, haddr, 0);
> if (IS_ERR(page)) {
> ret = -ENOMEM;
> *pagep = NULL;
> @@ -6216,8 +6219,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> page_in_pagecache = true;
> }
>
> - ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
> - spin_lock(ptl);
> + ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
>
> /*
> * Recheck the i_size after holding PT lock to make sure not
> @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> * registered, we firstly wr-protect a none pte which has no page cache
> * page backing it, then access the page.
> */
> - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> + if (!hugetlb_pte_none_mostly(dst_hpte))
> goto out_release_unlock;
>
> - if (vm_shared) {
> - page_dup_file_rmap(page, true);
> - } else {
> - ClearHPageRestoreReserve(page);
> - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> + if (new_mapping) {

IIUC you wanted to avoid the mapcount accountings when it's the sub-page
that was going to be mapped.

Is it a must we get this only from the caller? Can we know we're doing
sub-page mapping already here and make a decision with e.g. dst_hpte?

It looks weird to me to pass this explicitly from the caller, especially
that's when we don't really have the pgtable lock so I'm wondering about
possible race conditions too on having stale new_mapping values.

> + if (vm_shared) {
> + page_dup_file_rmap(page, true);
> + } else {
> + ClearHPageRestoreReserve(page);
> + hugepage_add_new_anon_rmap(page, dst_vma, haddr);
> + }
> }
>
> /*
> @@ -6258,7 +6262,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> else
> writable = dst_vma->vm_flags & VM_WRITE;
>
> - _dst_pte = make_huge_pte(dst_vma, page, writable);
> + subpage = hugetlb_find_subpage(h, page, dst_addr);
> + if (subpage != page)
> + BUG_ON(!hugetlb_hgm_enabled(dst_vma));
> +
> + _dst_pte = make_huge_pte(dst_vma, subpage, writable);
> /*
> * Always mark UFFDIO_COPY page dirty; note that this may not be
> * extremely important for hugetlbfs for now since swapping is not
> @@ -6271,14 +6279,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> if (wp_copy)
> _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
>
> - set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> + set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
>
> - (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
> - dst_vma->vm_flags & VM_WRITE);
> - hugetlb_count_add(pages_per_huge_page(h), dst_mm);
> + (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_hpte->ptep,
> + _dst_pte, dst_vma->vm_flags & VM_WRITE);
> + hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> + update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
>
> spin_unlock(ptl);
> if (!is_continue)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 4f4892a5f767..ee40d98068bf 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -310,14 +310,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> {
> int vm_shared = dst_vma->vm_flags & VM_SHARED;
> ssize_t err;
> - pte_t *dst_pte;
> unsigned long src_addr, dst_addr;
> long copied;
> struct page *page;
> - unsigned long vma_hpagesize;
> + unsigned long vma_hpagesize, vma_altpagesize;
> pgoff_t idx;
> u32 hash;
> struct address_space *mapping;
> + bool use_hgm = hugetlb_hgm_enabled(dst_vma) &&
> + mode == MCOPY_ATOMIC_CONTINUE;
> + struct hstate *h = hstate_vma(dst_vma);
>
> /*
> * There is no default zero huge page for all huge page sizes as
> @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> copied = 0;
> page = NULL;
> vma_hpagesize = vma_kernel_pagesize(dst_vma);
> + if (use_hgm)
> + vma_altpagesize = PAGE_SIZE;

Do we need to check the "len" to know whether we should use sub-page
mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
still want the old behavior I think.

> + else
> + vma_altpagesize = vma_hpagesize;
>
> /*
> * Validate alignment based on huge page size
> */
> err = -EINVAL;
> - if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> + if (dst_start & (vma_altpagesize - 1) || len & (vma_altpagesize - 1))
> goto out_unlock;
>
> retry:
> @@ -361,6 +367,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> vm_shared = dst_vma->vm_flags & VM_SHARED;
> }
>
> + BUG_ON(!vm_shared && use_hgm);
> +
> /*
> * If not shared, ensure the dst_vma has a anon_vma.
> */
> @@ -371,11 +379,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> }
>
> while (src_addr < src_start + len) {
> + struct hugetlb_pte hpte;
> + bool new_mapping;
> BUG_ON(dst_addr >= dst_start + len);
>
> /*
> * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
> - * i_mmap_rwsem ensures the dst_pte remains valid even
> + * i_mmap_rwsem ensures the hpte.ptep remains valid even
> * in the case of shared pmds. fault mutex prevents
> * races with other faulting threads.
> */
> @@ -383,27 +393,47 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> i_mmap_lock_read(mapping);
> idx = linear_page_index(dst_vma, dst_addr);
> hash = hugetlb_fault_mutex_hash(mapping, idx);
> + /* This lock will get expensive at 4K. */
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> - err = -ENOMEM;
> - dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> - if (!dst_pte) {
> + err = 0;
> +
> + pte_t *ptep = huge_pte_alloc(dst_mm, dst_vma, dst_addr,
> + vma_hpagesize);
> + if (!ptep)
> + err = -ENOMEM;
> + else {
> + hugetlb_pte_populate(&hpte, ptep,
> + huge_page_shift(h));
> + /*
> + * If the hstate-level PTE is not none, then a mapping
> + * was previously established.
> + * The per-hpage mutex prevents double-counting.
> + */
> + new_mapping = hugetlb_pte_none(&hpte);
> + if (use_hgm)
> + err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
> + dst_addr,
> + dst_start + len);
> + }
> +
> + if (err) {
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> i_mmap_unlock_read(mapping);
> goto out_unlock;
> }
>
> if (mode != MCOPY_ATOMIC_CONTINUE &&
> - !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
> + !hugetlb_pte_none_mostly(&hpte)) {
> err = -EEXIST;
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> i_mmap_unlock_read(mapping);
> goto out_unlock;
> }
>
> - err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> + err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
> dst_addr, src_addr, mode, &page,
> - wp_copy);
> + wp_copy, new_mapping);
>
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> i_mmap_unlock_read(mapping);
> @@ -413,6 +443,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> if (unlikely(err == -ENOENT)) {
> mmap_read_unlock(dst_mm);
> BUG_ON(!page);
> + BUG_ON(hpte.shift != huge_page_shift(h));
>
> err = copy_huge_page_from_user(page,
> (const void __user *)src_addr,
> @@ -430,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> BUG_ON(page);
>
> if (!err) {
> - dst_addr += vma_hpagesize;
> - src_addr += vma_hpagesize;
> - copied += vma_hpagesize;
> + dst_addr += hugetlb_pte_size(&hpte);
> + src_addr += hugetlb_pte_size(&hpte);
> + copied += hugetlb_pte_size(&hpte);
>
> if (fatal_signal_pending(current))
> err = -EINTR;
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

--
Peter Xu

2022-07-15 17:17:24

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > The changes here are very similar to the changes made to
> > hugetlb_no_page, where we do a high-granularity page table walk and
> > do accounting slightly differently because we are mapping only a piece
> > of a page.
> >
> > Signed-off-by: James Houghton <[email protected]>
> > ---
> > fs/userfaultfd.c | 3 +++
> > include/linux/hugetlb.h | 6 +++--
> > mm/hugetlb.c | 54 +++++++++++++++++++++-----------------
> > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++----------
> > 4 files changed, 82 insertions(+), 38 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index e943370107d0..77c1b8a7d0b9 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > if (!ptep)
> > goto out;
> >
> > + if (hugetlb_hgm_enabled(vma))
> > + goto out;
> > +
>
> This is weird. It means we'll never wait for sub-page mapping enabled
> vmas. Why?
>

`ret` is true in this case, so we're actually *always* waiting.

> Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> means we'll stop waiting for all shared hugetlbfs uffd page faults..
>
> I'd expect in the in-house postcopy tests you should see vcpu threads
> spinning on the page faults until it's serviced.
>
> IMO we still need to properly wait when the pgtable doesn't have the
> faulted address covered. For sub-page mapping it'll probably need to walk
> into sub-page levels.

Ok, SGTM. I'll do that for the next version. I'm not sure of the
consequences of returning `true` here when we should be returning
`false`.

>
> > ret = false;
> > pte = huge_ptep_get(ptep);
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index ac4ac8fbd901..c207b1ac6195 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -221,13 +221,15 @@ unsigned long hugetlb_total_pages(void);
> > vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long address, unsigned int flags);
> > #ifdef CONFIG_USERFAULTFD
> > -int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> > +int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > + struct hugetlb_pte *dst_hpte,
> > struct vm_area_struct *dst_vma,
> > unsigned long dst_addr,
> > unsigned long src_addr,
> > enum mcopy_atomic_mode mode,
> > struct page **pagep,
> > - bool wp_copy);
> > + bool wp_copy,
> > + bool new_mapping);
> > #endif /* CONFIG_USERFAULTFD */
> > bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> > struct vm_area_struct *vma,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 0ec2f231524e..09fa57599233 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5808,6 +5808,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > vma_end_reservation(h, vma, haddr);
> > }
> >
> > + /* This lock will get pretty expensive at 4K. */
> > ptl = hugetlb_pte_lock(mm, hpte);
> > ret = 0;
> > /* If pte changed from under us, retry */
> > @@ -6098,24 +6099,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > * modifications for huge pages.
> > */
> > int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > - pte_t *dst_pte,
> > + struct hugetlb_pte *dst_hpte,
> > struct vm_area_struct *dst_vma,
> > unsigned long dst_addr,
> > unsigned long src_addr,
> > enum mcopy_atomic_mode mode,
> > struct page **pagep,
> > - bool wp_copy)
> > + bool wp_copy,
> > + bool new_mapping)
> > {
> > bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> > struct hstate *h = hstate_vma(dst_vma);
> > struct address_space *mapping = dst_vma->vm_file->f_mapping;
> > + unsigned long haddr = dst_addr & huge_page_mask(h);
> > pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> > unsigned long size;
> > int vm_shared = dst_vma->vm_flags & VM_SHARED;
> > pte_t _dst_pte;
> > spinlock_t *ptl;
> > int ret = -ENOMEM;
> > - struct page *page;
> > + struct page *page, *subpage;
> > int writable;
> > bool page_in_pagecache = false;
> >
> > @@ -6130,12 +6133,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > * a non-missing case. Return -EEXIST.
> > */
> > if (vm_shared &&
> > - hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> > + hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
> > ret = -EEXIST;
> > goto out;
> > }
> >
> > - page = alloc_huge_page(dst_vma, dst_addr, 0);
> > + page = alloc_huge_page(dst_vma, haddr, 0);
> > if (IS_ERR(page)) {
> > ret = -ENOMEM;
> > goto out;
> > @@ -6151,13 +6154,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > /* Free the allocated page which may have
> > * consumed a reservation.
> > */
> > - restore_reserve_on_error(h, dst_vma, dst_addr, page);
> > + restore_reserve_on_error(h, dst_vma, haddr, page);
> > put_page(page);
> >
> > /* Allocate a temporary page to hold the copied
> > * contents.
> > */
> > - page = alloc_huge_page_vma(h, dst_vma, dst_addr);
> > + page = alloc_huge_page_vma(h, dst_vma, haddr);
> > if (!page) {
> > ret = -ENOMEM;
> > goto out;
> > @@ -6171,14 +6174,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > }
> > } else {
> > if (vm_shared &&
> > - hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> > + hugetlbfs_pagecache_present(h, dst_vma, haddr)) {
> > put_page(*pagep);
> > ret = -EEXIST;
> > *pagep = NULL;
> > goto out;
> > }
> >
> > - page = alloc_huge_page(dst_vma, dst_addr, 0);
> > + page = alloc_huge_page(dst_vma, haddr, 0);
> > if (IS_ERR(page)) {
> > ret = -ENOMEM;
> > *pagep = NULL;
> > @@ -6216,8 +6219,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > page_in_pagecache = true;
> > }
> >
> > - ptl = huge_pte_lockptr(huge_page_shift(h), dst_mm, dst_pte);
> > - spin_lock(ptl);
> > + ptl = hugetlb_pte_lock(dst_mm, dst_hpte);
> >
> > /*
> > * Recheck the i_size after holding PT lock to make sure not
> > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > * registered, we firstly wr-protect a none pte which has no page cache
> > * page backing it, then access the page.
> > */
> > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > + if (!hugetlb_pte_none_mostly(dst_hpte))
> > goto out_release_unlock;
> >
> > - if (vm_shared) {
> > - page_dup_file_rmap(page, true);
> > - } else {
> > - ClearHPageRestoreReserve(page);
> > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > + if (new_mapping) {
>
> IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> that was going to be mapped.
>
> Is it a must we get this only from the caller? Can we know we're doing
> sub-page mapping already here and make a decision with e.g. dst_hpte?
>
> It looks weird to me to pass this explicitly from the caller, especially
> that's when we don't really have the pgtable lock so I'm wondering about
> possible race conditions too on having stale new_mapping values.

The only way to know what the correct value for `new_mapping` should
be is to know if we had to change the hstate-level P*D to non-none to
service this UFFDIO_CONTINUE request. I'll see if there is a nice way
to do that check in `hugetlb_mcopy_atomic_pte`. Right now there is no
race, because we synchronize on the per-hpage mutex.

>
> > + if (vm_shared) {
> > + page_dup_file_rmap(page, true);
> > + } else {
> > + ClearHPageRestoreReserve(page);
> > + hugepage_add_new_anon_rmap(page, dst_vma, haddr);
> > + }
> > }
> >
> > /*
> > @@ -6258,7 +6262,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > else
> > writable = dst_vma->vm_flags & VM_WRITE;
> >
> > - _dst_pte = make_huge_pte(dst_vma, page, writable);
> > + subpage = hugetlb_find_subpage(h, page, dst_addr);
> > + if (subpage != page)
> > + BUG_ON(!hugetlb_hgm_enabled(dst_vma));
> > +
> > + _dst_pte = make_huge_pte(dst_vma, subpage, writable);
> > /*
> > * Always mark UFFDIO_COPY page dirty; note that this may not be
> > * extremely important for hugetlbfs for now since swapping is not
> > @@ -6271,14 +6279,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > if (wp_copy)
> > _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
> >
> > - set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> > + set_huge_pte_at(dst_mm, dst_addr, dst_hpte->ptep, _dst_pte);
> >
> > - (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
> > - dst_vma->vm_flags & VM_WRITE);
> > - hugetlb_count_add(pages_per_huge_page(h), dst_mm);
> > + (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_hpte->ptep,
> > + _dst_pte, dst_vma->vm_flags & VM_WRITE);
> > + hugetlb_count_add(hugetlb_pte_size(dst_hpte) / PAGE_SIZE, dst_mm);
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > + update_mmu_cache(dst_vma, dst_addr, dst_hpte->ptep);
> >
> > spin_unlock(ptl);
> > if (!is_continue)
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 4f4892a5f767..ee40d98068bf 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -310,14 +310,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > {
> > int vm_shared = dst_vma->vm_flags & VM_SHARED;
> > ssize_t err;
> > - pte_t *dst_pte;
> > unsigned long src_addr, dst_addr;
> > long copied;
> > struct page *page;
> > - unsigned long vma_hpagesize;
> > + unsigned long vma_hpagesize, vma_altpagesize;
> > pgoff_t idx;
> > u32 hash;
> > struct address_space *mapping;
> > + bool use_hgm = hugetlb_hgm_enabled(dst_vma) &&
> > + mode == MCOPY_ATOMIC_CONTINUE;
> > + struct hstate *h = hstate_vma(dst_vma);
> >
> > /*
> > * There is no default zero huge page for all huge page sizes as
> > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > copied = 0;
> > page = NULL;
> > vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > + if (use_hgm)
> > + vma_altpagesize = PAGE_SIZE;
>
> Do we need to check the "len" to know whether we should use sub-page
> mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
> still want the old behavior I think.

I think that's a fair point; however, if we enable HGM and the address
and len happen to be hstate-aligned, we basically do the same thing as
if HGM wasn't enabled. It could be a minor performance optimization to
do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
the page tables are set up, the end result would be the same.

>
> > + else
> > + vma_altpagesize = vma_hpagesize;
> >
> > /*
> > * Validate alignment based on huge page size
> > */
> > err = -EINVAL;
> > - if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
> > + if (dst_start & (vma_altpagesize - 1) || len & (vma_altpagesize - 1))
> > goto out_unlock;
> >
> > retry:
> > @@ -361,6 +367,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > vm_shared = dst_vma->vm_flags & VM_SHARED;
> > }
> >
> > + BUG_ON(!vm_shared && use_hgm);
> > +
> > /*
> > * If not shared, ensure the dst_vma has a anon_vma.
> > */
> > @@ -371,11 +379,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > }
> >
> > while (src_addr < src_start + len) {
> > + struct hugetlb_pte hpte;
> > + bool new_mapping;
> > BUG_ON(dst_addr >= dst_start + len);
> >
> > /*
> > * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
> > - * i_mmap_rwsem ensures the dst_pte remains valid even
> > + * i_mmap_rwsem ensures the hpte.ptep remains valid even
> > * in the case of shared pmds. fault mutex prevents
> > * races with other faulting threads.
> > */
> > @@ -383,27 +393,47 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > i_mmap_lock_read(mapping);
> > idx = linear_page_index(dst_vma, dst_addr);
> > hash = hugetlb_fault_mutex_hash(mapping, idx);
> > + /* This lock will get expensive at 4K. */
> > mutex_lock(&hugetlb_fault_mutex_table[hash]);
> >
> > - err = -ENOMEM;
> > - dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> > - if (!dst_pte) {
> > + err = 0;
> > +
> > + pte_t *ptep = huge_pte_alloc(dst_mm, dst_vma, dst_addr,
> > + vma_hpagesize);
> > + if (!ptep)
> > + err = -ENOMEM;
> > + else {
> > + hugetlb_pte_populate(&hpte, ptep,
> > + huge_page_shift(h));
> > + /*
> > + * If the hstate-level PTE is not none, then a mapping
> > + * was previously established.
> > + * The per-hpage mutex prevents double-counting.
> > + */
> > + new_mapping = hugetlb_pte_none(&hpte);
> > + if (use_hgm)
> > + err = hugetlb_alloc_largest_pte(&hpte, dst_mm, dst_vma,
> > + dst_addr,
> > + dst_start + len);
> > + }
> > +
> > + if (err) {
> > mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > i_mmap_unlock_read(mapping);
> > goto out_unlock;
> > }
> >
> > if (mode != MCOPY_ATOMIC_CONTINUE &&
> > - !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
> > + !hugetlb_pte_none_mostly(&hpte)) {
> > err = -EEXIST;
> > mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > i_mmap_unlock_read(mapping);
> > goto out_unlock;
> > }
> >
> > - err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
> > + err = hugetlb_mcopy_atomic_pte(dst_mm, &hpte, dst_vma,
> > dst_addr, src_addr, mode, &page,
> > - wp_copy);
> > + wp_copy, new_mapping);
> >
> > mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > i_mmap_unlock_read(mapping);
> > @@ -413,6 +443,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > if (unlikely(err == -ENOENT)) {
> > mmap_read_unlock(dst_mm);
> > BUG_ON(!page);
> > + BUG_ON(hpte.shift != huge_page_shift(h));
> >
> > err = copy_huge_page_from_user(page,
> > (const void __user *)src_addr,
> > @@ -430,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > BUG_ON(page);
> >
> > if (!err) {
> > - dst_addr += vma_hpagesize;
> > - src_addr += vma_hpagesize;
> > - copied += vma_hpagesize;
> > + dst_addr += hugetlb_pte_size(&hpte);
> > + src_addr += hugetlb_pte_size(&hpte);
> > + copied += hugetlb_pte_size(&hpte);
> >
> > if (fatal_signal_pending(current))
> > err = -EINTR;
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >
>
> --
> Peter Xu
>

Thanks, Peter! :)

- James

2022-07-15 17:46:29

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote:
> On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > > The changes here are very similar to the changes made to
> > > hugetlb_no_page, where we do a high-granularity page table walk and
> > > do accounting slightly differently because we are mapping only a piece
> > > of a page.
> > >
> > > Signed-off-by: James Houghton <[email protected]>
> > > ---
> > > fs/userfaultfd.c | 3 +++
> > > include/linux/hugetlb.h | 6 +++--
> > > mm/hugetlb.c | 54 +++++++++++++++++++++-----------------
> > > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++----------
> > > 4 files changed, 82 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index e943370107d0..77c1b8a7d0b9 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > > if (!ptep)
> > > goto out;
> > >
> > > + if (hugetlb_hgm_enabled(vma))
> > > + goto out;
> > > +
> >
> > This is weird. It means we'll never wait for sub-page mapping enabled
> > vmas. Why?
> >
>
> `ret` is true in this case, so we're actually *always* waiting.

Aha! Then I think that's another problem, sorry. :) See Below.

>
> > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> > means we'll stop waiting for all shared hugetlbfs uffd page faults..
> >
> > I'd expect in the in-house postcopy tests you should see vcpu threads
> > spinning on the page faults until it's serviced.
> >
> > IMO we still need to properly wait when the pgtable doesn't have the
> > faulted address covered. For sub-page mapping it'll probably need to walk
> > into sub-page levels.
>
> Ok, SGTM. I'll do that for the next version. I'm not sure of the
> consequences of returning `true` here when we should be returning
> `false`.

We've put ourselves onto the wait queue, if another concurrent
UFFDIO_CONTINUE happened and pte is already installed, I think this thread
could be waiting forever on the next schedule().

The solution should be the same - walking the sub-page pgtable would work,
afaict.

[...]

> > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > * registered, we firstly wr-protect a none pte which has no page cache
> > > * page backing it, then access the page.
> > > */
> > > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > > + if (!hugetlb_pte_none_mostly(dst_hpte))
> > > goto out_release_unlock;
> > >
> > > - if (vm_shared) {
> > > - page_dup_file_rmap(page, true);
> > > - } else {
> > > - ClearHPageRestoreReserve(page);
> > > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > > + if (new_mapping) {
> >
> > IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> > that was going to be mapped.
> >
> > Is it a must we get this only from the caller? Can we know we're doing
> > sub-page mapping already here and make a decision with e.g. dst_hpte?
> >
> > It looks weird to me to pass this explicitly from the caller, especially
> > that's when we don't really have the pgtable lock so I'm wondering about
> > possible race conditions too on having stale new_mapping values.
>
> The only way to know what the correct value for `new_mapping` should
> be is to know if we had to change the hstate-level P*D to non-none to
> service this UFFDIO_CONTINUE request. I'll see if there is a nice way
> to do that check in `hugetlb_mcopy_atomic_pte`.
> Right now there is no

Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or
something alike)?

> race, because we synchronize on the per-hpage mutex.

Yeah not familiar with that mutex enough to tell, as long as that mutex
guarantees no pgtable update (hmm, then why we need the pgtable lock
here???) then it looks fine.

[...]

> > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > copied = 0;
> > > page = NULL;
> > > vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > + if (use_hgm)
> > > + vma_altpagesize = PAGE_SIZE;
> >
> > Do we need to check the "len" to know whether we should use sub-page
> > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
> > still want the old behavior I think.
>
> I think that's a fair point; however, if we enable HGM and the address
> and len happen to be hstate-aligned

The address can, but len (note! not "end" here) cannot?

> , we basically do the same thing as
> if HGM wasn't enabled. It could be a minor performance optimization to
> do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
> the page tables are set up, the end result would be the same.

Thanks,

--
Peter Xu

2022-07-15 22:34:32

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [RFC PATCH 19/26] hugetlb: add HGM support for copy_hugetlb_page_range

On Tue, Jul 12, 2022 at 11:07 AM Mike Kravetz <[email protected]> wrote:
>
> On 07/12/22 10:19, James Houghton wrote:
> > On Mon, Jul 11, 2022 at 4:41 PM Mike Kravetz <[email protected]> wrote:
> > >
> > > On 06/24/22 17:36, James Houghton wrote:
> > > > This allows fork() to work with high-granularity mappings. The page
> > > > table structure is copied such that partially mapped regions will remain
> > > > partially mapped in the same way for the new process.
> > > >
> > > > Signed-off-by: James Houghton <[email protected]>
> > > > ---
> > > > mm/hugetlb.c | 74 +++++++++++++++++++++++++++++++++++++++++-----------
> > > > 1 file changed, 59 insertions(+), 15 deletions(-)
> > >
> > > FYI -
> > > With https://lore.kernel.org/linux-mm/[email protected]/
> > > copy_hugetlb_page_range() should never be called for shared mappings.
> > > Since HGM only works on shared mappings, code in this patch will never
> > > be executed.
> > >
> > > I have a TODO to remove shared mapping support from copy_hugetlb_page_range.
> >
> > Thanks Mike. If I understand things correctly, it seems like I don't
> > have to do anything to support fork() then; we just don't copy the
> > page table structure from the old VMA to the new one.
>
> Yes, for now. We will not copy the page tables for shared mappings.
> When adding support for private mapping, we will need to handle the
> HGM case.
>
> > That is, as
> > opposed to having the same bits of the old VMA being mapped in the new
> > one, the new VMA will have an empty page table. This would slightly
> > change how userfaultfd's behavior on the new VMA, but that seems fine
> > to me.
>
> Right. Since the 'mapping size information' is essentially carried in
> the page tables, it will be lost if page tables are not copied.
>
> Not sure if anyone would depend on that behavior.
>
> Axel, this may also impact minor fault processing. Any concerns?
> Patch is sitting in Andrew's tree for next merge window.

Sorry for the slow response, just catching up a bit here. :)

If I understand correctly, let's say we have a process where some
hugetlb pages are fully mapped (pages are in page cache, page table
entries exist). Once we fork(), we in the future won't copy the page
table entries, but I assume we do setup the underlying pages for CoW
still. So I guess this means in the old process no fault would happen
if the memory was touched, but in the forked process it would generate
a minor fault?

To me that seems fine. When userspace gets a minor fault it's always
fine for it to just say "don't care, just UFFDIO_CONTINUE, no work
needed". For VM migration I don't think it's unreasonable to expect
userspace to remember whether or not the page is clean (it already
does this anyway) and whether or not a fork (without exec) had
happened. It seems to me it should work fine.

> --
> Mike Kravetz

2022-07-20 21:47:51

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Fri, Jul 15, 2022 at 10:20 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jul 15, 2022 at 09:58:10AM -0700, James Houghton wrote:
> > On Fri, Jul 15, 2022 at 9:21 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Fri, Jun 24, 2022 at 05:36:50PM +0000, James Houghton wrote:
> > > > The changes here are very similar to the changes made to
> > > > hugetlb_no_page, where we do a high-granularity page table walk and
> > > > do accounting slightly differently because we are mapping only a piece
> > > > of a page.
> > > >
> > > > Signed-off-by: James Houghton <[email protected]>
> > > > ---
> > > > fs/userfaultfd.c | 3 +++
> > > > include/linux/hugetlb.h | 6 +++--
> > > > mm/hugetlb.c | 54 +++++++++++++++++++++-----------------
> > > > mm/userfaultfd.c | 57 +++++++++++++++++++++++++++++++----------
> > > > 4 files changed, 82 insertions(+), 38 deletions(-)
> > > >
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index e943370107d0..77c1b8a7d0b9 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -245,6 +245,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > > > if (!ptep)
> > > > goto out;
> > > >
> > > > + if (hugetlb_hgm_enabled(vma))
> > > > + goto out;
> > > > +
> > >
> > > This is weird. It means we'll never wait for sub-page mapping enabled
> > > vmas. Why?
> > >
> >
> > `ret` is true in this case, so we're actually *always* waiting.
>
> Aha! Then I think that's another problem, sorry. :) See Below.
>
> >
> > > Not to mention hugetlb_hgm_enabled() currently is simply VM_SHARED, so it
> > > means we'll stop waiting for all shared hugetlbfs uffd page faults..
> > >
> > > I'd expect in the in-house postcopy tests you should see vcpu threads
> > > spinning on the page faults until it's serviced.
> > >
> > > IMO we still need to properly wait when the pgtable doesn't have the
> > > faulted address covered. For sub-page mapping it'll probably need to walk
> > > into sub-page levels.
> >
> > Ok, SGTM. I'll do that for the next version. I'm not sure of the
> > consequences of returning `true` here when we should be returning
> > `false`.
>
> We've put ourselves onto the wait queue, if another concurrent
> UFFDIO_CONTINUE happened and pte is already installed, I think this thread
> could be waiting forever on the next schedule().
>
> The solution should be the same - walking the sub-page pgtable would work,
> afaict.
>
> [...]
>
> > > > @@ -6239,14 +6241,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > > > * registered, we firstly wr-protect a none pte which has no page cache
> > > > * page backing it, then access the page.
> > > > */
> > > > - if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
> > > > + if (!hugetlb_pte_none_mostly(dst_hpte))
> > > > goto out_release_unlock;
> > > >
> > > > - if (vm_shared) {
> > > > - page_dup_file_rmap(page, true);
> > > > - } else {
> > > > - ClearHPageRestoreReserve(page);
> > > > - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > > > + if (new_mapping) {
> > >
> > > IIUC you wanted to avoid the mapcount accountings when it's the sub-page
> > > that was going to be mapped.
> > >
> > > Is it a must we get this only from the caller? Can we know we're doing
> > > sub-page mapping already here and make a decision with e.g. dst_hpte?
> > >
> > > It looks weird to me to pass this explicitly from the caller, especially
> > > that's when we don't really have the pgtable lock so I'm wondering about
> > > possible race conditions too on having stale new_mapping values.
> >
> > The only way to know what the correct value for `new_mapping` should
> > be is to know if we had to change the hstate-level P*D to non-none to
> > service this UFFDIO_CONTINUE request. I'll see if there is a nice way
> > to do that check in `hugetlb_mcopy_atomic_pte`.
> > Right now there is no
>
> Would "new_mapping = dest_hpte->shift != huge_page_shift(hstate)" work (or
> something alike)?

This works in the hugetlb_fault case, because in the hugetlb_fault
case, we install the largest PTE possible. If we are mapping a page
for the first time, we will use an hstate-sized PTE. But for
UFFDIO_CONTINUE, we may be installing a 4K PTE as the first PTE for
the whole hpage.

>
> > race, because we synchronize on the per-hpage mutex.
>
> Yeah not familiar with that mutex enough to tell, as long as that mutex
> guarantees no pgtable update (hmm, then why we need the pgtable lock
> here???) then it looks fine.

Let me take a closer look at this. I'll have a more detailed
explanation for the next version of the RFC.

>
> [...]
>
> > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > copied = 0;
> > > > page = NULL;
> > > > vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > + if (use_hgm)
> > > > + vma_altpagesize = PAGE_SIZE;
> > >
> > > Do we need to check the "len" to know whether we should use sub-page
> > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
> > > still want the old behavior I think.
> >
> > I think that's a fair point; however, if we enable HGM and the address
> > and len happen to be hstate-aligned
>
> The address can, but len (note! not "end" here) cannot?

They both (dst_start and len) need to be hpage-aligned, otherwise we
won't be able to install hstate-sized PTEs. Like if we're installing
4K at the beginning of a 1G hpage, we can't install a PUD, because we
only want to install that 4K.

>
> > , we basically do the same thing as
> > if HGM wasn't enabled. It could be a minor performance optimization to
> > do `vma_altpagesize=vma_hpagesize` in that case, but in terms of how
> > the page tables are set up, the end result would be the same.
>
> Thanks,

Thanks!

>
> --
> Peter Xu
>

2022-07-21 19:16:02

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Wed, Jul 20, 2022 at 01:58:06PM -0700, James Houghton wrote:
> > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > > copied = 0;
> > > > > page = NULL;
> > > > > vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > > + if (use_hgm)
> > > > > + vma_altpagesize = PAGE_SIZE;
> > > >
> > > > Do we need to check the "len" to know whether we should use sub-page
> > > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
> > > > still want the old behavior I think.
> > >
> > > I think that's a fair point; however, if we enable HGM and the address
> > > and len happen to be hstate-aligned
> >
> > The address can, but len (note! not "end" here) cannot?
>
> They both (dst_start and len) need to be hpage-aligned, otherwise we
> won't be able to install hstate-sized PTEs. Like if we're installing
> 4K at the beginning of a 1G hpage, we can't install a PUD, because we
> only want to install that 4K.

I'm still confused...

Shouldn't one of the major goals of sub-page mapping is to grant user the
capability to do UFFDIO_CONTINUE with len<hpagesize (so we install pages in
sub-page level)? If so, why len needs to be always hpagesize aligned?

--
Peter Xu

2022-07-21 19:53:12

by James Houghton

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Thu, Jul 21, 2022 at 12:09 PM Peter Xu <[email protected]> wrote:
>
> On Wed, Jul 20, 2022 at 01:58:06PM -0700, James Houghton wrote:
> > > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > > > copied = 0;
> > > > > > page = NULL;
> > > > > > vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > > > + if (use_hgm)
> > > > > > + vma_altpagesize = PAGE_SIZE;
> > > > >
> > > > > Do we need to check the "len" to know whether we should use sub-page
> > > > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
> > > > > still want the old behavior I think.
> > > >
> > > > I think that's a fair point; however, if we enable HGM and the address
> > > > and len happen to be hstate-aligned
> > >
> > > The address can, but len (note! not "end" here) cannot?
> >
> > They both (dst_start and len) need to be hpage-aligned, otherwise we
> > won't be able to install hstate-sized PTEs. Like if we're installing
> > 4K at the beginning of a 1G hpage, we can't install a PUD, because we
> > only want to install that 4K.
>
> I'm still confused...
>
> Shouldn't one of the major goals of sub-page mapping is to grant user the
> capability to do UFFDIO_CONTINUE with len<hpagesize (so we install pages in
> sub-page level)? If so, why len needs to be always hpagesize aligned?

Sorry I misunderstood what you were asking. We allow both to be
PAGE_SIZE-aligned. :) That is indeed the goal of HGM.

If dst_start and len were both hpage-aligned, then we *could* set
`use_hgm = false`, and everything would still work. That's what I
thought you were asking about. I don't see any reason to do this
though, as `use_hgm = true` will only grant additional functionality,
and `use_hgm = false` would only -- at best -- be a minor performance
optimization in this case.

- James

>
> --
> Peter Xu
>

2022-07-21 20:07:01

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 20/26] hugetlb: add support for high-granularity UFFDIO_CONTINUE

On Thu, Jul 21, 2022 at 12:44:58PM -0700, James Houghton wrote:
> On Thu, Jul 21, 2022 at 12:09 PM Peter Xu <[email protected]> wrote:
> >
> > On Wed, Jul 20, 2022 at 01:58:06PM -0700, James Houghton wrote:
> > > > > > > @@ -335,12 +337,16 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> > > > > > > copied = 0;
> > > > > > > page = NULL;
> > > > > > > vma_hpagesize = vma_kernel_pagesize(dst_vma);
> > > > > > > + if (use_hgm)
> > > > > > > + vma_altpagesize = PAGE_SIZE;
> > > > > >
> > > > > > Do we need to check the "len" to know whether we should use sub-page
> > > > > > mapping or original hpage size? E.g. any old UFFDIO_CONTINUE code will
> > > > > > still want the old behavior I think.
> > > > >
> > > > > I think that's a fair point; however, if we enable HGM and the address
> > > > > and len happen to be hstate-aligned
> > > >
> > > > The address can, but len (note! not "end" here) cannot?
> > >
> > > They both (dst_start and len) need to be hpage-aligned, otherwise we
> > > won't be able to install hstate-sized PTEs. Like if we're installing
> > > 4K at the beginning of a 1G hpage, we can't install a PUD, because we
> > > only want to install that 4K.
> >
> > I'm still confused...
> >
> > Shouldn't one of the major goals of sub-page mapping is to grant user the
> > capability to do UFFDIO_CONTINUE with len<hpagesize (so we install pages in
> > sub-page level)? If so, why len needs to be always hpagesize aligned?
>
> Sorry I misunderstood what you were asking. We allow both to be
> PAGE_SIZE-aligned. :) That is indeed the goal of HGM.

Ah OK. :)

>
> If dst_start and len were both hpage-aligned, then we *could* set
> `use_hgm = false`, and everything would still work. That's what I
> thought you were asking about. I don't see any reason to do this
> though, as `use_hgm = true` will only grant additional functionality,
> and `use_hgm = false` would only -- at best -- be a minor performance
> optimization in this case.

I just want to make sure this patch won't break existing uffd-minor users,
or it'll be an kernel abi breakage.

We'd still want to have e.g. existing compiled apps run like before, which
iiuc means we should only use sub-page mapping when len!=hpagesize here.

I'm not sure it's only about perf - the app may not even be prepared to
receive yet another page faults within the same huge page range.

--
Peter Xu

2022-09-08 19:13:02

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH 11/26] hugetlb: add hugetlb_walk_to to do PT walks

On Fri, Jun 24, 2022 at 05:36:41PM +0000, James Houghton wrote:
> This adds it for architectures that use GENERAL_HUGETLB, including x86.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
> include/linux/hugetlb.h | 2 ++
> mm/hugetlb.c | 45 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 47 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e7a6b944d0cc..605aa19d8572 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -258,6 +258,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz);
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz, bool stop_at_none);
> int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long *addr, pte_t *ptep);
> void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 557b0afdb503..3ec2a921ee6f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6981,6 +6981,51 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> return (pte_t *)pmd;
> }
>
> +int hugetlb_walk_to(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz, bool stop_at_none)
> +{
> + pte_t *ptep;
> +
> + if (!hpte->ptep) {
> + pgd_t *pgd = pgd_offset(mm, addr);
> +
> + if (!pgd)
> + return -ENOMEM;
> + ptep = (pte_t *)p4d_alloc(mm, pgd, addr);
> + if (!ptep)
> + return -ENOMEM;
> + hugetlb_pte_populate(hpte, ptep, P4D_SHIFT);
> + }
> +
> + while (hugetlb_pte_size(hpte) > sz &&
> + !hugetlb_pte_present_leaf(hpte) &&
> + !(stop_at_none && hugetlb_pte_none(hpte))) {
> + if (hpte->shift == PMD_SHIFT) {
> + ptep = pte_alloc_map(mm, (pmd_t *)hpte->ptep, addr);

I had a feeling that the pairing pte_unmap() was lost.

I think most distros are not with CONFIG_HIGHPTE at all, but still..

> + if (!ptep)
> + return -ENOMEM;
> + hpte->shift = PAGE_SHIFT;
> + hpte->ptep = ptep;
> + } else if (hpte->shift == PUD_SHIFT) {
> + ptep = (pte_t *)pmd_alloc(mm, (pud_t *)hpte->ptep,
> + addr);
> + if (!ptep)
> + return -ENOMEM;
> + hpte->shift = PMD_SHIFT;
> + hpte->ptep = ptep;
> + } else if (hpte->shift == P4D_SHIFT) {
> + ptep = (pte_t *)pud_alloc(mm, (p4d_t *)hpte->ptep,
> + addr);
> + if (!ptep)
> + return -ENOMEM;
> + hpte->shift = PUD_SHIFT;
> + hpte->ptep = ptep;
> + } else
> + BUG();
> + }
> + return 0;
> +}
> +
> #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>
> #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> --
> 2.37.0.rc0.161.g10f37bed90-goog
>

--
Peter Xu