2023-08-01 14:58:04

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 0/8] smaps / mm/gup: fix gup_can_follow_protnone fallout

This is agains mm/mm-unstable, but everything except patch #7 and #8
should apply on current master. Especially patch #1 and #2 should go
upstream first, so we can let the other stuff mature a bit longer.


Next attempt to handle the fallout of 474098edac26
("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
accidentially missed that follow_page() and smaps implicitly kept the
FOLL_NUMA flag clear by not setting it if FOLL_FORCE is absent, to
not trigger faults on PROT_NONE-mapped PTEs.

Patch #1 fixes the known issues by reintroducing FOLL_NUMA as
FOLL_HONOR_NUMA_FAULT and decoupling it from FOLL_FORCE.

Patch #2 is a cleanup that I think actually fixes some corner cases, so
I added a Fixes: tag.

Patch #3 makes KVM explicitly set FOLL_HONOR_NUMA_FAULT in the single
case where it is required, and documents the situation.

Patch #4 then stops implicitly setting FOLL_HONOR_NUMA_FAULT. But note that
for FOLL_WRITE we always implicitly honor NUMA hinting faults.

Patch #5 and patch #6 cleanup some comments.

Patch #7 improves the KVM functional tests such that patch #8 can
actually check for one of the known issues: KSM no longer working on
PROT_NONE mappings on x86-64 with CONFIG_NUMA_BALANCING.

Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: liubo <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Paolo Bonzini <[email protected]>

David Hildenbrand (8):
mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()
mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT
pgtable: improve pte_protnone() comment
mm/huge_memory: remove stale NUMA hinting comment from
follow_trans_huge_pmd()
selftest/mm: ksm_functional_tests: test in mmap_and_merge_range() if
anything got merged
selftest/mm: ksm_functional_tests: Add PROT_NONE test

fs/proc/task_mmu.c | 3 +-
include/linux/mm.h | 21 +++-
include/linux/mm_types.h | 9 ++
include/linux/pgtable.h | 16 ++-
mm/gup.c | 23 +++-
mm/huge_memory.c | 3 +-
.../selftests/mm/ksm_functional_tests.c | 106 ++++++++++++++++--
virt/kvm/kvm_main.c | 13 ++-
8 files changed, 164 insertions(+), 30 deletions(-)

--
2.41.0



2023-08-01 15:21:43

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()

That comment for pmd_protnone() was added in commit 2b4847e73004
("mm: numa: serialise parallel get_user_page against THP migration"), which
noted:

THP does not unmap pages due to a lack of support for migration
entries at a PMD level. This allows races with get_user_pages

Nowadays, we do have PMD migration entries, so the comment no longer
applies. Let's drop it.

Signed-off-by: David Hildenbrand <[email protected]>
---
mm/huge_memory.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2cd3e5502180..0b709d2c46c6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1467,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
return ERR_PTR(-EFAULT);

- /* Full NUMA hinting faults to serialise migration in fault paths */
if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
return NULL;

--
2.41.0


2023-08-01 15:25:26

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 8/8] selftest/mm: ksm_functional_tests: Add PROT_NONE test

Let's test whether merging and unmerging in PROT_NONE areas works as
expected.

Pass a page protection to mmap_and_merge_range(), which will trigger
an mprotect() after writing to the pages, but before enabling merging.

Make sure that unsharing works as expected, by performing a ptrace write
(using /proc/self/mem) and by setting MADV_UNMERGEABLE.

Note that this implicitly tests that ptrace writes in an inaccessible
(PROT_NONE) mapping work as expected.

Signed-off-by: David Hildenbrand <[email protected]>
---
.../selftests/mm/ksm_functional_tests.c | 59 ++++++++++++++++---
1 file changed, 52 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/mm/ksm_functional_tests.c b/tools/testing/selftests/mm/ksm_functional_tests.c
index cb63b600cb4f..8fa4889ab4f3 100644
--- a/tools/testing/selftests/mm/ksm_functional_tests.c
+++ b/tools/testing/selftests/mm/ksm_functional_tests.c
@@ -27,6 +27,7 @@
#define KiB 1024u
#define MiB (1024 * KiB)

+static int mem_fd;
static int ksm_fd;
static int ksm_full_scans_fd;
static int proc_self_ksm_stat_fd;
@@ -144,7 +145,8 @@ static int ksm_unmerge(void)
return 0;
}

-static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl)
+static char *mmap_and_merge_range(char val, unsigned long size, int prot,
+ bool use_prctl)
{
char *map;
int ret;
@@ -176,6 +178,11 @@ static char *mmap_and_merge_range(char val, unsigned long size, bool use_prctl)
/* Make sure each page contains the same values to merge them. */
memset(map, val, size);

+ if (mprotect(map, size, prot)) {
+ ksft_test_result_skip("mprotect() failed\n");
+ goto unmap;
+ }
+
if (use_prctl) {
ret = prctl(PR_SET_MEMORY_MERGE, 1, 0, 0, 0);
if (ret < 0 && errno == EINVAL) {
@@ -218,7 +225,7 @@ static void test_unmerge(void)

ksft_print_msg("[RUN] %s\n", __func__);

- map = mmap_and_merge_range(0xcf, size, false);
+ map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false);
if (map == MAP_FAILED)
return;

@@ -256,7 +263,7 @@ static void test_unmerge_zero_pages(void)
}

/* Let KSM deduplicate zero pages. */
- map = mmap_and_merge_range(0x00, size, false);
+ map = mmap_and_merge_range(0x00, size, PROT_READ | PROT_WRITE, false);
if (map == MAP_FAILED)
return;

@@ -304,7 +311,7 @@ static void test_unmerge_discarded(void)

ksft_print_msg("[RUN] %s\n", __func__);

- map = mmap_and_merge_range(0xcf, size, false);
+ map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false);
if (map == MAP_FAILED)
return;

@@ -336,7 +343,7 @@ static void test_unmerge_uffd_wp(void)

ksft_print_msg("[RUN] %s\n", __func__);

- map = mmap_and_merge_range(0xcf, size, false);
+ map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, false);
if (map == MAP_FAILED)
return;

@@ -479,7 +486,7 @@ static void test_prctl_unmerge(void)

ksft_print_msg("[RUN] %s\n", __func__);

- map = mmap_and_merge_range(0xcf, size, true);
+ map = mmap_and_merge_range(0xcf, size, PROT_READ | PROT_WRITE, true);
if (map == MAP_FAILED)
return;

@@ -494,9 +501,42 @@ static void test_prctl_unmerge(void)
munmap(map, size);
}

+static void test_prot_none(void)
+{
+ const unsigned int size = 2 * MiB;
+ char *map;
+ int i;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ map = mmap_and_merge_range(0x11, size, PROT_NONE, false);
+ if (map == MAP_FAILED)
+ goto unmap;
+
+ /* Store a unique value in each page on one half using ptrace */
+ for (i = 0; i < size / 2; i += pagesize) {
+ lseek(mem_fd, (uintptr_t) map + i, SEEK_SET);
+ if (write(mem_fd, &i, sizeof(size)) != sizeof(size)) {
+ ksft_test_result_fail("ptrace write failed\n");
+ goto unmap;
+ }
+ }
+
+ /* Trigger unsharing on the other half. */
+ if (madvise(map + size / 2, size / 2, MADV_UNMERGEABLE)) {
+ ksft_test_result_fail("MADV_UNMERGEABLE failed\n");
+ goto unmap;
+ }
+
+ ksft_test_result(!range_maps_duplicates(map, size),
+ "Pages were unmerged\n");
+unmap:
+ munmap(map, size);
+}
+
int main(int argc, char **argv)
{
- unsigned int tests = 6;
+ unsigned int tests = 7;
int err;

#ifdef __NR_userfaultfd
@@ -508,6 +548,9 @@ int main(int argc, char **argv)

pagesize = getpagesize();

+ mem_fd = open("/proc/self/mem", O_RDWR);
+ if (mem_fd < 0)
+ ksft_exit_fail_msg("opening /proc/self/mem failed\n");
ksm_fd = open("/sys/kernel/mm/ksm/run", O_RDWR);
if (ksm_fd < 0)
ksft_exit_skip("open(\"/sys/kernel/mm/ksm/run\") failed\n");
@@ -529,6 +572,8 @@ int main(int argc, char **argv)
test_unmerge_uffd_wp();
#endif

+ test_prot_none();
+
test_prctl();
test_prctl_fork();
test_prctl_unmerge();
--
2.41.0


2023-08-01 15:31:41

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 5/8] pgtable: improve pte_protnone() comment

Especially the "For PROT_NONE VMAs, the PTEs are not marked
_PAGE_PROTNONE" is wrong: doing an mprotect(PROT_NONE) will end up
marking all PTEs on x86 as _PAGE_PROTNONE, making pte_protnone()
indicate "yes".

So let's improve the comment, so it's easier to grasp which semantics
pte_protnone() actually has.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/pgtable.h | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f34e0f2cb4d8..6064f454c8e3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1333,12 +1333,16 @@ static inline int pud_trans_unstable(pud_t *pud)

#ifndef CONFIG_NUMA_BALANCING
/*
- * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
- * the only case the kernel cares is for NUMA balancing and is only ever set
- * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
- * _PAGE_PROTNONE so by default, implement the helper as "always no". It
- * is the responsibility of the caller to distinguish between PROT_NONE
- * protections and NUMA hinting fault protections.
+ * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
+ * perfectly valid to indicate "no" in that case, which is why our default
+ * implementation defaults to "always no".
+ *
+ * In an accessible VMA, however, pte_protnone() reliably indicates PROT_NONE
+ * page protection due to NUMA hinting. NUMA hinting faults only apply in
+ * accessible VMAs.
+ *
+ * So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fault,
+ * looking at the VMA accessibility is sufficient.
*/
static inline int pte_protnone(pte_t pte)
{
--
2.41.0


2023-08-01 18:15:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()

On 01.08.23 18:07, Peter Xu wrote:
> On Tue, Aug 01, 2023 at 02:48:42PM +0200, David Hildenbrand wrote:
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 2cd3e5502180..0b709d2c46c6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1467,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>> if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
>> return ERR_PTR(-EFAULT);
>>
>> - /* Full NUMA hinting faults to serialise migration in fault paths */
>> if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
>> return NULL;
>
> Perhaps squashing into patch 1? Thanks,

I decided against it so I don't have to make patch description of patch
#1 even longer with something that's mostly unrelated to the core change.

--
Cheers,

David / dhildenb


2023-08-01 18:29:28

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()

On Tue, Aug 01, 2023 at 02:48:42PM +0200, David Hildenbrand wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2cd3e5502180..0b709d2c46c6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1467,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
> return ERR_PTR(-EFAULT);
>
> - /* Full NUMA hinting faults to serialise migration in fault paths */
> if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
> return NULL;

Perhaps squashing into patch 1? Thanks,

--
Peter Xu


2023-08-02 16:51:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] mm/huge_memory: remove stale NUMA hinting comment from follow_trans_huge_pmd()

On Tue, Aug 01, 2023 at 02:48:42PM +0200, David Hildenbrand wrote:
> That comment for pmd_protnone() was added in commit 2b4847e73004
> ("mm: numa: serialise parallel get_user_page against THP migration"), which
> noted:
>
> THP does not unmap pages due to a lack of support for migration
> entries at a PMD level. This allows races with get_user_pages
>
> Nowadays, we do have PMD migration entries, so the comment no longer
> applies. Let's drop it.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2023-08-02 16:53:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 5/8] pgtable: improve pte_protnone() comment

On Tue, Aug 01, 2023 at 02:48:41PM +0200, David Hildenbrand wrote:
> Especially the "For PROT_NONE VMAs, the PTEs are not marked
> _PAGE_PROTNONE" is wrong: doing an mprotect(PROT_NONE) will end up
> marking all PTEs on x86 as _PAGE_PROTNONE, making pte_protnone()
> indicate "yes".
>
> So let's improve the comment, so it's easier to grasp which semantics
> pte_protnone() actually has.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs