2022-04-05 03:43:12

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
write protection. It is based on v5.17-mmots-2022-03-31-20-40.

I touched up two small details after the rebase, namely:

- Let UFFDIO_REGISTER fail gracefully if CONFIG_PTE_MARKER_UFFD_WP is not
set, in "mm/uffd: Enable write protection for shmem & hugetlbfs".

- Tweaked the patch "mm: Enable PTE markers by default" to make sure
it'll be auto-enabled on x86_64 (for real) by kconfig where proper.

During testing of recent versions, I grew another unit test for uffd-wp
specifically (uffd-test [0], name is not important.. though). The current
vm/userfaultfd test doesn't cover the case to check what message we expect,
so the simple new test can catch errors when e.g. one page was wr-protected
but it was wrongly written without being noticed by the fault resolving
thread, hence data corrupt.

I used to only find such issues with umapsort only, and MISSING mode won't
have those data loss issues. But now many of it can also be found with
uffd-test [0]. I plan to port it to linux repo after this series lands.

The whole tree can be found here for testing:

https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs

Previous versions:

RFC: https://lore.kernel.org/lkml/[email protected]
v1: https://lore.kernel.org/lkml/[email protected]
v2: https://lore.kernel.org/lkml/[email protected]
v3: https://lore.kernel.org/lkml/[email protected]
v4: https://lore.kernel.org/lkml/[email protected]
v5: https://lore.kernel.org/lkml/[email protected]
v6: https://lore.kernel.org/lkml/[email protected]
v7: https://lore.kernel.org/lkml/[email protected]

Overview
========

Userfaultfd-wp anonymous support was merged two years ago. There're quite
a few applications that started to leverage this capability either to take
snapshots for user-app memory, or use it for full user controled swapping.

This series tries to complete the feature for uffd-wp so as to cover all
the RAM-based memory types. So far uffd-wp is the only missing piece of
the rest features (uffd-missing & uffd-minor mode).

One major reason to do so is that anonymous pages are sometimes not
satisfying the need of applications, and there're growing users of either
shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
between hypervisor process and device emulation process, shmem local live
migration for upgrades), or for performance on tlb hits.

All these mean that if a uffd-wp app wants to switch to any of the memory
types, it'll stop working. I think it's worthwhile to have the kernel to
cover all these aspects.

This series chose to protect pages in pte level not page level.

One major reason is safety. I have no idea how we could make it safe if
any of the uffd-privileged app can wr-protect a page that any other
application can use. It means this app can block any process potentially
for any time it wants.

The other reason is that it aligns very well with not only the anonymous
uffd-wp solution, but also uffd as a whole. For example, userfaultfd is
implemented fundamentally based on VMAs. We set flags to VMAs showing the
status of uffd tracking. For another per-page based protection solution,
it'll be crossing the fundation line on VMA-based, and it could simply be
too far away already from what's called userfaultfd.

PTE markers
===========

The patchset is based on the idea called PTE markers. It was discussed in
one of the mm alignment sessions, proposed starting from v6, and this is
the 2nd version of it using PTE marker idea.

PTE marker is a new type of swap entry that is ony applicable to file
backed memories like shmem and hugetlbfs. It's used to persist some
pte-level information even if the original present ptes in pgtable are
zapped.

Logically pte markers can store more than uffd-wp information, but so far
only one bit is used for uffd-wp purpose. When the pte marker is installed
with uffd-wp bit set, it means this pte is wr-protected by uffd.

It solves the problem on e.g. file-backed memory mapped ptes got zapped due
to any reason (e.g. thp split, or swapped out), we can still keep the
wr-protect information in the ptes. Then when the page fault triggers
again, we'll know this pte is wr-protected so we can treat the pte the same
as a normal uffd wr-protected pte.

The extra information is encoded into the swap entry, or swp_offset to be
explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses one
bit out of the swap entry, the rest bits of swp_offset are still reserved
for other purposes.

There're two configs to enable/disable PTE markers:

CONFIG_PTE_MARKER
CONFIG_PTE_MARKER_UFFD_WP

We can set !PTE_MARKER to completely disable all the PTE markers, along with
uffd-wp support. I made two config so we can also enable PTE marker but
disable uffd-wp file-backed for other purposes. At the end of current series,
I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if
anyone worries about having it by default, we can also consider turn it off by
dropping that oneliner patch. So far I don't see a huge risk of doing so, so I
kept that patch.

In most cases, PTE markers should be treated as none ptes. It is because that
unlike most of the other swap entry types, there's no PFN or block offset
information encoded into PTE markers but some extra well-defined bits showing
the status of the pte. These bits should only be used as extra data when
servicing an upcoming page fault, and then we behave as if it's a none pte.

I did spend a lot of time observing all the pte_none() users this time. It is
indeed a challenge because there're a lot, and I hope I didn't miss a single of
them when we should take care of pte markers. Luckily, I don't think it'll
need to be considered in many cases, for example: boot code, arch code
(especially non-x86), kernel-only page handlings (e.g. CPA), or device driver
codes when we're tackling with pure PFN mappings.

I introduced pte_none_mostly() in this series when we need to handle pte
markers the same as none pte, the "mostly" is the other way to write "either
none pte or a pte marker".

I didn't replace pte_none() to cover pte markers for below reasons:

- Very rare case of pte_none() callers will handle pte markers. E.g., all
the kernel pages do not require knowledge of pte markers. So we don't
pollute the major use cases.

- Unconditionally change pte_none() semantics could confuse people, because
pte_none() existed for so long a time.

- Unconditionally change pte_none() semantics could make pte_none() slower
even if in many cases pte markers do not exist.

- There're cases where we'd like to handle pte markers differntly from
pte_none(), so a full replace is also impossible. E.g. khugepaged should
still treat pte markers as normal swap ptes rather than none ptes, because
pte markers will always need a fault-in to merge the marker with a valid
pte. Or the smap code will need to parse PTE markers not none ptes.

Patch Layout
============

Introducing PTE marker and uffd-wp bit in PTE marker:

mm: Introduce PTE_MARKER swap entry
mm: Teach core mm about pte markers
mm: Check against orig_pte for finish_fault()
mm/uffd: PTE_MARKER_UFFD_WP

Adding support for shmem uffd-wp:

mm/shmem: Take care of UFFDIO_COPY_MODE_WP
mm/shmem: Handle uffd-wp special pte in page fault handler
mm/shmem: Persist uffd-wp bit across zapping for file-backed
mm/shmem: Allow uffd wr-protect none pte for file-backed mem
mm/shmem: Allows file-back mem to be uffd wr-protected on thps
mm/shmem: Handle uffd-wp during fork()

Adding support for hugetlbfs uffd-wp:

mm/hugetlb: Introduce huge pte version of uffd-wp helpers
mm/hugetlb: Hook page faults for uffd write protection
mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
mm/hugetlb: Handle UFFDIO_WRITEPROTECT
mm/hugetlb: Handle pte markers in page faults
mm/hugetlb: Allow uffd wr-protect none ptes
mm/hugetlb: Only drop uffd-wp special pte if required
mm/hugetlb: Handle uffd-wp during fork()

Misc handling on the rest mm for uffd-wp file-backed:

mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

Enabling of uffd-wp on file-backed memory:

mm/uffd: Enable write protection for shmem & hugetlbfs
mm: Enable PTE markers by default
selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

Tests
=====

- Compile test on x86_64 and aarch64 on different configs
- Kernel selftests
- uffd-test [0]
- Umapsort [1,2] test for shmem/hugetlb, with swap on/off

Please review, thanks.

[0] https://github.com/xzpeter/clibs/tree/master/uffd-test
[1] https://github.com/xzpeter/umap-apps/tree/peter
[2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

Peter Xu (23):
mm: Introduce PTE_MARKER swap entry
mm: Teach core mm about pte markers
mm: Check against orig_pte for finish_fault()
mm/uffd: PTE_MARKER_UFFD_WP
mm/shmem: Take care of UFFDIO_COPY_MODE_WP
mm/shmem: Handle uffd-wp special pte in page fault handler
mm/shmem: Persist uffd-wp bit across zapping for file-backed
mm/shmem: Allow uffd wr-protect none pte for file-backed mem
mm/shmem: Allows file-back mem to be uffd wr-protected on thps
mm/shmem: Handle uffd-wp during fork()
mm/hugetlb: Introduce huge pte version of uffd-wp helpers
mm/hugetlb: Hook page faults for uffd write protection
mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
mm/hugetlb: Handle UFFDIO_WRITEPROTECT
mm/hugetlb: Handle pte markers in page faults
mm/hugetlb: Allow uffd wr-protect none ptes
mm/hugetlb: Only drop uffd-wp special pte if required
mm/hugetlb: Handle uffd-wp during fork()
mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
mm/uffd: Enable write protection for shmem & hugetlbfs
mm: Enable PTE markers by default
selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

arch/s390/include/asm/hugetlb.h | 15 ++
fs/hugetlbfs/inode.c | 15 +-
fs/proc/task_mmu.c | 7 +
fs/userfaultfd.c | 31 ++--
include/asm-generic/hugetlb.h | 24 +++
include/linux/hugetlb.h | 27 ++--
include/linux/mm.h | 10 ++
include/linux/mm_inline.h | 43 +++++
include/linux/shmem_fs.h | 4 +-
include/linux/swap.h | 15 +-
include/linux/swapops.h | 79 +++++++++
include/linux/userfaultfd_k.h | 80 +++++++++
include/uapi/linux/userfaultfd.h | 10 +-
mm/Kconfig | 17 ++
mm/filemap.c | 5 +
mm/hmm.c | 2 +-
mm/hugetlb.c | 183 ++++++++++++++++-----
mm/khugepaged.c | 14 +-
mm/memcontrol.c | 8 +-
mm/memory.c | 196 ++++++++++++++++++++---
mm/mincore.c | 3 +-
mm/mprotect.c | 75 ++++++++-
mm/rmap.c | 8 +
mm/shmem.c | 4 +-
mm/userfaultfd.c | 54 +++++--
tools/testing/selftests/vm/userfaultfd.c | 4 +-
26 files changed, 807 insertions(+), 126 deletions(-)

--
2.32.0


2022-04-05 03:43:19

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

This patch introduces a new swap entry type called PTE_MARKER. It can be
installed for any pte that maps a file-backed memory when the pte is
temporarily zapped, so as to maintain per-pte information.

The information that kept in the pte is called a "marker". Here we define the
marker as "unsigned long" just to match pgoff_t, however it will only work if
it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.

A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A bunch
of helpers are defined altogether to service the rest of the pte marker code.

Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 9 ++++
include/linux/swap.h | 15 ++++++-
include/linux/swapops.h | 78 +++++++++++++++++++++++++++++++++++
mm/Kconfig | 6 +++
4 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..f39cad20ffc6 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -2,6 +2,9 @@
#ifndef _ASM_GENERIC_HUGETLB_H
#define _ASM_GENERIC_HUGETLB_H

+#include <linux/swap.h>
+#include <linux/swapops.h>
+
static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
{
return mk_pte(page, pgprot);
@@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
}
#endif

+/* Please refer to comments above pte_none_mostly() for the usage */
+static inline int huge_pte_none_mostly(pte_t pte)
+{
+ return huge_pte_none(pte) || is_pte_marker(pte);
+}
+
#ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
static inline pte_t huge_pte_wrprotect(pte_t pte)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7daae5a4b3e1..5553189d0215 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
* actions on faults.
*/

+/*
+ * PTE markers are used to persist information onto PTEs that are mapped with
+ * file-backed memories. As its name "PTE" hints, it should only be applied to
+ * the leaves of pgtables.
+ */
+#ifdef CONFIG_PTE_MARKER
+#define SWP_PTE_MARKER_NUM 1
+#define SWP_PTE_MARKER (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
+ SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
+#else
+#define SWP_PTE_MARKER_NUM 0
+#endif
+
/*
* Unaddressable device memory support. See include/linux/hmm.h and
* Documentation/vm/hmm.rst. Short description is we need struct pages for
@@ -107,7 +120,7 @@ static inline int current_is_kswapd(void)

#define MAX_SWAPFILES \
((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
- SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+ SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)

/*
* Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 32d517a28969..7a00627845f0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -274,6 +274,84 @@ static inline int is_readable_migration_entry(swp_entry_t entry)

#endif

+typedef unsigned long pte_marker;
+
+#define PTE_MARKER_MASK (0)
+
+#ifdef CONFIG_PTE_MARKER
+
+static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
+{
+ return swp_entry(SWP_PTE_MARKER, marker);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_PTE_MARKER;
+}
+
+static inline pte_marker pte_marker_get(swp_entry_t entry)
+{
+ return swp_offset(entry) & PTE_MARKER_MASK;
+}
+
+static inline bool is_pte_marker(pte_t pte)
+{
+ return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
+}
+
+#else /* CONFIG_PTE_MARKER */
+
+static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
+{
+ /* This should never be called if !CONFIG_PTE_MARKER */
+ WARN_ON_ONCE(1);
+ return swp_entry(0, 0);
+}
+
+static inline bool is_pte_marker_entry(swp_entry_t entry)
+{
+ return false;
+}
+
+static inline pte_marker pte_marker_get(swp_entry_t entry)
+{
+ return 0;
+}
+
+static inline bool is_pte_marker(pte_t pte)
+{
+ return false;
+}
+
+#endif /* CONFIG_PTE_MARKER */
+
+static inline pte_t make_pte_marker(pte_marker marker)
+{
+ return swp_entry_to_pte(make_pte_marker_entry(marker));
+}
+
+/*
+ * This is a special version to check pte_none() just to cover the case when
+ * the pte is a pte marker. It existed because in many cases the pte marker
+ * should be seen as a none pte; it's just that we have stored some information
+ * onto the none pte so it becomes not-none any more.
+ *
+ * It should be used when the pte is file-backed, ram-based and backing
+ * userspace pages, like shmem. It is not needed upon pgtables that do not
+ * support pte markers at all. For example, it's not needed on anonymous
+ * memory, kernel-only memory (including when the system is during-boot),
+ * non-ram based generic file-system. It's fine to be used even there, but the
+ * extra pte marker check will be pure overhead.
+ *
+ * For systems configured with !CONFIG_PTE_MARKER this will be automatically
+ * optimized to pte_none().
+ */
+static inline int pte_none_mostly(pte_t pte)
+{
+ return pte_none(pte) || is_pte_marker(pte);
+}
+
static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
{
struct page *p = pfn_to_page(swp_offset(entry));
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..a1688b9314b2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -909,6 +909,12 @@ config ANON_VMA_NAME
area from being merged with adjacent virtual memory areas due to the
difference in their name.

+config PTE_MARKER
+ bool "Marker PTEs support"
+
+ help
+ Allows to create marker PTEs for file-backed memory.
+
source "mm/damon/Kconfig"

endmenu
--
2.32.0

2022-04-05 03:43:20

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 16/23] mm/hugetlb: Allow uffd wr-protect none ptes

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 28 ++++++++++++++++++++++++----
1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9317b790161d..578c48ef931a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6225,7 +6225,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0;
+ unsigned long pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -6245,13 +6245,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

mmu_notifier_invalidate_range_start(&range);
i_mmap_lock_write(vma->vm_file->f_mapping);
- for (; address < end; address += huge_page_size(h)) {
+ for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ ptep = huge_pte_offset(mm, address, psize);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+ /*
+ * When uffd-wp is enabled on the vma, unshare
+ * shouldn't happen at all. Warn about it if it
+ * happened due to some reason.
+ */
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
pages++;
spin_unlock(ptl);
shared_pmd = true;
@@ -6281,12 +6287,20 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
- newpte, huge_page_size(h));
+ newpte, psize);
pages++;
}
spin_unlock(ptl);
continue;
}
+ if (unlikely(pte_marker_uffd_wp(pte))) {
+ /*
+ * This is changing a non-present pte into a none pte,
+ * no need for huge_ptep_modify_prot_start/commit().
+ */
+ if (uffd_wp_resolve)
+ huge_pte_clear(mm, address, ptep, psize);
+ }
if (!huge_pte_none(pte)) {
pte_t old_pte;
unsigned int shift = huge_page_shift(hstate_vma(vma));
@@ -6300,6 +6314,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
+ } else {
+ /* None pte */
+ if (unlikely(uffd_wp))
+ /* Safe to modify directly (none->non-present). */
+ set_huge_pte_at(mm, address, ptep,
+ make_pte_marker(PTE_MARKER_UFFD_WP));
}
spin_unlock(ptl);
}
--
2.32.0

2022-04-05 03:43:27

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 13/23] mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP

Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the stack.
Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.

Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting
dirty bit in the huge pte if the page is installed as read-only. However we'd
better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when
UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but
also because the page does contain dirty data that the kernel just copied from
the userspace.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 29 +++++++++++++++++++++++------
mm/userfaultfd.c | 14 +++++++++-----
3 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53c1b6082a4c..6347298778b6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -160,7 +160,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep);
+ struct page **pagep,
+ bool wp_copy);
#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
@@ -355,7 +356,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
BUG();
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 82df0fcfedf9..c94deead22b2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5795,7 +5795,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct hstate *h = hstate_vma(dst_vma);
@@ -5925,7 +5926,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out_release_unlock;

ret = -EEXIST;
- if (!huge_pte_none(huge_ptep_get(dst_pte)))
+ /*
+ * We allow to overwrite a pte marker: consider when both MISSING|WP
+ * registered, we firstly wr-protect a none pte which has no page cache
+ * page backing it, then access the page.
+ */
+ if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
goto out_release_unlock;

if (vm_shared) {
@@ -5935,17 +5941,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
}

- /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
- if (is_continue && !vm_shared)
+ /*
+ * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
+ * with wp flag set, don't set pte write bit.
+ */
+ if (wp_copy || (is_continue && !vm_shared))
writable = 0;
else
writable = dst_vma->vm_flags & VM_WRITE;

_dst_pte = make_huge_pte(dst_vma, page, writable);
- if (writable)
- _dst_pte = huge_pte_mkdirty(_dst_pte);
+ /*
+ * Always mark UFFDIO_COPY page dirty; note that this may not be
+ * extremely important for hugetlbfs for now since swapping is not
+ * supported, but we should still be clear in that this page cannot be
+ * thrown away at will, even if write bit not set.
+ */
+ _dst_pte = huge_pte_mkdirty(_dst_pte);
_dst_pte = pte_mkyoung(_dst_pte);

+ if (wp_copy)
+ _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b1c875b77fbb..da0b3ed2a6b5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -304,7 +304,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode)
+ enum mcopy_atomic_mode mode,
+ bool wp_copy)
{
int vm_shared = dst_vma->vm_flags & VM_SHARED;
ssize_t err;
@@ -392,7 +393,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

if (mode != MCOPY_ATOMIC_CONTINUE &&
- !huge_pte_none(huge_ptep_get(dst_pte))) {
+ !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -400,7 +401,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
- dst_addr, src_addr, mode, &page);
+ dst_addr, src_addr, mode, &page,
+ wp_copy);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -455,7 +457,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode);
+ enum mcopy_atomic_mode mode,
+ bool wp_copy);
#endif /* CONFIG_HUGETLB_PAGE */

static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -575,7 +578,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, mcopy_mode);
+ src_start, len, mcopy_mode,
+ wp_copy);

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
--
2.32.0

2022-04-05 03:43:30

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem

File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.

We do that by installing the uffd-wp pte markers when necessary. So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.

On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker. Then we'll need to be
able to recover from a uffd-wp pte marker into a none pte, so that the next
access to the page will fault in correctly as usual when accessed the next
time.

Special care needs to be taken throughout the change_protection_range()
process. Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see (!anonymous && MM_CP_UFFD_WP)
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.

For example, the pgtable can be missing for a whole chunk of 2M pmd, but the
page cache can exist for the 2M range. When we want to wr-protect one 4K page
within the 2M pmd range, we need to pre-populate the pgtable and install the
pte marker showing that we want to get a message and block the thread when the
page cache of that 4K page is written. Without pre-populating the pmd,
change_protection() will simply skip that whole pmd.

Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet. That will be done later, and this patch
will be a preparation for it too.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 709a6f73b764..bd62d5938c6c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -30,6 +30,7 @@
#include <linux/mm_inline.h>
#include <linux/pgtable.h>
#include <linux/sched/sysctl.h>
+#include <linux/userfaultfd_k.h>
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -188,8 +189,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
newpte = pte_swp_mksoft_dirty(newpte);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
- } else if (is_pte_marker_entry(entry)) {
- /* Skip it, the same as none pte */
+ } else if (pte_marker_entry_uffd_wp(entry)) {
+ /*
+ * If this is uffd-wp pte marker and we'd like
+ * to unprotect it, drop it; the next page
+ * fault will trigger without uffd trapping.
+ */
+ if (uffd_wp_resolve) {
+ pte_clear(vma->vm_mm, addr, pte);
+ pages++;
+ }
continue;
} else {
newpte = oldpte;
@@ -204,6 +213,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, newpte);
pages++;
}
+ } else {
+ /* It must be an none page, or what else?.. */
+ WARN_ON_ONCE(!pte_none(oldpte));
+ if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+ /*
+ * For file-backed mem, we need to be able to
+ * wr-protect a none pte, because even if the
+ * pte is none, the page/swap cache could
+ * exist. Doing that by install a marker.
+ */
+ set_pte_at(vma->vm_mm, addr, pte,
+ make_pte_marker(PTE_MARKER_UFFD_WP));
+ pages++;
+ }
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -237,6 +260,39 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}

+/* Return true if we're uffd wr-protecting file-backed memory, or false */
+static inline bool
+uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
+{
+ return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
+}
+
+/*
+ * If wr-protecting the range for file-backed, populate pgtable for the case
+ * when pgtable is empty but page cache exists. When {pte|pmd|...}_alloc()
+ * failed it means no memory, we don't have a better option but stop.
+ */
+#define change_pmd_prepare(vma, pmd, cp_flags) \
+ do { \
+ if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \
+ if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \
+ break; \
+ } \
+ } while (0)
+/*
+ * This is the general pud/p4d/pgd version of change_pmd_prepare(). We need to
+ * have separate change_pmd_prepare() because pte_alloc() returns 0 on success,
+ * while {pmd|pud|p4d}_alloc() returns the valid pointer on success.
+ */
+#define change_prepare(vma, high, low, addr, cp_flags) \
+ do { \
+ if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \
+ low##_t *p = low##_alloc(vma->vm_mm, high, addr); \
+ if (WARN_ON_ONCE(p == NULL)) \
+ break; \
+ } \
+ } while (0)
+
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
@@ -255,6 +311,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

next = pmd_addr_end(addr, end);

+ change_pmd_prepare(vma, pmd, cp_flags);
/*
* Automatic NUMA balancing walks the tables with mmap_lock
* held for read. It's possible a parallel update to occur
@@ -320,6 +377,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
+ change_prepare(vma, pud, pmd, addr, cp_flags);
if (pud_none_or_clear_bad(pud))
continue;
pages += change_pmd_range(vma, pud, addr, next, newprot,
@@ -340,6 +398,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
p4d = p4d_offset(pgd, addr);
do {
next = p4d_addr_end(addr, end);
+ change_prepare(vma, p4d, pud, addr, cp_flags);
if (p4d_none_or_clear_bad(p4d))
continue;
pages += change_pud_range(vma, p4d, addr, next, newprot,
@@ -365,6 +424,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
inc_tlb_flush_pending(mm);
do {
next = pgd_addr_end(addr, end);
+ change_prepare(vma, pgd, p4d, addr, cp_flags);
if (pgd_none_or_clear_bad(pgd))
continue;
pages += change_p4d_range(vma, pgd, addr, next, newprot,
--
2.32.0

2022-04-05 03:43:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 05/23] mm/shmem: Take care of UFFDIO_COPY_MODE_WP

Pass wp_copy into shmem_mfill_atomic_pte() through the stack, then apply the
UFFD_WP bit properly when the UFFDIO_COPY on shmem is with UFFDIO_COPY_MODE_WP.
wp_copy lands mfill_atomic_install_pte() finally.

Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as
mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file).

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/shmem_fs.h | 4 ++--
mm/shmem.c | 4 ++--
mm/userfaultfd.c | 23 ++++++++++++++++++-----
3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 3e915cc550bc..a68f982f22d1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -145,11 +145,11 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- bool zeropage,
+ bool zeropage, bool wp_copy,
struct page **pagep);
#else /* !CONFIG_SHMEM */
#define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
- src_addr, zeropage, pagep) ({ BUG(); 0; })
+ src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; })
#endif /* CONFIG_SHMEM */
#endif /* CONFIG_USERFAULTFD */

diff --git a/mm/shmem.c b/mm/shmem.c
index 7004c7f55716..9efb8a96d75e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2319,7 +2319,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- bool zeropage,
+ bool zeropage, bool wp_copy,
struct page **pagep)
{
struct inode *inode = file_inode(dst_vma->vm_file);
@@ -2392,7 +2392,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
goto out_release;

ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
- page, true, false);
+ page, true, wp_copy);
if (ret)
goto out_delete_from_cache;

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index dae25d985d15..b1c875b77fbb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -77,10 +77,19 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
* Always mark a PTE as write-protected when needed, regardless of
* VM_WRITE, which the user might change.
*/
- if (wp_copy)
+ if (wp_copy) {
_dst_pte = pte_mkuffd_wp(_dst_pte);
- else if (writable)
+ writable = false;
+ }
+
+ if (writable)
_dst_pte = pte_mkwrite(_dst_pte);
+ else
+ /*
+ * We need this to make sure write bit removed; as mk_pte()
+ * could return a pte with write bit set.
+ */
+ _dst_pte = pte_wrprotect(_dst_pte);

dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);

@@ -95,7 +104,12 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
}

ret = -EEXIST;
- if (!pte_none(*dst_pte))
+ /*
+ * We allow to overwrite a pte marker: consider when both MISSING|WP
+ * registered, we firstly wr-protect a none pte which has no page cache
+ * page backing it, then access the page.
+ */
+ if (!pte_none_mostly(*dst_pte))
goto out_unlock;

if (page_in_cache) {
@@ -479,11 +493,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
err = mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
} else {
- VM_WARN_ON_ONCE(wp_copy);
err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr,
mode != MCOPY_ATOMIC_NORMAL,
- page);
+ wp_copy, page);
}

return err;
--
2.32.0

2022-04-05 03:43:54

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork()

Normally we skip copy page when fork() for VM_SHARED shmem, but we can't skip
it anymore if uffd-wp is enabled on dst vma. This should only happen when the
src uffd has UFFD_FEATURE_EVENT_FORK enabled on uffd-wp shmem vma, so that
VM_UFFD_WP will be propagated onto dst vma too, then we should copy the
pgtables with uffd-wp bit and pte markers, because these information will be
lost otherwise.

Since the condition checks will become even more complicated for deciding
"whether a vma needs to copy the pgtable during fork()", introduce a helper
vma_needs_copy() for it, so everything will be clearer.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1144845ff734..8ba1bb196095 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -867,6 +867,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (try_restore_exclusive_pte(src_pte, src_vma, addr))
return -EBUSY;
return -ENOENT;
+ } else if (is_pte_marker_entry(entry)) {
+ /*
+ * We're copying the pgtable should only because dst_vma has
+ * uffd-wp enabled, do sanity check.
+ */
+ WARN_ON_ONCE(!userfaultfd_wp(dst_vma));
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
}
if (!userfaultfd_wp(dst_vma))
pte = pte_swp_clear_uffd_wp(pte);
@@ -1221,6 +1229,38 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
return 0;
}

+/*
+ * Return true if the vma needs to copy the pgtable during this fork(). Return
+ * false when we can speed up fork() by allowing lazy page faults later until
+ * when the child accesses the memory range.
+ */
+bool
+vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+ /*
+ * Always copy pgtables when dst_vma has uffd-wp enabled even if it's
+ * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable
+ * contains uffd-wp protection information, that's something we can't
+ * retrieve from page cache, and skip copying will lose those info.
+ */
+ if (userfaultfd_wp(dst_vma))
+ return true;
+
+ if (src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP))
+ return true;
+
+ if (src_vma->anon_vma)
+ return true;
+
+ /*
+ * Don't copy ptes where a page fault will fill them correctly. Fork
+ * becomes much lighter when there are big shared or private readonly
+ * mappings. The tradeoff is that copy_page_range is more efficient
+ * than faulting.
+ */
+ return false;
+}
+
int
copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
{
@@ -1234,14 +1274,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
bool is_cow;
int ret;

- /*
- * Don't copy ptes where a page fault will fill them correctly.
- * Fork becomes much lighter when there are big shared or private
- * readonly mappings. The tradeoff is that copy_page_range is more
- * efficient than faulting.
- */
- if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
- !src_vma->anon_vma)
+ if (!vma_needs_copy(dst_vma, src_vma))
return 0;

if (is_vm_hugetlb_page(src_vma))
--
2.32.0

2022-04-05 03:43:56

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 02/23] mm: Teach core mm about pte markers

This patch still does not use pte marker in any way, however it teaches the
core mm about the pte marker idea.

For example, handle_pte_marker() is introduced that will parse and handle all
the pte marker faults.

Many of the places are more about commenting it up - so that we know there's
the possibility of pte marker showing up, and why we don't need special code
for the cases.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 10 ++++++----
mm/filemap.c | 5 +++++
mm/hmm.c | 2 +-
mm/memcontrol.c | 8 ++++++--
mm/memory.c | 23 +++++++++++++++++++++++
mm/mincore.c | 3 ++-
mm/mprotect.c | 3 +++
7 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..8b4a94f5a238 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -249,9 +249,10 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,

/*
* Lockless access: we're in a wait_event so it's ok if it
- * changes under us.
+ * changes under us. PTE markers should be handled the same as none
+ * ptes here.
*/
- if (huge_pte_none(pte))
+ if (huge_pte_none_mostly(pte))
ret = true;
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
ret = true;
@@ -330,9 +331,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
pte = pte_offset_map(pmd, address);
/*
* Lockless access: we're in a wait_event so it's ok if it
- * changes under us.
+ * changes under us. PTE markers should be handled the same as none
+ * ptes here.
*/
- if (pte_none(*pte))
+ if (pte_none_mostly(*pte))
ret = true;
if (!pte_write(*pte) && (reason & VM_UFFD_WP))
ret = true;
diff --git a/mm/filemap.c b/mm/filemap.c
index 3a5ffb5587cd..ef77dae8c28d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3382,6 +3382,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;

+ /*
+ * NOTE: If there're PTE markers, we'll leave them to be
+ * handled in the specific fault path, and it'll prohibit the
+ * fault-around logic.
+ */
if (!pte_none(*vmf->pte))
goto unlock;

diff --git a/mm/hmm.c b/mm/hmm.c
index af71aac3140e..3fd3242c5e50 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
pte_t pte = *ptep;
uint64_t pfn_req_flags = *hmm_pfn;

- if (pte_none(pte)) {
+ if (pte_none_mostly(pte)) {
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (required_fault)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7a08737bac4b..08af97c73f0f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5644,10 +5644,14 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,

if (pte_present(ptent))
page = mc_handle_present_pte(vma, addr, ptent);
+ else if (pte_none_mostly(ptent))
+ /*
+ * PTE markers should be treated as a none pte here, separated
+ * from other swap handling below.
+ */
+ page = mc_handle_file_pte(vma, addr, ptent);
else if (is_swap_pte(ptent))
page = mc_handle_swap_pte(vma, ptent, &ent);
- else if (pte_none(ptent))
- page = mc_handle_file_pte(vma, addr, ptent);

if (!page && !ent.val)
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 2c5d1bb4694f..3f396241a7db 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -100,6 +100,8 @@ struct page *mem_map;
EXPORT_SYMBOL(mem_map);
#endif

+static vm_fault_t do_fault(struct vm_fault *vmf);
+
/*
* A number of key systems in x86 including ioremap() rely on the assumption
* that high_memory defines the upper bound on direct map memory, then end
@@ -1415,6 +1417,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (!should_zap_page(details, page))
continue;
rss[mm_counter(page)]--;
+ } else if (is_pte_marker_entry(entry)) {
+ /* By default, simply drop all pte markers when zap */
} else if (is_hwpoison_entry(entry)) {
if (!should_zap_cows(details))
continue;
@@ -3555,6 +3559,23 @@ static inline bool should_try_to_free_swap(struct page *page,
page_count(page) == 2;
}

+static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
+{
+ swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
+ unsigned long marker = pte_marker_get(entry);
+
+ /*
+ * PTE markers should always be with file-backed memories, and the
+ * marker should never be empty. If anything weird happened, the best
+ * thing to do is to kill the process along with its mm.
+ */
+ if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
+ return VM_FAULT_SIGBUS;
+
+ /* TODO: handle pte markers */
+ return 0;
+}
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3592,6 +3613,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
+ } else if (is_pte_marker_entry(entry)) {
+ ret = handle_pte_marker(vmf);
} else {
print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
ret = VM_FAULT_SIGBUS;
diff --git a/mm/mincore.c b/mm/mincore.c
index f4f627325e12..fa200c14185f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -122,7 +122,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr != end; ptep++, addr += PAGE_SIZE) {
pte_t pte = *ptep;

- if (pte_none(pte))
+ /* We need to do cache lookup too for pte markers */
+ if (pte_none_mostly(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
vma, vec);
else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56060acdabd3..709a6f73b764 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -188,6 +188,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
newpte = pte_swp_mksoft_dirty(newpte);
if (pte_swp_uffd_wp(oldpte))
newpte = pte_swp_mkuffd_wp(newpte);
+ } else if (is_pte_marker_entry(entry)) {
+ /* Skip it, the same as none pte */
+ continue;
} else {
newpte = oldpte;
}
--
2.32.0

2022-04-05 03:44:00

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 11/23] mm/hugetlb: Introduce huge pte version of uffd-wp helpers

They will be used in the follow up patches to either check/set/clear uffd-wp
bit of a huge pte.

So far it reuses all the small pte helpers. Archs can overwrite these versions
when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future.

Signed-off-by: Peter Xu <[email protected]>
---
arch/s390/include/asm/hugetlb.h | 15 +++++++++++++++
include/asm-generic/hugetlb.h | 15 +++++++++++++++
2 files changed, 30 insertions(+)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index bea47e7cc6a0..be99eda87f4d 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -115,6 +115,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}

+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+ return 0;
+}
+
static inline bool gigantic_page_runtime_supported(void)
{
return true;
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index f39cad20ffc6..896f341f614d 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -35,6 +35,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}

+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte_mkuffd_wp(pte);
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_uffd_wp(pte);
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+ return pte_uffd_wp(pte);
+}
+
#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR
static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long sz)
--
2.32.0

2022-04-05 03:44:02

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

We used to check against none pte in finish_fault(), with the assumption
that the orig_pte is always none pte.

This change prepares us to be able to call do_fault() on !none ptes. For
example, we should allow that to happen for pte marker so that we can restore
information out of the pte markers.

Let's change the "pte_none" check into detecting changes since we fetched
orig_pte. One trivial thing to take care of here is, when pmd==NULL for
the pgtable we may not initialize orig_pte at all in handle_pte_fault().

By default orig_pte will be all zeros however the problem is not all
architectures are using all-zeros for a none pte. pte_clear() will be the
right thing to use here so that we'll always have a valid orig_pte value
for the whole handle_pte_fault() call.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3f396241a7db..b1af996b09ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
vmf->address, &vmf->ptl);
ret = 0;
/* Re-check under ptl */
- if (likely(pte_none(*vmf->pte)))
+ if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
do_set_pte(vmf, page, vmf->address);
else
ret = VM_FAULT_NOPAGE;
@@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* concurrent faults and from rmap lookups.
*/
vmf->pte = NULL;
+ /*
+ * Always initialize orig_pte. This matches with below
+ * code to have orig_pte to be the none pte if pte==NULL.
+ * This makes the rest code to be always safe to reference
+ * it, e.g. in finish_fault() we'll detect pte changes.
+ */
+ pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
} else {
/*
* If a huge pmd materialized under us just retry later. Use
--
2.32.0

2022-04-05 03:44:07

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP

This patch introduces the 1st user of pte marker: the uffd-wp marker.

When the pte marker is installed with the uffd-wp bit set, it means this pte
was wr-protected by uffd.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected. This
special pte could trigger a page fault just like swap entries.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

Some helpers are introduced to detect whether a swap pte is uffd wr-protected.
After the pte marker introduced, one swap pte can be wr-protected in two forms:
either it is a normal swap pte and it has _PAGE_SWP_UFFD_WP set, or it's a pte
marker that has PTE_MARKER_UFFD_WP set.

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 3 ++-
include/linux/userfaultfd_k.h | 43 +++++++++++++++++++++++++++++++++++
mm/Kconfig | 9 ++++++++
3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 7a00627845f0..fffbba0036f6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -276,7 +276,8 @@ static inline int is_readable_migration_entry(swp_entry_t entry)

typedef unsigned long pte_marker;

-#define PTE_MARKER_MASK (0)
+#define PTE_MARKER_UFFD_WP BIT(0)
+#define PTE_MARKER_MASK (PTE_MARKER_UFFD_WP)

#ifdef CONFIG_PTE_MARKER

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 33cea484d1ad..bd09c3c89b59 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -15,6 +15,8 @@

#include <linux/fcntl.h>
#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
#include <asm-generic/pgtable_uffd.h>

/* The set of all possible UFFD-related VM flags. */
@@ -236,4 +238,45 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,

#endif /* CONFIG_USERFAULTFD */

+static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
+{
+ return is_pte_marker_entry(entry) &&
+ (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
+}
+
+static inline bool pte_marker_uffd_wp(pte_t pte)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+ swp_entry_t entry;
+
+ if (!is_swap_pte(pte))
+ return false;
+
+ entry = pte_to_swp_entry(pte);
+
+ return pte_marker_entry_uffd_wp(entry);
+#else
+ return false;
+#endif
+}
+
+/*
+ * Returns true if this is a swap pte and was uffd-wp wr-protected in either
+ * forms (pte marker or a normal swap pte), false otherwise.
+ */
+static inline bool pte_swp_uffd_wp_any(pte_t pte)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+ if (!is_swap_pte(pte))
+ return false;
+
+ if (pte_swp_uffd_wp(pte))
+ return true;
+
+ if (pte_marker_uffd_wp(pte))
+ return true;
+#endif
+ return false;
+}
+
#endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index a1688b9314b2..6e7c2d59fa96 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -915,6 +915,15 @@ config PTE_MARKER
help
Allows to create marker PTEs for file-backed memory.

+config PTE_MARKER_UFFD_WP
+ bool "Marker PTEs support for userfaultfd write protection"
+ depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
+
+ help
+ Allows to create marker PTEs for userfaultfd write protection
+ purposes. It is required to enable userfaultfd write protection on
+ file-backed memory types like shmem and hugetlbfs.
+
source "mm/damon/Kconfig"

endmenu
--
2.32.0

2022-04-05 03:44:09

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler

File-backed memories are prone to unmap/swap so the ptes are always unstable,
because they can be easily faulted back later using the page cache. This could
lead to uffd-wp getting lost when unmapping or swapping out such memory. One
example is shmem. PTE markers are needed to store those information.

This patch prepares it by handling uffd-wp pte markers first it is applied
elsewhere, so that the page fault handler can recognize uffd-wp pte markers.

The handling of uffd-wp pte markers is similar to missing fault, it's just that
we'll handle this "missing fault" when we see the pte markers, meanwhile we
need to make sure the marker information is kept during processing the fault.

This is a slow path of uffd-wp handling, because zapping of wr-protected shmem
ptes should be rare. So far it should only trigger in two conditions:

(1) When trying to punch holes in shmem_fallocate(), there is an optimization
to zap the pgtables before evicting the page.

(2) When swapping out shmem pages.

Because of this, the page fault handling is simplifed too by not sending the
wr-protect message in the 1st page fault, instead the page will be installed
read-only, so the uffd-wp message will be generated in the next fault, which
will trigger the do_wp_page() path of general uffd-wp handling.

Disable fault-around for all uffd-wp registered ranges for extra safety just
like uffd-minor fault, and clean the code up.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/userfaultfd_k.h | 17 +++++++++
mm/memory.c | 67 ++++++++++++++++++++++++++++++-----
2 files changed, 75 insertions(+), 9 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index bd09c3c89b59..827e38b7be65 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -96,6 +96,18 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
}

+/*
+ * Don't do fault around for either WP or MINOR registered uffd range. For
+ * MINOR registered range, fault around will be a total disaster and ptes can
+ * be installed without notifications; for WP it should mostly be fine as long
+ * as the fault around checks for pte_none() before the installation, however
+ * to be super safe we just forbid it.
+ */
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+}
+
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
return vma->vm_flags & VM_UFFD_MISSING;
@@ -236,6 +248,11 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
{
}

+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+ return false;
+}
+
#endif /* CONFIG_USERFAULTFD */

static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
diff --git a/mm/memory.c b/mm/memory.c
index b1af996b09ca..21abb8a30553 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3559,6 +3559,39 @@ static inline bool should_try_to_free_swap(struct page *page,
page_count(page) == 2;
}

+static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /*
+ * Be careful so that we will only recover a special uffd-wp pte into a
+ * none pte. Otherwise it means the pte could have changed, so retry.
+ */
+ if (is_pte_marker(*vmf->pte))
+ pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed. It means this pte was wr-protected before being unmapped.
+ */
+static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
+{
+ /*
+ * Just in case there're leftover special ptes even after the region
+ * got unregistered - we can simply clear them. We can also do that
+ * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
+ * ranges, but it should be more efficient to be done lazily here.
+ */
+ if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+ return pte_marker_clear(vmf);
+
+ /* do_fault() can handle pte markers too like none pte */
+ return do_fault(vmf);
+}
+
static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
{
swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
@@ -3572,8 +3605,11 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
return VM_FAULT_SIGBUS;

- /* TODO: handle pte markers */
- return 0;
+ if (pte_marker_entry_uffd_wp(entry))
+ return pte_marker_handle_uffd_wp(vmf);
+
+ /* This is an unknown pte marker */
+ return VM_FAULT_SIGBUS;
}

/*
@@ -4157,6 +4193,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
{
struct vm_area_struct *vma = vmf->vma;
+ bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool prefault = vmf->address != addr;
pte_t entry;
@@ -4171,6 +4208,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)

if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (unlikely(uffd_wp))
+ entry = pte_mkuffd_wp(pte_wrprotect(entry));
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -4344,9 +4383,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
}

+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+ /* No ->map_pages? No way to fault around... */
+ if (!vmf->vma->vm_ops->map_pages)
+ return false;
+
+ if (uffd_disable_fault_around(vmf->vma))
+ return false;
+
+ return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
static vm_fault_t do_read_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = 0;

/*
@@ -4354,12 +4405,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- if (likely(!userfaultfd_minor(vmf->vma))) {
- ret = do_fault_around(vmf);
- if (ret)
- return ret;
- }
+ if (should_fault_around(vmf)) {
+ ret = do_fault_around(vmf);
+ if (ret)
+ return ret;
}

ret = __do_fault(vmf);
--
2.32.0

2022-04-05 03:44:18

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 07/23] mm/shmem: Persist uffd-wp bit across zapping for file-backed

File-backed memory is prone to being unmapped at any time. It means all
information in the pte will be dropped, including the uffd-wp flag.

To persist the uffd-wp flag, we'll use the pte markers. This patch teaches the
zap code to understand uffd-wp and know when to keep or drop the uffd-wp bit.

Add a new flag ZAP_FLAG_DROP_MARKER and set it in zap_details when we don't
want to persist such an information, for example, when destroying the whole
vma, or punching a hole in a shmem file. For the rest cases we should never
drop the uffd-wp bit, or the wr-protect information will get lost.

The new ZAP_FLAG_DROP_MARKER needs to be put into mm.h rather than memory.c
because it'll be further referenced in hugetlb files later.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 10 ++++++++
include/linux/mm_inline.h | 43 ++++++++++++++++++++++++++++++++++
mm/memory.c | 49 ++++++++++++++++++++++++++++++++++++---
mm/rmap.c | 8 +++++++
4 files changed, 107 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 26428ff262fc..857bc8f7af45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3422,4 +3422,14 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

+typedef unsigned int __bitwise zap_flags_t;
+
+/*
+ * Whether to drop the pte markers, for example, the uffd-wp information for
+ * file-backed memory. This should only be specified when we will completely
+ * drop the page in the mm, either by truncation or unmapping of the vma. By
+ * default, the flag is not set.
+ */
+#define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0))
+
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ac32125745ab..7b25b53c474a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -6,6 +6,8 @@
#include <linux/huge_mm.h>
#include <linux/swap.h>
#include <linux/string.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

/**
* folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
@@ -316,5 +318,46 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
return atomic_read(&mm->tlb_flush_pending) > 1;
}

+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte. NOTE! This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable. Meanwhile
+ * none pte also means we are not demoting the pte so tlb flushed is not needed.
+ * E.g., when pte cleared the caller should have taken care of the tlb flush.
+ *
+ * Must be called with pgtable lock held so that no thread will see the none
+ * pte, and if they see it, they'll fault and serialize at the pgtable lock.
+ *
+ * This function is a no-op if PTE_MARKER_UFFD_WP is not enabled.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
+ bool arm_uffd_pte = false;
+
+ /* The current status of the pte should be "cleared" before calling */
+ WARN_ON_ONCE(!pte_none(*pte));
+
+ if (vma_is_anonymous(vma) || !userfaultfd_wp(vma))
+ return;
+
+ /* A uffd-wp wr-protected normal pte */
+ if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ /*
+ * A uffd-wp wr-protected swap pte. Note: this should even cover an
+ * existing pte marker with uffd-wp bit set.
+ */
+ if (unlikely(pte_swp_uffd_wp_any(pteval)))
+ arm_uffd_pte = true;
+
+ if (unlikely(arm_uffd_pte))
+ set_pte_at(vma->vm_mm, addr, pte,
+ make_pte_marker(PTE_MARKER_UFFD_WP));
+#endif
+}

#endif
diff --git a/mm/memory.c b/mm/memory.c
index 21abb8a30553..1144845ff734 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -74,6 +74,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/mm_inline.h>

#include <trace/events/kmem.h>

@@ -1306,6 +1307,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
struct zap_details {
struct folio *single_folio; /* Locked folio to be unmapped */
bool even_cows; /* Zap COWed private pages too? */
+ zap_flags_t zap_flags; /* Extra flags for zapping */
};

/* Whether we should zap all COWed (private) pages too */
@@ -1334,6 +1336,29 @@ static inline bool should_zap_page(struct zap_details *details, struct page *pag
return !PageAnon(page);
}

+static inline bool zap_drop_file_uffd_wp(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_DROP_MARKER;
+}
+
+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte,
+ struct zap_details *details, pte_t pteval)
+{
+ if (zap_drop_file_uffd_wp(details))
+ return;
+
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1371,6 +1396,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
if (unlikely(!page))
continue;

@@ -1401,6 +1428,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
page = pfn_swap_entry_to_page(entry);
if (unlikely(!should_zap_page(details, page)))
continue;
+ /*
+ * Both device private/exclusive mappings should only
+ * work with anonymous page so far, so we don't need to
+ * consider uffd-wp bit when zap. For more information,
+ * see zap_install_uffd_wp_if_needed().
+ */
+ WARN_ON_ONCE(!vma_is_anonymous(vma));
rss[mm_counter(page)]--;
if (is_device_private_entry(entry))
page_remove_rmap(page, vma, false);
@@ -1417,8 +1451,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (!should_zap_page(details, page))
continue;
rss[mm_counter(page)]--;
- } else if (is_pte_marker_entry(entry)) {
- /* By default, simply drop all pte markers when zap */
+ } else if (pte_marker_entry_uffd_wp(entry)) {
+ /* Only drop the uffd-wp marker if explicitly requested */
+ if (!zap_drop_file_uffd_wp(details))
+ continue;
} else if (is_hwpoison_entry(entry)) {
if (!should_zap_cows(details))
continue;
@@ -1427,6 +1463,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
WARN_ON_ONCE(1);
}
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss_vec(mm, rss);
@@ -1637,12 +1674,17 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mmu_notifier_range range;
+ struct zap_details details = {
+ .zap_flags = ZAP_FLAG_DROP_MARKER,
+ /* Careful - we need to zap private pages too! */
+ .even_cows = true,
+ };

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
- unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
mmu_notifier_invalidate_range_end(&range);
}

@@ -3438,6 +3480,7 @@ void unmap_mapping_folio(struct folio *folio)

details.even_cows = false;
details.single_folio = folio;
+ details.zap_flags = ZAP_FLAG_DROP_MARKER;

i_mmap_lock_read(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 208b2c683cec..69416072b1a6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -1538,6 +1539,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pvmw.pte);
}

+ /*
+ * Now the pte is cleared. If this pte was uffd-wp armed,
+ * we may want to replace a none pte with a marker pte if
+ * it's file-backed, so we don't lose the tracking info.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
/* Set the dirty flag on the folio now the pte is gone. */
if (pte_dirty(pteval))
folio_mark_dirty(folio);
--
2.32.0

2022-04-05 03:44:23

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 09/23] mm/shmem: Allows file-back mem to be uffd wr-protected on thps

We don't have "huge" version of pte markers, instead when necessary we split
the thp.

However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps: rather than doing a real split, the
thp pmd will simply got cleared in __split_huge_pmd_locked().

That is not enough if e.g. when there is a thp covers range [0, 2M) but we want
to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd, and change_pmd_range()
will just skip it right after the split.

Here we leverage the previously introduced change_pmd_prepare() macro so that
we'll populate the pmd with a pgtable page after the pmd split (in which
process the pmd will be cleared for cases like shmem). Then change_pte_range()
will do all the rest for us by installing the uffd-wp pte marker at any none
pte that we'd like to wr-protect.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index bd62d5938c6c..e0a567b66d07 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -333,8 +333,15 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
- if (next - addr != HPAGE_PMD_SIZE) {
+ if ((next - addr != HPAGE_PMD_SIZE) ||
+ uffd_wp_protect_file(vma, cp_flags)) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
+ /*
+ * For file-backed, the pmd could have been
+ * cleared; make sure pmd populated if
+ * necessary, then fall-through to pte level.
+ */
+ change_pmd_prepare(vma, pmd, cp_flags);
} else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, cp_flags);
--
2.32.0

2022-04-05 03:44:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 12/23] mm/hugetlb: Hook page faults for uffd write protection

Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.

We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd642cfc538b..82df0fcfedf9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5711,6 +5711,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_ptl;

+ /* Handle userfault-wp first, before trying to lock more pages */
+ if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+ (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .real_address = address,
+ .flags = flags,
+ };
+
+ spin_unlock(ptl);
+ if (pagecache_page) {
+ unlock_page(pagecache_page);
+ put_page(pagecache_page);
+ }
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ return handle_userfault(&vmf, VM_UFFD_WP);
+ }
+
/*
* hugetlb_wp() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
--
2.32.0

2022-04-05 03:44:30

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 14/23] mm/hugetlb: Handle UFFDIO_WRITEPROTECT

This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 13 ++++++++++++-
mm/mprotect.c | 3 ++-
mm/userfaultfd.c | 8 ++++++++
4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6347298778b6..38c5ac28b787 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -210,7 +210,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot);
+ unsigned long address, unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags);

bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -391,7 +392,8 @@ static inline void move_hugetlb_state(struct page *oldpage,

static inline unsigned long hugetlb_change_protection(
struct vm_area_struct *vma, unsigned long address,
- unsigned long end, pgprot_t newprot)
+ unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags)
{
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c94deead22b2..2401dd5997b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6207,7 +6207,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot)
+ unsigned long address, unsigned long end,
+ pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
@@ -6217,6 +6218,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long pages = 0;
bool shared_pmd = false;
struct mmu_notifier_range range;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -6263,6 +6266,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
entry = make_readable_migration_entry(
swp_offset(entry));
newpte = swp_entry_to_pte(entry);
+ if (uffd_wp)
+ newpte = pte_swp_mkuffd_wp(newpte);
+ else if (uffd_wp_resolve)
+ newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
pages++;
@@ -6277,6 +6284,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = huge_pte_modify(old_pte, newprot);
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ if (uffd_wp)
+ pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e0a567b66d07..6b0e8c213508 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -455,7 +455,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);

if (is_vm_hugetlb_page(vma))
- pages = hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot,
+ cp_flags);
else
pages = change_protection_range(vma, start, end, newprot,
cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index da0b3ed2a6b5..58d67f2bf980 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -704,6 +704,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
atomic_t *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ unsigned long page_mask;
pgprot_t newprot;
int err;

@@ -740,6 +741,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
if (!vma_is_anonymous(dst_vma))
goto out_unlock;

+ if (is_vm_hugetlb_page(dst_vma)) {
+ err = -EINVAL;
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ goto out_unlock;
+ }
+
if (enable_wp)
newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
else
--
2.32.0

2022-04-05 03:44:45

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 18/23] mm/hugetlb: Handle uffd-wp during fork()

Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range() because
for uffd-wp it's the dst vma that matters on deciding how we should treat
uffd-wp protected ptes.

We should recognize pte markers during fork and do the pte copy if needed.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 7 +++++--
mm/hugetlb.c | 42 +++++++++++++++++++++++++++--------------
mm/memory.c | 2 +-
3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ab48b3bbb0e6..6df51d23b7ee 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -137,7 +137,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
struct vm_area_struct *new_vma,
unsigned long old_addr, unsigned long new_addr,
unsigned long len);
-int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
+int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
+ struct vm_area_struct *, struct vm_area_struct *);
long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
struct page **, struct vm_area_struct **,
unsigned long *, unsigned long *, long, unsigned int,
@@ -268,7 +269,9 @@ static inline struct page *follow_huge_addr(struct mm_struct *mm,
}

static inline int copy_hugetlb_page_range(struct mm_struct *dst,
- struct mm_struct *src, struct vm_area_struct *vma)
+ struct mm_struct *src,
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma)
{
BUG();
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4af8b357b90..e1571179698a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4706,23 +4706,24 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
}

int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
- struct vm_area_struct *vma)
+ struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma)
{
pte_t *src_pte, *dst_pte, entry, dst_entry;
struct page *ptepage;
unsigned long addr;
- bool cow = is_cow_mapping(vma->vm_flags);
- struct hstate *h = hstate_vma(vma);
+ bool cow = is_cow_mapping(src_vma->vm_flags);
+ struct hstate *h = hstate_vma(src_vma);
unsigned long sz = huge_page_size(h);
unsigned long npages = pages_per_huge_page(h);
- struct address_space *mapping = vma->vm_file->f_mapping;
+ struct address_space *mapping = src_vma->vm_file->f_mapping;
struct mmu_notifier_range range;
int ret = 0;

if (cow) {
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, src,
- vma->vm_start,
- vma->vm_end);
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_vma, src,
+ src_vma->vm_start,
+ src_vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
mmap_assert_write_locked(src);
raw_write_seqcount_begin(&src->write_protect_seq);
@@ -4736,12 +4737,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
i_mmap_lock_read(mapping);
}

- for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+ for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
- dst_pte = huge_pte_alloc(dst, vma, addr, sz);
+ dst_pte = huge_pte_alloc(dst, dst_vma, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
break;
@@ -4776,6 +4777,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
} else if (unlikely(is_hugetlb_entry_migration(entry) ||
is_hugetlb_entry_hwpoisoned(entry))) {
swp_entry_t swp_entry = pte_to_swp_entry(entry);
+ bool uffd_wp = huge_pte_uffd_wp(entry);

if (!is_readable_migration_entry(swp_entry) && cow) {
/*
@@ -4785,10 +4787,21 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
swp_entry = make_readable_migration_entry(
swp_offset(swp_entry));
entry = swp_entry_to_pte(swp_entry);
+ if (userfaultfd_wp(src_vma) && uffd_wp)
+ entry = huge_pte_mkuffd_wp(entry);
set_huge_swap_pte_at(src, addr, src_pte,
entry, sz);
}
+ if (!userfaultfd_wp(dst_vma) && uffd_wp)
+ entry = huge_pte_clear_uffd_wp(entry);
set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
+ } else if (unlikely(is_pte_marker(entry))) {
+ /*
+ * We copy the pte marker only if the dst vma has
+ * uffd-wp enabled.
+ */
+ if (userfaultfd_wp(dst_vma))
+ set_huge_pte_at(dst, addr, dst_pte, entry);
} else {
entry = huge_ptep_get(src_pte);
ptepage = pte_page(entry);
@@ -4806,20 +4819,21 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
*/
if (!PageAnon(ptepage)) {
page_dup_file_rmap(ptepage, true);
- } else if (page_try_dup_anon_rmap(ptepage, true, vma)) {
+ } else if (page_try_dup_anon_rmap(ptepage, true,
+ src_vma)) {
pte_t src_pte_old = entry;
struct page *new;

spin_unlock(src_ptl);
spin_unlock(dst_ptl);
/* Do not use reserve as it's private owned */
- new = alloc_huge_page(vma, addr, 1);
+ new = alloc_huge_page(dst_vma, addr, 1);
if (IS_ERR(new)) {
put_page(ptepage);
ret = PTR_ERR(new);
break;
}
- copy_user_huge_page(new, ptepage, addr, vma,
+ copy_user_huge_page(new, ptepage, addr, dst_vma,
npages);
put_page(ptepage);

@@ -4829,13 +4843,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
entry = huge_ptep_get(src_pte);
if (!pte_same(src_pte_old, entry)) {
- restore_reserve_on_error(h, vma, addr,
+ restore_reserve_on_error(h, dst_vma, addr,
new);
put_page(new);
/* dst_entry won't change as in child */
goto again;
}
- hugetlb_install_page(vma, dst_pte, addr, new);
+ hugetlb_install_page(dst_vma, dst_pte, addr, new);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
continue;
diff --git a/mm/memory.c b/mm/memory.c
index 9808edfe18d4..d1e9c2517dfb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1278,7 +1278,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return 0;

if (is_vm_hugetlb_page(src_vma))
- return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
+ return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);

if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
/*
--
2.32.0

2022-04-05 03:44:46

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 19/23] mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered

When we're trying to collapse a 2M huge shmem page, don't retract pgtable pmd
page if it's registered with uffd-wp, because that pgtable could have pte
markers installed. Recycling of that pgtable means we'll lose the pte markers.
That could cause data loss for an uffd-wp enabled application on shmem.

Instead of disabling khugepaged on these files, simply skip retracting these
special VMAs, then the page cache can still be merged into a huge thp, and
other mm/vma can still map the range of file with a huge thp when proper.

Note that checking VM_UFFD_WP needs to be done with mmap_sem held for write,
that avoids race like:

khugepaged user thread
========== ===========
check VM_UFFD_WP, not set
UFFDIO_REGISTER with uffd-wp on shmem
wr-protect some pages (install markers)
take mmap_sem write lock
erase pmd and free pmd page
--> pte markers are dropped unnoticed!

Signed-off-by: Peter Xu <[email protected]>
---
mm/khugepaged.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 04a972259136..d7c5bb9fd1fb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1464,6 +1464,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
return;

+ /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
+ if (userfaultfd_wp(vma))
+ return;
+
hpage = find_lock_page(vma->vm_file->f_mapping,
linear_page_index(vma, haddr));
if (!hpage)
@@ -1599,7 +1603,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
* reverse order. Trylock is a way to avoid deadlock.
*/
if (mmap_write_trylock(mm)) {
- if (!khugepaged_test_exit(mm))
+ /*
+ * When a vma is registered with uffd-wp, we can't
+ * recycle the pmd pgtable because there can be pte
+ * markers installed. Skip it only, so the rest mm/vma
+ * can still have the same file mapped hugely, however
+ * it'll always mapped in small page size for uffd-wp
+ * registered ranges.
+ */
+ if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma))
collapse_and_free_pmd(mm, vma, addr, pmd);
mmap_write_unlock(mm);
} else {
--
2.32.0

2022-04-05 03:44:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 21/23] mm/uffd: Enable write protection for shmem & hugetlbfs

We've had all the necessary changes ready for both shmem and hugetlbfs. Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because
all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 21 +++------------------
include/linux/userfaultfd_k.h | 20 ++++++++++++++++++++
include/uapi/linux/userfaultfd.h | 10 ++++++++--
mm/userfaultfd.c | 9 +++------
4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 8b4a94f5a238..fb45522a2b44 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1257,24 +1257,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
return 0;
}

-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- /* FIXME: add WP support to hugetlbfs and shmem */
- if (vm_flags & VM_UFFD_WP) {
- if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
- return false;
- }
-
- if (vm_flags & VM_UFFD_MINOR) {
- if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
- return false;
- }
-
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
-}
-
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
unsigned long arg)
{
@@ -1955,6 +1937,9 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
#endif
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+#endif
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+ uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
#endif
uffdio_api.ioctls = UFFD_API_IOCTLS;
ret = -EFAULT;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 827e38b7be65..ea11bed9bb7e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -18,6 +18,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>

/* The set of all possible UFFD-related VM flags. */
#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -140,6 +141,25 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & __VM_UFFD_FLAGS;
}

+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+ unsigned long vm_flags)
+{
+ if (vm_flags & VM_UFFD_MINOR)
+ return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
+
+#ifndef CONFIG_PTE_MARKER_UFFD_WP
+ /*
+ * If user requested uffd-wp but not enabled pte markers for
+ * uffd-wp, then shmem & hugetlbfs are not supported but only
+ * anonymous.
+ */
+ if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
+ return false;
+#endif
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index ef739054cb1c..7d32b1e797fb 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -33,7 +33,8 @@
UFFD_FEATURE_THREAD_ID | \
UFFD_FEATURE_MINOR_HUGETLBFS | \
UFFD_FEATURE_MINOR_SHMEM | \
- UFFD_FEATURE_EXACT_ADDRESS)
+ UFFD_FEATURE_EXACT_ADDRESS | \
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -47,7 +48,8 @@
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
- (__u64)1 << _UFFDIO_CONTINUE)
+ (__u64)1 << _UFFDIO_CONTINUE | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -194,6 +196,9 @@ struct uffdio_api {
* UFFD_FEATURE_EXACT_ADDRESS indicates that the exact address of page
* faults would be provided and the offset within the page would not be
* masked.
+ *
+ * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
+ * write-protection mode is supported on both shmem and hugetlbfs.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -207,6 +212,7 @@ struct uffdio_api {
#define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9)
#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
#define UFFD_FEATURE_EXACT_ADDRESS (1<<11)
+#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
__u64 features;

__u64 ioctls;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 58d67f2bf980..156e9bdf9f23 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -730,15 +730,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,

err = -ENOENT;
dst_vma = find_dst_vma(dst_mm, start, len);
- /*
- * Make sure the vma is not shared, that the dst range is
- * both valid and fully within a single existing vma.
- */
- if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+ if (!dst_vma)
goto out_unlock;
if (!userfaultfd_wp(dst_vma))
goto out_unlock;
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
goto out_unlock;

if (is_vm_hugetlb_page(dst_vma)) {
--
2.32.0

2022-04-05 03:45:02

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults

Allow hugetlb code to handle pte markers just like none ptes. It's mostly
there, we just need to make sure we don't assume hugetlb_no_page() only handles
none pte, so when detecting pte change we should use pte_same() rather than
pte_none(). We need to pass in the old_pte to do the comparison.

Check the original pte to see whether it's a pte marker, if it is, we should
recover uffd-wp bit on the new pte to be installed, so that the next write will
be trapped by uffd.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2401dd5997b7..9317b790161d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5412,7 +5412,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ unsigned long address, pte_t *ptep,
+ pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -5539,7 +5540,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

ptl = huge_pte_lock(h, mm, ptep);
ret = 0;
- if (!huge_pte_none(huge_ptep_get(ptep)))
+ /* If pte changed from under us, retry */
+ if (!pte_same(huge_ptep_get(ptep), old_pte))
goto backout;

if (anon_rmap) {
@@ -5549,6 +5551,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_file_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
+ /*
+ * If this pte was previously wr-protected, keep it wr-protected even
+ * if populated.
+ */
+ if (unlikely(pte_marker_uffd_wp(old_pte)))
+ new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
set_huge_pte_at(mm, haddr, ptep, new_pte);

hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -5666,8 +5674,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);

entry = huge_ptep_get(ptep);
- if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+ /* PTE markers should be handled the same way as none pte */
+ if (huge_pte_none_mostly(entry)) {
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ entry, flags);
goto out_mutex;
}

--
2.32.0

2022-04-05 03:45:03

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 22/23] mm: Enable PTE markers by default

Enable PTE markers by default. On x86_64 it means it'll auto-enable
PTE_MARKER_UFFD_WP as well.

Signed-off-by: Peter Xu <[email protected]>
---
mm/Kconfig | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 6e7c2d59fa96..3eca34c864c5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -911,12 +911,14 @@ config ANON_VMA_NAME

config PTE_MARKER
bool "Marker PTEs support"
+ default y

help
Allows to create marker PTEs for file-backed memory.

config PTE_MARKER_UFFD_WP
bool "Marker PTEs support for userfaultfd write protection"
+ default y
depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP

help
--
2.32.0

2022-04-05 03:45:10

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 17/23] mm/hugetlb: Only drop uffd-wp special pte if required

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if
unmapping an entire vma or synchronized such that faults can not race with the
unmap operation. This requires passing zap_flags all the way to the lowest
level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent faults.
The exception is hole punch which will first unmap without any synchronization.
Later when hole punch actually removes the page from the file, it will check to
see if there was a subsequent fault and if so take the hugetlb fault mutex
while unmapping again. This second unmap will pass in ZAP_FLAG_DROP_MARKER.

The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when unmap a
hugetlb range" is (IMHO): we should never reach a state when a page fault could
errornously fault in a page-cache page that was wr-protected to be writable,
even in an extremely short period. That could happen if e.g. we pass
ZAP_FLAG_DROP_MARKER when hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(),
because if a page faults after that call and before remove_inode_hugepages() is
executed, the page cache can be mapped writable again in the small racy window,
that can cause unexpected data overwritten.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 15 +++++++++------
include/linux/hugetlb.h | 8 +++++---
mm/hugetlb.c | 33 +++++++++++++++++++++++++--------
mm/memory.c | 5 ++++-
4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 99c7477cee5c..8b5b9df2be7d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -404,7 +404,8 @@ static void remove_huge_page(struct page *page)
}

static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+ unsigned long zap_flags)
{
struct vm_area_struct *vma;

@@ -438,7 +439,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
}

unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
- NULL);
+ NULL, zap_flags);
}
}

@@ -516,7 +517,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
+ (index + 1) * pages_per_huge_page(h),
+ ZAP_FLAG_DROP_MARKER);
i_mmap_unlock_write(mapping);
}

@@ -582,7 +584,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
i_size_write(inode, offset);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
- hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+ ZAP_FLAG_DROP_MARKER);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
}
@@ -615,8 +618,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap,
- hole_start >> PAGE_SHIFT,
- hole_end >> PAGE_SHIFT);
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT, 0);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 38c5ac28b787..ab48b3bbb0e6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -143,11 +143,12 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
unsigned long *, unsigned long *, long, unsigned int,
int *);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *,
+ unsigned long);
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -400,7 +401,8 @@ static inline unsigned long hugetlb_change_protection(

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 578c48ef931a..e4af8b357b90 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4947,7 +4947,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,

static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page)
+ struct page *ref_page, unsigned long zap_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -5003,7 +5003,18 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
* unmapped and its refcount is dropped, so just clear pte here.
*/
if (unlikely(!pte_present(pte))) {
- huge_pte_clear(mm, address, ptep, sz);
+ /*
+ * If the pte was wr-protected by uffd-wp in any of the
+ * swap forms, meanwhile the caller does not want to
+ * drop the uffd-wp bit in this zap, then replace the
+ * pte with a marker.
+ */
+ if (pte_swp_uffd_wp_any(pte) &&
+ !(zap_flags & ZAP_FLAG_DROP_MARKER))
+ set_huge_pte_at(mm, address, ptep,
+ make_pte_marker(PTE_MARKER_UFFD_WP));
+ else
+ huge_pte_clear(mm, address, ptep, sz);
spin_unlock(ptl);
continue;
}
@@ -5031,7 +5042,11 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
if (huge_pte_dirty(pte))
set_page_dirty(page);
-
+ /* Leave a uffd-wp pte marker if needed */
+ if (huge_pte_uffd_wp(pte) &&
+ !(zap_flags & ZAP_FLAG_DROP_MARKER))
+ set_huge_pte_at(mm, address, ptep,
+ make_pte_marker(PTE_MARKER_UFFD_WP));
hugetlb_count_sub(pages_per_huge_page(h), mm);
page_remove_rmap(page, vma, true);

@@ -5065,9 +5080,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct

void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
- __unmap_hugepage_range(tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);

/*
* Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -5083,12 +5099,13 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
}

void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
struct mmu_gather tlb;

tlb_gather_mmu(&tlb, vma->vm_mm);
- __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
tlb_finish_mmu(&tlb);
}

@@ -5143,7 +5160,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
unmap_hugepage_range(iter_vma, address,
- address + huge_page_size(h), page);
+ address + huge_page_size(h), page, 0);
}
i_mmap_unlock_write(mapping);
}
diff --git a/mm/memory.c b/mm/memory.c
index 8ba1bb196095..9808edfe18d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1675,8 +1675,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
+ unsigned long zap_flags = details ?
+ details->zap_flags : 0;
i_mmap_lock_write(vma->vm_file->f_mapping);
- __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+ __unmap_hugepage_range_final(tlb, vma, start, end,
+ NULL, zap_flags);
i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
--
2.32.0

2022-04-05 03:45:14

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 23/23] selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.

Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 92a4516f8f0d..bbc4a6d8cf7b 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -82,7 +82,7 @@ static int test_type;
static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
/* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+static bool test_uffdio_wp = true;
/* Whether to test uffd minor faults */
static bool test_uffdio_minor = false;

@@ -1594,8 +1594,6 @@ static void set_test_type(const char *type)
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
- /* Only enable write-protect test for anonymous test */
- test_uffdio_wp = true;
} else if (!strcmp(type, "hugetlb")) {
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
--
2.32.0

2022-04-05 03:45:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH v8 20/23] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

This requires the pagemap code to be able to recognize the newly introduced
swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
recently start to support. It should make pagemap uffd-wp support complete.

Signed-off-by: Peter Xu <[email protected]>
---
fs/proc/task_mmu.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f46060eb91b5..194dfd7abf2b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1421,6 +1421,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
migration = is_migration_entry(entry);
if (is_pfn_swap_entry(entry))
page = pfn_swap_entry_to_page(entry);
+ if (pte_marker_entry_uffd_wp(entry))
+ flags |= PM_UFFD_WP;
}

if (page && !PageAnon(page))
@@ -1556,10 +1558,15 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
if (page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;

+ if (huge_pte_uffd_wp(pte))
+ flags |= PM_UFFD_WP;
+
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pte_pfn(pte) +
((addr & ~hmask) >> PAGE_SHIFT);
+ } else if (pte_swp_uffd_wp_any(pte)) {
+ flags |= PM_UFFD_WP;
}

for (; addr != end; addr += PAGE_SIZE) {
--
2.32.0

2022-04-06 08:52:36

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Tue, Apr 05, 2022 at 03:49:12PM -0700, Andrew Morton wrote:
> On Tue, 5 Apr 2022 18:42:43 -0400 Peter Xu <[email protected]> wrote:
>
> > On Tue, Apr 05, 2022 at 03:16:16PM -0700, Andrew Morton wrote:
> > > On Mon, 4 Apr 2022 21:46:23 -0400 Peter Xu <[email protected]> wrote:
> > >
> > > > This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> > > > write protection.
> > >
> > > Various compilation catastrophes with x86_64 allnoconfig. I poked at
> > > the include ordering for a while but other things quickly became more
> > > attractive ;)
> >
> > Sorry about that. I still don't know what's the problem, but I'll give it
> > a shot soon.
> >
> > I think I only tried out with the new configs but not all the rest configs.
> > I thought there're some bot looking after that one, from which I used to
> > receive build reports. And IIRC I fixed some build issues in early versions
> > from those reports. Maybe I was wrong..
> >
> > Any more hints on the latter?
>
> `make allnoconfig'?
>

Ah! I thought when you mentioned "other things" you meant there're other
more severe issues... :)

For the allnoconfig, could you try with the attached quick fixup (upon
patch "mm/uffd: PTE_MARKER_UFFD_WP")?

That works for me on x86/arm, but I'm still trying out some other configs.

Thanks,

--
Peter Xu


Attachments:
(No filename) (1.34 kB)
0001-fixup-mm-uffd-PTE_MARKER_UFFD_WP.patch (921.00 B)
Download all attachments

2022-04-06 11:43:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Tue, 5 Apr 2022 18:42:43 -0400 Peter Xu <[email protected]> wrote:

> On Tue, Apr 05, 2022 at 03:16:16PM -0700, Andrew Morton wrote:
> > On Mon, 4 Apr 2022 21:46:23 -0400 Peter Xu <[email protected]> wrote:
> >
> > > This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> > > write protection.
> >
> > Various compilation catastrophes with x86_64 allnoconfig. I poked at
> > the include ordering for a while but other things quickly became more
> > attractive ;)
>
> Sorry about that. I still don't know what's the problem, but I'll give it
> a shot soon.
>
> I think I only tried out with the new configs but not all the rest configs.
> I thought there're some bot looking after that one, from which I used to
> receive build reports. And IIRC I fixed some build issues in early versions
> from those reports. Maybe I was wrong..
>
> Any more hints on the latter?

`make allnoconfig'?

2022-04-06 11:51:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Mon, 4 Apr 2022 21:46:23 -0400 Peter Xu <[email protected]> wrote:

> This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> write protection.

Various compilation catastrophes with x86_64 allnoconfig. I poked at
the include ordering for a while but other things quickly became more
attractive ;)

2022-04-06 12:08:13

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Tue, Apr 05, 2022 at 03:16:16PM -0700, Andrew Morton wrote:
> On Mon, 4 Apr 2022 21:46:23 -0400 Peter Xu <[email protected]> wrote:
>
> > This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> > write protection.
>
> Various compilation catastrophes with x86_64 allnoconfig. I poked at
> the include ordering for a while but other things quickly became more
> attractive ;)

Sorry about that. I still don't know what's the problem, but I'll give it
a shot soon.

I think I only tried out with the new configs but not all the rest configs.
I thought there're some bot looking after that one, from which I used to
receive build reports. And IIRC I fixed some build issues in early versions
from those reports. Maybe I was wrong..

Any more hints on the latter?

Thanks,

--
Peter Xu

2022-04-06 12:15:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Tue, 5 Apr 2022 19:02:32 -0400 Peter Xu <[email protected]> wrote:

> For the allnoconfig, could you try with the attached quick fixup (upon
> patch "mm/uffd: PTE_MARKER_UFFD_WP")?

Works for me, thanks.

2022-04-06 14:48:33

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork()

Hi Peter,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]
[cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220405]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base: https://github.com/hnaz/linux-mm master
config: ia64-buildonly-randconfig-r005-20220405 (https://download.01.org/0day-ci/archive/20220406/[email protected]/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/49e56558a1f453907d2813e1ba94d91f9d102e73
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
git checkout 49e56558a1f453907d2813e1ba94d91f9d102e73
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from arch/ia64/include/asm/pgtable.h:153,
from include/linux/pgtable.h:6,
from arch/ia64/include/asm/uaccess.h:40,
from include/linux/uaccess.h:11,
from arch/ia64/include/asm/sections.h:11,
from include/linux/interrupt.h:21,
from include/linux/kernel_stat.h:9,
from mm/memory.c:42:
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
| ^~~~~~~
In file included from include/linux/mm_inline.h:9,
from mm/memory.c:44:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
include/linux/userfaultfd_k.h:260:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
260 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:261:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
261 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:261:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
261 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:261:38: note: each undeclared identifier is reported only once for each function it appears in
In file included from include/linux/mm_inline.h:10,
from mm/memory.c:44:
include/linux/swapops.h: At top level:
include/linux/swapops.h:289:20: error: conflicting types for 'is_pte_marker_entry'; have 'bool(swp_entry_t)' {aka '_Bool(swp_entry_t)'}
289 | static inline bool is_pte_marker_entry(swp_entry_t entry)
| ^~~~~~~~~~~~~~~~~~~
In file included from include/linux/mm_inline.h:9,
from mm/memory.c:44:
include/linux/userfaultfd_k.h:260:16: note: previous implicit declaration of 'is_pte_marker_entry' with type 'int()'
260 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
In file included from include/linux/mm_inline.h:10,
from mm/memory.c:44:
include/linux/swapops.h:294:26: error: conflicting types for 'pte_marker_get'; have 'pte_marker(swp_entry_t)' {aka 'long unsigned int(swp_entry_t)'}
294 | static inline pte_marker pte_marker_get(swp_entry_t entry)
| ^~~~~~~~~~~~~~
In file included from include/linux/mm_inline.h:9,
from mm/memory.c:44:
include/linux/userfaultfd_k.h:261:14: note: previous implicit declaration of 'pte_marker_get' with type 'int()'
261 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
>> mm/memory.c:1238:1: warning: no previous prototype for 'vma_needs_copy' [-Wmissing-prototypes]
1238 | vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
| ^~~~~~~~~~~~~~
In file included from include/linux/mm_inline.h:9,
from mm/memory.c:44:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
include/linux/userfaultfd_k.h:262:1: error: control reaches end of non-void function [-Werror=return-type]
262 | }
| ^
cc1: some warnings being treated as errors


vim +/vma_needs_copy +1238 mm/memory.c

1231
1232 /*
1233 * Return true if the vma needs to copy the pgtable during this fork(). Return
1234 * false when we can speed up fork() by allowing lazy page faults later until
1235 * when the child accesses the memory range.
1236 */
1237 bool
> 1238 vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
1239 {
1240 /*
1241 * Always copy pgtables when dst_vma has uffd-wp enabled even if it's
1242 * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable
1243 * contains uffd-wp protection information, that's something we can't
1244 * retrieve from page cache, and skip copying will lose those info.
1245 */
1246 if (userfaultfd_wp(dst_vma))
1247 return true;
1248
1249 if (src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP))
1250 return true;
1251
1252 if (src_vma->anon_vma)
1253 return true;
1254
1255 /*
1256 * Don't copy ptes where a page fault will fill them correctly. Fork
1257 * becomes much lighter when there are big shared or private readonly
1258 * mappings. The tradeoff is that copy_page_range is more efficient
1259 * than faulting.
1260 */
1261 return false;
1262 }
1263

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-06 16:19:42

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 10/23] mm/shmem: Handle uffd-wp during fork()

I assume below should be the same issue as the allnoconfig one Andrew
reported, IOW after the fixup squashed this report along with the other one
in patch 4 should go away. Let me know otherwise..

Thanks,

On Wed, Apr 06, 2022 at 02:16:56PM +0800, kernel test robot wrote:
> All warnings (new ones prefixed by >>):
>
> In file included from arch/ia64/include/asm/pgtable.h:153,
> from include/linux/pgtable.h:6,
> from arch/ia64/include/asm/uaccess.h:40,
> from include/linux/uaccess.h:11,
> from arch/ia64/include/asm/sections.h:11,
> from include/linux/interrupt.h:21,
> from include/linux/kernel_stat.h:9,
> from mm/memory.c:42:
> arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
> arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
> 127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
> | ^~~~~~~
> In file included from include/linux/mm_inline.h:9,
> from mm/memory.c:44:
> include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
> include/linux/userfaultfd_k.h:260:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
> 260 | return is_pte_marker_entry(entry) &&
> | ^~~~~~~~~~~~~~~~~~~
> include/linux/userfaultfd_k.h:261:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
> 261 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
> | ^~~~~~~~~~~~~~
> include/linux/userfaultfd_k.h:261:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
> 261 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
> | ^~~~~~~~~~~~~~~~~~
> include/linux/userfaultfd_k.h:261:38: note: each undeclared identifier is reported only once for each function it appears in
> In file included from include/linux/mm_inline.h:10,
> from mm/memory.c:44:
> include/linux/swapops.h: At top level:
> include/linux/swapops.h:289:20: error: conflicting types for 'is_pte_marker_entry'; have 'bool(swp_entry_t)' {aka '_Bool(swp_entry_t)'}
> 289 | static inline bool is_pte_marker_entry(swp_entry_t entry)
> | ^~~~~~~~~~~~~~~~~~~
> In file included from include/linux/mm_inline.h:9,
> from mm/memory.c:44:
> include/linux/userfaultfd_k.h:260:16: note: previous implicit declaration of 'is_pte_marker_entry' with type 'int()'
> 260 | return is_pte_marker_entry(entry) &&
> | ^~~~~~~~~~~~~~~~~~~
> In file included from include/linux/mm_inline.h:10,
> from mm/memory.c:44:
> include/linux/swapops.h:294:26: error: conflicting types for 'pte_marker_get'; have 'pte_marker(swp_entry_t)' {aka 'long unsigned int(swp_entry_t)'}
> 294 | static inline pte_marker pte_marker_get(swp_entry_t entry)
> | ^~~~~~~~~~~~~~
> In file included from include/linux/mm_inline.h:9,
> from mm/memory.c:44:
> include/linux/userfaultfd_k.h:261:14: note: previous implicit declaration of 'pte_marker_get' with type 'int()'
> 261 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
> | ^~~~~~~~~~~~~~

--
Peter Xu

2022-04-06 16:41:04

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v8 04/23] mm/uffd: PTE_MARKER_UFFD_WP

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on arnd-asm-generic/master linus/master v5.18-rc1 next-20220405]
[cannot apply to linux/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base: https://github.com/hnaz/linux-mm master
config: ia64-buildonly-randconfig-r005-20220405 (https://download.01.org/0day-ci/archive/20220406/[email protected]/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/5baea0f03d347e5b13fff03af297858f1247d51a
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
git checkout 5baea0f03d347e5b13fff03af297858f1247d51a
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from arch/ia64/include/asm/pgtable.h:153,
from include/linux/pgtable.h:6,
from arch/ia64/include/asm/uaccess.h:40,
from include/linux/uaccess.h:11,
from include/linux/sched/task.h:11,
from include/linux/sched/signal.h:9,
from include/linux/rcuwait.h:6,
from include/linux/percpu-rwsem.h:7,
from include/linux/fs.h:33,
from arch/ia64/kernel/sys_ia64.c:10:
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
| ^~~~~~~
In file included from include/linux/hugetlb.h:14,
from arch/ia64/kernel/sys_ia64.c:21:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
243 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
arch/ia64/kernel/sys_ia64.c: At top level:
arch/ia64/kernel/sys_ia64.c:71:1: warning: no previous prototype for 'ia64_getpriority' [-Wmissing-prototypes]
71 | ia64_getpriority (int which, int who)
| ^~~~~~~~~~~~~~~~
arch/ia64/kernel/sys_ia64.c:85:1: warning: no previous prototype for 'sys_getpagesize' [-Wmissing-prototypes]
85 | sys_getpagesize (void)
| ^~~~~~~~~~~~~~~
arch/ia64/kernel/sys_ia64.c:91:1: warning: no previous prototype for 'ia64_brk' [-Wmissing-prototypes]
91 | ia64_brk (unsigned long brk)
| ^~~~~~~~
arch/ia64/kernel/sys_ia64.c:161:1: warning: no previous prototype for 'ia64_mremap' [-Wmissing-prototypes]
161 | ia64_mremap (unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flags,
| ^~~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from arch/ia64/include/asm/pgtable.h:153,
from include/linux/pgtable.h:6,
from arch/ia64/include/asm/uaccess.h:40,
from include/linux/uaccess.h:11,
from include/linux/sched/task.h:11,
from kernel/fork.c:23:
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
| ^~~~~~~
In file included from include/linux/hugetlb.h:14,
from kernel/fork.c:52:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
243 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
kernel/fork.c: At top level:
kernel/fork.c:163:13: warning: no previous prototype for 'arch_release_task_struct' [-Wmissing-prototypes]
163 | void __weak arch_release_task_struct(struct task_struct *tsk)
| ^~~~~~~~~~~~~~~~~~~~~~~~
kernel/fork.c:853:20: warning: no previous prototype for 'arch_task_cache_init' [-Wmissing-prototypes]
853 | void __init __weak arch_task_cache_init(void) { }
| ^~~~~~~~~~~~~~~~~~~~
kernel/fork.c:948:12: warning: no previous prototype for 'arch_dup_task_struct' [-Wmissing-prototypes]
948 | int __weak arch_dup_task_struct(struct task_struct *dst,
| ^~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from arch/ia64/include/asm/pgtable.h:153,
from include/linux/pgtable.h:6,
from include/linux/mm.h:29,
from kernel/sysctl.c:23:
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
| ^~~~~~~
In file included from include/linux/hugetlb.h:14,
from kernel/sysctl.c:46:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
243 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors
--
In file included from arch/ia64/include/asm/pgtable.h:153,
from include/linux/pgtable.h:6,
from include/linux/mm.h:29,
from mm/vmscan.c:15:
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
| ^~~~~~~
In file included from include/linux/hugetlb.h:14,
from include/linux/migrate.h:8,
from mm/vmscan.c:44:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
243 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
In file included from mm/vmscan.c:57:
include/linux/swapops.h: At top level:
>> include/linux/swapops.h:289:20: error: conflicting types for 'is_pte_marker_entry'; have 'bool(swp_entry_t)' {aka '_Bool(swp_entry_t)'}
289 | static inline bool is_pte_marker_entry(swp_entry_t entry)
| ^~~~~~~~~~~~~~~~~~~
In file included from include/linux/hugetlb.h:14,
from include/linux/migrate.h:8,
from mm/vmscan.c:44:
include/linux/userfaultfd_k.h:243:16: note: previous implicit declaration of 'is_pte_marker_entry' with type 'int()'
243 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
In file included from mm/vmscan.c:57:
>> include/linux/swapops.h:294:26: error: conflicting types for 'pte_marker_get'; have 'pte_marker(swp_entry_t)' {aka 'long unsigned int(swp_entry_t)'}
294 | static inline pte_marker pte_marker_get(swp_entry_t entry)
| ^~~~~~~~~~~~~~
In file included from include/linux/hugetlb.h:14,
from include/linux/migrate.h:8,
from mm/vmscan.c:44:
include/linux/userfaultfd_k.h:244:14: note: previous implicit declaration of 'pte_marker_get' with type 'int()'
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
In file included from arch/ia64/include/asm/pgtable.h:153,
from include/linux/pgtable.h:6,
from arch/ia64/include/asm/uaccess.h:40,
from include/linux/uaccess.h:11,
from include/linux/sched/task.h:11,
from include/linux/sched/signal.h:9,
from include/linux/rcuwait.h:6,
from include/linux/percpu-rwsem.h:7,
from include/linux/fs.h:33,
from fs/proc/meminfo.c:2:
arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
arch/ia64/include/asm/mmu_context.h:127:48: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
127 | unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
| ^~~~~~~
In file included from include/linux/hugetlb.h:14,
from fs/proc/meminfo.c:6:
include/linux/userfaultfd_k.h: In function 'pte_marker_entry_uffd_wp':
>> include/linux/userfaultfd_k.h:243:16: error: implicit declaration of function 'is_pte_marker_entry' [-Werror=implicit-function-declaration]
243 | return is_pte_marker_entry(entry) &&
| ^~~~~~~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:14: error: implicit declaration of function 'pte_marker_get' [-Werror=implicit-function-declaration]
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~
>> include/linux/userfaultfd_k.h:244:38: error: 'PTE_MARKER_UFFD_WP' undeclared (first use in this function)
244 | (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
| ^~~~~~~~~~~~~~~~~~
include/linux/userfaultfd_k.h:244:38: note: each undeclared identifier is reported only once for each function it appears in
fs/proc/meminfo.c: At top level:
fs/proc/meminfo.c:22:28: warning: no previous prototype for 'arch_report_meminfo' [-Wmissing-prototypes]
22 | void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
| ^~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors


vim +/is_pte_marker_entry +243 include/linux/userfaultfd_k.h

240
241 static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
242 {
> 243 return is_pte_marker_entry(entry) &&
> 244 (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
245 }
246

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-06 16:54:01

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
base: https://github.com/hnaz/linux-mm master
config: s390-randconfig-r044-20220406 (https://download.01.org/0day-ci/archive/20220406/[email protected]/config)
compiler: s390-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
git checkout e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

mm/hugetlb.c: In function 'hugetlb_fault':
>> mm/hugetlb.c:5678:13: error: implicit declaration of function 'huge_pte_none_mostly'; did you mean 'pte_none_mostly'? [-Werror=implicit-function-declaration]
5678 | if (huge_pte_none_mostly(entry)) {
| ^~~~~~~~~~~~~~~~~~~~
| pte_none_mostly
cc1: some warnings being treated as errors


vim +5678 mm/hugetlb.c

5616
5617 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
5618 unsigned long address, unsigned int flags)
5619 {
5620 pte_t *ptep, entry;
5621 spinlock_t *ptl;
5622 vm_fault_t ret;
5623 u32 hash;
5624 pgoff_t idx;
5625 struct page *page = NULL;
5626 struct page *pagecache_page = NULL;
5627 struct hstate *h = hstate_vma(vma);
5628 struct address_space *mapping;
5629 int need_wait_lock = 0;
5630 unsigned long haddr = address & huge_page_mask(h);
5631
5632 ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
5633 if (ptep) {
5634 /*
5635 * Since we hold no locks, ptep could be stale. That is
5636 * OK as we are only making decisions based on content and
5637 * not actually modifying content here.
5638 */
5639 entry = huge_ptep_get(ptep);
5640 if (unlikely(is_hugetlb_entry_migration(entry))) {
5641 migration_entry_wait_huge(vma, mm, ptep);
5642 return 0;
5643 } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
5644 return VM_FAULT_HWPOISON_LARGE |
5645 VM_FAULT_SET_HINDEX(hstate_index(h));
5646 }
5647
5648 /*
5649 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
5650 * until finished with ptep. This serves two purposes:
5651 * 1) It prevents huge_pmd_unshare from being called elsewhere
5652 * and making the ptep no longer valid.
5653 * 2) It synchronizes us with i_size modifications during truncation.
5654 *
5655 * ptep could have already be assigned via huge_pte_offset. That
5656 * is OK, as huge_pte_alloc will return the same value unless
5657 * something has changed.
5658 */
5659 mapping = vma->vm_file->f_mapping;
5660 i_mmap_lock_read(mapping);
5661 ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
5662 if (!ptep) {
5663 i_mmap_unlock_read(mapping);
5664 return VM_FAULT_OOM;
5665 }
5666
5667 /*
5668 * Serialize hugepage allocation and instantiation, so that we don't
5669 * get spurious allocation failures if two CPUs race to instantiate
5670 * the same page in the page cache.
5671 */
5672 idx = vma_hugecache_offset(h, vma, haddr);
5673 hash = hugetlb_fault_mutex_hash(mapping, idx);
5674 mutex_lock(&hugetlb_fault_mutex_table[hash]);
5675
5676 entry = huge_ptep_get(ptep);
5677 /* PTE markers should be handled the same way as none pte */
> 5678 if (huge_pte_none_mostly(entry)) {
5679 ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
5680 entry, flags);
5681 goto out_mutex;
5682 }
5683
5684 ret = 0;
5685
5686 /*
5687 * entry could be a migration/hwpoison entry at this point, so this
5688 * check prevents the kernel from going below assuming that we have
5689 * an active hugepage in pagecache. This goto expects the 2nd page
5690 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
5691 * properly handle it.
5692 */
5693 if (!pte_present(entry))
5694 goto out_mutex;
5695
5696 /*
5697 * If we are going to COW/unshare the mapping later, we examine the
5698 * pending reservations for this page now. This will ensure that any
5699 * allocations necessary to record that reservation occur outside the
5700 * spinlock. For private mappings, we also lookup the pagecache
5701 * page now as it is used to determine if a reservation has been
5702 * consumed.
5703 */
5704 if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
5705 !huge_pte_write(entry)) {
5706 if (vma_needs_reservation(h, vma, haddr) < 0) {
5707 ret = VM_FAULT_OOM;
5708 goto out_mutex;
5709 }
5710 /* Just decrements count, does not deallocate */
5711 vma_end_reservation(h, vma, haddr);
5712
5713 if (!(vma->vm_flags & VM_MAYSHARE))
5714 pagecache_page = hugetlbfs_pagecache_page(h,
5715 vma, haddr);
5716 }
5717
5718 ptl = huge_pte_lock(h, mm, ptep);
5719
5720 /* Check for a racing update before calling hugetlb_wp() */
5721 if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
5722 goto out_ptl;
5723
5724 /* Handle userfault-wp first, before trying to lock more pages */
5725 if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
5726 (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
5727 struct vm_fault vmf = {
5728 .vma = vma,
5729 .address = haddr,
5730 .real_address = address,
5731 .flags = flags,
5732 };
5733
5734 spin_unlock(ptl);
5735 if (pagecache_page) {
5736 unlock_page(pagecache_page);
5737 put_page(pagecache_page);
5738 }
5739 mutex_unlock(&hugetlb_fault_mutex_table[hash]);
5740 i_mmap_unlock_read(mapping);
5741 return handle_userfault(&vmf, VM_UFFD_WP);
5742 }
5743
5744 /*
5745 * hugetlb_wp() requires page locks of pte_page(entry) and
5746 * pagecache_page, so here we need take the former one
5747 * when page != pagecache_page or !pagecache_page.
5748 */
5749 page = pte_page(entry);
5750 if (page != pagecache_page)
5751 if (!trylock_page(page)) {
5752 need_wait_lock = 1;
5753 goto out_ptl;
5754 }
5755
5756 get_page(page);
5757
5758 if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
5759 if (!huge_pte_write(entry)) {
5760 ret = hugetlb_wp(mm, vma, address, ptep, flags,
5761 pagecache_page, ptl);
5762 goto out_put_page;
5763 } else if (likely(flags & FAULT_FLAG_WRITE)) {
5764 entry = huge_pte_mkdirty(entry);
5765 }
5766 }
5767 entry = pte_mkyoung(entry);
5768 if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
5769 flags & FAULT_FLAG_WRITE))
5770 update_mmu_cache(vma, haddr, ptep);
5771 out_put_page:
5772 if (page != pagecache_page)
5773 unlock_page(page);
5774 put_page(page);
5775 out_ptl:
5776 spin_unlock(ptl);
5777
5778 if (pagecache_page) {
5779 unlock_page(pagecache_page);
5780 put_page(pagecache_page);
5781 }
5782 out_mutex:
5783 mutex_unlock(&hugetlb_fault_mutex_table[hash]);
5784 i_mmap_unlock_read(mapping);
5785 /*
5786 * Generally it's safe to hold refcount during waiting page lock. But
5787 * here we just wait to defer the next page fault to avoid busy loop and
5788 * the page is not used after unlocked before returning from the current
5789 * page fault. So we are safe from accessing freed page, even if we wait
5790 * here without taking refcount.
5791 */
5792 if (need_wait_lock)
5793 wait_on_page_locked(page);
5794 return ret;
5795 }
5796

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-04-06 17:01:53

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 15/23] mm/hugetlb: Handle pte markers in page faults

On Wed, Apr 06, 2022 at 09:37:00PM +0800, kernel test robot wrote:
> Hi Peter,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on hnaz-mm/master]
> [cannot apply to arnd-asm-generic/master linus/master linux/master v5.18-rc1 next-20220406]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
> base: https://github.com/hnaz/linux-mm master
> config: s390-randconfig-r044-20220406 (https://download.01.org/0day-ci/archive/20220406/[email protected]/config)
> compiler: s390-linux-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # https://github.com/intel-lab-lkp/linux/commit/e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20220405-100136
> git checkout e7e7aaec811e2817cd169f0cc1d8f81bdf1f05c3
> # save the config file to linux build tree
> mkdir build_dir
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <[email protected]>
>
> All errors (new ones prefixed by >>):
>
> mm/hugetlb.c: In function 'hugetlb_fault':
> >> mm/hugetlb.c:5678:13: error: implicit declaration of function 'huge_pte_none_mostly'; did you mean 'pte_none_mostly'? [-Werror=implicit-function-declaration]
> 5678 | if (huge_pte_none_mostly(entry)) {
> | ^~~~~~~~~~~~~~~~~~~~
> | pte_none_mostly
> cc1: some warnings being treated as errors

Ah, the s390 stub was forgotten again, sorry. I hope someday s390 will
start to include asm-generic/hugetlb.h like all the rest archs, because
that's really from the gut feeling of how it should happen.. or the dir
should be renamed to asm-generic-without-s390/. :(

An expected fix patch attached (to be squashed into patch "mm: Introduce
PTE_MARKER swap entry").

Thanks,

--
Peter Xu


Attachments:
(No filename) (2.49 kB)
0001-fixup-mm-Introduce-PTE_MARKER-swap-entry.patch (817.00 B)
Download all attachments

2022-04-12 13:56:26

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

Hi Peter,

I noticed this while reviewing the next patch in the series. I think you need to
add CONFIG_PTE_MARKER to the below as well:

#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
defined(CONFIG_DEVICE_PRIVATE)
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;
}
#else
static inline int non_swap_entry(swp_entry_t entry)
{
return 0;
}
#endif

Otherwise marker entries will be treated as swap entries, which is wrong for
example in swapin_walk_pmd_entry() as marker entries are no longer considered
pte_none().

- Alistair

Peter Xu <[email protected]> writes:

> This patch introduces a new swap entry type called PTE_MARKER. It can be
> installed for any pte that maps a file-backed memory when the pte is
> temporarily zapped, so as to maintain per-pte information.
>
> The information that kept in the pte is called a "marker". Here we define the
> marker as "unsigned long" just to match pgoff_t, however it will only work if
> it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.
>
> A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A bunch
> of helpers are defined altogether to service the rest of the pte marker code.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/asm-generic/hugetlb.h | 9 ++++
> include/linux/swap.h | 15 ++++++-
> include/linux/swapops.h | 78 +++++++++++++++++++++++++++++++++++
> mm/Kconfig | 6 +++
> 4 files changed, 107 insertions(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index 8e1e6244a89d..f39cad20ffc6 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -2,6 +2,9 @@
> #ifndef _ASM_GENERIC_HUGETLB_H
> #define _ASM_GENERIC_HUGETLB_H
>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
> static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
> {
> return mk_pte(page, pgprot);
> @@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
> }
> #endif
>
> +/* Please refer to comments above pte_none_mostly() for the usage */
> +static inline int huge_pte_none_mostly(pte_t pte)
> +{
> + return huge_pte_none(pte) || is_pte_marker(pte);
> +}
> +
> #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
> static inline pte_t huge_pte_wrprotect(pte_t pte)
> {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7daae5a4b3e1..5553189d0215 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
> * actions on faults.
> */
>
> +/*
> + * PTE markers are used to persist information onto PTEs that are mapped with
> + * file-backed memories. As its name "PTE" hints, it should only be applied to
> + * the leaves of pgtables.
> + */
> +#ifdef CONFIG_PTE_MARKER
> +#define SWP_PTE_MARKER_NUM 1
> +#define SWP_PTE_MARKER (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
> + SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
> +#else
> +#define SWP_PTE_MARKER_NUM 0
> +#endif
> +
> /*
> * Unaddressable device memory support. See include/linux/hmm.h and
> * Documentation/vm/hmm.rst. Short description is we need struct pages for
> @@ -107,7 +120,7 @@ static inline int current_is_kswapd(void)
>
> #define MAX_SWAPFILES \
> ((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> + SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
>
> /*
> * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 32d517a28969..7a00627845f0 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -274,6 +274,84 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>
> #endif
>
> +typedef unsigned long pte_marker;
> +
> +#define PTE_MARKER_MASK (0)
> +
> +#ifdef CONFIG_PTE_MARKER
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> + return swp_entry(SWP_PTE_MARKER, marker);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> + return swp_type(entry) == SWP_PTE_MARKER;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> + return swp_offset(entry) & PTE_MARKER_MASK;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> + return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> +}
> +
> +#else /* CONFIG_PTE_MARKER */
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> + /* This should never be called if !CONFIG_PTE_MARKER */
> + WARN_ON_ONCE(1);
> + return swp_entry(0, 0);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> + return false;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> + return 0;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_PTE_MARKER */
> +
> +static inline pte_t make_pte_marker(pte_marker marker)
> +{
> + return swp_entry_to_pte(make_pte_marker_entry(marker));
> +}
> +
> +/*
> + * This is a special version to check pte_none() just to cover the case when
> + * the pte is a pte marker. It existed because in many cases the pte marker
> + * should be seen as a none pte; it's just that we have stored some information
> + * onto the none pte so it becomes not-none any more.
> + *
> + * It should be used when the pte is file-backed, ram-based and backing
> + * userspace pages, like shmem. It is not needed upon pgtables that do not
> + * support pte markers at all. For example, it's not needed on anonymous
> + * memory, kernel-only memory (including when the system is during-boot),
> + * non-ram based generic file-system. It's fine to be used even there, but the
> + * extra pte marker check will be pure overhead.
> + *
> + * For systems configured with !CONFIG_PTE_MARKER this will be automatically
> + * optimized to pte_none().
> + */
> +static inline int pte_none_mostly(pte_t pte)
> +{
> + return pte_none(pte) || is_pte_marker(pte);
> +}
> +
> static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> {
> struct page *p = pfn_to_page(swp_offset(entry));
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 034d87953600..a1688b9314b2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -909,6 +909,12 @@ config ANON_VMA_NAME
> area from being merged with adjacent virtual memory areas due to the
> difference in their name.
>
> +config PTE_MARKER
> + bool "Marker PTEs support"
> +
> + help
> + Allows to create marker PTEs for file-backed memory.
> +
> source "mm/damon/Kconfig"
>
> endmenu


Attachments:
(No filename) (6.73 kB)

2022-04-12 19:52:03

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

Looks ok to me now and should work for all architectures.

Reviewed-by: Alistair Popple <[email protected]>

Peter Xu <[email protected]> writes:

> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes. For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte. pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/memory.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> vmf->address, &vmf->ptl);
> ret = 0;
> /* Re-check under ptl */
> - if (likely(pte_none(*vmf->pte)))
> + if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> do_set_pte(vmf, page, vmf->address);
> else
> ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * concurrent faults and from rmap lookups.
> */
> vmf->pte = NULL;
> + /*
> + * Always initialize orig_pte. This matches with below
> + * code to have orig_pte to be the none pte if pte==NULL.
> + * This makes the rest code to be always safe to reference
> + * it, e.g. in finish_fault() we'll detect pte changes.
> + */
> + pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
> } else {
> /*
> * If a huge pmd materialized under us just retry later. Use


Attachments:
(No filename) (2.11 kB)

2022-04-12 23:17:07

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 02/23] mm: Teach core mm about pte markers

Hi, Alistair,

On Tue, Apr 12, 2022 at 11:22:01AM +1000, Alistair Popple wrote:
> I've been reviewing existing pte_none() call sites and noticed the following:
>
> From khugepaged_scan_pmd():
>
> pte_t pteval = *_pte;
> if (is_swap_pte(pteval)) {
> if (++unmapped <= khugepaged_max_ptes_swap) {
> /*
> * Always be strict with uffd-wp
> * enabled swap entries. Please see
> * comment below for pte_uffd_wp().
> */
> if (pte_swp_uffd_wp(pteval)) {
> result = SCAN_PTE_UFFD_WP;
> goto out_unmap;
> }
> continue;
> } else {
> result = SCAN_EXCEED_SWAP_PTE;
> count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> goto out_unmap;
> }
> }
> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> if (!userfaultfd_armed(vma) &&
> ++none_or_zero <= khugepaged_max_ptes_none) {
> continue;
> } else {
> result = SCAN_EXCEED_NONE_PTE;
> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> goto out_unmap;
> }
> }
>
> I think the above could encounter a marker PTE right? So the behviour would be
> wrong in that case. As I understand things the is_swap_pte() path will be taken
> rather than pte_none() but in the absence of any special handling shouldn't
> marker PTE's mostly be treated as pte_none() here?
>
> I think you need to s/pte_none/pte_none_mostly/ here and swap the order of
> conditionals around.

Isn't khugepaged_scan_pmd() only for anonymous?

The shmem side is covered by khugepaged_scan_file(), imho. We definitely
don't want to collapse shmem vma ranges that has uffd-wp registered, and
that's actually handled explicilty in "mm/khugepaged: Don't recycle vma
pgtable if uffd-wp registered". Please feel free to have a look.

Thanks,

--
Peter Xu

2022-04-12 23:17:18

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

On Tue, Apr 12, 2022 at 11:07:56AM +1000, Alistair Popple wrote:
> Hi Peter,

Hi, Alistair,

>
> I noticed this while reviewing the next patch in the series. I think you need to
> add CONFIG_PTE_MARKER to the below as well:
>
> #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
> defined(CONFIG_DEVICE_PRIVATE)
> static inline int non_swap_entry(swp_entry_t entry)
> {
> return swp_type(entry) >= MAX_SWAPFILES;
> }
> #else
> static inline int non_swap_entry(swp_entry_t entry)
> {
> return 0;
> }
> #endif
>
> Otherwise marker entries will be treated as swap entries, which is wrong for
> example in swapin_walk_pmd_entry() as marker entries are no longer considered
> pte_none().

Thanks for the comment, that makes sense.

Instead of adding PTE_MARKER into this equation, I'm going backward and
wondering purely on why we need to bother with non_swap_entry() at all if
MAX_SWAPFILES is already defined with proper knowledges of all these bits.

#define MAX_SWAPFILES \
((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)

So, I agree with your analysis, but instead of adding PTE_MARKER, what do
you think about we dropping that complexity as a whole (possibly with a
standalone patch)?

---8<---
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..5af852b68805 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -387,18 +387,10 @@ static inline void num_poisoned_pages_inc(void)
}
#endif

-#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
- defined(CONFIG_DEVICE_PRIVATE)
static inline int non_swap_entry(swp_entry_t entry)
{
return swp_type(entry) >= MAX_SWAPFILES;
}
-#else
-static inline int non_swap_entry(swp_entry_t entry)
-{
- return 0;
-}
-#endif

#endif /* CONFIG_MMU */
#endif /* _LINUX_SWAPOPS_H */
---8<---

Thanks,

--
Peter Xu

2022-04-12 23:17:30

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

On Tue, Apr 12, 2022 at 12:05:41PM +1000, Alistair Popple wrote:
> Looks ok to me now and should work for all architectures.
>
> Reviewed-by: Alistair Popple <[email protected]>

Thanks again for taking a look. Currently the series is queued in -mm
already, but I'll take these R-bs if there'll be a new version.

--
Peter Xu

2022-04-12 23:30:32

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v8 02/23] mm: Teach core mm about pte markers

I've been reviewing existing pte_none() call sites and noticed the following:

From khugepaged_scan_pmd():

pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
if (++unmapped <= khugepaged_max_ptes_swap) {
/*
* Always be strict with uffd-wp
* enabled swap entries. Please see
* comment below for pte_uffd_wp().
*/
if (pte_swp_uffd_wp(pteval)) {
result = SCAN_PTE_UFFD_WP;
goto out_unmap;
}
continue;
} else {
result = SCAN_EXCEED_SWAP_PTE;
count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
goto out_unmap;
}
}
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
if (!userfaultfd_armed(vma) &&
++none_or_zero <= khugepaged_max_ptes_none) {
continue;
} else {
result = SCAN_EXCEED_NONE_PTE;
count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
goto out_unmap;
}
}

I think the above could encounter a marker PTE right? So the behviour would be
wrong in that case. As I understand things the is_swap_pte() path will be taken
rather than pte_none() but in the absence of any special handling shouldn't
marker PTE's mostly be treated as pte_none() here?

I think you need to s/pte_none/pte_none_mostly/ here and swap the order of
conditionals around.

- Alistair

Peter Xu <[email protected]> writes:

> This patch still does not use pte marker in any way, however it teaches the
> core mm about the pte marker idea.
>
> For example, handle_pte_marker() is introduced that will parse and handle all
> the pte marker faults.
>
> Many of the places are more about commenting it up - so that we know there's
> the possibility of pte marker showing up, and why we don't need special code
> for the cases.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 10 ++++++----
> mm/filemap.c | 5 +++++
> mm/hmm.c | 2 +-
> mm/memcontrol.c | 8 ++++++--
> mm/memory.c | 23 +++++++++++++++++++++++
> mm/mincore.c | 3 ++-
> mm/mprotect.c | 3 +++
> 7 files changed, 46 insertions(+), 8 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index aa0c47cb0d16..8b4a94f5a238 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -249,9 +249,10 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>
> /*
> * Lockless access: we're in a wait_event so it's ok if it
> - * changes under us.
> + * changes under us. PTE markers should be handled the same as none
> + * ptes here.
> */
> - if (huge_pte_none(pte))
> + if (huge_pte_none_mostly(pte))
> ret = true;
> if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> ret = true;
> @@ -330,9 +331,10 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> pte = pte_offset_map(pmd, address);
> /*
> * Lockless access: we're in a wait_event so it's ok if it
> - * changes under us.
> + * changes under us. PTE markers should be handled the same as none
> + * ptes here.
> */
> - if (pte_none(*pte))
> + if (pte_none_mostly(*pte))
> ret = true;
> if (!pte_write(*pte) && (reason & VM_UFFD_WP))
> ret = true;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3a5ffb5587cd..ef77dae8c28d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3382,6 +3382,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> vmf->pte += xas.xa_index - last_pgoff;
> last_pgoff = xas.xa_index;
>
> + /*
> + * NOTE: If there're PTE markers, we'll leave them to be
> + * handled in the specific fault path, and it'll prohibit the
> + * fault-around logic.
> + */
> if (!pte_none(*vmf->pte))
> goto unlock;
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index af71aac3140e..3fd3242c5e50 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> pte_t pte = *ptep;
> uint64_t pfn_req_flags = *hmm_pfn;
>
> - if (pte_none(pte)) {
> + if (pte_none_mostly(pte)) {
> required_fault =
> hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
> if (required_fault)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7a08737bac4b..08af97c73f0f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5644,10 +5644,14 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>
> if (pte_present(ptent))
> page = mc_handle_present_pte(vma, addr, ptent);
> + else if (pte_none_mostly(ptent))
> + /*
> + * PTE markers should be treated as a none pte here, separated
> + * from other swap handling below.
> + */
> + page = mc_handle_file_pte(vma, addr, ptent);
> else if (is_swap_pte(ptent))
> page = mc_handle_swap_pte(vma, ptent, &ent);
> - else if (pte_none(ptent))
> - page = mc_handle_file_pte(vma, addr, ptent);
>
> if (!page && !ent.val)
> return ret;
> diff --git a/mm/memory.c b/mm/memory.c
> index 2c5d1bb4694f..3f396241a7db 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -100,6 +100,8 @@ struct page *mem_map;
> EXPORT_SYMBOL(mem_map);
> #endif
>
> +static vm_fault_t do_fault(struct vm_fault *vmf);
> +
> /*
> * A number of key systems in x86 including ioremap() rely on the assumption
> * that high_memory defines the upper bound on direct map memory, then end
> @@ -1415,6 +1417,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> if (!should_zap_page(details, page))
> continue;
> rss[mm_counter(page)]--;
> + } else if (is_pte_marker_entry(entry)) {
> + /* By default, simply drop all pte markers when zap */
> } else if (is_hwpoison_entry(entry)) {
> if (!should_zap_cows(details))
> continue;
> @@ -3555,6 +3559,23 @@ static inline bool should_try_to_free_swap(struct page *page,
> page_count(page) == 2;
> }
>
> +static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> +{
> + swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
> + unsigned long marker = pte_marker_get(entry);
> +
> + /*
> + * PTE markers should always be with file-backed memories, and the
> + * marker should never be empty. If anything weird happened, the best
> + * thing to do is to kill the process along with its mm.
> + */
> + if (WARN_ON_ONCE(vma_is_anonymous(vmf->vma) || !marker))
> + return VM_FAULT_SIGBUS;
> +
> + /* TODO: handle pte markers */
> + return 0;
> +}
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3592,6 +3613,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> } else if (is_hwpoison_entry(entry)) {
> ret = VM_FAULT_HWPOISON;
> + } else if (is_pte_marker_entry(entry)) {
> + ret = handle_pte_marker(vmf);
> } else {
> print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
> ret = VM_FAULT_SIGBUS;
> diff --git a/mm/mincore.c b/mm/mincore.c
> index f4f627325e12..fa200c14185f 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -122,7 +122,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> for (; addr != end; ptep++, addr += PAGE_SIZE) {
> pte_t pte = *ptep;
>
> - if (pte_none(pte))
> + /* We need to do cache lookup too for pte markers */
> + if (pte_none_mostly(pte))
> __mincore_unmapped_range(addr, addr + PAGE_SIZE,
> vma, vec);
> else if (pte_present(pte))
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 56060acdabd3..709a6f73b764 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -188,6 +188,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> newpte = pte_swp_mksoft_dirty(newpte);
> if (pte_swp_uffd_wp(oldpte))
> newpte = pte_swp_mkuffd_wp(newpte);
> + } else if (is_pte_marker_entry(entry)) {
> + /* Skip it, the same as none pte */
> + continue;
> } else {
> newpte = oldpte;
> }


Attachments:
(No filename) (7.90 kB)

2022-04-13 02:33:07

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

Peter Xu <[email protected]> writes:

> On Tue, Apr 12, 2022 at 11:07:56AM +1000, Alistair Popple wrote:
>> Hi Peter,
>
> Hi, Alistair,
>
>>
>> I noticed this while reviewing the next patch in the series. I think you need to
>> add CONFIG_PTE_MARKER to the below as well:
>>
>> #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
>> defined(CONFIG_DEVICE_PRIVATE)
>> static inline int non_swap_entry(swp_entry_t entry)
>> {
>> return swp_type(entry) >= MAX_SWAPFILES;
>> }
>> #else
>> static inline int non_swap_entry(swp_entry_t entry)
>> {
>> return 0;
>> }
>> #endif
>>
>> Otherwise marker entries will be treated as swap entries, which is wrong for
>> example in swapin_walk_pmd_entry() as marker entries are no longer considered
>> pte_none().
>
> Thanks for the comment, that makes sense.
>
> Instead of adding PTE_MARKER into this equation, I'm going backward and
> wondering purely on why we need to bother with non_swap_entry() at all if
> MAX_SWAPFILES is already defined with proper knowledges of all these bits.

I was going to suggest it was to help the compiler optimise the non-swap entry
code away. But I just tested and it makes no difference in .text section size
either way so I think your suggestion is good unless that isn't true for other
architecture/compiler combinations (I only tried gcc-10.2.1 and x86_64).

That's a possibility because the optimisation isn't obvious to me at least.

non_swap_entry() is equivalent to:

(entry.val >> SWP_TYPE_SHIFT) >= MAX_SWAPFILES;
(entry.val >> (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)) >= (1<<5);
(entry.val >> (BITS_PER_LONG - 1 - 5)) >= (1<<5);
(entry.val >> 58) >= (1<<5);

Where entry.val is a long. So from that alone it's not obvious this could be
optimised away, because nothing there implies entry.val != (1<<63) which would
make the conditional true. But there's a lot of inlining going on in the
creation of swap entries which I didn't trace, so something must end up implying
entry.val < (1<<63).

>
> #define MAX_SWAPFILES \
> ((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>
> So, I agree with your analysis, but instead of adding PTE_MARKER, what do
> you think about we dropping that complexity as a whole (possibly with a
> standalone patch)?
>
> ---8<---
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index d356ab4047f7..5af852b68805 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -387,18 +387,10 @@ static inline void num_poisoned_pages_inc(void)
> }
> #endif
>
> -#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
> - defined(CONFIG_DEVICE_PRIVATE)
> static inline int non_swap_entry(swp_entry_t entry)
> {
> return swp_type(entry) >= MAX_SWAPFILES;
> }
> -#else
> -static inline int non_swap_entry(swp_entry_t entry)
> -{
> - return 0;
> -}
> -#endif
>
> #endif /* CONFIG_MMU */
> #endif /* _LINUX_SWAPOPS_H */
> ---8<---
>
> Thanks,


Attachments:
(No filename) (3.00 kB)

2022-04-13 17:51:05

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

Hi,

On 05.04.2022 03:48, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes. For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte. pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <[email protected]>

This patch landed in today's linux next-202204213 as commit fa6009949163
("mm: check against orig_pte for finish_fault()"). Unfortunately it
causes serious system instability on some ARM 32bit machines. I've
observed it on all tested boards (various Samsung Exynos based,
Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
compiled from multi_v7_defconfig.

Here is a crash log from QEMU's ARM 32bit virt machine:

8<--- cut here ---
Unable to handle kernel paging request at virtual address e093263c
[e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#1] SMP ARM
Modules linked in:
CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>]    lr : [<c0478144>]    psr: 40000013
sp : e0931df8  ip : e0931e54  fp : c26a8000
r10: 00000081  r9 : c2230880  r8 : 00000000
r7 : 00000081  r6 : beffffed  r5 : c267f000  r4 : c2230880
r3 : 00000000  r2 : 00000000  r1 : 00000040  r0 : e0931e3c
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4020406a  DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0930000
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c2230880 pointer offset 0
Register r5 information: slab vm_area_struct start c267f000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c2230880 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c26a8000 pointer offset 0
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0930000
allocated at kernel_clone+0x8c/0x3a8
Process kworker/u4:0 (pid: 37, stack limit = 0x(ptrval))
Stack: (0xe0931df8 to 0xe0932000)
...
---[ end trace 0000000000000000 ]---
CAN device driver interface
bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
e1000e: Intel(R) PRO/1000 Network Driver
e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
igb: Intel(R) Gigabit Ethernet Network Driver
igb: Copyright (c) 2007-2014 Intel Corporation.
pegasus: Pegasus/Pegasus II USB Ethernet driver
usbcore: registered new interface driver pegasus
usbcore: registered new interface driver asix
usbcore: registered new interface driver ax88179_178a
usbcore: registered new interface driver cdc_ether
usbcore: registered new interface driver smsc75xx
usbcore: registered new interface driver smsc95xx
usbcore: registered new interface driver net1080
usbcore: registered new interface driver cdc_subset
usbcore: registered new interface driver zaurus
usbcore: registered new interface driver cdc_ncm
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci-pci: EHCI PCI platform driver
ehci-platform: EHCI generic platform driver
ehci-omap: OMAP-EHCI Host Controller driver
ehci-orion: EHCI orion driver
SPEAr-ehci: EHCI SPEAr driver
ehci-st: EHCI STMicroelectronics driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci-pci: OHCI PCI platform driver
ohci-platform: OHCI generic platform driver
SPEAr-ohci: OHCI SPEAr driver
ohci-st: OHCI STMicroelectronics driver
usbcore: registered new interface driver usb-storage
rtc-pl031 9010000.pl031: registered as rtc0
rtc-pl031 9010000.pl031: setting system clock to 2022-04-13T13:49:19 UTC
(1649857759)
i2c_dev: i2c /dev entries driver
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
Synopsys Designware Multimedia Card Interface Driver
sdhci-pltfm: SDHCI platform and OF driver helper
ledtrig-cpu: registered to indicate activity on CPUs
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
NET: Registered PF_INET6 protocol family
Segment Routing with IPv6
In-situ OAM (IOAM) with IPv6
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
NET: Registered PF_PACKET protocol family
can: controller area network core
NET: Registered PF_CAN protocol family
can: raw protocol
can: broadcast manager protocol
can: netlink gateway - max_hops=1
Key type dns_resolver registered
ThumbEE CPU extension supported.
Registering SWP/SWPB emulation handler
Loading compiled-in X.509 certificates
input: gpio-keys as /devices/platform/gpio-keys/input/input0
uart-pl011 9000000.pl011: no DMA platform data
EXT4-fs (vda): mounted filesystem with ordered data mode. Quota mode:
disabled.
VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
devtmpfs: mounted
Freeing unused kernel image (initmem) memory: 2048K
Run /sbin/init as init process
  with arguments:
    /sbin/init
  with environment:
    HOME=/
    TERM=linux
8<--- cut here ---
Unable to handle kernel paging request at virtual address e082662c
[e082662c] *pgd=42083811, *pte=00000000, *ppte=00000000
Internal error: Oops: 807 [#2] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G      D
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
PC is at cpu_ca15_set_pte_ext+0x4c/0x58
LR is at handle_mm_fault+0x46c/0xbb0
pc : [<c031bdec>]    lr : [<c0478144>]    psr: 40000013
sp : e0825de8  ip : e0825e44  fp : c213e000
r10: 00000081  r9 : c20e0000  r8 : 00000000
r7 : 00000081  r6 : befffff1  r5 : c2695000  r4 : c20e0000
r3 : 00000000  r2 : 00000000  r1 : 00000040  r0 : e0825e2c
Flags: nZcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 4020406a  DAC: 00000051
Register r0 information: 2-page vmalloc region starting at 0xe0824000
allocated at kernel_clone+0x8c/0x3a8
Register r1 information: non-paged memory
Register r2 information: NULL pointer
Register r3 information: NULL pointer
Register r4 information: slab task_struct start c20e0000 pointer offset 0
Register r5 information: slab vm_area_struct start c2695000 pointer offset 0
Register r6 information: non-paged memory
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: slab task_struct start c20e0000 pointer offset 0
Register r10 information: non-paged memory
Register r11 information: slab mm_struct start c213e000 pointer offset 0
size 168
Register r12 information: 2-page vmalloc region starting at 0xe0824000
allocated at kernel_clone+0x8c/0x3a8
Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
Stack: (0xe0825de8 to 0xe0826000)
...
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
CPU1: stopping
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D
5.18.0-rc2-00176-gfa6009949163 #11684
Hardware name: Generic DT based system
 unwind_backtrace from show_stack+0x10/0x14
 show_stack from dump_stack_lvl+0x40/0x4c
 dump_stack_lvl from do_handle_IPI+0x2c4/0x2fc
 do_handle_IPI from ipi_handler+0x18/0x20
 ipi_handler from handle_percpu_devid_irq+0x8c/0x1e0
 handle_percpu_devid_irq from generic_handle_domain_irq+0x40/0x84
 generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
 gic_handle_irq from generic_handle_arch_irq+0x34/0x44
 generic_handle_arch_irq from call_with_stack+0x18/0x20
 call_with_stack from __irq_svc+0x98/0xb0
Exception stack(0xe0869f50 to 0xe0869f98)
9f40:                                     00009ddc 00000000 00000001
c031be20
9f60: c20e5d80 c1b48f20 c1904d10 c1904d6c c183e9e8 c1b47971 00000000
00000000
9f80: c1904e24 e0869fa0 c0307b74 c0307b78 60000113 ffffffff
 __irq_svc from arch_cpu_idle+0x38/0x3c
 arch_cpu_idle from default_idle_call+0x3c/0xb8
 default_idle_call from do_idle+0x1f8/0x298
 do_idle from cpu_startup_entry+0x18/0x1c
 cpu_startup_entry from 0x40301780
---[ end Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b ]---


> ---
> mm/memory.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3f396241a7db..b1af996b09ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4241,7 +4241,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> vmf->address, &vmf->ptl);
> ret = 0;
> /* Re-check under ptl */
> - if (likely(pte_none(*vmf->pte)))
> + if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> do_set_pte(vmf, page, vmf->address);
> else
> ret = VM_FAULT_NOPAGE;
> @@ -4709,6 +4709,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * concurrent faults and from rmap lookups.
> */
> vmf->pte = NULL;
> + /*
> + * Always initialize orig_pte. This matches with below
> + * code to have orig_pte to be the none pte if pte==NULL.
> + * This makes the rest code to be always safe to reference
> + * it, e.g. in finish_fault() we'll detect pte changes.
> + */
> + pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
> } else {
> /*
> * If a huge pmd materialized under us just retry later. Use

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2022-04-14 09:08:14

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

On Wed, Apr 13, 2022 at 10:30:33AM +1000, Alistair Popple wrote:
> Peter Xu <[email protected]> writes:
>
> > On Tue, Apr 12, 2022 at 11:07:56AM +1000, Alistair Popple wrote:
> >> Hi Peter,
> >
> > Hi, Alistair,
> >
> >>
> >> I noticed this while reviewing the next patch in the series. I think you need to
> >> add CONFIG_PTE_MARKER to the below as well:
> >>
> >> #if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \
> >> defined(CONFIG_DEVICE_PRIVATE)
> >> static inline int non_swap_entry(swp_entry_t entry)
> >> {
> >> return swp_type(entry) >= MAX_SWAPFILES;
> >> }
> >> #else
> >> static inline int non_swap_entry(swp_entry_t entry)
> >> {
> >> return 0;
> >> }
> >> #endif
> >>
> >> Otherwise marker entries will be treated as swap entries, which is wrong for
> >> example in swapin_walk_pmd_entry() as marker entries are no longer considered
> >> pte_none().
> >
> > Thanks for the comment, that makes sense.
> >
> > Instead of adding PTE_MARKER into this equation, I'm going backward and
> > wondering purely on why we need to bother with non_swap_entry() at all if
> > MAX_SWAPFILES is already defined with proper knowledges of all these bits.
>
> I was going to suggest it was to help the compiler optimise the non-swap entry
> code away. But I just tested and it makes no difference in .text section size
> either way so I think your suggestion is good unless that isn't true for other
> architecture/compiler combinations (I only tried gcc-10.2.1 and x86_64).
>
> That's a possibility because the optimisation isn't obvious to me at least.
>
> non_swap_entry() is equivalent to:
>
> (entry.val >> SWP_TYPE_SHIFT) >= MAX_SWAPFILES;
> (entry.val >> (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)) >= (1<<5);
> (entry.val >> (BITS_PER_LONG - 1 - 5)) >= (1<<5);
> (entry.val >> 58) >= (1<<5);
>
> Where entry.val is a long. So from that alone it's not obvious this could be
> optimised away, because nothing there implies entry.val != (1<<63) which would
> make the conditional true. But there's a lot of inlining going on in the
> creation of swap entries which I didn't trace, so something must end up implying
> entry.val < (1<<63).

I think my point was that we check non_swap_entry() with a pre-assumption
that it's a swap entry, then it means it's the slow path already after
we've parsed the pte entry and be aware it's not present.

So I'm doubting how much the optimization (even if at last, applicable)
could help in reality, not to mention that it'll only have an effect when
all of the configs are not set, so it's a micro optimization on slow path
in a rare config setup.

For any sane modern hosts I'd expect CONFIG_MIGRATION should at least be
set.. then it invalidates any potential optimization we're discussing here.

Let me post a patch for this and move the discussion there. Thanks a lot
for looking into it.

--
Peter Xu

2022-04-14 15:45:41

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
> Hi,

Hi, Marek,

>
> On 05.04.2022 03:48, Peter Xu wrote:
> > We used to check against none pte in finish_fault(), with the assumption
> > that the orig_pte is always none pte.
> >
> > This change prepares us to be able to call do_fault() on !none ptes. For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> >
> > Let's change the "pte_none" check into detecting changes since we fetched
> > orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> > the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >
> > By default orig_pte will be all zeros however the problem is not all
> > architectures are using all-zeros for a none pte. pte_clear() will be the
> > right thing to use here so that we'll always have a valid orig_pte value
> > for the whole handle_pte_fault() call.
> >
> > Signed-off-by: Peter Xu <[email protected]>
>
> This patch landed in today's linux next-202204213 as commit fa6009949163
> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
> causes serious system instability on some ARM 32bit machines. I've
> observed it on all tested boards (various Samsung Exynos based,
> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
> compiled from multi_v7_defconfig.

Thanks for the report.

>
> Here is a crash log from QEMU's ARM 32bit virt machine:
>
> 8<--- cut here ---
> Unable to handle kernel paging request at virtual address e093263c
> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
> Internal error: Oops: 807 [#1] SMP ARM
> Modules linked in:
> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
> 5.18.0-rc2-00176-gfa6009949163 #11684
> Hardware name: Generic DT based system
> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
> LR is at handle_mm_fault+0x46c/0xbb0

I had a feeling that for some reason the pte_clear() isn't working right
there when it's applying to a kernel stack variable for arm32. I'm totally
newbie to arm32, so what I'm reading is this:

https://people.kernel.org/linusw/arm32-page-tables

Especially:

https://dflund.se/~triad/images/classic-mmu-page-table.jpg

It does match with what I read from arm32's proc-v7-2level.S of it, where
from the comment above cpu_v7_set_pte_ext:

* - ptep - pointer to level 2 translation table entry
* (hardware version is stored at +2048 bytes) <----------

So it seems to me that arm32 needs to store some metadata at offset 0x800
of any pte_t* pointer passed over to pte_clear(), then it must be a real
pgtable or it'll write to random places in the kernel, am I correct?

Does it mean that all pte_*() operations upon a kernel stack var will be
wrong? I thought it could happen easily in the rest of mm too but I didn't
yet check much. The fact shows that it's mostly possible the current code
just work well with arm32 and no such violation occured yet.

That does sound a bit tricky, IMHO. But I don't have an immediate solution
to make it less tricky.. though I have a thought of workaround, by simply
not calling pte_clear() on the stack var.

Would you try the attached patch to replace this problematic patch? So we
need to revert commit fa6009949163 and apply the new one. Please let me
know whether it'll solve the problem, so far I only compile tested it, but
I'll run some more test to make sure the uffd-wp scenarios will be working
right with the new version.

Thanks,

--
Peter Xu


Attachments:
(No filename) (3.54 kB)
0001-mm-Check-against-orig_pte-for-finish_fault-when-prop.patch (4.10 kB)
Download all attachments

2022-04-15 00:03:14

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

Hi Peter,

On 13.04.2022 18:43, Peter Xu wrote:
> On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
>> On 05.04.2022 03:48, Peter Xu wrote:
>>> We used to check against none pte in finish_fault(), with the assumption
>>> that the orig_pte is always none pte.
>>>
>>> This change prepares us to be able to call do_fault() on !none ptes. For
>>> example, we should allow that to happen for pte marker so that we can restore
>>> information out of the pte markers.
>>>
>>> Let's change the "pte_none" check into detecting changes since we fetched
>>> orig_pte. One trivial thing to take care of here is, when pmd==NULL for
>>> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>>>
>>> By default orig_pte will be all zeros however the problem is not all
>>> architectures are using all-zeros for a none pte. pte_clear() will be the
>>> right thing to use here so that we'll always have a valid orig_pte value
>>> for the whole handle_pte_fault() call.
>>>
>>> Signed-off-by: Peter Xu <[email protected]>
>> This patch landed in today's linux next-202204213 as commit fa6009949163
>> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
>> causes serious system instability on some ARM 32bit machines. I've
>> observed it on all tested boards (various Samsung Exynos based,
>> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
>> compiled from multi_v7_defconfig.
> Thanks for the report.
>
>> Here is a crash log from QEMU's ARM 32bit virt machine:
>>
>> 8<--- cut here ---
>> Unable to handle kernel paging request at virtual address e093263c
>> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
>> Internal error: Oops: 807 [#1] SMP ARM
>> Modules linked in:
>> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
>> 5.18.0-rc2-00176-gfa6009949163 #11684
>> Hardware name: Generic DT based system
>> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
>> LR is at handle_mm_fault+0x46c/0xbb0
> I had a feeling that for some reason the pte_clear() isn't working right
> there when it's applying to a kernel stack variable for arm32. I'm totally
> newbie to arm32, so what I'm reading is this:
>
> https://people.kernel.org/linusw/arm32-page-tables
>
> Especially:
>
> https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg
>
> It does match with what I read from arm32's proc-v7-2level.S of it, where
> from the comment above cpu_v7_set_pte_ext:
>
> * - ptep - pointer to level 2 translation table entry
> * (hardware version is stored at +2048 bytes) <----------
>
> So it seems to me that arm32 needs to store some metadata at offset 0x800
> of any pte_t* pointer passed over to pte_clear(), then it must be a real
> pgtable or it'll write to random places in the kernel, am I correct?
>
> Does it mean that all pte_*() operations upon a kernel stack var will be
> wrong? I thought it could happen easily in the rest of mm too but I didn't
> yet check much. The fact shows that it's mostly possible the current code
> just work well with arm32 and no such violation occured yet.
>
> That does sound a bit tricky, IMHO. But I don't have an immediate solution
> to make it less tricky.. though I have a thought of workaround, by simply
> not calling pte_clear() on the stack var.
>
> Would you try the attached patch to replace this problematic patch? So we
> need to revert commit fa6009949163 and apply the new one. Please let me
> know whether it'll solve the problem, so far I only compile tested it, but
> I'll run some more test to make sure the uffd-wp scenarios will be working
> right with the new version.

I've reverted fa6009949163 and applied the attached patch on top of
linux next-20220314. The ARM 32bit issues went away. :)

Feel free to add:

Reported-by: Marek Szyprowski <[email protected]>

Tested-by: Marek Szyprowski <[email protected]>

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2022-04-16 00:46:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

On Thu, Apr 14, 2022 at 09:51:01AM +0200, Marek Szyprowski wrote:
> Hi Peter,
>
> On 13.04.2022 18:43, Peter Xu wrote:
> > On Wed, Apr 13, 2022 at 04:03:28PM +0200, Marek Szyprowski wrote:
> >> On 05.04.2022 03:48, Peter Xu wrote:
> >>> We used to check against none pte in finish_fault(), with the assumption
> >>> that the orig_pte is always none pte.
> >>>
> >>> This change prepares us to be able to call do_fault() on !none ptes. For
> >>> example, we should allow that to happen for pte marker so that we can restore
> >>> information out of the pte markers.
> >>>
> >>> Let's change the "pte_none" check into detecting changes since we fetched
> >>> orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> >>> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >>>
> >>> By default orig_pte will be all zeros however the problem is not all
> >>> architectures are using all-zeros for a none pte. pte_clear() will be the
> >>> right thing to use here so that we'll always have a valid orig_pte value
> >>> for the whole handle_pte_fault() call.
> >>>
> >>> Signed-off-by: Peter Xu <[email protected]>
> >> This patch landed in today's linux next-202204213 as commit fa6009949163
> >> ("mm: check against orig_pte for finish_fault()"). Unfortunately it
> >> causes serious system instability on some ARM 32bit machines. I've
> >> observed it on all tested boards (various Samsung Exynos based,
> >> Raspberry Pi 3b and 4b, even QEMU's virt 32bit machine) when kernel was
> >> compiled from multi_v7_defconfig.
> > Thanks for the report.
> >
> >> Here is a crash log from QEMU's ARM 32bit virt machine:
> >>
> >> 8<--- cut here ---
> >> Unable to handle kernel paging request at virtual address e093263c
> >> [e093263c] *pgd=42083811, *pte=00000000, *ppte=00000000
> >> Internal error: Oops: 807 [#1] SMP ARM
> >> Modules linked in:
> >> CPU: 1 PID: 37 Comm: kworker/u4:0 Not tainted
> >> 5.18.0-rc2-00176-gfa6009949163 #11684
> >> Hardware name: Generic DT based system
> >> PC is at cpu_ca15_set_pte_ext+0x4c/0x58
> >> LR is at handle_mm_fault+0x46c/0xbb0
> > I had a feeling that for some reason the pte_clear() isn't working right
> > there when it's applying to a kernel stack variable for arm32. I'm totally
> > newbie to arm32, so what I'm reading is this:
> >
> > https://people.kernel.org/linusw/arm32-page-tables
> >
> > Especially:
> >
> > https://protect2.fireeye.com/v1/url?k=35bc90ac-6a27a9bd-35bd1be3-0cc47a31cdbc-b032cb1d178dc691&q=1&e=c82daefb-c86b-4ca1-8db1-cadbdc124ed2&u=https%3A%2F%2Fdflund.se%2F%7Etriad%2Fimages%2Fclassic-mmu-page-table.jpg
> >
> > It does match with what I read from arm32's proc-v7-2level.S of it, where
> > from the comment above cpu_v7_set_pte_ext:
> >
> > * - ptep - pointer to level 2 translation table entry
> > * (hardware version is stored at +2048 bytes) <----------
> >
> > So it seems to me that arm32 needs to store some metadata at offset 0x800
> > of any pte_t* pointer passed over to pte_clear(), then it must be a real
> > pgtable or it'll write to random places in the kernel, am I correct?
> >
> > Does it mean that all pte_*() operations upon a kernel stack var will be
> > wrong? I thought it could happen easily in the rest of mm too but I didn't
> > yet check much. The fact shows that it's mostly possible the current code
> > just work well with arm32 and no such violation occured yet.
> >
> > That does sound a bit tricky, IMHO. But I don't have an immediate solution
> > to make it less tricky.. though I have a thought of workaround, by simply
> > not calling pte_clear() on the stack var.
> >
> > Would you try the attached patch to replace this problematic patch? So we
> > need to revert commit fa6009949163 and apply the new one. Please let me
> > know whether it'll solve the problem, so far I only compile tested it, but
> > I'll run some more test to make sure the uffd-wp scenarios will be working
> > right with the new version.
>
> I've reverted fa6009949163 and applied the attached patch on top of
> linux next-20220314. The ARM 32bit issues went away. :)
>
> Feel free to add:
>
> Reported-by: Marek Szyprowski <[email protected]>
>
> Tested-by: Marek Szyprowski <[email protected]>

Thanks, Marek, for the fast feedback!

I've also verified it for the uffd-wp case so the whole series keeps
running as usual and nothing else shows up after the new patch replaced.

Andrew, any suggestion on how we proceed with the replacement patch?
E.g. do you want me to post it separately to the list?

Please let me know your preference, thanks.

--
Peter Xu

2022-04-16 01:16:30

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

Hi,

On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote:
> We used to check against none pte in finish_fault(), with the assumption
> that the orig_pte is always none pte.
>
> This change prepares us to be able to call do_fault() on !none ptes. For
> example, we should allow that to happen for pte marker so that we can restore
> information out of the pte markers.
>
> Let's change the "pte_none" check into detecting changes since we fetched
> orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> the pgtable we may not initialize orig_pte at all in handle_pte_fault().
>
> By default orig_pte will be all zeros however the problem is not all
> architectures are using all-zeros for a none pte. pte_clear() will be the
> right thing to use here so that we'll always have a valid orig_pte value
> for the whole handle_pte_fault() call.
>
> Signed-off-by: Peter Xu <[email protected]>

This patch crashes pretty much all arm images in linux-next. Reverting it
fixes the problem. Sample crash log and bisect results attached.

Guenter

---
[ 11.232343] 8<--- cut here ---
[ 11.232564] Unable to handle kernel paging request at virtual address 88016664
[ 11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000
[ 11.233128] Internal error: Oops: 807 [#1] ARM
[ 11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1
[ 11.233564] Hardware name: Generic DT based system
[ 11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40
[ 11.233863] LR is at handle_mm_fault+0x4b0/0x11a8
[ 11.233963] pc : [<8010e60c>] lr : [<802944ec>] psr: 00000113
[ 11.234080] sp : 88015e20 ip : 88015e7c fp : 00000492
[ 11.234179] r10: 00000000 r9 : 00000000 r8 : 81167e50
[ 11.234280] r7 : 00000000 r6 : 00000081 r5 : 7efffff1 r4 : 83034690
[ 11.234402] r3 : 00000043 r2 : 00000000 r1 : 00000000 r0 : 88016664
[ 11.234549] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 11.234691] Control: 00093177 Table: 40004000 DAC: 00000053
[ 11.234816] Register r0 information: non-paged memory
[ 11.235031] Register r1 information: NULL pointer
[ 11.235127] Register r2 information: NULL pointer
[ 11.235219] Register r3 information: non-paged memory
[ 11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c
[ 11.235825] kmem_cache_alloc+0x1fc/0x21c
[ 11.235926] vm_area_alloc+0x20/0x5c
[ 11.236007] alloc_bprm+0xd0/0x298
[ 11.236082] kernel_execve+0x34/0x194
[ 11.236159] kernel_init+0x6c/0x138
[ 11.236235] ret_from_fork+0x14/0x3c
[ 11.236330] Register r5 information: non-paged memory
[ 11.236432] Register r6 information: non-paged memory
[ 11.236529] Register r7 information: NULL pointer
[ 11.236620] Register r8 information: non-slab/vmalloc memory
[ 11.236741] Register r9 information: NULL pointer
[ 11.236833] Register r10 information: NULL pointer
[ 11.236926] Register r11 information: non-paged memory
[ 11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440
[ 11.237253] Process swapper (pid: 1, stack limit = 0x88014000)
[ 11.237388] Stack: (0x88015e20 to 0x88016000)
[ 11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e
[ 11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8
[ 11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34
[ 11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000
[ 11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0
[ 11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c
[ 11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000
[ 11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b
[ 11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34
[ 11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc
[ 11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000
[ 11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400
[ 11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000
[ 11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
[ 11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000)
[ 11.240097] ---[ end trace 0000000000000000 ]---
[ 11.240307] Kernel panic - not syncing: Fatal exception

--
# bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414
# good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2
git bisect start 'HEAD' 'v5.18-rc2'
# good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e
# good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4
# good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git
git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd
# good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c
# bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed
git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f
# good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw
git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4
# good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds
git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d
# good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
git bisect good 3e68e467590511e2cf7f47194464a5512583f641
# good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged
git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4
# bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
git bisect bad fa600994916318341cf53e18769be547aa5975d2
# good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry
git bisect good 1112411b72b5e9774897538260028a677d616779
# good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers
git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7
# first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()

2022-04-16 01:56:18

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

On Fri, Apr 15, 2022 at 07:21:12AM -0700, Guenter Roeck wrote:
> Hi,

Hi, Guenter,

>
> On Mon, Apr 04, 2022 at 09:48:36PM -0400, Peter Xu wrote:
> > We used to check against none pte in finish_fault(), with the assumption
> > that the orig_pte is always none pte.
> >
> > This change prepares us to be able to call do_fault() on !none ptes. For
> > example, we should allow that to happen for pte marker so that we can restore
> > information out of the pte markers.
> >
> > Let's change the "pte_none" check into detecting changes since we fetched
> > orig_pte. One trivial thing to take care of here is, when pmd==NULL for
> > the pgtable we may not initialize orig_pte at all in handle_pte_fault().
> >
> > By default orig_pte will be all zeros however the problem is not all
> > architectures are using all-zeros for a none pte. pte_clear() will be the
> > right thing to use here so that we'll always have a valid orig_pte value
> > for the whole handle_pte_fault() call.
> >
> > Signed-off-by: Peter Xu <[email protected]>
>
> This patch crashes pretty much all arm images in linux-next. Reverting it
> fixes the problem. Sample crash log and bisect results attached.

Sorry for the issue, and thanks for reporting and bisecting.

It's already reported by Marek and this problematic patch will be replaced
by this one (already updated in -mm, but may land -next later I think):

https://lore.kernel.org/all/Ylb9rXJyPm8%[email protected]/

Thanks,

>
> Guenter
>
> ---
> [ 11.232343] 8<--- cut here ---
> [ 11.232564] Unable to handle kernel paging request at virtual address 88016664
> [ 11.232735] [88016664] *pgd=41cfd811, *pte=00000000, *ppte=00000000
> [ 11.233128] Internal error: Oops: 807 [#1] ARM
> [ 11.233385] CPU: 0 PID: 1 Comm: swapper Not tainted 5.18.0-rc2-next-20220414 #1
> [ 11.233564] Hardware name: Generic DT based system
> [ 11.233695] PC is at cpu_arm926_set_pte_ext+0x2c/0x40
> [ 11.233863] LR is at handle_mm_fault+0x4b0/0x11a8
> [ 11.233963] pc : [<8010e60c>] lr : [<802944ec>] psr: 00000113
> [ 11.234080] sp : 88015e20 ip : 88015e7c fp : 00000492
> [ 11.234179] r10: 00000000 r9 : 00000000 r8 : 81167e50
> [ 11.234280] r7 : 00000000 r6 : 00000081 r5 : 7efffff1 r4 : 83034690
> [ 11.234402] r3 : 00000043 r2 : 00000000 r1 : 00000000 r0 : 88016664
> [ 11.234549] Flags: nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> [ 11.234691] Control: 00093177 Table: 40004000 DAC: 00000053
> [ 11.234816] Register r0 information: non-paged memory
> [ 11.235031] Register r1 information: NULL pointer
> [ 11.235127] Register r2 information: NULL pointer
> [ 11.235219] Register r3 information: non-paged memory
> [ 11.235316] Register r4 information: slab vm_area_struct start 83034688 data offset 8 pointer offset 0 allocated at vm_area_alloc+0x20/0x5c
> [ 11.235825] kmem_cache_alloc+0x1fc/0x21c
> [ 11.235926] vm_area_alloc+0x20/0x5c
> [ 11.236007] alloc_bprm+0xd0/0x298
> [ 11.236082] kernel_execve+0x34/0x194
> [ 11.236159] kernel_init+0x6c/0x138
> [ 11.236235] ret_from_fork+0x14/0x3c
> [ 11.236330] Register r5 information: non-paged memory
> [ 11.236432] Register r6 information: non-paged memory
> [ 11.236529] Register r7 information: NULL pointer
> [ 11.236620] Register r8 information: non-slab/vmalloc memory
> [ 11.236741] Register r9 information: NULL pointer
> [ 11.236833] Register r10 information: NULL pointer
> [ 11.236926] Register r11 information: non-paged memory
> [ 11.237023] Register r12 information: 2-page vmalloc region starting at 0x88014000 allocated at kernel_clone+0xa0/0x440
> [ 11.237253] Process swapper (pid: 1, stack limit = 0x88014000)
> [ 11.237388] Stack: (0x88015e20 to 0x88016000)
> [ 11.237518] 5e20: ffffffff fffffffe 81d29be0 00000000 a0000193 00000000 81d2a1e8 00007f7e
> [ 11.237670] 5e40: 816580a8 83034690 00000cc0 0007efff 7efff000 7efffff1 00000081 83199fb8
> [ 11.237814] 5e60: 83199fb8 00000000 00000000 00000000 00000000 00000000 00000000 0a363e34
> [ 11.237957] 5e80: 88015ea4 83034690 7efffff1 00002017 00000081 81f4dd00 00001fb8 00000000
> [ 11.238100] 5ea0: 00000492 8028d160 00000000 81d29be0 00000001 00002017 80deedcc 81d29be0
> [ 11.238241] 5ec0: 00000000 81f4dd00 7efffff1 88015f38 81f4dd60 00002017 00000000 8028d64c
> [ 11.238383] 5ee0: 88015f38 00000000 00000000 7efffff1 81f4dd00 00000000 00000001 00000000
> [ 11.238524] 5f00: 00000011 82d80800 00000001 7efffff1 81f4dd00 00000011 7efffff1 0000000b
> [ 11.238666] 5f20: 82d80800 802ca218 88015f38 00000000 00000000 000001d3 80e0b43c 0a363e34
> [ 11.238808] 5f40: 00000ffc 82d80800 81d73140 81d29be0 0000000b 802cb390 81d7315b 802ca0bc
> [ 11.238950] 5f60: 8110c940 0000000c 82d80800 81d73140 8110c8b0 8110c93c 00000000 00000000
> [ 11.239091] 5f80: 00000000 802cbf44 81107820 8110c8b0 00000000 00000000 00000000 80b05400
> [ 11.239234] 5fa0: 00000000 80b05394 00000000 801000f8 00000000 00000000 00000000 00000000
> [ 11.239376] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [ 11.239518] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
> [ 11.239770] Code: e31300c0 03822e55 e3130003 13a02000 (e5802000)
> [ 11.240097] ---[ end trace 0000000000000000 ]---
> [ 11.240307] Kernel panic - not syncing: Fatal exception
>
> --
> # bad: [40354149f4d738dc3492d9998e45b3f02950369a] Add linux-next specific files for 20220414
> # good: [ce522ba9ef7e2d9fb22a39eb3371c0c64e2a433e] Linux 5.18-rc2
> git bisect start 'HEAD' 'v5.18-rc2'
> # good: [0f52e407eccb0f7ed62fdd8907b0042f4195159e] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
> git bisect good 0f52e407eccb0f7ed62fdd8907b0042f4195159e
> # good: [22b1b3a579c91a6afa945711eac72ab740b8f8e4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> git bisect good 22b1b3a579c91a6afa945711eac72ab740b8f8e4
> # good: [cbb5c08b3182cb498f67fa547392191a1d5622dd] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine.git
> git bisect good cbb5c08b3182cb498f67fa547392191a1d5622dd
> # good: [2acd94b759428825f0e8835fa24ad22c7b5c0e2c] Merge branch 'for-next/kspp' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git
> git bisect good 2acd94b759428825f0e8835fa24ad22c7b5c0e2c
> # bad: [d2d293faec99124d95590e88030ae3c8382fac7f] mm/shmem: persist uffd-wp bit across zapping for file-backed
> git bisect bad d2d293faec99124d95590e88030ae3c8382fac7f
> # good: [8cbcc910aec560e78e879cf82ed17e7e72d8a7d4] doc: update documentation for swap_activate and swap_rw
> git bisect good 8cbcc910aec560e78e879cf82ed17e7e72d8a7d4
> # good: [8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d] mm/page_alloc: simplify update of pgdat in wake_all_kswapds
> git bisect good 8c55a1ed1f9b95520b0307ba0ac6ff7f1aadfe9d
> # good: [3e68e467590511e2cf7f47194464a5512583f641] mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
> git bisect good 3e68e467590511e2cf7f47194464a5512583f641
> # good: [3fb21f4e38824f4d8a183ffcccc03b357ad836d4] mm: mmap: register suitable readonly file vmas for khugepaged
> git bisect good 3fb21f4e38824f4d8a183ffcccc03b357ad836d4
> # bad: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
> git bisect bad fa600994916318341cf53e18769be547aa5975d2
> # good: [1112411b72b5e9774897538260028a677d616779] fixup! mm: Introduce PTE_MARKER swap entry
> git bisect good 1112411b72b5e9774897538260028a677d616779
> # good: [1ae034d98f81a6cf8896b37c3dee9e099daeb3e7] mm: teach core mm about pte markers
> git bisect good 1ae034d98f81a6cf8896b37c3dee9e099daeb3e7
> # first bad commit: [fa600994916318341cf53e18769be547aa5975d2] mm: check against orig_pte for finish_fault()
>

--
Peter Xu

2022-04-16 02:30:16

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

On Thu, Apr 14, 2022 at 01:57:40PM -0700, Andrew Morton wrote:
> On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <[email protected]> wrote:
>
> > > Reported-by: Marek Szyprowski <[email protected]>
> > >
> > > Tested-by: Marek Szyprowski <[email protected]>
> >
> > Thanks, Marek, for the fast feedback!
>
> Certainly.
>
> > I've also verified it for the uffd-wp case so the whole series keeps
> > running as usual and nothing else shows up after the new patch replaced.
> >
> > Andrew, any suggestion on how we proceed with the replacement patch?
> > E.g. do you want me to post it separately to the list?
>
> I turned it into an incremental diff and queued it against [03/23]:
>
> --- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix
> +++ a/include/linux/mm_types.h
> @@ -814,6 +814,8 @@ typedef struct {
> * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
> * exclusive) a possibly shared anonymous page that is
> * mapped R/O.
> + * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
> + * We should only access orig_pte if this flag set.
> *
> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
> * whether we would allow page faults to retry by specifying these two
> @@ -850,6 +852,7 @@ enum fault_flag {
> FAULT_FLAG_INSTRUCTION = 1 << 8,
> FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
> FAULT_FLAG_UNSHARE = 1 << 10,
> + FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
> };
>
> #endif /* _LINUX_MM_TYPES_H */
> --- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix
> +++ a/mm/memory.c
> @@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st
> set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> }
>
> +static bool vmf_pte_changed(struct vm_fault *vmf)
> +{
> + if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
> + return !pte_same(*vmf->pte, vmf->orig_pte);
> + }
> +
> + return !pte_none(*vmf->pte);
> +}
> +
> /**
> * finish_fault - finish page fault once we have prepared the page to fault
> *
> @@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault
> vmf->address, &vmf->ptl);
> ret = 0;
> /* Re-check under ptl */
> - if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> + if (likely(!vmf_pte_changed(vmf)))
> do_set_pte(vmf, page, vmf->address);
> else
> ret = VM_FAULT_NOPAGE;
> @@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc
> * concurrent faults and from rmap lookups.
> */
> vmf->pte = NULL;
> - /*
> - * Always initialize orig_pte. This matches with below
> - * code to have orig_pte to be the none pte if pte==NULL.
> - * This makes the rest code to be always safe to reference
> - * it, e.g. in finish_fault() we'll detect pte changes.
> - */
> - pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
> + vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
> } else {
> /*
> * If a huge pmd materialized under us just retry later. Use
> @@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc
> */
> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> vmf->orig_pte = *vmf->pte;
> + vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>
> /*
> * some architectures can have larger ptes than wordsize,
> _
>

I verified the diff, that matches with what I got. Thanks Andrew.

--
Peter Xu

2022-04-16 02:35:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 03/23] mm: Check against orig_pte for finish_fault()

On Thu, 14 Apr 2022 12:30:06 -0400 Peter Xu <[email protected]> wrote:

> > Reported-by: Marek Szyprowski <[email protected]>
> >
> > Tested-by: Marek Szyprowski <[email protected]>
>
> Thanks, Marek, for the fast feedback!

Certainly.

> I've also verified it for the uffd-wp case so the whole series keeps
> running as usual and nothing else shows up after the new patch replaced.
>
> Andrew, any suggestion on how we proceed with the replacement patch?
> E.g. do you want me to post it separately to the list?

I turned it into an incremental diff and queued it against [03/23]:

--- a/include/linux/mm_types.h~mm-check-against-orig_pte-for-finish_fault-fix
+++ a/include/linux/mm_types.h
@@ -814,6 +814,8 @@ typedef struct {
* @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
* exclusive) a possibly shared anonymous page that is
* mapped R/O.
+ * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
+ * We should only access orig_pte if this flag set.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -850,6 +852,7 @@ enum fault_flag {
FAULT_FLAG_INSTRUCTION = 1 << 8,
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
FAULT_FLAG_UNSHARE = 1 << 10,
+ FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
};

#endif /* _LINUX_MM_TYPES_H */
--- a/mm/memory.c~mm-check-against-orig_pte-for-finish_fault-fix
+++ a/mm/memory.c
@@ -4194,6 +4194,15 @@ void do_set_pte(struct vm_fault *vmf, st
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
}

+static bool vmf_pte_changed(struct vm_fault *vmf)
+{
+ if (vmf->flags & FAULT_FLAG_ORIG_PTE_VALID) {
+ return !pte_same(*vmf->pte, vmf->orig_pte);
+ }
+
+ return !pte_none(*vmf->pte);
+}
+
/**
* finish_fault - finish page fault once we have prepared the page to fault
*
@@ -4252,7 +4261,7 @@ vm_fault_t finish_fault(struct vm_fault
vmf->address, &vmf->ptl);
ret = 0;
/* Re-check under ptl */
- if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+ if (likely(!vmf_pte_changed(vmf)))
do_set_pte(vmf, page, vmf->address);
else
ret = VM_FAULT_NOPAGE;
@@ -4720,13 +4729,7 @@ static vm_fault_t handle_pte_fault(struc
* concurrent faults and from rmap lookups.
*/
vmf->pte = NULL;
- /*
- * Always initialize orig_pte. This matches with below
- * code to have orig_pte to be the none pte if pte==NULL.
- * This makes the rest code to be always safe to reference
- * it, e.g. in finish_fault() we'll detect pte changes.
- */
- pte_clear(vmf->vma->vm_mm, vmf->address, &vmf->orig_pte);
+ vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
} else {
/*
* If a huge pmd materialized under us just retry later. Use
@@ -4750,6 +4753,7 @@ static vm_fault_t handle_pte_fault(struc
*/
vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
vmf->orig_pte = *vmf->pte;
+ vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

/*
* some architectures can have larger ptes than wordsize,
_

2022-04-19 18:11:39

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

Peter Xu <[email protected]> writes:

> This patch introduces a new swap entry type called PTE_MARKER. It can be
> installed for any pte that maps a file-backed memory when the pte is
> temporarily zapped, so as to maintain per-pte information.

Hi Peter,

Is there something I have missed that means PTE markers can only be used with
file-backed memory? Obviously that's all you care about for this patch series,
but if we needed to mark some anonymous PTE for special processing is there
anything that would prevent us using a PTE marker? Specifically I was thinking
about it in relation to this series:
<https://lore.kernel.org/linux-mm/[email protected]/>

> The information that kept in the pte is called a "marker". Here we define the
> marker as "unsigned long" just to match pgoff_t, however it will only work if
> it still fits in swp_offset(), which is e.g. currently 58 bits on x86_64.
>
> A new config CONFIG_PTE_MARKER is introduced too; it's by default off. A bunch
> of helpers are defined altogether to service the rest of the pte marker code.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/asm-generic/hugetlb.h | 9 ++++
> include/linux/swap.h | 15 ++++++-
> include/linux/swapops.h | 78 +++++++++++++++++++++++++++++++++++
> mm/Kconfig | 6 +++
> 4 files changed, 107 insertions(+), 1 deletion(-)
>
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index 8e1e6244a89d..f39cad20ffc6 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -2,6 +2,9 @@
> #ifndef _ASM_GENERIC_HUGETLB_H
> #define _ASM_GENERIC_HUGETLB_H
>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +
> static inline pte_t mk_huge_pte(struct page *page, pgprot_t pgprot)
> {
> return mk_pte(page, pgprot);
> @@ -80,6 +83,12 @@ static inline int huge_pte_none(pte_t pte)
> }
> #endif
>
> +/* Please refer to comments above pte_none_mostly() for the usage */
> +static inline int huge_pte_none_mostly(pte_t pte)
> +{
> + return huge_pte_none(pte) || is_pte_marker(pte);
> +}
> +
> #ifndef __HAVE_ARCH_HUGE_PTE_WRPROTECT
> static inline pte_t huge_pte_wrprotect(pte_t pte)
> {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7daae5a4b3e1..5553189d0215 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,19 @@ static inline int current_is_kswapd(void)
> * actions on faults.
> */
>
> +/*
> + * PTE markers are used to persist information onto PTEs that are mapped with
> + * file-backed memories. As its name "PTE" hints, it should only be applied to
> + * the leaves of pgtables.
> + */
> +#ifdef CONFIG_PTE_MARKER
> +#define SWP_PTE_MARKER_NUM 1
> +#define SWP_PTE_MARKER (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
> + SWP_MIGRATION_NUM + SWP_DEVICE_NUM)
> +#else
> +#define SWP_PTE_MARKER_NUM 0
> +#endif
> +
> /*
> * Unaddressable device memory support. See include/linux/hmm.h and
> * Documentation/vm/hmm.rst. Short description is we need struct pages for
> @@ -107,7 +120,7 @@ static inline int current_is_kswapd(void)
>
> #define MAX_SWAPFILES \
> ((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> + SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
>
> /*
> * Magic header for a swap area. The first part of the union is
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 32d517a28969..7a00627845f0 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -274,6 +274,84 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>
> #endif
>
> +typedef unsigned long pte_marker;
> +
> +#define PTE_MARKER_MASK (0)
> +
> +#ifdef CONFIG_PTE_MARKER
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> + return swp_entry(SWP_PTE_MARKER, marker);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> + return swp_type(entry) == SWP_PTE_MARKER;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> + return swp_offset(entry) & PTE_MARKER_MASK;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> + return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
> +}
> +
> +#else /* CONFIG_PTE_MARKER */
> +
> +static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
> +{
> + /* This should never be called if !CONFIG_PTE_MARKER */
> + WARN_ON_ONCE(1);
> + return swp_entry(0, 0);
> +}
> +
> +static inline bool is_pte_marker_entry(swp_entry_t entry)
> +{
> + return false;
> +}
> +
> +static inline pte_marker pte_marker_get(swp_entry_t entry)
> +{
> + return 0;
> +}
> +
> +static inline bool is_pte_marker(pte_t pte)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_PTE_MARKER */
> +
> +static inline pte_t make_pte_marker(pte_marker marker)
> +{
> + return swp_entry_to_pte(make_pte_marker_entry(marker));
> +}
> +
> +/*
> + * This is a special version to check pte_none() just to cover the case when
> + * the pte is a pte marker. It existed because in many cases the pte marker
> + * should be seen as a none pte; it's just that we have stored some information
> + * onto the none pte so it becomes not-none any more.
> + *
> + * It should be used when the pte is file-backed, ram-based and backing
> + * userspace pages, like shmem. It is not needed upon pgtables that do not
> + * support pte markers at all. For example, it's not needed on anonymous
> + * memory, kernel-only memory (including when the system is during-boot),
> + * non-ram based generic file-system. It's fine to be used even there, but the
> + * extra pte marker check will be pure overhead.
> + *
> + * For systems configured with !CONFIG_PTE_MARKER this will be automatically
> + * optimized to pte_none().
> + */
> +static inline int pte_none_mostly(pte_t pte)
> +{
> + return pte_none(pte) || is_pte_marker(pte);
> +}
> +
> static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> {
> struct page *p = pfn_to_page(swp_offset(entry));
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 034d87953600..a1688b9314b2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -909,6 +909,12 @@ config ANON_VMA_NAME
> area from being merged with adjacent virtual memory areas due to the
> difference in their name.
>
> +config PTE_MARKER
> + bool "Marker PTEs support"
> +
> + help
> + Allows to create marker PTEs for file-backed memory.
> +
> source "mm/damon/Kconfig"
>
> endmenu


Attachments:
(No filename) (6.55 kB)

2022-04-19 22:59:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

Hi Peter,

On Mon, Apr 04, 2022 at 09:49:29PM -0400, Peter Xu wrote:
> Enable PTE markers by default. On x86_64 it means it'll auto-enable
> PTE_MARKER_UFFD_WP as well.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/Kconfig | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6e7c2d59fa96..3eca34c864c5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -911,12 +911,14 @@ config ANON_VMA_NAME
>
> config PTE_MARKER
> bool "Marker PTEs support"
> + default y
>
> help
> Allows to create marker PTEs for file-backed memory.

make oldconfig just prompted me on these:

---
Marker PTEs support (PTE_MARKER) [Y/n/?] (NEW) ?

CONFIG_PTE_MARKER:

Allows to create marker PTEs for file-backed memory.

Symbol: PTE_MARKER [=y]
Type : bool
Defined at mm/Kconfig:1046
Prompt: Marker PTEs support
Location:
Main menu
-> Memory Management options
---

> config PTE_MARKER_UFFD_WP
> bool "Marker PTEs support for userfaultfd write protection"
> + default y
> depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP

It's not possible to answer them without looking at the code.

But after looking at the code, I'm still not sure why it asks
me. Isn't this infrastructure code?

Wouldn't it make more sense to remove the prompt string and have
userfaultfd simply select those?

If this is too experimental to enable per default, a more reasonable
question for the user would be a "userfaultfd file support" option or
something, and have *that* select the marker code.

Thanks!

2022-04-20 11:26:13

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

On Tue, Apr 19, 2022 at 05:24:28PM -0400, Johannes Weiner wrote:
> On Tue, Apr 19, 2022 at 04:28:16PM -0400, Peter Xu wrote:
> > On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> > > On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > > > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > > Btw, this doesn't do much without userfaultfd being enabled in
> > > general, right?
> >
> > So far yes, but I'm thinking there can be potential other users of
> > PTE_MARKERS from mm world. The most close discussion is on the swap read
> > failures and this patch proposed by Miaohe:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > So I hope we can still keep them around here under mm/ if possible, and
> > from the gut feeling it really should..
>
> Agreed, mm/ seems a good fit for PTE_MARKER.
>
> If it's invisible and gets selected as needed, it's less of a concern,
> IMO. I'm somewhat worried about when and how the user-visible options
> show up right now, though...
>
> > > Would it make sense to have it next to 'config USERFAULTFD' as a
> > > sub-option?
> >
> > Yes another good question. :)
> >
> > IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
> > syscall. Same to the rest of the bits for uffd since then, namely:
> >
> > - USERFAULTFD_WP
> > - USERFAULTFD_MINOR
> >
> > What I am thinking now is the other way round of your suggestion: whether
> > we should move most of them out, at least the _WP and _MINOR configs into
> > mm/? Because IMHO they are really pure mm ideas and they're irrelevant to
> > syscalls and init.
>
> I'm thinking the MM submenu would probably be a better fit for all
> user-visible userfaultfd options, including the syscall. Like you say,
> it's an MM concept.
>
> But if moving the syscall knob out from init isn't popular, IMO it
> would be better to add the new WP option to init as well. This ensures
> that when somebody selects userfaultfd, they also see the relevant
> suboptions and don't have to chase them down across multiple submenus.
>
> Conversely, they should also have the necessary depend clauses so that
> suboptions aren't visible without the main feature. E.g. it asked me
> for userfaultd options even though I have CONFIG_USERFAULTFD=n.

Hmm, this is a bit weird... since we do have that dependency chain for
PTE_MARKER_UFFD_WP -> HAVE_ARCH_USERFAULTFD_WP -> USERFAULTFD:

in arch/x86/Kconfig:
config X86
...
select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD

in mm/Kconfig (with/without the "mm/uffd: Hide PTE_MARKER" patch applied):
config PTE_MARKER_UFFD_WP
...
depends on HAVE_ARCH_USERFAULTFD_WP

So logically if !USERFAULTFD we shouldn't see PTE_MARKER_UFFD_WP at all?

That's also what I got when I tried it out for either !USERFAULTFD on x86,
or any non-x86 platforms (because there we have !HAVE_ARCH_USERFAULTFD_WP
constantly irrelevant of USERFAULTFD). Though I could have missed
something..

>
> What do you think?

I don't have a strong preference here, I think it's okay if it's preferred
that we only put user-visible configs into mm/Kconfig. It's just that I
see we have tons of user-invisible configs already in mm/Kconfig, to list
some:

config ARCH_HAS_HUGEPD
config MAPPING_DIRTY_HELPERS
config KMAP_LOCAL
config KMAP_LOCAL_NON_LINEAR_PTE_ARRAY

But I'm not sure whether it's a rule of thumb somewhere else.

At the meantime, I also looked at whether syscall configs are always and
only be put under init/, and funnily I got:

$ find . -name Kconfig | xargs grep --color -E "\".*syscall.*\""
./init/Kconfig: bool "Enable process_vm_readv/writev syscalls"
./init/Kconfig: bool "uselib syscall"
./init/Kconfig: bool "sgetmask/ssetmask syscalls support" if EXPERT
./init/Kconfig: bool "Sysfs syscall support" if EXPERT
./init/Kconfig: bool "open by fhandle syscalls" if EXPERT
./init/Kconfig: bool "Enable madvise/fadvise syscalls" if EXPERT
./arch/xtensa/Kconfig: bool "Enable fast atomic syscalls"
./arch/xtensa/Kconfig: bool "Enable spill registers syscall"
./arch/powerpc/Kconfig: bool "Support setting protections for 4k subpages (subpage_prot syscall)"
./arch/powerpc/Kconfig: bool "Enable filtering of RTAS syscalls"
./arch/Kconfig: bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
./arch/s390/Kconfig: bool "Verify kernel signature during kexec_file_load() syscall"
./arch/sh/mm/Kconfig: bool "Support vsyscall page"
./arch/x86/Kconfig: bool "Enable vsyscall emulation" if EXPERT
./arch/x86/Kconfig: bool "Verify kernel signature during kexec_file_load() syscall"
./arch/x86/Kconfig: bool "Require a valid signature in kexec_file_load() syscall"
./arch/x86/Kconfig: prompt "vsyscall table for legacy applications"
./arch/arm64/Kconfig: bool "Verify kernel signature during kexec_file_load() syscall"
./arch/arm64/Kconfig: bool "Enable the tagged user addresses syscall ABI"
./kernel/trace/Kconfig: bool "Trace syscalls"
./kernel/trace/Kconfig: bool "Run selftest on syscall events"

So let's put aside arch specific lines, ftrace does have FTRACE_SYSCALLS
that lies in the kernel/trace/ dir.. not sure whether we could move
USERFAULTFD and all the rest into mm/ as well? Or perhaps that's just a
bad example? :)

Thanks,

--
Peter Xu

2022-04-20 15:11:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> Hi Peter,
>
> On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > difference in their name.
> >
> > config PTE_MARKER
> > - bool "Marker PTEs support"
> > - default y
> > + bool
> >
> > help
> > Allows to create marker PTEs for file-backed memory.
> >
> > config PTE_MARKER_UFFD_WP
> > - bool "Marker PTEs support for userfaultfd write protection"
> > + bool "Userfaultfd write protection support for shmem/hugetlbfs"
> > default y
> > - depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
> > + depends on HAVE_ARCH_USERFAULTFD_WP
> > + select PTE_MARKER
>
> This is much easier to understand, thanks!

Cool! Sent as a formal patch just now.

>
> Btw, this doesn't do much without userfaultfd being enabled in
> general, right?

So far yes, but I'm thinking there can be potential other users of
PTE_MARKERS from mm world. The most close discussion is on the swap read
failures and this patch proposed by Miaohe:

https://lore.kernel.org/lkml/[email protected]/

So I hope we can still keep them around here under mm/ if possible, and
from the gut feeling it really should..

> Would it make sense to have it next to 'config USERFAULTFD' as a
> sub-option?

Yes another good question. :)

IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
syscall. Same to the rest of the bits for uffd since then, namely:

- USERFAULTFD_WP
- USERFAULTFD_MINOR

What I am thinking now is the other way round of your suggestion: whether
we should move most of them out, at least the _WP and _MINOR configs into
mm/? Because IMHO they are really pure mm ideas and they're irrelevant to
syscalls and init.

Any thoughts?

--
Peter Xu

2022-04-21 01:13:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

Hi Peter,

On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> difference in their name.
>
> config PTE_MARKER
> - bool "Marker PTEs support"
> - default y
> + bool
>
> help
> Allows to create marker PTEs for file-backed memory.
>
> config PTE_MARKER_UFFD_WP
> - bool "Marker PTEs support for userfaultfd write protection"
> + bool "Userfaultfd write protection support for shmem/hugetlbfs"
> default y
> - depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
> + depends on HAVE_ARCH_USERFAULTFD_WP
> + select PTE_MARKER

This is much easier to understand, thanks!

Btw, this doesn't do much without userfaultfd being enabled in
general, right? Would it make sense to have it next to 'config
USERFAULTFD' as a sub-option?

2022-04-21 19:44:16

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 01/23] mm: Introduce PTE_MARKER swap entry

On Tue, Apr 19, 2022 at 06:25:31PM +1000, Alistair Popple wrote:
> Hi Peter,

Hi, Alistair,

>
> Is there something I have missed that means PTE markers can only be used with
> file-backed memory? Obviously that's all you care about for this patch series,
> but if we needed to mark some anonymous PTE for special processing is there
> anything that would prevent us using a PTE marker? Specifically I was thinking
> about it in relation to this series:
> <https://lore.kernel.org/linux-mm/[email protected]/>

It's not necessarily to be restricted to file-backed. All the file-backed
check here in this series was just for safety purpose and nothing else.

I think it's a very good example of that swap-read-error case that pte
marker does sound like a great fit, but let's see whether people would
still like to stick with hwpoison which makes some sense too. Then let's
keep the discussion in that thread.

Thanks for the pointer!

--
Peter Xu

2022-04-22 12:10:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

On Wed, Apr 20, 2022 at 09:46:07AM -0400, Johannes Weiner wrote:
> My point was simply that for the user it might be easiest and most
> intuitive if userfaultfd and its related suboptions are 1) grouped
> together and 2) in the MM submenu.

Very reasonable request.

> Yeah it looks like there is a healthy mix ;) To add to the list:
>
> - mm/Kconfig has CONFIG_SWAP for the swapon/swapoff syscalls.
> - fs/Kconfig has CONFIG_FILE_LOCKING, which adds the flock() syscall.
> - Interestingly, fs/Kconfig has CONFIG_MEMFD_CREATE for memfd_create()
> which is implemented in mm/memfd.c.
>
> It seems reasonable to me to move the userfaultfd stuff to mm as well,
> especially when it's becoming more than just a single binary question
> on whether you want a syscall or not, and has MM-specific suboptions.

Thanks for the extra information!

Obviously as you said it's growing a little bit. I'll give it a shot later
today.

--
Peter Xu

2022-04-22 20:26:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

On Tue, Apr 19, 2022 at 06:01:14PM -0400, Peter Xu wrote:
> On Tue, Apr 19, 2022 at 05:24:28PM -0400, Johannes Weiner wrote:
> > On Tue, Apr 19, 2022 at 04:28:16PM -0400, Peter Xu wrote:
> > > On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> > > > On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > > > > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > > > Btw, this doesn't do much without userfaultfd being enabled in
> > > > general, right?
> > >
> > > So far yes, but I'm thinking there can be potential other users of
> > > PTE_MARKERS from mm world. The most close discussion is on the swap read
> > > failures and this patch proposed by Miaohe:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > So I hope we can still keep them around here under mm/ if possible, and
> > > from the gut feeling it really should..
> >
> > Agreed, mm/ seems a good fit for PTE_MARKER.
> >
> > If it's invisible and gets selected as needed, it's less of a concern,
> > IMO. I'm somewhat worried about when and how the user-visible options
> > show up right now, though...
> >
> > > > Would it make sense to have it next to 'config USERFAULTFD' as a
> > > > sub-option?
> > >
> > > Yes another good question. :)
> > >
> > > IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
> > > syscall. Same to the rest of the bits for uffd since then, namely:
> > >
> > > - USERFAULTFD_WP
> > > - USERFAULTFD_MINOR
> > >
> > > What I am thinking now is the other way round of your suggestion: whether
> > > we should move most of them out, at least the _WP and _MINOR configs into
> > > mm/? Because IMHO they are really pure mm ideas and they're irrelevant to
> > > syscalls and init.
> >
> > I'm thinking the MM submenu would probably be a better fit for all
> > user-visible userfaultfd options, including the syscall. Like you say,
> > it's an MM concept.
> >
> > But if moving the syscall knob out from init isn't popular, IMO it
> > would be better to add the new WP option to init as well. This ensures
> > that when somebody selects userfaultfd, they also see the relevant
> > suboptions and don't have to chase them down across multiple submenus.
> >
> > Conversely, they should also have the necessary depend clauses so that
> > suboptions aren't visible without the main feature. E.g. it asked me
> > for userfaultd options even though I have CONFIG_USERFAULTFD=n.
>
> Hmm, this is a bit weird... since we do have that dependency chain for
> PTE_MARKER_UFFD_WP -> HAVE_ARCH_USERFAULTFD_WP -> USERFAULTFD:
>
> in arch/x86/Kconfig:
> config X86
> ...
> select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD
>
> in mm/Kconfig (with/without the "mm/uffd: Hide PTE_MARKER" patch applied):
> config PTE_MARKER_UFFD_WP
> ...
> depends on HAVE_ARCH_USERFAULTFD_WP
>
> So logically if !USERFAULTFD we shouldn't see PTE_MARKER_UFFD_WP at all?
>
> That's also what I got when I tried it out for either !USERFAULTFD on x86,
> or any non-x86 platforms (because there we have !HAVE_ARCH_USERFAULTFD_WP
> constantly irrelevant of USERFAULTFD). Though I could have missed
> something..

Sorry, it asked me about PTE_MARKERS, and that conclusion got stuck in
my head. Indeed, once that symbol is invisible we should be good.

> > What do you think?
>
> I don't have a strong preference here, I think it's okay if it's preferred
> that we only put user-visible configs into mm/Kconfig. It's just that I
> see we have tons of user-invisible configs already in mm/Kconfig, to list
> some:
>
> config ARCH_HAS_HUGEPD
> config MAPPING_DIRTY_HELPERS
> config KMAP_LOCAL
> config KMAP_LOCAL_NON_LINEAR_PTE_ARRAY
>
> But I'm not sure whether it's a rule of thumb somewhere else.

I wasn't objecting to invisible symbols in mm/.

My point was simply that for the user it might be easiest and most
intuitive if userfaultfd and its related suboptions are 1) grouped
together and 2) in the MM submenu.

> At the meantime, I also looked at whether syscall configs are always and
> only be put under init/, and funnily I got:
>
> $ find . -name Kconfig | xargs grep --color -E "\".*syscall.*\""
> ./init/Kconfig: bool "Enable process_vm_readv/writev syscalls"
> ./init/Kconfig: bool "uselib syscall"
> ./init/Kconfig: bool "sgetmask/ssetmask syscalls support" if EXPERT
> ./init/Kconfig: bool "Sysfs syscall support" if EXPERT
> ./init/Kconfig: bool "open by fhandle syscalls" if EXPERT
> ./init/Kconfig: bool "Enable madvise/fadvise syscalls" if EXPERT
> ./arch/xtensa/Kconfig: bool "Enable fast atomic syscalls"
> ./arch/xtensa/Kconfig: bool "Enable spill registers syscall"
> ./arch/powerpc/Kconfig: bool "Support setting protections for 4k subpages (subpage_prot syscall)"
> ./arch/powerpc/Kconfig: bool "Enable filtering of RTAS syscalls"
> ./arch/Kconfig: bool "Support for randomizing kernel stack offset on syscall entry" if EXPERT
> ./arch/s390/Kconfig: bool "Verify kernel signature during kexec_file_load() syscall"
> ./arch/sh/mm/Kconfig: bool "Support vsyscall page"
> ./arch/x86/Kconfig: bool "Enable vsyscall emulation" if EXPERT
> ./arch/x86/Kconfig: bool "Verify kernel signature during kexec_file_load() syscall"
> ./arch/x86/Kconfig: bool "Require a valid signature in kexec_file_load() syscall"
> ./arch/x86/Kconfig: prompt "vsyscall table for legacy applications"
> ./arch/arm64/Kconfig: bool "Verify kernel signature during kexec_file_load() syscall"
> ./arch/arm64/Kconfig: bool "Enable the tagged user addresses syscall ABI"
> ./kernel/trace/Kconfig: bool "Trace syscalls"
> ./kernel/trace/Kconfig: bool "Run selftest on syscall events"
>
> So let's put aside arch specific lines, ftrace does have FTRACE_SYSCALLS
> that lies in the kernel/trace/ dir.. not sure whether we could move
> USERFAULTFD and all the rest into mm/ as well? Or perhaps that's just a
> bad example? :)

Yeah it looks like there is a healthy mix ;) To add to the list:

- mm/Kconfig has CONFIG_SWAP for the swapon/swapoff syscalls.
- fs/Kconfig has CONFIG_FILE_LOCKING, which adds the flock() syscall.
- Interestingly, fs/Kconfig has CONFIG_MEMFD_CREATE for memfd_create()
which is implemented in mm/memfd.c.

It seems reasonable to me to move the userfaultfd stuff to mm as well,
especially when it's becoming more than just a single binary question
on whether you want a syscall or not, and has MM-specific suboptions.

2022-04-22 20:51:20

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

On Tue, Apr 19, 2022 at 11:13:48AM -0400, Johannes Weiner wrote:
> Hi Peter,

Hi, Johannes,

>
> On Mon, Apr 04, 2022 at 09:49:29PM -0400, Peter Xu wrote:
> > Enable PTE markers by default. On x86_64 it means it'll auto-enable
> > PTE_MARKER_UFFD_WP as well.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/Kconfig | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 6e7c2d59fa96..3eca34c864c5 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -911,12 +911,14 @@ config ANON_VMA_NAME
> >
> > config PTE_MARKER
> > bool "Marker PTEs support"
> > + default y
> >
> > help
> > Allows to create marker PTEs for file-backed memory.
>
> make oldconfig just prompted me on these:
>
> ---
> Marker PTEs support (PTE_MARKER) [Y/n/?] (NEW) ?
>
> CONFIG_PTE_MARKER:
>
> Allows to create marker PTEs for file-backed memory.
>
> Symbol: PTE_MARKER [=y]
> Type : bool
> Defined at mm/Kconfig:1046
> Prompt: Marker PTEs support
> Location:
> Main menu
> -> Memory Management options
> ---
>
> > config PTE_MARKER_UFFD_WP
> > bool "Marker PTEs support for userfaultfd write protection"
> > + default y
> > depends on PTE_MARKER && HAVE_ARCH_USERFAULTFD_WP
>
> It's not possible to answer them without looking at the code.
>
> But after looking at the code, I'm still not sure why it asks
> me. Isn't this infrastructure code?
>
> Wouldn't it make more sense to remove the prompt string and have
> userfaultfd simply select those?
>
> If this is too experimental to enable per default, a more reasonable
> question for the user would be a "userfaultfd file support" option or
> something, and have *that* select the marker code.

Thanks for raising this question.

Actually it's right now enabled by default, so I kept the options just to
make sure we can always explicitly disable those options when we want.
That's majorly why I kept this patch standalone one so if we want we can
even drop it.

Said that, I fully agree with you that having two options seem to be an
overkill, especially the PTE_MARKER option will be too challenging to be
correctly understood by anyone not familiar with it.

So after a 2nd thought I'm trying to refine what I want to achieve with the
kbuild system on this new feature:

- On supported systems (x86_64), should be by default y with "make
olddefconfig", but able to turn it off using "make oldconfig" etc., so
the user will have a choice when they want.

- On not-supported systems (non-x86_64), should be always n without
any prompt asking, and user won't even see this entry.

- PTE_MARKER option should always be hidden for all archs

I plan to post a patch that is attached (I also reworded the entry to not
mention about pte markers). Does that look acceptable on your side?

Thanks,

--
Peter Xu


Attachments:
(No filename) (2.87 kB)
0001-mm-uffd-Hide-PTE_MARKER-option.patch (1.51 kB)
Download all attachments

2022-04-22 21:02:58

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v8 22/23] mm: Enable PTE markers by default

On Tue, Apr 19, 2022 at 04:28:16PM -0400, Peter Xu wrote:
> On Tue, Apr 19, 2022 at 04:14:11PM -0400, Johannes Weiner wrote:
> > On Tue, Apr 19, 2022 at 03:59:21PM -0400, Peter Xu wrote:
> > > @@ -910,16 +910,16 @@ config ANON_VMA_NAME
> > Btw, this doesn't do much without userfaultfd being enabled in
> > general, right?
>
> So far yes, but I'm thinking there can be potential other users of
> PTE_MARKERS from mm world. The most close discussion is on the swap read
> failures and this patch proposed by Miaohe:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> So I hope we can still keep them around here under mm/ if possible, and
> from the gut feeling it really should..

Agreed, mm/ seems a good fit for PTE_MARKER.

If it's invisible and gets selected as needed, it's less of a concern,
IMO. I'm somewhat worried about when and how the user-visible options
show up right now, though...

> > Would it make sense to have it next to 'config USERFAULTFD' as a
> > sub-option?
>
> Yes another good question. :)
>
> IIUC CONFIG_USERFAULTFD resides in init/Kconfig because it introduces a new
> syscall. Same to the rest of the bits for uffd since then, namely:
>
> - USERFAULTFD_WP
> - USERFAULTFD_MINOR
>
> What I am thinking now is the other way round of your suggestion: whether
> we should move most of them out, at least the _WP and _MINOR configs into
> mm/? Because IMHO they are really pure mm ideas and they're irrelevant to
> syscalls and init.

I'm thinking the MM submenu would probably be a better fit for all
user-visible userfaultfd options, including the syscall. Like you say,
it's an MM concept.

But if moving the syscall knob out from init isn't popular, IMO it
would be better to add the new WP option to init as well. This ensures
that when somebody selects userfaultfd, they also see the relevant
suboptions and don't have to chase them down across multiple submenus.

Conversely, they should also have the necessary depend clauses so that
suboptions aren't visible without the main feature. E.g. it asked me
for userfaultd options even though I have CONFIG_USERFAULTFD=n.

What do you think?

2022-05-11 08:13:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v8 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Mon, 4 Apr 2022 21:46:23 -0400 Peter Xu <[email protected]> wrote:

> This is v8 of the series to add shmem+hugetlbfs support for userfaultfd
> write protection.

I think we've finished tossing this patchset around, so I plan to feed
it into mm-stable later this week.

A few -fixes were added:

https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-introduce-pte_marker-swap-entry-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-teach-core-mm-about-pte-markers-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-check-against-orig_pte-for-finish_fault-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-check-against-orig_pte-for-finish_fault-fix-checkpatch-fixes.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-uffd-pte_marker_uffd_wp-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hugetlb-only-drop-uffd-wp-special-pte-if-required-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hugetlb-only-drop-uffd-wp-special-pte-if-required-fix-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-shmem-handle-uffd-wp-during-fork-fix.patch
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-enable-pte-markers-by-default-fix.patch


2022-05-12 02:21:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler

> +/*
> + * This is actually a page-missing access, but with uffd-wp special pte
> + * installed. It means this pte was wr-protected before being unmapped.
> + */
> +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> +{
> + /*
> + * Just in case there're leftover special ptes even after the region
> + * got unregistered - we can simply clear them. We can also do that
> + * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> + * ranges, but it should be more efficient to be done lazily here.
> + */
> + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> + return pte_marker_clear(vmf);

What would happen if we do a unregister followed by a register? IMHO we
should start with a clean uffd-wp slate then. Your comment makes ma
assume that we could receive stale WP events, which would be wrong?

--
Thanks,

David / dhildenb


2022-05-14 01:53:53

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v8 06/23] mm/shmem: Handle uffd-wp special pte in page fault handler

On Wed, May 11, 2022 at 06:30:59PM +0200, David Hildenbrand wrote:
> > +/*
> > + * This is actually a page-missing access, but with uffd-wp special pte
> > + * installed. It means this pte was wr-protected before being unmapped.
> > + */
> > +static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
> > +{
> > + /*
> > + * Just in case there're leftover special ptes even after the region
> > + * got unregistered - we can simply clear them. We can also do that
> > + * proactively when e.g. when we do UFFDIO_UNREGISTER upon some uffd-wp
> > + * ranges, but it should be more efficient to be done lazily here.
> > + */
> > + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> > + return pte_marker_clear(vmf);
>
> What would happen if we do a unregister followed by a register? IMHO we
> should start with a clean uffd-wp slate then. Your comment makes ma
> assume that we could receive stale WP events, which would be wrong?

I'd say it's not wrong, but it's true and actually expected.

Firstly, userfaultfd (by design) always allows false positives (getting
same message multiple times) but no tolerance on false negatives (missing
event, which is data corrupt).

The latter should be obvious. For the former, the simplest example is when
two threads access the same missing page the same time, two same messages
will be generated. Same applies to wr-protect faults. And it'll be
non-trivial (or say, impossible.. IMHO) to avoid those.

In this specific case, it's about when to drop the uffd-wp bits when
unregister. Two obvious options: (1) during unregister, or (2) lazy.

Here I chose the lazy way because unregister could be slowed down by this,
and that's when program quits. In short with current approach we quit
fast. We could have leftovers, but we'll take care of them when needed.

One important thing is leftover ptes should not be the major way uffd-wp
should be used by the normal register -> wr-protect -> unprotect ->
unregister sequence. Normally the process won't unregister probably until
it quits, so the leftover does no harm to anyone.

Meanwhile, any user who wants to avoid the lazy way can simply do a
whole-round unprotect before unregister. So we leave more choice for the
user and by default we make sure no syscall will be easily slowed down.

Hope that answers, thanks!

--
Peter Xu