2021-03-23 00:52:15

by Peter Xu

[permalink] [raw]
Subject: [PATCH 00/23] userfaultfd-wp: Support shmem and hugetlbfs

This patchset is based on tag v5.12-rc3-mmots-2021-03-17-22-26. To run the

selftest, need to apply the two patches to fix minor mode page leak:



https://lore.kernel.org/lkml/[email protected]/

https://lore.kernel.org/lkml/[email protected]/



Since I didn't get any NACK in the previous RFC series for months, I decided to

remove the RFC tag starting from this version, so this is v1 of uffd-wp support

on shmem & hugetlb.



The whole series can also be found online [1].



The major comment I'd like to get is on the new idea of swap special pte. That

comes from suggestions from both Hugh and Andrea and I appreciated a lot for

those discussions.



In short, the so-called "swap special pte" in this patchset is a new type of

pte that doesn't exist in the past, but it got used initially in this series in

file-backed memories. It is used to persist information even if the ptes got

dropped meanwhile when the page cache still existed. For example, when

splitting a file-backed huge pmd, we could be simply dropping the pmd entry

then wait until another fault coming. It's okay in the past since all

information in the pte can be retained from the page cache when the next page

fault triggers. However in this case, uffd-wp is per-pte information which

cannot be kept in page cache, so that information needs to be maintained

somehow still in the pgtable entry, even if the pgtable entry is going to be

dropped. Here instead of replacing with a none entry, we used the "swap

special pte". Then when the next page fault triggers, we can observe orig_pte

to retain this information.



I'm copy-pasting some commit message from the patch "mm/swap: Introduce the

idea of special swap ptes", where it tried to explain this pte in another angle:



We used to have special swap entries, like migration entries, hw-poison

entries, device private entries, etc.



Those "special swap entries" reside in the range that they need to be at least

swap entries first, and their types are decided by swp_type(entry).



This patch introduces another idea called "special swap ptes".



It's very easy to get confused against "special swap entries", but a speical

swap pte should never contain a swap entry at all. It means, it's illegal to

call pte_to_swp_entry() upon a special swap pte.



Make the uffd-wp special pte to be the first special swap pte.



Before this patch, is_swap_pte()==true means one of the below:



(a.1) The pte has a normal swap entry (non_swap_entry()==false). For

example, when an anonymous page got swapped out.



(a.2) The pte has a special swap entry (non_swap_entry()==true). For

example, a migration entry, a hw-poison entry, etc.



After this patch, is_swap_pte()==true means one of the below, where case (b) is

added:



(a) The pte contains a swap entry.



(a.1) The pte has a normal swap entry (non_swap_entry()==false). For

example, when an anonymous page got swapped out.



(a.2) The pte has a special swap entry (non_swap_entry()==true). For

example, a migration entry, a hw-poison entry, etc.



(b) The pte does not contain a swap entry at all (so it cannot be passed

into pte_to_swp_entry()). For example, uffd-wp special swap pte.



Hugetlbfs needs similar thing because it's also file-backed. I directly reused

the same special pte there, though the shmem/hugetlb change on supporting this

new pte is different since they don't share code path a lot.



Patch layout

============



Part (1): Shmem support, this is where the special swap pte is introduced.

Some zap rework is needed within the process:



shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

mm: Clear vmf->pte after pte_unmap_same() returns

mm/userfaultfd: Introduce special pte for unmapped file-backed mem

mm/swap: Introduce the idea of special swap ptes

shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

mm: Drop first_index/last_index in zap_details

mm: Introduce zap_details.zap_flags

mm: Introduce ZAP_FLAG_SKIP_SWAP

mm: Pass zap_flags into unmap_mapping_pages()

shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

shmem/userfaultfd: Handle the left-overed special swap ptes

shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()



Part (2): Hugetlb support, we need to disable huge pmd sharing for uffd-wp

because not compatible just like uffd minor mode. The rest is the changes

required to teach hugetlbfs understand the special swap pte too that introduced

with the uffd-wp change:



hugetlb/userfaultfd: Hook page faults for uffd write protection

hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

hugetlb: Pass vma into huge_pte_alloc()

hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled

mm/hugetlb: Introduce huge version of special swap pte helpers

mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h

hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp

hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

hugetlb/userfaultfd: Allow wr-protect none ptes

hugetlb/userfaultfd: Only drop uffd-wp special pte if required



Part (3): Enable both features in code and test



userfaultfd: Enable write protection for shmem & hugetlbfs

userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs



Tests

=========



I've tested it using either userfaultfd kselftest program, but also with

umapsort [2] which should be even stricter. Tested page swapping in/out during

umapsort.



If anyone would like to try umapsort, need to use an extremely hacked version

of umap library [3], because by default umap only supports anonymous. So to

test it we need to build [3] then [2].



Any comment would be greatly welcomed. Thanks,



[1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs

[2] https://github.com/LLNL/umap-apps

[3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs



Peter Xu (23):

shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

mm: Clear vmf->pte after pte_unmap_same() returns

mm/userfaultfd: Introduce special pte for unmapped file-backed mem

mm/swap: Introduce the idea of special swap ptes

shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

mm: Drop first_index/last_index in zap_details

mm: Introduce zap_details.zap_flags

mm: Introduce ZAP_FLAG_SKIP_SWAP

mm: Pass zap_flags into unmap_mapping_pages()

shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on

thps

shmem/userfaultfd: Handle the left-overed special swap ptes

shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

hugetlb/userfaultfd: Hook page faults for uffd write protection

hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

mm/hugetlb: Introduce huge version of special swap pte helpers

hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

hugetlb/userfaultfd: Allow wr-protect none ptes

hugetlb/userfaultfd: Only drop uffd-wp special pte if required

mm/userfaultfd: Enable write protection for shmem & hugetlbfs

userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs



arch/arm64/kernel/mte.c | 2 +-

arch/x86/include/asm/pgtable.h | 28 +++

fs/dax.c | 10 +-

fs/hugetlbfs/inode.c | 15 +-

fs/proc/task_mmu.c | 14 +-

fs/userfaultfd.c | 38 ++--

include/asm-generic/hugetlb.h | 10 +

include/asm-generic/pgtable_uffd.h | 3 +

include/linux/hugetlb.h | 25 ++-

include/linux/mm.h | 50 ++++-

include/linux/mm_inline.h | 43 ++++

include/linux/shmem_fs.h | 5 +-

include/linux/swapops.h | 39 +++-

include/linux/userfaultfd_k.h | 46 +++++

include/uapi/linux/userfaultfd.h | 7 +-

mm/gup.c | 2 +-

mm/hmm.c | 2 +-

mm/hugetlb.c | 167 ++++++++++++---

mm/khugepaged.c | 14 +-

mm/madvise.c | 4 +-

mm/memcontrol.c | 2 +-

mm/memory.c | 249 ++++++++++++++++++-----

mm/migrate.c | 4 +-

mm/mincore.c | 2 +-

mm/mprotect.c | 63 +++++-

mm/mremap.c | 2 +-

mm/page_vma_mapped.c | 6 +-

mm/rmap.c | 8 +

mm/shmem.c | 31 ++-

mm/swapfile.c | 2 +-

mm/truncate.c | 17 +-

mm/userfaultfd.c | 37 ++--

tools/testing/selftests/vm/userfaultfd.c | 9 +-

33 files changed, 776 insertions(+), 180 deletions(-)



--

2.26.2





2021-03-23 00:52:37

by Peter Xu

[permalink] [raw]
Subject: [PATCH 03/23] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

This patch introduces a very special swap-like pte for file-backed memories.

Currently it's only defined for x86_64 only, but as long as any arch that can
properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it should
conceptually work too.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected. This
special pte could trigger a page fault just like swap entries, and as long as
the page fault will satisfy pte_none()==false && pte_present()==false.

Then we can revive the special pte into a normal pte backed by the page cache.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as the
special pte. The current solution (as pointed out by Andrea) is slightly
preferred in that we don't even need swp_entry_t knowledge at all in trapping
these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP from the anonymous
swp entries.

This patch only introduces the special pte and its operators. It's not yet
applied to have any functional difference.

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 28 ++++++++++++++++++++++++++++
include/asm-generic/pgtable_uffd.h | 3 +++
include/linux/userfaultfd_k.h | 21 +++++++++++++++++++++
3 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..379bae343dd1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1329,6 +1329,34 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+/*
+ * This is a very special swap-like pte that marks this pte as "wr-protected"
+ * by userfaultfd-wp. It should only exist for file-backed memory where the
+ * page (previously got wr-protected) has been unmapped or swapped out.
+ *
+ * For anonymous memories, the userfaultfd-wp _PAGE_SWP_UFFD_WP bit is kept
+ * along with a real swp entry instead.
+ *
+ * Let's make some rules for this special pte:
+ *
+ * (1) pte_none()==false, so that it'll not trigger a missing page fault.
+ *
+ * (2) pte_present()==false, so that it's recognized as swap (is_swap_pte).
+ *
+ * (3) pte_swp_uffd_wp()==true, so it can be tested just like a swap pte that
+ * contains a valid swap entry, so that we can check a swap pte always
+ * using "is_swap_pte() && pte_swp_uffd_wp()" without caring about whether
+ * there's one swap entry inside of the pte.
+ *
+ * (4) It should not be a valid swap pte anywhere, so that when we see this pte
+ * we know it does not contain a swap entry.
+ *
+ * For x86, the simplest special pte which satisfies all of above should be the
+ * pte with only _PAGE_SWP_UFFD_WP bit set (where swp_type==swp_offset==0).
+ */
+#define UFFD_WP_SWP_PTE_SPECIAL __pte(_PAGE_SWP_UFFD_WP)
+
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 828966d4c281..95e9811ce9d1 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -2,6 +2,9 @@
#define _ASM_GENERIC_PGTABLE_UFFD_H

#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+#define UFFD_WP_SWP_PTE_SPECIAL __pte(0)
+
static __always_inline int pte_uffd_wp(pte_t pte)
{
return 0;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 794d1538b8ba..bc733512c690 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -140,6 +140,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
extern void userfaultfd_unmap_complete(struct mm_struct *mm,
struct list_head *uf);

+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+ WARN_ON_ONCE(vma_is_anonymous(vma));
+ return UFFD_WP_SWP_PTE_SPECIAL;
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+ return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
+}
+
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
@@ -229,6 +240,16 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
{
}

+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+ return __pte(0);
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+ return false;
+}
+
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
--
2.26.2

2021-03-23 00:52:51

by Peter Xu

[permalink] [raw]
Subject: [PATCH 02/23] mm: Clear vmf->pte after pte_unmap_same() returns

pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
will not be valid any more. We should clear it.

It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.

pte_unmap_same() will be used in other places in follow up patches, so that
vmf->pte will not always be re-written. This patch enables us to call
functions like finish_fault() because that'll conditionally unmap the pte by
checking vmf->pte first. Or, alloc_set_pte() will make sure to allocate a new
pte even after calling pte_unmap_same().

Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
and then we can also avoid the long parameter list.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a458a595331f..d534eba85756 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2607,19 +2607,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
{
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
+ spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
+ same = pte_same(*vmf->pte, vmf->orig_pte);
spin_unlock(ptl);
}
#endif
- pte_unmap(page_table);
+ pte_unmap(vmf->pte);
+ /* After unmap of pte, the pointer is invalid now - clear it. */
+ vmf->pte = NULL;
return same;
}

@@ -3308,7 +3309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;

- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ if (!pte_unmap_same(vmf))
goto out;

entry = pte_to_swp_entry(vmf->orig_pte);
--
2.26.2

2021-03-23 00:53:13

by Peter Xu

[permalink] [raw]
Subject: [PATCH 06/23] mm: Drop first_index/last_index in zap_details

The first_index/last_index parameters in zap_details are actually only used in
unmap_mapping_range_tree(). At the meantime, this function is only called by
unmap_mapping_pages() once. Instead of passing these two variables through the
whole stack of page zapping code, remove them from zap_details and let them
simply be parameters of unmap_mapping_range_tree(), which is inlined.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 2 --
mm/memory.c | 20 ++++++++++----------
2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b78306eb7a63..389dd91134f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1707,8 +1707,6 @@ extern void user_shm_unlock(size_t, struct user_struct *);
*/
struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
- pgoff_t first_index; /* Lowest page->index to unmap */
- pgoff_t last_index; /* Highest page->index to unmap */
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index b4ddba343abc..5e6175e00617 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3202,20 +3202,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
}

static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+ pgoff_t first_index,
+ pgoff_t last_index,
struct zap_details *details)
{
struct vm_area_struct *vma;
pgoff_t vba, vea, zba, zea;

- vma_interval_tree_foreach(vma, root,
- details->first_index, details->last_index) {
-
+ vma_interval_tree_foreach(vma, root, first_index, last_index) {
vba = vma->vm_pgoff;
vea = vba + vma_pages(vma) - 1;
- zba = details->first_index;
+ zba = first_index;
if (zba < vba)
zba = vba;
- zea = details->last_index;
+ zea = last_index;
if (zea > vea)
zea = vea;

@@ -3241,17 +3241,17 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
+ pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = { };

details.check_mapping = even_cows ? NULL : mapping;
- details.first_index = start;
- details.last_index = start + nr - 1;
- if (details.last_index < details.first_index)
- details.last_index = ULONG_MAX;
+ if (last_index < first_index)
+ last_index = ULONG_MAX;

i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
- unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+ last_index, &details);
i_mmap_unlock_write(mapping);
}

--
2.26.2

2021-03-23 00:53:25

by Peter Xu

[permalink] [raw]
Subject: [PATCH 08/23] mm: Introduce ZAP_FLAG_SKIP_SWAP

Firstly, the comment in zap_pte_range() is misleading because it checks against
details rather than check_mappings, so it's against what the code did.

Meanwhile, it's confusing too on not explaining why passing in the details
pointer would mean to skip all swap entries. New user of zap_details could
very possibly miss this fact if they don't read deep until zap_pte_range()
because there's no comment at zap_details talking about it at all, so swap
entries could be errornously skipped without being noticed.

This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
"details" parameter: the caller should explicitly set this to skip swap
entries, otherwise swap entries will always be considered (which is still the
major case here).

Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 12 ++++++++++++
mm/memory.c | 8 +++++---
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb513a346beb..c11fbce0d557 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1704,6 +1704,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);

/* Whether to check page->mapping when zapping */
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+/* Whether to skip zapping swap entries */
+#define ZAP_FLAG_SKIP_SWAP BIT(1)

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1726,6 +1728,16 @@ zap_check_mapping_skip(struct zap_details *details, struct page *page)
return details->zap_mapping != page_rmapping(page);
}

+/* Return true if skip swap entries, false otherwise */
+static inline bool
+zap_skip_swap(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 2e540b315481..a02c4d851cd4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1284,8 +1284,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ if (unlikely(zap_skip_swap(details)))
continue;

if (!non_swap_entry(entry))
@@ -3225,7 +3224,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { .zap_mapping = mapping };
+ struct zap_details details = {
+ .zap_mapping = mapping,
+ .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ };

if (!even_cows)
details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
--
2.26.2

2021-03-23 00:53:41

by Peter Xu

[permalink] [raw]
Subject: [PATCH 07/23] mm: Introduce zap_details.zap_flags

Instead of trying to introduce one variable for every new zap_details fields,
let's introduce a flag so that it can start to encode true/false informations.

Let's start to use this flag first to clean up the only check_mapping variable.
Firstly, the name "check_mapping" implies this is a "boolean", but actually it
stores the mapping inside, just in a way that it won't be set if we don't want
to check the mapping.

To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
that we only check against the mapping if this bit set. At the same time, we
can rename check_mapping into zap_mapping and set it always.

Since at it, introduce another helper zap_check_mapping_skip() and use it in
zap_pte_range() properly.

Some old comments have been removed in zap_pte_range() because they're
duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
easy to grep this information by simply grepping the flag.

It'll also make life easier when we want to e.g. pass in zap_flags into the
callers like unmap_mapping_pages() (instead of adding new booleans besides the
even_cows parameter).

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
mm/memory.c | 31 ++++++++-----------------------
2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 389dd91134f9..bb513a346beb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1702,13 +1702,30 @@ static inline bool can_do_mlock(void) { return false; }
extern int user_shm_lock(size_t, struct user_struct *);
extern void user_shm_unlock(size_t, struct user_struct *);

+/* Whether to check page->mapping when zapping */
+#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
*/
struct zap_details {
- struct address_space *check_mapping; /* Check page->mapping if set */
+ struct address_space *zap_mapping; /* Check page->mapping if set */
+ unsigned long zap_flags; /* Special flags for zapping */
};

+/* Return true if skip zapping this page, false otherwise */
+static inline bool
+zap_check_mapping_skip(struct zap_details *details, struct page *page)
+{
+ if (!details || !page)
+ return false;
+
+ if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
+ return false;
+
+ return details->zap_mapping != page_rmapping(page);
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 5e6175e00617..2e540b315481 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1242,16 +1242,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct page *page;

page = vm_normal_page(vma, addr, ptent);
- if (unlikely(details) && page) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping &&
- details->check_mapping != page_rmapping(page))
- continue;
- }
+ if (unlikely(zap_check_mapping_skip(details, page)))
+ continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1283,17 +1275,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (is_device_private_entry(entry)) {
struct page *page = device_private_entry_to_page(entry);

- if (unlikely(details && details->check_mapping)) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping !=
- page_rmapping(page))
- continue;
- }
-
+ if (unlikely(zap_check_mapping_skip(details, page)))
+ continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
page_remove_rmap(page, false);
@@ -3242,9 +3225,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { };
+ struct zap_details details = { .zap_mapping = mapping };
+
+ if (!even_cows)
+ details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;

- details.check_mapping = even_cows ? NULL : mapping;
if (last_index < first_index)
last_index = ULONG_MAX;

--
2.26.2

2021-03-23 00:53:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH 09/23] mm: Pass zap_flags into unmap_mapping_pages()

Give unmap_mapping_pages() more power by allowing to specify a zap flag so that
it can pass in more information than "whether we'd also like to zap cow pages".
With the new flag, we can remove the even_cow parameter because even_cow==false
equals to zap_flags==ZAP_FLAG_CHECK_MAPPING, while even_cow==true means a none
zap flag to pass in (though in most cases we have had even_cow==false).

No functional change intended.

Signed-off-by: Peter Xu <[email protected]>
---
fs/dax.c | 10 ++++++----
include/linux/mm.h | 4 ++--
mm/khugepaged.c | 3 ++-
mm/memory.c | 15 ++++++++-------
mm/truncate.c | 11 +++++++----
5 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 177b7d305a52..dd90a35d38be 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -514,7 +514,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_unlock_irq(xas);
unmap_mapping_pages(mapping,
xas->xa_index & ~PG_PMD_COLOUR,
- PG_PMD_NR, false);
+ PG_PMD_NR, ZAP_FLAG_CHECK_MAPPING);
xas_reset(xas);
xas_lock_irq(xas);
}
@@ -609,7 +609,8 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
* guaranteed to either see new references or prevent new
* references from being established.
*/
- unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
+ unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1,
+ ZAP_FLAG_CHECK_MAPPING);

xas_lock_irq(&xas);
xas_for_each(&xas, entry, end_idx) {
@@ -740,9 +741,10 @@ static void *dax_insert_entry(struct xa_state *xas,
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
- PG_PMD_NR, false);
+ PG_PMD_NR, ZAP_FLAG_CHECK_MAPPING);
else /* pte entry */
- unmap_mapping_pages(mapping, index, 1, false);
+ unmap_mapping_pages(mapping, index, 1,
+ ZAP_FLAG_CHECK_MAPPING);
}

xas_reset(xas);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c11fbce0d557..d38cd23a08be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1784,7 +1784,7 @@ extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
void unmap_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t nr, bool even_cows);
+ pgoff_t start, pgoff_t nr, unsigned long zap_flags);
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
#else
@@ -1804,7 +1804,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
return -EFAULT;
}
static inline void unmap_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t nr, bool even_cows) { }
+ pgoff_t start, pgoff_t nr, unsigned long zap_flags) { }
static inline void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows) { }
#endif
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 419a6acce326..7c75dff637e2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1837,7 +1837,8 @@ static void collapse_file(struct mm_struct *mm,
}

if (page_mapped(page))
- unmap_mapping_pages(mapping, index, 1, false);
+ unmap_mapping_pages(mapping, index, 1,
+ ZAP_FLAG_CHECK_MAPPING);

xas_lock_irq(&xas);
xas_set(&xas, index);
diff --git a/mm/memory.c b/mm/memory.c
index a02c4d851cd4..36204b898894 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3213,7 +3213,10 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
* @mapping: The address space containing pages to be unmapped.
* @start: Index of first page to be unmapped.
* @nr: Number of pages to be unmapped. 0 to unmap to end of file.
- * @even_cows: Whether to unmap even private COWed pages.
+ * @zap_flags: Zap flags for the process. E.g., when ZAP_FLAG_CHECK_MAPPING is
+ * passed into it, we will only zap the pages that are in the same mapping
+ * specified in the @mapping parameter; otherwise we will not check mapping,
+ * IOW cow pages will be zapped too.
*
* Unmap the pages in this address space from any userspace process which
* has them mmaped. Generally, you want to remove COWed pages as well when
@@ -3221,17 +3224,14 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
* cache.
*/
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
- pgoff_t nr, bool even_cows)
+ pgoff_t nr, unsigned long zap_flags)
{
pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = {
.zap_mapping = mapping,
- .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ .zap_flags = zap_flags | ZAP_FLAG_SKIP_SWAP,
};

- if (!even_cows)
- details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
-
if (last_index < first_index)
last_index = ULONG_MAX;

@@ -3273,7 +3273,8 @@ void unmap_mapping_range(struct address_space *mapping,
hlen = ULONG_MAX - hba + 1;
}

- unmap_mapping_pages(mapping, hba, hlen, even_cows);
+ unmap_mapping_pages(mapping, hba, hlen, even_cows ?
+ 0 : ZAP_FLAG_CHECK_MAPPING);
}
EXPORT_SYMBOL(unmap_mapping_range);

diff --git a/mm/truncate.c b/mm/truncate.c
index 95af244b112a..ba2cbe300e83 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -172,7 +172,8 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
{
if (page_mapped(page)) {
unsigned int nr = thp_nr_pages(page);
- unmap_mapping_pages(mapping, page->index, nr, false);
+ unmap_mapping_pages(mapping, page->index, nr,
+ ZAP_FLAG_CHECK_MAPPING);
}

if (page_has_private(page))
@@ -652,14 +653,15 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
* Zap the rest of the file in one hit.
*/
unmap_mapping_pages(mapping, index,
- (1 + end - index), false);
+ (1 + end - index),
+ ZAP_FLAG_CHECK_MAPPING);
did_range_unmap = 1;
} else {
/*
* Just zap this page
*/
unmap_mapping_pages(mapping, index,
- 1, false);
+ 1, ZAP_FLAG_CHECK_MAPPING);
}
}
BUG_ON(page_mapped(page));
@@ -685,7 +687,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
* get remapped later.
*/
if (dax_mapping(mapping)) {
- unmap_mapping_pages(mapping, start, end - start + 1, false);
+ unmap_mapping_pages(mapping, start, end - start + 1,
+ ZAP_FLAG_CHECK_MAPPING);
}
out:
cleancache_invalidate_inode(mapping);
--
2.26.2

2021-03-23 00:54:33

by Peter Xu

[permalink] [raw]
Subject: [PATCH 12/23] shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

We don't have "huge" version of PTE_SWP_UFFD_WP_SPECIAL, instead when necessary
we split the thp if the huge page is uffd wr-protected previously.

However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps - rather than doing a real split, the
thp pmd will simply got dropped in __split_huge_pmd_locked().

That is definitely not enough if e.g. when there is a thp covers range [0, 2M)
but we want to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd.

Here we leverage the previously introduced change_protection_prepare() macro so
that we'll populate the pmd with a pgtable page. Then change_pte_range() will
do all the rest for us, e.g., install the uffd-wp swap special pte marker at
any pte that we'd like to wr-protect, under the protection of pgtable lock.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6b63e3544b47..51c954afa406 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -296,8 +296,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
- if (next - addr != HPAGE_PMD_SIZE) {
+ if (next - addr != HPAGE_PMD_SIZE ||
+ /* Uffd wr-protecting a file-backed memory range */
+ unlikely(!vma_is_anonymous(vma) &&
+ (cp_flags & MM_CP_UFFD_WP))) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
+ /*
+ * For file-backed, the pmd could have been
+ * gone; still provide a pte pgtable if needed.
+ */
+ change_protection_prepare(vma, pmd, addr, cp_flags);
} else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, cp_flags);
--
2.26.2

2021-03-23 00:54:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH 13/23] shmem/userfaultfd: Handle the left-overed special swap ptes

Note that the special uffd-wp swap pte can be left over even if the page under
the pte got evicted. Normally when evict a page, we will unmap the ptes by
walking through the reverse mapping. However we never tracked such information
for the special swap ptes because they're not real mappings but just markers.
So we need to take care of that when we see a marker but when it's actually
meaningless (the page behind it got evicted).

We have already taken care of that in e.g. alloc_set_pte() where we'll treat
the special swap pte as pte_none() when necessary. However we need to also
teach userfaultfd itself on either UFFDIO_COPY or handling page faults, so that
everything will still work as expected.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 15 +++++++++++++++
mm/shmem.c | 13 ++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bd83379d4dd2..72956f9cc892 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -329,6 +329,21 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (pte_none(*pte))
ret = true;
+ /*
+ * We also treat the swap special uffd-wp pte as the pte_none() here.
+ * This should in most cases be a missing event, as we never handle
+ * wr-protect upon a special uffd-wp swap pte - it should first be
+ * converted into a normal read request before handling wp. It just
+ * means the page/swap cache that backing this pte is gone, so this
+ * special pte is leftover.
+ *
+ * We can't simply replace it with a none pte because we're not with
+ * the pgtable lock here. Instead of taking it and clearing the pte,
+ * the easy way is to let UFFDIO_COPY understand this pte too when
+ * trying to install a new page onto it.
+ */
+ if (pte_swp_uffd_wp_special(*pte))
+ ret = true;
if (!pte_write(*pte) && (reason & VM_UFFD_WP))
ret = true;
pte_unmap(pte);
diff --git a/mm/shmem.c b/mm/shmem.c
index e88aaabaeb27..90d67406af66 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2469,7 +2469,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
goto out_release_unlock;

ret = -EEXIST;
- if (!pte_none(*dst_pte))
+ /*
+ * Besides the none pte, we also allow UFFDIO_COPY to install a pte
+ * onto the uffd-wp swap special pte, because that pte should be the
+ * same as a pte_none() just in that it contains wr-protect information
+ * (which could only be dropped when unmap the memory).
+ *
+ * It's safe to drop that marker because we know this is part of a
+ * MISSING fault, and the caller is very clear about this page missing
+ * rather than wr-protected. Then we're sure the wr-protect bit is
+ * just a leftover so it's useless already.
+ */
+ if (!pte_none(*dst_pte) && !pte_swp_uffd_wp_special(*dst_pte))
goto out_release_unlock;

if (!is_continue) {
--
2.26.2

2021-03-23 00:54:57

by Peter Xu

[permalink] [raw]
Subject: [PATCH 04/23] mm/swap: Introduce the idea of special swap ptes

We used to have special swap entries, like migration entries, hw-poison
entries, device private entries, etc.

Those "special swap entries" reside in the range that they need to be at least
swap entries first, and their types are decided by swp_type(entry).

This patch introduces another idea called "special swap ptes".

It's very easy to get confused against "special swap entries", but a speical
swap pte should never contain a swap entry at all. It means, it's illegal to
call pte_to_swp_entry() upon a special swap pte.

Make the uffd-wp special pte to be the first special swap pte.

Before this patch, is_swap_pte()==true means one of the below:

(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.

(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.

After this patch, is_swap_pte()==true means one of the below, where case (b) is
added:

(a) The pte contains a swap entry.

(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.

(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.

(b) The pte does not contain a swap entry at all (so it cannot be passed
into pte_to_swp_entry()). For example, uffd-wp special swap pte.

Teach the whole mm core about this new idea. It's done by introducing another
helper called pte_has_swap_entry() which stands for case (a.1) and (a.2).
Before this patch, it will be the same as is_swap_pte() because there's no
special swap pte yet. Now for most of the previous use of is_swap_entry() in
mm core, we'll need to use the new helper pte_has_swap_entry() instead, to make
sure we won't try to parse a swap entry from a swap special pte (which does not
contain a swap entry at all!). We either handle the swap special pte, or it'll
naturally use the default "else" paths.

Warn properly (e.g., in do_swap_page()) when we see a special swap pte - we
should never call do_swap_page() upon those ptes, but just to bail out early if
it happens.

Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/kernel/mte.c | 2 +-
fs/proc/task_mmu.c | 14 ++++++++------
include/linux/swapops.h | 39 ++++++++++++++++++++++++++++++++++++++-
mm/gup.c | 2 +-
mm/hmm.c | 2 +-
mm/khugepaged.c | 11 ++++++++++-
mm/madvise.c | 4 ++--
mm/memcontrol.c | 2 +-
mm/memory.c | 7 +++++++
mm/migrate.c | 4 ++--
mm/mincore.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 6 +++---
mm/swapfile.c | 2 +-
15 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b3c70a612c7a..ebe213cba913 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -30,7 +30,7 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
{
pte_t old_pte = READ_ONCE(*ptep);

- if (check_swap && is_swap_pte(old_pte)) {
+ if (check_swap && pte_has_swap_entry(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);

if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc9784544b24..4c95cc57a66a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -498,7 +498,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

if (!non_swap_entry(swpent)) {
@@ -518,8 +518,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
page = migration_entry_to_page(swpent);
else if (is_device_private_entry(swpent))
page = device_private_entry_to_page(swpent);
- } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
- && pte_none(*pte))) {
+ } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) &&
+ mss->check_shmem_swap &&
+ /* Here swap special pte is the same as none pte */
+ (pte_none(*pte) || is_swap_special_pte(*pte)))) {
page = xa_load(&vma->vm_file->f_mapping->i_pages,
linear_page_index(vma, addr));
if (xa_is_value(page))
@@ -691,7 +693,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

if (is_migration_entry(swpent))
@@ -1075,7 +1077,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
ptent = pte_wrprotect(old_pte);
ptent = pte_clear_soft_dirty(ptent);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
- } else if (is_swap_pte(ptent)) {
+ } else if (pte_has_swap_entry(ptent)) {
ptent = pte_swp_clear_soft_dirty(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
}
@@ -1375,7 +1377,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
- } else if (is_swap_pte(pte)) {
+ } else if (pte_has_swap_entry(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 7dd57303bb0c..7b7387d2892f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -5,6 +5,7 @@
#include <linux/radix-tree.h>
#include <linux/bug.h>
#include <linux/mm_types.h>
+#include <linux/userfaultfd_k.h>

#ifdef CONFIG_MMU

@@ -52,12 +53,48 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
return entry.val & SWP_OFFSET_MASK;
}

-/* check whether a pte points to a swap entry */
+/*
+ * is_swap_pte() returns true for three cases:
+ *
+ * (a) The pte contains a swap entry.
+ *
+ * (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
+ * example, when an anonymous page got swapped out.
+ *
+ * (a.2) The pte has a special swap entry (non_swap_entry()==true). For
+ * example, a migration entry, a hw-poison entry, etc.
+ *
+ * (b) The pte does not contain a swap entry at all (so it cannot be passed
+ * into pte_to_swp_entry()). For example, uffd-wp special swap pte.
+ */
static inline int is_swap_pte(pte_t pte)
{
return !pte_none(pte) && !pte_present(pte);
}

+/*
+ * A swap-like special pte should only be used as special marker to trigger a
+ * page fault. We should treat them similarly as pte_none() in most cases,
+ * except that it may contain some special information that can persist within
+ * the pte. Currently the only special swap pte is UFFD_WP_SWP_PTE_SPECIAL.
+ *
+ * Note: we should never call pte_to_swp_entry() upon a special swap pte,
+ * Because a swap special pte does not contain a swap entry!
+ */
+static inline bool is_swap_special_pte(pte_t pte)
+{
+ return pte_swp_uffd_wp_special(pte);
+}
+
+/*
+ * Returns true if the pte contains a swap entry. This includes not only the
+ * normal swp entry case, but also for migration entries, etc.
+ */
+static inline bool pte_has_swap_entry(pte_t pte)
+{
+ return is_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
/*
* Convert the arch-dependent pte representation of a swp_entry_t into an
* arch-independent swp_entry_t.
diff --git a/mm/gup.c b/mm/gup.c
index b3e647c8b7ee..53e9ddc3a829 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -474,7 +474,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
*/
if (likely(!(flags & FOLL_MIGRATION)))
goto no_page;
- if (pte_none(pte))
+ if (!pte_has_swap_entry(pte))
goto no_page;
entry = pte_to_swp_entry(pte);
if (!is_migration_entry(entry))
diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba4442..4dba5debf163 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -237,7 +237,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
pte_t pte = *ptep;
uint64_t pfn_req_flags = *hmm_pfn;

- if (pte_none(pte)) {
+ if (pte_none(pte) || is_swap_special_pte(pte)) {
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (required_fault)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b81521dfbb1a..419a6acce326 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1019,7 +1019,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,

vmf.pte = pte_offset_map(pmd, address);
vmf.orig_pte = *vmf.pte;
- if (!is_swap_pte(vmf.orig_pte)) {
+ if (!pte_has_swap_entry(vmf.orig_pte)) {
pte_unmap(vmf.pte);
continue;
}
@@ -1248,6 +1248,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
+ if (is_swap_special_pte(pteval)) {
+ /*
+ * Reuse SCAN_PTE_UFFD_WP. If there will be
+ * new users of is_swap_special_pte(), we'd
+ * better introduce a new result type.
+ */
+ result = SCAN_PTE_UFFD_WP;
+ goto out_unmap;
+ }
if (++unmapped <= khugepaged_max_ptes_swap) {
/*
* Always be strict with uffd-wp
diff --git a/mm/madvise.c b/mm/madvise.c
index 01fef79ac761..c77499d21aac 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -202,7 +202,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
pte = *(orig_pte + ((index - start) / PAGE_SIZE));
pte_unmap_unlock(orig_pte, ptl);

- if (pte_present(pte) || pte_none(pte))
+ if (!pte_has_swap_entry(pte))
continue;
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
@@ -594,7 +594,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (pte_none(ptent))
+ if (pte_none(ptent) || is_swap_special_pte(ptent))
continue;
/*
* If the pte has swp_entry, just clear page table to
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 668d1d7c2645..64b347a15ded 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5558,7 +5558,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,

if (pte_present(ptent))
page = mc_handle_present_pte(vma, addr, ptent);
- else if (is_swap_pte(ptent))
+ else if (pte_has_swap_entry(ptent))
page = mc_handle_swap_pte(vma, ptent, &ent);
else if (pte_none(ptent))
page = mc_handle_file_pte(vma, addr, ptent, &ent);
diff --git a/mm/memory.c b/mm/memory.c
index d534eba85756..8c4ed1f9693c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3312,6 +3312,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;

+ /*
+ * We should never call do_swap_page upon a swap special pte; just be
+ * safe to bail out if it happens.
+ */
+ if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
+ goto out;
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 47df0df8f21a..08425acc2563 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -314,7 +314,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,

spin_lock(ptl);
pte = *ptep;
- if (!is_swap_pte(pte))
+ if (!pte_has_swap_entry(pte))
goto out;

entry = pte_to_swp_entry(pte);
@@ -2425,7 +2425,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,

pte = *ptep;

- if (pte_none(pte)) {
+ if (pte_none(pte) || is_swap_special_pte(pte)) {
if (vma_is_anonymous(vma)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..5728c3e6473f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -121,7 +121,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr != end; ptep++, addr += PAGE_SIZE) {
pte_t pte = *ptep;

- if (pte_none(pte))
+ if (pte_none(pte) || is_swap_special_pte(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
vma, vec);
else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94188df1ee55..b3def0a102bf 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
- } else if (is_swap_pte(oldpte)) {
+ } else if (pte_has_swap_entry(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;

diff --git a/mm/mremap.c b/mm/mremap.c
index 6934d199da54..cd9759ede04b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -124,7 +124,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
#ifdef CONFIG_MEM_SOFT_DIRTY
if (pte_present(pte))
pte = pte_mksoft_dirty(pte);
- else if (is_swap_pte(pte))
+ else if (pte_has_swap_entry(pte))
pte = pte_swp_mksoft_dirty(pte);
#endif
return pte;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 86e3a3688d59..6b51759d9203 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -36,7 +36,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
* For more details on device private memory see HMM
* (include/linux/hmm.h or mm/hmm.c).
*/
- if (is_swap_pte(*pvmw->pte)) {
+ if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;

/* Handle un-addressable ZONE_DEVICE memory */
@@ -89,7 +89,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

if (pvmw->flags & PVMW_MIGRATION) {
swp_entry_t entry;
- if (!is_swap_pte(*pvmw->pte))
+ if (!pte_has_swap_entry(*pvmw->pte))
return false;
entry = pte_to_swp_entry(*pvmw->pte);

@@ -97,7 +97,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return false;

pfn = migration_entry_to_pfn(entry);
- } else if (is_swap_pte(*pvmw->pte)) {
+ } else if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;

/* Handle un-addressable ZONE_DEVICE memory */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 149e77454e3c..8aa4be074659 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1964,7 +1964,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
si = swap_info[type];
pte = pte_offset_map(pmd, addr);
do {
- if (!is_swap_pte(*pte))
+ if (!pte_has_swap_entry(*pte))
continue;

entry = pte_to_swp_entry(*pte);
--
2.26.2

2021-03-23 00:55:02

by Peter Xu

[permalink] [raw]
Subject: [PATCH 05/23] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

File-backed memories are prone to unmap/swap so the ptes are always unstable.
This could lead to userfaultfd-wp information got lost when unmapped or swapped
out on such types of memory, for example, shmem. To keep such an information
persistent, we will start to use the newly introduced swap-like special ptes to
replace a null pte when those ptes were removed.

Prepare this by handling such a special pte first before it is applied. Here
a new fault flag FAULT_FLAG_UFFD_WP is introduced. When this flag is set, it
means the current fault is to resolve a page access (either read or write) to
the uffd-wp special pte.

The handling of this special pte page fault is similar to missing fault, but it
should happen after the pte missing logic since the special pte is designed to
be a swap-like pte. Meanwhile it should be handled before do_swap_page() so
that the swap core logic won't be confused to see such an illegal swap pte.

This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
ptes should be rare. So far it should only trigger in two conditions:

(1) When trying to punch holes in shmem_fallocate(), there will be a
pre-unmap optimization before evicting the page. That will create
unmapped shmem ptes with wr-protected pages covered.

(2) Swapping out of shmem pages

Because of this, the page fault handling is simplifed too by always assuming
it's a read fault when calling do_fault(). When it's a write fault, it'll
fault again when retry the page access, then do_wp_page() will handle the rest
of message generation and delivery to the userfaultfd.

Disable fault-around for such a special page fault, because the introduced new
flag (FAULT_FLAG_UFFD_WP) only applies to current pte rather than all the pages
around it. Doing fault-around with the new flag could confuse all the rest of
pages when installing ptes from page cache when there's a cache hit.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 2 +
include/linux/userfaultfd_k.h | 11 ++++
mm/memory.c | 103 +++++++++++++++++++++++++++++++---
3 files changed, 107 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1e191da319..b78306eb7a63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -441,6 +441,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UFFD_WP: When install new page entries, set uffd-wp bit.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -471,6 +472,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80
#define FAULT_FLAG_INSTRUCTION 0x100
#define FAULT_FLAG_INTERRUPTIBLE 0x200
+#define FAULT_FLAG_UFFD_WP 0x400

/*
* The default fault flags that should be used by most of the
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index bc733512c690..fefebe6e9656 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -89,6 +89,17 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
}

+/*
+ * Don't do fault around for FAULT_FLAG_UFFD_WP because it means we want to
+ * recover a previously wr-protected pte. This flag is a per-pte information,
+ * so it could confuse all the pages around the current page when faulted in.
+ * Similar reason for MINOR mode faults.
+ */
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+}
+
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
return vma->vm_flags & VM_UFFD_MISSING;
diff --git a/mm/memory.c b/mm/memory.c
index 8c4ed1f9693c..b4ddba343abc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3775,6 +3775,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
{
struct vm_area_struct *vma = vmf->vma;
+ bool uffd_wp = vmf->flags & FAULT_FLAG_UFFD_WP;
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool prefault = vmf->address != addr;
pte_t entry;
@@ -3787,6 +3788,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)

if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (uffd_wp) {
+ /* This should only be triggered by a read fault */
+ WARN_ON_ONCE(write);
+ entry = pte_mkuffd_wp(pte_wrprotect(entry));
+ }
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3816,6 +3822,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
*/
vm_fault_t finish_fault(struct vm_fault *vmf)
{
+ bool pte_stable, uffd_wp = vmf->flags & FAULT_FLAG_UFFD_WP;
struct vm_area_struct *vma = vmf->vma;
struct page *page;
vm_fault_t ret;
@@ -3854,8 +3861,14 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
ret = 0;
+
+ if (unlikely(uffd_wp))
+ pte_stable = pte_swp_uffd_wp_special(*vmf->pte);
+ else
+ pte_stable = pte_none(*vmf->pte);
+
/* Re-check under ptl */
- if (likely(pte_none(*vmf->pte)))
+ if (likely(pte_stable))
do_set_pte(vmf, page, vmf->address);
else
ret = VM_FAULT_NOPAGE;
@@ -3959,9 +3972,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
}

+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+ /* No ->map_pages? No way to fault around... */
+ if (!vmf->vma->vm_ops->map_pages)
+ return false;
+
+ if (uffd_disable_fault_around(vmf->vma))
+ return false;
+
+ return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
static vm_fault_t do_read_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = 0;

/*
@@ -3969,12 +3994,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- if (likely(!userfaultfd_minor(vmf->vma))) {
- ret = do_fault_around(vmf);
- if (ret)
- return ret;
- }
+ if (should_fault_around(vmf)) {
+ ret = do_fault_around(vmf);
+ if (ret)
+ return ret;
}

ret = __do_fault(vmf);
@@ -4284,6 +4307,68 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
return VM_FAULT_FALLBACK;
}

+static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /*
+ * Be careful so that we will only recover a special uffd-wp pte into a
+ * none pte. Otherwise it means the pte could have changed, so retry.
+ */
+ if (pte_swp_uffd_wp_special(*vmf->pte))
+ pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed. It means this pte was wr-protected before being unmapped.
+ */
+static vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf)
+{
+ /* Careful! vmf->pte unmapped after return */
+ if (!pte_unmap_same(vmf))
+ return 0;
+
+ /*
+ * Just in case there're leftover special ptes even after the region
+ * got unregistered - we can simply clear them.
+ */
+ if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+ return uffd_wp_clear_special(vmf);
+
+ /*
+ * Tell all the rest of the fault code: we're handling a special pte,
+ * always remember to arm the uffd-wp bit when intalling the new pte.
+ */
+ vmf->flags |= FAULT_FLAG_UFFD_WP;
+
+ /*
+ * Let's assume this is a read fault no matter what. If it is a real
+ * write access, it'll fault again into do_wp_page() where the message
+ * will be generated before the thread yields itself.
+ *
+ * Ideally we can also handle write immediately before return, but this
+ * should be a slow path already (pte unmapped), so be simple first.
+ */
+ vmf->flags &= ~FAULT_FLAG_WRITE;
+
+ return do_fault(vmf);
+}
+
+static vm_fault_t do_swap_pte(struct vm_fault *vmf)
+{
+ /*
+ * We need to handle special swap ptes before handling ptes that
+ * contain swap entries, always.
+ */
+ if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
+ return uffd_wp_handle_special(vmf);
+
+ return do_swap_page(vmf);
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -4358,7 +4443,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
}

if (!pte_present(vmf->orig_pte))
- return do_swap_page(vmf);
+ return do_swap_pte(vmf);

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
--
2.26.2

2021-03-23 00:55:05

by Peter Xu

[permalink] [raw]
Subject: [PATCH 15/23] hugetlb/userfaultfd: Hook page faults for uffd write protection

Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.

We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 56b78a206913..def2c7ddf3ae 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4643,6 +4643,25 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_ptl;

+ /* Handle userfault-wp first, before trying to lock more pages */
+ if (userfaultfd_pte_wp(vma, huge_ptep_get(ptep)) &&
+ (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .flags = flags,
+ };
+
+ spin_unlock(ptl);
+ if (pagecache_page) {
+ unlock_page(pagecache_page);
+ put_page(pagecache_page);
+ }
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ return handle_userfault(&vmf, VM_UFFD_WP);
+ }
+
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
--
2.26.2

2021-03-23 00:55:09

by Peter Xu

[permalink] [raw]
Subject: [PATCH 16/23] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
the stack. Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
UFFDIO_COPY. Introduce huge_pte_mkuffd_wp() for it.

Note that similar to how we've handled shmem, we'd better keep setting the
dirty bit even if UFFDIO_COPY_MODE_WP is provided, so that the core mm will
know this page contains valid data and never drop it.

Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 5 +++++
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 22 +++++++++++++++++-----
mm/userfaultfd.c | 12 ++++++++----
4 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..548212eccbd6 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -27,6 +27,11 @@ static inline pte_t huge_pte_mkdirty(pte_t pte)
return pte_mkdirty(pte);
}

+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte_mkuffd_wp(pte);
+}
+
static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
{
return pte_modify(pte, newprot);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a7f7d5f328dc..ef8d2b8427b1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -141,7 +141,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep);
+ struct page **pagep,
+ bool wp_copy);
#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
@@ -321,7 +322,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
BUG();
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index def2c7ddf3ae..f0e55b341ebd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4725,7 +4725,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct address_space *mapping;
@@ -4822,17 +4823,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
}

- /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
- if (is_continue && !vm_shared)
+ /*
+ * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
+ * with wp flag set, don't set pte write bit.
+ */
+ if (wp_copy || (is_continue && !vm_shared))
writable = 0;
else
writable = dst_vma->vm_flags & VM_WRITE;

_dst_pte = make_huge_pte(dst_vma, page, writable);
- if (writable)
- _dst_pte = huge_pte_mkdirty(_dst_pte);
+ /*
+ * Always mark UFFDIO_COPY page dirty; note that this may not be
+ * extremely important for hugetlbfs for now since swapping is not
+ * supported, but we should still be clear in that this page cannot be
+ * thrown away at will, even if write bit not set.
+ */
+ _dst_pte = huge_pte_mkdirty(_dst_pte);
_dst_pte = pte_mkyoung(_dst_pte);

+ if (wp_copy)
+ _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0963e0d9ed20..78471ae3d25c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -207,7 +207,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode)
+ enum mcopy_atomic_mode mode,
+ bool wp_copy)
{
int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -304,7 +305,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
- dst_addr, src_addr, mode, &page);
+ dst_addr, src_addr, mode, &page,
+ wp_copy);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -406,7 +408,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode);
+ enum mcopy_atomic_mode mode,
+ bool wp_copy);
#endif /* CONFIG_HUGETLB_PAGE */

static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -527,7 +530,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, mcopy_mode);
+ src_start, len, mcopy_mode,
+ wp_copy);

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
--
2.26.2

2021-03-23 00:55:36

by Peter Xu

[permalink] [raw]
Subject: [PATCH 11/23] shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.

We do that by installing the uffd-wp special swap pte as a marker. So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.

On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker. Then we'll need to be
able to recover from a uffd-wp special swap pte into a none pte, so that the
next access to the page will fault in correctly as usual when trigger the fault
handler next time, rather than sending a uffd-wp message.

Special care needs to be taken throughout the change_protection_range()
process. Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see !anonymous && MM_CP_UFFD_WP
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.

Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet. But this will be a base for thps too.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index b3def0a102bf..6b63e3544b47 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,6 +29,7 @@
#include <linux/uaccess.h>
#include <linux/mm_inline.h>
#include <linux/pgtable.h>
+#include <linux/userfaultfd_k.h>
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -176,6 +177,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, newpte);
pages++;
}
+ } else if (unlikely(is_swap_special_pte(oldpte))) {
+ if (uffd_wp_resolve && !vma_is_anonymous(vma) &&
+ pte_swp_uffd_wp_special(oldpte)) {
+ /*
+ * This is uffd-wp special pte and we'd like to
+ * unprotect it. What we need to do is simply
+ * recover the pte into a none pte; the next
+ * page fault will fault in the page.
+ */
+ pte_clear(vma->vm_mm, addr, pte);
+ pages++;
+ }
+ } else {
+ /* It must be an none page, or what else?.. */
+ WARN_ON_ONCE(!pte_none(oldpte));
+ if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+ /*
+ * For file-backed mem, we need to be able to
+ * wr-protect even for a none pte! Because
+ * even if the pte is null, the page/swap cache
+ * could exist.
+ */
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+ pages++;
+ }
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -209,6 +236,25 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}

+/*
+ * File-backed vma allows uffd wr-protect upon none ptes, because even if pte
+ * is missing, page/swap cache could exist. When that happens, the wr-protect
+ * information will be stored in the page table entries with the marker (e.g.,
+ * PTE_SWP_UFFD_WP_SPECIAL). Prepare for that by always populating the page
+ * tables to pte level, so that we'll install the markers in change_pte_range()
+ * where necessary.
+ *
+ * Note that we only need to do this in pmd level, because if pmd does not
+ * exist, it means the whole range covered by the pmd entry (of a pud) does not
+ * contain any valid data but all zeros. Then nothing to wr-protect.
+ */
+#define change_protection_prepare(vma, pmd, addr, cp_flags) \
+ do { \
+ if (unlikely((cp_flags & MM_CP_UFFD_WP) && pmd_none(*pmd) && \
+ !vma_is_anonymous(vma))) \
+ WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)); \
+ } while (0)
+
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
@@ -227,6 +273,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

next = pmd_addr_end(addr, end);

+ change_protection_prepare(vma, pmd, addr, cp_flags);
+
/*
* Automatic NUMA balancing walks the tables with mmap_lock
* held for read. It's possible a parallel update to occur
--
2.26.2

2021-03-23 00:55:38

by Peter Xu

[permalink] [raw]
Subject: [PATCH 18/23] mm/hugetlb: Introduce huge version of special swap pte helpers

This is to let hugetlbfs be prepared to also recognize swap special ptes just
like uffd-wp special swap ptes.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fd3e87517e10..64e424b03774 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,6 +93,25 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
return true;
}

+/*
+ * These are sister versions of is_swap_pte() and pte_has_swap_entry(). We
+ * need standalone ones because huge_pte_none() is handled differently from
+ * pte_none(). For more information, please refer to comments above
+ * is_swap_pte() and pte_has_swap_entry().
+ *
+ * Here we directly reuse the pte level of swap special ptes, for example, the
+ * pte_swp_uffd_wp_special(). It just stands for a huge page rather than a
+ * small page for hugetlbfs pages.
+ */
+static inline bool is_huge_swap_pte(pte_t pte)
+{
+ return !huge_pte_none(pte) && !pte_present(pte);
+}
+static inline bool huge_pte_has_swap_entry(pte_t pte)
+{
+ return is_huge_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
spin_unlock(&spool->lock);
@@ -3726,7 +3745,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
{
swp_entry_t swp;

- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_migration_entry(swp))
@@ -3739,7 +3758,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
{
swp_entry_t swp;

- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_hwpoison_entry(swp))
--
2.26.2

2021-03-23 00:55:52

by Peter Xu

[permalink] [raw]
Subject: [PATCH 19/23] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

Teach the hugetlb page fault code to understand uffd-wp special pte. For
example, when seeing such a pte we need to convert any write fault into a read
one (which is fake - we'll retry the write later if so). Meanwhile, for
handle_userfault() we'll need to make sure we must wait for the special swap
pte too just like a none pte.

Note that we also need to teach UFFDIO_COPY about this special pte across the
code path so that we can safely install a new page at this special pte as long
as we know it's a stall entry.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 5 ++++-
mm/hugetlb.c | 34 +++++++++++++++++++++++++++-------
mm/userfaultfd.c | 5 ++++-
3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 72956f9cc892..f6fa34f58c37 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,8 +245,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
/*
* Lockless access: we're in a wait_event so it's ok if it
* changes under us.
+ *
+ * Regarding uffd-wp special case, please refer to comments in
+ * userfaultfd_must_wait().
*/
- if (huge_pte_none(pte))
+ if (huge_pte_none(pte) || pte_swp_uffd_wp_special(pte))
ret = true;
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
ret = true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64e424b03774..448ef745d5ee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4369,7 +4369,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ unsigned long address, pte_t *ptep,
+ pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -4493,7 +4494,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

ptl = huge_pte_lock(h, mm, ptep);
ret = 0;
- if (!huge_pte_none(huge_ptep_get(ptep)))
+ if (!pte_same(huge_ptep_get(ptep), old_pte))
goto backout;

if (anon_rmap) {
@@ -4503,6 +4504,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
+ if (unlikely(flags & FAULT_FLAG_UFFD_WP)) {
+ WARN_ON_ONCE(flags & FAULT_FLAG_WRITE);
+ /* We should have the write bit cleared already, but be safe */
+ new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
+ }
set_huge_pte_at(mm, haddr, ptep, new_pte);

hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -4584,9 +4590,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
return 0;
- } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+ } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
+ } else if (unlikely(is_swap_special_pte(entry))) {
+ /* Must be a uffd-wp special swap pte */
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(entry));
+ flags |= FAULT_FLAG_UFFD_WP;
+ /* Emulate a read fault */
+ flags &= ~FAULT_FLAG_WRITE;
+ }
}

/*
@@ -4618,8 +4631,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);

entry = huge_ptep_get(ptep);
- if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+ /*
+ * FAULT_FLAG_UFFD_WP should be handled merely the same as pte none
+ * because it's basically a none pte with a special marker
+ */
+ if (huge_pte_none(entry) || pte_swp_uffd_wp_special(entry)) {
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ entry, flags);
goto out_mutex;
}

@@ -4753,7 +4771,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
struct hstate *h = hstate_vma(dst_vma);
- pte_t _dst_pte;
+ pte_t _dst_pte, cur_pte;
spinlock_t *ptl;
int ret;
struct page *page;
@@ -4831,8 +4849,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (idx >= size)
goto out_release_unlock;

+ cur_pte = huge_ptep_get(dst_pte);
ret = -EEXIST;
- if (!huge_pte_none(huge_ptep_get(dst_pte)))
+ /* Please refer to shmem_mfill_atomic_pte() for uffd-wp special case */
+ if (!huge_pte_none(cur_pte) && !pte_swp_uffd_wp_special(cur_pte))
goto out_release_unlock;

if (vm_shared) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 01170197a3d7..a2b0dcc80a19 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -274,6 +274,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
+ pte_t pteval;
+
BUG_ON(dst_addr >= dst_start + len);

/*
@@ -296,8 +298,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
goto out_unlock;
}

+ pteval = huge_ptep_get(dst_pte);
if (mode != MCOPY_ATOMIC_CONTINUE &&
- !huge_pte_none(huge_ptep_get(dst_pte))) {
+ !huge_pte_none(pteval) && !pte_swp_uffd_wp_special(pteval)) {
err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
--
2.26.2

2021-03-23 00:55:52

by Peter Xu

[permalink] [raw]
Subject: [PATCH 20/23] hugetlb/userfaultfd: Allow wr-protect none ptes

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 448ef745d5ee..d4acf9d9d087 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5110,7 +5110,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0;
+ unsigned long pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -5130,13 +5130,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

mmu_notifier_invalidate_range_start(&range);
i_mmap_lock_write(vma->vm_file->f_mapping);
- for (; address < end; address += huge_page_size(h)) {
+ for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ ptep = huge_pte_offset(mm, address, psize);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+ /*
+ * When uffd-wp is enabled on the vma, unshare
+ * shouldn't happen at all. Warn about it if it
+ * happened due to some reason.
+ */
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
pages++;
spin_unlock(ptl);
shared_pmd = true;
@@ -5160,12 +5166,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
- newpte, huge_page_size(h));
+ newpte, psize);
pages++;
}
spin_unlock(ptl);
continue;
}
+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * This is changing a non-present pte into a none pte,
+ * no need for huge_ptep_modify_prot_start/commit().
+ */
+ if (uffd_wp_resolve)
+ huge_pte_clear(mm, address, ptep, psize);
+ }
if (!huge_pte_none(pte)) {
pte_t old_pte;

@@ -5178,6 +5193,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
+ } else {
+ /* None pte */
+ if (unlikely(uffd_wp))
+ /* Safe to modify directly (none->non-present). */
+ set_huge_pte_at(mm, address, ptep,
+ pte_swp_mkuffd_wp_special(vma));
}
spin_unlock(ptl);
}
--
2.26.2

2021-03-23 00:55:53

by Peter Xu

[permalink] [raw]
Subject: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

Just like what we've done with shmem uffd-wp special ptes, we shouldn't drop
uffd-wp special swap pte for hugetlb too, only if we're going to unmap the
whole vma, or we're punching a hole with safe locks held.

For example, remove_inode_hugepages() is safe to drop uffd-wp ptes, because it
has taken hugetlb fault mutex so that no concurrent page fault would trigger.
While the call to hugetlb_vmdelete_list() in hugetlbfs_punch_hole() is not
safe. That's why the previous call will be with ZAP_FLAG_DROP_FILE_UFFD_WP,
while the latter one won't be able to.

Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 15 +++++++++------
include/linux/hugetlb.h | 13 ++++++++-----
mm/hugetlb.c | 27 +++++++++++++++++++++------
mm/memory.c | 5 ++++-
4 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d81f52b87bd7..5fe19e801a2b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -399,7 +399,8 @@ static void remove_huge_page(struct page *page)
}

static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+ unsigned long zap_flags)
{
struct vm_area_struct *vma;

@@ -432,7 +433,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
}

unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
- NULL);
+ NULL, zap_flags);
}
}

@@ -513,7 +514,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
+ (index + 1) * pages_per_huge_page(h),
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
}

@@ -579,7 +581,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
i_size_write(inode, offset);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
- hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
}
@@ -612,8 +615,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap,
- hole_start >> PAGE_SHIFT,
- hole_end >> PAGE_SHIFT);
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT, 0);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 92710600596e..4047fa042782 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
unsigned long *, unsigned long *, long, unsigned int,
int *);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *,
+ unsigned long);
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -361,14 +362,16 @@ static inline unsigned long hugetlb_change_protection(

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}

static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d4acf9d9d087..deeae6d40dad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3936,7 +3936,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,

void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page)
+ struct page *ref_page, unsigned long zap_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -3988,6 +3988,19 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
continue;
}

+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * Only drop the special swap uffd-wp pte if
+ * e.g. unmapping a vma or punching a hole (with proper
+ * lock held so that concurrent page fault won't happen).
+ */
+ if (zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP)
+ huge_pte_clear(mm, address, ptep, sz);
+ spin_unlock(ptl);
+ continue;
+ }
+
/*
* Migrating hugepage or HWPoisoned hugepage is already
* unmapped and its refcount is dropped, so just clear pte here.
@@ -4039,9 +4052,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,

void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
- __unmap_hugepage_range(tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);

/*
* Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -4057,12 +4071,13 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
}

void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
struct mmu_gather tlb;

tlb_gather_mmu(&tlb, vma->vm_mm);
- __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
tlb_finish_mmu(&tlb);
}

@@ -4117,7 +4132,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
unmap_hugepage_range(iter_vma, address,
- address + huge_page_size(h), page);
+ address + huge_page_size(h), page, 0);
}
i_mmap_unlock_write(mapping);
}
diff --git a/mm/memory.c b/mm/memory.c
index 766946d3eab0..4bf7f8e83733 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,8 +1515,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
+ unsigned long zap_flags = details ?
+ details->zap_flags : 0;
i_mmap_lock_write(vma->vm_file->f_mapping);
- __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+ __unmap_hugepage_range_final(tlb, vma, start, end,
+ NULL, zap_flags);
i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
--
2.26.2

2021-03-23 00:56:00

by Peter Xu

[permalink] [raw]
Subject: [PATCH 22/23] mm/userfaultfd: Enable write protection for shmem & hugetlbfs

We've had all the necessary changes ready for both shmem and hugetlbfs. Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.

Now we can remove the flags parameter for vma_can_userfault() since not used
any more. Meanwhile, we can expand UFFD_API_RANGE_IOCTLS_BASIC with
_UFFDIO_WRITEPROTECT too because all existing types now support write
protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 18 ------------------
include/linux/userfaultfd_k.h | 14 ++++++++++++++
include/uapi/linux/userfaultfd.h | 7 +++++--
mm/userfaultfd.c | 10 +++-------
4 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f6fa34f58c37..b4f30fe84aa3 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1275,24 +1275,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
return 0;
}

-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- /* FIXME: add WP support to hugetlbfs and shmem */
- if (vm_flags & VM_UFFD_WP) {
- if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
- return false;
- }
-
- if (vm_flags & VM_UFFD_MINOR) {
- if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
- return false;
- }
-
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
-}
-
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
unsigned long arg)
{
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index fefebe6e9656..413323fc81ca 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -16,6 +16,7 @@
#include <linux/fcntl.h>
#include <linux/mm.h>
#include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>

/* The set of all possible UFFD-related VM flags. */
#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -132,6 +133,19 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & __VM_UFFD_FLAGS;
}

+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+ unsigned long vm_flags)
+{
+ if (vm_flags & VM_UFFD_MINOR) {
+ if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
+ return false;
+ }
+
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
+
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 47d9790d863d..000cc4cfc787 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -32,7 +32,8 @@
UFFD_FEATURE_SIGBUS | \
UFFD_FEATURE_THREAD_ID | \
UFFD_FEATURE_MINOR_HUGETLBFS | \
- UFFD_FEATURE_MINOR_SHMEM)
+ UFFD_FEATURE_MINOR_SHMEM | \
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -46,7 +47,8 @@
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
- (__u64)1 << _UFFDIO_CONTINUE)
+ (__u64)1 << _UFFDIO_CONTINUE | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -198,6 +200,7 @@ struct uffdio_api {
#define UFFD_FEATURE_THREAD_ID (1<<8)
#define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9)
#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
+#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<11)
__u64 features;

__u64 ioctls;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a2b0dcc80a19..e6c6095878bb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -452,7 +452,6 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
break;
}
} else {
- VM_WARN_ON_ONCE(wp_copy);
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
src_addr, mode, page, wp_copy);
}
@@ -683,15 +682,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,

err = -ENOENT;
dst_vma = find_dst_vma(dst_mm, start, len);
- /*
- * Make sure the vma is not shared, that the dst range is
- * both valid and fully within a single existing vma.
- */
- if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+ if (!dst_vma)
goto out_unlock;
if (!userfaultfd_wp(dst_vma))
goto out_unlock;
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
goto out_unlock;

if (is_vm_hugetlb_page(dst_vma)) {
--
2.26.2

2021-03-23 00:56:14

by Peter Xu

[permalink] [raw]
Subject: [PATCH 10/23] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

File-backed memory is prone to being unmapped at any time. It means all
information in the pte will be dropped, including the uffd-wp flag.

Since the uffd-wp info cannot be stored in page cache or swap cache, persist
this wr-protect information by installing the special uffd-wp marker pte when
we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
we will know it's previously wr-protected by recognizing the special pte.

Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
persist such an information. For example, when destroying the whole vma, or
punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
bit when holding the page lock. It means the unmap_mapping_range() in
shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
because that's still racy with the page faults.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 11 ++++++++++
include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 8 ++++++++
mm/truncate.c | 8 +++++++-
5 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d38cd23a08be..9cd215454cdb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1706,6 +1706,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
/* Whether to skip zapping swap entries */
#define ZAP_FLAG_SKIP_SWAP BIT(1)
+/* Whether to completely drop uffd-wp entries for file-backed memory */
+#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1738,6 +1740,15 @@ zap_skip_swap(struct zap_details *details)
return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
}

+static inline bool
+zap_drop_file_uffd_wp(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..c29a6ef3a642 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,8 @@

#include <linux/huge_mm.h>
#include <linux/swap.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

/**
* page_is_file_lru - should the page be on a file LRU or anon LRU?
@@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
update_lru_size(lruvec, page_lru(page), page_zonenum(page),
-thp_nr_pages(page));
}
+
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte. NOTE! This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable. Meanwhile
+ * none pte also means we are not demoting the pte so if tlb flushed then we
+ * don't need to do it again; otherwise if tlb flush is postponed then it's
+ * even better.
+ *
+ * Must be called with pgtable lock held.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_USERFAULTFD
+ bool arm_uffd_pte = false;
+
+ /* The current status of the pte should be "cleared" before calling */
+ WARN_ON_ONCE(!pte_none(*pte));
+
+ if (vma_is_anonymous(vma))
+ return;
+
+ /* A uffd-wp wr-protected normal pte */
+ if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ /*
+ * A uffd-wp wr-protected swap pte. Note: this should even work for
+ * pte_swp_uffd_wp_special() too.
+ */
+ if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ if (unlikely(arm_uffd_pte))
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+#endif
+}
+
#endif
diff --git a/mm/memory.c b/mm/memory.c
index 36204b898894..8be28bcaa044 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/mm_inline.h>

#include <trace/events/kmem.h>

@@ -1210,6 +1211,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}

+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte,
+ struct zap_details *details, pte_t pteval)
+{
+ if (zap_drop_file_uffd_wp(details))
+ return;
+
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1247,6 +1263,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
if (unlikely(!page))
continue;

@@ -1271,6 +1289,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

+ /*
+ * If this is a special uffd-wp marker pte... Drop it only if
+ * enforced to do so.
+ */
+ if (unlikely(is_swap_special_pte(ptent))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
+ /*
+ * If this is a common unmap of ptes, keep this as is.
+ * Drop it only if this is a whole-vma destruction.
+ */
+ if (zap_drop_file_uffd_wp(details))
+ ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ continue;
+ }
+
entry = pte_to_swp_entry(ptent);
if (is_device_private_entry(entry)) {
struct page *page = device_private_entry_to_page(entry);
@@ -1281,6 +1315,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
rss[mm_counter(page)]--;
page_remove_rmap(page, false);
put_page(page);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
continue;
}

@@ -1298,6 +1334,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss_vec(mm, rss);
@@ -1497,12 +1534,15 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mmu_notifier_range range;
+ struct zap_details details = {
+ .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
+ };

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
- unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
mmu_notifier_invalidate_range_end(&range);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index b0fc27e77d6d..5e25c57164fc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -1571,6 +1572,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pvmw.pte);
}

+ /*
+ * Now the pte is cleared. If this is uffd-wp armed pte, we
+ * may want to replace a none pte with a marker pte if it's
+ * file-backed, so we don't lose the tracking information.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
/* Move the dirty bit to the page. Now the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
diff --git a/mm/truncate.c b/mm/truncate.c
index ba2cbe300e83..65fed21e52bd 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -173,7 +173,13 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
if (page_mapped(page)) {
unsigned int nr = thp_nr_pages(page);
unmap_mapping_pages(mapping, page->index, nr,
- ZAP_FLAG_CHECK_MAPPING);
+ ZAP_FLAG_CHECK_MAPPING |
+ /*
+ * Now it's safe to drop uffd-wp because
+ * we're with page lock, and the page is
+ * being truncated.
+ */
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
}

if (page_has_private(page))
--
2.26.2

2021-03-23 00:56:49

by Peter Xu

[permalink] [raw]
Subject: [PATCH 14/23] shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

It should be handled similarly like other uffd-wp wr-protected ptes: we should
pass it over when the dst_vma has VM_UFFD_WP armed, otherwise drop it.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8be28bcaa044..766946d3eab0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -715,8 +715,21 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = dst_vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
- swp_entry_t entry = pte_to_swp_entry(pte);
+ swp_entry_t entry;
+
+ if (unlikely(is_swap_special_pte(pte))) {
+ /*
+ * uffd-wp special swap pte is the only possibility for now.
+ * If dst vma is registered with uffd-wp, copy it over.
+ * Otherwise, ignore this pte as if it's a none pte would work.
+ */
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ if (userfaultfd_wp(dst_vma))
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+ }

+ entry = pte_to_swp_entry(pte);
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
return entry.val;
--
2.26.2

2021-03-23 00:56:59

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Mon, Mar 22, 2021 at 08:48:49PM -0400, Peter Xu wrote:
> This patchset is based on tag v5.12-rc3-mmots-2021-03-17-22-26. To run the
> selftest, need to apply the two patches to fix minor mode page leak:
>
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>
> Since I didn't get any NACK in the previous RFC series for months, I decided to
> remove the RFC tag starting from this version, so this is v1 of uffd-wp support
> on shmem & hugetlb.

Attaching changelog, rfc->v1:
- fix up syzbot reported issue
- add a new feature bit UFFD_FEATURE_WP_SHMEM_HUGETLBFS exported in uapi, so
that apps can detect the new kernel capability.
- check for all pte_to_swp_entry callers too (hmm, etc.) [JasonG]
- dropped the first few patches that are not directly related to this series; I
will post them separately as standalone series

Add Cc too (I'll remember to send the series with full cc list next time..).

Thanks,

--
Peter Xu

2021-03-23 00:57:15

by Peter Xu

[permalink] [raw]
Subject: [PATCH 23/23] userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.

Define HUGETLB_EXPECTED_IOCTLS to avoid using UFFD_API_RANGE_IOCTLS_BASIC,
because UFFD_API_RANGE_IOCTLS_BASIC is normally a superset of capabilities,
while the test may not satisfy them all. E.g., when hugetlb registered without
minor mode, then we need to explicitly remove _UFFDIO_CONTINUE. Same thing to
uffd-wp, as we'll need to explicitly remove _UFFDIO_WRITEPROTECT if not
registered with uffd-wp.

For the long term, we may consider dropping UFFD_API_* macros completely from
uapi/linux/userfaultfd.h header files, because it may cause kernel header
update to easily break userspace.

Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 1f5f9362ec7b..5fa9a506ded5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -80,7 +80,7 @@ static int test_type;
static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
/* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+static bool test_uffdio_wp = true;

static bool map_shared;
static int shm_fd;
@@ -319,6 +319,9 @@ struct uffd_test_ops {
(1 << _UFFDIO_ZEROPAGE) | \
(1 << _UFFDIO_WRITEPROTECT))

+#define HUGETLB_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
+ (1 << _UFFDIO_COPY))
+
static struct uffd_test_ops anon_uffd_test_ops = {
.expected_ioctls = ANON_EXPECTED_IOCTLS,
.allocate_area = anon_allocate_area,
@@ -334,7 +337,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
};

static struct uffd_test_ops hugetlb_uffd_test_ops = {
- .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
+ .expected_ioctls = HUGETLB_EXPECTED_IOCTLS,
.allocate_area = hugetlb_allocate_area,
.release_pages = hugetlb_release_pages,
.alias_mapping = hugetlb_alias_mapping,
@@ -1433,8 +1436,6 @@ static void set_test_type(const char *type)
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
- /* Only enable write-protect test for anonymous test */
- test_uffdio_wp = true;
} else if (!strcmp(type, "hugetlb")) {
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
--
2.26.2

2021-03-23 00:57:27

by Peter Xu

[permalink] [raw]
Subject: [PATCH 17/23] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 5 +++++
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 13 ++++++++++++-
mm/mprotect.c | 3 ++-
mm/userfaultfd.c | 8 ++++++++
5 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 548212eccbd6..181cdc3297e7 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -32,6 +32,11 @@ static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
return pte_mkuffd_wp(pte);
}

+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_uffd_wp(pte);
+}
+
static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
{
return pte_modify(pte, newprot);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ef8d2b8427b1..92710600596e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -190,7 +190,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot);
+ unsigned long address, unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags);

bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -352,7 +353,8 @@ static inline void move_hugetlb_state(struct page *oldpage,

static inline unsigned long hugetlb_change_protection(
struct vm_area_struct *vma, unsigned long address,
- unsigned long end, pgprot_t newprot)
+ unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags)
{
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f0e55b341ebd..fd3e87517e10 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5063,7 +5063,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot)
+ unsigned long address, unsigned long end,
+ pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
@@ -5073,6 +5074,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long pages = 0;
bool shared_pmd = false;
struct mmu_notifier_range range;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -5113,6 +5116,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

make_migration_entry_read(&entry);
newpte = swp_entry_to_pte(entry);
+ if (uffd_wp)
+ newpte = pte_swp_mkuffd_wp(newpte);
+ else if (uffd_wp_resolve)
+ newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
pages++;
@@ -5126,6 +5133,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
pte = arch_make_huge_pte(pte, vma, NULL, 0);
+ if (uffd_wp)
+ pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 51c954afa406..fe5a5b96a61f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -416,7 +416,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);

if (is_vm_hugetlb_page(vma))
- pages = hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot,
+ cp_flags);
else
pages = change_protection_range(vma, start, end, newprot,
cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 78471ae3d25c..01170197a3d7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -654,6 +654,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool enable_wp, bool *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ unsigned long page_mask;
pgprot_t newprot;
int err;

@@ -690,6 +691,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
if (!vma_is_anonymous(dst_vma))
goto out_unlock;

+ if (is_vm_hugetlb_page(dst_vma)) {
+ err = -EINVAL;
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ goto out_unlock;
+ }
+
if (enable_wp)
newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
else
--
2.26.2

2021-03-23 02:14:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 07/23] mm: Introduce zap_details.zap_flags

On Mon, Mar 22, 2021 at 08:48:56PM -0400, Peter Xu wrote:
> +/* Whether to check page->mapping when zapping */
> +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
> +
> /*
> * Parameter block passed down to zap_pte_range in exceptional cases.
> */
> struct zap_details {
> - struct address_space *check_mapping; /* Check page->mapping if set */
> + struct address_space *zap_mapping; /* Check page->mapping if set */

Now the comment is wrong. It used to mean "If this is NULL, zap pages
with any mapping", but now it's always set, and the decision about whether
to check the mapping is in the flag.

Honestly, I'd remove the comments from both these members. They don't add
anything to understandability now.

> + unsigned long zap_flags; /* Special flags for zapping */
> };

2021-03-23 02:36:23

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 02/23] mm: Clear vmf->pte after pte_unmap_same() returns

Hi:
On 2021/3/23 8:48, Peter Xu wrote:
> pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
> will not be valid any more. We should clear it.
>
> It was safe only because no one is accessing vmf->pte after pte_unmap_same()
> returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
> where vmf->pte will in most cases be overwritten very soon.
>
> pte_unmap_same() will be used in other places in follow up patches, so that
> vmf->pte will not always be re-written. This patch enables us to call
> functions like finish_fault() because that'll conditionally unmap the pte by
> checking vmf->pte first. Or, alloc_set_pte() will make sure to allocate a new
> pte even after calling pte_unmap_same().
>
> Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
> and then we can also avoid the long parameter list.
>
> Signed-off-by: Peter Xu <[email protected]>

Good cleanup! Thanks.
Reviewed-by: Miaohe Lin <[email protected]>

> ---
> mm/memory.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a458a595331f..d534eba85756 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2607,19 +2607,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
> * proceeding (but do_wp_page is only called after already making such a check;
> * and do_anonymous_page can safely check later on).
> */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
> {
> int same = 1;
> #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
> if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> + spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> + same = pte_same(*vmf->pte, vmf->orig_pte);
> spin_unlock(ptl);
> }
> #endif
> - pte_unmap(page_table);
> + pte_unmap(vmf->pte);
> + /* After unmap of pte, the pointer is invalid now - clear it. */
> + vmf->pte = NULL;
> return same;
> }
>
> @@ -3308,7 +3309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vm_fault_t ret = 0;
> void *shadow = NULL;
>
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + if (!pte_unmap_same(vmf))
> goto out;
>
> entry = pte_to_swp_entry(vmf->orig_pte);
>

2021-03-23 15:42:39

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 02/23] mm: Clear vmf->pte after pte_unmap_same() returns

On Tue, Mar 23, 2021 at 10:34:45AM +0800, Miaohe Lin wrote:
> Hi:
> On 2021/3/23 8:48, Peter Xu wrote:
> > pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
> > will not be valid any more. We should clear it.
> >
> > It was safe only because no one is accessing vmf->pte after pte_unmap_same()
> > returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
> > where vmf->pte will in most cases be overwritten very soon.
> >
> > pte_unmap_same() will be used in other places in follow up patches, so that
> > vmf->pte will not always be re-written. This patch enables us to call
> > functions like finish_fault() because that'll conditionally unmap the pte by
> > checking vmf->pte first. Or, alloc_set_pte() will make sure to allocate a new
> > pte even after calling pte_unmap_same().
> >
> > Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
> > and then we can also avoid the long parameter list.
> >
> > Signed-off-by: Peter Xu <[email protected]>
>
> Good cleanup! Thanks.
> Reviewed-by: Miaohe Lin <[email protected]>

Just a note that this is not a pure cleanup - the latter patches may start to
depend on the clearing of vmf->pte in their logic.

Thanks for the quick review!

--
Peter Xu

2021-03-23 15:46:30

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 07/23] mm: Introduce zap_details.zap_flags

On Tue, Mar 23, 2021 at 02:11:29AM +0000, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 08:48:56PM -0400, Peter Xu wrote:
> > +/* Whether to check page->mapping when zapping */
> > +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > +
> > /*
> > * Parameter block passed down to zap_pte_range in exceptional cases.
> > */
> > struct zap_details {
> > - struct address_space *check_mapping; /* Check page->mapping if set */
> > + struct address_space *zap_mapping; /* Check page->mapping if set */
>
> Now the comment is wrong. It used to mean "If this is NULL, zap pages
> with any mapping", but now it's always set, and the decision about whether
> to check the mapping is in the flag.
>
> Honestly, I'd remove the comments from both these members. They don't add
> anything to understandability now.

Agreed, I'm removing them. Thanks,

--
Peter Xu

2021-04-22 00:19:42

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Mon, Mar 22, 2021 at 08:48:49PM -0400, Peter Xu wrote:
> This patchset is based on tag v5.12-rc3-mmots-2021-03-17-22-26. To run the
> selftest, need to apply the two patches to fix minor mode page leak:
>
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>
> Since I didn't get any NACK in the previous RFC series for months, I decided to
> remove the RFC tag starting from this version, so this is v1 of uffd-wp support
> on shmem & hugetlb.
>
> The whole series can also be found online [1].
>
> The major comment I'd like to get is on the new idea of swap special pte. That
> comes from suggestions from both Hugh and Andrea and I appreciated a lot for
> those discussions.
>
> In short, the so-called "swap special pte" in this patchset is a new type of
> pte that doesn't exist in the past, but it got used initially in this series in
> file-backed memories. It is used to persist information even if the ptes got
> dropped meanwhile when the page cache still existed. For example, when
> splitting a file-backed huge pmd, we could be simply dropping the pmd entry
> then wait until another fault coming. It's okay in the past since all
> information in the pte can be retained from the page cache when the next page
> fault triggers. However in this case, uffd-wp is per-pte information which
> cannot be kept in page cache, so that information needs to be maintained
> somehow still in the pgtable entry, even if the pgtable entry is going to be
> dropped. Here instead of replacing with a none entry, we used the "swap
> special pte". Then when the next page fault triggers, we can observe orig_pte
> to retain this information.
>
> I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
> idea of special swap ptes", where it tried to explain this pte in another angle:
>
> We used to have special swap entries, like migration entries, hw-poison
> entries, device private entries, etc.
>
> Those "special swap entries" reside in the range that they need to be at least
> swap entries first, and their types are decided by swp_type(entry).
>
> This patch introduces another idea called "special swap ptes".
>
> It's very easy to get confused against "special swap entries", but a speical
> swap pte should never contain a swap entry at all. It means, it's illegal to
> call pte_to_swp_entry() upon a special swap pte.
>
> Make the uffd-wp special pte to be the first special swap pte.
>
> Before this patch, is_swap_pte()==true means one of the below:
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> After this patch, is_swap_pte()==true means one of the below, where case (b) is
> added:
>
> (a) The pte contains a swap entry.
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> (b) The pte does not contain a swap entry at all (so it cannot be passed
> into pte_to_swp_entry()). For example, uffd-wp special swap pte.
>
> Hugetlbfs needs similar thing because it's also file-backed. I directly reused
> the same special pte there, though the shmem/hugetlb change on supporting this
> new pte is different since they don't share code path a lot.
>
> Patch layout
> ============
>
> Part (1): Shmem support, this is where the special swap pte is introduced.
> Some zap rework is needed within the process:
>
> shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
> mm: Clear vmf->pte after pte_unmap_same() returns
> mm/userfaultfd: Introduce special pte for unmapped file-backed mem
> mm/swap: Introduce the idea of special swap ptes
> shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
> mm: Drop first_index/last_index in zap_details
> mm: Introduce zap_details.zap_flags
> mm: Introduce ZAP_FLAG_SKIP_SWAP
> mm: Pass zap_flags into unmap_mapping_pages()
> shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
> shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
> shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps
> shmem/userfaultfd: Handle the left-overed special swap ptes
> shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()
>
> Part (2): Hugetlb support, we need to disable huge pmd sharing for uffd-wp
> because not compatible just like uffd minor mode. The rest is the changes
> required to teach hugetlbfs understand the special swap pte too that introduced
> with the uffd-wp change:
>
> hugetlb/userfaultfd: Hook page faults for uffd write protection
> hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
> hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
> hugetlb: Pass vma into huge_pte_alloc()
> hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
> mm/hugetlb: Introduce huge version of special swap pte helpers
> mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
> hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
> hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
> hugetlb/userfaultfd: Allow wr-protect none ptes
> hugetlb/userfaultfd: Only drop uffd-wp special pte if required
>
> Part (3): Enable both features in code and test
>
> userfaultfd: Enable write protection for shmem & hugetlbfs
> userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs
>
> Tests
> =========
>
> I've tested it using either userfaultfd kselftest program, but also with
> umapsort [2] which should be even stricter. Tested page swapping in/out during
> umapsort.
>
> If anyone would like to try umapsort, need to use an extremely hacked version
> of umap library [3], because by default umap only supports anonymous. So to
> test it we need to build [3] then [2].
>
> Any comment would be greatly welcomed. Thanks,
>
> [1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs
> [2] https://github.com/LLNL/umap-apps
> [3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

Hugh, Mike, Andrew,

Any comment for this series?

Thanks,

--
Peter Xu

2021-04-22 01:25:48

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On 4/21/21 9:03 AM, Peter Xu wrote:
> Hugh, Mike, Andrew,
>
> Any comment for this series?
>

Sorry Peter, always get preempted with something else.

I'll start looking at the hugetlb specific changes and back my way into
swap special pte support. I feel qualified to review the hugetlb stuff
and hope others will join in on the common infrastructure changes.

--
Mike Kravetz

2021-04-22 01:26:36

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 15/23] hugetlb/userfaultfd: Hook page faults for uffd write protection

On 3/22/21 5:49 PM, Peter Xu wrote:
> Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.
>
> We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
> extra locks that we definitely don't need.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)

That is pretty straight forward,

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2021-04-22 01:27:40

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 16/23] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

On 3/22/21 5:49 PM, Peter Xu wrote:
> Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
> the stack. Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
> UFFDIO_COPY. Introduce huge_pte_mkuffd_wp() for it.
>
> Note that similar to how we've handled shmem, we'd better keep setting the
> dirty bit even if UFFDIO_COPY_MODE_WP is provided, so that the core mm will
> know this page contains valid data and never drop it.

There is nothing wrong with setting the dirty bit in this manner to be
consistent. But, since hugetlb pages are only managed by hugetlbfs, the
core mm will not drop them.

>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/asm-generic/hugetlb.h | 5 +++++
> include/linux/hugetlb.h | 6 ++++--
> mm/hugetlb.c | 22 +++++++++++++++++-----
> mm/userfaultfd.c | 12 ++++++++----
> 4 files changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index 8e1e6244a89d..548212eccbd6 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -27,6 +27,11 @@ static inline pte_t huge_pte_mkdirty(pte_t pte)
> return pte_mkdirty(pte);
> }
>
> +static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
> +{
> + return pte_mkuffd_wp(pte);
> +}
> +

Just want to verify that userfaultfd wp support is only enabled for
x86_64 now? I only ask because there are arch specific hugetlb pte
manipulation routines for some architectures.

> static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
> {
> return pte_modify(pte, newprot);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index a7f7d5f328dc..ef8d2b8427b1 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -141,7 +141,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> unsigned long dst_addr,
> unsigned long src_addr,
> enum mcopy_atomic_mode mode,
> - struct page **pagep);
> + struct page **pagep,
> + bool wp_copy);
> #endif /* CONFIG_USERFAULTFD */
> bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> struct vm_area_struct *vma,
> @@ -321,7 +322,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> unsigned long dst_addr,
> unsigned long src_addr,
> enum mcopy_atomic_mode mode,
> - struct page **pagep)
> + struct page **pagep,
> + bool wp_copy)
> {
> BUG();
> return 0;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index def2c7ddf3ae..f0e55b341ebd 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4725,7 +4725,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> unsigned long dst_addr,
> unsigned long src_addr,
> enum mcopy_atomic_mode mode,
> - struct page **pagep)
> + struct page **pagep,
> + bool wp_copy)
> {
> bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> struct address_space *mapping;
> @@ -4822,17 +4823,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> }
>
> - /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
> - if (is_continue && !vm_shared)
> + /*
> + * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
> + * with wp flag set, don't set pte write bit.
> + */
> + if (wp_copy || (is_continue && !vm_shared))
> writable = 0;
> else
> writable = dst_vma->vm_flags & VM_WRITE;
>
> _dst_pte = make_huge_pte(dst_vma, page, writable);
> - if (writable)
> - _dst_pte = huge_pte_mkdirty(_dst_pte);
> + /*
> + * Always mark UFFDIO_COPY page dirty; note that this may not be
> + * extremely important for hugetlbfs for now since swapping is not
> + * supported, but we should still be clear in that this page cannot be
> + * thrown away at will, even if write bit not set.

As mentioned earlier there should not be any issue with hugetlb pages
being thrown away without dirty set. Perhaps, the comment should
reflect that this is mostly for consistency.

Note to self: this may help when I get back to hugetlb soft dirty
support.

Other than that, patch looks good.
--
Mike Kravetz

2021-04-22 01:32:08

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 16/23] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

On Wed, Apr 21, 2021 at 04:06:39PM -0700, Mike Kravetz wrote:
> On 3/22/21 5:49 PM, Peter Xu wrote:
> > Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
> > the stack. Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
> > UFFDIO_COPY. Introduce huge_pte_mkuffd_wp() for it.
> >
> > Note that similar to how we've handled shmem, we'd better keep setting the
> > dirty bit even if UFFDIO_COPY_MODE_WP is provided, so that the core mm will
> > know this page contains valid data and never drop it.
>
> There is nothing wrong with setting the dirty bit in this manner to be
> consistent. But, since hugetlb pages are only managed by hugetlbfs, the
> core mm will not drop them.

Right, for this paragraph, how about I rephrase it as below?

Hugetlb pages are only managed by hugetlbfs, so we're safe even without
setting dirty bit in the huge pte if the page is installed as read-only.
However we'd better still keep the dirty bit set for a read-only UFFDIO_COPY
pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with
shmem, but also because the page does contain dirty data that the kernel just
copied from the userspace.

>
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/asm-generic/hugetlb.h | 5 +++++
> > include/linux/hugetlb.h | 6 ++++--
> > mm/hugetlb.c | 22 +++++++++++++++++-----
> > mm/userfaultfd.c | 12 ++++++++----
> > 4 files changed, 34 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> > index 8e1e6244a89d..548212eccbd6 100644
> > --- a/include/asm-generic/hugetlb.h
> > +++ b/include/asm-generic/hugetlb.h
> > @@ -27,6 +27,11 @@ static inline pte_t huge_pte_mkdirty(pte_t pte)
> > return pte_mkdirty(pte);
> > }
> >
> > +static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
> > +{
> > + return pte_mkuffd_wp(pte);
> > +}
> > +
>
> Just want to verify that userfaultfd wp support is only enabled for
> x86_64 now? I only ask because there are arch specific hugetlb pte
> manipulation routines for some architectures.

Yes it's x86_64 only, as we have:

select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD

Here the helper defines the huge pte uffd-wp bit to be the same as the pte
level bit, across all archs. However should be fine since for any arch that
does not support uffd-wp, it'll be no-op:

static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
{
return pte;
}

Meanwhile it shouldn't be called anywhere as well.

Here I can also define __HAVE_ARCH_HUGE_PTE_MKUFFD_WP, however I didn't do so
as I thought above should be enough. Then we can define it when really
necessary.

>
> > static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
> > {
> > return pte_modify(pte, newprot);
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index a7f7d5f328dc..ef8d2b8427b1 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -141,7 +141,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
> > unsigned long dst_addr,
> > unsigned long src_addr,
> > enum mcopy_atomic_mode mode,
> > - struct page **pagep);
> > + struct page **pagep,
> > + bool wp_copy);
> > #endif /* CONFIG_USERFAULTFD */
> > bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> > struct vm_area_struct *vma,
> > @@ -321,7 +322,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > unsigned long dst_addr,
> > unsigned long src_addr,
> > enum mcopy_atomic_mode mode,
> > - struct page **pagep)
> > + struct page **pagep,
> > + bool wp_copy)
> > {
> > BUG();
> > return 0;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index def2c7ddf3ae..f0e55b341ebd 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4725,7 +4725,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > unsigned long dst_addr,
> > unsigned long src_addr,
> > enum mcopy_atomic_mode mode,
> > - struct page **pagep)
> > + struct page **pagep,
> > + bool wp_copy)
> > {
> > bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> > struct address_space *mapping;
> > @@ -4822,17 +4823,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> > hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
> > }
> >
> > - /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
> > - if (is_continue && !vm_shared)
> > + /*
> > + * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
> > + * with wp flag set, don't set pte write bit.
> > + */
> > + if (wp_copy || (is_continue && !vm_shared))
> > writable = 0;
> > else
> > writable = dst_vma->vm_flags & VM_WRITE;
> >
> > _dst_pte = make_huge_pte(dst_vma, page, writable);
> > - if (writable)
> > - _dst_pte = huge_pte_mkdirty(_dst_pte);
> > + /*
> > + * Always mark UFFDIO_COPY page dirty; note that this may not be
> > + * extremely important for hugetlbfs for now since swapping is not
> > + * supported, but we should still be clear in that this page cannot be
> > + * thrown away at will, even if write bit not set.
>
> As mentioned earlier there should not be any issue with hugetlb pages
> being thrown away without dirty set. Perhaps, the comment should
> reflect that this is mostly for consistency.

Yes, functional-wise it seems not necessary, however I also think it's a bit
more than being consistency with what we do with shmem (as also rephrased in
above commit message): UFFDIO_COPY page should be marked as dirty by the
definition of "dirty bit", imho, since the data is indeed dirty (kernel copied
that data from userspace)?

>
> Note to self: this may help when I get back to hugetlb soft dirty
> support.
>
> Other than that, patch looks good.

Thanks!

--
Peter Xu

2021-04-22 01:32:12

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 00/23] userfaultfd-wp: Support shmem and hugetlbfs

On Wed, Apr 21, 2021 at 02:39:38PM -0700, Mike Kravetz wrote:
> Sorry Peter, always get preempted with something else.

No worry.

>
> I'll start looking at the hugetlb specific changes and back my way into
> swap special pte support. I feel qualified to review the hugetlb stuff
> and hope others will join in on the common infrastructure changes.

That'll be great; thanks Mike!

--
Peter Xu

2021-04-22 18:24:03

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 17/23] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

On 3/22/21 5:49 PM, Peter Xu wrote:
> This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
> will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.
>
> huge_pte_clear_uffd_wp() is introduced to handle the case where the
> UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/asm-generic/hugetlb.h | 5 +++++
> include/linux/hugetlb.h | 6 ++++--
> mm/hugetlb.c | 13 ++++++++++++-
> mm/mprotect.c | 3 ++-
> mm/userfaultfd.c | 8 ++++++++
> 5 files changed, 31 insertions(+), 4 deletions(-)

Thanks,

Reviewed-by: Mike Kravetz <[email protected]>

--
Mike Kravetz

2021-04-22 19:02:06

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 18/23] mm/hugetlb: Introduce huge version of special swap pte helpers

On 3/22/21 5:49 PM, Peter Xu wrote:
> This is to let hugetlbfs be prepared to also recognize swap special ptes just
> like uffd-wp special swap ptes.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)

The multiple uses of swap entries can be confusing. Thanks for the
additional comments in the original userfaultfd wp code.

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2021-04-22 22:47:15

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 19/23] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

On 3/22/21 5:50 PM, Peter Xu wrote:
> Teach the hugetlb page fault code to understand uffd-wp special pte. For
> example, when seeing such a pte we need to convert any write fault into a read
> one (which is fake - we'll retry the write later if so). Meanwhile, for
> handle_userfault() we'll need to make sure we must wait for the special swap
> pte too just like a none pte.
>
> Note that we also need to teach UFFDIO_COPY about this special pte across the
> code path so that we can safely install a new page at this special pte as long
> as we know it's a stall entry.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 5 ++++-
> mm/hugetlb.c | 34 +++++++++++++++++++++++++++-------
> mm/userfaultfd.c | 5 ++++-
> 3 files changed, 35 insertions(+), 9 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 72956f9cc892..f6fa34f58c37 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -245,8 +245,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> /*
> * Lockless access: we're in a wait_event so it's ok if it
> * changes under us.
> + *
> + * Regarding uffd-wp special case, please refer to comments in
> + * userfaultfd_must_wait().
> */
> - if (huge_pte_none(pte))
> + if (huge_pte_none(pte) || pte_swp_uffd_wp_special(pte))
> ret = true;
> if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> ret = true;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 64e424b03774..448ef745d5ee 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4369,7 +4369,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> struct vm_area_struct *vma,
> struct address_space *mapping, pgoff_t idx,
> - unsigned long address, pte_t *ptep, unsigned int flags)
> + unsigned long address, pte_t *ptep,
> + pte_t old_pte, unsigned int flags)
> {
> struct hstate *h = hstate_vma(vma);
> vm_fault_t ret = VM_FAULT_SIGBUS;
> @@ -4493,7 +4494,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>
> ptl = huge_pte_lock(h, mm, ptep);
> ret = 0;
> - if (!huge_pte_none(huge_ptep_get(ptep)))
> + if (!pte_same(huge_ptep_get(ptep), old_pte))
> goto backout;
>
> if (anon_rmap) {
> @@ -4503,6 +4504,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> page_dup_rmap(page, true);
> new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> && (vma->vm_flags & VM_SHARED)));
> + if (unlikely(flags & FAULT_FLAG_UFFD_WP)) {
> + WARN_ON_ONCE(flags & FAULT_FLAG_WRITE);
> + /* We should have the write bit cleared already, but be safe */
> + new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
> + }
> set_huge_pte_at(mm, haddr, ptep, new_pte);
>
> hugetlb_count_add(pages_per_huge_page(h), mm);
> @@ -4584,9 +4590,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (unlikely(is_hugetlb_entry_migration(entry))) {
> migration_entry_wait_huge(vma, mm, ptep);
> return 0;
> - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
> return VM_FAULT_HWPOISON_LARGE |
> VM_FAULT_SET_HINDEX(hstate_index(h));
> + } else if (unlikely(is_swap_special_pte(entry))) {
> + /* Must be a uffd-wp special swap pte */
> + WARN_ON_ONCE(!pte_swp_uffd_wp_special(entry));
> + flags |= FAULT_FLAG_UFFD_WP;
> + /* Emulate a read fault */
> + flags &= ~FAULT_FLAG_WRITE;
> + }

The comment above this if/else block points out that we hold no locks
and are only checking conditions that would cause a quick return. Yet,
this new code is potentially modifying flags. Pretty sure we can race
and have the entry change.

Not sure of all the side effects of emulating a read if changed entry is
not a uffd-wp special swap pte and we emulate read when we should not.

Perhaps we should just put this check and modification of flags after
taking the fault mutex and before the change below?
--
Mike Kravetz

> }
>
> /*
> @@ -4618,8 +4631,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> entry = huge_ptep_get(ptep);
> - if (huge_pte_none(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
> + /*
> + * FAULT_FLAG_UFFD_WP should be handled merely the same as pte none
> + * because it's basically a none pte with a special marker
> + */
> + if (huge_pte_none(entry) || pte_swp_uffd_wp_special(entry)) {
> + ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> + entry, flags);
> goto out_mutex;
> }
>
> @@ -4753,7 +4771,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> unsigned long size;
> int vm_shared = dst_vma->vm_flags & VM_SHARED;
> struct hstate *h = hstate_vma(dst_vma);
> - pte_t _dst_pte;
> + pte_t _dst_pte, cur_pte;
> spinlock_t *ptl;
> int ret;
> struct page *page;
> @@ -4831,8 +4849,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> if (idx >= size)
> goto out_release_unlock;
>
> + cur_pte = huge_ptep_get(dst_pte);
> ret = -EEXIST;
> - if (!huge_pte_none(huge_ptep_get(dst_pte)))
> + /* Please refer to shmem_mfill_atomic_pte() for uffd-wp special case */
> + if (!huge_pte_none(cur_pte) && !pte_swp_uffd_wp_special(cur_pte))
> goto out_release_unlock;
>
> if (vm_shared) {
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 01170197a3d7..a2b0dcc80a19 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -274,6 +274,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> }
>
> while (src_addr < src_start + len) {
> + pte_t pteval;
> +
> BUG_ON(dst_addr >= dst_start + len);
>
> /*
> @@ -296,8 +298,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> goto out_unlock;
> }
>
> + pteval = huge_ptep_get(dst_pte);
> if (mode != MCOPY_ATOMIC_CONTINUE &&
> - !huge_pte_none(huge_ptep_get(dst_pte))) {
> + !huge_pte_none(pteval) && !pte_swp_uffd_wp_special(pteval)) {
> err = -EEXIST;
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> i_mmap_unlock_read(mapping);
>

2021-04-23 00:10:05

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 20/23] hugetlb/userfaultfd: Allow wr-protect none ptes

On 3/22/21 5:50 PM, Peter Xu wrote:
> Teach hugetlbfs code to wr-protect none ptes just in case the page cache
> existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp
> marker pte and remove it for uffd_wp_resolve.
>
> Since at it, introduce a variable "psize" to replace all references to the huge
> page size fetcher.

Introducing psize may have caused most changes. :) I like it.

>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/hugetlb.c | 29 +++++++++++++++++++++++++----
> 1 file changed, 25 insertions(+), 4 deletions(-)

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2021-04-23 20:35:00

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

On 3/22/21 5:50 PM, Peter Xu wrote:
> Just like what we've done with shmem uffd-wp special ptes, we shouldn't drop
> uffd-wp special swap pte for hugetlb too, only if we're going to unmap the
> whole vma, or we're punching a hole with safe locks held.
>
> For example, remove_inode_hugepages() is safe to drop uffd-wp ptes, because it
> has taken hugetlb fault mutex so that no concurrent page fault would trigger.
> While the call to hugetlb_vmdelete_list() in hugetlbfs_punch_hole() is not
> safe. That's why the previous call will be with ZAP_FLAG_DROP_FILE_UFFD_WP,
> while the latter one won't be able to.

This commit message is a bit confusing, but the hugetlb hole punch code
path is a bit confusing. :) How about something like this?

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap
pte if unmapping an entire vma or synchronized such that faults can not
race with the unmap operation. This requires passing zap_flags all the
way to the lowest level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent
faults. The exception is hole punch which will first unmap without any
synchronization. Later when hole punch actually removes the page from
the file, it will check to see if there was a subsequent fault and if
so take the hugetlb fault mutex while unmapping again. This second
unmap will pass in ZAP_FLAG_DROP_FILE_UFFD_WP.

>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 15 +++++++++------
> include/linux/hugetlb.h | 13 ++++++++-----
> mm/hugetlb.c | 27 +++++++++++++++++++++------
> mm/memory.c | 5 ++++-
> 4 files changed, 42 insertions(+), 18 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d81f52b87bd7..5fe19e801a2b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -399,7 +399,8 @@ static void remove_huge_page(struct page *page)
> }
>
> static void
> -hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
> +hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> + unsigned long zap_flags)
> {
> struct vm_area_struct *vma;
>
> @@ -432,7 +433,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
> }
>
> unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
> - NULL);
> + NULL, zap_flags);
> }
> }
>
> @@ -513,7 +514,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
> hugetlb_vmdelete_list(&mapping->i_mmap,
> index * pages_per_huge_page(h),
> - (index + 1) * pages_per_huge_page(h));
> + (index + 1) * pages_per_huge_page(h),
> + ZAP_FLAG_DROP_FILE_UFFD_WP);
> i_mmap_unlock_write(mapping);
> }
>
> @@ -579,7 +581,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
> i_mmap_lock_write(mapping);
> i_size_write(inode, offset);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> - hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
> + hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
> + ZAP_FLAG_DROP_FILE_UFFD_WP);
> i_mmap_unlock_write(mapping);
> remove_inode_hugepages(inode, offset, LLONG_MAX);
> }
> @@ -612,8 +615,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> i_mmap_lock_write(mapping);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> hugetlb_vmdelete_list(&mapping->i_mmap,
> - hole_start >> PAGE_SHIFT,
> - hole_end >> PAGE_SHIFT);
> + hole_start >> PAGE_SHIFT,
> + hole_end >> PAGE_SHIFT, 0);
> i_mmap_unlock_write(mapping);
> remove_inode_hugepages(inode, hole_start, hole_end);
> inode_unlock(inode);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 92710600596e..4047fa042782 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> unsigned long *, unsigned long *, long, unsigned int,
> int *);
> void unmap_hugepage_range(struct vm_area_struct *,
> - unsigned long, unsigned long, struct page *);
> + unsigned long, unsigned long, struct page *,
> + unsigned long);
> void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> - struct page *ref_page);
> + struct page *ref_page, unsigned long zap_flags);
> void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> - struct page *ref_page);
> + struct page *ref_page, unsigned long zap_flags);

Nothing introduced with your patch, but it seems __unmap_hugepage_range_final
does not need to be in the header and can be static in hugetlb.c.

Everything else looks good,

Reviewed-by: Mike Kravetz <[email protected]>
--
Mike Kravetz

2021-04-26 02:09:56

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 19/23] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

On Thu, Apr 22, 2021 at 03:45:39PM -0700, Mike Kravetz wrote:
> On 3/22/21 5:50 PM, Peter Xu wrote:
> > Teach the hugetlb page fault code to understand uffd-wp special pte. For
> > example, when seeing such a pte we need to convert any write fault into a read
> > one (which is fake - we'll retry the write later if so). Meanwhile, for
> > handle_userfault() we'll need to make sure we must wait for the special swap
> > pte too just like a none pte.
> >
> > Note that we also need to teach UFFDIO_COPY about this special pte across the
> > code path so that we can safely install a new page at this special pte as long
> > as we know it's a stall entry.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/userfaultfd.c | 5 ++++-
> > mm/hugetlb.c | 34 +++++++++++++++++++++++++++-------
> > mm/userfaultfd.c | 5 ++++-
> > 3 files changed, 35 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 72956f9cc892..f6fa34f58c37 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -245,8 +245,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > /*
> > * Lockless access: we're in a wait_event so it's ok if it
> > * changes under us.
> > + *
> > + * Regarding uffd-wp special case, please refer to comments in
> > + * userfaultfd_must_wait().
> > */
> > - if (huge_pte_none(pte))
> > + if (huge_pte_none(pte) || pte_swp_uffd_wp_special(pte))
> > ret = true;
> > if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> > ret = true;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 64e424b03774..448ef745d5ee 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4369,7 +4369,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> > static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > struct vm_area_struct *vma,
> > struct address_space *mapping, pgoff_t idx,
> > - unsigned long address, pte_t *ptep, unsigned int flags)
> > + unsigned long address, pte_t *ptep,
> > + pte_t old_pte, unsigned int flags)
> > {
> > struct hstate *h = hstate_vma(vma);
> > vm_fault_t ret = VM_FAULT_SIGBUS;
> > @@ -4493,7 +4494,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >
> > ptl = huge_pte_lock(h, mm, ptep);
> > ret = 0;
> > - if (!huge_pte_none(huge_ptep_get(ptep)))
> > + if (!pte_same(huge_ptep_get(ptep), old_pte))
> > goto backout;
> >
> > if (anon_rmap) {
> > @@ -4503,6 +4504,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > page_dup_rmap(page, true);
> > new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
> > && (vma->vm_flags & VM_SHARED)));
> > + if (unlikely(flags & FAULT_FLAG_UFFD_WP)) {
> > + WARN_ON_ONCE(flags & FAULT_FLAG_WRITE);
> > + /* We should have the write bit cleared already, but be safe */
> > + new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
> > + }
> > set_huge_pte_at(mm, haddr, ptep, new_pte);
> >
> > hugetlb_count_add(pages_per_huge_page(h), mm);
> > @@ -4584,9 +4590,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > if (unlikely(is_hugetlb_entry_migration(entry))) {
> > migration_entry_wait_huge(vma, mm, ptep);
> > return 0;
> > - } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> > + } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
> > return VM_FAULT_HWPOISON_LARGE |
> > VM_FAULT_SET_HINDEX(hstate_index(h));
> > + } else if (unlikely(is_swap_special_pte(entry))) {
> > + /* Must be a uffd-wp special swap pte */
> > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(entry));
> > + flags |= FAULT_FLAG_UFFD_WP;
> > + /* Emulate a read fault */
> > + flags &= ~FAULT_FLAG_WRITE;
> > + }
>
> The comment above this if/else block points out that we hold no locks
> and are only checking conditions that would cause a quick return. Yet,
> this new code is potentially modifying flags. Pretty sure we can race
> and have the entry change.
>
> Not sure of all the side effects of emulating a read if changed entry is
> not a uffd-wp special swap pte and we emulate read when we should not.
>
> Perhaps we should just put this check and modification of flags after
> taking the fault mutex and before the change below?

That's a great point. Even the WARN_ON_ONCE could trigger if the pte got
modified in parallel, so definitely broken.

Yes I'd better do that with the pgtable lock, mostly hugetlb_no_page() should
be the only function to handle this special case. So maybe I don't need to
emulate the READ fault at all, but just check pte_swp_uffd_wp_special() with
the lock, then do wrprotect properly should suffice. Maybe that's even true
for shmem, I'll think more about it.

Thanks!

--
Peter Xu

2021-04-26 21:19:03

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

On Fri, Apr 23, 2021 at 01:33:08PM -0700, Mike Kravetz wrote:
> On 3/22/21 5:50 PM, Peter Xu wrote:
> > Just like what we've done with shmem uffd-wp special ptes, we shouldn't drop
> > uffd-wp special swap pte for hugetlb too, only if we're going to unmap the
> > whole vma, or we're punching a hole with safe locks held.
> >
> > For example, remove_inode_hugepages() is safe to drop uffd-wp ptes, because it
> > has taken hugetlb fault mutex so that no concurrent page fault would trigger.
> > While the call to hugetlb_vmdelete_list() in hugetlbfs_punch_hole() is not
> > safe. That's why the previous call will be with ZAP_FLAG_DROP_FILE_UFFD_WP,
> > while the latter one won't be able to.
>
> This commit message is a bit confusing, but the hugetlb hole punch code
> path is a bit confusing. :) How about something like this?
>
> As with shmem uffd-wp special ptes, only drop the uffd-wp special swap
> pte if unmapping an entire vma or synchronized such that faults can not
> race with the unmap operation. This requires passing zap_flags all the
> way to the lowest level hugetlb unmap routine: __unmap_hugepage_range.
>
> In general, unmap calls originated in hugetlbfs code will pass the
> ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent
> faults. The exception is hole punch which will first unmap without any
> synchronization. Later when hole punch actually removes the page from
> the file, it will check to see if there was a subsequent fault and if
> so take the hugetlb fault mutex while unmapping again. This second
> unmap will pass in ZAP_FLAG_DROP_FILE_UFFD_WP.

Sure, I can replace mine.

Maybe it's because I didn't explain enough on the reasoning so it's confusing.
The core justification of "whether to apply ZAP_FLAG_DROP_FILE_UFFD_WP flag
when unmap a hugetlb range" is (IMHO): we should never reach a state when a
page fault could errornously fault in a page-cache page that was wr-protected
to be writable, even in an extremely short period. That could happen if
e.g. we pass ZAP_FLAG_DROP_FILE_UFFD_WP in hugetlbfs_punch_hole() when calling
hugetlb_vmdelete_list(), because if a page fault triggers after that call and
before the remove_inode_hugepages() right after it, the page cache can be
mapped writable again in the small window, which can cause data corruption.

>
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 15 +++++++++------
> > include/linux/hugetlb.h | 13 ++++++++-----
> > mm/hugetlb.c | 27 +++++++++++++++++++++------
> > mm/memory.c | 5 ++++-
> > 4 files changed, 42 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index d81f52b87bd7..5fe19e801a2b 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -399,7 +399,8 @@ static void remove_huge_page(struct page *page)
> > }
> >
> > static void
> > -hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
> > +hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> > + unsigned long zap_flags)
> > {
> > struct vm_area_struct *vma;
> >
> > @@ -432,7 +433,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
> > }
> >
> > unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
> > - NULL);
> > + NULL, zap_flags);
> > }
> > }
> >
> > @@ -513,7 +514,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> > mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > hugetlb_vmdelete_list(&mapping->i_mmap,
> > index * pages_per_huge_page(h),
> > - (index + 1) * pages_per_huge_page(h));
> > + (index + 1) * pages_per_huge_page(h),
> > + ZAP_FLAG_DROP_FILE_UFFD_WP);
> > i_mmap_unlock_write(mapping);
> > }
> >
> > @@ -579,7 +581,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
> > i_mmap_lock_write(mapping);
> > i_size_write(inode, offset);
> > if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> > - hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
> > + hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
> > + ZAP_FLAG_DROP_FILE_UFFD_WP);
> > i_mmap_unlock_write(mapping);
> > remove_inode_hugepages(inode, offset, LLONG_MAX);
> > }
> > @@ -612,8 +615,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > i_mmap_lock_write(mapping);
> > if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> > hugetlb_vmdelete_list(&mapping->i_mmap,
> > - hole_start >> PAGE_SHIFT,
> > - hole_end >> PAGE_SHIFT);
> > + hole_start >> PAGE_SHIFT,
> > + hole_end >> PAGE_SHIFT, 0);
> > i_mmap_unlock_write(mapping);
> > remove_inode_hugepages(inode, hole_start, hole_end);
> > inode_unlock(inode);
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 92710600596e..4047fa042782 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> > unsigned long *, unsigned long *, long, unsigned int,
> > int *);
> > void unmap_hugepage_range(struct vm_area_struct *,
> > - unsigned long, unsigned long, struct page *);
> > + unsigned long, unsigned long, struct page *,
> > + unsigned long);
> > void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> > struct vm_area_struct *vma,
> > unsigned long start, unsigned long end,
> > - struct page *ref_page);
> > + struct page *ref_page, unsigned long zap_flags);
> > void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > unsigned long start, unsigned long end,
> > - struct page *ref_page);
> > + struct page *ref_page, unsigned long zap_flags);
>
> Nothing introduced with your patch, but it seems __unmap_hugepage_range_final
> does not need to be in the header and can be static in hugetlb.c.

It seems to be used in unmap_single_vma() of mm/memory.c?

>
> Everything else looks good,
>
> Reviewed-by: Mike Kravetz <[email protected]>

Please let me know whether you want my extra paragraph in the commit message,
then I'll coordinate accordingly with the R-b. Thanks!

--
Peter Xu

2021-04-26 21:38:44

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

On 4/26/21 2:16 PM, Peter Xu wrote:
> On Fri, Apr 23, 2021 at 01:33:08PM -0700, Mike Kravetz wrote:
>> On 3/22/21 5:50 PM, Peter Xu wrote:
>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>> index 92710600596e..4047fa042782 100644
>>> --- a/include/linux/hugetlb.h
>>> +++ b/include/linux/hugetlb.h
>>> @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>>> unsigned long *, unsigned long *, long, unsigned int,
>>> int *);
>>> void unmap_hugepage_range(struct vm_area_struct *,
>>> - unsigned long, unsigned long, struct page *);
>>> + unsigned long, unsigned long, struct page *,
>>> + unsigned long);
>>> void __unmap_hugepage_range_final(struct mmu_gather *tlb,
>>> struct vm_area_struct *vma,
>>> unsigned long start, unsigned long end,
>>> - struct page *ref_page);
>>> + struct page *ref_page, unsigned long zap_flags);
>>> void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>> unsigned long start, unsigned long end,
>>> - struct page *ref_page);
>>> + struct page *ref_page, unsigned long zap_flags);
>>
>> Nothing introduced with your patch, but it seems __unmap_hugepage_range_final
>> does not need to be in the header and can be static in hugetlb.c.
>
> It seems to be used in unmap_single_vma() of mm/memory.c?

Sorry, that should have been __unmap_hugepage_range. No need for you to
address in this series.

>>
>> Everything else looks good,
>>
>> Reviewed-by: Mike Kravetz <[email protected]>
>
> Please let me know whether you want my extra paragraph in the commit message,
> then I'll coordinate accordingly with the R-b. Thanks!

Yes, please do add the extra paragraph.

--
Mike Kravetz

2021-04-26 22:08:17

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

On Mon, Apr 26, 2021 at 02:36:26PM -0700, Mike Kravetz wrote:
> On 4/26/21 2:16 PM, Peter Xu wrote:
> > On Fri, Apr 23, 2021 at 01:33:08PM -0700, Mike Kravetz wrote:
> >> On 3/22/21 5:50 PM, Peter Xu wrote:
> >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> >>> index 92710600596e..4047fa042782 100644
> >>> --- a/include/linux/hugetlb.h
> >>> +++ b/include/linux/hugetlb.h
> >>> @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> >>> unsigned long *, unsigned long *, long, unsigned int,
> >>> int *);
> >>> void unmap_hugepage_range(struct vm_area_struct *,
> >>> - unsigned long, unsigned long, struct page *);
> >>> + unsigned long, unsigned long, struct page *,
> >>> + unsigned long);
> >>> void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> >>> struct vm_area_struct *vma,
> >>> unsigned long start, unsigned long end,
> >>> - struct page *ref_page);
> >>> + struct page *ref_page, unsigned long zap_flags);
> >>> void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>> unsigned long start, unsigned long end,
> >>> - struct page *ref_page);
> >>> + struct page *ref_page, unsigned long zap_flags);
> >>
> >> Nothing introduced with your patch, but it seems __unmap_hugepage_range_final
> >> does not need to be in the header and can be static in hugetlb.c.
> >
> > It seems to be used in unmap_single_vma() of mm/memory.c?
>
> Sorry, that should have been __unmap_hugepage_range. No need for you to
> address in this series.

Ah yes; I'd rather add a patch if you won't object, since my series will touch
that definition.

>
> >>
> >> Everything else looks good,
> >>
> >> Reviewed-by: Mike Kravetz <[email protected]>
> >
> > Please let me know whether you want my extra paragraph in the commit message,
> > then I'll coordinate accordingly with the R-b. Thanks!
>
> Yes, please do add the extra paragraph.

Will do.

--
Peter Xu

2021-04-26 23:10:35

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

On 4/26/21 3:05 PM, Peter Xu wrote:
> On Mon, Apr 26, 2021 at 02:36:26PM -0700, Mike Kravetz wrote:
>> On 4/26/21 2:16 PM, Peter Xu wrote:
>>> On Fri, Apr 23, 2021 at 01:33:08PM -0700, Mike Kravetz wrote:
>>>> On 3/22/21 5:50 PM, Peter Xu wrote:
>>>>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>>>>> index 92710600596e..4047fa042782 100644
>>>>> --- a/include/linux/hugetlb.h
>>>>> +++ b/include/linux/hugetlb.h
>>>>> @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>>>>> unsigned long *, unsigned long *, long, unsigned int,
>>>>> int *);
>>>>> void unmap_hugepage_range(struct vm_area_struct *,
>>>>> - unsigned long, unsigned long, struct page *);
>>>>> + unsigned long, unsigned long, struct page *,
>>>>> + unsigned long);
>>>>> void __unmap_hugepage_range_final(struct mmu_gather *tlb,
>>>>> struct vm_area_struct *vma,
>>>>> unsigned long start, unsigned long end,
>>>>> - struct page *ref_page);
>>>>> + struct page *ref_page, unsigned long zap_flags);
>>>>> void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>> unsigned long start, unsigned long end,
>>>>> - struct page *ref_page);
>>>>> + struct page *ref_page, unsigned long zap_flags);
>>>>
>>>> Nothing introduced with your patch, but it seems __unmap_hugepage_range_final
>>>> does not need to be in the header and can be static in hugetlb.c.
>>>
>>> It seems to be used in unmap_single_vma() of mm/memory.c?
>>
>> Sorry, that should have been __unmap_hugepage_range. No need for you to
>> address in this series.
>
> Ah yes; I'd rather add a patch if you won't object, since my series will touch
> that definition.
>

No objection if you want to fix, thanks.

--
Mike Kravetz