2021-07-15 20:15:36

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 00/26] userfaultfd-wp: Support shmem and hugetlbfs

This is v5 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a

full feature. It's based on v5.14-rc1.



I reposted the whole series majorly to trigger the syzbot tests again; sorry if

it brings a bit of noise. Please let me know if there's easier way to trigger

the syzbot test instead of reposting the whole series.



Meanwhile, recently discussion around soft-dirty shows that soft-dirty may have

similar requirement as uffd-wp on persisting the dirty information:



https://lore.kernel.org/lkml/[email protected]/



Then the mechanism provided in this patchset may be suitable for soft-dirty too.



The whole series can also be found online [1].



v5 changelog:

- Fix two issues spotted by syzbot

- Compile test with (1) !USERFAULTFD, (2) USERFAULTFD && !USERFAULTFD_WP



Previous versions:



RFC: https://lore.kernel.org/lkml/[email protected]/

v1: https://lore.kernel.org/lkml/[email protected]/

v2: https://lore.kernel.org/lkml/[email protected]/

v3: https://lore.kernel.org/lkml/[email protected]/

v4: https://lore.kernel.org/lkml/[email protected]/



About Swap Special PTE

======================



In short, the so-called "swap special pte" in this patchset is a new type of

pte that doesn't exist in the past, but it got used initially in this series in

file-backed memories. It is used to persist information even if the ptes got

dropped meanwhile when the page cache still existed. For example, when

splitting a file-backed huge pmd, we could be simply dropping the pmd entry

then wait until another fault coming. It's okay in the past since all

information in the pte can be retained from the page cache when the next page

fault triggers. However in this case, uffd-wp is per-pte information which

cannot be kept in page cache, so that information needs to be maintained

somehow still in the pgtable entry, even if the pgtable entry is going to be

dropped. Here instead of replacing with a none entry, we used the "swap

special pte". Then when the next page fault triggers, we can observe orig_pte

to retain this information.



I'm copy-pasting some commit message from the patch "mm/swap: Introduce the

idea of special swap ptes", where it tried to explain this pte in another angle:



We used to have special swap entries, like migration entries, hw-poison

entries, device private entries, etc.



Those "special swap entries" reside in the range that they need to be at least

swap entries first, and their types are decided by swp_type(entry).



This patch introduces another idea called "special swap ptes".



It's very easy to get confused against "special swap entries", but a speical

swap pte should never contain a swap entry at all. It means, it's illegal to

call pte_to_swp_entry() upon a special swap pte.



Make the uffd-wp special pte to be the first special swap pte.



Before this patch, is_swap_pte()==true means one of the below:



(a.1) The pte has a normal swap entry (non_swap_entry()==false). For

example, when an anonymous page got swapped out.



(a.2) The pte has a special swap entry (non_swap_entry()==true). For

example, a migration entry, a hw-poison entry, etc.



After this patch, is_swap_pte()==true means one of the below, where case (b) is

added:



(a) The pte contains a swap entry.



(a.1) The pte has a normal swap entry (non_swap_entry()==false). For

example, when an anonymous page got swapped out.



(a.2) The pte has a special swap entry (non_swap_entry()==true). For

example, a migration entry, a hw-poison entry, etc.



(b) The pte does not contain a swap entry at all (so it cannot be passed

into pte_to_swp_entry()). For example, uffd-wp special swap pte.



Hugetlbfs needs similar thing because it's also file-backed. I directly reused

the same special pte there, though the shmem/hugetlb change on supporting this

new pte is different since they don't share code path a lot.



Patch layout

============



Part (1): Shmem support, this is where the special swap pte is introduced.

Some zap rework is needed within the process:



mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

mm: Clear vmf->pte after pte_unmap_same() returns

mm/userfaultfd: Introduce special pte for unmapped file-backed mem

mm/swap: Introduce the idea of special swap ptes

shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

mm: Drop first_index/last_index in zap_details

mm: Introduce zap_details.zap_flags

mm: Introduce ZAP_FLAG_SKIP_SWAP

shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

shmem/userfaultfd: Handle the left-overed special swap ptes

shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()



Part (2): Hugetlb supportdisable huge pmd sharing for uffd-wp patches have been

merged. The rest is the changes required to teach hugetlbfs understand the

special swap pte too that introduced with the uffd-wp change:



mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

mm/hugetlb: Introduce huge pte version of uffd-wp helpers

hugetlb/userfaultfd: Hook page faults for uffd write protection

hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

mm/hugetlb: Introduce huge version of special swap pte helpers

hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

hugetlb/userfaultfd: Allow wr-protect none ptes

hugetlb/userfaultfd: Only drop uffd-wp special pte if required



Part (3): Enable both features in code and test (plus pagemap support)



mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

userfaultfd: Enable write protection for shmem & hugetlbfs

userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs



Tests

=====



I've tested it using either userfaultfd kselftest program, but also with

umapsort [2] which should be even stricter. Tested page swapping in/out during

umapsort.



If anyone would like to try umapsort, need to use an extremely hacked version

of umap library [3], because by default umap only supports anonymous. So to

test it we need to build [3] then [2].



Any comment would be greatly welcomed. Thanks,



[1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs

[2] https://github.com/xzpeter/umap-apps/tree/peter

[3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

[4] https://github.com/xzpeter/umap-apps/commit/b0c2c7b1cd9dcb6835e7c59d02ece1f6b7f1ea01



Peter Xu (26):

mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

mm: Clear vmf->pte after pte_unmap_same() returns

mm/userfaultfd: Introduce special pte for unmapped file-backed mem

mm/swap: Introduce the idea of special swap ptes

shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

mm: Drop first_index/last_index in zap_details

mm: Introduce zap_details.zap_flags

mm: Introduce ZAP_FLAG_SKIP_SWAP

shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on

thps

shmem/userfaultfd: Handle the left-overed special swap ptes

shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

mm/hugetlb: Introduce huge pte version of uffd-wp helpers

hugetlb/userfaultfd: Hook page faults for uffd write protection

hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

mm/hugetlb: Introduce huge version of special swap pte helpers

hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

hugetlb/userfaultfd: Allow wr-protect none ptes

hugetlb/userfaultfd: Only drop uffd-wp special pte if required

mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

mm/userfaultfd: Enable write protection for shmem & hugetlbfs

userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs



arch/arm64/kernel/mte.c | 2 +-

arch/x86/include/asm/pgtable.h | 28 +++

fs/hugetlbfs/inode.c | 15 +-

fs/proc/task_mmu.c | 21 +-

fs/userfaultfd.c | 41 ++--

include/asm-generic/hugetlb.h | 15 ++

include/asm-generic/pgtable_uffd.h | 3 +

include/linux/hugetlb.h | 30 ++-

include/linux/mm.h | 44 +++-

include/linux/mm_inline.h | 42 ++++

include/linux/shmem_fs.h | 4 +-

include/linux/swapops.h | 39 +++-

include/linux/userfaultfd_k.h | 54 +++++

include/uapi/linux/userfaultfd.h | 10 +-

mm/gup.c | 2 +-

mm/hmm.c | 2 +-

mm/hugetlb.c | 164 ++++++++++++---

mm/khugepaged.c | 11 +-

mm/madvise.c | 4 +-

mm/memcontrol.c | 2 +-

mm/memory.c | 244 +++++++++++++++++------

mm/migrate.c | 4 +-

mm/mincore.c | 2 +-

mm/mprotect.c | 63 +++++-

mm/mremap.c | 2 +-

mm/page_vma_mapped.c | 6 +-

mm/rmap.c | 8 +

mm/shmem.c | 5 +-

mm/swapfile.c | 2 +-

mm/userfaultfd.c | 73 +++++--

tools/testing/selftests/vm/userfaultfd.c | 9 +-

31 files changed, 763 insertions(+), 188 deletions(-)



--

2.31.1





2021-07-15 20:15:44

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

We used to have special swap entries, like migration entries, hw-poison
entries, device private entries, etc.

Those "special swap entries" reside in the range that they need to be at least
swap entries first, and their types are decided by swp_type(entry).

This patch introduces another idea called "special swap ptes".

It's very easy to get confused against "special swap entries", but a speical
swap pte should never contain a swap entry at all. It means, it's illegal to
call pte_to_swp_entry() upon a special swap pte.

Make the uffd-wp special pte to be the first special swap pte.

Before this patch, is_swap_pte()==true means one of the below:

(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.

(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.

After this patch, is_swap_pte()==true means one of the below, where case (b) is
added:

(a) The pte contains a swap entry.

(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.

(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.

(b) The pte does not contain a swap entry at all (so it cannot be passed
into pte_to_swp_entry()). For example, uffd-wp special swap pte.

Teach the whole mm core about this new idea. It's done by introducing another
helper called pte_has_swap_entry() which stands for case (a.1) and (a.2).
Before this patch, it will be the same as is_swap_pte() because there's no
special swap pte yet. Now for most of the previous use of is_swap_entry() in
mm core, we'll need to use the new helper pte_has_swap_entry() instead, to make
sure we won't try to parse a swap entry from a swap special pte (which does not
contain a swap entry at all!). We either handle the swap special pte, or it'll
naturally use the default "else" paths.

Warn properly (e.g., in do_swap_page()) when we see a special swap pte - we
should never call do_swap_page() upon those ptes, but just to bail out early if
it happens.

Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/kernel/mte.c | 2 +-
fs/proc/task_mmu.c | 14 ++++++++------
include/linux/swapops.h | 39 ++++++++++++++++++++++++++++++++++++++-
mm/gup.c | 2 +-
mm/hmm.c | 2 +-
mm/khugepaged.c | 11 ++++++++++-
mm/madvise.c | 4 ++--
mm/memcontrol.c | 2 +-
mm/memory.c | 7 +++++++
mm/migrate.c | 4 ++--
mm/mincore.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 6 +++---
mm/swapfile.c | 2 +-
15 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 69b3fde8759e..841ff639b4b5 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -35,7 +35,7 @@ EXPORT_SYMBOL_GPL(mte_async_mode);
static void mte_sync_page_tags(struct page *page, pte_t old_pte,
bool check_swap, bool pte_is_tagged)
{
- if (check_swap && is_swap_pte(old_pte)) {
+ if (check_swap && pte_has_swap_entry(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);

if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..9c5af77b5290 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -498,7 +498,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

if (!non_swap_entry(swpent)) {
@@ -516,8 +516,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
}
} else if (is_pfn_swap_entry(swpent))
page = pfn_swap_entry_to_page(swpent);
- } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
- && pte_none(*pte))) {
+ } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) &&
+ mss->check_shmem_swap &&
+ /* Here swap special pte is the same as none pte */
+ (pte_none(*pte) || is_swap_special_pte(*pte)))) {
page = xa_load(&vma->vm_file->f_mapping->i_pages,
linear_page_index(vma, addr));
if (xa_is_value(page))
@@ -689,7 +691,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

if (is_pfn_swap_entry(swpent))
@@ -1071,7 +1073,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
ptent = pte_wrprotect(old_pte);
ptent = pte_clear_soft_dirty(ptent);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
- } else if (is_swap_pte(ptent)) {
+ } else if (pte_has_swap_entry(ptent)) {
ptent = pte_swp_clear_soft_dirty(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
}
@@ -1374,7 +1376,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
flags |= PM_SOFT_DIRTY;
if (pte_uffd_wp(pte))
flags |= PM_UFFD_WP;
- } else if (is_swap_pte(pte)) {
+ } else if (pte_has_swap_entry(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..7f46dec3be1d 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -5,6 +5,7 @@
#include <linux/radix-tree.h>
#include <linux/bug.h>
#include <linux/mm_types.h>
+#include <linux/userfaultfd_k.h>

#ifdef CONFIG_MMU

@@ -62,12 +63,48 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
return entry.val & SWP_OFFSET_MASK;
}

-/* check whether a pte points to a swap entry */
+/*
+ * is_swap_pte() returns true for three cases:
+ *
+ * (a) The pte contains a swap entry.
+ *
+ * (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
+ * example, when an anonymous page got swapped out.
+ *
+ * (a.2) The pte has a special swap entry (non_swap_entry()==true). For
+ * example, a migration entry, a hw-poison entry, etc.
+ *
+ * (b) The pte does not contain a swap entry at all (so it cannot be passed
+ * into pte_to_swp_entry()). For example, uffd-wp special swap pte.
+ */
static inline int is_swap_pte(pte_t pte)
{
return !pte_none(pte) && !pte_present(pte);
}

+/*
+ * A swap-like special pte should only be used as special marker to trigger a
+ * page fault. We should treat them similarly as pte_none() in most cases,
+ * except that it may contain some special information that can persist within
+ * the pte. Currently the only special swap pte is UFFD_WP_SWP_PTE_SPECIAL.
+ *
+ * Note: we should never call pte_to_swp_entry() upon a special swap pte,
+ * Because a swap special pte does not contain a swap entry!
+ */
+static inline bool is_swap_special_pte(pte_t pte)
+{
+ return pte_swp_uffd_wp_special(pte);
+}
+
+/*
+ * Returns true if the pte contains a swap entry. This includes not only the
+ * normal swp entry case, but also for migration entries, etc.
+ */
+static inline bool pte_has_swap_entry(pte_t pte)
+{
+ return is_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
/*
* Convert the arch-dependent pte representation of a swp_entry_t into an
* arch-independent swp_entry_t.
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..425c08788921 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -513,7 +513,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
*/
if (likely(!(flags & FOLL_MIGRATION)))
goto no_page;
- if (pte_none(pte))
+ if (!pte_has_swap_entry(pte))
goto no_page;
entry = pte_to_swp_entry(pte);
if (!is_migration_entry(entry))
diff --git a/mm/hmm.c b/mm/hmm.c
index fad6be2bf072..aba1bf2c6742 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
pte_t pte = *ptep;
uint64_t pfn_req_flags = *hmm_pfn;

- if (pte_none(pte)) {
+ if (pte_none(pte) || is_swap_special_pte(pte)) {
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (required_fault)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..7376a9b5bfc9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1018,7 +1018,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,

vmf.pte = pte_offset_map(pmd, address);
vmf.orig_pte = *vmf.pte;
- if (!is_swap_pte(vmf.orig_pte)) {
+ if (!pte_has_swap_entry(vmf.orig_pte)) {
pte_unmap(vmf.pte);
continue;
}
@@ -1245,6 +1245,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
+ if (is_swap_special_pte(pteval)) {
+ /*
+ * Reuse SCAN_PTE_UFFD_WP. If there will be
+ * new users of is_swap_special_pte(), we'd
+ * better introduce a new result type.
+ */
+ result = SCAN_PTE_UFFD_WP;
+ goto out_unmap;
+ }
if (++unmapped <= khugepaged_max_ptes_swap) {
/*
* Always be strict with uffd-wp
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..2a8d2a9fc514 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -204,7 +204,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
pte = *(orig_pte + ((index - start) / PAGE_SIZE));
pte_unmap_unlock(orig_pte, ptl);

- if (pte_present(pte) || pte_none(pte))
+ if (!pte_has_swap_entry(pte))
continue;
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
@@ -596,7 +596,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (pte_none(ptent))
+ if (pte_none(ptent) || is_swap_special_pte(ptent))
continue;
/*
* If the pte has swp_entry, just clear page table to
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae1f5d0cb581..4b46c099ad94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,

if (pte_present(ptent))
page = mc_handle_present_pte(vma, addr, ptent);
- else if (is_swap_pte(ptent))
+ else if (pte_has_swap_entry(ptent))
page = mc_handle_swap_pte(vma, ptent, &ent);
else if (pte_none(ptent))
page = mc_handle_file_pte(vma, addr, ptent, &ent);
diff --git a/mm/memory.c b/mm/memory.c
index 0e0de08a2cd5..998a4f9a3744 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;

+ /*
+ * We should never call do_swap_page upon a swap special pte; just be
+ * safe to bail out if it happens.
+ */
+ if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
+ goto out;
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 23cbd9de030b..b477d0d5f911 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,

spin_lock(ptl);
pte = *ptep;
- if (!is_swap_pte(pte))
+ if (!pte_has_swap_entry(pte))
goto out;

entry = pte_to_swp_entry(pte);
@@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,

pte = *ptep;

- if (pte_none(pte)) {
+ if (pte_none(pte) || is_swap_special_pte(pte)) {
if (vma_is_anonymous(vma)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..5728c3e6473f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -121,7 +121,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr != end; ptep++, addr += PAGE_SIZE) {
pte_t pte = *ptep;

- if (pte_none(pte))
+ if (pte_none(pte) || is_swap_special_pte(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
vma, vec);
else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..4b743394afbe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
- } else if (is_swap_pte(oldpte)) {
+ } else if (pte_has_swap_entry(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;

diff --git a/mm/mremap.c b/mm/mremap.c
index 5989d3990020..122b279333ee 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -125,7 +125,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
#ifdef CONFIG_MEM_SOFT_DIRTY
if (pte_present(pte))
pte = pte_mksoft_dirty(pte);
- else if (is_swap_pte(pte))
+ else if (pte_has_swap_entry(pte))
pte = pte_swp_mksoft_dirty(pte);
#endif
return pte;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f7b331081791..ff57b67426af 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -36,7 +36,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
* For more details on device private memory see HMM
* (include/linux/hmm.h or mm/hmm.c).
*/
- if (is_swap_pte(*pvmw->pte)) {
+ if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;

/* Handle un-addressable ZONE_DEVICE memory */
@@ -90,7 +90,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

if (pvmw->flags & PVMW_MIGRATION) {
swp_entry_t entry;
- if (!is_swap_pte(*pvmw->pte))
+ if (!pte_has_swap_entry(*pvmw->pte))
return false;
entry = pte_to_swp_entry(*pvmw->pte);

@@ -99,7 +99,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return false;

pfn = swp_offset(entry);
- } else if (is_swap_pte(*pvmw->pte)) {
+ } else if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;

/* Handle un-addressable ZONE_DEVICE memory */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1e07d1c776f2..4993b4454c13 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1951,7 +1951,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
si = swap_info[type];
pte = pte_offset_map(pmd, addr);
do {
- if (!is_swap_pte(*pte))
+ if (!pte_has_swap_entry(*pte))
continue;

entry = pte_to_swp_entry(*pte);
--
2.31.1

2021-07-15 20:15:55

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 03/26] mm: Clear vmf->pte after pte_unmap_same() returns

pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
will not be valid any more. We should clear it.

It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.

pte_unmap_same() will be used in other places in follow up patches, so that
vmf->pte will not always be re-written. This patch enables us to call
functions like finish_fault() because that'll conditionally unmap the pte by
checking vmf->pte first. Or, alloc_set_pte() will make sure to allocate a new
pte even after calling pte_unmap_same().

Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
and then we can also avoid the long parameter list.

Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 747a01d495f2..0e0de08a2cd5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2724,19 +2724,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
{
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
+ spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
+ same = pte_same(*vmf->pte, vmf->orig_pte);
spin_unlock(ptl);
}
#endif
- pte_unmap(page_table);
+ pte_unmap(vmf->pte);
+ /* After unmap of pte, the pointer is invalid now - clear it. */
+ vmf->pte = NULL;
return same;
}

@@ -3487,7 +3488,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;

- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ if (!pte_unmap_same(vmf))
goto out;

entry = pte_to_swp_entry(vmf->orig_pte);
--
2.31.1

2021-07-15 20:15:59

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 06/26] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

File-backed memories are prone to unmap/swap so the ptes are always unstable.
This could lead to userfaultfd-wp information got lost when unmapped or swapped
out on such types of memory, for example, shmem. To keep such an information
persistent, we will start to use the newly introduced swap-like special ptes to
replace a null pte when those ptes were removed.

Prepare this by handling such a special pte first before it is applied in the
general page fault handler.

The handling of this special pte page fault is similar to missing fault, but it
should happen after the pte missing logic since the special pte is designed to
be a swap-like pte. Meanwhile it should be handled before do_swap_page() so
that the swap core logic won't be confused to see such an illegal swap pte.

This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
ptes should be rare. So far it should only trigger in two conditions:

(1) When trying to punch holes in shmem_fallocate(), there will be a
pre-unmap optimization before evicting the page. That will create
unmapped shmem ptes with wr-protected pages covered.

(2) Swapping out of shmem pages

Because of this, the page fault handling is simplifed too by not sending the
wr-protect message in the 1st page fault, instead the page will be installed
read-only, so the message will be generated until the next write, which will
trigger the do_wp_page() path of general uffd-wp handling.

Disable fault-around for all uffd-wp registered ranges for extra safety, and
clean the code up a bit after we introduced MINOR fault.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/userfaultfd_k.h | 17 +++++++
mm/memory.c | 88 +++++++++++++++++++++++++++++++----
2 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index bb5a72a2b07a..92606d95b005 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -94,6 +94,18 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
}

+/*
+ * Don't do fault around for either WP or MINOR registered uffd range. For
+ * MINOR registered range, fault around will be a total disaster and ptes can
+ * be installed without notifications; for WP it should mostly be fine as long
+ * as the fault around checks for pte_none() before the installation, however
+ * to be super safe we just forbid it.
+ */
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+}
+
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
return vma->vm_flags & VM_UFFD_MISSING;
@@ -259,6 +271,11 @@ static inline bool pte_swp_uffd_wp_special(pte_t pte)
return false;
}

+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+ return false;
+}
+
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/mm/memory.c b/mm/memory.c
index 998a4f9a3744..ba8033ca6682 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3964,6 +3964,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
{
struct vm_area_struct *vma = vmf->vma;
+ bool uffd_wp = pte_swp_uffd_wp_special(vmf->orig_pte);
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool prefault = vmf->address != addr;
pte_t entry;
@@ -3978,6 +3979,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)

if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (unlikely(uffd_wp))
+ entry = pte_mkuffd_wp(pte_wrprotect(entry));
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -4045,8 +4048,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
ret = 0;
- /* Re-check under ptl */
- if (likely(pte_none(*vmf->pte)))
+
+ /*
+ * Re-check under ptl. Note: this will cover both none pte and
+ * uffd-wp-special swap pte
+ */
+ if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
do_set_pte(vmf, page, vmf->address);
else
ret = VM_FAULT_NOPAGE;
@@ -4150,9 +4157,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
}

+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+ /* No ->map_pages? No way to fault around... */
+ if (!vmf->vma->vm_ops->map_pages)
+ return false;
+
+ if (uffd_disable_fault_around(vmf->vma))
+ return false;
+
+ return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
static vm_fault_t do_read_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = 0;

/*
@@ -4160,12 +4179,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- if (likely(!userfaultfd_minor(vmf->vma))) {
- ret = do_fault_around(vmf);
- if (ret)
- return ret;
- }
+ if (should_fault_around(vmf)) {
+ ret = do_fault_around(vmf);
+ if (ret)
+ return ret;
}

ret = __do_fault(vmf);
@@ -4484,6 +4501,57 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
return VM_FAULT_FALLBACK;
}

+static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /*
+ * Be careful so that we will only recover a special uffd-wp pte into a
+ * none pte. Otherwise it means the pte could have changed, so retry.
+ */
+ if (pte_swp_uffd_wp_special(*vmf->pte))
+ pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed. It means this pte was wr-protected before being unmapped.
+ */
+static vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf)
+{
+ /* Careful! vmf->pte unmapped after return */
+ if (!pte_unmap_same(vmf))
+ return 0;
+
+ /*
+ * Just in case there're leftover special ptes even after the region
+ * got unregistered - we can simply clear them.
+ */
+ if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+ return uffd_wp_clear_special(vmf);
+
+ /*
+ * Here we share most code with do_fault(), in which we can identify
+ * whether this is "none pte fault" or "uffd-wp-special fault" by
+ * checking the vmf->orig_pte.
+ */
+ return do_fault(vmf);
+}
+
+static vm_fault_t do_swap_pte(struct vm_fault *vmf)
+{
+ /*
+ * We need to handle special swap ptes before handling ptes that
+ * contain swap entries, always.
+ */
+ if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
+ return uffd_wp_handle_special(vmf);
+
+ return do_swap_page(vmf);
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -4558,7 +4626,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
}

if (!pte_present(vmf->orig_pte))
- return do_swap_page(vmf);
+ return do_swap_pte(vmf);

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
--
2.31.1

2021-07-15 20:16:04

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 04/26] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

This patch introduces a very special swap-like pte for file-backed memories.

Currently it's only defined for x86_64 only, but as long as any arch that can
properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it should
conceptually work too.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected. This
special pte could trigger a page fault just like swap entries, and as long as
the page fault will satisfy pte_none()==false && pte_present()==false.

Then we can revive the special pte into a normal pte backed by the page cache.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as the
special pte. The current solution (as pointed out by Andrea) is slightly
preferred in that we don't even need swp_entry_t knowledge at all in trapping
these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP from the anonymous
swp entries.

This patch only introduces the special pte and its operators. It's not yet
applied to have any functional difference.

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 28 ++++++++++++++++++++++++++++
include/asm-generic/pgtable_uffd.h | 3 +++
include/linux/userfaultfd_k.h | 25 +++++++++++++++++++++++++
3 files changed, 56 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..71b1e73d5b26 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1300,6 +1300,34 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+/*
+ * This is a very special swap-like pte that marks this pte as "wr-protected"
+ * by userfaultfd-wp. It should only exist for file-backed memory where the
+ * page (previously got wr-protected) has been unmapped or swapped out.
+ *
+ * For anonymous memories, the userfaultfd-wp _PAGE_SWP_UFFD_WP bit is kept
+ * along with a real swp entry instead.
+ *
+ * Let's make some rules for this special pte:
+ *
+ * (1) pte_none()==false, so that it'll not trigger a missing page fault.
+ *
+ * (2) pte_present()==false, so that it's recognized as swap (is_swap_pte).
+ *
+ * (3) pte_swp_uffd_wp()==true, so it can be tested just like a swap pte that
+ * contains a valid swap entry, so that we can check a swap pte always
+ * using "is_swap_pte() && pte_swp_uffd_wp()" without caring about whether
+ * there's one swap entry inside of the pte.
+ *
+ * (4) It should not be a valid swap pte anywhere, so that when we see this pte
+ * we know it does not contain a swap entry.
+ *
+ * For x86, the simplest special pte which satisfies all of above should be the
+ * pte with only _PAGE_SWP_UFFD_WP bit set (where swp_type==swp_offset==0).
+ */
+#define UFFD_WP_SWP_PTE_SPECIAL __pte(_PAGE_SWP_UFFD_WP)
+
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 828966d4c281..95e9811ce9d1 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -2,6 +2,9 @@
#define _ASM_GENERIC_PGTABLE_UFFD_H

#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+#define UFFD_WP_SWP_PTE_SPECIAL __pte(0)
+
static __always_inline int pte_uffd_wp(pte_t pte)
{
return 0;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 331d2ccf0bcc..bb5a72a2b07a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -145,6 +145,21 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
extern void userfaultfd_unmap_complete(struct mm_struct *mm,
struct list_head *uf);

+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+ WARN_ON_ONCE(vma_is_anonymous(vma));
+ return UFFD_WP_SWP_PTE_SPECIAL;
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+ return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
+#else
+ return false;
+#endif
+}
+
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
@@ -234,6 +249,16 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
{
}

+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+ return __pte(0);
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+ return false;
+}
+
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
--
2.31.1

2021-07-15 20:16:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 08/26] mm: Introduce zap_details.zap_flags

Instead of trying to introduce one variable for every new zap_details fields,
let's introduce a flag so that it can start to encode true/false informations.

Let's start to use this flag first to clean up the only check_mapping variable.
Firstly, the name "check_mapping" implies this is a "boolean", but actually it
stores the mapping inside, just in a way that it won't be set if we don't want
to check the mapping.

To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
that we only check against the mapping if this bit set. At the same time, we
can rename check_mapping into zap_mapping and set it always.

Since at it, introduce another helper zap_check_mapping_skip() and use it in
zap_pte_range() properly.

Some old comments have been removed in zap_pte_range() because they're
duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
easy to grep this information by simply grepping the flag.

It'll also make life easier when we want to e.g. pass in zap_flags into the
callers like unmap_mapping_pages() (instead of adding new booleans besides the
even_cows parameter).

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
mm/memory.c | 34 ++++++++++------------------------
2 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 795b3cd643ca..26088174daab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1715,14 +1715,31 @@ static inline bool can_do_mlock(void) { return false; }
extern int user_shm_lock(size_t, struct ucounts *);
extern void user_shm_unlock(size_t, struct ucounts *);

+/* Whether to check page->mapping when zapping */
+#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
*/
struct zap_details {
- struct address_space *check_mapping; /* Check page->mapping if set */
+ struct address_space *zap_mapping;
struct page *single_page; /* Locked page to be unmapped */
+ unsigned long zap_flags;
};

+/* Return true if skip zapping this page, false otherwise */
+static inline bool
+zap_skip_check_mapping(struct zap_details *details, struct page *page)
+{
+ if (!details || !page)
+ return false;
+
+ if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
+ return false;
+
+ return details->zap_mapping != page_rmapping(page);
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 4c269d7b3d83..2a5a6650f069 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1333,16 +1333,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct page *page;

page = vm_normal_page(vma, addr, ptent);
- if (unlikely(details) && page) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping &&
- details->check_mapping != page_rmapping(page))
- continue;
- }
+ if (unlikely(zap_skip_check_mapping(details, page)))
+ continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1375,17 +1367,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);

- if (unlikely(details && details->check_mapping)) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping !=
- page_rmapping(page))
- continue;
- }
-
+ if (unlikely(zap_skip_check_mapping(details, page)))
+ continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;

@@ -3369,8 +3352,9 @@ void unmap_mapping_page(struct page *page)
first_index = page->index;
last_index = page->index + thp_nr_pages(page) - 1;

- details.check_mapping = mapping;
+ details.zap_mapping = mapping;
details.single_page = page;
+ details.zap_flags = ZAP_FLAG_CHECK_MAPPING;

i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
@@ -3395,9 +3379,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { };
+ struct zap_details details = { .zap_mapping = mapping };
+
+ if (!even_cows)
+ details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;

- details.check_mapping = even_cows ? NULL : mapping;
if (last_index < first_index)
last_index = ULONG_MAX;

--
2.31.1

2021-07-15 20:16:41

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 01/26] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

It was conditionally done previously, as there's one shmem special case that we
use SetPageDirty() instead. However that's not necessary and it should be
easier and cleaner to do it unconditionally in mfill_atomic_install_pte().

The most recent discussion about this is here, where Hugh explained the history
of SetPageDirty() and why it's possible that it's not required at all:

https://lore.kernel.org/lkml/[email protected]/

Currently mfill_atomic_install_pte() has three callers:

1. shmem_mfill_atomic_pte
2. mcopy_atomic_pte
3. mcontinue_atomic_pte

After the change: case (1) should have its SetPageDirty replaced by the dirty
bit on pte (so we unify them together, finally), case (2) should have no
functional change at all as it has page_in_cache==false, case (3) may add a
dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem,
it's merely 100% sure the page is dirty after all, so should not make a real
difference either.

This should make it much easier to follow on which case will set dirty for
uffd, as we'll simply set it all now for all uffd related ioctls. Meanwhile,
no special handling of SetPageDirty() if there's no need.

Cc: Hugh Dickins <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/shmem.c | 1 -
mm/userfaultfd.c | 3 +--
2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 70d9ce294bb4..dc9f95b5fb34 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2449,7 +2449,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
shmem_recalc_inode(inode);
spin_unlock_irq(&info->lock);

- SetPageDirty(page);
unlock_page(page);
return 0;
out_delete_from_cache:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0e2132834bc7..b30a3724c701 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
pgoff_t offset, max_off;

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+ _dst_pte = pte_mkdirty(_dst_pte);
if (page_in_cache && !vm_shared)
writable = false;
- if (writable || !page_in_cache)
- _dst_pte = pte_mkdirty(_dst_pte);
if (writable) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
--
2.31.1

2021-07-15 20:16:41

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 10/26] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

File-backed memory is prone to being unmapped at any time. It means all
information in the pte will be dropped, including the uffd-wp flag.

Since the uffd-wp info cannot be stored in page cache or swap cache, persist
this wr-protect information by installing the special uffd-wp marker pte when
we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
we will know it's previously wr-protected by recognizing the special pte.

Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
persist such an information. For example, when destroying the whole vma, or
punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
bit when holding the page lock. It means the unmap_mapping_range() in
shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
because that's still racy with the page faults.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 11 +++++++++
include/linux/mm_inline.h | 42 +++++++++++++++++++++++++++++++++
mm/memory.c | 49 +++++++++++++++++++++++++++++++++++++--
mm/rmap.c | 8 +++++++
4 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 62a75e4414e3..8de230fc7b84 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,8 @@ extern void user_shm_unlock(size_t, struct ucounts *);
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
/* Whether to skip zapping swap entries */
#define ZAP_FLAG_SKIP_SWAP BIT(1)
+/* Whether to completely drop uffd-wp entries for file-backed memory */
+#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1752,6 +1754,15 @@ zap_skip_swap(struct zap_details *details)
return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
}

+static inline bool
+zap_drop_file_uffd_wp(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..95556b4bfe7a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,8 @@

#include <linux/huge_mm.h>
#include <linux/swap.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

/**
* page_is_file_lru - should the page be on a file LRU or anon LRU?
@@ -104,4 +106,44 @@ static __always_inline void del_page_from_lru_list(struct page *page,
update_lru_size(lruvec, page_lru(page), page_zonenum(page),
-thp_nr_pages(page));
}
+
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte. NOTE! This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable. Meanwhile
+ * none pte also means we are not demoting the pte so tlb flushed is not needed.
+ * E.g., when pte cleared the caller should have taken care of the tlb flush.
+ *
+ * Must be called with pgtable lock held.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_USERFAULTFD
+ bool arm_uffd_pte = false;
+
+ /* The current status of the pte should be "cleared" before calling */
+ WARN_ON_ONCE(!pte_none(*pte));
+
+ if (vma_is_anonymous(vma))
+ return;
+
+ /* A uffd-wp wr-protected normal pte */
+ if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ /*
+ * A uffd-wp wr-protected swap pte. Note: this should even work for
+ * pte_swp_uffd_wp_special() too.
+ */
+ if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ if (unlikely(arm_uffd_pte))
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+#endif
+}
+
#endif
diff --git a/mm/memory.c b/mm/memory.c
index d6b1adbf29e4..223781f115e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/mm_inline.h>

#include <trace/events/kmem.h>

@@ -1301,6 +1302,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}

+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte,
+ struct zap_details *details, pte_t pteval)
+{
+ if (zap_drop_file_uffd_wp(details))
+ return;
+
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1338,6 +1354,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
if (unlikely(!page))
continue;

@@ -1362,6 +1380,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

+ /*
+ * If this is a special uffd-wp marker pte... Drop it only if
+ * enforced to do so.
+ */
+ if (unlikely(is_swap_special_pte(ptent))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
+ /*
+ * If this is a common unmap of ptes, keep this as is.
+ * Drop it only if this is a whole-vma destruction.
+ */
+ if (zap_drop_file_uffd_wp(details))
+ ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ continue;
+ }
+
entry = pte_to_swp_entry(ptent);
if (is_device_private_entry(entry) ||
is_device_exclusive_entry(entry)) {
@@ -1369,6 +1403,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

if (unlikely(zap_skip_check_mapping(details, page)))
continue;
+ /*
+ * Both device private/exclusive mappings should only
+ * work with anonymous page so far, so we don't need to
+ * consider uffd-wp bit when zap. For more information,
+ * see zap_install_uffd_wp_if_needed().
+ */
+ WARN_ON_ONCE(!vma_is_anonymous(vma));
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;

@@ -1393,6 +1434,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss_vec(mm, rss);
@@ -1603,12 +1645,15 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mmu_notifier_range range;
+ struct zap_details details = {
+ .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
+ };

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
- unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
mmu_notifier_invalidate_range_end(&range);
}

@@ -3353,7 +3398,7 @@ void unmap_mapping_page(struct page *page)

details.zap_mapping = mapping;
details.single_page = page;
- details.zap_flags = ZAP_FLAG_CHECK_MAPPING;
+ details.zap_flags = ZAP_FLAG_CHECK_MAPPING | ZAP_FLAG_DROP_FILE_UFFD_WP;

i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 795f9d5f8386..92ba81567089 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -1515,6 +1516,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pvmw.pte);
}

+ /*
+ * Now the pte is cleared. If this is uffd-wp armed pte, we
+ * may want to replace a none pte with a marker pte if it's
+ * file-backed, so we don't lose the tracking information.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
/* Move the dirty bit to the page. Now the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
--
2.31.1

2021-07-15 20:17:19

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 12/26] shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

We don't have "huge" version of PTE_SWP_UFFD_WP_SPECIAL, instead when necessary
we split the thp if the huge page is uffd wr-protected previously.

However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps - rather than doing a real split, the
thp pmd will simply got dropped in __split_huge_pmd_locked().

That is definitely not enough if e.g. when there is a thp covers range [0, 2M)
but we want to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd.

Here we leverage the previously introduced change_protection_prepare() macro so
that we'll populate the pmd with a pgtable page. Then change_pte_range() will
do all the rest for us, e.g., install the uffd-wp swap special pte marker at
any pte that we'd like to wr-protect, under the protection of pgtable lock.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8ec85b276975..3fcb87b59696 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -306,8 +306,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
- if (next - addr != HPAGE_PMD_SIZE) {
+ if (next - addr != HPAGE_PMD_SIZE ||
+ /* Uffd wr-protecting a file-backed memory range */
+ unlikely(!vma_is_anonymous(vma) &&
+ (cp_flags & MM_CP_UFFD_WP))) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
+ /*
+ * For file-backed, the pmd could have been
+ * gone; still provide a pte pgtable if needed.
+ */
+ change_protection_prepare(vma, pmd, addr, cp_flags);
} else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, cp_flags);
--
2.31.1

2021-07-15 20:17:31

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 09/26] mm: Introduce ZAP_FLAG_SKIP_SWAP

Firstly, the comment in zap_pte_range() is misleading because it checks against
details rather than check_mappings, so it's against what the code did.

Meanwhile, it's confusing too on not explaining why passing in the details
pointer would mean to skip all swap entries. New user of zap_details could
very possibly miss this fact if they don't read deep until zap_pte_range()
because there's no comment at zap_details talking about it at all, so swap
entries could be errornously skipped without being noticed.

This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
"details" parameter: the caller should explicitly set this to skip swap
entries, otherwise swap entries will always be considered (which is still the
major case here).

Cc: Kirill A. Shutemov <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 12 ++++++++++++
mm/memory.c | 8 +++++---
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 26088174daab..62a75e4414e3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1717,6 +1717,8 @@ extern void user_shm_unlock(size_t, struct ucounts *);

/* Whether to check page->mapping when zapping */
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+/* Whether to skip zapping swap entries */
+#define ZAP_FLAG_SKIP_SWAP BIT(1)

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1740,6 +1742,16 @@ zap_skip_check_mapping(struct zap_details *details, struct page *page)
return details->zap_mapping != page_rmapping(page);
}

+/* Return true if skip swap entries, false otherwise */
+static inline bool
+zap_skip_swap(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 2a5a6650f069..d6b1adbf29e4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1379,8 +1379,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ if (unlikely(zap_skip_swap(details)))
continue;

if (!non_swap_entry(entry))
@@ -3379,7 +3378,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { .zap_mapping = mapping };
+ struct zap_details details = {
+ .zap_mapping = mapping,
+ .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ };

if (!even_cows)
details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
--
2.31.1

2021-07-15 20:17:32

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 07/26] mm: Drop first_index/last_index in zap_details

The first_index/last_index parameters in zap_details are actually only used in
unmap_mapping_range_tree(). At the meantime, this function is only called by
unmap_mapping_pages() once. Instead of passing these two variables through the
whole stack of page zapping code, remove them from zap_details and let them
simply be parameters of unmap_mapping_range_tree(), which is inlined.

Reviewed-by: Alistair Popple <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 2 --
mm/memory.c | 29 ++++++++++++++++-------------
2 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 57453dba41b9..795b3cd643ca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1720,8 +1720,6 @@ extern void user_shm_unlock(size_t, struct ucounts *);
*/
struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
- pgoff_t first_index; /* Lowest page->index to unmap */
- pgoff_t last_index; /* Highest page->index to unmap */
struct page *single_page; /* Locked page to be unmapped */
};

diff --git a/mm/memory.c b/mm/memory.c
index ba8033ca6682..4c269d7b3d83 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3322,20 +3322,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
}

static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+ pgoff_t first_index,
+ pgoff_t last_index,
struct zap_details *details)
{
struct vm_area_struct *vma;
pgoff_t vba, vea, zba, zea;

- vma_interval_tree_foreach(vma, root,
- details->first_index, details->last_index) {
-
+ vma_interval_tree_foreach(vma, root, first_index, last_index) {
vba = vma->vm_pgoff;
vea = vba + vma_pages(vma) - 1;
- zba = details->first_index;
+ zba = first_index;
if (zba < vba)
zba = vba;
- zea = details->last_index;
+ zea = last_index;
if (zea > vea)
zea = vea;

@@ -3361,18 +3361,21 @@ void unmap_mapping_page(struct page *page)
{
struct address_space *mapping = page->mapping;
struct zap_details details = { };
+ pgoff_t first_index, last_index;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageTail(page));

+ first_index = page->index;
+ last_index = page->index + thp_nr_pages(page) - 1;
+
details.check_mapping = mapping;
- details.first_index = page->index;
- details.last_index = page->index + thp_nr_pages(page) - 1;
details.single_page = page;

i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
- unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+ last_index, &details);
i_mmap_unlock_write(mapping);
}

@@ -3391,17 +3394,17 @@ void unmap_mapping_page(struct page *page)
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
+ pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = { };

details.check_mapping = even_cows ? NULL : mapping;
- details.first_index = start;
- details.last_index = start + nr - 1;
- if (details.last_index < details.first_index)
- details.last_index = ULONG_MAX;
+ if (last_index < first_index)
+ last_index = ULONG_MAX;

i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
- unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+ last_index, &details);
i_mmap_unlock_write(mapping);
}

--
2.31.1

2021-07-15 20:17:39

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 16/26] mm/hugetlb: Introduce huge pte version of uffd-wp helpers

They will be used in the follow up patches to either check/set/clear uffd-wp
bit of a huge pte.

So far it reuses all the small pte helpers. Archs can overwrite these versions
when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future.

Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..c45b9deb41ff 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -32,6 +32,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}

+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte_mkuffd_wp(pte);
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_uffd_wp(pte);
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+ return pte_uffd_wp(pte);
+}
+
#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR
static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long sz)
--
2.31.1

2021-07-15 20:18:08

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 17/26] hugetlb/userfaultfd: Hook page faults for uffd write protection

Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.

We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdd637b0c29..d34636085eaf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5059,6 +5059,25 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_ptl;

+ /* Handle userfault-wp first, before trying to lock more pages */
+ if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+ (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .flags = flags,
+ };
+
+ spin_unlock(ptl);
+ if (pagecache_page) {
+ unlock_page(pagecache_page);
+ put_page(pagecache_page);
+ }
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ return handle_userfault(&vmf, VM_UFFD_WP);
+ }
+
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
--
2.31.1

2021-07-15 20:18:08

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 14/26] shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

It should be handled similarly like other uffd-wp wr-protected ptes: we should
pass it over when the dst_vma has VM_UFFD_WP armed, otherwise drop it.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 223781f115e9..af91bee934c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -776,8 +776,21 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = dst_vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
- swp_entry_t entry = pte_to_swp_entry(pte);
+ swp_entry_t entry;
+
+ if (unlikely(is_swap_special_pte(pte))) {
+ /*
+ * uffd-wp special swap pte is the only possibility for now.
+ * If dst vma is registered with uffd-wp, copy it over.
+ * Otherwise, ignore this pte as if it's a none pte would work.
+ */
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ if (userfaultfd_wp(dst_vma))
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+ }

+ entry = pte_to_swp_entry(pte);
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
return -EIO;
--
2.31.1

2021-07-15 20:18:28

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 15/26] mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

Drop it in the header since it's only used in hugetlb.c.

Suggested-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 10 ----------
mm/hugetlb.c | 6 +++---
2 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f7ca1a3870ea..c30f39815e13 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -143,9 +143,6 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct page *ref_page);
-void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct page *ref_page);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -385,13 +382,6 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
BUG();
}

-static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
-{
- BUG();
-}
-
static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 924553aa8f78..4bdd637b0c29 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4331,9 +4331,9 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
return ret;
}

-void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct page *ref_page)
+static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ struct page *ref_page)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
--
2.31.1

2021-07-15 20:18:45

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 23/26] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if
unmapping an entire vma or synchronized such that faults can not race with the
unmap operation. This requires passing zap_flags all the way to the lowest
level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent
faults. The exception is hole punch which will first unmap without any
synchronization. Later when hole punch actually removes the page from the
file, it will check to see if there was a subsequent fault and if so take the
hugetlb fault mutex while unmapping again. This second unmap will pass in
ZAP_FLAG_DROP_FILE_UFFD_WP.

The core justification of "whether to apply ZAP_FLAG_DROP_FILE_UFFD_WP flag
when unmap a hugetlb range" is (IMHO): we should never reach a state when a
page fault could errornously fault in a page-cache page that was wr-protected
to be writable, even in an extremely short period. That could happen if
e.g. we pass ZAP_FLAG_DROP_FILE_UFFD_WP in hugetlbfs_punch_hole() when calling
hugetlb_vmdelete_list(), because if a page fault triggers after that call and
before the remove_inode_hugepages() right after it, the page cache can be
mapped writable again in the small window, which can cause data corruption.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 15 +++++++++------
include/linux/hugetlb.h | 8 +++++---
mm/hugetlb.c | 27 +++++++++++++++++++++------
mm/memory.c | 5 ++++-
4 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 926eeb9bf4eb..fdbb972b781b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -404,7 +404,8 @@ static void remove_huge_page(struct page *page)
}

static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+ unsigned long zap_flags)
{
struct vm_area_struct *vma;

@@ -437,7 +438,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
}

unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
- NULL);
+ NULL, zap_flags);
}
}

@@ -515,7 +516,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
+ (index + 1) * pages_per_huge_page(h),
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
}

@@ -581,7 +583,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
i_size_write(inode, offset);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
- hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
}
@@ -614,8 +617,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap,
- hole_start >> PAGE_SHIFT,
- hole_end >> PAGE_SHIFT);
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT, 0);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e19ca363803d..809bb63ecf9e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -138,11 +138,12 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
unsigned long *, unsigned long *, long, unsigned int,
int *);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *,
+ unsigned long);
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -381,7 +382,8 @@ static inline unsigned long hugetlb_change_protection(

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 875adf3b000c..5e18a20c1999 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4353,7 +4353,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,

static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page)
+ struct page *ref_page, unsigned long zap_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -4405,6 +4405,19 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
continue;
}

+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * Only drop the special swap uffd-wp pte if
+ * e.g. unmapping a vma or punching a hole (with proper
+ * lock held so that concurrent page fault won't happen).
+ */
+ if (zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP)
+ huge_pte_clear(mm, address, ptep, sz);
+ spin_unlock(ptl);
+ continue;
+ }
+
/*
* Migrating hugepage or HWPoisoned hugepage is already
* unmapped and its refcount is dropped, so just clear pte here.
@@ -4456,9 +4469,10 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct

void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
- __unmap_hugepage_range(tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);

/*
* Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -4474,12 +4488,13 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
}

void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
struct mmu_gather tlb;

tlb_gather_mmu(&tlb, vma->vm_mm);
- __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
tlb_finish_mmu(&tlb);
}

@@ -4534,7 +4549,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
unmap_hugepage_range(iter_vma, address,
- address + huge_page_size(h), page);
+ address + huge_page_size(h), page, 0);
}
i_mmap_unlock_write(mapping);
}
diff --git a/mm/memory.c b/mm/memory.c
index af91bee934c7..c4a80f45e48f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1626,8 +1626,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
+ unsigned long zap_flags = details ?
+ details->zap_flags : 0;
i_mmap_lock_write(vma->vm_file->f_mapping);
- __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+ __unmap_hugepage_range_final(tlb, vma, start, end,
+ NULL, zap_flags);
i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
--
2.31.1

2021-07-15 20:18:54

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 26/26] userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.

Define HUGETLB_EXPECTED_IOCTLS to avoid using UFFD_API_RANGE_IOCTLS_BASIC,
because UFFD_API_RANGE_IOCTLS_BASIC is normally a superset of capabilities,
while the test may not satisfy them all. E.g., when hugetlb registered without
minor mode, then we need to explicitly remove _UFFDIO_CONTINUE. Same thing to
uffd-wp, as we'll need to explicitly remove _UFFDIO_WRITEPROTECT if not
registered with uffd-wp.

For the long term, we may consider dropping UFFD_API_* macros completely from
uapi/linux/userfaultfd.h header files, because it may cause kernel header
update to easily break userspace.

Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index e363bdaff59d..015f2df8ece4 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -80,7 +80,7 @@ static int test_type;
static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
/* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+static bool test_uffdio_wp = true;
/* Whether to test uffd minor faults */
static bool test_uffdio_minor = false;

@@ -320,6 +320,9 @@ struct uffd_test_ops {
(1 << _UFFDIO_ZEROPAGE) | \
(1 << _UFFDIO_WRITEPROTECT))

+#define HUGETLB_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
+ (1 << _UFFDIO_COPY))
+
static struct uffd_test_ops anon_uffd_test_ops = {
.expected_ioctls = ANON_EXPECTED_IOCTLS,
.allocate_area = anon_allocate_area,
@@ -335,7 +338,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
};

static struct uffd_test_ops hugetlb_uffd_test_ops = {
- .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
+ .expected_ioctls = HUGETLB_EXPECTED_IOCTLS,
.allocate_area = hugetlb_allocate_area,
.release_pages = hugetlb_release_pages,
.alias_mapping = hugetlb_alias_mapping,
@@ -1580,8 +1583,6 @@ static void set_test_type(const char *type)
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
- /* Only enable write-protect test for anonymous test */
- test_uffdio_wp = true;
} else if (!strcmp(type, "hugetlb")) {
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
--
2.31.1

2021-07-15 20:19:18

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 20/26] mm/hugetlb: Introduce huge version of special swap pte helpers

This is to let hugetlbfs be prepared to also recognize swap special ptes just
like uffd-wp special swap ptes.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4edb3ee885ea..517ee30f272c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -95,6 +95,26 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
return true;
}

+/*
+ * These are sister versions of is_swap_pte() and pte_has_swap_entry(). We
+ * need standalone ones because huge_pte_none() is handled differently from
+ * pte_none(). For more information, please refer to comments above
+ * is_swap_pte() and pte_has_swap_entry().
+ *
+ * Here we directly reuse the pte level of swap special ptes, for example, the
+ * pte_swp_uffd_wp_special(). It just stands for a huge page rather than a
+ * small page for hugetlbfs pages.
+ */
+static inline bool is_huge_swap_pte(pte_t pte)
+{
+ return !huge_pte_none(pte) && !pte_present(pte);
+}
+
+static inline bool huge_pte_has_swap_entry(pte_t pte)
+{
+ return is_huge_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
unsigned long irq_flags)
{
@@ -4138,7 +4158,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
{
swp_entry_t swp;

- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_migration_entry(swp))
@@ -4151,7 +4171,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
{
swp_entry_t swp;

- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_hwpoison_entry(swp))
--
2.31.1

2021-07-15 20:19:28

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 18/26] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
the stack. Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
UFFDIO_COPY. Introduce huge_pte_mkuffd_wp() for it.

Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting
dirty bit in the huge pte if the page is installed as read-only. However we'd
better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when
UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but
also because the page does contain dirty data that the kernel just copied from
the userspace.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 22 +++++++++++++++++-----
mm/userfaultfd.c | 12 ++++++++----
3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c30f39815e13..fcdbf9f46d85 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -155,7 +155,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep);
+ struct page **pagep,
+ bool wp_copy);
#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
@@ -336,7 +337,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
BUG();
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d34636085eaf..880cb2137d04 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5141,7 +5141,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct hstate *h = hstate_vma(dst_vma);
@@ -5277,17 +5278,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
}

- /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
- if (is_continue && !vm_shared)
+ /*
+ * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
+ * with wp flag set, don't set pte write bit.
+ */
+ if (wp_copy || (is_continue && !vm_shared))
writable = 0;
else
writable = dst_vma->vm_flags & VM_WRITE;

_dst_pte = make_huge_pte(dst_vma, page, writable);
- if (writable)
- _dst_pte = huge_pte_mkdirty(_dst_pte);
+ /*
+ * Always mark UFFDIO_COPY page dirty; note that this may not be
+ * extremely important for hugetlbfs for now since swapping is not
+ * supported, but we should still be clear in that this page cannot be
+ * thrown away at will, even if write bit not set.
+ */
+ _dst_pte = huge_pte_mkdirty(_dst_pte);
_dst_pte = pte_mkyoung(_dst_pte);

+ if (wp_copy)
+ _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0c7212dfb95d..501d6b9f7a5a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -297,7 +297,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode)
+ enum mcopy_atomic_mode mode,
+ bool wp_copy)
{
int vm_shared = dst_vma->vm_flags & VM_SHARED;
ssize_t err;
@@ -393,7 +394,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
- dst_addr, src_addr, mode, &page);
+ dst_addr, src_addr, mode, &page,
+ wp_copy);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -448,7 +450,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode);
+ enum mcopy_atomic_mode mode,
+ bool wp_copy);
#endif /* CONFIG_HUGETLB_PAGE */

static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -568,7 +571,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, mcopy_mode);
+ src_start, len, mcopy_mode,
+ wp_copy);

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
--
2.31.1

2021-07-15 20:19:30

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 19/26] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 13 ++++++++++++-
mm/mprotect.c | 3 ++-
mm/userfaultfd.c | 8 ++++++++
4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fcdbf9f46d85..e19ca363803d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -205,7 +205,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot);
+ unsigned long address, unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags);

bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -372,7 +373,8 @@ static inline void move_hugetlb_state(struct page *oldpage,

static inline unsigned long hugetlb_change_protection(
struct vm_area_struct *vma, unsigned long address,
- unsigned long end, pgprot_t newprot)
+ unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags)
{
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 880cb2137d04..4edb3ee885ea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5519,7 +5519,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot)
+ unsigned long address, unsigned long end,
+ pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
@@ -5529,6 +5530,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long pages = 0;
bool shared_pmd = false;
struct mmu_notifier_range range;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -5570,6 +5573,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
entry = make_readable_migration_entry(
swp_offset(entry));
newpte = swp_entry_to_pte(entry);
+ if (uffd_wp)
+ newpte = pte_swp_mkuffd_wp(newpte);
+ else if (uffd_wp_resolve)
+ newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
pages++;
@@ -5584,6 +5591,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ if (uffd_wp)
+ pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3fcb87b59696..96f4df023439 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -426,7 +426,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);

if (is_vm_hugetlb_page(vma))
- pages = hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot,
+ cp_flags);
else
pages = change_protection_range(vma, start, end, newprot,
cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 501d6b9f7a5a..7ba721aca1c5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -695,6 +695,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool enable_wp, bool *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ unsigned long page_mask;
pgprot_t newprot;
int err;

@@ -731,6 +732,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
if (!vma_is_anonymous(dst_vma))
goto out_unlock;

+ if (is_vm_hugetlb_page(dst_vma)) {
+ err = -EINVAL;
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ goto out_unlock;
+ }
+
if (enable_wp)
newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
else
--
2.31.1

2021-07-15 20:19:30

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 21/26] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

Teach the hugetlb page fault code to understand uffd-wp special pte. For
example, when seeing such a pte we need to convert any write fault into a read
one (which is fake - we'll retry the write later if so). Meanwhile, for
handle_userfault() we'll need to make sure we must wait for the special swap
pte too just like a none pte.

Note that we also need to teach UFFDIO_COPY about this special pte across the
code path so that we can safely install a new page at this special pte as long
as we know it's a stall entry.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 5 ++++-
mm/hugetlb.c | 26 ++++++++++++++++++++------
mm/userfaultfd.c | 5 ++++-
3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index e1c1cbc7bcc8..644df737fbb2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,8 +245,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
/*
* Lockless access: we're in a wait_event so it's ok if it
* changes under us.
+ *
+ * Regarding uffd-wp special case, please refer to comments in
+ * userfaultfd_must_wait().
*/
- if (huge_pte_none(pte))
+ if (huge_pte_none(pte) || pte_swp_uffd_wp_special(pte))
ret = true;
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
ret = true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 517ee30f272c..5941b5cd7ecc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4786,7 +4786,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ unsigned long address, pte_t *ptep,
+ pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -4910,7 +4911,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

ptl = huge_pte_lock(h, mm, ptep);
ret = 0;
- if (!huge_pte_none(huge_ptep_get(ptep)))
+ if (!pte_same(huge_ptep_get(ptep), old_pte))
goto backout;

if (anon_rmap) {
@@ -4920,6 +4921,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
+ /*
+ * If this pte was previously wr-protected, keep it wr-protected even
+ * if populated.
+ */
+ if (unlikely(pte_swp_uffd_wp_special(old_pte)))
+ new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
set_huge_pte_at(mm, haddr, ptep, new_pte);

hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -5035,8 +5042,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);

entry = huge_ptep_get(ptep);
- if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+ /*
+ * uffd-wp-special should be handled merely the same as pte none
+ * because it's basically a none pte with a special marker
+ */
+ if (huge_pte_none(entry) || pte_swp_uffd_wp_special(entry)) {
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ entry, flags);
goto out_mutex;
}

@@ -5170,7 +5182,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
- pte_t _dst_pte;
+ pte_t _dst_pte, cur_pte;
spinlock_t *ptl;
int ret = -ENOMEM;
struct page *page;
@@ -5287,8 +5299,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (idx >= size)
goto out_release_unlock;

+ cur_pte = huge_ptep_get(dst_pte);
ret = -EEXIST;
- if (!huge_pte_none(huge_ptep_get(dst_pte)))
+ /* Please refer to shmem_mfill_atomic_pte() for uffd-wp special case */
+ if (!huge_pte_none(cur_pte) && !pte_swp_uffd_wp_special(cur_pte))
goto out_release_unlock;

if (vm_shared) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7ba721aca1c5..a8038903effd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -363,6 +363,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
+ pte_t pteval;
+
BUG_ON(dst_addr >= dst_start + len);

/*
@@ -385,8 +387,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
goto out_unlock;
}

+ pteval = huge_ptep_get(dst_pte);
if (mode != MCOPY_ATOMIC_CONTINUE &&
- !huge_pte_none(huge_ptep_get(dst_pte))) {
+ !huge_pte_none(pteval) && !pte_swp_uffd_wp_special(pteval)) {
err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
--
2.31.1

2021-07-15 20:19:32

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 13/26] shmem/userfaultfd: Handle the left-overed special swap ptes

Note that the special uffd-wp swap pte can be left over even if the page under
the pte got evicted. Normally when evict a page, we will unmap the ptes by
walking through the reverse mapping. However we never tracked such information
for the special swap ptes because they're not real mappings but just markers.
So we need to take care of that when we see a marker but when it's actually
meaningless (the page behind it got evicted).

We have already taken care of that in e.g. alloc_set_pte() where we'll treat
the special swap pte as pte_none() when necessary. However we need to also
teach userfaultfd itself on either UFFDIO_COPY or handling page faults, so that
everything will still work as expected.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 15 +++++++++++++++
mm/userfaultfd.c | 13 ++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f6e0f0c0d0e5..e1c1cbc7bcc8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -329,6 +329,21 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (pte_none(*pte))
ret = true;
+ /*
+ * We also treat the swap special uffd-wp pte as the pte_none() here.
+ * This should in most cases be a missing event, as we never handle
+ * wr-protect upon a special uffd-wp swap pte - it should first be
+ * converted into a normal read request before handling wp. It just
+ * means the page/swap cache that backing this pte is gone, so this
+ * special pte is leftover.
+ *
+ * We can't simply replace it with a none pte because we're not with
+ * the pgtable lock here. Instead of taking it and clearing the pte,
+ * the easy way is to let UFFDIO_COPY understand this pte too when
+ * trying to install a new page onto it.
+ */
+ if (pte_swp_uffd_wp_special(*pte))
+ ret = true;
if (!pte_write(*pte) && (reason & VM_UFFD_WP))
ret = true;
pte_unmap(pte);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 2a9c9e6eb876..0c7212dfb95d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -100,7 +100,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
}

ret = -EEXIST;
- if (!pte_none(*dst_pte))
+ /*
+ * Besides the none pte, we also allow UFFDIO_COPY to install a pte
+ * onto the uffd-wp swap special pte, because that pte should be the
+ * same as a pte_none() just in that it contains wr-protect information
+ * (which could only be dropped when unmap the memory).
+ *
+ * It's safe to drop that marker because we know this is part of a
+ * MISSING fault, and the caller is very clear about this page missing
+ * rather than wr-protected. Then we're sure the wr-protect bit is
+ * just a leftover so it's useless already and is the same as none pte.
+ */
+ if (!pte_none(*dst_pte) && !pte_swp_uffd_wp_special(*dst_pte))
goto out_unlock;

if (page_in_cache)
--
2.31.1

2021-07-15 20:19:33

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

This requires the pagemap code to be able to recognize the newly introduced
swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
recently start to support. It should make pagemap uffd-wp support complete.

Signed-off-by: Peter Xu <[email protected]>
---
fs/proc/task_mmu.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9c5af77b5290..988e29fa1f00 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
flags |= PM_SWAP;
if (is_pfn_swap_entry(entry))
page = pfn_swap_entry_to_page(entry);
+ } else if (pte_swp_uffd_wp_special(pte)) {
+ flags |= PM_UFFD_WP;
}

if (page && !PageAnon(page))
@@ -1522,10 +1524,15 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
if (page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;

+ if (huge_pte_uffd_wp(pte))
+ flags |= PM_UFFD_WP;
+
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pte_pfn(pte) +
((addr & ~hmask) >> PAGE_SHIFT);
+ } else if (pte_swp_uffd_wp_special(pte)) {
+ flags |= PM_UFFD_WP;
}

for (; addr != end; addr += PAGE_SIZE) {
--
2.31.1

2021-07-15 20:19:39

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 22/26] hugetlb/userfaultfd: Allow wr-protect none ptes

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5941b5cd7ecc..875adf3b000c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5561,7 +5561,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0;
+ unsigned long pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -5581,13 +5581,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

mmu_notifier_invalidate_range_start(&range);
i_mmap_lock_write(vma->vm_file->f_mapping);
- for (; address < end; address += huge_page_size(h)) {
+ for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ ptep = huge_pte_offset(mm, address, psize);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+ /*
+ * When uffd-wp is enabled on the vma, unshare
+ * shouldn't happen at all. Warn about it if it
+ * happened due to some reason.
+ */
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
pages++;
spin_unlock(ptl);
shared_pmd = true;
@@ -5612,12 +5618,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
- newpte, huge_page_size(h));
+ newpte, psize);
pages++;
}
spin_unlock(ptl);
continue;
}
+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * This is changing a non-present pte into a none pte,
+ * no need for huge_ptep_modify_prot_start/commit().
+ */
+ if (uffd_wp_resolve)
+ huge_pte_clear(mm, address, ptep, psize);
+ }
if (!huge_pte_none(pte)) {
pte_t old_pte;
unsigned int shift = huge_page_shift(hstate_vma(vma));
@@ -5631,6 +5646,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
+ } else {
+ /* None pte */
+ if (unlikely(uffd_wp))
+ /* Safe to modify directly (none->non-present). */
+ set_huge_pte_at(mm, address, ptep,
+ pte_swp_mkuffd_wp_special(vma));
}
spin_unlock(ptl);
}
--
2.31.1

2021-07-15 20:19:50

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 11/26] shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.

We do that by installing the uffd-wp special swap pte as a marker. So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.

On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker. Then we'll need to be
able to recover from a uffd-wp special swap pte into a none pte, so that the
next access to the page will fault in correctly as usual when trigger the fault
handler next time, rather than sending a uffd-wp message.

Special care needs to be taken throughout the change_protection_range()
process. Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see !anonymous && MM_CP_UFFD_WP
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.

Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet. But this will be a base for thps too.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4b743394afbe..8ec85b276975 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,6 +29,7 @@
#include <linux/uaccess.h>
#include <linux/mm_inline.h>
#include <linux/pgtable.h>
+#include <linux/userfaultfd_k.h>
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -186,6 +187,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, newpte);
pages++;
}
+ } else if (unlikely(is_swap_special_pte(oldpte))) {
+ if (uffd_wp_resolve && !vma_is_anonymous(vma) &&
+ pte_swp_uffd_wp_special(oldpte)) {
+ /*
+ * This is uffd-wp special pte and we'd like to
+ * unprotect it. What we need to do is simply
+ * recover the pte into a none pte; the next
+ * page fault will fault in the page.
+ */
+ pte_clear(vma->vm_mm, addr, pte);
+ pages++;
+ }
+ } else {
+ /* It must be an none page, or what else?.. */
+ WARN_ON_ONCE(!pte_none(oldpte));
+ if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+ /*
+ * For file-backed mem, we need to be able to
+ * wr-protect even for a none pte! Because
+ * even if the pte is null, the page/swap cache
+ * could exist.
+ */
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+ pages++;
+ }
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -219,6 +246,25 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}

+/*
+ * File-backed vma allows uffd wr-protect upon none ptes, because even if pte
+ * is missing, page/swap cache could exist. When that happens, the wr-protect
+ * information will be stored in the page table entries with the marker (e.g.,
+ * PTE_SWP_UFFD_WP_SPECIAL). Prepare for that by always populating the page
+ * tables to pte level, so that we'll install the markers in change_pte_range()
+ * where necessary.
+ *
+ * Note that we only need to do this in pmd level, because if pmd does not
+ * exist, it means the whole range covered by the pmd entry (of a pud) does not
+ * contain any valid data but all zeros. Then nothing to wr-protect.
+ */
+#define change_protection_prepare(vma, pmd, addr, cp_flags) \
+ do { \
+ if (unlikely((cp_flags & MM_CP_UFFD_WP) && pmd_none(*pmd) && \
+ !vma_is_anonymous(vma))) \
+ WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)); \
+ } while (0)
+
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
@@ -237,6 +283,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

next = pmd_addr_end(addr, end);

+ change_protection_prepare(vma, pmd, addr, cp_flags);
+
/*
* Automatic NUMA balancing walks the tables with mmap_lock
* held for read. It's possible a parallel update to occur
--
2.31.1

2021-07-15 20:20:26

by Peter Xu

[permalink] [raw]
Subject: [PATCH v5 25/26] mm/userfaultfd: Enable write protection for shmem & hugetlbfs

We've had all the necessary changes ready for both shmem and hugetlbfs. Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because
all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 21 ++-------------------
include/linux/userfaultfd_k.h | 12 ++++++++++++
include/uapi/linux/userfaultfd.h | 10 ++++++++--
mm/userfaultfd.c | 9 +++------
4 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 644df737fbb2..980879117110 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1275,24 +1275,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
return 0;
}

-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- /* FIXME: add WP support to hugetlbfs and shmem */
- if (vm_flags & VM_UFFD_WP) {
- if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
- return false;
- }
-
- if (vm_flags & VM_UFFD_MINOR) {
- if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
- return false;
- }
-
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
-}
-
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
unsigned long arg)
{
@@ -1966,7 +1948,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
#endif
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
- uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ uffdio_api.features &=
+ ~(UFFD_FEATURE_PAGEFAULT_FLAG_WP | UFFD_FEATURE_WP_HUGETLBFS_SHMEM);
#endif
uffdio_api.ioctls = UFFD_API_IOCTLS;
ret = -EFAULT;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 92606d95b005..4382240de7c3 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -16,6 +16,7 @@
#include <linux/fcntl.h>
#include <linux/mm.h>
#include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>

/* The set of all possible UFFD-related VM flags. */
#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -138,6 +139,17 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & __VM_UFFD_FLAGS;
}

+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+ unsigned long vm_flags)
+{
+ if (vm_flags & VM_UFFD_MINOR)
+ return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
+
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
+
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 05b31d60acf6..a67b5185a7a9 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -32,7 +32,8 @@
UFFD_FEATURE_SIGBUS | \
UFFD_FEATURE_THREAD_ID | \
UFFD_FEATURE_MINOR_HUGETLBFS | \
- UFFD_FEATURE_MINOR_SHMEM)
+ UFFD_FEATURE_MINOR_SHMEM | \
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -46,7 +47,8 @@
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
- (__u64)1 << _UFFDIO_CONTINUE)
+ (__u64)1 << _UFFDIO_CONTINUE | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -189,6 +191,9 @@ struct uffdio_api {
*
* UFFD_FEATURE_MINOR_SHMEM indicates the same support as
* UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
+ *
+ * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
+ * write-protection mode is supported on both shmem and hugetlbfs.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -201,6 +206,7 @@ struct uffdio_api {
#define UFFD_FEATURE_THREAD_ID (1<<8)
#define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9)
#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
+#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<11)
__u64 features;

__u64 ioctls;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a8038903effd..6587ed787707 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -724,15 +724,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,

err = -ENOENT;
dst_vma = find_dst_vma(dst_mm, start, len);
- /*
- * Make sure the vma is not shared, that the dst range is
- * both valid and fully within a single existing vma.
- */
- if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+ if (!dst_vma)
goto out_unlock;
if (!userfaultfd_wp(dst_vma))
goto out_unlock;
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
goto out_unlock;

if (is_vm_hugetlb_page(dst_vma)) {
--
2.31.1

2021-07-16 05:52:46

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

Hi Peter,

[...]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae1f5d0cb581..4b46c099ad94 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
>
> if (pte_present(ptent))
> page = mc_handle_present_pte(vma, addr, ptent);
> - else if (is_swap_pte(ptent))
> + else if (pte_has_swap_entry(ptent))
> page = mc_handle_swap_pte(vma, ptent, &ent);
> else if (pte_none(ptent))
> page = mc_handle_file_pte(vma, addr, ptent, &ent);

As I understand things pte_none() == False for a special swap pte, but
shouldn't this be treated as pte_none() here? Ie. does this need to be
pte_none(ptent) || is_swap_special_pte() here?

> diff --git a/mm/memory.c b/mm/memory.c
> index 0e0de08a2cd5..998a4f9a3744 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!pte_unmap_same(vmf))
> goto out;
>
> + /*
> + * We should never call do_swap_page upon a swap special pte; just be
> + * safe to bail out if it happens.
> + */
> + if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> + goto out;
> +
> entry = pte_to_swp_entry(vmf->orig_pte);
> if (unlikely(non_swap_entry(entry))) {
> if (is_migration_entry(entry)) {

Are there other changes required here? Because we can end up with stale special
pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
checks in these functions:

insert_pfn() -> checks for !pte_none
remap_pte_range() -> BUG_ON(!pte_none)
apply_to_pte_range() -> didn't check further but it tests for !pte_none

In general it feels like I might be missing something here though. There are
plenty of checks in the kernel for pte_none() which haven't been updated. Is
there some rule that says none of those paths can see a special pte?

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 23cbd9de030b..b477d0d5f911 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>
> spin_lock(ptl);
> pte = *ptep;
> - if (!is_swap_pte(pte))
> + if (!pte_has_swap_entry(pte))
> goto out;
>
> entry = pte_to_swp_entry(pte);
> @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>
> pte = *ptep;
>
> - if (pte_none(pte)) {
> + if (pte_none(pte) || is_swap_special_pte(pte)) {

I was wondering if we can loose the special pte information here? However I see
that in migrate_vma_insert_page() we check again and fail the migration if
!pte_none() so I think this is ok.

I think it would be better if this check was moved below so the migration fails
early. Ie:

if (pte_none(pte)) {
if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {

Also how does this work for page migration in general? I can see in
page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
loose the special pte in that instance? Or is that ok for some reason?

> if (vma_is_anonymous(vma)) {
> mpfn = MIGRATE_PFN_MIGRATE;
> migrate->cpages++;
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 9122676b54d6..5728c3e6473f 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -121,7 +121,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> for (; addr != end; ptep++, addr += PAGE_SIZE) {
> pte_t pte = *ptep;
>
> - if (pte_none(pte))
> + if (pte_none(pte) || is_swap_special_pte(pte))
> __mincore_unmapped_range(addr, addr + PAGE_SIZE,
> vma, vec);
> else if (pte_present(pte))
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 883e2cc85cad..4b743394afbe 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -139,7 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> }
> ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> pages++;
> - } else if (is_swap_pte(oldpte)) {
> + } else if (pte_has_swap_entry(oldpte)) {
> swp_entry_t entry = pte_to_swp_entry(oldpte);
> pte_t newpte;
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 5989d3990020..122b279333ee 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -125,7 +125,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> #ifdef CONFIG_MEM_SOFT_DIRTY
> if (pte_present(pte))
> pte = pte_mksoft_dirty(pte);
> - else if (is_swap_pte(pte))
> + else if (pte_has_swap_entry(pte))
> pte = pte_swp_mksoft_dirty(pte);
> #endif
> return pte;
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index f7b331081791..ff57b67426af 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -36,7 +36,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
> * For more details on device private memory see HMM
> * (include/linux/hmm.h or mm/hmm.c).
> */
> - if (is_swap_pte(*pvmw->pte)) {
> + if (pte_has_swap_entry(*pvmw->pte)) {
> swp_entry_t entry;
>
> /* Handle un-addressable ZONE_DEVICE memory */
> @@ -90,7 +90,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>
> if (pvmw->flags & PVMW_MIGRATION) {
> swp_entry_t entry;
> - if (!is_swap_pte(*pvmw->pte))
> + if (!pte_has_swap_entry(*pvmw->pte))
> return false;
> entry = pte_to_swp_entry(*pvmw->pte);
>
> @@ -99,7 +99,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> return false;
>
> pfn = swp_offset(entry);
> - } else if (is_swap_pte(*pvmw->pte)) {
> + } else if (pte_has_swap_entry(*pvmw->pte)) {
> swp_entry_t entry;
>
> /* Handle un-addressable ZONE_DEVICE memory */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1e07d1c776f2..4993b4454c13 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1951,7 +1951,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> si = swap_info[type];
> pte = pte_offset_map(pmd, addr);
> do {
> - if (!is_swap_pte(*pte))
> + if (!pte_has_swap_entry(*pte))
> continue;
>
> entry = pte_to_swp_entry(*pte);
>




2021-07-16 19:14:00

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> Hi Peter,
>
> [...]
>
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ae1f5d0cb581..4b46c099ad94 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> >
> > if (pte_present(ptent))
> > page = mc_handle_present_pte(vma, addr, ptent);
> > - else if (is_swap_pte(ptent))
> > + else if (pte_has_swap_entry(ptent))
> > page = mc_handle_swap_pte(vma, ptent, &ent);
> > else if (pte_none(ptent))
> > page = mc_handle_file_pte(vma, addr, ptent, &ent);
>
> As I understand things pte_none() == False for a special swap pte, but
> shouldn't this be treated as pte_none() here? Ie. does this need to be
> pte_none(ptent) || is_swap_special_pte() here?

Looks correct; here the page/swap cache could hide behind the special pte just
like a none pte. Will fix it. Thanks!

>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 0e0de08a2cd5..998a4f9a3744 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (!pte_unmap_same(vmf))
> > goto out;
> >
> > + /*
> > + * We should never call do_swap_page upon a swap special pte; just be
> > + * safe to bail out if it happens.
> > + */
> > + if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > + goto out;
> > +
> > entry = pte_to_swp_entry(vmf->orig_pte);
> > if (unlikely(non_swap_entry(entry))) {
> > if (is_migration_entry(entry)) {
>
> Are there other changes required here? Because we can end up with stale special
> pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> checks in these functions:
>
> insert_pfn() -> checks for !pte_none
> remap_pte_range() -> BUG_ON(!pte_none)
> apply_to_pte_range() -> didn't check further but it tests for !pte_none
>
> In general it feels like I might be missing something here though. There are
> plenty of checks in the kernel for pte_none() which haven't been updated. Is
> there some rule that says none of those paths can see a special pte?

My rule on doing this was to only care about vma that can be backed by RAM,
majorly shmem/hugetlb, so the special pte can only exist there within those
vmas. I believe in most pte_none() users this special pte won't exist.

So if it's not related to RAM backed memory at all, maybe it's fine to keep the
pte_none() usage like before.

Take the example of insert_pfn() referenced first - I think it can be used to
map some MMIO regions, but I don't think we'll call that upon a RAM region
(either shmem or hugetlb), nor can it be uffd wr-protected. So I'm not sure
adding special pte check there would be helpful.

apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
more on whether the special pte will matter. I had a quick look, it seems
still be used mostly by all kinds of driver code not mm core. It's used in two
forms:

apply_to_page_range
apply_to_existing_page_range

The first one creates ptes only, so it ignores the pte_none() check so I skipped.

The second one has two call sites:

*** arch/powerpc/mm/pageattr.c:
change_memory_attr[99] return apply_to_existing_page_range(&init_mm, start, size,
set_memory_attr[132] return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,

*** mm/kasan/shadow.c:
kasan_release_vmalloc[485] apply_to_existing_page_range(&init_mm,

I'll leave the ppc callers for now as uffd-wp is not even supported there. The
kasan_release_vmalloc() should be for kernel allocated memories only, so should
not be a target for special pte either.

So indeed it's hard to 100% cover all pte_none() users to make sure things are
used right. As stated above I still believe most callers don't need that, but
the worst case is if someone triggered uffd-wp issues with a specific feature,
we can look into it. I am not sure whether it's good we add this for all the
pte_none() users, because mostly they'll be useless checks, imho.

So far what I planned to do is to cover most things we know that may be
affected like this patch so the change may bring a difference, hopefully we
won't miss any important spots.

>
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 23cbd9de030b..b477d0d5f911 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> >
> > spin_lock(ptl);
> > pte = *ptep;
> > - if (!is_swap_pte(pte))
> > + if (!pte_has_swap_entry(pte))
> > goto out;
> >
> > entry = pte_to_swp_entry(pte);
> > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >
> > pte = *ptep;
> >
> > - if (pte_none(pte)) {
> > + if (pte_none(pte) || is_swap_special_pte(pte)) {
>
> I was wondering if we can loose the special pte information here? However I see
> that in migrate_vma_insert_page() we check again and fail the migration if
> !pte_none() so I think this is ok.
>
> I think it would be better if this check was moved below so the migration fails
> early. Ie:
>
> if (pte_none(pte)) {
> if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {

Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
swap special pte? Because swap special pte only exists when !vma_is_anonymous().

>
> Also how does this work for page migration in general? I can see in
> page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> loose the special pte in that instance? Or is that ok for some reason?

Do you mean try_to_migrate_one()? Does it need to be aware of that? Per my
understanding that's only for anonymous private memory, while in that world
there should have no swap special pte (page_lock_anon_vma_read will return NULL
early for !vma_is_anonymous).

Thanks,

--
Peter Xu

2021-07-19 09:55:05

by Tiberiu A Georgescu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs


Hello Peter,

> On 15 Jul 2021, at 21:16, Peter Xu <[email protected]> wrote:
>
> This requires the pagemap code to be able to recognize the newly introduced
> swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
> recently start to support. It should make pagemap uffd-wp support complete.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/proc/task_mmu.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 9c5af77b5290..988e29fa1f00 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> flags |= PM_SWAP;
> if (is_pfn_swap_entry(entry))
> page = pfn_swap_entry_to_page(entry);
> + } else if (pte_swp_uffd_wp_special(pte)) {
> + flags |= PM_UFFD_WP;
> }

^ Would it not be important to also add PM_SWAP to flags?

Kind regards,
Tibi

2021-07-19 17:39:41

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On Mon, Jul 19, 2021 at 09:53:36AM +0000, Tiberiu Georgescu wrote:
>
> Hello Peter,

Hi, Tiberiu,

>
> > On 15 Jul 2021, at 21:16, Peter Xu <[email protected]> wrote:
> >
> > This requires the pagemap code to be able to recognize the newly introduced
> > swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
> > recently start to support. It should make pagemap uffd-wp support complete.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 9c5af77b5290..988e29fa1f00 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> > flags |= PM_SWAP;
> > if (is_pfn_swap_entry(entry))
> > page = pfn_swap_entry_to_page(entry);
> > + } else if (pte_swp_uffd_wp_special(pte)) {
> > + flags |= PM_UFFD_WP;
> > }
>
> ^ Would it not be important to also add PM_SWAP to flags?

Hmm, I'm not sure; it's the same as a none pte in this case, so imho we still
can't tell if it's swapped out or simply the pte got zapped but page cache will
still hit (even if being swapped out may be the most possible case).

What we're clear is we know it's uffd wr-protected, so maybe setting PM_UFFD_WP
is still the simplest?

Thanks,

--
Peter Xu

2021-07-19 18:29:47

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On Mon, Jul 19, 2021 at 05:23:14PM +0000, Tiberiu Georgescu wrote:
> > What we're clear is we know it's uffd wr-protected, so maybe setting PM_UFFD_WP
> > is still the simplest?
>
> That's right, but if we were to require any of the differentiations above, how
> does keeping another bit on the special pte sound to you? One to signal the location on swap or otherwise (none or zapped).

I don't know how to do it even with an extra bit in the pte. The thing is we
need some mechanism to trigger the tweak of that bit in the pte when switching
from "present" to "swapped out", while I don't see how that could be done.

Consider when page reclaim happens, we'll unmap and zap the ptes first before
swapping the pages out, then when we do the pageout() we've already released
the rmap so no way to figure out which pte to tweak, afaiu. It also looks
complicated just for maintaining this information.

>
> Is there any other clearer way to do it? We wouldn't want to overload the
> special pte unnecessarily.

I feel like the solution you proposed in the other patch for soft dirty might
work. It's just that it seems heavier, especially because we'll try to look up
the page cache for every single pte_none() (and after this patch including the
swap special pte) even if the page is never accessed.

I expect it will regress the case of a normal soft-dirty user when the memory
is sparsely used, because there'll be plenty of page cache look up operations
that are destined to be useless.

I'm also curious what would be the real use to have an accurate PM_SWAP
accounting. To me current implementation may not provide accurate value but
should be good enough for most cases. However not sure whether it's also true
for your use case.

Thanks,

--
Peter Xu

2021-07-19 18:30:50

by Tiberiu A Georgescu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs


> On 19 Jul 2021, at 17:03, Peter Xu <[email protected]> wrote:
>
> On Mon, Jul 19, 2021 at 09:53:36AM +0000, Tiberiu Georgescu wrote:
>>
>> Hello Peter,
>
> Hi, Tiberiu,
>
>>
>>> On 15 Jul 2021, at 21:16, Peter Xu <[email protected]> wrote:
>>>
>>> This requires the pagemap code to be able to recognize the newly introduced
>>> swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
>>> recently start to support. It should make pagemap uffd-wp support complete.
>>>
>>> Signed-off-by: Peter Xu <[email protected]>
>>> ---
>>> fs/proc/task_mmu.c | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 9c5af77b5290..988e29fa1f00 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>> flags |= PM_SWAP;
>>> if (is_pfn_swap_entry(entry))
>>> page = pfn_swap_entry_to_page(entry);
>>> + } else if (pte_swp_uffd_wp_special(pte)) {
>>> + flags |= PM_UFFD_WP;
>>> }
>>
>> ^ Would it not be important to also add PM_SWAP to flags?
>
> Hmm, I'm not sure; it's the same as a none pte in this case, so imho we still
> can't tell if it's swapped out or simply the pte got zapped but page cache will
> still hit (even if being swapped out may be the most possible case).

Yeah, that's true. Come to think of it, we also can't tell none pte from swapped
out shmem pages (all bits are cleared out).

>
> What we're clear is we know it's uffd wr-protected, so maybe setting PM_UFFD_WP
> is still the simplest?

That's right, but if we were to require any of the differentiations above, how
does keeping another bit on the special pte sound to you? One to signal the location on swap or otherwise (none or zapped).

Is there any other clearer way to do it? We wouldn't want to overload the
special pte unnecessarily.

Thanks,

--
Tibi

2021-07-19 20:28:20

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 00/26] userfaultfd-wp: Support shmem and hugetlbfs

On 15.07.21 22:13, Peter Xu wrote:
> This is v5 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> full feature. It's based on v5.14-rc1.
>
> I reposted the whole series majorly to trigger the syzbot tests again; sorry if
> it brings a bit of noise. Please let me know if there's easier way to trigger
> the syzbot test instead of reposting the whole series.
>
> Meanwhile, recently discussion around soft-dirty shows that soft-dirty may have
> similar requirement as uffd-wp on persisting the dirty information:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Then the mechanism provided in this patchset may be suitable for soft-dirty too.
>
> The whole series can also be found online [1].
>
> v5 changelog:
> - Fix two issues spotted by syzbot
> - Compile test with (1) !USERFAULTFD, (2) USERFAULTFD && !USERFAULTFD_WP
>
> Previous versions:
>
> RFC: https://lore.kernel.org/lkml/[email protected]/
> v1: https://lore.kernel.org/lkml/[email protected]/
> v2: https://lore.kernel.org/lkml/[email protected]/
> v3: https://lore.kernel.org/lkml/[email protected]/
> v4: https://lore.kernel.org/lkml/[email protected]/
>
> About Swap Special PTE
> ======================
>
> In short, the so-called "swap special pte" in this patchset is a new type of
> pte that doesn't exist in the past, but it got used initially in this series in
> file-backed memories. It is used to persist information even if the ptes got
> dropped meanwhile when the page cache still existed. For example, when
> splitting a file-backed huge pmd, we could be simply dropping the pmd entry
> then wait until another fault coming. It's okay in the past since all
> information in the pte can be retained from the page cache when the next page
> fault triggers. However in this case, uffd-wp is per-pte information which
> cannot be kept in page cache, so that information needs to be maintained
> somehow still in the pgtable entry, even if the pgtable entry is going to be
> dropped. Here instead of replacing with a none entry, we used the "swap
> special pte". Then when the next page fault triggers, we can observe orig_pte
> to retain this information.
>
> I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
> idea of special swap ptes", where it tried to explain this pte in another angle:
>
> We used to have special swap entries, like migration entries, hw-poison
> entries, device private entries, etc.
>
> Those "special swap entries" reside in the range that they need to be at least
> swap entries first, and their types are decided by swp_type(entry).
>
> This patch introduces another idea called "special swap ptes".
>
> It's very easy to get confused against "special swap entries", but a speical
> swap pte should never contain a swap entry at all. It means, it's illegal to
> call pte_to_swp_entry() upon a special swap pte.
>
> Make the uffd-wp special pte to be the first special swap pte.
>
> Before this patch, is_swap_pte()==true means one of the below:
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> After this patch, is_swap_pte()==true means one of the below, where case (b) is
> added:
>
> (a) The pte contains a swap entry.
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> (b) The pte does not contain a swap entry at all (so it cannot be passed
> into pte_to_swp_entry()). For example, uffd-wp special swap pte.
>
> Hugetlbfs needs similar thing because it's also file-backed. I directly reused
> the same special pte there, though the shmem/hugetlb change on supporting this
> new pte is different since they don't share code path a lot.
>
> Patch layout
> ============
>
> Part (1): Shmem support, this is where the special swap pte is introduced.
> Some zap rework is needed within the process:
>
> mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte
> shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
> mm: Clear vmf->pte after pte_unmap_same() returns
> mm/userfaultfd: Introduce special pte for unmapped file-backed mem
> mm/swap: Introduce the idea of special swap ptes
> shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
> mm: Drop first_index/last_index in zap_details
> mm: Introduce zap_details.zap_flags
> mm: Introduce ZAP_FLAG_SKIP_SWAP
> shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
> shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
> shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps
> shmem/userfaultfd: Handle the left-overed special swap ptes
> shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()
>
> Part (2): Hugetlb supportdisable huge pmd sharing for uffd-wp patches have been
> merged. The rest is the changes required to teach hugetlbfs understand the
> special swap pte too that introduced with the uffd-wp change:
>
> mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h
> mm/hugetlb: Introduce huge pte version of uffd-wp helpers
> hugetlb/userfaultfd: Hook page faults for uffd write protection
> hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
> hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
> mm/hugetlb: Introduce huge version of special swap pte helpers
> hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
> hugetlb/userfaultfd: Allow wr-protect none ptes
> hugetlb/userfaultfd: Only drop uffd-wp special pte if required
>
> Part (3): Enable both features in code and test (plus pagemap support)
>
> mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs
> userfaultfd: Enable write protection for shmem & hugetlbfs
> userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs
>
> Tests
> =====
>
> I've tested it using either userfaultfd kselftest program, but also with
> umapsort [2] which should be even stricter. Tested page swapping in/out during
> umapsort.
>
> If anyone would like to try umapsort, need to use an extremely hacked version
> of umap library [3], because by default umap only supports anonymous. So to
> test it we need to build [3] then [2].
>
> Any comment would be greatly welcomed. Thanks,

Hi Peter,

I just stumbled over copy_page_range() optimization

/*
* Don't copy ptes where a page fault will fill them correctly.
* Fork becomes much lighter when there are big shared or private
* readonly mappings. The tradeoff is that copy_page_range is more
* efficient than faulting.
*/
if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
!src_vma->anon_vma)
return 0;

IIUC, that means you'll not copy the WP bits for shmem and,
therefore, lose them during fork.

--
Thanks,

David / dhildenb

2021-07-19 20:48:08

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 00/26] userfaultfd-wp: Support shmem and hugetlbfs

On Mon, Jul 19, 2021 at 09:21:18PM +0200, David Hildenbrand wrote:
> Hi Peter,

Hi, David,

>
> I just stumbled over copy_page_range() optimization
>
> /*
> * Don't copy ptes where a page fault will fill them correctly.
> * Fork becomes much lighter when there are big shared or private
> * readonly mappings. The tradeoff is that copy_page_range is more
> * efficient than faulting.
> */
> if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
> !src_vma->anon_vma)
> return 0;
>
> IIUC, that means you'll not copy the WP bits for shmem and,
> therefore, lose them during fork.

Good point.

I think the fix shouldn't be hard - we can also skip this if dst_vma->vm_flags
has VM_UFFD_WP set (that means UFFD_FEATURE_EVENT_FORK is enabled too). But
I'll check a bit into page copy later to make sure it works (maybe I'll add a
small test case too).

Thanks!

--
Peter Xu

2021-07-20 15:45:41

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 17/26] hugetlb/userfaultfd: Hook page faults for uffd write protection

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kselftest/next]
[also build test ERROR on linus/master v5.14-rc2 next-20210720]
[cannot apply to hnaz-linux-mm/master asm-generic/master arm64/for-next/core linux/master tip/x86/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20210716-041947
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
config: s390-randconfig-r023-20210716 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 5d5b08761f944d5b9822d582378333cc4b36a0a7)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install s390 cross compiling tool for clang build
# apt-get install binutils-s390x-linux-gnu
# https://github.com/0day-ci/linux/commit/071935856c8e636cafde633db59259d75069cc8f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20210716-041947
git checkout 071935856c8e636cafde633db59259d75069cc8f
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __raw_readb(PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:36:59: note: expanded from macro '__le16_to_cpu'
#define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
^
include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
#define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
#define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writeb(value, PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
>> mm/hugetlb.c:5063:29: error: implicit declaration of function 'huge_pte_uffd_wp' [-Werror,-Wimplicit-function-declaration]
if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
^
12 warnings and 1 error generated.


vim +/huge_pte_uffd_wp +5063 mm/hugetlb.c

4957
4958 vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
4959 unsigned long address, unsigned int flags)
4960 {
4961 pte_t *ptep, entry;
4962 spinlock_t *ptl;
4963 vm_fault_t ret;
4964 u32 hash;
4965 pgoff_t idx;
4966 struct page *page = NULL;
4967 struct page *pagecache_page = NULL;
4968 struct hstate *h = hstate_vma(vma);
4969 struct address_space *mapping;
4970 int need_wait_lock = 0;
4971 unsigned long haddr = address & huge_page_mask(h);
4972
4973 ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
4974 if (ptep) {
4975 /*
4976 * Since we hold no locks, ptep could be stale. That is
4977 * OK as we are only making decisions based on content and
4978 * not actually modifying content here.
4979 */
4980 entry = huge_ptep_get(ptep);
4981 if (unlikely(is_hugetlb_entry_migration(entry))) {
4982 migration_entry_wait_huge(vma, mm, ptep);
4983 return 0;
4984 } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
4985 return VM_FAULT_HWPOISON_LARGE |
4986 VM_FAULT_SET_HINDEX(hstate_index(h));
4987 }
4988
4989 /*
4990 * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
4991 * until finished with ptep. This serves two purposes:
4992 * 1) It prevents huge_pmd_unshare from being called elsewhere
4993 * and making the ptep no longer valid.
4994 * 2) It synchronizes us with i_size modifications during truncation.
4995 *
4996 * ptep could have already be assigned via huge_pte_offset. That
4997 * is OK, as huge_pte_alloc will return the same value unless
4998 * something has changed.
4999 */
5000 mapping = vma->vm_file->f_mapping;
5001 i_mmap_lock_read(mapping);
5002 ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
5003 if (!ptep) {
5004 i_mmap_unlock_read(mapping);
5005 return VM_FAULT_OOM;
5006 }
5007
5008 /*
5009 * Serialize hugepage allocation and instantiation, so that we don't
5010 * get spurious allocation failures if two CPUs race to instantiate
5011 * the same page in the page cache.
5012 */
5013 idx = vma_hugecache_offset(h, vma, haddr);
5014 hash = hugetlb_fault_mutex_hash(mapping, idx);
5015 mutex_lock(&hugetlb_fault_mutex_table[hash]);
5016
5017 entry = huge_ptep_get(ptep);
5018 if (huge_pte_none(entry)) {
5019 ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
5020 goto out_mutex;
5021 }
5022
5023 ret = 0;
5024
5025 /*
5026 * entry could be a migration/hwpoison entry at this point, so this
5027 * check prevents the kernel from going below assuming that we have
5028 * an active hugepage in pagecache. This goto expects the 2nd page
5029 * fault, and is_hugetlb_entry_(migration|hwpoisoned) check will
5030 * properly handle it.
5031 */
5032 if (!pte_present(entry))
5033 goto out_mutex;
5034
5035 /*
5036 * If we are going to COW the mapping later, we examine the pending
5037 * reservations for this page now. This will ensure that any
5038 * allocations necessary to record that reservation occur outside the
5039 * spinlock. For private mappings, we also lookup the pagecache
5040 * page now as it is used to determine if a reservation has been
5041 * consumed.
5042 */
5043 if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
5044 if (vma_needs_reservation(h, vma, haddr) < 0) {
5045 ret = VM_FAULT_OOM;
5046 goto out_mutex;
5047 }
5048 /* Just decrements count, does not deallocate */
5049 vma_end_reservation(h, vma, haddr);
5050
5051 if (!(vma->vm_flags & VM_MAYSHARE))
5052 pagecache_page = hugetlbfs_pagecache_page(h,
5053 vma, haddr);
5054 }
5055
5056 ptl = huge_pte_lock(h, mm, ptep);
5057
5058 /* Check for a racing update before calling hugetlb_cow */
5059 if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
5060 goto out_ptl;
5061
5062 /* Handle userfault-wp first, before trying to lock more pages */
> 5063 if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
5064 (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
5065 struct vm_fault vmf = {
5066 .vma = vma,
5067 .address = haddr,
5068 .flags = flags,
5069 };
5070
5071 spin_unlock(ptl);
5072 if (pagecache_page) {
5073 unlock_page(pagecache_page);
5074 put_page(pagecache_page);
5075 }
5076 mutex_unlock(&hugetlb_fault_mutex_table[hash]);
5077 i_mmap_unlock_read(mapping);
5078 return handle_userfault(&vmf, VM_UFFD_WP);
5079 }
5080
5081 /*
5082 * hugetlb_cow() requires page locks of pte_page(entry) and
5083 * pagecache_page, so here we need take the former one
5084 * when page != pagecache_page or !pagecache_page.
5085 */
5086 page = pte_page(entry);
5087 if (page != pagecache_page)
5088 if (!trylock_page(page)) {
5089 need_wait_lock = 1;
5090 goto out_ptl;
5091 }
5092
5093 get_page(page);
5094
5095 if (flags & FAULT_FLAG_WRITE) {
5096 if (!huge_pte_write(entry)) {
5097 ret = hugetlb_cow(mm, vma, address, ptep,
5098 pagecache_page, ptl);
5099 goto out_put_page;
5100 }
5101 entry = huge_pte_mkdirty(entry);
5102 }
5103 entry = pte_mkyoung(entry);
5104 if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,
5105 flags & FAULT_FLAG_WRITE))
5106 update_mmu_cache(vma, haddr, ptep);
5107 out_put_page:
5108 if (page != pagecache_page)
5109 unlock_page(page);
5110 put_page(page);
5111 out_ptl:
5112 spin_unlock(ptl);
5113
5114 if (pagecache_page) {
5115 unlock_page(pagecache_page);
5116 put_page(pagecache_page);
5117 }
5118 out_mutex:
5119 mutex_unlock(&hugetlb_fault_mutex_table[hash]);
5120 i_mmap_unlock_read(mapping);
5121 /*
5122 * Generally it's safe to hold refcount during waiting page lock. But
5123 * here we just wait to defer the next page fault to avoid busy loop and
5124 * the page is not used after unlocked before returning from the current
5125 * page fault. So we are safe from accessing freed page, even if we wait
5126 * here without taking refcount.
5127 */
5128 if (need_wait_lock)
5129 wait_on_page_locked(page);
5130 return ret;
5131 }
5132

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (13.07 kB)
.config.gz (27.64 kB)
Download all attachments

2021-07-21 00:00:54

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 18/26] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kselftest/next]
[also build test ERROR on linus/master v5.14-rc2 next-20210720]
[cannot apply to hnaz-linux-mm/master asm-generic/master arm64/for-next/core linux/master tip/x86/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20210716-041947
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
config: s390-randconfig-r023-20210716 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 5d5b08761f944d5b9822d582378333cc4b36a0a7)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install s390 cross compiling tool for clang build
# apt-get install binutils-s390x-linux-gnu
# https://github.com/0day-ci/linux/commit/2ad11be7ccbb4fada15c5ec48a35630d450fc9ca
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20210716-041947
git checkout 2ad11be7ccbb4fada15c5ec48a35630d450fc9ca
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __raw_readb(PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:36:59: note: expanded from macro '__le16_to_cpu'
#define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
^
include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
#define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
#define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writeb(value, PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
mm/hugetlb.c:5063:29: error: implicit declaration of function 'huge_pte_uffd_wp' [-Werror,-Wimplicit-function-declaration]
if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
^
>> mm/hugetlb.c:5301:14: error: implicit declaration of function 'huge_pte_mkuffd_wp' [-Werror,-Wimplicit-function-declaration]
_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
^
mm/hugetlb.c:5301:14: note: did you mean 'pte_mkuffd_wp'?
include/asm-generic/pgtable_uffd.h:18:30: note: 'pte_mkuffd_wp' declared here
static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
^
>> mm/hugetlb.c:5301:12: error: assigning to 'pte_t' from incompatible type 'int'
_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12 warnings and 3 errors generated.


vim +/huge_pte_mkuffd_wp +5301 mm/hugetlb.c

5132
5133 #ifdef CONFIG_USERFAULTFD
5134 /*
5135 * Used by userfaultfd UFFDIO_COPY. Based on mcopy_atomic_pte with
5136 * modifications for huge pages.
5137 */
5138 int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
5139 pte_t *dst_pte,
5140 struct vm_area_struct *dst_vma,
5141 unsigned long dst_addr,
5142 unsigned long src_addr,
5143 enum mcopy_atomic_mode mode,
5144 struct page **pagep,
5145 bool wp_copy)
5146 {
5147 bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
5148 struct hstate *h = hstate_vma(dst_vma);
5149 struct address_space *mapping = dst_vma->vm_file->f_mapping;
5150 pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
5151 unsigned long size;
5152 int vm_shared = dst_vma->vm_flags & VM_SHARED;
5153 pte_t _dst_pte;
5154 spinlock_t *ptl;
5155 int ret = -ENOMEM;
5156 struct page *page;
5157 int writable;
5158
5159 if (is_continue) {
5160 ret = -EFAULT;
5161 page = find_lock_page(mapping, idx);
5162 if (!page)
5163 goto out;
5164 } else if (!*pagep) {
5165 /* If a page already exists, then it's UFFDIO_COPY for
5166 * a non-missing case. Return -EEXIST.
5167 */
5168 if (vm_shared &&
5169 hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
5170 ret = -EEXIST;
5171 goto out;
5172 }
5173
5174 page = alloc_huge_page(dst_vma, dst_addr, 0);
5175 if (IS_ERR(page)) {
5176 ret = -ENOMEM;
5177 goto out;
5178 }
5179
5180 ret = copy_huge_page_from_user(page,
5181 (const void __user *) src_addr,
5182 pages_per_huge_page(h), false);
5183
5184 /* fallback to copy_from_user outside mmap_lock */
5185 if (unlikely(ret)) {
5186 ret = -ENOENT;
5187 /* Free the allocated page which may have
5188 * consumed a reservation.
5189 */
5190 restore_reserve_on_error(h, dst_vma, dst_addr, page);
5191 put_page(page);
5192
5193 /* Allocate a temporary page to hold the copied
5194 * contents.
5195 */
5196 page = alloc_huge_page_vma(h, dst_vma, dst_addr);
5197 if (!page) {
5198 ret = -ENOMEM;
5199 goto out;
5200 }
5201 *pagep = page;
5202 /* Set the outparam pagep and return to the caller to
5203 * copy the contents outside the lock. Don't free the
5204 * page.
5205 */
5206 goto out;
5207 }
5208 } else {
5209 if (vm_shared &&
5210 hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
5211 put_page(*pagep);
5212 ret = -EEXIST;
5213 *pagep = NULL;
5214 goto out;
5215 }
5216
5217 page = alloc_huge_page(dst_vma, dst_addr, 0);
5218 if (IS_ERR(page)) {
5219 ret = -ENOMEM;
5220 *pagep = NULL;
5221 goto out;
5222 }
5223 copy_huge_page(page, *pagep);
5224 put_page(*pagep);
5225 *pagep = NULL;
5226 }
5227
5228 /*
5229 * The memory barrier inside __SetPageUptodate makes sure that
5230 * preceding stores to the page contents become visible before
5231 * the set_pte_at() write.
5232 */
5233 __SetPageUptodate(page);
5234
5235 /* Add shared, newly allocated pages to the page cache. */
5236 if (vm_shared && !is_continue) {
5237 size = i_size_read(mapping->host) >> huge_page_shift(h);
5238 ret = -EFAULT;
5239 if (idx >= size)
5240 goto out_release_nounlock;
5241
5242 /*
5243 * Serialization between remove_inode_hugepages() and
5244 * huge_add_to_page_cache() below happens through the
5245 * hugetlb_fault_mutex_table that here must be hold by
5246 * the caller.
5247 */
5248 ret = huge_add_to_page_cache(page, mapping, idx);
5249 if (ret)
5250 goto out_release_nounlock;
5251 }
5252
5253 ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
5254 spin_lock(ptl);
5255
5256 /*
5257 * Recheck the i_size after holding PT lock to make sure not
5258 * to leave any page mapped (as page_mapped()) beyond the end
5259 * of the i_size (remove_inode_hugepages() is strict about
5260 * enforcing that). If we bail out here, we'll also leave a
5261 * page in the radix tree in the vm_shared case beyond the end
5262 * of the i_size, but remove_inode_hugepages() will take care
5263 * of it as soon as we drop the hugetlb_fault_mutex_table.
5264 */
5265 size = i_size_read(mapping->host) >> huge_page_shift(h);
5266 ret = -EFAULT;
5267 if (idx >= size)
5268 goto out_release_unlock;
5269
5270 ret = -EEXIST;
5271 if (!huge_pte_none(huge_ptep_get(dst_pte)))
5272 goto out_release_unlock;
5273
5274 if (vm_shared) {
5275 page_dup_rmap(page, true);
5276 } else {
5277 ClearHPageRestoreReserve(page);
5278 hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
5279 }
5280
5281 /*
5282 * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
5283 * with wp flag set, don't set pte write bit.
5284 */
5285 if (wp_copy || (is_continue && !vm_shared))
5286 writable = 0;
5287 else
5288 writable = dst_vma->vm_flags & VM_WRITE;
5289
5290 _dst_pte = make_huge_pte(dst_vma, page, writable);
5291 /*
5292 * Always mark UFFDIO_COPY page dirty; note that this may not be
5293 * extremely important for hugetlbfs for now since swapping is not
5294 * supported, but we should still be clear in that this page cannot be
5295 * thrown away at will, even if write bit not set.
5296 */
5297 _dst_pte = huge_pte_mkdirty(_dst_pte);
5298 _dst_pte = pte_mkyoung(_dst_pte);
5299
5300 if (wp_copy)
> 5301 _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
5302
5303 set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
5304
5305 (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
5306 dst_vma->vm_flags & VM_WRITE);
5307 hugetlb_count_add(pages_per_huge_page(h), dst_mm);
5308
5309 /* No need to invalidate - it was non-present before */
5310 update_mmu_cache(dst_vma, dst_addr, dst_pte);
5311
5312 spin_unlock(ptl);
5313 if (!is_continue)
5314 SetHPageMigratable(page);
5315 if (vm_shared || is_continue)
5316 unlock_page(page);
5317 ret = 0;
5318 out:
5319 return ret;
5320 out_release_unlock:
5321 spin_unlock(ptl);
5322 if (vm_shared || is_continue)
5323 unlock_page(page);
5324 out_release_nounlock:
5325 restore_reserve_on_error(h, dst_vma, dst_addr, page);
5326 put_page(page);
5327 goto out;
5328 }
5329 #endif /* CONFIG_USERFAULTFD */
5330

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (13.94 kB)
.config.gz (27.64 kB)
Download all attachments

2021-07-21 08:37:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 19/26] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kselftest/next]
[also build test ERROR on linus/master v5.14-rc2 next-20210720]
[cannot apply to hnaz-linux-mm/master asm-generic/master arm64/for-next/core linux/master tip/x86/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20210716-041947
base: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
config: s390-randconfig-r023-20210716 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 5d5b08761f944d5b9822d582378333cc4b36a0a7)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install s390 cross compiling tool for clang build
# apt-get install binutils-s390x-linux-gnu
# https://github.com/0day-ci/linux/commit/23779145f29982887db86a44763fa794325c479f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Peter-Xu/userfaultfd-wp-Support-shmem-and-hugetlbfs/20210716-041947
git checkout 23779145f29982887db86a44763fa794325c479f
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __raw_readb(PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:36:59: note: expanded from macro '__le16_to_cpu'
#define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
^
include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
#define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
#define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
^
include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
#define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
^
In file included from mm/hugetlb.c:19:
In file included from include/linux/memblock.h:14:
In file included from arch/s390/include/asm/dma.h:5:
In file included from arch/s390/include/asm/io.h:75:
include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writeb(value, PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
__raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
~~~~~~~~~~ ^
include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
readsl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesb(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesw(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
writesl(PCI_IOBASE + addr, buffer, count);
~~~~~~~~~~ ^
mm/hugetlb.c:5063:29: error: implicit declaration of function 'huge_pte_uffd_wp' [-Werror,-Wimplicit-function-declaration]
if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
^
mm/hugetlb.c:5301:14: error: implicit declaration of function 'huge_pte_mkuffd_wp' [-Werror,-Wimplicit-function-declaration]
_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
^
mm/hugetlb.c:5301:14: note: did you mean 'pte_mkuffd_wp'?
include/asm-generic/pgtable_uffd.h:18:30: note: 'pte_mkuffd_wp' declared here
static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
^
mm/hugetlb.c:5301:12: error: assigning to 'pte_t' from incompatible type 'int'
_dst_pte = huge_pte_mkuffd_wp(_dst_pte);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:5595:11: error: implicit declaration of function 'huge_pte_mkuffd_wp' [-Werror,-Wimplicit-function-declaration]
pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
^
mm/hugetlb.c:5595:9: error: assigning to 'pte_t' from incompatible type 'int'
pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> mm/hugetlb.c:5597:11: error: implicit declaration of function 'huge_pte_clear_uffd_wp' [-Werror,-Wimplicit-function-declaration]
pte = huge_pte_clear_uffd_wp(pte);
^
mm/hugetlb.c:5597:11: note: did you mean 'pte_clear_uffd_wp'?
include/asm-generic/pgtable_uffd.h:28:30: note: 'pte_clear_uffd_wp' declared here
static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
^
mm/hugetlb.c:5597:9: error: assigning to 'pte_t' from incompatible type 'int'
pte = huge_pte_clear_uffd_wp(pte);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
12 warnings and 7 errors generated.


vim +/huge_pte_clear_uffd_wp +5597 mm/hugetlb.c

5520
5521 unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
5522 unsigned long address, unsigned long end,
5523 pgprot_t newprot, unsigned long cp_flags)
5524 {
5525 struct mm_struct *mm = vma->vm_mm;
5526 unsigned long start = address;
5527 pte_t *ptep;
5528 pte_t pte;
5529 struct hstate *h = hstate_vma(vma);
5530 unsigned long pages = 0;
5531 bool shared_pmd = false;
5532 struct mmu_notifier_range range;
5533 bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
5534 bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
5535
5536 /*
5537 * In the case of shared PMDs, the area to flush could be beyond
5538 * start/end. Set range.start/range.end to cover the maximum possible
5539 * range if PMD sharing is possible.
5540 */
5541 mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA,
5542 0, vma, mm, start, end);
5543 adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
5544
5545 BUG_ON(address >= end);
5546 flush_cache_range(vma, range.start, range.end);
5547
5548 mmu_notifier_invalidate_range_start(&range);
5549 i_mmap_lock_write(vma->vm_file->f_mapping);
5550 for (; address < end; address += huge_page_size(h)) {
5551 spinlock_t *ptl;
5552 ptep = huge_pte_offset(mm, address, huge_page_size(h));
5553 if (!ptep)
5554 continue;
5555 ptl = huge_pte_lock(h, mm, ptep);
5556 if (huge_pmd_unshare(mm, vma, &address, ptep)) {
5557 pages++;
5558 spin_unlock(ptl);
5559 shared_pmd = true;
5560 continue;
5561 }
5562 pte = huge_ptep_get(ptep);
5563 if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
5564 spin_unlock(ptl);
5565 continue;
5566 }
5567 if (unlikely(is_hugetlb_entry_migration(pte))) {
5568 swp_entry_t entry = pte_to_swp_entry(pte);
5569
5570 if (is_writable_migration_entry(entry)) {
5571 pte_t newpte;
5572
5573 entry = make_readable_migration_entry(
5574 swp_offset(entry));
5575 newpte = swp_entry_to_pte(entry);
5576 if (uffd_wp)
5577 newpte = pte_swp_mkuffd_wp(newpte);
5578 else if (uffd_wp_resolve)
5579 newpte = pte_swp_clear_uffd_wp(newpte);
5580 set_huge_swap_pte_at(mm, address, ptep,
5581 newpte, huge_page_size(h));
5582 pages++;
5583 }
5584 spin_unlock(ptl);
5585 continue;
5586 }
5587 if (!huge_pte_none(pte)) {
5588 pte_t old_pte;
5589 unsigned int shift = huge_page_shift(hstate_vma(vma));
5590
5591 old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
5592 pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
5593 pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
5594 if (uffd_wp)
5595 pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
5596 else if (uffd_wp_resolve)
> 5597 pte = huge_pte_clear_uffd_wp(pte);
5598 huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
5599 pages++;
5600 }
5601 spin_unlock(ptl);
5602 }
5603 /*
5604 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
5605 * may have cleared our pud entry and done put_page on the page table:
5606 * once we release i_mmap_rwsem, another task can do the final put_page
5607 * and that page table be reused and filled with junk. If we actually
5608 * did unshare a page of pmds, flush the range corresponding to the pud.
5609 */
5610 if (shared_pmd)
5611 flush_hugetlb_tlb_range(vma, range.start, range.end);
5612 else
5613 flush_hugetlb_tlb_range(vma, start, end);
5614 /*
5615 * No need to call mmu_notifier_invalidate_range() we are downgrading
5616 * page table protection not changing it to point to a new page.
5617 *
5618 * See Documentation/vm/mmu_notifier.rst
5619 */
5620 i_mmap_unlock_write(vma->vm_file->f_mapping);
5621 mmu_notifier_invalidate_range_end(&range);
5622
5623 return pages << h->order;
5624 }
5625

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (12.51 kB)
.config.gz (27.64 kB)
Download all attachments

2021-07-21 11:41:51

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > Hi Peter,
> >
> > [...]
> >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index ae1f5d0cb581..4b46c099ad94 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > >
> > > if (pte_present(ptent))
> > > page = mc_handle_present_pte(vma, addr, ptent);
> > > - else if (is_swap_pte(ptent))
> > > + else if (pte_has_swap_entry(ptent))
> > > page = mc_handle_swap_pte(vma, ptent, &ent);
> > > else if (pte_none(ptent))
> > > page = mc_handle_file_pte(vma, addr, ptent, &ent);
> >
> > As I understand things pte_none() == False for a special swap pte, but
> > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > pte_none(ptent) || is_swap_special_pte() here?
>
> Looks correct; here the page/swap cache could hide behind the special pte just
> like a none pte. Will fix it. Thanks!
>
> >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > if (!pte_unmap_same(vmf))
> > > goto out;
> > >
> > > + /*
> > > + * We should never call do_swap_page upon a swap special pte; just be
> > > + * safe to bail out if it happens.
> > > + */
> > > + if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > + goto out;
> > > +
> > > entry = pte_to_swp_entry(vmf->orig_pte);
> > > if (unlikely(non_swap_entry(entry))) {
> > > if (is_migration_entry(entry)) {
> >
> > Are there other changes required here? Because we can end up with stale special
> > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > checks in these functions:
> >
> > insert_pfn() -> checks for !pte_none
> > remap_pte_range() -> BUG_ON(!pte_none)
> > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> >
> > In general it feels like I might be missing something here though. There are
> > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > there some rule that says none of those paths can see a special pte?
>
> My rule on doing this was to only care about vma that can be backed by RAM,
> majorly shmem/hugetlb, so the special pte can only exist there within those
> vmas. I believe in most pte_none() users this special pte won't exist.
>
> So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> pte_none() usage like before.
>
> Take the example of insert_pfn() referenced first - I think it can be used to
> map some MMIO regions, but I don't think we'll call that upon a RAM region
> (either shmem or hugetlb), nor can it be uffd wr-protected. So I'm not sure
> adding special pte check there would be helpful.
>
> apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> more on whether the special pte will matter. I had a quick look, it seems
> still be used mostly by all kinds of driver code not mm core. It's used in two
> forms:
>
> apply_to_page_range
> apply_to_existing_page_range
>
> The first one creates ptes only, so it ignores the pte_none() check so I skipped.
>
> The second one has two call sites:
>
> *** arch/powerpc/mm/pageattr.c:
> change_memory_attr[99] return apply_to_existing_page_range(&init_mm, start, size,
> set_memory_attr[132] return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
>
> *** mm/kasan/shadow.c:
> kasan_release_vmalloc[485] apply_to_existing_page_range(&init_mm,
>
> I'll leave the ppc callers for now as uffd-wp is not even supported there. The
> kasan_release_vmalloc() should be for kernel allocated memories only, so should
> not be a target for special pte either.
>
> So indeed it's hard to 100% cover all pte_none() users to make sure things are
> used right. As stated above I still believe most callers don't need that, but
> the worst case is if someone triggered uffd-wp issues with a specific feature,
> we can look into it. I am not sure whether it's good we add this for all the
> pte_none() users, because mostly they'll be useless checks, imho.

I wonder then - should we make pte_none() return true for these special pte's
as well? It seems if we do miss any callers it could result in some fairly hard
to find bugs if the code follows a different path due to the presence of an
unexpected special pte changing the result of pte_none().

> So far what I planned to do is to cover most things we know that may be
> affected like this patch so the change may bring a difference, hopefully we
> won't miss any important spots.
>
> >
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 23cbd9de030b..b477d0d5f911 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > >
> > > spin_lock(ptl);
> > > pte = *ptep;
> > > - if (!is_swap_pte(pte))
> > > + if (!pte_has_swap_entry(pte))
> > > goto out;
> > >
> > > entry = pte_to_swp_entry(pte);
> > > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > >
> > > pte = *ptep;
> > >
> > > - if (pte_none(pte)) {
> > > + if (pte_none(pte) || is_swap_special_pte(pte)) {
> >
> > I was wondering if we can loose the special pte information here? However I see
> > that in migrate_vma_insert_page() we check again and fail the migration if
> > !pte_none() so I think this is ok.
> >
> > I think it would be better if this check was moved below so the migration fails
> > early. Ie:
> >
> > if (pte_none(pte)) {
> > if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {
>
> Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
> swap special pte? Because swap special pte only exists when !vma_is_anonymous().

Oh ok that makes sense. With the code written that way it is easy to forget
that though so maybe a comment would help?

> >
> > Also how does this work for page migration in general? I can see in
> > page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> > loose the special pte in that instance? Or is that ok for some reason?
>
> Do you mean try_to_migrate_one()? Does it need to be aware of that? Per my
> understanding that's only for anonymous private memory, while in that world
> there should have no swap special pte (page_lock_anon_vma_read will return NULL
> early for !vma_is_anonymous).

As far as I know try_to_migrate_one() gets called for both anonymous pages and
file-backed pages. page_lock_anon_vma_read() is only called in the case of an
anonymous vma. See the implementation of rmap_walk() - it will call either
rmap_walk_anon() or rmap_walk_file() depending on the result of PageAnon().

- Alistair

> Thanks,
>
>




2021-07-21 19:28:08

by Ivan Teterevkov

[permalink] [raw]
Subject: RE: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
> I'm also curious what would be the real use to have an accurate PM_SWAP
> accounting. To me current implementation may not provide accurate value but
> should be good enough for most cases. However not sure whether it's also true
> for your use case.

We want the PM_SWAP bit implemented (for shared memory in the pagemap
interface) to enhance the live migration for some fraction of the guest
VMs that have their pages swapped out to the host swap. Once those pages
are paged in and transferred over network, we then want to release them
with madvise(MADV_PAGEOUT) and preserve the working set of the guest VMs
to reduce the thrashing of the host swap.

At this point, we don't really need the PM_UFFD_WP or PM_SOFT_DIRTY bits
in the pagemap report and were considering them only if they were easy to
retrieve. The latter one seems to require some plumbing through the variety
of use cases in the kernel, so our intention at the moment is to capture it
in the pagemap docs as the known issue, presumably to handle by CRIU users.

(Cc Pavel Emelyanov CRIU chief maintainer)

Thanks,
Ivan

2021-07-21 21:03:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On 21.07.21 16:38, Ivan Teterevkov wrote:
> On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
>> I'm also curious what would be the real use to have an accurate PM_SWAP
>> accounting. To me current implementation may not provide accurate value but
>> should be good enough for most cases. However not sure whether it's also true
>> for your use case.
>
> We want the PM_SWAP bit implemented (for shared memory in the pagemap
> interface) to enhance the live migration for some fraction of the guest
> VMs that have their pages swapped out to the host swap. Once those pages
> are paged in and transferred over network, we then want to release them
> with madvise(MADV_PAGEOUT) and preserve the working set of the guest VMs
> to reduce the thrashing of the host swap.

There are 3 possibilities I think (swap is just another variant of the
page cache):

1) The page is not in the page cache, e.g., it resides on disk or in a
swap file. pte_none().
2) The page is in the page cache and is not mapped into the page table.
pte_none().
3) The page is in the page cache and mapped into the page table.
!pte_none().

Do I understand correctly that you want to identify 1) and indicate it
via PM_SWAP?

--
Thanks,

David / dhildenb

2021-07-21 21:07:57

by Ivan Teterevkov

[permalink] [raw]
Subject: RE: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
> On 21.07.21 16:38, Ivan Teterevkov wrote:
> > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
> >> I'm also curious what would be the real use to have an accurate
> >> PM_SWAP accounting. To me current implementation may not provide
> >> accurate value but should be good enough for most cases. However not
> >> sure whether it's also true for your use case.
> >
> > We want the PM_SWAP bit implemented (for shared memory in the pagemap
> > interface) to enhance the live migration for some fraction of the
> > guest VMs that have their pages swapped out to the host swap. Once
> > those pages are paged in and transferred over network, we then want to
> > release them with madvise(MADV_PAGEOUT) and preserve the working set
> > of the guest VMs to reduce the thrashing of the host swap.
>
> There are 3 possibilities I think (swap is just another variant of the page cache):
>
> 1) The page is not in the page cache, e.g., it resides on disk or in a swap file.
> pte_none().
> 2) The page is in the page cache and is not mapped into the page table.
> pte_none().
> 3) The page is in the page cache and mapped into the page table.
> !pte_none().
>
> Do I understand correctly that you want to identify 1) and indicate it via
> PM_SWAP?

Yes, and I also want to outline the context so we're on the same page.

This series introduces the support for userfaultfd-wp for shared memory
because once a shared page is swapped, its PTE is cleared. Upon retrieval
from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag
because unlike private memory it's not kept in PTE or elsewhere.

We came across the same issue with PM_SWAP in the pagemap interface, but
fortunately, there's the place that we could query: the i_pages field of
the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595
we do it similarly to what shmem_fault() does when it handles #PF.

Now, in the context of this series, we were exploring whether it makes
any practical sense to introduce more brand new flags to the special
PTE to populate the pagemap flags "on the spot" from the given PTE.

However, I can't see how (and why) to achieve that specifically for
PM_SWAP even with an extra bit: the XArray is precisely what we need for
the live migration use case. Another flag PM_SOFT_DIRTY suffers the same
problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't
need it at the moment.

Hope that clarification makes sense?

The only outstanding note I have is about the compatibility of our
patches around pte_to_pagemap_entry(). I think the resulting code
should look like this:

static pagemap_entry_t pte_to_pagemap_entry(...)
{
if (pte_present(pte)) {
...
} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
...
if (pte_swp_uffd_wp_special(pte)) {
flags |= PM_UFFD_WP;
}
}
}

The is_swap_pte() branch will be taken for the swapped out shared pages,
thanks to shmem_file(), so the pte_swp_uffd_wp_special() can be checked
inside.

Alternatively, we could just remove "else" statement:

static pagemap_entry_t pte_to_pagemap_entry(...)
{
if (pte_present(pte)) {
...
} else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
...
}

if (pte_swp_uffd_wp_special(pte)) {
flags |= PM_UFFD_WP;
}
}

What do you reckon?

Thanks,
Ivan

2021-07-21 21:36:55

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

On Wed, Jul 21, 2021 at 09:28:49PM +1000, Alistair Popple wrote:
> On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> > On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > > Hi Peter,
> > >
> > > [...]
> > >
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index ae1f5d0cb581..4b46c099ad94 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > >
> > > > if (pte_present(ptent))
> > > > page = mc_handle_present_pte(vma, addr, ptent);
> > > > - else if (is_swap_pte(ptent))
> > > > + else if (pte_has_swap_entry(ptent))
> > > > page = mc_handle_swap_pte(vma, ptent, &ent);
> > > > else if (pte_none(ptent))
> > > > page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > >
> > > As I understand things pte_none() == False for a special swap pte, but
> > > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > > pte_none(ptent) || is_swap_special_pte() here?
> >
> > Looks correct; here the page/swap cache could hide behind the special pte just
> > like a none pte. Will fix it. Thanks!
> >
> > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > if (!pte_unmap_same(vmf))
> > > > goto out;
> > > >
> > > > + /*
> > > > + * We should never call do_swap_page upon a swap special pte; just be
> > > > + * safe to bail out if it happens.
> > > > + */
> > > > + if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > > + goto out;
> > > > +
> > > > entry = pte_to_swp_entry(vmf->orig_pte);
> > > > if (unlikely(non_swap_entry(entry))) {
> > > > if (is_migration_entry(entry)) {
> > >
> > > Are there other changes required here? Because we can end up with stale special
> > > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > > checks in these functions:
> > >
> > > insert_pfn() -> checks for !pte_none
> > > remap_pte_range() -> BUG_ON(!pte_none)
> > > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > >
> > > In general it feels like I might be missing something here though. There are
> > > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > > there some rule that says none of those paths can see a special pte?
> >
> > My rule on doing this was to only care about vma that can be backed by RAM,
> > majorly shmem/hugetlb, so the special pte can only exist there within those
> > vmas. I believe in most pte_none() users this special pte won't exist.
> >
> > So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> > pte_none() usage like before.
> >
> > Take the example of insert_pfn() referenced first - I think it can be used to
> > map some MMIO regions, but I don't think we'll call that upon a RAM region
> > (either shmem or hugetlb), nor can it be uffd wr-protected. So I'm not sure
> > adding special pte check there would be helpful.
> >
> > apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> > more on whether the special pte will matter. I had a quick look, it seems
> > still be used mostly by all kinds of driver code not mm core. It's used in two
> > forms:
> >
> > apply_to_page_range
> > apply_to_existing_page_range
> >
> > The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> >
> > The second one has two call sites:
> >
> > *** arch/powerpc/mm/pageattr.c:
> > change_memory_attr[99] return apply_to_existing_page_range(&init_mm, start, size,
> > set_memory_attr[132] return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> >
> > *** mm/kasan/shadow.c:
> > kasan_release_vmalloc[485] apply_to_existing_page_range(&init_mm,
> >
> > I'll leave the ppc callers for now as uffd-wp is not even supported there. The
> > kasan_release_vmalloc() should be for kernel allocated memories only, so should
> > not be a target for special pte either.
> >
> > So indeed it's hard to 100% cover all pte_none() users to make sure things are
> > used right. As stated above I still believe most callers don't need that, but
> > the worst case is if someone triggered uffd-wp issues with a specific feature,
> > we can look into it. I am not sure whether it's good we add this for all the
> > pte_none() users, because mostly they'll be useless checks, imho.
>
> I wonder then - should we make pte_none() return true for these special pte's
> as well? It seems if we do miss any callers it could result in some fairly hard
> to find bugs if the code follows a different path due to the presence of an
> unexpected special pte changing the result of pte_none().

I thought about something similar before, but I didn't dare to change
pte_none() as it's been there for ages and I'm afraid people will get confused
when it's meaning changed. So even if we want to have some helper identifying
"either none pte or the swap special pte" it should use a different name.

Modifying the meaning of pte_none() could also have other risks that when we
really want an empty pte to be doing something else now. It turns out there's
no easy way to not identify the case one by one, at least to me. I'm always
open to good suggestions.

Btw, as you mentioned before, we can use a new number out of MAX_SWAPFILES,
that'll make all these easier a bit here, then we don't need to worry on
pte_none() issues too. Two days ago Hugh has raised some similar concern on
whether it's good to implement this uffd-wp special pte like this. I think we
can discuss this separately.

>
> > So far what I planned to do is to cover most things we know that may be
> > affected like this patch so the change may bring a difference, hopefully we
> > won't miss any important spots.
> >
> > >
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index 23cbd9de030b..b477d0d5f911 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > > >
> > > > spin_lock(ptl);
> > > > pte = *ptep;
> > > > - if (!is_swap_pte(pte))
> > > > + if (!pte_has_swap_entry(pte))
> > > > goto out;
> > > >
> > > > entry = pte_to_swp_entry(pte);
> > > > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > >
> > > > pte = *ptep;
> > > >
> > > > - if (pte_none(pte)) {
> > > > + if (pte_none(pte) || is_swap_special_pte(pte)) {
> > >
> > > I was wondering if we can loose the special pte information here? However I see
> > > that in migrate_vma_insert_page() we check again and fail the migration if
> > > !pte_none() so I think this is ok.
> > >
> > > I think it would be better if this check was moved below so the migration fails
> > > early. Ie:
> > >
> > > if (pte_none(pte)) {
> > > if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {
> >
> > Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
> > swap special pte? Because swap special pte only exists when !vma_is_anonymous().
>
> Oh ok that makes sense. With the code written that way it is easy to forget
> that though so maybe a comment would help?

I've put most words in comment of is_swap_special_pte(). Do you perhaps have a
suggestion on the comment here?

>
> > >
> > > Also how does this work for page migration in general? I can see in
> > > page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> > > loose the special pte in that instance? Or is that ok for some reason?
> >
> > Do you mean try_to_migrate_one()? Does it need to be aware of that? Per my
> > understanding that's only for anonymous private memory, while in that world
> > there should have no swap special pte (page_lock_anon_vma_read will return NULL
> > early for !vma_is_anonymous).
>
> As far as I know try_to_migrate_one() gets called for both anonymous pages and
> file-backed pages. page_lock_anon_vma_read() is only called in the case of an
> anonymous vma. See the implementation of rmap_walk() - it will call either
> rmap_walk_anon() or rmap_walk_file() depending on the result of PageAnon().

I may have replied too soon there. :) I think you're right.

So I think how it should work with page migration is: we skip that pte just
like what you said (check_pte returns false), then the per-pte info will be
kept there, irrelevant of what's the backing page is. When it faults, it'll
bring up with either the old/new page depending on migration finished or not.
Does that sound working to you?

Thanks,

--
Peter Xu

2021-07-21 21:53:03

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 17/26] hugetlb/userfaultfd: Hook page faults for uffd write protection

On Tue, Jul 20, 2021 at 11:37:36PM +0800, kernel test robot wrote:
> config: s390-randconfig-r023-20210716 (attached as .config)

[...]

> >> mm/hugetlb.c:5063:29: error: implicit declaration of function 'huge_pte_uffd_wp' [-Werror,-Wimplicit-function-declaration]
> if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
> ^
> 12 warnings and 1 error generated.

I remember I raised this question once on why s390 redefines a lot of huge pte
operations on its own even if they're defined the same in generic hugetlb.h..
I think there was a plan to rework that but definitely not landed yet.

Will sqaush below into the patch "mm/hugetlb: Introduce huge pte version of
uffd-wp helpers":

---8<---
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 60f9241e5e4a..19c4b4431d27 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -115,6 +115,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}

+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+ return 0;
+}
+
static inline bool gigantic_page_runtime_supported(void)
{
return true;
---8<---

Thanks,

--
Peter Xu

2021-07-21 22:29:16

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

Hi, Ivan,

On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote:
> On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
> > On 21.07.21 16:38, Ivan Teterevkov wrote:
> > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
> > >> I'm also curious what would be the real use to have an accurate
> > >> PM_SWAP accounting. To me current implementation may not provide
> > >> accurate value but should be good enough for most cases. However not
> > >> sure whether it's also true for your use case.
> > >
> > > We want the PM_SWAP bit implemented (for shared memory in the pagemap
> > > interface) to enhance the live migration for some fraction of the
> > > guest VMs that have their pages swapped out to the host swap. Once
> > > those pages are paged in and transferred over network, we then want to
> > > release them with madvise(MADV_PAGEOUT) and preserve the working set
> > > of the guest VMs to reduce the thrashing of the host swap.
> >
> > There are 3 possibilities I think (swap is just another variant of the page cache):
> >
> > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file.
> > pte_none().
> > 2) The page is in the page cache and is not mapped into the page table.
> > pte_none().
> > 3) The page is in the page cache and mapped into the page table.
> > !pte_none().
> >
> > Do I understand correctly that you want to identify 1) and indicate it via
> > PM_SWAP?
>
> Yes, and I also want to outline the context so we're on the same page.
>
> This series introduces the support for userfaultfd-wp for shared memory
> because once a shared page is swapped, its PTE is cleared. Upon retrieval
> from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag
> because unlike private memory it's not kept in PTE or elsewhere.
>
> We came across the same issue with PM_SWAP in the pagemap interface, but
> fortunately, there's the place that we could query: the i_pages field of
> the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595
> we do it similarly to what shmem_fault() does when it handles #PF.
>
> Now, in the context of this series, we were exploring whether it makes
> any practical sense to introduce more brand new flags to the special
> PTE to populate the pagemap flags "on the spot" from the given PTE.
>
> However, I can't see how (and why) to achieve that specifically for
> PM_SWAP even with an extra bit: the XArray is precisely what we need for
> the live migration use case. Another flag PM_SOFT_DIRTY suffers the same
> problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't
> need it at the moment.
>
> Hope that clarification makes sense?

Yes it helps, thanks.

So I can understand now on how that patch comes initially, even if it may not
work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP.

However I have a concern that I raised also in the other thread: I think
there'll be an extra and meaningless xa_load() for all the real pte_none()s
that aren't swapped out but just having no page at the back from the very
beginning. That happens much more frequent when the memory being observed by
pagemap is mapped in a huge chunk and sparsely mapped.

With old code we'll simply skip those ptes, but now I have no idea how much
overhead would a xa_load() brings.

Btw, I think there's a way to implement such an idea similar to the swap
special uffd-wp pte - when page reclaim of shmem pages, instead of putting a
none pte there maybe we can also have one bit set in the none pte showing that
this pte is swapped out. When the page faulted back we just drop that bit.

That bit could be also scanned by pagemap code to know that this page was
swapped out. That should be much lighter than xa_load(), and that identifies
immediately from a real none pte just by reading the value.

Do you think this would work?

>
> The only outstanding note I have is about the compatibility of our
> patches around pte_to_pagemap_entry(). I think the resulting code
> should look like this:
>
> static pagemap_entry_t pte_to_pagemap_entry(...)
> {
> if (pte_present(pte)) {
> ...
> } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
> ...
> if (pte_swp_uffd_wp_special(pte)) {
> flags |= PM_UFFD_WP;
> }
> }
> }
>
> The is_swap_pte() branch will be taken for the swapped out shared pages,
> thanks to shmem_file(), so the pte_swp_uffd_wp_special() can be checked
> inside.
>
> Alternatively, we could just remove "else" statement:
>
> static pagemap_entry_t pte_to_pagemap_entry(...)
> {
> if (pte_present(pte)) {
> ...
> } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) {
> ...
> }
>
> if (pte_swp_uffd_wp_special(pte)) {
> flags |= PM_UFFD_WP;
> }
> }
>
> What do you reckon?

I don't worry too much on how we implement those in details yet. Both look
fine to me.

Thanks,

--
Peter Xu

2021-07-21 22:59:03

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote:
> Hi, Ivan,
>
> On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote:
> > On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
> > > On 21.07.21 16:38, Ivan Teterevkov wrote:
> > > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
> > > >> I'm also curious what would be the real use to have an accurate
> > > >> PM_SWAP accounting. To me current implementation may not provide
> > > >> accurate value but should be good enough for most cases. However not
> > > >> sure whether it's also true for your use case.
> > > >
> > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap
> > > > interface) to enhance the live migration for some fraction of the
> > > > guest VMs that have their pages swapped out to the host swap. Once
> > > > those pages are paged in and transferred over network, we then want to
> > > > release them with madvise(MADV_PAGEOUT) and preserve the working set
> > > > of the guest VMs to reduce the thrashing of the host swap.
> > >
> > > There are 3 possibilities I think (swap is just another variant of the page cache):
> > >
> > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file.
> > > pte_none().
> > > 2) The page is in the page cache and is not mapped into the page table.
> > > pte_none().
> > > 3) The page is in the page cache and mapped into the page table.
> > > !pte_none().
> > >
> > > Do I understand correctly that you want to identify 1) and indicate it via
> > > PM_SWAP?
> >
> > Yes, and I also want to outline the context so we're on the same page.
> >
> > This series introduces the support for userfaultfd-wp for shared memory
> > because once a shared page is swapped, its PTE is cleared. Upon retrieval
> > from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag
> > because unlike private memory it's not kept in PTE or elsewhere.
> >
> > We came across the same issue with PM_SWAP in the pagemap interface, but
> > fortunately, there's the place that we could query: the i_pages field of
> > the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595
> > we do it similarly to what shmem_fault() does when it handles #PF.
> >
> > Now, in the context of this series, we were exploring whether it makes
> > any practical sense to introduce more brand new flags to the special
> > PTE to populate the pagemap flags "on the spot" from the given PTE.
> >
> > However, I can't see how (and why) to achieve that specifically for
> > PM_SWAP even with an extra bit: the XArray is precisely what we need for
> > the live migration use case. Another flag PM_SOFT_DIRTY suffers the same
> > problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't
> > need it at the moment.
> >
> > Hope that clarification makes sense?
>
> Yes it helps, thanks.
>
> So I can understand now on how that patch comes initially, even if it may not
> work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP.
>
> However I have a concern that I raised also in the other thread: I think
> there'll be an extra and meaningless xa_load() for all the real pte_none()s
> that aren't swapped out but just having no page at the back from the very
> beginning. That happens much more frequent when the memory being observed by
> pagemap is mapped in a huge chunk and sparsely mapped.
>
> With old code we'll simply skip those ptes, but now I have no idea how much
> overhead would a xa_load() brings.
>
> Btw, I think there's a way to implement such an idea similar to the swap
> special uffd-wp pte - when page reclaim of shmem pages, instead of putting a
> none pte there maybe we can also have one bit set in the none pte showing that
> this pte is swapped out. When the page faulted back we just drop that bit.
>
> That bit could be also scanned by pagemap code to know that this page was
> swapped out. That should be much lighter than xa_load(), and that identifies
> immediately from a real none pte just by reading the value.
>
> Do you think this would work?

Btw, I think that's what Tiberiu used to mention, but I think I just changed my
mind.. Sorry to have brought such a confusion.

So what I think now is: we can set it (instead of zeroing the pte) right at
unmapping the pte of page reclaim. Code-wise, that can be a special flag
(maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to
differenciate from other try_to_unmap()s.

I think that bit can also be dropped correctly e.g. when punching a hole in the
file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp
bit could get left-overs, but after a second thought here similarly, it seems
it won't; as long as hole punching and vma unmapping will always be able to
scan those marker ptes, then it seems all right to drop them correctly).

But that's my wild thoughts; I could have missed something too.

Thanks,

--
Peter Xu

2021-07-22 01:10:03

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

On Thursday, 22 July 2021 7:35:32 AM AEST Peter Xu wrote:
> On Wed, Jul 21, 2021 at 09:28:49PM +1000, Alistair Popple wrote:
> > On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> > > On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > > > Hi Peter,
> > > >
> > > > [...]
> > > >
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index ae1f5d0cb581..4b46c099ad94 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > > >
> > > > > if (pte_present(ptent))
> > > > > page = mc_handle_present_pte(vma, addr, ptent);
> > > > > - else if (is_swap_pte(ptent))
> > > > > + else if (pte_has_swap_entry(ptent))
> > > > > page = mc_handle_swap_pte(vma, ptent, &ent);
> > > > > else if (pte_none(ptent))
> > > > > page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > > >
> > > > As I understand things pte_none() == False for a special swap pte, but
> > > > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > > > pte_none(ptent) || is_swap_special_pte() here?
> > >
> > > Looks correct; here the page/swap cache could hide behind the special pte just
> > > like a none pte. Will fix it. Thanks!
> > >
> > > >
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > if (!pte_unmap_same(vmf))
> > > > > goto out;
> > > > >
> > > > > + /*
> > > > > + * We should never call do_swap_page upon a swap special pte; just be
> > > > > + * safe to bail out if it happens.
> > > > > + */
> > > > > + if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > > > + goto out;
> > > > > +
> > > > > entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > if (unlikely(non_swap_entry(entry))) {
> > > > > if (is_migration_entry(entry)) {
> > > >
> > > > Are there other changes required here? Because we can end up with stale special
> > > > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > > > checks in these functions:
> > > >
> > > > insert_pfn() -> checks for !pte_none
> > > > remap_pte_range() -> BUG_ON(!pte_none)
> > > > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > > >
> > > > In general it feels like I might be missing something here though. There are
> > > > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > > > there some rule that says none of those paths can see a special pte?
> > >
> > > My rule on doing this was to only care about vma that can be backed by RAM,
> > > majorly shmem/hugetlb, so the special pte can only exist there within those
> > > vmas. I believe in most pte_none() users this special pte won't exist.
> > >
> > > So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> > > pte_none() usage like before.
> > >
> > > Take the example of insert_pfn() referenced first - I think it can be used to
> > > map some MMIO regions, but I don't think we'll call that upon a RAM region
> > > (either shmem or hugetlb), nor can it be uffd wr-protected. So I'm not sure
> > > adding special pte check there would be helpful.
> > >
> > > apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> > > more on whether the special pte will matter. I had a quick look, it seems
> > > still be used mostly by all kinds of driver code not mm core. It's used in two
> > > forms:
> > >
> > > apply_to_page_range
> > > apply_to_existing_page_range
> > >
> > > The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> > >
> > > The second one has two call sites:
> > >
> > > *** arch/powerpc/mm/pageattr.c:
> > > change_memory_attr[99] return apply_to_existing_page_range(&init_mm, start, size,
> > > set_memory_attr[132] return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> > >
> > > *** mm/kasan/shadow.c:
> > > kasan_release_vmalloc[485] apply_to_existing_page_range(&init_mm,
> > >
> > > I'll leave the ppc callers for now as uffd-wp is not even supported there. The
> > > kasan_release_vmalloc() should be for kernel allocated memories only, so should
> > > not be a target for special pte either.
> > >
> > > So indeed it's hard to 100% cover all pte_none() users to make sure things are
> > > used right. As stated above I still believe most callers don't need that, but
> > > the worst case is if someone triggered uffd-wp issues with a specific feature,
> > > we can look into it. I am not sure whether it's good we add this for all the
> > > pte_none() users, because mostly they'll be useless checks, imho.
> >
> > I wonder then - should we make pte_none() return true for these special pte's
> > as well? It seems if we do miss any callers it could result in some fairly hard
> > to find bugs if the code follows a different path due to the presence of an
> > unexpected special pte changing the result of pte_none().
>
> I thought about something similar before, but I didn't dare to change
> pte_none() as it's been there for ages and I'm afraid people will get confused
> when it's meaning changed. So even if we want to have some helper identifying
> "either none pte or the swap special pte" it should use a different name.
>
> Modifying the meaning of pte_none() could also have other risks that when we
> really want an empty pte to be doing something else now. It turns out there's
> no easy way to not identify the case one by one, at least to me. I'm always
> open to good suggestions.

I'm not convinced it's changing the behaviour of pte_none() though and my
concern is that introducing special swap ptes does change it. Prior to this
clearing a pte would result in pte_none()==True. After this series clearing a
pte can some sometimes result in pte_none()==False because it doesn't really
get cleared.

Now as you say it's hard to cover 100% of pte_none() uses, so it's possible we
have missed cases that may now encounter a special pte and take a different
path (get_mctgt_type() is one example, I stopped looking for other possible
ones after mm/memory.c).

So perhaps if we want to keep pte_none() to check for really clear pte's then
what is required is converting all callers to a new helper
(pte_none_not_special()?) that treats special swap ptes as pte_none() and warns
if a special pte is encountered?

> Btw, as you mentioned before, we can use a new number out of MAX_SWAPFILES,
> that'll make all these easier a bit here, then we don't need to worry on
> pte_none() issues too. Two days ago Hugh has raised some similar concern on
> whether it's good to implement this uffd-wp special pte like this. I think we
> can discuss this separately.

Yes, I saw that and personally I still prefer that approach.

> >
> > > So far what I planned to do is to cover most things we know that may be
> > > affected like this patch so the change may bring a difference, hopefully we
> > > won't miss any important spots.
> > >
> > > >
> > > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > > index 23cbd9de030b..b477d0d5f911 100644
> > > > > --- a/mm/migrate.c
> > > > > +++ b/mm/migrate.c
> > > > > @@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > > > >
> > > > > spin_lock(ptl);
> > > > > pte = *ptep;
> > > > > - if (!is_swap_pte(pte))
> > > > > + if (!pte_has_swap_entry(pte))
> > > > > goto out;
> > > > >
> > > > > entry = pte_to_swp_entry(pte);
> > > > > @@ -2276,7 +2276,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > > > >
> > > > > pte = *ptep;
> > > > >
> > > > > - if (pte_none(pte)) {
> > > > > + if (pte_none(pte) || is_swap_special_pte(pte)) {
> > > >
> > > > I was wondering if we can loose the special pte information here? However I see
> > > > that in migrate_vma_insert_page() we check again and fail the migration if
> > > > !pte_none() so I think this is ok.
> > > >
> > > > I think it would be better if this check was moved below so the migration fails
> > > > early. Ie:
> > > >
> > > > if (pte_none(pte)) {
> > > > if (vma_is_anonymous(vma) && !is_swap_special_pte(pte)) {
> > >
> > > Hmm.. but shouldn't vma_is_anonymous()==true already means it must not be a
> > > swap special pte? Because swap special pte only exists when !vma_is_anonymous().
> >
> > Oh ok that makes sense. With the code written that way it is easy to forget
> > that though so maybe a comment would help?
>
> I've put most words in comment of is_swap_special_pte(). Do you perhaps have a
> suggestion on the comment here?

Perhaps something like "swap special ptes only exist for !vma_is_anonymous(vma)"?

And I now see my original code suggestion was wrong anyway :)

> >
> > > >
> > > > Also how does this work for page migration in general? I can see in
> > > > page_vma_mapped_walk() that we skip special pte's, but doesn't this mean we
> > > > loose the special pte in that instance? Or is that ok for some reason?
> > >
> > > Do you mean try_to_migrate_one()? Does it need to be aware of that? Per my
> > > understanding that's only for anonymous private memory, while in that world
> > > there should have no swap special pte (page_lock_anon_vma_read will return NULL
> > > early for !vma_is_anonymous).
> >
> > As far as I know try_to_migrate_one() gets called for both anonymous pages and
> > file-backed pages. page_lock_anon_vma_read() is only called in the case of an
> > anonymous vma. See the implementation of rmap_walk() - it will call either
> > rmap_walk_anon() or rmap_walk_file() depending on the result of PageAnon().
>
> I may have replied too soon there. :) I think you're right.
>
> So I think how it should work with page migration is: we skip that pte just
> like what you said (check_pte returns false), then the per-pte info will be
> kept there, irrelevant of what's the backing page is. When it faults, it'll
> bring up with either the old/new page depending on migration finished or not.
> Does that sound working to you?

Yes actually I think this is ok. check_pte returns false for special pte's so
the existing special pte will be left in place to be dealt with as normal.

- Alistair

> Thanks,
>
>




2021-07-22 06:29:10

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On 22.07.21 00:57, Peter Xu wrote:
> On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote:
>> Hi, Ivan,
>>
>> On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote:
>>> On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
>>>> On 21.07.21 16:38, Ivan Teterevkov wrote:
>>>>> On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
>>>>>> I'm also curious what would be the real use to have an accurate
>>>>>> PM_SWAP accounting. To me current implementation may not provide
>>>>>> accurate value but should be good enough for most cases. However not
>>>>>> sure whether it's also true for your use case.
>>>>>
>>>>> We want the PM_SWAP bit implemented (for shared memory in the pagemap
>>>>> interface) to enhance the live migration for some fraction of the
>>>>> guest VMs that have their pages swapped out to the host swap. Once
>>>>> those pages are paged in and transferred over network, we then want to
>>>>> release them with madvise(MADV_PAGEOUT) and preserve the working set
>>>>> of the guest VMs to reduce the thrashing of the host swap.
>>>>
>>>> There are 3 possibilities I think (swap is just another variant of the page cache):
>>>>
>>>> 1) The page is not in the page cache, e.g., it resides on disk or in a swap file.
>>>> pte_none().
>>>> 2) The page is in the page cache and is not mapped into the page table.
>>>> pte_none().
>>>> 3) The page is in the page cache and mapped into the page table.
>>>> !pte_none().
>>>>
>>>> Do I understand correctly that you want to identify 1) and indicate it via
>>>> PM_SWAP?
>>>
>>> Yes, and I also want to outline the context so we're on the same page.
>>>
>>> This series introduces the support for userfaultfd-wp for shared memory
>>> because once a shared page is swapped, its PTE is cleared. Upon retrieval
>>> from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag
>>> because unlike private memory it's not kept in PTE or elsewhere.
>>>
>>> We came across the same issue with PM_SWAP in the pagemap interface, but
>>> fortunately, there's the place that we could query: the i_pages field of
>>> the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595
>>> we do it similarly to what shmem_fault() does when it handles #PF.
>>>
>>> Now, in the context of this series, we were exploring whether it makes
>>> any practical sense to introduce more brand new flags to the special
>>> PTE to populate the pagemap flags "on the spot" from the given PTE.
>>>
>>> However, I can't see how (and why) to achieve that specifically for
>>> PM_SWAP even with an extra bit: the XArray is precisely what we need for
>>> the live migration use case. Another flag PM_SOFT_DIRTY suffers the same
>>> problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't
>>> need it at the moment.
>>>
>>> Hope that clarification makes sense?
>>
>> Yes it helps, thanks.
>>
>> So I can understand now on how that patch comes initially, even if it may not
>> work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP.
>>
>> However I have a concern that I raised also in the other thread: I think
>> there'll be an extra and meaningless xa_load() for all the real pte_none()s
>> that aren't swapped out but just having no page at the back from the very
>> beginning. That happens much more frequent when the memory being observed by
>> pagemap is mapped in a huge chunk and sparsely mapped.
>>
>> With old code we'll simply skip those ptes, but now I have no idea how much
>> overhead would a xa_load() brings.

Let's benchmark it then. I feel like we really shouldn't be storing
unnecessarily data in page tables if they are readily available
somehwere else, because ...

>>
>> Btw, I think there's a way to implement such an idea similar to the swap
>> special uffd-wp pte - when page reclaim of shmem pages, instead of putting a
>> none pte there maybe we can also have one bit set in the none pte showing that
>> this pte is swapped out. When the page faulted back we just drop that bit.
>>
>> That bit could be also scanned by pagemap code to know that this page was
>> swapped out. That should be much lighter than xa_load(), and that identifies
>> immediately from a real none pte just by reading the value.

... we are optimizing a corner case feature (pagemap) by affecting other
system parts. Just imagine

1. Forking: will always have to copy the whole page tables for shemem
instead of optimizing.
2. New shmem mappings: will always have to sync back that bit from the
pagecache

And these are just the things that immediately come to mind. There is
certainly more (e.g., page table reclaim [1]).

>>
>> Do you think this would work?
>
> Btw, I think that's what Tiberiu used to mention, but I think I just changed my
> mind.. Sorry to have brought such a confusion.
>
> So what I think now is: we can set it (instead of zeroing the pte) right at
> unmapping the pte of page reclaim. Code-wise, that can be a special flag
> (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to
> differenciate from other try_to_unmap()s.
>
> I think that bit can also be dropped correctly e.g. when punching a hole in the
> file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp
> bit could get left-overs, but after a second thought here similarly, it seems
> it won't; as long as hole punching and vma unmapping will always be able to
> scan those marker ptes, then it seems all right to drop them correctly).
>
> But that's my wild thoughts; I could have missed something too.
>

Adding to that, Peter can you enlighten me how uffd-wp on shmem
combined with the uffd-wp bit in page tables is supposed to work in
general when talking about multiple processes?

Shmem means any process can modify any memory. To be able to properly
catch writes to such memory, the only way I can see it working is

1. All processes register uffd-wp on the shmem VMA
2. All processes arm uffd-wp by setting the same uffd-wp bits in their
page tables for the affected shmem
3. All processes synchronize, sending each other uffd-wp events when
they receive one

This is quite ... suboptimal I have to say. This is really the only way
I can imagine uffd-wp to work reliably. Is there any obvious way to make
this work I am missing?

But then, all page tables are already supposed to contain the uffd-wp
bit. Which makes me think that we can actually get rid of the uffd-wp
bit in the page table for pte_none() entries and instead store this
information somewhere else (in the page cache?) for all entries combined.

So that simplification would result in

1. All processes register uffd-wp on the shmem VMA
2. One processes wp-protects uffd-wp via the page cache (we can update
all PTEs in other processes)
3. All processes synchronize, sending each other uffd-wp events when
they receive one

The semantics of uffd-wp on shmem would be different to what we have so
far ... which would be just fine as we never had uffd-wp on shared memory.

In an ideal world, 1. and 3. wouldn't be required and all registered
uffd listeners would be notified when any process writes to it.


Sure, for single-user shmem it would work just like !shmem, but then,
maybe that user really shouldn't be using shmem. But maybe I am missing
something important :)

> Thanks,
>

[1]
https://lkml.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb

2021-07-22 15:24:57

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 05/26] mm/swap: Introduce the idea of special swap ptes

On Thu, Jul 22, 2021 at 11:08:53AM +1000, Alistair Popple wrote:
> On Thursday, 22 July 2021 7:35:32 AM AEST Peter Xu wrote:
> > On Wed, Jul 21, 2021 at 09:28:49PM +1000, Alistair Popple wrote:
> > > On Saturday, 17 July 2021 5:11:33 AM AEST Peter Xu wrote:
> > > > On Fri, Jul 16, 2021 at 03:50:52PM +1000, Alistair Popple wrote:
> > > > > Hi Peter,
> > > > >
> > > > > [...]
> > > > >
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index ae1f5d0cb581..4b46c099ad94 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -5738,7 +5738,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> > > > > >
> > > > > > if (pte_present(ptent))
> > > > > > page = mc_handle_present_pte(vma, addr, ptent);
> > > > > > - else if (is_swap_pte(ptent))
> > > > > > + else if (pte_has_swap_entry(ptent))
> > > > > > page = mc_handle_swap_pte(vma, ptent, &ent);
> > > > > > else if (pte_none(ptent))
> > > > > > page = mc_handle_file_pte(vma, addr, ptent, &ent);
> > > > >
> > > > > As I understand things pte_none() == False for a special swap pte, but
> > > > > shouldn't this be treated as pte_none() here? Ie. does this need to be
> > > > > pte_none(ptent) || is_swap_special_pte() here?
> > > >
> > > > Looks correct; here the page/swap cache could hide behind the special pte just
> > > > like a none pte. Will fix it. Thanks!
> > > >
> > > > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 0e0de08a2cd5..998a4f9a3744 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -3491,6 +3491,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > > > if (!pte_unmap_same(vmf))
> > > > > > goto out;
> > > > > >
> > > > > > + /*
> > > > > > + * We should never call do_swap_page upon a swap special pte; just be
> > > > > > + * safe to bail out if it happens.
> > > > > > + */
> > > > > > + if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> > > > > > + goto out;
> > > > > > +
> > > > > > entry = pte_to_swp_entry(vmf->orig_pte);
> > > > > > if (unlikely(non_swap_entry(entry))) {
> > > > > > if (is_migration_entry(entry)) {
> > > > >
> > > > > Are there other changes required here? Because we can end up with stale special
> > > > > pte's and a special pte is !pte_none don't we need to fix some of the !pte_none
> > > > > checks in these functions:
> > > > >
> > > > > insert_pfn() -> checks for !pte_none
> > > > > remap_pte_range() -> BUG_ON(!pte_none)
> > > > > apply_to_pte_range() -> didn't check further but it tests for !pte_none
> > > > >
> > > > > In general it feels like I might be missing something here though. There are
> > > > > plenty of checks in the kernel for pte_none() which haven't been updated. Is
> > > > > there some rule that says none of those paths can see a special pte?
> > > >
> > > > My rule on doing this was to only care about vma that can be backed by RAM,
> > > > majorly shmem/hugetlb, so the special pte can only exist there within those
> > > > vmas. I believe in most pte_none() users this special pte won't exist.
> > > >
> > > > So if it's not related to RAM backed memory at all, maybe it's fine to keep the
> > > > pte_none() usage like before.
> > > >
> > > > Take the example of insert_pfn() referenced first - I think it can be used to
> > > > map some MMIO regions, but I don't think we'll call that upon a RAM region
> > > > (either shmem or hugetlb), nor can it be uffd wr-protected. So I'm not sure
> > > > adding special pte check there would be helpful.
> > > >
> > > > apply_to_pte_range() seems to be a bit special - I think the pte_fn_t matters
> > > > more on whether the special pte will matter. I had a quick look, it seems
> > > > still be used mostly by all kinds of driver code not mm core. It's used in two
> > > > forms:
> > > >
> > > > apply_to_page_range
> > > > apply_to_existing_page_range
> > > >
> > > > The first one creates ptes only, so it ignores the pte_none() check so I skipped.
> > > >
> > > > The second one has two call sites:
> > > >
> > > > *** arch/powerpc/mm/pageattr.c:
> > > > change_memory_attr[99] return apply_to_existing_page_range(&init_mm, start, size,
> > > > set_memory_attr[132] return apply_to_existing_page_range(&init_mm, start, sz, set_page_attr,
> > > >
> > > > *** mm/kasan/shadow.c:
> > > > kasan_release_vmalloc[485] apply_to_existing_page_range(&init_mm,
> > > >
> > > > I'll leave the ppc callers for now as uffd-wp is not even supported there. The
> > > > kasan_release_vmalloc() should be for kernel allocated memories only, so should
> > > > not be a target for special pte either.
> > > >
> > > > So indeed it's hard to 100% cover all pte_none() users to make sure things are
> > > > used right. As stated above I still believe most callers don't need that, but
> > > > the worst case is if someone triggered uffd-wp issues with a specific feature,
> > > > we can look into it. I am not sure whether it's good we add this for all the
> > > > pte_none() users, because mostly they'll be useless checks, imho.
> > >
> > > I wonder then - should we make pte_none() return true for these special pte's
> > > as well? It seems if we do miss any callers it could result in some fairly hard
> > > to find bugs if the code follows a different path due to the presence of an
> > > unexpected special pte changing the result of pte_none().
> >
> > I thought about something similar before, but I didn't dare to change
> > pte_none() as it's been there for ages and I'm afraid people will get confused
> > when it's meaning changed. So even if we want to have some helper identifying
> > "either none pte or the swap special pte" it should use a different name.
> >
> > Modifying the meaning of pte_none() could also have other risks that when we
> > really want an empty pte to be doing something else now. It turns out there's
> > no easy way to not identify the case one by one, at least to me. I'm always
> > open to good suggestions.
>
> I'm not convinced it's changing the behaviour of pte_none() though and my
> concern is that introducing special swap ptes does change it. Prior to this
> clearing a pte would result in pte_none()==True. After this series clearing a
> pte can some sometimes result in pte_none()==False because it doesn't really
> get cleared.

The thing is the uffd special pte is not "none" literally; there's something
inside. That's what makes it feel not right to me. I'm not against trapping
all of pte_none(), but as I mentioned I think at least it needs to be renamed
to something else (maybe pte_none_mostly(), but I don't know..).

>
> Now as you say it's hard to cover 100% of pte_none() uses, so it's possible we
> have missed cases that may now encounter a special pte and take a different
> path (get_mctgt_type() is one example, I stopped looking for other possible
> ones after mm/memory.c).
>
> So perhaps if we want to keep pte_none() to check for really clear pte's then
> what is required is converting all callers to a new helper
> (pte_none_not_special()?) that treats special swap ptes as pte_none() and warns
> if a special pte is encountered?

By double check all core memory calls to pte_none()?

The special swap pte shouldn't exist for most cases but only for shmem and
hugetlbfs so far. So we can sensibly drop a lot of pte_none() users IMHO
depending on the type of memory.

>
> > Btw, as you mentioned before, we can use a new number out of MAX_SWAPFILES,
> > that'll make all these easier a bit here, then we don't need to worry on
> > pte_none() issues too. Two days ago Hugh has raised some similar concern on
> > whether it's good to implement this uffd-wp special pte like this. I think we
> > can discuss this separately.
>
> Yes, I saw that and personally I still prefer that approach.

Yes I see your preference. Let's hold off a bit on the pte_none() discussions;
I'll re-raise this in the cover letter soon. If everyone is okay that we use
yet another MAX_SWAPFILES and that's preferred, then I can switch the design.
Then I think I can also avoid touching the pte_none() bits at all, which seems
to be controversial here.

But still, I am also not convinced that we can blindly replace pte_none() into
"either none pte or some special pte", either in this series or (if this series
will switch to swp_entry) in the future when we want to use !pte_present and
!swp_entry ptes. If we want to replace that, we may still want to check over
all the users of pte_none then it's the same as what we should do now, and do a
proper rename of it.

Thanks,

--
Peter Xu

2021-07-22 16:09:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 24/26] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

On Thu, Jul 22, 2021 at 08:27:07AM +0200, David Hildenbrand wrote:
> On 22.07.21 00:57, Peter Xu wrote:
> > On Wed, Jul 21, 2021 at 06:28:03PM -0400, Peter Xu wrote:
> > > Hi, Ivan,
> > >
> > > On Wed, Jul 21, 2021 at 07:54:44PM +0000, Ivan Teterevkov wrote:
> > > > On Wed, Jul 21, 2021 4:20 PM +0000, David Hildenbrand wrote:
> > > > > On 21.07.21 16:38, Ivan Teterevkov wrote:
> > > > > > On Mon, Jul 19, 2021 5:56 PM +0000, Peter Xu wrote:
> > > > > > > I'm also curious what would be the real use to have an accurate
> > > > > > > PM_SWAP accounting. To me current implementation may not provide
> > > > > > > accurate value but should be good enough for most cases. However not
> > > > > > > sure whether it's also true for your use case.
> > > > > >
> > > > > > We want the PM_SWAP bit implemented (for shared memory in the pagemap
> > > > > > interface) to enhance the live migration for some fraction of the
> > > > > > guest VMs that have their pages swapped out to the host swap. Once
> > > > > > those pages are paged in and transferred over network, we then want to
> > > > > > release them with madvise(MADV_PAGEOUT) and preserve the working set
> > > > > > of the guest VMs to reduce the thrashing of the host swap.
> > > > >
> > > > > There are 3 possibilities I think (swap is just another variant of the page cache):
> > > > >
> > > > > 1) The page is not in the page cache, e.g., it resides on disk or in a swap file.
> > > > > pte_none().
> > > > > 2) The page is in the page cache and is not mapped into the page table.
> > > > > pte_none().
> > > > > 3) The page is in the page cache and mapped into the page table.
> > > > > !pte_none().
> > > > >
> > > > > Do I understand correctly that you want to identify 1) and indicate it via
> > > > > PM_SWAP?
> > > >
> > > > Yes, and I also want to outline the context so we're on the same page.
> > > >
> > > > This series introduces the support for userfaultfd-wp for shared memory
> > > > because once a shared page is swapped, its PTE is cleared. Upon retrieval
> > > > from a swap file, there's no way to "recover" the _PAGE_SWP_UFFD_WP flag
> > > > because unlike private memory it's not kept in PTE or elsewhere.
> > > >
> > > > We came across the same issue with PM_SWAP in the pagemap interface, but
> > > > fortunately, there's the place that we could query: the i_pages field of
> > > > the struct address_space (XArray). In https://lkml.org/lkml/2021/7/14/595
> > > > we do it similarly to what shmem_fault() does when it handles #PF.
> > > >
> > > > Now, in the context of this series, we were exploring whether it makes
> > > > any practical sense to introduce more brand new flags to the special
> > > > PTE to populate the pagemap flags "on the spot" from the given PTE.
> > > >
> > > > However, I can't see how (and why) to achieve that specifically for
> > > > PM_SWAP even with an extra bit: the XArray is precisely what we need for
> > > > the live migration use case. Another flag PM_SOFT_DIRTY suffers the same
> > > > problem as UFFD_WP_SWP_PTE_SPECIAL before this patch series, but we don't
> > > > need it at the moment.
> > > >
> > > > Hope that clarification makes sense?
> > >
> > > Yes it helps, thanks.
> > >
> > > So I can understand now on how that patch comes initially, even if it may not
> > > work for PM_SOFT_DIRTY but it seems working indeed for PM_SWAP.
> > >
> > > However I have a concern that I raised also in the other thread: I think
> > > there'll be an extra and meaningless xa_load() for all the real pte_none()s
> > > that aren't swapped out but just having no page at the back from the very
> > > beginning. That happens much more frequent when the memory being observed by
> > > pagemap is mapped in a huge chunk and sparsely mapped.
> > >
> > > With old code we'll simply skip those ptes, but now I have no idea how much
> > > overhead would a xa_load() brings.
>
> Let's benchmark it then.

Yes that's a good idea. The goal should be that we won't regress any existing
pagemap users due to too slow sampling.

> I feel like we really shouldn't be storing
> unnecessarily data in page tables if they are readily available somehwere
> else, because ...
>
> > >
> > > Btw, I think there's a way to implement such an idea similar to the swap
> > > special uffd-wp pte - when page reclaim of shmem pages, instead of putting a
> > > none pte there maybe we can also have one bit set in the none pte showing that
> > > this pte is swapped out. When the page faulted back we just drop that bit.
> > >
> > > That bit could be also scanned by pagemap code to know that this page was
> > > swapped out. That should be much lighter than xa_load(), and that identifies
> > > immediately from a real none pte just by reading the value.
>
> ... we are optimizing a corner case feature (pagemap) by affecting other
> system parts. Just imagine
>
> 1. Forking: will always have to copy the whole page tables for shemem
> instead of optimizing.
> 2. New shmem mappings: will always have to sync back that bit from the
> pagecache

Both points seem valid.

Not sure whether we can still keep the behavior single-process only, that
should satisfy e.g. the guest tracking use case and I believe most of the
cases. Then for fork() we can ignore this bit, and for new mappings we ignore
too. But I do confess that's a new limitation even if so.

>
> And these are just the things that immediately come to mind. There is
> certainly more (e.g., page table reclaim [1]).
>
> > >
> > > Do you think this would work?
> >
> > Btw, I think that's what Tiberiu used to mention, but I think I just changed my
> > mind.. Sorry to have brought such a confusion.
> >
> > So what I think now is: we can set it (instead of zeroing the pte) right at
> > unmapping the pte of page reclaim. Code-wise, that can be a special flag
> > (maybe, TTU_PAGEOUT?) passed over to try_to_unmap() of shrink_page_list() to
> > differenciate from other try_to_unmap()s.
> >
> > I think that bit can also be dropped correctly e.g. when punching a hole in the
> > file, then rmap_walk() can find and drop the marker (I used to suspect uffd-wp
> > bit could get left-overs, but after a second thought here similarly, it seems
> > it won't; as long as hole punching and vma unmapping will always be able to
> > scan those marker ptes, then it seems all right to drop them correctly).
> >
> > But that's my wild thoughts; I could have missed something too.
> >
>
> Adding to that, Peter can you enlighten me how uffd-wp on shmem combined
> with the uffd-wp bit in page tables is supposed to work in general when
> talking about multiple processes?
>
> Shmem means any process can modify any memory. To be able to properly catch
> writes to such memory, the only way I can see it working is
>
> 1. All processes register uffd-wp on the shmem VMA
> 2. All processes arm uffd-wp by setting the same uffd-wp bits in their page
> tables for the affected shmem
> 3. All processes synchronize, sending each other uffd-wp events when they
> receive one
>
> This is quite ... suboptimal I have to say. This is really the only way I
> can imagine uffd-wp to work reliably. Is there any obvious way to make this
> work I am missing?
>
> But then, all page tables are already supposed to contain the uffd-wp bit.
> Which makes me think that we can actually get rid of the uffd-wp bit in the
> page table for pte_none() entries and instead store this information
> somewhere else (in the page cache?) for all entries combined.
>
> So that simplification would result in
>
> 1. All processes register uffd-wp on the shmem VMA
> 2. One processes wp-protects uffd-wp via the page cache (we can update all
> PTEs in other processes)
> 3. All processes synchronize, sending each other uffd-wp events when they
> receive one
>
> The semantics of uffd-wp on shmem would be different to what we have so far
> ... which would be just fine as we never had uffd-wp on shared memory.
>
> In an ideal world, 1. and 3. wouldn't be required and all registered uffd
> listeners would be notified when any process writes to it.

This is also a good point at least to be fully discussed, I wished there could
be the ideal world, but I just don't yet know whether it exists..

As you see even if we want to do some trick in page cache it'll still need at
least step 1 to have that process be okay and be aware of such trick otherwise
there'll start to evolve more complicated privilege relationships.

It's also easier to do like what we have right now, e.g. one vma is bind to
only one uffd, and that uffd is definitely bind to the current mm. If we allow
cross-process operations that'll be hard to tell how that works - say processes
A,B,C shared the memory and wanted to wr-protect it, who (which uffd, as there
can be three) should be servicing the fault from someone? Why that process has
that privilege? I can't quickly tell.

So the per-mm idea and keep things within pgtables do sound okay, and simplify
things like this so all things are at least self-contained within one process,
for either privilege or the rest.

I also don't know whether dropping the uffd-wp bit will be easy for shmem. The
page can be wr-protected for other reasons (e.g. soft dirty), even if we check
the vma VM_UFFD_WP flag we won't be able to identify whether this is
wr-protected by uffd. I don't see how we can avoid false positives otherwise.

>
>
> Sure, for single-user shmem it would work just like !shmem, but then, maybe
> that user really shouldn't be using shmem. But maybe I am missing something
> important :)

I see this in a bit different perspective; normally shmem and uffd-wp come from
two different requirements: (1) shmem is definitely for data sharing, while (2)
uffd-wp is for tracking page changes. When the requriements meet each other,
we'll need something like this series. I don't know how it will go at last,
but I hope we can fill up the gap so uffd-wp can be close to a complete stage.

Thanks!

--
Peter Xu

2021-07-22 18:32:22

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v5 00/26] userfaultfd-wp: Support shmem and hugetlbfs

On Thu, Jul 15, 2021 at 04:13:56PM -0400, Peter Xu wrote:
> About Swap Special PTE
> ======================

I've got some more feedback regarding this series, either within review comment
or from other threads.

Hugh shared his concern on using such type of pte level operation could make
things even worse:

https://lore.kernel.org/linux-mm/[email protected]/

Since most context is irrelevant, only quotting the p.s. section:

p.s. Peter, unrelated to this particular bug, and should not divert from
fixing it: but looking again at those swap encodings, and particularly the
soft_dirty manipulations: they look very fragile. I think uffd_wp was wrong
to follow that bad example, and your upcoming new encoding (that I have
previously called elegant) takes it a worse step further.

Alistair shared his preference on keep using swp_entry to store these extra
information:

https://lore.kernel.org/linux-mm/5071185.SEdLSG93TQ@nvdebian/

So I'm trying to do some self introspection to see maybe I was just too bold to
try introducing that pte idea, either I'm not the "suitable one" to introduce
it as it's indeed challenging, or maybe it's as simple as we don't really need
to worry using up swap address space yet, at least for now (currently worst
case MAX_SWAPFILES=32-4-2-1=25).

I don't yet have plan to think about Hugh's idea on further dropping the usage
of per-arch bits in swap ptes, e.g. _PAGE_SWP_SOFT_DIRTY or _PAGE_SWP_UFFD_WP.
I need more thoughts there. But what I can still do is think about whether we
can still go back to swap entry ptes for this series.

Originally I was afraid of wasting a whole type of swp entry just for one
single pte, so we came up with the idea (thanks again for Andrea and Hugh on
proposing and discussions around it!). But did we just worry too much on that
while it comes from nothing?

So as time passes, there're indeed some more similar requirements coming that
has issues that look like what uffd-wp file-backed wanted to solve on pagemap,
they're:

- PM_SWAP info missing when shmem page swapped out
- PM_SOFT_DIRTY lost when shmem page swapped out

The 1st issue might be solved by other way and there're still discussed here:

https://lore.kernel.org/linux-mm/YPmX7ZyDFRCuLXrh@t490s/

I don't see a good way to solve the 2nd issue (if we would like to solve it
first, though; I don't know whether that's intended to not be fixed for some
reason), if without similar solution like what we will like to apply to
maintain the uffd-wp bit, because they're all potentially issues around
persisting pte information for file-backed memories.

These requirements at least show that even if we introduce a new swp type
(maybe let's just call it SWP_PTE_MARKER) then uffd-wp won't be the only user,
so there're already potential users of more bit out of the entry.

In summary, I'm considering whether I should switch the special swap pte idea
back to the swp entry idea (safer, according to Hugh, also arch-independent,
according to Alistair). Before working on that, any early comment would be
greatly welcomed.

Thanks.

--
Peter Xu