2021-05-27 20:21:13

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 00/27] userfaultfd-wp: Support shmem and hugetlbfs

This is v3 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a

kernel full feature, as it only supports anonymous before this series. It's

based on latest v5.13-rc3-mmots-2021-05-25-20-12.



The rebase was probably the hardest one, as I encountered quite a few breakage

here and there within a few mmots tags. But now after figuring out everything

(which does took time) it's settling.



The whole series can also be found online [1].



Nothing big really changed otherwise. Full changelog listed below.



v3:

- Rebase to v5.13-rc3-mmots-2021-05-25-20-12

- Fix commit message and comment for patch "shmem/userfaultfd: Handle uffd-wp

special pte in page fault handler", dropping all reference to FAULT_FLAG_UFFD_WP.

- Reworked patch "shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP" after

Axel's refactoring on uffdio-copy/continue.

- Added patch "mm/hugetlb: Introduce huge pte version of uffd-wp helpers", so

that huge pte helpers are introduced in one patch. Also add huge_pte_uffd_wp

helper, which was missing previously.

- Added patch: "mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs", to let

pagemap uffd-wp bit work for shmem/hugetlbfs

- Added patch: "mm/shmem: Unconditionally set pte dirty in

mfill_atomic_install_pte", to clean up dirty bit together in uffdio-copy



v2:

- Add R-bs

- Added patch "mm/hugetlb: Drop __unmap_hugepage_range definition from

hugetlb.h" as noticed/suggested by Mike Kravets

- Fix commit message of patch "hugetlb/userfaultfd: Only drop uffd-wp special

pte if required" [MikeK]

- Removing comments for fields in zap_details since they're either incorrect or

not helping [Matthew]

- Rephrase commit message in patch "hugetlb/userfaultfd: Take care of

UFFDIO_COPY_MODE_WP" to explain better on why set dirty bit for UFFDIO_COPY

in hugetlbfs [MikeK]

- Don't emulate READ for uffd-wp-special on both shmem & hugetlbfs.

- Drop FAULT_FLAG_UFFD_WP flag, by checking vmf->orig_pte directly against

pte_swp_uffd_wp_special()

- Fix race condition of page fault handling on uffd-wp-special [Mike]



About Swap Special PTE

======================



In short, the so-called "swap special pte" in this patchset is a new type of

pte that doesn't exist in the past, but it got used initially in this series in

file-backed memories. It is used to persist information even if the ptes got

dropped meanwhile when the page cache still existed. For example, when

splitting a file-backed huge pmd, we could be simply dropping the pmd entry

then wait until another fault coming. It's okay in the past since all

information in the pte can be retained from the page cache when the next page

fault triggers. However in this case, uffd-wp is per-pte information which

cannot be kept in page cache, so that information needs to be maintained

somehow still in the pgtable entry, even if the pgtable entry is going to be

dropped. Here instead of replacing with a none entry, we used the "swap

special pte". Then when the next page fault triggers, we can observe orig_pte

to retain this information.



I'm copy-pasting some commit message from the patch "mm/swap: Introduce the

idea of special swap ptes", where it tried to explain this pte in another angle:



We used to have special swap entries, like migration entries, hw-poison

entries, device private entries, etc.



Those "special swap entries" reside in the range that they need to be at least

swap entries first, and their types are decided by swp_type(entry).



This patch introduces another idea called "special swap ptes".



It's very easy to get confused against "special swap entries", but a speical

swap pte should never contain a swap entry at all. It means, it's illegal to

call pte_to_swp_entry() upon a special swap pte.



Make the uffd-wp special pte to be the first special swap pte.



Before this patch, is_swap_pte()==true means one of the below:



(a.1) The pte has a normal swap entry (non_swap_entry()==false). For

example, when an anonymous page got swapped out.



(a.2) The pte has a special swap entry (non_swap_entry()==true). For

example, a migration entry, a hw-poison entry, etc.



After this patch, is_swap_pte()==true means one of the below, where case (b) is

added:



(a) The pte contains a swap entry.



(a.1) The pte has a normal swap entry (non_swap_entry()==false). For

example, when an anonymous page got swapped out.



(a.2) The pte has a special swap entry (non_swap_entry()==true). For

example, a migration entry, a hw-poison entry, etc.



(b) The pte does not contain a swap entry at all (so it cannot be passed

into pte_to_swp_entry()). For example, uffd-wp special swap pte.



Hugetlbfs needs similar thing because it's also file-backed. I directly reused

the same special pte there, though the shmem/hugetlb change on supporting this

new pte is different since they don't share code path a lot.



Patch layout

============



Part (1): Shmem support, this is where the special swap pte is introduced.

Some zap rework is needed within the process:



mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

mm: Clear vmf->pte after pte_unmap_same() returns

mm/userfaultfd: Introduce special pte for unmapped file-backed mem

mm/swap: Introduce the idea of special swap ptes

shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

mm: Drop first_index/last_index in zap_details

mm: Introduce zap_details.zap_flags

mm: Introduce ZAP_FLAG_SKIP_SWAP

mm: Pass zap_flags into unmap_mapping_pages()

shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

shmem/userfaultfd: Handle the left-overed special swap ptes

shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()



Part (2): Hugetlb supportdisable huge pmd sharing for uffd-wp patches have been

merged. The rest is the changes required to teach hugetlbfs understand the

special swap pte too that introduced with the uffd-wp change:



mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

mm/hugetlb: Introduce huge pte version of uffd-wp helpers

hugetlb/userfaultfd: Hook page faults for uffd write protection

hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

mm/hugetlb: Introduce huge version of special swap pte helpers

hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

hugetlb/userfaultfd: Allow wr-protect none ptes

hugetlb/userfaultfd: Only drop uffd-wp special pte if required



Part (3): Enable both features in code and test (plus pagemap support)



mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

userfaultfd: Enable write protection for shmem & hugetlbfs

userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs



Tests

=====



I've tested it using either userfaultfd kselftest program, but also with

umapsort [2] which should be even stricter. Tested page swapping in/out during

umapsort.



If anyone would like to try umapsort, need to use an extremely hacked version

of umap library [3], because by default umap only supports anonymous. So to

test it we need to build [3] then [2].



Any comment would be greatly welcomed. Thanks,



[1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs

[2] https://github.com/LLNL/umap-apps

[3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs



Peter Xu (27):

mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

mm: Clear vmf->pte after pte_unmap_same() returns

mm/userfaultfd: Introduce special pte for unmapped file-backed mem

mm/swap: Introduce the idea of special swap ptes

shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

mm: Drop first_index/last_index in zap_details

mm: Introduce zap_details.zap_flags

mm: Introduce ZAP_FLAG_SKIP_SWAP

mm: Pass zap_flags into unmap_mapping_pages()

shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on

thps

shmem/userfaultfd: Handle the left-overed special swap ptes

shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

mm/hugetlb: Introduce huge pte version of uffd-wp helpers

hugetlb/userfaultfd: Hook page faults for uffd write protection

hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

mm/hugetlb: Introduce huge version of special swap pte helpers

hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

hugetlb/userfaultfd: Allow wr-protect none ptes

hugetlb/userfaultfd: Only drop uffd-wp special pte if required

mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

mm/userfaultfd: Enable write protection for shmem & hugetlbfs

userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs



arch/arm64/kernel/mte.c | 2 +-

arch/x86/include/asm/pgtable.h | 28 +++

fs/dax.c | 10 +-

fs/hugetlbfs/inode.c | 15 +-

fs/proc/task_mmu.c | 21 +-

fs/userfaultfd.c | 38 ++--

include/asm-generic/hugetlb.h | 15 ++

include/asm-generic/pgtable_uffd.h | 3 +

include/linux/hugetlb.h | 30 ++-

include/linux/mm.h | 48 ++++-

include/linux/mm_inline.h | 43 +++++

include/linux/shmem_fs.h | 4 +-

include/linux/swapops.h | 39 +++-

include/linux/userfaultfd_k.h | 45 +++++

include/uapi/linux/userfaultfd.h | 10 +-

mm/gup.c | 2 +-

mm/hmm.c | 2 +-

mm/hugetlb.c | 160 +++++++++++++---

mm/khugepaged.c | 14 +-

mm/madvise.c | 4 +-

mm/memcontrol.c | 2 +-

mm/memory.c | 234 +++++++++++++++++------

mm/migrate.c | 4 +-

mm/mincore.c | 2 +-

mm/mprotect.c | 63 +++++-

mm/mremap.c | 2 +-

mm/page_vma_mapped.c | 6 +-

mm/rmap.c | 8 +

mm/shmem.c | 5 +-

mm/swapfile.c | 2 +-

mm/truncate.c | 17 +-

mm/userfaultfd.c | 73 ++++---

tools/testing/selftests/vm/userfaultfd.c | 9 +-

33 files changed, 765 insertions(+), 195 deletions(-)



--

2.31.1





2021-05-27 20:21:32

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 01/27] mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte

It was conditionally done previously, as there's one shmem special case that we
use SetPageDirty() instead. However that's not necessary and it should be
easier and cleaner to do it unconditionally in mfill_atomic_install_pte().

The most recent discussion about this is here, where Hugh explained the history
of SetPageDirty() and why it's possible that it's not required at all:

https://lore.kernel.org/lkml/[email protected]/

Currently mfill_atomic_install_pte() has three callers:

1. shmem_mfill_atomic_pte
2. mcopy_atomic_pte
3. mcontinue_atomic_pte

After the change: case (1) should have its SetPageDirty replaced by the dirty
bit on pte (so we unify them together, finally), case (2) should have no
functional change at all as it has page_in_cache==false, case (3) may add a
dirty bit to the pte. However since case (3) is UFFDIO_CONTINUE for shmem,
it's merely 100% sure the page is dirty after all, so should not make a real
difference either.

This should make it much easier to follow on which case will set dirty for
uffd, as we'll simply set it all now for all uffd related ioctls. Meanwhile,
no special handling of SetPageDirty() if there's no need.

Cc: Hugh Dickins <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/shmem.c | 1 -
mm/userfaultfd.c | 3 +--
2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ccce139d4e5c..4085a5cf4a13 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2449,7 +2449,6 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
shmem_recalc_inode(inode);
spin_unlock_irq(&info->lock);

- SetPageDirty(page);
unlock_page(page);
return 0;
out_delete_from_cache:
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9ce5a3793ad4..462fa6e25e03 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
pgoff_t offset, max_off;

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+ _dst_pte = pte_mkdirty(_dst_pte);
if (page_in_cache && !vm_shared)
writable = false;
- if (writable || !page_in_cache)
- _dst_pte = pte_mkdirty(_dst_pte);
if (writable) {
if (wp_copy)
_dst_pte = pte_mkuffd_wp(_dst_pte);
--
2.31.1

2021-05-27 20:21:42

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 02/27] shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

Firstly, pass wp_copy into shmem_mfill_atomic_pte() through the stack.
Then apply the UFFD_WP bit properly when the UFFDIO_COPY on shmem is with
UFFDIO_COPY_MODE_WP, then wp_copy lands mfill_atomic_install_pte() which is
newly introduced very recently.

We need to make sure shmem_mfill_atomic_pte() will always set the dirty bit in
pte even if UFFDIO_COPY_MODE_WP is set. After the rework of minor fault series
on shmem we need to slightly touch up the logic there, since uffd-wp needs to
be applied even if writable==false previously (e.g., for shmem private mapping).

Note: we must do pte_wrprotect() if !writable in mfill_atomic_install_pte(), as
mk_pte() could return a writable pte (e.g., when VM_SHARED on a shmem file).

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/shmem_fs.h | 4 ++--
mm/shmem.c | 4 ++--
mm/userfaultfd.c | 23 +++++++++++++++--------
3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a69ea4d97fdd..0b6a8e871036 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -128,11 +128,11 @@ extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- bool zeropage,
+ bool zeropage, bool wp_copy,
struct page **pagep);
#else /* !CONFIG_SHMEM */
#define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
- src_addr, zeropage, pagep) ({ BUG(); 0; })
+ src_addr, zeropage, wp_copy, pagep) ({ BUG(); 0; })
#endif /* CONFIG_SHMEM */
#endif /* CONFIG_USERFAULTFD */

diff --git a/mm/shmem.c b/mm/shmem.c
index 4085a5cf4a13..5b648cbae37a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2368,7 +2368,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- bool zeropage,
+ bool zeropage, bool wp_copy,
struct page **pagep)
{
struct inode *inode = file_inode(dst_vma->vm_file);
@@ -2439,7 +2439,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
goto out_release;

ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
- page, true, false);
+ page, true, wp_copy);
if (ret)
goto out_delete_from_cache;

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 462fa6e25e03..3636f5be6390 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,14 +70,22 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
_dst_pte = pte_mkdirty(_dst_pte);
- if (page_in_cache && !vm_shared)
+ /* Don't write if uffd-wp wr-protected */
+ if (wp_copy) {
+ _dst_pte = pte_mkuffd_wp(_dst_pte);
writable = false;
- if (writable) {
- if (wp_copy)
- _dst_pte = pte_mkuffd_wp(_dst_pte);
- else
- _dst_pte = pte_mkwrite(_dst_pte);
}
+ /* Don't write if page cache privately mapped */
+ if (page_in_cache && !vm_shared)
+ writable = false;
+ if (writable)
+ _dst_pte = pte_mkwrite(_dst_pte);
+ else
+ /*
+ * We need this to make sure write bit removed; as mk_pte()
+ * could return a pte with write bit set.
+ */
+ _dst_pte = pte_wrprotect(_dst_pte);

dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);

@@ -515,11 +523,10 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
err = mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
} else {
- VM_WARN_ON_ONCE(wp_copy);
err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
dst_addr, src_addr,
mode != MCOPY_ATOMIC_NORMAL,
- page);
+ wp_copy, page);
}

return err;
--
2.31.1

2021-05-27 20:21:55

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 03/27] mm: Clear vmf->pte after pte_unmap_same() returns

pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
will not be valid any more. We should clear it.

It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.

pte_unmap_same() will be used in other places in follow up patches, so that
vmf->pte will not always be re-written. This patch enables us to call
functions like finish_fault() because that'll conditionally unmap the pte by
checking vmf->pte first. Or, alloc_set_pte() will make sure to allocate a new
pte even after calling pte_unmap_same().

Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
and then we can also avoid the long parameter list.

Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2b7ffcbca175..0ccaae2647c0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2710,19 +2710,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
{
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
+ spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
+ same = pte_same(*vmf->pte, vmf->orig_pte);
spin_unlock(ptl);
}
#endif
- pte_unmap(page_table);
+ pte_unmap(vmf->pte);
+ /* After unmap of pte, the pointer is invalid now - clear it. */
+ vmf->pte = NULL;
return same;
}

@@ -3441,7 +3442,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;

- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ if (!pte_unmap_same(vmf))
goto out;

entry = pte_to_swp_entry(vmf->orig_pte);
--
2.31.1

2021-05-27 20:22:50

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

This patch introduces a very special swap-like pte for file-backed memories.

Currently it's only defined for x86_64 only, but as long as any arch that can
properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it should
conceptually work too.

We will use this special pte to arm the ptes that got either unmapped or
swapped out for a file-backed region that was previously wr-protected. This
special pte could trigger a page fault just like swap entries, and as long as
the page fault will satisfy pte_none()==false && pte_present()==false.

Then we can revive the special pte into a normal pte backed by the page cache.

This idea is greatly inspired by Hugh and Andrea in the discussion, which is
referenced in the links below.

The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as the
special pte. The current solution (as pointed out by Andrea) is slightly
preferred in that we don't even need swp_entry_t knowledge at all in trapping
these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP from the anonymous
swp entries.

This patch only introduces the special pte and its operators. It's not yet
applied to have any functional difference.

Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://lore.kernel.org/lkml/[email protected]/
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Hugh Dickins <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 28 ++++++++++++++++++++++++++++
include/asm-generic/pgtable_uffd.h | 3 +++
include/linux/userfaultfd_k.h | 21 +++++++++++++++++++++
3 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..9781ba2da049 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1329,6 +1329,34 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif

#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+/*
+ * This is a very special swap-like pte that marks this pte as "wr-protected"
+ * by userfaultfd-wp. It should only exist for file-backed memory where the
+ * page (previously got wr-protected) has been unmapped or swapped out.
+ *
+ * For anonymous memories, the userfaultfd-wp _PAGE_SWP_UFFD_WP bit is kept
+ * along with a real swp entry instead.
+ *
+ * Let's make some rules for this special pte:
+ *
+ * (1) pte_none()==false, so that it'll not trigger a missing page fault.
+ *
+ * (2) pte_present()==false, so that it's recognized as swap (is_swap_pte).
+ *
+ * (3) pte_swp_uffd_wp()==true, so it can be tested just like a swap pte that
+ * contains a valid swap entry, so that we can check a swap pte always
+ * using "is_swap_pte() && pte_swp_uffd_wp()" without caring about whether
+ * there's one swap entry inside of the pte.
+ *
+ * (4) It should not be a valid swap pte anywhere, so that when we see this pte
+ * we know it does not contain a swap entry.
+ *
+ * For x86, the simplest special pte which satisfies all of above should be the
+ * pte with only _PAGE_SWP_UFFD_WP bit set (where swp_type==swp_offset==0).
+ */
+#define UFFD_WP_SWP_PTE_SPECIAL __pte(_PAGE_SWP_UFFD_WP)
+
static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
{
return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 828966d4c281..95e9811ce9d1 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -2,6 +2,9 @@
#define _ASM_GENERIC_PGTABLE_UFFD_H

#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+
+#define UFFD_WP_SWP_PTE_SPECIAL __pte(0)
+
static __always_inline int pte_uffd_wp(pte_t pte)
{
return 0;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 331d2ccf0bcc..93f932b53a71 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
extern void userfaultfd_unmap_complete(struct mm_struct *mm,
struct list_head *uf);

+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+ WARN_ON_ONCE(vma_is_anonymous(vma));
+ return UFFD_WP_SWP_PTE_SPECIAL;
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+ return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
+}
+
#else /* CONFIG_USERFAULTFD */

/* mm helpers */
@@ -234,6 +245,16 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
{
}

+static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
+{
+ return __pte(0);
+}
+
+static inline bool pte_swp_uffd_wp_special(pte_t pte)
+{
+ return false;
+}
+
#endif /* CONFIG_USERFAULTFD */

#endif /* _LINUX_USERFAULTFD_K_H */
--
2.31.1

2021-05-27 20:22:51

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 06/27] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

File-backed memories are prone to unmap/swap so the ptes are always unstable.
This could lead to userfaultfd-wp information got lost when unmapped or swapped
out on such types of memory, for example, shmem. To keep such an information
persistent, we will start to use the newly introduced swap-like special ptes to
replace a null pte when those ptes were removed.

Prepare this by handling such a special pte first before it is applied in the
general page fault handler.

The handling of this special pte page fault is similar to missing fault, but it
should happen after the pte missing logic since the special pte is designed to
be a swap-like pte. Meanwhile it should be handled before do_swap_page() so
that the swap core logic won't be confused to see such an illegal swap pte.

This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
ptes should be rare. So far it should only trigger in two conditions:

(1) When trying to punch holes in shmem_fallocate(), there will be a
pre-unmap optimization before evicting the page. That will create
unmapped shmem ptes with wr-protected pages covered.

(2) Swapping out of shmem pages

Because of this, the page fault handling is simplifed too by not sending the
wr-protect message in the 1st page fault, instead the page will be installed
read-only, so the message will be generated until the next write, which will
trigger the do_wp_page() path of general uffd-wp handling.

Disable fault-around for all uffd-wp registered ranges for extra safety, and
clean the code up a bit after we introduced MINOR fault.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/userfaultfd_k.h | 12 +++++
mm/memory.c | 88 +++++++++++++++++++++++++++++++----
2 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 93f932b53a71..ca3f794d07e9 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -94,6 +94,18 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
}

+/*
+ * Don't do fault around for either WP or MINOR registered uffd range. For
+ * MINOR registered range, fault around will be a total disaster and ptes can
+ * be installed without notifications; for WP it should mostly be fine as long
+ * as the fault around checks for pte_none() before the installation, however
+ * to be super safe we just forbid it.
+ */
+static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
+}
+
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
return vma->vm_flags & VM_UFFD_MISSING;
diff --git a/mm/memory.c b/mm/memory.c
index 2b24af4616df..45a2f71e447a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3917,6 +3917,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
{
struct vm_area_struct *vma = vmf->vma;
+ bool uffd_wp = pte_swp_uffd_wp_special(vmf->orig_pte);
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool prefault = vmf->address != addr;
pte_t entry;
@@ -3929,6 +3930,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)

if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (unlikely(uffd_wp))
+ entry = pte_mkuffd_wp(pte_wrprotect(entry));
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3996,8 +3999,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
ret = 0;
- /* Re-check under ptl */
- if (likely(pte_none(*vmf->pte)))
+
+ /*
+ * Re-check under ptl. Note: this will cover both none pte and
+ * uffd-wp-special swap pte
+ */
+ if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
do_set_pte(vmf, page, vmf->address);
else
ret = VM_FAULT_NOPAGE;
@@ -4101,9 +4108,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
}

+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+ /* No ->map_pages? No way to fault around... */
+ if (!vmf->vma->vm_ops->map_pages)
+ return false;
+
+ if (uffd_disable_fault_around(vmf->vma))
+ return false;
+
+ return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
static vm_fault_t do_read_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = 0;

/*
@@ -4111,12 +4130,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- if (likely(!userfaultfd_minor(vmf->vma))) {
- ret = do_fault_around(vmf);
- if (ret)
- return ret;
- }
+ if (should_fault_around(vmf)) {
+ ret = do_fault_around(vmf);
+ if (ret)
+ return ret;
}

ret = __do_fault(vmf);
@@ -4435,6 +4452,57 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
return VM_FAULT_FALLBACK;
}

+static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /*
+ * Be careful so that we will only recover a special uffd-wp pte into a
+ * none pte. Otherwise it means the pte could have changed, so retry.
+ */
+ if (pte_swp_uffd_wp_special(*vmf->pte))
+ pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed. It means this pte was wr-protected before being unmapped.
+ */
+static vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf)
+{
+ /* Careful! vmf->pte unmapped after return */
+ if (!pte_unmap_same(vmf))
+ return 0;
+
+ /*
+ * Just in case there're leftover special ptes even after the region
+ * got unregistered - we can simply clear them.
+ */
+ if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+ return uffd_wp_clear_special(vmf);
+
+ /*
+ * Here we share most code with do_fault(), in which we can identify
+ * whether this is "none pte fault" or "uffd-wp-special fault" by
+ * checking the vmf->orig_pte.
+ */
+ return do_fault(vmf);
+}
+
+static vm_fault_t do_swap_pte(struct vm_fault *vmf)
+{
+ /*
+ * We need to handle special swap ptes before handling ptes that
+ * contain swap entries, always.
+ */
+ if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
+ return uffd_wp_handle_special(vmf);
+
+ return do_swap_page(vmf);
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -4509,7 +4577,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
}

if (!pte_present(vmf->orig_pte))
- return do_swap_page(vmf);
+ return do_swap_pte(vmf);

if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
--
2.31.1

2021-05-27 20:24:39

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

File-backed memory is prone to being unmapped at any time. It means all
information in the pte will be dropped, including the uffd-wp flag.

Since the uffd-wp info cannot be stored in page cache or swap cache, persist
this wr-protect information by installing the special uffd-wp marker pte when
we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
we will know it's previously wr-protected by recognizing the special pte.

Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
persist such an information. For example, when destroying the whole vma, or
punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
bit when holding the page lock. It means the unmap_mapping_range() in
shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
because that's still racy with the page faults.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 11 ++++++++++
include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 8 ++++++++
mm/truncate.c | 8 +++++++-
5 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b1fb2826e29c..5989fc7ed00d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
/* Whether to skip zapping swap entries */
#define ZAP_FLAG_SKIP_SWAP BIT(1)
+/* Whether to completely drop uffd-wp entries for file-backed memory */
+#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
}

+static inline bool
+zap_drop_file_uffd_wp(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..c29a6ef3a642 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,8 @@

#include <linux/huge_mm.h>
#include <linux/swap.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

/**
* page_is_file_lru - should the page be on a file LRU or anon LRU?
@@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
update_lru_size(lruvec, page_lru(page), page_zonenum(page),
-thp_nr_pages(page));
}
+
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte. NOTE! This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable. Meanwhile
+ * none pte also means we are not demoting the pte so if tlb flushed then we
+ * don't need to do it again; otherwise if tlb flush is postponed then it's
+ * even better.
+ *
+ * Must be called with pgtable lock held.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_USERFAULTFD
+ bool arm_uffd_pte = false;
+
+ /* The current status of the pte should be "cleared" before calling */
+ WARN_ON_ONCE(!pte_none(*pte));
+
+ if (vma_is_anonymous(vma))
+ return;
+
+ /* A uffd-wp wr-protected normal pte */
+ if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ /*
+ * A uffd-wp wr-protected swap pte. Note: this should even work for
+ * pte_swp_uffd_wp_special() too.
+ */
+ if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ if (unlikely(arm_uffd_pte))
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+#endif
+}
+
#endif
diff --git a/mm/memory.c b/mm/memory.c
index 319552efc782..3453b8ae5f4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/mm_inline.h>

#include <trace/events/kmem.h>

@@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}

+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte,
+ struct zap_details *details, pte_t pteval)
+{
+ if (zap_drop_file_uffd_wp(details))
+ return;
+
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
if (unlikely(!page))
continue;

@@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

+ /*
+ * If this is a special uffd-wp marker pte... Drop it only if
+ * enforced to do so.
+ */
+ if (unlikely(is_swap_special_pte(ptent))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
+ /*
+ * If this is a common unmap of ptes, keep this as is.
+ * Drop it only if this is a whole-vma destruction.
+ */
+ if (zap_drop_file_uffd_wp(details))
+ ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ continue;
+ }
+
entry = pte_to_swp_entry(ptent);
if (is_device_private_entry(entry) ||
is_device_exclusive_entry(entry)) {
@@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
page_remove_rmap(page, false);

put_page(page);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
continue;
}

@@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);

add_mm_rss_vec(mm, rss);
@@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mmu_notifier_range range;
+ struct zap_details details = {
+ .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
+ };

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
- unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
mmu_notifier_invalidate_range_end(&range);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index 0419c9a1a280..a94d9aed9d95 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>

@@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pvmw.pte);
}

+ /*
+ * Now the pte is cleared. If this is uffd-wp armed pte, we
+ * may want to replace a none pte with a marker pte if it's
+ * file-backed, so we don't lose the tracking information.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
/* Move the dirty bit to the page. Now the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 85cd84486589..62f9c488b986 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -173,7 +173,13 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
if (page_mapped(page)) {
unsigned int nr = thp_nr_pages(page);
unmap_mapping_pages(mapping, page->index, nr,
- ZAP_FLAG_CHECK_MAPPING);
+ ZAP_FLAG_CHECK_MAPPING |
+ /*
+ * Now it's safe to drop uffd-wp because
+ * we're with page lock, and the page is
+ * being truncated.
+ */
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
}

if (page_has_private(page))
--
2.31.1

2021-05-27 20:24:41

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 14/27] shmem/userfaultfd: Handle the left-overed special swap ptes

Note that the special uffd-wp swap pte can be left over even if the page under
the pte got evicted. Normally when evict a page, we will unmap the ptes by
walking through the reverse mapping. However we never tracked such information
for the special swap ptes because they're not real mappings but just markers.
So we need to take care of that when we see a marker but when it's actually
meaningless (the page behind it got evicted).

We have already taken care of that in e.g. alloc_set_pte() where we'll treat
the special swap pte as pte_none() when necessary. However we need to also
teach userfaultfd itself on either UFFDIO_COPY or handling page faults, so that
everything will still work as expected.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 15 +++++++++++++++
mm/userfaultfd.c | 13 ++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 19ebae443ade..15031d6f1f17 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -329,6 +329,21 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (pte_none(*pte))
ret = true;
+ /*
+ * We also treat the swap special uffd-wp pte as the pte_none() here.
+ * This should in most cases be a missing event, as we never handle
+ * wr-protect upon a special uffd-wp swap pte - it should first be
+ * converted into a normal read request before handling wp. It just
+ * means the page/swap cache that backing this pte is gone, so this
+ * special pte is leftover.
+ *
+ * We can't simply replace it with a none pte because we're not with
+ * the pgtable lock here. Instead of taking it and clearing the pte,
+ * the easy way is to let UFFDIO_COPY understand this pte too when
+ * trying to install a new page onto it.
+ */
+ if (pte_swp_uffd_wp_special(*pte))
+ ret = true;
if (!pte_write(*pte) && (reason & VM_UFFD_WP))
ret = true;
pte_unmap(pte);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3636f5be6390..147e86095070 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -100,7 +100,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
}

ret = -EEXIST;
- if (!pte_none(*dst_pte))
+ /*
+ * Besides the none pte, we also allow UFFDIO_COPY to install a pte
+ * onto the uffd-wp swap special pte, because that pte should be the
+ * same as a pte_none() just in that it contains wr-protect information
+ * (which could only be dropped when unmap the memory).
+ *
+ * It's safe to drop that marker because we know this is part of a
+ * MISSING fault, and the caller is very clear about this page missing
+ * rather than wr-protected. Then we're sure the wr-protect bit is
+ * just a leftover so it's useless already and is the same as none pte.
+ */
+ if (!pte_none(*dst_pte) && !pte_swp_uffd_wp_special(*dst_pte))
goto out_unlock;

if (page_in_cache)
--
2.31.1

2021-05-27 20:24:47

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 13/27] shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps

We don't have "huge" version of PTE_SWP_UFFD_WP_SPECIAL, instead when necessary
we split the thp if the huge page is uffd wr-protected previously.

However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps - rather than doing a real split, the
thp pmd will simply got dropped in __split_huge_pmd_locked().

That is definitely not enough if e.g. when there is a thp covers range [0, 2M)
but we want to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd.

Here we leverage the previously introduced change_protection_prepare() macro so
that we'll populate the pmd with a pgtable page. Then change_pte_range() will
do all the rest for us, e.g., install the uffd-wp swap special pte marker at
any pte that we'd like to wr-protect, under the protection of pgtable lock.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8ec85b276975..3fcb87b59696 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -306,8 +306,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}

if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
- if (next - addr != HPAGE_PMD_SIZE) {
+ if (next - addr != HPAGE_PMD_SIZE ||
+ /* Uffd wr-protecting a file-backed memory range */
+ unlikely(!vma_is_anonymous(vma) &&
+ (cp_flags & MM_CP_UFFD_WP))) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
+ /*
+ * For file-backed, the pmd could have been
+ * gone; still provide a pte pgtable if needed.
+ */
+ change_protection_prepare(vma, pmd, addr, cp_flags);
} else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, cp_flags);
--
2.31.1

2021-05-27 20:25:26

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 20/27] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT

This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.

huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 13 ++++++++++++-
mm/mprotect.c | 3 ++-
mm/userfaultfd.c | 8 ++++++++
4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d238a69bcbb3..3e4c5c64d867 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -204,7 +204,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot);
+ unsigned long address, unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags);

bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -368,7 +369,8 @@ static inline void move_hugetlb_state(struct page *oldpage,

static inline unsigned long hugetlb_change_protection(
struct vm_area_struct *vma, unsigned long address,
- unsigned long end, pgprot_t newprot)
+ unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags)
{
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9bdcc208f5d9..b101c3af3ab5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5444,7 +5444,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot)
+ unsigned long address, unsigned long end,
+ pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
@@ -5454,6 +5455,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long pages = 0;
bool shared_pmd = false;
struct mmu_notifier_range range;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -5495,6 +5498,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
entry = make_readable_migration_entry(
swp_offset(entry));
newpte = swp_entry_to_pte(entry);
+ if (uffd_wp)
+ newpte = pte_swp_mkuffd_wp(newpte);
+ else if (uffd_wp_resolve)
+ newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
pages++;
@@ -5509,6 +5516,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
+ if (uffd_wp)
+ pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3fcb87b59696..96f4df023439 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -426,7 +426,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);

if (is_vm_hugetlb_page(vma))
- pages = hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot,
+ cp_flags);
else
pages = change_protection_range(vma, start, end, newprot,
cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 424d0adc3f80..82c235f555b8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -743,6 +743,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool enable_wp, bool *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ unsigned long page_mask;
pgprot_t newprot;
int err;

@@ -779,6 +780,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
if (!vma_is_anonymous(dst_vma))
goto out_unlock;

+ if (is_vm_hugetlb_page(dst_vma)) {
+ err = -EINVAL;
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ goto out_unlock;
+ }
+
if (enable_wp)
newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
else
--
2.31.1

2021-05-27 20:25:36

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 18/27] hugetlb/userfaultfd: Hook page faults for uffd write protection

Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.

We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 69a4b551c157..4cbbffd50080 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4980,6 +4980,25 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_ptl;

+ /* Handle userfault-wp first, before trying to lock more pages */
+ if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+ (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .flags = flags,
+ };
+
+ spin_unlock(ptl);
+ if (pagecache_page) {
+ unlock_page(pagecache_page);
+ put_page(pagecache_page);
+ }
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ return handle_userfault(&vmf, VM_UFFD_WP);
+ }
+
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
--
2.31.1

2021-05-27 20:25:48

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 22/27] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler

Teach the hugetlb page fault code to understand uffd-wp special pte. For
example, when seeing such a pte we need to convert any write fault into a read
one (which is fake - we'll retry the write later if so). Meanwhile, for
handle_userfault() we'll need to make sure we must wait for the special swap
pte too just like a none pte.

Note that we also need to teach UFFDIO_COPY about this special pte across the
code path so that we can safely install a new page at this special pte as long
as we know it's a stall entry.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 5 ++++-
mm/hugetlb.c | 26 ++++++++++++++++++++------
mm/userfaultfd.c | 5 ++++-
3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 15031d6f1f17..6ef7b56760bf 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -245,8 +245,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
/*
* Lockless access: we're in a wait_event so it's ok if it
* changes under us.
+ *
+ * Regarding uffd-wp special case, please refer to comments in
+ * userfaultfd_must_wait().
*/
- if (huge_pte_none(pte))
+ if (huge_pte_none(pte) || pte_swp_uffd_wp_special(pte))
ret = true;
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
ret = true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c64dfd0a9883..a17d894312c0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4707,7 +4707,8 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ unsigned long address, pte_t *ptep,
+ pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -4831,7 +4832,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

ptl = huge_pte_lock(h, mm, ptep);
ret = 0;
- if (!huge_pte_none(huge_ptep_get(ptep)))
+ if (!pte_same(huge_ptep_get(ptep), old_pte))
goto backout;

if (anon_rmap) {
@@ -4841,6 +4842,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
+ /*
+ * If this pte was previously wr-protected, keep it wr-protected even
+ * if populated.
+ */
+ if (unlikely(pte_swp_uffd_wp_special(old_pte)))
+ new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
set_huge_pte_at(mm, haddr, ptep, new_pte);

hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -4956,8 +4963,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);

entry = huge_ptep_get(ptep);
- if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+ /*
+ * uffd-wp-special should be handled merely the same as pte none
+ * because it's basically a none pte with a special marker
+ */
+ if (huge_pte_none(entry) || pte_swp_uffd_wp_special(entry)) {
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ entry, flags);
goto out_mutex;
}

@@ -5091,7 +5103,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
- pte_t _dst_pte;
+ pte_t _dst_pte, cur_pte;
spinlock_t *ptl;
int ret = -ENOMEM;
struct page *page;
@@ -5213,8 +5225,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (idx >= size)
goto out_release_unlock;

+ cur_pte = huge_ptep_get(dst_pte);
ret = -EEXIST;
- if (!huge_pte_none(huge_ptep_get(dst_pte)))
+ /* Please refer to shmem_mfill_atomic_pte() for uffd-wp special case */
+ if (!huge_pte_none(cur_pte) && !pte_swp_uffd_wp_special(cur_pte))
goto out_release_unlock;

if (vm_shared) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 82c235f555b8..af79f3d3a001 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -364,6 +364,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

while (src_addr < src_start + len) {
+ pte_t pteval;
+
BUG_ON(dst_addr >= dst_start + len);

/*
@@ -386,8 +388,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
goto out_unlock;
}

+ pteval = huge_ptep_get(dst_pte);
if (mode != MCOPY_ATOMIC_CONTINUE &&
- !huge_pte_none(huge_ptep_get(dst_pte))) {
+ !huge_pte_none(pteval) && !pte_swp_uffd_wp_special(pteval)) {
err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
--
2.31.1

2021-05-27 20:26:15

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 15/27] shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

It should be handled similarly like other uffd-wp wr-protected ptes: we should
pass it over when the dst_vma has VM_UFFD_WP armed, otherwise drop it.

Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3453b8ae5f4f..8372b212993a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -777,8 +777,21 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = dst_vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
- swp_entry_t entry = pte_to_swp_entry(pte);
+ swp_entry_t entry;
+
+ if (unlikely(is_swap_special_pte(pte))) {
+ /*
+ * uffd-wp special swap pte is the only possibility for now.
+ * If dst vma is registered with uffd-wp, copy it over.
+ * Otherwise, ignore this pte as if it's a none pte would work.
+ */
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ if (userfaultfd_wp(dst_vma))
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+ }

+ entry = pte_to_swp_entry(pte);
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
return -EAGAIN;
--
2.31.1

2021-05-27 20:27:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 25/27] mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

This requires the pagemap code to be able to recognize the newly introduced
swap special pte for uffd-wp, meanwhile the general case for hugetlb that we
recently start to support. It should make pagemap uffd-wp support complete.

Signed-off-by: Peter Xu <[email protected]>
---
fs/proc/task_mmu.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9c5af77b5290..988e29fa1f00 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1389,6 +1389,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
flags |= PM_SWAP;
if (is_pfn_swap_entry(entry))
page = pfn_swap_entry_to_page(entry);
+ } else if (pte_swp_uffd_wp_special(pte)) {
+ flags |= PM_UFFD_WP;
}

if (page && !PageAnon(page))
@@ -1522,10 +1524,15 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
if (page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;

+ if (huge_pte_uffd_wp(pte))
+ flags |= PM_UFFD_WP;
+
flags |= PM_PRESENT;
if (pm->show_pfn)
frame = pte_pfn(pte) +
((addr & ~hmask) >> PAGE_SHIFT);
+ } else if (pte_swp_uffd_wp_special(pte)) {
+ flags |= PM_UFFD_WP;
}

for (; addr != end; addr += PAGE_SIZE) {
--
2.31.1

2021-05-27 20:27:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 26/27] mm/userfaultfd: Enable write protection for shmem & hugetlbfs

We've had all the necessary changes ready for both shmem and hugetlbfs. Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.

We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because
all existing types now support write protection mode.

Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 18 ------------------
include/linux/userfaultfd_k.h | 12 ++++++++++++
include/uapi/linux/userfaultfd.h | 10 ++++++++--
mm/userfaultfd.c | 9 +++------
4 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6ef7b56760bf..140bab07935e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1275,24 +1275,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
return 0;
}

-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- /* FIXME: add WP support to hugetlbfs and shmem */
- if (vm_flags & VM_UFFD_WP) {
- if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
- return false;
- }
-
- if (vm_flags & VM_UFFD_MINOR) {
- if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
- return false;
- }
-
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
-}
-
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
unsigned long arg)
{
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ca3f794d07e9..489fb375e66c 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -16,6 +16,7 @@
#include <linux/fcntl.h>
#include <linux/mm.h>
#include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>

/* The set of all possible UFFD-related VM flags. */
#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR)
@@ -138,6 +139,17 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & __VM_UFFD_FLAGS;
}

+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+ unsigned long vm_flags)
+{
+ if (vm_flags & VM_UFFD_MINOR)
+ return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);
+
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
+
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 159a74e9564f..09b7ec69c97d 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -32,7 +32,8 @@
UFFD_FEATURE_SIGBUS | \
UFFD_FEATURE_THREAD_ID | \
UFFD_FEATURE_MINOR_HUGETLBFS | \
- UFFD_FEATURE_MINOR_SHMEM)
+ UFFD_FEATURE_MINOR_SHMEM | \
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -46,7 +47,8 @@
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
- (__u64)1 << _UFFDIO_CONTINUE)
+ (__u64)1 << _UFFDIO_CONTINUE | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT)

/*
* Valid ioctl command number range with this API is from 0x00 to
@@ -189,6 +191,9 @@ struct uffdio_api {
*
* UFFD_FEATURE_MINOR_SHMEM indicates the same support as
* UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
+ *
+ * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
+ * write-protection mode is supported on both shmem and hugetlbfs.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -201,6 +206,7 @@ struct uffdio_api {
#define UFFD_FEATURE_THREAD_ID (1<<8)
#define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9)
#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
+#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<11)
__u64 features;

__u64 ioctls;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index af79f3d3a001..7ff9176149e0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -772,15 +772,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,

err = -ENOENT;
dst_vma = find_dst_vma(dst_mm, start, len);
- /*
- * Make sure the vma is not shared, that the dst range is
- * both valid and fully within a single existing vma.
- */
- if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+ if (!dst_vma)
goto out_unlock;
if (!userfaultfd_wp(dst_vma))
goto out_unlock;
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
goto out_unlock;

if (is_vm_hugetlb_page(dst_vma)) {
--
2.31.1

2021-05-27 20:27:43

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 24/27] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte if
unmapping an entire vma or synchronized such that faults can not race with the
unmap operation. This requires passing zap_flags all the way to the lowest
level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent
faults. The exception is hole punch which will first unmap without any
synchronization. Later when hole punch actually removes the page from the
file, it will check to see if there was a subsequent fault and if so take the
hugetlb fault mutex while unmapping again. This second unmap will pass in
ZAP_FLAG_DROP_FILE_UFFD_WP.

The core justification of "whether to apply ZAP_FLAG_DROP_FILE_UFFD_WP flag
when unmap a hugetlb range" is (IMHO): we should never reach a state when a
page fault could errornously fault in a page-cache page that was wr-protected
to be writable, even in an extremely short period. That could happen if
e.g. we pass ZAP_FLAG_DROP_FILE_UFFD_WP in hugetlbfs_punch_hole() when calling
hugetlb_vmdelete_list(), because if a page fault triggers after that call and
before the remove_inode_hugepages() right after it, the page cache can be
mapped writable again in the small window, which can cause data corruption.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 15 +++++++++------
include/linux/hugetlb.h | 8 +++++---
mm/hugetlb.c | 27 +++++++++++++++++++++------
mm/memory.c | 5 ++++-
4 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 55efd3dd04f6..b917fb4c670e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -404,7 +404,8 @@ static void remove_huge_page(struct page *page)
}

static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+ unsigned long zap_flags)
{
struct vm_area_struct *vma;

@@ -437,7 +438,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
}

unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
- NULL);
+ NULL, zap_flags);
}
}

@@ -515,7 +516,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
+ (index + 1) * pages_per_huge_page(h),
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
}

@@ -581,7 +583,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
i_size_write(inode, offset);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
- hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
}
@@ -614,8 +617,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap,
- hole_start >> PAGE_SHIFT,
- hole_end >> PAGE_SHIFT);
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT, 0);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3e4c5c64d867..d3e8b3b38ded 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -138,11 +138,12 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
unsigned long *, unsigned long *, long, unsigned int,
int *);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *,
+ unsigned long);
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -377,7 +378,8 @@ static inline unsigned long hugetlb_change_protection(

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c4dd0c531bb5..78675158911c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4274,7 +4274,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,

void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page)
+ struct page *ref_page, unsigned long zap_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -4326,6 +4326,19 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
continue;
}

+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * Only drop the special swap uffd-wp pte if
+ * e.g. unmapping a vma or punching a hole (with proper
+ * lock held so that concurrent page fault won't happen).
+ */
+ if (zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP)
+ huge_pte_clear(mm, address, ptep, sz);
+ spin_unlock(ptl);
+ continue;
+ }
+
/*
* Migrating hugepage or HWPoisoned hugepage is already
* unmapped and its refcount is dropped, so just clear pte here.
@@ -4377,9 +4390,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,

void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
- __unmap_hugepage_range(tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);

/*
* Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -4395,12 +4409,13 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
}

void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
struct mmu_gather tlb;

tlb_gather_mmu(&tlb, vma->vm_mm);
- __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
tlb_finish_mmu(&tlb);
}

@@ -4455,7 +4470,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
unmap_hugepage_range(iter_vma, address,
- address + huge_page_size(h), page);
+ address + huge_page_size(h), page, 0);
}
i_mmap_unlock_write(mapping);
}
diff --git a/mm/memory.c b/mm/memory.c
index 8372b212993a..4427f48e446d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1607,8 +1607,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
+ unsigned long zap_flags = details ?
+ details->zap_flags : 0;
i_mmap_lock_write(vma->vm_file->f_mapping);
- __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+ __unmap_hugepage_range_final(tlb, vma, start, end,
+ NULL, zap_flags);
i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
--
2.31.1

2021-05-27 23:56:41

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 05/27] mm/swap: Introduce the idea of special swap ptes

We used to have special swap entries, like migration entries, hw-poison
entries, device private entries, etc.

Those "special swap entries" reside in the range that they need to be at least
swap entries first, and their types are decided by swp_type(entry).

This patch introduces another idea called "special swap ptes".

It's very easy to get confused against "special swap entries", but a speical
swap pte should never contain a swap entry at all. It means, it's illegal to
call pte_to_swp_entry() upon a special swap pte.

Make the uffd-wp special pte to be the first special swap pte.

Before this patch, is_swap_pte()==true means one of the below:

(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.

(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.

After this patch, is_swap_pte()==true means one of the below, where case (b) is
added:

(a) The pte contains a swap entry.

(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.

(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.

(b) The pte does not contain a swap entry at all (so it cannot be passed
into pte_to_swp_entry()). For example, uffd-wp special swap pte.

Teach the whole mm core about this new idea. It's done by introducing another
helper called pte_has_swap_entry() which stands for case (a.1) and (a.2).
Before this patch, it will be the same as is_swap_pte() because there's no
special swap pte yet. Now for most of the previous use of is_swap_entry() in
mm core, we'll need to use the new helper pte_has_swap_entry() instead, to make
sure we won't try to parse a swap entry from a swap special pte (which does not
contain a swap entry at all!). We either handle the swap special pte, or it'll
naturally use the default "else" paths.

Warn properly (e.g., in do_swap_page()) when we see a special swap pte - we
should never call do_swap_page() upon those ptes, but just to bail out early if
it happens.

Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/kernel/mte.c | 2 +-
fs/proc/task_mmu.c | 14 ++++++++------
include/linux/swapops.h | 39 ++++++++++++++++++++++++++++++++++++++-
mm/gup.c | 2 +-
mm/hmm.c | 2 +-
mm/khugepaged.c | 11 ++++++++++-
mm/madvise.c | 4 ++--
mm/memcontrol.c | 2 +-
mm/memory.c | 7 +++++++
mm/migrate.c | 4 ++--
mm/mincore.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 6 +++---
mm/swapfile.c | 2 +-
15 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..a6fd3fb3eacb 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -36,7 +36,7 @@ static void mte_sync_page_tags(struct page *page, pte_t *ptep, bool check_swap)
{
pte_t old_pte = READ_ONCE(*ptep);

- if (check_swap && is_swap_pte(old_pte)) {
+ if (check_swap && pte_has_swap_entry(old_pte)) {
swp_entry_t entry = pte_to_swp_entry(old_pte);

if (!non_swap_entry(entry) && mte_restore_tags(entry, page))
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..9c5af77b5290 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -498,7 +498,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

if (!non_swap_entry(swpent)) {
@@ -516,8 +516,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
}
} else if (is_pfn_swap_entry(swpent))
page = pfn_swap_entry_to_page(swpent);
- } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
- && pte_none(*pte))) {
+ } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) &&
+ mss->check_shmem_swap &&
+ /* Here swap special pte is the same as none pte */
+ (pte_none(*pte) || is_swap_special_pte(*pte)))) {
page = xa_load(&vma->vm_file->f_mapping->i_pages,
linear_page_index(vma, addr));
if (xa_is_value(page))
@@ -689,7 +691,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

if (is_pfn_swap_entry(swpent))
@@ -1071,7 +1073,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
ptent = pte_wrprotect(old_pte);
ptent = pte_clear_soft_dirty(ptent);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
- } else if (is_swap_pte(ptent)) {
+ } else if (pte_has_swap_entry(ptent)) {
ptent = pte_swp_clear_soft_dirty(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
}
@@ -1374,7 +1376,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
flags |= PM_SOFT_DIRTY;
if (pte_uffd_wp(pte))
flags |= PM_UFFD_WP;
- } else if (is_swap_pte(pte)) {
+ } else if (pte_has_swap_entry(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index af3d2661e41e..4a316c015fe0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -5,6 +5,7 @@
#include <linux/radix-tree.h>
#include <linux/bug.h>
#include <linux/mm_types.h>
+#include <linux/userfaultfd_k.h>

#ifdef CONFIG_MMU

@@ -52,12 +53,48 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
return entry.val & SWP_OFFSET_MASK;
}

-/* check whether a pte points to a swap entry */
+/*
+ * is_swap_pte() returns true for three cases:
+ *
+ * (a) The pte contains a swap entry.
+ *
+ * (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
+ * example, when an anonymous page got swapped out.
+ *
+ * (a.2) The pte has a special swap entry (non_swap_entry()==true). For
+ * example, a migration entry, a hw-poison entry, etc.
+ *
+ * (b) The pte does not contain a swap entry at all (so it cannot be passed
+ * into pte_to_swp_entry()). For example, uffd-wp special swap pte.
+ */
static inline int is_swap_pte(pte_t pte)
{
return !pte_none(pte) && !pte_present(pte);
}

+/*
+ * A swap-like special pte should only be used as special marker to trigger a
+ * page fault. We should treat them similarly as pte_none() in most cases,
+ * except that it may contain some special information that can persist within
+ * the pte. Currently the only special swap pte is UFFD_WP_SWP_PTE_SPECIAL.
+ *
+ * Note: we should never call pte_to_swp_entry() upon a special swap pte,
+ * Because a swap special pte does not contain a swap entry!
+ */
+static inline bool is_swap_special_pte(pte_t pte)
+{
+ return pte_swp_uffd_wp_special(pte);
+}
+
+/*
+ * Returns true if the pte contains a swap entry. This includes not only the
+ * normal swp entry case, but also for migration entries, etc.
+ */
+static inline bool pte_has_swap_entry(pte_t pte)
+{
+ return is_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
/*
* Convert the arch-dependent pte representation of a swp_entry_t into an
* arch-independent swp_entry_t.
diff --git a/mm/gup.c b/mm/gup.c
index 29a0c7d87024..e03590c9c68e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -485,7 +485,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
*/
if (likely(!(flags & FOLL_MIGRATION)))
goto no_page;
- if (pte_none(pte))
+ if (!pte_has_swap_entry(pte))
goto no_page;
entry = pte_to_swp_entry(pte);
if (!is_migration_entry(entry))
diff --git a/mm/hmm.c b/mm/hmm.c
index fad6be2bf072..aba1bf2c6742 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -239,7 +239,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
pte_t pte = *ptep;
uint64_t pfn_req_flags = *hmm_pfn;

- if (pte_none(pte)) {
+ if (pte_none(pte) || is_swap_special_pte(pte)) {
required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (required_fault)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..7376a9b5bfc9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1018,7 +1018,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,

vmf.pte = pte_offset_map(pmd, address);
vmf.orig_pte = *vmf.pte;
- if (!is_swap_pte(vmf.orig_pte)) {
+ if (!pte_has_swap_entry(vmf.orig_pte)) {
pte_unmap(vmf.pte);
continue;
}
@@ -1245,6 +1245,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
+ if (is_swap_special_pte(pteval)) {
+ /*
+ * Reuse SCAN_PTE_UFFD_WP. If there will be
+ * new users of is_swap_special_pte(), we'd
+ * better introduce a new result type.
+ */
+ result = SCAN_PTE_UFFD_WP;
+ goto out_unmap;
+ }
if (++unmapped <= khugepaged_max_ptes_swap) {
/*
* Always be strict with uffd-wp
diff --git a/mm/madvise.c b/mm/madvise.c
index 012129fbfaf8..ebde36d685ad 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -204,7 +204,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
pte = *(orig_pte + ((index - start) / PAGE_SIZE));
pte_unmap_unlock(orig_pte, ptl);

- if (pte_present(pte) || pte_none(pte))
+ if (!pte_has_swap_entry(pte))
continue;
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
@@ -596,7 +596,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (pte_none(ptent))
+ if (pte_none(ptent) || is_swap_special_pte(ptent))
continue;
/*
* If the pte has swp_entry, just clear page table to
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cb864f87b01d..f684f6cf6fce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5719,7 +5719,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,

if (pte_present(ptent))
page = mc_handle_present_pte(vma, addr, ptent);
- else if (is_swap_pte(ptent))
+ else if (pte_has_swap_entry(ptent))
page = mc_handle_swap_pte(vma, ptent, &ent);
else if (pte_none(ptent))
page = mc_handle_file_pte(vma, addr, ptent, &ent);
diff --git a/mm/memory.c b/mm/memory.c
index 0ccaae2647c0..2b24af4616df 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3445,6 +3445,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;

+ /*
+ * We should never call do_swap_page upon a swap special pte; just be
+ * safe to bail out if it happens.
+ */
+ if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
+ goto out;
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 91ee6f0941b4..2468c5d00f30 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -294,7 +294,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,

spin_lock(ptl);
pte = *ptep;
- if (!is_swap_pte(pte))
+ if (!pte_has_swap_entry(pte))
goto out;

entry = pte_to_swp_entry(pte);
@@ -2248,7 +2248,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,

pte = *ptep;

- if (pte_none(pte)) {
+ if (pte_none(pte) || is_swap_special_pte(pte)) {
if (vma_is_anonymous(vma)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..5728c3e6473f 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -121,7 +121,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
for (; addr != end; ptep++, addr += PAGE_SIZE) {
pte_t pte = *ptep;

- if (pte_none(pte))
+ if (pte_none(pte) || is_swap_special_pte(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
vma, vec);
else if (pte_present(pte))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..4b743394afbe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
- } else if (is_swap_pte(oldpte)) {
+ } else if (pte_has_swap_entry(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;

diff --git a/mm/mremap.c b/mm/mremap.c
index b7523589f218..64cd6581e05a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -125,7 +125,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
#ifdef CONFIG_MEM_SOFT_DIRTY
if (pte_present(pte))
pte = pte_mksoft_dirty(pte);
- else if (is_swap_pte(pte))
+ else if (pte_has_swap_entry(pte))
pte = pte_swp_mksoft_dirty(pte);
#endif
return pte;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f535bcb4950c..c2f9bcee2273 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -36,7 +36,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
* For more details on device private memory see HMM
* (include/linux/hmm.h or mm/hmm.c).
*/
- if (is_swap_pte(*pvmw->pte)) {
+ if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;

/* Handle un-addressable ZONE_DEVICE memory */
@@ -90,7 +90,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

if (pvmw->flags & PVMW_MIGRATION) {
swp_entry_t entry;
- if (!is_swap_pte(*pvmw->pte))
+ if (!pte_has_swap_entry(*pvmw->pte))
return false;
entry = pte_to_swp_entry(*pvmw->pte);

@@ -99,7 +99,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return false;

pfn = swp_offset(entry);
- } else if (is_swap_pte(*pvmw->pte)) {
+ } else if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;

/* Handle un-addressable ZONE_DEVICE memory */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cbb4c0795284..2401b2a90443 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1972,7 +1972,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
si = swap_info[type];
pte = pte_offset_map(pmd, addr);
do {
- if (!is_swap_pte(*pte))
+ if (!pte_has_swap_entry(*pte))
continue;

entry = pte_to_swp_entry(*pte);
--
2.31.1

2021-05-27 23:56:47

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 08/27] mm: Introduce zap_details.zap_flags

Instead of trying to introduce one variable for every new zap_details fields,
let's introduce a flag so that it can start to encode true/false informations.

Let's start to use this flag first to clean up the only check_mapping variable.
Firstly, the name "check_mapping" implies this is a "boolean", but actually it
stores the mapping inside, just in a way that it won't be set if we don't want
to check the mapping.

To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
that we only check against the mapping if this bit set. At the same time, we
can rename check_mapping into zap_mapping and set it always.

Since at it, introduce another helper zap_check_mapping_skip() and use it in
zap_pte_range() properly.

Some old comments have been removed in zap_pte_range() because they're
duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
easy to grep this information by simply grepping the flag.

It'll also make life easier when we want to e.g. pass in zap_flags into the
callers like unmap_mapping_pages() (instead of adding new booleans besides the
even_cows parameter).

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
mm/memory.c | 31 ++++++++-----------------------
2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db155be8e66c..52d3ef2ed753 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1721,13 +1721,30 @@ static inline bool can_do_mlock(void) { return false; }
extern int user_shm_lock(size_t, struct user_struct *);
extern void user_shm_unlock(size_t, struct user_struct *);

+/* Whether to check page->mapping when zapping */
+#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
*/
struct zap_details {
- struct address_space *check_mapping; /* Check page->mapping if set */
+ struct address_space *zap_mapping;
+ unsigned long zap_flags;
};

+/* Return true if skip zapping this page, false otherwise */
+static inline bool
+zap_check_mapping_skip(struct zap_details *details, struct page *page)
+{
+ if (!details || !page)
+ return false;
+
+ if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
+ return false;
+
+ return details->zap_mapping != page_rmapping(page);
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 27cf8a6375c6..c9dc4e9e05b5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1330,16 +1330,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct page *page;

page = vm_normal_page(vma, addr, ptent);
- if (unlikely(details) && page) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping &&
- details->check_mapping != page_rmapping(page))
- continue;
- }
+ if (unlikely(zap_check_mapping_skip(details, page)))
+ continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1372,17 +1364,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);

- if (unlikely(details && details->check_mapping)) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping !=
- page_rmapping(page))
- continue;
- }
-
+ if (unlikely(zap_check_mapping_skip(details, page)))
+ continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;

@@ -3345,9 +3328,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { };
+ struct zap_details details = { .zap_mapping = mapping };
+
+ if (!even_cows)
+ details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;

- details.check_mapping = even_cows ? NULL : mapping;
if (last_index < first_index)
last_index = ULONG_MAX;

--
2.31.1

2021-05-27 23:57:24

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 19/27] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP

Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
the stack. Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
UFFDIO_COPY. Introduce huge_pte_mkuffd_wp() for it.

Hugetlb pages are only managed by hugetlbfs, so we're safe even without setting
dirty bit in the huge pte if the page is installed as read-only. However we'd
better still keep the dirty bit set for a read-only UFFDIO_COPY pte (when
UFFDIO_COPY_MODE_WP bit is set), not only to match what we do with shmem, but
also because the page does contain dirty data that the kernel just copied from
the userspace.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 22 +++++++++++++++++-----
mm/userfaultfd.c | 12 ++++++++----
3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7ef2b8c2ff41..d238a69bcbb3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -155,7 +155,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep);
+ struct page **pagep,
+ bool wp_copy);
#endif /* CONFIG_USERFAULTFD */
bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
@@ -337,7 +338,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
BUG();
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4cbbffd50080..9bdcc208f5d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5062,7 +5062,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
enum mcopy_atomic_mode mode,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
struct hstate *h = hstate_vma(dst_vma);
@@ -5203,17 +5204,28 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
}

- /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */
- if (is_continue && !vm_shared)
+ /*
+ * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
+ * with wp flag set, don't set pte write bit.
+ */
+ if (wp_copy || (is_continue && !vm_shared))
writable = 0;
else
writable = dst_vma->vm_flags & VM_WRITE;

_dst_pte = make_huge_pte(dst_vma, page, writable);
- if (writable)
- _dst_pte = huge_pte_mkdirty(_dst_pte);
+ /*
+ * Always mark UFFDIO_COPY page dirty; note that this may not be
+ * extremely important for hugetlbfs for now since swapping is not
+ * supported, but we should still be clear in that this page cannot be
+ * thrown away at will, even if write bit not set.
+ */
+ _dst_pte = huge_pte_mkdirty(_dst_pte);
_dst_pte = pte_mkyoung(_dst_pte);

+ if (wp_copy)
+ _dst_pte = huge_pte_mkuffd_wp(_dst_pte);
+
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

(void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 147e86095070..424d0adc3f80 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -297,7 +297,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode)
+ enum mcopy_atomic_mode mode,
+ bool wp_copy)
{
int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -394,7 +395,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}

err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
- dst_addr, src_addr, mode, &page);
+ dst_addr, src_addr, mode, &page,
+ wp_copy);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -496,7 +498,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- enum mcopy_atomic_mode mode);
+ enum mcopy_atomic_mode mode,
+ bool wp_copy);
#endif /* CONFIG_HUGETLB_PAGE */

static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -616,7 +619,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, mcopy_mode);
+ src_start, len, mcopy_mode,
+ wp_copy);

if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
--
2.31.1

2021-05-27 23:57:37

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 21/27] mm/hugetlb: Introduce huge version of special swap pte helpers

This is to let hugetlbfs be prepared to also recognize swap special ptes just
like uffd-wp special swap ptes.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b101c3af3ab5..c64dfd0a9883 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -133,6 +133,26 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool)
return true;
}

+/*
+ * These are sister versions of is_swap_pte() and pte_has_swap_entry(). We
+ * need standalone ones because huge_pte_none() is handled differently from
+ * pte_none(). For more information, please refer to comments above
+ * is_swap_pte() and pte_has_swap_entry().
+ *
+ * Here we directly reuse the pte level of swap special ptes, for example, the
+ * pte_swp_uffd_wp_special(). It just stands for a huge page rather than a
+ * small page for hugetlbfs pages.
+ */
+static inline bool is_huge_swap_pte(pte_t pte)
+{
+ return !huge_pte_none(pte) && !pte_present(pte);
+}
+
+static inline bool huge_pte_has_swap_entry(pte_t pte)
+{
+ return is_huge_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
unsigned long irq_flags)
{
@@ -4061,7 +4081,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
{
swp_entry_t swp;

- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_migration_entry(swp))
@@ -4074,7 +4094,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
{
swp_entry_t swp;

- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_hwpoison_entry(swp))
--
2.31.1

2021-05-27 23:58:34

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 09/27] mm: Introduce ZAP_FLAG_SKIP_SWAP

Firstly, the comment in zap_pte_range() is misleading because it checks against
details rather than check_mappings, so it's against what the code did.

Meanwhile, it's confusing too on not explaining why passing in the details
pointer would mean to skip all swap entries. New user of zap_details could
very possibly miss this fact if they don't read deep until zap_pte_range()
because there's no comment at zap_details talking about it at all, so swap
entries could be errornously skipped without being noticed.

This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
"details" parameter: the caller should explicitly set this to skip swap
entries, otherwise swap entries will always be considered (which is still the
major case here).

Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 12 ++++++++++++
mm/memory.c | 8 +++++---
2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52d3ef2ed753..1adf313a01fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1723,6 +1723,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);

/* Whether to check page->mapping when zapping */
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+/* Whether to skip zapping swap entries */
+#define ZAP_FLAG_SKIP_SWAP BIT(1)

/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1745,6 +1747,16 @@ zap_check_mapping_skip(struct zap_details *details, struct page *page)
return details->zap_mapping != page_rmapping(page);
}

+/* Return true if skip swap entries, false otherwise */
+static inline bool
+zap_skip_swap(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index c9dc4e9e05b5..8a3751be87ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1376,8 +1376,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ if (unlikely(zap_skip_swap(details)))
continue;

if (!non_swap_entry(entry))
@@ -3328,7 +3327,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { .zap_mapping = mapping };
+ struct zap_details details = {
+ .zap_mapping = mapping,
+ .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ };

if (!even_cows)
details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
--
2.31.1

2021-05-27 23:59:06

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 16/27] mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h

Drop it in the header since it's only used in hugetlb.c.

Suggested-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d944aa0202f0..7ef2b8c2ff41 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -143,9 +143,6 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct page *ref_page);
-void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct page *ref_page);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -381,13 +378,6 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
BUG();
}

-static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
-{
- BUG();
-}
-
static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
unsigned int flags)
--
2.31.1

2021-05-28 01:25:30

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 07/27] mm: Drop first_index/last_index in zap_details

The first_index/last_index parameters in zap_details are actually only used in
unmap_mapping_range_tree(). At the meantime, this function is only called by
unmap_mapping_pages() once. Instead of passing these two variables through the
whole stack of page zapping code, remove them from zap_details and let them
simply be parameters of unmap_mapping_range_tree(), which is inlined.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 2 --
mm/memory.c | 20 ++++++++++----------
2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef9ea6dfefff..db155be8e66c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1726,8 +1726,6 @@ extern void user_shm_unlock(size_t, struct user_struct *);
*/
struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
- pgoff_t first_index; /* Lowest page->index to unmap */
- pgoff_t last_index; /* Highest page->index to unmap */
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 45a2f71e447a..27cf8a6375c6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3305,20 +3305,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
}

static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+ pgoff_t first_index,
+ pgoff_t last_index,
struct zap_details *details)
{
struct vm_area_struct *vma;
pgoff_t vba, vea, zba, zea;

- vma_interval_tree_foreach(vma, root,
- details->first_index, details->last_index) {
-
+ vma_interval_tree_foreach(vma, root, first_index, last_index) {
vba = vma->vm_pgoff;
vea = vba + vma_pages(vma) - 1;
- zba = details->first_index;
+ zba = first_index;
if (zba < vba)
zba = vba;
- zea = details->last_index;
+ zea = last_index;
if (zea > vea)
zea = vea;

@@ -3344,17 +3344,17 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
+ pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = { };

details.check_mapping = even_cows ? NULL : mapping;
- details.first_index = start;
- details.last_index = start + nr - 1;
- if (details.last_index < details.first_index)
- details.last_index = ULONG_MAX;
+ if (last_index < first_index)
+ last_index = ULONG_MAX;

i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
- unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+ last_index, &details);
i_mmap_unlock_write(mapping);
}

--
2.31.1

2021-05-28 01:25:57

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 10/27] mm: Pass zap_flags into unmap_mapping_pages()

Give unmap_mapping_pages() more power by allowing to specify a zap flag so that
it can pass in more information than "whether we'd also like to zap cow pages".
With the new flag, we can remove the even_cow parameter because even_cow==false
equals to zap_flags==ZAP_FLAG_CHECK_MAPPING, while even_cow==true means a none
zap flag to pass in (though in most cases we have had even_cow==false).

No functional change intended.

Signed-off-by: Peter Xu <[email protected]>
---
fs/dax.c | 10 ++++++----
include/linux/mm.h | 4 ++--
mm/khugepaged.c | 3 ++-
mm/memory.c | 15 ++++++++-------
mm/truncate.c | 11 +++++++----
5 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 62352cbcf0f4..09d482c1595b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -528,7 +528,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_unlock_irq(xas);
unmap_mapping_pages(mapping,
xas->xa_index & ~PG_PMD_COLOUR,
- PG_PMD_NR, false);
+ PG_PMD_NR, ZAP_FLAG_CHECK_MAPPING);
xas_reset(xas);
xas_lock_irq(xas);
}
@@ -623,7 +623,8 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
* guaranteed to either see new references or prevent new
* references from being established.
*/
- unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
+ unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1,
+ ZAP_FLAG_CHECK_MAPPING);

xas_lock_irq(&xas);
xas_for_each(&xas, entry, end_idx) {
@@ -754,9 +755,10 @@ static void *dax_insert_entry(struct xa_state *xas,
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
- PG_PMD_NR, false);
+ PG_PMD_NR, ZAP_FLAG_CHECK_MAPPING);
else /* pte entry */
- unmap_mapping_pages(mapping, index, 1, false);
+ unmap_mapping_pages(mapping, index, 1,
+ ZAP_FLAG_CHECK_MAPPING);
}

xas_reset(xas);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1adf313a01fe..b1fb2826e29c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1803,7 +1803,7 @@ extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
void unmap_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t nr, bool even_cows);
+ pgoff_t start, pgoff_t nr, unsigned long zap_flags);
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
#else
@@ -1823,7 +1823,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
return -EFAULT;
}
static inline void unmap_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t nr, bool even_cows) { }
+ pgoff_t start, pgoff_t nr, unsigned long zap_flags) { }
static inline void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows) { }
#endif
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7376a9b5bfc9..9e89a032e2fd 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1830,7 +1830,8 @@ static void collapse_file(struct mm_struct *mm,
}

if (page_mapped(page))
- unmap_mapping_pages(mapping, index, 1, false);
+ unmap_mapping_pages(mapping, index, 1,
+ ZAP_FLAG_CHECK_MAPPING);

xas_lock_irq(&xas);
xas_set(&xas, index);
diff --git a/mm/memory.c b/mm/memory.c
index 8a3751be87ba..319552efc782 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,10 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
* @mapping: The address space containing pages to be unmapped.
* @start: Index of first page to be unmapped.
* @nr: Number of pages to be unmapped. 0 to unmap to end of file.
- * @even_cows: Whether to unmap even private COWed pages.
+ * @zap_flags: Zap flags for the process. E.g., when ZAP_FLAG_CHECK_MAPPING is
+ * passed into it, we will only zap the pages that are in the same mapping
+ * specified in the @mapping parameter; otherwise we will not check mapping,
+ * IOW cow pages will be zapped too.
*
* Unmap the pages in this address space from any userspace process which
* has them mmaped. Generally, you want to remove COWed pages as well when
@@ -3324,17 +3327,14 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
* cache.
*/
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
- pgoff_t nr, bool even_cows)
+ pgoff_t nr, unsigned long zap_flags)
{
pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = {
.zap_mapping = mapping,
- .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ .zap_flags = zap_flags | ZAP_FLAG_SKIP_SWAP,
};

- if (!even_cows)
- details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
-
if (last_index < first_index)
last_index = ULONG_MAX;

@@ -3376,7 +3376,8 @@ void unmap_mapping_range(struct address_space *mapping,
hlen = ULONG_MAX - hba + 1;
}

- unmap_mapping_pages(mapping, hba, hlen, even_cows);
+ unmap_mapping_pages(mapping, hba, hlen, even_cows ?
+ 0 : ZAP_FLAG_CHECK_MAPPING);
}
EXPORT_SYMBOL(unmap_mapping_range);

diff --git a/mm/truncate.c b/mm/truncate.c
index 57a618c4a0d6..85cd84486589 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -172,7 +172,8 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
{
if (page_mapped(page)) {
unsigned int nr = thp_nr_pages(page);
- unmap_mapping_pages(mapping, page->index, nr, false);
+ unmap_mapping_pages(mapping, page->index, nr,
+ ZAP_FLAG_CHECK_MAPPING);
}

if (page_has_private(page))
@@ -652,14 +653,15 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
* Zap the rest of the file in one hit.
*/
unmap_mapping_pages(mapping, index,
- (1 + end - index), false);
+ (1 + end - index),
+ ZAP_FLAG_CHECK_MAPPING);
did_range_unmap = 1;
} else {
/*
* Just zap this page
*/
unmap_mapping_pages(mapping, index,
- 1, false);
+ 1, ZAP_FLAG_CHECK_MAPPING);
}
}
BUG_ON(page_mapped(page));
@@ -685,7 +687,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
* get remapped later.
*/
if (dax_mapping(mapping)) {
- unmap_mapping_pages(mapping, start, end - start + 1, false);
+ unmap_mapping_pages(mapping, start, end - start + 1,
+ ZAP_FLAG_CHECK_MAPPING);
}
out:
cleancache_invalidate_inode(mapping);
--
2.31.1

2021-05-28 01:26:00

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 12/27] shmem/userfaultfd: Allow wr-protect none pte for file-backed mem

File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.

We do that by installing the uffd-wp special swap pte as a marker. So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.

On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker. Then we'll need to be
able to recover from a uffd-wp special swap pte into a none pte, so that the
next access to the page will fault in correctly as usual when trigger the fault
handler next time, rather than sending a uffd-wp message.

Special care needs to be taken throughout the change_protection_range()
process. Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see !anonymous && MM_CP_UFFD_WP
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.

Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet. But this will be a base for thps too.

Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4b743394afbe..8ec85b276975 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,6 +29,7 @@
#include <linux/uaccess.h>
#include <linux/mm_inline.h>
#include <linux/pgtable.h>
+#include <linux/userfaultfd_k.h>
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -186,6 +187,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, newpte);
pages++;
}
+ } else if (unlikely(is_swap_special_pte(oldpte))) {
+ if (uffd_wp_resolve && !vma_is_anonymous(vma) &&
+ pte_swp_uffd_wp_special(oldpte)) {
+ /*
+ * This is uffd-wp special pte and we'd like to
+ * unprotect it. What we need to do is simply
+ * recover the pte into a none pte; the next
+ * page fault will fault in the page.
+ */
+ pte_clear(vma->vm_mm, addr, pte);
+ pages++;
+ }
+ } else {
+ /* It must be an none page, or what else?.. */
+ WARN_ON_ONCE(!pte_none(oldpte));
+ if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+ /*
+ * For file-backed mem, we need to be able to
+ * wr-protect even for a none pte! Because
+ * even if the pte is null, the page/swap cache
+ * could exist.
+ */
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+ pages++;
+ }
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -219,6 +246,25 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}

+/*
+ * File-backed vma allows uffd wr-protect upon none ptes, because even if pte
+ * is missing, page/swap cache could exist. When that happens, the wr-protect
+ * information will be stored in the page table entries with the marker (e.g.,
+ * PTE_SWP_UFFD_WP_SPECIAL). Prepare for that by always populating the page
+ * tables to pte level, so that we'll install the markers in change_pte_range()
+ * where necessary.
+ *
+ * Note that we only need to do this in pmd level, because if pmd does not
+ * exist, it means the whole range covered by the pmd entry (of a pud) does not
+ * contain any valid data but all zeros. Then nothing to wr-protect.
+ */
+#define change_protection_prepare(vma, pmd, addr, cp_flags) \
+ do { \
+ if (unlikely((cp_flags & MM_CP_UFFD_WP) && pmd_none(*pmd) && \
+ !vma_is_anonymous(vma))) \
+ WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)); \
+ } while (0)
+
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
@@ -237,6 +283,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

next = pmd_addr_end(addr, end);

+ change_protection_prepare(vma, pmd, addr, cp_flags);
+
/*
* Automatic NUMA balancing walks the tables with mmap_lock
* held for read. It's possible a parallel update to occur
--
2.31.1

2021-05-28 01:27:11

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 17/27] mm/hugetlb: Introduce huge pte version of uffd-wp helpers

They will be used in the follow up patches to either check/set/clear uffd-wp
bit of a huge pte.

So far it reuses all the small pte helpers. Archs can overwrite these versions
when necessary (with __HAVE_ARCH_HUGE_PTE_UFFD_WP* macros) in the future.

Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..c45b9deb41ff 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -32,6 +32,21 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
return pte_modify(pte, newprot);
}

+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte_mkuffd_wp(pte);
+}
+
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_uffd_wp(pte);
+}
+
+static inline int huge_pte_uffd_wp(pte_t pte)
+{
+ return pte_uffd_wp(pte);
+}
+
#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR
static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long sz)
--
2.31.1

2021-05-28 01:28:12

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 23/27] hugetlb/userfaultfd: Allow wr-protect none ptes

Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.

Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.

Reviewed-by: Mike Kravetz <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a17d894312c0..c4dd0c531bb5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5486,7 +5486,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0;
+ unsigned long pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -5506,13 +5506,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,

mmu_notifier_invalidate_range_start(&range);
i_mmap_lock_write(vma->vm_file->f_mapping);
- for (; address < end; address += huge_page_size(h)) {
+ for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ ptep = huge_pte_offset(mm, address, psize);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+ /*
+ * When uffd-wp is enabled on the vma, unshare
+ * shouldn't happen at all. Warn about it if it
+ * happened due to some reason.
+ */
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
pages++;
spin_unlock(ptl);
shared_pmd = true;
@@ -5537,12 +5543,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
- newpte, huge_page_size(h));
+ newpte, psize);
pages++;
}
spin_unlock(ptl);
continue;
}
+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * This is changing a non-present pte into a none pte,
+ * no need for huge_ptep_modify_prot_start/commit().
+ */
+ if (uffd_wp_resolve)
+ huge_pte_clear(mm, address, ptep, psize);
+ }
if (!huge_pte_none(pte)) {
pte_t old_pte;
unsigned int shift = huge_page_shift(hstate_vma(vma));
@@ -5556,6 +5571,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
+ } else {
+ /* None pte */
+ if (unlikely(uffd_wp))
+ /* Safe to modify directly (none->non-present). */
+ set_huge_pte_at(mm, address, ptep,
+ pte_swp_mkuffd_wp_special(vma));
}
spin_unlock(ptl);
}
--
2.31.1

2021-05-28 01:28:28

by Peter Xu

[permalink] [raw]
Subject: [PATCH v3 27/27] userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.

Define HUGETLB_EXPECTED_IOCTLS to avoid using UFFD_API_RANGE_IOCTLS_BASIC,
because UFFD_API_RANGE_IOCTLS_BASIC is normally a superset of capabilities,
while the test may not satisfy them all. E.g., when hugetlb registered without
minor mode, then we need to explicitly remove _UFFDIO_CONTINUE. Same thing to
uffd-wp, as we'll need to explicitly remove _UFFDIO_WRITEPROTECT if not
registered with uffd-wp.

For the long term, we may consider dropping UFFD_API_* macros completely from
uapi/linux/userfaultfd.h header files, because it may cause kernel header
update to easily break userspace.

Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index e363bdaff59d..015f2df8ece4 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -80,7 +80,7 @@ static int test_type;
static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
/* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+static bool test_uffdio_wp = true;
/* Whether to test uffd minor faults */
static bool test_uffdio_minor = false;

@@ -320,6 +320,9 @@ struct uffd_test_ops {
(1 << _UFFDIO_ZEROPAGE) | \
(1 << _UFFDIO_WRITEPROTECT))

+#define HUGETLB_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
+ (1 << _UFFDIO_COPY))
+
static struct uffd_test_ops anon_uffd_test_ops = {
.expected_ioctls = ANON_EXPECTED_IOCTLS,
.allocate_area = anon_allocate_area,
@@ -335,7 +338,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = {
};

static struct uffd_test_ops hugetlb_uffd_test_ops = {
- .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
+ .expected_ioctls = HUGETLB_EXPECTED_IOCTLS,
.allocate_area = hugetlb_allocate_area,
.release_pages = hugetlb_release_pages,
.alias_mapping = hugetlb_alias_mapping,
@@ -1580,8 +1583,6 @@ static void set_test_type(const char *type)
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
- /* Only enable write-protect test for anonymous test */
- test_uffdio_wp = true;
} else if (!strcmp(type, "hugetlb")) {
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
--
2.31.1

2021-05-28 08:34:20

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> This patch introduces a very special swap-like pte for file-backed memories.
>
> Currently it's only defined for x86_64 only, but as long as any arch that
> can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> should conceptually work too.
>
> We will use this special pte to arm the ptes that got either unmapped or
> swapped out for a file-backed region that was previously wr-protected. This
> special pte could trigger a page fault just like swap entries, and as long
> as the page fault will satisfy pte_none()==false && pte_present()==false.
>
> Then we can revive the special pte into a normal pte backed by the page
> cache.
>
> This idea is greatly inspired by Hugh and Andrea in the discussion, which is
> referenced in the links below.
>
> The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as
> the special pte. The current solution (as pointed out by Andrea) is
> slightly preferred in that we don't even need swp_entry_t knowledge at all
> in trapping these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> from the anonymous swp entries.

So to confirm my understanding the reason you use this special swap pte
instead of a new swp_type is that you only need the fault and have no extra
information that needs storing in the pte?

Personally I think it might be better to define a new swp_type for this rather
than introducing a new arch-specific concept. swp_type entries are portable so
wouldn't need extra arch-specific bits defined. And as I understand things not
all architectures (eg. ARM) have spare bits in their swap entry encoding
anyway so would have to reserve a bit specifically for this which would be
less efficient than using a swp_type.

Anyway it seems I missed the initial discussion so don't have a strong opinion
here, mainly just wanted to check my understanding of what's required and how
these special entries work.

> This patch only introduces the special pte and its operators. It's not yet
> applied to have any functional difference.
>
> Link: https://lore.kernel.org/lkml/[email protected]/
> Link: https://lore.kernel.org/lkml/[email protected]/
> Suggested-by: Andrea Arcangeli <[email protected]>
> Suggested-by: Hugh Dickins <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 28 ++++++++++++++++++++++++++++
> include/asm-generic/pgtable_uffd.h | 3 +++
> include/linux/userfaultfd_k.h | 21 +++++++++++++++++++++
> 3 files changed, 52 insertions(+)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index b1099f2d9800..9781ba2da049 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1329,6 +1329,34 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t
> pmd) #endif
>
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +
> +/*
> + * This is a very special swap-like pte that marks this pte as
> "wr-protected" + * by userfaultfd-wp. It should only exist for file-backed
> memory where the + * page (previously got wr-protected) has been unmapped
> or swapped out. + *
> + * For anonymous memories, the userfaultfd-wp _PAGE_SWP_UFFD_WP bit is kept
> + * along with a real swp entry instead.
> + *
> + * Let's make some rules for this special pte:
> + *
> + * (1) pte_none()==false, so that it'll not trigger a missing page fault.
> + *
> + * (2) pte_present()==false, so that it's recognized as swap (is_swap_pte).
> + *
> + * (3) pte_swp_uffd_wp()==true, so it can be tested just like a swap pte
> that + * contains a valid swap entry, so that we can check a swap pte
> always + * using "is_swap_pte() && pte_swp_uffd_wp()" without caring
> about whether + * there's one swap entry inside of the pte.
> + *
> + * (4) It should not be a valid swap pte anywhere, so that when we see this
> pte + * we know it does not contain a swap entry.
> + *
> + * For x86, the simplest special pte which satisfies all of above should be
> the + * pte with only _PAGE_SWP_UFFD_WP bit set (where
> swp_type==swp_offset==0). + */
> +#define UFFD_WP_SWP_PTE_SPECIAL __pte(_PAGE_SWP_UFFD_WP)
> +
> static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
> {
> return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
> diff --git a/include/asm-generic/pgtable_uffd.h
> b/include/asm-generic/pgtable_uffd.h index 828966d4c281..95e9811ce9d1
> 100644
> --- a/include/asm-generic/pgtable_uffd.h
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -2,6 +2,9 @@
> #define _ASM_GENERIC_PGTABLE_UFFD_H
>
> #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +
> +#define UFFD_WP_SWP_PTE_SPECIAL __pte(0)
> +
> static __always_inline int pte_uffd_wp(pte_t pte)
> {
> return 0;
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 331d2ccf0bcc..93f932b53a71 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct
> *vma, extern void userfaultfd_unmap_complete(struct mm_struct *mm,
> struct list_head *uf);
>
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> + WARN_ON_ONCE(vma_is_anonymous(vma));
> + return UFFD_WP_SWP_PTE_SPECIAL;
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> + return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
> +}
> +
> #else /* CONFIG_USERFAULTFD */
>
> /* mm helpers */
> @@ -234,6 +245,16 @@ static inline void userfaultfd_unmap_complete(struct
> mm_struct *mm, {
> }
>
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> + return __pte(0);
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> + return false;
> +}
> +
> #endif /* CONFIG_USERFAULTFD */
>
> #endif /* _LINUX_USERFAULTFD_K_H */




2021-05-28 12:57:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > This patch introduces a very special swap-like pte for file-backed memories.
> >
> > Currently it's only defined for x86_64 only, but as long as any arch that
> > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> > should conceptually work too.
> >
> > We will use this special pte to arm the ptes that got either unmapped or
> > swapped out for a file-backed region that was previously wr-protected. This
> > special pte could trigger a page fault just like swap entries, and as long
> > as the page fault will satisfy pte_none()==false && pte_present()==false.
> >
> > Then we can revive the special pte into a normal pte backed by the page
> > cache.
> >
> > This idea is greatly inspired by Hugh and Andrea in the discussion, which is
> > referenced in the links below.
> >
> > The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0 as
> > the special pte. The current solution (as pointed out by Andrea) is
> > slightly preferred in that we don't even need swp_entry_t knowledge at all
> > in trapping these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> > from the anonymous swp entries.
>
> So to confirm my understanding the reason you use this special swap pte
> instead of a new swp_type is that you only need the fault and have no extra
> information that needs storing in the pte?

Yes.

>
> Personally I think it might be better to define a new swp_type for this rather
> than introducing a new arch-specific concept.

The concept should not be arch-specific, it's the pte that's arch-specific.

> swp_type entries are portable so wouldn't need extra arch-specific bits
> defined. And as I understand things not all architectures (eg. ARM) have
> spare bits in their swap entry encoding anyway so would have to reserve a bit
> specifically for this which would be less efficient than using a swp_type.

It looks a trade-off to me: I think it's fine to use swap type in my series, as
you said it's portable, but it will also waste the swap address space for the
arch when the arch enables it.

The format of the special pte to trigger the fault in this series should be
only a small portion of the code change. The main logic should still be the
same - we just replace this pte with that one. IMHO it also means the format
can be changed in the future, it's just that I don't know whether it's wise to
take over a new swap type from start.

>
> Anyway it seems I missed the initial discussion so don't have a strong opinion
> here, mainly just wanted to check my understanding of what's required and how
> these special entries work.

Thanks for mentioning this and join the discussion. I don't know ARM enough so
good to know we may have issue on finding the bits. Actually before finding
this bit for file-backed uffd-wp specifically, we need to firstly find a bit in
the normal pte for ARM too anyways (see _PAGE_UFFD_WP). If there's no strong
reason to switch to a new swap type, I'd tend to leave all these to the future
when we enable them on ARM.

--
Peter Xu

2021-06-02 14:41:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/27] userfaultfd-wp: Support shmem and hugetlbfs

On Thu, May 27, 2021 at 04:19:00PM -0400, Peter Xu wrote:
> This is v3 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> kernel full feature, as it only supports anonymous before this series. It's
> based on latest v5.13-rc3-mmots-2021-05-25-20-12.

Andrew,

Any suggestion on how I should move on with this series?

Thanks,

--
Peter Xu

2021-06-02 22:39:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 00/27] userfaultfd-wp: Support shmem and hugetlbfs

On Wed, 2 Jun 2021 10:40:04 -0400 Peter Xu <[email protected]> wrote:

> On Thu, May 27, 2021 at 04:19:00PM -0400, Peter Xu wrote:
> > This is v3 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> > kernel full feature, as it only supports anonymous before this series. It's
> > based on latest v5.13-rc3-mmots-2021-05-25-20-12.
>
> Andrew,
>
> Any suggestion on how I should move on with this series?
>

It is large, and thinly reviewed. I haven't seriously looked at it
yet. If nothing much else happens I might toss it in there for some
additional exposure but I do think more input from other developers is
needed before we go further.

2021-06-03 00:11:07

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/27] userfaultfd-wp: Support shmem and hugetlbfs

On Wed, Jun 02, 2021 at 03:36:06PM -0700, Andrew Morton wrote:
> On Wed, 2 Jun 2021 10:40:04 -0400 Peter Xu <[email protected]> wrote:
>
> > On Thu, May 27, 2021 at 04:19:00PM -0400, Peter Xu wrote:
> > > This is v3 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> > > kernel full feature, as it only supports anonymous before this series. It's
> > > based on latest v5.13-rc3-mmots-2021-05-25-20-12.
> >
> > Andrew,
> >
> > Any suggestion on how I should move on with this series?
> >
>
> It is large, and thinly reviewed. I haven't seriously looked at it
> yet. If nothing much else happens I might toss it in there for some
> additional exposure but I do think more input from other developers is
> needed before we go further.

It's just that the 1st RFC series was posted ~6 months ago and the major things
should be mostly the same since then (we've got a few patches merged; but
mostly for the sake of dependencies of other projects):

https://lore.kernel.org/lkml/[email protected]/

So I start to doubt whether I should ask you for help (after bothering at least
Mike and Hugh already :), as I have very little confidence that this series
will be reviewed thoroughly in the near future if I do nothing.

But I definitely agree with you, it's still relatively large changeset with not
so much review done. Let's keep it there for some more time, then.

Thanks,

--
Peter Xu

2021-06-03 11:57:19

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Friday, 28 May 2021 10:56:02 PM AEST Peter Xu wrote:
> On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > > This patch introduces a very special swap-like pte for file-backed
> > > memories.
> > >
> > > Currently it's only defined for x86_64 only, but as long as any arch
> > > that
> > > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> > > should conceptually work too.
> > >
> > > We will use this special pte to arm the ptes that got either unmapped or
> > > swapped out for a file-backed region that was previously wr-protected.
> > > This special pte could trigger a page fault just like swap entries, and
> > > as long as the page fault will satisfy pte_none()==false &&
> > > pte_present()==false.
> > >
> > > Then we can revive the special pte into a normal pte backed by the page
> > > cache.
> > >
> > > This idea is greatly inspired by Hugh and Andrea in the discussion,
> > > which is referenced in the links below.
> > >
> > > The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0
> > > as
> > > the special pte. The current solution (as pointed out by Andrea) is
> > > slightly preferred in that we don't even need swp_entry_t knowledge at
> > > all
> > > in trapping these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> > > from the anonymous swp entries.
> >
> > So to confirm my understanding the reason you use this special swap pte
> > instead of a new swp_type is that you only need the fault and have no
> > extra
> > information that needs storing in the pte?
>
> Yes.
>
> > Personally I think it might be better to define a new swp_type for this
> > rather than introducing a new arch-specific concept.
>
> The concept should not be arch-specific, it's the pte that's arch-specific.

Right, agree this is a minor detail.

> > swp_type entries are portable so wouldn't need extra arch-specific bits
> > defined. And as I understand things not all architectures (eg. ARM) have
> > spare bits in their swap entry encoding anyway so would have to reserve a
> > bit specifically for this which would be less efficient than using a
> > swp_type.
> It looks a trade-off to me: I think it's fine to use swap type in my series,
> as you said it's portable, but it will also waste the swap address space
> for the arch when the arch enables it.
>
> The format of the special pte to trigger the fault in this series should be
> only a small portion of the code change. The main logic should still be the
> same - we just replace this pte with that one. IMHO it also means the
> format can be changed in the future, it's just that I don't know whether
> it's wise to take over a new swap type from start.
>
> > Anyway it seems I missed the initial discussion so don't have a strong
> > opinion here, mainly just wanted to check my understanding of what's
> > required and how these special entries work.
>
> Thanks for mentioning this and join the discussion. I don't know ARM enough
> so good to know we may have issue on finding the bits. Actually before
> finding this bit for file-backed uffd-wp specifically, we need to firstly
> find a bit in the normal pte for ARM too anyways (see _PAGE_UFFD_WP). If
> there's no strong reason to switch to a new swap type, I'd tend to leave
> all these to the future when we enable them on ARM.

Yeah, makes sense to me. As you say it should be easy to change and other
architectures need to find another bit anyway. Not sure how useful it will be
but I'll try and take a look over the rest of the series as well.

> --
> Peter Xu




2021-06-03 14:54:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Thu, Jun 03, 2021 at 09:53:45PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 10:56:02 PM AEST Peter Xu wrote:
> > On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> > > On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > > > This patch introduces a very special swap-like pte for file-backed
> > > > memories.
> > > >
> > > > Currently it's only defined for x86_64 only, but as long as any arch
> > > > that
> > > > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested, it
> > > > should conceptually work too.
> > > >
> > > > We will use this special pte to arm the ptes that got either unmapped or
> > > > swapped out for a file-backed region that was previously wr-protected.
> > > > This special pte could trigger a page fault just like swap entries, and
> > > > as long as the page fault will satisfy pte_none()==false &&
> > > > pte_present()==false.
> > > >
> > > > Then we can revive the special pte into a normal pte backed by the page
> > > > cache.
> > > >
> > > > This idea is greatly inspired by Hugh and Andrea in the discussion,
> > > > which is referenced in the links below.
> > > >
> > > > The other idea (from Hugh) is that we use swp_type==1 and swp_offset=0
> > > > as
> > > > the special pte. The current solution (as pointed out by Andrea) is
> > > > slightly preferred in that we don't even need swp_entry_t knowledge at
> > > > all
> > > > in trapping these accesses. Meanwhile, we also reuse _PAGE_SWP_UFFD_WP
> > > > from the anonymous swp entries.
> > >
> > > So to confirm my understanding the reason you use this special swap pte
> > > instead of a new swp_type is that you only need the fault and have no
> > > extra
> > > information that needs storing in the pte?
> >
> > Yes.
> >
> > > Personally I think it might be better to define a new swp_type for this
> > > rather than introducing a new arch-specific concept.
> >
> > The concept should not be arch-specific, it's the pte that's arch-specific.
>
> Right, agree this is a minor detail.

I can't say it's a minor detail, as that's still indeed one of the major ideas
that I'd like to get comment for within the whole series. It's currently an
outcome from previous discussion with Andrea and Hugh, but of course if there's
better idea with reasoning I can always consider to rework the series.

>
> > > swp_type entries are portable so wouldn't need extra arch-specific bits
> > > defined. And as I understand things not all architectures (eg. ARM) have
> > > spare bits in their swap entry encoding anyway so would have to reserve a
> > > bit specifically for this which would be less efficient than using a
> > > swp_type.
> > It looks a trade-off to me: I think it's fine to use swap type in my series,
> > as you said it's portable, but it will also waste the swap address space
> > for the arch when the arch enables it.
> >
> > The format of the special pte to trigger the fault in this series should be
> > only a small portion of the code change. The main logic should still be the
> > same - we just replace this pte with that one. IMHO it also means the
> > format can be changed in the future, it's just that I don't know whether
> > it's wise to take over a new swap type from start.
> >
> > > Anyway it seems I missed the initial discussion so don't have a strong
> > > opinion here, mainly just wanted to check my understanding of what's
> > > required and how these special entries work.
> >
> > Thanks for mentioning this and join the discussion. I don't know ARM enough
> > so good to know we may have issue on finding the bits. Actually before
> > finding this bit for file-backed uffd-wp specifically, we need to firstly
> > find a bit in the normal pte for ARM too anyways (see _PAGE_UFFD_WP). If
> > there's no strong reason to switch to a new swap type, I'd tend to leave
> > all these to the future when we enable them on ARM.
>
> Yeah, makes sense to me. As you say it should be easy to change and other
> architectures need to find another bit anyway. Not sure how useful it will be
> but I'll try and take a look over the rest of the series as well.

I'll highly appreciate that. Thanks Alistair!

--
Peter Xu

2021-06-04 00:56:59

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Friday, 4 June 2021 12:51:19 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
>
> On Thu, Jun 03, 2021 at 09:53:45PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 10:56:02 PM AEST Peter Xu wrote:
> > > On Fri, May 28, 2021 at 06:32:52PM +1000, Alistair Popple wrote:
> > > > On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
> > > > > This patch introduces a very special swap-like pte for file-backed
> > > > > memories.
> > > > >
> > > > > Currently it's only defined for x86_64 only, but as long as any arch
> > > > > that
> > > > > can properly define the UFFD_WP_SWP_PTE_SPECIAL value as requested,
> > > > > it
> > > > > should conceptually work too.
> > > > >
> > > > > We will use this special pte to arm the ptes that got either
> > > > > unmapped or
> > > > > swapped out for a file-backed region that was previously
> > > > > wr-protected.
> > > > > This special pte could trigger a page fault just like swap entries,
> > > > > and
> > > > > as long as the page fault will satisfy pte_none()==false &&
> > > > > pte_present()==false.
> > > > >
> > > > > Then we can revive the special pte into a normal pte backed by the
> > > > > page
> > > > > cache.
> > > > >
> > > > > This idea is greatly inspired by Hugh and Andrea in the discussion,
> > > > > which is referenced in the links below.
> > > > >
> > > > > The other idea (from Hugh) is that we use swp_type==1 and
> > > > > swp_offset=0
> > > > > as
> > > > > the special pte. The current solution (as pointed out by Andrea) is
> > > > > slightly preferred in that we don't even need swp_entry_t knowledge
> > > > > at
> > > > > all
> > > > > in trapping these accesses. Meanwhile, we also reuse
> > > > > _PAGE_SWP_UFFD_WP
> > > > > from the anonymous swp entries.
> > > >
> > > > So to confirm my understanding the reason you use this special swap
> > > > pte
> > > > instead of a new swp_type is that you only need the fault and have no
> > > > extra
> > > > information that needs storing in the pte?
> > >
> > > Yes.
> > >
> > > > Personally I think it might be better to define a new swp_type for
> > > > this
> > > > rather than introducing a new arch-specific concept.
> > >
> > > The concept should not be arch-specific, it's the pte that's
> > > arch-specific.
> >
> > Right, agree this is a minor detail.
>
> I can't say it's a minor detail, as that's still indeed one of the major
> ideas that I'd like to get comment for within the whole series. It's
> currently an outcome from previous discussion with Andrea and Hugh, but of
> course if there's better idea with reasoning I can always consider to
> rework the series.

Sorry, I wasn't very clear there. What I meant is the high level arch-
independent concept of using a special swap pte for this is the most important
aspect of the design and looks good to me.

The detail which is perhaps less important is whether to implement this using
a new swap entry type or arch-specific swap bit. The argument for using a swap
type is it will work across architectures due to the use of pte_to_swp_entry()
and swp_entry_to_pte() to convert to and from the arch-dependent and
independent representations.

The argument against seems to have been that it is wasting a swap type.
However if I'm understanding correctly that's not true for all architectures,
and needing to reserve a bit is more wasteful than using a swap type. For
example ARM encodes swap entries like so:

* Encode and decode a swap entry. Swap entries are stored in the Linux
* page tables as follows:
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* <--------------- offset ------------------------> < type -> 0 0

So the only way to get a spare bit is to reduce the width of type (or offset)
which would halve the number of swap types. And if I understand correctly the
same argument might apply to x86 - the spare bit being used here could instead
be used to expand the width of type if a lack of available swap types is a
concern.

> > > > swp_type entries are portable so wouldn't need extra arch-specific
> > > > bits
> > > > defined. And as I understand things not all architectures (eg. ARM)
> > > > have
> > > > spare bits in their swap entry encoding anyway so would have to
> > > > reserve a
> > > > bit specifically for this which would be less efficient than using a
> > > > swp_type.
> > >
> > > It looks a trade-off to me: I think it's fine to use swap type in my
> > > series, as you said it's portable, but it will also waste the swap
> > > address space for the arch when the arch enables it.
> > >
> > > The format of the special pte to trigger the fault in this series should
> > > be
> > > only a small portion of the code change. The main logic should still be
> > > the same - we just replace this pte with that one. IMHO it also means
> > > the format can be changed in the future, it's just that I don't know
> > > whether it's wise to take over a new swap type from start.
> > >
> > > > Anyway it seems I missed the initial discussion so don't have a strong
> > > > opinion here, mainly just wanted to check my understanding of what's
> > > > required and how these special entries work.
> > >
> > > Thanks for mentioning this and join the discussion. I don't know ARM
> > > enough
> > > so good to know we may have issue on finding the bits. Actually before
> > > finding this bit for file-backed uffd-wp specifically, we need to
> > > firstly
> > > find a bit in the normal pte for ARM too anyways (see _PAGE_UFFD_WP).
> > > If
> > > there's no strong reason to switch to a new swap type, I'd tend to leave
> > > all these to the future when we enable them on ARM.
> >
> > Yeah, makes sense to me. As you say it should be easy to change and other
> > architectures need to find another bit anyway. Not sure how useful it will
> > be but I'll try and take a look over the rest of the series as well.
>
> I'll highly appreciate that. Thanks Alistair!
>
> --
> Peter Xu




2021-06-04 03:18:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Fri, 4 Jun 2021, Alistair Popple wrote:
>
> The detail which is perhaps less important is whether to implement this using
> a new swap entry type or arch-specific swap bit. The argument for using a swap
> type is it will work across architectures due to the use of pte_to_swp_entry()
> and swp_entry_to_pte() to convert to and from the arch-dependent and
> independent representations.
>
> The argument against seems to have been that it is wasting a swap type.
> However if I'm understanding correctly that's not true for all architectures,
> and needing to reserve a bit is more wasteful than using a swap type.

I'm on the outside, not paying much attention here,
but thought Peter would have cleared this up already.

My understanding is that it does *not* use an additional arch-dependent
bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
implementing UFFD WP) to an additional use. That's why I called this
design (from Andrea) more elegant than mine (swap type business).

If I've got that wrong, and yet another arch-dependent bit is needed,
then I very much agree with you: finding arch-dependent pte bits is a
much tougher job than another play with swap type.

(And "more elegant" might not be "easier to understand": you decide.)

Hugh

2021-06-04 06:17:57

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Friday, 4 June 2021 1:14:31 PM AEST Hugh Dickins wrote:
> On Fri, 4 Jun 2021, Alistair Popple wrote:
> >
> > The detail which is perhaps less important is whether to implement this
using
> > a new swap entry type or arch-specific swap bit. The argument for using a
swap
> > type is it will work across architectures due to the use of
pte_to_swp_entry()
> > and swp_entry_to_pte() to convert to and from the arch-dependent and
> > independent representations.
> >
> > The argument against seems to have been that it is wasting a swap type.
> > However if I'm understanding correctly that's not true for all
architectures,
> > and needing to reserve a bit is more wasteful than using a swap type.
>
> I'm on the outside, not paying much attention here,
> but thought Peter would have cleared this up already.
>
> My understanding is that it does *not* use an additional arch-dependent
> bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
> implementing UFFD WP) to an additional use. That's why I called this
> design (from Andrea) more elegant than mine (swap type business).

Oh my bad, I had somehow missed this was reusing an *existing* arch-dependent
swap bit (_PAGE_SWP_UFFD_WP, although the same argument could apply) even
though it's in the commit message. Obviously I should have read that more
carefully, apologies for the noise but thanks for the clarification.

> If I've got that wrong, and yet another arch-dependent bit is needed,
> then I very much agree with you: finding arch-dependent pte bits is a
> much tougher job than another play with swap type.
>
> (And "more elegant" might not be "easier to understand": you decide.)

Agree, that's a somewhat subjective debate. Conceptually I don't think this is
particularly difficult to understand. It just adds another slightly different
class of special swap pte's to know about.

- Alistair

> Hugh




2021-06-04 16:05:39

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Fri, Jun 04, 2021 at 04:16:30PM +1000, Alistair Popple wrote:
> > My understanding is that it does *not* use an additional arch-dependent
> > bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
> > implementing UFFD WP) to an additional use. That's why I called this
> > design (from Andrea) more elegant than mine (swap type business).
>
> Oh my bad, I had somehow missed this was reusing an *existing* arch-dependent
> swap bit (_PAGE_SWP_UFFD_WP, although the same argument could apply) even
> though it's in the commit message. Obviously I should have read that more
> carefully, apologies for the noise but thanks for the clarification.

Right, as Hugh mentioned what this series wanted to use is one explicit pte
that no one should ever be using, so ideally that should be the most saving way
per address-space pov.

Meanwhile I think that pte can actually be not related to _PAGE_UFFD_WP at all,
as long as it's a specific pte value then it will service the same goal (even
if to reuse a new swp type, I'll probably only use one pte for it and leave the
rest for other use; but who knows who will start to use the rest!).

I kept using it because that's suggested by Andrea (it actually has
type==off==0 as Hugh suggested too - so it keeps a suggestion of both!) and
it's a good idea to use it since (1) it's never used by anyone before, and (2)
it is _somehow_ related to uffd-wp itself already by having that specific bit
set in the special pte, while that's also the only bit set for the u64 field.

It looks very nice too when debug, because when I dump the ptes it reads 0x4 on
x86.. so the pte value is even easy to read as a number. :)

However I can see that it is less easy to follow than the swap type solution.
In all cases it's still something worth thinking about before using up the swap
types - it's not so rich there, and we keep shrinking MAX_SWAPFILES.. so let's
see whether uffd-wp could be the 1st one to open a new field for unused
"invalid/swap pte" address space.

Meanwhile, I did have a look at ARM on supporting uffd-wp in general, starting
from anonymous pages. I doubt whether it can be done for old arms (uffd-wp not
even supported on 32bit x86 after all), but for ARM64 I see it has:

For normal ptes:

/*
* Level 3 descriptor (PTE).
*/
#define PTE_VALID (_AT(pteval_t, 1) << 0)
#define PTE_TYPE_MASK (_AT(pteval_t, 3) << 0)
#define PTE_TYPE_PAGE (_AT(pteval_t, 3) << 0)
#define PTE_TABLE_BIT (_AT(pteval_t, 1) << 1)
#define PTE_USER (_AT(pteval_t, 1) << 6) /* AP[1] */
#define PTE_RDONLY (_AT(pteval_t, 1) << 7) /* AP[2] */
#define PTE_SHARED (_AT(pteval_t, 3) << 8) /* SH[1:0], inner shareable */
#define PTE_AF (_AT(pteval_t, 1) << 10) /* Access Flag */
#define PTE_NG (_AT(pteval_t, 1) << 11) /* nG */
#define PTE_GP (_AT(pteval_t, 1) << 50) /* BTI guarded */
#define PTE_DBM (_AT(pteval_t, 1) << 51) /* Dirty Bit Management */
#define PTE_CONT (_AT(pteval_t, 1) << 52) /* Contiguous range */
#define PTE_PXN (_AT(pteval_t, 1) << 53) /* Privileged XN */
#define PTE_UXN (_AT(pteval_t, 1) << 54) /* User XN */

For swap ptes:

/*
* Encode and decode a swap entry:
* bits 0-1: present (must be zero)
* bits 2-7: swap type
* bits 8-57: swap offset
* bit 58: PTE_PROT_NONE (must be zero)
*/

So I feel like we still have chance there at least for 64bit ARM? As both
normal/swap ptes have some bits free (bits 2-5,9 for normal ptes; bits 59-63
for swap ptes). But as I know little on ARM64, I hope I looked at the right
things..

Thanks,

--
Peter Xu

2021-06-09 00:29:25

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Saturday, 5 June 2021 2:01:59 AM AEST Peter Xu wrote:
> On Fri, Jun 04, 2021 at 04:16:30PM +1000, Alistair Popple wrote:
> > > My understanding is that it does *not* use an additional arch-dependent
> > > bit, but puts the _PAGE_UFFD_WP bit (already set aside by any architecture
> > > implementing UFFD WP) to an additional use. That's why I called this
> > > design (from Andrea) more elegant than mine (swap type business).
> >
> > Oh my bad, I had somehow missed this was reusing an *existing* arch-dependent
> > swap bit (_PAGE_SWP_UFFD_WP, although the same argument could apply) even
> > though it's in the commit message. Obviously I should have read that more
> > carefully, apologies for the noise but thanks for the clarification.
>
> Right, as Hugh mentioned what this series wanted to use is one explicit pte
> that no one should ever be using, so ideally that should be the most saving way
> per address-space pov.
>
> Meanwhile I think that pte can actually be not related to _PAGE_UFFD_WP at all,
> as long as it's a specific pte value then it will service the same goal (even
> if to reuse a new swp type, I'll probably only use one pte for it and leave the
> rest for other use; but who knows who will start to use the rest!).
>
> I kept using it because that's suggested by Andrea (it actually has
> type==off==0 as Hugh suggested too - so it keeps a suggestion of both!) and
> it's a good idea to use it since (1) it's never used by anyone before, and (2)
> it is _somehow_ related to uffd-wp itself already by having that specific bit
> set in the special pte, while that's also the only bit set for the u64 field.
>
> It looks very nice too when debug, because when I dump the ptes it reads 0x4 on
> x86.. so the pte value is even easy to read as a number. :)
>
> However I can see that it is less easy to follow than the swap type solution.
> In all cases it's still something worth thinking about before using up the swap
> types - it's not so rich there, and we keep shrinking MAX_SWAPFILES.. so let's
> see whether uffd-wp could be the 1st one to open a new field for unused
> "invalid/swap pte" address space.

Agreed, that matches with what I was thinking as well. If we do end up having
more swap types such as this which don't need to store much information in
the swap pte itself we could define a special swap type (eg. this bit) for
that.

> Meanwhile, I did have a look at ARM on supporting uffd-wp in general, starting
> from anonymous pages. I doubt whether it can be done for old arms (uffd-wp not
> even supported on 32bit x86 after all), but for ARM64 I see it has:
>
> For normal ptes:
>
> /*
> * Level 3 descriptor (PTE).
> */
> #define PTE_VALID (_AT(pteval_t, 1) << 0)
> #define PTE_TYPE_MASK (_AT(pteval_t, 3) << 0)
> #define PTE_TYPE_PAGE (_AT(pteval_t, 3) << 0)
> #define PTE_TABLE_BIT (_AT(pteval_t, 1) << 1)
> #define PTE_USER (_AT(pteval_t, 1) << 6) /* AP[1] */
> #define PTE_RDONLY (_AT(pteval_t, 1) << 7) /* AP[2] */
> #define PTE_SHARED (_AT(pteval_t, 3) << 8) /* SH[1:0], inner shareable */
> #define PTE_AF (_AT(pteval_t, 1) << 10) /* Access Flag */
> #define PTE_NG (_AT(pteval_t, 1) << 11) /* nG */
> #define PTE_GP (_AT(pteval_t, 1) << 50) /* BTI guarded */
> #define PTE_DBM (_AT(pteval_t, 1) << 51) /* Dirty Bit Management */
> #define PTE_CONT (_AT(pteval_t, 1) << 52) /* Contiguous range */
> #define PTE_PXN (_AT(pteval_t, 1) << 53) /* Privileged XN */
> #define PTE_UXN (_AT(pteval_t, 1) << 54) /* User XN */
>
> For swap ptes:
>
> /*
> * Encode and decode a swap entry:
> * bits 0-1: present (must be zero)
> * bits 2-7: swap type
> * bits 8-57: swap offset
> * bit 58: PTE_PROT_NONE (must be zero)
> */
>
> So I feel like we still have chance there at least for 64bit ARM? As both
> normal/swap ptes have some bits free (bits 2-5,9 for normal ptes; bits 59-63
> for swap ptes). But as I know little on ARM64, I hope I looked at the right
> things..

I don't claim to be an expert there either. Given there's already a bit
defined for x86 anyway (which is what I missed) I now think the special
swap idea is ok, although I still need to look at the rest of the series.

> Thanks,
>
> --
> Peter Xu
>




2021-06-09 15:47:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Wed, Jun 09, 2021 at 11:06:32PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:
>
> [...]
>
> > diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> > index 828966d4c281..95e9811ce9d1 100644
> > --- a/include/asm-generic/pgtable_uffd.h
> > +++ b/include/asm-generic/pgtable_uffd.h
> > @@ -2,6 +2,9 @@
> > #define _ASM_GENERIC_PGTABLE_UFFD_H
> >
> > #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> > +
> > +#define UFFD_WP_SWP_PTE_SPECIAL __pte(0)
> > +
> > static __always_inline int pte_uffd_wp(pte_t pte)
> > {
> > return 0;
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index 331d2ccf0bcc..93f932b53a71 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
> > extern void userfaultfd_unmap_complete(struct mm_struct *mm,
> > struct list_head *uf);
> >
> > +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> > +{
> > + WARN_ON_ONCE(vma_is_anonymous(vma));
> > + return UFFD_WP_SWP_PTE_SPECIAL;
> > +}
> > +
> > +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> > +{
> > + return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
> > +}
> > +
>
> Sorry, only just noticed this but do we need to define a different version of
> this helper that returns false for CONFIG_HAVE_ARCH_USERFAULTFD_WP=n to avoid
> spurious matches with __pte(0) on architectures supporting userfaultfd but not
> userfaultfd-wp?

Good point.. Yes we definitely don't want the empty pte to be recognized as the
special pte.. I'll squash below into the same patch:

----8<----
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 489fb375e66c..23ca449240d1 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -177,7 +177,11 @@ static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)

static inline bool pte_swp_uffd_wp_special(pte_t pte)
{
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
+#else
+ return false;
+#fi
}

#else /* CONFIG_USERFAULTFD */
----8<----

I'll see whether I can give some dry run without HAVE_ARCH_USERFAULTFD_WP but
with USERFAULTFD.

Thanks for spotting that!

--
Peter Xu

2021-06-09 17:28:41

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 04/27] mm/userfaultfd: Introduce special pte for unmapped file-backed mem

On Friday, 28 May 2021 6:19:04 AM AEST Peter Xu wrote:

[...]

> diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
> index 828966d4c281..95e9811ce9d1 100644
> --- a/include/asm-generic/pgtable_uffd.h
> +++ b/include/asm-generic/pgtable_uffd.h
> @@ -2,6 +2,9 @@
> #define _ASM_GENERIC_PGTABLE_UFFD_H
>
> #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> +
> +#define UFFD_WP_SWP_PTE_SPECIAL __pte(0)
> +
> static __always_inline int pte_uffd_wp(pte_t pte)
> {
> return 0;
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 331d2ccf0bcc..93f932b53a71 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -145,6 +145,17 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
> extern void userfaultfd_unmap_complete(struct mm_struct *mm,
> struct list_head *uf);
>
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> + WARN_ON_ONCE(vma_is_anonymous(vma));
> + return UFFD_WP_SWP_PTE_SPECIAL;
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> + return pte_same(pte, UFFD_WP_SWP_PTE_SPECIAL);
> +}
> +

Sorry, only just noticed this but do we need to define a different version of
this helper that returns false for CONFIG_HAVE_ARCH_USERFAULTFD_WP=n to avoid
spurious matches with __pte(0) on architectures supporting userfaultfd but not
userfaultfd-wp?

> #else /* CONFIG_USERFAULTFD */
>
> /* mm helpers */
> @@ -234,6 +245,16 @@ static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
> {
> }
>
> +static inline pte_t pte_swp_mkuffd_wp_special(struct vm_area_struct *vma)
> +{
> + return __pte(0);
> +}
> +
> +static inline bool pte_swp_uffd_wp_special(pte_t pte)
> +{
> + return false;
> +}
> +
> #endif /* CONFIG_USERFAULTFD */
>
> #endif /* _LINUX_USERFAULTFD_K_H */
>




2021-06-17 09:00:33

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 06/27] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

On Friday, 28 May 2021 6:21:22 AM AEST Peter Xu wrote:
> File-backed memories are prone to unmap/swap so the ptes are always unstable.
> This could lead to userfaultfd-wp information got lost when unmapped or swapped
> out on such types of memory, for example, shmem. To keep such an information
> persistent, we will start to use the newly introduced swap-like special ptes to
> replace a null pte when those ptes were removed.
>
> Prepare this by handling such a special pte first before it is applied in the
> general page fault handler.
>
> The handling of this special pte page fault is similar to missing fault, but it
> should happen after the pte missing logic since the special pte is designed to
> be a swap-like pte. Meanwhile it should be handled before do_swap_page() so
> that the swap core logic won't be confused to see such an illegal swap pte.
>
> This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
> ptes should be rare. So far it should only trigger in two conditions:
>
> (1) When trying to punch holes in shmem_fallocate(), there will be a
> pre-unmap optimization before evicting the page. That will create
> unmapped shmem ptes with wr-protected pages covered.
>
> (2) Swapping out of shmem pages
>
> Because of this, the page fault handling is simplifed too by not sending the
> wr-protect message in the 1st page fault, instead the page will be installed
> read-only, so the message will be generated until the next write, which will
> trigger the do_wp_page() path of general uffd-wp handling.
>
> Disable fault-around for all uffd-wp registered ranges for extra safety, and
> clean the code up a bit after we introduced MINOR fault.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/userfaultfd_k.h | 12 +++++
> mm/memory.c | 88 +++++++++++++++++++++++++++++++----
> 2 files changed, 90 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 93f932b53a71..ca3f794d07e9 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -94,6 +94,18 @@ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
> return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> }
>
> +/*
> + * Don't do fault around for either WP or MINOR registered uffd range. For
> + * MINOR registered range, fault around will be a total disaster and ptes can
> + * be installed without notifications; for WP it should mostly be fine as long
> + * as the fault around checks for pte_none() before the installation, however
> + * to be super safe we just forbid it.
> + */
> +static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
> +{
> + return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR);
> +}
> +
> static inline bool userfaultfd_missing(struct vm_area_struct *vma)
> {
> return vma->vm_flags & VM_UFFD_MISSING;
> diff --git a/mm/memory.c b/mm/memory.c
> index 2b24af4616df..45a2f71e447a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3917,6 +3917,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> {
> struct vm_area_struct *vma = vmf->vma;
> + bool uffd_wp = pte_swp_uffd_wp_special(vmf->orig_pte);
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> bool prefault = vmf->address != addr;
> pte_t entry;
> @@ -3929,6 +3930,8 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>
> if (write)
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + if (unlikely(uffd_wp))
> + entry = pte_mkuffd_wp(pte_wrprotect(entry));
> /* copy-on-write page */
> if (write && !(vma->vm_flags & VM_SHARED)) {
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> @@ -3996,8 +3999,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> vmf->address, &vmf->ptl);
> ret = 0;
> - /* Re-check under ptl */
> - if (likely(pte_none(*vmf->pte)))
> +
> + /*
> + * Re-check under ptl. Note: this will cover both none pte and
> + * uffd-wp-special swap pte
> + */
> + if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> do_set_pte(vmf, page, vmf->address);
> else
> ret = VM_FAULT_NOPAGE;
> @@ -4101,9 +4108,21 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
> return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
> }
>
> +/* Return true if we should do read fault-around, false otherwise */
> +static inline bool should_fault_around(struct vm_fault *vmf)
> +{
> + /* No ->map_pages? No way to fault around... */
> + if (!vmf->vma->vm_ops->map_pages)
> + return false;
> +
> + if (uffd_disable_fault_around(vmf->vma))
> + return false;
> +
> + return fault_around_bytes >> PAGE_SHIFT > 1;
> +}
> +
> static vm_fault_t do_read_fault(struct vm_fault *vmf)
> {
> - struct vm_area_struct *vma = vmf->vma;
> vm_fault_t ret = 0;
>
> /*
> @@ -4111,12 +4130,10 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
> * if page by the offset is not ready to be mapped (cold cache or
> * something).
> */
> - if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
> - if (likely(!userfaultfd_minor(vmf->vma))) {
> - ret = do_fault_around(vmf);
> - if (ret)
> - return ret;
> - }
> + if (should_fault_around(vmf)) {
> + ret = do_fault_around(vmf);
> + if (ret)
> + return ret;
> }
>
> ret = __do_fault(vmf);
> @@ -4435,6 +4452,57 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
> return VM_FAULT_FALLBACK;
> }
>
> +static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf)
> +{
> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + /*
> + * Be careful so that we will only recover a special uffd-wp pte into a
> + * none pte. Otherwise it means the pte could have changed, so retry.
> + */
> + if (pte_swp_uffd_wp_special(*vmf->pte))
> + pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + return 0;
> +}
> +
> +/*
> + * This is actually a page-missing access, but with uffd-wp special pte
> + * installed. It means this pte was wr-protected before being unmapped.
> + */
> +static vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf)
> +{
> + /* Careful! vmf->pte unmapped after return */
> + if (!pte_unmap_same(vmf))
> + return 0;
> +
> + /*
> + * Just in case there're leftover special ptes even after the region
> + * got unregistered - we can simply clear them.
> + */
> + if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
> + return uffd_wp_clear_special(vmf);
> +
> + /*
> + * Here we share most code with do_fault(), in which we can identify
> + * whether this is "none pte fault" or "uffd-wp-special fault" by
> + * checking the vmf->orig_pte.
> + */
> + return do_fault(vmf);
> +}
> +
> +static vm_fault_t do_swap_pte(struct vm_fault *vmf)
> +{
> + /*
> + * We need to handle special swap ptes before handling ptes that
> + * contain swap entries, always.
> + */
> + if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
> + return uffd_wp_handle_special(vmf);
> +
> + return do_swap_page(vmf);

Probably pretty minor in the scheme of things but why not add this special
case directly to do_swap_page()? Your earlier "shmem/userfaultfd: Handle
uffd-wp special pte in page fault handler" adds this to do_swap_page()
anyway:

/*
* We should never call do_swap_page upon a swap special pte; just be
* safe to bail out if it happens.
*/
if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
goto out;

So this patch could instead replace the warning with the call to
uffd_wp_handle_special(), which also means you can remove the extra
pte_unmap_same(vmf) check in uffd_wp_handle_special().

I suppose you might have to worry about other callers of do_swap_page(),
but the only other one I could see was __collapse_huge_page_swapin().
Initially I thought that might be able to trigger the warning here but I
see it checks pte_has_swap_entry() first which should skip it if it finds
the special pte.

- Alistair

> +}
> +
> /*
> * These routines also need to handle stuff like marking pages dirty
> * and/or accessed for architectures that don't do it in hardware (most
> @@ -4509,7 +4577,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> }
>
> if (!pte_present(vmf->orig_pte))
> - return do_swap_page(vmf);
> + return do_swap_pte(vmf);
>
> if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> return do_numa_page(vmf);
>




2021-06-17 15:13:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 06/27] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler

On Thu, Jun 17, 2021 at 06:59:09PM +1000, Alistair Popple wrote:
> > +static vm_fault_t do_swap_pte(struct vm_fault *vmf)
> > +{
> > + /*
> > + * We need to handle special swap ptes before handling ptes that
> > + * contain swap entries, always.
> > + */
> > + if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
> > + return uffd_wp_handle_special(vmf);
> > +
> > + return do_swap_page(vmf);
>
> Probably pretty minor in the scheme of things but why not add this special
> case directly to do_swap_page()? Your earlier "shmem/userfaultfd: Handle
> uffd-wp special pte in page fault handler" adds this to do_swap_page()
> anyway:
>
> /*
> * We should never call do_swap_page upon a swap special pte; just be
> * safe to bail out if it happens.
> */
> if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
> goto out;
>
> So this patch could instead replace the warning with the call to
> uffd_wp_handle_special(), which also means you can remove the extra
> pte_unmap_same(vmf) check in uffd_wp_handle_special().
>
> I suppose you might have to worry about other callers of do_swap_page(),
> but the only other one I could see was __collapse_huge_page_swapin().
> Initially I thought that might be able to trigger the warning here but I
> see it checks pte_has_swap_entry() first which should skip it if it finds
> the special pte.

Yes I wanted to keep the existing caller untouched, and I wanted to keep its
semantics too to not bother with the new idea (it turns out do_swap_page should
have a history long enough to be beyond when git is introduced to Linux).

The other reason is that this series is the first one to introduce the new swap
pte which actually does not have a page on the back, so I figured maybe it's
good to call the new handler do_swap_pte() (as swap pte can either contain swap
entry or not), then we keep do_swap_page() if it's an old typed swap pte (which
contains the swap entry).

Thanks,

--
Peter Xu

2021-06-21 08:43:04

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> File-backed memory is prone to being unmapped at any time. It means all
> information in the pte will be dropped, including the uffd-wp flag.
>
> Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> this wr-protect information by installing the special uffd-wp marker pte when
> we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
> we will know it's previously wr-protected by recognizing the special pte.
>
> Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> persist such an information. For example, when destroying the whole vma, or
> punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
> bit when holding the page lock. It means the unmap_mapping_range() in
> shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> because that's still racy with the page faults.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 11 ++++++++++
> include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
> mm/rmap.c | 8 ++++++++
> mm/truncate.c | 8 +++++++-
> 5 files changed, 110 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b1fb2826e29c..5989fc7ed00d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> /* Whether to skip zapping swap entries */
> #define ZAP_FLAG_SKIP_SWAP BIT(1)
> +/* Whether to completely drop uffd-wp entries for file-backed memory */
> +#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
>
> /*
> * Parameter block passed down to zap_pte_range in exceptional cases.
> @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> }
>
> +static inline bool
> +zap_drop_file_uffd_wp(struct zap_details *details)
> +{
> + if (!details)
> + return false;
> +
> + return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> +}

Is this a good default having to explicitly specify that you don't want
special pte's left in place? For example the OOM killer seems to call
unmap_page_range() with details == NULL (although in practice only for
anonymous vmas so it wont actually cause an issue). Similarly in madvise
for MADV_DONTNEED, although arguably I suppose that is the correct thing to
do there?

> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 355ea1ee32bd..c29a6ef3a642 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -4,6 +4,8 @@
>
> #include <linux/huge_mm.h>
> #include <linux/swap.h>
> +#include <linux/userfaultfd_k.h>
> +#include <linux/swapops.h>
>
> /**
> * page_is_file_lru - should the page be on a file LRU or anon LRU?
> @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> -thp_nr_pages(page));
> }
> +
> +/*
> + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> + * replace a none pte. NOTE! This should only be called when *pte is already
> + * cleared so we will never accidentally replace something valuable. Meanwhile
> + * none pte also means we are not demoting the pte so if tlb flushed then we
> + * don't need to do it again; otherwise if tlb flush is postponed then it's
> + * even better.
> + *
> + * Must be called with pgtable lock held.
> + */
> +static inline void
> +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> + pte_t *pte, pte_t pteval)
> +{
> +#ifdef CONFIG_USERFAULTFD
> + bool arm_uffd_pte = false;
> +
> + /* The current status of the pte should be "cleared" before calling */
> + WARN_ON_ONCE(!pte_none(*pte));
> +
> + if (vma_is_anonymous(vma))
> + return;
> +
> + /* A uffd-wp wr-protected normal pte */
> + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> + arm_uffd_pte = true;
> +
> + /*
> + * A uffd-wp wr-protected swap pte. Note: this should even work for
> + * pte_swp_uffd_wp_special() too.
> + */

I'm probably missing something but when can we actually have this case and why
would we want to leave a special pte behind? From what I can tell this is
called from try_to_unmap_one() where this won't be true or from zap_pte_range()
when not skipping swap pages.

> + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> + arm_uffd_pte = true;
> +
> + if (unlikely(arm_uffd_pte))
> + set_pte_at(vma->vm_mm, addr, pte,
> + pte_swp_mkuffd_wp_special(vma));
> +#endif
> +}
> +
> #endif
> diff --git a/mm/memory.c b/mm/memory.c
> index 319552efc782..3453b8ae5f4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -73,6 +73,7 @@
> #include <linux/perf_event.h>
> #include <linux/ptrace.h>
> #include <linux/vmalloc.h>
> +#include <linux/mm_inline.h>
>
> #include <trace/events/kmem.h>
>
> @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> return ret;
> }
>
> +/*
> + * This function makes sure that we'll replace the none pte with an uffd-wp
> + * swap special pte marker when necessary. Must be with the pgtable lock held.
> + */
> +static inline void
> +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *pte,
> + struct zap_details *details, pte_t pteval)
> +{
> + if (zap_drop_file_uffd_wp(details))
> + return;
> +
> + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> +}
> +
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
> tlb_remove_tlb_entry(tlb, pte, addr);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> + ptent);
> if (unlikely(!page))
> continue;
>
> @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> continue;
> }
>
> + /*
> + * If this is a special uffd-wp marker pte... Drop it only if
> + * enforced to do so.
> + */
> + if (unlikely(is_swap_special_pte(ptent))) {
> + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));

Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?

> + /*
> + * If this is a common unmap of ptes, keep this as is.
> + * Drop it only if this is a whole-vma destruction.
> + */
> + if (zap_drop_file_uffd_wp(details))
> + ptep_get_and_clear_full(mm, addr, pte,
> + tlb->fullmm);
> + continue;
> + }
> +
> entry = pte_to_swp_entry(ptent);
> if (is_device_private_entry(entry) ||
> is_device_exclusive_entry(entry)) {
> @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> page_remove_rmap(page, false);
>
> put_page(page);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> + ptent);

Device entries only support anonymous vmas at present so should we drop this?
I guess I'm also a little confused by this because I'm not sure in what
scenarios you would want to zap swap entries but leave special swap ptes behind
(see also my earlier question above as well).

> continue;
> }
>
> @@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> if (unlikely(!free_swap_and_cache(entry)))
> print_bad_pte(vma, addr, ptent, NULL);
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> } while (pte++, addr += PAGE_SIZE, addr != end);
>
> add_mm_rss_vec(mm, rss);
> @@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
> unsigned long end_addr)
> {
> struct mmu_notifier_range range;
> + struct zap_details details = {
> + .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
> + };
>
> mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> start_addr, end_addr);
> mmu_notifier_invalidate_range_start(&range);
> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> + unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> mmu_notifier_invalidate_range_end(&range);
> }
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0419c9a1a280..a94d9aed9d95 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -72,6 +72,7 @@
> #include <linux/page_idle.h>
> #include <linux/memremap.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
>
> #include <asm/tlbflush.h>
>
> @@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> pteval = ptep_clear_flush(vma, address, pvmw.pte);
> }
>
> + /*
> + * Now the pte is cleared. If this is uffd-wp armed pte, we
> + * may want to replace a none pte with a marker pte if it's
> + * file-backed, so we don't lose the tracking information.
> + */
> + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);

From what I can tell we don't need to do this in try_to_migrate_one() (assuming
that goes in) as well because the existing uffd wp code already deals with
copying the pte bits over to the migration entries. Is that correct?

> +
> /* Move the dirty bit to the page. Now the pte is gone. */
> if (pte_dirty(pteval))
> set_page_dirty(page);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 85cd84486589..62f9c488b986 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -173,7 +173,13 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
> if (page_mapped(page)) {
> unsigned int nr = thp_nr_pages(page);
> unmap_mapping_pages(mapping, page->index, nr,
> - ZAP_FLAG_CHECK_MAPPING);
> + ZAP_FLAG_CHECK_MAPPING |
> + /*
> + * Now it's safe to drop uffd-wp because
> + * we're with page lock, and the page is
> + * being truncated.
> + */
> + ZAP_FLAG_DROP_FILE_UFFD_WP);
> }
>
> if (page_has_private(page))
>




2021-06-21 12:10:22

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 08/27] mm: Introduce zap_details.zap_flags

On Friday, 28 May 2021 6:21:30 AM AEST Peter Xu wrote:
> Instead of trying to introduce one variable for every new zap_details fields,
> let's introduce a flag so that it can start to encode true/false informations.
>
> Let's start to use this flag first to clean up the only check_mapping variable.
> Firstly, the name "check_mapping" implies this is a "boolean", but actually it
> stores the mapping inside, just in a way that it won't be set if we don't want
> to check the mapping.
>
> To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
> that we only check against the mapping if this bit set. At the same time, we
> can rename check_mapping into zap_mapping and set it always.
>
> Since at it, introduce another helper zap_check_mapping_skip() and use it in
> zap_pte_range() properly.
>
> Some old comments have been removed in zap_pte_range() because they're
> duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
> easy to grep this information by simply grepping the flag.
>
> It'll also make life easier when we want to e.g. pass in zap_flags into the
> callers like unmap_mapping_pages() (instead of adding new booleans besides the
> even_cows parameter).
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 19 ++++++++++++++++++-
> mm/memory.c | 31 ++++++++-----------------------
> 2 files changed, 26 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index db155be8e66c..52d3ef2ed753 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1721,13 +1721,30 @@ static inline bool can_do_mlock(void) { return false; }
> extern int user_shm_lock(size_t, struct user_struct *);
> extern void user_shm_unlock(size_t, struct user_struct *);
>
> +/* Whether to check page->mapping when zapping */
> +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
> +
> /*
> * Parameter block passed down to zap_pte_range in exceptional cases.
> */
> struct zap_details {
> - struct address_space *check_mapping; /* Check page->mapping if set */
> + struct address_space *zap_mapping;
> + unsigned long zap_flags;
> };
>
> +/* Return true if skip zapping this page, false otherwise */
> +static inline bool
> +zap_check_mapping_skip(struct zap_details *details, struct page *page)
> +{
> + if (!details || !page)
> + return false;
> +
> + if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
> + return false;
> +
> + return details->zap_mapping != page_rmapping(page);

I doubt this matters in practice, but there is a slight behaviour change
here that might be worth checking. Previously this check was equivalent
to:

details->zap_mapping && details->zap_mapping != page_rmapping(page)

Otherwise I think this looks good.

> +}
> +
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index 27cf8a6375c6..c9dc4e9e05b5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1330,16 +1330,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct page *page;
>
> page = vm_normal_page(vma, addr, ptent);
> - if (unlikely(details) && page) {
> - /*
> - * unmap_shared_mapping_pages() wants to
> - * invalidate cache without truncating:
> - * unmap shared but keep private pages.
> - */
> - if (details->check_mapping &&
> - details->check_mapping != page_rmapping(page))
> - continue;
> - }
> + if (unlikely(zap_check_mapping_skip(details, page)))
> + continue;
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
> tlb_remove_tlb_entry(tlb, pte, addr);
> @@ -1372,17 +1364,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> is_device_exclusive_entry(entry)) {
> struct page *page = pfn_swap_entry_to_page(entry);
>
> - if (unlikely(details && details->check_mapping)) {
> - /*
> - * unmap_shared_mapping_pages() wants to
> - * invalidate cache without truncating:
> - * unmap shared but keep private pages.
> - */
> - if (details->check_mapping !=
> - page_rmapping(page))
> - continue;
> - }
> -
> + if (unlikely(zap_check_mapping_skip(details, page)))
> + continue;
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> rss[mm_counter(page)]--;
>
> @@ -3345,9 +3328,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> pgoff_t nr, bool even_cows)
> {
> pgoff_t first_index = start, last_index = start + nr - 1;
> - struct zap_details details = { };
> + struct zap_details details = { .zap_mapping = mapping };
> +
> + if (!even_cows)
> + details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
>
> - details.check_mapping = even_cows ? NULL : mapping;
> if (last_index < first_index)
> last_index = ULONG_MAX;
>
>




2021-06-21 12:21:33

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 07/27] mm: Drop first_index/last_index in zap_details

To me this looks like a good clean up that won't change any behaviour:

Reviewed-by: Alistair Popple <[email protected]>

On Friday, 28 May 2021 6:21:26 AM AEST Peter Xu wrote:
> The first_index/last_index parameters in zap_details are actually only used in
> unmap_mapping_range_tree(). At the meantime, this function is only called by
> unmap_mapping_pages() once. Instead of passing these two variables through the
> whole stack of page zapping code, remove them from zap_details and let them
> simply be parameters of unmap_mapping_range_tree(), which is inlined.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 2 --
> mm/memory.c | 20 ++++++++++----------
> 2 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef9ea6dfefff..db155be8e66c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1726,8 +1726,6 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> */
> struct zap_details {
> struct address_space *check_mapping; /* Check page->mapping if set */
> - pgoff_t first_index; /* Lowest page->index to unmap */
> - pgoff_t last_index; /* Highest page->index to unmap */
> };
>
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index 45a2f71e447a..27cf8a6375c6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3305,20 +3305,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
> }
>
> static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
> + pgoff_t first_index,
> + pgoff_t last_index,
> struct zap_details *details)
> {
> struct vm_area_struct *vma;
> pgoff_t vba, vea, zba, zea;
>
> - vma_interval_tree_foreach(vma, root,
> - details->first_index, details->last_index) {
> -
> + vma_interval_tree_foreach(vma, root, first_index, last_index) {
> vba = vma->vm_pgoff;
> vea = vba + vma_pages(vma) - 1;
> - zba = details->first_index;
> + zba = first_index;
> if (zba < vba)
> zba = vba;
> - zea = details->last_index;
> + zea = last_index;
> if (zea > vea)
> zea = vea;
>
> @@ -3344,17 +3344,17 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
> void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> pgoff_t nr, bool even_cows)
> {
> + pgoff_t first_index = start, last_index = start + nr - 1;
> struct zap_details details = { };
>
> details.check_mapping = even_cows ? NULL : mapping;
> - details.first_index = start;
> - details.last_index = start + nr - 1;
> - if (details.last_index < details.first_index)
> - details.last_index = ULONG_MAX;
> + if (last_index < first_index)
> + last_index = ULONG_MAX;
>
> i_mmap_lock_write(mapping);
> if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
> - unmap_mapping_range_tree(&mapping->i_mmap, &details);
> + unmap_mapping_range_tree(&mapping->i_mmap, first_index,
> + last_index, &details);
> i_mmap_unlock_write(mapping);
> }
>
>




2021-06-21 12:38:13

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 09/27] mm: Introduce ZAP_FLAG_SKIP_SWAP

On Friday, 28 May 2021 6:21:35 AM AEST Peter Xu wrote:
> Firstly, the comment in zap_pte_range() is misleading because it checks against
> details rather than check_mappings, so it's against what the code did.
>
> Meanwhile, it's confusing too on not explaining why passing in the details
> pointer would mean to skip all swap entries. New user of zap_details could
> very possibly miss this fact if they don't read deep until zap_pte_range()
> because there's no comment at zap_details talking about it at all, so swap
> entries could be errornously skipped without being noticed.
>
> This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
> but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
> "details" parameter: the caller should explicitly set this to skip swap
> entries, otherwise swap entries will always be considered (which is still the
> major case here).
>
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm.h | 12 ++++++++++++
> mm/memory.c | 8 +++++---
> 2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 52d3ef2ed753..1adf313a01fe 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1723,6 +1723,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
>
> /* Whether to check page->mapping when zapping */
> #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> +/* Whether to skip zapping swap entries */
> +#define ZAP_FLAG_SKIP_SWAP BIT(1)
>
> /*
> * Parameter block passed down to zap_pte_range in exceptional cases.
> @@ -1745,6 +1747,16 @@ zap_check_mapping_skip(struct zap_details *details, struct page *page)
> return details->zap_mapping != page_rmapping(page);
> }
>
> +/* Return true if skip swap entries, false otherwise */
> +static inline bool
> +zap_skip_swap(struct zap_details *details)

Minor nit-pick but imho it would be nice if the naming was consistent between
this and check mapping. Ie. zap_skip_swap()/zap_skip_check_mapping() or
zap_swap_skip()/zap_check_mapping_skip().

> +{
> + if (!details)
> + return false;
> +
> + return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> +}
> +
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte);
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index c9dc4e9e05b5..8a3751be87ba 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1376,8 +1376,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> continue;
> }
>
> - /* If details->check_mapping, we leave swap entries. */
> - if (unlikely(details))
> + if (unlikely(zap_skip_swap(details)))
> continue;
>
> if (!non_swap_entry(entry))
> @@ -3328,7 +3327,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> pgoff_t nr, bool even_cows)
> {
> pgoff_t first_index = start, last_index = start + nr - 1;
> - struct zap_details details = { .zap_mapping = mapping };
> + struct zap_details details = {
> + .zap_mapping = mapping,

I meant to comment on this in the previous patch, but it might be nice to set
.zap_mapping in the !even_cows case below to make it very obvious it only
applies to ZAP_FLAG_CHECK_MAPPING.

Otherwise I think this is a good clean up which makes things clearer. I double
checked that unmap_mapping_pages() was the only place in the existing code that
needs ZAP_FLAG_SKIP_SWAP and that appears to be the case so there shouldn't be
any behaviour changes from this.

Reviewed-by: Alistair Popple <[email protected]>

> + .zap_flags = ZAP_FLAG_SKIP_SWAP,
> + };
>
> if (!even_cows)
> details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
>




2021-06-21 16:19:15

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 08/27] mm: Introduce zap_details.zap_flags

On Mon, Jun 21, 2021 at 10:09:00PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:21:30 AM AEST Peter Xu wrote:
> > Instead of trying to introduce one variable for every new zap_details fields,
> > let's introduce a flag so that it can start to encode true/false informations.
> >
> > Let's start to use this flag first to clean up the only check_mapping variable.
> > Firstly, the name "check_mapping" implies this is a "boolean", but actually it
> > stores the mapping inside, just in a way that it won't be set if we don't want
> > to check the mapping.
> >
> > To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
> > that we only check against the mapping if this bit set. At the same time, we
> > can rename check_mapping into zap_mapping and set it always.
> >
> > Since at it, introduce another helper zap_check_mapping_skip() and use it in
> > zap_pte_range() properly.
> >
> > Some old comments have been removed in zap_pte_range() because they're
> > duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
> > easy to grep this information by simply grepping the flag.
> >
> > It'll also make life easier when we want to e.g. pass in zap_flags into the
> > callers like unmap_mapping_pages() (instead of adding new booleans besides the
> > even_cows parameter).
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/mm.h | 19 ++++++++++++++++++-
> > mm/memory.c | 31 ++++++++-----------------------
> > 2 files changed, 26 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index db155be8e66c..52d3ef2ed753 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1721,13 +1721,30 @@ static inline bool can_do_mlock(void) { return false; }
> > extern int user_shm_lock(size_t, struct user_struct *);
> > extern void user_shm_unlock(size_t, struct user_struct *);
> >
> > +/* Whether to check page->mapping when zapping */
> > +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > +
> > /*
> > * Parameter block passed down to zap_pte_range in exceptional cases.
> > */
> > struct zap_details {
> > - struct address_space *check_mapping; /* Check page->mapping if set */
> > + struct address_space *zap_mapping;
> > + unsigned long zap_flags;
> > };
> >
> > +/* Return true if skip zapping this page, false otherwise */
> > +static inline bool
> > +zap_check_mapping_skip(struct zap_details *details, struct page *page)
> > +{
> > + if (!details || !page)
> > + return false;
> > +
> > + if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
> > + return false;

[1]

> > +
> > + return details->zap_mapping != page_rmapping(page);
>
> I doubt this matters in practice, but there is a slight behaviour change
> here that might be worth checking. Previously this check was equivalent
> to:
>
> details->zap_mapping && details->zap_mapping != page_rmapping(page)

Yes; IMHO "details->zap_mapping" is just replaced by the check at [1].

For example, there's only one real user of this mapping check, which is
unmap_mapping_pages() below [2].

With the old code, we have:

details.check_mapping = even_cows ? NULL : mapping;

So "details->zap_mapping" is only true if "!even_cows".

With the new code, we'll have:

if (!even_cows)
details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;

So ZAP_FLAG_CHECK_MAPPING is only set if "!even_cows", while that's what we
check exactly at [1].

>
> Otherwise I think this looks good.
>
> > +}
> > +
> > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > pte_t pte);
> > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 27cf8a6375c6..c9dc4e9e05b5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1330,16 +1330,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > struct page *page;
> >
> > page = vm_normal_page(vma, addr, ptent);
> > - if (unlikely(details) && page) {
> > - /*
> > - * unmap_shared_mapping_pages() wants to
> > - * invalidate cache without truncating:
> > - * unmap shared but keep private pages.
> > - */
> > - if (details->check_mapping &&
> > - details->check_mapping != page_rmapping(page))
> > - continue;
> > - }
> > + if (unlikely(zap_check_mapping_skip(details, page)))
> > + continue;
> > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > tlb->fullmm);
> > tlb_remove_tlb_entry(tlb, pte, addr);
> > @@ -1372,17 +1364,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > is_device_exclusive_entry(entry)) {
> > struct page *page = pfn_swap_entry_to_page(entry);
> >
> > - if (unlikely(details && details->check_mapping)) {
> > - /*
> > - * unmap_shared_mapping_pages() wants to
> > - * invalidate cache without truncating:
> > - * unmap shared but keep private pages.
> > - */
> > - if (details->check_mapping !=
> > - page_rmapping(page))
> > - continue;
> > - }
> > -
> > + if (unlikely(zap_check_mapping_skip(details, page)))
> > + continue;
> > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > rss[mm_counter(page)]--;
> >
> > @@ -3345,9 +3328,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> > pgoff_t nr, bool even_cows)
> > {
> > pgoff_t first_index = start, last_index = start + nr - 1;
> > - struct zap_details details = { };
> > + struct zap_details details = { .zap_mapping = mapping };
> > +
> > + if (!even_cows)
> > + details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
> >
> > - details.check_mapping = even_cows ? NULL : mapping;

[2]

> > if (last_index < first_index)
> > last_index = ULONG_MAX;

Thanks,

--
Peter Xu

2021-06-21 16:29:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 09/27] mm: Introduce ZAP_FLAG_SKIP_SWAP

On Mon, Jun 21, 2021 at 10:36:46PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:21:35 AM AEST Peter Xu wrote:
> > Firstly, the comment in zap_pte_range() is misleading because it checks against
> > details rather than check_mappings, so it's against what the code did.
> >
> > Meanwhile, it's confusing too on not explaining why passing in the details
> > pointer would mean to skip all swap entries. New user of zap_details could
> > very possibly miss this fact if they don't read deep until zap_pte_range()
> > because there's no comment at zap_details talking about it at all, so swap
> > entries could be errornously skipped without being noticed.
> >
> > This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
> > but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
> > "details" parameter: the caller should explicitly set this to skip swap
> > entries, otherwise swap entries will always be considered (which is still the
> > major case here).
> >
> > Cc: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/mm.h | 12 ++++++++++++
> > mm/memory.c | 8 +++++---
> > 2 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 52d3ef2ed753..1adf313a01fe 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1723,6 +1723,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> >
> > /* Whether to check page->mapping when zapping */
> > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > +/* Whether to skip zapping swap entries */
> > +#define ZAP_FLAG_SKIP_SWAP BIT(1)
> >
> > /*
> > * Parameter block passed down to zap_pte_range in exceptional cases.
> > @@ -1745,6 +1747,16 @@ zap_check_mapping_skip(struct zap_details *details, struct page *page)
> > return details->zap_mapping != page_rmapping(page);
> > }
> >
> > +/* Return true if skip swap entries, false otherwise */
> > +static inline bool
> > +zap_skip_swap(struct zap_details *details)
>
> Minor nit-pick but imho it would be nice if the naming was consistent between
> this and check mapping. Ie. zap_skip_swap()/zap_skip_check_mapping() or
> zap_swap_skip()/zap_check_mapping_skip().

Makes sense; I'll use zap_skip_swap()/zap_skip_check_mapping() I think, then I
keep this patch untouched.

>
> > +{
> > + if (!details)
> > + return false;
> > +
> > + return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > +}
> > +
> > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > pte_t pte);
> > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c9dc4e9e05b5..8a3751be87ba 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1376,8 +1376,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > continue;
> > }
> >
> > - /* If details->check_mapping, we leave swap entries. */
> > - if (unlikely(details))
> > + if (unlikely(zap_skip_swap(details)))
> > continue;
> >
> > if (!non_swap_entry(entry))
> > @@ -3328,7 +3327,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> > pgoff_t nr, bool even_cows)
> > {
> > pgoff_t first_index = start, last_index = start + nr - 1;
> > - struct zap_details details = { .zap_mapping = mapping };
> > + struct zap_details details = {
> > + .zap_mapping = mapping,
>
> I meant to comment on this in the previous patch, but it might be nice to set
> .zap_mapping in the !even_cows case below to make it very obvious it only
> applies to ZAP_FLAG_CHECK_MAPPING.

I wanted to make it easy to understand by having zap_mapping always points to
the mapping it's zapping, so it does not contain any other information like
"whether we want to check the mapping is the same when zap", which now stays
fully in the flags. Then it's always legal to reference zap_mapping without any
prior knowledge. But indeed it's only used by ZAP_FLAG_CHECK_MAPPING.

I do have a slight preference to keep it as the patch does, but I don't have a
strong opinion. Let me know if you insist; I can change.

>
> Otherwise I think this is a good clean up which makes things clearer. I double
> checked that unmap_mapping_pages() was the only place in the existing code that
> needs ZAP_FLAG_SKIP_SWAP and that appears to be the case so there shouldn't be
> any behaviour changes from this.
>
> Reviewed-by: Alistair Popple <[email protected]>

Since I won't change anything within this patch, I'll take this away, thanks!

--
Peter Xu

2021-06-22 00:44:41

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Mon, Jun 21, 2021 at 06:41:17PM +1000, Alistair Popple wrote:
> On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> > File-backed memory is prone to being unmapped at any time. It means all
> > information in the pte will be dropped, including the uffd-wp flag.
> >
> > Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> > this wr-protect information by installing the special uffd-wp marker pte when
> > we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
> > we will know it's previously wr-protected by recognizing the special pte.
> >
> > Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> > persist such an information. For example, when destroying the whole vma, or
> > punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
> > bit when holding the page lock. It means the unmap_mapping_range() in
> > shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> > because that's still racy with the page faults.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/mm.h | 11 ++++++++++
> > include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> > mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
> > mm/rmap.c | 8 ++++++++
> > mm/truncate.c | 8 +++++++-
> > 5 files changed, 110 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index b1fb2826e29c..5989fc7ed00d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > /* Whether to skip zapping swap entries */
> > #define ZAP_FLAG_SKIP_SWAP BIT(1)
> > +/* Whether to completely drop uffd-wp entries for file-backed memory */
> > +#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
> >
> > /*
> > * Parameter block passed down to zap_pte_range in exceptional cases.
> > @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> > return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > }
> >
> > +static inline bool
> > +zap_drop_file_uffd_wp(struct zap_details *details)
> > +{
> > + if (!details)
> > + return false;
> > +
> > + return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> > +}
>
> Is this a good default having to explicitly specify that you don't want
> special pte's left in place?

I made it explicitly the default so we won't accidentally drop that bit without
being aware of it; because missing of the uffd-wp bit anywhere can directly
cause data corruption in the userspace.

> For example the OOM killer seems to call unmap_page_range() with details ==
> NULL (although in practice only for anonymous vmas so it wont actually cause
> an issue). Similarly in madvise for MADV_DONTNEED, although arguably I
> suppose that is the correct thing to do there?

So I must confess I'm not familiar with the oom code, it looks to me it's a
fast path to recycle pages that can have a better chance to be reclaimed. Even
in exit_mmap() we'll do this first before unmap_vmas(). Then it still looks
the right thing to do if it's only a fast path, not to mention if we only runs
with anonymous then it's ignored.

Basically I followed this rule: the bit should never be cleared if (1) user
manually clear it using UFFDIO_WRITEPROTECT, (2) unmapping the whole region.
There can be special cases e.g. when unregister the vma with VM_UFFD_WP, but
that's a rare case, and we also have code to take care of those lazily (e.g.,
we'll restore such a uffd-wp special pte into none pte if we found we've got a
fault and the vma is not registered with uffd-wp at all, in do_swap_pte).
Otherwise I never clear the bit.

>
> > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > pte_t pte);
> > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index 355ea1ee32bd..c29a6ef3a642 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -4,6 +4,8 @@
> >
> > #include <linux/huge_mm.h>
> > #include <linux/swap.h>
> > +#include <linux/userfaultfd_k.h>
> > +#include <linux/swapops.h>
> >
> > /**
> > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > -thp_nr_pages(page));
> > }
> > +
> > +/*
> > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > + * replace a none pte. NOTE! This should only be called when *pte is already
> > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > + * even better.
> > + *
> > + * Must be called with pgtable lock held.
> > + */
> > +static inline void
> > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > + pte_t *pte, pte_t pteval)
> > +{
> > +#ifdef CONFIG_USERFAULTFD
> > + bool arm_uffd_pte = false;
> > +
> > + /* The current status of the pte should be "cleared" before calling */
> > + WARN_ON_ONCE(!pte_none(*pte));
> > +
> > + if (vma_is_anonymous(vma))
> > + return;
> > +
> > + /* A uffd-wp wr-protected normal pte */
> > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > + arm_uffd_pte = true;
> > +
> > + /*
> > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > + * pte_swp_uffd_wp_special() too.
> > + */
>
> I'm probably missing something but when can we actually have this case and why
> would we want to leave a special pte behind? From what I can tell this is
> called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> when not skipping swap pages.

Yes this is a good question..

Initially I made this function make sure I cover all forms of uffd-wp bit, that
contains both swap and present ptes; imho that's pretty safe. However for
!anonymous cases we don't keep swap entry inside pte even if swapped out, as
they should reside in shmem page cache indeed. The only missing piece seems to
be the device private entries as you also spotted below.

>
> > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > + arm_uffd_pte = true;
> > +
> > + if (unlikely(arm_uffd_pte))
> > + set_pte_at(vma->vm_mm, addr, pte,
> > + pte_swp_mkuffd_wp_special(vma));
> > +#endif
> > +}
> > +
> > #endif
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 319552efc782..3453b8ae5f4f 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -73,6 +73,7 @@
> > #include <linux/perf_event.h>
> > #include <linux/ptrace.h>
> > #include <linux/vmalloc.h>
> > +#include <linux/mm_inline.h>
> >
> > #include <trace/events/kmem.h>
> >
> > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > return ret;
> > }
> >
> > +/*
> > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > + */
> > +static inline void
> > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > + unsigned long addr, pte_t *pte,
> > + struct zap_details *details, pte_t pteval)
> > +{
> > + if (zap_drop_file_uffd_wp(details))
> > + return;
> > +
> > + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > +}
> > +
> > static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > struct vm_area_struct *vma, pmd_t *pmd,
> > unsigned long addr, unsigned long end,
> > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > tlb->fullmm);
> > tlb_remove_tlb_entry(tlb, pte, addr);
> > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > + ptent);
> > if (unlikely(!page))
> > continue;
> >
> > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > continue;
> > }
> >
> > + /*
> > + * If this is a special uffd-wp marker pte... Drop it only if
> > + * enforced to do so.
> > + */
> > + if (unlikely(is_swap_special_pte(ptent))) {
> > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
>
> Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
>
> > + /*
> > + * If this is a common unmap of ptes, keep this as is.
> > + * Drop it only if this is a whole-vma destruction.
> > + */
> > + if (zap_drop_file_uffd_wp(details))
> > + ptep_get_and_clear_full(mm, addr, pte,
> > + tlb->fullmm);
> > + continue;
> > + }
> > +
> > entry = pte_to_swp_entry(ptent);
> > if (is_device_private_entry(entry) ||
> > is_device_exclusive_entry(entry)) {
> > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > page_remove_rmap(page, false);
> >
> > put_page(page);
> > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > + ptent);
>
> Device entries only support anonymous vmas at present so should we drop this?
> I guess I'm also a little confused by this because I'm not sure in what
> scenarios you would want to zap swap entries but leave special swap ptes behind
> (see also my earlier question above as well).

If that's the case, maybe indeed this is not needed, and I can use a
WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
possible one day to have !anonymous support for device private entries?
Frankly I have no solid idea on how device private is used, so some more
context would be nice too; since I think you should know much better than me,
so maybe it's a good chance to learn more about it. :)

>
> > continue;
> > }
> >
> > @@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > if (unlikely(!free_swap_and_cache(entry)))
> > print_bad_pte(vma, addr, ptent, NULL);
> > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> > } while (pte++, addr += PAGE_SIZE, addr != end);
> >
> > add_mm_rss_vec(mm, rss);
> > @@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
> > unsigned long end_addr)
> > {
> > struct mmu_notifier_range range;
> > + struct zap_details details = {
> > + .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
> > + };
> >
> > mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > start_addr, end_addr);
> > mmu_notifier_invalidate_range_start(&range);
> > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > + unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> > mmu_notifier_invalidate_range_end(&range);
> > }
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0419c9a1a280..a94d9aed9d95 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -72,6 +72,7 @@
> > #include <linux/page_idle.h>
> > #include <linux/memremap.h>
> > #include <linux/userfaultfd_k.h>
> > +#include <linux/mm_inline.h>
> >
> > #include <asm/tlbflush.h>
> >
> > @@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > }
> >
> > + /*
> > + * Now the pte is cleared. If this is uffd-wp armed pte, we
> > + * may want to replace a none pte with a marker pte if it's
> > + * file-backed, so we don't lose the tracking information.
> > + */
> > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
>
> From what I can tell we don't need to do this in try_to_migrate_one() (assuming
> that goes in) as well because the existing uffd wp code already deals with
> copying the pte bits over to the migration entries. Is that correct?

I agree try_to_migrate_one() shouldn't need it. But I'm not sure about
try_to_unmap_one(), as e.g. I think we should rely on this to make shmem work
with when page got swapped out.

Thanks,

--
Peter Xu

2021-06-22 02:09:16

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 08/27] mm: Introduce zap_details.zap_flags

On Tuesday, 22 June 2021 2:16:50 AM AEST Peter Xu wrote:
> On Mon, Jun 21, 2021 at 10:09:00PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 6:21:30 AM AEST Peter Xu wrote:
> > > Instead of trying to introduce one variable for every new zap_details fields,
> > > let's introduce a flag so that it can start to encode true/false informations.
> > >
> > > Let's start to use this flag first to clean up the only check_mapping variable.
> > > Firstly, the name "check_mapping" implies this is a "boolean", but actually it
> > > stores the mapping inside, just in a way that it won't be set if we don't want
> > > to check the mapping.
> > >
> > > To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
> > > that we only check against the mapping if this bit set. At the same time, we
> > > can rename check_mapping into zap_mapping and set it always.
> > >
> > > Since at it, introduce another helper zap_check_mapping_skip() and use it in
> > > zap_pte_range() properly.
> > >
> > > Some old comments have been removed in zap_pte_range() because they're
> > > duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
> > > easy to grep this information by simply grepping the flag.
> > >
> > > It'll also make life easier when we want to e.g. pass in zap_flags into the
> > > callers like unmap_mapping_pages() (instead of adding new booleans besides the
> > > even_cows parameter).
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > include/linux/mm.h | 19 ++++++++++++++++++-
> > > mm/memory.c | 31 ++++++++-----------------------
> > > 2 files changed, 26 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index db155be8e66c..52d3ef2ed753 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1721,13 +1721,30 @@ static inline bool can_do_mlock(void) { return false; }
> > > extern int user_shm_lock(size_t, struct user_struct *);
> > > extern void user_shm_unlock(size_t, struct user_struct *);
> > >
> > > +/* Whether to check page->mapping when zapping */
> > > +#define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > > +
> > > /*
> > > * Parameter block passed down to zap_pte_range in exceptional cases.
> > > */
> > > struct zap_details {
> > > - struct address_space *check_mapping; /* Check page->mapping if set */
> > > + struct address_space *zap_mapping;
> > > + unsigned long zap_flags;
> > > };
> > >
> > > +/* Return true if skip zapping this page, false otherwise */
> > > +static inline bool
> > > +zap_check_mapping_skip(struct zap_details *details, struct page *page)
> > > +{
> > > + if (!details || !page)
> > > + return false;
> > > +
> > > + if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
> > > + return false;
>
> [1]
>
> > > +
> > > + return details->zap_mapping != page_rmapping(page);
> >
> > I doubt this matters in practice, but there is a slight behaviour change
> > here that might be worth checking. Previously this check was equivalent
> > to:
> >
> > details->zap_mapping && details->zap_mapping != page_rmapping(page)
>
> Yes; IMHO "details->zap_mapping" is just replaced by the check at [1].

Yes, but what I meant is that this check is slightly different in behaviour
from the old code which would never skip if check/zap_mapping == NULL where as
the new code will skip if
details->zap_mapping == NULL && page_rmapping(page) != NULL because the check
has effectively become:

if ((details->zap_flags & ZAP_FLAG_CHECK_MAPPING) &&
details->zap_mapping != page_rmapping(page))
continue;

instead of:

if (details->zap_mapping &&
details->zap_mapping != page_rmapping(page))
continue;

As I said though I only looked at this superficially from the perspective of
whether this patch changes existing code behaviour. I doubt this is a real
problem because I assume
details->check_mapping == NULL && page_rmapping(page) != NULL can never
actually happen in practice.

> For example, there's only one real user of this mapping check, which is
> unmap_mapping_pages() below [2].
>
> With the old code, we have:
>
> details.check_mapping = even_cows ? NULL : mapping;
>
> So "details->zap_mapping" is only true if "!even_cows".
>
> With the new code, we'll have:
>
> if (!even_cows)
> details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
>
> So ZAP_FLAG_CHECK_MAPPING is only set if "!even_cows", while that's what we
> check exactly at [1].
> >
> > Otherwise I think this looks good.
> >
> > > +}
> > > +
> > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > pte_t pte);
> > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 27cf8a6375c6..c9dc4e9e05b5 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1330,16 +1330,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > struct page *page;
> > >
> > > page = vm_normal_page(vma, addr, ptent);
> > > - if (unlikely(details) && page) {
> > > - /*
> > > - * unmap_shared_mapping_pages() wants to
> > > - * invalidate cache without truncating:
> > > - * unmap shared but keep private pages.
> > > - */
> > > - if (details->check_mapping &&
> > > - details->check_mapping != page_rmapping(page))
> > > - continue;
> > > - }
> > > + if (unlikely(zap_check_mapping_skip(details, page)))
> > > + continue;
> > > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > tlb->fullmm);
> > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > @@ -1372,17 +1364,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > is_device_exclusive_entry(entry)) {
> > > struct page *page = pfn_swap_entry_to_page(entry);
> > >
> > > - if (unlikely(details && details->check_mapping)) {
> > > - /*
> > > - * unmap_shared_mapping_pages() wants to
> > > - * invalidate cache without truncating:
> > > - * unmap shared but keep private pages.
> > > - */
> > > - if (details->check_mapping !=
> > > - page_rmapping(page))
> > > - continue;
> > > - }
> > > -
> > > + if (unlikely(zap_check_mapping_skip(details, page)))
> > > + continue;
> > > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > > rss[mm_counter(page)]--;
> > >
> > > @@ -3345,9 +3328,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> > > pgoff_t nr, bool even_cows)
> > > {
> > > pgoff_t first_index = start, last_index = start + nr - 1;
> > > - struct zap_details details = { };
> > > + struct zap_details details = { .zap_mapping = mapping };
> > > +
> > > + if (!even_cows)
> > > + details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
> > >
> > > - details.check_mapping = even_cows ? NULL : mapping;
>
> [2]
>
> > > if (last_index < first_index)
> > > last_index = ULONG_MAX;
>
> Thanks,
>
>




2021-06-22 02:13:50

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 09/27] mm: Introduce ZAP_FLAG_SKIP_SWAP

On Tuesday, 22 June 2021 2:26:23 AM AEST Peter Xu wrote:
> On Mon, Jun 21, 2021 at 10:36:46PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 6:21:35 AM AEST Peter Xu wrote:
> > > Firstly, the comment in zap_pte_range() is misleading because it checks against
> > > details rather than check_mappings, so it's against what the code did.
> > >
> > > Meanwhile, it's confusing too on not explaining why passing in the details
> > > pointer would mean to skip all swap entries. New user of zap_details could
> > > very possibly miss this fact if they don't read deep until zap_pte_range()
> > > because there's no comment at zap_details talking about it at all, so swap
> > > entries could be errornously skipped without being noticed.
> > >
> > > This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
> > > but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
> > > "details" parameter: the caller should explicitly set this to skip swap
> > > entries, otherwise swap entries will always be considered (which is still the
> > > major case here).
> > >
> > > Cc: Kirill A. Shutemov <[email protected]>
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > include/linux/mm.h | 12 ++++++++++++
> > > mm/memory.c | 8 +++++---
> > > 2 files changed, 17 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 52d3ef2ed753..1adf313a01fe 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1723,6 +1723,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> > >
> > > /* Whether to check page->mapping when zapping */
> > > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > > +/* Whether to skip zapping swap entries */
> > > +#define ZAP_FLAG_SKIP_SWAP BIT(1)
> > >
> > > /*
> > > * Parameter block passed down to zap_pte_range in exceptional cases.
> > > @@ -1745,6 +1747,16 @@ zap_check_mapping_skip(struct zap_details *details, struct page *page)
> > > return details->zap_mapping != page_rmapping(page);
> > > }
> > >
> > > +/* Return true if skip swap entries, false otherwise */
> > > +static inline bool
> > > +zap_skip_swap(struct zap_details *details)
> >
> > Minor nit-pick but imho it would be nice if the naming was consistent between
> > this and check mapping. Ie. zap_skip_swap()/zap_skip_check_mapping() or
> > zap_swap_skip()/zap_check_mapping_skip().
>
> Makes sense; I'll use zap_skip_swap()/zap_skip_check_mapping() I think, then I
> keep this patch untouched.

Sounds good.

> >
> > > +{
> > > + if (!details)
> > > + return false;
> > > +
> > > + return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > > +}
> > > +
> > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > pte_t pte);
> > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index c9dc4e9e05b5..8a3751be87ba 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1376,8 +1376,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > continue;
> > > }
> > >
> > > - /* If details->check_mapping, we leave swap entries. */
> > > - if (unlikely(details))
> > > + if (unlikely(zap_skip_swap(details)))
> > > continue;
> > >
> > > if (!non_swap_entry(entry))
> > > @@ -3328,7 +3327,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
> > > pgoff_t nr, bool even_cows)
> > > {
> > > pgoff_t first_index = start, last_index = start + nr - 1;
> > > - struct zap_details details = { .zap_mapping = mapping };
> > > + struct zap_details details = {
> > > + .zap_mapping = mapping,
> >
> > I meant to comment on this in the previous patch, but it might be nice to set
> > .zap_mapping in the !even_cows case below to make it very obvious it only
> > applies to ZAP_FLAG_CHECK_MAPPING.
>
> I wanted to make it easy to understand by having zap_mapping always points to
> the mapping it's zapping, so it does not contain any other information like
> "whether we want to check the mapping is the same when zap", which now stays
> fully in the flags. Then it's always legal to reference zap_mapping without any
> prior knowledge. But indeed it's only used by ZAP_FLAG_CHECK_MAPPING.
>
> I do have a slight preference to keep it as the patch does, but I don't have a
> strong opinion. Let me know if you insist; I can change.

No insistence from me if you want to keep it this way, it's all pretty obvious
anyway.

> >
> > Otherwise I think this is a good clean up which makes things clearer. I double
> > checked that unmap_mapping_pages() was the only place in the existing code that
> > needs ZAP_FLAG_SKIP_SWAP and that appears to be the case so there shouldn't be
> > any behaviour changes from this.
> >
> > Reviewed-by: Alistair Popple <[email protected]>
>
> Since I won't change anything within this patch, I'll take this away, thanks!
>
>




2021-06-22 12:48:49

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Tuesday, 22 June 2021 10:40:37 AM AEST Peter Xu wrote:
> On Mon, Jun 21, 2021 at 06:41:17PM +1000, Alistair Popple wrote:
> > On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> > > File-backed memory is prone to being unmapped at any time. It means all
> > > information in the pte will be dropped, including the uffd-wp flag.
> > >
> > > Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> > > this wr-protect information by installing the special uffd-wp marker pte when
> > > we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
> > > we will know it's previously wr-protected by recognizing the special pte.
> > >
> > > Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> > > persist such an information. For example, when destroying the whole vma, or
> > > punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
> > > bit when holding the page lock. It means the unmap_mapping_range() in
> > > shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> > > because that's still racy with the page faults.
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > include/linux/mm.h | 11 ++++++++++
> > > include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> > > mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
> > > mm/rmap.c | 8 ++++++++
> > > mm/truncate.c | 8 +++++++-
> > > 5 files changed, 110 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index b1fb2826e29c..5989fc7ed00d 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> > > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > > /* Whether to skip zapping swap entries */
> > > #define ZAP_FLAG_SKIP_SWAP BIT(1)
> > > +/* Whether to completely drop uffd-wp entries for file-backed memory */
> > > +#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
> > >
> > > /*
> > > * Parameter block passed down to zap_pte_range in exceptional cases.
> > > @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> > > return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > > }
> > >
> > > +static inline bool
> > > +zap_drop_file_uffd_wp(struct zap_details *details)
> > > +{
> > > + if (!details)
> > > + return false;
> > > +
> > > + return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> > > +}
> >
> > Is this a good default having to explicitly specify that you don't want
> > special pte's left in place?
>
> I made it explicitly the default so we won't accidentally drop that bit without
> being aware of it; because missing of the uffd-wp bit anywhere can directly
> cause data corruption in the userspace.

Ok, I think that makes sense. I was just a little concerned about leaving
special pte's behind everywhere by accident and whether there would be any
unforeseen side effects from that. As you point out below though we do expect
that to happen occasionally and to clean them up when found.

> > For example the OOM killer seems to call unmap_page_range() with details ==
> > NULL (although in practice only for anonymous vmas so it wont actually cause
> > an issue). Similarly in madvise for MADV_DONTNEED, although arguably I
> > suppose that is the correct thing to do there?
>
> So I must confess I'm not familiar with the oom code, it looks to me it's a
> fast path to recycle pages that can have a better chance to be reclaimed. Even
> in exit_mmap() we'll do this first before unmap_vmas(). Then it still looks
> the right thing to do if it's only a fast path, not to mention if we only runs
> with anonymous then it's ignored.

Don't confuse my ability to grep with understanding of the OOM killer :-)

I was just reviewing cases where we might leave behind unwanted special ptes.
I don't think I really found any but wanted to ask about them anyway to learn
more about the rules for them (which you have answered below, thanks!).

> Basically I followed this rule: the bit should never be cleared if (1) user
> manually clear it using UFFDIO_WRITEPROTECT, (2) unmapping the whole region.
> There can be special cases e.g. when unregister the vma with VM_UFFD_WP, but
> that's a rare case, and we also have code to take care of those lazily (e.g.,
> we'll restore such a uffd-wp special pte into none pte if we found we've got a
> fault and the vma is not registered with uffd-wp at all, in do_swap_pte).
> Otherwise I never clear the bit.
>
> >
> > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > pte_t pte);
> > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > --- a/include/linux/mm_inline.h
> > > +++ b/include/linux/mm_inline.h
> > > @@ -4,6 +4,8 @@
> > >
> > > #include <linux/huge_mm.h>
> > > #include <linux/swap.h>
> > > +#include <linux/userfaultfd_k.h>
> > > +#include <linux/swapops.h>
> > >
> > > /**
> > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > -thp_nr_pages(page));
> > > }
> > > +
> > > +/*
> > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > + * even better.
> > > + *
> > > + * Must be called with pgtable lock held.
> > > + */
> > > +static inline void
> > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > + pte_t *pte, pte_t pteval)
> > > +{
> > > +#ifdef CONFIG_USERFAULTFD
> > > + bool arm_uffd_pte = false;
> > > +
> > > + /* The current status of the pte should be "cleared" before calling */
> > > + WARN_ON_ONCE(!pte_none(*pte));
> > > +
> > > + if (vma_is_anonymous(vma))
> > > + return;
> > > +
> > > + /* A uffd-wp wr-protected normal pte */
> > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > + arm_uffd_pte = true;
> > > +
> > > + /*
> > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > + * pte_swp_uffd_wp_special() too.
> > > + */
> >
> > I'm probably missing something but when can we actually have this case and why
> > would we want to leave a special pte behind? From what I can tell this is
> > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > when not skipping swap pages.
>
> Yes this is a good question..
>
> Initially I made this function make sure I cover all forms of uffd-wp bit, that
> contains both swap and present ptes; imho that's pretty safe. However for
> !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> they should reside in shmem page cache indeed. The only missing piece seems to
> be the device private entries as you also spotted below.

Yes, I think it's *probably* safe although I don't yet have a strong opinion
here ...

> > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))

... however if this can never happen would a WARN_ON() be better? It would also
mean you could remove arm_uffd_pte.

> > > + arm_uffd_pte = true;
> > > +
> > > + if (unlikely(arm_uffd_pte))
> > > + set_pte_at(vma->vm_mm, addr, pte,
> > > + pte_swp_mkuffd_wp_special(vma));
> > > +#endif
> > > +}
> > > +
> > > #endif
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 319552efc782..3453b8ae5f4f 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -73,6 +73,7 @@
> > > #include <linux/perf_event.h>
> > > #include <linux/ptrace.h>
> > > #include <linux/vmalloc.h>
> > > +#include <linux/mm_inline.h>
> > >
> > > #include <trace/events/kmem.h>
> > >
> > > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > return ret;
> > > }
> > >
> > > +/*
> > > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > > + */
> > > +static inline void
> > > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > > + unsigned long addr, pte_t *pte,
> > > + struct zap_details *details, pte_t pteval)
> > > +{
> > > + if (zap_drop_file_uffd_wp(details))
> > > + return;
> > > +
> > > + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > > +}
> > > +
> > > static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > struct vm_area_struct *vma, pmd_t *pmd,
> > > unsigned long addr, unsigned long end,
> > > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > tlb->fullmm);
> > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > + ptent);
> > > if (unlikely(!page))
> > > continue;
> > >
> > > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > continue;
> > > }
> > >
> > > + /*
> > > + * If this is a special uffd-wp marker pte... Drop it only if
> > > + * enforced to do so.
> > > + */
> > > + if (unlikely(is_swap_special_pte(ptent))) {
> > > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> >
> > Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> >
> > > + /*
> > > + * If this is a common unmap of ptes, keep this as is.
> > > + * Drop it only if this is a whole-vma destruction.
> > > + */
> > > + if (zap_drop_file_uffd_wp(details))
> > > + ptep_get_and_clear_full(mm, addr, pte,
> > > + tlb->fullmm);
> > > + continue;
> > > + }
> > > +
> > > entry = pte_to_swp_entry(ptent);
> > > if (is_device_private_entry(entry) ||
> > > is_device_exclusive_entry(entry)) {
> > > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > page_remove_rmap(page, false);
> > >
> > > put_page(page);
> > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > + ptent);
> >
> > Device entries only support anonymous vmas at present so should we drop this?
> > I guess I'm also a little confused by this because I'm not sure in what
> > scenarios you would want to zap swap entries but leave special swap ptes behind
> > (see also my earlier question above as well).
>
> If that's the case, maybe indeed this is not needed, and I can use a
> WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
> possible one day to have !anonymous support for device private entries?
> Frankly I have no solid idea on how device private is used, so some more
> context would be nice too; since I think you should know much better than me,
> so maybe it's a good chance to learn more about it. :)

Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
support for !anonymous device private entries at some point.

There's nothing too special about device private entries. They exist to store
some state and look up a device driver callback that gets called when the CPU
tries to access the page. For example see how do_swap_page() handles them:

} else if (is_device_private_entry(entry)) {
vmf->page = pfn_swap_entry_to_page(entry);
ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);

Normally a device driver provides the implementation of migrate_to_ram() which
will copy the page back to CPU addressable memory and restore the PTE to a
normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
used to allow migration of a page to memory that is not directly CPU addressable
(eg. GPU memory). Hopefully that goes some way to explaining what they are, but
if you have more questions let me know!

As far as I know there should already be support for userfaultfd-wp on device
private pages, and given they can only currently exist in anon vmas I think we
should be safe to not install a special pte when unmapping. On the other hand
I suppose it doesn't mater if we do install one right?

> >
> > > continue;
> > > }
> > >
> > > @@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > if (unlikely(!free_swap_and_cache(entry)))
> > > print_bad_pte(vma, addr, ptent, NULL);
> > > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> > > } while (pte++, addr += PAGE_SIZE, addr != end);
> > >
> > > add_mm_rss_vec(mm, rss);
> > > @@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
> > > unsigned long end_addr)
> > > {
> > > struct mmu_notifier_range range;
> > > + struct zap_details details = {
> > > + .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
> > > + };
> > >
> > > mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > > start_addr, end_addr);
> > > mmu_notifier_invalidate_range_start(&range);
> > > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > > - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > > + unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> > > mmu_notifier_invalidate_range_end(&range);
> > > }
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 0419c9a1a280..a94d9aed9d95 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -72,6 +72,7 @@
> > > #include <linux/page_idle.h>
> > > #include <linux/memremap.h>
> > > #include <linux/userfaultfd_k.h>
> > > +#include <linux/mm_inline.h>
> > >
> > > #include <asm/tlbflush.h>
> > >
> > > @@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > }
> > >
> > > + /*
> > > + * Now the pte is cleared. If this is uffd-wp armed pte, we
> > > + * may want to replace a none pte with a marker pte if it's
> > > + * file-backed, so we don't lose the tracking information.
> > > + */
> > > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> >
> > From what I can tell we don't need to do this in try_to_migrate_one() (assuming
> > that goes in) as well because the existing uffd wp code already deals with
> > copying the pte bits over to the migration entries. Is that correct?
>
> I agree try_to_migrate_one() shouldn't need it. But I'm not sure about
> try_to_unmap_one(), as e.g. I think we should rely on this to make shmem work
> with when page got swapped out.

Oh for sure I agree you need it in try_to_unmap_one(), my code didn't change
the unmap path. It just split the migration cases (ie. replacing mappings with
migration entries instaed of unmapping) into a different function so I just
wanted to make sure we didn't need it in try_to_migrate_one() (and I think we
agree it isn't needed there).

- Alistair

> Thanks,
>
>




2021-06-22 15:46:21

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Tue, Jun 22, 2021 at 10:47:11PM +1000, Alistair Popple wrote:
> On Tuesday, 22 June 2021 10:40:37 AM AEST Peter Xu wrote:
> > On Mon, Jun 21, 2021 at 06:41:17PM +1000, Alistair Popple wrote:
> > > On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> > > > File-backed memory is prone to being unmapped at any time. It means all
> > > > information in the pte will be dropped, including the uffd-wp flag.
> > > >
> > > > Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> > > > this wr-protect information by installing the special uffd-wp marker pte when
> > > > we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
> > > > we will know it's previously wr-protected by recognizing the special pte.
> > > >
> > > > Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> > > > persist such an information. For example, when destroying the whole vma, or
> > > > punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
> > > > bit when holding the page lock. It means the unmap_mapping_range() in
> > > > shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> > > > because that's still racy with the page faults.
> > > >
> > > > Signed-off-by: Peter Xu <[email protected]>
> > > > ---
> > > > include/linux/mm.h | 11 ++++++++++
> > > > include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> > > > mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
> > > > mm/rmap.c | 8 ++++++++
> > > > mm/truncate.c | 8 +++++++-
> > > > 5 files changed, 110 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index b1fb2826e29c..5989fc7ed00d 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> > > > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > > > /* Whether to skip zapping swap entries */
> > > > #define ZAP_FLAG_SKIP_SWAP BIT(1)
> > > > +/* Whether to completely drop uffd-wp entries for file-backed memory */
> > > > +#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
> > > >
> > > > /*
> > > > * Parameter block passed down to zap_pte_range in exceptional cases.
> > > > @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> > > > return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > > > }
> > > >
> > > > +static inline bool
> > > > +zap_drop_file_uffd_wp(struct zap_details *details)
> > > > +{
> > > > + if (!details)
> > > > + return false;
> > > > +
> > > > + return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> > > > +}
> > >
> > > Is this a good default having to explicitly specify that you don't want
> > > special pte's left in place?
> >
> > I made it explicitly the default so we won't accidentally drop that bit without
> > being aware of it; because missing of the uffd-wp bit anywhere can directly
> > cause data corruption in the userspace.
>
> Ok, I think that makes sense. I was just a little concerned about leaving
> special pte's behind everywhere by accident and whether there would be any
> unforeseen side effects from that. As you point out below though we do expect
> that to happen occasionally and to clean them up when found.

Right, that's a valid concern which I had too, but I found that it's
non-trivial to avoid those leftover uffd-wp bits. Since we need to take care
of them anyways, so I let it just be like that, which looks not that bad so far.

One example is shmem file truncation, where we have some optimized path to drop
the mappings before taking the page lock - see shmem_fallocate() where we've
called unmap_mapping_range() (with no page lock, so not safe to drop uffd-wp as
page fault could happen in parallel! that'll cause the page be written before
dropped, so data potentially lost), before calling shmem_truncate_range()
(which will take the page lock; it's the only safe place to drop the uffd-wp
bit). These are very trivial cases but very important too - as I used to spend
days debugging a data corruption with it, only until then I notice it's
unavoidable to have those leftover ptes with these corner cases.

>
> > > For example the OOM killer seems to call unmap_page_range() with details ==
> > > NULL (although in practice only for anonymous vmas so it wont actually cause
> > > an issue). Similarly in madvise for MADV_DONTNEED, although arguably I
> > > suppose that is the correct thing to do there?
> >
> > So I must confess I'm not familiar with the oom code, it looks to me it's a
> > fast path to recycle pages that can have a better chance to be reclaimed. Even
> > in exit_mmap() we'll do this first before unmap_vmas(). Then it still looks
> > the right thing to do if it's only a fast path, not to mention if we only runs
> > with anonymous then it's ignored.
>
> Don't confuse my ability to grep with understanding of the OOM killer :-)
>
> I was just reviewing cases where we might leave behind unwanted special ptes.
> I don't think I really found any but wanted to ask about them anyway to learn
> more about the rules for them (which you have answered below, thanks!).

Yes, actually thanks for raising it too; I didn't really look closely on the
oom side before. It's good to double-check.

>
> > Basically I followed this rule: the bit should never be cleared if (1) user
> > manually clear it using UFFDIO_WRITEPROTECT, (2) unmapping the whole region.

(So obviously when I said "unmapping the whole region", it should include the
case when we truncate the pages; basically I'll let case (2) to cover all
cases that we're certain the page can be dropped, so is the uffd-wp bit)

> > There can be special cases e.g. when unregister the vma with VM_UFFD_WP, but
> > that's a rare case, and we also have code to take care of those lazily (e.g.,
> > we'll restore such a uffd-wp special pte into none pte if we found we've got a
> > fault and the vma is not registered with uffd-wp at all, in do_swap_pte).
> > Otherwise I never clear the bit.
> >
> > >
> > > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > pte_t pte);
> > > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > --- a/include/linux/mm_inline.h
> > > > +++ b/include/linux/mm_inline.h
> > > > @@ -4,6 +4,8 @@
> > > >
> > > > #include <linux/huge_mm.h>
> > > > #include <linux/swap.h>
> > > > +#include <linux/userfaultfd_k.h>
> > > > +#include <linux/swapops.h>
> > > >
> > > > /**
> > > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > -thp_nr_pages(page));
> > > > }
> > > > +
> > > > +/*
> > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > + * even better.
> > > > + *
> > > > + * Must be called with pgtable lock held.
> > > > + */
> > > > +static inline void
> > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > + pte_t *pte, pte_t pteval)
> > > > +{
> > > > +#ifdef CONFIG_USERFAULTFD
> > > > + bool arm_uffd_pte = false;
> > > > +
> > > > + /* The current status of the pte should be "cleared" before calling */
> > > > + WARN_ON_ONCE(!pte_none(*pte));
> > > > +
> > > > + if (vma_is_anonymous(vma))
> > > > + return;
> > > > +
> > > > + /* A uffd-wp wr-protected normal pte */
> > > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > + arm_uffd_pte = true;
> > > > +
> > > > + /*
> > > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > > + * pte_swp_uffd_wp_special() too.
> > > > + */
> > >
> > > I'm probably missing something but when can we actually have this case and why
> > > would we want to leave a special pte behind? From what I can tell this is
> > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > when not skipping swap pages.
> >
> > Yes this is a good question..
> >
> > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > contains both swap and present ptes; imho that's pretty safe. However for
> > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > they should reside in shmem page cache indeed. The only missing piece seems to
> > be the device private entries as you also spotted below.
>
> Yes, I think it's *probably* safe although I don't yet have a strong opinion
> here ...
>
> > > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
>
> ... however if this can never happen would a WARN_ON() be better? It would also
> mean you could remove arm_uffd_pte.

Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
can still be useful for private mapping of shmem files: in that case we'll have
swap entry stored in pte not page cache, so after page reclaim it will contain
a valid swap entry, while it's still "!anonymous".

>
> > > > + arm_uffd_pte = true;
> > > > +
> > > > + if (unlikely(arm_uffd_pte))
> > > > + set_pte_at(vma->vm_mm, addr, pte,
> > > > + pte_swp_mkuffd_wp_special(vma));
> > > > +#endif
> > > > +}
> > > > +
> > > > #endif
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 319552efc782..3453b8ae5f4f 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -73,6 +73,7 @@
> > > > #include <linux/perf_event.h>
> > > > #include <linux/ptrace.h>
> > > > #include <linux/vmalloc.h>
> > > > +#include <linux/mm_inline.h>
> > > >
> > > > #include <trace/events/kmem.h>
> > > >
> > > > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > > return ret;
> > > > }
> > > >
> > > > +/*
> > > > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > > > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > > > + */
> > > > +static inline void
> > > > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > > > + unsigned long addr, pte_t *pte,
> > > > + struct zap_details *details, pte_t pteval)
> > > > +{
> > > > + if (zap_drop_file_uffd_wp(details))
> > > > + return;
> > > > +
> > > > + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > > > +}
> > > > +
> > > > static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > struct vm_area_struct *vma, pmd_t *pmd,
> > > > unsigned long addr, unsigned long end,
> > > > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > tlb->fullmm);
> > > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > + ptent);
> > > > if (unlikely(!page))
> > > > continue;
> > > >
> > > > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > continue;
> > > > }
> > > >
> > > > + /*
> > > > + * If this is a special uffd-wp marker pte... Drop it only if
> > > > + * enforced to do so.
> > > > + */
> > > > + if (unlikely(is_swap_special_pte(ptent))) {
> > > > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> > >
> > > Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> > >
> > > > + /*
> > > > + * If this is a common unmap of ptes, keep this as is.
> > > > + * Drop it only if this is a whole-vma destruction.
> > > > + */
> > > > + if (zap_drop_file_uffd_wp(details))
> > > > + ptep_get_and_clear_full(mm, addr, pte,
> > > > + tlb->fullmm);
> > > > + continue;
> > > > + }
> > > > +
> > > > entry = pte_to_swp_entry(ptent);
> > > > if (is_device_private_entry(entry) ||
> > > > is_device_exclusive_entry(entry)) {
> > > > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > page_remove_rmap(page, false);
> > > >
> > > > put_page(page);
> > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > + ptent);
> > >
> > > Device entries only support anonymous vmas at present so should we drop this?
> > > I guess I'm also a little confused by this because I'm not sure in what
> > > scenarios you would want to zap swap entries but leave special swap ptes behind
> > > (see also my earlier question above as well).
> >
> > If that's the case, maybe indeed this is not needed, and I can use a
> > WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
> > possible one day to have !anonymous support for device private entries?
> > Frankly I have no solid idea on how device private is used, so some more
> > context would be nice too; since I think you should know much better than me,
> > so maybe it's a good chance to learn more about it. :)
>
> Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
> support for !anonymous device private entries at some point.
>
> There's nothing too special about device private entries. They exist to store
> some state and look up a device driver callback that gets called when the CPU
> tries to access the page. For example see how do_swap_page() handles them:
>
> } else if (is_device_private_entry(entry)) {
> vmf->page = pfn_swap_entry_to_page(entry);
> ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>
> Normally a device driver provides the implementation of migrate_to_ram() which
> will copy the page back to CPU addressable memory and restore the PTE to a
> normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
> used to allow migration of a page to memory that is not directly CPU addressable
> (eg. GPU memory). Hopefully that goes some way to explaining what they are, but
> if you have more questions let me know!

Thanks for offering these details! So one thing I'm still uncertain is what
exact type of memory is allowed to be mapped to device private. E.g., would
"anonymous shared" allowed as "anonymous"? I saw there seems to have one ioctl
defined that's used to bind these things:

DRM_IOCTL_DEF_DRV(NOUVEAU_SVM_BIND, nouveau_svmm_bind, DRM_RENDER_ALLOW),

Then nouveau_dmem_migrate_chunk() will initiates the device private entries, am
I right? Then to ask my previous question in another form: if the vaddr range
is coming from an userspace extention driver, would it be allowed to pass in
some vaddr range mapped with MAP_ANONYMOUS|MAP_SHARED?

>
> As far as I know there should already be support for userfaultfd-wp on device
> private pages, and given they can only currently exist in anon vmas I think we
> should be safe to not install a special pte when unmapping. On the other hand
> I suppose it doesn't mater if we do install one right?

For this series, I wanted to make sure that even if there's unexpected leftover
uffd-wp special ptes we'll take care of them too indeed. But let's see how you
would answer above question first.

>
> > >
> > > > continue;
> > > > }
> > > >
> > > > @@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > if (unlikely(!free_swap_and_cache(entry)))
> > > > print_bad_pte(vma, addr, ptent, NULL);
> > > > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> > > > } while (pte++, addr += PAGE_SIZE, addr != end);
> > > >
> > > > add_mm_rss_vec(mm, rss);
> > > > @@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
> > > > unsigned long end_addr)
> > > > {
> > > > struct mmu_notifier_range range;
> > > > + struct zap_details details = {
> > > > + .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
> > > > + };
> > > >
> > > > mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > > > start_addr, end_addr);
> > > > mmu_notifier_invalidate_range_start(&range);
> > > > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > > > - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > > > + unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> > > > mmu_notifier_invalidate_range_end(&range);
> > > > }
> > > >
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 0419c9a1a280..a94d9aed9d95 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -72,6 +72,7 @@
> > > > #include <linux/page_idle.h>
> > > > #include <linux/memremap.h>
> > > > #include <linux/userfaultfd_k.h>
> > > > +#include <linux/mm_inline.h>
> > > >
> > > > #include <asm/tlbflush.h>
> > > >
> > > > @@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > > }
> > > >
> > > > + /*
> > > > + * Now the pte is cleared. If this is uffd-wp armed pte, we
> > > > + * may want to replace a none pte with a marker pte if it's
> > > > + * file-backed, so we don't lose the tracking information.
> > > > + */
> > > > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> > >
> > > From what I can tell we don't need to do this in try_to_migrate_one() (assuming
> > > that goes in) as well because the existing uffd wp code already deals with
> > > copying the pte bits over to the migration entries. Is that correct?
> >
> > I agree try_to_migrate_one() shouldn't need it. But I'm not sure about
> > try_to_unmap_one(), as e.g. I think we should rely on this to make shmem work
> > with when page got swapped out.
>
> Oh for sure I agree you need it in try_to_unmap_one(), my code didn't change
> the unmap path. It just split the migration cases (ie. replacing mappings with
> migration entries instaed of unmapping) into a different function so I just
> wanted to make sure we didn't need it in try_to_migrate_one() (and I think we
> agree it isn't needed there).

Ah so I misunderstood - yes I think we're on the same page then!

Thanks,

--
Peter Xu

2021-06-23 06:07:27

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Wednesday, 23 June 2021 1:44:21 AM AEST Peter Xu wrote:
> On Tue, Jun 22, 2021 at 10:47:11PM +1000, Alistair Popple wrote:
> > On Tuesday, 22 June 2021 10:40:37 AM AEST Peter Xu wrote:
> > > On Mon, Jun 21, 2021 at 06:41:17PM +1000, Alistair Popple wrote:
> > > > On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> > > > > File-backed memory is prone to being unmapped at any time. It means all
> > > > > information in the pte will be dropped, including the uffd-wp flag.
> > > > >
> > > > > Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> > > > > this wr-protect information by installing the special uffd-wp marker pte when
> > > > > we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
> > > > > we will know it's previously wr-protected by recognizing the special pte.
> > > > >
> > > > > Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> > > > > persist such an information. For example, when destroying the whole vma, or
> > > > > punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
> > > > > bit when holding the page lock. It means the unmap_mapping_range() in
> > > > > shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> > > > > because that's still racy with the page faults.
> > > > >
> > > > > Signed-off-by: Peter Xu <[email protected]>
> > > > > ---
> > > > > include/linux/mm.h | 11 ++++++++++
> > > > > include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> > > > > mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
> > > > > mm/rmap.c | 8 ++++++++
> > > > > mm/truncate.c | 8 +++++++-
> > > > > 5 files changed, 110 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index b1fb2826e29c..5989fc7ed00d 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> > > > > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > > > > /* Whether to skip zapping swap entries */
> > > > > #define ZAP_FLAG_SKIP_SWAP BIT(1)
> > > > > +/* Whether to completely drop uffd-wp entries for file-backed memory */
> > > > > +#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
> > > > >
> > > > > /*
> > > > > * Parameter block passed down to zap_pte_range in exceptional cases.
> > > > > @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> > > > > return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > > > > }
> > > > >
> > > > > +static inline bool
> > > > > +zap_drop_file_uffd_wp(struct zap_details *details)
> > > > > +{
> > > > > + if (!details)
> > > > > + return false;
> > > > > +
> > > > > + return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> > > > > +}
> > > >
> > > > Is this a good default having to explicitly specify that you don't want
> > > > special pte's left in place?
> > >
> > > I made it explicitly the default so we won't accidentally drop that bit without
> > > being aware of it; because missing of the uffd-wp bit anywhere can directly
> > > cause data corruption in the userspace.
> >
> > Ok, I think that makes sense. I was just a little concerned about leaving
> > special pte's behind everywhere by accident and whether there would be any
> > unforeseen side effects from that. As you point out below though we do expect
> > that to happen occasionally and to clean them up when found.
>
> Right, that's a valid concern which I had too, but I found that it's
> non-trivial to avoid those leftover uffd-wp bits. Since we need to take care
> of them anyways, so I let it just be like that, which looks not that bad so far.
>
> One example is shmem file truncation, where we have some optimized path to drop
> the mappings before taking the page lock - see shmem_fallocate() where we've
> called unmap_mapping_range() (with no page lock, so not safe to drop uffd-wp as
> page fault could happen in parallel! that'll cause the page be written before
> dropped, so data potentially lost), before calling shmem_truncate_range()
> (which will take the page lock; it's the only safe place to drop the uffd-wp
> bit). These are very trivial cases but very important too - as I used to spend
> days debugging a data corruption with it, only until then I notice it's
> unavoidable to have those leftover ptes with these corner cases.
>
> >
> > > > For example the OOM killer seems to call unmap_page_range() with details ==
> > > > NULL (although in practice only for anonymous vmas so it wont actually cause
> > > > an issue). Similarly in madvise for MADV_DONTNEED, although arguably I
> > > > suppose that is the correct thing to do there?
> > >
> > > So I must confess I'm not familiar with the oom code, it looks to me it's a
> > > fast path to recycle pages that can have a better chance to be reclaimed. Even
> > > in exit_mmap() we'll do this first before unmap_vmas(). Then it still looks
> > > the right thing to do if it's only a fast path, not to mention if we only runs
> > > with anonymous then it's ignored.
> >
> > Don't confuse my ability to grep with understanding of the OOM killer :-)
> >
> > I was just reviewing cases where we might leave behind unwanted special ptes.
> > I don't think I really found any but wanted to ask about them anyway to learn
> > more about the rules for them (which you have answered below, thanks!).
>
> Yes, actually thanks for raising it too; I didn't really look closely on the
> oom side before. It's good to double-check.
>
> >
> > > Basically I followed this rule: the bit should never be cleared if (1) user
> > > manually clear it using UFFDIO_WRITEPROTECT, (2) unmapping the whole region.
>
> (So obviously when I said "unmapping the whole region", it should include the
> case when we truncate the pages; basically I'll let case (2) to cover all
> cases that we're certain the page can be dropped, so is the uffd-wp bit)
>
> > > There can be special cases e.g. when unregister the vma with VM_UFFD_WP, but
> > > that's a rare case, and we also have code to take care of those lazily (e.g.,
> > > we'll restore such a uffd-wp special pte into none pte if we found we've got a
> > > fault and the vma is not registered with uffd-wp at all, in do_swap_pte).
> > > Otherwise I never clear the bit.
> > >
> > > >
> > > > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > > pte_t pte);
> > > > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > > --- a/include/linux/mm_inline.h
> > > > > +++ b/include/linux/mm_inline.h
> > > > > @@ -4,6 +4,8 @@
> > > > >
> > > > > #include <linux/huge_mm.h>
> > > > > #include <linux/swap.h>
> > > > > +#include <linux/userfaultfd_k.h>
> > > > > +#include <linux/swapops.h>
> > > > >
> > > > > /**
> > > > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > > -thp_nr_pages(page));
> > > > > }
> > > > > +
> > > > > +/*
> > > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > > + * even better.
> > > > > + *
> > > > > + * Must be called with pgtable lock held.
> > > > > + */
> > > > > +static inline void
> > > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > > + pte_t *pte, pte_t pteval)
> > > > > +{
> > > > > +#ifdef CONFIG_USERFAULTFD
> > > > > + bool arm_uffd_pte = false;
> > > > > +
> > > > > + /* The current status of the pte should be "cleared" before calling */
> > > > > + WARN_ON_ONCE(!pte_none(*pte));
> > > > > +
> > > > > + if (vma_is_anonymous(vma))
> > > > > + return;
> > > > > +
> > > > > + /* A uffd-wp wr-protected normal pte */
> > > > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > > + arm_uffd_pte = true;
> > > > > +
> > > > > + /*
> > > > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > > > + * pte_swp_uffd_wp_special() too.
> > > > > + */
> > > >
> > > > I'm probably missing something but when can we actually have this case and why
> > > > would we want to leave a special pte behind? From what I can tell this is
> > > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > > when not skipping swap pages.
> > >
> > > Yes this is a good question..
> > >
> > > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > > contains both swap and present ptes; imho that's pretty safe. However for
> > > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > > they should reside in shmem page cache indeed. The only missing piece seems to
> > > be the device private entries as you also spotted below.
> >
> > Yes, I think it's *probably* safe although I don't yet have a strong opinion
> > here ...
> >
> > > > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> >
> > ... however if this can never happen would a WARN_ON() be better? It would also
> > mean you could remove arm_uffd_pte.
>
> Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
> can still be useful for private mapping of shmem files: in that case we'll have
> swap entry stored in pte not page cache, so after page reclaim it will contain
> a valid swap entry, while it's still "!anonymous".

There's something (probably obvious) I must still be missing here. During
reclaim won't a private shmem mapping still have a present pteval here?
Therefore it won't trigger this case - the uffd wp bit is set when the swap
entry is established further down in try_to_unmap_one() right?

> >
> > > > > + arm_uffd_pte = true;
> > > > > +
> > > > > + if (unlikely(arm_uffd_pte))
> > > > > + set_pte_at(vma->vm_mm, addr, pte,
> > > > > + pte_swp_mkuffd_wp_special(vma));
> > > > > +#endif
> > > > > +}
> > > > > +
> > > > > #endif
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 319552efc782..3453b8ae5f4f 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -73,6 +73,7 @@
> > > > > #include <linux/perf_event.h>
> > > > > #include <linux/ptrace.h>
> > > > > #include <linux/vmalloc.h>
> > > > > +#include <linux/mm_inline.h>
> > > > >
> > > > > #include <trace/events/kmem.h>
> > > > >
> > > > > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > > > return ret;
> > > > > }
> > > > >
> > > > > +/*
> > > > > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > > > > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > > > > + */
> > > > > +static inline void
> > > > > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > > > > + unsigned long addr, pte_t *pte,
> > > > > + struct zap_details *details, pte_t pteval)
> > > > > +{
> > > > > + if (zap_drop_file_uffd_wp(details))
> > > > > + return;
> > > > > +
> > > > > + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > > > > +}
> > > > > +
> > > > > static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > struct vm_area_struct *vma, pmd_t *pmd,
> > > > > unsigned long addr, unsigned long end,
> > > > > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > > tlb->fullmm);
> > > > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > + ptent);
> > > > > if (unlikely(!page))
> > > > > continue;
> > > > >
> > > > > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > continue;
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * If this is a special uffd-wp marker pte... Drop it only if
> > > > > + * enforced to do so.
> > > > > + */
> > > > > + if (unlikely(is_swap_special_pte(ptent))) {
> > > > > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> > > >
> > > > Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> > > >
> > > > > + /*
> > > > > + * If this is a common unmap of ptes, keep this as is.
> > > > > + * Drop it only if this is a whole-vma destruction.
> > > > > + */
> > > > > + if (zap_drop_file_uffd_wp(details))
> > > > > + ptep_get_and_clear_full(mm, addr, pte,
> > > > > + tlb->fullmm);
> > > > > + continue;
> > > > > + }
> > > > > +
> > > > > entry = pte_to_swp_entry(ptent);
> > > > > if (is_device_private_entry(entry) ||
> > > > > is_device_exclusive_entry(entry)) {
> > > > > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > page_remove_rmap(page, false);
> > > > >
> > > > > put_page(page);
> > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > + ptent);
> > > >
> > > > Device entries only support anonymous vmas at present so should we drop this?
> > > > I guess I'm also a little confused by this because I'm not sure in what
> > > > scenarios you would want to zap swap entries but leave special swap ptes behind
> > > > (see also my earlier question above as well).
> > >
> > > If that's the case, maybe indeed this is not needed, and I can use a
> > > WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
> > > possible one day to have !anonymous support for device private entries?
> > > Frankly I have no solid idea on how device private is used, so some more
> > > context would be nice too; since I think you should know much better than me,
> > > so maybe it's a good chance to learn more about it. :)
> >
> > Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
> > support for !anonymous device private entries at some point.
> >
> > There's nothing too special about device private entries. They exist to store
> > some state and look up a device driver callback that gets called when the CPU
> > tries to access the page. For example see how do_swap_page() handles them:
> >
> > } else if (is_device_private_entry(entry)) {
> > vmf->page = pfn_swap_entry_to_page(entry);
> > ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> >
> > Normally a device driver provides the implementation of migrate_to_ram() which
> > will copy the page back to CPU addressable memory and restore the PTE to a
> > normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
> > used to allow migration of a page to memory that is not directly CPU addressable
> > (eg. GPU memory). Hopefully that goes some way to explaining what they are, but
> > if you have more questions let me know!
>
> Thanks for offering these details! So one thing I'm still uncertain is what
> exact type of memory is allowed to be mapped to device private. E.g., would
> "anonymous shared" allowed as "anonymous"? I saw there seems to have one ioctl
> defined that's used to bind these things:
>
> DRM_IOCTL_DEF_DRV(NOUVEAU_SVM_BIND, nouveau_svmm_bind, DRM_RENDER_ALLOW),
>
> Then nouveau_dmem_migrate_chunk() will initiates the device private entries, am
> I right? Then to ask my previous question in another form: if the vaddr range
> is coming from an userspace extention driver, would it be allowed to pass in
> some vaddr range mapped with MAP_ANONYMOUS|MAP_SHARED?

I should have been more specific - device private pages currently only support
non-file/shmem backed pages. In other words the migrate_vma_*() calls will fail
for MAP_ANONYMOUS | MAP_SHARED when the target page is a device private page.

For a present page this is enforced in migrate_vma_pages() when trying to
migrate to a device private page:

mapping = page_mapping(page);

if (is_zone_device_page(newpage)) {
if (is_device_private_page(newpage)) {
/*
* For now only support private anonymous when
* migrating to un-addressable device memory.
*/
if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
continue;
}


> >
> > As far as I know there should already be support for userfaultfd-wp on device
> > private pages, and given they can only currently exist in anon vmas I think we
> > should be safe to not install a special pte when unmapping. On the other hand
> > I suppose it doesn't mater if we do install one right?
>
> For this series, I wanted to make sure that even if there's unexpected leftover
> uffd-wp special ptes we'll take care of them too indeed. But let's see how you
> would answer above question first.
>
> >
> > > >
> > > > > continue;
> > > > > }
> > > > >
> > > > > @@ -1390,6 +1426,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > if (unlikely(!free_swap_and_cache(entry)))
> > > > > print_bad_pte(vma, addr, ptent, NULL);
> > > > > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
> > > > > } while (pte++, addr += PAGE_SIZE, addr != end);
> > > > >
> > > > > add_mm_rss_vec(mm, rss);
> > > > > @@ -1589,12 +1626,15 @@ void unmap_vmas(struct mmu_gather *tlb,
> > > > > unsigned long end_addr)
> > > > > {
> > > > > struct mmu_notifier_range range;
> > > > > + struct zap_details details = {
> > > > > + .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
> > > > > + };
> > > > >
> > > > > mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
> > > > > start_addr, end_addr);
> > > > > mmu_notifier_invalidate_range_start(&range);
> > > > > for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
> > > > > - unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
> > > > > + unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
> > > > > mmu_notifier_invalidate_range_end(&range);
> > > > > }
> > > > >
> > > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > index 0419c9a1a280..a94d9aed9d95 100644
> > > > > --- a/mm/rmap.c
> > > > > +++ b/mm/rmap.c
> > > > > @@ -72,6 +72,7 @@
> > > > > #include <linux/page_idle.h>
> > > > > #include <linux/memremap.h>
> > > > > #include <linux/userfaultfd_k.h>
> > > > > +#include <linux/mm_inline.h>
> > > > >
> > > > > #include <asm/tlbflush.h>
> > > > >
> > > > > @@ -1509,6 +1510,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > > > }
> > > > >
> > > > > + /*
> > > > > + * Now the pte is cleared. If this is uffd-wp armed pte, we
> > > > > + * may want to replace a none pte with a marker pte if it's
> > > > > + * file-backed, so we don't lose the tracking information.
> > > > > + */
> > > > > + pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> > > >
> > > > From what I can tell we don't need to do this in try_to_migrate_one() (assuming
> > > > that goes in) as well because the existing uffd wp code already deals with
> > > > copying the pte bits over to the migration entries. Is that correct?
> > >
> > > I agree try_to_migrate_one() shouldn't need it. But I'm not sure about
> > > try_to_unmap_one(), as e.g. I think we should rely on this to make shmem work
> > > with when page got swapped out.
> >
> > Oh for sure I agree you need it in try_to_unmap_one(), my code didn't change
> > the unmap path. It just split the migration cases (ie. replacing mappings with
> > migration entries instaed of unmapping) into a different function so I just
> > wanted to make sure we didn't need it in try_to_migrate_one() (and I think we
> > agree it isn't needed there).
>
> Ah so I misunderstood - yes I think we're on the same page then!
>
> Thanks,
>
>




2021-06-23 15:33:22

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Wed, Jun 23, 2021 at 04:04:03PM +1000, Alistair Popple wrote:
> On Wednesday, 23 June 2021 1:44:21 AM AEST Peter Xu wrote:
> > On Tue, Jun 22, 2021 at 10:47:11PM +1000, Alistair Popple wrote:
> > > On Tuesday, 22 June 2021 10:40:37 AM AEST Peter Xu wrote:
> > > > On Mon, Jun 21, 2021 at 06:41:17PM +1000, Alistair Popple wrote:
> > > > > On Friday, 28 May 2021 6:22:14 AM AEST Peter Xu wrote:
> > > > > > File-backed memory is prone to being unmapped at any time. It means all
> > > > > > information in the pte will be dropped, including the uffd-wp flag.
> > > > > >
> > > > > > Since the uffd-wp info cannot be stored in page cache or swap cache, persist
> > > > > > this wr-protect information by installing the special uffd-wp marker pte when
> > > > > > we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
> > > > > > we will know it's previously wr-protected by recognizing the special pte.
> > > > > >
> > > > > > Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
> > > > > > persist such an information. For example, when destroying the whole vma, or
> > > > > > punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
> > > > > > bit when holding the page lock. It means the unmap_mapping_range() in
> > > > > > shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
> > > > > > because that's still racy with the page faults.
> > > > > >
> > > > > > Signed-off-by: Peter Xu <[email protected]>
> > > > > > ---
> > > > > > include/linux/mm.h | 11 ++++++++++
> > > > > > include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
> > > > > > mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
> > > > > > mm/rmap.c | 8 ++++++++
> > > > > > mm/truncate.c | 8 +++++++-
> > > > > > 5 files changed, 110 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index b1fb2826e29c..5989fc7ed00d 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > > > @@ -1725,6 +1725,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
> > > > > > #define ZAP_FLAG_CHECK_MAPPING BIT(0)
> > > > > > /* Whether to skip zapping swap entries */
> > > > > > #define ZAP_FLAG_SKIP_SWAP BIT(1)
> > > > > > +/* Whether to completely drop uffd-wp entries for file-backed memory */
> > > > > > +#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
> > > > > >
> > > > > > /*
> > > > > > * Parameter block passed down to zap_pte_range in exceptional cases.
> > > > > > @@ -1757,6 +1759,15 @@ zap_skip_swap(struct zap_details *details)
> > > > > > return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
> > > > > > }
> > > > > >
> > > > > > +static inline bool
> > > > > > +zap_drop_file_uffd_wp(struct zap_details *details)
> > > > > > +{
> > > > > > + if (!details)
> > > > > > + return false;
> > > > > > +
> > > > > > + return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
> > > > > > +}
> > > > >
> > > > > Is this a good default having to explicitly specify that you don't want
> > > > > special pte's left in place?
> > > >
> > > > I made it explicitly the default so we won't accidentally drop that bit without
> > > > being aware of it; because missing of the uffd-wp bit anywhere can directly
> > > > cause data corruption in the userspace.
> > >
> > > Ok, I think that makes sense. I was just a little concerned about leaving
> > > special pte's behind everywhere by accident and whether there would be any
> > > unforeseen side effects from that. As you point out below though we do expect
> > > that to happen occasionally and to clean them up when found.
> >
> > Right, that's a valid concern which I had too, but I found that it's
> > non-trivial to avoid those leftover uffd-wp bits. Since we need to take care
> > of them anyways, so I let it just be like that, which looks not that bad so far.
> >
> > One example is shmem file truncation, where we have some optimized path to drop
> > the mappings before taking the page lock - see shmem_fallocate() where we've
> > called unmap_mapping_range() (with no page lock, so not safe to drop uffd-wp as
> > page fault could happen in parallel! that'll cause the page be written before
> > dropped, so data potentially lost), before calling shmem_truncate_range()
> > (which will take the page lock; it's the only safe place to drop the uffd-wp
> > bit). These are very trivial cases but very important too - as I used to spend
> > days debugging a data corruption with it, only until then I notice it's
> > unavoidable to have those leftover ptes with these corner cases.
> >
> > >
> > > > > For example the OOM killer seems to call unmap_page_range() with details ==
> > > > > NULL (although in practice only for anonymous vmas so it wont actually cause
> > > > > an issue). Similarly in madvise for MADV_DONTNEED, although arguably I
> > > > > suppose that is the correct thing to do there?
> > > >
> > > > So I must confess I'm not familiar with the oom code, it looks to me it's a
> > > > fast path to recycle pages that can have a better chance to be reclaimed. Even
> > > > in exit_mmap() we'll do this first before unmap_vmas(). Then it still looks
> > > > the right thing to do if it's only a fast path, not to mention if we only runs
> > > > with anonymous then it's ignored.
> > >
> > > Don't confuse my ability to grep with understanding of the OOM killer :-)
> > >
> > > I was just reviewing cases where we might leave behind unwanted special ptes.
> > > I don't think I really found any but wanted to ask about them anyway to learn
> > > more about the rules for them (which you have answered below, thanks!).
> >
> > Yes, actually thanks for raising it too; I didn't really look closely on the
> > oom side before. It's good to double-check.
> >
> > >
> > > > Basically I followed this rule: the bit should never be cleared if (1) user
> > > > manually clear it using UFFDIO_WRITEPROTECT, (2) unmapping the whole region.
> >
> > (So obviously when I said "unmapping the whole region", it should include the
> > case when we truncate the pages; basically I'll let case (2) to cover all
> > cases that we're certain the page can be dropped, so is the uffd-wp bit)
> >
> > > > There can be special cases e.g. when unregister the vma with VM_UFFD_WP, but
> > > > that's a rare case, and we also have code to take care of those lazily (e.g.,
> > > > we'll restore such a uffd-wp special pte into none pte if we found we've got a
> > > > fault and the vma is not registered with uffd-wp at all, in do_swap_pte).
> > > > Otherwise I never clear the bit.
> > > >
> > > > >
> > > > > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > > > pte_t pte);
> > > > > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > > > --- a/include/linux/mm_inline.h
> > > > > > +++ b/include/linux/mm_inline.h
> > > > > > @@ -4,6 +4,8 @@
> > > > > >
> > > > > > #include <linux/huge_mm.h>
> > > > > > #include <linux/swap.h>
> > > > > > +#include <linux/userfaultfd_k.h>
> > > > > > +#include <linux/swapops.h>
> > > > > >
> > > > > > /**
> > > > > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > > > -thp_nr_pages(page));
> > > > > > }
> > > > > > +
> > > > > > +/*
> > > > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > > > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > > > + * even better.
> > > > > > + *
> > > > > > + * Must be called with pgtable lock held.
> > > > > > + */
> > > > > > +static inline void
> > > > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > > > + pte_t *pte, pte_t pteval)
> > > > > > +{
> > > > > > +#ifdef CONFIG_USERFAULTFD
> > > > > > + bool arm_uffd_pte = false;
> > > > > > +
> > > > > > + /* The current status of the pte should be "cleared" before calling */
> > > > > > + WARN_ON_ONCE(!pte_none(*pte));
> > > > > > +
> > > > > > + if (vma_is_anonymous(vma))
> > > > > > + return;
> > > > > > +
> > > > > > + /* A uffd-wp wr-protected normal pte */
> > > > > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > > > + arm_uffd_pte = true;
> > > > > > +
> > > > > > + /*
> > > > > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > > > > + * pte_swp_uffd_wp_special() too.
> > > > > > + */
> > > > >
> > > > > I'm probably missing something but when can we actually have this case and why
> > > > > would we want to leave a special pte behind? From what I can tell this is
> > > > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > > > when not skipping swap pages.
> > > >
> > > > Yes this is a good question..
> > > >
> > > > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > > > contains both swap and present ptes; imho that's pretty safe. However for
> > > > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > > > they should reside in shmem page cache indeed. The only missing piece seems to
> > > > be the device private entries as you also spotted below.
> > >
> > > Yes, I think it's *probably* safe although I don't yet have a strong opinion
> > > here ...
> > >
> > > > > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > >
> > > ... however if this can never happen would a WARN_ON() be better? It would also
> > > mean you could remove arm_uffd_pte.
> >
> > Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
> > can still be useful for private mapping of shmem files: in that case we'll have
> > swap entry stored in pte not page cache, so after page reclaim it will contain
> > a valid swap entry, while it's still "!anonymous".
>
> There's something (probably obvious) I must still be missing here. During
> reclaim won't a private shmem mapping still have a present pteval here?
> Therefore it won't trigger this case - the uffd wp bit is set when the swap
> entry is established further down in try_to_unmap_one() right?

I agree if it's at the point when it get reclaimed, however what if we zap a
pte of a page already got reclaimed? It should have the swap pte installed,
imho, which will have "is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)"==true.

>
> > >
> > > > > > + arm_uffd_pte = true;
> > > > > > +
> > > > > > + if (unlikely(arm_uffd_pte))
> > > > > > + set_pte_at(vma->vm_mm, addr, pte,
> > > > > > + pte_swp_mkuffd_wp_special(vma));
> > > > > > +#endif
> > > > > > +}
> > > > > > +
> > > > > > #endif
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 319552efc782..3453b8ae5f4f 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -73,6 +73,7 @@
> > > > > > #include <linux/perf_event.h>
> > > > > > #include <linux/ptrace.h>
> > > > > > #include <linux/vmalloc.h>
> > > > > > +#include <linux/mm_inline.h>
> > > > > >
> > > > > > #include <trace/events/kmem.h>
> > > > > >
> > > > > > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > > > > return ret;
> > > > > > }
> > > > > >
> > > > > > +/*
> > > > > > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > > > > > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > > > > > + */
> > > > > > +static inline void
> > > > > > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > > > > > + unsigned long addr, pte_t *pte,
> > > > > > + struct zap_details *details, pte_t pteval)
> > > > > > +{
> > > > > > + if (zap_drop_file_uffd_wp(details))
> > > > > > + return;
> > > > > > +
> > > > > > + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > > > > > +}
> > > > > > +
> > > > > > static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > unsigned long addr, unsigned long end,
> > > > > > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > > > tlb->fullmm);
> > > > > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > > + ptent);
> > > > > > if (unlikely(!page))
> > > > > > continue;
> > > > > >
> > > > > > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > continue;
> > > > > > }
> > > > > >
> > > > > > + /*
> > > > > > + * If this is a special uffd-wp marker pte... Drop it only if
> > > > > > + * enforced to do so.
> > > > > > + */
> > > > > > + if (unlikely(is_swap_special_pte(ptent))) {
> > > > > > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> > > > >
> > > > > Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> > > > >
> > > > > > + /*
> > > > > > + * If this is a common unmap of ptes, keep this as is.
> > > > > > + * Drop it only if this is a whole-vma destruction.
> > > > > > + */
> > > > > > + if (zap_drop_file_uffd_wp(details))
> > > > > > + ptep_get_and_clear_full(mm, addr, pte,
> > > > > > + tlb->fullmm);
> > > > > > + continue;
> > > > > > + }
> > > > > > +
> > > > > > entry = pte_to_swp_entry(ptent);
> > > > > > if (is_device_private_entry(entry) ||
> > > > > > is_device_exclusive_entry(entry)) {
> > > > > > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > page_remove_rmap(page, false);
> > > > > >
> > > > > > put_page(page);
> > > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > > + ptent);
> > > > >
> > > > > Device entries only support anonymous vmas at present so should we drop this?
> > > > > I guess I'm also a little confused by this because I'm not sure in what
> > > > > scenarios you would want to zap swap entries but leave special swap ptes behind
> > > > > (see also my earlier question above as well).
> > > >
> > > > If that's the case, maybe indeed this is not needed, and I can use a
> > > > WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
> > > > possible one day to have !anonymous support for device private entries?
> > > > Frankly I have no solid idea on how device private is used, so some more
> > > > context would be nice too; since I think you should know much better than me,
> > > > so maybe it's a good chance to learn more about it. :)
> > >
> > > Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
> > > support for !anonymous device private entries at some point.
> > >
> > > There's nothing too special about device private entries. They exist to store
> > > some state and look up a device driver callback that gets called when the CPU
> > > tries to access the page. For example see how do_swap_page() handles them:
> > >
> > > } else if (is_device_private_entry(entry)) {
> > > vmf->page = pfn_swap_entry_to_page(entry);
> > > ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > >
> > > Normally a device driver provides the implementation of migrate_to_ram() which
> > > will copy the page back to CPU addressable memory and restore the PTE to a
> > > normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
> > > used to allow migration of a page to memory that is not directly CPU addressable
> > > (eg. GPU memory). Hopefully that goes some way to explaining what they are, but
> > > if you have more questions let me know!
> >
> > Thanks for offering these details! So one thing I'm still uncertain is what
> > exact type of memory is allowed to be mapped to device private. E.g., would
> > "anonymous shared" allowed as "anonymous"? I saw there seems to have one ioctl
> > defined that's used to bind these things:
> >
> > DRM_IOCTL_DEF_DRV(NOUVEAU_SVM_BIND, nouveau_svmm_bind, DRM_RENDER_ALLOW),
> >
> > Then nouveau_dmem_migrate_chunk() will initiates the device private entries, am
> > I right? Then to ask my previous question in another form: if the vaddr range
> > is coming from an userspace extention driver, would it be allowed to pass in
> > some vaddr range mapped with MAP_ANONYMOUS|MAP_SHARED?
>
> I should have been more specific - device private pages currently only support
> non-file/shmem backed pages. In other words the migrate_vma_*() calls will fail
> for MAP_ANONYMOUS | MAP_SHARED when the target page is a device private page.
>
> For a present page this is enforced in migrate_vma_pages() when trying to
> migrate to a device private page:
>
> mapping = page_mapping(page);
>
> if (is_zone_device_page(newpage)) {
> if (is_device_private_page(newpage)) {
> /*
> * For now only support private anonymous when
> * migrating to un-addressable device memory.
> */
> if (mapping) {
> migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> continue;
> }

Ah fair enough. :)

When I looked again, I did also see that there's vma_is_anonymous() check right
at the entry of migrate_vma_insert_page() too.

I'll convert this device private call to a WARN_ON_ONCE() then, with proper
comments explaining why.

Thanks,

--
Peter Xu

2021-07-06 05:41:52

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

> > > > > > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > pte_t pte);
> > > > > > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > > > > --- a/include/linux/mm_inline.h
> > > > > > > +++ b/include/linux/mm_inline.h
> > > > > > > @@ -4,6 +4,8 @@
> > > > > > >
> > > > > > > #include <linux/huge_mm.h>
> > > > > > > #include <linux/swap.h>
> > > > > > > +#include <linux/userfaultfd_k.h>
> > > > > > > +#include <linux/swapops.h>
> > > > > > >
> > > > > > > /**
> > > > > > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > > > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > > > > -thp_nr_pages(page));
> > > > > > > }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > > > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > > > > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > > > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > > > > + * even better.
> > > > > > > + *
> > > > > > > + * Must be called with pgtable lock held.
> > > > > > > + */
> > > > > > > +static inline void
> > > > > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > + pte_t *pte, pte_t pteval)
> > > > > > > +{
> > > > > > > +#ifdef CONFIG_USERFAULTFD
> > > > > > > + bool arm_uffd_pte = false;
> > > > > > > +
> > > > > > > + /* The current status of the pte should be "cleared" before calling */
> > > > > > > + WARN_ON_ONCE(!pte_none(*pte));
> > > > > > > +
> > > > > > > + if (vma_is_anonymous(vma))
> > > > > > > + return;
> > > > > > > +
> > > > > > > + /* A uffd-wp wr-protected normal pte */
> > > > > > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > > > > + arm_uffd_pte = true;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > > > > > + * pte_swp_uffd_wp_special() too.
> > > > > > > + */
> > > > > >
> > > > > > I'm probably missing something but when can we actually have this case and why
> > > > > > would we want to leave a special pte behind? From what I can tell this is
> > > > > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > > > > when not skipping swap pages.
> > > > >
> > > > > Yes this is a good question..
> > > > >
> > > > > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > > > > contains both swap and present ptes; imho that's pretty safe. However for
> > > > > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > > > > they should reside in shmem page cache indeed. The only missing piece seems to
> > > > > be the device private entries as you also spotted below.
> > > >
> > > > Yes, I think it's *probably* safe although I don't yet have a strong opinion
> > > > here ...
> > > >
> > > > > > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > > >
> > > > ... however if this can never happen would a WARN_ON() be better? It would also
> > > > mean you could remove arm_uffd_pte.
> > >
> > > Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
> > > can still be useful for private mapping of shmem files: in that case we'll have
> > > swap entry stored in pte not page cache, so after page reclaim it will contain
> > > a valid swap entry, while it's still "!anonymous".
> >
> > There's something (probably obvious) I must still be missing here. During
> > reclaim won't a private shmem mapping still have a present pteval here?
> > Therefore it won't trigger this case - the uffd wp bit is set when the swap
> > entry is established further down in try_to_unmap_one() right?
>
> I agree if it's at the point when it get reclaimed, however what if we zap a
> pte of a page already got reclaimed? It should have the swap pte installed,
> imho, which will have "is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)"==true.

Apologies for the delay getting back to this, I hope to find some more time
to look at this again this week.

I guess what I am missing is why we care about a swap pte for a reclaimed page
getting zapped. I thought that would imply the mapping was getting torn down,
although I suppose in that case you still want the uffd-wp to apply in case a
new mapping appears there?

> >
> > > >
> > > > > > > + arm_uffd_pte = true;
> > > > > > > +
> > > > > > > + if (unlikely(arm_uffd_pte))
> > > > > > > + set_pte_at(vma->vm_mm, addr, pte,
> > > > > > > + pte_swp_mkuffd_wp_special(vma));
> > > > > > > +#endif
> > > > > > > +}
> > > > > > > +
> > > > > > > #endif
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 319552efc782..3453b8ae5f4f 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -73,6 +73,7 @@
> > > > > > > #include <linux/perf_event.h>
> > > > > > > #include <linux/ptrace.h>
> > > > > > > #include <linux/vmalloc.h>
> > > > > > > +#include <linux/mm_inline.h>
> > > > > > >
> > > > > > > #include <trace/events/kmem.h>
> > > > > > >
> > > > > > > @@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> > > > > > > return ret;
> > > > > > > }
> > > > > > >
> > > > > > > +/*
> > > > > > > + * This function makes sure that we'll replace the none pte with an uffd-wp
> > > > > > > + * swap special pte marker when necessary. Must be with the pgtable lock held.
> > > > > > > + */
> > > > > > > +static inline void
> > > > > > > +zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> > > > > > > + unsigned long addr, pte_t *pte,
> > > > > > > + struct zap_details *details, pte_t pteval)
> > > > > > > +{
> > > > > > > + if (zap_drop_file_uffd_wp(details))
> > > > > > > + return;
> > > > > > > +
> > > > > > > + pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> > > > > > > +}
> > > > > > > +
> > > > > > > static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > > struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > > unsigned long addr, unsigned long end,
> > > > > > > @@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > > > > tlb->fullmm);
> > > > > > > tlb_remove_tlb_entry(tlb, pte, addr);
> > > > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > > > + ptent);
> > > > > > > if (unlikely(!page))
> > > > > > > continue;
> > > > > > >
> > > > > > > @@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > > continue;
> > > > > > > }
> > > > > > >
> > > > > > > + /*
> > > > > > > + * If this is a special uffd-wp marker pte... Drop it only if
> > > > > > > + * enforced to do so.
> > > > > > > + */
> > > > > > > + if (unlikely(is_swap_special_pte(ptent))) {
> > > > > > > + WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
> > > > > >
> > > > > > Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?
> > > > > >
> > > > > > > + /*
> > > > > > > + * If this is a common unmap of ptes, keep this as is.
> > > > > > > + * Drop it only if this is a whole-vma destruction.
> > > > > > > + */
> > > > > > > + if (zap_drop_file_uffd_wp(details))
> > > > > > > + ptep_get_and_clear_full(mm, addr, pte,
> > > > > > > + tlb->fullmm);
> > > > > > > + continue;
> > > > > > > + }
> > > > > > > +
> > > > > > > entry = pte_to_swp_entry(ptent);
> > > > > > > if (is_device_private_entry(entry) ||
> > > > > > > is_device_exclusive_entry(entry)) {
> > > > > > > @@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > > > > > > page_remove_rmap(page, false);
> > > > > > >
> > > > > > > put_page(page);
> > > > > > > + zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> > > > > > > + ptent);
> > > > > >
> > > > > > Device entries only support anonymous vmas at present so should we drop this?
> > > > > > I guess I'm also a little confused by this because I'm not sure in what
> > > > > > scenarios you would want to zap swap entries but leave special swap ptes behind
> > > > > > (see also my earlier question above as well).
> > > > >
> > > > > If that's the case, maybe indeed this is not needed, and I can use a
> > > > > WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
> > > > > possible one day to have !anonymous support for device private entries?
> > > > > Frankly I have no solid idea on how device private is used, so some more
> > > > > context would be nice too; since I think you should know much better than me,
> > > > > so maybe it's a good chance to learn more about it. :)
> > > >
> > > > Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
> > > > support for !anonymous device private entries at some point.
> > > >
> > > > There's nothing too special about device private entries. They exist to store
> > > > some state and look up a device driver callback that gets called when the CPU
> > > > tries to access the page. For example see how do_swap_page() handles them:
> > > >
> > > > } else if (is_device_private_entry(entry)) {
> > > > vmf->page = pfn_swap_entry_to_page(entry);
> > > > ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> > > >
> > > > Normally a device driver provides the implementation of migrate_to_ram() which
> > > > will copy the page back to CPU addressable memory and restore the PTE to a
> > > > normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
> > > > used to allow migration of a page to memory that is not directly CPU addressable
> > > > (eg. GPU memory). Hopefully that goes some way to explaining what they are, but
> > > > if you have more questions let me know!
> > >
> > > Thanks for offering these details! So one thing I'm still uncertain is what
> > > exact type of memory is allowed to be mapped to device private. E.g., would
> > > "anonymous shared" allowed as "anonymous"? I saw there seems to have one ioctl
> > > defined that's used to bind these things:
> > >
> > > DRM_IOCTL_DEF_DRV(NOUVEAU_SVM_BIND, nouveau_svmm_bind, DRM_RENDER_ALLOW),
> > >
> > > Then nouveau_dmem_migrate_chunk() will initiates the device private entries, am
> > > I right? Then to ask my previous question in another form: if the vaddr range
> > > is coming from an userspace extention driver, would it be allowed to pass in
> > > some vaddr range mapped with MAP_ANONYMOUS|MAP_SHARED?
> >
> > I should have been more specific - device private pages currently only support
> > non-file/shmem backed pages. In other words the migrate_vma_*() calls will fail
> > for MAP_ANONYMOUS | MAP_SHARED when the target page is a device private page.
> >
> > For a present page this is enforced in migrate_vma_pages() when trying to
> > migrate to a device private page:
> >
> > mapping = page_mapping(page);
> >
> > if (is_zone_device_page(newpage)) {
> > if (is_device_private_page(newpage)) {
> > /*
> > * For now only support private anonymous when
> > * migrating to un-addressable device memory.
> > */
> > if (mapping) {
> > migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
> > continue;
> > }
>
> Ah fair enough. :)
>
> When I looked again, I did also see that there's vma_is_anonymous() check right
> at the entry of migrate_vma_insert_page() too.
>
> I'll convert this device private call to a WARN_ON_ONCE() then, with proper
> comments explaining why.
>
> Thanks,
>
>




2021-07-06 15:36:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Tue, Jul 06, 2021 at 03:40:42PM +1000, Alistair Popple wrote:
> > > > > > > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > > pte_t pte);
> > > > > > > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > > > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > > > > > --- a/include/linux/mm_inline.h
> > > > > > > > +++ b/include/linux/mm_inline.h
> > > > > > > > @@ -4,6 +4,8 @@
> > > > > > > >
> > > > > > > > #include <linux/huge_mm.h>
> > > > > > > > #include <linux/swap.h>
> > > > > > > > +#include <linux/userfaultfd_k.h>
> > > > > > > > +#include <linux/swapops.h>
> > > > > > > >
> > > > > > > > /**
> > > > > > > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > > > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > > > > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > > > > > -thp_nr_pages(page));
> > > > > > > > }
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > > > > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > > > > > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > > > > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > > > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > > > > > + * even better.
> > > > > > > > + *
> > > > > > > > + * Must be called with pgtable lock held.
> > > > > > > > + */
> > > > > > > > +static inline void
> > > > > > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > > + pte_t *pte, pte_t pteval)
> > > > > > > > +{
> > > > > > > > +#ifdef CONFIG_USERFAULTFD
> > > > > > > > + bool arm_uffd_pte = false;
> > > > > > > > +
> > > > > > > > + /* The current status of the pte should be "cleared" before calling */
> > > > > > > > + WARN_ON_ONCE(!pte_none(*pte));
> > > > > > > > +
> > > > > > > > + if (vma_is_anonymous(vma))
> > > > > > > > + return;
> > > > > > > > +
> > > > > > > > + /* A uffd-wp wr-protected normal pte */
> > > > > > > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > > > > > + arm_uffd_pte = true;
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > > > > > > + * pte_swp_uffd_wp_special() too.
> > > > > > > > + */
> > > > > > >
> > > > > > > I'm probably missing something but when can we actually have this case and why
> > > > > > > would we want to leave a special pte behind? From what I can tell this is
> > > > > > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > > > > > when not skipping swap pages.
> > > > > >
> > > > > > Yes this is a good question..
> > > > > >
> > > > > > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > > > > > contains both swap and present ptes; imho that's pretty safe. However for
> > > > > > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > > > > > they should reside in shmem page cache indeed. The only missing piece seems to
> > > > > > be the device private entries as you also spotted below.
> > > > >
> > > > > Yes, I think it's *probably* safe although I don't yet have a strong opinion
> > > > > here ...
> > > > >
> > > > > > > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > > > >
> > > > > ... however if this can never happen would a WARN_ON() be better? It would also
> > > > > mean you could remove arm_uffd_pte.
> > > >
> > > > Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
> > > > can still be useful for private mapping of shmem files: in that case we'll have
> > > > swap entry stored in pte not page cache, so after page reclaim it will contain
> > > > a valid swap entry, while it's still "!anonymous".

[1]

> > >
> > > There's something (probably obvious) I must still be missing here. During
> > > reclaim won't a private shmem mapping still have a present pteval here?
> > > Therefore it won't trigger this case - the uffd wp bit is set when the swap
> > > entry is established further down in try_to_unmap_one() right?
> >
> > I agree if it's at the point when it get reclaimed, however what if we zap a
> > pte of a page already got reclaimed? It should have the swap pte installed,
> > imho, which will have "is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)"==true.
>
> Apologies for the delay getting back to this, I hope to find some more time
> to look at this again this week.

No problem, please take your time on reviewing the series.

>
> I guess what I am missing is why we care about a swap pte for a reclaimed page
> getting zapped. I thought that would imply the mapping was getting torn down,
> although I suppose in that case you still want the uffd-wp to apply in case a
> new mapping appears there?

For the torn down case it'll always have ZAP_FLAG_DROP_FILE_UFFD_WP set, so
pte_install_uffd_wp_if_needed() won't be called, as zap_drop_file_uffd_wp()
will return true:

static inline void
zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
unsigned long addr, pte_t *pte,
struct zap_details *details, pte_t pteval)
{
if (zap_drop_file_uffd_wp(details))
return;

pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
}

If you see it's non-trivial to fully digest all the caller stacks of it. What I
wanted to do with pte_install_uffd_wp_if_needed is simply to provide a helper
that can convert any form of uffd-wp ptes into a pte marker before being set as
none pte. Since uffd-wp can exist in two forms (either present, or swap), then
cover all these two forms (and for swap form also cover the uffd-wp special pte
itself) is very clear idea and easy to understand to me. I don't even need to
worry about who is calling it, and which case can be swap pte, which case must
not - we just call it when we want to persist the uffd-wp bit (after a pte got
cleared). That's why in all cases I still prefer to keep it as is, as it just
makes things straightforward to me.

Thanks,

--
Peter Xu

2021-07-08 02:57:48

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

On Wednesday, 7 July 2021 1:35:18 AM AEST Peter Xu wrote:
> On Tue, Jul 06, 2021 at 03:40:42PM +1000, Alistair Popple wrote:
> > > > > > > > > struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > > > pte_t pte);
> > > > > > > > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > > > > > > > > index 355ea1ee32bd..c29a6ef3a642 100644
> > > > > > > > > --- a/include/linux/mm_inline.h
> > > > > > > > > +++ b/include/linux/mm_inline.h
> > > > > > > > > @@ -4,6 +4,8 @@
> > > > > > > > >
> > > > > > > > > #include <linux/huge_mm.h>
> > > > > > > > > #include <linux/swap.h>
> > > > > > > > > +#include <linux/userfaultfd_k.h>
> > > > > > > > > +#include <linux/swapops.h>
> > > > > > > > >
> > > > > > > > > /**
> > > > > > > > > * page_is_file_lru - should the page be on a file LRU or anon LRU?
> > > > > > > > > @@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
> > > > > > > > > update_lru_size(lruvec, page_lru(page), page_zonenum(page),
> > > > > > > > > -thp_nr_pages(page));
> > > > > > > > > }
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
> > > > > > > > > + * replace a none pte. NOTE! This should only be called when *pte is already
> > > > > > > > > + * cleared so we will never accidentally replace something valuable. Meanwhile
> > > > > > > > > + * none pte also means we are not demoting the pte so if tlb flushed then we
> > > > > > > > > + * don't need to do it again; otherwise if tlb flush is postponed then it's
> > > > > > > > > + * even better.
> > > > > > > > > + *
> > > > > > > > > + * Must be called with pgtable lock held.
> > > > > > > > > + */
> > > > > > > > > +static inline void
> > > > > > > > > +pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > > > + pte_t *pte, pte_t pteval)
> > > > > > > > > +{
> > > > > > > > > +#ifdef CONFIG_USERFAULTFD
> > > > > > > > > + bool arm_uffd_pte = false;
> > > > > > > > > +
> > > > > > > > > + /* The current status of the pte should be "cleared" before calling */
> > > > > > > > > + WARN_ON_ONCE(!pte_none(*pte));
> > > > > > > > > +
> > > > > > > > > + if (vma_is_anonymous(vma))
> > > > > > > > > + return;
> > > > > > > > > +
> > > > > > > > > + /* A uffd-wp wr-protected normal pte */
> > > > > > > > > + if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
> > > > > > > > > + arm_uffd_pte = true;
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * A uffd-wp wr-protected swap pte. Note: this should even work for
> > > > > > > > > + * pte_swp_uffd_wp_special() too.
> > > > > > > > > + */
> > > > > > > >
> > > > > > > > I'm probably missing something but when can we actually have this case and why
> > > > > > > > would we want to leave a special pte behind? From what I can tell this is
> > > > > > > > called from try_to_unmap_one() where this won't be true or from zap_pte_range()
> > > > > > > > when not skipping swap pages.
> > > > > > >
> > > > > > > Yes this is a good question..
> > > > > > >
> > > > > > > Initially I made this function make sure I cover all forms of uffd-wp bit, that
> > > > > > > contains both swap and present ptes; imho that's pretty safe. However for
> > > > > > > !anonymous cases we don't keep swap entry inside pte even if swapped out, as
> > > > > > > they should reside in shmem page cache indeed. The only missing piece seems to
> > > > > > > be the device private entries as you also spotted below.
> > > > > >
> > > > > > Yes, I think it's *probably* safe although I don't yet have a strong opinion
> > > > > > here ...
> > > > > >
> > > > > > > > > + if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
> > > > > >
> > > > > > ... however if this can never happen would a WARN_ON() be better? It would also
> > > > > > mean you could remove arm_uffd_pte.
> > > > >
> > > > > Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
> > > > > can still be useful for private mapping of shmem files: in that case we'll have
> > > > > swap entry stored in pte not page cache, so after page reclaim it will contain
> > > > > a valid swap entry, while it's still "!anonymous".
>
> [1]
>
> > > >
> > > > There's something (probably obvious) I must still be missing here. During
> > > > reclaim won't a private shmem mapping still have a present pteval here?
> > > > Therefore it won't trigger this case - the uffd wp bit is set when the swap
> > > > entry is established further down in try_to_unmap_one() right?
> > >
> > > I agree if it's at the point when it get reclaimed, however what if we zap a
> > > pte of a page already got reclaimed? It should have the swap pte installed,
> > > imho, which will have "is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)"==true.
> >
> > Apologies for the delay getting back to this, I hope to find some more time
> > to look at this again this week.
>
> No problem, please take your time on reviewing the series.
>
> >
> > I guess what I am missing is why we care about a swap pte for a reclaimed page
> > getting zapped. I thought that would imply the mapping was getting torn down,
> > although I suppose in that case you still want the uffd-wp to apply in case a
> > new mapping appears there?
>
> For the torn down case it'll always have ZAP_FLAG_DROP_FILE_UFFD_WP set, so
> pte_install_uffd_wp_if_needed() won't be called, as zap_drop_file_uffd_wp()
> will return true:

Argh, thanks. I had forgotten that bit.

> static inline void
> zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> unsigned long addr, pte_t *pte,
> struct zap_details *details, pte_t pteval)
> {
> if (zap_drop_file_uffd_wp(details))
> return;
>
> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> }
>
> If you see it's non-trivial to fully digest all the caller stacks of it. What I
> wanted to do with pte_install_uffd_wp_if_needed is simply to provide a helper
> that can convert any form of uffd-wp ptes into a pte marker before being set as
> none pte. Since uffd-wp can exist in two forms (either present, or swap), then
> cover all these two forms (and for swap form also cover the uffd-wp special pte
> itself) is very clear idea and easy to understand to me. I don't even need to
> worry about who is calling it, and which case can be swap pte, which case must
> not - we just call it when we want to persist the uffd-wp bit (after a pte got
> cleared). That's why in all cases I still prefer to keep it as is, as it just
> makes things straightforward to me.

Ok, that makes sense. I don't think there is an actual problem here it was
just a little surprising to me so I was trying to get a better understanding
of the caller stacks and when this might actually be required. As you say
though that is non-trivial and in any case it's still ok to install these
bits and a single function is simpler.

- Alistair

> Thanks,
>
>