This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
PS. Note that there's a known issue [0] with tlb against uffd-wp/soft-dirty in
general and Nadav is working on it. It may or may not directly affect
shmem/hugetlbfs since there're no COW on shared mappings normally. Private
shmem could hit, but still that's another problem to solve in general, and this
RFC is majorly to see whether there's any objection on the concept of the idea
specific to uffd-wp on shmem/hugetlbfs.
The whole series can also be found online [1].
The major comment I'd like to get is on the new idea of swap special pte. That
comes from suggestions from both Hugh and Andrea and I appreciated a lot for
those discussions.
In short, it's a new type of pte that doesn't exist in the past, while used in
file-backed memories to persist information across ptes being erased (but the
page cache could still exist, for example, so in the next page fault we can
reload the page cache with that specific information when necessary).
I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
idea of special swap ptes", where uffd-wp becomes the first user of it:
We used to have special swap entries, like migration entries, hw-poison
entries, device private entries, etc.
Those "special swap entries" reside in the range that they need to be at least
swap entries first, and their types are decided by swp_type(entry).
This patch introduces another idea called "special swap ptes".
It's very easy to get confused against "special swap entries", but a speical
swap pte should never contain a swap entry at all. It means, it's illegal to
call pte_to_swp_entry() upon a special swap pte.
Make the uffd-wp special pte to be the first special swap pte.
Before this patch, is_swap_pte()==true means one of the below:
(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.
(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.
After this patch, is_swap_pte()==true means one of the below, where case (b) is
added:
(a) The pte contains a swap entry.
(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.
(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.
(b) The pte does not contain a swap entry at all (so it cannot be passed
into pte_to_swp_entry()). For example, uffd-wp special swap pte.
Hugetlbfs needs similar thing because it's also file-backed. I directly reused
the same special pte there, though the shmem/hugetlb change on supporting this
new pte is different since they don't share code path a lot.
Patch layout
============
Part (1): some fixes that I observed when working on this; feel free to skip
them for now becuase I think they're corner cases and irrelevant of the major
change:
mm/thp: Simplify copying of huge zero page pmd when fork
mm/userfaultfd: Fix uffd-wp special cases for fork()
mm/userfaultfd: Fix a few thp pmd missing uffd-wp bit
Part (2): Shmem support, this is where the special swap pte is introduced.
Some zap rework is needed within the process:
shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
mm: Clear vmf->pte after pte_unmap_same() returns
mm/userfaultfd: Introduce special pte for unmapped file-backed mem
mm/swap: Introduce the idea of special swap ptes
shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
mm: Drop first_index/last_index in zap_details
mm: Introduce zap_details.zap_flags
mm: Introduce ZAP_FLAG_SKIP_SWAP
mm: Pass zap_flags into unmap_mapping_pages()
shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps
shmem/userfaultfd: Handle the left-overed special swap ptes
shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()
Part (3): Hugetlb support, we need to disable huge pmd sharing for uffd-wp
because not compatible just like uffd minor mode. The rest is the changes
required to teach hugetlbfs understand the special swap pte too that introduced
with the uffd-wp change:
hugetlb/userfaultfd: Hook page faults for uffd write protection
hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
hugetlb: Pass vma into huge_pte_alloc()
hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
mm/hugetlb: Introduce huge version of special swap pte helpers
mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
hugetlb/userfaultfd: Allow wr-protect none ptes
hugetlb/userfaultfd: Only drop uffd-wp special pte if required
Part (4): Enable both features in code and test
userfaultfd: Enable write protection for shmem & hugetlbfs
userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs
Tests
=========
I've tested it using either userfaultfd kselftest program, but also with
umapsort [2] which should be even stricter. No complicated mm setup is tested
yet besides page swapping in/out, but in all cases we need to have more tests
when it becomes non-RFC.
If anyone would like to try umapsort, need to use an extremely hacked version
of umap library [3], because by default umap only supports anonymous. So to
test it we need to build [3] then [2].
Any comment would be greatly welcomed. Thanks,
[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs
[2] https://github.com/LLNL/umap-apps
[3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs
Peter Xu (30):
mm/thp: Simplify copying of huge zero page pmd when fork
mm/userfaultfd: Fix uffd-wp special cases for fork()
mm/userfaultfd: Fix a few thp pmd missing uffd-wp bit
shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
mm: Clear vmf->pte after pte_unmap_same() returns
mm/userfaultfd: Introduce special pte for unmapped file-backed mem
mm/swap: Introduce the idea of special swap ptes
shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
mm: Drop first_index/last_index in zap_details
mm: Introduce zap_details.zap_flags
mm: Introduce ZAP_FLAG_SKIP_SWAP
mm: Pass zap_flags into unmap_mapping_pages()
shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on
thps
shmem/userfaultfd: Handle the left-overed special swap ptes
shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()
hugetlb/userfaultfd: Hook page faults for uffd write protection
hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
hugetlb: Pass vma into huge_pte_alloc()
hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
mm/hugetlb: Introduce huge version of special swap pte helpers
mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
hugetlb/userfaultfd: Allow wr-protect none ptes
hugetlb/userfaultfd: Only drop uffd-wp special pte if required
userfaultfd: Enable write protection for shmem & hugetlbfs
userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs
arch/arm64/mm/hugetlbpage.c | 5 +-
arch/ia64/mm/hugetlbpage.c | 3 +-
arch/mips/mm/hugetlbpage.c | 4 +-
arch/parisc/mm/hugetlbpage.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 3 +-
arch/s390/mm/hugetlbpage.c | 2 +-
arch/sh/mm/hugetlbpage.c | 2 +-
arch/sparc/mm/hugetlbpage.c | 2 +-
arch/x86/include/asm/pgtable.h | 28 +++
fs/dax.c | 10 +-
fs/hugetlbfs/inode.c | 15 +-
fs/proc/task_mmu.c | 14 +-
fs/userfaultfd.c | 80 +++++--
include/asm-generic/hugetlb.h | 10 +
include/asm-generic/pgtable_uffd.h | 3 +
include/linux/huge_mm.h | 3 +-
include/linux/hugetlb.h | 47 +++-
include/linux/mm.h | 50 +++-
include/linux/mm_inline.h | 43 ++++
include/linux/mmu_notifier.h | 1 +
include/linux/shmem_fs.h | 5 +-
include/linux/swapops.h | 41 +++-
include/linux/userfaultfd_k.h | 37 +++
include/uapi/linux/userfaultfd.h | 3 +-
mm/huge_memory.c | 36 ++-
mm/hugetlb.c | 174 +++++++++++---
mm/khugepaged.c | 14 +-
mm/memcontrol.c | 2 +-
mm/memory.c | 277 ++++++++++++++++++-----
mm/migrate.c | 2 +-
mm/mprotect.c | 63 +++++-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 6 +-
mm/rmap.c | 8 +
mm/shmem.c | 39 +++-
mm/truncate.c | 17 +-
mm/userfaultfd.c | 37 +--
tools/testing/selftests/vm/userfaultfd.c | 14 +-
38 files changed, 881 insertions(+), 223 deletions(-)
--
2.26.2
File-backed memories are prone to unmap/swap so the ptes are always unstable.
This could lead to userfaultfd-wp information got lost when unmapped or swapped
out on such types of memory, for example, shmem. To keep such an information
persistent, we will start to use the newly introduced swap-like special ptes to
replace a null pte when those ptes were removed.
Prepare this by handling such a special pte first before it is applied. Here
a new fault flag FAULT_FLAG_UFFD_WP is introduced. When this flag is set, it
means the current fault is to resolve a page access (either read or write) to
the uffd-wp special pte.
The handling of this special pte page fault is similar to missing fault, but it
should happen after the pte missing logic since the special pte is designed to
be a swap-like pte. Meanwhile it should be handled before do_swap_page() so
that the swap core logic won't be confused to see such an illegal swap pte.
This is a slow path of uffd-wp handling, because unmap of wr-protected shmem
ptes should be rare. So far it should only trigger in two conditions:
(1) When trying to punch holes in shmem_fallocate(), there will be a
pre-unmap optimization before evicting the page. That will create
unmapped shmem ptes with wr-protected pages covered.
(2) Swapping out of shmem pages
Because of this, the page fault handling is simplifed too by always assuming
it's a read fault when calling do_fault(). When it's a write fault, it'll
fault again when retry the page access, then do_wp_page() will handle the rest
of message generation and delivery to the userfaultfd.
Disable fault-around for such a special page fault, because the introduced new
flag (FAULT_FLAG_UFFD_WP) only applies to current pte rather than all the pages
around it. Doing fault-around with the new flag could confuse all the rest of
pages when installing ptes from page cache when there's a cache hit.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 2 +
mm/memory.c | 107 +++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 105 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..85d928764b64 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -426,6 +426,7 @@ extern pgprot_t protection_map[16];
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UFFD_WP: When install new page entries, set uffd-wp bit.
*
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
* whether we would allow page faults to retry by specifying these two
@@ -456,6 +457,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_REMOTE 0x80
#define FAULT_FLAG_INSTRUCTION 0x100
#define FAULT_FLAG_INTERRUPTIBLE 0x200
+#define FAULT_FLAG_UFFD_WP 0x400
/*
* The default fault flags that should be used by most of the
diff --git a/mm/memory.c b/mm/memory.c
index 394c2602dce7..0b687f0be4d0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3797,6 +3797,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
+ bool pte_changed, uffd_wp = vmf->flags & FAULT_FLAG_UFFD_WP;
bool write = vmf->flags & FAULT_FLAG_WRITE;
pte_t entry;
vm_fault_t ret;
@@ -3807,14 +3808,27 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
return ret;
}
+ /*
+ * Note: besides pte missing, FAULT_FLAG_UFFD_WP could also trigger
+ * this path where vmf->pte got released before reaching here. In that
+ * case, even if vmf->pte==NULL, the pte actually still contains the
+ * protection pte (by pte_swp_mkuffd_wp_special()). For that case,
+ * we'd also like to allocate a new pte like pte none, but check
+ * differently for changing pte.
+ */
if (!vmf->pte) {
ret = pte_alloc_one_map(vmf);
if (ret)
return ret;
}
+ if (unlikely(uffd_wp))
+ pte_changed = !pte_swp_uffd_wp_special(*vmf->pte);
+ else
+ pte_changed = !pte_none(*vmf->pte);
+
/* Re-check under ptl */
- if (unlikely(!pte_none(*vmf->pte))) {
+ if (unlikely(pte_changed)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
return VM_FAULT_NOPAGE;
}
@@ -3824,6 +3838,11 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
entry = pte_sw_mkyoung(entry);
if (write)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (uffd_wp) {
+ /* This should only be triggered by a read fault */
+ WARN_ON_ONCE(write);
+ entry = pte_mkuffd_wp(pte_wrprotect(entry));
+ }
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3997,9 +4016,27 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
return ret;
}
+/* Return true if we should do read fault-around, false otherwise */
+static inline bool should_fault_around(struct vm_fault *vmf)
+{
+ /* No ->map_pages? No way to fault around... */
+ if (!vmf->vma->vm_ops->map_pages)
+ return false;
+
+ /*
+ * Don't do fault around for FAULT_FLAG_UFFD_WP because it means we
+ * want to recover a previously wr-protected pte. This flag is a
+ * per-pte information, so it could confuse all the pages around the
+ * current page when faulted in. Give up on that quickly.
+ */
+ if (vmf->flags & FAULT_FLAG_UFFD_WP)
+ return false;
+
+ return fault_around_bytes >> PAGE_SHIFT > 1;
+}
+
static vm_fault_t do_read_fault(struct vm_fault *vmf)
{
- struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = 0;
/*
@@ -4007,7 +4044,7 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf)
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
+ if (should_fault_around(vmf)) {
ret = do_fault_around(vmf);
if (ret)
return ret;
@@ -4322,6 +4359,68 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
return VM_FAULT_FALLBACK;
}
+static vm_fault_t uffd_wp_clear_special(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ /*
+ * Be careful so that we will only recover a special uffd-wp pte into a
+ * none pte. Otherwise it means the pte could have changed, so retry.
+ */
+ if (pte_swp_uffd_wp_special(*vmf->pte))
+ pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return 0;
+}
+
+/*
+ * This is actually a page-missing access, but with uffd-wp special pte
+ * installed. It means this pte was wr-protected before being unmapped.
+ */
+vm_fault_t uffd_wp_handle_special(struct vm_fault *vmf)
+{
+ /* Careful! vmf->pte unmapped after return */
+ if (!pte_unmap_same(vmf))
+ return 0;
+
+ /*
+ * Just in case there're leftover special ptes even after the region
+ * got unregistered - we can simply clear them.
+ */
+ if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma)))
+ return uffd_wp_clear_special(vmf);
+
+ /*
+ * Tell all the rest of the fault code: we're handling a special pte,
+ * always remember to arm the uffd-wp bit when intalling the new pte.
+ */
+ vmf->flags |= FAULT_FLAG_UFFD_WP;
+
+ /*
+ * Let's assume this is a read fault no matter what. If it is a real
+ * write access, it'll fault again into do_wp_page() where the message
+ * will be generated before the thread yields itself.
+ *
+ * Ideally we can also handle write immediately before return, but this
+ * should be a slow path already (pte unmapped), so be simple first.
+ */
+ vmf->flags &= ~FAULT_FLAG_WRITE;
+
+ return do_fault(vmf);
+}
+
+static vm_fault_t do_swap_pte(struct vm_fault *vmf)
+{
+ /*
+ * We need to handle special swap ptes before handling ptes that
+ * contain swap entries, always.
+ */
+ if (unlikely(pte_swp_uffd_wp_special(vmf->orig_pte)))
+ return uffd_wp_handle_special(vmf);
+
+ return do_swap_page(vmf);
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -4385,7 +4484,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
}
if (!pte_present(vmf->orig_pte))
- return do_swap_page(vmf);
+ return do_swap_pte(vmf);
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
--
2.26.2
It should be handled similarly like other uffd-wp wr-protected ptes: we should
pass it over when the dst_vma has VM_UFFD_WP armed, otherwise drop it.
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index f87b5a8a098e..59d56f57ba2c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -703,8 +703,21 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
unsigned long vm_flags = dst_vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
- swp_entry_t entry = pte_to_swp_entry(pte);
+ swp_entry_t entry;
+
+ if (unlikely(is_swap_special_pte(pte))) {
+ /*
+ * uffd-wp special swap pte is the only possibility for now.
+ * If dst vma is registered with uffd-wp, copy it over.
+ * Otherwise, ignore this pte as if it's a none pte would work.
+ */
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ if (userfaultfd_wp(dst_vma))
+ set_pte_at(dst_mm, addr, dst_pte, pte);
+ return 0;
+ }
+ entry = pte_to_swp_entry(pte);
if (likely(!non_swap_entry(entry))) {
if (swap_duplicate(entry) < 0)
return entry.val;
--
2.26.2
We don't have "huge" version of PTE_SWP_UFFD_WP_SPECIAL, instead when necessary
we split the thp if the huge page is uffd wr-protected previously.
However split the thp is not enough, because file-backed thp is handled totally
differently comparing to anonymous thps - rather than doing a real split, the
thp pmd will simply got dropped in __split_huge_pmd_locked().
That is definitely not enough if e.g. when there is a thp covers range [0, 2M)
but we want to wr-protect small page resides in [4K, 8K) range, because after
__split_huge_pmd() returns, there will be a none pmd.
Here we leverage the previously introduced change_protection_prepare() macro so
that we'll populate the pmd with a pgtable page. Then change_pte_range() will
do all the rest for us, e.g., install the uffd-wp swap special pte marker at
any pte that we'd like to wr-protect, under the protection of pgtable lock.
Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c9390fd673fe..055871322007 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -296,8 +296,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
- if (next - addr != HPAGE_PMD_SIZE) {
+ if (next - addr != HPAGE_PMD_SIZE ||
+ /* Uffd wr-protecting a file-backed memory range */
+ unlikely(!vma_is_anonymous(vma) &&
+ (cp_flags & MM_CP_UFFD_WP))) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
+ /*
+ * For file-backed, the pmd could have been
+ * gone; still provide a pte pgtable if needed.
+ */
+ change_protection_prepare(vma, pmd, addr, cp_flags);
} else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, cp_flags);
--
2.26.2
Note that the special uffd-wp swap pte can be left over even if the page under
the pte got evicted. Normally when evict a page, we will unmap the ptes by
walking through the reverse mapping. However we never tracked such information
for the special swap ptes because they're not real mappings but just markers.
So we need to take care of that when we see a marker but when it's actually
meaningless (the page behind it got evicted).
We have already taken care of that in e.g. alloc_set_pte() where we'll treat
the special swap pte as pte_none() when necessary. However we need to also
teach userfaultfd itself on either UFFDIO_COPY or handling page faults, so that
everything will still work as expected.
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 15 +++++++++++++++
mm/shmem.c | 13 ++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 000b457ad087..3537a43b69c9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -331,6 +331,21 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (pte_none(*pte))
ret = true;
+ /*
+ * We also treat the swap special uffd-wp pte as the pte_none() here.
+ * This should in most cases be a missing event, as we never handle
+ * wr-protect upon a special uffd-wp swap pte - it should first be
+ * converted into a normal read request before handling wp. It just
+ * means the page/swap cache that backing this pte is gone, so this
+ * special pte is leftover.
+ *
+ * We can't simply replace it with a none pte because we're not with
+ * the pgtable lock here. Instead of taking it and clearing the pte,
+ * the easy way is to let UFFDIO_COPY understand this pte too when
+ * trying to install a new page onto it.
+ */
+ if (pte_swp_uffd_wp_special(*pte))
+ ret = true;
if (!pte_write(*pte) && (reason & VM_UFFD_WP))
ret = true;
pte_unmap(pte);
diff --git a/mm/shmem.c b/mm/shmem.c
index de45333626f7..9947bcf92663 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2456,7 +2456,18 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
goto out_release_unlock;
ret = -EEXIST;
- if (!pte_none(*dst_pte))
+ /*
+ * Besides the none pte, we also allow UFFDIO_COPY to install a pte
+ * onto the uffd-wp swap special pte, because that pte should be the
+ * same as a pte_none() just in that it contains wr-protect information
+ * (which could only be dropped when unmap the memory).
+ *
+ * It's safe to drop that marker because we know this is part of a
+ * MISSING fault, and the caller is very clear about this page missing
+ * rather than wr-protected. Then we're sure the wr-protect bit is
+ * just a leftover so it's useless already.
+ */
+ if (!pte_none(*dst_pte) && !pte_swp_uffd_wp_special(*dst_pte))
goto out_release_unlock;
lru_cache_add(page);
--
2.26.2
File-backed memory differs from anonymous memory in that even if the pte is
missing, the data could still resides either in the file or in page/swap cache.
So when wr-protect a pte, we need to consider none ptes too.
We do that by installing the uffd-wp special swap pte as a marker. So when
there's a future write to the pte, the fault handler will go the special path
to first fault-in the page as read-only, then report to userfaultfd server with
the wr-protect message.
On the other hand, when unprotecting a page, it's also possible that the pte
got unmapped but replaced by the special uffd-wp marker. Then we'll need to be
able to recover from a uffd-wp special swap pte into a none pte, so that the
next access to the page will fault in correctly as usual when trigger the fault
handler next time, rather than sending a uffd-wp message.
Special care needs to be taken throughout the change_protection_range()
process. Since now we allow user to wr-protect a none pte, we need to be able
to pre-populate the page table entries if we see !anonymous && MM_CP_UFFD_WP
requests, otherwise change_protection_range() will always skip when the pgtable
entry does not exist.
Note that this patch only covers the small pages (pte level) but not covering
any of the transparent huge pages yet. But this will be a base for thps too.
Signed-off-by: Peter Xu <[email protected]>
---
mm/mprotect.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e75bfe43cedd..c9390fd673fe 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,6 +29,7 @@
#include <linux/uaccess.h>
#include <linux/mm_inline.h>
#include <linux/pgtable.h>
+#include <linux/userfaultfd_k.h>
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
@@ -176,6 +177,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, newpte);
pages++;
}
+ } else if (unlikely(is_swap_special_pte(oldpte))) {
+ if (uffd_wp_resolve && !vma_is_anonymous(vma) &&
+ pte_swp_uffd_wp_special(oldpte)) {
+ /*
+ * This is uffd-wp special pte and we'd like to
+ * unprotect it. What we need to do is simply
+ * recover the pte into a none pte; the next
+ * page fault will fault in the page.
+ */
+ pte_clear(vma->vm_mm, addr, pte);
+ pages++;
+ }
+ } else {
+ /* It must be an none page, or what else?.. */
+ WARN_ON_ONCE(!pte_none(oldpte));
+ if (unlikely(uffd_wp && !vma_is_anonymous(vma))) {
+ /*
+ * For file-backed mem, we need to be able to
+ * wr-protect even for a none pte! Because
+ * even if the pte is null, the page/swap cache
+ * could exist.
+ */
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+ pages++;
+ }
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -209,6 +236,25 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}
+/*
+ * File-backed vma allows uffd wr-protect upon none ptes, because even if pte
+ * is missing, page/swap cache could exist. When that happens, the wr-protect
+ * information will be stored in the page table entries with the marker (e.g.,
+ * PTE_SWP_UFFD_WP_SPECIAL). Prepare for that by always populating the page
+ * tables to pte level, so that we'll install the markers in change_pte_range()
+ * where necessary.
+ *
+ * Note that we only need to do this in pmd level, because if pmd does not
+ * exist, it means the whole range covered by the pmd entry (of a pud) does not
+ * contain any valid data but all zeros. Then nothing to wr-protect.
+ */
+#define change_protection_prepare(vma, pmd, addr, cp_flags) \
+ do { \
+ if (unlikely((cp_flags & MM_CP_UFFD_WP) && pmd_none(*pmd) && \
+ !vma_is_anonymous(vma))) \
+ WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd)); \
+ } while (0)
+
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
@@ -227,6 +273,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
next = pmd_addr_end(addr, end);
+ change_protection_prepare(vma, pmd, addr, cp_flags);
+
/*
* Automatic NUMA balancing walks the tables with mmap_lock
* held for read. It's possible a parallel update to occur
--
2.26.2
Teach hugetlbfs code to wr-protect none ptes just in case the page cache
existed for that pte. Meanwhile we also need to be able to recognize a uffd-wp
marker pte and remove it for uffd_wp_resolve.
Since at it, introduce a variable "psize" to replace all references to the huge
page size fetcher.
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9b9f71ec30e1..7959fb4b1633 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4978,7 +4978,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
- unsigned long pages = 0;
+ unsigned long pages = 0, psize = huge_page_size(h);
bool shared_pmd = false;
struct mmu_notifier_range range;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -4998,13 +4998,19 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);
i_mmap_lock_write(vma->vm_file->f_mapping);
- for (; address < end; address += huge_page_size(h)) {
+ for (; address < end; address += psize) {
spinlock_t *ptl;
- ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ ptep = huge_pte_offset(mm, address, psize);
if (!ptep)
continue;
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, &address, ptep)) {
+ /*
+ * When uffd-wp is enabled on the vma, unshare
+ * shouldn't happen at all. Warn about it if it
+ * happened due to some reason.
+ */
+ WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
pages++;
spin_unlock(ptl);
shared_pmd = true;
@@ -5028,12 +5034,21 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
else if (uffd_wp_resolve)
newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
- newpte, huge_page_size(h));
+ newpte, psize);
pages++;
}
spin_unlock(ptl);
continue;
}
+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * This is changing a non-present pte into a none pte,
+ * no need for huge_ptep_modify_prot_start/commit().
+ */
+ if (uffd_wp_resolve)
+ huge_pte_clear(mm, address, ptep, psize);
+ }
if (!huge_pte_none(pte)) {
pte_t old_pte;
@@ -5046,6 +5061,12 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
+ } else {
+ /* None pte */
+ if (unlikely(uffd_wp))
+ /* Safe to modify directly (none->non-present). */
+ set_huge_pte_at(mm, address, ptep,
+ pte_swp_mkuffd_wp_special(vma));
}
spin_unlock(ptl);
}
--
2.26.2
Hook up hugetlbfs_fault() with the capability to handle userfaultfd-wp faults.
We do this slightly earlier than hugetlb_cow() so that we can avoid taking some
extra locks that we definitely don't need.
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d029d938d26d..dcbbba53bd10 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4544,6 +4544,25 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
goto out_ptl;
+ /* Handle userfault-wp first, before trying to lock more pages */
+ if (userfaultfd_pte_wp(vma, huge_ptep_get(ptep)) &&
+ (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .flags = flags,
+ };
+
+ spin_unlock(ptl);
+ if (pagecache_page) {
+ unlock_page(pagecache_page);
+ put_page(pagecache_page);
+ }
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ i_mmap_unlock_read(mapping);
+ return handle_userfault(&vmf, VM_UFFD_WP);
+ }
+
/*
* hugetlb_cow() requires page locks of pte_page(entry) and
* pagecache_page, so here we need take the former one
--
2.26.2
This is to let hugetlbfs be prepared to also recognize swap special ptes just
like uffd-wp special swap ptes.
Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd2acb8b3f0f..16a07f41880e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -82,6 +82,25 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
+/*
+ * These are sister versions of is_swap_pte() and pte_has_swap_entry(). We
+ * need standalone ones because huge_pte_none() is handled differently from
+ * pte_none(). For more information, please refer to comments above
+ * is_swap_pte() and pte_has_swap_entry().
+ *
+ * Here we directly reuse the pte level of swap special ptes, for example, the
+ * pte_swp_uffd_wp_special(). It just stands for a huge page rather than a
+ * small page for hugetlbfs pages.
+ */
+static inline bool is_huge_swap_pte(pte_t pte)
+{
+ return !huge_pte_none(pte) && !pte_present(pte);
+}
+static inline bool huge_pte_has_swap_entry(pte_t pte)
+{
+ return is_huge_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
{
bool free = (spool->count == 0) && (spool->used_hpages == 0);
@@ -3710,7 +3729,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
{
swp_entry_t swp;
- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_migration_entry(swp))
@@ -3723,7 +3742,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
{
swp_entry_t swp;
- if (huge_pte_none(pte) || pte_present(pte))
+ if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_hwpoison_entry(swp))
--
2.26.2
Huge pmd sharing could bring problem to userfaultfd. The thing is that
userfaultfd is running its logic based on the special bits on page table
entries, however the huge pmd sharing could potentially share page table
entries for different address ranges. That could cause issues on either:
- When sharing huge pmd page tables for an uffd write protected range, the
newly mapped huge pmd range will also be write protected unexpectedly, or,
- When we try to write protect a range of huge pmd shared range, we'll first
do huge_pmd_unshare() in hugetlb_change_protection(), however that also
means the UFFDIO_WRITEPROTECT could be silently skipped for the shared
region, which could lead to data loss.
Since at it, a few other things are done altogether:
- Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
that's definitely something that arch code would like to use too
- ARM64 currently directly check against CONFIG_ARCH_WANT_HUGE_PMD_SHARE when
trying to share huge pmd. Switch to the want_pmd_share() helper.
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/mm/hugetlbpage.c | 3 +--
include/linux/hugetlb.h | 12 ++++++++++++
include/linux/userfaultfd_k.h | 9 +++++++++
mm/hugetlb.c | 5 ++---
4 files changed, 24 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 5b32ec888698..1a8ce0facfe8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -284,8 +284,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
*/
ptep = pte_alloc_map(mm, pmdp, addr);
} else if (sz == PMD_SIZE) {
- if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
- pud_none(READ_ONCE(*pudp)))
+ if (want_pmd_share(vma) && pud_none(READ_ONCE(*pudp)))
ptep = huge_pmd_share(mm, addr, pudp);
else
ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7d4c5669e118..27ada597a8e6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -11,6 +11,7 @@
#include <linux/kref.h>
#include <linux/pgtable.h>
#include <linux/gfp.h>
+#include <linux/userfaultfd_k.h>
struct ctl_table;
struct user_struct;
@@ -951,4 +952,15 @@ static inline __init void hugetlb_cma_check(void)
}
#endif
+static inline bool want_pmd_share(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
+ if (uffd_disable_huge_pmd_share(vma))
+ return false;
+ return true;
+#else
+ return false;
+#endif
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 7d6071a65ded..7d14444862d4 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -52,6 +52,15 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
}
+/*
+ * Never enable huge pmd sharing on uffd-wp registered vmas, because uffd-wp
+ * protect information is per pgtable entry.
+ */
+static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_WP;
+}
+
static inline bool userfaultfd_missing(struct vm_area_struct *vma)
{
return vma->vm_flags & VM_UFFD_MISSING;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb7cd0c7d6d2..dd2acb8b3f0f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5386,7 +5386,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;
return 1;
}
-#define want_pmd_share() (1)
+
#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
{
@@ -5403,7 +5403,6 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
unsigned long *start, unsigned long *end)
{
}
-#define want_pmd_share() (0)
#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
@@ -5425,7 +5424,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pte = (pte_t *)pud;
} else {
BUG_ON(sz != PMD_SIZE);
- if (want_pmd_share() && pud_none(*pud))
+ if (want_pmd_share(vma) && pud_none(*pud))
pte = huge_pmd_share(mm, addr, pud);
else
pte = (pte_t *)pmd_alloc(mm, pud, addr);
--
2.26.2
It is a preparation work to be able to behave differently in the per
architecture huge_pte_alloc() according to different VMA attributes.
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/ia64/mm/hugetlbpage.c | 3 ++-
arch/mips/mm/hugetlbpage.c | 4 ++--
arch/parisc/mm/hugetlbpage.c | 2 +-
arch/powerpc/mm/hugetlbpage.c | 3 ++-
arch/s390/mm/hugetlbpage.c | 2 +-
arch/sh/mm/hugetlbpage.c | 2 +-
arch/sparc/mm/hugetlbpage.c | 2 +-
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 6 +++---
mm/userfaultfd.c | 2 +-
11 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 55ecf6de9ff7..5b32ec888698 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -252,7 +252,7 @@ void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
set_pte(ptep, pte);
}
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
pgd_t *pgdp;
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index b331f94d20ac..f993cb36c062 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
EXPORT_SYMBOL(hpage_shift);
pte_t *
-huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long sz)
{
unsigned long taddr = htlbpage_to_page(addr);
pgd_t *pgd;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index 77ffece9c270..c1d8f51c5255 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -21,8 +21,8 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
- unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, structt vm_area_struct *vma,
+ unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
p4d_t *p4d;
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d7ba014a7fbb..e141441bfa64 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -44,7 +44,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
}
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 36c3800769fb..2514884c0d20 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -106,7 +106,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
* At this point we do the placement change only for BOOK3S 64. This would
* possibly work on other subarchs.
*/
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long sz)
{
pgd_t *pg;
p4d_t *p4;
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 3b5a4d25ca9b..da36d13ffc16 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -189,7 +189,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
return pte;
}
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
pgd_t *pgdp;
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 220d7bc43d2b..999ab5916e69 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -21,7 +21,7 @@
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index ec423b5f17dd..ae06f7df9750 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -272,7 +272,7 @@ static unsigned long huge_tte_to_size(pte_t pte)
return size;
}
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index fe1dde0afbaf..7d4c5669e118 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -162,7 +162,7 @@ extern struct list_head huge_boot_pages;
/* arch callbacks */
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz);
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 18b236bac6cd..eb7cd0c7d6d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3767,7 +3767,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte)
continue;
- dst_pte = huge_pte_alloc(dst, addr, sz);
+ dst_pte = huge_pte_alloc(dst, vma, addr, sz);
if (!dst_pte) {
ret = -ENOMEM;
break;
@@ -4484,7 +4484,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
*/
mapping = vma->vm_file->f_mapping;
i_mmap_lock_read(mapping);
- ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
+ ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
if (!ptep) {
i_mmap_unlock_read(mapping);
return VM_FAULT_OOM;
@@ -5407,7 +5407,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
-pte_t *huge_pte_alloc(struct mm_struct *mm,
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
{
pgd_t *pgd;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 480d91b783d4..3d49b888e3e8 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -291,7 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
err = -ENOMEM;
- dst_pte = huge_pte_alloc(dst_mm, dst_addr, vma_hpagesize);
+ dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
--
2.26.2
Just like what we've done with shmem uffd-wp special ptes, we shouldn't drop
uffd-wp special swap pte for hugetlb too, only if we're going to unmap the
whole vma, or we're punching a hole with safe locks held.
For example, remove_inode_hugepages() is safe to drop uffd-wp ptes, because it
has taken hugetlb fault mutex so that no concurrent page fault would trigger.
While the call to hugetlb_vmdelete_list() in hugetlbfs_punch_hole() is not
safe. That's why the previous call will be with ZAP_FLAG_DROP_FILE_UFFD_WP,
while the latter one won't be able to.
Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 15 +++++++++------
include/linux/hugetlb.h | 13 ++++++++-----
mm/hugetlb.c | 27 +++++++++++++++++++++------
mm/memory.c | 5 ++++-
4 files changed, 42 insertions(+), 18 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..f9ff2ba5e47b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -399,7 +399,8 @@ static void remove_huge_page(struct page *page)
}
static void
-hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
+hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
+ unsigned long zap_flags)
{
struct vm_area_struct *vma;
@@ -432,7 +433,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
}
unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
- NULL);
+ NULL, zap_flags);
}
}
@@ -513,7 +514,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h));
+ (index + 1) * pages_per_huge_page(h),
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
}
@@ -579,7 +581,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
i_mmap_lock_write(mapping);
i_size_write(inode, offset);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
- hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
+ hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
return 0;
@@ -613,8 +616,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap,
- hole_start >> PAGE_SHIFT,
- hole_end >> PAGE_SHIFT);
+ hole_start >> PAGE_SHIFT,
+ hole_end >> PAGE_SHIFT, 0);
i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
inode_unlock(inode);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8841d118f45b..93f3c46439b2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
unsigned long *, unsigned long *, long, unsigned int,
int *);
void unmap_hugepage_range(struct vm_area_struct *,
- unsigned long, unsigned long, struct page *);
+ unsigned long, unsigned long, struct page *,
+ unsigned long);
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page);
+ struct page *ref_page, unsigned long zap_flags);
void hugetlb_report_meminfo(struct seq_file *);
int hugetlb_report_node_meminfo(char *buf, int len, int nid);
void hugetlb_show_meminfo(void);
@@ -353,14 +354,16 @@ static inline unsigned long hugetlb_change_protection(
static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}
static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
BUG();
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7959fb4b1633..731a26617673 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3864,7 +3864,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long start, unsigned long end,
- struct page *ref_page)
+ struct page *ref_page, unsigned long zap_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -3916,6 +3916,19 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
continue;
}
+ if (unlikely(is_swap_special_pte(pte))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(pte));
+ /*
+ * Only drop the special swap uffd-wp pte if
+ * e.g. unmapping a vma or punching a hole (with proper
+ * lock held so that concurrent page fault won't happen).
+ */
+ if (zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP)
+ huge_pte_clear(mm, address, ptep, sz);
+ spin_unlock(ptl);
+ continue;
+ }
+
/*
* Migrating hugepage or HWPoisoned hugepage is already
* unmapped and its refcount is dropped, so just clear pte here.
@@ -3967,9 +3980,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
- __unmap_hugepage_range(tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
/*
* Clear this flag so that x86's huge_pmd_share page_table_shareable
@@ -3985,7 +3999,8 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
}
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long end, struct page *ref_page)
+ unsigned long end, struct page *ref_page,
+ unsigned long zap_flags)
{
struct mm_struct *mm;
struct mmu_gather tlb;
@@ -4004,7 +4019,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
mm = vma->vm_mm;
tlb_gather_mmu(&tlb, mm, tlb_start, tlb_end);
- __unmap_hugepage_range(&tlb, vma, start, end, ref_page);
+ __unmap_hugepage_range(&tlb, vma, start, end, ref_page, zap_flags);
tlb_finish_mmu(&tlb, tlb_start, tlb_end);
}
@@ -4059,7 +4074,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (!is_vma_resv_set(iter_vma, HPAGE_RESV_OWNER))
unmap_hugepage_range(iter_vma, address,
- address + huge_page_size(h), page);
+ address + huge_page_size(h), page, 0);
}
i_mmap_unlock_write(mapping);
}
diff --git a/mm/memory.c b/mm/memory.c
index 59d56f57ba2c..993ec7a7961a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1499,8 +1499,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
+ unsigned long zap_flags = details ?
+ details->zap_flags : 0;
i_mmap_lock_write(vma->vm_file->f_mapping);
- __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
+ __unmap_hugepage_range_final(tlb, vma, start, end,
+ NULL, zap_flags);
i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
--
2.26.2
Huge zero page is handled in a special path in copy_huge_pmd(), however it
should share most codes with a normal thp page. Trying to share more code with
it by removing the special path. The only leftover so far is the huge zero
page refcounting (mm_get_huge_zero_page()), because that's separately done with
a global counter.
This prepares for a future patch to modify the huge pmd to be installed, so
that we don't need to duplicate it explicitly into huge zero page case too.
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/huge_memory.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec2bb93f7431..b64ad1947900 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1058,17 +1058,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* a page table.
*/
if (is_huge_zero_pmd(pmd)) {
- struct page *zero_page;
/*
* get_huge_zero_page() will never allocate a new page here,
* since we already have a zero page to copy. It just takes a
* reference.
*/
- zero_page = mm_get_huge_zero_page(dst_mm);
- set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
- zero_page);
- ret = 0;
- goto out_unlock;
+ mm_get_huge_zero_page(dst_mm);
+ goto out_zero_page;
}
src_page = pmd_page(pmd);
@@ -1094,6 +1090,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
get_page(src_page);
page_dup_rmap(src_page, true);
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+out_zero_page:
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
--
2.26.2
Prepare for it to be called outside of mm/hugetlb.c.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/hugetlb.h | 8 ++++++++
mm/hugetlb.c | 8 --------
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 27ada597a8e6..8841d118f45b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -963,4 +963,12 @@ static inline bool want_pmd_share(struct vm_area_struct *vma)
#endif
}
+#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
+/*
+ * ARCHes with special requirements for evicting HUGETLB backing TLB entries can
+ * implement this.
+ */
+#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
+#endif
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 16a07f41880e..a971513cdff6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4948,14 +4948,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
return i ? i : err;
}
-#ifndef __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
-/*
- * ARCHes with special requirements for evicting HUGETLB backing TLB entries can
- * implement this.
- */
-#define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
-#endif
-
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end,
pgprot_t newprot, unsigned long cp_flags)
--
2.26.2
Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
userfaultfd-wp is always based on pgtable entries, so they cannot be shared.
Walk the hugetlb range and unshare all such mappings if there is, right before
UFFDIO_REGISTER will succeed and return to userspace.
This will pair with want_pmd_share() in hugetlb code so that huge pmd sharing
is completely disabled for userfaultfd-wp registered range.
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 43 ++++++++++++++++++++++++++++++++++++
include/linux/mmu_notifier.h | 1 +
2 files changed, 44 insertions(+)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3537a43b69c9..3190dff39d6c 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -15,6 +15,7 @@
#include <linux/sched/signal.h>
#include <linux/sched/mm.h>
#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
#include <linux/poll.h>
#include <linux/slab.h>
#include <linux/seq_file.h>
@@ -1198,6 +1199,45 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
}
}
+/*
+ * This function will unconditionally remove all the shared pmd pgtable entries
+ * within the specific vma for a hugetlbfs memory range.
+ */
+static void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned long sz = huge_page_size(h);
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range;
+ unsigned long address;
+ spinlock_t *ptl;
+ pte_t *ptep;
+
+ /*
+ * No need to call adjust_range_if_pmd_sharing_possible(), because
+ * we're going to operate on the whole vma
+ */
+ mmu_notifier_range_init(&range, MMU_NOTIFY_HUGETLB_UNSHARE,
+ 0, vma, mm, vma->vm_start, vma->vm_end);
+ mmu_notifier_invalidate_range_start(&range);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+ for (address = vma->vm_start; address < vma->vm_end; address += sz) {
+ ptep = huge_pte_offset(mm, address, sz);
+ if (!ptep)
+ continue;
+ ptl = huge_pte_lock(h, mm, ptep);
+ huge_pmd_unshare(mm, vma, &address, ptep);
+ spin_unlock(ptl);
+ }
+ flush_hugetlb_tlb_range(vma, vma->vm_start, vma->vm_end);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+ /*
+ * No need to call mmu_notifier_invalidate_range(), see
+ * Documentation/vm/mmu_notifier.rst.
+ */
+ mmu_notifier_invalidate_range_end(&range);
+}
+
static void __wake_userfault(struct userfaultfd_ctx *ctx,
struct userfaultfd_wake_range *range)
{
@@ -1456,6 +1496,9 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vma->vm_flags = new_flags;
vma->vm_userfaultfd_ctx.ctx = ctx;
+ if (is_vm_hugetlb_page(vma) && uffd_disable_huge_pmd_share(vma))
+ hugetlb_unshare_all_pmds(vma);
+
skip:
prev = vma;
start = vma->vm_end;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..ff50c8528113 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -51,6 +51,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_SOFT_DIRTY,
MMU_NOTIFY_RELEASE,
MMU_NOTIFY_MIGRATE,
+ MMU_NOTIFY_HUGETLB_UNSHARE,
};
#define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
--
2.26.2
Teach the hugetlb page fault code to understand uffd-wp special pte. For
example, when seeing such a pte we need to convert any write fault into a read
one (which is fake - we'll retry the write later if so). Meanwhile, for
handle_userfault() we'll need to make sure we must wait for the special swap
pte too just like a none pte.
Note that we also need to teach UFFDIO_COPY about this special pte across the
code path so that we can safely install a new page at this special pte as long
as we know it's a stall entry.
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 5 ++++-
mm/hugetlb.c | 35 ++++++++++++++++++++++++++++-------
mm/userfaultfd.c | 3 ++-
3 files changed, 34 insertions(+), 9 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3190dff39d6c..3264ec46242b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -248,8 +248,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
/*
* Lockless access: we're in a wait_event so it's ok if it
* changes under us.
+ *
+ * Regarding uffd-wp special case, please refer to comments in
+ * userfaultfd_must_wait().
*/
- if (huge_pte_none(pte))
+ if (huge_pte_none(pte) || pte_swp_uffd_wp_special(pte))
ret = true;
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
ret = true;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a971513cdff6..9b9f71ec30e1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4253,7 +4253,8 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct address_space *mapping, pgoff_t idx,
- unsigned long address, pte_t *ptep, unsigned int flags)
+ unsigned long address, pte_t *ptep,
+ pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
vm_fault_t ret = VM_FAULT_SIGBUS;
@@ -4297,6 +4298,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
.vma = vma,
.address = haddr,
.flags = flags,
+ .orig_pte = old_pte,
/*
* Hard to debug if it ends up being
* used by a callee that assumes
@@ -4394,7 +4396,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
ptl = huge_pte_lock(h, mm, ptep);
ret = 0;
- if (!huge_pte_none(huge_ptep_get(ptep)))
+ if (!pte_same(huge_ptep_get(ptep), old_pte))
goto backout;
if (anon_rmap) {
@@ -4404,6 +4406,11 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
page_dup_rmap(page, true);
new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
&& (vma->vm_flags & VM_SHARED)));
+ if (unlikely(flags & FAULT_FLAG_UFFD_WP)) {
+ WARN_ON_ONCE(flags & FAULT_FLAG_WRITE);
+ /* We should have the write bit cleared already, but be safe */
+ new_pte = huge_pte_wrprotect(huge_pte_mkuffd_wp(new_pte));
+ }
set_huge_pte_at(mm, haddr, ptep, new_pte);
hugetlb_count_add(pages_per_huge_page(h), mm);
@@ -4485,9 +4492,16 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, mm, ptep);
return 0;
- } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
+ } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) {
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
+ } else if (unlikely(is_swap_special_pte(entry))) {
+ /* Must be a uffd-wp special swap pte */
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(entry));
+ flags |= FAULT_FLAG_UFFD_WP;
+ /* Emulate a read fault */
+ flags &= ~FAULT_FLAG_WRITE;
+ }
}
/*
@@ -4519,8 +4533,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
entry = huge_ptep_get(ptep);
- if (huge_pte_none(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, flags);
+ /*
+ * FAULT_FLAG_UFFD_WP should be handled merely the same as pte none
+ * because it's basically a none pte with a special marker
+ */
+ if (huge_pte_none(entry) || pte_swp_uffd_wp_special(entry)) {
+ ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ entry, flags);
goto out_mutex;
}
@@ -4651,7 +4670,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
unsigned long size;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
struct hstate *h = hstate_vma(dst_vma);
- pte_t _dst_pte;
+ pte_t _dst_pte, cur_pte;
spinlock_t *ptl;
int ret;
struct page *page;
@@ -4725,8 +4744,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
if (idx >= size)
goto out_release_unlock;
+ cur_pte = huge_ptep_get(dst_pte);
ret = -EEXIST;
- if (!huge_pte_none(huge_ptep_get(dst_pte)))
+ /* Please refer to shmem_mfill_atomic_pte() for uffd-wp special case */
+ if (!huge_pte_none(cur_pte) && !pte_swp_uffd_wp_special(cur_pte))
goto out_release_unlock;
if (vm_shared) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3d49b888e3e8..1dff5b9a2c26 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -300,7 +300,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
err = -EEXIST;
dst_pteval = huge_ptep_get(dst_pte);
- if (!huge_pte_none(dst_pteval)) {
+ if (!huge_pte_none(dst_pteval) &&
+ !pte_swp_uffd_wp_special(dst_pteval)) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
goto out_unlock;
--
2.26.2
This starts from passing cp_flags into hugetlb_change_protection() so hugetlb
will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.
huge_pte_clear_uffd_wp() is introduced to handle the case where the
UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.
Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 5 +++++
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 13 ++++++++++++-
mm/mprotect.c | 3 ++-
mm/userfaultfd.c | 8 ++++++++
5 files changed, 31 insertions(+), 4 deletions(-)
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 548212eccbd6..181cdc3297e7 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -32,6 +32,11 @@ static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
return pte_mkuffd_wp(pte);
}
+static inline pte_t huge_pte_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_uffd_wp(pte);
+}
+
static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
{
return pte_modify(pte, newprot);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bd061f7eedcb..fe1dde0afbaf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -185,7 +185,8 @@ struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot);
+ unsigned long address, unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags);
bool is_hugetlb_entry_migration(pte_t pte);
@@ -343,7 +344,8 @@ static inline void move_hugetlb_state(struct page *oldpage,
static inline unsigned long hugetlb_change_protection(
struct vm_area_struct *vma, unsigned long address,
- unsigned long end, pgprot_t newprot)
+ unsigned long end, pgprot_t newprot,
+ unsigned long cp_flags)
{
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 563b8f70537f..18b236bac6cd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4938,7 +4938,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
#endif
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
- unsigned long address, unsigned long end, pgprot_t newprot)
+ unsigned long address, unsigned long end,
+ pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long start = address;
@@ -4948,6 +4949,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long pages = 0;
bool shared_pmd = false;
struct mmu_notifier_range range;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
/*
* In the case of shared PMDs, the area to flush could be beyond
@@ -4988,6 +4991,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
make_migration_entry_read(&entry);
newpte = swp_entry_to_pte(entry);
+ if (uffd_wp)
+ newpte = pte_swp_mkuffd_wp(newpte);
+ else if (uffd_wp_resolve)
+ newpte = pte_swp_clear_uffd_wp(newpte);
set_huge_swap_pte_at(mm, address, ptep,
newpte, huge_page_size(h));
pages++;
@@ -5001,6 +5008,10 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
pte = arch_make_huge_pte(pte, vma, NULL, 0);
+ if (uffd_wp)
+ pte = huge_pte_mkuffd_wp(huge_pte_wrprotect(pte));
+ else if (uffd_wp_resolve)
+ pte = huge_pte_clear_uffd_wp(pte);
huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
pages++;
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 055871322007..fce87ac99117 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -416,7 +416,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
if (is_vm_hugetlb_page(vma))
- pages = hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot,
+ cp_flags);
else
pages = change_protection_range(vma, start, end, newprot,
cp_flags);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b00e5e6b8b8b..480d91b783d4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -644,6 +644,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool enable_wp, bool *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ unsigned long page_mask;
pgprot_t newprot;
int err;
@@ -680,6 +681,13 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
if (!vma_is_anonymous(dst_vma))
goto out_unlock;
+ if (is_vm_hugetlb_page(dst_vma)) {
+ err = -EINVAL;
+ page_mask = vma_kernel_pagesize(dst_vma) - 1;
+ if ((start & page_mask) || (len & page_mask))
+ goto out_unlock;
+ }
+
if (enable_wp)
newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
else
--
2.26.2
Firstly, pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout
the stack. Then, apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with
UFFDIO_COPY. Introduce huge_pte_mkuffd_wp() for it.
Note that similar to how we've handled shmem, we'd better keep setting the
dirty bit even if UFFDIO_COPY_MODE_WP is provided, so that the core mm will
know this page contains valid data and never drop it.
Signed-off-by: Peter Xu <[email protected]>
---
include/asm-generic/hugetlb.h | 5 +++++
include/linux/hugetlb.h | 6 ++++--
mm/hugetlb.c | 9 +++++++--
mm/userfaultfd.c | 12 ++++++++----
4 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 8e1e6244a89d..548212eccbd6 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -27,6 +27,11 @@ static inline pte_t huge_pte_mkdirty(pte_t pte)
return pte_mkdirty(pte);
}
+static inline pte_t huge_pte_mkuffd_wp(pte_t pte)
+{
+ return pte_mkuffd_wp(pte);
+}
+
static inline pte_t huge_pte_modify(pte_t pte, pgprot_t newprot)
{
return pte_modify(pte, newprot);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ebca2ef02212..bd061f7eedcb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -138,7 +138,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, pte_t *dst_pte,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- struct page **pagep);
+ struct page **pagep,
+ bool wp_copy);
int hugetlb_reserve_pages(struct inode *inode, long from, long to,
struct vm_area_struct *vma,
vm_flags_t vm_flags);
@@ -313,7 +314,8 @@ static inline int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
BUG();
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dcbbba53bd10..563b8f70537f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4624,7 +4624,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
struct address_space *mapping;
pgoff_t idx;
@@ -4717,8 +4718,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
}
_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
- if (dst_vma->vm_flags & VM_WRITE)
+ if (dst_vma->vm_flags & VM_WRITE) {
_dst_pte = huge_pte_mkdirty(_dst_pte);
+ if (wp_copy)
+ _dst_pte = huge_pte_mkuffd_wp(
+ huge_pte_wrprotect(_dst_pte));
+ }
_dst_pte = pte_mkyoung(_dst_pte);
set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6d4b3b7c7f9f..b00e5e6b8b8b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -207,7 +207,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage)
+ bool zeropage,
+ bool wp_copy)
{
int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
int vm_shared = dst_vma->vm_flags & VM_SHARED;
@@ -306,7 +307,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
}
err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
- dst_addr, src_addr, &page);
+ dst_addr, src_addr, &page,
+ wp_copy);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_unlock_read(mapping);
@@ -408,7 +410,8 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long src_start,
unsigned long len,
- bool zeropage);
+ bool zeropage,
+ bool wp_copy);
#endif /* CONFIG_HUGETLB_PAGE */
static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
@@ -527,7 +530,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
*/
if (is_vm_hugetlb_page(dst_vma))
return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
- src_start, len, zeropage);
+ src_start, len, zeropage,
+ wp_copy);
if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
goto out_unlock;
--
2.26.2
We've had all the necessary changes ready for both shmem and hugetlbfs. Turn
on all the shmem/hugetlbfs switches for userfaultfd-wp.
Now we can remove the flags parameter for vma_can_userfault() since not used
any more. Meanwhile, we can expand UFFD_API_RANGE_IOCTLS_BASIC with
_UFFDIO_WRITEPROTECT too because all existing types now support write
protection mode.
Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h.
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 17 ++++-------------
include/linux/userfaultfd_k.h | 7 +++++++
include/uapi/linux/userfaultfd.h | 3 ++-
mm/userfaultfd.c | 10 +++-------
4 files changed, 16 insertions(+), 21 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3264ec46242b..88ad90fc8539 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1307,15 +1307,6 @@ static __always_inline int validate_range(struct mm_struct *mm,
return 0;
}
-static inline bool vma_can_userfault(struct vm_area_struct *vma,
- unsigned long vm_flags)
-{
- /* FIXME: add WP support to hugetlbfs and shmem */
- return vma_is_anonymous(vma) ||
- ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
- !(vm_flags & VM_UFFD_WP));
-}
-
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
unsigned long arg)
{
@@ -1394,7 +1385,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
/* check not compatible vmas */
ret = -EINVAL;
- if (!vma_can_userfault(cur, vm_flags))
+ if (!vma_can_userfault(cur))
goto out_unlock;
/*
@@ -1453,7 +1444,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
do {
cond_resched();
- BUG_ON(!vma_can_userfault(vma, vm_flags));
+ BUG_ON(!vma_can_userfault(vma));
BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
vma->vm_userfaultfd_ctx.ctx != ctx);
WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1602,7 +1593,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
* provides for more strict behavior to notice
* unregistration errors.
*/
- if (!vma_can_userfault(cur, cur->vm_flags))
+ if (!vma_can_userfault(cur))
goto out_unlock;
found = true;
@@ -1616,7 +1607,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
do {
cond_resched();
- BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
+ BUG_ON(!vma_can_userfault(vma));
/*
* Nothing to do: this vma is already registered into this
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 7d14444862d4..fd7031173949 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -16,6 +16,7 @@
#include <linux/fcntl.h>
#include <linux/mm.h>
#include <asm-generic/pgtable_uffd.h>
+#include <linux/hugetlb_inline.h>
/*
* CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -88,6 +89,12 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
}
+static inline bool vma_can_userfault(struct vm_area_struct *vma)
+{
+ return vma_is_anonymous(vma) || vma_is_shmem(vma) ||
+ is_vm_hugetlb_page(vma);
+}
+
extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
extern void dup_userfaultfd_complete(struct list_head *);
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index e7e98bde221f..83bcd739de50 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -39,7 +39,8 @@
(__u64)1 << _UFFDIO_WRITEPROTECT)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
- (__u64)1 << _UFFDIO_COPY)
+ (__u64)1 << _UFFDIO_COPY | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT)
/*
* Valid ioctl command number range with this API is from 0x00 to
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 1dff5b9a2c26..3ad52f01553b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -445,7 +445,6 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
err = mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
} else {
- VM_WARN_ON_ONCE(wp_copy);
if (!zeropage)
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
dst_vma, dst_addr,
@@ -671,15 +670,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
err = -ENOENT;
dst_vma = find_dst_vma(dst_mm, start, len);
- /*
- * Make sure the vma is not shared, that the dst range is
- * both valid and fully within a single existing vma.
- */
- if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+
+ if (!dst_vma)
goto out_unlock;
if (!userfaultfd_wp(dst_vma))
goto out_unlock;
- if (!vma_is_anonymous(dst_vma))
+ if (!vma_can_userfault(dst_vma))
goto out_unlock;
if (is_vm_hugetlb_page(dst_vma)) {
--
2.26.2
Give unmap_mapping_pages() more power by allowing to specify a zap flag so that
it can pass in more information than "whether we'd also like to zap cow pages".
With the new flag, we can remove the even_cow parameter because even_cow==false
equals to zap_flags==ZAP_FLAG_CHECK_MAPPING, while even_cow==true means a none
zap flag to pass in (though in most cases we have had even_cow==false).
No functional change intended.
Signed-off-by: Peter Xu <[email protected]>
---
fs/dax.c | 10 ++++++----
include/linux/mm.h | 4 ++--
mm/khugepaged.c | 3 ++-
mm/memory.c | 15 ++++++++-------
mm/truncate.c | 11 +++++++----
5 files changed, 25 insertions(+), 18 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..6a123c2bfc59 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -517,7 +517,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_unlock_irq(xas);
unmap_mapping_pages(mapping,
xas->xa_index & ~PG_PMD_COLOUR,
- PG_PMD_NR, false);
+ PG_PMD_NR, ZAP_FLAG_CHECK_MAPPING);
xas_reset(xas);
xas_lock_irq(xas);
}
@@ -612,7 +612,8 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
* guaranteed to either see new references or prevent new
* references from being established.
*/
- unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
+ unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1,
+ ZAP_FLAG_CHECK_MAPPING);
xas_lock_irq(&xas);
xas_for_each(&xas, entry, end_idx) {
@@ -743,9 +744,10 @@ static void *dax_insert_entry(struct xa_state *xas,
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
- PG_PMD_NR, false);
+ PG_PMD_NR, ZAP_FLAG_CHECK_MAPPING);
else /* pte entry */
- unmap_mapping_pages(mapping, index, 1, false);
+ unmap_mapping_pages(mapping, index, 1,
+ ZAP_FLAG_CHECK_MAPPING);
}
xas_reset(xas);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0b1d04404275..57bb3d680844 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1710,7 +1710,7 @@ extern int fixup_user_fault(struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
void unmap_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t nr, bool even_cows);
+ pgoff_t start, pgoff_t nr, unsigned long zap_flags);
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
#else
@@ -1730,7 +1730,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address,
return -EFAULT;
}
static inline void unmap_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t nr, bool even_cows) { }
+ pgoff_t start, pgoff_t nr, unsigned long zap_flags) { }
static inline void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows) { }
#endif
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 20807163a25f..981d7abb09ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1817,7 +1817,8 @@ static void collapse_file(struct mm_struct *mm,
}
if (page_mapped(page))
- unmap_mapping_pages(mapping, index, 1, false);
+ unmap_mapping_pages(mapping, index, 1,
+ ZAP_FLAG_CHECK_MAPPING);
xas_lock_irq(&xas);
xas_set(&xas, index);
diff --git a/mm/memory.c b/mm/memory.c
index 873b2515e187..afe09fccdee1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3156,7 +3156,10 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
* @mapping: The address space containing pages to be unmapped.
* @start: Index of first page to be unmapped.
* @nr: Number of pages to be unmapped. 0 to unmap to end of file.
- * @even_cows: Whether to unmap even private COWed pages.
+ * @zap_flags: Zap flags for the process. E.g., when ZAP_FLAG_CHECK_MAPPING is
+ * passed into it, we will only zap the pages that are in the same mapping
+ * specified in the @mapping parameter; otherwise we will not check mapping,
+ * IOW cow pages will be zapped too.
*
* Unmap the pages in this address space from any userspace process which
* has them mmaped. Generally, you want to remove COWed pages as well when
@@ -3164,17 +3167,14 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
* cache.
*/
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
- pgoff_t nr, bool even_cows)
+ pgoff_t nr, unsigned long zap_flags)
{
pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = {
.zap_mapping = mapping,
- .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ .zap_flags = zap_flags | ZAP_FLAG_SKIP_SWAP,
};
- if (!even_cows)
- details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
-
if (last_index < first_index)
last_index = ULONG_MAX;
@@ -3216,7 +3216,8 @@ void unmap_mapping_range(struct address_space *mapping,
hlen = ULONG_MAX - hba + 1;
}
- unmap_mapping_pages(mapping, hba, hlen, even_cows);
+ unmap_mapping_pages(mapping, hba, hlen, even_cows ?
+ 0 : ZAP_FLAG_CHECK_MAPPING);
}
EXPORT_SYMBOL(unmap_mapping_range);
diff --git a/mm/truncate.c b/mm/truncate.c
index 960edf5803ca..dac66749e400 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -178,7 +178,8 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
{
if (page_mapped(page)) {
unsigned int nr = thp_nr_pages(page);
- unmap_mapping_pages(mapping, page->index, nr, false);
+ unmap_mapping_pages(mapping, page->index, nr,
+ ZAP_FLAG_CHECK_MAPPING);
}
if (page_has_private(page))
@@ -750,14 +751,15 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
* Zap the rest of the file in one hit.
*/
unmap_mapping_pages(mapping, index,
- (1 + end - index), false);
+ (1 + end - index),
+ ZAP_FLAG_CHECK_MAPPING);
did_range_unmap = 1;
} else {
/*
* Just zap this page
*/
unmap_mapping_pages(mapping, index,
- 1, false);
+ 1, ZAP_FLAG_CHECK_MAPPING);
}
}
BUG_ON(page_mapped(page));
@@ -783,7 +785,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
* get remapped later.
*/
if (dax_mapping(mapping)) {
- unmap_mapping_pages(mapping, start, end - start + 1, false);
+ unmap_mapping_pages(mapping, start, end - start + 1,
+ ZAP_FLAG_CHECK_MAPPING);
}
out:
cleancache_invalidate_inode(mapping);
--
2.26.2
The first_index/last_index parameters in zap_details are actually only used in
unmap_mapping_range_tree(). At the meantime, this function is only called by
unmap_mapping_pages() once. Instead of passing these two variables through the
whole stack of page zapping code, remove them from zap_details and let them
simply be parameters of unmap_mapping_range_tree(), which is inlined.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 2 --
mm/memory.c | 20 ++++++++++----------
2 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 85d928764b64..faf9538c13b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1635,8 +1635,6 @@ extern void user_shm_unlock(size_t, struct user_struct *);
*/
struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
- pgoff_t first_index; /* Lowest page->index to unmap */
- pgoff_t last_index; /* Highest page->index to unmap */
};
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 0b687f0be4d0..dd49dea276e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3145,20 +3145,20 @@ static void unmap_mapping_range_vma(struct vm_area_struct *vma,
}
static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+ pgoff_t first_index,
+ pgoff_t last_index,
struct zap_details *details)
{
struct vm_area_struct *vma;
pgoff_t vba, vea, zba, zea;
- vma_interval_tree_foreach(vma, root,
- details->first_index, details->last_index) {
-
+ vma_interval_tree_foreach(vma, root, first_index, last_index) {
vba = vma->vm_pgoff;
vea = vba + vma_pages(vma) - 1;
- zba = details->first_index;
+ zba = first_index;
if (zba < vba)
zba = vba;
- zea = details->last_index;
+ zea = last_index;
if (zea > vea)
zea = vea;
@@ -3184,17 +3184,17 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
+ pgoff_t first_index = start, last_index = start + nr - 1;
struct zap_details details = { };
details.check_mapping = even_cows ? NULL : mapping;
- details.first_index = start;
- details.last_index = start + nr - 1;
- if (details.last_index < details.first_index)
- details.last_index = ULONG_MAX;
+ if (last_index < first_index)
+ last_index = ULONG_MAX;
i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
- unmap_mapping_range_tree(&mapping->i_mmap, &details);
+ unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+ last_index, &details);
i_mmap_unlock_write(mapping);
}
--
2.26.2
pte_unmap_same() will always unmap the pte pointer. After the unmap, vmf->pte
will not be valid any more. We should clear it.
It was safe only because no one is accessing vmf->pte after pte_unmap_same()
returns, since the only caller of pte_unmap_same() (so far) is do_swap_page(),
where vmf->pte will in most cases be overwritten very soon.
pte_unmap_same() will be used in other places in follow up patches, so that
vmf->pte will not always be re-written. This patch enables us to call
functions like finish_fault() because that'll conditionally unmap the pte by
checking vmf->pte first. Or, alloc_set_pte() will make sure to allocate a new
pte even after calling pte_unmap_same().
Since we'll need to modify vmf->pte, directly pass in vmf into pte_unmap_same()
and then we can also avoid the long parameter list.
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index d6d2873368e1..5ab3106cdd35 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2559,19 +2559,20 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
{
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
+ spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
+ same = pte_same(*vmf->pte, vmf->orig_pte);
spin_unlock(ptl);
}
#endif
- pte_unmap(page_table);
+ pte_unmap(vmf->pte);
+ /* After unmap of pte, the pointer is invalid now - clear it. */
+ vmf->pte = NULL;
return same;
}
@@ -3251,7 +3252,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vm_fault_t ret = 0;
void *shadow = NULL;
- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ if (!pte_unmap_same(vmf))
goto out;
entry = pte_to_swp_entry(vmf->orig_pte);
--
2.26.2
We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop
_PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
right.. A few fixes around the code path:
1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the
new vma. That's overlooked in b569a1760782, so it won't work as expected.
Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily
get the new vma now, so switch the checks to that.
2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd
is a migration huge pmd. When it happens, instead of using pmd_uffd_wp(),
we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately.
3. Forget to carry over uffd-wp bit for a write migration huge pmd entry.
This also happens in copy_huge_pmd(), where we converted a write huge
migration entry into a read one.
4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
5. In copy_present_page() when COW is enforced when fork(), we also need to
pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when
the pte to be copied has uffd-wp bit set.
Remove the comment in copy_present_pte() about this. It won't help a huge lot
to only comment there, but comment everywhere would be an overkill. Let's
assume the commit messages would help.
Cc: Jerome Glisse <[email protected]>
Cc: Mike Rapoport <[email protected]>
Fixes: b569a1760782f3da03ff718d61f74163dea599ff
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/huge_mm.h | 3 ++-
mm/huge_memory.c | 23 ++++++++++-------------
mm/memory.c | 24 +++++++++++++-----------
3 files changed, 25 insertions(+), 25 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa97f8e7..77b8d2132c3a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -10,7 +10,8 @@
extern vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
- struct vm_area_struct *vma);
+ struct vm_area_struct *src_vma,
+ struct vm_area_struct *dst_vma);
extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b64ad1947900..35d4acac6874 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -996,7 +996,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
- struct vm_area_struct *vma)
+ struct vm_area_struct *src_vma, struct vm_area_struct *dst_vma)
{
spinlock_t *dst_ptl, *src_ptl;
struct page *src_page;
@@ -1005,7 +1005,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
int ret = -ENOMEM;
/* Skip if can be re-fill on fault */
- if (!vma_is_anonymous(vma))
+ if (!vma_is_anonymous(dst_vma))
return 0;
pgtable = pte_alloc_one(dst_mm);
@@ -1019,14 +1019,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
- /*
- * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
- * does not have the VM_UFFD_WP, which means that the uffd
- * fork event is not enabled.
- */
- if (!(vma->vm_flags & VM_UFFD_WP))
- pmd = pmd_clear_uffd_wp(pmd);
-
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
if (unlikely(is_swap_pmd(pmd))) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
@@ -1037,11 +1029,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*src_pmd))
pmd = pmd_swp_mksoft_dirty(pmd);
+ if (pmd_swp_uffd_wp(*src_pmd))
+ pmd = pmd_swp_mkuffd_wp(pmd);
set_pmd_at(src_mm, addr, src_pmd, pmd);
}
add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+ if (!(dst_vma->vm_flags & VM_UFFD_WP))
+ pmd = pmd_swp_clear_uffd_wp(pmd);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
ret = 0;
goto out_unlock;
@@ -1077,13 +1073,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* best effort that the pinned pages won't be replaced by another
* random page during the coming copy-on-write.
*/
- if (unlikely(is_cow_mapping(vma->vm_flags) &&
+ if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
atomic_read(&src_mm->has_pinned) &&
page_maybe_dma_pinned(src_page))) {
pte_free(dst_mm, pgtable);
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
- __split_huge_pmd(vma, src_pmd, addr, false, NULL);
+ __split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
return -EAGAIN;
}
@@ -1093,8 +1089,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
out_zero_page:
mm_inc_nr_ptes(dst_mm);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-
pmdp_set_wrprotect(src_mm, addr, src_pmd);
+ if (!(dst_vma->vm_flags & VM_UFFD_WP))
+ pmd = pmd_clear_uffd_wp(pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..d6d2873368e1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -696,10 +696,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
static unsigned long
copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
- pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
- unsigned long addr, int *rss)
+ pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *src_vma,
+ struct vm_area_struct *dst_vma, unsigned long addr, int *rss)
{
- unsigned long vm_flags = vma->vm_flags;
+ unsigned long vm_flags = dst_vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
swp_entry_t entry = pte_to_swp_entry(pte);
@@ -768,6 +768,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
set_pte_at(src_mm, addr, src_pte, pte);
}
}
+ if (!userfaultfd_wp(dst_vma))
+ pte = pte_swp_clear_uffd_wp(pte);
set_pte_at(dst_mm, addr, dst_pte, pte);
return 0;
}
@@ -839,6 +841,9 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* All done, just insert the new page copy in the child */
pte = mk_pte(new_page, dst_vma->vm_page_prot);
pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
+ if (userfaultfd_wp(dst_vma) && pte_uffd_wp(*src_pte))
+ /* Uffd-wp needs to be delivered to dest pte as well */
+ pte = pte_wrprotect(pte_mkuffd_wp(pte));
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
return 0;
}
@@ -888,12 +893,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
- /*
- * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
- * does not have the VM_UFFD_WP, which means that the uffd
- * fork event is not enabled.
- */
- if (!(vm_flags & VM_UFFD_WP))
+ if (!(dst_vma->vm_flags & VM_UFFD_WP))
pte = pte_clear_uffd_wp(pte);
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
@@ -968,7 +968,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
if (unlikely(!pte_present(*src_pte))) {
entry.val = copy_nonpresent_pte(dst_mm, src_mm,
dst_pte, src_pte,
- src_vma, addr, rss);
+ src_vma, dst_vma,
+ addr, rss);
if (entry.val)
break;
progress += 8;
@@ -1046,7 +1047,8 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
err = copy_huge_pmd(dst_mm, src_mm,
- dst_pmd, src_pmd, addr, src_vma);
+ dst_pmd, src_pmd, addr, src_vma,
+ dst_vma);
if (err == -ENOMEM)
return -ENOMEM;
if (!err)
--
2.26.2
We used to have special swap entries, like migration entries, hw-poison
entries, device private entries, etc.
Those "special swap entries" reside in the range that they need to be at least
swap entries first, and their types are decided by swp_type(entry).
This patch introduces another idea called "special swap ptes".
It's very easy to get confused against "special swap entries", but a speical
swap pte should never contain a swap entry at all. It means, it's illegal to
call pte_to_swp_entry() upon a special swap pte.
Make the uffd-wp special pte to be the first special swap pte.
Before this patch, is_swap_pte()==true means one of the below:
(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.
(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.
After this patch, is_swap_pte()==true means one of the below, where case (b) is
added:
(a) The pte contains a swap entry.
(a.1) The pte has a normal swap entry (non_swap_entry()==false). For
example, when an anonymous page got swapped out.
(a.2) The pte has a special swap entry (non_swap_entry()==true). For
example, a migration entry, a hw-poison entry, etc.
(b) The pte does not contain a swap entry at all (so it cannot be passed
into pte_to_swp_entry()). For example, uffd-wp special swap pte.
Teach the whole mm core about this new idea. It's done by introducing another
helper called pte_has_swap_entry() which stands for case (a.1) and (a.2).
Before this patch, it will be the same as is_swap_pte() because there's no
special swap pte yet. Now for most of the previous use of is_swap_entry() in
mm core, we'll need to use the new helper pte_has_swap_entry() instead, to make
sure we won't try to parse a swap entry from a swap special pte (which does not
contain a swap entry at all!). We either handle the swap special pte, or it'll
naturally use the default "else" paths.
Warn properly (e.g., in do_swap_page()) when we see a special swap pte - we
should never call do_swap_page() upon those ptes, but just to bail out early if
it happens.
Signed-off-by: Peter Xu <[email protected]>
---
fs/proc/task_mmu.c | 14 ++++++++------
include/linux/swapops.h | 39 ++++++++++++++++++++++++++++++++++++++-
mm/khugepaged.c | 11 ++++++++++-
mm/memcontrol.c | 2 +-
mm/memory.c | 7 +++++++
mm/migrate.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/page_vma_mapped.c | 6 +++---
9 files changed, 70 insertions(+), 15 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee5a235b3056..5286fd23bbf4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -498,7 +498,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
if (!non_swap_entry(swpent)) {
@@ -518,8 +518,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
page = migration_entry_to_page(swpent);
else if (is_device_private_entry(swpent))
page = device_private_entry_to_page(swpent);
- } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
- && pte_none(*pte))) {
+ } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) &&
+ mss->check_shmem_swap &&
+ /* Here swap special pte is the same as none pte */
+ (pte_none(*pte) || is_swap_special_pte(*pte)))) {
page = xa_load(&vma->vm_file->f_mapping->i_pages,
linear_page_index(vma, addr));
if (xa_is_value(page))
@@ -688,7 +690,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
- } else if (is_swap_pte(*pte)) {
+ } else if (pte_has_swap_entry(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
if (is_migration_entry(swpent))
@@ -1053,7 +1055,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
ptent = pte_wrprotect(old_pte);
ptent = pte_clear_soft_dirty(ptent);
ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
- } else if (is_swap_pte(ptent)) {
+ } else if (pte_has_swap_entry(ptent)) {
ptent = pte_swp_clear_soft_dirty(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
}
@@ -1366,7 +1368,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
- } else if (is_swap_pte(pte)) {
+ } else if (pte_has_swap_entry(pte)) {
swp_entry_t entry;
if (pte_swp_soft_dirty(pte))
flags |= PM_SOFT_DIRTY;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 7dd57303bb0c..7b7387d2892f 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -5,6 +5,7 @@
#include <linux/radix-tree.h>
#include <linux/bug.h>
#include <linux/mm_types.h>
+#include <linux/userfaultfd_k.h>
#ifdef CONFIG_MMU
@@ -52,12 +53,48 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
return entry.val & SWP_OFFSET_MASK;
}
-/* check whether a pte points to a swap entry */
+/*
+ * is_swap_pte() returns true for three cases:
+ *
+ * (a) The pte contains a swap entry.
+ *
+ * (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
+ * example, when an anonymous page got swapped out.
+ *
+ * (a.2) The pte has a special swap entry (non_swap_entry()==true). For
+ * example, a migration entry, a hw-poison entry, etc.
+ *
+ * (b) The pte does not contain a swap entry at all (so it cannot be passed
+ * into pte_to_swp_entry()). For example, uffd-wp special swap pte.
+ */
static inline int is_swap_pte(pte_t pte)
{
return !pte_none(pte) && !pte_present(pte);
}
+/*
+ * A swap-like special pte should only be used as special marker to trigger a
+ * page fault. We should treat them similarly as pte_none() in most cases,
+ * except that it may contain some special information that can persist within
+ * the pte. Currently the only special swap pte is UFFD_WP_SWP_PTE_SPECIAL.
+ *
+ * Note: we should never call pte_to_swp_entry() upon a special swap pte,
+ * Because a swap special pte does not contain a swap entry!
+ */
+static inline bool is_swap_special_pte(pte_t pte)
+{
+ return pte_swp_uffd_wp_special(pte);
+}
+
+/*
+ * Returns true if the pte contains a swap entry. This includes not only the
+ * normal swp entry case, but also for migration entries, etc.
+ */
+static inline bool pte_has_swap_entry(pte_t pte)
+{
+ return is_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
/*
* Convert the arch-dependent pte representation of a swp_entry_t into an
* arch-independent swp_entry_t.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4e3dff13eb70..20807163a25f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1006,7 +1006,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
vmf.pte++, vmf.address += PAGE_SIZE) {
vmf.orig_pte = *vmf.pte;
- if (!is_swap_pte(vmf.orig_pte))
+ if (!pte_has_swap_entry(vmf.orig_pte))
continue;
swapped_in++;
ret = do_swap_page(&vmf);
@@ -1238,6 +1238,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
+ if (is_swap_special_pte(pteval)) {
+ /*
+ * Reuse SCAN_PTE_UFFD_WP. If there will be
+ * new users of is_swap_special_pte(), we'd
+ * better introduce a new result type.
+ */
+ result = SCAN_PTE_UFFD_WP;
+ goto out_unmap;
+ }
if (++unmapped <= khugepaged_max_ptes_swap) {
/*
* Always be strict with uffd-wp
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29459a6ce1c7..3af43a218b8b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5776,7 +5776,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
if (pte_present(ptent))
page = mc_handle_present_pte(vma, addr, ptent);
- else if (is_swap_pte(ptent))
+ else if (pte_has_swap_entry(ptent))
page = mc_handle_swap_pte(vma, ptent, &ent);
else if (pte_none(ptent))
page = mc_handle_file_pte(vma, addr, ptent, &ent);
diff --git a/mm/memory.c b/mm/memory.c
index 5ab3106cdd35..394c2602dce7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3255,6 +3255,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
+ /*
+ * We should never call do_swap_page upon a swap special pte; just be
+ * safe to bail out if it happens.
+ */
+ if (WARN_ON_ONCE(is_swap_special_pte(vmf->orig_pte)))
+ goto out;
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 5795cb82e27c..8a5459859e17 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -318,7 +318,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spin_lock(ptl);
pte = *ptep;
- if (!is_swap_pte(pte))
+ if (!pte_has_swap_entry(pte))
goto out;
entry = pte_to_swp_entry(pte);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02beb6041..e75bfe43cedd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
- } else if (is_swap_pte(oldpte)) {
+ } else if (pte_has_swap_entry(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
pte_t newpte;
diff --git a/mm/mremap.c b/mm/mremap.c
index 138abbae4f75..f736fcbe1247 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -106,7 +106,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
#ifdef CONFIG_MEM_SOFT_DIRTY
if (pte_present(pte))
pte = pte_mksoft_dirty(pte);
- else if (is_swap_pte(pte))
+ else if (pte_has_swap_entry(pte))
pte = pte_swp_mksoft_dirty(pte);
#endif
return pte;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 5e77b269c330..c97884007232 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -36,7 +36,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
* For more details on device private memory see HMM
* (include/linux/hmm.h or mm/hmm.c).
*/
- if (is_swap_pte(*pvmw->pte)) {
+ if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;
/* Handle un-addressable ZONE_DEVICE memory */
@@ -88,7 +88,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
if (pvmw->flags & PVMW_MIGRATION) {
swp_entry_t entry;
- if (!is_swap_pte(*pvmw->pte))
+ if (!pte_has_swap_entry(*pvmw->pte))
return false;
entry = pte_to_swp_entry(*pvmw->pte);
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return false;
pfn = migration_entry_to_pfn(entry);
- } else if (is_swap_pte(*pvmw->pte)) {
+ } else if (pte_has_swap_entry(*pvmw->pte)) {
swp_entry_t entry;
/* Handle un-addressable ZONE_DEVICE memory */
--
2.26.2
File-backed memory is prone to being unmapped at any time. It means all
information in the pte will be dropped, including the uffd-wp flag.
Since the uffd-wp info cannot be stored in page cache or swap cache, persist
this wr-protect information by installing the special uffd-wp marker pte when
we're going to unmap a uffd wr-protected pte. When the pte is accessed again,
we will know it's previously wr-protected by recognizing the special pte.
Meanwhile add a new flag ZAP_FLAG_DROP_FILE_UFFD_WP when we don't want to
persist such an information. For example, when destroying the whole vma, or
punching a hole in a shmem file. For the latter, we can only drop the uffd-wp
bit when holding the page lock. It means the unmap_mapping_range() in
shmem_fallocate() still reuqires to zap without ZAP_FLAG_DROP_FILE_UFFD_WP
because that's still racy with the page faults.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 11 ++++++++++
include/linux/mm_inline.h | 43 +++++++++++++++++++++++++++++++++++++++
mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++-
mm/rmap.c | 8 ++++++++
mm/truncate.c | 8 +++++++-
5 files changed, 110 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 57bb3d680844..e4aba745be62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1634,6 +1634,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
/* Whether to skip zapping swap entries */
#define ZAP_FLAG_SKIP_SWAP BIT(1)
+/* Whether to completely drop uffd-wp entries for file-backed memory */
+#define ZAP_FLAG_DROP_FILE_UFFD_WP BIT(2)
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1666,6 +1668,15 @@ zap_skip_swap(struct zap_details *details)
return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
}
+static inline bool
+zap_drop_file_uffd_wp(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_DROP_FILE_UFFD_WP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 8fc71e9d7bb0..da4c710859e6 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,8 @@
#include <linux/huge_mm.h>
#include <linux/swap.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>
/**
* page_is_file_lru - should the page be on a file LRU or anon LRU?
@@ -125,4 +127,45 @@ static __always_inline enum lru_list page_lru(struct page *page)
}
return lru;
}
+
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte. NOTE! This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable. Meanwhile
+ * none pte also means we are not demoting the pte so if tlb flushed then we
+ * don't need to do it again; otherwise if tlb flush is postponed then it's
+ * even better.
+ *
+ * Must be called with pgtable lock held.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_USERFAULTFD
+ bool arm_uffd_pte = false;
+
+ /* The current status of the pte should be "cleared" before calling */
+ WARN_ON_ONCE(!pte_none(*pte));
+
+ if (vma_is_anonymous(vma))
+ return;
+
+ /* A uffd-wp wr-protected normal pte */
+ if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ /*
+ * A uffd-wp wr-protected swap pte. Note: this should even work for
+ * pte_swp_uffd_wp_special() too.
+ */
+ if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))
+ arm_uffd_pte = true;
+
+ if (unlikely(arm_uffd_pte))
+ set_pte_at(vma->vm_mm, addr, pte,
+ pte_swp_mkuffd_wp_special(vma));
+#endif
+}
+
#endif
diff --git a/mm/memory.c b/mm/memory.c
index afe09fccdee1..f87b5a8a098e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -73,6 +73,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
#include <trace/events/kmem.h>
@@ -1194,6 +1195,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return ret;
}
+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *pte,
+ struct zap_details *details, pte_t pteval)
+{
+ if (zap_drop_file_uffd_wp(details))
+ return;
+
+ pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1231,6 +1247,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
if (unlikely(!page))
continue;
@@ -1255,6 +1273,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}
+ /*
+ * If this is a special uffd-wp marker pte... Drop it only if
+ * enforced to do so.
+ */
+ if (unlikely(is_swap_special_pte(ptent))) {
+ WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));
+ /*
+ * If this is a common unmap of ptes, keep this as is.
+ * Drop it only if this is a whole-vma destruction.
+ */
+ if (zap_drop_file_uffd_wp(details))
+ ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ continue;
+ }
+
entry = pte_to_swp_entry(ptent);
if (is_device_private_entry(entry)) {
struct page *page = device_private_entry_to_page(entry);
@@ -1265,6 +1299,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
rss[mm_counter(page)]--;
page_remove_rmap(page, false);
put_page(page);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+ ptent);
continue;
}
@@ -1282,6 +1318,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);
add_mm_rss_vec(mm, rss);
@@ -1481,12 +1518,15 @@ void unmap_vmas(struct mmu_gather *tlb,
unsigned long end_addr)
{
struct mmu_notifier_range range;
+ struct zap_details details = {
+ .zap_flags = ZAP_FLAG_DROP_FILE_UFFD_WP,
+ };
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
- unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ unmap_single_vma(tlb, vma, start_addr, end_addr, &details);
mmu_notifier_invalidate_range_end(&range);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 31b29321adfe..f6cc0b9b1963 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@
#include <linux/page_idle.h>
#include <linux/memremap.h>
#include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
#include <asm/tlbflush.h>
@@ -1560,6 +1561,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pvmw.pte);
}
+ /*
+ * Now the pte is cleared. If this is uffd-wp armed pte, we
+ * may want to replace a none pte with a marker pte if it's
+ * file-backed, so we don't lose the tracking information.
+ */
+ pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
/* Move the dirty bit to the page. Now the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);
diff --git a/mm/truncate.c b/mm/truncate.c
index dac66749e400..35df3b1d301e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -179,7 +179,13 @@ truncate_cleanup_page(struct address_space *mapping, struct page *page)
if (page_mapped(page)) {
unsigned int nr = thp_nr_pages(page);
unmap_mapping_pages(mapping, page->index, nr,
- ZAP_FLAG_CHECK_MAPPING);
+ ZAP_FLAG_CHECK_MAPPING |
+ /*
+ * Now it's safe to drop uffd-wp because
+ * we're with page lock, and the page is
+ * being truncated.
+ */
+ ZAP_FLAG_DROP_FILE_UFFD_WP);
}
if (page_has_private(page))
--
2.26.2
These include:
1. When remove migration pmd entry, should keep the uffd-wp bit from swap
pte. Note that we need to do this after setting write bit just in case we
need to remove it.
2. When change huge pmd and convert write -> read migration entry, persist
the same uffd-wp bit.
3. When convert pmd to swap entry, should drop the uffd-wp bit always.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 2 ++
mm/huge_memory.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d9b7c9132c2f..7dd57303bb0c 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -258,6 +258,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
if (pmd_swp_soft_dirty(pmd))
pmd = pmd_swp_clear_soft_dirty(pmd);
+ if (pmd_swp_uffd_wp(pmd))
+ pmd = pmd_swp_clear_uffd_wp(pmd);
arch_entry = __pmd_to_swp_entry(pmd);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 35d4acac6874..4abc46e780a0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1810,6 +1810,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
newpmd = swp_entry_to_pmd(entry);
if (pmd_swp_soft_dirty(*pmd))
newpmd = pmd_swp_mksoft_dirty(newpmd);
+ if (pmd_swp_uffd_wp(*pmd))
+ newpmd = pmd_swp_mkuffd_wp(newpmd);
set_pmd_at(mm, addr, pmd, newpmd);
}
goto unlock;
@@ -2968,6 +2970,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
pmde = pmd_mksoft_dirty(pmde);
if (is_write_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
+ if (pmd_swp_uffd_wp(*pvmw->pmd))
+ pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
if (PageAnon(new))
--
2.26.2
Firstly, the comment in zap_pte_range() is misleading because it checks against
details rather than check_mappings, so it's against what the code did.
Meanwhile, it's confusing too on not explaining why passing in the details
pointer would mean to skip all swap entries. New user of zap_details could
very possibly miss this fact if they don't read deep until zap_pte_range()
because there's no comment at zap_details talking about it at all, so swap
entries could be errornously skipped without being noticed.
This partly reverts 3e8715fdc03e ("mm: drop zap_details::check_swap_entries"),
but introduce ZAP_FLAG_SKIP_SWAP flag, which means the opposite of previous
"details" parameter: the caller should explicitly set this to skip swap
entries, otherwise swap entries will always be considered (which is still the
major case here).
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 12 ++++++++++++
mm/memory.c | 8 +++++---
2 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2380e1df6a49..0b1d04404275 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1632,6 +1632,8 @@ extern void user_shm_unlock(size_t, struct user_struct *);
/* Whether to check page->mapping when zapping */
#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+/* Whether to skip zapping swap entries */
+#define ZAP_FLAG_SKIP_SWAP BIT(1)
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
@@ -1654,6 +1656,16 @@ zap_check_mapping_skip(struct zap_details *details, struct page *page)
return details->zap_mapping != page_rmapping(page);
}
+/* Return true if skip swap entries, false otherwise */
+static inline bool
+zap_skip_swap(struct zap_details *details)
+{
+ if (!details)
+ return false;
+
+ return details->zap_flags & ZAP_FLAG_SKIP_SWAP;
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 43d8641dbe18..873b2515e187 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1268,8 +1268,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}
- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ if (unlikely(zap_skip_swap(details)))
continue;
if (!non_swap_entry(entry))
@@ -3168,7 +3167,10 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { .zap_mapping = mapping };
+ struct zap_details details = {
+ .zap_mapping = mapping,
+ .zap_flags = ZAP_FLAG_SKIP_SWAP,
+ };
if (!even_cows)
details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
--
2.26.2
Instead of trying to introduce one variable for every new zap_details fields,
let's introduce a flag so that it can start to encode true/false informations.
Let's start to use this flag first to clean up the only check_mapping variable.
Firstly, the name "check_mapping" implies this is a "boolean", but actually it
stores the mapping inside, just in a way that it won't be set if we don't want
to check the mapping.
To make things clearer, introduce the 1st zap flag ZAP_FLAG_CHECK_MAPPING, so
that we only check against the mapping if this bit set. At the same time, we
can rename check_mapping into zap_mapping and set it always.
Since at it, introduce another helper zap_check_mapping_skip() and use it in
zap_pte_range() properly.
Some old comments have been removed in zap_pte_range() because they're
duplicated, and since now we're with ZAP_FLAG_CHECK_MAPPING flag, it'll be very
easy to grep this information by simply grepping the flag.
It'll also make life easier when we want to e.g. pass in zap_flags into the
callers like unmap_mapping_pages() (instead of adding new booleans besides the
even_cows parameter).
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 19 ++++++++++++++++++-
mm/memory.c | 31 ++++++++-----------------------
2 files changed, 26 insertions(+), 24 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index faf9538c13b2..2380e1df6a49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1630,13 +1630,30 @@ static inline bool can_do_mlock(void) { return false; }
extern int user_shm_lock(size_t, struct user_struct *);
extern void user_shm_unlock(size_t, struct user_struct *);
+/* Whether to check page->mapping when zapping */
+#define ZAP_FLAG_CHECK_MAPPING BIT(0)
+
/*
* Parameter block passed down to zap_pte_range in exceptional cases.
*/
struct zap_details {
- struct address_space *check_mapping; /* Check page->mapping if set */
+ struct address_space *zap_mapping; /* Check page->mapping if set */
+ unsigned long zap_flags; /* Special flags for zapping */
};
+/* Return true if skip zapping this page, false otherwise */
+static inline bool
+zap_check_mapping_skip(struct zap_details *details, struct page *page)
+{
+ if (!details || !page)
+ return false;
+
+ if (!(details->zap_flags & ZAP_FLAG_CHECK_MAPPING))
+ return false;
+
+ return details->zap_mapping != page_rmapping(page);
+}
+
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index dd49dea276e3..43d8641dbe18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1226,16 +1226,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct page *page;
page = vm_normal_page(vma, addr, ptent);
- if (unlikely(details) && page) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping &&
- details->check_mapping != page_rmapping(page))
- continue;
- }
+ if (unlikely(zap_check_mapping_skip(details, page)))
+ continue;
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1267,17 +1259,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (is_device_private_entry(entry)) {
struct page *page = device_private_entry_to_page(entry);
- if (unlikely(details && details->check_mapping)) {
- /*
- * unmap_shared_mapping_pages() wants to
- * invalidate cache without truncating:
- * unmap shared but keep private pages.
- */
- if (details->check_mapping !=
- page_rmapping(page))
- continue;
- }
-
+ if (unlikely(zap_check_mapping_skip(details, page)))
+ continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
page_remove_rmap(page, false);
@@ -3185,9 +3168,11 @@ void unmap_mapping_pages(struct address_space *mapping, pgoff_t start,
pgoff_t nr, bool even_cows)
{
pgoff_t first_index = start, last_index = start + nr - 1;
- struct zap_details details = { };
+ struct zap_details details = { .zap_mapping = mapping };
+
+ if (!even_cows)
+ details.zap_flags |= ZAP_FLAG_CHECK_MAPPING;
- details.check_mapping = even_cows ? NULL : mapping;
if (last_index < first_index)
last_index = ULONG_MAX;
--
2.26.2
After we added support for shmem and hugetlbfs, we can turn uffd-wp test on
always now.
Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index c4425597769a..219251c194da 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -77,8 +77,8 @@ static int test_type;
#define ALARM_INTERVAL_SECS 10
static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
-/* Whether to test uffd write-protection */
-static bool test_uffdio_wp = false;
+/* Whether to test uffd write-protection. Default is to test uffd-wp always */
+static bool test_uffdio_wp = true;
static bool map_shared;
static int huge_fd;
@@ -295,9 +295,9 @@ struct uffd_test_ops {
void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
};
-#define SHMEM_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
+#define HUGETLB_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
(1 << _UFFDIO_COPY) | \
- (1 << _UFFDIO_ZEROPAGE))
+ (1 << _UFFDIO_WRITEPROTECT))
#define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
(1 << _UFFDIO_COPY) | \
@@ -312,14 +312,14 @@ static struct uffd_test_ops anon_uffd_test_ops = {
};
static struct uffd_test_ops shmem_uffd_test_ops = {
- .expected_ioctls = SHMEM_EXPECTED_IOCTLS,
+ .expected_ioctls = ANON_EXPECTED_IOCTLS,
.allocate_area = shmem_allocate_area,
.release_pages = shmem_release_pages,
.alias_mapping = noop_alias_mapping,
};
static struct uffd_test_ops hugetlb_uffd_test_ops = {
- .expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC,
+ .expected_ioctls = HUGETLB_EXPECTED_IOCTLS,
.allocate_area = hugetlb_allocate_area,
.release_pages = hugetlb_release_pages,
.alias_mapping = hugetlb_alias_mapping,
@@ -1453,8 +1453,6 @@ static void set_test_type(const char *type)
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
- /* Only enable write-protect test for anonymous test */
- test_uffdio_wp = true;
} else if (!strcmp(type, "hugetlb")) {
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
--
2.26.2
On Fri, Jan 15, 2021 at 12:08:44PM -0500, Peter Xu wrote:
> We used to have special swap entries, like migration entries, hw-poison
> entries, device private entries, etc.
>
> Those "special swap entries" reside in the range that they need to be at least
> swap entries first, and their types are decided by swp_type(entry).
>
> This patch introduces another idea called "special swap ptes".
>
> It's very easy to get confused against "special swap entries", but a speical
> swap pte should never contain a swap entry at all. It means, it's illegal to
> call pte_to_swp_entry() upon a special swap pte.
>
> Make the uffd-wp special pte to be the first special swap pte.
>
> Before this patch, is_swap_pte()==true means one of the below:
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> After this patch, is_swap_pte()==true means one of the below, where case (b) is
> added:
>
> (a) The pte contains a swap entry.
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> (b) The pte does not contain a swap entry at all (so it cannot be passed
> into pte_to_swp_entry()). For example, uffd-wp special swap pte.
Does the stuff in hmm.c need updating too?
if (!pte_present(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);
Jason
On Mon, Jan 18, 2021 at 03:40:13PM -0400, Jason Gunthorpe wrote:
> Does the stuff in hmm.c need updating too?
>
> if (!pte_present(pte)) {
> swp_entry_t entry = pte_to_swp_entry(pte);
This idea should be cross-tree, so yes.. even if I'm not 100% sure whether HMM
would be a good candidate for file-backed uffd-wp, I agree we should also take
care of those.
I guess I managed to take care of all the is_swap_pte() callers but forget I
should also look for pte_to_swp_entry() users too, since sometimes pte_none()
and pte_present() is checked separately, so there're quite a few places that I
should have taken care of but got lost. HMM is one of them.
Thanks for spotting this, Jason. I'll coordinate in the next version.
--
Peter Xu
On Fri, Jan 15, 2021 at 9:09 AM Peter Xu <[email protected]> wrote:
>
> It is a preparation work to be able to behave differently in the per
> architecture huge_pte_alloc() according to different VMA attributes.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/arm64/mm/hugetlbpage.c | 2 +-
> arch/ia64/mm/hugetlbpage.c | 3 ++-
> arch/mips/mm/hugetlbpage.c | 4 ++--
> arch/parisc/mm/hugetlbpage.c | 2 +-
> arch/powerpc/mm/hugetlbpage.c | 3 ++-
> arch/s390/mm/hugetlbpage.c | 2 +-
> arch/sh/mm/hugetlbpage.c | 2 +-
> arch/sparc/mm/hugetlbpage.c | 2 +-
> include/linux/hugetlb.h | 2 +-
> mm/hugetlb.c | 6 +++---
> mm/userfaultfd.c | 2 +-
> 11 files changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 55ecf6de9ff7..5b32ec888698 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -252,7 +252,7 @@ void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
> set_pte(ptep, pte);
> }
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> {
> pgd_t *pgdp;
> diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
> index b331f94d20ac..f993cb36c062 100644
> --- a/arch/ia64/mm/hugetlbpage.c
> +++ b/arch/ia64/mm/hugetlbpage.c
> @@ -25,7 +25,8 @@ unsigned int hpage_shift = HPAGE_SHIFT_DEFAULT;
> EXPORT_SYMBOL(hpage_shift);
>
> pte_t *
> -huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr, unsigned long sz)
> {
> unsigned long taddr = htlbpage_to_page(addr);
> pgd_t *pgd;
> diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
> index 77ffece9c270..c1d8f51c5255 100644
> --- a/arch/mips/mm/hugetlbpage.c
> +++ b/arch/mips/mm/hugetlbpage.c
> @@ -21,8 +21,8 @@
> #include <asm/tlb.h>
> #include <asm/tlbflush.h>
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
> - unsigned long sz)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, structt vm_area_struct *vma,
This was pointed out to me just after I sent v3 of my series today
(which includes this patch):
Typo, s/structt/struct/.
> + unsigned long addr, unsigned long sz)
> {
> pgd_t *pgd;
> p4d_t *p4d;
> diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
> index d7ba014a7fbb..e141441bfa64 100644
> --- a/arch/parisc/mm/hugetlbpage.c
> +++ b/arch/parisc/mm/hugetlbpage.c
> @@ -44,7 +44,7 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> }
>
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> {
> pgd_t *pgd;
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 36c3800769fb..2514884c0d20 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -106,7 +106,8 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
> * At this point we do the placement change only for BOOK3S 64. This would
> * possibly work on other subarchs.
> */
> -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr, unsigned long sz)
> {
> pgd_t *pg;
> p4d_t *p4;
> diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
> index 3b5a4d25ca9b..da36d13ffc16 100644
> --- a/arch/s390/mm/hugetlbpage.c
> +++ b/arch/s390/mm/hugetlbpage.c
> @@ -189,7 +189,7 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
> return pte;
> }
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> {
> pgd_t *pgdp;
> diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
> index 220d7bc43d2b..999ab5916e69 100644
> --- a/arch/sh/mm/hugetlbpage.c
> +++ b/arch/sh/mm/hugetlbpage.c
> @@ -21,7 +21,7 @@
> #include <asm/tlbflush.h>
> #include <asm/cacheflush.h>
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> {
> pgd_t *pgd;
> diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
> index ec423b5f17dd..ae06f7df9750 100644
> --- a/arch/sparc/mm/hugetlbpage.c
> +++ b/arch/sparc/mm/hugetlbpage.c
> @@ -272,7 +272,7 @@ static unsigned long huge_tte_to_size(pte_t pte)
> return size;
> }
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> {
> pgd_t *pgd;
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index fe1dde0afbaf..7d4c5669e118 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -162,7 +162,7 @@ extern struct list_head huge_boot_pages;
>
> /* arch callbacks */
>
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz);
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 18b236bac6cd..eb7cd0c7d6d2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3767,7 +3767,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> src_pte = huge_pte_offset(src, addr, sz);
> if (!src_pte)
> continue;
> - dst_pte = huge_pte_alloc(dst, addr, sz);
> + dst_pte = huge_pte_alloc(dst, vma, addr, sz);
> if (!dst_pte) {
> ret = -ENOMEM;
> break;
> @@ -4484,7 +4484,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> mapping = vma->vm_file->f_mapping;
> i_mmap_lock_read(mapping);
> - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
> + ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> if (!ptep) {
> i_mmap_unlock_read(mapping);
> return VM_FAULT_OOM;
> @@ -5407,7 +5407,7 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
> -pte_t *huge_pte_alloc(struct mm_struct *mm,
> +pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> {
> pgd_t *pgd;
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 480d91b783d4..3d49b888e3e8 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -291,7 +291,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> err = -ENOMEM;
> - dst_pte = huge_pte_alloc(dst_mm, dst_addr, vma_hpagesize);
> + dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> if (!dst_pte) {
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> i_mmap_unlock_read(mapping);
> --
> 2.26.2
>
On Thu, Jan 28, 2021 at 02:59:13PM -0800, Axel Rasmussen wrote:
> > +pte_t *huge_pte_alloc(struct mm_struct *mm, structt vm_area_struct *vma,
>
> This was pointed out to me just after I sent v3 of my series today
> (which includes this patch):
>
> Typo, s/structt/struct/.
Thanks Axel - fixed here too. It's strange why it didn't complain.
Re the minor fault series, I thought it would be good to have some comment from
Andrea/Mike or others, but in all cases I'll read your v3 next week.
(A small heads-up: you'd better use v3.1 next time for that single patch, so
that just in case there will be a complete v4 series then that patch won't
collapse with it)
--
Peter Xu
On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
>
> PS. Note that there's a known issue [0] with tlb against uffd-wp/soft-dirty in
> general and Nadav is working on it. It may or may not directly affect
> shmem/hugetlbfs since there're no COW on shared mappings normally. Private
> shmem could hit, but still that's another problem to solve in general, and this
> RFC is majorly to see whether there's any objection on the concept of the idea
> specific to uffd-wp on shmem/hugetlbfs.
>
> The whole series can also be found online [1].
>
> The major comment I'd like to get is on the new idea of swap special pte. That
> comes from suggestions from both Hugh and Andrea and I appreciated a lot for
> those discussions.
>
> In short, it's a new type of pte that doesn't exist in the past, while used in
> file-backed memories to persist information across ptes being erased (but the
> page cache could still exist, for example, so in the next page fault we can
> reload the page cache with that specific information when necessary).
>
> I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
> idea of special swap ptes", where uffd-wp becomes the first user of it:
>
> We used to have special swap entries, like migration entries, hw-poison
> entries, device private entries, etc.
>
> Those "special swap entries" reside in the range that they need to be at least
> swap entries first, and their types are decided by swp_type(entry).
>
> This patch introduces another idea called "special swap ptes".
>
> It's very easy to get confused against "special swap entries", but a speical
> swap pte should never contain a swap entry at all. It means, it's illegal to
> call pte_to_swp_entry() upon a special swap pte.
>
> Make the uffd-wp special pte to be the first special swap pte.
>
> Before this patch, is_swap_pte()==true means one of the below:
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> After this patch, is_swap_pte()==true means one of the below, where case (b) is
> added:
>
> (a) The pte contains a swap entry.
>
> (a.1) The pte has a normal swap entry (non_swap_entry()==false). For
> example, when an anonymous page got swapped out.
>
> (a.2) The pte has a special swap entry (non_swap_entry()==true). For
> example, a migration entry, a hw-poison entry, etc.
>
> (b) The pte does not contain a swap entry at all (so it cannot be passed
> into pte_to_swp_entry()). For example, uffd-wp special swap pte.
>
> Hugetlbfs needs similar thing because it's also file-backed. I directly reused
> the same special pte there, though the shmem/hugetlb change on supporting this
> new pte is different since they don't share code path a lot.
Huge & Mike,
Would any of you have comment/concerns on the high-level design of this series?
It would be great to know it, especially major objection, before move on to an
non-rfc version.
Thanks,
--
Peter Xu
On Fri, Jan 29, 2021 at 2:31 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jan 28, 2021 at 02:59:13PM -0800, Axel Rasmussen wrote:
> > > +pte_t *huge_pte_alloc(struct mm_struct *mm, structt vm_area_struct *vma,
> >
> > This was pointed out to me just after I sent v3 of my series today
> > (which includes this patch):
> >
> > Typo, s/structt/struct/.
>
> Thanks Axel - fixed here too. It's strange why it didn't complain.
>
> Re the minor fault series, I thought it would be good to have some comment from
> Andrea/Mike or others, but in all cases I'll read your v3 next week.
>
> (A small heads-up: you'd better use v3.1 next time for that single patch, so
> that just in case there will be a complete v4 series then that patch won't
> collapse with it)
Thanks!
And, I'll keep that in mind for next time. I had seen it done that way
before, but it slipped my mind.
>
> --
> Peter Xu
>
On 1/29/21 2:49 PM, Peter Xu wrote:
> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
...
> Huge & Mike,
>
> Would any of you have comment/concerns on the high-level design of this series?
>
> It would be great to know it, especially major objection, before move on to an
> non-rfc version.
My apologies for not looking at this sooner. Even now, I have only taken
a very brief look at the hugetlbfs patches.
Coincidentally, I am working on the 'BUG' that soft dirty does not work for
hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes
set for soft dirty. In addition, pmd sharing must be disabled for soft dirty
as here and in Axel's uffd minor fault code.
No objections to the overall approach based on my quick look.
I'll try to take a closer look at the areas where efforts overlap.
--
Mike Kravetz
On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> On 1/29/21 2:49 PM, Peter Xu wrote:
> > On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> ...
> > Huge & Mike,
> >
> > Would any of you have comment/concerns on the high-level design of this series?
> >
> > It would be great to know it, especially major objection, before move on to an
> > non-rfc version.
>
> My apologies for not looking at this sooner. Even now, I have only taken
> a very brief look at the hugetlbfs patches.
>
> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes
> set for soft dirty. In addition, pmd sharing must be disabled for soft dirty
> as here and in Axel's uffd minor fault code.
Interesting to know that we'll reach and need something common from different
directions, especially when they all mostly happen at the same time. :)
Is there a real "BUG" that you mentioned? I'd be glad to read about it if
there is a link or something.
>
> No objections to the overall approach based on my quick look.
Thanks for having a look.
So for hugetlb one major thing is indeed about the pmd sharing part, which
seems that we've got very good consensus on.
The other thing that I'd love to get some comment would be a shared topic with
shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
to record wr-protect information even if the pgtable entries were flushed.
That comes from a fundamental difference between anonymous and file-backed
memory in that anonymous pages keep all info in the pgtable entry, but
file-backed memory is not, e.g., pgtable entries can be dropped at any time as
long as page cache is there.
I goes to look at soft-dirty then regarding this issue, and there's actually a
paragraph about it:
While in most cases tracking memory changes by #PF-s is more than enough
there is still a scenario when we can lose soft dirty bits -- a task
unmaps a previously mapped memory region and then maps a new one at
exactly the same place. When unmap is called, the kernel internally
clears PTE values including soft dirty bits. To notify user space
application about such memory region renewal the kernel always marks
new memory regions (and expanded regions) as soft dirty.
I feel like it just means soft-dirty currently allows false positives: we could
have set the soft dirty bit even if the page is clean. And that's what this
series wanted to avoid: it used the new concept called "swap special pte" to
persistent that information even for file-backed memory. That all goes for
avoiding those false positives.
>
> I'll try to take a closer look at the areas where efforts overlap.
I dumped above just to hope maybe it could help a little bit more for the
reviews, but if it's not, I totally agree we can focus on the overlapped part.
And, I'd be more than glad to read your work too if I can understand more on
what you're working on with the hugetlb soft dirty bug, since I do feel uffd-wp
is servicing similar goals just like what soft-dirty does, so we could share a
lot of common knowledge there. :)
Thanks again!
--
Peter Xu
On Fri, 29 Jan 2021, Peter Xu wrote:
>
> Huge & Mike,
>
> Would any of you have comment/concerns on the high-level design of this series?
>
> It would be great to know it, especially major objection, before move on to an
> non-rfc version.
Seeing Mike's update prompts me to speak up: I have been looking, and
will continue to look through it - will report when done; but find I've
been making very little forward progress from one day to the next.
It is very confusing, inevitably; but you have done an *outstanding*
job on acknowledging the confusion, and commenting it in great detail.
Hugh
On Fri, Feb 05, 2021 at 02:21:47PM -0800, Hugh Dickins wrote:
> On Fri, 29 Jan 2021, Peter Xu wrote:
> >
> > Huge & Mike,
> >
> > Would any of you have comment/concerns on the high-level design of this series?
> >
> > It would be great to know it, especially major objection, before move on to an
> > non-rfc version.
>
> Seeing Mike's update prompts me to speak up: I have been looking, and
> will continue to look through it - will report when done; but find I've
> been making very little forward progress from one day to the next.
>
> It is very confusing, inevitably; but you have done an *outstanding*
> job on acknowledging the confusion, and commenting it in great detail.
I'm honored to receive such an evaluation, thanks Hugh!
As a quick summary - what I did in this series is mostly what you've suggested
on using swp_type==1 && swp_offset=0 as a special pte, so the swap code can
trap it. The only difference is that "swp_type==1 && swp_offset=0" still uses
valid swp_entry address space, so I introduced the "swap special pte" idea
hoping to make it clearer, which is also based on Andrea's suggestion. I hope
I didn't make it even worse. :)
It's just that I don't want to make this idea that "only works for uffd-wp".
What I'm thinking is whether we can provide such a common way to keep some
records in pgtable entries that point to file-backed memory. Say, currently
for a file-backed memory we can only have either a valid pte (either RO or RW)
or a none pte. So maybe we could provide a way to start using the rest pte
address space that we haven't yet used.
Please take your time on reviewing the series. Any of your future comment
would be greatly welcomed.
Thanks,
--
Peter Xu
On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote:
> On 2/5/21 6:36 PM, Peter Xu wrote:
> > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> >> On 1/29/21 2:49 PM, Peter Xu wrote:
> >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> >> ...
> >>> Huge & Mike,
> >>>
> >>> Would any of you have comment/concerns on the high-level design of this series?
> >>>
> >>> It would be great to know it, especially major objection, before move on to an
> >>> non-rfc version.
> >>
> >> My apologies for not looking at this sooner. Even now, I have only taken
> >> a very brief look at the hugetlbfs patches.
> >>
> >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> >> hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes
> >> set for soft dirty. In addition, pmd sharing must be disabled for soft dirty
> >> as here and in Axel's uffd minor fault code.
> >
> > Interesting to know that we'll reach and need something common from different
> > directions, especially when they all mostly happen at the same time. :)
> >
> > Is there a real "BUG" that you mentioned? I'd be glad to read about it if
> > there is a link or something.
> >
>
> Sorry, I was referring to a bugzilla bug not a BUG(). Bottom line is that
> hugetlb was mostly overlooked when soft dirty support was added. A thread
> mostly from me is at:
> lore.kernel.org/r/[email protected]
> I am close to sending out a RFC, but keep getting distracted.
Thanks. Indeed I see no reason to not have hugetlb supported for soft dirty.
Tracking 1G huge pages could be too coarse and heavy, but 2M at least still
seems reasonable.
>
> >> No objections to the overall approach based on my quick look.
> >
> > Thanks for having a look.
> >
> > So for hugetlb one major thing is indeed about the pmd sharing part, which
> > seems that we've got very good consensus on.
>
> Yes
>
> > The other thing that I'd love to get some comment would be a shared topic with
> > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> > to record wr-protect information even if the pgtable entries were flushed.
> > That comes from a fundamental difference between anonymous and file-backed
> > memory in that anonymous pages keep all info in the pgtable entry, but
> > file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> > long as page cache is there.
>
> Sorry, but I can not recall this difference for hugetlb pages. What operations
> lead to flushing of pagetable entries? It would need to be something other
> than unmap as it seems we want to lose the information in unmap IIUC.
For hugetlbfs I know two cases.
One is exactly huge pmd sharing as mentioned above, where we'll drop the
pgtable entries for a specific process but the page cache will still exist.
The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called
before remove_inode_hugepages(). For uffd-wp, there will be a very small
window that a wr-protected huge page can be written before the page is finally
dropped in remove_inode_hugepages() but after pgtable entry flushed. In some
apps that could cause data loss.
>
> > I goes to look at soft-dirty then regarding this issue, and there's actually a
> > paragraph about it:
> >
> > While in most cases tracking memory changes by #PF-s is more than enough
> > there is still a scenario when we can lose soft dirty bits -- a task
> > unmaps a previously mapped memory region and then maps a new one at
> > exactly the same place. When unmap is called, the kernel internally
> > clears PTE values including soft dirty bits. To notify user space
> > application about such memory region renewal the kernel always marks
> > new memory regions (and expanded regions) as soft dirty.
> >
> > I feel like it just means soft-dirty currently allows false positives: we could
> > have set the soft dirty bit even if the page is clean. And that's what this
> > series wanted to avoid: it used the new concept called "swap special pte" to
> > persistent that information even for file-backed memory. That all goes for
> > avoiding those false positives.
>
> Yes, I have seen this with soft dirty. It really does not seem right. When
> you first create a mapping, even before faulting in anything the vma is marked
> VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.
Right that seems not optimal. It is understandable since dirty info is indeed
tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted
to solve in this series. It would be great to know if current approach in this
series would work for us to remove those false positives.
>
> To be honest, I am not sure you want to try and carry per-process/per-mapping
> wp information in the file.
What this series does is trying to persist that information in pgtable entries,
rather than in the file (or page cache). Frankly I can't say whether that's
optimal either, so I'm always open to any comment. So far I think it's a valid
solution, but it could always be possible that I missed something important.
> In the comment about soft dirty above, it seems
> reasonable that unmapping would clear all soft dirty information. Also,
> unmapping would clear any uffd state/information.
Right, unmap should always means "dropping all information in the ptes". It's
in below patch that we tried to treat it differently:
https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2
A quick summary of above patch: only if we unmap or truncate the hugetlbfs
file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP
(which means we'll drop all the information, including uffd-wp bit).
Thanks,
--
Peter Xu
On 2/5/21 6:36 PM, Peter Xu wrote:
> On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
>> On 1/29/21 2:49 PM, Peter Xu wrote:
>>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
>>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
>> ...
>>> Huge & Mike,
>>>
>>> Would any of you have comment/concerns on the high-level design of this series?
>>>
>>> It would be great to know it, especially major objection, before move on to an
>>> non-rfc version.
>>
>> My apologies for not looking at this sooner. Even now, I have only taken
>> a very brief look at the hugetlbfs patches.
>>
>> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
>> hugetlbfs. As you can imagine, there is some overlap in handling of wp ptes
>> set for soft dirty. In addition, pmd sharing must be disabled for soft dirty
>> as here and in Axel's uffd minor fault code.
>
> Interesting to know that we'll reach and need something common from different
> directions, especially when they all mostly happen at the same time. :)
>
> Is there a real "BUG" that you mentioned? I'd be glad to read about it if
> there is a link or something.
>
Sorry, I was referring to a bugzilla bug not a BUG(). Bottom line is that
hugetlb was mostly overlooked when soft dirty support was added. A thread
mostly from me is at:
lore.kernel.org/r/[email protected]
I am close to sending out a RFC, but keep getting distracted.
>> No objections to the overall approach based on my quick look.
>
> Thanks for having a look.
>
> So for hugetlb one major thing is indeed about the pmd sharing part, which
> seems that we've got very good consensus on.
Yes
> The other thing that I'd love to get some comment would be a shared topic with
> shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> to record wr-protect information even if the pgtable entries were flushed.
> That comes from a fundamental difference between anonymous and file-backed
> memory in that anonymous pages keep all info in the pgtable entry, but
> file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> long as page cache is there.
Sorry, but I can not recall this difference for hugetlb pages. What operations
lead to flushing of pagetable entries? It would need to be something other
than unmap as it seems we want to lose the information in unmap IIUC.
> I goes to look at soft-dirty then regarding this issue, and there's actually a
> paragraph about it:
>
> While in most cases tracking memory changes by #PF-s is more than enough
> there is still a scenario when we can lose soft dirty bits -- a task
> unmaps a previously mapped memory region and then maps a new one at
> exactly the same place. When unmap is called, the kernel internally
> clears PTE values including soft dirty bits. To notify user space
> application about such memory region renewal the kernel always marks
> new memory regions (and expanded regions) as soft dirty.
>
> I feel like it just means soft-dirty currently allows false positives: we could
> have set the soft dirty bit even if the page is clean. And that's what this
> series wanted to avoid: it used the new concept called "swap special pte" to
> persistent that information even for file-backed memory. That all goes for
> avoiding those false positives.
Yes, I have seen this with soft dirty. It really does not seem right. When
you first create a mapping, even before faulting in anything the vma is marked
VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.
To be honest, I am not sure you want to try and carry per-process/per-mapping
wp information in the file. In the comment about soft dirty above, it seems
reasonable that unmapping would clear all soft dirty information. Also,
unmapping would clear any uffd state/information.
--
Mike Kravetz
>>
>> I'll try to take a closer look at the areas where efforts overlap.
>
> I dumped above just to hope maybe it could help a little bit more for the
> reviews, but if it's not, I totally agree we can focus on the overlapped part.
> And, I'd be more than glad to read your work too if I can understand more on
> what you're working on with the hugetlb soft dirty bug, since I do feel uffd-wp
> is servicing similar goals just like what soft-dirty does, so we could share a
> lot of common knowledge there. :)
>
> Thanks again!
>