This series adds a new userfaultfd feature, UFFDIO_POISON. See commit 4
for a detailed description of the feature.
The series is based on Linus master (partial 6.5 merge window), and
structured like this:
- Patches 1-3 are preparation / refactoring
- Patches 4-6 implement and advertise the new feature
- Patches 7-8 implement a unit test for the new feature
Changelog:
v2 -> v3:
- Rebase onto current Linus master.
- Don't overwrite existing PTE markers for non-hugetlb UFFDIO_POISON.
Before, non-hugetlb would override them, but hugetlb would not. I don't
think there's a use case where we *want* to override a UFFD_WP marker
for example, so take the more conservative behavior for all kinds of
memory.
- [Peter] Drop hugetlb mfill atomic refactoring, since it isn't needed
for this series (we don't touch that code directly anyway).
- [Peter] Switch to re-using PTE_MARKER_SWAPIN_ERROR instead of defining
new PTE_MARKER_UFFD_POISON.
- [Peter] Extract start / len range overflow check into existing
validate_range helper; this fixes the style issue of unnecessary braces
in the UFFDIO_POISON implementation, because this code is just deleted.
- [Peter] Extract file size check out into a new helper.
- [Peter] Defer actually "enabling" the new feature until the last commit
in the series; combine this with adding the documentation. As a
consequence, move the selftest commits after this one.
- [Randy] Fix typo in documentation.
v1 -> v2:
- [Peter] Return VM_FAULT_HWPOISON not VM_FAULT_SIGBUS, to yield the
correct behavior for KVM (guest MCE).
- [Peter] Rename UFFDIO_SIGBUS to UFFDIO_POISON.
- [Peter] Implement hugetlbfs support for UFFDIO_POISON.
Axel Rasmussen (8):
mm: make PTE_MARKER_SWAPIN_ERROR more general
mm: userfaultfd: check for start + len overflow in validate_range
mm: userfaultfd: extract file size check out into a helper
mm: userfaultfd: add new UFFDIO_POISON ioctl
mm: userfaultfd: support UFFDIO_POISON for hugetlbfs
mm: userfaultfd: document and enable new UFFDIO_POISON feature
selftests/mm: refactor uffd_poll_thread to allow custom fault handlers
selftests/mm: add uffd unit test for UFFDIO_POISON
Documentation/admin-guide/mm/userfaultfd.rst | 15 +++
fs/userfaultfd.c | 73 ++++++++++--
include/linux/mm_inline.h | 19 +++
include/linux/swapops.h | 10 +-
include/linux/userfaultfd_k.h | 4 +
include/uapi/linux/userfaultfd.h | 25 +++-
mm/hugetlb.c | 51 ++++++--
mm/madvise.c | 2 +-
mm/memory.c | 15 ++-
mm/mprotect.c | 4 +-
mm/shmem.c | 4 +-
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 83 ++++++++++---
tools/testing/selftests/mm/uffd-common.c | 5 +-
tools/testing/selftests/mm/uffd-common.h | 3 +
tools/testing/selftests/mm/uffd-stress.c | 12 +-
tools/testing/selftests/mm/uffd-unit-tests.c | 117 +++++++++++++++++++
17 files changed, 377 insertions(+), 67 deletions(-)
--
2.41.0.255.g8b1d071c50-goog
Most userfaultfd ioctls take a `start + len` range as an argument.
We have the validate_range helper to check that such ranges are valid.
However, some (but not all!) ioctls *also* check that `start + len`
doesn't wrap around (overflow).
Just check for this in validate_range. This saves some repetitive code,
and adds the check to some ioctls which weren't bothering to check for
it before.
Signed-off-by: Axel Rasmussen <[email protected]>
---
fs/userfaultfd.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..2e84684c46f0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1306,6 +1306,8 @@ static __always_inline int validate_range(struct mm_struct *mm,
return -EINVAL;
if (len > task_size - start)
return -EINVAL;
+ if (start + len <= start)
+ return -EINVAL;
return 0;
}
@@ -1760,14 +1762,8 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
if (ret)
goto out;
- /*
- * double check for wraparound just in case. copy_from_user()
- * will later check uffdio_copy.src + uffdio_copy.len to fit
- * in the userland range.
- */
+
ret = -EINVAL;
- if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
- goto out;
if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
goto out;
if (uffdio_copy.mode & UFFDIO_COPY_MODE_WP)
@@ -1927,11 +1923,6 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
goto out;
ret = -EINVAL;
- /* double check for wraparound just in case. */
- if (uffdio_continue.range.start + uffdio_continue.range.len <=
- uffdio_continue.range.start) {
- goto out;
- }
if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE |
UFFDIO_CONTINUE_MODE_WP))
goto out;
--
2.41.0.255.g8b1d071c50-goog
Future patches will re-use PTE_MARKER_SWAPIN_ERROR to implement
UFFDIO_POISON, so make some various preparations for that:
First, rename it to just PTE_MARKER_ERROR. The "SWAPIN" can be confusing
since we're going to re-use it for something not really related to swap.
This can be particularly confusing for things like hugetlbfs, which
doesn't support swap whatsoever. Also rename some various helper
functions.
Next, fix pte marker copying for hugetlbfs. Previously, it would WARN on
seeing a PTE_MARKER_SWAPIN_ERROR, since hugetlbfs doesn't support swap.
But, since we're going to re-use it, we want it to go ahead and copy it
just like non-hugetlbfs memory does today. Since the code to do this is
more complicated now, pull it out into a helper which can be re-used in
both places. While we're at it, also make it slightly more explicit in
its handling of e.g. uffd wp markers.
For non-hugetlbfs page faults, instead of returning VM_FAULT_SIGBUS for
an error entry, return VM_FAULT_HWPOISON. For most cases this change
doesn't matter, e.g. a userspace program would receive a SIGBUS either
way. But for UFFDIO_POISON, this change will let KVM guests get an MCE
out of the box, instead of giving a SIGBUS to the hypervisor and
requiring it to somehow inject an MCE.
Finally, for hugetlbfs faults, handle PTE_MARKER_ERROR, and return
VM_FAULT_HWPOISON_LARGE in such cases. Note that this can't happen today
because the lack of swap support means we'll never end up with such a
PTE anyway, but this behavior will be needed once such entries *can*
show up via UFFDIO_POISON.
Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/mm_inline.h | 19 +++++++++++++++++++
include/linux/swapops.h | 10 +++++-----
mm/hugetlb.c | 32 +++++++++++++++++++++-----------
mm/madvise.c | 2 +-
mm/memory.c | 15 +++++++++------
mm/mprotect.c | 4 ++--
mm/shmem.c | 4 ++--
mm/swapfile.c | 2 +-
8 files changed, 60 insertions(+), 28 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 21d6c72bcc71..329bd9370b49 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -523,6 +523,25 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
return atomic_read(&mm->tlb_flush_pending) > 1;
}
+/*
+ * Computes the pte marker to copy from the given source entry into dst_vma.
+ * If no marker should be copied, returns 0.
+ * The caller should insert a new pte created with make_pte_marker().
+ */
+static inline pte_marker copy_pte_marker(
+ swp_entry_t entry, struct vm_area_struct *dst_vma)
+{
+ pte_marker srcm = pte_marker_get(entry);
+ /* Always copy error entries. */
+ pte_marker dstm = srcm & PTE_MARKER_ERROR;
+
+ /* Only copy PTE markers if UFFD register matches. */
+ if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma))
+ dstm |= PTE_MARKER_UFFD_WP;
+
+ return dstm;
+}
+
/*
* If this pte is wr-protected by uffd-wp in any form, arm the special pte to
* replace a none pte. NOTE! This should only be called when *pte is already
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4c932cb45e0b..5f1818d48dd6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -393,7 +393,7 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
typedef unsigned long pte_marker;
#define PTE_MARKER_UFFD_WP BIT(0)
-#define PTE_MARKER_SWAPIN_ERROR BIT(1)
+#define PTE_MARKER_ERROR BIT(1)
#define PTE_MARKER_MASK (BIT(2) - 1)
static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
@@ -421,15 +421,15 @@ static inline pte_t make_pte_marker(pte_marker marker)
return swp_entry_to_pte(make_pte_marker_entry(marker));
}
-static inline swp_entry_t make_swapin_error_entry(void)
+static inline swp_entry_t make_error_swp_entry(void)
{
- return make_pte_marker_entry(PTE_MARKER_SWAPIN_ERROR);
+ return make_pte_marker_entry(PTE_MARKER_ERROR);
}
-static inline int is_swapin_error_entry(swp_entry_t entry)
+static inline int is_error_swp_entry(swp_entry_t entry)
{
return is_pte_marker_entry(entry) &&
- (pte_marker_get(entry) & PTE_MARKER_SWAPIN_ERROR);
+ (pte_marker_get(entry) & PTE_MARKER_ERROR);
}
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bce28cca73a1..934e129d9939 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -34,6 +34,7 @@
#include <linux/nospec.h>
#include <linux/delayacct.h>
#include <linux/memory.h>
+#include <linux/mm_inline.h>
#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -5101,15 +5102,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
entry = huge_pte_clear_uffd_wp(entry);
set_huge_pte_at(dst, addr, dst_pte, entry);
} else if (unlikely(is_pte_marker(entry))) {
- /* No swap on hugetlb */
- WARN_ON_ONCE(
- is_swapin_error_entry(pte_to_swp_entry(entry)));
- /*
- * We copy the pte marker only if the dst vma has
- * uffd-wp enabled.
- */
- if (userfaultfd_wp(dst_vma))
- set_huge_pte_at(dst, addr, dst_pte, entry);
+ pte_marker marker = copy_pte_marker(
+ pte_to_swp_entry(entry), dst_vma);
+
+ if (marker)
+ set_huge_pte_at(dst, addr, dst_pte,
+ make_pte_marker(marker));
} else {
entry = huge_ptep_get(src_pte);
pte_folio = page_folio(pte_page(entry));
@@ -6090,14 +6088,26 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
entry = huge_ptep_get(ptep);
- /* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry))
+ if (huge_pte_none_mostly(entry)) {
+ if (is_pte_marker(entry)) {
+ pte_marker marker =
+ pte_marker_get(pte_to_swp_entry(entry));
+
+ if (marker & PTE_MARKER_ERROR) {
+ ret = VM_FAULT_HWPOISON_LARGE;
+ goto out_mutex;
+ }
+ }
+
/*
+ * Other PTE markers should be handled the same way as none PTE.
+ *
* hugetlb_no_page will drop vma lock and hugetlb fault
* mutex internally, which make us return immediately.
*/
return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
+ }
ret = 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 886f06066622..59e954586e2a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -660,7 +660,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
free_swap_and_cache(entry);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} else if (is_hwpoison_entry(entry) ||
- is_swapin_error_entry(entry)) {
+ is_error_swp_entry(entry)) {
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
}
continue;
diff --git a/mm/memory.c b/mm/memory.c
index 0ae594703021..c8b6de99d14c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -860,8 +860,11 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
return -EBUSY;
return -ENOENT;
} else if (is_pte_marker_entry(entry)) {
- if (is_swapin_error_entry(entry) || userfaultfd_wp(dst_vma))
- set_pte_at(dst_mm, addr, dst_pte, pte);
+ pte_marker marker = copy_pte_marker(entry, dst_vma);
+
+ if (marker)
+ set_pte_at(dst_mm, addr, dst_pte,
+ make_pte_marker(marker));
return 0;
}
if (!userfaultfd_wp(dst_vma))
@@ -1500,7 +1503,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
!zap_drop_file_uffd_wp(details))
continue;
} else if (is_hwpoison_entry(entry) ||
- is_swapin_error_entry(entry)) {
+ is_error_swp_entry(entry)) {
if (!should_zap_cows(details))
continue;
} else {
@@ -3647,7 +3650,7 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
* none pte. Otherwise it means the pte could have changed, so retry.
*
* This should also cover the case where e.g. the pte changed
- * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_SWAPIN_ERROR.
+ * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_ERROR.
* So is_pte_marker() check is not enough to safely drop the pte.
*/
if (pte_same(vmf->orig_pte, ptep_get(vmf->pte)))
@@ -3693,8 +3696,8 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
/* Higher priority than uffd-wp when data corrupted */
- if (marker & PTE_MARKER_SWAPIN_ERROR)
- return VM_FAULT_SIGBUS;
+ if (marker & PTE_MARKER_ERROR)
+ return VM_FAULT_HWPOISON;
if (pte_marker_entry_uffd_wp(entry))
return pte_marker_handle_uffd_wp(vmf);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..47d255c8c2f2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -230,10 +230,10 @@ static long change_pte_range(struct mmu_gather *tlb,
newpte = pte_swp_mkuffd_wp(newpte);
} else if (is_pte_marker_entry(entry)) {
/*
- * Ignore swapin errors unconditionally,
+ * Ignore error swap entries unconditionally,
* because any access should sigbus anyway.
*/
- if (is_swapin_error_entry(entry))
+ if (is_error_swp_entry(entry))
continue;
/*
* If this is uffd-wp pte marker and we'd like
diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..c0f408c2c020 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1707,7 +1707,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
swp_entry_t swapin_error;
void *old;
- swapin_error = make_swapin_error_entry();
+ swapin_error = make_error_swp_entry();
old = xa_cmpxchg_irq(&mapping->i_pages, index,
swp_to_radix_entry(swap),
swp_to_radix_entry(swapin_error), 0);
@@ -1752,7 +1752,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
swap = radix_to_swp_entry(*foliop);
*foliop = NULL;
- if (is_swapin_error_entry(swap))
+ if (is_error_swp_entry(swap))
return -EIO;
si = get_swap_device(swap);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8e6dde68b389..72e110387e67 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1773,7 +1773,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
swp_entry = make_hwpoison_entry(swapcache);
page = swapcache;
} else {
- swp_entry = make_swapin_error_entry();
+ swp_entry = make_error_swp_entry();
}
new_pte = swp_entry_to_pte(swp_entry);
ret = 0;
--
2.41.0.255.g8b1d071c50-goog
This code is already duplicated twice, and UFFDIO_POISON will do the
same check a third time. So, it's worth extracting into a helper to save
repetitive lines of code.
Signed-off-by: Axel Rasmussen <[email protected]>
---
mm/userfaultfd.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index a2bf37ee276d..4244ca7ee903 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -45,6 +45,22 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
return dst_vma;
}
+/* Check if dst_addr is outside of file's size. Must be called with ptl held. */
+static bool mfill_file_over_size(struct vm_area_struct *dst_vma,
+ unsigned long dst_addr)
+{
+ struct inode *inode;
+ pgoff_t offset, max_off;
+
+ if (!dst_vma->vm_file)
+ return false;
+
+ inode = dst_vma->vm_file->f_inode;
+ offset = linear_page_index(dst_vma, dst_addr);
+ max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ return offset >= max_off;
+}
+
/*
* Install PTEs, to map dst_addr (within dst_vma) to page.
*
@@ -64,8 +80,6 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
bool page_in_cache = page_mapping(page);
spinlock_t *ptl;
struct folio *folio;
- struct inode *inode;
- pgoff_t offset, max_off;
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
_dst_pte = pte_mkdirty(_dst_pte);
@@ -81,14 +95,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
if (!dst_pte)
goto out;
- if (vma_is_shmem(dst_vma)) {
- /* serialize against truncate with the page table lock */
- inode = dst_vma->vm_file->f_inode;
- offset = linear_page_index(dst_vma, dst_addr);
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ if (mfill_file_over_size(dst_vma, dst_addr)) {
ret = -EFAULT;
- if (unlikely(offset >= max_off))
- goto out_unlock;
+ goto out_unlock;
}
ret = -EEXIST;
@@ -211,8 +220,6 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl;
int ret;
- pgoff_t offset, max_off;
- struct inode *inode;
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
dst_vma->vm_page_prot));
@@ -220,14 +227,9 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl);
if (!dst_pte)
goto out;
- if (dst_vma->vm_file) {
- /* the shmem MAP_PRIVATE case requires checking the i_size */
- inode = dst_vma->vm_file->f_inode;
- offset = linear_page_index(dst_vma, dst_addr);
- max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+ if (mfill_file_over_size(dst_vma, dst_addr)) {
ret = -EFAULT;
- if (unlikely(offset >= max_off))
- goto out_unlock;
+ goto out_unlock;
}
ret = -EEXIST;
if (!pte_none(ptep_get(dst_pte)))
--
2.41.0.255.g8b1d071c50-goog
On Thu, Jul 06, 2023 at 03:50:29PM -0700, Axel Rasmussen wrote:
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 886f06066622..59e954586e2a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -660,7 +660,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> free_swap_and_cache(entry);
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> } else if (is_hwpoison_entry(entry) ||
> - is_swapin_error_entry(entry)) {
> + is_error_swp_entry(entry)) {
is_error_swp_entry() made me think whether we should call this marker as
ERROR at all - it's too generic, is_error_swp_entry() can be easily read as
"this swap entry is not legal". Can we come up with a less generic one?
PTE_MARKER_PAGE_POISONED / PTE_MARKER_POISONED / PTE_MARKER_PAGE_ERR / ...?
We do use poisoned only in real bad physical pages before, but I think we
can still keep using it when applying it to a pte. I think it's fine to
call a pte poisoned if it's not legal to access, just like a poisoned page.
> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> }
> continue;
The code change across the whole patch look good to me otherwise, thanks.
--
Peter Xu
On Thu, Jul 06, 2023 at 03:50:30PM -0700, Axel Rasmussen wrote:
> Most userfaultfd ioctls take a `start + len` range as an argument.
> We have the validate_range helper to check that such ranges are valid.
> However, some (but not all!) ioctls *also* check that `start + len`
> doesn't wrap around (overflow).
>
> Just check for this in validate_range. This saves some repetitive code,
> and adds the check to some ioctls which weren't bothering to check for
> it before.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
--
Peter Xu
On Thu, Jul 06, 2023 at 03:50:31PM -0700, Axel Rasmussen wrote:
> This code is already duplicated twice, and UFFDIO_POISON will do the
> same check a third time. So, it's worth extracting into a helper to save
> repetitive lines of code.
>
> Signed-off-by: Axel Rasmussen <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
--
Peter Xu