This series implements initial write protection support for
userfaultfd. Currently both shmem and hugetlbfs are not supported
yet, but only anonymous memory. This is the 4nd version of it.
The latest code can also be found at:
https://github.com/xzpeter/linux/tree/uffd-wp-merged
v5 changelog:
- rebase
- drop two patches:
"userfaultfd: wp: handle COW properly for uffd-wp"
"mm: introduce do_wp_page_cont()"
instead remove the write bit always when resolving uffd-wp page
fault in previous patch ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
then COW will be handled correctly in the PF irq handler [Andrea]
v4 changelog:
- add r-bs
- use kernel-doc format for fault_flag_allow_retry_first [Jerome]
- drop "export wp_page_copy", add new patch to split do_wp_page(), use
it in change_pte_range() to replace the wp_page_copy(). [Jerome] (I
thought about different ways to do this but I still can't find a
100% good way for all... in this version I still used the
do_wp_page_cont naming. We can still discuss this and how we should
split do_wp_page)
- make sure uffd-wp will also apply to device private entries which
HMM uses [Jerome]
v3 changelog:
- take r-bs
- patch 1: fix typo [Jerome]
- patch 2: use brackets where proper around (flags & VM_FAULT_RETRY)
(there're three places to change, not four...) [Jerome]
- patch 4: make sure TRIED is applied correctly on all archs, add more
comment to explain the new page fault mechanism [Jerome]
- patch 7: in do_swap_page() remove the two lines to remove
FAULT_FLAG_WRITE flag [Jerome]
- patch 10: another brackets change like above, and in
mfill_atomic_pte return -EINVAL when detected wp_copy==1 upon shared
memories [Jerome]
- patch 12: move _PAGE_CHG_MASK change to patch 8 [Jerome]
- patch 14: wp_page_copy() - fix write bit; change_pte_range() -
detect PTE change after COW [Jerome]
- patch 17: remove last paragraph of commit message, no need to drop
the two lines in do_swap_page() since they've been directly dropped
in patch 7; touch up remove_migration_pte() to only detect uffd-wp
bit if it's read migration entry [Jerome]
- add patch: "userfaultfd: wp: declare _UFFDIO_WRITEPROTECT
conditionally", which remove _UFFDIO_WRITEPROTECT bit if detected
non-anonymous memory during REGISTER; meanwhile fixup the test case
for shmem too for expected ioctls returned from REGISTER [Mike]
- add patch: "userfaultfd: wp: fixup swap entries in
change_pte_range", the new patch will allow to apply the uffd-wp
bits upon swap entries directly (e.g., when the page is during
migration or the page was swapped out). Please see the patch for
detail information.
v2 changelog:
- add some r-bs
- split the patch "mm: userfault: return VM_FAULT_RETRY on signals"
into two: one to focus on the signal behavior change, the other to
remove the NOPAGE special path in handle_userfault(). Removing the
ARC specific change and remove that part of commit message since
it's fixed in 4d447455e73b already [Jerome]
- return -ENOENT when VMA is invalid for UFFDIO_WRITEPROTECT to match
UFFDIO_COPY errno [Mike]
- add a new patch to introduce helper to find valid VMA for uffd
[Mike]
- check against VM_MAYWRITE instead of VM_WRITE when registering UFFD
WP [Mike]
- MM_CP_DIRTY_ACCT is used incorrectly, fix it up [Jerome]
- make sure the lock_page behavior will not be changed [Jerome]
- reorder the whole series, introduce the new ioctl last. [Jerome]
- fix up the uffdio_writeprotect() following commit df2cc96e77011cf79
to return -EAGAIN when detected mm layout changes [Mike]
v1 can be found at: https://lkml.org/lkml/2019/1/21/130
Any comment would be greatly welcomed. Thanks.
Overview
====================
The uffd-wp work was initialized by Shaohua Li [1], and later
continued by Andrea [2]. This series is based upon Andrea's latest
userfaultfd tree, and it is a continuous works from both Shaohua and
Andrea. Many of the follow up ideas come from Andrea too.
Besides the old MISSING register mode of userfaultfd, the new uffd-wp
support provides another alternative register mode called
UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
page faults but also write protection page faults, or even they can be
registered together. At the same time, the new feature also provides
a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
userspace to write protect a range or memory or fixup write permission
of faulted pages.
Please refer to the document patch "userfaultfd: wp:
UFFDIO_REGISTER_MODE_WP documentation update" for more information on
the new interface and what it can do.
The major workflow of an uffd-wp program should be:
1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP
2. Write protect part of the whole registered region using
UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
show that we want to write protect the range.
3. Start a working thread that modifies the protected pages,
meanwhile listening to UFFD messages.
4. When a write is detected upon the protected range, page fault
happens, a UFFD message will be generated and reported to the
page fault handling thread
5. The page fault handler thread resolves the page fault using the
new UFFDIO_WRITEPROTECT ioctl, but this time passing in
!UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
recover the write permission. Before this operation, the fault
handler thread can do anything it wants, e.g., dumps the page to
a persistent storage.
6. The worker thread will continue running with the correctly
applied write permission from step 5.
Currently there are already two projects that are based on this new
userfaultfd feature.
QEMU Live Snapshot: The project provides a way to allow the QEMU
hypervisor to take snapshot of VMs without
stopping the VM [3].
LLNL umap library: The project provides a mmap-like interface and
"allow to have an application specific buffer of
pages cached from a large file, i.e. out-of-core
execution using memory map" [4][5].
Before posting the patchset, this series was smoke tested against QEMU
live snapshot and the LLNL umap library (by doing parallel quicksort
using 128 sorting threads + 80 uffd servicing threads). My sincere
thanks to Marty Mcfadden and Denis Plotnikov for the help along the
way.
TODO
=============
- hugetlbfs/shmem support
- performance
- more architectures
- cooperate with mprotect()-allowed processes (???)
- ...
References
==========
[1] https://lwn.net/Articles/666187/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
[3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
[4] https://github.com/LLNL/umap
[5] https://llnl-umap.readthedocs.io/en/develop/
[6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
[7] https://lkml.org/lkml/2018/11/21/370
[8] https://lkml.org/lkml/2018/12/30/64
Andrea Arcangeli (5):
userfaultfd: wp: hook userfault handler to write protection fault
userfaultfd: wp: add WP pagetable tracking to x86
userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers
userfaultfd: wp: add UFFDIO_COPY_MODE_WP
userfaultfd: wp: add the writeprotect API to userfaultfd ioctl
Martin Cracauer (1):
userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update
Peter Xu (16):
mm: gup: rename "nonblocking" to "locked" where proper
mm: userfault: return VM_FAULT_RETRY on signals
userfaultfd: don't retake mmap_sem to emulate NOPAGE
mm: allow VM_FAULT_RETRY for multiple times
mm: gup: allow VM_FAULT_RETRY for multiple times
mm: merge parameters for change_protection()
userfaultfd: wp: apply _PAGE_UFFD_WP bit
userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
userfaultfd: wp: add pmd_swp_*uffd_wp() helpers
userfaultfd: wp: support swap and page migration
khugepaged: skip collapse if uffd-wp detected
userfaultfd: introduce helper vma_find_uffd
userfaultfd: wp: don't wake up when doing write protect
userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally
userfaultfd: selftests: refactor statistics
userfaultfd: selftests: add write-protect test
Shaohua Li (3):
userfaultfd: wp: add helper for writeprotect check
userfaultfd: wp: support write protection for userfault vma range
userfaultfd: wp: enabled write protection in userfaultfd API
Documentation/admin-guide/mm/userfaultfd.rst | 51 +++++
arch/alpha/mm/fault.c | 4 +-
arch/arc/mm/fault.c | 12 +-
arch/arm/mm/fault.c | 9 +-
arch/arm64/mm/fault.c | 11 +-
arch/hexagon/mm/vm_fault.c | 3 +-
arch/ia64/mm/fault.c | 3 +-
arch/m68k/mm/fault.c | 5 +-
arch/microblaze/mm/fault.c | 3 +-
arch/mips/mm/fault.c | 3 +-
arch/nds32/mm/fault.c | 7 +-
arch/nios2/mm/fault.c | 5 +-
arch/openrisc/mm/fault.c | 3 +-
arch/parisc/mm/fault.c | 6 +-
arch/powerpc/mm/fault.c | 8 +-
arch/riscv/mm/fault.c | 9 +-
arch/s390/mm/fault.c | 14 +-
arch/sh/mm/fault.c | 5 +-
arch/sparc/mm/fault_32.c | 4 +-
arch/sparc/mm/fault_64.c | 4 +-
arch/um/kernel/trap.c | 6 +-
arch/unicore32/mm/fault.c | 8 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 67 ++++++
arch/x86/include/asm/pgtable_64.h | 8 +-
arch/x86/include/asm/pgtable_types.h | 11 +-
arch/x86/mm/fault.c | 8 +-
arch/xtensa/mm/fault.c | 4 +-
drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +-
fs/userfaultfd.c | 130 +++++++----
include/asm-generic/pgtable.h | 1 +
include/asm-generic/pgtable_uffd.h | 66 ++++++
include/linux/huge_mm.h | 2 +-
include/linux/mm.h | 60 ++++-
include/linux/swapops.h | 2 +
include/linux/userfaultfd_k.h | 42 +++-
include/trace/events/huge_memory.h | 1 +
include/uapi/linux/userfaultfd.h | 40 +++-
init/Kconfig | 5 +
mm/filemap.c | 2 +-
mm/gup.c | 61 ++---
mm/huge_memory.c | 32 ++-
mm/hugetlb.c | 14 +-
mm/khugepaged.c | 23 ++
mm/memory.c | 26 ++-
mm/mempolicy.c | 2 +-
mm/migrate.c | 6 +
mm/mprotect.c | 74 ++++--
mm/rmap.c | 6 +
mm/shmem.c | 2 +-
mm/userfaultfd.c | 148 +++++++++---
tools/testing/selftests/vm/userfaultfd.c | 225 +++++++++++++++----
52 files changed, 974 insertions(+), 290 deletions(-)
create mode 100644 include/asm-generic/pgtable_uffd.h
--
2.21.0
There's plenty of places around __get_user_pages() that has a parameter
"nonblocking" which does not really mean that "it won't block" (because
it can really block) but instead it shows whether the mmap_sem is
released by up_read() during the page fault handling mostly when
VM_FAULT_RETRY is returned.
We have the correct naming in e.g. get_user_pages_locked() or
get_user_pages_remote() as "locked", however there're still many places
that are using the "nonblocking" as name.
Renaming the places to "locked" where proper to better suite the
functionality of the variable. While at it, fixing up some of the
comments accordingly.
Reviewed-by: Mike Rapoport <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 44 +++++++++++++++++++++-----------------------
mm/hugetlb.c | 8 ++++----
2 files changed, 25 insertions(+), 27 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index ddde097cf9e4..58d282115d9b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -625,12 +625,12 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
}
/*
- * mmap_sem must be held on entry. If @nonblocking != NULL and
- * *@flags does not include FOLL_NOWAIT, the mmap_sem may be released.
- * If it is, *@nonblocking will be set to 0 and -EBUSY returned.
+ * mmap_sem must be held on entry. If @locked != NULL and *@flags
+ * does not include FOLL_NOWAIT, the mmap_sem may be released. If it
+ * is, *@locked will be set to 0 and -EBUSY returned.
*/
static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
- unsigned long address, unsigned int *flags, int *nonblocking)
+ unsigned long address, unsigned int *flags, int *locked)
{
unsigned int fault_flags = 0;
vm_fault_t ret;
@@ -642,7 +642,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_WRITE;
if (*flags & FOLL_REMOTE)
fault_flags |= FAULT_FLAG_REMOTE;
- if (nonblocking)
+ if (locked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
@@ -668,8 +668,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
}
if (ret & VM_FAULT_RETRY) {
- if (nonblocking && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
- *nonblocking = 0;
+ if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+ *locked = 0;
return -EBUSY;
}
@@ -746,7 +746,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
* only intends to ensure the pages are faulted in.
* @vmas: array of pointers to vmas corresponding to each page.
* Or NULL if the caller does not require them.
- * @nonblocking: whether waiting for disk IO or mmap_sem contention
+ * @locked: whether we're still with the mmap_sem held
*
* Returns number of pages pinned. This may be fewer than the number
* requested. If nr_pages is 0 or negative, returns 0. If no pages
@@ -775,13 +775,11 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
* appropriate) must be called after the page is finished with, and
* before put_page is called.
*
- * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
- * or mmap_sem contention, and if waiting is needed to pin all pages,
- * *@nonblocking will be set to 0. Further, if @gup_flags does not
- * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
- * this case.
+ * If @locked != NULL, *@locked will be set to 0 when mmap_sem is
+ * released by an up_read(). That can happen if @gup_flags does not
+ * have FOLL_NOWAIT.
*
- * A caller using such a combination of @nonblocking and @gup_flags
+ * A caller using such a combination of @locked and @gup_flags
* must therefore hold the mmap_sem for reading only, and recognize
* when it's been released. Otherwise, it must be held for either
* reading or writing and will not be released.
@@ -793,7 +791,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *nonblocking)
+ struct vm_area_struct **vmas, int *locked)
{
long ret = 0, i = 0;
struct vm_area_struct *vma = NULL;
@@ -837,7 +835,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
- gup_flags, nonblocking);
+ gup_flags, locked);
continue;
}
}
@@ -855,7 +853,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
page = follow_page_mask(vma, start, foll_flags, &ctx);
if (!page) {
ret = faultin_page(tsk, vma, start, &foll_flags,
- nonblocking);
+ locked);
switch (ret) {
case 0:
goto retry;
@@ -1508,7 +1506,7 @@ EXPORT_SYMBOL(get_user_pages);
* @vma: target vma
* @start: start address
* @end: end address
- * @nonblocking:
+ * @locked: whether the mmap_sem is still held
*
* This takes care of mlocking the pages too if VM_LOCKED is set.
*
@@ -1516,14 +1514,14 @@ EXPORT_SYMBOL(get_user_pages);
*
* vma->vm_mm->mmap_sem must be held.
*
- * If @nonblocking is NULL, it may be held for read or write and will
+ * If @locked is NULL, it may be held for read or write and will
* be unperturbed.
*
- * If @nonblocking is non-NULL, it must held for read only and may be
- * released. If it's released, *@nonblocking will be set to 0.
+ * If @locked is non-NULL, it must held for read only and may be
+ * released. If it's released, *@locked will be set to 0.
*/
long populate_vma_page_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end, int *nonblocking)
+ unsigned long start, unsigned long end, int *locked)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long nr_pages = (end - start) / PAGE_SIZE;
@@ -1558,7 +1556,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
* not result in a stack expansion that recurses back here.
*/
return __get_user_pages(current, mm, start, nr_pages, gup_flags,
- NULL, NULL, nonblocking);
+ NULL, NULL, locked);
}
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac843d32b019..ba179c2fa8fb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4240,7 +4240,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page **pages, struct vm_area_struct **vmas,
unsigned long *position, unsigned long *nr_pages,
- long i, unsigned int flags, int *nonblocking)
+ long i, unsigned int flags, int *locked)
{
unsigned long pfn_offset;
unsigned long vaddr = *position;
@@ -4311,7 +4311,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
if (flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
- if (nonblocking)
+ if (locked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
@@ -4328,9 +4328,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
break;
}
if (ret & VM_FAULT_RETRY) {
- if (nonblocking &&
+ if (locked &&
!(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
- *nonblocking = 0;
+ *locked = 0;
*nr_pages = 0;
/*
* VM_FAULT_RETRY must not return an
--
2.21.0
The idea comes from the upstream discussion between Linus and Andrea:
https://lkml.org/lkml/2017/10/30/560
A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling. We did that by
reacquiring the mmap_sem before returning. However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.
This patch removes the risk path in handle_userfault() then we will be
sure that the callers of handle_mm_fault() will know that the VMAs
might have changed. Meanwhile with previous patch we don't lose
responsiveness as well since the core mm code now can handle the
nonfatal userspace signals quickly even if we return VM_FAULT_RETRY.
Suggested-by: Andrea Arcangeli <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 24 ------------------------
1 file changed, 24 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3b30301c90ec..5dbef45ecbf5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -516,30 +516,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
__set_current_state(TASK_RUNNING);
- if (return_to_userland) {
- if (signal_pending(current) &&
- !fatal_signal_pending(current)) {
- /*
- * If we got a SIGSTOP or SIGCONT and this is
- * a normal userland page fault, just let
- * userland return so the signal will be
- * handled and gdb debugging works. The page
- * fault code immediately after we return from
- * this function is going to release the
- * mmap_sem and it's not depending on it
- * (unlike gup would if we were not to return
- * VM_FAULT_RETRY).
- *
- * If a fatal signal is pending we still take
- * the streamlined VM_FAULT_RETRY failure path
- * and there's no need to retake the mmap_sem
- * in such case.
- */
- down_read(&mm->mmap_sem);
- ret = VM_FAULT_NOPAGE;
- }
- }
-
/*
* Here we race with the list_del; list_add in
* userfaultfd_ctx_read(), however because we don't ever run
--
2.21.0
The idea comes from a discussion between Linus and Andrea [1].
Before this patch we only allow a page fault to retry once. We
achieved this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
handle_mm_fault() the second time. This was majorly used to avoid
unexpected starvation of the system by looping over forever to handle
the page fault on a single page. However that should hardly happen,
and after all for each code path to return a VM_FAULT_RETRY we'll
first wait for a condition (during which time we should possibly yield
the cpu) to happen before VM_FAULT_RETRY is really returned.
This patch removes the restriction by keeping the
FAULT_FLAG_ALLOW_RETRY flag when we receive VM_FAULT_RETRY. It means
that the page fault handler now can retry the page fault for multiple
times if necessary without the need to generate another page fault
event. Meanwhile we still keep the FAULT_FLAG_TRIED flag so page
fault handler can still identify whether a page fault is the first
attempt or not.
Then we'll have these combinations of fault flags (only considering
ALLOW_RETRY flag and TRIED flag):
- ALLOW_RETRY and !TRIED: this means the page fault allows to
retry, and this is the first try
- ALLOW_RETRY and TRIED: this means the page fault allows to
retry, and this is not the first try
- !ALLOW_RETRY and !TRIED: this means the page fault does not allow
to retry at all
- !ALLOW_RETRY and TRIED: this is forbidden and should never be used
In existing code we have multiple places that has taken special care
of the first condition above by checking against (fault_flags &
FAULT_FLAG_ALLOW_RETRY). This patch introduces a simple helper to
detect the first retry of a page fault by checking against
both (fault_flags & FAULT_FLAG_ALLOW_RETRY) and !(fault_flag &
FAULT_FLAG_TRIED) because now even the 2nd try will have the
ALLOW_RETRY set, then use that helper in all existing special paths.
One example is in __lock_page_or_retry(), now we'll drop the mmap_sem
only in the first attempt of page fault and we'll keep it in follow up
retries, so old locking behavior will be retained.
This will be a nice enhancement for current code [2] at the same time
a supporting material for the future userfaultfd-writeprotect work,
since in that work there will always be an explicit userfault
writeprotect retry for protected pages, and if that cannot resolve the
page fault (e.g., when userfaultfd-writeprotect is used in conjunction
with swapped pages) then we'll possibly need a 3rd retry of the page
fault. It might also benefit other potential users who will have
similar requirement like userfault write-protection.
GUP code is not touched yet and will be covered in follow up patch.
Please read the thread below for more information.
[1] https://lkml.org/lkml/2017/11/2/833
[2] https://lkml.org/lkml/2018/12/30/64
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 1 -
arch/arm/mm/fault.c | 3 ---
arch/arm64/mm/fault.c | 5 ----
arch/hexagon/mm/vm_fault.c | 1 -
arch/ia64/mm/fault.c | 1 -
arch/m68k/mm/fault.c | 3 ---
arch/microblaze/mm/fault.c | 1 -
arch/mips/mm/fault.c | 1 -
arch/nds32/mm/fault.c | 1 -
arch/nios2/mm/fault.c | 3 ---
arch/openrisc/mm/fault.c | 1 -
arch/parisc/mm/fault.c | 4 +---
arch/powerpc/mm/fault.c | 6 -----
arch/riscv/mm/fault.c | 5 ----
arch/s390/mm/fault.c | 5 +---
arch/sh/mm/fault.c | 1 -
arch/sparc/mm/fault_32.c | 1 -
arch/sparc/mm/fault_64.c | 1 -
arch/um/kernel/trap.c | 1 -
arch/unicore32/mm/fault.c | 4 +---
arch/x86/mm/fault.c | 2 --
arch/xtensa/mm/fault.c | 1 -
drivers/gpu/drm/ttm/ttm_bo_vm.c | 12 +++++++---
include/linux/mm.h | 41 ++++++++++++++++++++++++++++++++-
mm/filemap.c | 2 +-
mm/shmem.c | 2 +-
27 files changed, 55 insertions(+), 56 deletions(-)
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 8a2ef90b4bfc..6a02c0fb36b9 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
+ flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
* have already released it in __lock_page_or_retry
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 3517820aea07..144d25b2e044 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -165,7 +165,6 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
}
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
goto retry;
}
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c41c021bbe40..7910b4b5205d 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -342,9 +342,6 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
regs, addr);
}
if (fault & VM_FAULT_RETRY) {
- /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation. */
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
goto retry;
}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 890ec3a693e6..c36da19d9098 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -524,12 +524,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
return 0;
}
- /*
- * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
- * starvation.
- */
if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
- mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
mm_flags |= FAULT_FLAG_TRIED;
goto retry;
}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index febb4f96ba6f..21b6e9d8f2a1 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -102,7 +102,6 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
goto retry;
}
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 62c2d39d2bed..9de95d39935e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -189,7 +189,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index d9808a807ab8..b1b2109e4ab4 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -162,9 +162,6 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation. */
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/*
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 4fd2dbd0c5ca..05a4847ac0bf 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -236,7 +236,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/*
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 92374fd091d2..9953b5b571df 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -178,7 +178,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
tsk->min_flt++;
}
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/*
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index da777de8a62e..3642bdd7909d 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -242,7 +242,6 @@ void do_page_fault(unsigned long entry, unsigned long addr,
1, regs, addr);
}
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index bdb1f9db75ba..9d4961d51db4 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -157,9 +157,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation. */
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/*
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index f9f47dc32f94..05c754664fcb 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -181,7 +181,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
else
tsk->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 29422eec329d..675b221af198 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -327,14 +327,12 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
-
/*
* No need to up_read(&mm->mmap_sem) as we would
* have already released it in __lock_page_or_retry
* in mm/filemap.c.
*/
-
+ flags |= FAULT_FLAG_TRIED;
goto retry;
}
}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c2168b298c82..63564afb24ed 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -608,13 +608,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
* case.
*/
if (unlikely(fault & VM_FAULT_RETRY)) {
- /* We retry only once */
if (flags & FAULT_FLAG_ALLOW_RETRY) {
- /*
- * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation.
- */
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
if (is_user && signal_pending(current))
return 0;
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 4aa7a2343353..cc76c8766951 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -142,11 +142,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
1, regs, addr);
}
if (fault & VM_FAULT_RETRY) {
- /*
- * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation.
- */
- flags &= ~(FAULT_FLAG_ALLOW_RETRY);
flags |= FAULT_FLAG_TRIED;
/*
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 94087ba285be..e460043776f3 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -530,10 +530,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
fault = VM_FAULT_PFAULT;
goto out_up;
}
- /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation. */
- flags &= ~(FAULT_FLAG_ALLOW_RETRY |
- FAULT_FLAG_RETRY_NOWAIT);
+ flags &= ~FAULT_FLAG_RETRY_NOWAIT;
flags |= FAULT_FLAG_TRIED;
down_read(&mm->mmap_sem);
goto retry;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index baf5d73df40c..cd710e2d7c57 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -498,7 +498,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
regs, address);
}
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/*
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index a2c83104fe35..6735cd1c09b9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -261,7 +261,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
1, regs, address);
}
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cad71ec5c7b3..28d5b4d012c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -459,7 +459,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
1, regs, address);
}
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 05dcd4c5f0d5..e7723c133c7f 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -99,7 +99,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
goto retry;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3611f19234a1..efca122b5ef7 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -261,9 +261,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
else
tsk->min_flt++;
if (fault & VM_FAULT_RETRY) {
- /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation. */
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
+ flags |= FAULT_FLAG_TRIED;
goto retry;
}
}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index dcd7c1393be3..8d3fbd3dca75 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1465,9 +1465,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (unlikely(fault & VM_FAULT_RETRY)) {
bool is_user = flags & FAULT_FLAG_USER;
- /* Retry at most once */
if (flags & FAULT_FLAG_ALLOW_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
if (is_user && signal_pending(tsk))
return;
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 792dad5e2f12..7cd55f2d66c9 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -128,7 +128,6 @@ void do_page_fault(struct pt_regs *regs)
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
/* No need to up_read(&mm->mmap_sem) as we would
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 6dacff49c1cc..8f2f9ee6effa 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -61,9 +61,10 @@ static vm_fault_t ttm_bo_vm_fault_idle(struct ttm_buffer_object *bo,
/*
* If possible, avoid waiting for GPU with mmap_sem
- * held.
+ * held. We only do this if the fault allows retry and this
+ * is the first attempt.
*/
- if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+ if (fault_flag_allow_retry_first(vmf->flags)) {
ret = VM_FAULT_RETRY;
if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
goto out_unlock;
@@ -132,7 +133,12 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
* for the buffer to become unreserved.
*/
if (unlikely(!reservation_object_trylock(bo->resv))) {
- if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
+ /*
+ * If the fault allows retry and this is the first
+ * fault attempt, we try to release the mmap_sem
+ * before waiting
+ */
+ if (fault_flag_allow_retry_first(vmf->flags)) {
if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
ttm_bo_get(bo);
up_read(&vmf->vma->vm_mm->mmap_sem);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd0b5f4e1e45..dcaca899e4a8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -383,16 +383,55 @@ extern unsigned int kobjsize(const void *objp);
*/
extern pgprot_t protection_map[16];
+/*
+ * About FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED: we can specify whether we
+ * would allow page faults to retry by specifying these two fault flags
+ * correctly. Currently there can be three legal combinations:
+ *
+ * (a) ALLOW_RETRY and !TRIED: this means the page fault allows retry, and
+ * this is the first try
+ *
+ * (b) ALLOW_RETRY and TRIED: this means the page fault allows retry, and
+ * we've already tried at least once
+ *
+ * (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
+ *
+ * The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never
+ * be used. Note that page faults can be allowed to retry for multiple times,
+ * in which case we'll have an initial fault with flags (a) then later on
+ * continuous faults with flags (b). We should always try to detect pending
+ * signals before a retry to make sure the continuous page faults can still be
+ * interrupted if necessary.
+ */
+
#define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */
#define FAULT_FLAG_MKWRITE 0x02 /* Fault was mkwrite of existing pte */
#define FAULT_FLAG_ALLOW_RETRY 0x04 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x08 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */
-#define FAULT_FLAG_TRIED 0x20 /* Second try */
+#define FAULT_FLAG_TRIED 0x20 /* We've tried once */
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
+/**
+ * fault_flag_allow_retry_first - check ALLOW_RETRY the first time
+ *
+ * This is mostly used for places where we want to try to avoid taking
+ * the mmap_sem for too long a time when waiting for another condition
+ * to change, in which case we can try to be polite to release the
+ * mmap_sem in the first round to avoid potential starvation of other
+ * processes that would also want the mmap_sem.
+ *
+ * Return: true if the page fault allows retry and this is the first
+ * attempt of the fault handling; false otherwise.
+ */
+static inline bool fault_flag_allow_retry_first(unsigned int flags)
+{
+ return (flags & FAULT_FLAG_ALLOW_RETRY) &&
+ (!(flags & FAULT_FLAG_TRIED));
+}
+
#define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
{ FAULT_FLAG_MKWRITE, "MKWRITE" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index df2006ba0cfa..83fdf429f795 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1381,7 +1381,7 @@ EXPORT_SYMBOL_GPL(__lock_page_killable);
int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags)
{
- if (flags & FAULT_FLAG_ALLOW_RETRY) {
+ if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_sem is not released
* even though return 0.
diff --git a/mm/shmem.c b/mm/shmem.c
index 1bb3b8dc8bb2..ef3a19c83927 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2009,7 +2009,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
DEFINE_WAIT_FUNC(shmem_fault_wait, synchronous_wake_function);
ret = VM_FAULT_NOPAGE;
- if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
+ if (fault_flag_allow_retry_first(vmf->flags) &&
!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
/* It's polite to up mmap_sem if we can */
up_read(&vma->vm_mm->mmap_sem);
--
2.21.0
The idea comes from the upstream discussion between Linus and Andrea:
https://lkml.org/lkml/2017/10/30/560
A summary to the issue: there was a special path in handle_userfault()
in the past that we'll return a VM_FAULT_NOPAGE when we detected
non-fatal signals when waiting for userfault handling. We did that by
reacquiring the mmap_sem before returning. However that brings a risk
in that the vmas might have changed when we retake the mmap_sem and
even we could be holding an invalid vma structure.
This patch removes the special path and we'll return a VM_FAULT_RETRY
with the common path even if we have got such signals. Then for all
the architectures that is passing in VM_FAULT_ALLOW_RETRY into
handle_mm_fault(), we check not only for SIGKILL but for all the rest
of userspace pending signals right after we returned from
handle_mm_fault(). This can allow the userspace to handle nonfatal
signals faster than before.
This patch is a preparation work for the next patch to finally remove
the special code path mentioned above in handle_userfault().
Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/alpha/mm/fault.c | 2 +-
arch/arc/mm/fault.c | 11 ++++-------
arch/arm/mm/fault.c | 6 +++---
arch/arm64/mm/fault.c | 6 +++---
arch/hexagon/mm/vm_fault.c | 2 +-
arch/ia64/mm/fault.c | 2 +-
arch/m68k/mm/fault.c | 2 +-
arch/microblaze/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 +-
arch/nds32/mm/fault.c | 6 +++---
arch/nios2/mm/fault.c | 2 +-
arch/openrisc/mm/fault.c | 2 +-
arch/parisc/mm/fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 ++
arch/riscv/mm/fault.c | 4 ++--
arch/s390/mm/fault.c | 9 ++++++---
arch/sh/mm/fault.c | 4 ++++
arch/sparc/mm/fault_32.c | 3 +++
arch/sparc/mm/fault_64.c | 3 +++
arch/um/kernel/trap.c | 5 ++++-
arch/unicore32/mm/fault.c | 4 ++--
arch/x86/mm/fault.c | 6 +++++-
arch/xtensa/mm/fault.c | 3 +++
23 files changed, 56 insertions(+), 34 deletions(-)
diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 188fc9256baf..8a2ef90b4bfc 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
the fault. */
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6836095251ed..3517820aea07 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -139,17 +139,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
*/
fault = handle_mm_fault(vma, address, flags);
- if (fatal_signal_pending(current)) {
-
+ if (unlikely((fault & VM_FAULT_RETRY) && signal_pending(current))) {
+ if (fatal_signal_pending(current) && !user_mode(regs))
+ goto no_context;
/*
* if fault retry, mmap_sem already relinquished by core mm
* so OK to return to user mode (with signal handled first)
*/
- if (fault & VM_FAULT_RETRY) {
- if (!user_mode(regs))
- goto no_context;
- return;
- }
+ return;
}
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 58f69fa07df9..c41c021bbe40 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -314,12 +314,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
fault = __do_page_fault(mm, addr, fsr, flags, tsk);
- /* If we need to retry but a fatal signal is pending, handle the
+ /* If we need to retry but a signal is pending, handle the
* signal first. We do not need to release the mmap_sem because
* it would already be released in __lock_page_or_retry in
* mm/filemap.c. */
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
- if (!user_mode(regs))
+ if (unlikely(fault & VM_FAULT_RETRY && signal_pending(current))) {
+ if (fatal_signal_pending(current) && !user_mode(regs))
goto no_context;
return 0;
}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index a30818ed9c60..890ec3a693e6 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -513,13 +513,13 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
if (fault & VM_FAULT_RETRY) {
/*
- * If we need to retry but a fatal signal is pending,
+ * If we need to retry but a signal is pending,
* handle the signal first. We do not need to release
* the mmap_sem because it would already be released
* in __lock_page_or_retry in mm/filemap.c.
*/
- if (fatal_signal_pending(current)) {
- if (!user_mode(regs))
+ if (signal_pending(current)) {
+ if (fatal_signal_pending(current) && !user_mode(regs))
goto no_context;
return 0;
}
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index b7a99aa5b0ba..febb4f96ba6f 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -91,7 +91,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
/* The most common case -- we are done. */
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 5baeb022f474..62c2d39d2bed 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -163,7 +163,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
*/
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 9b6163c05a75..d9808a807ab8 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -138,7 +138,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
fault = handle_mm_fault(vma, address, flags);
pr_debug("handle_mm_fault returns %x\n", fault);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return 0;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 202ad6a494f5..4fd2dbd0c5ca 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -217,7 +217,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
*/
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 73d8a0f0b810..92374fd091d2 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -154,7 +154,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
*/
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index 68d5f2a27f38..da777de8a62e 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -206,12 +206,12 @@ void do_page_fault(unsigned long entry, unsigned long addr,
fault = handle_mm_fault(vma, addr, flags);
/*
- * If we need to retry but a fatal signal is pending, handle the
+ * If we need to retry but a signal is pending, handle the
* signal first. We do not need to release the mmap_sem because it
* would already be released in __lock_page_or_retry in mm/filemap.c.
*/
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
- if (!user_mode(regs))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+ if (fatal_signal_pending(current) && !user_mode(regs))
goto no_context;
return;
}
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index 6a2e716b959f..bdb1f9db75ba 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -133,7 +133,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
*/
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 9eee5bf3db27..f9f47dc32f94 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -161,7 +161,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index c8e8b7c05558..29422eec329d 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -303,7 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ec6b7ad70659..c2168b298c82 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -616,6 +616,8 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
*/
flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
+ if (is_user && signal_pending(current))
+ return 0;
if (!fatal_signal_pending(current))
goto retry;
}
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 3e2708c626a8..4aa7a2343353 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -111,11 +111,11 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
fault = handle_mm_fault(vma, addr, flags);
/*
- * If we need to retry but a fatal signal is pending, handle the
+ * If we need to retry but a signal is pending, handle the
* signal first. We do not need to release the mmap_sem because it
* would already be released in __lock_page_or_retry in mm/filemap.c.
*/
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(tsk))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index df75d574246d..94087ba285be 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -493,9 +493,12 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
* the fault.
*/
fault = handle_mm_fault(vma, address, flags);
- /* No reason to continue if interrupted by SIGKILL. */
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
- fault = VM_FAULT_SIGNAL;
+ /* Do not continue if interrupted by signals. */
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+ if (fatal_signal_pending(current))
+ fault = VM_FAULT_SIGNAL;
+ else
+ fault = 0;
if (flags & FAULT_FLAG_RETRY_NOWAIT)
goto out_up;
goto out;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 6defd2c6d9b1..baf5d73df40c 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -506,6 +506,10 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
* have already released it in __lock_page_or_retry
* in mm/filemap.c.
*/
+
+ if (user_mode(regs) && signal_pending(tsk))
+ return;
+
goto retry;
}
}
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index b0440b0edd97..a2c83104fe35 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -269,6 +269,9 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
* in mm/filemap.c.
*/
+ if (user_mode(regs) && signal_pending(tsk))
+ return;
+
goto retry;
}
}
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 8f8a604c1300..cad71ec5c7b3 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -467,6 +467,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
* in mm/filemap.c.
*/
+ if (user_mode(regs) && signal_pending(current))
+ return;
+
goto retry;
}
}
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 0e8b6158f224..05dcd4c5f0d5 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -76,8 +76,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
fault = handle_mm_fault(vma, address, flags);
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current)) {
+ if (is_user && !fatal_signal_pending(current))
+ err = 0;
goto out_nosemaphore;
+ }
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM) {
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index b9a3a50644c1..3611f19234a1 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -248,11 +248,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
fault = __do_pf(mm, addr, fsr, flags, tsk);
- /* If we need to retry but a fatal signal is pending, handle the
+ /* If we need to retry but a signal is pending, handle the
* signal first. We do not need to release the mmap_sem because
* it would already be released in __lock_page_or_retry in
* mm/filemap.c. */
- if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ if ((fault & VM_FAULT_RETRY) && signal_pending(current))
return 0;
if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 46df4c6aae46..dcd7c1393be3 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1463,16 +1463,20 @@ void do_user_addr_fault(struct pt_regs *regs,
* that we made any progress. Handle this case first.
*/
if (unlikely(fault & VM_FAULT_RETRY)) {
+ bool is_user = flags & FAULT_FLAG_USER;
+
/* Retry at most once */
if (flags & FAULT_FLAG_ALLOW_RETRY) {
flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;
+ if (is_user && signal_pending(tsk))
+ return;
if (!fatal_signal_pending(tsk))
goto retry;
}
/* User mode? Just return to handle the fatal exception */
- if (flags & FAULT_FLAG_USER)
+ if (is_user)
return;
/* Not returning to user mode? Handle exceptions or die: */
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 2ab0e0dcd166..792dad5e2f12 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -136,6 +136,9 @@ void do_page_fault(struct pt_regs *regs)
* in mm/filemap.c.
*/
+ if (user_mode(regs) && signal_pending(current))
+ return;
+
goto retry;
}
}
--
2.21.0
From: Andrea Arcangeli <[email protected]>
Accurate userfaultfd WP tracking is possible by tracking exactly which
virtual memory ranges were writeprotected by userland. We can't relay
only on the RW bit of the mapped pagetable because that information is
destroyed by fork() or KSM or swap. If we were to relay on that, we'd
need to stay on the safe side and generate false positive wp faults
for every swapped out page.
Signed-off-by: Andrea Arcangeli <[email protected]>
[peterx: append _PAGE_UFD_WP to _PAGE_CHG_MASK]
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable_64.h | 8 ++++-
arch/x86/include/asm/pgtable_types.h | 11 +++++-
include/asm-generic/pgtable.h | 1 +
include/asm-generic/pgtable_uffd.h | 51 +++++++++++++++++++++++++++
init/Kconfig | 5 +++
7 files changed, 127 insertions(+), 2 deletions(-)
create mode 100644 include/asm-generic/pgtable_uffd.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..3e06f679126d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -217,6 +217,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMES if PROC_FS
+ select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD
config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5e0509b41986..5b254b851082 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -25,6 +25,7 @@
#include <asm/x86_init.h>
#include <asm/fpu/xstate.h>
#include <asm/fpu/api.h>
+#include <asm-generic/pgtable_uffd.h>
extern pgd_t early_top_pgt[PTRS_PER_PGD];
int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -310,6 +311,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
return native_make_pte(v & ~clear);
}
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pte_uffd_wp(pte_t pte)
+{
+ return pte_flags(pte) & _PAGE_UFFD_WP;
+}
+
+static inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+ return pte_set_flags(pte, _PAGE_UFFD_WP);
+}
+
+static inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_flags(pte, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
static inline pte_t pte_mkclean(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_DIRTY);
@@ -389,6 +407,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
return native_make_pmd(v & ~clear);
}
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pmd_uffd_wp(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_UFFD_WP;
+}
+
+static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_UFFD_WP);
+}
+
+static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
static inline pmd_t pmd_mkold(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
@@ -1371,6 +1406,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
#endif
#endif
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+ return pte_set_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pte_swp_uffd_wp(pte_t pte)
+{
+ return pte_flags(pte) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+ return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
#define PKRU_AD_BIT 0x1
#define PKRU_WD_BIT 0x2
#define PKRU_BITS_PER_PKEY 2
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 0bb566315621..627666b1c3c0 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
*
* | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number
* | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|F|SD|0| <- swp entry
*
* G (8) is aliased and used as a PROT_NONE indicator for
* !present ptes. We need to start storing swap entries above
@@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end);
* erratum where they can be incorrectly set by hardware on
* non-present PTEs.
*
+ * SD Bits 1-4 are not used in non-present format and available for
+ * special use described below:
+ *
* SD (1) in swp entry is used to store soft dirty bit, which helps us
* remember soft dirty over page migration
*
+ * F (2) in swp entry is used to record when a pagetable is
+ * writeprotected by userfaultfd WP support.
+ *
* Bit 7 in swp entry should be 0 because pmd_present checks not only P,
* but also L and G.
*
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..dd9c6295d610 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -32,6 +32,7 @@
#define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
+#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
@@ -100,6 +101,14 @@
#define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0))
#endif
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP)
+#define _PAGE_SWP_UFFD_WP _PAGE_USER
+#else
+#define _PAGE_UFFD_WP (_AT(pteval_t, 0))
+#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0))
+#endif
+
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
@@ -124,7 +133,7 @@
*/
#define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
- _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+ _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_UFFD_WP)
#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
/*
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..1e979845e1cb 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -10,6 +10,7 @@
#include <linux/mm_types.h>
#include <linux/bug.h>
#include <linux/errno.h>
+#include <asm-generic/pgtable_uffd.h>
#if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \
defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
new file mode 100644
index 000000000000..643d1bf559c2
--- /dev/null
+++ b/include/asm-generic/pgtable_uffd.h
@@ -0,0 +1,51 @@
+#ifndef _ASM_GENERIC_PGTABLE_UFFD_H
+#define _ASM_GENERIC_PGTABLE_UFFD_H
+
+#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static __always_inline int pte_uffd_wp(pte_t pte)
+{
+ return 0;
+}
+
+static __always_inline int pmd_uffd_wp(pmd_t pmd)
+{
+ return 0;
+}
+
+static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+ return pmd;
+}
+
+static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+ return pmd;
+}
+
+static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
+{
+ return pte;
+}
+
+static __always_inline int pte_swp_uffd_wp(pte_t pte)
+{
+ return 0;
+}
+
+static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
+{
+ return pte;
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
+#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 0e2344389501..763dc7fcf361 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1453,6 +1453,11 @@ config ADVISE_SYSCALLS
applications use these syscalls, you can disable this option to save
space.
+config HAVE_ARCH_USERFAULTFD_WP
+ bool
+ help
+ Arch has userfaultfd write protection support
+
config MEMBARRIER
bool "Enable membarrier() system call" if EXPERT
default y
--
2.21.0
This is the gup counterpart of the change that allows the VM_FAULT_RETRY
to happen for more than once.
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 17 +++++++++++++----
mm/hugetlb.c | 6 ++++--
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 58d282115d9b..ac8d5b73c212 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -647,7 +647,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
if (*flags & FOLL_NOWAIT)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
if (*flags & FOLL_TRIED) {
- VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+ /*
+ * Note: FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_TRIED
+ * can co-exist
+ */
fault_flags |= FAULT_FLAG_TRIED;
}
@@ -1062,17 +1065,23 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
if (likely(pages))
pages += ret;
start += ret << PAGE_SHIFT;
+ lock_dropped = true;
+retry:
/*
* Repeat on the address that fired VM_FAULT_RETRY
- * without FAULT_FLAG_ALLOW_RETRY but with
+ * with both FAULT_FLAG_ALLOW_RETRY and
* FAULT_FLAG_TRIED.
*/
*locked = 1;
- lock_dropped = true;
down_read(&mm->mmap_sem);
ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
- pages, NULL, NULL);
+ pages, NULL, locked);
+ if (!*locked) {
+ /* Continue to retry until we succeeded */
+ BUG_ON(ret != 0);
+ goto retry;
+ }
if (ret != 1) {
BUG_ON(ret > 1);
if (!pages_done)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ba179c2fa8fb..d9c739f9a28e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4317,8 +4317,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
fault_flags |= FAULT_FLAG_ALLOW_RETRY |
FAULT_FLAG_RETRY_NOWAIT;
if (flags & FOLL_TRIED) {
- VM_WARN_ON_ONCE(fault_flags &
- FAULT_FLAG_ALLOW_RETRY);
+ /*
+ * Note: FAULT_FLAG_ALLOW_RETRY and
+ * FAULT_FLAG_TRIED can co-exist
+ */
fault_flags |= FAULT_FLAG_TRIED;
}
ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
--
2.21.0
change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa). Further, these parameters are passed along the calls:
- change_protection_range()
- change_p4d_range()
- change_pud_range()
- change_pmd_range()
- ...
Now we introduce a flag for change_protect() and all these helpers to
replace these parameters. Then we can avoid passing multiple parameters
multiple times along the way.
More importantly, it'll greatly simplify the work if we want to
introduce any new parameters to change_protection(). In the follow up
patches, a new parameter for userfaultfd write protection will be
introduced.
No functional change at all.
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/huge_mm.h | 2 +-
include/linux/mm.h | 14 +++++++++++++-
mm/huge_memory.c | 3 ++-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 29 ++++++++++++++++-------------
5 files changed, 33 insertions(+), 17 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd5c150c21d..a81a6ed609ac 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,7 +46,7 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
pmd_t *old_pmd, pmd_t *new_pmd);
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
- int prot_numa);
+ unsigned long cp_flags);
vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
enum transparent_hugepage_flag {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dcaca899e4a8..a93ac1c37940 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1708,9 +1708,21 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
bool need_rmap_locks);
+
+/*
+ * Flags used by change_protection(). For now we make it a bitmap so
+ * that we can pass in multiple flags just like parameters. However
+ * for now all the callers are only use one of the flags at the same
+ * time.
+ */
+/* Whether we should allow dirty bit accounting */
+#define MM_CP_DIRTY_ACCT (1UL << 0)
+/* Whether this protection change is for NUMA hints */
+#define MM_CP_PROT_NUMA (1UL << 1)
+
extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa);
+ unsigned long cp_flags);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..b7149a0acac1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1903,13 +1903,14 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
* - HPAGE_PMD_NR is protections changed and TLB flush necessary
*/
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot, int prot_numa)
+ unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
pmd_t entry;
bool preserve_write;
int ret;
+ bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
ptl = __pmd_trans_huge_lock(pmd, vma);
if (!ptl)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 01600d80ae01..dea6a49573e3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -575,7 +575,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
{
int nr_updated;
- nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
+ nr_updated = change_protection(vma, addr, end, PAGE_NONE, MM_CP_PROT_NUMA);
if (nr_updated)
count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index bf38dfbbb4b4..ae9caa4c6562 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,12 +37,14 @@
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa)
+ unsigned long cp_flags)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
int target_node = NUMA_NO_NODE;
+ bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
+ bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
/*
* Can be called with only the mmap_sem for reading by
@@ -163,7 +165,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pud_t *pud, unsigned long addr, unsigned long end,
- pgprot_t newprot, int dirty_accountable, int prot_numa)
+ pgprot_t newprot, unsigned long cp_flags)
{
pmd_t *pmd;
unsigned long next;
@@ -195,7 +197,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
int nr_ptes = change_huge_pmd(vma, pmd, addr,
- newprot, prot_numa);
+ newprot, cp_flags);
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
@@ -210,7 +212,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
/* fall through, the trans huge pmd just split */
}
this_pages = change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa);
+ cp_flags);
pages += this_pages;
next:
cond_resched();
@@ -226,7 +228,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
static inline unsigned long change_pud_range(struct vm_area_struct *vma,
p4d_t *p4d, unsigned long addr, unsigned long end,
- pgprot_t newprot, int dirty_accountable, int prot_numa)
+ pgprot_t newprot, unsigned long cp_flags)
{
pud_t *pud;
unsigned long next;
@@ -238,7 +240,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
if (pud_none_or_clear_bad(pud))
continue;
pages += change_pmd_range(vma, pud, addr, next, newprot,
- dirty_accountable, prot_numa);
+ cp_flags);
} while (pud++, addr = next, addr != end);
return pages;
@@ -246,7 +248,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
pgd_t *pgd, unsigned long addr, unsigned long end,
- pgprot_t newprot, int dirty_accountable, int prot_numa)
+ pgprot_t newprot, unsigned long cp_flags)
{
p4d_t *p4d;
unsigned long next;
@@ -258,7 +260,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
if (p4d_none_or_clear_bad(p4d))
continue;
pages += change_pud_range(vma, p4d, addr, next, newprot,
- dirty_accountable, prot_numa);
+ cp_flags);
} while (p4d++, addr = next, addr != end);
return pages;
@@ -266,7 +268,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa)
+ unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
@@ -283,7 +285,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
if (pgd_none_or_clear_bad(pgd))
continue;
pages += change_p4d_range(vma, pgd, addr, next, newprot,
- dirty_accountable, prot_numa);
+ cp_flags);
} while (pgd++, addr = next, addr != end);
/* Only flush the TLB if we actually modified any entries: */
@@ -296,14 +298,15 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa)
+ unsigned long cp_flags)
{
unsigned long pages;
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot);
else
- pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa);
+ pages = change_protection_range(vma, start, end, newprot,
+ cp_flags);
return pages;
}
@@ -431,7 +434,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
vma_set_page_prot(vma);
change_protection(vma, start, end, vma->vm_page_prot,
- dirty_accountable, 0);
+ dirty_accountable ? MM_CP_DIRTY_ACCT : 0);
/*
* Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
--
2.21.0
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new
flags are exclusively used. Then,
- For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
when a range of memory is write protected by uffd
- For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
_PAGE_RW when write protection is resolved from userspace
And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.
Do this change for both PTEs and huge PMDs. Then we can start to
identify which PTE/PMD is write protected by general (e.g., COW or soft
dirty tracking), and which is for userfaultfd-wp.
Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().
After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp. Here the
userfault-wp will have higher priority and will be handled first.
Only after the uffd-wp bit is cleared on the PTE/PMD will we continue
to handle the general COW. These are the steps on what will happen
with such a page:
1. CPU accesses write protected shared page (so both protected by
general COW and uffd-wp), blocked by uffd-wp first because in
do_wp_page we'll handle uffd-wp first, so it has higher priority
than general COW.
2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
to remove the uffd-wp bit upon the PTE/PMD. However here we
still keep the write bit cleared. Notify the blocked CPU.
3. The blocked CPU resumes the page fault process with a fault
retry, during retry it'll notice it was not with the uffd-wp bit
this time but it is still write protected by general COW, then
it'll go though the COW path in the fault handler, copy the page,
apply write bit where necessary, and retry again.
4. The CPU will be able to access this page with write bit set.
Suggested-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm.h | 5 +++++
mm/huge_memory.c | 18 +++++++++++++++++-
mm/memory.c | 4 ++--
mm/mprotect.c | 17 +++++++++++++++++
mm/userfaultfd.c | 8 ++++++--
5 files changed, 47 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a93ac1c37940..beca76650271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1719,6 +1719,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_DIRTY_ACCT (1UL << 0)
/* Whether this protection change is for NUMA hints */
#define MM_CP_PROT_NUMA (1UL << 1)
+/* Whether this change is for write protecting */
+#define MM_CP_UFFD_WP (1UL << 2) /* do wp */
+#define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */
+#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
+ MM_CP_UFFD_WP_RESOLVE)
extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7149a0acac1..3fda79f6746b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1911,6 +1911,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
bool preserve_write;
int ret;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
ptl = __pmd_trans_huge_lock(pmd, vma);
if (!ptl)
@@ -1977,6 +1979,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
+ if (uffd_wp) {
+ entry = pmd_wrprotect(entry);
+ entry = pmd_mkuffd_wp(entry);
+ } else if (uffd_wp_resolve) {
+ /*
+ * Leave the write bit to be handled by PF interrupt
+ * handler, then things like COW could be properly
+ * handled.
+ */
+ entry = pmd_clear_uffd_wp(entry);
+ }
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
@@ -2125,7 +2138,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
- bool young, write, soft_dirty, pmd_migration = false;
+ bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
unsigned long addr;
int i;
@@ -2207,6 +2220,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
+ uffd_wp = pmd_uffd_wp(old_pmd);
}
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
@@ -2240,6 +2254,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_mkold(entry);
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
+ if (uffd_wp)
+ entry = pte_mkuffd_wp(entry);
}
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
diff --git a/mm/memory.c b/mm/memory.c
index 05bcd741855b..d79e6d1f8c62 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2579,7 +2579,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- if (userfaultfd_wp(vma)) {
+ if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
}
@@ -3800,7 +3800,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
{
if (vma_is_anonymous(vmf->vma)) {
- if (userfaultfd_wp(vmf->vma))
+ if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf, orig_pmd);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ae9caa4c6562..c7066d7384e3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -45,6 +45,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
int target_node = NUMA_NO_NODE;
bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
/*
* Can be called with only the mmap_sem for reading by
@@ -116,6 +118,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (preserve_write)
ptent = pte_mk_savedwrite(ptent);
+ if (uffd_wp) {
+ ptent = pte_wrprotect(ptent);
+ ptent = pte_mkuffd_wp(ptent);
+ } else if (uffd_wp_resolve) {
+ /*
+ * Leave the write bit to be handled
+ * by PF interrupt handler, then
+ * things like COW could be properly
+ * handled.
+ */
+ ptent = pte_clear_uffd_wp(ptent);
+ }
+
/* Avoid taking write faults for known dirty pages */
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
@@ -302,6 +317,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
{
unsigned long pages;
+ BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL);
+
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot);
else
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c8e7846e9b7e..5363376cb07a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -73,8 +73,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out_release;
_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
- if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
- _dst_pte = pte_mkwrite(_dst_pte);
+ if (dst_vma->vm_flags & VM_WRITE) {
+ if (wp_copy)
+ _dst_pte = pte_mkuffd_wp(_dst_pte);
+ else
+ _dst_pte = pte_mkwrite(_dst_pte);
+ }
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
if (dst_vma->vm_file) {
--
2.21.0
From: Shaohua Li <[email protected]>
add helper for writeprotect check. Will use it later.
Cc: Andrea Arcangeli <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/userfaultfd_k.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ac9d71e24b81..5dc247af0f2e 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -52,6 +52,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
return vma->vm_flags & VM_UFFD_MISSING;
}
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_UFFD_WP;
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -96,6 +101,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+ return false;
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return false;
--
2.21.0
From: Andrea Arcangeli <[email protected]>
There are several cases write protection fault happens. It could be a
write to zero page, swaped page or userfault write protected
page. When the fault happens, there is no way to know if userfault
write protect the page before. Here we just blindly issue a userfault
notification for vma with VM_UFFD_WP regardless if app write protects
it yet. Application should be ready to handle such wp fault.
v1: From: Shaohua Li <[email protected]>
v2: Handle the userfault in the common do_wp_page. If we get there a
pagetable is present and readonly so no need to do further processing
until we solve the userfault.
In the swapin case, always swapin as readonly. This will cause false
positive userfaults. We need to decide later if to eliminate them with
a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).
hugetlbfs wouldn't need to worry about swapouts but and tmpfs would
be handled by a swap entry bit like anonymous memory.
The main problem with no easy solution to eliminate the false
positives, will be if/when userfaultfd is extended to real filesystem
pagecache. When the pagecache is freed by reclaim we can't leave the
radix tree pinned if the inode and in turn the radix tree is reclaimed
as well.
The estimation is that full accuracy and lack of false positives could
be easily provided only to anonymous memory (as long as there's no
fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous
range) tmpfs and hugetlbfs, it's most certainly worth to achieve it
but in a later incremental patch.
v3: Add hooking point for THP wrprotect faults.
CC: Shaohua Li <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
[peterx: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
Reviewed-by: Mike Rapoport <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..05bcd741855b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2579,6 +2579,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ if (userfaultfd_wp(vma)) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return handle_userfault(vmf, VM_UFFD_WP);
+ }
+
vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
if (!vmf->page) {
/*
@@ -3794,8 +3799,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
/* `inline' is required to avoid gcc 4.1.2 build error */
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
{
- if (vma_is_anonymous(vmf->vma))
+ if (vma_is_anonymous(vmf->vma)) {
+ if (userfaultfd_wp(vmf->vma))
+ return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf, orig_pmd);
+ }
if (vmf->vma->vm_ops->huge_fault)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
--
2.21.0
Adding these missing helpers for uffd-wp operations with pmd
swap/migration entries.
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 15 +++++++++++++++
include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5b254b851082..0120fa671914 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1421,6 +1421,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP);
}
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
+}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#define PKRU_AD_BIT 0x1
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 643d1bf559c2..828966d4c281 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte)
{
return pte;
}
+
+static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd)
+{
+ return pmd;
+}
+
+static inline int pmd_swp_uffd_wp(pmd_t pmd)
+{
+ return 0;
+}
+
+static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
+{
+ return pmd;
+}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
--
2.21.0
For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected. It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.
Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care
of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.
In change_pte_range() we do nothing for uffd if the PTE is a swap
entry. That can lead to data mismatch if the page that we are going
to write protect is swapped out when sending the UFFDIO_WRITEPROTECT.
This patch also applies/removes the uffd-wp bit even for the swap
entries.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 2 ++
mm/huge_memory.c | 3 +++
mm/memory.c | 8 ++++++++
mm/migrate.c | 6 ++++++
mm/mprotect.c | 28 +++++++++++++++++-----------
mm/rmap.c | 6 ++++++
6 files changed, 42 insertions(+), 11 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..0c2923b1cdb7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte)
if (pte_swp_soft_dirty(pte))
pte = pte_swp_clear_soft_dirty(pte);
+ if (pte_swp_uffd_wp(pte))
+ pte = pte_swp_clear_uffd_wp(pte);
arch_entry = __pte_to_swp_entry(pte);
return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 757975920df8..eae25c58db9d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2221,6 +2221,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = is_write_migration_entry(entry);
young = false;
soft_dirty = pmd_swp_soft_dirty(old_pmd);
+ uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
page = pmd_page(old_pmd);
if (pmd_dirty(old_pmd))
@@ -2253,6 +2254,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
+ if (uffd_wp)
+ entry = pte_swp_mkuffd_wp(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
entry = maybe_mkwrite(entry, vma);
diff --git a/mm/memory.c b/mm/memory.c
index 8c69257d6ef1..28e9342d00cc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -738,6 +738,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(*src_pte))
pte = pte_swp_mksoft_dirty(pte);
+ if (pte_swp_uffd_wp(*src_pte))
+ pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
} else if (is_device_private_entry(entry)) {
@@ -767,6 +769,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
is_cow_mapping(vm_flags)) {
make_device_private_entry_read(&entry);
pte = swp_entry_to_pte(entry);
+ if (pte_swp_uffd_wp(*src_pte))
+ pte = pte_swp_mkuffd_wp(pte);
set_pte_at(src_mm, addr, src_pte, pte);
}
}
@@ -2930,6 +2934,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
+ if (pte_swp_uffd_wp(vmf->orig_pte)) {
+ pte = pte_mkuffd_wp(pte);
+ pte = pte_wrprotect(pte);
+ }
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
vmf->orig_pte = pte;
diff --git a/mm/migrate.c b/mm/migrate.c
index f2ecc2855a12..d8f1f6d13960 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -241,11 +241,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
entry = pte_to_swp_entry(*pvmw.pte);
if (is_write_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
+ else if (pte_swp_uffd_wp(*pvmw.pte))
+ pte = pte_mkuffd_wp(pte);
if (unlikely(is_zone_device_page(new))) {
if (is_device_private_page(new)) {
entry = make_device_private_entry(new, pte_write(pte));
pte = swp_entry_to_pte(entry);
+ if (pte_swp_uffd_wp(*pvmw.pte))
+ pte = pte_mkuffd_wp(pte);
} else if (is_device_public_page(new)) {
pte = pte_mkdevmap(pte);
}
@@ -2306,6 +2310,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pte))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pte))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
set_pte_at(mm, addr, ptep, swp_pte);
/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c7066d7384e3..a63737d9884e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,11 +139,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
- } else if (IS_ENABLED(CONFIG_MIGRATION)) {
+ } else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
+ pte_t newpte;
if (is_write_migration_entry(entry)) {
- pte_t newpte;
/*
* A protection check is difficult so
* just be safe and disable write
@@ -152,22 +152,28 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
newpte = swp_entry_to_pte(entry);
if (pte_swp_soft_dirty(oldpte))
newpte = pte_swp_mksoft_dirty(newpte);
- set_pte_at(vma->vm_mm, addr, pte, newpte);
-
- pages++;
- }
-
- if (is_write_device_private_entry(entry)) {
- pte_t newpte;
-
+ if (pte_swp_uffd_wp(oldpte))
+ newpte = pte_swp_mkuffd_wp(newpte);
+ } else if (is_write_device_private_entry(entry)) {
/*
* We do not preserve soft-dirtiness. See
* copy_one_pte() for explanation.
*/
make_device_private_entry_read(&entry);
newpte = swp_entry_to_pte(entry);
- set_pte_at(vma->vm_mm, addr, pte, newpte);
+ if (pte_swp_uffd_wp(oldpte))
+ newpte = pte_swp_mkuffd_wp(newpte);
+ } else {
+ newpte = oldpte;
+ }
+ if (uffd_wp)
+ newpte = pte_swp_mkuffd_wp(newpte);
+ else if (uffd_wp_resolve)
+ newpte = pte_swp_clear_uffd_wp(newpte);
+
+ if (!pte_same(oldpte, newpte)) {
+ set_pte_at(vma->vm_mm, addr, pte, newpte);
pages++;
}
}
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..dedde54dadb7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1471,6 +1471,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
/*
* No need to invalidate here it will synchronize on
@@ -1563,6 +1565,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
set_pte_at(mm, address, pvmw.pte, swp_pte);
/*
* No need to invalidate here it will synchronize on
@@ -1629,6 +1633,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
+ if (pte_uffd_wp(pteval))
+ swp_pte = pte_swp_mkuffd_wp(swp_pte);
set_pte_at(mm, address, pvmw.pte, swp_pte);
/* Invalidate as we cleared the pte */
mmu_notifier_invalidate_range(mm, address,
--
2.21.0
From: Andrea Arcangeli <[email protected]>
Implement helpers methods to invoke userfaultfd wp faults more
selectively: not only when a wp fault triggers on a vma with
vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set
in the pagetable too.
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 5dc247af0f2e..7b91b76aac58 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -14,6 +14,8 @@
#include <linux/userfaultfd.h> /* linux/include/uapi/linux/userfaultfd.h */
#include <linux/fcntl.h>
+#include <linux/mm.h>
+#include <asm-generic/pgtable_uffd.h>
/*
* CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
@@ -57,6 +59,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return vma->vm_flags & VM_UFFD_WP;
}
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+ pte_t pte)
+{
+ return userfaultfd_wp(vma) && pte_uffd_wp(pte);
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+ return userfaultfd_wp(vma) && pmd_uffd_wp(pmd);
+}
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -106,6 +120,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma)
return false;
}
+static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma,
+ pte_t pte)
+{
+ return false;
+}
+
+static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma,
+ pmd_t pmd)
+{
+ return false;
+}
+
+
static inline bool userfaultfd_armed(struct vm_area_struct *vma)
{
return false;
--
2.21.0
Don't collapse the huge PMD if there is any userfault write protected
small PTEs. The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.
The same thing needs to be considered for swap entries and migration
entries. So do the check as well disregarding khugepaged_max_ptes_swap.
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/trace/events/huge_memory.h | 1 +
mm/khugepaged.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index dd4db334bd63..2d7bad9cb976 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
EM( SCAN_PMD_NULL, "pmd_null") \
EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \
EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
+ EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
EM( SCAN_PAGE_RO, "no_writable_page") \
EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
EM( SCAN_PAGE_NULL, "page_null") \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0f7419938008..fc40aa214be7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -29,6 +29,7 @@ enum scan_result {
SCAN_PMD_NULL,
SCAN_EXCEED_NONE_PTE,
SCAN_PTE_NON_PRESENT,
+ SCAN_PTE_UFFD_WP,
SCAN_PAGE_RO,
SCAN_LACK_REFERENCED_PAGE,
SCAN_PAGE_NULL,
@@ -1128,6 +1129,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
pte_t pteval = *_pte;
if (is_swap_pte(pteval)) {
if (++unmapped <= khugepaged_max_ptes_swap) {
+ /*
+ * Always be strict with uffd-wp
+ * enabled swap entries. Please see
+ * comment below for pte_uffd_wp().
+ */
+ if (pte_swp_uffd_wp(pteval)) {
+ result = SCAN_PTE_UFFD_WP;
+ goto out_unmap;
+ }
continue;
} else {
result = SCAN_EXCEED_SWAP_PTE;
@@ -1147,6 +1157,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
result = SCAN_PTE_NON_PRESENT;
goto out_unmap;
}
+ if (pte_uffd_wp(pteval)) {
+ /*
+ * Don't collapse the page if any of the small
+ * PTEs are armed with uffd write protection.
+ * Here we can also mark the new huge pmd as
+ * write protected if any of the small ones is
+ * marked but that could bring uknown
+ * userfault messages that falls outside of
+ * the registered range. So, just be simple.
+ */
+ result = SCAN_PTE_UFFD_WP;
+ goto out_unmap;
+ }
if (pte_write(pteval))
writable = true;
--
2.21.0
UFFD_EVENT_FORK support for uffd-wp should be already there, except
that we should clean the uffd-wp bit if uffd fork event is not
enabled. Detect that to avoid _PAGE_UFFD_WP being set even if the VMA
is not being tracked by VM_UFFD_WP. Do this for both small PTEs and
huge PMDs.
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/huge_memory.c | 8 ++++++++
mm/memory.c | 8 ++++++++
2 files changed, 16 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3fda79f6746b..757975920df8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -980,6 +980,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
+ /*
+ * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+ * does not have the VM_UFFD_WP, which means that the uffd
+ * fork event is not enabled.
+ */
+ if (!(vma->vm_flags & VM_UFFD_WP))
+ pmd = pmd_clear_uffd_wp(pmd);
+
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
if (unlikely(is_swap_pmd(pmd))) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
diff --git a/mm/memory.c b/mm/memory.c
index d79e6d1f8c62..8c69257d6ef1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -790,6 +790,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
+ /*
+ * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
+ * does not have the VM_UFFD_WP, which means that the uffd
+ * fork event is not enabled.
+ */
+ if (!(vm_flags & VM_UFFD_WP))
+ pte = pte_clear_uffd_wp(pte);
+
page = vm_normal_page(vma, addr, pte);
if (page) {
get_page(page);
--
2.21.0
From: Andrea Arcangeli <[email protected]>
This allows UFFDIO_COPY to map pages write-protected.
Signed-off-by: Andrea Arcangeli <[email protected]>
[peterx: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
commit messages]
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 5 +++--
include/linux/userfaultfd_k.h | 2 +-
include/uapi/linux/userfaultfd.h | 11 +++++-----
mm/userfaultfd.c | 36 ++++++++++++++++++++++----------
4 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5dbef45ecbf5..c594945ad5bf 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1694,11 +1694,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
ret = -EINVAL;
if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
goto out;
- if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+ if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP))
goto out;
if (mmget_not_zero(ctx->mm)) {
ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
- uffdio_copy.len, &ctx->mmap_changing);
+ uffdio_copy.len, &ctx->mmap_changing,
+ uffdio_copy.mode);
mmput(ctx->mm);
} else {
return -ESRCH;
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 7b91b76aac58..dcd33172b728 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,7 +36,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
- bool *mmap_changing);
+ bool *mmap_changing, __u64 mode);
extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long len,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 48f1a7c2f1f0..340f23bc251d 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -203,13 +203,14 @@ struct uffdio_copy {
__u64 dst;
__u64 src;
__u64 len;
+#define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0)
/*
- * There will be a wrprotection flag later that allows to map
- * pages wrprotected on the fly. And such a flag will be
- * available if the wrprotection ioctl are implemented for the
- * range according to the uffdio_register.ioctls.
+ * UFFDIO_COPY_MODE_WP will map the page write protected on
+ * the fly. UFFDIO_COPY_MODE_WP is available only if the
+ * write protected ioctl is implemented for the range
+ * according to the uffdio_register.ioctls.
*/
-#define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0)
+#define UFFDIO_COPY_MODE_WP ((__u64)1<<1)
__u64 mode;
/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9932d5755e4c..c8e7846e9b7e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -25,7 +25,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
struct vm_area_struct *dst_vma,
unsigned long dst_addr,
unsigned long src_addr,
- struct page **pagep)
+ struct page **pagep,
+ bool wp_copy)
{
struct mem_cgroup *memcg;
pte_t _dst_pte, *dst_pte;
@@ -71,9 +72,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
goto out_release;
- _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
- if (dst_vma->vm_flags & VM_WRITE)
- _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+ _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
+ if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy)
+ _dst_pte = pte_mkwrite(_dst_pte);
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
if (dst_vma->vm_file) {
@@ -398,7 +399,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
unsigned long dst_addr,
unsigned long src_addr,
struct page **page,
- bool zeropage)
+ bool zeropage,
+ bool wp_copy)
{
ssize_t err;
@@ -415,11 +417,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
if (!(dst_vma->vm_flags & VM_SHARED)) {
if (!zeropage)
err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
- dst_addr, src_addr, page);
+ dst_addr, src_addr, page,
+ wp_copy);
else
err = mfill_zeropage_pte(dst_mm, dst_pmd,
dst_vma, dst_addr);
} else {
+ VM_WARN_ON_ONCE(wp_copy);
if (!zeropage)
err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
dst_vma, dst_addr,
@@ -437,7 +441,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
unsigned long src_start,
unsigned long len,
bool zeropage,
- bool *mmap_changing)
+ bool *mmap_changing,
+ __u64 mode)
{
struct vm_area_struct *dst_vma;
ssize_t err;
@@ -445,6 +450,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
unsigned long src_addr, dst_addr;
long copied;
struct page *page;
+ bool wp_copy;
/*
* Sanitize the command parameters:
@@ -501,6 +507,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
dst_vma->vm_flags & VM_SHARED))
goto out_unlock;
+ /*
+ * validate 'mode' now that we know the dst_vma: don't allow
+ * a wrprotect copy if the userfaultfd didn't register as WP.
+ */
+ wp_copy = mode & UFFDIO_COPY_MODE_WP;
+ if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP))
+ goto out_unlock;
+
/*
* If this is a HUGETLB vma, pass off to appropriate routine
*/
@@ -556,7 +570,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
BUG_ON(pmd_trans_huge(*dst_pmd));
err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
- src_addr, &page, zeropage);
+ src_addr, &page, zeropage, wp_copy);
cond_resched();
if (unlikely(err == -ENOENT)) {
@@ -603,14 +617,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
unsigned long src_start, unsigned long len,
- bool *mmap_changing)
+ bool *mmap_changing, __u64 mode)
{
return __mcopy_atomic(dst_mm, dst_start, src_start, len, false,
- mmap_changing);
+ mmap_changing, mode);
}
ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
unsigned long len, bool *mmap_changing)
{
- return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing);
+ return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
}
--
2.21.0
From: Andrea Arcangeli <[email protected]>
v1: From: Shaohua Li <[email protected]>
v2: cleanups, remove a branch.
[peterx writes up the commit message, as below...]
This patch introduces the new uffd-wp APIs for userspace.
Firstly, we'll allow to do UFFDIO_REGISTER with write protection
tracking using the new UFFDIO_REGISTER_MODE_WP flag. Note that this
flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in
which case the userspace program can not only resolve missing page
faults, and at the same time tracking page data changes along the way.
Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page
level write protection tracking. Note that we will need to register
the memory region with UFFDIO_REGISTER_MODE_WP before that.
Signed-off-by: Andrea Arcangeli <[email protected]>
[peterx: remove useless block, write commit message, check against
VM_MAYWRITE rather than VM_WRITE when register]
Reviewed-by: Jerome Glisse <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 82 +++++++++++++++++++++++++-------
include/uapi/linux/userfaultfd.h | 23 +++++++++
2 files changed, 89 insertions(+), 16 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index c594945ad5bf..3cf19aeaa0e0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -306,8 +306,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pmd_present(_pmd))
goto out;
- if (pmd_trans_huge(_pmd))
+ if (pmd_trans_huge(_pmd)) {
+ if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
+ ret = true;
goto out;
+ }
/*
* the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
@@ -320,6 +323,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
*/
if (pte_none(*pte))
ret = true;
+ if (!pte_write(*pte) && (reason & VM_UFFD_WP))
+ ret = true;
pte_unmap(pte);
out:
@@ -1258,10 +1263,13 @@ static __always_inline int validate_range(struct mm_struct *mm,
return 0;
}
-static inline bool vma_can_userfault(struct vm_area_struct *vma)
+static inline bool vma_can_userfault(struct vm_area_struct *vma,
+ unsigned long vm_flags)
{
- return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
- vma_is_shmem(vma);
+ /* FIXME: add WP support to hugetlbfs and shmem */
+ return vma_is_anonymous(vma) ||
+ ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
+ !(vm_flags & VM_UFFD_WP));
}
static int userfaultfd_register(struct userfaultfd_ctx *ctx,
@@ -1293,15 +1301,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
vm_flags = 0;
if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
vm_flags |= VM_UFFD_MISSING;
- if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+ if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
vm_flags |= VM_UFFD_WP;
- /*
- * FIXME: remove the below error constraint by
- * implementing the wprotect tracking mode.
- */
- ret = -EINVAL;
- goto out;
- }
ret = validate_range(mm, uffdio_register.range.start,
uffdio_register.range.len);
@@ -1351,7 +1352,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
/* check not compatible vmas */
ret = -EINVAL;
- if (!vma_can_userfault(cur))
+ if (!vma_can_userfault(cur, vm_flags))
goto out_unlock;
/*
@@ -1379,6 +1380,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
if (end & (vma_hpagesize - 1))
goto out_unlock;
}
+ if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE))
+ goto out_unlock;
/*
* Check that this vma isn't already owned by a
@@ -1408,7 +1411,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
do {
cond_resched();
- BUG_ON(!vma_can_userfault(vma));
+ BUG_ON(!vma_can_userfault(vma, vm_flags));
BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
vma->vm_userfaultfd_ctx.ctx != ctx);
WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1545,7 +1548,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
* provides for more strict behavior to notice
* unregistration errors.
*/
- if (!vma_can_userfault(cur))
+ if (!vma_can_userfault(cur, cur->vm_flags))
goto out_unlock;
found = true;
@@ -1559,7 +1562,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
do {
cond_resched();
- BUG_ON(!vma_can_userfault(vma));
+ BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
/*
* Nothing to do: this vma is already registered into this
@@ -1772,6 +1775,50 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
return ret;
}
+static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
+ unsigned long arg)
+{
+ int ret;
+ struct uffdio_writeprotect uffdio_wp;
+ struct uffdio_writeprotect __user *user_uffdio_wp;
+ struct userfaultfd_wake_range range;
+
+ if (READ_ONCE(ctx->mmap_changing))
+ return -EAGAIN;
+
+ user_uffdio_wp = (struct uffdio_writeprotect __user *) arg;
+
+ if (copy_from_user(&uffdio_wp, user_uffdio_wp,
+ sizeof(struct uffdio_writeprotect)))
+ return -EFAULT;
+
+ ret = validate_range(ctx->mm, uffdio_wp.range.start,
+ uffdio_wp.range.len);
+ if (ret)
+ return ret;
+
+ if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
+ UFFDIO_WRITEPROTECT_MODE_WP))
+ return -EINVAL;
+ if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
+ (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+ return -EINVAL;
+
+ ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
+ uffdio_wp.range.len, uffdio_wp.mode &
+ UFFDIO_WRITEPROTECT_MODE_WP,
+ &ctx->mmap_changing);
+ if (ret)
+ return ret;
+
+ if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+ range.start = uffdio_wp.range.start;
+ range.len = uffdio_wp.range.len;
+ wake_userfault(ctx, &range);
+ }
+ return ret;
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
@@ -1849,6 +1896,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
case UFFDIO_ZEROPAGE:
ret = userfaultfd_zeropage(ctx, arg);
break;
+ case UFFDIO_WRITEPROTECT:
+ ret = userfaultfd_writeprotect(ctx, arg);
+ break;
}
return ret;
}
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 340f23bc251d..95c4a160e5f8 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -52,6 +52,7 @@
#define _UFFDIO_WAKE (0x02)
#define _UFFDIO_COPY (0x03)
#define _UFFDIO_ZEROPAGE (0x04)
+#define _UFFDIO_WRITEPROTECT (0x06)
#define _UFFDIO_API (0x3F)
/* userfaultfd ioctl ids */
@@ -68,6 +69,8 @@
struct uffdio_copy)
#define UFFDIO_ZEROPAGE _IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
struct uffdio_zeropage)
+#define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \
+ struct uffdio_writeprotect)
/* read() structure */
struct uffd_msg {
@@ -232,4 +235,24 @@ struct uffdio_zeropage {
__s64 zeropage;
};
+struct uffdio_writeprotect {
+ struct uffdio_range range;
+/*
+ * UFFDIO_WRITEPROTECT_MODE_WP: set the flag to write protect a range,
+ * unset the flag to undo protection of a range which was previously
+ * write protected.
+ *
+ * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
+ * any wait thread after the operation succeeds.
+ *
+ * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
+ * therefore DONTWAKE flag is meaningless with WP=1. Removing write
+ * protection (WP=0) in response to a page fault wakes the faulting
+ * task unless DONTWAKE is set.
+ */
+#define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0)
+#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1)
+ __u64 mode;
+};
+
#endif /* _LINUX_USERFAULTFD_H */
--
2.21.0
It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region. Only wake up when resolving a write
protected page fault.
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3cf19aeaa0e0..498971fa9163 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1782,6 +1782,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
struct uffdio_writeprotect uffdio_wp;
struct uffdio_writeprotect __user *user_uffdio_wp;
struct userfaultfd_wake_range range;
+ bool mode_wp, mode_dontwake;
if (READ_ONCE(ctx->mmap_changing))
return -EAGAIN;
@@ -1800,18 +1801,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
UFFDIO_WRITEPROTECT_MODE_WP))
return -EINVAL;
- if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) &&
- (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE))
+
+ mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
+ mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+
+ if (mode_wp && mode_dontwake)
return -EINVAL;
ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
- uffdio_wp.range.len, uffdio_wp.mode &
- UFFDIO_WRITEPROTECT_MODE_WP,
+ uffdio_wp.range.len, mode_wp,
&ctx->mmap_changing);
if (ret)
return ret;
- if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) {
+ if (!mode_wp && !mode_dontwake) {
range.start = uffdio_wp.range.start;
range.len = uffdio_wp.range.len;
wake_userfault(ctx, &range);
--
2.21.0
From: Martin Cracauer <[email protected]>
Adds documentation about the write protection support.
Signed-off-by: Martin Cracauer <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
[peterx: rewrite in rst format; fixups here and there]
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 5048cf661a8a..c30176e67900 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.
+Notes:
+
+- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
+ you must provide some kind of page in your thread after reading from
+ the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
+ The normal behavior of the OS automatically providing a zero page on
+ an annonymous mmaping is not in place.
+
+- None of the page-delivering ioctls default to the range that you
+ registered with. You must fill in all fields for the appropriate
+ ioctl struct including the range.
+
+- You get the address of the access that triggered the missing page
+ event out of a struct uffd_msg that you read in the thread from the
+ uffd. You can supply as many pages as you want with UFFDIO_COPY or
+ UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then
+ the first of any of those IOCTLs wakes up the faulting thread.
+
+- Be sure to test for all errors including (pollfd[0].revents &
+ POLLERR). This can happen, e.g. when ranges supplied were
+ incorrect.
+
+Write Protect Notifications
+---------------------------
+
+This is equivalent to (but faster than) using mprotect and a SIGSEGV
+signal handler.
+
+Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
+Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
+struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
+in the struct passed in. The range does not default to and does not
+have to be identical to the range you registered with. You can write
+protect as many ranges as you like (inside the registered range).
+Then, in the thread reading from uffd the struct will have
+msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
+ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
+while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
+This wakes up the thread which will continue to run with writes. This
+allows you to do the bookkeeping about the write in the uffd reading
+thread before the ioctl.
+
+If you registered with both UFFDIO_REGISTER_MODE_MISSING and
+UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
+which you supply a page and undo write protect. Note that there is a
+difference between writes into a WP area and into a !WP area. The
+former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
+UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but
+you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
+used.
+
QEMU/KVM
========
--
2.21.0
Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the
user registers regions with shmem/hugetlbfs we won't expose the new
ioctl to them. Even with complete anonymous memory range, we'll only
expose the new WP ioctl bit if the register mode has MODE_WP.
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 498971fa9163..4e1d7748224a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1465,14 +1465,24 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
up_write(&mm->mmap_sem);
mmput(mm);
if (!ret) {
+ __u64 ioctls_out;
+
+ ioctls_out = basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
+ UFFD_API_RANGE_IOCTLS;
+
+ /*
+ * Declare the WP ioctl only if the WP mode is
+ * specified and all checks passed with the range
+ */
+ if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP))
+ ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT);
+
/*
* Now that we scanned all vmas we can already tell
* userland which ioctls methods are guaranteed to
* succeed on this range.
*/
- if (put_user(basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC :
- UFFD_API_RANGE_IOCTLS,
- &user_uffdio_register->ioctls))
+ if (put_user(ioctls_out, &user_uffdio_register->ioctls))
ret = -EFAULT;
}
out:
--
2.21.0
We've have multiple (and more coming) places that would like to find a
userfault enabled VMA from a mm struct that covers a specific memory
range. This patch introduce the helper for it, meanwhile apply it to
the code.
Suggested-by: Mike Rapoport <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/userfaultfd.c | 54 +++++++++++++++++++++++++++---------------------
1 file changed, 30 insertions(+), 24 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5363376cb07a..6b9dd5b66f64 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -20,6 +20,34 @@
#include <asm/tlbflush.h>
#include "internal.h"
+/*
+ * Find a valid userfault enabled VMA region that covers the whole
+ * address range, or NULL on failure. Must be called with mmap_sem
+ * held.
+ */
+static struct vm_area_struct *vma_find_uffd(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long len)
+{
+ struct vm_area_struct *vma = find_vma(mm, start);
+
+ if (!vma)
+ return NULL;
+
+ /*
+ * Check the vma is registered in uffd, this is required to
+ * enforce the VM_MAYWRITE check done at uffd registration
+ * time.
+ */
+ if (!vma->vm_userfaultfd_ctx.ctx)
+ return NULL;
+
+ if (start < vma->vm_start || start + len > vma->vm_end)
+ return NULL;
+
+ return vma;
+}
+
static int mcopy_atomic_pte(struct mm_struct *dst_mm,
pmd_t *dst_pmd,
struct vm_area_struct *dst_vma,
@@ -228,20 +256,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
*/
if (!dst_vma) {
err = -ENOENT;
- dst_vma = find_vma(dst_mm, dst_start);
+ dst_vma = vma_find_uffd(dst_mm, dst_start, len);
if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
goto out_unlock;
- /*
- * Check the vma is registered in uffd, this is
- * required to enforce the VM_MAYWRITE check done at
- * uffd registration time.
- */
- if (!dst_vma->vm_userfaultfd_ctx.ctx)
- goto out_unlock;
-
- if (dst_start < dst_vma->vm_start ||
- dst_start + len > dst_vma->vm_end)
- goto out_unlock;
err = -EINVAL;
if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
@@ -487,20 +504,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
* both valid and fully within a single existing vma.
*/
err = -ENOENT;
- dst_vma = find_vma(dst_mm, dst_start);
+ dst_vma = vma_find_uffd(dst_mm, dst_start, len);
if (!dst_vma)
goto out_unlock;
- /*
- * Check the vma is registered in uffd, this is required to
- * enforce the VM_MAYWRITE check done at uffd registration
- * time.
- */
- if (!dst_vma->vm_userfaultfd_ctx.ctx)
- goto out_unlock;
-
- if (dst_start < dst_vma->vm_start ||
- dst_start + len > dst_vma->vm_end)
- goto out_unlock;
err = -EINVAL;
/*
--
2.21.0
This patch adds uffd tests for write protection.
Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases. Changes are:
(1) Bouncing tests
We do the write-protection in two ways during the bouncing test:
- By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
we'll make sure for each bounce process every single page will be
at least fault twice: once for MISSING, once for WP.
- By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
To further torture the explicit page protection procedures of
uffd-wp, we split each bounce procedure into two halves (in the
background thread): the first half will be MISSING+WP for each
page as explained above. After the first half, we write protect
the faulted region in the background thread to make sure at least
half of the pages will be write protected again which is the first
half to test the new UFFDIO_WRITEPROTECT call. Then we continue
with the 2nd half, which will contain both MISSING and WP faulting
tests for the 2nd half and WP-only faults from the 1st half.
(2) Event/Signal test
Mostly previous tests but will do MISSING+WP for each page. For
sigbus-mode test we'll need to provide standalone path to handle the
write protection faults.
For all tests, do statistics as well for uffd-wp pages.
Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 157 +++++++++++++++++++----
1 file changed, 133 insertions(+), 24 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 417dbdf4d379..fa362fe311e3 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -56,6 +56,7 @@
#include <linux/userfaultfd.h>
#include <setjmp.h>
#include <stdbool.h>
+#include <assert.h>
#include "../kselftest.h"
@@ -78,6 +79,8 @@ static int test_type;
#define ALARM_INTERVAL_SECS 10
static volatile bool test_uffdio_copy_eexist = true;
static volatile bool test_uffdio_zeropage_eexist = true;
+/* Whether to test uffd write-protection */
+static bool test_uffdio_wp = false;
static bool map_shared;
static int huge_fd;
@@ -92,6 +95,7 @@ pthread_attr_t attr;
struct uffd_stats {
int cpu;
unsigned long missing_faults;
+ unsigned long wp_faults;
};
/* pthread_mutex_t starts at page offset 0 */
@@ -141,9 +145,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats,
for (i = 0; i < n_cpus; i++) {
uffd_stats[i].cpu = i;
uffd_stats[i].missing_faults = 0;
+ uffd_stats[i].wp_faults = 0;
}
}
+static void uffd_stats_report(struct uffd_stats *stats, int n_cpus)
+{
+ int i;
+ unsigned long long miss_total = 0, wp_total = 0;
+
+ for (i = 0; i < n_cpus; i++) {
+ miss_total += stats[i].missing_faults;
+ wp_total += stats[i].wp_faults;
+ }
+
+ printf("userfaults: %llu missing (", miss_total);
+ for (i = 0; i < n_cpus; i++)
+ printf("%lu+", stats[i].missing_faults);
+ printf("\b), %llu wp (", wp_total);
+ for (i = 0; i < n_cpus; i++)
+ printf("%lu+", stats[i].wp_faults);
+ printf("\b)\n");
+}
+
static int anon_release_pages(char *rel_area)
{
int ret = 0;
@@ -264,10 +288,15 @@ struct uffd_test_ops {
void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
};
-#define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
+#define SHMEM_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
(1 << _UFFDIO_COPY) | \
(1 << _UFFDIO_ZEROPAGE))
+#define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \
+ (1 << _UFFDIO_COPY) | \
+ (1 << _UFFDIO_ZEROPAGE) | \
+ (1 << _UFFDIO_WRITEPROTECT))
+
static struct uffd_test_ops anon_uffd_test_ops = {
.expected_ioctls = ANON_EXPECTED_IOCTLS,
.allocate_area = anon_allocate_area,
@@ -276,7 +305,7 @@ static struct uffd_test_ops anon_uffd_test_ops = {
};
static struct uffd_test_ops shmem_uffd_test_ops = {
- .expected_ioctls = ANON_EXPECTED_IOCTLS,
+ .expected_ioctls = SHMEM_EXPECTED_IOCTLS,
.allocate_area = shmem_allocate_area,
.release_pages = shmem_release_pages,
.alias_mapping = noop_alias_mapping,
@@ -300,6 +329,21 @@ static int my_bcmp(char *str1, char *str2, size_t n)
return 0;
}
+static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
+{
+ struct uffdio_writeprotect prms = { 0 };
+
+ /* Write protection page faults */
+ prms.range.start = start;
+ prms.range.len = len;
+ /* Undo write-protect, do wakeup after that */
+ prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
+
+ if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
+ fprintf(stderr, "clear WP failed for address 0x%Lx\n",
+ start), exit(1);
+}
+
static void *locking_thread(void *arg)
{
unsigned long cpu = (unsigned long) arg;
@@ -438,7 +482,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry)
uffdio_copy.dst = (unsigned long) area_dst + offset;
uffdio_copy.src = (unsigned long) area_src + offset;
uffdio_copy.len = page_size;
- uffdio_copy.mode = 0;
+ if (test_uffdio_wp)
+ uffdio_copy.mode = UFFDIO_COPY_MODE_WP;
+ else
+ uffdio_copy.mode = 0;
uffdio_copy.copy = 0;
if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
/* real retval in ufdio_copy.copy */
@@ -495,15 +542,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
fprintf(stderr, "unexpected msg event %u\n",
msg->event), exit(1);
- if (bounces & BOUNCE_VERIFY &&
- msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
- fprintf(stderr, "unexpected write fault\n"), exit(1);
+ if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
+ wp_range(uffd, msg->arg.pagefault.address, page_size, false);
+ stats->wp_faults++;
+ } else {
+ /* Missing page faults */
+ if (bounces & BOUNCE_VERIFY &&
+ msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+ fprintf(stderr, "unexpected write fault\n"), exit(1);
- offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
- offset &= ~(page_size-1);
+ offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
+ offset &= ~(page_size-1);
- if (copy_page(uffd, offset))
- stats->missing_faults++;
+ if (copy_page(uffd, offset))
+ stats->missing_faults++;
+ }
}
static void *uffd_poll_thread(void *arg)
@@ -589,11 +642,30 @@ static void *uffd_read_thread(void *arg)
static void *background_thread(void *arg)
{
unsigned long cpu = (unsigned long) arg;
- unsigned long page_nr;
+ unsigned long page_nr, start_nr, mid_nr, end_nr;
+
+ start_nr = cpu * nr_pages_per_cpu;
+ end_nr = (cpu+1) * nr_pages_per_cpu;
+ mid_nr = (start_nr + end_nr) / 2;
+
+ /* Copy the first half of the pages */
+ for (page_nr = start_nr; page_nr < mid_nr; page_nr++)
+ copy_page_retry(uffd, page_nr * page_size);
- for (page_nr = cpu * nr_pages_per_cpu;
- page_nr < (cpu+1) * nr_pages_per_cpu;
- page_nr++)
+ /*
+ * If we need to test uffd-wp, set it up now. Then we'll have
+ * at least the first half of the pages mapped already which
+ * can be write-protected for testing
+ */
+ if (test_uffdio_wp)
+ wp_range(uffd, (unsigned long)area_dst + start_nr * page_size,
+ nr_pages_per_cpu * page_size, true);
+
+ /*
+ * Continue the 2nd half of the page copying, handling write
+ * protection faults if any
+ */
+ for (page_nr = mid_nr; page_nr < end_nr; page_nr++)
copy_page_retry(uffd, page_nr * page_size);
return NULL;
@@ -755,17 +827,31 @@ static int faulting_process(int signal_test)
}
for (nr = 0; nr < split_nr_pages; nr++) {
+ int steps = 1;
+ unsigned long offset = nr * page_size;
+
if (signal_test) {
if (sigsetjmp(*sigbuf, 1) != 0) {
- if (nr == lastnr) {
+ if (steps == 1 && nr == lastnr) {
fprintf(stderr, "Signal repeated\n");
return 1;
}
lastnr = nr;
if (signal_test == 1) {
- if (copy_page(uffd, nr * page_size))
- signalled++;
+ if (steps == 1) {
+ /* This is a MISSING request */
+ steps++;
+ if (copy_page(uffd, offset))
+ signalled++;
+ } else {
+ /* This is a WP request */
+ assert(steps == 2);
+ wp_range(uffd,
+ (__u64)area_dst +
+ offset,
+ page_size, false);
+ }
} else {
signalled++;
continue;
@@ -778,8 +864,13 @@ static int faulting_process(int signal_test)
fprintf(stderr,
"nr %lu memory corruption %Lu %Lu\n",
nr, count,
- count_verify[nr]), exit(1);
- }
+ count_verify[nr]);
+ }
+ /*
+ * Trigger write protection if there is by writting
+ * the same value back.
+ */
+ *area_count(area_dst, nr) = count;
}
if (signal_test)
@@ -801,6 +892,11 @@ static int faulting_process(int signal_test)
nr, count,
count_verify[nr]), exit(1);
}
+ /*
+ * Trigger write protection if there is by writting
+ * the same value back.
+ */
+ *area_count(area_dst, nr) = count;
}
if (uffd_test_ops->release_pages(area_dst))
@@ -904,6 +1000,8 @@ static int userfaultfd_zeropage_test(void)
uffdio_register.range.start = (unsigned long) area_dst;
uffdio_register.range.len = nr_pages * page_size;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (test_uffdio_wp)
+ uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
fprintf(stderr, "register failure\n"), exit(1);
@@ -949,6 +1047,8 @@ static int userfaultfd_events_test(void)
uffdio_register.range.start = (unsigned long) area_dst;
uffdio_register.range.len = nr_pages * page_size;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (test_uffdio_wp)
+ uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
fprintf(stderr, "register failure\n"), exit(1);
@@ -979,7 +1079,8 @@ static int userfaultfd_events_test(void)
return 1;
close(uffd);
- printf("userfaults: %ld\n", stats.missing_faults);
+
+ uffd_stats_report(&stats, 1);
return stats.missing_faults != nr_pages;
}
@@ -1009,6 +1110,8 @@ static int userfaultfd_sig_test(void)
uffdio_register.range.start = (unsigned long) area_dst;
uffdio_register.range.len = nr_pages * page_size;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (test_uffdio_wp)
+ uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
fprintf(stderr, "register failure\n"), exit(1);
@@ -1141,6 +1244,8 @@ static int userfaultfd_stress(void)
uffdio_register.range.start = (unsigned long) area_dst;
uffdio_register.range.len = nr_pages * page_size;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+ if (test_uffdio_wp)
+ uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
fprintf(stderr, "register failure\n");
return 1;
@@ -1195,6 +1300,11 @@ static int userfaultfd_stress(void)
if (stress(uffd_stats))
return 1;
+ /* Clear all the write protections if there is any */
+ if (test_uffdio_wp)
+ wp_range(uffd, (unsigned long)area_dst,
+ nr_pages * page_size, false);
+
/* unregister */
if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
fprintf(stderr, "unregister failure\n");
@@ -1233,10 +1343,7 @@ static int userfaultfd_stress(void)
area_src_alias = area_dst_alias;
area_dst_alias = tmp_area;
- printf("userfaults:");
- for (cpu = 0; cpu < nr_cpus; cpu++)
- printf(" %lu", uffd_stats[cpu].missing_faults);
- printf("\n");
+ uffd_stats_report(uffd_stats, nr_cpus);
}
if (err)
@@ -1276,6 +1383,8 @@ static void set_test_type(const char *type)
if (!strcmp(type, "anon")) {
test_type = TEST_ANON;
uffd_test_ops = &anon_uffd_test_ops;
+ /* Only enable write-protect test for anonymous test */
+ test_uffdio_wp = true;
} else if (!strcmp(type, "hugetlb")) {
test_type = TEST_HUGETLB;
uffd_test_ops = &hugetlb_uffd_test_ops;
--
2.21.0
From: Shaohua Li <[email protected]>
Add API to enable/disable writeprotect a vma range. Unlike mprotect,
this doesn't split/merge vmas.
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
[peterx:
- use the helper to find VMA;
- return -ENOENT if not found to match mcopy case;
- use the new MM_CP_UFFD_WP* flags for change_protection
- check against mmap_changing for failures]
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/userfaultfd_k.h | 3 ++
mm/userfaultfd.c | 54 +++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index dcd33172b728..a8e5f3ea9bb2 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -41,6 +41,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
unsigned long dst_start,
unsigned long len,
bool *mmap_changing);
+extern int mwriteprotect_range(struct mm_struct *dst_mm,
+ unsigned long start, unsigned long len,
+ bool enable_wp, bool *mmap_changing);
/* mm helpers */
static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 6b9dd5b66f64..4208592c7ca3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -638,3 +638,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start,
{
return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0);
}
+
+int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
+ unsigned long len, bool enable_wp, bool *mmap_changing)
+{
+ struct vm_area_struct *dst_vma;
+ pgprot_t newprot;
+ int err;
+
+ /*
+ * Sanitize the command parameters:
+ */
+ BUG_ON(start & ~PAGE_MASK);
+ BUG_ON(len & ~PAGE_MASK);
+
+ /* Does the address range wrap, or is the span zero-sized? */
+ BUG_ON(start + len <= start);
+
+ down_read(&dst_mm->mmap_sem);
+
+ /*
+ * If memory mappings are changing because of non-cooperative
+ * operation (e.g. mremap) running in parallel, bail out and
+ * request the user to retry later
+ */
+ err = -EAGAIN;
+ if (mmap_changing && READ_ONCE(*mmap_changing))
+ goto out_unlock;
+
+ err = -ENOENT;
+ dst_vma = vma_find_uffd(dst_mm, start, len);
+ /*
+ * Make sure the vma is not shared, that the dst range is
+ * both valid and fully within a single existing vma.
+ */
+ if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+ goto out_unlock;
+ if (!userfaultfd_wp(dst_vma))
+ goto out_unlock;
+ if (!vma_is_anonymous(dst_vma))
+ goto out_unlock;
+
+ if (enable_wp)
+ newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE));
+ else
+ newprot = vm_get_page_prot(dst_vma->vm_flags);
+
+ change_protection(dst_vma, start, start + len, newprot,
+ enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+
+ err = 0;
+out_unlock:
+ up_read(&dst_mm->mmap_sem);
+ return err;
+}
--
2.21.0
From: Shaohua Li <[email protected]>
Now it's safe to enable write protection in userfaultfd API
Cc: Andrea Arcangeli <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Jerome Glisse <[email protected]>
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/uapi/linux/userfaultfd.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 95c4a160e5f8..e7e98bde221f 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
* means the userland is reading).
*/
#define UFFD_API ((__u64)0xAA)
-#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK | \
+#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \
+ UFFD_FEATURE_EVENT_FORK | \
UFFD_FEATURE_EVENT_REMAP | \
UFFD_FEATURE_EVENT_REMOVE | \
UFFD_FEATURE_EVENT_UNMAP | \
@@ -34,7 +35,8 @@
#define UFFD_API_RANGE_IOCTLS \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY | \
- (__u64)1 << _UFFDIO_ZEROPAGE)
+ (__u64)1 << _UFFDIO_ZEROPAGE | \
+ (__u64)1 << _UFFDIO_WRITEPROTECT)
#define UFFD_API_RANGE_IOCTLS_BASIC \
((__u64)1 << _UFFDIO_WAKE | \
(__u64)1 << _UFFDIO_COPY)
--
2.21.0
Introduce uffd_stats structure for statistics of the self test, at the
same time refactor the code to always pass in the uffd_stats for either
read() or poll() typed fault handling threads instead of using two
different ways to return the statistic results. No functional change.
With the new structure, it's very easy to introduce new statistics.
Reviewed-by: Mike Rapoport <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
tools/testing/selftests/vm/userfaultfd.c | 76 +++++++++++++++---------
1 file changed, 49 insertions(+), 27 deletions(-)
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index b3e6497b080c..417dbdf4d379 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,6 +88,12 @@ static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
static char *zeropage;
pthread_attr_t attr;
+/* Userfaultfd test statistics */
+struct uffd_stats {
+ int cpu;
+ unsigned long missing_faults;
+};
+
/* pthread_mutex_t starts at page offset 0 */
#define area_mutex(___area, ___nr) \
((pthread_mutex_t *) ((___area) + (___nr)*page_size))
@@ -127,6 +133,17 @@ static void usage(void)
exit(1);
}
+static void uffd_stats_reset(struct uffd_stats *uffd_stats,
+ unsigned long n_cpus)
+{
+ int i;
+
+ for (i = 0; i < n_cpus; i++) {
+ uffd_stats[i].cpu = i;
+ uffd_stats[i].missing_faults = 0;
+ }
+}
+
static int anon_release_pages(char *rel_area)
{
int ret = 0;
@@ -469,8 +486,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg)
return 0;
}
-/* Return 1 if page fault handled by us; otherwise 0 */
-static int uffd_handle_page_fault(struct uffd_msg *msg)
+static void uffd_handle_page_fault(struct uffd_msg *msg,
+ struct uffd_stats *stats)
{
unsigned long offset;
@@ -485,18 +502,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg)
offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
offset &= ~(page_size-1);
- return copy_page(uffd, offset);
+ if (copy_page(uffd, offset))
+ stats->missing_faults++;
}
static void *uffd_poll_thread(void *arg)
{
- unsigned long cpu = (unsigned long) arg;
+ struct uffd_stats *stats = (struct uffd_stats *)arg;
+ unsigned long cpu = stats->cpu;
struct pollfd pollfd[2];
struct uffd_msg msg;
struct uffdio_register uffd_reg;
int ret;
char tmp_chr;
- unsigned long userfaults = 0;
pollfd[0].fd = uffd;
pollfd[0].events = POLLIN;
@@ -526,7 +544,7 @@ static void *uffd_poll_thread(void *arg)
msg.event), exit(1);
break;
case UFFD_EVENT_PAGEFAULT:
- userfaults += uffd_handle_page_fault(&msg);
+ uffd_handle_page_fault(&msg, stats);
break;
case UFFD_EVENT_FORK:
close(uffd);
@@ -545,28 +563,27 @@ static void *uffd_poll_thread(void *arg)
break;
}
}
- return (void *)userfaults;
+
+ return NULL;
}
pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
static void *uffd_read_thread(void *arg)
{
- unsigned long *this_cpu_userfaults;
+ struct uffd_stats *stats = (struct uffd_stats *)arg;
struct uffd_msg msg;
- this_cpu_userfaults = (unsigned long *) arg;
- *this_cpu_userfaults = 0;
-
pthread_mutex_unlock(&uffd_read_mutex);
/* from here cancellation is ok */
for (;;) {
if (uffd_read_msg(uffd, &msg))
continue;
- (*this_cpu_userfaults) += uffd_handle_page_fault(&msg);
+ uffd_handle_page_fault(&msg, stats);
}
- return (void *)NULL;
+
+ return NULL;
}
static void *background_thread(void *arg)
@@ -582,13 +599,12 @@ static void *background_thread(void *arg)
return NULL;
}
-static int stress(unsigned long *userfaults)
+static int stress(struct uffd_stats *uffd_stats)
{
unsigned long cpu;
pthread_t locking_threads[nr_cpus];
pthread_t uffd_threads[nr_cpus];
pthread_t background_threads[nr_cpus];
- void **_userfaults = (void **) userfaults;
finished = 0;
for (cpu = 0; cpu < nr_cpus; cpu++) {
@@ -597,12 +613,13 @@ static int stress(unsigned long *userfaults)
return 1;
if (bounces & BOUNCE_POLL) {
if (pthread_create(&uffd_threads[cpu], &attr,
- uffd_poll_thread, (void *)cpu))
+ uffd_poll_thread,
+ (void *)&uffd_stats[cpu]))
return 1;
} else {
if (pthread_create(&uffd_threads[cpu], &attr,
uffd_read_thread,
- &_userfaults[cpu]))
+ (void *)&uffd_stats[cpu]))
return 1;
pthread_mutex_lock(&uffd_read_mutex);
}
@@ -639,7 +656,8 @@ static int stress(unsigned long *userfaults)
fprintf(stderr, "pipefd write error\n");
return 1;
}
- if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
+ if (pthread_join(uffd_threads[cpu],
+ (void *)&uffd_stats[cpu]))
return 1;
} else {
if (pthread_cancel(uffd_threads[cpu]))
@@ -910,11 +928,11 @@ static int userfaultfd_events_test(void)
{
struct uffdio_register uffdio_register;
unsigned long expected_ioctls;
- unsigned long userfaults;
pthread_t uffd_mon;
int err, features;
pid_t pid;
char c;
+ struct uffd_stats stats = { 0 };
printf("testing events (fork, remap, remove): ");
fflush(stdout);
@@ -941,7 +959,7 @@ static int userfaultfd_events_test(void)
"unexpected missing ioctl for anon memory\n"),
exit(1);
- if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+ if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
perror("uffd_poll_thread create"), exit(1);
pid = fork();
@@ -957,13 +975,13 @@ static int userfaultfd_events_test(void)
if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
perror("pipe write"), exit(1);
- if (pthread_join(uffd_mon, (void **)&userfaults))
+ if (pthread_join(uffd_mon, NULL))
return 1;
close(uffd);
- printf("userfaults: %ld\n", userfaults);
+ printf("userfaults: %ld\n", stats.missing_faults);
- return userfaults != nr_pages;
+ return stats.missing_faults != nr_pages;
}
static int userfaultfd_sig_test(void)
@@ -975,6 +993,7 @@ static int userfaultfd_sig_test(void)
int err, features;
pid_t pid;
char c;
+ struct uffd_stats stats = { 0 };
printf("testing signal delivery: ");
fflush(stdout);
@@ -1006,7 +1025,7 @@ static int userfaultfd_sig_test(void)
if (uffd_test_ops->release_pages(area_dst))
return 1;
- if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL))
+ if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
perror("uffd_poll_thread create"), exit(1);
pid = fork();
@@ -1032,6 +1051,7 @@ static int userfaultfd_sig_test(void)
close(uffd);
return userfaults != 0;
}
+
static int userfaultfd_stress(void)
{
void *area;
@@ -1040,7 +1060,7 @@ static int userfaultfd_stress(void)
struct uffdio_register uffdio_register;
unsigned long cpu;
int err;
- unsigned long userfaults[nr_cpus];
+ struct uffd_stats uffd_stats[nr_cpus];
uffd_test_ops->allocate_area((void **)&area_src);
if (!area_src)
@@ -1169,8 +1189,10 @@ static int userfaultfd_stress(void)
if (uffd_test_ops->release_pages(area_dst))
return 1;
+ uffd_stats_reset(uffd_stats, nr_cpus);
+
/* bounce pass */
- if (stress(userfaults))
+ if (stress(uffd_stats))
return 1;
/* unregister */
@@ -1213,7 +1235,7 @@ static int userfaultfd_stress(void)
printf("userfaults:");
for (cpu = 0; cpu < nr_cpus; cpu++)
- printf(" %lu", userfaults[cpu]);
+ printf(" %lu", uffd_stats[cpu].missing_faults);
printf("\n");
}
--
2.21.0
So I still think this all *may* ok, but at a minimum some of the
comments are misleading, and we need more docs on what happens with
normal signals.
I'm picking on just the first one I noticed, but I think there were
other architectures with this too:
On Wed, Jun 19, 2019 at 7:20 PM Peter Xu <[email protected]> wrote:
>
> diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> index 6836095251ed..3517820aea07 100644
> --- a/arch/arc/mm/fault.c
> +++ b/arch/arc/mm/fault.c
> @@ -139,17 +139,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
> */
> fault = handle_mm_fault(vma, address, flags);
>
> - if (fatal_signal_pending(current)) {
> -
> + if (unlikely((fault & VM_FAULT_RETRY) && signal_pending(current))) {
> + if (fatal_signal_pending(current) && !user_mode(regs))
> + goto no_context;
> /*
> * if fault retry, mmap_sem already relinquished by core mm
> * so OK to return to user mode (with signal handled first)
> */
> - if (fault & VM_FAULT_RETRY) {
> - if (!user_mode(regs))
> - goto no_context;
> - return;
> - }
> + return;
> }
So note how the end result of this is:
(a) if a fatal signal is pending, and we're returning to kernel mode,
we do the exception handling
(b) otherwise, if *any* signal is pending, we'll just return and
retry the page fault
I have nothing against (a), and (b) is likely also ok, but it's worth
noting that (b) happens for kernel returns too. But the comment talks
about returning to user mode.
Is it ok to return to kernel mode when signals are pending? The signal
won't be handled, and we'll just retry the access.
Will we possibly keep retrying forever? When we take the fault again,
we'll set the FAULT_FLAG_ALLOW_RETRY again, so any fault handler that
says "if it allows retry, and signals are pending, just return" would
keep never making any progress, and we'd be stuck taking page faults
in kernel mode forever.
So I think the x86 code sequence is the much safer and more correct
one, because it will actually retry once, and set FAULT_FLAG_TRIED
(and it will clear the "FAULT_FLAG_ALLOW_RETRY" flag - but you'll
remove that clearing later in the series).
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 46df4c6aae46..dcd7c1393be3 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1463,16 +1463,20 @@ void do_user_addr_fault(struct pt_regs *regs,
> * that we made any progress. Handle this case first.
> */
> if (unlikely(fault & VM_FAULT_RETRY)) {
> + bool is_user = flags & FAULT_FLAG_USER;
> +
> /* Retry at most once */
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> flags &= ~FAULT_FLAG_ALLOW_RETRY;
> flags |= FAULT_FLAG_TRIED;
> + if (is_user && signal_pending(tsk))
> + return;
> if (!fatal_signal_pending(tsk))
> goto retry;
> }
>
> /* User mode? Just return to handle the fatal exception */
> - if (flags & FAULT_FLAG_USER)
> + if (is_user)
> return;
>
> /* Not returning to user mode? Handle exceptions or die: */
However, I think the real issue is that it just needs documentation
that a fault handler must not react to signal_pending() as part of the
fault handling itself (ie the VM_FAULT_RETRY can not be *because* of a
non-fatal signal), and there needs to be some guarantee of forward
progress.
At that point the "infinite page faults in kernel mode due to pending
signals" issue goes away. But it's not obvious in this patch, at
least.
Linus
On Sat, Jun 22, 2019 at 11:02:48AM -0700, Linus Torvalds wrote:
> So I still think this all *may* ok, but at a minimum some of the
> comments are misleading, and we need more docs on what happens with
> normal signals.
>
> I'm picking on just the first one I noticed, but I think there were
> other architectures with this too:
>
> On Wed, Jun 19, 2019 at 7:20 PM Peter Xu <[email protected]> wrote:
> >
> > diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> > index 6836095251ed..3517820aea07 100644
> > --- a/arch/arc/mm/fault.c
> > +++ b/arch/arc/mm/fault.c
> > @@ -139,17 +139,14 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
> > */
> > fault = handle_mm_fault(vma, address, flags);
> >
> > - if (fatal_signal_pending(current)) {
> > -
> > + if (unlikely((fault & VM_FAULT_RETRY) && signal_pending(current))) {
> > + if (fatal_signal_pending(current) && !user_mode(regs))
> > + goto no_context;
> > /*
> > * if fault retry, mmap_sem already relinquished by core mm
> > * so OK to return to user mode (with signal handled first)
> > */
> > - if (fault & VM_FAULT_RETRY) {
> > - if (!user_mode(regs))
> > - goto no_context;
> > - return;
> > - }
> > + return;
> > }
>
> So note how the end result of this is:
>
> (a) if a fatal signal is pending, and we're returning to kernel mode,
> we do the exception handling
>
> (b) otherwise, if *any* signal is pending, we'll just return and
> retry the page fault
>
> I have nothing against (a), and (b) is likely also ok, but it's worth
> noting that (b) happens for kernel returns too. But the comment talks
> about returning to user mode.
True. So even with the content of this patch, I should at least touch
up the comment but I obviously missed that. Though when reading
through the reply I think it's the patch content that might need a
fixup rather than the comment...
>
> Is it ok to return to kernel mode when signals are pending? The signal
> won't be handled, and we'll just retry the access.
>
> Will we possibly keep retrying forever? When we take the fault again,
> we'll set the FAULT_FLAG_ALLOW_RETRY again, so any fault handler that
> says "if it allows retry, and signals are pending, just return" would
> keep never making any progress, and we'd be stuck taking page faults
> in kernel mode forever.
>
> So I think the x86 code sequence is the much safer and more correct
> one, because it will actually retry once, and set FAULT_FLAG_TRIED
> (and it will clear the "FAULT_FLAG_ALLOW_RETRY" flag - but you'll
> remove that clearing later in the series).
Indeed at least the ARC code has more functional change than what has
been stated in the commit message (which is only about faster signal
handling). I wasn't paying much attention before because I don't see
"multiple retries" a big problem here and after all that's what we
finally want to achieve with the follow up patches... But I agree that
maybe I should be even more explicit in this patch. Do you think
below change (to be squashed into this patch) looks good to you?
That's also an example only with ARC architecture but I can do similar
things to the other archs if you prefer:
/*
* if fault retry, mmap_sem already relinquished by core mm
* so OK to return to user mode (with signal handled first)
*/
- return;
+ if (user_mode(regs))
+ return;
>
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 46df4c6aae46..dcd7c1393be3 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1463,16 +1463,20 @@ void do_user_addr_fault(struct pt_regs *regs,
> > * that we made any progress. Handle this case first.
> > */
> > if (unlikely(fault & VM_FAULT_RETRY)) {
> > + bool is_user = flags & FAULT_FLAG_USER;
> > +
> > /* Retry at most once */
> > if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > flags &= ~FAULT_FLAG_ALLOW_RETRY;
> > flags |= FAULT_FLAG_TRIED;
> > + if (is_user && signal_pending(tsk))
> > + return;
> > if (!fatal_signal_pending(tsk))
> > goto retry;
> > }
> >
> > /* User mode? Just return to handle the fatal exception */
> > - if (flags & FAULT_FLAG_USER)
> > + if (is_user)
> > return;
> >
> > /* Not returning to user mode? Handle exceptions or die: */
>
> However, I think the real issue is that it just needs documentation
> that a fault handler must not react to signal_pending() as part of the
> fault handling itself (ie the VM_FAULT_RETRY can not be *because* of a
> non-fatal signal), and there needs to be some guarantee of forward
> progress.
Should we still be able to react on signal_pending() as part of fault
handling (because that's what this patch wants to do, at least for an
user-mode page fault)? Please kindly correct me if I misunderstood...
>
> At that point the "infinite page faults in kernel mode due to pending
> signals" issue goes away. But it's not obvious in this patch, at
> least.
Thanks,
--
Peter Xu
On Mon, Jun 24, 2019 at 3:43 PM Peter Xu <[email protected]> wrote:
>
> Should we still be able to react on signal_pending() as part of fault
> handling (because that's what this patch wants to do, at least for an
> user-mode page fault)? Please kindly correct me if I misunderstood...
I think that with this patch (modulo possible fix-ups) then yes, as
long as we're returning to user mode we can do signal_pending() and
return RETRY.
But I think we really want to add a new FAULT_FLAG_INTERRUPTIBLE bit
for that (the same way we already have FAULT_FLAG_KILLABLE for things
that can react to fatal signals), and only do it when that is set.
Then the page fault handler can set that flag when it's doing a
user-mode page fault.
Does that sound reasonable?
Linus
On Mon, Jun 24, 2019 at 09:31:42PM +0800, Linus Torvalds wrote:
> On Mon, Jun 24, 2019 at 3:43 PM Peter Xu <[email protected]> wrote:
> >
> > Should we still be able to react on signal_pending() as part of fault
> > handling (because that's what this patch wants to do, at least for an
> > user-mode page fault)? Please kindly correct me if I misunderstood...
>
> I think that with this patch (modulo possible fix-ups) then yes, as
> long as we're returning to user mode we can do signal_pending() and
> return RETRY.
>
> But I think we really want to add a new FAULT_FLAG_INTERRUPTIBLE bit
> for that (the same way we already have FAULT_FLAG_KILLABLE for things
> that can react to fatal signals), and only do it when that is set.
> Then the page fault handler can set that flag when it's doing a
> user-mode page fault.
>
> Does that sound reasonable?
Yes that sounds reasonable to me, and that matches perfectly with
TASK_INTERRUPTIBLE and TASK_KILLABLE. The only thing that I am a bit
uncertain is whether we should define FAULT_FLAG_INTERRUPTIBLE as a
new bit or make it simply a combination of:
FAULT_FLAG_KILLABLE | FAULT_FLAG_USER
The problem is that when we do set_current_state() with either
TASK_INTERRUPTIBLE or TASK_KILLABLE we'll only choose one of them, but
never both. Here since the fault flag is a bitmask then if we
introduce a new FAULT_FLAG_INTERRUPTIBLE bit and use it in the fault
flags then we should probably be sure that FAULT_FLAG_KILLABLE is also
set when with that (since IMHO it won't make much sense to make a page
fault "interruptable" but "un-killable"...). Considering that
TASK_INTERRUPTIBLE should also always in user-mode page faults so this
dependency seems to exist with FAULT_FLAG_USER. Then I'm thinking
maybe using the combination to express the meaning that "we would like
this page fault to be interruptable, even for general userspace
signals" would be nicer?
AFAIK currently only handle_userfault() have such code to handle
normal signals besides SIGKILL, and it was trying to detect this using
this rule already:
return_to_userland =
(vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
Then if we define that globally and officially then we can probably
replace this simply with:
return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
What do you think?
Thanks,
--
Peter Xu
On Tue, Jun 25, 2019 at 1:31 PM Peter Xu <[email protected]> wrote:
>
> Yes that sounds reasonable to me, and that matches perfectly with
> TASK_INTERRUPTIBLE and TASK_KILLABLE. The only thing that I am a bit
> uncertain is whether we should define FAULT_FLAG_INTERRUPTIBLE as a
> new bit or make it simply a combination of:
>
> FAULT_FLAG_KILLABLE | FAULT_FLAG_USER
It needs to be a new bit, I think.
Some things could potentially care about the difference between "can I
abort this thing because the task will *die* and never see the end
result" and "can I abort this thing because it will be retried".
For a regular page fault, maybe FAULT_FLAG_INTERRUPTBLE will always be
set for the same things that set FAULT_FLAG_KILLABLE when it happens
from user mode, but at least conceptually I think they are different,
and it could make a difference for things like get_user_pages() or
similar.
Also, I actually don't think we should ever expose FAULT_FLAG_USER to
any fault handlers anyway. It has a very specific meaning for memory
cgroup handling, and no other fault handler should likely ever care
about "was this a user fault". So I'd actually prefer for people to
ignore and forget that hacky flag entirely, rather than give it subtle
semantic meaning together with KILLABLE.
[ Side note: this is the point where I may soon lose internet access,
so I'll probably not be able to participate in the discussion any more
for a while ]
Linus
On Wed, Jun 26, 2019 at 09:59:58AM +0800, Linus Torvalds wrote:
> On Tue, Jun 25, 2019 at 1:31 PM Peter Xu <[email protected]> wrote:
> >
> > Yes that sounds reasonable to me, and that matches perfectly with
> > TASK_INTERRUPTIBLE and TASK_KILLABLE. The only thing that I am a bit
> > uncertain is whether we should define FAULT_FLAG_INTERRUPTIBLE as a
> > new bit or make it simply a combination of:
> >
> > FAULT_FLAG_KILLABLE | FAULT_FLAG_USER
>
> It needs to be a new bit, I think.
>
> Some things could potentially care about the difference between "can I
> abort this thing because the task will *die* and never see the end
> result" and "can I abort this thing because it will be retried".
>
> For a regular page fault, maybe FAULT_FLAG_INTERRUPTBLE will always be
> set for the same things that set FAULT_FLAG_KILLABLE when it happens
> from user mode, but at least conceptually I think they are different,
> and it could make a difference for things like get_user_pages() or
> similar.
>
> Also, I actually don't think we should ever expose FAULT_FLAG_USER to
> any fault handlers anyway. It has a very specific meaning for memory
> cgroup handling, and no other fault handler should likely ever care
> about "was this a user fault". So I'd actually prefer for people to
> ignore and forget that hacky flag entirely, rather than give it subtle
> semantic meaning together with KILLABLE.
OK.
>
> [ Side note: this is the point where I may soon lose internet access,
> so I'll probably not be able to participate in the discussion any more
> for a while ]
Appreciate for these suggestions. I'll prepare something with that
new bit and see whether that could be accepted. I'll also try to
split those out of the bigger series.
Thanks,
--
Peter Xu
On Wed, Jun 19, 2019 at 7:20 PM Peter Xu <[email protected]> wrote:
> This series implements initial write protection support for
> userfaultfd. Currently both shmem and hugetlbfs are not supported
> yet, but only anonymous memory. This is the 4nd version of it.
>
> The latest code can also be found at:
>
> https://github.com/xzpeter/linux/tree/uffd-wp-merged
Hi Peter - I ported the branch you had above on top of v5.4.20 (what I
happened to be running locally), and fixed one issue that was causing
crashes for me:
https://github.com/bpowers/linux/commit/61086b5a0fa4aeb494e86d999926551a4323b84f
I wrote a small test program here:
https://github.com/plasma-umass/Mesh/blob/master/src/test/userfaultfd-kernel-copy.cc
and write protection support for userfaultfd (with eventual shmem
support) would be _hugely_ helpful for a userspace memory allocator
I'm working on.
Is there anything I can do to help get this considered for mainline?
We have some time before the 5.7 merge window opens up.
Tested-by: Bobby Powers <[email protected]>
On Mon, Feb 17, 2020 at 07:59:12PM -0800, Bobby Powers wrote:
> On Wed, Jun 19, 2019 at 7:20 PM Peter Xu <[email protected]> wrote:
> > This series implements initial write protection support for
> > userfaultfd. Currently both shmem and hugetlbfs are not supported
> > yet, but only anonymous memory. This is the 4nd version of it.
> >
> > The latest code can also be found at:
> >
> > https://github.com/xzpeter/linux/tree/uffd-wp-merged
>
> Hi Peter - I ported the branch you had above on top of v5.4.20 (what I
> happened to be running locally), and fixed one issue that was causing
> crashes for me:
> https://github.com/bpowers/linux/commit/61086b5a0fa4aeb494e86d999926551a4323b84f
Hi, Bobby,
Thanks for playing with the branch!
Yes, this should be needed if you have 7d0325749a6c ("userfaultfd:
untag user pointers", 2019-09-25) in your base branch where the
address is replaced by its pointer.
> I wrote a small test program here:
> https://github.com/plasma-umass/Mesh/blob/master/src/test/userfaultfd-kernel-copy.cc
Just FYI that there's some other tests/libraries over there [1,2].
Also the series has the uffd selftest for write-protection as well.
> and write protection support for userfaultfd (with eventual shmem
> support) would be _hugely_ helpful for a userspace memory allocator
> I'm working on. Is there anything I can do to help get this
> considered for mainline? We have some time before the 5.7 merge
> window opens up. Tested-by: Bobby Powers <[email protected]>
Thanks for the tag! Yes it would be great if we can continue to work
on those, but for now let's see whether we can move on what we have
first (it's already two series without much certainty on whether it
could get merged soon). Considering that we've got quite a few pings
again for either the mm retry series and the write-protect work, I'll
rebase the two series, test & post soon this week. I'll keep you in
the loop.
Thanks,
[1] https://github.com/LLNL/umap
[2] https://github.com/xzpeter/clibs/blob/master/gpl/userspace/uffd-test/uffd-test.c
--
Peter Xu