2023-02-02 11:30:33

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 0/6] Implement IOCTL to get and/or the clear info about PTEs

*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation

*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code

*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation

*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags

Hello,

Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.

This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping

Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)

Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.

*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.

So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.

The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].

The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.

The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/all/[email protected]
[3] https://lore.kernel.org/all/[email protected]
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/[email protected]/
[6] https://lore.kernel.org/all/[email protected]/

Regards,
Muhammad Usama Anjum

Muhammad Usama Anjum (6):
userfaultfd: Add UFFD WP Async support
userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC
fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about
PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: vm: add pagemap ioctl tests

Documentation/admin-guide/mm/pagemap.rst | 24 +
Documentation/admin-guide/mm/userfaultfd.rst | 7 +
fs/proc/task_mmu.c | 290 ++++++
fs/userfaultfd.c | 20 +-
include/linux/userfaultfd_k.h | 11 +
include/uapi/linux/fs.h | 50 ++
include/uapi/linux/userfaultfd.h | 10 +-
mm/memory.c | 23 +-
tools/include/uapi/linux/fs.h | 50 ++
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++
12 files changed, 1364 insertions(+), 8 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c

--
2.30.2



2023-02-02 11:30:43

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 1/6] userfaultfd: Add UFFD WP Async support

Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
faults on its own. It can be used to track that which pages have been
written-to from the time the pages were write-protected. It is very
efficient way to track the changes as uffd is by nature pte/pmd based.

UFFD synchronous WP sends the page faults to the userspace where the
pages which have been written-to can be tracked. But it is not efficient.
This is why this asynchronous version is being added. After setting the
WP Async, the pages which have been written to can be found in the pagemap
file or information can be obtained from the PAGEMAP_IOCTL.

Suggested-by: Peter Xu <[email protected]>
Signed-off-by: Muhammad Usama Anjum <[email protected]>
---
Changes in v10:
- Build fix
- Update comments and add error condition to return error from uffd
register if hugetlb pages are present when wp async flag is set

Changes in v9:
- Correct the fault resolution with code contributed by Peter

Changes in v7:
- Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
- Handle automatic page fault resolution in better way (thanks to Peter)

update to wp async

uffd wp async
---
fs/userfaultfd.c | 20 ++++++++++++++++++--
include/linux/userfaultfd_k.h | 11 +++++++++++
include/uapi/linux/userfaultfd.h | 10 +++++++++-
mm/memory.c | 23 ++++++++++++++++++++---
4 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 15a5bf765d43..422f2530c63e 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
goto out_unlock;

/*
- * Note vmas containing huge pages
+ * Note vmas containing huge pages. Hugetlb isn't supported
+ * with UFFD_FEATURE_WP_ASYNC.
*/
- if (is_vm_hugetlb_page(cur))
+ if (is_vm_hugetlb_page(cur)) {
+ if (ctx->features & UFFD_FEATURE_WP_ASYNC)
+ goto out_unlock;
+
basic_ioctls = true;
+ }

found = true;
}
@@ -1867,6 +1872,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;

+ /* The unprotection is not supported if in async WP mode */
+ if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
+ return -EINVAL;
+
if (mode_wp && mode_dontwake)
return -EINVAL;

@@ -1950,6 +1959,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
return ret;
}

+int userfaultfd_wp_async(struct vm_area_struct *vma)
+{
+ struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
+
+ return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
+}
+
static inline unsigned int uffd_ctx_features(__u64 user_features)
{
/*
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 9df0b9a762cc..38c92c2beb16 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
unsigned long end, struct list_head *uf);
extern void userfaultfd_unmap_complete(struct mm_struct *mm,
struct list_head *uf);
+extern int userfaultfd_wp_async(struct vm_area_struct *vma);

#else /* CONFIG_USERFAULTFD */

@@ -189,6 +190,11 @@ static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
return VM_FAULT_SIGBUS;
}

+static inline void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
+ unsigned long start, unsigned long len, bool enable_wp)
+{
+}
+
static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx vm_ctx)
{
@@ -274,6 +280,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
return false;
}

+static inline int userfaultfd_wp_async(struct vm_area_struct *vma)
+{
+ return false;
+}
+
#endif /* CONFIG_USERFAULTFD */

static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..30a6f32cf564 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -38,7 +38,8 @@
UFFD_FEATURE_MINOR_HUGETLBFS | \
UFFD_FEATURE_MINOR_SHMEM | \
UFFD_FEATURE_EXACT_ADDRESS | \
- UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
+ UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
+ UFFD_FEATURE_WP_ASYNC)
#define UFFD_API_IOCTLS \
((__u64)1 << _UFFDIO_REGISTER | \
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -203,6 +204,12 @@ struct uffdio_api {
*
* UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
* write-protection mode is supported on both shmem and hugetlbfs.
+ *
+ * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
+ * asynchronous mode is supported in which the write fault is automatically
+ * resolved and write-protection is un-set. It only supports anon and shmem
+ * (hugetlb isn't supported). It only takes effect when a vma is registered
+ * with write-protection mode. Otherwise the flag is ignored.
*/
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -217,6 +224,7 @@ struct uffdio_api {
#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
#define UFFD_FEATURE_EXACT_ADDRESS (1<<11)
#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
+#define UFFD_FEATURE_WP_ASYNC (1<<13)
__u64 features;

__u64 ioctls;
diff --git a/mm/memory.c b/mm/memory.c
index 4000e9f017e0..75331fbf7cb4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3351,8 +3351,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)

if (likely(!unshare)) {
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- return handle_userfault(vmf, VM_UFFD_WP);
+ if (userfaultfd_wp_async(vma)) {
+ /*
+ * Nothing needed (cache flush, TLB invalidations,
+ * etc.) because we're only removing the uffd-wp bit,
+ * which is completely invisible to the user.
+ */
+ pte_t pte = pte_clear_uffd_wp(*vmf->pte);
+
+ set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
+ /* Update this to be prepared for following up CoW handling */
+ vmf->orig_pte = pte;
+ } else {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ return handle_userfault(vmf, VM_UFFD_WP);
+ }
}

/*
@@ -4812,8 +4825,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)

if (vma_is_anonymous(vmf->vma)) {
if (likely(!unshare) &&
- userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
+ userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
+ if (userfaultfd_wp_async(vmf->vma))
+ goto split;
return handle_userfault(vmf, VM_UFFD_WP);
+ }
return do_huge_pmd_wp_page(vmf);
}

@@ -4825,6 +4841,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
}
}

+split:
/* COW or write-notify handled on pte level: split pmd. */
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

--
2.30.2


2023-02-02 11:30:55

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 2/6] userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC

Explain the difference created by UFFD_FEATURE_WP_ASYNC to the write
protection (UFFDIO_WRITEPROTECT_MODE_WP) mode.

Signed-off-by: Muhammad Usama Anjum <[email protected]>
---
Documentation/admin-guide/mm/userfaultfd.rst | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index 83f31919ebb3..4747e7bd5b26 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -221,6 +221,13 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
used.

+If ``UFFD_FEATURE_WP_ASYNC`` is set while calling ``UFFDIO_API`` ioctl, the
+behaviour of ``UFFDIO_WRITEPROTECT_MODE_WP`` changes such that faults for
+anon and shmem are resolved automatically by the kernel instead of sending
+the message to the userfaultfd. The hugetlb isn't supported. The ``pagemap``
+file can be read to find which pages have ``PM_UFFD_WP`` flag set which
+means they are write-protected.
+
QEMU/KVM
========

--
2.30.2


2023-02-02 11:30:59

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

To get information about which pages have been written-to and/or write
protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region
can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
PAGEMAP_SCAN IOCTL.

struct pagemap_scan_args is used as the argument of the IOCTL. In this
struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as
vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask
and return_mask.

This IOCTL can be extended to get information about more PTE bits. This
IOCTL doesn't support hugetlbs at the moment. No information about
hugetlb can be obtained. This patch has evolved from a basic patch from
Gabriel Krisman Bertazi.

Signed-off-by: Muhammad Usama Anjum <[email protected]>
---
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fs.h | 50 +++++++
2 files changed, 340 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..c6bde19d63d9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@
#include <linux/shmem_fs.h>
#include <linux/uaccess.h>
#include <linux/pkeys.h>
+#include <linux/minmax.h>

#include <asm/elf.h>
#include <asm/tlb.h>
@@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
}
#endif

+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+ if ((pte_present(pte) && pte_uffd_wp(pte)) ||
+ (pte_swp_uffd_wp_any(pte)))
+ return true;
+ return false;
+}
+
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+ if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+ (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
+ return true;
+ return false;
+}
+
#if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
@@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
return 0;
}

+#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
+ PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
+#define IS_GET_OP(a) (a->vec)
+#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
+
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
+ (wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a) \
+ ((a->required_mask & PAGE_IS_WRITTEN) || \
+ (a->anyof_mask & PAGE_IS_WRITTEN))
+
+struct pagemap_scan_private {
+ struct page_region *vec;
+ struct page_region prev;
+ unsigned long vec_len, vec_index;
+ unsigned int max_pages, found_pages, flags;
+ unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+
+ if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
+ return -EPERM;
+ if (vma->vm_flags & VM_PFNMAP)
+ return 1;
+ return 0;
+}
+
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+ struct pagemap_scan_private *p, unsigned long addr,
+ unsigned int len)
+{
+ unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
+ bool cpy = true;
+ struct page_region *prev = &p->prev;
+
+ if (HAS_NO_SPACE(p))
+ return -ENOSPC;
+
+ if (p->max_pages && p->found_pages + len >= p->max_pages)
+ len = p->max_pages - p->found_pages;
+ if (!len)
+ return -EINVAL;
+
+ if (p->required_mask)
+ cpy = ((p->required_mask & cur) == p->required_mask);
+ if (cpy && p->anyof_mask)
+ cpy = (p->anyof_mask & cur);
+ if (cpy && p->excluded_mask)
+ cpy = !(p->excluded_mask & cur);
+ bitmap = cur & p->return_mask;
+ if (cpy && bitmap) {
+ if ((prev->len) && (prev->bitmap == bitmap) &&
+ (prev->start + prev->len * PAGE_SIZE == addr)) {
+ prev->len += len;
+ p->found_pages += len;
+ } else if (p->vec_index < p->vec_len) {
+ if (prev->len) {
+ memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
+ p->vec_index++;
+ }
+ prev->start = addr;
+ prev->len = len;
+ prev->bitmap = bitmap;
+ p->found_pages += len;
+ } else {
+ return -ENOSPC;
+ }
+ }
+ return 0;
+}
+
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
+ unsigned long *vec_index)
+{
+ struct page_region *prev = &p->prev;
+
+ if (prev->len) {
+ if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
+ return -EFAULT;
+ p->vec_index++;
+ (*vec_index)++;
+ prev->len = 0;
+ }
+ return 0;
+}
+
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long addr = end;
+ spinlock_t *ptl;
+ int ret = 0;
+ pte_t *pte;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ bool pmd_wt;
+
+ pmd_wt = !is_pmd_uffd_wp(*pmd);
+ /*
+ * Break huge page into small pages if operation needs to be performed is
+ * on a portion of the huge page.
+ */
+ if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
+ spin_unlock(ptl);
+ split_huge_pmd(vma, pmd, start);
+ goto process_smaller_pages;
+ }
+ if (IS_GET_OP(p))
+ ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
+ is_swap_pmd(*pmd), p, start,
+ (end - start)/PAGE_SIZE);
+ spin_unlock(ptl);
+ if (!ret) {
+ if (pmd_wt && IS_WP_ENGAGE_OP(p))
+ uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
+ }
+ return ret;
+ }
+process_smaller_pages:
+ if (pmd_trans_unstable(pmd))
+ return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+ if (IS_GET_OP(p)) {
+ for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
+ ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
+ pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
+ if (ret)
+ break;
+ }
+ }
+ pte_unmap_unlock(pte - 1, ptl);
+ if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
+ uffd_wp_range(walk->mm, vma, start, addr - start, true);
+
+ cond_resched();
+ return ret;
+}
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
+ struct mm_walk *walk)
+{
+ struct pagemap_scan_private *p = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ int ret = 0;
+
+ if (vma)
+ ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
+ (end - addr)/PAGE_SIZE);
+ return ret;
+}
+
+/* No hugetlb support is present. */
+static const struct mm_walk_ops pagemap_scan_ops = {
+ .test_walk = pagemap_scan_test_walk,
+ .pmd_entry = pagemap_scan_pmd_entry,
+ .pte_hole = pagemap_scan_pte_hole,
+};
+
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
+{
+ unsigned long empty_slots, vec_index = 0;
+ unsigned long __user start, end;
+ unsigned long __start, __end;
+ struct page_region __user *vec;
+ struct pagemap_scan_private p;
+ int ret = 0;
+
+ start = (unsigned long)untagged_addr(arg->start);
+ vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
+
+ /* Validate memory ranges */
+ if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
+ return -EINVAL;
+ if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
+ (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
+ return -EINVAL;
+
+ /* Detect illegal flags and masks */
+ if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
+ (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
+ (arg->return_mask & ~PAGEMAP_BITS_ALL))
+ return -EINVAL;
+ if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
+ !arg->return_mask))
+ return -EINVAL;
+ /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
+ if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
+ (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
+ return -EINVAL;
+
+ end = start + arg->len;
+ p.max_pages = arg->max_pages;
+ p.found_pages = 0;
+ p.flags = arg->flags;
+ p.required_mask = arg->required_mask;
+ p.anyof_mask = arg->anyof_mask;
+ p.excluded_mask = arg->excluded_mask;
+ p.return_mask = arg->return_mask;
+ p.prev.len = 0;
+ p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+
+ if (IS_GET_OP(arg)) {
+ p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
+ if (!p.vec)
+ return -ENOMEM;
+ } else {
+ p.vec = NULL;
+ }
+ __start = __end = start;
+ while (!ret && __end < end) {
+ p.vec_index = 0;
+ empty_slots = arg->vec_len - vec_index;
+ if (p.vec_len > empty_slots)
+ p.vec_len = empty_slots;
+
+ __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+ if (__end > end)
+ __end = end;
+
+ mmap_read_lock(mm);
+ ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
+ mmap_read_unlock(mm);
+ if (!(!ret || ret == -ENOSPC))
+ goto free_data;
+
+ __start = __end;
+ if (IS_GET_OP(arg) && p.vec_index) {
+ if (copy_to_user(&vec[vec_index], p.vec,
+ p.vec_index * sizeof(struct page_region))) {
+ ret = -EFAULT;
+ goto free_data;
+ }
+ vec_index += p.vec_index;
+ }
+ }
+ ret = export_prev_to_out(&p, vec, &vec_index);
+ if (!ret)
+ ret = vec_index;
+free_data:
+ if (IS_GET_OP(arg))
+ kfree(p.vec);
+
+ return ret;
+}
+
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
+ struct mm_struct *mm = file->private_data;
+ struct pagemap_scan_arg argument;
+
+ if (cmd == PAGEMAP_SCAN) {
+ if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
+ return -EFAULT;
+ return do_pagemap_cmd(mm, &argument);
+ }
+ return -EINVAL;
+}
+
const struct file_operations proc_pagemap_operations = {
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
.open = pagemap_open,
.release = pagemap_release,
+ .unlocked_ioctl = pagemap_scan_ioctl,
+ .compat_ioctl = pagemap_scan_ioctl,
};
#endif /* CONFIG_PROC_PAGE_MONITOR */

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..1ae9a8684b48 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND)

+/* Pagemap ioctl */
+#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
+#define PAGE_IS_WRITTEN (1 << 0)
+#define PAGE_IS_FILE (1 << 1)
+#define PAGE_IS_PRESENT (1 << 2)
+#define PAGE_IS_SWAPPED (1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start: Start of the region
+ * @len: Length of the region
+ * bitmap: Bits sets for the region
+ */
+struct page_region {
+ __u64 start;
+ __u64 len;
+ __u64 bitmap;
+};
+
+/*
+ * struct pagemap_scan_arg - Pagemap ioctl argument
+ * @start: Starting address of the region
+ * @len: Length of the region (All the pages in this length are included)
+ * @vec: Address of page_region struct array for output
+ * @vec_len: Length of the page_region struct array
+ * @max_pages: Optional max return pages
+ * @flags: Flags for the IOCTL
+ * @required_mask: Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask: Any mask - Any of these bits are set in the PTE
+ * @excluded_mask: Exclude mask - None of these bits are set in the PTE
+ * @return_mask: Bits that are to be reported in page_region
+ */
+struct pagemap_scan_arg {
+ __u64 start;
+ __u64 len;
+ __u64 vec;
+ __u64 vec_len;
+ __u32 max_pages;
+ __u32 flags;
+ __u64 required_mask;
+ __u64 anyof_mask;
+ __u64 excluded_mask;
+ __u64 return_mask;
+};
+
+/* Special flags */
+#define PAGEMAP_WP_ENGAGE (1 << 0)
+
#endif /* _UAPI_LINUX_FS_H */
--
2.30.2


2023-02-02 11:31:15

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 4/6] tools headers UAPI: Update linux/fs.h with the kernel sources

New IOCTL and macros has been added in the kernel sources. Update the
tools header file as well.

Signed-off-by: Muhammad Usama Anjum <[email protected]>
---
tools/include/uapi/linux/fs.h | 50 +++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)

diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index b7b56871029c..1ae9a8684b48 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND)

+/* Pagemap ioctl */
+#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
+#define PAGE_IS_WRITTEN (1 << 0)
+#define PAGE_IS_FILE (1 << 1)
+#define PAGE_IS_PRESENT (1 << 2)
+#define PAGE_IS_SWAPPED (1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start: Start of the region
+ * @len: Length of the region
+ * bitmap: Bits sets for the region
+ */
+struct page_region {
+ __u64 start;
+ __u64 len;
+ __u64 bitmap;
+};
+
+/*
+ * struct pagemap_scan_arg - Pagemap ioctl argument
+ * @start: Starting address of the region
+ * @len: Length of the region (All the pages in this length are included)
+ * @vec: Address of page_region struct array for output
+ * @vec_len: Length of the page_region struct array
+ * @max_pages: Optional max return pages
+ * @flags: Flags for the IOCTL
+ * @required_mask: Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask: Any mask - Any of these bits are set in the PTE
+ * @excluded_mask: Exclude mask - None of these bits are set in the PTE
+ * @return_mask: Bits that are to be reported in page_region
+ */
+struct pagemap_scan_arg {
+ __u64 start;
+ __u64 len;
+ __u64 vec;
+ __u64 vec_len;
+ __u32 max_pages;
+ __u32 flags;
+ __u64 required_mask;
+ __u64 anyof_mask;
+ __u64 excluded_mask;
+ __u64 return_mask;
+};
+
+/* Special flags */
+#define PAGEMAP_WP_ENGAGE (1 << 0)
+
#endif /* _UAPI_LINUX_FS_H */
--
2.30.2


2023-02-02 11:31:27

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 5/6] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL

Add some explanation and method to use write-protection and written-to
on memory range.

Signed-off-by: Muhammad Usama Anjum <[email protected]>
---
Documentation/admin-guide/mm/pagemap.rst | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index 6e2e416af783..1cb2189e9a0d 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -230,3 +230,27 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
always 12 at most architectures). Since Linux 3.11 their meaning changes
after first clear of soft-dirty bits. Since Linux 4.2 they are used for
flags unconditionally.
+
+Pagemap Scan IOCTL
+==================
+
+The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get and/or clear
+the info about page table entries. The following operations are supported in
+this IOCTL:
+- Get the information if the pages have been written-to (``PAGE_IS_WRITTEN``),
+ file mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``) or swapped
+ (``PAGE_IS_SWAPPED``).
+- Write-protect the pages (``PAGEMAP_WP_ENGAGE``) to start finding which
+ pages have been written-to.
+- Find pages which have been written-to and write protect the pages
+ (atomic ``PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE``)
+
+To get information about which pages have been written-to and/or write protect
+the pages, following must be performed first in order:
+ 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
+ 2. The ``UFFD_FEATURE_WP_ASYNC`` feature is set by ``UFFDIO_API`` IOCTL.
+ 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
+ through ``UFFDIO_REGISTER`` IOCTL.
+Then the any part of the registered memory or the whole memory region can be
+write protected using the ``UFFDIO_WRITEPROTECT`` IOCTL or ``PAGEMAP_SCAN``
+IOCTL.
--
2.30.2


2023-02-02 11:31:49

by Muhammad Usama Anjum

[permalink] [raw]
Subject: [PATCH v10 6/6] selftests: vm: add pagemap ioctl tests

Add pagemap ioctl tests. Add several different types of tests to judge
the correction of the interface.

Signed-off-by: Muhammad Usama Anjum <[email protected]>
---
Chages in v7:
- Add and update all test cases

Changes in v6:
- Rename variables

Changes in v4:
- Updated all the tests to conform to new IOCTL

Changes in v3:
- Add another test to do sanity of flags

Changes in v2:
- Update the tests to use the ioctl interface instead of syscall

TAP version 13
1..54
ok 1 sanity_tests_sd wrong flag specified
ok 2 sanity_tests_sd wrong mask specified
ok 3 sanity_tests_sd wrong return mask specified
ok 4 sanity_tests_sd mixture of correct and wrong flag
ok 5 sanity_tests_sd Clear area with larger vec size
ok 6 sanity_tests_sd Repeated pattern of dirty and non-dirty pages
ok 7 sanity_tests_sd Repeated pattern of dirty and non-dirty pages in parts
ok 8 sanity_tests_sd Two regions
ok 9 Page testing: all new pages must be soft dirty
ok 10 Page testing: all pages must not be soft dirty
ok 11 Page testing: all pages dirty other than first and the last one
ok 12 Page testing: only middle page dirty
ok 13 Page testing: only two middle pages dirty
ok 14 Page testing: only get 2 dirty pages and clear them as well
ok 15 Page testing: Range clear only
ok 16 Large Page testing: all new pages must be soft dirty
ok 17 Large Page testing: all pages must not be soft dirty
ok 18 Large Page testing: all pages dirty other than first and the last one
ok 19 Large Page testing: only middle page dirty
ok 20 Large Page testing: only two middle pages dirty
ok 21 Large Page testing: only get 2 dirty pages and clear them as well
ok 22 Large Page testing: Range clear only
ok 23 Huge page testing: all new pages must be soft dirty
ok 24 Huge page testing: all pages must not be soft dirty
ok 25 Huge page testing: all pages dirty other than first and the last one
ok 26 Huge page testing: only middle page dirty
ok 27 Huge page testing: only two middle pages dirty
ok 28 Huge page testing: only get 2 dirty pages and clear them as well
ok 29 Huge page testing: Range clear only
ok 30 hpage_unit_tests all new huge page must be dirty
ok 31 hpage_unit_tests all the huge page must not be dirty
ok 32 hpage_unit_tests all the huge page must be dirty and clear
ok 33 hpage_unit_tests only middle page dirty
ok 34 hpage_unit_tests clear first half of huge page
ok 35 hpage_unit_tests clear first half of huge page with limited buffer
ok 36 hpage_unit_tests clear second half huge page
ok 37 Test test_simple
ok 38 mprotect_tests Both pages dirty
ok 39 mprotect_tests Both pages are not soft dirty
ok 40 mprotect_tests Both pages dirty after remap and mprotect
ok 41 mprotect_tests Clear and make the pages dirty
ok 42 sanity_tests clear op can only be specified with PAGE_IS_WRITTEN
ok 43 sanity_tests required_mask specified
ok 44 sanity_tests anyof_mask specified
ok 45 sanity_tests excluded_mask specified
ok 46 sanity_tests required_mask and anyof_mask specified
ok 47 sanity_tests Get sd and present pages with anyof_mask
ok 48 sanity_tests Get all the pages with required_mask
ok 49 sanity_tests Get sd and present pages with required_mask and anyof_mask
ok 50 sanity_tests Don't get sd pages
ok 51 sanity_tests Don't get present pages
ok 52 sanity_tests Find dirty present pages with return mask
ok 53 sanity_tests Memory mapped file
ok 54 unmapped_region_tests Get status of pages
# Totals: pass:54 fail:0 xfail:0 xpass:0 skip:0 error:0
---
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/Makefile | 5 +-
tools/testing/selftests/vm/pagemap_ioctl.c | 881 +++++++++++++++++++++
3 files changed, 885 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/vm/pagemap_ioctl.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 1f8c36a9fa10..9e7e0ae26582 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -17,6 +17,7 @@ mremap_dontunmap
mremap_test
on-fault-limit
transhuge-stress
+pagemap_ioctl
protection_keys
protection_keys_32
protection_keys_64
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 89c14e41bd43..54c074440a1b 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -24,9 +24,8 @@ MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/p
# things despite using incorrect values such as an *occasionally* incomplete
# LDLIBS.
MAKEFLAGS += --no-builtin-rules
-
CFLAGS = -Wall -I $(top_srcdir) -I $(top_srcdir)/usr/include $(EXTRA_CFLAGS) $(KHDR_INCLUDES)
-LDLIBS = -lrt -lpthread
+LDLIBS = -lrt -lpthread -lm
TEST_GEN_FILES = cow
TEST_GEN_FILES += compaction_test
TEST_GEN_FILES += gup_test
@@ -52,6 +51,7 @@ TEST_GEN_FILES += on-fault-limit
TEST_GEN_FILES += thuge-gen
TEST_GEN_FILES += transhuge-stress
TEST_GEN_FILES += userfaultfd
+TEST_GEN_PROGS += pagemap_ioctl
TEST_GEN_PROGS += soft-dirty
TEST_GEN_PROGS += split_huge_page_test
TEST_GEN_FILES += ksm_tests
@@ -103,6 +103,7 @@ $(OUTPUT)/cow: vm_util.c
$(OUTPUT)/khugepaged: vm_util.c
$(OUTPUT)/ksm_functional_tests: vm_util.c
$(OUTPUT)/madv_populate: vm_util.c
+$(OUTPUT)/pagemap_ioctl: vm_util.c
$(OUTPUT)/soft-dirty: vm_util.c
$(OUTPUT)/split_huge_page_test: vm_util.c
$(OUTPUT)/userfaultfd: vm_util.c
diff --git a/tools/testing/selftests/vm/pagemap_ioctl.c b/tools/testing/selftests/vm/pagemap_ioctl.c
new file mode 100644
index 000000000000..09b676a626d8
--- /dev/null
+++ b/tools/testing/selftests/vm/pagemap_ioctl.c
@@ -0,0 +1,881 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <fcntl.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <malloc.h>
+#include "vm_util.h"
+#include "../kselftest.h"
+#include <linux/types.h>
+#include <linux/userfaultfd.h>
+#include <linux/fs.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <math.h>
+#include <asm/unistd.h>
+
+#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
+ PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | \
+ PAGE_IS_SWAPPED)
+
+#define TEST_ITERATIONS 10
+#define PAGEMAP "/proc/self/pagemap"
+int pagemap_fd;
+int uffd;
+int page_size;
+int hpage_size;
+
+static long pagemap_ioctl(void *start, int len, void *vec, int vec_len, int flag,
+ int max_pages, long required_mask, long anyof_mask, long excluded_mask,
+ long return_mask)
+{
+ struct pagemap_scan_arg arg;
+ int ret;
+
+ arg.start = (uintptr_t)start;
+ arg.len = len;
+ arg.vec = (uintptr_t)vec;
+ arg.vec_len = vec_len;
+ arg.flags = flag;
+ arg.max_pages = max_pages;
+ arg.required_mask = required_mask;
+ arg.anyof_mask = anyof_mask;
+ arg.excluded_mask = excluded_mask;
+ arg.return_mask = return_mask;
+
+ ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
+
+ return ret;
+}
+
+int init_uffd(void)
+{
+ struct uffdio_api uffdio_api;
+
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd == -1)
+ ksft_exit_fail_msg("uffd syscall failed\n");
+
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = UFFD_FEATURE_WP_ASYNC;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+ ksft_exit_fail_msg("UFFDIO_API\n");
+
+ if (uffdio_api.api != UFFD_API)
+ ksft_exit_fail_msg("UFFDIO_API error %llu\n", uffdio_api.api);
+
+ return 0;
+}
+
+int wp_init(void *lpBaseAddress, int dwRegionSize)
+{
+ struct uffdio_register uffdio_register;
+ struct uffdio_writeprotect wp;
+
+ /* TODO: can it be avoided? Write protect doesn't engage on the pages if they aren't
+ * present already. The pages can be made present by writing to them.
+ */
+ memset(lpBaseAddress, -1, dwRegionSize);
+
+ uffdio_register.range.start = (unsigned long)lpBaseAddress;
+ uffdio_register.range.len = dwRegionSize;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
+ ksft_exit_fail_msg("ioctl(UFFDIO_REGISTER)\n");
+
+ if (!(uffdio_register.ioctls & UFFDIO_WRITEPROTECT))
+ ksft_exit_fail_msg("ioctl set is incorrect\n");
+
+ if (rand() % 2) {
+ wp.range.start = (unsigned long)lpBaseAddress;
+ wp.range.len = dwRegionSize;
+ wp.mode = UFFDIO_WRITEPROTECT_MODE_WP;
+
+ if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp) == -1)
+ ksft_exit_fail_msg("ioctl(UFFDIO_WRITEPROTECT)\n");
+ } else {
+ if (pagemap_ioctl(lpBaseAddress, dwRegionSize, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0) < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", 1, errno, strerror(errno));
+ }
+ return 0;
+}
+
+int wp_free(void *lpBaseAddress, int dwRegionSize)
+{
+ struct uffdio_register uffdio_register;
+
+ uffdio_register.range.start = (unsigned long)lpBaseAddress;
+ uffdio_register.range.len = dwRegionSize;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+ if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range))
+ ksft_exit_fail_msg("ioctl unregister failure\n");
+ return 0;
+}
+
+int clear_softdirty_wp(void *lpBaseAddress, int dwRegionSize)
+{
+ struct uffdio_writeprotect wp;
+
+ if (rand() % 2) {
+ wp.range.start = (unsigned long)lpBaseAddress;
+ wp.range.len = dwRegionSize;
+ wp.mode = UFFDIO_WRITEPROTECT_MODE_WP;
+
+ if (ioctl(uffd, UFFDIO_WRITEPROTECT, &wp) == -1)
+ ksft_exit_fail_msg("ioctl(UFFDIO_WRITEPROTECT)\n");
+ } else {
+ if (pagemap_ioctl(lpBaseAddress, dwRegionSize, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0) < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", 1, errno, strerror(errno));
+ }
+ return 0;
+}
+
+int sanity_tests_sd(void)
+{
+ char *mem, *m[2];
+ int mem_size, vec_size, ret, ret2, ret3, i, num_pages = 10;
+ struct page_region *vec;
+
+ vec_size = 100;
+ mem_size = num_pages * page_size;
+
+ vec = malloc(sizeof(struct page_region) * vec_size);
+ if (!vec)
+ ksft_exit_fail_msg("error nomem\n");
+ mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+
+ wp_init(mem, mem_size);
+
+ /* 1. wrong operation */
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, -1,
+ 0, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) < 0,
+ "%s wrong flag specified\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 8,
+ 0, 0x1111, 0, 0, PAGE_IS_WRITTEN) < 0,
+ "%s wrong mask specified\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0,
+ 0, PAGE_IS_WRITTEN, 0, 0, 0x1000) < 0,
+ "%s wrong return mask specified\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size,
+ PAGEMAP_WP_ENGAGE | 0x32,
+ 0, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) < 0,
+ "%s mixture of correct and wrong flag\n", __func__);
+
+ /* 2. Clear area with larger vec size */
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ ksft_test_result(ret >= 0, "%s Clear area with larger vec size\n", __func__);
+
+ /* 3. Repeated pattern of dirty and non-dirty pages */
+ for (i = 0; i < mem_size; i += 2 * page_size)
+ mem[i]++;
+
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == mem_size/(page_size * 2),
+ "%s Repeated pattern of dirty and non-dirty pages\n", __func__);
+
+ /* 4. Repeated pattern of dirty and non-dirty pages in parts */
+ ret = pagemap_ioctl(mem, mem_size, vec, num_pages/5, PAGEMAP_WP_ENGAGE,
+ num_pages/2 - 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ret2 = pagemap_ioctl(mem, mem_size, vec, 2, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (ret2 < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret2, errno, strerror(errno));
+
+ ret3 = pagemap_ioctl(mem, mem_size, vec, num_pages/2, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (ret3 < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret3, errno, strerror(errno));
+
+ ksft_test_result((ret + ret3) == num_pages/2 && ret2 == 2,
+ "%s Repeated pattern of dirty and non-dirty pages in parts\n", __func__);
+
+ wp_free(mem, mem_size);
+ munmap(mem, mem_size);
+
+ /* 5. Two regions */
+ m[0] = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (m[0] == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ m[1] = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (m[1] == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+
+ wp_init(m[0], mem_size);
+ wp_init(m[1], mem_size);
+
+ memset(m[0], 'a', mem_size);
+ memset(m[1], 'b', mem_size);
+
+ ret = pagemap_ioctl(m[0], mem_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ret = pagemap_ioctl(m[1], mem_size, vec, 1, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].len == mem_size/page_size,
+ "%s Two regions\n", __func__);
+
+ wp_free(m[0], mem_size);
+ wp_free(m[1], mem_size);
+ munmap(m[0], mem_size);
+ munmap(m[1], mem_size);
+
+ free(vec);
+ return 0;
+}
+
+int base_tests(char *prefix, char *mem, int mem_size, int skip)
+{
+ int vec_size, ret, dirty, dirty2;
+ struct page_region *vec, *vec2;
+
+ if (skip) {
+ ksft_test_result_skip("%s all new pages must be soft dirty\n", prefix);
+ ksft_test_result_skip("%s all pages must not be soft dirty\n", prefix);
+ ksft_test_result_skip("%s all pages dirty other than first and the last one\n",
+ prefix);
+ ksft_test_result_skip("%s only middle page dirty\n", prefix);
+ ksft_test_result_skip("%s only two middle pages dirty\n", prefix);
+ ksft_test_result_skip("%s only get 2 dirty pages and clear them as well\n", prefix);
+ ksft_test_result_skip("%s Range clear only\n", prefix);
+ return 0;
+ }
+
+ vec_size = mem_size/page_size;
+ vec = malloc(sizeof(struct page_region) * vec_size);
+ vec2 = malloc(sizeof(struct page_region) * vec_size);
+
+ /* 1. all new pages must be not be soft dirty */
+ dirty = pagemap_ioctl(mem, mem_size, vec, 1, PAGEMAP_WP_ENGAGE, vec_size - 2,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ dirty2 = pagemap_ioctl(mem, mem_size, vec2, 1, PAGEMAP_WP_ENGAGE, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (dirty2 < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty2, errno, strerror(errno));
+
+ ksft_test_result(dirty == 0 && dirty2 == 0,
+ "%s all new pages must be soft dirty\n", prefix);
+
+ /* 2. all pages must not be soft dirty */
+ dirty = pagemap_ioctl(mem, mem_size, vec, 1, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ ksft_test_result(dirty == 0, "%s all pages must not be soft dirty\n", prefix);
+
+ /* 3. all pages dirty other than first and the last one */
+ memset(mem + page_size, 0, mem_size - (2 * page_size));
+
+ dirty = pagemap_ioctl(mem, mem_size, vec, 1, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ ksft_test_result(dirty == 1 && vec[0].len >= vec_size - 2 && vec[0].len <= vec_size,
+ "%s all pages dirty other than first and the last one\n", prefix);
+
+ /* 4. only middle page dirty */
+ clear_softdirty_wp(mem, mem_size);
+ mem[vec_size/2 * page_size]++;
+
+ dirty = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0, PAGE_IS_WRITTEN, 0, 0,
+ PAGE_IS_WRITTEN);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ ksft_test_result(dirty == 1 && vec[0].len >= 1,
+ "%s only middle page dirty\n", prefix);
+
+ /* 5. only two middle pages dirty and walk over only middle pages */
+ clear_softdirty_wp(mem, mem_size);
+ mem[vec_size/2 * page_size]++;
+ mem[(vec_size/2 + 1) * page_size]++;
+
+ dirty = pagemap_ioctl(&mem[vec_size/2 * page_size], 2 * page_size, vec, 1, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ ksft_test_result(dirty == 1 && vec[0].start == (uintptr_t)(&mem[vec_size/2 * page_size]) &&
+ vec[0].len == 2,
+ "%s only two middle pages dirty\n", prefix);
+
+ /* 6. only get 2 dirty pages and clear them as well */
+ memset(mem, -1, mem_size);
+
+ /* get and clear second and third pages */
+ ret = pagemap_ioctl(mem + page_size, 2 * page_size, vec, 1, PAGEMAP_WP_ENGAGE,
+ 2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ dirty = pagemap_ioctl(mem, mem_size, vec2, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].len == 2 &&
+ vec[0].start == (uintptr_t)(mem + page_size) &&
+ dirty == 2 && vec2[0].len == 1 && vec2[0].start == (uintptr_t)mem &&
+ vec2[1].len == vec_size - 3 &&
+ vec2[1].start == (uintptr_t)(mem + 3 * page_size),
+ "%s only get 2 dirty pages and clear them as well\n", prefix);
+
+ /* 7. Range clear only */
+ memset(mem, -1, mem_size);
+
+ dirty = pagemap_ioctl(mem, mem_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ dirty2 = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (dirty2 < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty2, errno, strerror(errno));
+
+ ksft_test_result(dirty == 0 && dirty2 == 0, "%s Range clear only\n",
+ prefix);
+
+ free(vec);
+ free(vec2);
+ return 0;
+}
+
+void *gethugepage(int map_size)
+{
+ int ret;
+ char *map;
+
+ map = memalign(hpage_size, map_size);
+ if (!map)
+ ksft_exit_fail_msg("memalign failed %d %s\n", errno, strerror(errno));
+
+ ret = madvise(map, map_size, MADV_HUGEPAGE);
+ if (ret)
+ ksft_exit_fail_msg("madvise failed %d %d %s\n", ret, errno, strerror(errno));
+
+ wp_init(map, map_size);
+
+ if (check_huge_anon(map, map_size/hpage_size, hpage_size))
+ return map;
+
+ free(map);
+ return NULL;
+
+}
+
+int hpage_unit_tests(void)
+{
+ char *map;
+ int ret;
+ size_t num_pages = 10;
+ int map_size = hpage_size * num_pages;
+ int vec_size = map_size/page_size;
+ struct page_region *vec, *vec2;
+
+ vec = malloc(sizeof(struct page_region) * vec_size);
+ vec2 = malloc(sizeof(struct page_region) * vec_size);
+ if (!vec || !vec2)
+ ksft_exit_fail_msg("malloc failed\n");
+
+ map = gethugepage(map_size);
+ if (map) {
+ /* 1. all new huge page must not be dirty */
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 0, "%s all new huge page must be dirty\n", __func__);
+
+ /* 2. all the huge page must not be dirty */
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 0, "%s all the huge page must not be dirty\n", __func__);
+
+ /* 3. all the huge page must be dirty and clear dirty as well */
+ memset(map, -1, map_size);
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].start == (uintptr_t)map &&
+ vec[0].len == vec_size && vec[0].bitmap == PAGE_IS_WRITTEN,
+ "%s all the huge page must be dirty and clear\n", __func__);
+
+ /* 4. only middle page dirty */
+ wp_free(map, map_size);
+ free(map);
+ map = gethugepage(map_size);
+ wp_init(map, map_size);
+ clear_softdirty_wp(map, map_size);
+ map[vec_size/2 * page_size]++;
+
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].len > 0,
+ "%s only middle page dirty\n", __func__);
+
+ wp_free(map, map_size);
+ free(map);
+ } else {
+ ksft_test_result_skip("all new huge page must be dirty\n");
+ ksft_test_result_skip("all the huge page must not be dirty\n");
+ ksft_test_result_skip("all the huge page must be dirty and clear\n");
+ ksft_test_result_skip("only middle page dirty\n");
+ }
+
+ /* 5. clear first half of huge page */
+ map = gethugepage(map_size);
+ if (map) {
+
+ memset(map, 0, map_size);
+
+ ret = pagemap_ioctl(map, map_size/2, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].len == vec_size/2 &&
+ vec[0].start == (uintptr_t)(map + map_size/2),
+ "%s clear first half of huge page\n", __func__);
+ wp_free(map, map_size);
+ free(map);
+ } else {
+ ksft_test_result_skip("clear first half of huge page\n");
+ }
+
+ /* 6. clear first half of huge page with limited buffer */
+ map = gethugepage(map_size);
+ if (map) {
+ memset(map, 0, map_size);
+
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, PAGEMAP_WP_ENGAGE,
+ vec_size/2, PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].len == vec_size/2 &&
+ vec[0].start == (uintptr_t)(map + map_size/2),
+ "%s clear first half of huge page with limited buffer\n",
+ __func__);
+ wp_free(map, map_size);
+ free(map);
+ } else {
+ ksft_test_result_skip("clear first half of huge page with limited buffer\n");
+ }
+
+ /* 7. clear second half of huge page */
+ map = gethugepage(map_size);
+ if (map) {
+ memset(map, -1, map_size);
+ ret = pagemap_ioctl(map + map_size/2, map_size/2, NULL, 0, PAGEMAP_WP_ENGAGE,
+ 0, 0, 0, 0, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ret = pagemap_ioctl(map, map_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec[0].len == vec_size/2,
+ "%s clear second half huge page\n", __func__);
+ wp_free(map, map_size);
+ free(map);
+ } else {
+ ksft_test_result_skip("clear second half huge page\n");
+ }
+
+ free(vec);
+ free(vec2);
+ return 0;
+}
+
+int unmapped_region_tests(void)
+{
+ void *start = (void *)0x10000000;
+ int dirty, len = 0x00040000;
+ int vec_size = len / page_size;
+ struct page_region *vec = malloc(sizeof(struct page_region) * vec_size);
+
+ /* 1. Get dirty pages */
+ dirty = pagemap_ioctl(start, len, vec, vec_size, 0, 0, PAGEMAP_NON_WRITTEN_BITS, 0, 0,
+ PAGEMAP_NON_WRITTEN_BITS);
+ if (dirty < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", dirty, errno, strerror(errno));
+
+ ksft_test_result(dirty >= 0, "%s Get status of pages\n", __func__);
+
+ free(vec);
+ return 0;
+}
+
+static void test_simple(void)
+{
+ int i;
+ char *map;
+ struct page_region vec;
+
+ map = aligned_alloc(page_size, page_size);
+ if (!map)
+ ksft_exit_fail_msg("aligned_alloc failed\n");
+ wp_init(map, page_size);
+
+ clear_softdirty_wp(map, page_size);
+
+ for (i = 0 ; i < TEST_ITERATIONS; i++) {
+ if (pagemap_ioctl(map, page_size, &vec, 1, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) == 1) {
+ ksft_print_msg("dirty bit was 1, but should be 0 (i=%d)\n", i);
+ break;
+ }
+
+ clear_softdirty_wp(map, page_size);
+ /* Write something to the page to get the dirty bit enabled on the page */
+ map[0]++;
+
+ if (pagemap_ioctl(map, page_size, &vec, 1, 0, 0,
+ PAGE_IS_WRITTEN, 0, 0, PAGE_IS_WRITTEN) == 0) {
+ ksft_print_msg("dirty bit was 0, but should be 1 (i=%d)\n", i);
+ break;
+ }
+
+ clear_softdirty_wp(map, page_size);
+ }
+ wp_free(map, page_size);
+ free(map);
+
+ ksft_test_result(i == TEST_ITERATIONS, "Test %s\n", __func__);
+}
+
+int sanity_tests(void)
+{
+ char *mem, *fmem;
+ int mem_size, vec_size, ret;
+ struct page_region *vec;
+
+ /* 1. wrong operation */
+ mem_size = 10 * page_size;
+ vec_size = mem_size / page_size;
+
+ vec = malloc(sizeof(struct page_region) * vec_size);
+ mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED || vec == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+
+ wp_init(mem, mem_size);
+
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, PAGEMAP_WP_ENGAGE, 0,
+ PAGEMAP_BITS_ALL, 0, 0, PAGEMAP_BITS_ALL) < 0,
+ "%s clear op can only be specified with PAGE_IS_WRITTEN\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ PAGEMAP_BITS_ALL, 0, 0, PAGEMAP_BITS_ALL) >= 0,
+ "%s required_mask specified\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ 0, PAGEMAP_BITS_ALL, 0, PAGEMAP_BITS_ALL) >= 0,
+ "%s anyof_mask specified\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ 0, 0, PAGEMAP_BITS_ALL, PAGEMAP_BITS_ALL) >= 0,
+ "%s excluded_mask specified\n", __func__);
+ ksft_test_result(pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ PAGEMAP_BITS_ALL, PAGEMAP_BITS_ALL, 0,
+ PAGEMAP_BITS_ALL) >= 0,
+ "%s required_mask and anyof_mask specified\n", __func__);
+ wp_free(mem, mem_size);
+ munmap(mem, mem_size);
+
+ /* 2. Get sd and present pages with anyof_mask */
+ mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ wp_init(mem, mem_size);
+
+ memset(mem, 0, mem_size);
+
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ 0, PAGEMAP_BITS_ALL, 0, PAGEMAP_BITS_ALL);
+ ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+ vec[0].bitmap == (PAGE_IS_WRITTEN | PAGE_IS_PRESENT),
+ "%s Get sd and present pages with anyof_mask\n", __func__);
+
+ /* 3. Get sd and present pages with required_mask */
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ PAGEMAP_BITS_ALL, 0, 0, PAGEMAP_BITS_ALL);
+ ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+ vec[0].bitmap == (PAGE_IS_WRITTEN | PAGE_IS_PRESENT),
+ "%s Get all the pages with required_mask\n", __func__);
+
+ /* 4. Get sd and present pages with required_mask and anyof_mask */
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ PAGE_IS_WRITTEN, PAGE_IS_PRESENT, 0, PAGEMAP_BITS_ALL);
+ ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+ vec[0].bitmap == (PAGE_IS_WRITTEN | PAGE_IS_PRESENT),
+ "%s Get sd and present pages with required_mask and anyof_mask\n",
+ __func__);
+
+ /* 5. Don't get sd pages */
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ 0, 0, PAGE_IS_WRITTEN, PAGEMAP_BITS_ALL);
+ ksft_test_result(ret == 0, "%s Don't get sd pages\n", __func__);
+
+ /* 6. Don't get present pages */
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ 0, 0, PAGE_IS_PRESENT, PAGEMAP_BITS_ALL);
+ ksft_test_result(ret == 0, "%s Don't get present pages\n", __func__);
+
+ wp_free(mem, mem_size);
+ munmap(mem, mem_size);
+
+ /* 8. Find dirty present pages with return mask */
+ mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ wp_init(mem, mem_size);
+
+ memset(mem, 0, mem_size);
+
+ ret = pagemap_ioctl(mem, mem_size, vec, vec_size, 0, 0,
+ 0, PAGEMAP_BITS_ALL, 0, PAGE_IS_WRITTEN);
+ ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)mem && vec[0].len == vec_size &&
+ vec[0].bitmap == PAGE_IS_WRITTEN,
+ "%s Find dirty present pages with return mask\n", __func__);
+ wp_free(mem, mem_size);
+ munmap(mem, mem_size);
+
+ /* 9. Memory mapped file */
+ int fd;
+ struct stat sbuf;
+
+ fd = open(__FILE__, O_RDONLY);
+ if (fd < 0) {
+ ksft_test_result_skip("%s Memory mapped file\n");
+ goto free_vec_and_return;
+ }
+
+ ret = stat(__FILE__, &sbuf);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ fmem = mmap(NULL, sbuf.st_size, PROT_READ, MAP_SHARED, fd, 0);
+ if (fmem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+
+ ret = pagemap_ioctl(fmem, sbuf.st_size, vec, vec_size, 0, 0,
+ 0, PAGEMAP_NON_WRITTEN_BITS, 0, PAGEMAP_NON_WRITTEN_BITS);
+
+ ksft_test_result(ret >= 0 && vec[0].start == (uintptr_t)fmem &&
+ vec[0].len == ceilf((float)sbuf.st_size/page_size) &&
+ vec[0].bitmap == PAGE_IS_FILE,
+ "%s Memory mapped file\n", __func__);
+
+ munmap(fmem, sbuf.st_size);
+ close(fd);
+
+free_vec_and_return:
+ free(vec);
+ return 0;
+}
+
+int mprotect_tests(void)
+{
+ int ret;
+ char *mem, *mem2;
+ struct page_region vec;
+ int pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+
+ if (pagemap_fd < 0) {
+ fprintf(stderr, "open() failed\n");
+ exit(1);
+ }
+
+ /* 1. Map two pages */
+ mem = mmap(0, 2 * page_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ wp_init(mem, 2 * page_size);
+
+ /* Populate both pages. */
+ memset(mem, 1, 2 * page_size);
+
+ ret = pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN,
+ 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec.len == 2, "%s Both pages dirty\n", __func__);
+
+ /* 2. Start softdirty tracking. Clear VM_SOFTDIRTY and clear the softdirty PTE bit. */
+ ret = pagemap_ioctl(mem, 2 * page_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN,
+ 0, 0, PAGE_IS_WRITTEN) == 0,
+ "%s Both pages are not soft dirty\n", __func__);
+
+ /* 3. Remap the second page */
+ mem2 = mmap(mem + page_size, page_size, PROT_READ|PROT_WRITE,
+ MAP_PRIVATE|MAP_ANON|MAP_FIXED, -1, 0);
+ if (mem2 == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ wp_init(mem2, page_size);
+
+ /* Protect + unprotect. */
+ mprotect(mem, 2 * page_size, PROT_READ);
+ mprotect(mem, 2 * page_size, PROT_READ|PROT_WRITE);
+
+ /* Modify both pages. */
+ memset(mem, 2, 2 * page_size);
+
+ ret = pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN,
+ 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec.len == 2,
+ "%s Both pages dirty after remap and mprotect\n", __func__);
+
+ /* 4. Clear and make the pages dirty */
+ ret = pagemap_ioctl(mem, 2 * page_size, NULL, 0, PAGEMAP_WP_ENGAGE, 0,
+ 0, 0, 0, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ memset(mem, 'A', 2 * page_size);
+
+ ret = pagemap_ioctl(mem, 2 * page_size, &vec, 1, 0, 0, PAGE_IS_WRITTEN,
+ 0, 0, PAGE_IS_WRITTEN);
+ if (ret < 0)
+ ksft_exit_fail_msg("error %d %d %s\n", ret, errno, strerror(errno));
+
+ ksft_test_result(ret == 1 && vec.len == 2,
+ "%s Clear and make the pages dirty\n", __func__);
+
+ wp_free(mem, 2 * page_size);
+ munmap(mem, 2 * page_size);
+ return 0;
+}
+
+int main(void)
+{
+ char *mem, *map;
+ int mem_size;
+
+ ksft_print_header();
+ ksft_set_plan(54);
+
+ page_size = getpagesize();
+ hpage_size = read_pmd_pagesize();
+
+ pagemap_fd = open(PAGEMAP, O_RDWR);
+ if (pagemap_fd < 0)
+ return -EINVAL;
+
+ if (init_uffd())
+ ksft_exit_fail_msg("uffd init failed\n");
+
+ /*
+ * Soft-dirty PTE bit tests
+ */
+
+ /* 1. Sanity testing */
+ sanity_tests_sd();
+
+ /* 2. Normal page testing */
+ mem_size = 10 * page_size;
+ mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ wp_init(mem, mem_size);
+
+ base_tests("Page testing:", mem, mem_size, 0);
+
+ wp_free(mem, mem_size);
+ munmap(mem, mem_size);
+
+ /* 3. Large page testing */
+ mem_size = 512 * 10 * page_size;
+ mem = mmap(NULL, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (mem == MAP_FAILED)
+ ksft_exit_fail_msg("error nomem\n");
+ wp_init(mem, mem_size);
+
+ base_tests("Large Page testing:", mem, mem_size, 0);
+
+ wp_free(mem, mem_size);
+ munmap(mem, mem_size);
+
+ /* 4. Huge page testing */
+ map = gethugepage(hpage_size);
+ if (map) {
+ base_tests("Huge page testing:", map, hpage_size, 0);
+ wp_free(map, hpage_size);
+ free(map);
+ } else {
+ base_tests("Huge page testing:", NULL, 0, 1);
+ }
+
+ /* 6. Huge page tests */
+ hpage_unit_tests();
+
+ /* 7. Iterative test */
+ test_simple();
+
+ /* 8. Mprotect test */
+ mprotect_tests();
+
+ /*
+ * Other PTE bit tests
+ */
+
+ /* 1. Sanity testing */
+ sanity_tests();
+
+ /* 2. Unmapped address test */
+ unmapped_region_tests();
+
+ close(pagemap_fd);
+ return ksft_exit_pass();
+}
--
2.30.2


2023-02-08 21:13:13

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 1/6] userfaultfd: Add UFFD WP Async support

On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
> faults on its own. It can be used to track that which pages have been
> written-to from the time the pages were write-protected. It is very
> efficient way to track the changes as uffd is by nature pte/pmd based.
>
> UFFD synchronous WP sends the page faults to the userspace where the
> pages which have been written-to can be tracked. But it is not efficient.
> This is why this asynchronous version is being added. After setting the
> WP Async, the pages which have been written to can be found in the pagemap
> file or information can be obtained from the PAGEMAP_IOCTL.
>
> Suggested-by: Peter Xu <[email protected]>
> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Changes in v10:
> - Build fix
> - Update comments and add error condition to return error from uffd
> register if hugetlb pages are present when wp async flag is set
>
> Changes in v9:
> - Correct the fault resolution with code contributed by Peter
>
> Changes in v7:
> - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
> - Handle automatic page fault resolution in better way (thanks to Peter)
>
> update to wp async
>
> uffd wp async
> ---
> fs/userfaultfd.c | 20 ++++++++++++++++++--
> include/linux/userfaultfd_k.h | 11 +++++++++++
> include/uapi/linux/userfaultfd.h | 10 +++++++++-
> mm/memory.c | 23 ++++++++++++++++++++---
> 4 files changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 15a5bf765d43..422f2530c63e 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> goto out_unlock;
>
> /*
> - * Note vmas containing huge pages
> + * Note vmas containing huge pages. Hugetlb isn't supported
> + * with UFFD_FEATURE_WP_ASYNC.
> */

Need to set "ret = -EINVAL;" here. Or..

> - if (is_vm_hugetlb_page(cur))
> + if (is_vm_hugetlb_page(cur)) {
> + if (ctx->features & UFFD_FEATURE_WP_ASYNC)
> + goto out_unlock;

.. it'll return -EBUSY, which does not sound like the right errcode here.

> +

Drop this empty line?

> basic_ioctls = true;
> + }
>
> found = true;
> }

Other than that looks good, thanks.

--
Peter Xu


2023-02-08 21:32:19

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 2/6] userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC

On Thu, Feb 02, 2023 at 04:29:11PM +0500, Muhammad Usama Anjum wrote:
> Explain the difference created by UFFD_FEATURE_WP_ASYNC to the write
> protection (UFFDIO_WRITEPROTECT_MODE_WP) mode.
>
> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Documentation/admin-guide/mm/userfaultfd.rst | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
> index 83f31919ebb3..4747e7bd5b26 100644
> --- a/Documentation/admin-guide/mm/userfaultfd.rst
> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
> @@ -221,6 +221,13 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
> you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
> used.
>
> +If ``UFFD_FEATURE_WP_ASYNC`` is set while calling ``UFFDIO_API`` ioctl, the
> +behaviour of ``UFFDIO_WRITEPROTECT_MODE_WP`` changes such that faults for

UFFDIO_WRITEPROTECT_MODE_WP is only a flag in UFFDIO_WRITEPROTECT, while
it's forbidden only when not specified.

> +anon and shmem are resolved automatically by the kernel instead of sending
> +the message to the userfaultfd. The hugetlb isn't supported. The ``pagemap``
> +file can be read to find which pages have ``PM_UFFD_WP`` flag set which
> +means they are write-protected.

Here's my version. Please feel free to do modifications on top.

If the userfaultfd context (that has ``UFFDIO_REGISTER_MODE_WP``
registered against) has ``UFFD_FEATURE_WP_ASYNC`` feature enabled, it
will work in async write protection mode. It can be seen as a more
accurate version of soft-dirty tracking, meanwhile the results will not
be easily affected by other operations like vma merging.

Comparing to the generic mode, the async mode will not generate any
userfaultfd message when the protected memory range is written. Instead,
the kernel will automatically resolve the page fault immediately by
dropping the uffd-wp bit in the pgtables. The user app can collect the
"written/dirty" status by looking up the uffd-wp bit for the pages being
interested in /proc/pagemap.

The page will be under track of uffd-wp async mode until the page is
explicitly write-protected by ``UFFDIO_WRITEPROTECT`` ioctl with the mode
flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
that was tracked by async mode userfaultfd-wp is invalid.

Currently ``UFFD_FEATURE_WP_ASYNC`` only support anonymous and shmem.
Hugetlb is not yet supported.

--
Peter Xu


2023-02-08 22:16:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> pages have been written-to.
> - Find pages which have been written-to and write protect the pages
> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>
> To get information about which pages have been written-to and/or write
> protect the pages, following must be performed first in order:
> - The userfaultfd file descriptor is created with userfaultfd syscall.
> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> through UFFDIO_REGISTER IOCTL.
> Then the any part of the registered memory or the whole memory region
> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> PAGEMAP_SCAN IOCTL.
>
> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> struct:
> - The range is specified through start and len.
> - The output buffer of struct page_region array and size is specified as
> vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> is the only added flag at this time.
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> and return_mask.
>
> This IOCTL can be extended to get information about more PTE bits. This
> IOCTL doesn't support hugetlbs at the moment. No information about
> hugetlb can be obtained. This patch has evolved from a basic patch from
> Gabriel Krisman Bertazi.
>
> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
>
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
>
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
>
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
>
> Changes in v5:
> - Remove tlb flushing even for clear operation
>
> Changes in v4:
> - Update the interface and implementation
>
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
> error checking
>
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fs.h | 50 +++++++
> 2 files changed, 340 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..c6bde19d63d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
> #include <linux/shmem_fs.h>
> #include <linux/uaccess.h>
> #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>
> #include <asm/elf.h>
> #include <asm/tlb.h>
> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> }
> #endif
>
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> + (pte_swp_uffd_wp_any(pte)))
> + return true;
> + return false;

Sorry I should have mentioned this earlier: you can directly return here.

return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte);

> +}
> +
> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
> +{
> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> + return true;
> + return false;

Same here.

> +}
> +
> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> unsigned long addr, pmd_t *pmdp)
> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
> +#define IS_GET_OP(a) (a->vec)
> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
> +
> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
> + (wt | file << 1 | present << 2 | swap << 3)
> +#define IS_WT_REQUIRED(a) \
> + ((a->required_mask & PAGE_IS_WRITTEN) || \
> + (a->anyof_mask & PAGE_IS_WRITTEN))
> +
> +struct pagemap_scan_private {
> + struct page_region *vec;
> + struct page_region prev;
> + unsigned long vec_len, vec_index;
> + unsigned int max_pages, found_pages, flags;
> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
> +};
> +
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> +
> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))

Should this be:

(IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma)))

Instead?

> + return -EPERM;
> + if (vma->vm_flags & VM_PFNMAP)
> + return 1;
> + return 0;
> +}
> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> + struct pagemap_scan_private *p, unsigned long addr,
> + unsigned int len)
> +{
> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> + bool cpy = true;
> + struct page_region *prev = &p->prev;

Nit: switch the above two lines?

> +
> + if (HAS_NO_SPACE(p))
> + return -ENOSPC;
> +
> + if (p->max_pages && p->found_pages + len >= p->max_pages)
> + len = p->max_pages - p->found_pages;

If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC?

> + if (!len)
> + return -EINVAL;
> +
> + if (p->required_mask)
> + cpy = ((p->required_mask & cur) == p->required_mask);
> + if (cpy && p->anyof_mask)
> + cpy = (p->anyof_mask & cur);
> + if (cpy && p->excluded_mask)
> + cpy = !(p->excluded_mask & cur);
> + bitmap = cur & p->return_mask;
> + if (cpy && bitmap) {
> + if ((prev->len) && (prev->bitmap == bitmap) &&
> + (prev->start + prev->len * PAGE_SIZE == addr)) {
> + prev->len += len;
> + p->found_pages += len;
> + } else if (p->vec_index < p->vec_len) {
> + if (prev->len) {
> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> + p->vec_index++;
> + }

IIUC you can have:

int pagemap_scan_deposit(p)
{
if (p->vec_index >= p->vec_len)
return -ENOSPC;

if (p->prev->len) {
memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
p->vec_index++;
}

return 0;
}

Then call it here. I think it can also be called below to replace
export_prev_to_out().

> + prev->start = addr;
> + prev->len = len;
> + prev->bitmap = bitmap;
> + p->found_pages += len;
> + } else {
> + return -ENOSPC;
> + }
> + }
> + return 0;
> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> + unsigned long *vec_index)
> +{
> + struct page_region *prev = &p->prev;
> +
> + if (prev->len) {
> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> + return -EFAULT;
> + p->vec_index++;
> + (*vec_index)++;
> + prev->len = 0;
> + }
> + return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> + unsigned long end, struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + unsigned long addr = end;

This assignment is useless?

> + spinlock_t *ptl;
> + int ret = 0;
> + pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (ptl) {
> + bool pmd_wt;
> +
> + pmd_wt = !is_pmd_uffd_wp(*pmd);
> + /*
> + * Break huge page into small pages if operation needs to be performed is
> + * on a portion of the huge page.
> + */
> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
> + spin_unlock(ptl);
> + split_huge_pmd(vma, pmd, start);
> + goto process_smaller_pages;
> + }
> + if (IS_GET_OP(p))
> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
> + is_swap_pmd(*pmd), p, start,
> + (end - start)/PAGE_SIZE);
> + spin_unlock(ptl);
> + if (!ret) {
> + if (pmd_wt && IS_WP_ENGAGE_OP(p))
> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
> + }
> + return ret;
> + }
> +process_smaller_pages:
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> + if (IS_GET_OP(p)) {
> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
> + if (ret)
> + break;
> + }
> + }
> + pte_unmap_unlock(pte - 1, ptl);
> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
> + uffd_wp_range(walk->mm, vma, start, addr - start, true);
> +
> + cond_resched();
> + return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> + struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + int ret = 0;
> +
> + if (vma)
> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
> + (end - addr)/PAGE_SIZE);
> + return ret;
> +}
> +
> +/* No hugetlb support is present. */
> +static const struct mm_walk_ops pagemap_scan_ops = {
> + .test_walk = pagemap_scan_test_walk,
> + .pmd_entry = pagemap_scan_pmd_entry,
> + .pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> + unsigned long empty_slots, vec_index = 0;
> + unsigned long __user start, end;
> + unsigned long __start, __end;
> + struct page_region __user *vec;
> + struct pagemap_scan_private p;
> + int ret = 0;
> +
> + start = (unsigned long)untagged_addr(arg->start);
> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> + /* Validate memory ranges */
> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> + return -EINVAL;
> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
> + return -EINVAL;
> +
> + /* Detect illegal flags and masks */
> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
> + (arg->return_mask & ~PAGEMAP_BITS_ALL))
> + return -EINVAL;
> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
> + !arg->return_mask))
> + return -EINVAL;
> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
> + return -EINVAL;

I think you said you'll clean this up a bit. I don't think so..

> +
> + end = start + arg->len;
> + p.max_pages = arg->max_pages;
> + p.found_pages = 0;
> + p.flags = arg->flags;
> + p.required_mask = arg->required_mask;
> + p.anyof_mask = arg->anyof_mask;
> + p.excluded_mask = arg->excluded_mask;
> + p.return_mask = arg->return_mask;
> + p.prev.len = 0;
> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> + if (IS_GET_OP(arg)) {
> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
> + if (!p.vec)
> + return -ENOMEM;
> + } else {
> + p.vec = NULL;
> + }
> + __start = __end = start;
> + while (!ret && __end < end) {
> + p.vec_index = 0;
> + empty_slots = arg->vec_len - vec_index;
> + if (p.vec_len > empty_slots)
> + p.vec_len = empty_slots;
> +
> + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> + if (__end > end)
> + __end = end;
> +
> + mmap_read_lock(mm);
> + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
> + mmap_read_unlock(mm);
> + if (!(!ret || ret == -ENOSPC))
> + goto free_data;
> +
> + __start = __end;
> + if (IS_GET_OP(arg) && p.vec_index) {
> + if (copy_to_user(&vec[vec_index], p.vec,
> + p.vec_index * sizeof(struct page_region))) {
> + ret = -EFAULT;
> + goto free_data;
> + }
> + vec_index += p.vec_index;
> + }

I think you can move copy_to_user() to outside the loop, then call
pagemap_scan_deposit() before copy_to_user(), then I think you can drop
the ugly export_prev_to_out()..

> + }
> + ret = export_prev_to_out(&p, vec, &vec_index);
> + if (!ret)
> + ret = vec_index;
> +free_data:
> + if (IS_GET_OP(arg))
> + kfree(p.vec);
> +
> + return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
> + struct mm_struct *mm = file->private_data;
> + struct pagemap_scan_arg argument;
> +
> + if (cmd == PAGEMAP_SCAN) {
> + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
> + return -EFAULT;
> + return do_pagemap_cmd(mm, &argument);
> + }
> + return -EINVAL;
> +}
> +
> const struct file_operations proc_pagemap_operations = {
> .llseek = mem_lseek, /* borrow this */
> .read = pagemap_read,
> .open = pagemap_open,
> .release = pagemap_release,
> + .unlocked_ioctl = pagemap_scan_ioctl,
> + .compat_ioctl = pagemap_scan_ioctl,
> };
> #endif /* CONFIG_PROC_PAGE_MONITOR */
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1ae9a8684b48 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> RWF_APPEND)
>
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
> +
> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
> +#define PAGE_IS_WRITTEN (1 << 0)
> +#define PAGE_IS_FILE (1 << 1)
> +#define PAGE_IS_PRESENT (1 << 2)
> +#define PAGE_IS_SWAPPED (1 << 3)
> +
> +/*
> + * struct page_region - Page region with bitmap flags
> + * @start: Start of the region
> + * @len: Length of the region
> + * bitmap: Bits sets for the region
> + */
> +struct page_region {
> + __u64 start;
> + __u64 len;
> + __u64 bitmap;
> +};
> +
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start: Starting address of the region
> + * @len: Length of the region (All the pages in this length are included)
> + * @vec: Address of page_region struct array for output
> + * @vec_len: Length of the page_region struct array
> + * @max_pages: Optional max return pages
> + * @flags: Flags for the IOCTL
> + * @required_mask: Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask: Any mask - Any of these bits are set in the PTE
> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE
> + * @return_mask: Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> + __u64 start;
> + __u64 len;
> + __u64 vec;
> + __u64 vec_len;
> + __u32 max_pages;
> + __u32 flags;
> + __u64 required_mask;
> + __u64 anyof_mask;
> + __u64 excluded_mask;
> + __u64 return_mask;
> +};
> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE (1 << 0)
> +
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.30.2
>

--
Peter Xu


2023-02-08 22:23:01

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
...
Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i
fear cant look with precise care at least for some time, hopefully other CRIU guys
pick it up. Anyway, here a few comment from a glance.

> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> + struct pagemap_scan_private *p, unsigned long addr,
> + unsigned int len)
> +{

This is a big function and usually it's a flag to not declare it as "inline" until
there very serious reson to.

> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> + bool cpy = true;
> + struct page_region *prev = &p->prev;
> +
> + if (HAS_NO_SPACE(p))
> + return -ENOSPC;
> +
> + if (p->max_pages && p->found_pages + len >= p->max_pages)
> + len = p->max_pages - p->found_pages;
> + if (!len)
> + return -EINVAL;
> +
> + if (p->required_mask)
> + cpy = ((p->required_mask & cur) == p->required_mask);
> + if (cpy && p->anyof_mask)
> + cpy = (p->anyof_mask & cur);
> + if (cpy && p->excluded_mask)
> + cpy = !(p->excluded_mask & cur);
> + bitmap = cur & p->return_mask;
> + if (cpy && bitmap) {

You can exit early here simply

if (!cpy || !bitmap)
return 0;

saving one tab for the code below.

> + if ((prev->len) && (prev->bitmap == bitmap) &&
> + (prev->start + prev->len * PAGE_SIZE == addr)) {
> + prev->len += len;
> + p->found_pages += len;
> + } else if (p->vec_index < p->vec_len) {
> + if (prev->len) {
> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> + p->vec_index++;
> + }
> + prev->start = addr;
> + prev->len = len;
> + prev->bitmap = bitmap;
> + p->found_pages += len;
> + } else {
> + return -ENOSPC;
> + }
> + }
> + return 0;
> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> + unsigned long *vec_index)
> +{

No need for inline either.

> + struct page_region *prev = &p->prev;
> +
> + if (prev->len) {
> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> + return -EFAULT;
> + p->vec_index++;
> + (*vec_index)++;
> + prev->len = 0;
> + }
> + return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> + unsigned long end, struct mm_walk *walk)
> +{

Same, no need for inline. I've a few comments more in my mind will try to
collect them tomorrow.

2023-02-09 15:28:15

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 1/6] userfaultfd: Add UFFD WP Async support

Hi Peter,

Thank you so much for reviewing!

On 2/9/23 2:12 AM, Peter Xu wrote:
> On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
>> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
>> faults on its own. It can be used to track that which pages have been
>> written-to from the time the pages were write-protected. It is very
>> efficient way to track the changes as uffd is by nature pte/pmd based.
>>
>> UFFD synchronous WP sends the page faults to the userspace where the
>> pages which have been written-to can be tracked. But it is not efficient.
>> This is why this asynchronous version is being added. After setting the
>> WP Async, the pages which have been written to can be found in the pagemap
>> file or information can be obtained from the PAGEMAP_IOCTL.
>>
>> Suggested-by: Peter Xu <[email protected]>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Changes in v10:
>> - Build fix
>> - Update comments and add error condition to return error from uffd
>> register if hugetlb pages are present when wp async flag is set
>>
>> Changes in v9:
>> - Correct the fault resolution with code contributed by Peter
>>
>> Changes in v7:
>> - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
>> - Handle automatic page fault resolution in better way (thanks to Peter)
>>
>> update to wp async
>>
>> uffd wp async
>> ---
>> fs/userfaultfd.c | 20 ++++++++++++++++++--
>> include/linux/userfaultfd_k.h | 11 +++++++++++
>> include/uapi/linux/userfaultfd.h | 10 +++++++++-
>> mm/memory.c | 23 ++++++++++++++++++++---
>> 4 files changed, 58 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 15a5bf765d43..422f2530c63e 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>> goto out_unlock;
>>
>> /*
>> - * Note vmas containing huge pages
>> + * Note vmas containing huge pages. Hugetlb isn't supported
>> + * with UFFD_FEATURE_WP_ASYNC.
>> */
>
> Need to set "ret = -EINVAL;" here. Or..
Will fix in next version.

>
>> - if (is_vm_hugetlb_page(cur))
>> + if (is_vm_hugetlb_page(cur)) {
>> + if (ctx->features & UFFD_FEATURE_WP_ASYNC)
>> + goto out_unlock;
>
> .. it'll return -EBUSY, which does not sound like the right errcode here.
>
>> +
>
> Drop this empty line?
>
>> basic_ioctls = true;
>> + }
>>
>> found = true;
>> }
>
> Other than that looks good, thanks.
Thank you so much! This wouldn't have been possible without your help.

>

--
BR,
Muhammad Usama Anjum

2023-02-09 15:48:26

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 2/6] userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC

On 2/9/23 2:31 AM, Peter Xu wrote:
> On Thu, Feb 02, 2023 at 04:29:11PM +0500, Muhammad Usama Anjum wrote:
>> Explain the difference created by UFFD_FEATURE_WP_ASYNC to the write
>> protection (UFFDIO_WRITEPROTECT_MODE_WP) mode.
>>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Documentation/admin-guide/mm/userfaultfd.rst | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
>> index 83f31919ebb3..4747e7bd5b26 100644
>> --- a/Documentation/admin-guide/mm/userfaultfd.rst
>> +++ b/Documentation/admin-guide/mm/userfaultfd.rst
>> @@ -221,6 +221,13 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
>> you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
>> used.
>>
>> +If ``UFFD_FEATURE_WP_ASYNC`` is set while calling ``UFFDIO_API`` ioctl, the
>> +behaviour of ``UFFDIO_WRITEPROTECT_MODE_WP`` changes such that faults for
>
> UFFDIO_WRITEPROTECT_MODE_WP is only a flag in UFFDIO_WRITEPROTECT, while
> it's forbidden only when not specified.
>
>> +anon and shmem are resolved automatically by the kernel instead of sending
>> +the message to the userfaultfd. The hugetlb isn't supported. The ``pagemap``
>> +file can be read to find which pages have ``PM_UFFD_WP`` flag set which
>> +means they are write-protected.
>
> Here's my version. Please feel free to do modifications on top.
>
> If the userfaultfd context (that has ``UFFDIO_REGISTER_MODE_WP``
> registered against) has ``UFFD_FEATURE_WP_ASYNC`` feature enabled, it
> will work in async write protection mode. It can be seen as a more
> accurate version of soft-dirty tracking, meanwhile the results will not
> be easily affected by other operations like vma merging.
>
> Comparing to the generic mode, the async mode will not generate any
> userfaultfd message when the protected memory range is written. Instead,
> the kernel will automatically resolve the page fault immediately by
> dropping the uffd-wp bit in the pgtables. The user app can collect the
> "written/dirty" status by looking up the uffd-wp bit for the pages being
> interested in /proc/pagemap.
>
> The page will be under track of uffd-wp async mode until the page is
> explicitly write-protected by ``UFFDIO_WRITEPROTECT`` ioctl with the mode
> flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
> that was tracked by async mode userfaultfd-wp is invalid.
>
> Currently ``UFFD_FEATURE_WP_ASYNC`` only support anonymous and shmem.
> Hugetlb is not yet supported.
>
It'll get replaced the documentation. I'll add a suggested by tag as well.
Thanks.

--
BR,
Muhammad Usama Anjum

2023-02-09 19:27:42

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 5/6] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL

On Thu, Feb 02, 2023 at 04:29:14PM +0500, Muhammad Usama Anjum wrote:
> Add some explanation and method to use write-protection and written-to
> on memory range.
>
> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Documentation/admin-guide/mm/pagemap.rst | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
> index 6e2e416af783..1cb2189e9a0d 100644
> --- a/Documentation/admin-guide/mm/pagemap.rst
> +++ b/Documentation/admin-guide/mm/pagemap.rst
> @@ -230,3 +230,27 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
> always 12 at most architectures). Since Linux 3.11 their meaning changes
> after first clear of soft-dirty bits. Since Linux 4.2 they are used for
> flags unconditionally.
> +
> +Pagemap Scan IOCTL
> +==================
> +
> +The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get and/or clear
> +the info about page table entries. The following operations are supported in
> +this IOCTL:
> +- Get the information if the pages have been written-to (``PAGE_IS_WRITTEN``),
> + file mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``) or swapped
> + (``PAGE_IS_SWAPPED``).
> +- Write-protect the pages (``PAGEMAP_WP_ENGAGE``) to start finding which
> + pages have been written-to.
> +- Find pages which have been written-to and write protect the pages
> + (atomic ``PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE``)

Could we extend this section a bit more? Some points for reference:

- The new struct you introduced, definitions of each of the fields, and
generic use cases for each of the field/ops.

- It'll be nice to list the OPs the new interface supports (GET,
WP_ENGAGE, GET+WP_ENGAGE).

- When should people use this rather than the old pagemap interface?
What's the major problems to solve / what's the major difference?
(Maybe nice to reference the Windows API too here)

> +
> +To get information about which pages have been written-to and/or write protect
> +the pages, following must be performed first in order:
> + 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
> + 2. The ``UFFD_FEATURE_WP_ASYNC`` feature is set by ``UFFDIO_API`` IOCTL.
> + 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
> + through ``UFFDIO_REGISTER`` IOCTL.
> +Then the any part of the registered memory or the whole memory region can be
> +write protected using the ``UFFDIO_WRITEPROTECT`` IOCTL or ``PAGEMAP_SCAN``
> +IOCTL.

This part looks good.

Thanks,

--
Peter Xu


2023-02-13 08:19:31

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

Hi Cyrill,

Thank you for your time and review.

On 2/9/23 3:22 AM, Cyrill Gorcunov wrote:
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> ...
> Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i
> fear cant look with precise care at least for some time, hopefully other CRIU guys
> pick it up. Anyway, here a few comment from a glance.
>
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> + struct pagemap_scan_private *p, unsigned long addr,
>> + unsigned int len)
>> +{
>
> This is a big function and usually it's a flag to not declare it as "inline" until
> there very serious reson to.
I'll remove all these inline in next revision.

>
>> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> + bool cpy = true;
>> + struct page_region *prev = &p->prev;
>> +
>> + if (HAS_NO_SPACE(p))
>> + return -ENOSPC;
>> +
>> + if (p->max_pages && p->found_pages + len >= p->max_pages)
>> + len = p->max_pages - p->found_pages;
>> + if (!len)
>> + return -EINVAL;
>> +
>> + if (p->required_mask)
>> + cpy = ((p->required_mask & cur) == p->required_mask);
>> + if (cpy && p->anyof_mask)
>> + cpy = (p->anyof_mask & cur);
>> + if (cpy && p->excluded_mask)
>> + cpy = !(p->excluded_mask & cur);
>> + bitmap = cur & p->return_mask;
>> + if (cpy && bitmap) {
>
> You can exit early here simply
>
> if (!cpy || !bitmap)
> return 0;
I'm avoiding an extra return here.

>
> saving one tab for the code below.
>
>> + if ((prev->len) && (prev->bitmap == bitmap) &&
>> + (prev->start + prev->len * PAGE_SIZE == addr)) {
>> + prev->len += len;
>> + p->found_pages += len;
>> + } else if (p->vec_index < p->vec_len) {
>> + if (prev->len) {
>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>> + p->vec_index++;
>> + }
>> + prev->start = addr;
>> + prev->len = len;
>> + prev->bitmap = bitmap;
>> + p->found_pages += len;
>> + } else {
>> + return -ENOSPC;
>> + }
>> + }
>> + return 0;
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
>> + unsigned long *vec_index)
>> +{
>
> No need for inline either.
>
>> + struct page_region *prev = &p->prev;
>> +
>> + if (prev->len) {
>> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
>> + return -EFAULT;
>> + p->vec_index++;
>> + (*vec_index)++;
>> + prev->len = 0;
>> + }
>> + return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> + unsigned long end, struct mm_walk *walk)
>> +{
>
> Same, no need for inline. I've a few comments more in my mind will try to
> collect them tomorrow.
Your review would be much appreciated.

--
BR,
Muhammad Usama Anjum

2023-02-13 10:46:45

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 5/6] mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL

On 2/10/23 12:26 AM, Peter Xu wrote:
> On Thu, Feb 02, 2023 at 04:29:14PM +0500, Muhammad Usama Anjum wrote:
>> Add some explanation and method to use write-protection and written-to
>> on memory range.
>>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Documentation/admin-guide/mm/pagemap.rst | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>> index 6e2e416af783..1cb2189e9a0d 100644
>> --- a/Documentation/admin-guide/mm/pagemap.rst
>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>> @@ -230,3 +230,27 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>> always 12 at most architectures). Since Linux 3.11 their meaning changes
>> after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>> flags unconditionally.
>> +
>> +Pagemap Scan IOCTL
>> +==================
>> +
>> +The ``PAGEMAP_SCAN`` IOCTL on the pagemap file can be used to get and/or clear
>> +the info about page table entries. The following operations are supported in
>> +this IOCTL:
>> +- Get the information if the pages have been written-to (``PAGE_IS_WRITTEN``),
>> + file mapped (``PAGE_IS_FILE``), present (``PAGE_IS_PRESENT``) or swapped
>> + (``PAGE_IS_SWAPPED``).
>> +- Write-protect the pages (``PAGEMAP_WP_ENGAGE``) to start finding which
>> + pages have been written-to.
>> +- Find pages which have been written-to and write protect the pages
>> + (atomic ``PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE``)
>
> Could we extend this section a bit more? Some points for reference:
>
> - The new struct you introduced, definitions of each of the fields, and
> generic use cases for each of the field/ops.
>
> - It'll be nice to list the OPs the new interface supports (GET,
> WP_ENGAGE, GET+WP_ENGAGE).
>
> - When should people use this rather than the old pagemap interface?
> What's the major problems to solve / what's the major difference?
> (Maybe nice to reference the Windows API too here)
I'll update the documentation.

>
>> +
>> +To get information about which pages have been written-to and/or write protect
>> +the pages, following must be performed first in order:
>> + 1. The userfaultfd file descriptor is created with ``userfaultfd`` syscall.
>> + 2. The ``UFFD_FEATURE_WP_ASYNC`` feature is set by ``UFFDIO_API`` IOCTL.
>> + 3. The memory range is registered with ``UFFDIO_REGISTER_MODE_WP`` mode
>> + through ``UFFDIO_REGISTER`` IOCTL.
>> +Then the any part of the registered memory or the whole memory region can be
>> +write protected using the ``UFFDIO_WRITEPROTECT`` IOCTL or ``PAGEMAP_SCAN``
>> +IOCTL.
>
> This part looks good.
>
> Thanks,
>

--
BR,
Muhammad Usama Anjum

2023-02-13 12:55:41

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/9/23 3:15 AM, Peter Xu wrote:
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>> (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>> pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> To get information about which pages have been written-to and/or write
>> protect the pages, following must be performed first in order:
>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>> through UFFDIO_REGISTER IOCTL.
>> Then the any part of the registered memory or the whole memory region
>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>> PAGEMAP_SCAN IOCTL.
>>
>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>> struct:
>> - The range is specified through start and len.
>> - The output buffer of struct page_region array and size is specified as
>> vec and vec_len.
>> - The optional maximum requested pages are specified in the max_pages.
>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>> is the only added flag at this time.
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>> and return_mask.
>>
>> This IOCTL can be extended to get information about more PTE bits. This
>> IOCTL doesn't support hugetlbs at the moment. No information about
>> hugetlb can be obtained. This patch has evolved from a basic patch from
>> Gabriel Krisman Bertazi.
>>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>>
>> Changes in v6:
>> - Rename variables and update comments
>> - Make IOCTL independent of soft_dirty config
>> - Change masks and bitmap type to _u64
>> - Improve code quality
>>
>> Changes in v5:
>> - Remove tlb flushing even for clear operation
>>
>> Changes in v4:
>> - Update the interface and implementation
>>
>> Changes in v3:
>> - Tighten the user-kernel interface by using explicit types and add more
>> error checking
>>
>> Changes in v2:
>> - Convert the interface from syscall to ioctl
>> - Remove pidfd support as it doesn't make sense in ioctl
>> ---
>> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
>> include/uapi/linux/fs.h | 50 +++++++
>> 2 files changed, 340 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index e35a0398db63..c6bde19d63d9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>> #include <linux/shmem_fs.h>
>> #include <linux/uaccess.h>
>> #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>
>> #include <asm/elf.h>
>> #include <asm/tlb.h>
>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>> }
>> #endif
>>
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> + (pte_swp_uffd_wp_any(pte)))
>> + return true;
>> + return false;
>
> Sorry I should have mentioned this earlier: you can directly return here.
No problem at all. I'm replacing these two helper functions with following
in next version so that !present pages don't show as dirty:

static inline bool is_pte_written(pte_t pte)
{
if ((pte_present(pte) && pte_uffd_wp(pte)) ||
(pte_swp_uffd_wp_any(pte)))
return false;
return (pte_present(pte) || is_swap_pte(pte));
}

static inline bool is_pmd_written(pmd_t pmd)
{
if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
return false;
return (pmd_present(pmd) || is_swap_pmd(pmd));
}

>
> return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte);
>
>> +}
>> +
>> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
>> +{
>> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> + return true;
>> + return false;
>
> Same here.
>
>> +}
>> +
>> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>> unsigned long addr, pmd_t *pmdp)
>> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>> return 0;
>> }
>>
>> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
>> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
>> +#define IS_GET_OP(a) (a->vec)
>> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
>> +
>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
>> + (wt | file << 1 | present << 2 | swap << 3)
>> +#define IS_WT_REQUIRED(a) \
>> + ((a->required_mask & PAGE_IS_WRITTEN) || \
>> + (a->anyof_mask & PAGE_IS_WRITTEN))
>> +
>> +struct pagemap_scan_private {
>> + struct page_region *vec;
>> + struct page_region prev;
>> + unsigned long vec_len, vec_index;
>> + unsigned int max_pages, found_pages, flags;
>> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
>> +};
>> +
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
>> +{
>> + struct pagemap_scan_private *p = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> +
>> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
>
> Should this be:
>
> (IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma)))
>
> Instead?
Correct. I'll fix this.

>
>> + return -EPERM;
>> + if (vma->vm_flags & VM_PFNMAP)
>> + return 1;
>> + return 0;
>> +}
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> + struct pagemap_scan_private *p, unsigned long addr,
>> + unsigned int len)
>> +{
>> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> + bool cpy = true;
>> + struct page_region *prev = &p->prev;
>
> Nit: switch the above two lines?
I'll fix this.

>
>> +
>> + if (HAS_NO_SPACE(p))
>> + return -ENOSPC;
>> +
>> + if (p->max_pages && p->found_pages + len >= p->max_pages)
>> + len = p->max_pages - p->found_pages;
>
> If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC?
Length calculation is happening in the funtions calling this function. I'll
move this out of here to make things logically better.

>
>> + if (!len)
>> + return -EINVAL;
>> +
>> + if (p->required_mask)
>> + cpy = ((p->required_mask & cur) == p->required_mask);
>> + if (cpy && p->anyof_mask)
>> + cpy = (p->anyof_mask & cur);
>> + if (cpy && p->excluded_mask)
>> + cpy = !(p->excluded_mask & cur);
>> + bitmap = cur & p->return_mask;
>> + if (cpy && bitmap) {
>> + if ((prev->len) && (prev->bitmap == bitmap) &&
>> + (prev->start + prev->len * PAGE_SIZE == addr)) {
>> + prev->len += len;
>> + p->found_pages += len;
>> + } else if (p->vec_index < p->vec_len) {
>> + if (prev->len) {
>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>> + p->vec_index++;
>> + }
>
> IIUC you can have:
>
> int pagemap_scan_deposit(p)
> {
> if (p->vec_index >= p->vec_len)
> return -ENOSPC;
>
> if (p->prev->len) {
> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> p->vec_index++;
> }
>
> return 0;
> }
>
> Then call it here. I think it can also be called below to replace
> export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge
with it. At that point, we put prev into the output buffer and new range is
put into prev. Now that we have shifted to smaller page walks of <= 512
entries. We want to visit all ranges before finally putting the prev to
output. Sorry to have this some what complex method. The problem is that we
want to merge the consective matching regions into one entry in the output.
So to achieve this among multiple different page walks, the prev is being used.

Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
having length of 1024 pages and all of the memory has been written.
walk_page_range() will be called 2 times. In the first call, prev will be
set having length of 512. In second call, prev will be updated to 1024 as
the previous range stored in prev could be extended. After this, the prev
will be stored to the user output buffer consuming only 1 struct of page_range.

If we store prev back to output memory in every walk_page_range() call, we
wouldn't get 1 struct of page_range with length 1024. Instead we would get
2 elements of page_range structs with half the length.

>
>> + prev->start = addr;
>> + prev->len = len;
>> + prev->bitmap = bitmap;
>> + p->found_pages += len;
>> + } else {
>> + return -ENOSPC;
>> + }
>> + }
>> + return 0;
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
>> + unsigned long *vec_index)
>> +{
>> + struct page_region *prev = &p->prev;
>> +
>> + if (prev->len) {
>> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
>> + return -EFAULT;
>> + p->vec_index++;
>> + (*vec_index)++;
>> + prev->len = 0;
>> + }
>> + return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> + unsigned long end, struct mm_walk *walk)
>> +{
>> + struct pagemap_scan_private *p = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> + unsigned long addr = end;
>
> This assignment is useless?
No, this assignement gets used when only the WP_ENGAGE operation is used on
normal size pages.

>
>> + spinlock_t *ptl;
>> + int ret = 0;
>> + pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + ptl = pmd_trans_huge_lock(pmd, vma);
>> + if (ptl) {
>> + bool pmd_wt;
>> +
>> + pmd_wt = !is_pmd_uffd_wp(*pmd);
>> + /*
>> + * Break huge page into small pages if operation needs to be performed is
>> + * on a portion of the huge page.
>> + */
>> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>> + spin_unlock(ptl);
>> + split_huge_pmd(vma, pmd, start);
>> + goto process_smaller_pages;
>> + }
>> + if (IS_GET_OP(p))
>> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
>> + is_swap_pmd(*pmd), p, start,
>> + (end - start)/PAGE_SIZE);
>> + spin_unlock(ptl);
>> + if (!ret) {
>> + if (pmd_wt && IS_WP_ENGAGE_OP(p))
>> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
>> + }
>> + return ret;
>> + }
>> +process_smaller_pages:
>> + if (pmd_trans_unstable(pmd))
>> + return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> + if (IS_GET_OP(p)) {
>> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
>> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
>> + if (ret)
>> + break;
>> + }
>> + }
>> + pte_unmap_unlock(pte - 1, ptl);
>> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
>> + uffd_wp_range(walk->mm, vma, start, addr - start, true);
>> +
>> + cond_resched();
>> + return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
>> + struct mm_walk *walk)
>> +{
>> + struct pagemap_scan_private *p = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> + int ret = 0;
>> +
>> + if (vma)
>> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
>> + (end - addr)/PAGE_SIZE);
>> + return ret;
>> +}
>> +
>> +/* No hugetlb support is present. */
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> + .test_walk = pagemap_scan_test_walk,
>> + .pmd_entry = pagemap_scan_pmd_entry,
>> + .pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
>> +{
>> + unsigned long empty_slots, vec_index = 0;
>> + unsigned long __user start, end;
>> + unsigned long __start, __end;
>> + struct page_region __user *vec;
>> + struct pagemap_scan_private p;
>> + int ret = 0;
>> +
>> + start = (unsigned long)untagged_addr(arg->start);
>> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> + /* Validate memory ranges */
>> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
>> + return -EINVAL;
>> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
>> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
>> + return -EINVAL;
>> +
>> + /* Detect illegal flags and masks */
>> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
>> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
>> + (arg->return_mask & ~PAGEMAP_BITS_ALL))
>> + return -EINVAL;
>> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
>> + !arg->return_mask))
>> + return -EINVAL;
>> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
>> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
>> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
>> + return -EINVAL;
>
> I think you said you'll clean this up a bit. I don't think so..
You had showed a really clean way to put all these error checking
conditions. But I wasn't able to put the current error checking conditions
in that much nice way. I'd done at least something to make them look
better. Sorry, I'll revisit and try to come up with easier to follow error
checking conditions.

--
BR,
Muhammad Usama Anjum

2023-02-13 21:43:09

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
> On 2/9/23 3:15 AM, Peter Xu wrote:
> > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> >> the info about page table entries. The following operations are supported
> >> in this ioctl:
> >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> >> (PAGE_IS_SWAPPED).
> >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> >> pages have been written-to.
> >> - Find pages which have been written-to and write protect the pages
> >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> >>
> >> To get information about which pages have been written-to and/or write
> >> protect the pages, following must be performed first in order:
> >> - The userfaultfd file descriptor is created with userfaultfd syscall.
> >> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> >> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> >> through UFFDIO_REGISTER IOCTL.
> >> Then the any part of the registered memory or the whole memory region
> >> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> >> PAGEMAP_SCAN IOCTL.
> >>
> >> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> >> struct:
> >> - The range is specified through start and len.
> >> - The output buffer of struct page_region array and size is specified as
> >> vec and vec_len.
> >> - The optional maximum requested pages are specified in the max_pages.
> >> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> >> is the only added flag at this time.
> >> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> >> and return_mask.
> >>
> >> This IOCTL can be extended to get information about more PTE bits. This
> >> IOCTL doesn't support hugetlbs at the moment. No information about
> >> hugetlb can be obtained. This patch has evolved from a basic patch from
> >> Gabriel Krisman Bertazi.
> >>
> >> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> >> ---
> >> Changes in v10:
> >> - move changes in tools/include/uapi/linux/fs.h to separate patch
> >> - update commit message
> >>
> >> Change in v8:
> >> - Correct is_pte_uffd_wp()
> >> - Improve readability and error checks
> >> - Remove some un-needed code
> >>
> >> Changes in v7:
> >> - Rebase on top of latest next
> >> - Fix some corner cases
> >> - Base soft-dirty on the uffd wp async
> >> - Update the terminologies
> >> - Optimize the memory usage inside the ioctl
> >>
> >> Changes in v6:
> >> - Rename variables and update comments
> >> - Make IOCTL independent of soft_dirty config
> >> - Change masks and bitmap type to _u64
> >> - Improve code quality
> >>
> >> Changes in v5:
> >> - Remove tlb flushing even for clear operation
> >>
> >> Changes in v4:
> >> - Update the interface and implementation
> >>
> >> Changes in v3:
> >> - Tighten the user-kernel interface by using explicit types and add more
> >> error checking
> >>
> >> Changes in v2:
> >> - Convert the interface from syscall to ioctl
> >> - Remove pidfd support as it doesn't make sense in ioctl
> >> ---
> >> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
> >> include/uapi/linux/fs.h | 50 +++++++
> >> 2 files changed, 340 insertions(+)
> >>
> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >> index e35a0398db63..c6bde19d63d9 100644
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
> >> @@ -19,6 +19,7 @@
> >> #include <linux/shmem_fs.h>
> >> #include <linux/uaccess.h>
> >> #include <linux/pkeys.h>
> >> +#include <linux/minmax.h>
> >>
> >> #include <asm/elf.h>
> >> #include <asm/tlb.h>
> >> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >> }
> >> #endif
> >>
> >> +static inline bool is_pte_uffd_wp(pte_t pte)
> >> +{
> >> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >> + (pte_swp_uffd_wp_any(pte)))
> >> + return true;
> >> + return false;
> >
> > Sorry I should have mentioned this earlier: you can directly return here.
> No problem at all. I'm replacing these two helper functions with following
> in next version so that !present pages don't show as dirty:
>
> static inline bool is_pte_written(pte_t pte)
> {
> if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> (pte_swp_uffd_wp_any(pte)))
> return false;
> return (pte_present(pte) || is_swap_pte(pte));
> }

Could you explain why you don't want to return dirty for !present? A page
can be written then swapped out. Don't you want to know that happened
(from dirty tracking POV)?

The code looks weird to me too.. We only have three types of ptes: (1)
present, (2) swap, (3) none.

Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is
that what you're really looking for?

>
> static inline bool is_pmd_written(pmd_t pmd)
> {
> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> return false;
> return (pmd_present(pmd) || is_swap_pmd(pmd));
> }

[...]

> >> + bitmap = cur & p->return_mask;
> >> + if (cpy && bitmap) {
> >> + if ((prev->len) && (prev->bitmap == bitmap) &&
> >> + (prev->start + prev->len * PAGE_SIZE == addr)) {
> >> + prev->len += len;
> >> + p->found_pages += len;
> >> + } else if (p->vec_index < p->vec_len) {
> >> + if (prev->len) {
> >> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >> + p->vec_index++;
> >> + }
> >
> > IIUC you can have:
> >
> > int pagemap_scan_deposit(p)
> > {
> > if (p->vec_index >= p->vec_len)
> > return -ENOSPC;
> >
> > if (p->prev->len) {
> > memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> > p->vec_index++;
> > }
> >
> > return 0;
> > }
> >
> > Then call it here. I think it can also be called below to replace
> > export_prev_to_out().
> No this isn't possible. We fill up prev until the next range doesn't merge
> with it. At that point, we put prev into the output buffer and new range is
> put into prev. Now that we have shifted to smaller page walks of <= 512
> entries. We want to visit all ranges before finally putting the prev to
> output. Sorry to have this some what complex method. The problem is that we
> want to merge the consective matching regions into one entry in the output.
> So to achieve this among multiple different page walks, the prev is being used.
>
> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
> having length of 1024 pages and all of the memory has been written.
> walk_page_range() will be called 2 times. In the first call, prev will be
> set having length of 512. In second call, prev will be updated to 1024 as
> the previous range stored in prev could be extended. After this, the prev
> will be stored to the user output buffer consuming only 1 struct of page_range.
>
> If we store prev back to output memory in every walk_page_range() call, we
> wouldn't get 1 struct of page_range with length 1024. Instead we would get
> 2 elements of page_range structs with half the length.

I didn't mean to merge PREV for each pgtable walk. What I meant is I think
with such a pagemap_scan_deposit() you can rewrite it as:

if (cpy && bitmap) {
if ((prev->len) && (prev->bitmap == bitmap) &&
(prev->start + prev->len * PAGE_SIZE == addr)) {
prev->len += len;
p->found_pages += len;
} else {
if (pagemap_scan_deposit(p))
return -ENOSPC;
prev->start = addr;
prev->len = len;
prev->bitmap = bitmap;
p->found_pages += len;
}
}

Then you can reuse pagemap_scan_deposit() when before returning to
userspace, just to flush PREV to p->vec properly in a single helper.
It also makes the code slightly easier to read.

--
Peter Xu


2023-02-14 07:57:58

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/14/23 2:42 AM, Peter Xu wrote:
> On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
>> On 2/9/23 3:15 AM, Peter Xu wrote:
>>> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>>>> the info about page table entries. The following operations are supported
>>>> in this ioctl:
>>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>>> (PAGE_IS_SWAPPED).
>>>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>>> pages have been written-to.
>>>> - Find pages which have been written-to and write protect the pages
>>>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>>>
>>>> To get information about which pages have been written-to and/or write
>>>> protect the pages, following must be performed first in order:
>>>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>>>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>>>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>>>> through UFFDIO_REGISTER IOCTL.
>>>> Then the any part of the registered memory or the whole memory region
>>>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>>>> PAGEMAP_SCAN IOCTL.
>>>>
>>>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>>>> struct:
>>>> - The range is specified through start and len.
>>>> - The output buffer of struct page_region array and size is specified as
>>>> vec and vec_len.
>>>> - The optional maximum requested pages are specified in the max_pages.
>>>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>>>> is the only added flag at this time.
>>>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>>> and return_mask.
>>>>
>>>> This IOCTL can be extended to get information about more PTE bits. This
>>>> IOCTL doesn't support hugetlbs at the moment. No information about
>>>> hugetlb can be obtained. This patch has evolved from a basic patch from
>>>> Gabriel Krisman Bertazi.
>>>>
>>>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>>>> ---
>>>> Changes in v10:
>>>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>>>> - update commit message
>>>>
>>>> Change in v8:
>>>> - Correct is_pte_uffd_wp()
>>>> - Improve readability and error checks
>>>> - Remove some un-needed code
>>>>
>>>> Changes in v7:
>>>> - Rebase on top of latest next
>>>> - Fix some corner cases
>>>> - Base soft-dirty on the uffd wp async
>>>> - Update the terminologies
>>>> - Optimize the memory usage inside the ioctl
>>>>
>>>> Changes in v6:
>>>> - Rename variables and update comments
>>>> - Make IOCTL independent of soft_dirty config
>>>> - Change masks and bitmap type to _u64
>>>> - Improve code quality
>>>>
>>>> Changes in v5:
>>>> - Remove tlb flushing even for clear operation
>>>>
>>>> Changes in v4:
>>>> - Update the interface and implementation
>>>>
>>>> Changes in v3:
>>>> - Tighten the user-kernel interface by using explicit types and add more
>>>> error checking
>>>>
>>>> Changes in v2:
>>>> - Convert the interface from syscall to ioctl
>>>> - Remove pidfd support as it doesn't make sense in ioctl
>>>> ---
>>>> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
>>>> include/uapi/linux/fs.h | 50 +++++++
>>>> 2 files changed, 340 insertions(+)
>>>>
>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>> index e35a0398db63..c6bde19d63d9 100644
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
>>>> @@ -19,6 +19,7 @@
>>>> #include <linux/shmem_fs.h>
>>>> #include <linux/uaccess.h>
>>>> #include <linux/pkeys.h>
>>>> +#include <linux/minmax.h>
>>>>
>>>> #include <asm/elf.h>
>>>> #include <asm/tlb.h>
>>>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>>>> }
>>>> #endif
>>>>
>>>> +static inline bool is_pte_uffd_wp(pte_t pte)
>>>> +{
>>>> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>> + (pte_swp_uffd_wp_any(pte)))
>>>> + return true;
>>>> + return false;
>>>
>>> Sorry I should have mentioned this earlier: you can directly return here.
>> No problem at all. I'm replacing these two helper functions with following
>> in next version so that !present pages don't show as dirty:
>>
>> static inline bool is_pte_written(pte_t pte)
>> {
>> if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> (pte_swp_uffd_wp_any(pte)))
>> return false;
>> return (pte_present(pte) || is_swap_pte(pte));
>> }
>
> Could you explain why you don't want to return dirty for !present? A page
> can be written then swapped out. Don't you want to know that happened
> (from dirty tracking POV)?
>
> The code looks weird to me too.. We only have three types of ptes: (1)
> present, (2) swap, (3) none.
>
> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is
> that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it
simpler.

>
>>
>> static inline bool is_pmd_written(pmd_t pmd)
>> {
>> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> return false;
>> return (pmd_present(pmd) || is_swap_pmd(pmd));
>> }
>
> [...]
>
>>>> + bitmap = cur & p->return_mask;
>>>> + if (cpy && bitmap) {
>>>> + if ((prev->len) && (prev->bitmap == bitmap) &&
>>>> + (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>> + prev->len += len;
>>>> + p->found_pages += len;
>>>> + } else if (p->vec_index < p->vec_len) {
>>>> + if (prev->len) {
>>>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>> + p->vec_index++;
>>>> + }
>>>
>>> IIUC you can have:
>>>
>>> int pagemap_scan_deposit(p)
>>> {
>>> if (p->vec_index >= p->vec_len)
>>> return -ENOSPC;
>>>
>>> if (p->prev->len) {
>>> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>> p->vec_index++;
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> Then call it here. I think it can also be called below to replace
>>> export_prev_to_out().
>> No this isn't possible. We fill up prev until the next range doesn't merge
>> with it. At that point, we put prev into the output buffer and new range is
>> put into prev. Now that we have shifted to smaller page walks of <= 512
>> entries. We want to visit all ranges before finally putting the prev to
>> output. Sorry to have this some what complex method. The problem is that we
>> want to merge the consective matching regions into one entry in the output.
>> So to achieve this among multiple different page walks, the prev is being used.
>>
>> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
>> having length of 1024 pages and all of the memory has been written.
>> walk_page_range() will be called 2 times. In the first call, prev will be
>> set having length of 512. In second call, prev will be updated to 1024 as
>> the previous range stored in prev could be extended. After this, the prev
>> will be stored to the user output buffer consuming only 1 struct of page_range.
>>
>> If we store prev back to output memory in every walk_page_range() call, we
>> wouldn't get 1 struct of page_range with length 1024. Instead we would get
>> 2 elements of page_range structs with half the length.
>
> I didn't mean to merge PREV for each pgtable walk. What I meant is I think
> with such a pagemap_scan_deposit() you can rewrite it as:
>
> if (cpy && bitmap) {
> if ((prev->len) && (prev->bitmap == bitmap) &&
> (prev->start + prev->len * PAGE_SIZE == addr)) {
> prev->len += len;
> p->found_pages += len;
> } else {
> if (pagemap_scan_deposit(p))
> return -ENOSPC;
> prev->start = addr;
> prev->len = len;
> prev->bitmap = bitmap;
> p->found_pages += len;
> }
> }
>
> Then you can reuse pagemap_scan_deposit() when before returning to
> userspace, just to flush PREV to p->vec properly in a single helper.
> It also makes the code slightly easier to read.
Yeah, this would have worked as you have described. But in
pagemap_scan_output(), we are flushing prev to p->vec. But later in
export_prev_to_out() we need to flush prev to user_memory directly.


>

--
BR,
Muhammad Usama Anjum

2023-02-14 21:00:04

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Tue, Feb 14, 2023 at 12:57:21PM +0500, Muhammad Usama Anjum wrote:
> On 2/14/23 2:42 AM, Peter Xu wrote:
> > On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
> >> On 2/9/23 3:15 AM, Peter Xu wrote:
> >>> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> >>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> >>>> the info about page table entries. The following operations are supported
> >>>> in this ioctl:
> >>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> >>>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> >>>> (PAGE_IS_SWAPPED).
> >>>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> >>>> pages have been written-to.
> >>>> - Find pages which have been written-to and write protect the pages
> >>>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> >>>>
> >>>> To get information about which pages have been written-to and/or write
> >>>> protect the pages, following must be performed first in order:
> >>>> - The userfaultfd file descriptor is created with userfaultfd syscall.
> >>>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> >>>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> >>>> through UFFDIO_REGISTER IOCTL.
> >>>> Then the any part of the registered memory or the whole memory region
> >>>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> >>>> PAGEMAP_SCAN IOCTL.
> >>>>
> >>>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> >>>> struct:
> >>>> - The range is specified through start and len.
> >>>> - The output buffer of struct page_region array and size is specified as
> >>>> vec and vec_len.
> >>>> - The optional maximum requested pages are specified in the max_pages.
> >>>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> >>>> is the only added flag at this time.
> >>>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> >>>> and return_mask.
> >>>>
> >>>> This IOCTL can be extended to get information about more PTE bits. This
> >>>> IOCTL doesn't support hugetlbs at the moment. No information about
> >>>> hugetlb can be obtained. This patch has evolved from a basic patch from
> >>>> Gabriel Krisman Bertazi.
> >>>>
> >>>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> >>>> ---
> >>>> Changes in v10:
> >>>> - move changes in tools/include/uapi/linux/fs.h to separate patch
> >>>> - update commit message
> >>>>
> >>>> Change in v8:
> >>>> - Correct is_pte_uffd_wp()
> >>>> - Improve readability and error checks
> >>>> - Remove some un-needed code
> >>>>
> >>>> Changes in v7:
> >>>> - Rebase on top of latest next
> >>>> - Fix some corner cases
> >>>> - Base soft-dirty on the uffd wp async
> >>>> - Update the terminologies
> >>>> - Optimize the memory usage inside the ioctl
> >>>>
> >>>> Changes in v6:
> >>>> - Rename variables and update comments
> >>>> - Make IOCTL independent of soft_dirty config
> >>>> - Change masks and bitmap type to _u64
> >>>> - Improve code quality
> >>>>
> >>>> Changes in v5:
> >>>> - Remove tlb flushing even for clear operation
> >>>>
> >>>> Changes in v4:
> >>>> - Update the interface and implementation
> >>>>
> >>>> Changes in v3:
> >>>> - Tighten the user-kernel interface by using explicit types and add more
> >>>> error checking
> >>>>
> >>>> Changes in v2:
> >>>> - Convert the interface from syscall to ioctl
> >>>> - Remove pidfd support as it doesn't make sense in ioctl
> >>>> ---
> >>>> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
> >>>> include/uapi/linux/fs.h | 50 +++++++
> >>>> 2 files changed, 340 insertions(+)
> >>>>
> >>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>>> index e35a0398db63..c6bde19d63d9 100644
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> >>>> @@ -19,6 +19,7 @@
> >>>> #include <linux/shmem_fs.h>
> >>>> #include <linux/uaccess.h>
> >>>> #include <linux/pkeys.h>
> >>>> +#include <linux/minmax.h>
> >>>>
> >>>> #include <asm/elf.h>
> >>>> #include <asm/tlb.h>
> >>>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >>>> }
> >>>> #endif
> >>>>
> >>>> +static inline bool is_pte_uffd_wp(pte_t pte)
> >>>> +{
> >>>> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >>>> + (pte_swp_uffd_wp_any(pte)))
> >>>> + return true;
> >>>> + return false;
> >>>
> >>> Sorry I should have mentioned this earlier: you can directly return here.
> >> No problem at all. I'm replacing these two helper functions with following
> >> in next version so that !present pages don't show as dirty:
> >>
> >> static inline bool is_pte_written(pte_t pte)
> >> {
> >> if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >> (pte_swp_uffd_wp_any(pte)))
> >> return false;
> >> return (pte_present(pte) || is_swap_pte(pte));
> >> }
> >
> > Could you explain why you don't want to return dirty for !present? A page
> > can be written then swapped out. Don't you want to know that happened
> > (from dirty tracking POV)?
> >
> > The code looks weird to me too.. We only have three types of ptes: (1)
> > present, (2) swap, (3) none.
> >
> > Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is
> > that what you're really looking for?
> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
> simpler.

Ah I think I see what you wanted to do now.. But I'm afraid it won't work
for all cases.

So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
persist on anon (but none) ptes, then we got it lost and we cannot identify
it from pages being written. Your solution will solve problem for
anonymous, but I think it'll break file memories.

Example:

Consider one shmem page that got mapped, write protected (using UFFDIO_WP
ioctl), written again (removing uffd-wp bit automatically), then zapped.
The pte will be pte_none() but it's actually written, afaiu.

Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
to install pte markers for anonymous too (then it will work similarly like
shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think
you can keep using the old check and it should start to work.

Please let me know if my understanding is correct above.

I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some
patch at the meantime. That's something we wanted before too, when the app
cares about zero pages on anon. We used to populate the pages before doing
ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag
should be more efficient.

>
> >
> >>
> >> static inline bool is_pmd_written(pmd_t pmd)
> >> {
> >> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> >> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> >> return false;
> >> return (pmd_present(pmd) || is_swap_pmd(pmd));
> >> }
> >
> > [...]
> >
> >>>> + bitmap = cur & p->return_mask;
> >>>> + if (cpy && bitmap) {
> >>>> + if ((prev->len) && (prev->bitmap == bitmap) &&
> >>>> + (prev->start + prev->len * PAGE_SIZE == addr)) {
> >>>> + prev->len += len;
> >>>> + p->found_pages += len;
> >>>> + } else if (p->vec_index < p->vec_len) {
> >>>> + if (prev->len) {
> >>>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >>>> + p->vec_index++;
> >>>> + }
> >>>
> >>> IIUC you can have:
> >>>
> >>> int pagemap_scan_deposit(p)
> >>> {
> >>> if (p->vec_index >= p->vec_len)
> >>> return -ENOSPC;
> >>>
> >>> if (p->prev->len) {
> >>> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >>> p->vec_index++;
> >>> }
> >>>
> >>> return 0;
> >>> }
> >>>
> >>> Then call it here. I think it can also be called below to replace
> >>> export_prev_to_out().
> >> No this isn't possible. We fill up prev until the next range doesn't merge
> >> with it. At that point, we put prev into the output buffer and new range is
> >> put into prev. Now that we have shifted to smaller page walks of <= 512
> >> entries. We want to visit all ranges before finally putting the prev to
> >> output. Sorry to have this some what complex method. The problem is that we
> >> want to merge the consective matching regions into one entry in the output.
> >> So to achieve this among multiple different page walks, the prev is being used.
> >>
> >> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
> >> having length of 1024 pages and all of the memory has been written.
> >> walk_page_range() will be called 2 times. In the first call, prev will be
> >> set having length of 512. In second call, prev will be updated to 1024 as
> >> the previous range stored in prev could be extended. After this, the prev
> >> will be stored to the user output buffer consuming only 1 struct of page_range.
> >>
> >> If we store prev back to output memory in every walk_page_range() call, we
> >> wouldn't get 1 struct of page_range with length 1024. Instead we would get
> >> 2 elements of page_range structs with half the length.
> >
> > I didn't mean to merge PREV for each pgtable walk. What I meant is I think
> > with such a pagemap_scan_deposit() you can rewrite it as:
> >
> > if (cpy && bitmap) {
> > if ((prev->len) && (prev->bitmap == bitmap) &&
> > (prev->start + prev->len * PAGE_SIZE == addr)) {
> > prev->len += len;
> > p->found_pages += len;
> > } else {
> > if (pagemap_scan_deposit(p))
> > return -ENOSPC;
> > prev->start = addr;
> > prev->len = len;
> > prev->bitmap = bitmap;
> > p->found_pages += len;
> > }
> > }
> >
> > Then you can reuse pagemap_scan_deposit() when before returning to
> > userspace, just to flush PREV to p->vec properly in a single helper.
> > It also makes the code slightly easier to read.
> Yeah, this would have worked as you have described. But in
> pagemap_scan_output(), we are flushing prev to p->vec. But later in
> export_prev_to_out() we need to flush prev to user_memory directly.

I think there's a loop to copy_to_user(). Could you use the new helper so
the copy_to_user() loop will work without export_prev_to_out()?

I really hope we can get rid of export_prev_to_out(). Thanks,

--
Peter Xu


2023-02-15 10:03:48

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/15/23 1:59 AM, Peter Xu wrote:
[..]
>>>> static inline bool is_pte_written(pte_t pte)
>>>> {
>>>> if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>> (pte_swp_uffd_wp_any(pte)))
>>>> return false;
>>>> return (pte_present(pte) || is_swap_pte(pte));
>>>> }
>>>
>>> Could you explain why you don't want to return dirty for !present? A page
>>> can be written then swapped out. Don't you want to know that happened
>>> (from dirty tracking POV)?
>>>
>>> The code looks weird to me too.. We only have three types of ptes: (1)
>>> present, (2) swap, (3) none.
>>>
>>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is
>>> that what you're really looking for?
>> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
>> simpler.
>
> Ah I think I see what you wanted to do now.. But I'm afraid it won't work
> for all cases.
>
> So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
> persist on anon (but none) ptes, then we got it lost and we cannot identify
> it from pages being written. Your solution will solve problem for
> anonymous, but I think it'll break file memories.
>
> Example:
>
> Consider one shmem page that got mapped, write protected (using UFFDIO_WP
> ioctl), written again (removing uffd-wp bit automatically), then zapped.
> The pte will be pte_none() but it's actually written, afaiu.
>
> Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
> to install pte markers for anonymous too (then it will work similarly like
> shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
> need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think
> you can keep using the old check and it should start to work.
>
> Please let me know if my understanding is correct above.
Thank you for identifying it. Your understanding seems on point. I'll have
research things up about PTE Markers. I'm looking at your patches about it
[1]. Can you refer me to "mm alignment sessions" discussion in form of
presentation or if any transcript is available?

>
> I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some
> patch at the meantime. That's something we wanted before too, when the app
> cares about zero pages on anon. We used to populate the pages before doing
> ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag
> should be more efficient.
Is this discussion public? For what application you were looking into this?
I'll dig down to see how can I contribute to it.

>
>>
>>>
>>>>
>>>> static inline bool is_pmd_written(pmd_t pmd)
>>>> {
>>>> if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>>>> (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>>>> return false;
>>>> return (pmd_present(pmd) || is_swap_pmd(pmd));
>>>> }
>>>
>>> [...]
>>>
>>>>>> + bitmap = cur & p->return_mask;
>>>>>> + if (cpy && bitmap) {
>>>>>> + if ((prev->len) && (prev->bitmap == bitmap) &&
>>>>>> + (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>>>> + prev->len += len;
>>>>>> + p->found_pages += len;
>>>>>> + } else if (p->vec_index < p->vec_len) {
>>>>>> + if (prev->len) {
>>>>>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>>>> + p->vec_index++;
>>>>>> + }
>>>>>
>>>>> IIUC you can have:
>>>>>
>>>>> int pagemap_scan_deposit(p)
>>>>> {
>>>>> if (p->vec_index >= p->vec_len)
>>>>> return -ENOSPC;
>>>>>
>>>>> if (p->prev->len) {
>>>>> memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>>> p->vec_index++;
>>>>> }
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Then call it here. I think it can also be called below to replace
>>>>> export_prev_to_out().
>>>> No this isn't possible. We fill up prev until the next range doesn't merge
>>>> with it. At that point, we put prev into the output buffer and new range is
>>>> put into prev. Now that we have shifted to smaller page walks of <= 512
>>>> entries. We want to visit all ranges before finally putting the prev to
>>>> output. Sorry to have this some what complex method. The problem is that we
>>>> want to merge the consective matching regions into one entry in the output.
>>>> So to achieve this among multiple different page walks, the prev is being used.
>>>>
>>>> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
>>>> having length of 1024 pages and all of the memory has been written.
>>>> walk_page_range() will be called 2 times. In the first call, prev will be
>>>> set having length of 512. In second call, prev will be updated to 1024 as
>>>> the previous range stored in prev could be extended. After this, the prev
>>>> will be stored to the user output buffer consuming only 1 struct of page_range.
>>>>
>>>> If we store prev back to output memory in every walk_page_range() call, we
>>>> wouldn't get 1 struct of page_range with length 1024. Instead we would get
>>>> 2 elements of page_range structs with half the length.
>>>
>>> I didn't mean to merge PREV for each pgtable walk. What I meant is I think
>>> with such a pagemap_scan_deposit() you can rewrite it as:
>>>
>>> if (cpy && bitmap) {
>>> if ((prev->len) && (prev->bitmap == bitmap) &&
>>> (prev->start + prev->len * PAGE_SIZE == addr)) {
>>> prev->len += len;
>>> p->found_pages += len;
>>> } else {
>>> if (pagemap_scan_deposit(p))
>>> return -ENOSPC;
>>> prev->start = addr;
>>> prev->len = len;
>>> prev->bitmap = bitmap;
>>> p->found_pages += len;
>>> }
>>> }
>>>
>>> Then you can reuse pagemap_scan_deposit() when before returning to
>>> userspace, just to flush PREV to p->vec properly in a single helper.
>>> It also makes the code slightly easier to read.
>> Yeah, this would have worked as you have described. But in
>> pagemap_scan_output(), we are flushing prev to p->vec. But later in
>> export_prev_to_out() we need to flush prev to user_memory directly.
>
> I think there's a loop to copy_to_user(). Could you use the new helper so
> the copy_to_user() loop will work without export_prev_to_out()?
>
> I really hope we can get rid of export_prev_to_out(). Thanks,
I truly understand how you feel about export_prev_to_out(). It is really
difficult to understand. Even I had to made a hard try to come up with the
current code to avoid consuming a lot of kernel's memory while giving user
the compact output. I can surely map both of these with a dirty looking
macro. But I'm unable to find a decent macro to replace these. I think I'll
put a comment some where to explain whats going-on.


--
BR,
Muhammad Usama Anjum

2023-02-15 21:13:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote:
> On 2/15/23 1:59 AM, Peter Xu wrote:
> [..]
> >>>> static inline bool is_pte_written(pte_t pte)
> >>>> {
> >>>> if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >>>> (pte_swp_uffd_wp_any(pte)))
> >>>> return false;
> >>>> return (pte_present(pte) || is_swap_pte(pte));
> >>>> }
> >>>
> >>> Could you explain why you don't want to return dirty for !present? A page
> >>> can be written then swapped out. Don't you want to know that happened
> >>> (from dirty tracking POV)?
> >>>
> >>> The code looks weird to me too.. We only have three types of ptes: (1)
> >>> present, (2) swap, (3) none.
> >>>
> >>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is
> >>> that what you're really looking for?
> >> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
> >> simpler.
> >
> > Ah I think I see what you wanted to do now.. But I'm afraid it won't work
> > for all cases.
> >
> > So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
> > persist on anon (but none) ptes, then we got it lost and we cannot identify
> > it from pages being written. Your solution will solve problem for
> > anonymous, but I think it'll break file memories.
> >
> > Example:
> >
> > Consider one shmem page that got mapped, write protected (using UFFDIO_WP
> > ioctl), written again (removing uffd-wp bit automatically), then zapped.
> > The pte will be pte_none() but it's actually written, afaiu.
> >
> > Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
> > to install pte markers for anonymous too (then it will work similarly like
> > shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
> > need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think
> > you can keep using the old check and it should start to work.
> >
> > Please let me know if my understanding is correct above.
> Thank you for identifying it. Your understanding seems on point. I'll have
> research things up about PTE Markers. I'm looking at your patches about it
> [1]. Can you refer me to "mm alignment sessions" discussion in form of
> presentation or if any transcript is available?

No worry now, after a second thought I think zero page is better than pte
markers, and I've got a patch that works for it here by injecting zero
pages for anonymous:

https://lore.kernel.org/all/[email protected]/

I think we'd also better to enforce your new WP_ASYNC feature bit to depend
on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE.

Could you please try by rebasing your work upon this one? Hope it'll work
for you already. Note again that you'll need to go back to the old
is_pte|pmd_written() to make things work always, I think.

[...]

> I truly understand how you feel about export_prev_to_out(). It is really
> difficult to understand. Even I had to made a hard try to come up with the
> current code to avoid consuming a lot of kernel's memory while giving user
> the compact output. I can surely map both of these with a dirty looking
> macro. But I'm unable to find a decent macro to replace these. I think I'll
> put a comment some where to explain whats going-on.

So maybe I still missed something? I'll read the new version when it comes.

Thanks,

--
Peter Xu


2023-02-17 09:38:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v10 1/6] userfaultfd: Add UFFD WP Async support

Hi Muhammad,

On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
> faults on its own. It can be used to track that which pages have been
> written-to from the time the pages were write-protected. It is very
> efficient way to track the changes as uffd is by nature pte/pmd based.
>
> UFFD synchronous WP sends the page faults to the userspace where the
> pages which have been written-to can be tracked. But it is not efficient.
> This is why this asynchronous version is being added. After setting the
> WP Async, the pages which have been written to can be found in the pagemap
> file or information can be obtained from the PAGEMAP_IOCTL.
>
> Suggested-by: Peter Xu <[email protected]>
> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Changes in v10:
> - Build fix
> - Update comments and add error condition to return error from uffd
> register if hugetlb pages are present when wp async flag is set
>
> Changes in v9:
> - Correct the fault resolution with code contributed by Peter
>
> Changes in v7:
> - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
> - Handle automatic page fault resolution in better way (thanks to Peter)
>
> update to wp async
>
> uffd wp async
> ---
> fs/userfaultfd.c | 20 ++++++++++++++++++--
> include/linux/userfaultfd_k.h | 11 +++++++++++
> include/uapi/linux/userfaultfd.h | 10 +++++++++-
> mm/memory.c | 23 ++++++++++++++++++++---
> 4 files changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 15a5bf765d43..422f2530c63e 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> goto out_unlock;
>
> /*
> - * Note vmas containing huge pages
> + * Note vmas containing huge pages. Hugetlb isn't supported
> + * with UFFD_FEATURE_WP_ASYNC.
> */
> - if (is_vm_hugetlb_page(cur))
> + if (is_vm_hugetlb_page(cur)) {
> + if (ctx->features & UFFD_FEATURE_WP_ASYNC)
> + goto out_unlock;
> +
> basic_ioctls = true;
> + }
>
> found = true;
> }
> @@ -1867,6 +1872,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
> mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>
> + /* The unprotection is not supported if in async WP mode */
> + if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
> + return -EINVAL;
> +
> if (mode_wp && mode_dontwake)
> return -EINVAL;
>
> @@ -1950,6 +1959,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> return ret;
> }
>
> +int userfaultfd_wp_async(struct vm_area_struct *vma)
> +{
> + struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
> +
> + return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
> +}
> +
> static inline unsigned int uffd_ctx_features(__u64 user_features)
> {
> /*
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 9df0b9a762cc..38c92c2beb16 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
> unsigned long end, struct list_head *uf);
> extern void userfaultfd_unmap_complete(struct mm_struct *mm,
> struct list_head *uf);
> +extern int userfaultfd_wp_async(struct vm_area_struct *vma);
>
> #else /* CONFIG_USERFAULTFD */
>
> @@ -189,6 +190,11 @@ static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
> return VM_FAULT_SIGBUS;
> }
>
> +static inline void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
> + unsigned long start, unsigned long len, bool enable_wp)
> +{
> +}
> +
> static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
> struct vm_userfaultfd_ctx vm_ctx)
> {
> @@ -274,6 +280,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
> return false;
> }
>
> +static inline int userfaultfd_wp_async(struct vm_area_struct *vma)
> +{
> + return false;
> +}
> +
> #endif /* CONFIG_USERFAULTFD */
>
> static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 005e5e306266..30a6f32cf564 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -38,7 +38,8 @@
> UFFD_FEATURE_MINOR_HUGETLBFS | \
> UFFD_FEATURE_MINOR_SHMEM | \
> UFFD_FEATURE_EXACT_ADDRESS | \
> - UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
> + UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
> + UFFD_FEATURE_WP_ASYNC)
> #define UFFD_API_IOCTLS \
> ((__u64)1 << _UFFDIO_REGISTER | \
> (__u64)1 << _UFFDIO_UNREGISTER | \
> @@ -203,6 +204,12 @@ struct uffdio_api {
> *
> * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
> * write-protection mode is supported on both shmem and hugetlbfs.
> + *
> + * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
> + * asynchronous mode is supported in which the write fault is automatically
> + * resolved and write-protection is un-set. It only supports anon and shmem
> + * (hugetlb isn't supported). It only takes effect when a vma is registered
> + * with write-protection mode. Otherwise the flag is ignored.
> */

Most of mm/ adheres the 80-character limits. Please make your changes to
follow it as well.

> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
> #define UFFD_FEATURE_EVENT_FORK (1<<1)
> @@ -217,6 +224,7 @@ struct uffdio_api {
> #define UFFD_FEATURE_MINOR_SHMEM (1<<10)
> #define UFFD_FEATURE_EXACT_ADDRESS (1<<11)
> #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
> +#define UFFD_FEATURE_WP_ASYNC (1<<13)
> __u64 features;
>
> __u64 ioctls;
> diff --git a/mm/memory.c b/mm/memory.c
> index 4000e9f017e0..75331fbf7cb4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3351,8 +3351,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>
> if (likely(!unshare)) {
> if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - return handle_userfault(vmf, VM_UFFD_WP);
> + if (userfaultfd_wp_async(vma)) {
> + /*
> + * Nothing needed (cache flush, TLB invalidations,
> + * etc.) because we're only removing the uffd-wp bit,
> + * which is completely invisible to the user.
> + */
> + pte_t pte = pte_clear_uffd_wp(*vmf->pte);
> +
> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> + /* Update this to be prepared for following up CoW handling */
> + vmf->orig_pte = pte;
> + } else {
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + return handle_userfault(vmf, VM_UFFD_WP);
> + }

You can revert the condition here and reduce the nesting:

if (!userfaultfd_wp_async(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
}

/* handle async WP */

> }
>
> /*
> @@ -4812,8 +4825,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>
> if (vma_is_anonymous(vmf->vma)) {
> if (likely(!unshare) &&
> - userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
> + userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
> + if (userfaultfd_wp_async(vmf->vma))
> + goto split;
> return handle_userfault(vmf, VM_UFFD_WP);
> + }
> return do_huge_pmd_wp_page(vmf);
> }
>
> @@ -4825,6 +4841,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
> }
> }
>
> +split:
> /* COW or write-notify handled on pte level: split pmd. */
> __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2023-02-17 10:10:51

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> pages have been written-to.
> - Find pages which have been written-to and write protect the pages
> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>
> To get information about which pages have been written-to and/or write
> protect the pages, following must be performed first in order:
> - The userfaultfd file descriptor is created with userfaultfd syscall.
> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> through UFFDIO_REGISTER IOCTL.
> Then the any part of the registered memory or the whole memory region
> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> PAGEMAP_SCAN IOCTL.
>
> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> struct:
> - The range is specified through start and len.
> - The output buffer of struct page_region array and size is specified as
> vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> is the only added flag at this time.
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> and return_mask.
>
> This IOCTL can be extended to get information about more PTE bits. This
> IOCTL doesn't support hugetlbs at the moment. No information about
> hugetlb can be obtained. This patch has evolved from a basic patch from
> Gabriel Krisman Bertazi.
>
> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
>
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
>
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
>
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
>
> Changes in v5:
> - Remove tlb flushing even for clear operation
>
> Changes in v4:
> - Update the interface and implementation
>
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
> error checking
>
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fs.h | 50 +++++++
> 2 files changed, 340 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..c6bde19d63d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
> #include <linux/shmem_fs.h>
> #include <linux/uaccess.h>
> #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>
> #include <asm/elf.h>
> #include <asm/tlb.h>
> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> }
> #endif
>
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> + (pte_swp_uffd_wp_any(pte)))
> + return true;
> + return false;
> +}
> +
> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
> +{
> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> + return true;
> + return false;
> +}
> +
> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> unsigned long addr, pmd_t *pmdp)
> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
> +#define IS_GET_OP(a) (a->vec)
> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
> +
> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
> + (wt | file << 1 | present << 2 | swap << 3)
> +#define IS_WT_REQUIRED(a) \
> + ((a->required_mask & PAGE_IS_WRITTEN) || \
> + (a->anyof_mask & PAGE_IS_WRITTEN))

All these macros are specific to pagemap_scan_ioctl() and should be
namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.

Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.

And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or
similar flag rather than using arg->vec.

> +
> +struct pagemap_scan_private {
> + struct page_region *vec;
> + struct page_region prev;
> + unsigned long vec_len, vec_index;
> + unsigned int max_pages, found_pages, flags;
> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
> +};
> +
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)

Please keep the lines under 80 characters limit.

> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> +
> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
> + return -EPERM;
> + if (vma->vm_flags & VM_PFNMAP)
> + return 1;
> + return 0;
> +}
> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> + struct pagemap_scan_private *p, unsigned long addr,
> + unsigned int len)
> +{
> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> + bool cpy = true;
> + struct page_region *prev = &p->prev;
> +
> + if (HAS_NO_SPACE(p))
> + return -ENOSPC;
> +
> + if (p->max_pages && p->found_pages + len >= p->max_pages)
> + len = p->max_pages - p->found_pages;
> + if (!len)
> + return -EINVAL;
> +
> + if (p->required_mask)
> + cpy = ((p->required_mask & cur) == p->required_mask);
> + if (cpy && p->anyof_mask)
> + cpy = (p->anyof_mask & cur);
> + if (cpy && p->excluded_mask)
> + cpy = !(p->excluded_mask & cur);
> + bitmap = cur & p->return_mask;
> + if (cpy && bitmap) {
> + if ((prev->len) && (prev->bitmap == bitmap) &&
> + (prev->start + prev->len * PAGE_SIZE == addr)) {
> + prev->len += len;
> + p->found_pages += len;
> + } else if (p->vec_index < p->vec_len) {
> + if (prev->len) {
> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> + p->vec_index++;
> + }
> + prev->start = addr;
> + prev->len = len;
> + prev->bitmap = bitmap;
> + p->found_pages += len;
> + } else {
> + return -ENOSPC;
> + }
> + }
> + return 0;

Please don't save on empty lines. Empty lines between logical pieces
improve readability.

> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> + unsigned long *vec_index)
> +{
> + struct page_region *prev = &p->prev;
> +
> + if (prev->len) {
> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> + return -EFAULT;
> + p->vec_index++;
> + (*vec_index)++;
> + prev->len = 0;
> + }
> + return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> + unsigned long end, struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + unsigned long addr = end;
> + spinlock_t *ptl;
> + int ret = 0;
> + pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (ptl) {
> + bool pmd_wt;
> +
> + pmd_wt = !is_pmd_uffd_wp(*pmd);
> + /*
> + * Break huge page into small pages if operation needs to be performed is
> + * on a portion of the huge page.
> + */
> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
> + spin_unlock(ptl);
> + split_huge_pmd(vma, pmd, start);
> + goto process_smaller_pages;
> + }
> + if (IS_GET_OP(p))
> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
> + is_swap_pmd(*pmd), p, start,
> + (end - start)/PAGE_SIZE);
> + spin_unlock(ptl);
> + if (!ret) {
> + if (pmd_wt && IS_WP_ENGAGE_OP(p))
> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
> + }
> + return ret;
> + }
> +process_smaller_pages:
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> + if (IS_GET_OP(p)) {
> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
> + if (ret)
> + break;
> + }
> + }
> + pte_unmap_unlock(pte - 1, ptl);
> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
> + uffd_wp_range(walk->mm, vma, start, addr - start, true);
> +
> + cond_resched();
> + return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> + struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + int ret = 0;
> +
> + if (vma)
> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
> + (end - addr)/PAGE_SIZE);
> + return ret;
> +}
> +
> +/* No hugetlb support is present. */
> +static const struct mm_walk_ops pagemap_scan_ops = {
> + .test_walk = pagemap_scan_test_walk,
> + .pmd_entry = pagemap_scan_pmd_entry,
> + .pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> + unsigned long empty_slots, vec_index = 0;
> + unsigned long __user start, end;
> + unsigned long __start, __end;
> + struct page_region __user *vec;
> + struct pagemap_scan_private p;
> + int ret = 0;
> +
> + start = (unsigned long)untagged_addr(arg->start);
> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> + /* Validate memory ranges */
> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> + return -EINVAL;
> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
> + return -EINVAL;
> +
> + /* Detect illegal flags and masks */
> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
> + (arg->return_mask & ~PAGEMAP_BITS_ALL))
> + return -EINVAL;
> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
> + !arg->return_mask))
> + return -EINVAL;
> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
> + return -EINVAL;

I'd split argument validation into a separate function and split the OR'ed
conditions into separate if statements, e.g

bool pm_scan_args_valid(struct pagemap_scan_arg *arg)
{
if (IS_GET_OP(arg)) {
if (!arg->return_mask)
return false;
if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask)
return false;
}

/* ... */

return true;
}

> +
> + end = start + arg->len;
> + p.max_pages = arg->max_pages;
> + p.found_pages = 0;
> + p.flags = arg->flags;
> + p.required_mask = arg->required_mask;
> + p.anyof_mask = arg->anyof_mask;
> + p.excluded_mask = arg->excluded_mask;
> + p.return_mask = arg->return_mask;
> + p.prev.len = 0;
> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> + if (IS_GET_OP(arg)) {
> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
> + if (!p.vec)
> + return -ENOMEM;
> + } else {
> + p.vec = NULL;
> + }
> + __start = __end = start;
> + while (!ret && __end < end) {
> + p.vec_index = 0;
> + empty_slots = arg->vec_len - vec_index;
> + if (p.vec_len > empty_slots)
> + p.vec_len = empty_slots;
> +
> + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> + if (__end > end)
> + __end = end;
> +
> + mmap_read_lock(mm);
> + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
> + mmap_read_unlock(mm);
> + if (!(!ret || ret == -ENOSPC))
> + goto free_data;
> +
> + __start = __end;
> + if (IS_GET_OP(arg) && p.vec_index) {
> + if (copy_to_user(&vec[vec_index], p.vec,
> + p.vec_index * sizeof(struct page_region))) {
> + ret = -EFAULT;
> + goto free_data;
> + }
> + vec_index += p.vec_index;
> + }
> + }
> + ret = export_prev_to_out(&p, vec, &vec_index);
> + if (!ret)
> + ret = vec_index;
> +free_data:
> + if (IS_GET_OP(arg))
> + kfree(p.vec);
> +
> + return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
> + struct mm_struct *mm = file->private_data;
> + struct pagemap_scan_arg argument;
> +
> + if (cmd == PAGEMAP_SCAN) {
> + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
> + return -EFAULT;
> + return do_pagemap_cmd(mm, &argument);
> + }
> + return -EINVAL;
> +}
> +
> const struct file_operations proc_pagemap_operations = {
> .llseek = mem_lseek, /* borrow this */
> .read = pagemap_read,
> .open = pagemap_open,
> .release = pagemap_release,
> + .unlocked_ioctl = pagemap_scan_ioctl,
> + .compat_ioctl = pagemap_scan_ioctl,
> };
> #endif /* CONFIG_PROC_PAGE_MONITOR */
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1ae9a8684b48 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> RWF_APPEND)
>
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
> +
> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
> +#define PAGE_IS_WRITTEN (1 << 0)
> +#define PAGE_IS_FILE (1 << 1)
> +#define PAGE_IS_PRESENT (1 << 2)
> +#define PAGE_IS_SWAPPED (1 << 3)
> +
> +/*
> + * struct page_region - Page region with bitmap flags
> + * @start: Start of the region
> + * @len: Length of the region
> + * bitmap: Bits sets for the region
> + */
> +struct page_region {
> + __u64 start;
> + __u64 len;
> + __u64 bitmap;
> +};
> +
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start: Starting address of the region
> + * @len: Length of the region (All the pages in this length are included)
> + * @vec: Address of page_region struct array for output
> + * @vec_len: Length of the page_region struct array
> + * @max_pages: Optional max return pages
> + * @flags: Flags for the IOCTL
> + * @required_mask: Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask: Any mask - Any of these bits are set in the PTE
> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE
> + * @return_mask: Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> + __u64 start;
> + __u64 len;
> + __u64 vec;
> + __u64 vec_len;
> + __u32 max_pages;
> + __u32 flags;
> + __u64 required_mask;
> + __u64 anyof_mask;
> + __u64 excluded_mask;
> + __u64 return_mask;
> +};
> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE (1 << 0)
> +
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2023-02-17 10:40:24

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/16/23 2:12 AM, Peter Xu wrote:
> On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote:
>> On 2/15/23 1:59 AM, Peter Xu wrote:
>> [..]
>>>>>> static inline bool is_pte_written(pte_t pte)
>>>>>> {
>>>>>> if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>>>> (pte_swp_uffd_wp_any(pte)))
>>>>>> return false;
>>>>>> return (pte_present(pte) || is_swap_pte(pte));
>>>>>> }
>>>>>
>>>>> Could you explain why you don't want to return dirty for !present? A page
>>>>> can be written then swapped out. Don't you want to know that happened
>>>>> (from dirty tracking POV)?
>>>>>
>>>>> The code looks weird to me too.. We only have three types of ptes: (1)
>>>>> present, (2) swap, (3) none.
>>>>>
>>>>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none(). Is
>>>>> that what you're really looking for?
>>>> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
>>>> simpler.
>>>
>>> Ah I think I see what you wanted to do now.. But I'm afraid it won't work
>>> for all cases.
>>>
>>> So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
>>> persist on anon (but none) ptes, then we got it lost and we cannot identify
>>> it from pages being written. Your solution will solve problem for
>>> anonymous, but I think it'll break file memories.
>>>
>>> Example:
>>>
>>> Consider one shmem page that got mapped, write protected (using UFFDIO_WP
>>> ioctl), written again (removing uffd-wp bit automatically), then zapped.
>>> The pte will be pte_none() but it's actually written, afaiu.
>>>
>>> Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
>>> to install pte markers for anonymous too (then it will work similarly like
>>> shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
>>> need to have the new UFFD_FEATURE_WP_ASYNC depend on it. With that I think
>>> you can keep using the old check and it should start to work.
>>>
>>> Please let me know if my understanding is correct above.
>> Thank you for identifying it. Your understanding seems on point. I'll have
>> research things up about PTE Markers. I'm looking at your patches about it
>> [1]. Can you refer me to "mm alignment sessions" discussion in form of
>> presentation or if any transcript is available?
>
> No worry now, after a second thought I think zero page is better than pte
> markers, and I've got a patch that works for it here by injecting zero
> pages for anonymous:
>
> https://lore.kernel.org/all/[email protected]/
>
> I think we'd also better to enforce your new WP_ASYNC feature bit to depend
> on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE.
>
> Could you please try by rebasing your work upon this one? Hope it'll work
> for you already. Note again that you'll need to go back to the old
> is_pte|pmd_written() to make things work always, I think.
Thank you so much for sending the ZEROPAGE patch. I've rebased my patches
on top of it and my all tests for anon memory are passing. Now we don't
need to touch the page before engaging wp. This is what we wanted to
achieve. So !wp flag can be easily translated to soft-dirty flag
(is_pte_soft_dirty = is_pte_wp).

I've only a few file mem and shmem tests. I'll write more tests.

>
> [...]
>
>> I truly understand how you feel about export_prev_to_out(). It is really
>> difficult to understand. Even I had to made a hard try to come up with the
>> current code to avoid consuming a lot of kernel's memory while giving user
>> the compact output. I can surely map both of these with a dirty looking
>> macro. But I'm unable to find a decent macro to replace these. I think I'll
>> put a comment some where to explain whats going-on.
>
> So maybe I still missed something? I'll read the new version when it comes.
Lets reconvene in next patches if you feel like they can be improved.

>
> Thanks,
>

--
BR,
Muhammad Usama Anjum

2023-02-17 15:19:09

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum
<[email protected]> wrote:
[...]
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> and return_mask.
[...]

May I suggest a slightly modified interface for the flags?

As I understand, the return_mask is what is applied to page flags to
aggregate the list.
This is a separate thing, and I think it doesn't need changes except
maybe an improvement
in the documentation and visual distinction.

For the page-selection mechanism, currently required_mask and
excluded_mask have conflicting
responsibilities. I suggest to rework that to:
1. negated_flags: page flags which are to be negated before applying
the page selection using following masks;
2. required_flags: flags which all have to be set in the
(negation-applied) page flags;
3. anyof_flags: flags of which at least one has to be set in the
(negation-applied) page flags;

IOW, the resulting algorithm would be:

tested_flags = page_flags ^ negated_flags;
if (~tested_flags & required_flags)
skip page;
if (!(tested_flags & anyof_flags))
skip_page;

aggregate_on(page_flags & return_flags);

Best Regards
Michał Mirosław

2023-02-19 13:52:17

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs


On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> pages have been written-to.
> - Find pages which have been written-to and write protect the pages
> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>
> To get information about which pages have been written-to and/or write
> protect the pages, following must be performed first in order:
> - The userfaultfd file descriptor is created with userfaultfd syscall.
> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> through UFFDIO_REGISTER IOCTL.
> Then the any part of the registered memory or the whole memory region
> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> PAGEMAP_SCAN IOCTL.
>
> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> struct:
> - The range is specified through start and len.
> - The output buffer of struct page_region array and size is specified as
> vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> is the only added flag at this time.
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> and return_mask.
>
> This IOCTL can be extended to get information about more PTE bits. This
> IOCTL doesn't support hugetlbs at the moment. No information about
> hugetlb can be obtained. This patch has evolved from a basic patch from
> Gabriel Krisman Bertazi.

I was not involved before, so I am not commenting on the API and code to
avoid making unhelpful noise.

Having said that, some things in the code seem quite dirty and make
understanding the code hard to read.

> Signed-off-by: Muhammad Usama Anjum <[email protected]>
> ---
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
>
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
>
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
>
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
>
> Changes in v5:
> - Remove tlb flushing even for clear operation
>
> Changes in v4:
> - Update the interface and implementation
>
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
> error checking
>
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fs.h | 50 +++++++
> 2 files changed, 340 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..c6bde19d63d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
> #include <linux/shmem_fs.h>
> #include <linux/uaccess.h>
> #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>
> #include <asm/elf.h>
> #include <asm/tlb.h>
> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> }
> #endif
>
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> + (pte_swp_uffd_wp_any(pte)))
> + return true;
> + return false;
> +}
> +
> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
> +{
> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> + return true;
> + return false;
> +}
> +
> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> unsigned long addr, pmd_t *pmdp)
> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
> return 0;
> }
>
> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
> +#define IS_GET_OP(a) (a->vec)
> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))

I think that in general it is better to have an inline function instead
of macros when possible, as it is clearer and checks types. Anyhow, IMHO
most of these macros are better be open-coded.

> +
> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
> + (wt | file << 1 | present << 2 | swap << 3)
> +#define IS_WT_REQUIRED(a) \
> + ((a->required_mask & PAGE_IS_WRITTEN) || \
> + (a->anyof_mask & PAGE_IS_WRITTEN))
> +
> +struct pagemap_scan_private {
> + struct page_region *vec;
> + struct page_region prev;
> + unsigned long vec_len, vec_index;
> + unsigned int max_pages, found_pages, flags;
> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
> +};
> +
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> +
> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
> + return -EPERM;
> + if (vma->vm_flags & VM_PFNMAP)
> + return 1;
> + return 0;
> +}
> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> + struct pagemap_scan_private *p, unsigned long addr,
> + unsigned int len)
> +{
> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> + bool cpy = true;
> + struct page_region *prev = &p->prev;
> +
> + if (HAS_NO_SPACE(p))
> + return -ENOSPC;
> +
> + if (p->max_pages && p->found_pages + len >= p->max_pages)
> + len = p->max_pages - p->found_pages;
> + if (!len)
> + return -EINVAL;
> +
> + if (p->required_mask)
> + cpy = ((p->required_mask & cur) == p->required_mask);
> + if (cpy && p->anyof_mask)
> + cpy = (p->anyof_mask & cur);
> + if (cpy && p->excluded_mask)
> + cpy = !(p->excluded_mask & cur);
> + bitmap = cur & p->return_mask;
> + if (cpy && bitmap) {
> + if ((prev->len) && (prev->bitmap == bitmap) &&
> + (prev->start + prev->len * PAGE_SIZE == addr)) {
> + prev->len += len;
The use of "len" both for bytes and pages is very confusing. Consider
changing the name to n_pages or something similar.
> + p->found_pages += len;
> + } else if (p->vec_index < p->vec_len) {
> + if (prev->len) {
> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> + p->vec_index++;
> + }
> + prev->start = addr;
> + prev->len = len;
> + prev->bitmap = bitmap;
> + p->found_pages += len;
> + } else {
> + return -ENOSPC;
> + }
> + }
> + return 0;
> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> + unsigned long *vec_index)
> +{
> + struct page_region *prev = &p->prev;
> +
> + if (prev->len) {
> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> + return -EFAULT;
> + p->vec_index++;
> + (*vec_index)++;
> + prev->len = 0;
> + }
> + return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> + unsigned long end, struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + unsigned long addr = end;
> + spinlock_t *ptl;
> + int ret = 0;
> + pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (ptl) {
> + bool pmd_wt;
> +
> + pmd_wt = !is_pmd_uffd_wp(*pmd);
> + /*
> + * Break huge page into small pages if operation needs to be performed is
> + * on a portion of the huge page.
> + */
> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
> + spin_unlock(ptl);
> + split_huge_pmd(vma, pmd, start);
> + goto process_smaller_pages;
I think that such goto's are really confusing and should be avoided. And
using 'else' (could have easily prevented the need for goto). It is not
the best solution though, since I think it would have been better to
invert the conditions.
> + }
> + if (IS_GET_OP(p))
> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
> + is_swap_pmd(*pmd), p, start,
> + (end - start)/PAGE_SIZE);
> + spin_unlock(ptl);
> + if (!ret) {
> + if (pmd_wt && IS_WP_ENGAGE_OP(p))
> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
> + }
> + return ret;
> + }
> +process_smaller_pages:
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> + if (IS_GET_OP(p)) {
> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
> + if (ret)
> + break;
> + }
> + }
> + pte_unmap_unlock(pte - 1, ptl);
We might have not entered the loop and pte-1 would be wrong.
> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
What does 'addr - start' mean? If you want to say they are not equal,
why not say so?
> + uffd_wp_range(walk->mm, vma, start, addr - start, true);
> +
> + cond_resched();
> + return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> + struct mm_walk *walk)
> +{
> + struct pagemap_scan_private *p = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + int ret = 0;
> +
> + if (vma)
> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
> + (end - addr)/PAGE_SIZE);
> + return ret;
> +}
> +
> +/* No hugetlb support is present. */
> +static const struct mm_walk_ops pagemap_scan_ops = {
> + .test_walk = pagemap_scan_test_walk,
> + .pmd_entry = pagemap_scan_pmd_entry,
> + .pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> + unsigned long empty_slots, vec_index = 0;
> + unsigned long __user start, end;

The whole point of __user (attribute) is to be assigned to pointers.

> + unsigned long __start, __end;

I think such names do not convey sufficient information.

> + struct page_region __user *vec;
> + struct pagemap_scan_private p;
> + int ret = 0;
> +
> + start = (unsigned long)untagged_addr(arg->start);
> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> + /* Validate memory ranges */
> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> + return -EINVAL;
> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
> + return -EINVAL;
> +
> + /* Detect illegal flags and masks */
> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
> + (arg->return_mask & ~PAGEMAP_BITS_ALL))

Using bitwise or to check

    (arg->required_mask | arg->anyof_mask |
     arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL

Would have been much cleaner, IMHO.

> + return -EINVAL;
> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
> + !arg->return_mask))
> + return -EINVAL;
> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
> + return -EINVAL;
> +
> + end = start + arg->len;
> + p.max_pages = arg->max_pages;
> + p.found_pages = 0;
> + p.flags = arg->flags;
> + p.required_mask = arg->required_mask;
> + p.anyof_mask = arg->anyof_mask;
> + p.excluded_mask = arg->excluded_mask;
> + p.return_mask = arg->return_mask;
> + p.prev.len = 0;
> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> + if (IS_GET_OP(arg)) {
> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
> + if (!p.vec)
> + return -ENOMEM;
> + } else {
> + p.vec = NULL;
I find it cleaner to initialize 'p.vec = NULL' unconditionally before
IS_GET_OP() check.
> + }
> + __start = __end = start;
> + while (!ret && __end < end) {
> + p.vec_index = 0;
> + empty_slots = arg->vec_len - vec_index;
> + if (p.vec_len > empty_slots)
> + p.vec_len = empty_slots;
> +
> + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> + if (__end > end)
> + __end = end;
Easier to understand using min().
> +
> + mmap_read_lock(mm);
> + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
> + mmap_read_unlock(mm);
> + if (!(!ret || ret == -ENOSPC))

Double negations complicate things unnecessarily.

And if you already "break" on ret, why do you check the condition in the
while loop?

> + goto free_data;
> +
> + __start = __end;
> + if (IS_GET_OP(arg) && p.vec_index) {
> + if (copy_to_user(&vec[vec_index], p.vec,
> + p.vec_index * sizeof(struct page_region))) {
> + ret = -EFAULT;
> + goto free_data;
> + }
> + vec_index += p.vec_index;
> + }
> + }
> + ret = export_prev_to_out(&p, vec, &vec_index);
> + if (!ret)
> + ret = vec_index;
> +free_data:
> + if (IS_GET_OP(arg))
> + kfree(p.vec);
Just call it unconditionally.
> +
> + return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
> + struct mm_struct *mm = file->private_data;
> + struct pagemap_scan_arg argument;
> +
> + if (cmd == PAGEMAP_SCAN) {
> + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
> + return -EFAULT;
> + return do_pagemap_cmd(mm, &argument);
> + }
> + return -EINVAL;
> +}
> +
> const struct file_operations proc_pagemap_operations = {
> .llseek = mem_lseek, /* borrow this */
> .read = pagemap_read,
> .open = pagemap_open,
> .release = pagemap_release,
> + .unlocked_ioctl = pagemap_scan_ioctl,
> + .compat_ioctl = pagemap_scan_ioctl,
> };
> #endif /* CONFIG_PROC_PAGE_MONITOR */
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1ae9a8684b48 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> RWF_APPEND)
>
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
> +
> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
> +#define PAGE_IS_WRITTEN (1 << 0)
> +#define PAGE_IS_FILE (1 << 1)
> +#define PAGE_IS_PRESENT (1 << 2)
> +#define PAGE_IS_SWAPPED (1 << 3)

These names are way too generic and are likely to be misused for the
wrong purpose. The "_IS_" part seems confusing as well. So I think the
naming needs to be fixed and some new type (using typedef) or enum
should be introduced to hold these flags. I understand it is part of
uapi and it is less common there, but it is not unheard of and does make
things clearer.


> +
> +/*
> + * struct page_region - Page region with bitmap flags
> + * @start: Start of the region
> + * @len: Length of the region
> + * bitmap: Bits sets for the region
> + */
> +struct page_region {
> + __u64 start;
> + __u64 len;

I presume in bytes. Would be useful to mention.

> + __u64 bitmap;
> +};
> +
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start: Starting address of the region
> + * @len: Length of the region (All the pages in this length are included)
> + * @vec: Address of page_region struct array for output
> + * @vec_len: Length of the page_region struct array
> + * @max_pages: Optional max return pages
> + * @flags: Flags for the IOCTL
> + * @required_mask: Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask: Any mask - Any of these bits are set in the PTE
> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE
> + * @return_mask: Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> + __u64 start;
> + __u64 len;
> + __u64 vec;
> + __u64 vec_len;
> + __u32 max_pages;
> + __u32 flags;
> + __u64 required_mask;
> + __u64 anyof_mask;
> + __u64 excluded_mask;
> + __u64 return_mask;
> +};
> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE (1 << 0)
> +
> #endif /* _UAPI_LINUX_FS_H */

2023-02-20 08:36:47

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 1/6] userfaultfd: Add UFFD WP Async support

Hi Mike,

Thanks for reviewing.

On 2/17/23 2:37 PM, Mike Rapoport wrote:
> Hi Muhammad,
>
> On Thu, Feb 02, 2023 at 04:29:10PM +0500, Muhammad Usama Anjum wrote:
>> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
>> faults on its own. It can be used to track that which pages have been
>> written-to from the time the pages were write-protected. It is very
>> efficient way to track the changes as uffd is by nature pte/pmd based.
>>
>> UFFD synchronous WP sends the page faults to the userspace where the
>> pages which have been written-to can be tracked. But it is not efficient.
>> This is why this asynchronous version is being added. After setting the
>> WP Async, the pages which have been written to can be found in the pagemap
>> file or information can be obtained from the PAGEMAP_IOCTL.
>>
>> Suggested-by: Peter Xu <[email protected]>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Changes in v10:
>> - Build fix
>> - Update comments and add error condition to return error from uffd
>> register if hugetlb pages are present when wp async flag is set
>>
>> Changes in v9:
>> - Correct the fault resolution with code contributed by Peter
>>
>> Changes in v7:
>> - Remove UFFDIO_WRITEPROTECT_MODE_ASYNC_WP and add UFFD_FEATURE_WP_ASYNC
>> - Handle automatic page fault resolution in better way (thanks to Peter)
>>
>> update to wp async
>>
>> uffd wp async
>> ---
>> fs/userfaultfd.c | 20 ++++++++++++++++++--
>> include/linux/userfaultfd_k.h | 11 +++++++++++
>> include/uapi/linux/userfaultfd.h | 10 +++++++++-
>> mm/memory.c | 23 ++++++++++++++++++++---
>> 4 files changed, 58 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 15a5bf765d43..422f2530c63e 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -1422,10 +1422,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
>> goto out_unlock;
>>
>> /*
>> - * Note vmas containing huge pages
>> + * Note vmas containing huge pages. Hugetlb isn't supported
>> + * with UFFD_FEATURE_WP_ASYNC.
>> */
>> - if (is_vm_hugetlb_page(cur))
>> + if (is_vm_hugetlb_page(cur)) {
>> + if (ctx->features & UFFD_FEATURE_WP_ASYNC)
>> + goto out_unlock;
>> +
>> basic_ioctls = true;
>> + }
>>
>> found = true;
>> }
>> @@ -1867,6 +1872,10 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>> mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>> mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>>
>> + /* The unprotection is not supported if in async WP mode */
>> + if (!mode_wp && (ctx->features & UFFD_FEATURE_WP_ASYNC))
>> + return -EINVAL;
>> +
>> if (mode_wp && mode_dontwake)
>> return -EINVAL;
>>
>> @@ -1950,6 +1959,13 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> return ret;
>> }
>>
>> +int userfaultfd_wp_async(struct vm_area_struct *vma)
>> +{
>> + struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
>> +
>> + return (ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC));
>> +}
>> +
>> static inline unsigned int uffd_ctx_features(__u64 user_features)
>> {
>> /*
>> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
>> index 9df0b9a762cc..38c92c2beb16 100644
>> --- a/include/linux/userfaultfd_k.h
>> +++ b/include/linux/userfaultfd_k.h
>> @@ -179,6 +179,7 @@ extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
>> unsigned long end, struct list_head *uf);
>> extern void userfaultfd_unmap_complete(struct mm_struct *mm,
>> struct list_head *uf);
>> +extern int userfaultfd_wp_async(struct vm_area_struct *vma);
>>
>> #else /* CONFIG_USERFAULTFD */
>>
>> @@ -189,6 +190,11 @@ static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
>> return VM_FAULT_SIGBUS;
>> }
>>
>> +static inline void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma,
>> + unsigned long start, unsigned long len, bool enable_wp)
>> +{
>> +}
>> +
>> static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
>> struct vm_userfaultfd_ctx vm_ctx)
>> {
>> @@ -274,6 +280,11 @@ static inline bool uffd_disable_fault_around(struct vm_area_struct *vma)
>> return false;
>> }
>>
>> +static inline int userfaultfd_wp_async(struct vm_area_struct *vma)
>> +{
>> + return false;
>> +}
>> +
>> #endif /* CONFIG_USERFAULTFD */
>>
>> static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 005e5e306266..30a6f32cf564 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -38,7 +38,8 @@
>> UFFD_FEATURE_MINOR_HUGETLBFS | \
>> UFFD_FEATURE_MINOR_SHMEM | \
>> UFFD_FEATURE_EXACT_ADDRESS | \
>> - UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
>> + UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
>> + UFFD_FEATURE_WP_ASYNC)
>> #define UFFD_API_IOCTLS \
>> ((__u64)1 << _UFFDIO_REGISTER | \
>> (__u64)1 << _UFFDIO_UNREGISTER | \
>> @@ -203,6 +204,12 @@ struct uffdio_api {
>> *
>> * UFFD_FEATURE_WP_HUGETLBFS_SHMEM indicates that userfaultfd
>> * write-protection mode is supported on both shmem and hugetlbfs.
>> + *
>> + * UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
>> + * asynchronous mode is supported in which the write fault is automatically
>> + * resolved and write-protection is un-set. It only supports anon and shmem
>> + * (hugetlb isn't supported). It only takes effect when a vma is registered
>> + * with write-protection mode. Otherwise the flag is ignored.
>> */
>
> Most of mm/ adheres the 80-character limits. Please make your changes to
> follow it as well.
Will update in next version.

>
>> #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
>> #define UFFD_FEATURE_EVENT_FORK (1<<1)
>> @@ -217,6 +224,7 @@ struct uffdio_api {
>> #define UFFD_FEATURE_MINOR_SHMEM (1<<10)
>> #define UFFD_FEATURE_EXACT_ADDRESS (1<<11)
>> #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
>> +#define UFFD_FEATURE_WP_ASYNC (1<<13)
>> __u64 features;
>>
>> __u64 ioctls;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 4000e9f017e0..75331fbf7cb4 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3351,8 +3351,21 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>>
>> if (likely(!unshare)) {
>> if (userfaultfd_pte_wp(vma, *vmf->pte)) {
>> - pte_unmap_unlock(vmf->pte, vmf->ptl);
>> - return handle_userfault(vmf, VM_UFFD_WP);
>> + if (userfaultfd_wp_async(vma)) {
>> + /*
>> + * Nothing needed (cache flush, TLB invalidations,
>> + * etc.) because we're only removing the uffd-wp bit,
>> + * which is completely invisible to the user.
>> + */
>> + pte_t pte = pte_clear_uffd_wp(*vmf->pte);
>> +
>> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>> + /* Update this to be prepared for following up CoW handling */
>> + vmf->orig_pte = pte;
>> + } else {
>> + pte_unmap_unlock(vmf->pte, vmf->ptl);
>> + return handle_userfault(vmf, VM_UFFD_WP);
>> + }
>
> You can revert the condition here and reduce the nesting:
>
> if (!userfaultfd_wp_async(vma)) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return handle_userfault(vmf, VM_UFFD_WP);
> }
>
> /* handle async WP */
I'll update in next version.

>
>> }
>>
>> /*
>> @@ -4812,8 +4825,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>>
>> if (vma_is_anonymous(vmf->vma)) {
>> if (likely(!unshare) &&
>> - userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
>> + userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) {
>> + if (userfaultfd_wp_async(vmf->vma))
>> + goto split;
>> return handle_userfault(vmf, VM_UFFD_WP);
>> + }
>> return do_huge_pmd_wp_page(vmf);
>> }
>>
>> @@ -4825,6 +4841,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
>> }
>> }
>>
>> +split:
>> /* COW or write-notify handled on pte level: split pmd. */
>> __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>>
>> --
>> 2.30.2
>>
>

--
BR,
Muhammad Usama Anjum

2023-02-20 10:40:26

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/17/23 3:10 PM, Mike Rapoport wrote:
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>> (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>> pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> To get information about which pages have been written-to and/or write
>> protect the pages, following must be performed first in order:
>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>> through UFFDIO_REGISTER IOCTL.
>> Then the any part of the registered memory or the whole memory region
>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>> PAGEMAP_SCAN IOCTL.
>>
>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>> struct:
>> - The range is specified through start and len.
>> - The output buffer of struct page_region array and size is specified as
>> vec and vec_len.
>> - The optional maximum requested pages are specified in the max_pages.
>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>> is the only added flag at this time.
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>> and return_mask.
>>
>> This IOCTL can be extended to get information about more PTE bits. This
>> IOCTL doesn't support hugetlbs at the moment. No information about
>> hugetlb can be obtained. This patch has evolved from a basic patch from
>> Gabriel Krisman Bertazi.
>>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>>
>> Changes in v6:
>> - Rename variables and update comments
>> - Make IOCTL independent of soft_dirty config
>> - Change masks and bitmap type to _u64
>> - Improve code quality
>>
>> Changes in v5:
>> - Remove tlb flushing even for clear operation
>>
>> Changes in v4:
>> - Update the interface and implementation
>>
>> Changes in v3:
>> - Tighten the user-kernel interface by using explicit types and add more
>> error checking
>>
>> Changes in v2:
>> - Convert the interface from syscall to ioctl
>> - Remove pidfd support as it doesn't make sense in ioctl
>> ---
>> fs/proc/task_mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++
>> include/uapi/linux/fs.h | 50 +++++++
>> 2 files changed, 340 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index e35a0398db63..c6bde19d63d9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>> #include <linux/shmem_fs.h>
>> #include <linux/uaccess.h>
>> #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>
>> #include <asm/elf.h>
>> #include <asm/tlb.h>
>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>> }
>> #endif
>>
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> + if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> + (pte_swp_uffd_wp_any(pte)))
>> + return true;
>> + return false;
>> +}
>> +
>> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
>> +{
>> + if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> + return true;
>> + return false;
>> +}
>> +
>> #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>> unsigned long addr, pmd_t *pmdp)
>> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>> return 0;
>> }
>>
>> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
>> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
>> +#define IS_GET_OP(a) (a->vec)
>> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
>> +
>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
>> + (wt | file << 1 | present << 2 | swap << 3)
>> +#define IS_WT_REQUIRED(a) \
>> + ((a->required_mask & PAGE_IS_WRITTEN) || \
>> + (a->anyof_mask & PAGE_IS_WRITTEN))
>
> All these macros are specific to pagemap_scan_ioctl() and should be
> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
>
> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
Will do in next version.

>
> And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or
> similar flag rather than using arg->vec.
I had in the first revisions. But explicit GET_OP was removed in the
previous iterations after some feedback. Peter has also suggested this.
I'll add the GET_OP flag again.

>
>> +
>> +struct pagemap_scan_private {
>> + struct page_region *vec;
>> + struct page_region prev;
>> + unsigned long vec_len, vec_index;
>> + unsigned int max_pages, found_pages, flags;
>> + unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
>> +};
>> +
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
>
> Please keep the lines under 80 characters limit.
>
>> +{
>> + struct pagemap_scan_private *p = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> +
>> + if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
>> + return -EPERM;
>> + if (vma->vm_flags & VM_PFNMAP)
>> + return 1;
>> + return 0;
>> +}
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> + struct pagemap_scan_private *p, unsigned long addr,
>> + unsigned int len)
>> +{
>> + unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> + bool cpy = true;
>> + struct page_region *prev = &p->prev;
>> +
>> + if (HAS_NO_SPACE(p))
>> + return -ENOSPC;
>> +
>> + if (p->max_pages && p->found_pages + len >= p->max_pages)
>> + len = p->max_pages - p->found_pages;
>> + if (!len)
>> + return -EINVAL;
>> +
>> + if (p->required_mask)
>> + cpy = ((p->required_mask & cur) == p->required_mask);
>> + if (cpy && p->anyof_mask)
>> + cpy = (p->anyof_mask & cur);
>> + if (cpy && p->excluded_mask)
>> + cpy = !(p->excluded_mask & cur);
>> + bitmap = cur & p->return_mask;
>> + if (cpy && bitmap) {
>> + if ((prev->len) && (prev->bitmap == bitmap) &&
>> + (prev->start + prev->len * PAGE_SIZE == addr)) {
>> + prev->len += len;
>> + p->found_pages += len;
>> + } else if (p->vec_index < p->vec_len) {
>> + if (prev->len) {
>> + memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>> + p->vec_index++;
>> + }
>> + prev->start = addr;
>> + prev->len = len;
>> + prev->bitmap = bitmap;
>> + p->found_pages += len;
>> + } else {
>> + return -ENOSPC;
>> + }
>> + }
>> + return 0;
>
> Please don't save on empty lines. Empty lines between logical pieces
> improve readability.
Sorry, I'll add them.

>
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
>> + unsigned long *vec_index)
>> +{
>> + struct page_region *prev = &p->prev;
>> +
>> + if (prev->len) {
>> + if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
>> + return -EFAULT;
>> + p->vec_index++;
>> + (*vec_index)++;
>> + prev->len = 0;
>> + }
>> + return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> + unsigned long end, struct mm_walk *walk)
>> +{
>> + struct pagemap_scan_private *p = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> + unsigned long addr = end;
>> + spinlock_t *ptl;
>> + int ret = 0;
>> + pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + ptl = pmd_trans_huge_lock(pmd, vma);
>> + if (ptl) {
>> + bool pmd_wt;
>> +
>> + pmd_wt = !is_pmd_uffd_wp(*pmd);
>> + /*
>> + * Break huge page into small pages if operation needs to be performed is
>> + * on a portion of the huge page.
>> + */
>> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>> + spin_unlock(ptl);
>> + split_huge_pmd(vma, pmd, start);
>> + goto process_smaller_pages;
>> + }
>> + if (IS_GET_OP(p))
>> + ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
>> + is_swap_pmd(*pmd), p, start,
>> + (end - start)/PAGE_SIZE);
>> + spin_unlock(ptl);
>> + if (!ret) {
>> + if (pmd_wt && IS_WP_ENGAGE_OP(p))
>> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
>> + }
>> + return ret;
>> + }
>> +process_smaller_pages:
>> + if (pmd_trans_unstable(pmd))
>> + return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> + pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> + if (IS_GET_OP(p)) {
>> + for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> + ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
>> + pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
>> + if (ret)
>> + break;
>> + }
>> + }
>> + pte_unmap_unlock(pte - 1, ptl);
>> + if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
>> + uffd_wp_range(walk->mm, vma, start, addr - start, true);
>> +
>> + cond_resched();
>> + return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
>> + struct mm_walk *walk)
>> +{
>> + struct pagemap_scan_private *p = walk->private;
>> + struct vm_area_struct *vma = walk->vma;
>> + int ret = 0;
>> +
>> + if (vma)
>> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
>> + (end - addr)/PAGE_SIZE);
>> + return ret;
>> +}
>> +
>> +/* No hugetlb support is present. */
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> + .test_walk = pagemap_scan_test_walk,
>> + .pmd_entry = pagemap_scan_pmd_entry,
>> + .pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
>> +{
>> + unsigned long empty_slots, vec_index = 0;
>> + unsigned long __user start, end;
>> + unsigned long __start, __end;
>> + struct page_region __user *vec;
>> + struct pagemap_scan_private p;
>> + int ret = 0;
>> +
>> + start = (unsigned long)untagged_addr(arg->start);
>> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> + /* Validate memory ranges */
>> + if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
>> + return -EINVAL;
>> + if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
>> + (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
>> + return -EINVAL;
>> +
>> + /* Detect illegal flags and masks */
>> + if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
>> + (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
>> + (arg->return_mask & ~PAGEMAP_BITS_ALL))
>> + return -EINVAL;
>> + if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
>> + !arg->return_mask))
>> + return -EINVAL;
>> + /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
>> + if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
>> + (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
>> + return -EINVAL;
>
> I'd split argument validation into a separate function and split the OR'ed
> conditions into separate if statements, e.g
>
> bool pm_scan_args_valid(struct pagemap_scan_arg *arg)
> {
> if (IS_GET_OP(arg)) {
> if (!arg->return_mask)
> return false;
> if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask)
> return false;
> }
>
> /* ... */
>
> return true;
> }
This seems a very good way. Thank you so much!

>
>> +
>> + end = start + arg->len;
>> + p.max_pages = arg->max_pages;
>> + p.found_pages = 0;
>> + p.flags = arg->flags;
>> + p.required_mask = arg->required_mask;
>> + p.anyof_mask = arg->anyof_mask;
>> + p.excluded_mask = arg->excluded_mask;
>> + p.return_mask = arg->return_mask;
>> + p.prev.len = 0;
>> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> +
>> + if (IS_GET_OP(arg)) {
>> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
>> + if (!p.vec)
>> + return -ENOMEM;
>> + } else {
>> + p.vec = NULL;
>> + }
>> + __start = __end = start;
>> + while (!ret && __end < end) {
>> + p.vec_index = 0;
>> + empty_slots = arg->vec_len - vec_index;
>> + if (p.vec_len > empty_slots)
>> + p.vec_len = empty_slots;
>> +
>> + __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> + if (__end > end)
>> + __end = end;
>> +
>> + mmap_read_lock(mm);
>> + ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
>> + mmap_read_unlock(mm);
>> + if (!(!ret || ret == -ENOSPC))
>> + goto free_data;
>> +
>> + __start = __end;
>> + if (IS_GET_OP(arg) && p.vec_index) {
>> + if (copy_to_user(&vec[vec_index], p.vec,
>> + p.vec_index * sizeof(struct page_region))) {
>> + ret = -EFAULT;
>> + goto free_data;
>> + }
>> + vec_index += p.vec_index;
>> + }
>> + }
>> + ret = export_prev_to_out(&p, vec, &vec_index);
>> + if (!ret)
>> + ret = vec_index;
>> +free_data:
>> + if (IS_GET_OP(arg))
>> + kfree(p.vec);
>> +
>> + return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> + struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
>> + struct mm_struct *mm = file->private_data;
>> + struct pagemap_scan_arg argument;
>> +
>> + if (cmd == PAGEMAP_SCAN) {
>> + if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
>> + return -EFAULT;
>> + return do_pagemap_cmd(mm, &argument);
>> + }
>> + return -EINVAL;
>> +}
>> +
>> const struct file_operations proc_pagemap_operations = {
>> .llseek = mem_lseek, /* borrow this */
>> .read = pagemap_read,
>> .open = pagemap_open,
>> .release = pagemap_release,
>> + .unlocked_ioctl = pagemap_scan_ioctl,
>> + .compat_ioctl = pagemap_scan_ioctl,
>> };
>> #endif /* CONFIG_PROC_PAGE_MONITOR */
>>
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index b7b56871029c..1ae9a8684b48 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>> RWF_APPEND)
>>
>> +/* Pagemap ioctl */
>> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
>> +
>> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
>> +#define PAGE_IS_WRITTEN (1 << 0)
>> +#define PAGE_IS_FILE (1 << 1)
>> +#define PAGE_IS_PRESENT (1 << 2)
>> +#define PAGE_IS_SWAPPED (1 << 3)
>> +
>> +/*
>> + * struct page_region - Page region with bitmap flags
>> + * @start: Start of the region
>> + * @len: Length of the region
>> + * bitmap: Bits sets for the region
>> + */
>> +struct page_region {
>> + __u64 start;
>> + __u64 len;
>> + __u64 bitmap;
>> +};
>> +
>> +/*
>> + * struct pagemap_scan_arg - Pagemap ioctl argument
>> + * @start: Starting address of the region
>> + * @len: Length of the region (All the pages in this length are included)
>> + * @vec: Address of page_region struct array for output
>> + * @vec_len: Length of the page_region struct array
>> + * @max_pages: Optional max return pages
>> + * @flags: Flags for the IOCTL
>> + * @required_mask: Required mask - All of these bits have to be set in the PTE
>> + * @anyof_mask: Any mask - Any of these bits are set in the PTE
>> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE
>> + * @return_mask: Bits that are to be reported in page_region
>> + */
>> +struct pagemap_scan_arg {
>> + __u64 start;
>> + __u64 len;
>> + __u64 vec;
>> + __u64 vec_len;
>> + __u32 max_pages;
>> + __u32 flags;
>> + __u64 required_mask;
>> + __u64 anyof_mask;
>> + __u64 excluded_mask;
>> + __u64 return_mask;
>> +};
>> +
>> +/* Special flags */
>> +#define PAGEMAP_WP_ENGAGE (1 << 0)
>> +
>> #endif /* _UAPI_LINUX_FS_H */
>> --
>> 2.30.2
>>
>

--
BR,
Muhammad Usama Anjum

2023-02-20 11:38:28

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote:
>>> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
>>> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>>> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>>> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
>>> +#define IS_GET_OP(a) (a->vec)
>>> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
>>> +
>>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
>>> + (wt | file << 1 | present << 2 | swap << 3)
>>> +#define IS_WT_REQUIRED(a) \
>>> + ((a->required_mask & PAGE_IS_WRITTEN) || \
>>> + (a->anyof_mask & PAGE_IS_WRITTEN))
>> All these macros are specific to pagemap_scan_ioctl() and should be
>> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
>>
>> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
>> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
>> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
> Will do in next version.
>

IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to
PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of
open code as they seem more readable to me. I can open code if you insist.

--
BR,
Muhammad Usama Anjum

2023-02-20 13:17:33

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Mon, Feb 20, 2023 at 04:38:10PM +0500, Muhammad Usama Anjum wrote:
> On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote:
> >>> +#define PAGEMAP_BITS_ALL (PAGE_IS_WRITTEN | PAGE_IS_FILE | \
> >>> + PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> >>> +#define PAGEMAP_NON_WRITTEN_BITS (PAGE_IS_FILE | PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> >>> +#define IS_WP_ENGAGE_OP(a) (a->flags & PAGEMAP_WP_ENGAGE)
> >>> +#define IS_GET_OP(a) (a->vec)
> >>> +#define HAS_NO_SPACE(p) (p->max_pages && (p->found_pages == p->max_pages))
> >>> +
> >>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap) \
> >>> + (wt | file << 1 | present << 2 | swap << 3)
> >>> +#define IS_WT_REQUIRED(a) \
> >>> + ((a->required_mask & PAGE_IS_WRITTEN) || \
> >>> + (a->anyof_mask & PAGE_IS_WRITTEN))
> >> All these macros are specific to pagemap_scan_ioctl() and should be
> >> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
> >>
> >> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
> >> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
> >> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
> > Will do in next version.
> >
>
> IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to
> PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of
> open code as they seem more readable to me. I can open code if you insist.

I'd suggest to see how the rework of pagemap_scan_pmd_entry() paves out. An
open-coded '&' is surely clearer than a macro/function, but if it's buried
in a long sequence of conditions, it may be not such clear win.

> --
> BR,
> Muhammad Usama Anjum

--
Sincerely yours,
Mike.

2023-02-20 13:25:12

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

Hello Nadav,

Thank you so much for reviewing!

On 2/19/23 6:52 PM, Nadav Amit wrote:
>
> On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>    file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>    (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>    pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>>    (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> To get information about which pages have been written-to and/or write
>> protect the pages, following must be performed first in order:
>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>>    through UFFDIO_REGISTER IOCTL.
>> Then the any part of the registered memory or the whole memory region
>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>> PAGEMAP_SCAN IOCTL.
>>
>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>> struct:
>> - The range is specified through start and len.
>> - The output buffer of struct page_region array and size is specified as
>>    vec and vec_len.
>> - The optional maximum requested pages are specified in the max_pages.
>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>>    is the only added flag at this time.
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>    and return_mask.
>>
>> This IOCTL can be extended to get information about more PTE bits. This
>> IOCTL doesn't support hugetlbs at the moment. No information about
>> hugetlb can be obtained. This patch has evolved from a basic patch from
>> Gabriel Krisman Bertazi.
>
> I was not involved before, so I am not commenting on the API and code to
> avoid making unhelpful noise.
>
> Having said that, some things in the code seem quite dirty and make
> understanding the code hard to read.
There is a new proposal about the flags in the interface. I'll include you
there.

>
>> Signed-off-by: Muhammad Usama Anjum <[email protected]>
>> ---
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>>
>> Changes in v6:
>> - Rename variables and update comments
>> - Make IOCTL independent of soft_dirty config
>> - Change masks and bitmap type to _u64
>> - Improve code quality
>>
>> Changes in v5:
>> - Remove tlb flushing even for clear operation
>>
>> Changes in v4:
>> - Update the interface and implementation
>>
>> Changes in v3:
>> - Tighten the user-kernel interface by using explicit types and add more
>>    error checking
>>
>> Changes in v2:
>> - Convert the interface from syscall to ioctl
>> - Remove pidfd support as it doesn't make sense in ioctl
>> ---
>>   fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/fs.h |  50 +++++++
>>   2 files changed, 340 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index e35a0398db63..c6bde19d63d9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/shmem_fs.h>
>>   #include <linux/uaccess.h>
>>   #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>     #include <asm/elf.h>
>>   #include <asm/tlb.h>
>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct
>> vm_area_struct *vma,
>>   }
>>   #endif
>>   +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +    if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +        (pte_swp_uffd_wp_any(pte)))
>> +        return true;
>> +    return false;
>> +}
>> +
>> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
>> +{
>> +    if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> +        (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> +        return true;
>> +    return false;
>> +}
>> +
>>   #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>   static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>>           unsigned long addr, pmd_t *pmdp)
>> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode,
>> struct file *file)
>>       return 0;
>>   }
>>   +#define PAGEMAP_BITS_ALL        (PAGE_IS_WRITTEN | PAGE_IS_FILE |    \
>> +                     PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PAGEMAP_NON_WRITTEN_BITS    (PAGE_IS_FILE |    PAGE_IS_PRESENT |
>> PAGE_IS_SWAPPED)
>> +#define IS_WP_ENGAGE_OP(a)        (a->flags & PAGEMAP_WP_ENGAGE)
>> +#define IS_GET_OP(a)            (a->vec)
>> +#define HAS_NO_SPACE(p)            (p->max_pages && (p->found_pages ==
>> p->max_pages))
>
> I think that in general it is better to have an inline function instead of
> macros when possible, as it is clearer and checks types. Anyhow, IMHO most
> of these macros are better be open-coded.
I'll update most of these in next version.

>
>> +
>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)    \
>> +    (wt | file << 1 | present << 2 | swap << 3)
>> +#define IS_WT_REQUIRED(a)                \
>> +    ((a->required_mask & PAGE_IS_WRITTEN) ||    \
>> +     (a->anyof_mask & PAGE_IS_WRITTEN))
>> +
>> +struct pagemap_scan_private {
>> +    struct page_region *vec;
>> +    struct page_region prev;
>> +    unsigned long vec_len, vec_index;
>> +    unsigned int max_pages, found_pages, flags;
>> +    unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
>> +};
>> +
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long
>> end, struct mm_walk *walk)
>> +{
>> +    struct pagemap_scan_private *p = walk->private;
>> +    struct vm_area_struct *vma = walk->vma;
>> +
>> +    if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) &&
>> !userfaultfd_wp_async(vma))
>> +        return -EPERM;
>> +    if (vma->vm_flags & VM_PFNMAP)
>> +        return 1;
>> +    return 0;
>> +}
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres,
>> bool swap,
>> +                      struct pagemap_scan_private *p, unsigned long addr,
>> +                      unsigned int len)
>> +{
>> +    unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> +    bool cpy = true;
>> +    struct page_region *prev = &p->prev;
>> +
>> +    if (HAS_NO_SPACE(p))
>> +        return -ENOSPC;
>> +
>> +    if (p->max_pages && p->found_pages + len >= p->max_pages)
>> +        len = p->max_pages - p->found_pages;
>> +    if (!len)
>> +        return -EINVAL;
>> +
>> +    if (p->required_mask)
>> +        cpy = ((p->required_mask & cur) == p->required_mask);
>> +    if (cpy && p->anyof_mask)
>> +        cpy = (p->anyof_mask & cur);
>> +    if (cpy && p->excluded_mask)
>> +        cpy = !(p->excluded_mask & cur);
>> +    bitmap = cur & p->return_mask;
>> +    if (cpy && bitmap) {
>> +        if ((prev->len) && (prev->bitmap == bitmap) &&
>> +            (prev->start + prev->len * PAGE_SIZE == addr)) {
>> +            prev->len += len;
> The use of "len" both for bytes and pages is very confusing. Consider
> changing the name to n_pages or something similar.
Will update in next version.

>> +            p->found_pages += len;
>> +        } else if (p->vec_index < p->vec_len) {
>> +            if (prev->len) {
>> +                memcpy(&p->vec[p->vec_index], prev, sizeof(struct
>> page_region));
>> +                p->vec_index++;
>> +            }
>> +            prev->start = addr;
>> +            prev->len = len;
>> +            prev->bitmap = bitmap;
>> +            p->found_pages += len;
>> +        } else {
>> +            return -ENOSPC;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p,
>> struct page_region __user *vec,
>> +                     unsigned long *vec_index)
>> +{
>> +    struct page_region *prev = &p->prev;
>> +
>> +    if (prev->len) {
>> +        if (copy_to_user(&vec[*vec_index], prev, sizeof(struct
>> page_region)))
>> +            return -EFAULT;
>> +        p->vec_index++;
>> +        (*vec_index)++;
>> +        prev->len = 0;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                     unsigned long end, struct mm_walk *walk)
>> +{
>> +    struct pagemap_scan_private *p = walk->private;
>> +    struct vm_area_struct *vma = walk->vma;
>> +    unsigned long addr = end;
>> +    spinlock_t *ptl;
>> +    int ret = 0;
>> +    pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +    ptl = pmd_trans_huge_lock(pmd, vma);
>> +    if (ptl) {
>> +        bool pmd_wt;
>> +
>> +        pmd_wt = !is_pmd_uffd_wp(*pmd);
>> +        /*
>> +         * Break huge page into small pages if operation needs to be
>> performed is
>> +         * on a portion of the huge page.
>> +         */
>> +        if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>> +            spin_unlock(ptl);
>> +            split_huge_pmd(vma, pmd, start);
>> +            goto process_smaller_pages;
> I think that such goto's are really confusing and should be avoided. And
> using 'else' (could have easily prevented the need for goto). It is not the
> best solution though, since I think it would have been better to invert the
> conditions.
Yeah, else can be used here. But then we'll have to add a tab to all the
code after adding else. We have already so many tabs and very less space to
right code. Not sure which is better.

>> +        }
>> +        if (IS_GET_OP(p))
>> +            ret = pagemap_scan_output(pmd_wt, vma->vm_file,
>> pmd_present(*pmd),
>> +                          is_swap_pmd(*pmd), p, start,
>> +                          (end - start)/PAGE_SIZE);
>> +        spin_unlock(ptl);
>> +        if (!ret) {
>> +            if (pmd_wt && IS_WP_ENGAGE_OP(p))
>> +                uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
>> +        }
>> +        return ret;
>> +    }
>> +process_smaller_pages:
>> +    if (pmd_trans_unstable(pmd))
>> +        return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +    pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> +    if (IS_GET_OP(p)) {
>> +        for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> +            ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
>> +                          pte_present(*pte), is_swap_pte(*pte), p, addr,
>> 1);
>> +            if (ret)
>> +                break;
>> +        }
>> +    }
>> +    pte_unmap_unlock(pte - 1, ptl);
> We might have not entered the loop and pte-1 would be wrong.
>> +    if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
> What does 'addr - start' mean? If you want to say they are not equal, why
> not say so?
This has been revamped in the next version.

>> +        uffd_wp_range(walk->mm, vma, start, addr - start, true);
>> +
>> +    cond_resched();
>> +    return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
>> int depth,
>> +                 struct mm_walk *walk)
>> +{
>> +    struct pagemap_scan_private *p = walk->private;
>> +    struct vm_area_struct *vma = walk->vma;
>> +    int ret = 0;
>> +
>> +    if (vma)
>> +        ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
>> addr,
>> +                      (end - addr)/PAGE_SIZE);
>> +    return ret;
>> +}
>> +
>> +/* No hugetlb support is present. */
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> +    .test_walk = pagemap_scan_test_walk,
>> +    .pmd_entry = pagemap_scan_pmd_entry,
>> +    .pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg
>> *arg)
>> +{
>> +    unsigned long empty_slots, vec_index = 0;
>> +    unsigned long __user start, end;
>
> The whole point of __user (attribute) is to be assigned to pointers.
I'll remove it.

>
>> +    unsigned long __start, __end;
>
> I think such names do not convey sufficient information.
I'll update it.

>
>> +    struct page_region __user *vec;
>> +    struct pagemap_scan_private p;
>> +    int ret = 0;
>> +
>> +    start = (unsigned long)untagged_addr(arg->start);
>> +    vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> +    /* Validate memory ranges */
>> +    if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user
>> *)start, arg->len)))
>> +        return -EINVAL;
>> +    if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
>> +        (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct
>> page_region)))))
>> +        return -EINVAL;
>> +
>> +    /* Detect illegal flags and masks */
>> +    if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask &
>> ~PAGEMAP_BITS_ALL) ||
>> +        (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask &
>> ~PAGEMAP_BITS_ALL) ||
>> +        (arg->return_mask & ~PAGEMAP_BITS_ALL))
>
> Using bitwise or to check
>
>     (arg->required_mask | arg->anyof_mask |
>      arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL
>
> Would have been much cleaner, IMHO.
I'll update it.

>
>> +        return -EINVAL;
>> +    if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask &&
>> !arg->excluded_mask) ||
>> +                !arg->return_mask))
>> +        return -EINVAL;
>> +    /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also
>> specified. */
>> +    if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask &
>> PAGEMAP_NON_WRITTEN_BITS) ||
>> +        (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
>> +        return -EINVAL;
>> +
>> +    end = start + arg->len;
>> +    p.max_pages = arg->max_pages;
>> +    p.found_pages = 0;
>> +    p.flags = arg->flags;
>> +    p.required_mask = arg->required_mask;
>> +    p.anyof_mask = arg->anyof_mask;
>> +    p.excluded_mask = arg->excluded_mask;
>> +    p.return_mask = arg->return_mask;
>> +    p.prev.len = 0;
>> +    p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> +
>> +    if (IS_GET_OP(arg)) {
>> +        p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>> GFP_KERNEL);
>> +        if (!p.vec)
>> +            return -ENOMEM;
>> +    } else {
>> +        p.vec = NULL;
> I find it cleaner to initialize 'p.vec = NULL' unconditionally before
> IS_GET_OP() check.
It'll get updated.

>> +    }
>> +    __start = __end = start;
>> +    while (!ret && __end < end) {
>> +        p.vec_index = 0;
>> +        empty_slots = arg->vec_len - vec_index;
>> +        if (p.vec_len > empty_slots)
>> +            p.vec_len = empty_slots;
>> +
>> +        __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> +        if (__end > end)
>> +            __end = end;
> Easier to understand using min().
Will update.

>> +
>> +        mmap_read_lock(mm);
>> +        ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
>> +        mmap_read_unlock(mm);
>> +        if (!(!ret || ret == -ENOSPC))
>
> Double negations complicate things unnecessarily.
>
> And if you already "break" on ret, why do you check the condition in the
> while loop?
Ohh, good catch.

>
>> +            goto free_data;
>> +
>> +        __start = __end;
>> +        if (IS_GET_OP(arg) && p.vec_index) {
>> +            if (copy_to_user(&vec[vec_index], p.vec,
>> +                     p.vec_index * sizeof(struct page_region))) {
>> +                ret = -EFAULT;
>> +                goto free_data;
>> +            }
>> +            vec_index += p.vec_index;
>> +        }
>> +    }
>> +    ret = export_prev_to_out(&p, vec, &vec_index);
>> +    if (!ret)
>> +        ret = vec_index;
>> +free_data:
>> +    if (IS_GET_OP(arg))
>> +        kfree(p.vec);
> Just call it unconditionally.
I didn't know it. I'll do it.

>> +
>> +    return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
>> unsigned long arg)
>> +{
>> +    struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg
>> __user *)arg;
>> +    struct mm_struct *mm = file->private_data;
>> +    struct pagemap_scan_arg argument;
>> +
>> +    if (cmd == PAGEMAP_SCAN) {
>> +        if (copy_from_user(&argument, uarg, sizeof(struct
>> pagemap_scan_arg)))
>> +            return -EFAULT;
>> +        return do_pagemap_cmd(mm, &argument);
>> +    }
>> +    return -EINVAL;
>> +}
>> +
>>   const struct file_operations proc_pagemap_operations = {
>>       .llseek        = mem_lseek, /* borrow this */
>>       .read        = pagemap_read,
>>       .open        = pagemap_open,
>>       .release    = pagemap_release,
>> +    .unlocked_ioctl = pagemap_scan_ioctl,
>> +    .compat_ioctl    = pagemap_scan_ioctl,
>>   };
>>   #endif /* CONFIG_PROC_PAGE_MONITOR */
>>   diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index b7b56871029c..1ae9a8684b48 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>   #define RWF_SUPPORTED    (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>                RWF_APPEND)
>>   +/* Pagemap ioctl */
>> +#define PAGEMAP_SCAN    _IOWR('f', 16, struct pagemap_scan_arg)
>> +
>> +/* Bits are set in the bitmap of the page_region and masks in
>> pagemap_scan_args */
>> +#define PAGE_IS_WRITTEN        (1 << 0)
>> +#define PAGE_IS_FILE        (1 << 1)
>> +#define PAGE_IS_PRESENT        (1 << 2)
>> +#define PAGE_IS_SWAPPED        (1 << 3)
>
> These names are way too generic and are likely to be misused for the wrong
> purpose. The "_IS_" part seems confusing as well. So I think the naming
> needs to be fixed and some new type (using typedef) or enum should be
> introduced to hold these flags. I understand it is part of uapi and it is
> less common there, but it is not unheard of and does make things clearer.
Do you think PM_SCAN_PAGE_IS_* work here?

>
>
>> +
>> +/*
>> + * struct page_region - Page region with bitmap flags
>> + * @start:    Start of the region
>> + * @len:    Length of the region
>> + * bitmap:    Bits sets for the region
>> + */
>> +struct page_region {
>> +    __u64 start;
>> +    __u64 len;
>
> I presume in bytes. Would be useful to mention.
Length of region in pages.

>
>> +    __u64 bitmap;
>> +};
>> +
>> +/*
>> + * struct pagemap_scan_arg - Pagemap ioctl argument
>> + * @start:        Starting address of the region
>> + * @len:        Length of the region (All the pages in this length are
>> included)
>> + * @vec:        Address of page_region struct array for output
>> + * @vec_len:        Length of the page_region struct array
>> + * @max_pages:        Optional max return pages
>> + * @flags:        Flags for the IOCTL
>> + * @required_mask:    Required mask - All of these bits have to be set
>> in the PTE
>> + * @anyof_mask:        Any mask - Any of these bits are set in the PTE
>> + * @excluded_mask:    Exclude mask - None of these bits are set in the PTE
>> + * @return_mask:    Bits that are to be reported in page_region
>> + */
>> +struct pagemap_scan_arg {
>> +    __u64 start;
>> +    __u64 len;
>> +    __u64 vec;
>> +    __u64 vec_len;
>> +    __u32 max_pages;
>> +    __u32 flags;
>> +    __u64 required_mask;
>> +    __u64 anyof_mask;
>> +    __u64 excluded_mask;
>> +    __u64 return_mask;
>> +};
>> +
>> +/* Special flags */
>> +#define PAGEMAP_WP_ENGAGE    (1 << 0)
>> +
>>   #endif /* _UAPI_LINUX_FS_H */

--
BR,
Muhammad Usama Anjum

2023-02-20 13:27:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

Hi,

On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> pages have been written-to.
> - Find pages which have been written-to and write protect the pages
> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start: Starting address of the region
> + * @len: Length of the region (All the pages in this length are included)
> + * @vec: Address of page_region struct array for output
> + * @vec_len: Length of the page_region struct array
> + * @max_pages: Optional max return pages
> + * @flags: Flags for the IOCTL
> + * @required_mask: Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask: Any mask - Any of these bits are set in the PTE
> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE
> + * @return_mask: Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> + __u64 start;
> + __u64 len;
> + __u64 vec;
> + __u64 vec_len;
> + __u32 max_pages;
> + __u32 flags;
> + __u64 required_mask;
> + __u64 anyof_mask;
> + __u64 excluded_mask;
> + __u64 return_mask;
> +};

After Nadav's comment I've realized I missed the API part :)

A few quick notes for now:
* The arg struct is fixed, so it would be impossible to extend the API
later. Following the clone3() example, I'd add 'size' field to the
pagemam_scan_arg so that it would be possible to add new fields afterwards.
* Please make flags __u64, just in case
* Put size and flags at the beginning of the struct, e.g.

strucr pagemap_scan_arg {
size_t size;
__u64 flags;
/* all the rest */
};

> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE (1 << 0)
> +
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.30.2
>

--
Sincerely yours,
Mike.

2023-02-21 07:02:36

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/20/23 6:26 PM, Mike Rapoport wrote:
> Hi,
>
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>> (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>> pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> +/*
>> + * struct pagemap_scan_arg - Pagemap ioctl argument
>> + * @start: Starting address of the region
>> + * @len: Length of the region (All the pages in this length are included)
>> + * @vec: Address of page_region struct array for output
>> + * @vec_len: Length of the page_region struct array
>> + * @max_pages: Optional max return pages
>> + * @flags: Flags for the IOCTL
>> + * @required_mask: Required mask - All of these bits have to be set in the PTE
>> + * @anyof_mask: Any mask - Any of these bits are set in the PTE
>> + * @excluded_mask: Exclude mask - None of these bits are set in the PTE
>> + * @return_mask: Bits that are to be reported in page_region
>> + */
>> +struct pagemap_scan_arg {
>> + __u64 start;
>> + __u64 len;
>> + __u64 vec;
>> + __u64 vec_len;
>> + __u32 max_pages;
>> + __u32 flags;
>> + __u64 required_mask;
>> + __u64 anyof_mask;
>> + __u64 excluded_mask;
>> + __u64 return_mask;
>> +};
>
> After Nadav's comment I've realized I missed the API part :)
>
> A few quick notes for now:
> * The arg struct is fixed, so it would be impossible to extend the API
> later. Following the clone3() example, I'd add 'size' field to the
> pagemam_scan_arg so that it would be possible to add new fields afterwards.
> * Please make flags __u64, just in case
> * Put size and flags at the beginning of the struct, e.g.
>
> strucr pagemap_scan_arg {
> size_t size;
> __u64 flags;
> /* all the rest */
> };
Updated. Thank you so much!

>
>> +
>> +/* Special flags */
>> +#define PAGEMAP_WP_ENGAGE (1 << 0)
>> +
>> #endif /* _UAPI_LINUX_FS_H */
>> --
>> 2.30.2
>>
>

--
BR,
Muhammad Usama Anjum

2023-02-21 10:29:07

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

Hi Michał,

Thank you so much for comment!

On 2/17/23 8:18 PM, Michał Mirosław wrote:
> On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum
> <[email protected]> wrote:
> [...]
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>> and return_mask.
> [...]

The interface was suggested by Andrei back on the review of v3 [1]:
> I mean we should be able to specify for what pages we need to get info
> for. An ioctl argument can have these four fields:
> * required bits (rmask & mask == mask) - all bits from this mask have to
be set.
> * any of these bits (amask & mask != 0) - any of these bits is set.
> * exclude masks (emask & mask == 0) = none of these bits are set.
> * return mask - bits that have to be reported to user.

>
> May I suggest a slightly modified interface for the flags?
I've added everyone who may be interested in making interface better.

>
> As I understand, the return_mask is what is applied to page flags to
> aggregate the list.
> This is a separate thing, and I think it doesn't need changes except
> maybe an improvement
> in the documentation and visual distinction.
>
> For the page-selection mechanism, currently required_mask and
> excluded_mask have conflicting
They are opposite of each other:
All the set bits in required_mask must be set for the page to be selected.
All the set bits in excluded_mask must _not_ be set for the page to be
selected.

> responsibilities. I suggest to rework that to:
> 1. negated_flags: page flags which are to be negated before applying
> the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at
the truth table:
Page Flag negated_flags
0 0 0
0 1 1
1 0 1
1 1 0

If a page flag is 0 and negated_flag is 1, the result would be 1 which has
changed the page flag. It isn't making sense to me. Why the page flag bit
is being fliped?

When Anrdei had proposed these masks, they seemed like a fancy way of
filtering inside kernel and it was straight forward to understand. These
masks would help his use cases for CRIU. So I'd included it. Please can you
elaborate what is the purpose of negation?

> 2. required_flags: flags which all have to be set in the
> (negation-applied) page flags;
> 3. anyof_flags: flags of which at least one has to be set in the
> (negation-applied) page flags;
>
> IOW, the resulting algorithm would be:
>
> tested_flags = page_flags ^ negated_flags;
> if (~tested_flags & required_flags)
> skip page;
> if (!(tested_flags & anyof_flags))
> skip_page;
>
> aggregate_on(page_flags & return_flags);
>
> Best Regards
> Michał Mirosław

[1] https://lore.kernel.org/all/[email protected]

--
BR,
Muhammad Usama Anjum

2023-02-21 12:42:48

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
<[email protected]> wrote:
>
> Hi Michał,
>
> Thank you so much for comment!
>
> On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
> > For the page-selection mechanism, currently required_mask and
> > excluded_mask have conflicting
> They are opposite of each other:
> All the set bits in required_mask must be set for the page to be selected.
> All the set bits in excluded_mask must _not_ be set for the page to be
> selected.
>
> > responsibilities. I suggest to rework that to:
> > 1. negated_flags: page flags which are to be negated before applying
> > the page selection using following masks;
> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> the truth table:
> Page Flag negated_flags
> 0 0 0
> 0 1 1
> 1 0 1
> 1 1 0
>
> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> changed the page flag. It isn't making sense to me. Why the page flag bit
> is being fliped?
>
> When Anrdei had proposed these masks, they seemed like a fancy way of
> filtering inside kernel and it was straight forward to understand. These
> masks would help his use cases for CRIU. So I'd included it. Please can you
> elaborate what is the purpose of negation?

The XOR is a way to invert the tested value of a flag (from positive
to negative and the other way) without having the API with invalid
values (with required_flags and excluded_flags you need to define a
rule about what happens if a flag is present in both of the masks -
either prioritise one mask over the other or reject the call).
(Note: the XOR is applied only to the value of the flags for the
purpose of testing page-selection criteria.)

So:
1. if a flag is not set in negated_flags, but set in required_flags,
then it means "this flag must be one" - equivalent to it being set in
required_flag (in your current version of the API).
2. if a flag is set in negated_flags and also in required_flags, then
it means "this flag must be zero" - equivalent to it being set in
excluded_flags.

The same thing goes for anyof_flags: if a flag is set in anyof_flags,
then for it to be considered matched:
1. it must have a value of 1 if it is not set in negated_flags
2. it must have a value of 0 if it is set in negated_flags

BTW, I think I assumed that both conditions (all flags in
required_flags and at least one in anyof_flags is present) need to be
true for the page to be selected - is this your intention? The example
code has a bug though, in that if anyof_flags is zero it will never
match. Let me fix the selection part:

// calc. a mask of flags that have expected ("active") values
tested_flags = page_flags ^ negated_flags;
// are all required flags in "active" state? [== all zero when negated]
if (~tested_flags & required_mask)
skip page;
// is any extra flag "active"?
if (anyof_flags && !(tested_flags & anyof_flags))
skip page;


Best Regards
Michał Mirosław

2023-02-22 10:11:47

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/21/23 5:42 PM, Michał Mirosław wrote:
> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> <[email protected]> wrote:
>>
>> Hi Michał,
>>
>> Thank you so much for comment!
>>
>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
> [...]
>>> For the page-selection mechanism, currently required_mask and
>>> excluded_mask have conflicting
>> They are opposite of each other:
>> All the set bits in required_mask must be set for the page to be selected.
>> All the set bits in excluded_mask must _not_ be set for the page to be
>> selected.
>>
>>> responsibilities. I suggest to rework that to:
>>> 1. negated_flags: page flags which are to be negated before applying
>>> the page selection using following masks;
>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
>> the truth table:
>> Page Flag negated_flags
>> 0 0 0
>> 0 1 1
>> 1 0 1
>> 1 1 0
>>
>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
>> changed the page flag. It isn't making sense to me. Why the page flag bit
>> is being fliped?
>>
>> When Anrdei had proposed these masks, they seemed like a fancy way of
>> filtering inside kernel and it was straight forward to understand. These
>> masks would help his use cases for CRIU. So I'd included it. Please can you
>> elaborate what is the purpose of negation?
>
> The XOR is a way to invert the tested value of a flag (from positive
> to negative and the other way) without having the API with invalid
> values (with required_flags and excluded_flags you need to define a
> rule about what happens if a flag is present in both of the masks -
> either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a
page to get selected, the page flags must fulfill the criterion of all the
specified masks.

If a flag is present in both required_mask and excluded_mask, the
required_mask would select a page. But exculded_mask would drop the page.
So page page would be dropped. It is responsibility of the user to
correctly specify the flags.

matched = true;
if (p->required_mask)
matched = ((p->required_mask & bitmap) == p->required_mask);
if (matched && p->anyof_mask)
matched = (p->anyof_mask & bitmap);
if (matched && p->excluded_mask)
matched = !(p->excluded_mask & bitmap);

if (matched && bitmap) {
// page selected
}

Do you accept/like this behavior of masks after explaintation?

> (Note: the XOR is applied only to the value of the flags for the
> purpose of testing page-selection criteria.)
>
> So:
> 1. if a flag is not set in negated_flags, but set in required_flags,
> then it means "this flag must be one" - equivalent to it being set in
> required_flag (in your current version of the API).
> 2. if a flag is set in negated_flags and also in required_flags, then
> it means "this flag must be zero" - equivalent to it being set in
> excluded_flags.
Lets translate words into table:
pageflags required_flags negated_flags matched
1 1 0 yes
0 1 1 yes

>
> The same thing goes for anyof_flags: if a flag is set in anyof_flags,
> then for it to be considered matched:
> 1. it must have a value of 1 if it is not set in negated_flags
> 2. it must have a value of 0 if it is set in negated_flags

pageflags anyof_flags negated_flags matched
1 1 0 yes
0 1 1 yes

>
> BTW, I think I assumed that both conditions (all flags in
> required_flags and at least one in anyof_flags is present) need to be
> true for the page to be selected - is this your intention?
All the masks are optional. If all or any of the 3 masks are specified, the
page flags must pass these masks to get selected.

> The example
> code has a bug though, in that if anyof_flags is zero it will never
> match. Let me fix the selection part:
>
> // calc. a mask of flags that have expected ("active") values
> tested_flags = page_flags ^ negated_flags;
> // are all required flags in "active" state? [== all zero when negated]
> if (~tested_flags & required_mask)
> skip page;
> // is any extra flag "active"?
> if (anyof_flags && !(tested_flags & anyof_flags))
> skip page;
>
After taking a while to understand this and compare with already present
flag system, `negated flags` is comparatively difficult to understand while
already present flags seem easier.

>
> Best Regards
> Michał Mirosław

--
BR,
Muhammad Usama Anjum

2023-02-22 10:44:47

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
<[email protected]> wrote:
> On 2/21/23 5:42 PM, Michał Mirosław wrote:
> > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> > <[email protected]> wrote:
> >>
> >> Hi Michał,
> >>
> >> Thank you so much for comment!
> >>
> >> On 2/17/23 8:18 PM, Michał Mirosław wrote:
> > [...]
> >>> For the page-selection mechanism, currently required_mask and
> >>> excluded_mask have conflicting
> >> They are opposite of each other:
> >> All the set bits in required_mask must be set for the page to be selected.
> >> All the set bits in excluded_mask must _not_ be set for the page to be
> >> selected.
> >>
> >>> responsibilities. I suggest to rework that to:
> >>> 1. negated_flags: page flags which are to be negated before applying
> >>> the page selection using following masks;
> >> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> >> the truth table:
> >> Page Flag negated_flags
> >> 0 0 0
> >> 0 1 1
> >> 1 0 1
> >> 1 1 0
> >>
> >> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> >> changed the page flag. It isn't making sense to me. Why the page flag bit
> >> is being fliped?
> >>
> >> When Anrdei had proposed these masks, they seemed like a fancy way of
> >> filtering inside kernel and it was straight forward to understand. These
> >> masks would help his use cases for CRIU. So I'd included it. Please can you
> >> elaborate what is the purpose of negation?
> >
> > The XOR is a way to invert the tested value of a flag (from positive
> > to negative and the other way) without having the API with invalid
> > values (with required_flags and excluded_flags you need to define a
> > rule about what happens if a flag is present in both of the masks -
> > either prioritise one mask over the other or reject the call).
> At minimum, one mask (required, any or excluded) must be specified. For a
> page to get selected, the page flags must fulfill the criterion of all the
> specified masks.

[Please see the comment below.]

[...]
> Lets translate words into table:
[Yes, those tables captured the intent correctly.]

> > BTW, I think I assumed that both conditions (all flags in
> > required_flags and at least one in anyof_flags is present) need to be
> > true for the page to be selected - is this your intention?
> All the masks are optional. If all or any of the 3 masks are specified, the
> page flags must pass these masks to get selected.

This explanation contradicts in part the introductory paragraph, but
this version seems more useful as you can pass all masks zero to have
all pages selected.

> > The example
> > code has a bug though, in that if anyof_flags is zero it will never
> > match. Let me fix the selection part:
> >
> > // calc. a mask of flags that have expected ("active") values
> > tested_flags = page_flags ^ negated_flags;
> > // are all required flags in "active" state? [== all zero when negated]
> > if (~tested_flags & required_mask)
> > skip page;
> > // is any extra flag "active"?
> > if (anyof_flags && !(tested_flags & anyof_flags))
> > skip page;
> >
> After taking a while to understand this and compare with already present
> flag system, `negated flags` is comparatively difficult to understand while
> already present flags seem easier.

Maybe replacing negated_flags in the API with matched_values =
~negated_flags would make this better?

We compare having to understand XOR vs having to understand ordering
of required_flags and excluded_flags.
IOW my proposal is to replace branches in the masks interpretation (if
in one set then matches but if in another set then doesn't; if flags
match ... ) with plain calculation (flag is matching when equals
~negated_flags; if flags match the masks ...).

Best Regards
Michał Mirosław

2023-02-22 11:07:17

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/22/23 3:44 PM, Michał Mirosław wrote:
> On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
> <[email protected]> wrote:
>> On 2/21/23 5:42 PM, Michał Mirosław wrote:
>>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
>>> <[email protected]> wrote:
>>>>
>>>> Hi Michał,
>>>>
>>>> Thank you so much for comment!
>>>>
>>>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
>>> [...]
>>>>> For the page-selection mechanism, currently required_mask and
>>>>> excluded_mask have conflicting
>>>> They are opposite of each other:
>>>> All the set bits in required_mask must be set for the page to be selected.
>>>> All the set bits in excluded_mask must _not_ be set for the page to be
>>>> selected.
>>>>
>>>>> responsibilities. I suggest to rework that to:
>>>>> 1. negated_flags: page flags which are to be negated before applying
>>>>> the page selection using following masks;
>>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
>>>> the truth table:
>>>> Page Flag negated_flags
>>>> 0 0 0
>>>> 0 1 1
>>>> 1 0 1
>>>> 1 1 0
>>>>
>>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
>>>> changed the page flag. It isn't making sense to me. Why the page flag bit
>>>> is being fliped?
>>>>
>>>> When Anrdei had proposed these masks, they seemed like a fancy way of
>>>> filtering inside kernel and it was straight forward to understand. These
>>>> masks would help his use cases for CRIU. So I'd included it. Please can you
>>>> elaborate what is the purpose of negation?
>>>
>>> The XOR is a way to invert the tested value of a flag (from positive
>>> to negative and the other way) without having the API with invalid
>>> values (with required_flags and excluded_flags you need to define a
>>> rule about what happens if a flag is present in both of the masks -
>>> either prioritise one mask over the other or reject the call).
>> At minimum, one mask (required, any or excluded) must be specified. For a
>> page to get selected, the page flags must fulfill the criterion of all the
>> specified masks.
>
> [Please see the comment below.]
>
> [...]
>> Lets translate words into table:
> [Yes, those tables captured the intent correctly.]
>
>>> BTW, I think I assumed that both conditions (all flags in
>>> required_flags and at least one in anyof_flags is present) need to be
>>> true for the page to be selected - is this your intention?
>> All the masks are optional. If all or any of the 3 masks are specified, the
>> page flags must pass these masks to get selected.
>
> This explanation contradicts in part the introductory paragraph, but
> this version seems more useful as you can pass all masks zero to have
> all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
must be specified. The return_mask must always be specified. Error is
returned if all 3 masks (required, anyof, exclude) are zero or return_mask
is zero.

>
>>> The example
>>> code has a bug though, in that if anyof_flags is zero it will never
>>> match. Let me fix the selection part:
>>>
>>> // calc. a mask of flags that have expected ("active") values
>>> tested_flags = page_flags ^ negated_flags;
>>> // are all required flags in "active" state? [== all zero when negated]
>>> if (~tested_flags & required_mask)
>>> skip page;
>>> // is any extra flag "active"?
>>> if (anyof_flags && !(tested_flags & anyof_flags))
>>> skip page;
>>>
>> After taking a while to understand this and compare with already present
>> flag system, `negated flags` is comparatively difficult to understand while
>> already present flags seem easier.
>
> Maybe replacing negated_flags in the API with matched_values =
> ~negated_flags would make this better?
>
> We compare having to understand XOR vs having to understand ordering
> of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a
page to get selected, all the definitions of the masks must be fulfilled.
You have come up with good example that what if required_mask =
exclude_mask. In this case, no page will fulfill the criterion and hence no
page would be selected. It is user's fault that he isn't understanding the
definitions of these masks correctly.

Now thinking about it, I can add a error check which would return error if
a bit in required and excluded masks matches. Would you like it? Lets put
this check in place.
(Previously I'd left it for user's wisdom not to do this. If he'll specify
same masks in them, he'll get no addresses out of the syscall.)

> IOW my proposal is to replace branches in the masks interpretation (if
> in one set then matches but if in another set then doesn't; if flags
> match ... ) with plain calculation (flag is matching when equals
> ~negated_flags; if flags match the masks ...).
>
> Best Regards
> Michał Mirosław

--
BR,
Muhammad Usama Anjum

2023-02-22 11:48:34

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
<[email protected]> wrote:
>
> On 2/22/23 3:44 PM, Michał Mirosław wrote:
> > On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
> > <[email protected]> wrote:
> >> On 2/21/23 5:42 PM, Michał Mirosław wrote:
> >>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> >>> <[email protected]> wrote:
> >>>>
> >>>> Hi Michał,
> >>>>
> >>>> Thank you so much for comment!
> >>>>
> >>>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
> >>> [...]
> >>>>> For the page-selection mechanism, currently required_mask and
> >>>>> excluded_mask have conflicting
> >>>> They are opposite of each other:
> >>>> All the set bits in required_mask must be set for the page to be selected.
> >>>> All the set bits in excluded_mask must _not_ be set for the page to be
> >>>> selected.
> >>>>
> >>>>> responsibilities. I suggest to rework that to:
> >>>>> 1. negated_flags: page flags which are to be negated before applying
> >>>>> the page selection using following masks;
> >>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> >>>> the truth table:
> >>>> Page Flag negated_flags
> >>>> 0 0 0
> >>>> 0 1 1
> >>>> 1 0 1
> >>>> 1 1 0
> >>>>
> >>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> >>>> changed the page flag. It isn't making sense to me. Why the page flag bit
> >>>> is being fliped?
> >>>>
> >>>> When Anrdei had proposed these masks, they seemed like a fancy way of
> >>>> filtering inside kernel and it was straight forward to understand. These
> >>>> masks would help his use cases for CRIU. So I'd included it. Please can you
> >>>> elaborate what is the purpose of negation?
> >>>
> >>> The XOR is a way to invert the tested value of a flag (from positive
> >>> to negative and the other way) without having the API with invalid
> >>> values (with required_flags and excluded_flags you need to define a
> >>> rule about what happens if a flag is present in both of the masks -
> >>> either prioritise one mask over the other or reject the call).
> >> At minimum, one mask (required, any or excluded) must be specified. For a
> >> page to get selected, the page flags must fulfill the criterion of all the
> >> specified masks.
> >
> > [Please see the comment below.]
> >
> > [...]
> >> Lets translate words into table:
> > [Yes, those tables captured the intent correctly.]
> >
> >>> BTW, I think I assumed that both conditions (all flags in
> >>> required_flags and at least one in anyof_flags is present) need to be
> >>> true for the page to be selected - is this your intention?
> >> All the masks are optional. If all or any of the 3 masks are specified, the
> >> page flags must pass these masks to get selected.
> >
> > This explanation contradicts in part the introductory paragraph, but
> > this version seems more useful as you can pass all masks zero to have
> > all pages selected.
> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
> must be specified. The return_mask must always be specified. Error is
> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
> is zero.

Why do you need those restrictions? I'd guess it is valid to request a
list of all pages with zero return_mask - this will return a compact
list of used ranges of the virtual address space.

> >> After taking a while to understand this and compare with already present
> >> flag system, `negated flags` is comparatively difficult to understand while
> >> already present flags seem easier.
> >
> > Maybe replacing negated_flags in the API with matched_values =
> > ~negated_flags would make this better?
> >
> > We compare having to understand XOR vs having to understand ordering
> > of required_flags and excluded_flags.
> There is no ordering in current masks scheme. No mask is preferable. For a
> page to get selected, all the definitions of the masks must be fulfilled.
> You have come up with good example that what if required_mask =
> exclude_mask. In this case, no page will fulfill the criterion and hence no
> page would be selected. It is user's fault that he isn't understanding the
> definitions of these masks correctly.
>
> Now thinking about it, I can add a error check which would return error if
> a bit in required and excluded masks matches. Would you like it? Lets put
> this check in place.
> (Previously I'd left it for user's wisdom not to do this. If he'll specify
> same masks in them, he'll get no addresses out of the syscall.)

This error case is (one of) the problems I propose avoiding. You also
need much more text to describe the requred/excluded flags
interactions and edge cases than saying that a flag must have a value
equal to corresponding bit in ~negated_flags to be matched by
requried/anyof masks.

> > IOW my proposal is to replace branches in the masks interpretation (if
> > in one set then matches but if in another set then doesn't; if flags
> > match ... ) with plain calculation (flag is matching when equals
> > ~negated_flags; if flags match the masks ...).

Best Regards
Michał Mirosław

2023-02-22 19:10:40

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs



> On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum <[email protected]> wrote:
>
>>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>> + unsigned long end, struct mm_walk *walk)
>>> +{
>>> + struct pagemap_scan_private *p = walk->private;
>>> + struct vm_area_struct *vma = walk->vma;
>>> + unsigned long addr = end;
>>> + spinlock_t *ptl;
>>> + int ret = 0;
>>> + pte_t *pte;
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> + ptl = pmd_trans_huge_lock(pmd, vma);
>>> + if (ptl) {
>>> + bool pmd_wt;
>>> +
>>> + pmd_wt = !is_pmd_uffd_wp(*pmd);
>>> + /*
>>> + * Break huge page into small pages if operation needs to be
>>> performed is
>>> + * on a portion of the huge page.
>>> + */
>>> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>>> + spin_unlock(ptl);
>>> + split_huge_pmd(vma, pmd, start);
>>> + goto process_smaller_pages;
>> I think that such goto's are really confusing and should be avoided. And
>> using 'else' (could have easily prevented the need for goto). It is not the
>> best solution though, since I think it would have been better to invert the
>> conditions.
> Yeah, else can be used here. But then we'll have to add a tab to all the
> code after adding else. We have already so many tabs and very less space to
> right code. Not sure which is better.

goto’s are usually not the right solution. You can extract things into a different
function if you have to.

I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the
lock and dropping it in such a case. I think that the code can be simplified and
additional condition nesting can be avoided.

>>> --- a/include/uapi/linux/fs.h
>>> +++ b/include/uapi/linux/fs.h
>>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>> RWF_APPEND)
>>> +/* Pagemap ioctl */
>>> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
>>> +
>>> +/* Bits are set in the bitmap of the page_region and masks in
>>> pagemap_scan_args */
>>> +#define PAGE_IS_WRITTEN (1 << 0)
>>> +#define PAGE_IS_FILE (1 << 1)
>>> +#define PAGE_IS_PRESENT (1 << 2)
>>> +#define PAGE_IS_SWAPPED (1 << 3)
>>
>> These names are way too generic and are likely to be misused for the wrong
>> purpose. The "_IS_" part seems confusing as well. So I think the naming
>> needs to be fixed and some new type (using typedef) or enum should be
>> introduced to hold these flags. I understand it is part of uapi and it is
>> less common there, but it is not unheard of and does make things clearer.
> Do you think PM_SCAN_PAGE_IS_* work here?

Can we lose the IS somehow?

>
>>
>>
>>> +
>>> +/*
>>> + * struct page_region - Page region with bitmap flags
>>> + * @start: Start of the region
>>> + * @len: Length of the region
>>> + * bitmap: Bits sets for the region
>>> + */
>>> +struct page_region {
>>> + __u64 start;
>>> + __u64 len;
>>
>> I presume in bytes. Would be useful to mention.
> Length of region in pages.

Very unintuitive to me I must say. If the start is an address, I would expect
the len to be in bytes.

2023-02-23 06:44:23

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/22/23 4:48 PM, Michał Mirosław wrote:
> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
> <[email protected]> wrote:
>>
>> On 2/22/23 3:44 PM, Michał Mirosław wrote:
>>> On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
>>> <[email protected]> wrote:
>>>> On 2/21/23 5:42 PM, Michał Mirosław wrote:
>>>>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi Michał,
>>>>>>
>>>>>> Thank you so much for comment!
>>>>>>
>>>>>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
>>>>> [...]
>>>>>>> For the page-selection mechanism, currently required_mask and
>>>>>>> excluded_mask have conflicting
>>>>>> They are opposite of each other:
>>>>>> All the set bits in required_mask must be set for the page to be selected.
>>>>>> All the set bits in excluded_mask must _not_ be set for the page to be
>>>>>> selected.
>>>>>>
>>>>>>> responsibilities. I suggest to rework that to:
>>>>>>> 1. negated_flags: page flags which are to be negated before applying
>>>>>>> the page selection using following masks;
>>>>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
>>>>>> the truth table:
>>>>>> Page Flag negated_flags
>>>>>> 0 0 0
>>>>>> 0 1 1
>>>>>> 1 0 1
>>>>>> 1 1 0
>>>>>>
>>>>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
>>>>>> changed the page flag. It isn't making sense to me. Why the page flag bit
>>>>>> is being fliped?
>>>>>>
>>>>>> When Anrdei had proposed these masks, they seemed like a fancy way of
>>>>>> filtering inside kernel and it was straight forward to understand. These
>>>>>> masks would help his use cases for CRIU. So I'd included it. Please can you
>>>>>> elaborate what is the purpose of negation?
>>>>>
>>>>> The XOR is a way to invert the tested value of a flag (from positive
>>>>> to negative and the other way) without having the API with invalid
>>>>> values (with required_flags and excluded_flags you need to define a
>>>>> rule about what happens if a flag is present in both of the masks -
>>>>> either prioritise one mask over the other or reject the call).
>>>> At minimum, one mask (required, any or excluded) must be specified. For a
>>>> page to get selected, the page flags must fulfill the criterion of all the
>>>> specified masks.
>>>
>>> [Please see the comment below.]
>>>
>>> [...]
>>>> Lets translate words into table:
>>> [Yes, those tables captured the intent correctly.]
>>>
>>>>> BTW, I think I assumed that both conditions (all flags in
>>>>> required_flags and at least one in anyof_flags is present) need to be
>>>>> true for the page to be selected - is this your intention?
>>>> All the masks are optional. If all or any of the 3 masks are specified, the
>>>> page flags must pass these masks to get selected.
>>>
>>> This explanation contradicts in part the introductory paragraph, but
>>> this version seems more useful as you can pass all masks zero to have
>>> all pages selected.
>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
>> must be specified. The return_mask must always be specified. Error is
>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
>> is zero.
>
> Why do you need those restrictions? I'd guess it is valid to request a
> list of all pages with zero return_mask - this will return a compact
> list of used ranges of the virtual address space.
At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
flags of interest in the return_mask. If he wants only 1 flag, he'll
specify it. Definitely if user wants only 1 flag, initially it doesn't make
any sense to mention in the return mask. But we want uniformity. If user
want, 2 or more flags in returned, return_mask becomes compulsory. So to
keep things simple and generic for any number of flags of interest
returned, the return_mask must be specified even if the flag of interest is
only 1.

>
>>>> After taking a while to understand this and compare with already present
>>>> flag system, `negated flags` is comparatively difficult to understand while
>>>> already present flags seem easier.
>>>
>>> Maybe replacing negated_flags in the API with matched_values =
>>> ~negated_flags would make this better?
>>>
>>> We compare having to understand XOR vs having to understand ordering
>>> of required_flags and excluded_flags.
>> There is no ordering in current masks scheme. No mask is preferable. For a
>> page to get selected, all the definitions of the masks must be fulfilled.
>> You have come up with good example that what if required_mask =
>> exclude_mask. In this case, no page will fulfill the criterion and hence no
>> page would be selected. It is user's fault that he isn't understanding the
>> definitions of these masks correctly.
>>
>> Now thinking about it, I can add a error check which would return error if
>> a bit in required and excluded masks matches. Would you like it? Lets put
>> this check in place.
>> (Previously I'd left it for user's wisdom not to do this. If he'll specify
>> same masks in them, he'll get no addresses out of the syscall.)
>
> This error case is (one of) the problems I propose avoiding. You also
> need much more text to describe the requred/excluded flags
> interactions and edge cases than saying that a flag must have a value
> equal to corresponding bit in ~negated_flags to be matched by
> requried/anyof masks.
I've found excluded_mask very intuitive as compared to negated_mask which
is so difficult to understand that I don't know how to use it correctly.
Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
PAGE_IS_SWAPPED. This can be specified as:

required_mask = PAGE_IS_WRITTEN
excluded_mask = PAGE_IS_FILE
anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP

(a) assume page_flags = 0b1111
skip page as 0b1111 & 0b0010 = true

(b) assume page_flags = 0b1001
select page as 0b1001 & 0b0010 = false

It seemed intuitive. Right? How would you achieve same thing with negated_mask?

required_mask = PAGE_IS_WRITTEN
negated_mask = PAGE_IS_FILE
anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP

(1) assume page_flags = 0b1111
tested_flags = 0b1111 ^ 0b0010 = 0b1101

(2) assume page_flags = 0b1001
tested_flags = 0b1001 ^ 0b0010 = 0b1011

In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
negated_mask has just masked it and page is still getting tested if it
should be selected and it would get selected. It is wrong.

In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
PAGE_IS_FILE in tested_flags.

>
>>> IOW my proposal is to replace branches in the masks interpretation (if
>>> in one set then matches but if in another set then doesn't; if flags
>>> match ... ) with plain calculation (flag is matching when equals
>>> ~negated_flags; if flags match the masks ...).
>
> Best Regards
> Michał Mirosław

--
BR,
Muhammad Usama Anjum

2023-02-23 07:10:58

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

Hi Nadav, Mike, Michał,

Can you please share your thoughts at [A] below?

On 2/23/23 12:10 AM, Nadav Amit wrote:
>
>
>> On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum <[email protected]> wrote:
>>
>>>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>> + unsigned long end, struct mm_walk *walk)
>>>> +{
>>>> + struct pagemap_scan_private *p = walk->private;
>>>> + struct vm_area_struct *vma = walk->vma;
>>>> + unsigned long addr = end;
>>>> + spinlock_t *ptl;
>>>> + int ret = 0;
>>>> + pte_t *pte;
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> + ptl = pmd_trans_huge_lock(pmd, vma);
>>>> + if (ptl) {
>>>> + bool pmd_wt;
>>>> +
>>>> + pmd_wt = !is_pmd_uffd_wp(*pmd);
>>>> + /*
>>>> + * Break huge page into small pages if operation needs to be
>>>> performed is
>>>> + * on a portion of the huge page.
>>>> + */
>>>> + if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>>>> + spin_unlock(ptl);
>>>> + split_huge_pmd(vma, pmd, start);
>>>> + goto process_smaller_pages;
>>> I think that such goto's are really confusing and should be avoided. And
>>> using 'else' (could have easily prevented the need for goto). It is not the
>>> best solution though, since I think it would have been better to invert the
>>> conditions.
>> Yeah, else can be used here. But then we'll have to add a tab to all the
>> code after adding else. We have already so many tabs and very less space to
>> right code. Not sure which is better.
>
> goto’s are usually not the right solution. You can extract things into a different
> function if you have to.
>
> I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the
> lock and dropping it in such a case. I think that the code can be simplified and
> additional condition nesting can be avoided.
Lock is taken and we check if pmd has UFFD_WP set or not. In the next
version, the GET check has been removed as we have dropped WP_ENGAGE + !GET
operation. So get is always specified and condition isn't needed.

Please comment on next version if you want anything more optimized.

>
>>>> --- a/include/uapi/linux/fs.h
>>>> +++ b/include/uapi/linux/fs.h
>>>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>>> #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>>> RWF_APPEND)
>>>> +/* Pagemap ioctl */
>>>> +#define PAGEMAP_SCAN _IOWR('f', 16, struct pagemap_scan_arg)
>>>> +
>>>> +/* Bits are set in the bitmap of the page_region and masks in
>>>> pagemap_scan_args */
>>>> +#define PAGE_IS_WRITTEN (1 << 0)
>>>> +#define PAGE_IS_FILE (1 << 1)
>>>> +#define PAGE_IS_PRESENT (1 << 2)
>>>> +#define PAGE_IS_SWAPPED (1 << 3)
>>>
>>> These names are way too generic and are likely to be misused for the wrong
>>> purpose. The "_IS_" part seems confusing as well. So I think the naming
>>> needs to be fixed and some new type (using typedef) or enum should be
>>> introduced to hold these flags. I understand it is part of uapi and it is
>>> less common there, but it is not unheard of and does make things clearer.
>> Do you think PM_SCAN_PAGE_IS_* work here?
>
> Can we lose the IS somehow?
[A] Do you think these names would work better: PM_SCAN_WRITTEN_PAGE,
PM_SCAN_FILE_PAGE, PM_SCAN_SWAP_PAGE, PM_SCAN_PRESENT_PAGE?

>
>>
>>>
>>>
>>>> +
>>>> +/*
>>>> + * struct page_region - Page region with bitmap flags
>>>> + * @start: Start of the region
>>>> + * @len: Length of the region
>>>> + * bitmap: Bits sets for the region
>>>> + */
>>>> +struct page_region {
>>>> + __u64 start;
>>>> + __u64 len;
>>>
>>> I presume in bytes. Would be useful to mention.
>> Length of region in pages.
>
> Very unintuitive to me I must say. If the start is an address, I would expect
> the len to be in bytes.
The PAGEMAP_SCAN ioctl is working on page granularity level. We tell the
user if a page has certain flags are not. Keeping length in bytes doesn't
makes sense.

>

--
BR,
Muhammad Usama Anjum

2023-02-23 08:42:17

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum
<[email protected]> wrote:
>
> On 2/22/23 4:48 PM, Michał Mirosław wrote:
> > On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
> > <[email protected]> wrote:
[...]
> >>>>> BTW, I think I assumed that both conditions (all flags in
> >>>>> required_flags and at least one in anyof_flags is present) need to be
> >>>>> true for the page to be selected - is this your intention?
> >>>> All the masks are optional. If all or any of the 3 masks are specified, the
> >>>> page flags must pass these masks to get selected.
> >>>
> >>> This explanation contradicts in part the introductory paragraph, but
> >>> this version seems more useful as you can pass all masks zero to have
> >>> all pages selected.
> >> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
> >> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
> >> must be specified. The return_mask must always be specified. Error is
> >> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
> >> is zero.
> >
> > Why do you need those restrictions? I'd guess it is valid to request a
> > list of all pages with zero return_mask - this will return a compact
> > list of used ranges of the virtual address space.
> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
> flags of interest in the return_mask. If he wants only 1 flag, he'll
> specify it. Definitely if user wants only 1 flag, initially it doesn't make
> any sense to mention in the return mask. But we want uniformity. If user
> want, 2 or more flags in returned, return_mask becomes compulsory. So to
> keep things simple and generic for any number of flags of interest
> returned, the return_mask must be specified even if the flag of interest is
> only 1.

I'm not sure why do we want uniformity in the case of 1 flag? If a
user specifies a single required flag, I'd expect he doesn't need to
look at the flags returned as those will duplicate the information
from mere presence of a page. A user might also require a single flag,
but want all of them returned. Both requests - return 1 flag and
return 0 flags would give meaningful output, so why force one way or
the other? Allowing two will also enable users to express the intent:
they need either just a list of pages, or they need a list with
per-page flags - the need would follow from the code structure or
other factors.

> >>>> After taking a while to understand this and compare with already present
> >>>> flag system, `negated flags` is comparatively difficult to understand while
> >>>> already present flags seem easier.
> >>>
> >>> Maybe replacing negated_flags in the API with matched_values =
> >>> ~negated_flags would make this better?
> >>>
> >>> We compare having to understand XOR vs having to understand ordering
> >>> of required_flags and excluded_flags.
> >> There is no ordering in current masks scheme. No mask is preferable. For a
> >> page to get selected, all the definitions of the masks must be fulfilled.
> >> You have come up with good example that what if required_mask =
> >> exclude_mask. In this case, no page will fulfill the criterion and hence no
> >> page would be selected. It is user's fault that he isn't understanding the
> >> definitions of these masks correctly.
> >>
> >> Now thinking about it, I can add a error check which would return error if
> >> a bit in required and excluded masks matches. Would you like it? Lets put
> >> this check in place.
> >> (Previously I'd left it for user's wisdom not to do this. If he'll specify
> >> same masks in them, he'll get no addresses out of the syscall.)
> >
> > This error case is (one of) the problems I propose avoiding. You also
> > need much more text to describe the requred/excluded flags
> > interactions and edge cases than saying that a flag must have a value
> > equal to corresponding bit in ~negated_flags to be matched by
> > requried/anyof masks.
> I've found excluded_mask very intuitive as compared to negated_mask which
> is so difficult to understand that I don't know how to use it correctly.
> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
> PAGE_IS_SWAPPED. This can be specified as:
>
> required_mask = PAGE_IS_WRITTEN
> excluded_mask = PAGE_IS_FILE
> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>
> (a) assume page_flags = 0b1111
> skip page as 0b1111 & 0b0010 = true
>
> (b) assume page_flags = 0b1001
> select page as 0b1001 & 0b0010 = false
>
> It seemed intuitive. Right? How would you achieve same thing with negated_mask?
>
> required_mask = PAGE_IS_WRITTEN
> negated_mask = PAGE_IS_FILE
> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>
> (1) assume page_flags = 0b1111
> tested_flags = 0b1111 ^ 0b0010 = 0b1101
>
> (2) assume page_flags = 0b1001
> tested_flags = 0b1001 ^ 0b0010 = 0b1011
>
> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
> negated_mask has just masked it and page is still getting tested if it
> should be selected and it would get selected. It is wrong.
>
> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
> PAGE_IS_FILE in tested_flags.

I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:

required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE;
negated_flags = PAGE_IS_FILE; // flags I want zero

I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so:

anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP;

Another case: I want to analyse a process' working set:

required_mask = 0;
negated_flags = PAGE_IS_FILE;
anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN;

-> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0].

To clarify a bit: negated_flags doesn't mask anything: the field
inverts values of the flags (marks some "active low", if you consider
electronic signal analogy).

Best Regards
Michał Mirosław

2023-02-23 09:23:58

by Muhammad Usama Anjum

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On 2/23/23 1:41 PM, Michał Mirosław wrote:
> On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum
> <[email protected]> wrote:
>>
>> On 2/22/23 4:48 PM, Michał Mirosław wrote:
>>> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
>>> <[email protected]> wrote:
> [...]
>>>>>>> BTW, I think I assumed that both conditions (all flags in
>>>>>>> required_flags and at least one in anyof_flags is present) need to be
>>>>>>> true for the page to be selected - is this your intention?
>>>>>> All the masks are optional. If all or any of the 3 masks are specified, the
>>>>>> page flags must pass these masks to get selected.
>>>>>
>>>>> This explanation contradicts in part the introductory paragraph, but
>>>>> this version seems more useful as you can pass all masks zero to have
>>>>> all pages selected.
>>>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
>>>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
>>>> must be specified. The return_mask must always be specified. Error is
>>>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
>>>> is zero.
>>>
>>> Why do you need those restrictions? I'd guess it is valid to request a
>>> list of all pages with zero return_mask - this will return a compact
>>> list of used ranges of the virtual address space.
>> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
>> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
>> flags of interest in the return_mask. If he wants only 1 flag, he'll
>> specify it. Definitely if user wants only 1 flag, initially it doesn't make
>> any sense to mention in the return mask. But we want uniformity. If user
>> want, 2 or more flags in returned, return_mask becomes compulsory. So to
>> keep things simple and generic for any number of flags of interest
>> returned, the return_mask must be specified even if the flag of interest is
>> only 1.
>
> I'm not sure why do we want uniformity in the case of 1 flag? If a
> user specifies a single required flag, I'd expect he doesn't need to
> look at the flags returned as those will duplicate the information
> from mere presence of a page. A user might also require a single flag,
> but want all of them returned. Both requests - return 1 flag and
> return 0 flags would give meaningful output, so why force one way or
> the other? Allowing two will also enable users to express the intent:
> they need either just a list of pages, or they need a list with
> per-page flags - the need would follow from the code structure or
> other factors.
We can add as much flexibility as much people ask by keeping code simple.
But it is going to be dirty to add error check which detects if return_mask
= 0 and if there is only 1 flag of interest mentioned by the user. The
following mentioned error check is essential to return deterministic
output. Do you think this case is worth it to support and we don't want to
go with the generality for both 1 or more flag cases?

if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1)
return error;

>
>>>>>> After taking a while to understand this and compare with already present
>>>>>> flag system, `negated flags` is comparatively difficult to understand while
>>>>>> already present flags seem easier.
>>>>>
>>>>> Maybe replacing negated_flags in the API with matched_values =
>>>>> ~negated_flags would make this better?
>>>>>
>>>>> We compare having to understand XOR vs having to understand ordering
>>>>> of required_flags and excluded_flags.
>>>> There is no ordering in current masks scheme. No mask is preferable. For a
>>>> page to get selected, all the definitions of the masks must be fulfilled.
>>>> You have come up with good example that what if required_mask =
>>>> exclude_mask. In this case, no page will fulfill the criterion and hence no
>>>> page would be selected. It is user's fault that he isn't understanding the
>>>> definitions of these masks correctly.
>>>>
>>>> Now thinking about it, I can add a error check which would return error if
>>>> a bit in required and excluded masks matches. Would you like it? Lets put
>>>> this check in place.
>>>> (Previously I'd left it for user's wisdom not to do this. If he'll specify
>>>> same masks in them, he'll get no addresses out of the syscall.)
>>>
>>> This error case is (one of) the problems I propose avoiding. You also
>>> need much more text to describe the requred/excluded flags
>>> interactions and edge cases than saying that a flag must have a value
>>> equal to corresponding bit in ~negated_flags to be matched by
>>> requried/anyof masks.
>> I've found excluded_mask very intuitive as compared to negated_mask which
>> is so difficult to understand that I don't know how to use it correctly.
>> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
>> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
>> PAGE_IS_SWAPPED. This can be specified as:
>>
>> required_mask = PAGE_IS_WRITTEN
>> excluded_mask = PAGE_IS_FILE
>> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>>
>> (a) assume page_flags = 0b1111
>> skip page as 0b1111 & 0b0010 = true
>>
>> (b) assume page_flags = 0b1001
>> select page as 0b1001 & 0b0010 = false
>>
>> It seemed intuitive. Right? How would you achieve same thing with negated_mask?
>>
>> required_mask = PAGE_IS_WRITTEN
>> negated_mask = PAGE_IS_FILE
>> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>>
>> (1) assume page_flags = 0b1111
>> tested_flags = 0b1111 ^ 0b0010 = 0b1101
>>
>> (2) assume page_flags = 0b1001
>> tested_flags = 0b1001 ^ 0b0010 = 0b1011
>>
>> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
>> negated_mask has just masked it and page is still getting tested if it
>> should be selected and it would get selected. It is wrong.
>>
>> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
>> PAGE_IS_FILE in tested_flags.
>
> I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
>
> required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE;
> negated_flags = PAGE_IS_FILE; // flags I want zero
You want PAGE_IS_FILE to be zero and at the same time you are requiring the
PAGE_IS_FILE. It is confusing. Lets go with excluded mask and excluded_mask
must never have any bit matching with required_mask. Lets stay with this as
it is intuitive and would be easy to use from the user's perspective.
Andrei and Danylo had suggested these mask scheme and have use cases for
this. Andrei and Danylo can please comment as well.

>
> I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so:
>
> anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP;
>
> Another case: I want to analyse a process' working set:
>
> required_mask = 0;
> negated_flags = PAGE_IS_FILE;
> anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN;
>
> -> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0].
>
> To clarify a bit: negated_flags doesn't mask anything: the field
> inverts values of the flags (marks some "active low", if you consider
> electronic signal analogy).
>
> Best Regards
> Michał Mirosław

--
BR,
Muhammad Usama Anjum

2023-02-23 09:42:34

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, 23 Feb 2023 at 10:23, Muhammad Usama Anjum
<[email protected]> wrote:
>
> On 2/23/23 1:41 PM, Michał Mirosław wrote:
> > On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum
> > <[email protected]> wrote:
> >>
> >> On 2/22/23 4:48 PM, Michał Mirosław wrote:
> >>> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
> >>> <[email protected]> wrote:
> > [...]
> >>>>>>> BTW, I think I assumed that both conditions (all flags in
> >>>>>>> required_flags and at least one in anyof_flags is present) need to be
> >>>>>>> true for the page to be selected - is this your intention?
> >>>>>> All the masks are optional. If all or any of the 3 masks are specified, the
> >>>>>> page flags must pass these masks to get selected.
> >>>>>
> >>>>> This explanation contradicts in part the introductory paragraph, but
> >>>>> this version seems more useful as you can pass all masks zero to have
> >>>>> all pages selected.
> >>>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
> >>>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
> >>>> must be specified. The return_mask must always be specified. Error is
> >>>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
> >>>> is zero.
> >>>
> >>> Why do you need those restrictions? I'd guess it is valid to request a
> >>> list of all pages with zero return_mask - this will return a compact
> >>> list of used ranges of the virtual address space.
> >> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
> >> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
> >> flags of interest in the return_mask. If he wants only 1 flag, he'll
> >> specify it. Definitely if user wants only 1 flag, initially it doesn't make
> >> any sense to mention in the return mask. But we want uniformity. If user
> >> want, 2 or more flags in returned, return_mask becomes compulsory. So to
> >> keep things simple and generic for any number of flags of interest
> >> returned, the return_mask must be specified even if the flag of interest is
> >> only 1.
> >
> > I'm not sure why do we want uniformity in the case of 1 flag? If a
> > user specifies a single required flag, I'd expect he doesn't need to
> > look at the flags returned as those will duplicate the information
> > from mere presence of a page. A user might also require a single flag,
> > but want all of them returned. Both requests - return 1 flag and
> > return 0 flags would give meaningful output, so why force one way or
> > the other? Allowing two will also enable users to express the intent:
> > they need either just a list of pages, or they need a list with
> > per-page flags - the need would follow from the code structure or
> > other factors.
> We can add as much flexibility as much people ask by keeping code simple.
> But it is going to be dirty to add error check which detects if return_mask
> = 0 and if there is only 1 flag of interest mentioned by the user. The
> following mentioned error check is essential to return deterministic
> output. Do you think this case is worth it to support and we don't want to
> go with the generality for both 1 or more flag cases?
>
> if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1)
> return error;

Why would you want to add this error check? If a user requires
multiple flags but cares only about a list of matching pages, then it
would be natural to express this intent as return_mask = 0.

> >>>>>> After taking a while to understand this and compare with already present
> >>>>>> flag system, `negated flags` is comparatively difficult to understand while
> >>>>>> already present flags seem easier.
> >>>>>
> >>>>> Maybe replacing negated_flags in the API with matched_values =
> >>>>> ~negated_flags would make this better?
> >>>>>
> >>>>> We compare having to understand XOR vs having to understand ordering
> >>>>> of required_flags and excluded_flags.
> >>>> There is no ordering in current masks scheme. No mask is preferable. For a
> >>>> page to get selected, all the definitions of the masks must be fulfilled.
> >>>> You have come up with good example that what if required_mask =
> >>>> exclude_mask. In this case, no page will fulfill the criterion and hence no
> >>>> page would be selected. It is user's fault that he isn't understanding the
> >>>> definitions of these masks correctly.
> >>>>
> >>>> Now thinking about it, I can add a error check which would return error if
> >>>> a bit in required and excluded masks matches. Would you like it? Lets put
> >>>> this check in place.
> >>>> (Previously I'd left it for user's wisdom not to do this. If he'll specify
> >>>> same masks in them, he'll get no addresses out of the syscall.)
> >>>
> >>> This error case is (one of) the problems I propose avoiding. You also
> >>> need much more text to describe the requred/excluded flags
> >>> interactions and edge cases than saying that a flag must have a value
> >>> equal to corresponding bit in ~negated_flags to be matched by
> >>> requried/anyof masks.
> >> I've found excluded_mask very intuitive as compared to negated_mask which
> >> is so difficult to understand that I don't know how to use it correctly.
> >> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
> >> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
> >> PAGE_IS_SWAPPED. This can be specified as:
> >>
> >> required_mask = PAGE_IS_WRITTEN
> >> excluded_mask = PAGE_IS_FILE
> >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
> >>
> >> (a) assume page_flags = 0b1111
> >> skip page as 0b1111 & 0b0010 = true
> >>
> >> (b) assume page_flags = 0b1001
> >> select page as 0b1001 & 0b0010 = false
> >>
> >> It seemed intuitive. Right? How would you achieve same thing with negated_mask?
> >>
> >> required_mask = PAGE_IS_WRITTEN
> >> negated_mask = PAGE_IS_FILE
> >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
> >>
> >> (1) assume page_flags = 0b1111
> >> tested_flags = 0b1111 ^ 0b0010 = 0b1101
> >>
> >> (2) assume page_flags = 0b1001
> >> tested_flags = 0b1001 ^ 0b0010 = 0b1011
> >>
> >> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
> >> negated_mask has just masked it and page is still getting tested if it
> >> should be selected and it would get selected. It is wrong.
> >>
> >> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
> >> PAGE_IS_FILE in tested_flags.
> >
> > I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
> >
> > required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE;
> > negated_flags = PAGE_IS_FILE; // flags I want zero
> You want PAGE_IS_FILE to be zero and at the same time you are requiring the
> PAGE_IS_FILE. It is confusing.

Ok, I believe the misunderstanding comes from the naming. I "require"
the flag to be a particular value - hence include it in
"required_flags" and specify the required value in ~negated_flags. You
"require" the flag to be set (equal 1) and so include it in
"required_flags" and you "require" the flag to be clear (equal to 0)
so include it in "excluded_flags". Both approaches are correct, but I
would not consider one "easier" than the other. The former is more
general, though - makes any_of also able to match on flags cleared and
removes the possibility of a conflicting case of a flag present in
both sets.

Maybe considered_flags or matched_flags then would make the field
better understandable?

Best Regards
Michał Mirosław

2023-02-23 17:11:20

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs


> On Feb 22, 2023, at 11:10 PM, Muhammad Usama Anjum <[email protected]> wrote:
>
> Hi Nadav, Mike, Michał,
>
> Can you please share your thoughts at [A] below?

I promised I won't talk about the API, but was persuaded to reconsider. I have a
general question regarding the suitablity of currently proposed high-level API.
To explore some alternatives, I'd like to suggest an alternative that may have
some advantages. If these have already been considered and dismissed, feel free
to ignore.

I believe we have two distinct usage scenarios: (1) vectored reads from pagemap,
and (2) atomic UFFD WP-read/protect. It's possible that these require separate
interfaces

Regarding vectored reads, I believe the simplest solution is to maintain the
current pagemap entry format for output and extend it if necessary. The input
can be a vector of ranges. I'm uncertain about the purpose of fields such
as 'anyof_mask' in 'pagemap_scan_arg', so I can't confirm their necessity and
whether the input need to be made. more complicated. There is a possibility
that fields such as 'anyof_mask' might expose internal APIs, so I hope they’re
not required.

For the atomic operation of 'PAGE_IS_WRITTEN' + 'PAGEMAP_WP_ENGAGE', a different
mechanism might be necessary. This function appears to be UFFD-specific.
Instead of the proposed IOCTL, an alternative option is to
use 'UFFD_FEATURE_WP_ASYNC' to log the pages that were written, similar to
page-modification logging on Intel. Since this feature appears to be specific
to UFFD, I believe it would be more appropriate to include the log as part of
the UFFD mechanism rather than the pagemap.

From my experience with UFFD, proper ordering of events is crucial, although it
is not always done well. Therefore, we should aim for improvement, not
regression. I believe that utilizing the pagemap-based mechanism for WP'ing
might be a step in the wrong direction. I think that it would have been better
to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
file descriptor unless the log is full.

I am sorry that I chime in that late, but I think the complications that the
proposed mechanism might raise are not negligible. And anyhow this patch-set
still requires quite a bit of work before it can be merged.

2023-02-24 02:20:52

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław <[email protected]> wrote:
>
> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> <[email protected]> wrote:
> >
> > Hi Michał,
> >
> > Thank you so much for comment!
> >
> > On 2/17/23 8:18 PM, Michał Mirosław wrote:
> [...]
> > > For the page-selection mechanism, currently required_mask and
> > > excluded_mask have conflicting
> > They are opposite of each other:
> > All the set bits in required_mask must be set for the page to be selected.
> > All the set bits in excluded_mask must _not_ be set for the page to be
> > selected.
> >
> > > responsibilities. I suggest to rework that to:
> > > 1. negated_flags: page flags which are to be negated before applying
> > > the page selection using following masks;
> > Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> > the truth table:
> > Page Flag negated_flags
> > 0 0 0
> > 0 1 1
> > 1 0 1
> > 1 1 0
> >
> > If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> > changed the page flag. It isn't making sense to me. Why the page flag bit
> > is being fliped?
> >
> > When Anrdei had proposed these masks, they seemed like a fancy way of
> > filtering inside kernel and it was straight forward to understand. These
> > masks would help his use cases for CRIU. So I'd included it. Please can you
> > elaborate what is the purpose of negation?
>
> The XOR is a way to invert the tested value of a flag (from positive
> to negative and the other way) without having the API with invalid
> values (with required_flags and excluded_flags you need to define a
> rule about what happens if a flag is present in both of the masks -
> either prioritise one mask over the other or reject the call).
> (Note: the XOR is applied only to the value of the flags for the
> purpose of testing page-selection criteria.)

Michał,

Your API isn't much different from the current one, but it requires
a bit more brain activity for understanding.

The current set of masks can be easy translated to the new one:
negated_flags = excluded_flags
required_flags_new = excluded_flags | required_flags

As for invalid values, I think it is an advantage of the current API.
I mean we can easily detect invalid values and return EINVAL. With your
API, such mistakes will be undetectable.

As for priorities, I don't see this problem here If I don't miss something.

We can rewrite the code this way:
```
if (required_mask && ((page_flags & required_mask) != required_mask)
skip page;
if (anyof_mask && !(page_flags & anyof_mask))
skip page;
if (page_flags & excluded_mask)
skip page;
```

I think the result is always the same no matter in what order each
mask is applied.

Thanks,
Andrei

2023-02-25 09:39:13

by Michał Mirosław

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Fri, 24 Feb 2023 at 03:20, Andrei Vagin <[email protected]> wrote:
>
> On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław <[email protected]> wrote:
> >
> > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> > <[email protected]> wrote:
> > >
> > > Hi Michał,
> > >
> > > Thank you so much for comment!
> > >
> > > On 2/17/23 8:18 PM, Michał Mirosław wrote:
> > [...]
> > > > For the page-selection mechanism, currently required_mask and
> > > > excluded_mask have conflicting
> > > They are opposite of each other:
> > > All the set bits in required_mask must be set for the page to be selected.
> > > All the set bits in excluded_mask must _not_ be set for the page to be
> > > selected.
> > >
> > > > responsibilities. I suggest to rework that to:
> > > > 1. negated_flags: page flags which are to be negated before applying
> > > > the page selection using following masks;
> > > Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> > > the truth table:
> > > Page Flag negated_flags
> > > 0 0 0
> > > 0 1 1
> > > 1 0 1
> > > 1 1 0
> > >
> > > If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> > > changed the page flag. It isn't making sense to me. Why the page flag bit
> > > is being fliped?
> > >
> > > When Anrdei had proposed these masks, they seemed like a fancy way of
> > > filtering inside kernel and it was straight forward to understand. These
> > > masks would help his use cases for CRIU. So I'd included it. Please can you
> > > elaborate what is the purpose of negation?
> >
> > The XOR is a way to invert the tested value of a flag (from positive
> > to negative and the other way) without having the API with invalid
> > values (with required_flags and excluded_flags you need to define a
> > rule about what happens if a flag is present in both of the masks -
> > either prioritise one mask over the other or reject the call).
> > (Note: the XOR is applied only to the value of the flags for the
> > purpose of testing page-selection criteria.)
>
> Michał,
>
> Your API isn't much different from the current one, but it requires
> a bit more brain activity for understanding.
>
> The current set of masks can be easy translated to the new one:
> negated_flags = excluded_flags
> required_flags_new = excluded_flags | required_flags
>
> As for invalid values, I think it is an advantage of the current API.
> I mean we can easily detect invalid values and return EINVAL. With your
> API, such mistakes will be undetectable.
>
> As for priorities, I don't see this problem here If I don't miss something.
>
> We can rewrite the code this way:
> ```
> if (required_mask && ((page_flags & required_mask) != required_mask)
> skip page;
> if (anyof_mask && !(page_flags & anyof_mask))
> skip page;
> if (page_flags & excluded_mask)
> skip page;
> ```
>
> I think the result is always the same no matter in what order each
> mask is applied.

Hi,

I would not want the discussion to wander into easier/harder territory
as that highty depends on experience one has. What I'm arguing about
is the consistency of the API. Let me expand a bit on that.

We have two ways to look at the page_flags:
A. the field represents a *set of elements* (tags, attributes)
present on the page;
B. the field represents a bitfield (structure; a fixed set of boolean
fields having a value of 0 or 1)

From A follows the include/exclude way of API design for matching the
flags, and from B the matched mask (which flags to check) + value set
(what values to require).

My argument is that B is consistent with how the flags are used in the
kernel: we don't have operations that add or remove flags, but we have
operations that set or change their value.

Best Regards
Michał Mirosław

2023-02-27 21:19:55

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
> From my experience with UFFD, proper ordering of events is crucial, although it
> is not always done well. Therefore, we should aim for improvement, not
> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
> might be a step in the wrong direction. I think that it would have been better
> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
> file descriptor unless the log is full.

Yes this is an interesting question to think about..

Keeping the data in the pgtable has one good thing that it doesn't need any
complexity on maintaining the log, and no possibility of "log full".

If there's possible "log full" then the next question is whether we should
let the worker wait the monitor if the monitor is not fast enough to
collect those data. It adds some slight dependency on the two threads, I
think it can make the tracking harder or impossible in latency sensitive
workloads.

The other thing is we can also make the log "never gonna full" by making it
a bitmap covering any registered ranges, but I don't either know whether
it'll be worth it for the effort.

Thanks,

--
Peter Xu


2023-02-27 23:09:20

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs



> On Feb 27, 2023, at 1:18 PM, Peter Xu <[email protected]> wrote:
>
> !! External Email
>
> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
>> From my experience with UFFD, proper ordering of events is crucial, although it
>> is not always done well. Therefore, we should aim for improvement, not
>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
>> might be a step in the wrong direction. I think that it would have been better
>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
>> file descriptor unless the log is full.
>
> Yes this is an interesting question to think about..
>
> Keeping the data in the pgtable has one good thing that it doesn't need any
> complexity on maintaining the log, and no possibility of "log full".

I understand your concern, but I think that eventually it might be simpler
to maintain, since the logic of how to process the log is moved to userspace.

At the same time, handling inputs from pagemap and uffd handlers and sync’ing
them would not be too easy for userspace.

But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
be needed, and there are some issues of ordering the events (I think all #PF
and other events should be ordered regardless) and how not to traverse all
async-userfaultfd_wait_queue’s (except those that block if the log is full)
when a wakeup is needed.

>
> If there's possible "log full" then the next question is whether we should
> let the worker wait the monitor if the monitor is not fast enough to
> collect those data. It adds some slight dependency on the two threads, I
> think it can make the tracking harder or impossible in latency sensitive
> workloads.

Again, I understand your concern. But this model that I propose is not new.
It is used with PML (page-modification logging) and KVM, and IIRC there is
a similar interface between KVM and QEMU to provide this information. There
are endless other examples for similar producer-consumer mechanisms that
might lead to stall in extreme cases.

>
> The other thing is we can also make the log "never gonna full" by making it
> a bitmap covering any registered ranges, but I don't either know whether
> it'll be worth it for the effort.

I do not see a benefit of half-log half-scan. It tries to take the
data-structure of one format and combine it with another.

Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the
threads of previous versions and I did not see userspace components that
use the API to say something smart. Personally, I do not find the current
API proposal to be very consistent and simple, and it seems to me that it
lets pagemap do userfaultfd-related tasks, which might be considered
inappropriate and non-intuitive.

If I derailed the discussion, I apologize.

2023-02-28 15:56:47

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
>
>
> > On Feb 27, 2023, at 1:18 PM, Peter Xu <[email protected]> wrote:
> >
> > !! External Email
> >
> > On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
> >> From my experience with UFFD, proper ordering of events is crucial, although it
> >> is not always done well. Therefore, we should aim for improvement, not
> >> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
> >> might be a step in the wrong direction. I think that it would have been better
> >> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
> >> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
> >> file descriptor unless the log is full.
> >
> > Yes this is an interesting question to think about..
> >
> > Keeping the data in the pgtable has one good thing that it doesn't need any
> > complexity on maintaining the log, and no possibility of "log full".
>
> I understand your concern, but I think that eventually it might be simpler
> to maintain, since the logic of how to process the log is moved to userspace.
>
> At the same time, handling inputs from pagemap and uffd handlers and sync’ing
> them would not be too easy for userspace.

I do not expect a common uffd-wp async user to provide a fault handler at
all. In my imagination it's in most cases used standalone from other uffd
modes; it means all the faults will still be handled by the kernel. Here
we only leverage the accuracy of userfaultfd comparing to soft-dirty, so
not really real "user"-faults.

>
> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
> be needed, and there are some issues of ordering the events (I think all #PF
> and other events should be ordered regardless) and how not to traverse all
> async-userfaultfd_wait_queue’s (except those that block if the log is full)
> when a wakeup is needed.

Will there be an ordering requirement for an async mode? Considering it
should be async to whatever else, I would think it's not a problem, but
maybe I missed something.

>
> >
> > If there's possible "log full" then the next question is whether we should
> > let the worker wait the monitor if the monitor is not fast enough to
> > collect those data. It adds some slight dependency on the two threads, I
> > think it can make the tracking harder or impossible in latency sensitive
> > workloads.
>
> Again, I understand your concern. But this model that I propose is not new.
> It is used with PML (page-modification logging) and KVM, and IIRC there is
> a similar interface between KVM and QEMU to provide this information. There
> are endless other examples for similar producer-consumer mechanisms that
> might lead to stall in extreme cases.

Yes, I'm not against thinking of using similar structures here. It's just
that it's definitely more complicated on the interface, at least we need
yet one more interface to setup the rings and define its interfaces.

Note that although Muhammud is defining another new interface here too for
pagemap, I don't think it's strictly needed for uffd-wp async mode. One
can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap
interface already.

So what Muhammud is proposing here are two things to me: (1) uffd-wp async,
plus (2) a new pagemap interface (which will closely work with (1) only if
we need atomicity on get-dirty and reprotect).

Defining new interface for uffd-wp async mode will be something extra, so
IMHO besides the heap allocation on the rings, we need to also justify
whether that is needed. That's why I think it's fine to go with what
Muhammud proposed, because it's a minimum changeset at least for userfault
to support an async mode, and anything else can be done on top if necessary.

Going a bit back to the "lead to stall in extreme cases" above, just also
want to mention that the VM use case is slightly different - dirty tracking
is only heavily used during migration afaict, and it's a short period. Not
a lot of people will complain performance degrades during that period
because that's just rare. And, even without the ring the perf is really
bad during migration anyway... Especially when huge pages are used to back
the guest RAM.

Here it's slightly different to me: it's about tracking dirty pages during
any possible workload, and it can be monitored periodically and frequently.
So IMHO stricter than a VM use case where migration is the only period to
use it.

>
> >
> > The other thing is we can also make the log "never gonna full" by making it
> > a bitmap covering any registered ranges, but I don't either know whether
> > it'll be worth it for the effort.
>
> I do not see a benefit of half-log half-scan. It tries to take the
> data-structure of one format and combine it with another.

What I'm saying here is not half-log / half-scan, but use a single bitmap
to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it
avoids any above "stall" issue.

>
> Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the
> threads of previous versions and I did not see userspace components that
> use the API to say something smart.

Actually similar here. :) So I'm probably not the best one to describe what
is the best to look as API.

What I know is I think the new pagemap interface is welcomed by CRIU
developers, so it may be something good with/without userfaultfd getting
involved already. I see this as "let's add one more bit for uffd-wp" in
the new interface only.

Quotting some link I got from Muhammud before with CRIU usage:

https://lore.kernel.org/all/[email protected]
https://lore.kernel.org/all/[email protected]

> Personally, I do not find the current API proposal to be very consistent
> and simple, and it seems to me that it lets pagemap do
> userfaultfd-related tasks, which might be considered inappropriate and
> non-intuitive.

Yes, I agree. I just don't know what's the best way to avoid this.

The issue here IIUC is Muhammud needs one operation to do what Windows does
with getWriteWatch() API. It means we need to mix up GET and PROTECT in a
single shot. If we want to use pagemap as GET, then no choice to PROTECT
also here to me.

I think it'll be the same to soft-dirty if it's used, it means we'll extend
soft-dirty modifications from clear_refs to pagemap too which I also don't
think it's as clean.

>
> If I derailed the discussion, I apologize.

Not at all. I just wished you joined earlier!

--
Peter Xu


2023-02-28 17:21:33

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs



> On Feb 28, 2023, at 7:55 AM, Peter Xu <[email protected]> wrote:
>
> !! External Email
>
> On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
>>
>>
>>> On Feb 27, 2023, at 1:18 PM, Peter Xu <[email protected]> wrote:
>>>
>>> !! External Email
>>>
>>> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
>>>> From my experience with UFFD, proper ordering of events is crucial, although it
>>>> is not always done well. Therefore, we should aim for improvement, not
>>>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
>>>> might be a step in the wrong direction. I think that it would have been better
>>>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
>>>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
>>>> file descriptor unless the log is full.
>>>
>>> Yes this is an interesting question to think about..
>>>
>>> Keeping the data in the pgtable has one good thing that it doesn't need any
>>> complexity on maintaining the log, and no possibility of "log full".
>>
>> I understand your concern, but I think that eventually it might be simpler
>> to maintain, since the logic of how to process the log is moved to userspace.
>>
>> At the same time, handling inputs from pagemap and uffd handlers and sync’ing
>> them would not be too easy for userspace.
>
> I do not expect a common uffd-wp async user to provide a fault handler at
> all. In my imagination it's in most cases used standalone from other uffd
> modes; it means all the faults will still be handled by the kernel. Here
> we only leverage the accuracy of userfaultfd comparing to soft-dirty, so
> not really real "user"-faults.

If that is the only use-case, it might make sense. But I guess most users would
most likely use some library (and not syscalls directly). So slightly
complicating the API for better generality may be reasonable.

>
>>
>> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
>> be needed, and there are some issues of ordering the events (I think all #PF
>> and other events should be ordered regardless) and how not to traverse all
>> async-userfaultfd_wait_queue’s (except those that block if the log is full)
>> when a wakeup is needed.
>
> Will there be an ordering requirement for an async mode? Considering it
> should be async to whatever else, I would think it's not a problem, but
> maybe I missed something.

You may be right, but I am not sure. I am still not sure what use-cases are
targeted in this patch-set. For CRIU checkpoint use-case (when the app is
not running), I guess the current interface makes sense. But if there are
use-cases in which this you do care about UFFD-events this can become an
issue.

But even in some obvious use-cases, this might be the wrong interface for
major performance issues. If we think about some incremental copying of
modified pages (a-la pre-copy live-migration or to create point-in-time
snapshots), it seems to me much more efficient for application to have a
log than traversing all the page-tables.


>>
>>>
>>> If there's possible "log full" then the next question is whether we should
>>> let the worker wait the monitor if the monitor is not fast enough to
>>> collect those data. It adds some slight dependency on the two threads, I
>>> think it can make the tracking harder or impossible in latency sensitive
>>> workloads.
>>
>> Again, I understand your concern. But this model that I propose is not new.
>> It is used with PML (page-modification logging) and KVM, and IIRC there is
>> a similar interface between KVM and QEMU to provide this information. There
>> are endless other examples for similar producer-consumer mechanisms that
>> might lead to stall in extreme cases.
>
> Yes, I'm not against thinking of using similar structures here. It's just
> that it's definitely more complicated on the interface, at least we need
> yet one more interface to setup the rings and define its interfaces.
>
> Note that although Muhammud is defining another new interface here too for
> pagemap, I don't think it's strictly needed for uffd-wp async mode. One
> can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap
> interface already.
>
> So what Muhammud is proposing here are two things to me: (1) uffd-wp async,
> plus (2) a new pagemap interface (which will closely work with (1) only if
> we need atomicity on get-dirty and reprotect).
>
> Defining new interface for uffd-wp async mode will be something extra, so
> IMHO besides the heap allocation on the rings, we need to also justify
> whether that is needed. That's why I think it's fine to go with what
> Muhammud proposed, because it's a minimum changeset at least for userfault
> to support an async mode, and anything else can be done on top if necessary.
>
> Going a bit back to the "lead to stall in extreme cases" above, just also
> want to mention that the VM use case is slightly different - dirty tracking
> is only heavily used during migration afaict, and it's a short period. Not
> a lot of people will complain performance degrades during that period
> because that's just rare. And, even without the ring the perf is really
> bad during migration anyway... Especially when huge pages are used to back
> the guest RAM.
>
> Here it's slightly different to me: it's about tracking dirty pages during
> any possible workload, and it can be monitored periodically and frequently.
> So IMHO stricter than a VM use case where migration is the only period to
> use it.

I still don’t get the use-cases. "monitored periodically and frequently” is
not a use-case. And as I said before, actually, monitoring frequently is
more performant with a log than with scanning all the page-tables.

>
>>
>>>
>>> The other thing is we can also make the log "never gonna full" by making it
>>> a bitmap covering any registered ranges, but I don't either know whether
>>> it'll be worth it for the effort.
>>
>> I do not see a benefit of half-log half-scan. It tries to take the
>> data-structure of one format and combine it with another.
>
> What I'm saying here is not half-log / half-scan, but use a single bitmap
> to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it
> avoids any above "stall" issue.

Oh, I never went into the KVM details before - stupid me. If that’s what
eventually was proven to work for KVM/QEMU, then it really sounds like
the pagemap solution that Muhammad proposed.

But still not convoluting pagemap with userfaultfd (and especially
uffd-wp) can be beneficial. Linus already threw some comments here and
there about disliking uffd-wp, and I’m not sure adding uffd-wp specific
stuff to pagemap would be welcomed.

Anyhow, thanks for all the explanations. Eventually, I understand that
using bitmaps can be more efficient than a log if the bits are condensed.

2023-02-28 19:32:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

On Tue, Feb 28, 2023 at 05:21:20PM +0000, Nadav Amit wrote:
>
>
> > On Feb 28, 2023, at 7:55 AM, Peter Xu <[email protected]> wrote:
> >
> > !! External Email
> >
> > On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
> >>
> >>
> >>> On Feb 27, 2023, at 1:18 PM, Peter Xu <[email protected]> wrote:
> >>>
> >>> !! External Email
> >>>
> >>> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
> >>>> From my experience with UFFD, proper ordering of events is crucial, although it
> >>>> is not always done well. Therefore, we should aim for improvement, not
> >>>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
> >>>> might be a step in the wrong direction. I think that it would have been better
> >>>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
> >>>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
> >>>> file descriptor unless the log is full.
> >>>
> >>> Yes this is an interesting question to think about..
> >>>
> >>> Keeping the data in the pgtable has one good thing that it doesn't need any
> >>> complexity on maintaining the log, and no possibility of "log full".
> >>
> >> I understand your concern, but I think that eventually it might be simpler
> >> to maintain, since the logic of how to process the log is moved to userspace.
> >>
> >> At the same time, handling inputs from pagemap and uffd handlers and sync’ing
> >> them would not be too easy for userspace.
> >
> > I do not expect a common uffd-wp async user to provide a fault handler at
> > all. In my imagination it's in most cases used standalone from other uffd
> > modes; it means all the faults will still be handled by the kernel. Here
> > we only leverage the accuracy of userfaultfd comparing to soft-dirty, so
> > not really real "user"-faults.
>
> If that is the only use-case, it might make sense. But I guess most users would
> most likely use some library (and not syscalls directly). So slightly
> complicating the API for better generality may be reasonable.
>
> >
> >>
> >> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
> >> be needed, and there are some issues of ordering the events (I think all #PF
> >> and other events should be ordered regardless) and how not to traverse all
> >> async-userfaultfd_wait_queue’s (except those that block if the log is full)
> >> when a wakeup is needed.
> >
> > Will there be an ordering requirement for an async mode? Considering it
> > should be async to whatever else, I would think it's not a problem, but
> > maybe I missed something.
>
> You may be right, but I am not sure. I am still not sure what use-cases are
> targeted in this patch-set. For CRIU checkpoint use-case (when the app is
> not running), I guess the current interface makes sense. But if there are
> use-cases in which this you do care about UFFD-events this can become an
> issue.
>
> But even in some obvious use-cases, this might be the wrong interface for
> major performance issues. If we think about some incremental copying of
> modified pages (a-la pre-copy live-migration or to create point-in-time
> snapshots), it seems to me much more efficient for application to have a
> log than traversing all the page-tables.

IMHO snapshots may not need a log at all - it needs CoW before the write
happens. Nor is the case for swapping with userfaults, IIUC. IOW in those
cases people don't care which page got dirtied, but care on data not being
modified until the app allows it to.

But I get the point, and I agree collecting by scanning is slower.

>
>
> >>
> >>>
> >>> If there's possible "log full" then the next question is whether we should
> >>> let the worker wait the monitor if the monitor is not fast enough to
> >>> collect those data. It adds some slight dependency on the two threads, I
> >>> think it can make the tracking harder or impossible in latency sensitive
> >>> workloads.
> >>
> >> Again, I understand your concern. But this model that I propose is not new.
> >> It is used with PML (page-modification logging) and KVM, and IIRC there is
> >> a similar interface between KVM and QEMU to provide this information. There
> >> are endless other examples for similar producer-consumer mechanisms that
> >> might lead to stall in extreme cases.
> >
> > Yes, I'm not against thinking of using similar structures here. It's just
> > that it's definitely more complicated on the interface, at least we need
> > yet one more interface to setup the rings and define its interfaces.
> >
> > Note that although Muhammud is defining another new interface here too for
> > pagemap, I don't think it's strictly needed for uffd-wp async mode. One
> > can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap
> > interface already.
> >
> > So what Muhammud is proposing here are two things to me: (1) uffd-wp async,
> > plus (2) a new pagemap interface (which will closely work with (1) only if
> > we need atomicity on get-dirty and reprotect).
> >
> > Defining new interface for uffd-wp async mode will be something extra, so
> > IMHO besides the heap allocation on the rings, we need to also justify
> > whether that is needed. That's why I think it's fine to go with what
> > Muhammud proposed, because it's a minimum changeset at least for userfault
> > to support an async mode, and anything else can be done on top if necessary.
> >
> > Going a bit back to the "lead to stall in extreme cases" above, just also
> > want to mention that the VM use case is slightly different - dirty tracking
> > is only heavily used during migration afaict, and it's a short period. Not
> > a lot of people will complain performance degrades during that period
> > because that's just rare. And, even without the ring the perf is really
> > bad during migration anyway... Especially when huge pages are used to back
> > the guest RAM.
> >
> > Here it's slightly different to me: it's about tracking dirty pages during
> > any possible workload, and it can be monitored periodically and frequently.
> > So IMHO stricter than a VM use case where migration is the only period to
> > use it.
>
> I still don’t get the use-cases. "monitored periodically and frequently” is
> not a use-case. And as I said before, actually, monitoring frequently is
> more performant with a log than with scanning all the page-tables.

Feel free to ignore this part if we're not taking about using a ring
structure. My previous comment was mostly for that. Bitmaps won't have
this issue. Here I see a bitmap as one way to implement a log, where it's
recorded by one bit per page. My comment was that we should be careful on
using rings.

Side note: actually kvm dirty ring is even trickier; see the soft-full
(kvm_dirty_ring.soft_limit) besides the hard-full event to make sure
hard-full won't really trigger (or we're prone to lose dirty bits). I
don't think we'll have the same issue here so we can trigger hard-full, but
it's still unwanted to halt the threads being tracked for dirty pages. I
don't know whether there'll be other side effects by the ring, though..

>
> >
> >>
> >>>
> >>> The other thing is we can also make the log "never gonna full" by making it
> >>> a bitmap covering any registered ranges, but I don't either know whether
> >>> it'll be worth it for the effort.
> >>
> >> I do not see a benefit of half-log half-scan. It tries to take the
> >> data-structure of one format and combine it with another.
> >
> > What I'm saying here is not half-log / half-scan, but use a single bitmap
> > to store what page is dirty, just like KVM_GET_DIRTY_LOG. I think it
> > avoids any above "stall" issue.
>
> Oh, I never went into the KVM details before - stupid me. If that’s what
> eventually was proven to work for KVM/QEMU, then it really sounds like
> the pagemap solution that Muhammad proposed.
>
> But still not convoluting pagemap with userfaultfd (and especially
> uffd-wp) can be beneficial. Linus already threw some comments here and
> there about disliking uffd-wp, and I’m not sure adding uffd-wp specific
> stuff to pagemap would be welcomed.

Yes I also don't know.. As I mentioned I'm not super happy with the
interface either, but that's the simplest I can think of so far.

IOW, from an "userfaultfd-side reviewer" POV I'm fine if someone wants to
leverage the concepts of uffd-wp and its internals using a separate but
very light weighted patch just to impl async mode of uffd-wp. But I'm
always open to any suggestions too. It's just that when there're multiple
options and when we're not confident on either way, I normally prefer the
simplest and cleanest (even if less efficient).

> Anyhow, thanks for all the explanations. Eventually, I understand that
> using bitmaps can be more efficient than a log if the bits are condensed.

Note that I think what Muhammad (sorry, Muhammad! I think I spelled your
name wrongly before starting from some email..) proposed is not a bitmap,
but an array of ranges that can coalesce the result into very condensed
form. Pros and cons.

Again, I can't comment much on that API, but since there're a bunch of
other developers looking at that and they're also potential future users,
I'll trust their judgement and just focus more on the other side of things.

Thanks,

--
Peter Xu


2023-03-01 02:00:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v10 3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs



> On Feb 28, 2023, at 11:31 AM, Peter Xu <[email protected]> wrote:
>
>> Anyhow, thanks for all the explanations. Eventually, I understand that
>> using bitmaps can be more efficient than a log if the bits are condensed.
>
> Note that I think what Muhammad (sorry, Muhammad! I think I spelled your
> name wrongly before starting from some email..) proposed is not a bitmap,
> but an array of ranges that can coalesce the result into very condensed
> form. Pros and cons.
>
> Again, I can't comment much on that API, but since there're a bunch of
> other developers looking at that and they're also potential future users,
> I'll trust their judgement and just focus more on the other side of things.


Thanks Peter for your patience.

I would just note that I understood that Muhammad did not propose a condensed
bitmap, and that was a hint that handling a condensed bitmap (at least on x86)
can be done rather efficiently. I am not sure about other representations.

Thanks for your explanations again, Peter.