From: Nadav Amit <[email protected]>
The goal of these patches is to add support for
process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
useful cleanups, a bug fix and performance enhancements are performed.
The patches try to consolidate the logic across different behaviors, and
to a certain extent overlap/conflict with an outstanding patch that does
something similar [1]. This consolidation however is mostly orthogonal
to the aforementioned one and done in order to clarify what is done in
respect to locks and TLB for each behavior and to batch these operations
more efficiently on process_madvise().
process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
userfaultfd monitors to unmap memory from monitored processes; and (b)
it is more efficient than madvise() since it is vectored and batches TLB
flushes more aggressively.
The first three patches should not interfere with the outstanding patch
[1]. The rest might need to be rebased after [1] is applied.
[1] https://lore.kernel.org/linux-mm/CAJuCfpFDBJ_W1y2tqAT4BGtPbWrjjDud_JuKO8ZbnjYfeVNvRg@mail.gmail.com/
Cc: Peter Xu <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Nadav Amit (8):
mm/madvise: propagate vma->vm_end changes
mm/madvise: remove unnecessary check on madvise_dontneed_free()
mm/madvise: remove unnecessary checks on madvise_free_single_vma()
mm/madvise: define madvise behavior in a struct
mm/madvise: perform certain operations once on process_madvise()
mm/madvise: more aggressive TLB batching
mm/madvise: deduplicate code in madvise_dontneed_free()
mm/madvise: process_madvise(MADV_DONTNEED)
mm/internal.h | 5 +
mm/madvise.c | 396 ++++++++++++++++++++++++++++++--------------------
mm/memory.c | 2 +-
3 files changed, 248 insertions(+), 155 deletions(-)
--
2.25.1
From: Nadav Amit <[email protected]>
The comment in madvise_dontneed_free() says that vma splits that occur
while the mmap-lock is dropped, during userfaultfd_remove(), should be
handled correctly, but nothing in the code indicates that it is so: prev
is invalidated, and do_madvise() will therefore continue to update VMAs
from the "obsolete" end (i.e., the one before the split).
Propagate the changes to end from madvise_dontneed_free() back to
do_madvise() and continue the updates from the new end accordingly.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Fixes: 70ccb92fdd90 ("userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED")
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..a2b05352ebfe 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -768,10 +768,11 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_free(struct vm_area_struct *vma,
struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
+ unsigned long start, unsigned long *pend,
int behavior)
{
struct mm_struct *mm = vma->vm_mm;
+ unsigned long end = *pend;
*prev = vma;
if (!can_madv_lru_vma(vma))
@@ -811,6 +812,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* end-vma->vm_end range, but the manager can
* handle a repetition fine.
*/
+ *pend = end;
end = vma->vm_end;
}
VM_WARN_ON(start >= end);
@@ -980,8 +982,10 @@ static int madvise_inject_error(int behavior,
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
- unsigned long start, unsigned long end, int behavior)
+ unsigned long start, unsigned long *pend, int behavior)
{
+ unsigned long end = *pend;
+
switch (behavior) {
case MADV_REMOVE:
return madvise_remove(vma, prev, start, end);
@@ -993,7 +997,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_pageout(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
- return madvise_dontneed_free(vma, prev, start, end, behavior);
+ return madvise_dontneed_free(vma, prev, start, pend, behavior);
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
return madvise_populate(vma, prev, start, end, behavior);
@@ -1199,7 +1203,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
+ error = madvise_vma(vma, &prev, start, &tmp, behavior);
if (error)
goto out;
start = tmp;
--
2.25.1
From: Nadav Amit <[email protected]>
madvise_dontneed_free() is called only from madvise_vma() and the
behavior is always either MADV_FREE or MADV_DONTNEED. There is no need
to check again in madvise_dontneed_free() if the behavior is any
different.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index a2b05352ebfe..fe843513a4e8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -820,10 +820,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
if (behavior == MADV_DONTNEED)
return madvise_dontneed_single_vma(vma, start, end);
- else if (behavior == MADV_FREE)
+ else /* behavior == MADV_FREE */
return madvise_free_single_vma(vma, start, end);
- else
- return -EINVAL;
}
static long madvise_populate(struct vm_area_struct *vma,
--
2.25.1
From: Nadav Amit <[email protected]>
madvise_free_single_vma() currently rechecks that the range fits within
the VMA, adapts it accordingly, and returns -EINVAL if the range is
entirely outside of the VMA.
The error-code of -EINVAL is incorrect according to the man pages (as it
should have been -ENOMEM), but anyhow the range that is provided to
madvise_free_single_vma() should always be valid. It is set correctly in
do_madvise() and then rechecked in madvise_dontneed_free() is the
mmap-lock is dropped.
Remove this check.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index fe843513a4e8..17e39c70704b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -716,14 +716,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
if (!vma_is_anonymous(vma))
return -EINVAL;
- range.start = max(vma->vm_start, start_addr);
- if (range.start >= vma->vm_end)
- return -EINVAL;
- range.end = min(vma->vm_end, end_addr);
- if (range.end <= vma->vm_start)
- return -EINVAL;
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
- range.start, range.end);
+ start_addr, end_addr);
lru_add_drain();
tlb_gather_mmu(&tlb, mm);
--
2.25.1
From: Nadav Amit <[email protected]>
The different behaviors of madvise are different in several ways, which
are distributed across several functions. Use the design pattern from
iouring in order to define the actions that are required for each
behavior.
The next patches will get rid of old helper functions that are modified
in this patch and the redundant use of array_index_nospec(). The next
patches will add more actions for each leaf into the new struct.
No functional change is intended.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 168 +++++++++++++++++++++++++++++++++------------------
1 file changed, 109 insertions(+), 59 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 17e39c70704b..127507c71ba9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -29,6 +29,7 @@
#include <linux/swapops.h>
#include <linux/shmem_fs.h>
#include <linux/mmu_notifier.h>
+#include <linux/nospec.h>
#include <asm/tlb.h>
@@ -39,6 +40,101 @@ struct madvise_walk_private {
bool pageout;
};
+struct madvise_info {
+ u8 behavior_valid: 1;
+ u8 process_behavior_valid: 1;
+ u8 need_mmap_read_only: 1;
+};
+
+static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
+ [MADV_DOFORK] = {
+ .behavior_valid = 1,
+ },
+ [MADV_DONTFORK] = {
+ .behavior_valid = 1,
+ },
+ [MADV_NORMAL] = {
+ .behavior_valid = 1,
+ },
+ [MADV_SEQUENTIAL] = {
+ .behavior_valid = 1,
+ },
+ [MADV_RANDOM] = {
+ .behavior_valid = 1,
+ },
+ [MADV_REMOVE] = {
+ .behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+ [MADV_WILLNEED] = {
+ .behavior_valid = 1,
+ .process_behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+ [MADV_DONTNEED] = {
+ .behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+ [MADV_FREE] = {
+ .behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+ [MADV_COLD] = {
+ .behavior_valid = 1,
+ .process_behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+ [MADV_PAGEOUT] = {
+ .behavior_valid = 1,
+ .process_behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+#ifdef CONFIG_KSM
+ [MADV_MERGEABLE] = {
+ .behavior_valid = 1,
+ },
+ [MADV_UNMERGEABLE] = {
+ .behavior_valid = 1,
+ },
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ [MADV_HUGEPAGE] = {
+ .behavior_valid = 1,
+ },
+ [MADV_NOHUGEPAGE] = {
+ .behavior_valid = 1,
+ },
+#endif
+ [MADV_DONTDUMP] = {
+ .behavior_valid = 1,
+ },
+ [MADV_DODUMP] = {
+ .behavior_valid = 1,
+ },
+ [MADV_WIPEONFORK] = {
+ .behavior_valid = 1,
+ },
+ [MADV_KEEPONFORK] = {
+ .behavior_valid = 1,
+ },
+#ifdef CONFIG_MEMORY_FAILURE
+ [MADV_HWPOISON] = {
+ .behavior_valid = 1,
+ },
+ [MADV_SOFT_OFFLINE] = {
+ .behavior_valid = 1,
+ },
+#endif
+ [MADV_POPULATE_READ] = {
+ .behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+ [MADV_POPULATE_WRITE] = {
+ .behavior_valid = 1,
+ .need_mmap_read_only = 1,
+ },
+};
+
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
* take mmap_lock for writing. Others, which simply traverse vmas, need
@@ -46,20 +142,7 @@ struct madvise_walk_private {
*/
static int madvise_need_mmap_write(int behavior)
{
- switch (behavior) {
- case MADV_REMOVE:
- case MADV_WILLNEED:
- case MADV_DONTNEED:
- case MADV_COLD:
- case MADV_PAGEOUT:
- case MADV_FREE:
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
- return 0;
- default:
- /* be safe, default to 1. list exceptions explicitly */
- return 1;
- }
+ return !madvise_info[behavior].need_mmap_read_only;
}
/*
@@ -999,56 +1082,23 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
}
static bool
-madvise_behavior_valid(int behavior)
+madvise_behavior_valid(int *behavior)
{
- switch (behavior) {
- case MADV_DOFORK:
- case MADV_DONTFORK:
- case MADV_NORMAL:
- case MADV_SEQUENTIAL:
- case MADV_RANDOM:
- case MADV_REMOVE:
- case MADV_WILLNEED:
- case MADV_DONTNEED:
- case MADV_FREE:
- case MADV_COLD:
- case MADV_PAGEOUT:
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
-#ifdef CONFIG_KSM
- case MADV_MERGEABLE:
- case MADV_UNMERGEABLE:
-#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- case MADV_HUGEPAGE:
- case MADV_NOHUGEPAGE:
-#endif
- case MADV_DONTDUMP:
- case MADV_DODUMP:
- case MADV_WIPEONFORK:
- case MADV_KEEPONFORK:
-#ifdef CONFIG_MEMORY_FAILURE
- case MADV_SOFT_OFFLINE:
- case MADV_HWPOISON:
-#endif
- return true;
-
- default:
+ if (*behavior >= ARRAY_SIZE(madvise_info))
return false;
- }
+
+ *behavior = array_index_nospec(*behavior, ARRAY_SIZE(madvise_info));
+ return madvise_info[*behavior].behavior_valid;
}
static bool
-process_madvise_behavior_valid(int behavior)
+process_madvise_behavior_valid(int *behavior)
{
- switch (behavior) {
- case MADV_COLD:
- case MADV_PAGEOUT:
- case MADV_WILLNEED:
- return true;
- default:
+ if (*behavior >= ARRAY_SIZE(madvise_info))
return false;
- }
+
+ *behavior = array_index_nospec(*behavior, ARRAY_SIZE(madvise_info));
+ return madvise_info[*behavior].process_behavior_valid;
}
/*
@@ -1133,7 +1183,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
start = untagged_addr(start);
- if (!madvise_behavior_valid(behavior))
+ if (!madvise_behavior_valid(&behavior))
return error;
if (!PAGE_ALIGNED(start))
@@ -1258,7 +1308,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
goto put_pid;
}
- if (!process_madvise_behavior_valid(behavior)) {
+ if (!process_madvise_behavior_valid(&behavior)) {
ret = -EINVAL;
goto release_task;
}
--
2.25.1
From: Nadav Amit <[email protected]>
There are certain operations that can be performed only once on
process_madvise() instead of performing them for each IO vector.
Acquiring the mmap-lock, and initializing blk_plug are specifically such
operations.
Collect the aforementioned operations into madvise_start() and
madvise_finish(). The next patches will add additional operations into
these functions.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 139 +++++++++++++++++++++++++++++++--------------------
1 file changed, 86 insertions(+), 53 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 127507c71ba9..84b86ae85671 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -43,6 +43,13 @@ struct madvise_walk_private {
struct madvise_info {
u8 behavior_valid: 1;
u8 process_behavior_valid: 1;
+ u8 no_mmap_lock: 1;
+
+ /*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_lock for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
u8 need_mmap_read_only: 1;
};
@@ -120,9 +127,11 @@ static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
#ifdef CONFIG_MEMORY_FAILURE
[MADV_HWPOISON] = {
.behavior_valid = 1,
+ .no_mmap_lock = 1,
},
[MADV_SOFT_OFFLINE] = {
.behavior_valid = 1,
+ .no_mmap_lock = 1,
},
#endif
[MADV_POPULATE_READ] = {
@@ -135,16 +144,6 @@ static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
},
};
-/*
- * Any behaviour which results in changes to the vma->vm_flags needs to
- * take mmap_lock for writing. Others, which simply traverse vmas, need
- * to only take it for reading.
- */
-static int madvise_need_mmap_write(int behavior)
-{
- return !madvise_info[behavior].need_mmap_read_only;
-}
-
/*
* We can potentially split a vm area into separate
* areas, each area with its own behavior.
@@ -1081,26 +1080,6 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
}
}
-static bool
-madvise_behavior_valid(int *behavior)
-{
- if (*behavior >= ARRAY_SIZE(madvise_info))
- return false;
-
- *behavior = array_index_nospec(*behavior, ARRAY_SIZE(madvise_info));
- return madvise_info[*behavior].behavior_valid;
-}
-
-static bool
-process_madvise_behavior_valid(int *behavior)
-{
- if (*behavior >= ARRAY_SIZE(madvise_info))
- return false;
-
- *behavior = array_index_nospec(*behavior, ARRAY_SIZE(madvise_info));
- return madvise_info[*behavior].process_behavior_valid;
-}
-
/*
* The madvise(2) system call.
*
@@ -1171,21 +1150,17 @@ process_madvise_behavior_valid(int *behavior)
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
-int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
+int madvise_one_range(struct mm_struct *mm, unsigned long start, size_t len_in,
+ int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
int unmapped_error = 0;
int error = -EINVAL;
- int write;
size_t len;
- struct blk_plug plug;
start = untagged_addr(start);
- if (!madvise_behavior_valid(&behavior))
- return error;
-
if (!PAGE_ALIGNED(start))
return error;
len = PAGE_ALIGN(len_in);
@@ -1207,14 +1182,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
return madvise_inject_error(behavior, start, start + len_in);
#endif
- write = madvise_need_mmap_write(behavior);
- if (write) {
- if (mmap_write_lock_killable(mm))
- return -EINTR;
- } else {
- mmap_read_lock(mm);
- }
-
/*
* If the interval [start,end) covers some unmapped address
* ranges, just ignore them, but return -ENOMEM at the end.
@@ -1224,7 +1191,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
if (vma && start > vma->vm_start)
prev = vma;
- blk_start_plug(&plug);
for (;;) {
/* Still start < end. */
error = -ENOMEM;
@@ -1260,15 +1226,72 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
vma = find_vma(mm, start);
}
out:
- blk_finish_plug(&plug);
- if (write)
- mmap_write_unlock(mm);
- else
- mmap_read_unlock(mm);
return error;
}
+static int
+madvise_start(struct mm_struct *mm, struct madvise_info behavior_info,
+ struct blk_plug *plug)
+{
+ if (!behavior_info.no_mmap_lock) {
+ if (behavior_info.need_mmap_read_only)
+ mmap_read_lock(mm);
+ else if (mmap_write_lock_killable(mm))
+ return -EINTR;
+ }
+
+ blk_start_plug(plug);
+ return 0;
+}
+
+static void
+madvise_finish(struct mm_struct *mm, struct madvise_info behavior_info,
+ struct blk_plug *plug)
+{
+ blk_finish_plug(plug);
+
+ if (!behavior_info.no_mmap_lock) {
+ if (behavior_info.need_mmap_read_only)
+ mmap_read_unlock(mm);
+ else
+ mmap_write_unlock(mm);
+ }
+}
+
+static struct madvise_info madvise_behavior_info(int behavior)
+{
+ if (behavior >= ARRAY_SIZE(madvise_info) || behavior < 0) {
+ const struct madvise_info invalid = {0};
+ return invalid;
+ }
+
+ behavior = array_index_nospec(behavior, ARRAY_SIZE(madvise_info));
+ return madvise_info[behavior];
+}
+
+int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+ int behavior)
+{
+ struct madvise_info behavior_info;
+ struct blk_plug plug;
+ int ret = -EINVAL;
+
+ behavior_info = madvise_behavior_info(behavior);
+
+ if (!behavior_info.behavior_valid)
+ return ret;
+
+ ret = madvise_start(mm, behavior_info, &plug);
+ if (ret != 0)
+ return ret;
+
+ ret = madvise_one_range(mm, start, len_in, behavior);
+
+ madvise_finish(mm, behavior_info, &plug);
+ return ret;
+}
+
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
return do_madvise(current->mm, start, len_in, behavior);
@@ -1286,6 +1309,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
struct mm_struct *mm;
size_t total_len;
unsigned int f_flags;
+ struct madvise_info behavior_info;
+ struct blk_plug plug;
if (flags != 0) {
ret = -EINVAL;
@@ -1308,7 +1333,9 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
goto put_pid;
}
- if (!process_madvise_behavior_valid(&behavior)) {
+ behavior_info = madvise_behavior_info(behavior);
+
+ if (!behavior_info.process_behavior_valid) {
ret = -EINVAL;
goto release_task;
}
@@ -1331,15 +1358,21 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
total_len = iov_iter_count(&iter);
+ ret = madvise_start(mm, behavior_info, &plug);
+ if (ret != 0)
+ goto release_mm;
+
while (iov_iter_count(&iter)) {
iovec = iov_iter_iovec(&iter);
- ret = do_madvise(mm, (unsigned long)iovec.iov_base,
- iovec.iov_len, behavior);
+ ret = madvise_one_range(mm, (unsigned long)iovec.iov_base,
+ iovec.iov_len, behavior);
if (ret < 0)
break;
iov_iter_advance(&iter, iovec.iov_len);
}
+ madvise_finish(mm, behavior_info, &plug);
+
if (ret == 0)
ret = total_len - iov_iter_count(&iter);
--
2.25.1
From: Nadav Amit <[email protected]>
Userfaultfd users, for the sake of memory management, debugging or other
types of monitoring may wish to use process_madvise(MADV_DONTNEED).
Moreover, since process_madvise() supports vectored operations, and now
supports efficient TLB flushes, existing users of madvise(MADV_DONTNEED)
that wish to perform advices on non-contiguous memory may prefer
the vectored process_madvise() flavor for performance reasons.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/madvise.c b/mm/madvise.c
index 9528c38fb6a4..d8f70960680e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -81,6 +81,7 @@ static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
},
[MADV_DONTNEED] = {
.behavior_valid = 1,
+ .process_behavior_valid = 1,
.need_mmap_read_only = 1,
.tlb_batching = 1,
},
--
2.25.1
From: Nadav Amit <[email protected]>
process_madvise(MADV_COLD) and future variants lose the opportunity to
perform TLB batching across IO vectors. As a result, a TLB flush (and
potentially a shootdown) is performed for each IO vector, instead of one
for the whole operation.
Batch TLB operations across IO-vectors. Open code zap_page_range() to do
so. A future patch will eliminate redundancies between
madvise_free_single_vma() and madvise_dontneed_single_vma().
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/internal.h | 5 +++
mm/madvise.c | 104 +++++++++++++++++++++++++++++---------------------
mm/memory.c | 2 +-
3 files changed, 67 insertions(+), 44 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 18256e32a14c..2313a6739ce7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -59,6 +59,11 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long addr, unsigned long end,
struct zap_details *details);
+void unmap_single_vma(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start_addr,
+ unsigned long end_addr,
+ struct zap_details *details);
+
void do_page_cache_ra(struct readahead_control *, unsigned long nr_to_read,
unsigned long lookahead_size);
void force_page_cache_ra(struct readahead_control *, unsigned long nr);
diff --git a/mm/madvise.c b/mm/madvise.c
index 84b86ae85671..e679cfa94655 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -51,6 +51,7 @@ struct madvise_info {
* to only take it for reading.
*/
u8 need_mmap_read_only: 1;
+ u8 tlb_batching: 1;
};
static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
@@ -81,20 +82,24 @@ static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
[MADV_DONTNEED] = {
.behavior_valid = 1,
.need_mmap_read_only = 1,
+ .tlb_batching = 1,
},
[MADV_FREE] = {
.behavior_valid = 1,
.need_mmap_read_only = 1,
+ .tlb_batching = 1,
},
[MADV_COLD] = {
.behavior_valid = 1,
.process_behavior_valid = 1,
.need_mmap_read_only = 1,
+ .tlb_batching = 1,
},
[MADV_PAGEOUT] = {
.behavior_valid = 1,
.process_behavior_valid = 1,
.need_mmap_read_only = 1,
+ .tlb_batching = 1,
},
#ifdef CONFIG_KSM
[MADV_MERGEABLE] = {
@@ -578,21 +583,16 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
tlb_end_vma(tlb, vma);
}
-static long madvise_cold(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_cold(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ struct vm_area_struct **prev, unsigned long start_addr,
+ unsigned long end_addr)
{
- struct mm_struct *mm = vma->vm_mm;
- struct mmu_gather tlb;
-
*prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
- madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
- tlb_finish_mmu(&tlb);
+ madvise_cold_page_range(tlb, vma, start_addr, end_addr);
return 0;
}
@@ -628,13 +628,10 @@ static inline bool can_do_pageout(struct vm_area_struct *vma)
file_permission(vma->vm_file, MAY_WRITE) == 0;
}
-static long madvise_pageout(struct vm_area_struct *vma,
+static long madvise_pageout(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
{
- struct mm_struct *mm = vma->vm_mm;
- struct mmu_gather tlb;
-
*prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -643,9 +640,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
return 0;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
- tlb_finish_mmu(&tlb);
+ madvise_pageout_page_range(tlb, vma, start_addr, end_addr);
return 0;
}
@@ -787,12 +782,12 @@ static const struct mm_walk_ops madvise_free_walk_ops = {
.pmd_entry = madvise_free_pte_range,
};
-static int madvise_free_single_vma(struct vm_area_struct *vma,
+static int madvise_free_single_vma(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
unsigned long start_addr, unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_notifier_range range;
- struct mmu_gather tlb;
/* MADV_FREE works for only anon vma at the moment */
if (!vma_is_anonymous(vma))
@@ -802,16 +797,14 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
start_addr, end_addr);
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(&range);
tlb_start_vma(&tlb, vma);
walk_page_range(vma->vm_mm, range.start, range.end,
- &madvise_free_walk_ops, &tlb);
- tlb_end_vma(&tlb, vma);
+ &madvise_free_walk_ops, tlb);
+ tlb_end_vma(tlb, vma);
mmu_notifier_invalidate_range_end(&range);
- tlb_finish_mmu(&tlb);
return 0;
}
@@ -820,7 +813,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
- * zap_page_range call sets things up for shrink_active_list to actually free
+ * unmap_single_vma call sets things up for shrink_active_list to actually free
* these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
@@ -835,14 +828,28 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
-static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
+static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- zap_page_range(vma, start, end - start);
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range;
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, start,
+ end);
+
+ lru_add_drain();
+ update_hiwater_rss(mm);
+
+ mmu_notifier_invalidate_range_start(&range);
+ unmap_single_vma(tlb, vma, start, end, NULL);
+ mmu_notifier_invalidate_range_end(&range);
+
return 0;
}
-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long *pend,
int behavior)
@@ -895,9 +902,9 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
}
if (behavior == MADV_DONTNEED)
- return madvise_dontneed_single_vma(vma, start, end);
+ return madvise_dontneed_single_vma(tlb, vma, start, end);
else /* behavior == MADV_FREE */
- return madvise_free_single_vma(vma, start, end);
+ return madvise_free_single_vma(tlb, vma, start, end);
}
static long madvise_populate(struct vm_area_struct *vma,
@@ -1055,8 +1062,9 @@ static int madvise_inject_error(int behavior,
#endif
static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
- unsigned long start, unsigned long *pend, int behavior)
+madvise_vma(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ struct vm_area_struct **prev, unsigned long start,
+ unsigned long *pend, int behavior)
{
unsigned long end = *pend;
@@ -1066,12 +1074,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
- return madvise_cold(vma, prev, start, end);
+ return madvise_cold(tlb, vma, prev, start, end);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(tlb, vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
- return madvise_dontneed_free(vma, prev, start, pend, behavior);
+ return madvise_dontneed_free(tlb, vma, prev, start, pend,
+ behavior);
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
return madvise_populate(vma, prev, start, end, behavior);
@@ -1151,7 +1160,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
int madvise_one_range(struct mm_struct *mm, unsigned long start, size_t len_in,
- int behavior)
+ int behavior, struct mmu_gather *tlb)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1211,7 +1220,7 @@ int madvise_one_range(struct mm_struct *mm, unsigned long start, size_t len_in,
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, &tmp, behavior);
+ error = madvise_vma(tlb, vma, &prev, start, &tmp, behavior);
if (error)
goto out;
start = tmp;
@@ -1225,6 +1234,7 @@ int madvise_one_range(struct mm_struct *mm, unsigned long start, size_t len_in,
else /* madvise_remove dropped mmap_lock */
vma = find_vma(mm, start);
}
+
out:
return error;
@@ -1232,7 +1242,7 @@ int madvise_one_range(struct mm_struct *mm, unsigned long start, size_t len_in,
static int
madvise_start(struct mm_struct *mm, struct madvise_info behavior_info,
- struct blk_plug *plug)
+ struct mmu_gather *tlb, struct blk_plug *plug)
{
if (!behavior_info.no_mmap_lock) {
if (behavior_info.need_mmap_read_only)
@@ -1241,16 +1251,22 @@ madvise_start(struct mm_struct *mm, struct madvise_info behavior_info,
return -EINTR;
}
+ if (behavior_info.tlb_batching)
+ tlb_gather_mmu(tlb, mm);
+
blk_start_plug(plug);
return 0;
}
static void
madvise_finish(struct mm_struct *mm, struct madvise_info behavior_info,
- struct blk_plug *plug)
+ struct mmu_gather *tlb, struct blk_plug *plug)
{
blk_finish_plug(plug);
+ if (behavior_info.tlb_batching)
+ tlb_finish_mmu(tlb);
+
if (!behavior_info.no_mmap_lock) {
if (behavior_info.need_mmap_read_only)
mmap_read_unlock(mm);
@@ -1274,6 +1290,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
int behavior)
{
struct madvise_info behavior_info;
+ struct mmu_gather tlb;
struct blk_plug plug;
int ret = -EINVAL;
@@ -1282,13 +1299,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
if (!behavior_info.behavior_valid)
return ret;
- ret = madvise_start(mm, behavior_info, &plug);
+ ret = madvise_start(mm, behavior_info, &tlb, &plug);
if (ret != 0)
return ret;
- ret = madvise_one_range(mm, start, len_in, behavior);
+ ret = madvise_one_range(mm, start, len_in, behavior, &tlb);
- madvise_finish(mm, behavior_info, &plug);
+ madvise_finish(mm, behavior_info, &tlb, &plug);
return ret;
}
@@ -1310,6 +1327,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
size_t total_len;
unsigned int f_flags;
struct madvise_info behavior_info;
+ struct mmu_gather tlb;
struct blk_plug plug;
if (flags != 0) {
@@ -1358,20 +1376,20 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
total_len = iov_iter_count(&iter);
- ret = madvise_start(mm, behavior_info, &plug);
+ ret = madvise_start(mm, behavior_info, &tlb, &plug);
if (ret != 0)
goto release_mm;
while (iov_iter_count(&iter)) {
iovec = iov_iter_iovec(&iter);
ret = madvise_one_range(mm, (unsigned long)iovec.iov_base,
- iovec.iov_len, behavior);
+ iovec.iov_len, behavior, &tlb);
if (ret < 0)
break;
iov_iter_advance(&iter, iovec.iov_len);
}
- madvise_finish(mm, behavior_info, &plug);
+ madvise_finish(mm, behavior_info, &tlb, &plug);
if (ret == 0)
ret = total_len - iov_iter_count(&iter);
diff --git a/mm/memory.c b/mm/memory.c
index 12a7b2094434..a00fb705a77f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1541,7 +1541,7 @@ void unmap_page_range(struct mmu_gather *tlb,
}
-static void unmap_single_vma(struct mmu_gather *tlb,
+void unmap_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr,
struct zap_details *details)
--
2.25.1
From: Nadav Amit <[email protected]>
Following the previous patches, madvise_dontneed_single_vma() and
madvise_free_single_vma() have redundant code. Consolidate it together
into madvise_dontneed_free().
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Colin Cross <[email protected]>
Cc: Suren Baghdasarya <[email protected]>
Cc: Mike Rapoport <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/madvise.c | 40 +++++++++++++++-------------------------
1 file changed, 15 insertions(+), 25 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index e679cfa94655..9528c38fb6a4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -786,25 +786,14 @@ static int madvise_free_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start_addr, unsigned long end_addr)
{
- struct mm_struct *mm = vma->vm_mm;
- struct mmu_notifier_range range;
-
/* MADV_FREE works for only anon vma at the moment */
if (!vma_is_anonymous(vma))
return -EINVAL;
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
- start_addr, end_addr);
-
- lru_add_drain();
- update_hiwater_rss(mm);
-
- mmu_notifier_invalidate_range_start(&range);
tlb_start_vma(&tlb, vma);
- walk_page_range(vma->vm_mm, range.start, range.end,
+ walk_page_range(vma->vm_mm, start_addr, end_addr,
&madvise_free_walk_ops, tlb);
tlb_end_vma(tlb, vma);
- mmu_notifier_invalidate_range_end(&range);
return 0;
}
@@ -832,18 +821,7 @@ static long madvise_dontneed_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- struct mm_struct *mm = vma->vm_mm;
- struct mmu_notifier_range range;
-
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, start,
- end);
-
- lru_add_drain();
- update_hiwater_rss(mm);
-
- mmu_notifier_invalidate_range_start(&range);
unmap_single_vma(tlb, vma, start, end, NULL);
- mmu_notifier_invalidate_range_end(&range);
return 0;
}
@@ -855,7 +833,9 @@ static long madvise_dontneed_free(struct mmu_gather *tlb,
int behavior)
{
struct mm_struct *mm = vma->vm_mm;
+ struct mmu_notifier_range range;
unsigned long end = *pend;
+ long ret;
*prev = vma;
if (!can_madv_lru_vma(vma))
@@ -901,10 +881,20 @@ static long madvise_dontneed_free(struct mmu_gather *tlb,
VM_WARN_ON(start >= end);
}
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, start,
+ end);
+
+ lru_add_drain();
+ update_hiwater_rss(mm);
+ mmu_notifier_invalidate_range_start(&range);
+
if (behavior == MADV_DONTNEED)
- return madvise_dontneed_single_vma(tlb, vma, start, end);
+ ret = madvise_dontneed_single_vma(tlb, vma, start, end);
else /* behavior == MADV_FREE */
- return madvise_free_single_vma(tlb, vma, start, end);
+ ret = madvise_free_single_vma(tlb, vma, start, end);
+
+ mmu_notifier_invalidate_range_end(&range);
+ return ret;
}
static long madvise_populate(struct vm_area_struct *vma,
--
2.25.1
On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> The comment in madvise_dontneed_free() says that vma splits that occur
> while the mmap-lock is dropped, during userfaultfd_remove(), should be
> handled correctly, but nothing in the code indicates that it is so: prev
> is invalidated, and do_madvise() will therefore continue to update VMAs
> from the "obsolete" end (i.e., the one before the split).
>
> Propagate the changes to end from madvise_dontneed_free() back to
> do_madvise() and continue the updates from the new end accordingly.
Could you describe in details a race that would lead to wrong behaviour?
If mmap lock was dropped any change to VMA layout can appear. We can have
totally unrelated VMA there.
Either way, if userspace change VMA layout for a region that is under
madvise(MADV_DONTNEED) it is totally broken. I don't see a valid reason to
do this.
The current behaviour looks reasonable to me. Yes, we can miss VMAs, but
these VMAs can also be created just after madvise() is finished.
--
Kirill A. Shutemov
On Sun, Sep 26, 2021 at 09:12:53AM -0700, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> madvise_dontneed_free() is called only from madvise_vma() and the
> behavior is always either MADV_FREE or MADV_DONTNEED. There is no need
> to check again in madvise_dontneed_free() if the behavior is any
> different.
So what. The check is free. Compiler should be clever enough to eliminate
the additional check. If there's a new MADV_DONTNEED flavour, the change
would have to be effectively reverted.
NAK.
--
Kirill A. Shutemov
On Sun, Sep 26, 2021 at 09:12:54AM -0700, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> madvise_free_single_vma() currently rechecks that the range fits within
> the VMA, adapts it accordingly, and returns -EINVAL if the range is
> entirely outside of the VMA.
>
> The error-code of -EINVAL is incorrect according to the man pages (as it
> should have been -ENOMEM), but anyhow the range that is provided to
> madvise_free_single_vma() should always be valid. It is set correctly in
> do_madvise() and then rechecked in madvise_dontneed_free() is the
> mmap-lock is dropped.
>
> Remove this check.
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Colin Cross <[email protected]>
> Cc: Suren Baghdasarya <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> mm/madvise.c | 8 +-------
> 1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fe843513a4e8..17e39c70704b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -716,14 +716,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> if (!vma_is_anonymous(vma))
> return -EINVAL;
>
> - range.start = max(vma->vm_start, start_addr);
> - if (range.start >= vma->vm_end)
> - return -EINVAL;
> - range.end = min(vma->vm_end, end_addr);
> - if (range.end <= vma->vm_start)
> - return -EINVAL;
How did you test this change?
As far as I can see, you leave 'range' uninitialized, but used for
walk_page_range() and mmu_notifiers.
NAK.
> mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
> - range.start, range.end);
> + start_addr, end_addr);
>
> lru_add_drain();
> tlb_gather_mmu(&tlb, mm);
> --
> 2.25.1
>
>
--
Kirill A. Shutemov
On 26.09.21 18:12, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> The goal of these patches is to add support for
> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
> useful cleanups, a bug fix and performance enhancements are performed.
>
> The patches try to consolidate the logic across different behaviors, and
> to a certain extent overlap/conflict with an outstanding patch that does
> something similar [1]. This consolidation however is mostly orthogonal
> to the aforementioned one and done in order to clarify what is done in
> respect to locks and TLB for each behavior and to batch these operations
> more efficiently on process_madvise().
>
> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
> userfaultfd monitors to unmap memory from monitored processes; and (b)
> it is more efficient than madvise() since it is vectored and batches TLB
> flushes more aggressively.
MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this
is very different to all the other process_madvise() calls we allow,
which are merely hints, but the target cannot be broken . I don't think
this is acceptable.
--
Thanks,
David / dhildenb
On Mon, Sep 27, 2021 at 12:17:09PM +0300, Kirill A. Shutemov wrote:
> On Sun, Sep 26, 2021 at 09:12:54AM -0700, Nadav Amit wrote:
> > From: Nadav Amit <[email protected]>
> >
> > madvise_free_single_vma() currently rechecks that the range fits within
> > the VMA, adapts it accordingly, and returns -EINVAL if the range is
> > entirely outside of the VMA.
> >
> > The error-code of -EINVAL is incorrect according to the man pages (as it
> > should have been -ENOMEM), but anyhow the range that is provided to
> > madvise_free_single_vma() should always be valid. It is set correctly in
> > do_madvise() and then rechecked in madvise_dontneed_free() is the
s/is/if/
> > mmap-lock is dropped.
> >
> > Remove this check.
> >
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Cc: Colin Cross <[email protected]>
> > Cc: Suren Baghdasarya <[email protected]>
> > Cc: Mike Rapoport <[email protected]>
> > Signed-off-by: Nadav Amit <[email protected]>
> > ---
> > mm/madvise.c | 8 +-------
> > 1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index fe843513a4e8..17e39c70704b 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -716,14 +716,8 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > if (!vma_is_anonymous(vma))
> > return -EINVAL;
> >
> > - range.start = max(vma->vm_start, start_addr);
> > - if (range.start >= vma->vm_end)
> > - return -EINVAL;
> > - range.end = min(vma->vm_end, end_addr);
> > - if (range.end <= vma->vm_start)
> > - return -EINVAL;
>
> How did you test this change?
>
> As far as I can see, you leave 'range' uninitialized, but used for
> walk_page_range() and mmu_notifiers.
>
> NAK.
Sorry. My bad. mmu_notifier_range_init will init the range even if mmu
notifiers are disabled.
>
> > mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
> > - range.start, range.end);
> > + start_addr, end_addr);
> >
> > lru_add_drain();
> > tlb_gather_mmu(&tlb, mm);
> > --
> > 2.25.1
> >
> >
>
> --
> Kirill A. Shutemov
--
Kirill A. Shutemov
On Sun, Sep 26, 2021 at 09:12:55AM -0700, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> The different behaviors of madvise are different in several ways, which
> are distributed across several functions. Use the design pattern from
> iouring in order to define the actions that are required for each
> behavior.
>
> The next patches will get rid of old helper functions that are modified
> in this patch and the redundant use of array_index_nospec(). The next
> patches will add more actions for each leaf into the new struct.
>
> No functional change is intended.
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Colin Cross <[email protected]>
> Cc: Suren Baghdasarya <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> mm/madvise.c | 168 +++++++++++++++++++++++++++++++++------------------
> 1 file changed, 109 insertions(+), 59 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 17e39c70704b..127507c71ba9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -29,6 +29,7 @@
> #include <linux/swapops.h>
> #include <linux/shmem_fs.h>
> #include <linux/mmu_notifier.h>
> +#include <linux/nospec.h>
>
> #include <asm/tlb.h>
>
> @@ -39,6 +40,101 @@ struct madvise_walk_private {
> bool pageout;
> };
>
> +struct madvise_info {
> + u8 behavior_valid: 1;
> + u8 process_behavior_valid: 1;
> + u8 need_mmap_read_only: 1;
> +};
> +
> +static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
MADV_SOFT_OFFLINE+1 smells bad.
And I don't like the change in general. Given that MADV_SOFT_OFFLINE is
101, the array will be mostly empty.
I donno. I don't see any improvement with the patch. But maybe it's only me.
--
Kirill A. Shutemov
> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> The comment in madvise_dontneed_free() says that vma splits that occur
>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
>> handled correctly, but nothing in the code indicates that it is so: prev
>> is invalidated, and do_madvise() will therefore continue to update VMAs
>> from the "obsolete" end (i.e., the one before the split).
>>
>> Propagate the changes to end from madvise_dontneed_free() back to
>> do_madvise() and continue the updates from the new end accordingly.
>
> Could you describe in details a race that would lead to wrong behaviour?
Thanks for the quick response.
For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
the VMA to split.
Something like:
CPU0 CPU1
---- ----
madvise(0x10000, 0x2000, MADV_DONTNEED)
-> userfaultfd_remove()
[ mmap-lock dropped ]
mprotect(0x11000, 0x1000, PROT_READ)
[splitting the VMA]
read(uffd)
[unblocking userfaultfd_remove()]
[ resuming ]
end = vma->vm_end
[end == 0x11000]
madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
Following this operation, 0x11000-0x12000 would not be zapped.
> If mmap lock was dropped any change to VMA layout can appear. We can have
> totally unrelated VMA there.
Yes, but we are not talking about completely unrelated VMAs. If
userspace registered a region to be monitored using userfaultfd,
it expects this region to be handled as any other region. This is
a change of behavior that only affects regions with uffd.
The comment in the code explicitly says that this scenario should be
handled:
/*
* Don't fail if end > vma->vm_end. If the old
* vma was split while the mmap_lock was
* released the effect of the concurrent
* operation may not cause madvise() to
* have an undefined result. There may be an
* adjacent next vma that we'll walk
* next. userfaultfd_remove() will generate an
* UFFD_EVENT_REMOVE repetition on the
* end-vma->vm_end range, but the manager can
* handle a repetition fine.
*/
Unless I am missing something, this does not happen in the current
code.
>
> Either way, if userspace change VMA layout for a region that is under
> madvise(MADV_DONTNEED) it is totally broken. I don't see a valid reason to
> do this.
>
> The current behaviour looks reasonable to me. Yes, we can miss VMAs, but
> these VMAs can also be created just after madvise() is finished.
Again, we are not talking about newly created VMAs.
Alternatively, this comment should be removed and perhaps the
documentation should be updated.
> On Sep 27, 2021, at 2:31 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Sun, Sep 26, 2021 at 09:12:55AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> The different behaviors of madvise are different in several ways, which
>> are distributed across several functions. Use the design pattern from
>> iouring in order to define the actions that are required for each
>> behavior.
>>
>> The next patches will get rid of old helper functions that are modified
>> in this patch and the redundant use of array_index_nospec(). The next
>> patches will add more actions for each leaf into the new struct.
>>
>> No functional change is intended.
>>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Cc: Colin Cross <[email protected]>
>> Cc: Suren Baghdasarya <[email protected]>
>> Cc: Mike Rapoport <[email protected]>
>> Signed-off-by: Nadav Amit <[email protected]>
>> ---
>> mm/madvise.c | 168 +++++++++++++++++++++++++++++++++------------------
>> 1 file changed, 109 insertions(+), 59 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 17e39c70704b..127507c71ba9 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -29,6 +29,7 @@
>> #include <linux/swapops.h>
>> #include <linux/shmem_fs.h>
>> #include <linux/mmu_notifier.h>
>> +#include <linux/nospec.h>
>>
>> #include <asm/tlb.h>
>>
>> @@ -39,6 +40,101 @@ struct madvise_walk_private {
>> bool pageout;
>> };
>>
>> +struct madvise_info {
>> + u8 behavior_valid: 1;
>> + u8 process_behavior_valid: 1;
>> + u8 need_mmap_read_only: 1;
>> +};
>> +
>> +static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
>
> MADV_SOFT_OFFLINE+1 smells bad.
I can set another constant instead and let the compiler shout if anything
outside the array is initialized.
>
> And I don't like the change in general. Given that MADV_SOFT_OFFLINE is
> 101, the array will be mostly empty.
Seriously, these is less than 128B - two cachelines. Perhaps they should
be aligned. But this whole change should have no effect on code/data size.
>
> I donno. I don't see any improvement with the patch. But maybe it's only me.
The following patches make it clearer when TLBs flushes are batched and
when mmap_lock is not taken (which is by the way not clear from the code).
I could have added two more functions for that and it would have taken
me less time. I do not think the end result of having ~5 different
functions to figure out the actions needed for each behavior would be
as clear.
> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <[email protected]> wrote:
>
> On 26.09.21 18:12, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>> The goal of these patches is to add support for
>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>> useful cleanups, a bug fix and performance enhancements are performed.
>> The patches try to consolidate the logic across different behaviors, and
>> to a certain extent overlap/conflict with an outstanding patch that does
>> something similar [1]. This consolidation however is mostly orthogonal
>> to the aforementioned one and done in order to clarify what is done in
>> respect to locks and TLB for each behavior and to batch these operations
>> more efficiently on process_madvise().
>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>> it is more efficient than madvise() since it is vectored and batches TLB
>> flushes more aggressively.
>
> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
This is a fair point, which I expected, but did not address properly.
I guess an additional capability, such as CAP_SYS_PTRACE needs to be
required in this case. Would that ease your mind?
On 27.09.21 12:41, Nadav Amit wrote:
>
>
>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 26.09.21 18:12, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>> The goal of these patches is to add support for
>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>> useful cleanups, a bug fix and performance enhancements are performed.
>>> The patches try to consolidate the logic across different behaviors, and
>>> to a certain extent overlap/conflict with an outstanding patch that does
>>> something similar [1]. This consolidation however is mostly orthogonal
>>> to the aforementioned one and done in order to clarify what is done in
>>> respect to locks and TLB for each behavior and to batch these operations
>>> more efficiently on process_madvise().
>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>> it is more efficient than madvise() since it is vectored and batches TLB
>>> flushes more aggressively.
>>
>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>
> This is a fair point, which I expected, but did not address properly.
>
> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
> required in this case. Would that ease your mind?
I think it would be slightly better, but I'm still missing a clear use
case that justifies messing with the page tables of other processes in
that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate
a bit on a) and b)?
Especially, why would a) make sense or be required? When would it be a
good idea to zap random pages of a target process, especially with
MAP_PRIVATE? How would the target use case make sure that the target
process doesn't suddenly lose data? I would have assume that you can
really only do something sane with uffd() if 1) the process decided to
give up on some pages (madvise(DONTNEED)) b) the process hasn't touched
these pages yet.
Can you also comment a bit more on b)? Who cares about that? And would
we suddenly expect users of madvise() to switch to process_madvise()
because it's more effective? It sounds a bit weird to me TBH, but most
probably I am missing details :)
--
Thanks,
David / dhildenb
> On Sep 27, 2021, at 2:11 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Sun, Sep 26, 2021 at 09:12:53AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> madvise_dontneed_free() is called only from madvise_vma() and the
>> behavior is always either MADV_FREE or MADV_DONTNEED. There is no need
>> to check again in madvise_dontneed_free() if the behavior is any
>> different.
>
> So what. The check is free. Compiler should be clever enough to eliminate
> the additional check. If there's a new MADV_DONTNEED flavour, the change
> would have to be effectively reverted.
>
> NAK.
I hate bikeshedding, but I will take the bait, since I see no
reason for this NAK.
I do not know what future change you have in mind in which quietly
failing in madvise_dontneed_free() would be the right behavior.
If the current code is presumed to be more “robust” against future
changes since there is an additional check, I would argue that this
is not the case: failing silently on a code-path that should never
run is not the right thing to do.
Having redundant checks that are not documented as such do not make
the code more readable or maintainable.
Having said that, if you want, I can turn this condition into
WARN_ON_ONCE() or VM_BUG_ON(), although I really see no reason to
do so.
[ You might just as well add a default statement to the switch in
madvise_behavior(), which BTW would have been much more reasonable,
but only if it does not fail silently as the one we discuss. ]
Note that I made this change not out of boredom, but because I
needed to change this piece of code later for TLB batching. I
did not want to sneak this change in another patch or to leave
this confusing code. Anyhow, I wasted enough time on this
trivial patch.
On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
>
> > On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <[email protected]>
> >>
> >> The comment in madvise_dontneed_free() says that vma splits that occur
> >> while the mmap-lock is dropped, during userfaultfd_remove(), should be
> >> handled correctly, but nothing in the code indicates that it is so: prev
> >> is invalidated, and do_madvise() will therefore continue to update VMAs
> >> from the "obsolete" end (i.e., the one before the split).
> >>
> >> Propagate the changes to end from madvise_dontneed_free() back to
> >> do_madvise() and continue the updates from the new end accordingly.
> >
> > Could you describe in details a race that would lead to wrong behaviour?
>
> Thanks for the quick response.
>
> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
> the VMA to split.
>
> Something like:
>
> CPU0 CPU1
> ---- ----
> madvise(0x10000, 0x2000, MADV_DONTNEED)
> -> userfaultfd_remove()
> [ mmap-lock dropped ]
> mprotect(0x11000, 0x1000, PROT_READ)
> [splitting the VMA]
>
> read(uffd)
> [unblocking userfaultfd_remove()]
>
> [ resuming ]
> end = vma->vm_end
> [end == 0x11000]
>
> madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
>
> Following this operation, 0x11000-0x12000 would not be zapped.
Okay, fair enough.
Wouldn't something like this work too:
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..0898120c5c04 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
*/
return -ENOMEM;
}
+ *prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
if (end > vma->vm_end) {
--
Kirill A. Shutemov
> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <[email protected]> wrote:
>
> On 27.09.21 12:41, Nadav Amit wrote:
>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 26.09.21 18:12, Nadav Amit wrote:
>>>> From: Nadav Amit <[email protected]>
>>>> The goal of these patches is to add support for
>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>>> useful cleanups, a bug fix and performance enhancements are performed.
>>>> The patches try to consolidate the logic across different behaviors, and
>>>> to a certain extent overlap/conflict with an outstanding patch that does
>>>> something similar [1]. This consolidation however is mostly orthogonal
>>>> to the aforementioned one and done in order to clarify what is done in
>>>> respect to locks and TLB for each behavior and to batch these operations
>>>> more efficiently on process_madvise().
>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>>> it is more efficient than madvise() since it is vectored and batches TLB
>>>> flushes more aggressively.
>>>
>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>> This is a fair point, which I expected, but did not address properly.
>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
>> required in this case. Would that ease your mind?
>
> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)?
>
> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet.
>
> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)
Ok, ok, your criticism is fair. I tried to hold back some details in order to
prevent the discussion from digressing. I am going to focus on (a) which is
what I really have in mind.
The use-case that I explore is a userspace memory manager with some level of
cooperation of the monitored processes.
The manager is notified on memory regions that it should monitor
(through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
using the remote-userfaultfd that you saw on the second thread. When it wants
to reclaim (anonymous) memory, it:
1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
UFFD-WP to do so efficiently, a patch which I did not send yet).
2. Calls process_vm_readv() to read that memory of that process.
3. Write it back to “swap”.
4. Calls process_madvise(MADV_DONTNEED) to zap it.
Once the memory is accessed again, the manager uses UFFD-COPY to bring it
back. This is really work-in-progress, but eventually performance is not as
bad as you would imagine (some patches for efficient use of uffd with
iouring are needed for that matter).
I am aware that there are some caveats, as zapping the memory does not
guarantee that the memory would be freed since it might be pinned for a
variety of reasons. That's the reason I mentioned the processes have "some
level of cooperation" with the manager. It is not intended to deal with
adversaries or uncommon corner cases (e.g., processes that use UFFD for
their own reasons).
Putting aside my use-case (which I am sure people would be glad to criticize),
I can imagine debuggers or emulators may also find use for similar schemes
(although I do not have concrete use-cases for them).
On Mon, Sep 27, 2021 at 03:31:21AM -0700, Nadav Amit wrote:
>
>
> > On Sep 27, 2021, at 2:31 AM, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Sun, Sep 26, 2021 at 09:12:55AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <[email protected]>
> >>
> >> The different behaviors of madvise are different in several ways, which
> >> are distributed across several functions. Use the design pattern from
> >> iouring in order to define the actions that are required for each
> >> behavior.
> >>
> >> The next patches will get rid of old helper functions that are modified
> >> in this patch and the redundant use of array_index_nospec(). The next
> >> patches will add more actions for each leaf into the new struct.
> >>
> >> No functional change is intended.
> >>
> >> Cc: Andrea Arcangeli <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: Minchan Kim <[email protected]>
> >> Cc: Colin Cross <[email protected]>
> >> Cc: Suren Baghdasarya <[email protected]>
> >> Cc: Mike Rapoport <[email protected]>
> >> Signed-off-by: Nadav Amit <[email protected]>
> >> ---
> >> mm/madvise.c | 168 +++++++++++++++++++++++++++++++++------------------
> >> 1 file changed, 109 insertions(+), 59 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 17e39c70704b..127507c71ba9 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -29,6 +29,7 @@
> >> #include <linux/swapops.h>
> >> #include <linux/shmem_fs.h>
> >> #include <linux/mmu_notifier.h>
> >> +#include <linux/nospec.h>
> >>
> >> #include <asm/tlb.h>
> >>
> >> @@ -39,6 +40,101 @@ struct madvise_walk_private {
> >> bool pageout;
> >> };
> >>
> >> +struct madvise_info {
> >> + u8 behavior_valid: 1;
> >> + u8 process_behavior_valid: 1;
> >> + u8 need_mmap_read_only: 1;
> >> +};
> >> +
> >> +static const struct madvise_info madvise_info[MADV_SOFT_OFFLINE+1] = {
> >
> > MADV_SOFT_OFFLINE+1 smells bad.
>
> I can set another constant instead and let the compiler shout if anything
> outside the array is initialized.
I would rather introduce a function that would return struct madvise_info
for a given behavior. The function would have a switch inside. The default:
may have BUILD_BUG() or something.
--
Kirill A. Shutemov
On Mon 27-09-21 05:00:11, Nadav Amit wrote:
[...]
> The manager is notified on memory regions that it should monitor
> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> using the remote-userfaultfd that you saw on the second thread. When it wants
> to reclaim (anonymous) memory, it:
>
> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> UFFD-WP to do so efficiently, a patch which I did not send yet).
> 2. Calls process_vm_readv() to read that memory of that process.
> 3. Write it back to “swap”.
> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
MADV_DONTNEED on a remote process has been proposed in the past several
times and it has always been rejected because it is a free ticket to all
sorts of hard to debug problems as it is just a free ticket for a remote
memory corruption. An additional capability requirement might reduce the
risk to some degree but I still do not think this is a good idea.
--
Michal Hocko
SUSE Labs
On Mon, Sep 27, 2021 at 04:05:47AM -0700, Nadav Amit wrote:
> Having said that, if you want, I can turn this condition into
> WARN_ON_ONCE() or VM_BUG_ON(), although I really see no reason to
> do so.
BUILD_BUG() should be fine here.
--
Kirill A. Shutemov
> On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
>>
>>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <[email protected]> wrote:
>>>
>>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <[email protected]>
>>>>
>>>> The comment in madvise_dontneed_free() says that vma splits that occur
>>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
>>>> handled correctly, but nothing in the code indicates that it is so: prev
>>>> is invalidated, and do_madvise() will therefore continue to update VMAs
>>>> from the "obsolete" end (i.e., the one before the split).
>>>>
>>>> Propagate the changes to end from madvise_dontneed_free() back to
>>>> do_madvise() and continue the updates from the new end accordingly.
>>>
>>> Could you describe in details a race that would lead to wrong behaviour?
>>
>> Thanks for the quick response.
>>
>> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
>> the VMA to split.
>>
>> Something like:
>>
>> CPU0 CPU1
>> ---- ----
>> madvise(0x10000, 0x2000, MADV_DONTNEED)
>> -> userfaultfd_remove()
>> [ mmap-lock dropped ]
>> mprotect(0x11000, 0x1000, PROT_READ)
>> [splitting the VMA]
>>
>> read(uffd)
>> [unblocking userfaultfd_remove()]
>>
>> [ resuming ]
>> end = vma->vm_end
>> [end == 0x11000]
>>
>> madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
>>
>> Following this operation, 0x11000-0x12000 would not be zapped.
>
> Okay, fair enough.
>
> Wouldn't something like this work too:
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0734db8d53a7..0898120c5c04 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> */
> return -ENOMEM;
> }
> + *prev = vma;
> if (!can_madv_lru_vma(vma))
> return -EINVAL;
> if (end > vma->vm_end) {
Admittedly (embarrassingly?) I didn’t even consider it since all the
comments say that once the lock is dropped prev should be invalidated.
Let’s see, considering the aforementioned scenario and that there is
initially one VMA between 0x10000-0x12000.
Looking at the code from do_madvise()):
[ end == 0x12000 ]
tmp = vma->vm_end;
[ tmp == 0x12000 ]
if (end < tmp)
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
error = madvise_vma(vma, &prev, start, tmp, behavior);
[ prev->vm_end == 0x11000 after the split]
if (error)
goto out;
start = tmp;
[ start == 0x12000 ]
if (prev && start < prev->vm_end)
start = prev->vm_end;
[ The condition (start < prev->vm_end) is false, start not updated ]
error = unmapped_error;
if (start >= end)
goto out;
[ start >= end; so we end without updating the second part of the split ]
So it does not work.
Perhaps adding this one on top of yours? I can test it when I wake up.
It is cleaner, but I am not sure if I am missing something.
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..548fc106e70b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1203,7 +1203,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
if (error)
goto out;
start = tmp;
- if (prev && start < prev->vm_end)
+ if (prev)
start = prev->vm_end;
error = unmapped_error;
if (start >= end)
On Mon, Sep 27, 2021 at 05:33:39AM -0700, Nadav Amit wrote:
>
>
> > On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
> >>
> >>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <[email protected]> wrote:
> >>>
> >>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
> >>>> From: Nadav Amit <[email protected]>
> >>>>
> >>>> The comment in madvise_dontneed_free() says that vma splits that occur
> >>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
> >>>> handled correctly, but nothing in the code indicates that it is so: prev
> >>>> is invalidated, and do_madvise() will therefore continue to update VMAs
> >>>> from the "obsolete" end (i.e., the one before the split).
> >>>>
> >>>> Propagate the changes to end from madvise_dontneed_free() back to
> >>>> do_madvise() and continue the updates from the new end accordingly.
> >>>
> >>> Could you describe in details a race that would lead to wrong behaviour?
> >>
> >> Thanks for the quick response.
> >>
> >> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
> >> the VMA to split.
> >>
> >> Something like:
> >>
> >> CPU0 CPU1
> >> ---- ----
> >> madvise(0x10000, 0x2000, MADV_DONTNEED)
> >> -> userfaultfd_remove()
> >> [ mmap-lock dropped ]
> >> mprotect(0x11000, 0x1000, PROT_READ)
> >> [splitting the VMA]
> >>
> >> read(uffd)
> >> [unblocking userfaultfd_remove()]
> >>
> >> [ resuming ]
> >> end = vma->vm_end
> >> [end == 0x11000]
> >>
> >> madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
> >>
> >> Following this operation, 0x11000-0x12000 would not be zapped.
> >
> > Okay, fair enough.
> >
> > Wouldn't something like this work too:
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 0734db8d53a7..0898120c5c04 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > */
> > return -ENOMEM;
> > }
> > + *prev = vma;
> > if (!can_madv_lru_vma(vma))
> > return -EINVAL;
> > if (end > vma->vm_end) {
>
> Admittedly (embarrassingly?) I didn’t even consider it since all the
> comments say that once the lock is dropped prev should be invalidated.
>
> Let’s see, considering the aforementioned scenario and that there is
> initially one VMA between 0x10000-0x12000.
>
> Looking at the code from do_madvise()):
>
> [ end == 0x12000 ]
>
> tmp = vma->vm_end;
>
> [ tmp == 0x12000 ]
>
> if (end < tmp)
> tmp = end;
>
> /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
>
> error = madvise_vma(vma, &prev, start, tmp, behavior);
>
> [ prev->vm_end == 0x11000 after the split]
>
> if (error)
> goto out;
> start = tmp;
>
> [ start == 0x12000 ]
>
> if (prev && start < prev->vm_end)
> start = prev->vm_end;
>
> [ The condition (start < prev->vm_end) is false, start not updated ]
>
> error = unmapped_error;
> if (start >= end)
> goto out;
>
> [ start >= end; so we end without updating the second part of the split ]
>
> So it does not work.
>
> Perhaps adding this one on top of yours? I can test it when I wake up.
> It is cleaner, but I am not sure if I am missing something.
It should work.
BTW, shouldn't we bring madvise_willneed() and madvise_remove() to the
same scheme?
--
Kirill A. Shutemov
> On Sep 27, 2021, at 5:19 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, Sep 27, 2021 at 04:05:47AM -0700, Nadav Amit wrote:
>> Having said that, if you want, I can turn this condition into
>> WARN_ON_ONCE() or VM_BUG_ON(), although I really see no reason to
>> do so.
>
> BUILD_BUG() should be fine here.
It does not work. At least my gcc is not smart enough to figure it
out in build time.
I can put instead:
BUILD_BUG_ON(__builtin_constant_p(behavior));
for potentially smarter compilers (clang?), but I doubt it would work.
> On Sep 27, 2021, at 5:45 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, Sep 27, 2021 at 05:33:39AM -0700, Nadav Amit wrote:
>>
>>
>>> On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <[email protected]> wrote:
>>>
>>> On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
>>>>
>>>>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <[email protected]> wrote:
>>>>>
>>>>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
>>>>>> From: Nadav Amit <[email protected]>
>>>>>>
>>>>>> The comment in madvise_dontneed_free() says that vma splits that occur
>>>>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
>>>>>> handled correctly, but nothing in the code indicates that it is so: prev
>>>>>> is invalidated, and do_madvise() will therefore continue to update VMAs
>>>>>> from the "obsolete" end (i.e., the one before the split).
>>>>>>
>>
[snip]
>> Perhaps adding this one on top of yours? I can test it when I wake up.
>> It is cleaner, but I am not sure if I am missing something.
>
> It should work.
>
> BTW, shouldn't we bring madvise_willneed() and madvise_remove() to the
> same scheme?
Even for consistency you are right. My only problem is that I am afraid
to backport such a change. For MADV_DONTNEED, I saw an explicit assumption.
I can do it all in one patch if we agree that none of it goes into stable
(which I clumsily forgot to cc, but might find the patch and backport it).
On 27.09.21 14:00, Nadav Amit wrote:
>
>
>> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 27.09.21 12:41, Nadav Amit wrote:
>>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 26.09.21 18:12, Nadav Amit wrote:
>>>>> From: Nadav Amit <[email protected]>
>>>>> The goal of these patches is to add support for
>>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>>>> useful cleanups, a bug fix and performance enhancements are performed.
>>>>> The patches try to consolidate the logic across different behaviors, and
>>>>> to a certain extent overlap/conflict with an outstanding patch that does
>>>>> something similar [1]. This consolidation however is mostly orthogonal
>>>>> to the aforementioned one and done in order to clarify what is done in
>>>>> respect to locks and TLB for each behavior and to batch these operations
>>>>> more efficiently on process_madvise().
>>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>>>> it is more efficient than madvise() since it is vectored and batches TLB
>>>>> flushes more aggressively.
>>>>
>>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>>> This is a fair point, which I expected, but did not address properly.
>>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
>>> required in this case. Would that ease your mind?
>>
>> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)?
>>
>> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet.
>>
>> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)
>
> Ok, ok, your criticism is fair. I tried to hold back some details in order to
> prevent the discussion from digressing. I am going to focus on (a) which is
> what I really have in mind.
Thanks for the details!
>
> The use-case that I explore is a userspace memory manager with some level of
> cooperation of the monitored processes.
>
> The manager is notified on memory regions that it should monitor
> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> using the remote-userfaultfd that you saw on the second thread. When it wants
> to reclaim (anonymous) memory, it:
>
> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> UFFD-WP to do so efficiently, a patch which I did not send yet).
> 2. Calls process_vm_readv() to read that memory of that process.
> 3. Write it back to “swap”.
> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>
> Once the memory is accessed again, the manager uses UFFD-COPY to bring it
> back. This is really work-in-progress, but eventually performance is not as
> bad as you would imagine (some patches for efficient use of uffd with
> iouring are needed for that matter).
Again, thanks for the details. I guess this should basically work,
although it involves a lot of complexity (read: all flavors of uffd on
other processes). And I am no so sure about performance aspects.
"Performance is not as bad as you think" doesn't sound like the words
you would want to hear from a car dealer ;) So there has to be another
big benefit to do such user space swapping.
>
> I am aware that there are some caveats, as zapping the memory does not
> guarantee that the memory would be freed since it might be pinned for a
> variety of reasons. That's the reason I mentioned the processes have "some
> level of cooperation" with the manager. It is not intended to deal with
> adversaries or uncommon corner cases (e.g., processes that use UFFD for
> their own reasons).
It's not only long-term pinnings. Pages could have been de-duplicated
(COW after fork, KSM, shared zeropage). Further, you'll most probably
lose any kind of "aging" ("accessed") information on pages, or how would
you track that?
Although I can see that this might work, I do wonder if it's a use case
worth supporting. As Michal correctly raised, we already have other
infrastructure in place to trigger swapin/swapout. I recall that also
damon wants to let you write advanced policies for that by monitoring
actual access characteristics.
>
> Putting aside my use-case (which I am sure people would be glad to criticize),
> I can imagine debuggers or emulators may also find use for similar schemes
> (although I do not have concrete use-cases for them).
I'd be curious about use cases for debuggers/emulators. Especially for
emulators I'd guess it makes more sense to just do it within the
process. And for debuggers, I'm having a hard time why it would make
sense to throw away a page instead of just overwriting it with $PATTERN
(e.g., 0). But I'm sure people can be creative :)
--
Thanks,
David / dhildenb
> On Sep 27, 2021, at 5:16 AM, Michal Hocko <[email protected]> wrote:
>
> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> [...]
>> The manager is notified on memory regions that it should monitor
>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>> using the remote-userfaultfd that you saw on the second thread. When it wants
>> to reclaim (anonymous) memory, it:
>>
>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>> 2. Calls process_vm_readv() to read that memory of that process.
>> 3. Write it back to “swap”.
>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>
> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
Providing hints to the kernel takes you so far to a certain extent.
The kernel does not want to (for a good reason) to be completely
configurable when it comes to reclaim and prefetch policies. Doing
so from userspace allows you to be fully configurable.
> MADV_DONTNEED on a remote process has been proposed in the past several
> times and it has always been rejected because it is a free ticket to all
> sorts of hard to debug problems as it is just a free ticket for a remote
> memory corruption. An additional capability requirement might reduce the
> risk to some degree but I still do not think this is a good idea.
I would argue that there is nothing bad that remote MADV_DONTNEED can do
that process_vm_writev() cannot do as well (putting aside ptrace).
process_vm_writev() is checking:
mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS)
Wouldn't adding such a condition suffice?
> On Sep 27, 2021, at 10:05 AM, David Hildenbrand <[email protected]> wrote:
>
> On 27.09.21 14:00, Nadav Amit wrote:
>>> On Sep 27, 2021, at 3:58 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 27.09.21 12:41, Nadav Amit wrote:
>>>>> On Sep 27, 2021, at 2:24 AM, David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 26.09.21 18:12, Nadav Amit wrote:
>>>>>> From: Nadav Amit <[email protected]>
>>>>>> The goal of these patches is to add support for
>>>>>> process_madvise(MADV_DONTNEED). Yet, in the process some (arguably)
>>>>>> useful cleanups, a bug fix and performance enhancements are performed.
>>>>>> The patches try to consolidate the logic across different behaviors, and
>>>>>> to a certain extent overlap/conflict with an outstanding patch that does
>>>>>> something similar [1]. This consolidation however is mostly orthogonal
>>>>>> to the aforementioned one and done in order to clarify what is done in
>>>>>> respect to locks and TLB for each behavior and to batch these operations
>>>>>> more efficiently on process_madvise().
>>>>>> process_madvise(MADV_DONTNEED) is useful for two reasons: (a) it allows
>>>>>> userfaultfd monitors to unmap memory from monitored processes; and (b)
>>>>>> it is more efficient than madvise() since it is vectored and batches TLB
>>>>>> flushes more aggressively.
>>>>>
>>>>> MADV_DONTNEED on MAP_PRIVATE memory is a target-visible operation; this is very different to all the other process_madvise() calls we allow, which are merely hints, but the target cannot be broken . I don't think this is acceptable.
>>>> This is a fair point, which I expected, but did not address properly.
>>>> I guess an additional capability, such as CAP_SYS_PTRACE needs to be
>>>> required in this case. Would that ease your mind?
>>>
>>> I think it would be slightly better, but I'm still missing a clear use case that justifies messing with the page tables of other processes in that way, especially with MAP_PRIVATE mappings. Can you maybe elaborate a bit on a) and b)?
>>>
>>> Especially, why would a) make sense or be required? When would it be a good idea to zap random pages of a target process, especially with MAP_PRIVATE? How would the target use case make sure that the target process doesn't suddenly lose data? I would have assume that you can really only do something sane with uffd() if 1) the process decided to give up on some pages (madvise(DONTNEED)) b) the process hasn't touched these pages yet.
>>>
>>> Can you also comment a bit more on b)? Who cares about that? And would we suddenly expect users of madvise() to switch to process_madvise() because it's more effective? It sounds a bit weird to me TBH, but most probably I am missing details :)
>> Ok, ok, your criticism is fair. I tried to hold back some details in order to
>> prevent the discussion from digressing. I am going to focus on (a) which is
>> what I really have in mind.
>
> Thanks for the details!
>
>> The use-case that I explore is a userspace memory manager with some level of
>> cooperation of the monitored processes.
>> The manager is notified on memory regions that it should monitor
>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>> using the remote-userfaultfd that you saw on the second thread. When it wants
>> to reclaim (anonymous) memory, it:
>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>> 2. Calls process_vm_readv() to read that memory of that process.
>> 3. Write it back to “swap”.
>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>> Once the memory is accessed again, the manager uses UFFD-COPY to bring it
>> back. This is really work-in-progress, but eventually performance is not as
>> bad as you would imagine (some patches for efficient use of uffd with
>> iouring are needed for that matter).
>
> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.
There is some complexity, indeed. Worse, there are some quirks of UFFD
that make life hard for no reason and some uffd and iouring bugs.
As for my sales pitch - I agree that I am not the best car dealer… :(
When I say performance is not bad, I mean that the core operations of
page-fault handling, prefetch and reclaim do not induce high overhead
*after* the improvements I sent or mentioned.
The benefit of doing so from userspace is that you have full control
over the reclaim/prefetch policies, so you may be able to make better
decisions.
Some workloads have predictable access patterns (see for instance "MAGE:
Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
be handle such access patterns without requiring intrusive changes to the
workload.
>
>> I am aware that there are some caveats, as zapping the memory does not
>> guarantee that the memory would be freed since it might be pinned for a
>> variety of reasons. That's the reason I mentioned the processes have "some
>> level of cooperation" with the manager. It is not intended to deal with
>> adversaries or uncommon corner cases (e.g., processes that use UFFD for
>> their own reasons).
>
> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?
I know it’s not just long-term pinnings. That’s what “variety of reasons”
stood for. ;-)
Aging is a tool for certain types of reclamation policies. Some do not
require it (e.g., random). You can also have compiler/application-guided
reclamation policies. If you are really into “aging”, you may be able
to use PEBS or other CPU facilities to track it.
Anyhow, the access-bit by itself not such a great solution to track
aging. Setting it can induce overheads of >500 cycles from my (and
others) experience.
>
> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.
Hints, as those that Michal mentioned, prevent the efficient use of
userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
when the page is brought back from swap. So using
MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
prefetch policy, for instance. It would also require you to live
with the kernel reclamation/IO stack for better and worse.
As for DAMON, I am not very familiar with it, but from what I remember
it seemed to look on a similar direction. IMHO it is more intrusive
and less configurable (although it can have the advantage of better
integration with various kernel mechanism). I was wondering for a
second why you give me such a hard time for a pretty straight-forward
extension for process_madvise(), but then I remembered that DAMON got
into the kernel after >30 versions, so I’ll shut up about that. ;-)
>
>> Putting aside my use-case (which I am sure people would be glad to criticize),
>> I can imagine debuggers or emulators may also find use for similar schemes
>> (although I do not have concrete use-cases for them).
>
> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
I have some more vague ideas, but I am afraid that you will keep
saying that it makes more sense to handle such events from within
a process. I am not sure that this is true. Even for the emulators
that we discuss, the emulated program might run in a different
address space (for sandboxing). You may be able to avoid the need
for remote-UFFD and get away with the current non-cooperative
UFFD, but zapping the memory (for atomic updates) would still
require process_madvise(MADV_DONTNEED) [putting aside various
ptrace solutions].
Anyhow, David, I really appreciate your feedback. And you make
strong points about issues I encounter. Yet, eventually, I think
that the main question in this discussion is whether enabling
process_madvise(MADV_DONTNEED) is any different - from security
point of view - than process_vm_writev(), not to mention ptrace.
If not, then the same security guards should suffice, I would
argue.
> On Sep 27, 2021, at 5:14 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, Sep 27, 2021 at 03:31:21AM -0700, Nadav Amit wrote:
>>
>>
>>> On Sep 27, 2021, at 2:31 AM, Kirill A. Shutemov <[email protected]> wrote:
>>>
>>> On Sun, Sep 26, 2021 at 09:12:55AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <[email protected]>
>>>>
>>>> The different behaviors of madvise are different in several ways, which
>>>> are distributed across several functions. Use the design pattern from
>>>> iouring in order to define the actions that are required for each
>>>> behavior.
>>>>
>>>> The next patches will get rid of old helper functions that are modified
>>>> in this patch and the redundant use of array_index_nospec(). The next
>>>> patches will add more actions for each leaf into the new struct.
>>>
[ snip ]
>>> MADV_SOFT_OFFLINE+1 smells bad.
>>
>> I can set another constant instead and let the compiler shout if anything
>> outside the array is initialized.
>
> I would rather introduce a function that would return struct madvise_info
> for a given behavior. The function would have a switch inside. The default:
> may have BUILD_BUG() or something.
Sounds better than my solution. I will do so.
>>
>> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.
>
> There is some complexity, indeed. Worse, there are some quirks of UFFD
> that make life hard for no reason and some uffd and iouring bugs.
>
> As for my sales pitch - I agree that I am not the best car dealer… :(
:)
> When I say performance is not bad, I mean that the core operations of
> page-fault handling, prefetch and reclaim do not induce high overhead
> *after* the improvements I sent or mentioned.
>
> The benefit of doing so from userspace is that you have full control
> over the reclaim/prefetch policies, so you may be able to make better
> decisions.
>
> Some workloads have predictable access patterns (see for instance "MAGE:
> Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
> be handle such access patterns without requiring intrusive changes to the
> workload.
Thanks for the pointer.
And my question would be if something like DAMON would actually be what
you want.
>
>
>>
>>> I am aware that there are some caveats, as zapping the memory does not
>>> guarantee that the memory would be freed since it might be pinned for a
>>> variety of reasons. That's the reason I mentioned the processes have "some
>>> level of cooperation" with the manager. It is not intended to deal with
>>> adversaries or uncommon corner cases (e.g., processes that use UFFD for
>>> their own reasons).
>>
>> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?
>
> I know it’s not just long-term pinnings. That’s what “variety of reasons”
> stood for. ;-)
>
> Aging is a tool for certain types of reclamation policies. Some do not
> require it (e.g., random). You can also have compiler/application-guided
> reclamation policies. If you are really into “aging”, you may be able
> to use PEBS or other CPU facilities to track it.
>
> Anyhow, the access-bit by itself not such a great solution to track
> aging. Setting it can induce overheads of >500 cycles from my (and
> others) experience.
Well, I'm certainly no expert on that; I would assume it's relevant in
corner cases only: if you're application accesses all it's memory
permanently a swap setup is already "broken". If you have plenty of old
memory (VMs, databases, ...) it should work reasonably well. But yeah,
detecting the working set size is a problematic problem, and "access"
bits can be sub-optimal.
After all, that's what the Linux kernel has been relying on for a long
time ... and IIRC it might be extended by multiple "aging" queues soon.
>
>>
>> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.
>
> Hints, as those that Michal mentioned, prevent the efficient use of
> userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
> when the page is brought back from swap. So using
> MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
> prefetch policy, for instance. It would also require you to live
> with the kernel reclamation/IO stack for better and worse.
Would more uffd (or similar) events help?
>
> As for DAMON, I am not very familiar with it, but from what I remember
> it seemed to look on a similar direction. IMHO it is more intrusive
> and less configurable (although it can have the advantage of better
> integration with various kernel mechanism). I was wondering for a
> second why you give me such a hard time for a pretty straight-forward
> extension for process_madvise(), but then I remembered that DAMON got
> into the kernel after >30 versions, so I’ll shut up about that. ;-)
It took ... quite a long time, indeed :)
>
>>
>>> Putting aside my use-case (which I am sure people would be glad to criticize),
>>> I can imagine debuggers or emulators may also find use for similar schemes
>>> (although I do not have concrete use-cases for them).
>>
>> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
>
> I have some more vague ideas, but I am afraid that you will keep
> saying that it makes more sense to handle such events from within
> a process. I am not sure that this is true. Even for the emulators
> that we discuss, the emulated program might run in a different
> address space (for sandboxing). You may be able to avoid the need
> for remote-UFFD and get away with the current non-cooperative
> UFFD, but zapping the memory (for atomic updates) would still
> require process_madvise(MADV_DONTNEED) [putting aside various
> ptrace solutions].
>
> Anyhow, David, I really appreciate your feedback. And you make
> strong points about issues I encounter. Yet, eventually, I think
> that the main question in this discussion is whether enabling
> process_madvise(MADV_DONTNEED) is any different - from security
> point of view - than process_vm_writev(), not to mention ptrace.
> If not, then the same security guards should suffice, I would
> argue.
>
You raise a very excellent point (and it should have been part of your
initial sales pitch): how does it differ to process_vm_writev().
I can say that it differs in a way that you can break applications in
more extreme ways. Let me give you two examples:
1. longterm pinnings: you raised this yourself; this can break an
application silently and there is barely a safe way your tooling could
handle it.
2. pagemap: applications can depend on the populated(present |swap)
information in the pagemap for correctness. For example, there was
recently a discussion to use pagemap information to speed up live
migration of VMs, by skipping migration of !populated pages. There is
currently no way your tooling can fake that. In comparison, ordinary
swapping in the kernel can handle it.
Is it easy to break an application with process_vm_writev()? Yes. When
talking about dynamic debugging, it's expected that you break the target
already -- or the target is already broken. Is it easier to break an
application with process_madvise(MADV_DONTNEED)? I'd say yes, especially
when implementing something way beyond debugging as you describe.
I'm giving you "a hard time" for the reason Michal raised: we discussed
this in the past already at least two times IIRC and "it is a free
ticket to all sorts of hard to debug problem" in our opinion; especially
when we mess around in other process address spaces besides for debugging.
I'm not the person to ack/nack this, I'm just asking the questions :)
--
Thanks,
David / dhildenb
> On Sep 28, 2021, at 1:53 AM, David Hildenbrand <[email protected]> wrote:
>
>>>
>>> Again, thanks for the details. I guess this should basically work, although it involves a lot of complexity (read: all flavors of uffd on other processes). And I am no so sure about performance aspects. "Performance is not as bad as you think" doesn't sound like the words you would want to hear from a car dealer ;) So there has to be another big benefit to do such user space swapping.
>> There is some complexity, indeed. Worse, there are some quirks of UFFD
>> that make life hard for no reason and some uffd and iouring bugs.
>> As for my sales pitch - I agree that I am not the best car dealer… :(
>
> :)
>
>> When I say performance is not bad, I mean that the core operations of
>> page-fault handling, prefetch and reclaim do not induce high overhead
>> *after* the improvements I sent or mentioned.
>> The benefit of doing so from userspace is that you have full control
>> over the reclaim/prefetch policies, so you may be able to make better
>> decisions.
>> Some workloads have predictable access patterns (see for instance "MAGE:
>> Nearly Zero-Cost Virtual Memory for Secure Computation”, OSDI’21). You may
>> be handle such access patterns without requiring intrusive changes to the
>> workload.
>
> Thanks for the pointer.
>
> And my question would be if something like DAMON would actually be what you want.
I looked into DAMON and even with the proposed future extensions it sounds
as a different approach with certain benefits but with many limitations.
The major limitation of DAMON is that you need to predefine the logic you
want for reclamation into the kernel. You can add programability through
some API or even eBPF, but it would never be as easy or as versatile as
what user manager can achieve. We already have pretty much all the
facilities to do so from userspace, and the missing parts (at least for
basic userspace manager) are almost already there. In contrast, see how
many iterations are needed for the basic DAMON implementation.
The second, also big, difference is that DAMON looks only on reclamation.
If you want a custom prefetch scheme or different I/O stack for backing
storage, you cannot have such one.
>
>>>
>>>> I am aware that there are some caveats, as zapping the memory does not
>>>> guarantee that the memory would be freed since it might be pinned for a
>>>> variety of reasons. That's the reason I mentioned the processes have "some
>>>> level of cooperation" with the manager. It is not intended to deal with
>>>> adversaries or uncommon corner cases (e.g., processes that use UFFD for
>>>> their own reasons).
>>>
>>> It's not only long-term pinnings. Pages could have been de-duplicated (COW after fork, KSM, shared zeropage). Further, you'll most probably lose any kind of "aging" ("accessed") information on pages, or how would you track that?
>> I know it’s not just long-term pinnings. That’s what “variety of reasons”
>> stood for. ;-)
>> Aging is a tool for certain types of reclamation policies. Some do not
>> require it (e.g., random). You can also have compiler/application-guided
>> reclamation policies. If you are really into “aging”, you may be able
>> to use PEBS or other CPU facilities to track it.
>> Anyhow, the access-bit by itself not such a great solution to track
>> aging. Setting it can induce overheads of >500 cycles from my (and
>> others) experience.
>
> Well, I'm certainly no expert on that; I would assume it's relevant in corner cases only: if you're application accesses all it's memory permanently a swap setup is already "broken". If you have plenty of old memory (VMs, databases, ...) it should work reasonably well. But yeah, detecting the working set size is a problematic problem, and "access"
> bits can be sub-optimal.
>
> After all, that's what the Linux kernel has been relying on for a long time ... and IIRC it might be extended by multiple "aging" queues soon.
I see your point. I do not want to waste time in our discussion by
focusing on recency-based reclamation schemes and access-bits. I was
just trying to point that such schemes and access-bit have their own
drawbacks.
>
>>>
>>> Although I can see that this might work, I do wonder if it's a use case worth supporting. As Michal correctly raised, we already have other infrastructure in place to trigger swapin/swapout. I recall that also damon wants to let you write advanced policies for that by monitoring actual access characteristics.
>> Hints, as those that Michal mentioned, prevent the efficient use of
>> userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
>> when the page is brought back from swap. So using
>> MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
>> prefetch policy, for instance. It would also require you to live
>> with the kernel reclamation/IO stack for better and worse.
>
> Would more uffd (or similar) events help?
>
>> As for DAMON, I am not very familiar with it, but from what I remember
>> it seemed to look on a similar direction. IMHO it is more intrusive
>> and less configurable (although it can have the advantage of better
>> integration with various kernel mechanism). I was wondering for a
>> second why you give me such a hard time for a pretty straight-forward
>> extension for process_madvise(), but then I remembered that DAMON got
>> into the kernel after >30 versions, so I’ll shut up about that. ;-)
>
> It took ... quite a long time, indeed :)
>
>>>
>>>> Putting aside my use-case (which I am sure people would be glad to criticize),
>>>> I can imagine debuggers or emulators may also find use for similar schemes
>>>> (although I do not have concrete use-cases for them).
>>>
>>> I'd be curious about use cases for debuggers/emulators. Especially for emulators I'd guess it makes more sense to just do it within the process. And for debuggers, I'm having a hard time why it would make sense to throw away a page instead of just overwriting it with $PATTERN (e.g., 0). But I'm sure people can be creative :)
>> I have some more vague ideas, but I am afraid that you will keep
>> saying that it makes more sense to handle such events from within
>> a process. I am not sure that this is true. Even for the emulators
>> that we discuss, the emulated program might run in a different
>> address space (for sandboxing). You may be able to avoid the need
>> for remote-UFFD and get away with the current non-cooperative
>> UFFD, but zapping the memory (for atomic updates) would still
>> require process_madvise(MADV_DONTNEED) [putting aside various
>> ptrace solutions].
>> Anyhow, David, I really appreciate your feedback. And you make
>> strong points about issues I encounter. Yet, eventually, I think
>> that the main question in this discussion is whether enabling
>> process_madvise(MADV_DONTNEED) is any different - from security
>> point of view - than process_vm_writev(), not to mention ptrace.
>> If not, then the same security guards should suffice, I would
>> argue.
>
> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>
> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>
> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>
> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
I understand (1). As for (2): the scenario that you mention sound
very specific, and one can argue that ignoring UFFD-registered
regions in such a case is either (1) wrong or (2) should trigger
some UFFD event.
>
> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
If you do not know what you are doing, you can easily break anything.
Note that there are other APIs that can break your application even
worse, specifically ptrace().
> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>
> I'm not the person to ack/nack this, I'm just asking the questions :)
I see your points and I try to look for a path of least resistance.
I thought that process_madvise() is a nice interface to hook into.
But if you are concerned it will be misused, how about adding instead
an IOCTL that will zap pages but only in UFFD-registered regions?
A separate IOCTL for this matter have an advantage of being more
tailored for UFFD, not to notify UFFD upon “remove” and to be less
likely to be misused.
On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>
> > On Sep 27, 2021, at 5:16 AM, Michal Hocko <[email protected]> wrote:
> >
> > On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> > [...]
> >> The manager is notified on memory regions that it should monitor
> >> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >> using the remote-userfaultfd that you saw on the second thread. When it wants
> >> to reclaim (anonymous) memory, it:
> >>
> >> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >> UFFD-WP to do so efficiently, a patch which I did not send yet).
> >> 2. Calls process_vm_readv() to read that memory of that process.
> >> 3. Write it back to “swap”.
> >> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> >
> > Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>
> Providing hints to the kernel takes you so far to a certain extent.
> The kernel does not want to (for a good reason) to be completely
> configurable when it comes to reclaim and prefetch policies. Doing
> so from userspace allows you to be fully configurable.
I am sorry but I do not follow. Your scenario is describing a user
space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
designed for. What are you missing in the existing functionality?
> > MADV_DONTNEED on a remote process has been proposed in the past several
> > times and it has always been rejected because it is a free ticket to all
> > sorts of hard to debug problems as it is just a free ticket for a remote
> > memory corruption. An additional capability requirement might reduce the
> > risk to some degree but I still do not think this is a good idea.
>
> I would argue that there is nothing bad that remote MADV_DONTNEED can do
> that process_vm_writev() cannot do as well (putting aside ptrace).
I am not arguing this would be the first syscall to allow tricky and
hard to debug corruptions if used without care.
> process_vm_writev() is checking:
>
> mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS)
>
> Wouldn't adding such a condition suffice?
This would be a minimum requirement. Another one is a sensible usecase
that is not covered by an existing functionality.
--
Michal Hocko
SUSE Labs
> On Sep 29, 2021, at 12:52 AM, Michal Hocko <[email protected]> wrote:
>
> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>>
>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <[email protected]> wrote:
>>>
>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>> [...]
>>>> The manager is notified on memory regions that it should monitor
>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>> to reclaim (anonymous) memory, it:
>>>>
>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>> 3. Write it back to “swap”.
>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>>
>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>>
>> Providing hints to the kernel takes you so far to a certain extent.
>> The kernel does not want to (for a good reason) to be completely
>> configurable when it comes to reclaim and prefetch policies. Doing
>> so from userspace allows you to be fully configurable.
>
> I am sorry but I do not follow. Your scenario is describing a user
> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> designed for. What are you missing in the existing functionality?
Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
many aspects of paging out memory:
1. Writeback: writeback ahead of time, dynamic clustering, etc.
2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
on non-contiguous memory).
3. No guarantee the page is actually reclaimed (e.g., writeback)
and the time it takes place.
4. I/O stack for swapping - you must use kernel I/O stack (FUSE
as non-performant as it is cannot be used for swap AFAIK).
5. Other operations (e.g., locking, working set tracking) that
might not be necessary or interfere.
In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
of userfaultfd to trap page-faults and react accordingly, so you
are also prevented from:
6. Having your own custom prefetching policy in response to #PF.
There are additional use-cases I can try to formalize in which
MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
is pretty clear, I think: one is a hint that only applied to
page reclamation. The other enables the direct control of
userspace over (almost) all aspects of paging.
As I suggested before, if it is preferred, this can be a UFFD
IOCTL instead of process_madvise() behavior, thereby lowering
the risk of a misuse.
I would emphasize that this feature (i.e.,
process_madvise(MADV_DONTNEED) or a similar new UFFD feature)
has little to no effect on the kernel robustness, complexity,
security or API changes. So the impact on the kernel is
negligible.
>>
>> Thanks for the pointer.
>>
>> And my question would be if something like DAMON would actually be what you want.
>
> I looked into DAMON and even with the proposed future extensions it sounds
> as a different approach with certain benefits but with many limitations.
>
> The major limitation of DAMON is that you need to predefine the logic you
> want for reclamation into the kernel. You can add programability through
> some API or even eBPF, but it would never be as easy or as versatile as
> what user manager can achieve. We already have pretty much all the
> facilities to do so from userspace, and the missing parts (at least for
> basic userspace manager) are almost already there. In contrast, see how
> many iterations are needed for the basic DAMON implementation.
I can see what you're saying when looking at optimizing a hand full of
special applications. I yet fail to see how something like that could
work as a full replacement for in kernel swapping. I'm happy to learn.
>
> The second, also big, difference is that DAMON looks only on reclamation.
> If you want a custom prefetch scheme or different I/O stack for backing
> storage, you cannot have such one.
I do wonder if it could be extended for prefetching. But I am absolutely
not a DAMON expert.
[...]
>>
>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>>
>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>>
>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>>
>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
>
> I understand (1). As for (2): the scenario that you mention sound
> very specific, and one can argue that ignoring UFFD-registered
> regions in such a case is either (1) wrong or (2) should trigger
> some UFFD event.
>
>>
>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
>
> If you do not know what you are doing, you can easily break anything.
> Note that there are other APIs that can break your application even
> worse, specifically ptrace().
>
>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>>
>> I'm not the person to ack/nack this, I'm just asking the questions :)
>
> I see your points and I try to look for a path of least resistance.
> I thought that process_madvise() is a nice interface to hook into.
It would be the right interface -- iff the operation wouldn't have a bad
smell to it. We don't really want applications to mess around in the
page table layout of some other process: however, that is exactly what
you require. By unlocking that interface for that use case we agree that
what you are proposing is a "sane use case", but ...
>
> But if you are concerned it will be misused, how about adding instead
> an IOCTL that will zap pages but only in UFFD-registered regions?
> A separate IOCTL for this matter have an advantage of being more
> tailored for UFFD, not to notify UFFD upon “remove” and to be less
> likely to be misused.
... that won't change the fact that with your user-space swapping
approach that requires this interface we can break some applications
silently, and that's really the major concern I have.
I mean, there are more cases where you can just harm the target
application I think, for example if the target application uses
SOFTDIRTY tracking.
To judge if this is a sane use case we want to support, it would help a
lot if there would be actual code+evaluation when actually implementing
some of these advanced policies. Because you raise a lot of interesting
points in your reply to Michal to back your use case, and naive me
thinks "this sounds interesting but ... aren't we losing a lot of
flexibility+features when doing this in user space? Does anyone actually
want to do it like that?".
Again, I'm not the person to ack/nack this, I'm just questioning if the
use case that requires this interface is actually something that will
get used later in real life because it has real advantages, or if it's a
pure research project that will get abandoned at some point and we ended
up exposing an interface we really didn't want to expose so far
(especially, because all other requests so far were bogus).
--
Thanks,
David / dhildenb
> On Oct 4, 2021, at 10:58 AM, David Hildenbrand <[email protected]> wrote:
>
>>>
>>> Thanks for the pointer.
>>>
>>> And my question would be if something like DAMON would actually be what you want.
>> I looked into DAMON and even with the proposed future extensions it sounds
>> as a different approach with certain benefits but with many limitations.
>> The major limitation of DAMON is that you need to predefine the logic you
>> want for reclamation into the kernel. You can add programability through
>> some API or even eBPF, but it would never be as easy or as versatile as
>> what user manager can achieve. We already have pretty much all the
>> facilities to do so from userspace, and the missing parts (at least for
>> basic userspace manager) are almost already there. In contrast, see how
>> many iterations are needed for the basic DAMON implementation.
>
> I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn.
I am not arguing it is a full replacement, at least at this stage.
>
>> The second, also big, difference is that DAMON looks only on reclamation.
>> If you want a custom prefetch scheme or different I/O stack for backing
>> storage, you cannot have such one.
>
> I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert.
>
> [...]
These are 2 different approaches. One, is to provide some logic
for the kernel (DAMON). The other is to provide userspace full
control over paging operations (with caveats). Obviously, due to
the caveats, the kernel paging mechanism behaves as a backup.
>
>>>
>>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>>>
>>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>>>
>>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>>>
>>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
>> I understand (1). As for (2): the scenario that you mention sound
>> very specific, and one can argue that ignoring UFFD-registered
>> regions in such a case is either (1) wrong or (2) should trigger
>> some UFFD event.
>>>
>>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
>> If you do not know what you are doing, you can easily break anything.
>> Note that there are other APIs that can break your application even
>> worse, specifically ptrace().
>>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>>>
>>> I'm not the person to ack/nack this, I'm just asking the questions :)
>> I see your points and I try to look for a path of least resistance.
>> I thought that process_madvise() is a nice interface to hook into.
>
> It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but ...
>
>> But if you are concerned it will be misused, how about adding instead
>> an IOCTL that will zap pages but only in UFFD-registered regions?
>> A separate IOCTL for this matter have an advantage of being more
>> tailored for UFFD, not to notify UFFD upon “remove” and to be less
>> likely to be misused.
>
> ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have.
>
> I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking.
>
>
> To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?".
>
> Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus).
I do want to release the code, but it is really
incomplete/immature at this point. I would not that there additional
use cases, such as workloads that have discardable cache (or memoization
data), which want a central/another entity to discard the data when
there is memory pressure. (You can think about it as a userspace
shrinker).
Anyhow, as a path of least resistance, I think I would do the
following:
1. Wait for the other madvise related patches to be applied.
2. Simplify the patches, specifically removing the data structure
changes based on Kirill feedback.
3. Defer the enablement of the MADV_DONTNEED until I can show
code/performance numbers.
Regards,
Nadav
On 07.10.21 18:19, Nadav Amit wrote:
>
>
>> On Oct 4, 2021, at 10:58 AM, David Hildenbrand <[email protected]> wrote:
>>
>>>>
>>>> Thanks for the pointer.
>>>>
>>>> And my question would be if something like DAMON would actually be what you want.
>>> I looked into DAMON and even with the proposed future extensions it sounds
>>> as a different approach with certain benefits but with many limitations.
>>> The major limitation of DAMON is that you need to predefine the logic you
>>> want for reclamation into the kernel. You can add programability through
>>> some API or even eBPF, but it would never be as easy or as versatile as
>>> what user manager can achieve. We already have pretty much all the
>>> facilities to do so from userspace, and the missing parts (at least for
>>> basic userspace manager) are almost already there. In contrast, see how
>>> many iterations are needed for the basic DAMON implementation.
>>
>> I can see what you're saying when looking at optimizing a hand full of special applications. I yet fail to see how something like that could work as a full replacement for in kernel swapping. I'm happy to learn.
>
> I am not arguing it is a full replacement, at least at this stage.
>
>>
>>> The second, also big, difference is that DAMON looks only on reclamation.
>>> If you want a custom prefetch scheme or different I/O stack for backing
>>> storage, you cannot have such one.
>>
>> I do wonder if it could be extended for prefetching. But I am absolutely not a DAMON expert.
>>
>> [...]
>
> These are 2 different approaches. One, is to provide some logic
> for the kernel (DAMON). The other is to provide userspace full
> control over paging operations (with caveats). Obviously, due to
> the caveats, the kernel paging mechanism behaves as a backup.
>
>>
>>>>
>>>> You raise a very excellent point (and it should have been part of your initial sales pitch): how does it differ to process_vm_writev().
>>>>
>>>> I can say that it differs in a way that you can break applications in more extreme ways. Let me give you two examples:
>>>>
>>>> 1. longterm pinnings: you raised this yourself; this can break an application silently and there is barely a safe way your tooling could handle it.
>>>>
>>>> 2. pagemap: applications can depend on the populated(present |swap) information in the pagemap for correctness. For example, there was recently a discussion to use pagemap information to speed up live migration of VMs, by skipping migration of !populated pages. There is currently no way your tooling can fake that. In comparison, ordinary swapping in the kernel can handle it.
>>> I understand (1). As for (2): the scenario that you mention sound
>>> very specific, and one can argue that ignoring UFFD-registered
>>> regions in such a case is either (1) wrong or (2) should trigger
>>> some UFFD event.
>>>>
>>>> Is it easy to break an application with process_vm_writev()? Yes. When talking about dynamic debugging, it's expected that you break the target already -- or the target is already broken. Is it easier to break an application with process_madvise(MADV_DONTNEED)? I'd say yes, especially when implementing something way beyond debugging as you describe.
>>> If you do not know what you are doing, you can easily break anything.
>>> Note that there are other APIs that can break your application even
>>> worse, specifically ptrace().
>>>> I'm giving you "a hard time" for the reason Michal raised: we discussed this in the past already at least two times IIRC and "it is a free ticket to all sorts of hard to debug problem" in our opinion; especially when we mess around in other process address spaces besides for debugging.
>>>>
>>>> I'm not the person to ack/nack this, I'm just asking the questions :)
>>> I see your points and I try to look for a path of least resistance.
>>> I thought that process_madvise() is a nice interface to hook into.
>>
>> It would be the right interface -- iff the operation wouldn't have a bad smell to it. We don't really want applications to mess around in the page table layout of some other process: however, that is exactly what you require. By unlocking that interface for that use case we agree that what you are proposing is a "sane use case", but ...
>>
>>> But if you are concerned it will be misused, how about adding instead
>>> an IOCTL that will zap pages but only in UFFD-registered regions?
>>> A separate IOCTL for this matter have an advantage of being more
>>> tailored for UFFD, not to notify UFFD upon “remove” and to be less
>>> likely to be misused.
>>
>> ... that won't change the fact that with your user-space swapping approach that requires this interface we can break some applications silently, and that's really the major concern I have.
>>
>> I mean, there are more cases where you can just harm the target application I think, for example if the target application uses SOFTDIRTY tracking.
>>
>>
>> To judge if this is a sane use case we want to support, it would help a lot if there would be actual code+evaluation when actually implementing some of these advanced policies. Because you raise a lot of interesting points in your reply to Michal to back your use case, and naive me thinks "this sounds interesting but ... aren't we losing a lot of flexibility+features when doing this in user space? Does anyone actually want to do it like that?".
>>
>> Again, I'm not the person to ack/nack this, I'm just questioning if the use case that requires this interface is actually something that will get used later in real life because it has real advantages, or if it's a pure research project that will get abandoned at some point and we ended up exposing an interface we really didn't want to expose so far (especially, because all other requests so far were bogus).
>
> I do want to release the code, but it is really
> incomplete/immature at this point. I would not that there additional
> use cases, such as workloads that have discardable cache (or memoization
> data), which want a central/another entity to discard the data when
> there is memory pressure. (You can think about it as a userspace
> shrinker).
>
> Anyhow, as a path of least resistance, I think I would do the
> following:
>
> 1. Wait for the other madvise related patches to be applied.
> 2. Simplify the patches, specifically removing the data structure
> changes based on Kirill feedback.
> 3. Defer the enablement of the MADV_DONTNEED until I can show
> code/performance numbers.
Sounds excellent, for your project to make progress at this stage I
assume this stuff doesn't have to be upstream, but it's good to discuss
upstream-ability.
Happy to learn more once you have more details to share.
--
Thanks,
David / dhildenb
On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
>
>
> > On Sep 29, 2021, at 12:52 AM, Michal Hocko <[email protected]> wrote:
> >
> > On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> >>
> >>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <[email protected]> wrote:
> >>>
> >>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> >>> [...]
> >>>> The manager is notified on memory regions that it should monitor
> >>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >>>> using the remote-userfaultfd that you saw on the second thread. When it wants
> >>>> to reclaim (anonymous) memory, it:
> >>>>
> >>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
> >>>> 2. Calls process_vm_readv() to read that memory of that process.
> >>>> 3. Write it back to “swap”.
> >>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> >>>
> >>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> >>
> >> Providing hints to the kernel takes you so far to a certain extent.
> >> The kernel does not want to (for a good reason) to be completely
> >> configurable when it comes to reclaim and prefetch policies. Doing
> >> so from userspace allows you to be fully configurable.
> >
> > I am sorry but I do not follow. Your scenario is describing a user
> > space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> > designed for. What are you missing in the existing functionality?
>
> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
> many aspects of paging out memory:
>
> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
> on non-contiguous memory).
> 3. No guarantee the page is actually reclaimed (e.g., writeback)
> and the time it takes place.
> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
> as non-performant as it is cannot be used for swap AFAIK).
> 5. Other operations (e.g., locking, working set tracking) that
> might not be necessary or interfere.
>
> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
> of userfaultfd to trap page-faults and react accordingly, so you
> are also prevented from:
>
> 6. Having your own custom prefetching policy in response to #PF.
>
> There are additional use-cases I can try to formalize in which
> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
> is pretty clear, I think: one is a hint that only applied to
> page reclamation. The other enables the direct control of
> userspace over (almost) all aspects of paging.
>
> As I suggested before, if it is preferred, this can be a UFFD
> IOCTL instead of process_madvise() behavior, thereby lowering
> the risk of a misuse.
(Sorry to join so late..)
Yeah I'm wondering whether that could add one extra layer of security. But as
you mentioned, we've already have process_vm_writev(), then it's indeed not
strong reason to reject process_madvise(DONTNEED) too, it seems.
Not sure whether you're aware of the umap project from LLNL:
https://github.com/LLNL/umap
From what I can tell, that's really doing very similar thing as what you
proposed here, but it's just a local version of things. IOW in umap the
DONTNEED can be done locally with madvise() already in the umap maintained
threads. That close the need to introduce the new process_madvise() interface
and it's definitely safer as it's per-mm and per-task.
I think you mentioned above that the tracee program will need to cooperate in
this case, I'm wondering whether some solution like umap would be fine too as
that also requires cooperation of the tracee program, it's just that the
cooperation may be slightly more than your solution but frankly I think that's
still trivial and before I understand the details of your solution I can't
really tell..
E.g. for a program to use umap, I think it needs to replace mmap() to umap()
where we want the buffers to be managed by umap library rather than the kernel,
then link against the umap library should work. If the remote solution you're
proposing requires similar (or even more complicated) cooperation, then it'll
be controversial whether that can be done per-mm just like how umap designed
and used. So IMHO it'll be great to share more details on those parts if umap
cannot satisfy the current need - IMHO it satisfies all the features you
described on fully customized pageout and page faulting in, it's just done in a
single mm.
Thanks,
--
Peter Xu
> On Oct 12, 2021, at 4:14 PM, Peter Xu <[email protected]> wrote:
>
> On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
>>
>>
>>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <[email protected]> wrote:
>>>
>>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
>>>>
>>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <[email protected]> wrote:
>>>>>
>>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
>>>>> [...]
>>>>>> The manager is notified on memory regions that it should monitor
>>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
>>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
>>>>>> to reclaim (anonymous) memory, it:
>>>>>>
>>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
>>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
>>>>>> 2. Calls process_vm_readv() to read that memory of that process.
>>>>>> 3. Write it back to “swap”.
>>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
>>>>>
>>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
>>>>
>>>> Providing hints to the kernel takes you so far to a certain extent.
>>>> The kernel does not want to (for a good reason) to be completely
>>>> configurable when it comes to reclaim and prefetch policies. Doing
>>>> so from userspace allows you to be fully configurable.
>>>
>>> I am sorry but I do not follow. Your scenario is describing a user
>>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
>>> designed for. What are you missing in the existing functionality?
>>
>> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
>> many aspects of paging out memory:
>>
>> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
>> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
>> on non-contiguous memory).
>> 3. No guarantee the page is actually reclaimed (e.g., writeback)
>> and the time it takes place.
>> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
>> as non-performant as it is cannot be used for swap AFAIK).
>> 5. Other operations (e.g., locking, working set tracking) that
>> might not be necessary or interfere.
>>
>> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
>> of userfaultfd to trap page-faults and react accordingly, so you
>> are also prevented from:
>>
>> 6. Having your own custom prefetching policy in response to #PF.
>>
>> There are additional use-cases I can try to formalize in which
>> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
>> is pretty clear, I think: one is a hint that only applied to
>> page reclamation. The other enables the direct control of
>> userspace over (almost) all aspects of paging.
>>
>> As I suggested before, if it is preferred, this can be a UFFD
>> IOCTL instead of process_madvise() behavior, thereby lowering
>> the risk of a misuse.
>
> (Sorry to join so late..)
>
> Yeah I'm wondering whether that could add one extra layer of security. But as
> you mentioned, we've already have process_vm_writev(), then it's indeed not
> strong reason to reject process_madvise(DONTNEED) too, it seems.
>
> Not sure whether you're aware of the umap project from LLNL:
>
> https://github.com/LLNL/umap
>
> From what I can tell, that's really doing very similar thing as what you
> proposed here, but it's just a local version of things. IOW in umap the
> DONTNEED can be done locally with madvise() already in the umap maintained
> threads. That close the need to introduce the new process_madvise() interface
> and it's definitely safer as it's per-mm and per-task.
>
> I think you mentioned above that the tracee program will need to cooperate in
> this case, I'm wondering whether some solution like umap would be fine too as
> that also requires cooperation of the tracee program, it's just that the
> cooperation may be slightly more than your solution but frankly I think that's
> still trivial and before I understand the details of your solution I can't
> really tell..
>
> E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> where we want the buffers to be managed by umap library rather than the kernel,
> then link against the umap library should work. If the remote solution you're
> proposing requires similar (or even more complicated) cooperation, then it'll
> be controversial whether that can be done per-mm just like how umap designed
> and used. So IMHO it'll be great to share more details on those parts if umap
> cannot satisfy the current need - IMHO it satisfies all the features you
> described on fully customized pageout and page faulting in, it's just done in a
> single mm.
Thanks for you feedback, Peter.
I am familiar with umap, perhaps not enough, but I am aware.
From my experience, the existing interfaces are not sufficient if you look
for high performance (low overhead) solution for multiple processes. The
level of cooperation that I mentioned is something that I mentioned
preemptively to avoid unnecessary discussion, but I believe they can be
resolved (I have just deferred handling them).
Specifically for performance, several new kernel features are needed, for
instance, support for iouring with async operations, a vectored
UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
vectored madvise(). Even if we talk on the context of a single mm, I
cannot see umap being performant for low latency devices without those
facilities.
Anyhow, I take your feedback and I will resend the patch for enabling
MADV_DONTNEED with other patches once I am done. As for the TLB batching
itself, I think it has an independent value - but I am not going to
argue about it now if there is a pushback against it.
On Wed, Oct 13, 2021 at 08:47:11AM -0700, Nadav Amit wrote:
>
>
> > On Oct 12, 2021, at 4:14 PM, Peter Xu <[email protected]> wrote:
> >
> > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote:
> >>
> >>
> >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko <[email protected]> wrote:
> >>>
> >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote:
> >>>>
> >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko <[email protected]> wrote:
> >>>>>
> >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote:
> >>>>> [...]
> >>>>>> The manager is notified on memory regions that it should monitor
> >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these regions
> >>>>>> using the remote-userfaultfd that you saw on the second thread. When it wants
> >>>>>> to reclaim (anonymous) memory, it:
> >>>>>>
> >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got a vectored
> >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet).
> >>>>>> 2. Calls process_vm_readv() to read that memory of that process.
> >>>>>> 3. Write it back to “swap”.
> >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it.
> >>>>>
> >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase?
> >>>>
> >>>> Providing hints to the kernel takes you so far to a certain extent.
> >>>> The kernel does not want to (for a good reason) to be completely
> >>>> configurable when it comes to reclaim and prefetch policies. Doing
> >>>> so from userspace allows you to be fully configurable.
> >>>
> >>> I am sorry but I do not follow. Your scenario is describing a user
> >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been
> >>> designed for. What are you missing in the existing functionality?
> >>
> >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control
> >> many aspects of paging out memory:
> >>
> >> 1. Writeback: writeback ahead of time, dynamic clustering, etc.
> >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job
> >> on non-contiguous memory).
> >> 3. No guarantee the page is actually reclaimed (e.g., writeback)
> >> and the time it takes place.
> >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE
> >> as non-performant as it is cannot be used for swap AFAIK).
> >> 5. Other operations (e.g., locking, working set tracking) that
> >> might not be necessary or interfere.
> >>
> >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use
> >> of userfaultfd to trap page-faults and react accordingly, so you
> >> are also prevented from:
> >>
> >> 6. Having your own custom prefetching policy in response to #PF.
> >>
> >> There are additional use-cases I can try to formalize in which
> >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference
> >> is pretty clear, I think: one is a hint that only applied to
> >> page reclamation. The other enables the direct control of
> >> userspace over (almost) all aspects of paging.
> >>
> >> As I suggested before, if it is preferred, this can be a UFFD
> >> IOCTL instead of process_madvise() behavior, thereby lowering
> >> the risk of a misuse.
> >
> > (Sorry to join so late..)
> >
> > Yeah I'm wondering whether that could add one extra layer of security. But as
> > you mentioned, we've already have process_vm_writev(), then it's indeed not
> > strong reason to reject process_madvise(DONTNEED) too, it seems.
> >
> > Not sure whether you're aware of the umap project from LLNL:
> >
> > https://github.com/LLNL/umap
> >
> > From what I can tell, that's really doing very similar thing as what you
> > proposed here, but it's just a local version of things. IOW in umap the
> > DONTNEED can be done locally with madvise() already in the umap maintained
> > threads. That close the need to introduce the new process_madvise() interface
> > and it's definitely safer as it's per-mm and per-task.
> >
> > I think you mentioned above that the tracee program will need to cooperate in
> > this case, I'm wondering whether some solution like umap would be fine too as
> > that also requires cooperation of the tracee program, it's just that the
> > cooperation may be slightly more than your solution but frankly I think that's
> > still trivial and before I understand the details of your solution I can't
> > really tell..
> >
> > E.g. for a program to use umap, I think it needs to replace mmap() to umap()
> > where we want the buffers to be managed by umap library rather than the kernel,
> > then link against the umap library should work. If the remote solution you're
> > proposing requires similar (or even more complicated) cooperation, then it'll
> > be controversial whether that can be done per-mm just like how umap designed
> > and used. So IMHO it'll be great to share more details on those parts if umap
> > cannot satisfy the current need - IMHO it satisfies all the features you
> > described on fully customized pageout and page faulting in, it's just done in a
> > single mm.
>
> Thanks for you feedback, Peter.
>
> I am familiar with umap, perhaps not enough, but I am aware.
>
> From my experience, the existing interfaces are not sufficient if you look
> for high performance (low overhead) solution for multiple processes. The
> level of cooperation that I mentioned is something that I mentioned
> preemptively to avoid unnecessary discussion, but I believe they can be
> resolved (I have just deferred handling them).
>
> Specifically for performance, several new kernel features are needed, for
> instance, support for iouring with async operations, a vectored
> UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a
> vectored madvise(). Even if we talk on the context of a single mm, I
> cannot see umap being performant for low latency devices without those
> facilities.
>
> Anyhow, I take your feedback and I will resend the patch for enabling
> MADV_DONTNEED with other patches once I am done. As for the TLB batching
> itself, I think it has an independent value - but I am not going to
> argue about it now if there is a pushback against it.
Fair enough.
Yes my comment was mostly about whether a remote interface is needed or can we
still do it locally (frankly I always wanted to have some remote interface to
manipulate uffd, but still anything like that should require some
justifications for sure).
I totally agree your rest works on either optimizing tlb (especially on the two
points Andrea mentioned for either reducing tlb for huge pmd change protection,
or pte promotions) and vectored interfaces sound reasonable, and they're
definitely separate issues comparing to this one.
Thanks,
--
Peter Xu