2021-08-27 19:20:42

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v8 0/3] Anonymous VMA naming patches

There were a number of previous attempts to upstream support for anonymous
VMA naming. The original submission by Colin Cross [1] implemented a
dictionary of refcounted names to reuse same name strings. Dave Hansen
suggested [2] to use userspace pointers instead and the patch was rewritten
that way. The last v7 version of this patch was posted by Sumit Semwal [3]
and a very similar patch has been used in Android to name anonymous VMAs
for a number of years. Concerns about this patch were raised by Kees Cook
[4] noting the lack of string sanitization and the use of userspace
pointers from the kernel. In conclusion [5], it was suggested to
strndup_user the strings from userspace, perform appropriate checks and
store a copy as a vm_area_struct member. Performance impact from
additional strdup's during fork() should be measured by allocating a large
number (64k) of VMAs with longest names and timing fork()s.

This patchset implements the suggested approach in the first 2 patches and
the 3rd patch implements simple refcounting to avoid strdup'ing the names
during fork() and minimize the regression.

Proposed test was conducted on an ARM64 Android device with CPU frequency
locked at 2.4GHz, performance governor and Android system being stopped
(adb shell stop) to minimize the noise. Test includes 3 different
scenarios. In each scenario a process with 64K named anonymous VMAs forks
children 1000 times while timing each fork and reporting the average time.
The scenarios differ in the VMA content:

1. VMAs are not populated with any data (not realistic scenario but
helps in emphasizing the regression).
2. Each VMA contains 1 page populated with random data.
3. Each VMA contains 10 pages populated with random data.

With the first 2 patches implementing strdup approach, the average fork()
times are:

unnamed VMAs named VMAs REGRESSION
Unpopulated VMAs 16.73ms 23.34ms 39.51%
VMAs with 1 page of data 51.98ms 59.94ms 15.31%
VMAs with 10 pages of data 66.86ms 76.31ms 14.13%

From the perf results, the regression can be attributed to strlen() and
strdup() calls. The regression shrinking with the increased amount of
populated data can be attributed mostly to anon_vma_fork() and
copy_page_range() consuming more time during fork().

After the refcounting implemented in the last patch of this series the
results are:

unnamed VMAs named VMAs REGRESSION
Unpopulated VMAs 16.36ms 18.35ms 12.16%%
VMAs with 1 page of data 48.16ms 51.30ms 6.52%
VMAs with 10 pages of data 64.23ms 67.69ms 5.39%

From the perf results, the regression can be attributed to
refcount_inc_checked() (called from kref_get()).

While there is obviously a measurable regression, 64K named anonymous VMAs
is truly a worst case scenario. In the real usage, the only current user of
this feature, namely Android, rarely has processes with the number of VMAs
reaching 4000 (that's the highest I've measured). The regression of forking
a process with that number of VMAs is at the noise level.

1. https://lore.kernel.org/linux-mm/[email protected]/
2. https://lore.kernel.org/linux-mm/[email protected]/
3. https://lore.kernel.org/linux-mm/[email protected]/
4. https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
5. https://lore.kernel.org/linux-mm/[email protected]/

Colin Cross (2):
mm: rearrange madvise code to allow for reuse
mm: add a field to store names for private anonymous memory

Suren Baghdasaryan (1):
mm: add anonymous vma name refcounting

Documentation/filesystems/proc.rst | 2 +
fs/proc/task_mmu.c | 14 +-
fs/userfaultfd.c | 7 +-
include/linux/mm.h | 13 +-
include/linux/mm_types.h | 55 +++-
include/uapi/linux/prctl.h | 3 +
kernel/fork.c | 2 +
kernel/sys.c | 48 ++++
mm/madvise.c | 447 +++++++++++++++++++----------
mm/mempolicy.c | 3 +-
mm/mlock.c | 2 +-
mm/mmap.c | 38 +--
mm/mprotect.c | 2 +-
13 files changed, 462 insertions(+), 174 deletions(-)

--
2.33.0.259.gc128427fd7-goog


2021-08-27 19:20:48

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v8 1/3] mm: rearrange madvise code to allow for reuse

From: Colin Cross <[email protected]>

Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.

Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter. The only caller for now is
sys_madvise, which uses it to call madvise_vma_behavior on each vma, but
the next patch will add an additional caller.

Move handling all vma behaviors inside madvise_behavior, and rename it to
madvise_vma_behavior.

Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma. The next patch will add support for updating a new
anon_name field as well.

Signed-off-by: Colin Cross <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Jan Glauber <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Tang Chen <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
[sumits: rebased over v5.9-rc3]
Signed-off-by: Sumit Semwal <[email protected]>
[surenb: rebased over v5.14-rc7]
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/madvise.c | 319 +++++++++++++++++++++++++++------------------------
1 file changed, 172 insertions(+), 147 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 5c065bc8b5f6..359cd3fa612c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -63,76 +63,20 @@ static int madvise_need_mmap_write(int behavior)
}

/*
- * We can potentially split a vm area into separate
- * areas, each area with its own behavior.
+ * Update the vm_flags on regiion of a vma, splitting it or merging it as
+ * necessary. Must be called with mmap_sem held for writing;
*/
-static long madvise_behavior(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end, int behavior)
+static int madvise_update_vma(struct vm_area_struct *vma,
+ struct vm_area_struct **prev, unsigned long start,
+ unsigned long end, unsigned long new_flags)
{
struct mm_struct *mm = vma->vm_mm;
- int error = 0;
+ int error;
pgoff_t pgoff;
- unsigned long new_flags = vma->vm_flags;
-
- switch (behavior) {
- case MADV_NORMAL:
- new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
- break;
- case MADV_SEQUENTIAL:
- new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
- break;
- case MADV_RANDOM:
- new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
- break;
- case MADV_DONTFORK:
- new_flags |= VM_DONTCOPY;
- break;
- case MADV_DOFORK:
- if (vma->vm_flags & VM_IO) {
- error = -EINVAL;
- goto out;
- }
- new_flags &= ~VM_DONTCOPY;
- break;
- case MADV_WIPEONFORK:
- /* MADV_WIPEONFORK is only supported on anonymous memory. */
- if (vma->vm_file || vma->vm_flags & VM_SHARED) {
- error = -EINVAL;
- goto out;
- }
- new_flags |= VM_WIPEONFORK;
- break;
- case MADV_KEEPONFORK:
- new_flags &= ~VM_WIPEONFORK;
- break;
- case MADV_DONTDUMP:
- new_flags |= VM_DONTDUMP;
- break;
- case MADV_DODUMP:
- if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
- error = -EINVAL;
- goto out;
- }
- new_flags &= ~VM_DONTDUMP;
- break;
- case MADV_MERGEABLE:
- case MADV_UNMERGEABLE:
- error = ksm_madvise(vma, start, end, behavior, &new_flags);
- if (error)
- goto out_convert_errno;
- break;
- case MADV_HUGEPAGE:
- case MADV_NOHUGEPAGE:
- error = hugepage_madvise(vma, &new_flags, behavior);
- if (error)
- goto out_convert_errno;
- break;
- }

if (new_flags == vma->vm_flags) {
*prev = vma;
- goto out;
+ return 0;
}

pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
@@ -149,21 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
if (start != vma->vm_start) {
if (unlikely(mm->map_count >= sysctl_max_map_count)) {
error = -ENOMEM;
- goto out;
+ return error;
}
error = __split_vma(mm, vma, start, 1);
if (error)
- goto out_convert_errno;
+ return error;
}

if (end != vma->vm_end) {
if (unlikely(mm->map_count >= sysctl_max_map_count)) {
error = -ENOMEM;
- goto out;
+ return error;
}
error = __split_vma(mm, vma, end, 0);
if (error)
- goto out_convert_errno;
+ return error;
}

success:
@@ -172,15 +116,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
*/
vma->vm_flags = new_flags;

-out_convert_errno:
- /*
- * madvise() returns EAGAIN if kernel resources, such as
- * slab, are temporarily unavailable.
- */
- if (error == -ENOMEM)
- error = -EAGAIN;
-out:
- return error;
+ return 0;
}

#ifdef CONFIG_SWAP
@@ -930,6 +866,96 @@ static long madvise_remove(struct vm_area_struct *vma,
return error;
}

+/*
+ * Apply an madvise behavior to a region of a vma. madvise_update_vma
+ * will handle splitting a vm area into separate areas, each area with its own
+ * behavior.
+ */
+static int madvise_vma_behavior(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end,
+ unsigned long behavior)
+{
+ int error = 0;
+ unsigned long new_flags = vma->vm_flags;
+
+ switch (behavior) {
+ case MADV_REMOVE:
+ return madvise_remove(vma, prev, start, end);
+ case MADV_WILLNEED:
+ return madvise_willneed(vma, prev, start, end);
+ case MADV_COLD:
+ return madvise_cold(vma, prev, start, end);
+ case MADV_PAGEOUT:
+ return madvise_pageout(vma, prev, start, end);
+ case MADV_FREE:
+ case MADV_DONTNEED:
+ return madvise_dontneed_free(vma, prev, start, end, behavior);
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
+ return madvise_populate(vma, prev, start, end, behavior);
+ case MADV_NORMAL:
+ new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
+ break;
+ case MADV_SEQUENTIAL:
+ new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
+ break;
+ case MADV_RANDOM:
+ new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
+ break;
+ case MADV_DONTFORK:
+ new_flags |= VM_DONTCOPY;
+ break;
+ case MADV_DOFORK:
+ if (vma->vm_flags & VM_IO) {
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags &= ~VM_DONTCOPY;
+ break;
+ case MADV_WIPEONFORK:
+ /* MADV_WIPEONFORK is only supported on anonymous memory. */
+ if (vma->vm_file || vma->vm_flags & VM_SHARED) {
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags |= VM_WIPEONFORK;
+ break;
+ case MADV_KEEPONFORK:
+ new_flags &= ~VM_WIPEONFORK;
+ break;
+ case MADV_DONTDUMP:
+ new_flags |= VM_DONTDUMP;
+ break;
+ case MADV_DODUMP:
+ if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
+ error = -EINVAL;
+ goto out;
+ }
+ new_flags &= ~VM_DONTDUMP;
+ break;
+ case MADV_MERGEABLE:
+ case MADV_UNMERGEABLE:
+ error = ksm_madvise(vma, start, end, behavior, &new_flags);
+ if (error)
+ goto out;
+ break;
+ case MADV_HUGEPAGE:
+ case MADV_NOHUGEPAGE:
+ error = hugepage_madvise(vma, &new_flags, behavior);
+ if (error)
+ goto out;
+ break;
+ }
+
+ error = madvise_update_vma(vma, prev, start, end, new_flags);
+
+out:
+ if (error == -ENOMEM)
+ error = -EAGAIN;
+ return error;
+}
+
#ifdef CONFIG_MEMORY_FAILURE
/*
* Error injection support for memory error handling.
@@ -978,30 +1004,6 @@ static int madvise_inject_error(int behavior,
}
#endif

-static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
- unsigned long start, unsigned long end, int behavior)
-{
- switch (behavior) {
- case MADV_REMOVE:
- return madvise_remove(vma, prev, start, end);
- case MADV_WILLNEED:
- return madvise_willneed(vma, prev, start, end);
- case MADV_COLD:
- return madvise_cold(vma, prev, start, end);
- case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
- case MADV_FREE:
- case MADV_DONTNEED:
- return madvise_dontneed_free(vma, prev, start, end, behavior);
- case MADV_POPULATE_READ:
- case MADV_POPULATE_WRITE:
- return madvise_populate(vma, prev, start, end, behavior);
- default:
- return madvise_behavior(vma, prev, start, end, behavior);
- }
-}
-
static bool
madvise_behavior_valid(int behavior)
{
@@ -1054,6 +1056,73 @@ process_madvise_behavior_valid(int behavior)
}
}

+/*
+ * Walk the vmas in range [start,end), and call the visit function on each one.
+ * The visit function will get start and end parameters that cover the overlap
+ * between the current vma and the original range. Any unmapped regions in the
+ * original range will result in this function returning -ENOMEM while still
+ * calling the visit function on all of the existing vmas in the range.
+ * Must be called with the mmap_lock held for reading or writing.
+ */
+static
+int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
+ unsigned long end, unsigned long arg,
+ int (*visit)(struct vm_area_struct *vma,
+ struct vm_area_struct **prev, unsigned long start,
+ unsigned long end, unsigned long arg))
+{
+ struct vm_area_struct *vma;
+ struct vm_area_struct *prev;
+ unsigned long tmp;
+ int unmapped_error = 0;
+
+ /*
+ * If the interval [start,end) covers some unmapped address
+ * ranges, just ignore them, but return -ENOMEM at the end.
+ * - different from the way of handling in mlock etc.
+ */
+ vma = find_vma_prev(mm, start, &prev);
+ if (vma && start > vma->vm_start)
+ prev = vma;
+
+ for (;;) {
+ int error;
+
+ /* Still start < end. */
+ if (!vma)
+ return -ENOMEM;
+
+ /* Here start < (end|vma->vm_end). */
+ if (start < vma->vm_start) {
+ unmapped_error = -ENOMEM;
+ start = vma->vm_start;
+ if (start >= end)
+ break;
+ }
+
+ /* Here vma->vm_start <= start < (end|vma->vm_end) */
+ tmp = vma->vm_end;
+ if (end < tmp)
+ tmp = end;
+
+ /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+ error = visit(vma, &prev, start, tmp, arg);
+ if (error)
+ return error;
+ start = tmp;
+ if (prev && start < prev->vm_end)
+ start = prev->vm_end;
+ if (start >= end)
+ break;
+ if (prev)
+ vma = prev->vm_next;
+ else /* madvise_remove dropped mmap_lock */
+ vma = find_vma(mm, start);
+ }
+
+ return unmapped_error;
+}
+
/*
* The madvise(2) system call.
*
@@ -1126,9 +1195,7 @@ process_madvise_behavior_valid(int behavior)
*/
int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
{
- unsigned long end, tmp;
- struct vm_area_struct *vma, *prev;
- int unmapped_error = 0;
+ unsigned long end;
int error = -EINVAL;
int write;
size_t len;
@@ -1168,51 +1235,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
mmap_read_lock(mm);
}

- /*
- * If the interval [start,end) covers some unmapped address
- * ranges, just ignore them, but return -ENOMEM at the end.
- * - different from the way of handling in mlock etc.
- */
- vma = find_vma_prev(mm, start, &prev);
- if (vma && start > vma->vm_start)
- prev = vma;
-
blk_start_plug(&plug);
- for (;;) {
- /* Still start < end. */
- error = -ENOMEM;
- if (!vma)
- goto out;
-
- /* Here start < (end|vma->vm_end). */
- if (start < vma->vm_start) {
- unmapped_error = -ENOMEM;
- start = vma->vm_start;
- if (start >= end)
- goto out;
- }
-
- /* Here vma->vm_start <= start < (end|vma->vm_end) */
- tmp = vma->vm_end;
- if (end < tmp)
- tmp = end;
-
- /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
- if (error)
- goto out;
- start = tmp;
- if (prev && start < prev->vm_end)
- start = prev->vm_end;
- error = unmapped_error;
- if (start >= end)
- goto out;
- if (prev)
- vma = prev->vm_next;
- else /* madvise_remove dropped mmap_lock */
- vma = find_vma(mm, start);
- }
-out:
+ error = madvise_walk_vmas(mm, start, end, behavior,
+ madvise_vma_behavior);
blk_finish_plug(&plug);
if (write)
mmap_write_unlock(mm);
--
2.33.0.259.gc128427fd7-goog

2021-08-27 19:21:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

From: Colin Cross <[email protected]>

In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in use.
At a minimum there is libc malloc and the stack, and in many cases there
are libc malloc, the stack, direct syscalls to mmap anonymous memory, and
multiple VM heaps (one for small objects, one for big objects, etc.).
Each of these layers usually has its own tools to inspect its usage;
malloc by compiling a debug version, the VM through heap inspection tools,
and for direct syscalls there is usually no way to track them.

On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped
in userspace and slice their usage by process, shared (COW) vs. unique
mappings, backing, etc. This can account for real physical memory usage
even in cases like fork without exec (which Android uses heavily to share
as many private COW pages as possible between processes), Kernel SamePage
Merging, and clean zero pages. It produces a measurement of the pages
that only exist in that process (USS, for unique), and a measurement of
the physical memory usage of that process with the cost of shared pages
being evenly split between processes that share them (PSS).

If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap walking
tool that can understand the heap debugging of every layer, or for every
layer's heap debugging tools to implement the pagemap walking logic, in
which case it is hard to get a consistent view of memory across the whole
system.

Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that
somebody needs to clean up on crashes. It needs to be readable while
the process is still running, so it has to have some sort of
synchronization with every layer of userspace. Efficiently tracking
the ranges requires reimplementing something like the kernel vma
trees, and linking to it from every layer of userspace. It requires
more memory, more syscalls, more runtime cost, and more complexity to
separately track regions that the kernel is already tracking.

This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named anonymous
vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>].

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it. The name length limit is 64 bytes
including NUL-terminator (to have some reasonable limit and because the
longest name used in Android has 50 chars) and is checked to contain only
printable characters.

The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.

The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last upstreaming
attempt, Kees Cook raised concerns [2] about this approach and suggested
to copy the name into kernel memory space, perform validity checks [3]
and store as a string referenced from vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with worst-case
scenario of forking a process with 64k vmas having longest possible names
[4]. I ran this experiment on an ARM64 Android device and recorded a
worst-case regression of almost 40% when forking such a process. This
regression is addressed in the followup patch which replaces the pointer
to a name with a refcounted structure that allows sharing the name pointer
between vmas of the same name. Instead of duplicating the string during
fork() or when splitting a vma it increments the refcount.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Colin Cross <[email protected]>
[surenb: rebased over v5.14-rc7, replaced userpointer with a kernel copy
and added input sanitization. The bulk of the work here was done by Colin
Cross, therefore, with his permission, keeping him as the author]
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
Documentation/filesystems/proc.rst | 2 +
fs/proc/task_mmu.c | 14 +++-
fs/userfaultfd.c | 7 +-
include/linux/mm.h | 13 +++-
include/linux/mm_types.h | 48 +++++++++++--
include/uapi/linux/prctl.h | 3 +
kernel/fork.c | 2 +
kernel/sys.c | 48 +++++++++++++
mm/madvise.c | 112 +++++++++++++++++++++++++++--
mm/mempolicy.c | 3 +-
mm/mlock.c | 2 +-
mm/mmap.c | 38 +++++-----
mm/mprotect.c | 2 +-
13 files changed, 261 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 042c418f4090..a067eec54ef1 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -431,6 +431,8 @@ is not associated with a file:
[stack] the stack of the main process
[vdso] the "virtual dynamic shared object",
the kernel system call handler
+[anon:<name>] an anonymous mapping that has been
+ named by userspace
======= ====================================

or if empty, the mapping is anonymous.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..2ce5b3c4e7fc 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -308,6 +308,8 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)

name = arch_vma_name(vma);
if (!name) {
+ const char *anon_name;
+
if (!mm) {
name = "[vdso]";
goto done;
@@ -319,8 +321,18 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
goto done;
}

- if (is_stack(vma))
+ if (is_stack(vma)) {
name = "[stack]";
+ goto done;
+ }
+
+ anon_name = vma_anon_name(vma);
+ if (anon_name) {
+ seq_pad(m, ' ');
+ seq_puts(m, "[anon:");
+ seq_write(m, anon_name, strlen(anon_name));
+ seq_putc(m, ']');
+ }
}

done:
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 5c2d806e6ae5..5057843fb71a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -876,7 +876,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
new_flags, vma->anon_vma,
vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX, vma_anon_name(vma));
if (prev)
vma = prev;
else
@@ -1440,7 +1440,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- ((struct vm_userfaultfd_ctx){ ctx }));
+ ((struct vm_userfaultfd_ctx){ ctx }),
+ vma_anon_name(vma));
if (prev) {
vma = prev;
goto next;
@@ -1617,7 +1618,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX, vma_anon_name(vma));
if (prev) {
vma = prev;
goto next;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..45c003fae7fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2548,7 +2548,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *, struct vm_userfaultfd_ctx);
+ struct mempolicy *, struct vm_userfaultfd_ctx, const char *);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
@@ -3283,5 +3283,16 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
return 0;
}

+#ifdef CONFIG_ADVISE_SYSCALLS
+int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+ unsigned long len_in, const char *name);
+#else
+static inline int
+madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+ unsigned long len_in, const char *name) {
+ return 0;
+}
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..26a30f7a5228 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -342,11 +342,19 @@ struct vm_area_struct {
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
+ *
+ * For private anonymous mappings, a pointer to a null terminated string
+ * containing the name given to the vma, or NULL if unnamed.
*/
- struct {
- struct rb_node rb;
- unsigned long rb_subtree_last;
- } shared;
+
+ union {
+ struct {
+ struct rb_node rb;
+ unsigned long rb_subtree_last;
+ } shared;
+ /* Serialized by mmap_sem. */
+ char *anon_name;
+ };

/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
@@ -801,4 +809,36 @@ typedef struct {
unsigned long val;
} swp_entry_t;

+/*
+ * mmap_lock should be read-locked when calling vma_anon_name() and while using
+ * the returned pointer.
+ */
+extern const char *vma_anon_name(struct vm_area_struct *vma);
+
+/*
+ * mmap_lock should be read-locked for orig_vma->vm_mm.
+ * mmap_lock should be write-locked for new_vma->vm_mm or new_vma should be
+ * isolated.
+ */
+extern void dup_vma_anon_name(struct vm_area_struct *orig_vma,
+ struct vm_area_struct *new_vma);
+
+/*
+ * mmap_lock should be write-locked or vma should have been isolated under
+ * write-locked mmap_lock protection.
+ */
+extern void free_vma_anon_name(struct vm_area_struct *vma);
+
+/* mmap_lock should be read-locked */
+static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
+ const char *name)
+{
+ const char *vma_name = vma_anon_name(vma);
+
+ if (likely(!vma_name))
+ return name == NULL;
+
+ return name && !strcmp(name, vma_name);
+}
+
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 967d9c55323d..968582cd91b5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -267,4 +267,7 @@ struct prctl_mm_map {
# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
# define PR_SCHED_CORE_MAX 4

+#define PR_SET_VMA 0x53564d41
+# define PR_SET_VMA_ANON_NAME 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 44f4c2d83763..e086f56a4628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -366,12 +366,14 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
*new = data_race(*orig);
INIT_LIST_HEAD(&new->anon_vma_chain);
new->vm_next = new->vm_prev = NULL;
+ dup_vma_anon_name(orig, new);
}
return new;
}

void vm_area_free(struct vm_area_struct *vma)
{
+ free_vma_anon_name(vma);
kmem_cache_free(vm_area_cachep, vma);
}

diff --git a/kernel/sys.c b/kernel/sys.c
index ef1a78f5d71c..c48267a8b857 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2298,6 +2298,51 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,

#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)

+#ifdef CONFIG_MMU
+
+#define ANON_VMA_NAME_MAX_LEN 64
+
+static int prctl_set_vma(unsigned long opt, unsigned long addr,
+ unsigned long size, unsigned long arg)
+{
+ struct mm_struct *mm = current->mm;
+ char *name, *pch;
+ int error;
+
+ switch (opt) {
+ case PR_SET_VMA_ANON_NAME:
+ name = strndup_user((const char __user *)arg,
+ ANON_VMA_NAME_MAX_LEN);
+
+ if (IS_ERR(name))
+ return PTR_ERR(name);
+
+ for (pch = name; *pch != '\0'; pch++) {
+ if (!isprint(*pch)) {
+ kfree(name);
+ return -EINVAL;
+ }
+ }
+
+ mmap_write_lock(mm);
+ error = madvise_set_anon_name(mm, addr, size, name);
+ mmap_write_unlock(mm);
+ kfree(name);
+ break;
+ default:
+ error = -EINVAL;
+ }
+
+ return error;
+}
+#else /* CONFIG_MMU */
+static int prctl_set_vma(unsigned long opt, unsigned long start,
+ unsigned long size, unsigned long arg)
+{
+ return -EINVAL;
+}
+#endif
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -2567,6 +2612,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = sched_core_share_pid(arg2, arg3, arg4, arg5);
break;
#endif
+ case PR_SET_VMA:
+ error = prctl_set_vma(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
diff --git a/mm/madvise.c b/mm/madvise.c
index 359cd3fa612c..bc029f3fca6a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -18,6 +18,7 @@
#include <linux/fadvise.h>
#include <linux/sched.h>
#include <linux/sched/mm.h>
+#include <linux/string.h>
#include <linux/uio.h>
#include <linux/ksm.h>
#include <linux/fs.h>
@@ -62,19 +63,74 @@ static int madvise_need_mmap_write(int behavior)
}
}

+static inline bool has_vma_anon_name(struct vm_area_struct *vma)
+{
+ return !vma->vm_file && vma->anon_name;
+}
+
+const char *vma_anon_name(struct vm_area_struct *vma)
+{
+ if (!has_vma_anon_name(vma))
+ return NULL;
+
+ mmap_assert_locked(vma->vm_mm);
+
+ return vma->anon_name;
+}
+
+void dup_vma_anon_name(struct vm_area_struct *orig_vma,
+ struct vm_area_struct *new_vma)
+{
+ if (!has_vma_anon_name(orig_vma))
+ return;
+
+ new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
+}
+
+void free_vma_anon_name(struct vm_area_struct *vma)
+{
+ if (!has_vma_anon_name(vma))
+ return;
+
+ kfree(vma->anon_name);
+ vma->anon_name = NULL;
+}
+
+/* mmap_lock should be write-locked */
+static void replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
+{
+ if (!name) {
+ free_vma_anon_name(vma);
+ return;
+ }
+
+ if (vma->anon_name) {
+ /* Should never happen, to dup use dup_vma_anon_name() */
+ WARN_ON(vma->anon_name == name);
+
+ /* Same name, nothing to do here */
+ if (!strcmp(name, vma->anon_name))
+ return;
+
+ free_vma_anon_name(vma);
+ }
+ vma->anon_name = kstrdup(name, GFP_KERNEL);
+}
+
/*
- * Update the vm_flags on regiion of a vma, splitting it or merging it as
+ * Update the vm_flags on region of a vma, splitting it or merging it as
* necessary. Must be called with mmap_sem held for writing;
*/
static int madvise_update_vma(struct vm_area_struct *vma,
struct vm_area_struct **prev, unsigned long start,
- unsigned long end, unsigned long new_flags)
+ unsigned long end, unsigned long new_flags,
+ const char *name)
{
struct mm_struct *mm = vma->vm_mm;
int error;
pgoff_t pgoff;

- if (new_flags == vma->vm_flags) {
+ if (new_flags == vma->vm_flags && is_same_vma_anon_name(vma, name)) {
*prev = vma;
return 0;
}
@@ -82,7 +138,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, name);
if (*prev) {
vma = *prev;
goto success;
@@ -115,10 +171,30 @@ static int madvise_update_vma(struct vm_area_struct *vma,
* vm_flags is protected by the mmap_lock held in write mode.
*/
vma->vm_flags = new_flags;
+ if (!vma->vm_file)
+ replace_vma_anon_name(vma, name);

return 0;
}

+static int madvise_vma_anon_name(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end,
+ unsigned long name)
+{
+ int error;
+
+ /* Only anonymous mappings can be named */
+ if (vma->vm_file)
+ return -EINVAL;
+
+ error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
+ (const char *)name);
+ if (error == -ENOMEM)
+ error = -EAGAIN;
+ return error;
+}
+
#ifdef CONFIG_SWAP
static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
@@ -948,7 +1024,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
break;
}

- error = madvise_update_vma(vma, prev, start, end, new_flags);
+ error = madvise_update_vma(vma, prev, start, end, new_flags,
+ vma_anon_name(vma));

out:
if (error == -ENOMEM)
@@ -1123,6 +1200,31 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
return unmapped_error;
}

+int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+ unsigned long len_in, const char *name)
+{
+ unsigned long end;
+ unsigned long len;
+
+ if (start & ~PAGE_MASK)
+ return -EINVAL;
+ len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return -EINVAL;
+
+ end = start + len;
+ if (end < start)
+ return -EINVAL;
+
+ if (end == start)
+ return 0;
+
+ return madvise_walk_vmas(mm, start, end, (unsigned long)name,
+ madvise_vma_anon_name);
+}
+
/*
* The madvise(2) system call.
*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e32360e90274..cc21ca7e9d40 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -811,7 +811,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff,
- new_pol, vma->vm_userfaultfd_ctx);
+ new_pol, vma->vm_userfaultfd_ctx,
+ vma_anon_name(vma));
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 16d2ee160d43..c878515680af 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -511,7 +511,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_anon_name(vma));
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..baf00fbb1f4c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1032,7 +1032,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
static inline int is_mergeable_vma(struct vm_area_struct *vma,
struct file *file, unsigned long vm_flags,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char *anon_name)
{
/*
* VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -1050,6 +1051,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
return 0;
if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
return 0;
+ if (!is_same_vma_anon_name(vma, anon_name))
+ return 0;
return 1;
}

@@ -1082,9 +1085,10 @@ static int
can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t vm_pgoff,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char *anon_name)
{
- if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
return 1;
@@ -1103,9 +1107,10 @@ static int
can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t vm_pgoff,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char *anon_name)
{
- if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
pgoff_t vm_pglen;
vm_pglen = vma_pages(vma);
@@ -1116,9 +1121,9 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
}

/*
- * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
- * whether that can be merged with its predecessor or its successor.
- * Or both (it neatly fills a hole).
+ * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name),
+ * figure out whether that can be merged with its predecessor or its
+ * successor. Or both (it neatly fills a hole).
*
* In most cases - when called for mmap, brk or mremap - [addr,end) is
* certain not to be mapped by the time vma_merge is called; but when
@@ -1163,7 +1168,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char *anon_name)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1193,7 +1199,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags,
anon_vma, file, pgoff,
- vm_userfaultfd_ctx)) {
+ vm_userfaultfd_ctx, anon_name)) {
/*
* OK, it can. Can we now merge in the successor as well?
*/
@@ -1202,7 +1208,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
can_vma_merge_before(next, vm_flags,
anon_vma, file,
pgoff+pglen,
- vm_userfaultfd_ctx) &&
+ vm_userfaultfd_ctx, anon_name) &&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma, NULL)) {
/* cases 1, 6 */
@@ -1225,7 +1231,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
anon_vma, file, pgoff+pglen,
- vm_userfaultfd_ctx)) {
+ vm_userfaultfd_ctx, anon_name)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = __vma_adjust(prev, prev->vm_start,
addr, prev->vm_pgoff, NULL, next);
@@ -1766,7 +1772,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* Can we just expand an old mapping?
*/
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
if (vma)
goto out;

@@ -1825,7 +1831,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
*/
if (unlikely(vm_flags != vma->vm_flags && prev)) {
merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
if (merge) {
/* ->mmap() can change vma->vm_file and fput the original file. So
* fput the vma->vm_file here or we would add an extra fput for file
@@ -3087,7 +3093,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla

/* Can we just expand an old private anonymous mapping? */
vma = vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
if (vma)
goto out;

@@ -3280,7 +3286,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
return NULL; /* should never get here */
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_anon_name(vma));
if (new_vma) {
/*
* Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..a48ff8e79f48 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -464,7 +464,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*pprev = vma_merge(mm, *pprev, start, end, newflags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_anon_name(vma));
if (*pprev) {
vma = *pprev;
VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
--
2.33.0.259.gc128427fd7-goog

2021-08-27 19:21:26

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v8 3/3] mm: add anonymous vma name refcounting

While forking a process with high number (64K) of named anonymous vmas the
overhead caused by strdup() is noticeable. Experiments with ARM64 Android
device show up to 40% performance regression when forking a process with
64k unpopulated anonymous vmas using the max name lengths vs the same
process with the same number of anonymous vmas having no name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member of
anon_vma_name structure is assigned at structure allocation time and is
never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated
to hold the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 9 ++++++++-
mm/madvise.c | 42 +++++++++++++++++++++++++++++++++-------
2 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26a30f7a5228..a7361acf2921 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -5,6 +5,7 @@
#include <linux/mm_types_task.h>

#include <linux/auxvec.h>
+#include <linux/kref.h>
#include <linux/list.h>
#include <linux/spinlock.h>
#include <linux/rbtree.h>
@@ -302,6 +303,12 @@ struct vm_userfaultfd_ctx {
struct vm_userfaultfd_ctx {};
#endif /* CONFIG_USERFAULTFD */

+struct anon_vma_name {
+ struct kref kref;
+ /* The name needs to be at the end because it is dynamically sized. */
+ char name[];
+};
+
/*
* This struct describes a virtual memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -353,7 +360,7 @@ struct vm_area_struct {
unsigned long rb_subtree_last;
} shared;
/* Serialized by mmap_sem. */
- char *anon_name;
+ struct anon_vma_name *anon_name;
};

/*
diff --git a/mm/madvise.c b/mm/madvise.c
index bc029f3fca6a..32ac5dc5ebf3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -63,6 +63,27 @@ static int madvise_need_mmap_write(int behavior)
}
}

+static struct anon_vma_name *anon_vma_name_alloc(const char *name)
+{
+ struct anon_vma_name *anon_name;
+ size_t len = strlen(name);
+
+ /* Add 1 for NUL terminator at the end of the anon_name->name */
+ anon_name = kzalloc(sizeof(*anon_name) + len + 1,
+ GFP_KERNEL);
+ kref_init(&anon_name->kref);
+ strcpy(anon_name->name, name);
+
+ return anon_name;
+}
+
+static void vma_anon_name_free(struct kref *kref)
+{
+ struct anon_vma_name *anon_name =
+ container_of(kref, struct anon_vma_name, kref);
+ kfree(anon_name);
+}
+
static inline bool has_vma_anon_name(struct vm_area_struct *vma)
{
return !vma->vm_file && vma->anon_name;
@@ -75,7 +96,7 @@ const char *vma_anon_name(struct vm_area_struct *vma)

mmap_assert_locked(vma->vm_mm);

- return vma->anon_name;
+ return vma->anon_name->name;
}

void dup_vma_anon_name(struct vm_area_struct *orig_vma,
@@ -84,37 +105,44 @@ void dup_vma_anon_name(struct vm_area_struct *orig_vma,
if (!has_vma_anon_name(orig_vma))
return;

- new_vma->anon_name = kstrdup(orig_vma->anon_name, GFP_KERNEL);
+ kref_get(&orig_vma->anon_name->kref);
+ new_vma->anon_name = orig_vma->anon_name;
}

void free_vma_anon_name(struct vm_area_struct *vma)
{
+ struct anon_vma_name *anon_name;
+
if (!has_vma_anon_name(vma))
return;

- kfree(vma->anon_name);
+ anon_name = vma->anon_name;
vma->anon_name = NULL;
+ kref_put(&anon_name->kref, vma_anon_name_free);
}

/* mmap_lock should be write-locked */
static void replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
{
+ const char *anon_name;
+
if (!name) {
free_vma_anon_name(vma);
return;
}

- if (vma->anon_name) {
+ anon_name = vma_anon_name(vma);
+ if (anon_name) {
/* Should never happen, to dup use dup_vma_anon_name() */
- WARN_ON(vma->anon_name == name);
+ WARN_ON(anon_name == name);

/* Same name, nothing to do here */
- if (!strcmp(name, vma->anon_name))
+ if (!strcmp(name, anon_name))
return;

free_vma_anon_name(vma);
}
- vma->anon_name = kstrdup(name, GFP_KERNEL);
+ vma->anon_name = anon_vma_name_alloc(name);
}

/*
--
2.33.0.259.gc128427fd7-goog

2021-08-28 00:15:26

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 1/3] mm: rearrange madvise code to allow for reuse

On Fri, Aug 27, 2021 at 12:18:56PM -0700, Suren Baghdasaryan wrote:
> From: Colin Cross <[email protected]>
>
> Refactor the madvise syscall to allow for parts of it to be reused by a
> prctl syscall that affects vmas.
>
> Move the code that walks vmas in a virtual address range into a function
> that takes a function pointer as a parameter. The only caller for now is
> sys_madvise, which uses it to call madvise_vma_behavior on each vma, but
> the next patch will add an additional caller.
>
> Move handling all vma behaviors inside madvise_behavior, and rename it to
> madvise_vma_behavior.
>
> Move the code that updates the flags on a vma, including splitting or
> merging the vma as necessary, into a new function called
> madvise_update_vma. The next patch will add support for updating a new
> anon_name field as well.
>
> Signed-off-by: Colin Cross <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: "Eric W. Biederman" <[email protected]>
> Cc: Jan Glauber <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Cyrill Gorcunov <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: "Serge E. Hallyn" <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Michel Lespinasse <[email protected]>
> Cc: Tang Chen <[email protected]>
> Cc: Robin Holt <[email protected]>
> Cc: Shaohua Li <[email protected]>
> Cc: Sasha Levin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> [sumits: rebased over v5.9-rc3]
> Signed-off-by: Sumit Semwal <[email protected]>
> [surenb: rebased over v5.14-rc7]
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Other folks have already reviewed this, and it does look okay to me,
too, but I find it a bit hard to review. There are at least 3 things
happening in this patch:
- moving to the walker
- merging two behavior routines
- extracting flag setting from behavior checking

It seems like those could be separate patches, but I'm probably overly
picky. :)

-Kees

> ---
> mm/madvise.c | 319 +++++++++++++++++++++++++++------------------------
> 1 file changed, 172 insertions(+), 147 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5c065bc8b5f6..359cd3fa612c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -63,76 +63,20 @@ static int madvise_need_mmap_write(int behavior)
> }
>
> /*
> - * We can potentially split a vm area into separate
> - * areas, each area with its own behavior.
> + * Update the vm_flags on regiion of a vma, splitting it or merging it as
> + * necessary. Must be called with mmap_sem held for writing;
> */
> -static long madvise_behavior(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start, unsigned long end, int behavior)
> +static int madvise_update_vma(struct vm_area_struct *vma,
> + struct vm_area_struct **prev, unsigned long start,
> + unsigned long end, unsigned long new_flags)
> {
> struct mm_struct *mm = vma->vm_mm;
> - int error = 0;
> + int error;
> pgoff_t pgoff;
> - unsigned long new_flags = vma->vm_flags;
> -
> - switch (behavior) {
> - case MADV_NORMAL:
> - new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> - break;
> - case MADV_SEQUENTIAL:
> - new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> - break;
> - case MADV_RANDOM:
> - new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> - break;
> - case MADV_DONTFORK:
> - new_flags |= VM_DONTCOPY;
> - break;
> - case MADV_DOFORK:
> - if (vma->vm_flags & VM_IO) {
> - error = -EINVAL;
> - goto out;
> - }
> - new_flags &= ~VM_DONTCOPY;
> - break;
> - case MADV_WIPEONFORK:
> - /* MADV_WIPEONFORK is only supported on anonymous memory. */
> - if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> - error = -EINVAL;
> - goto out;
> - }
> - new_flags |= VM_WIPEONFORK;
> - break;
> - case MADV_KEEPONFORK:
> - new_flags &= ~VM_WIPEONFORK;
> - break;
> - case MADV_DONTDUMP:
> - new_flags |= VM_DONTDUMP;
> - break;
> - case MADV_DODUMP:
> - if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> - error = -EINVAL;
> - goto out;
> - }
> - new_flags &= ~VM_DONTDUMP;
> - break;
> - case MADV_MERGEABLE:
> - case MADV_UNMERGEABLE:
> - error = ksm_madvise(vma, start, end, behavior, &new_flags);
> - if (error)
> - goto out_convert_errno;
> - break;
> - case MADV_HUGEPAGE:
> - case MADV_NOHUGEPAGE:
> - error = hugepage_madvise(vma, &new_flags, behavior);
> - if (error)
> - goto out_convert_errno;
> - break;
> - }
>
> if (new_flags == vma->vm_flags) {
> *prev = vma;
> - goto out;
> + return 0;
> }
>
> pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> @@ -149,21 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
> if (start != vma->vm_start) {
> if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> error = -ENOMEM;
> - goto out;
> + return error;
> }
> error = __split_vma(mm, vma, start, 1);
> if (error)
> - goto out_convert_errno;
> + return error;
> }
>
> if (end != vma->vm_end) {
> if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> error = -ENOMEM;
> - goto out;
> + return error;
> }
> error = __split_vma(mm, vma, end, 0);
> if (error)
> - goto out_convert_errno;
> + return error;
> }
>
> success:
> @@ -172,15 +116,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
> */
> vma->vm_flags = new_flags;
>
> -out_convert_errno:
> - /*
> - * madvise() returns EAGAIN if kernel resources, such as
> - * slab, are temporarily unavailable.
> - */
> - if (error == -ENOMEM)
> - error = -EAGAIN;
> -out:
> - return error;
> + return 0;
> }
>
> #ifdef CONFIG_SWAP
> @@ -930,6 +866,96 @@ static long madvise_remove(struct vm_area_struct *vma,
> return error;
> }
>
> +/*
> + * Apply an madvise behavior to a region of a vma. madvise_update_vma
> + * will handle splitting a vm area into separate areas, each area with its own
> + * behavior.
> + */
> +static int madvise_vma_behavior(struct vm_area_struct *vma,
> + struct vm_area_struct **prev,
> + unsigned long start, unsigned long end,
> + unsigned long behavior)
> +{
> + int error = 0;
> + unsigned long new_flags = vma->vm_flags;
> +
> + switch (behavior) {
> + case MADV_REMOVE:
> + return madvise_remove(vma, prev, start, end);
> + case MADV_WILLNEED:
> + return madvise_willneed(vma, prev, start, end);
> + case MADV_COLD:
> + return madvise_cold(vma, prev, start, end);
> + case MADV_PAGEOUT:
> + return madvise_pageout(vma, prev, start, end);
> + case MADV_FREE:
> + case MADV_DONTNEED:
> + return madvise_dontneed_free(vma, prev, start, end, behavior);
> + case MADV_POPULATE_READ:
> + case MADV_POPULATE_WRITE:
> + return madvise_populate(vma, prev, start, end, behavior);
> + case MADV_NORMAL:
> + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> + break;
> + case MADV_SEQUENTIAL:
> + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> + break;
> + case MADV_RANDOM:
> + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> + break;
> + case MADV_DONTFORK:
> + new_flags |= VM_DONTCOPY;
> + break;
> + case MADV_DOFORK:
> + if (vma->vm_flags & VM_IO) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags &= ~VM_DONTCOPY;
> + break;
> + case MADV_WIPEONFORK:
> + /* MADV_WIPEONFORK is only supported on anonymous memory. */
> + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags |= VM_WIPEONFORK;
> + break;
> + case MADV_KEEPONFORK:
> + new_flags &= ~VM_WIPEONFORK;
> + break;
> + case MADV_DONTDUMP:
> + new_flags |= VM_DONTDUMP;
> + break;
> + case MADV_DODUMP:
> + if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags &= ~VM_DONTDUMP;
> + break;
> + case MADV_MERGEABLE:
> + case MADV_UNMERGEABLE:
> + error = ksm_madvise(vma, start, end, behavior, &new_flags);
> + if (error)
> + goto out;
> + break;
> + case MADV_HUGEPAGE:
> + case MADV_NOHUGEPAGE:
> + error = hugepage_madvise(vma, &new_flags, behavior);
> + if (error)
> + goto out;
> + break;
> + }
> +
> + error = madvise_update_vma(vma, prev, start, end, new_flags);
> +
> +out:
> + if (error == -ENOMEM)
> + error = -EAGAIN;
> + return error;
> +}
> +
> #ifdef CONFIG_MEMORY_FAILURE
> /*
> * Error injection support for memory error handling.
> @@ -978,30 +1004,6 @@ static int madvise_inject_error(int behavior,
> }
> #endif
>
> -static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> - unsigned long start, unsigned long end, int behavior)
> -{
> - switch (behavior) {
> - case MADV_REMOVE:
> - return madvise_remove(vma, prev, start, end);
> - case MADV_WILLNEED:
> - return madvise_willneed(vma, prev, start, end);
> - case MADV_COLD:
> - return madvise_cold(vma, prev, start, end);
> - case MADV_PAGEOUT:
> - return madvise_pageout(vma, prev, start, end);
> - case MADV_FREE:
> - case MADV_DONTNEED:
> - return madvise_dontneed_free(vma, prev, start, end, behavior);
> - case MADV_POPULATE_READ:
> - case MADV_POPULATE_WRITE:
> - return madvise_populate(vma, prev, start, end, behavior);
> - default:
> - return madvise_behavior(vma, prev, start, end, behavior);
> - }
> -}
> -
> static bool
> madvise_behavior_valid(int behavior)
> {
> @@ -1054,6 +1056,73 @@ process_madvise_behavior_valid(int behavior)
> }
> }
>
> +/*
> + * Walk the vmas in range [start,end), and call the visit function on each one.
> + * The visit function will get start and end parameters that cover the overlap
> + * between the current vma and the original range. Any unmapped regions in the
> + * original range will result in this function returning -ENOMEM while still
> + * calling the visit function on all of the existing vmas in the range.
> + * Must be called with the mmap_lock held for reading or writing.
> + */
> +static
> +int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> + unsigned long end, unsigned long arg,
> + int (*visit)(struct vm_area_struct *vma,
> + struct vm_area_struct **prev, unsigned long start,
> + unsigned long end, unsigned long arg))
> +{
> + struct vm_area_struct *vma;
> + struct vm_area_struct *prev;
> + unsigned long tmp;
> + int unmapped_error = 0;
> +
> + /*
> + * If the interval [start,end) covers some unmapped address
> + * ranges, just ignore them, but return -ENOMEM at the end.
> + * - different from the way of handling in mlock etc.
> + */
> + vma = find_vma_prev(mm, start, &prev);
> + if (vma && start > vma->vm_start)
> + prev = vma;
> +
> + for (;;) {
> + int error;
> +
> + /* Still start < end. */
> + if (!vma)
> + return -ENOMEM;
> +
> + /* Here start < (end|vma->vm_end). */
> + if (start < vma->vm_start) {
> + unmapped_error = -ENOMEM;
> + start = vma->vm_start;
> + if (start >= end)
> + break;
> + }
> +
> + /* Here vma->vm_start <= start < (end|vma->vm_end) */
> + tmp = vma->vm_end;
> + if (end < tmp)
> + tmp = end;
> +
> + /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> + error = visit(vma, &prev, start, tmp, arg);
> + if (error)
> + return error;
> + start = tmp;
> + if (prev && start < prev->vm_end)
> + start = prev->vm_end;
> + if (start >= end)
> + break;
> + if (prev)
> + vma = prev->vm_next;
> + else /* madvise_remove dropped mmap_lock */
> + vma = find_vma(mm, start);
> + }
> +
> + return unmapped_error;
> +}
> +
> /*
> * The madvise(2) system call.
> *
> @@ -1126,9 +1195,7 @@ process_madvise_behavior_valid(int behavior)
> */
> int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> {
> - unsigned long end, tmp;
> - struct vm_area_struct *vma, *prev;
> - int unmapped_error = 0;
> + unsigned long end;
> int error = -EINVAL;
> int write;
> size_t len;
> @@ -1168,51 +1235,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> mmap_read_lock(mm);
> }
>
> - /*
> - * If the interval [start,end) covers some unmapped address
> - * ranges, just ignore them, but return -ENOMEM at the end.
> - * - different from the way of handling in mlock etc.
> - */
> - vma = find_vma_prev(mm, start, &prev);
> - if (vma && start > vma->vm_start)
> - prev = vma;
> -
> blk_start_plug(&plug);
> - for (;;) {
> - /* Still start < end. */
> - error = -ENOMEM;
> - if (!vma)
> - goto out;
> -
> - /* Here start < (end|vma->vm_end). */
> - if (start < vma->vm_start) {
> - unmapped_error = -ENOMEM;
> - start = vma->vm_start;
> - if (start >= end)
> - goto out;
> - }
> -
> - /* Here vma->vm_start <= start < (end|vma->vm_end) */
> - tmp = vma->vm_end;
> - if (end < tmp)
> - tmp = end;
> -
> - /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> - error = madvise_vma(vma, &prev, start, tmp, behavior);
> - if (error)
> - goto out;
> - start = tmp;
> - if (prev && start < prev->vm_end)
> - start = prev->vm_end;
> - error = unmapped_error;
> - if (start >= end)
> - goto out;
> - if (prev)
> - vma = prev->vm_next;
> - else /* madvise_remove dropped mmap_lock */
> - vma = find_vma(mm, start);
> - }
> -out:
> + error = madvise_walk_vmas(mm, start, end, behavior,
> + madvise_vma_behavior);
> blk_finish_plug(&plug);
> if (write)
> mmap_write_unlock(mm);
> --
> 2.33.0.259.gc128427fd7-goog
>

--
Kees Cook

2021-08-28 01:01:17

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 1/3] mm: rearrange madvise code to allow for reuse

On Fri, Aug 27, 2021 at 5:14 PM Kees Cook <[email protected]> wrote:
>
> On Fri, Aug 27, 2021 at 12:18:56PM -0700, Suren Baghdasaryan wrote:
> > From: Colin Cross <[email protected]>
> >
> > Refactor the madvise syscall to allow for parts of it to be reused by a
> > prctl syscall that affects vmas.
> >
> > Move the code that walks vmas in a virtual address range into a function
> > that takes a function pointer as a parameter. The only caller for now is
> > sys_madvise, which uses it to call madvise_vma_behavior on each vma, but
> > the next patch will add an additional caller.
> >
> > Move handling all vma behaviors inside madvise_behavior, and rename it to
> > madvise_vma_behavior.
> >
> > Move the code that updates the flags on a vma, including splitting or
> > merging the vma as necessary, into a new function called
> > madvise_update_vma. The next patch will add support for updating a new
> > anon_name field as well.
> >
> > Signed-off-by: Colin Cross <[email protected]>
> > Cc: Pekka Enberg <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: "Eric W. Biederman" <[email protected]>
> > Cc: Jan Glauber <[email protected]>
> > Cc: John Stultz <[email protected]>
> > Cc: Rob Landley <[email protected]>
> > Cc: Cyrill Gorcunov <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > Cc: "Serge E. Hallyn" <[email protected]>
> > Cc: David Rientjes <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Michel Lespinasse <[email protected]>
> > Cc: Tang Chen <[email protected]>
> > Cc: Robin Holt <[email protected]>
> > Cc: Shaohua Li <[email protected]>
> > Cc: Sasha Levin <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > [sumits: rebased over v5.9-rc3]
> > Signed-off-by: Sumit Semwal <[email protected]>
> > [surenb: rebased over v5.14-rc7]
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
>
> Other folks have already reviewed this, and it does look okay to me,
> too, but I find it a bit hard to review. There are at least 3 things
> happening in this patch:
> - moving to the walker
> - merging two behavior routines
> - extracting flag setting from behavior checking
>
> It seems like those could be separate patches, but I'm probably overly
> picky. :)

Thank you for taking a look. I wanted to keep the patch as close to
the original as possible, but if more people find it hard to review
then I'll break it up into smaller pieces.
Thanks,
Suren.

>
> -Kees
>
> > ---
> > mm/madvise.c | 319 +++++++++++++++++++++++++++------------------------
> > 1 file changed, 172 insertions(+), 147 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 5c065bc8b5f6..359cd3fa612c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -63,76 +63,20 @@ static int madvise_need_mmap_write(int behavior)
> > }
> >
> > /*
> > - * We can potentially split a vm area into separate
> > - * areas, each area with its own behavior.
> > + * Update the vm_flags on regiion of a vma, splitting it or merging it as
> > + * necessary. Must be called with mmap_sem held for writing;
> > */
> > -static long madvise_behavior(struct vm_area_struct *vma,
> > - struct vm_area_struct **prev,
> > - unsigned long start, unsigned long end, int behavior)
> > +static int madvise_update_vma(struct vm_area_struct *vma,
> > + struct vm_area_struct **prev, unsigned long start,
> > + unsigned long end, unsigned long new_flags)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > - int error = 0;
> > + int error;
> > pgoff_t pgoff;
> > - unsigned long new_flags = vma->vm_flags;
> > -
> > - switch (behavior) {
> > - case MADV_NORMAL:
> > - new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> > - break;
> > - case MADV_SEQUENTIAL:
> > - new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> > - break;
> > - case MADV_RANDOM:
> > - new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> > - break;
> > - case MADV_DONTFORK:
> > - new_flags |= VM_DONTCOPY;
> > - break;
> > - case MADV_DOFORK:
> > - if (vma->vm_flags & VM_IO) {
> > - error = -EINVAL;
> > - goto out;
> > - }
> > - new_flags &= ~VM_DONTCOPY;
> > - break;
> > - case MADV_WIPEONFORK:
> > - /* MADV_WIPEONFORK is only supported on anonymous memory. */
> > - if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > - error = -EINVAL;
> > - goto out;
> > - }
> > - new_flags |= VM_WIPEONFORK;
> > - break;
> > - case MADV_KEEPONFORK:
> > - new_flags &= ~VM_WIPEONFORK;
> > - break;
> > - case MADV_DONTDUMP:
> > - new_flags |= VM_DONTDUMP;
> > - break;
> > - case MADV_DODUMP:
> > - if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> > - error = -EINVAL;
> > - goto out;
> > - }
> > - new_flags &= ~VM_DONTDUMP;
> > - break;
> > - case MADV_MERGEABLE:
> > - case MADV_UNMERGEABLE:
> > - error = ksm_madvise(vma, start, end, behavior, &new_flags);
> > - if (error)
> > - goto out_convert_errno;
> > - break;
> > - case MADV_HUGEPAGE:
> > - case MADV_NOHUGEPAGE:
> > - error = hugepage_madvise(vma, &new_flags, behavior);
> > - if (error)
> > - goto out_convert_errno;
> > - break;
> > - }
> >
> > if (new_flags == vma->vm_flags) {
> > *prev = vma;
> > - goto out;
> > + return 0;
> > }
> >
> > pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
> > @@ -149,21 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > if (start != vma->vm_start) {
> > if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> > error = -ENOMEM;
> > - goto out;
> > + return error;
> > }
> > error = __split_vma(mm, vma, start, 1);
> > if (error)
> > - goto out_convert_errno;
> > + return error;
> > }
> >
> > if (end != vma->vm_end) {
> > if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> > error = -ENOMEM;
> > - goto out;
> > + return error;
> > }
> > error = __split_vma(mm, vma, end, 0);
> > if (error)
> > - goto out_convert_errno;
> > + return error;
> > }
> >
> > success:
> > @@ -172,15 +116,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
> > */
> > vma->vm_flags = new_flags;
> >
> > -out_convert_errno:
> > - /*
> > - * madvise() returns EAGAIN if kernel resources, such as
> > - * slab, are temporarily unavailable.
> > - */
> > - if (error == -ENOMEM)
> > - error = -EAGAIN;
> > -out:
> > - return error;
> > + return 0;
> > }
> >
> > #ifdef CONFIG_SWAP
> > @@ -930,6 +866,96 @@ static long madvise_remove(struct vm_area_struct *vma,
> > return error;
> > }
> >
> > +/*
> > + * Apply an madvise behavior to a region of a vma. madvise_update_vma
> > + * will handle splitting a vm area into separate areas, each area with its own
> > + * behavior.
> > + */
> > +static int madvise_vma_behavior(struct vm_area_struct *vma,
> > + struct vm_area_struct **prev,
> > + unsigned long start, unsigned long end,
> > + unsigned long behavior)
> > +{
> > + int error = 0;
> > + unsigned long new_flags = vma->vm_flags;
> > +
> > + switch (behavior) {
> > + case MADV_REMOVE:
> > + return madvise_remove(vma, prev, start, end);
> > + case MADV_WILLNEED:
> > + return madvise_willneed(vma, prev, start, end);
> > + case MADV_COLD:
> > + return madvise_cold(vma, prev, start, end);
> > + case MADV_PAGEOUT:
> > + return madvise_pageout(vma, prev, start, end);
> > + case MADV_FREE:
> > + case MADV_DONTNEED:
> > + return madvise_dontneed_free(vma, prev, start, end, behavior);
> > + case MADV_POPULATE_READ:
> > + case MADV_POPULATE_WRITE:
> > + return madvise_populate(vma, prev, start, end, behavior);
> > + case MADV_NORMAL:
> > + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> > + break;
> > + case MADV_SEQUENTIAL:
> > + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ;
> > + break;
> > + case MADV_RANDOM:
> > + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ;
> > + break;
> > + case MADV_DONTFORK:
> > + new_flags |= VM_DONTCOPY;
> > + break;
> > + case MADV_DOFORK:
> > + if (vma->vm_flags & VM_IO) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags &= ~VM_DONTCOPY;
> > + break;
> > + case MADV_WIPEONFORK:
> > + /* MADV_WIPEONFORK is only supported on anonymous memory. */
> > + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags |= VM_WIPEONFORK;
> > + break;
> > + case MADV_KEEPONFORK:
> > + new_flags &= ~VM_WIPEONFORK;
> > + break;
> > + case MADV_DONTDUMP:
> > + new_flags |= VM_DONTDUMP;
> > + break;
> > + case MADV_DODUMP:
> > + if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags &= ~VM_DONTDUMP;
> > + break;
> > + case MADV_MERGEABLE:
> > + case MADV_UNMERGEABLE:
> > + error = ksm_madvise(vma, start, end, behavior, &new_flags);
> > + if (error)
> > + goto out;
> > + break;
> > + case MADV_HUGEPAGE:
> > + case MADV_NOHUGEPAGE:
> > + error = hugepage_madvise(vma, &new_flags, behavior);
> > + if (error)
> > + goto out;
> > + break;
> > + }
> > +
> > + error = madvise_update_vma(vma, prev, start, end, new_flags);
> > +
> > +out:
> > + if (error == -ENOMEM)
> > + error = -EAGAIN;
> > + return error;
> > +}
> > +
> > #ifdef CONFIG_MEMORY_FAILURE
> > /*
> > * Error injection support for memory error handling.
> > @@ -978,30 +1004,6 @@ static int madvise_inject_error(int behavior,
> > }
> > #endif
> >
> > -static long
> > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > - unsigned long start, unsigned long end, int behavior)
> > -{
> > - switch (behavior) {
> > - case MADV_REMOVE:
> > - return madvise_remove(vma, prev, start, end);
> > - case MADV_WILLNEED:
> > - return madvise_willneed(vma, prev, start, end);
> > - case MADV_COLD:
> > - return madvise_cold(vma, prev, start, end);
> > - case MADV_PAGEOUT:
> > - return madvise_pageout(vma, prev, start, end);
> > - case MADV_FREE:
> > - case MADV_DONTNEED:
> > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > - case MADV_POPULATE_READ:
> > - case MADV_POPULATE_WRITE:
> > - return madvise_populate(vma, prev, start, end, behavior);
> > - default:
> > - return madvise_behavior(vma, prev, start, end, behavior);
> > - }
> > -}
> > -
> > static bool
> > madvise_behavior_valid(int behavior)
> > {
> > @@ -1054,6 +1056,73 @@ process_madvise_behavior_valid(int behavior)
> > }
> > }
> >
> > +/*
> > + * Walk the vmas in range [start,end), and call the visit function on each one.
> > + * The visit function will get start and end parameters that cover the overlap
> > + * between the current vma and the original range. Any unmapped regions in the
> > + * original range will result in this function returning -ENOMEM while still
> > + * calling the visit function on all of the existing vmas in the range.
> > + * Must be called with the mmap_lock held for reading or writing.
> > + */
> > +static
> > +int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> > + unsigned long end, unsigned long arg,
> > + int (*visit)(struct vm_area_struct *vma,
> > + struct vm_area_struct **prev, unsigned long start,
> > + unsigned long end, unsigned long arg))
> > +{
> > + struct vm_area_struct *vma;
> > + struct vm_area_struct *prev;
> > + unsigned long tmp;
> > + int unmapped_error = 0;
> > +
> > + /*
> > + * If the interval [start,end) covers some unmapped address
> > + * ranges, just ignore them, but return -ENOMEM at the end.
> > + * - different from the way of handling in mlock etc.
> > + */
> > + vma = find_vma_prev(mm, start, &prev);
> > + if (vma && start > vma->vm_start)
> > + prev = vma;
> > +
> > + for (;;) {
> > + int error;
> > +
> > + /* Still start < end. */
> > + if (!vma)
> > + return -ENOMEM;
> > +
> > + /* Here start < (end|vma->vm_end). */
> > + if (start < vma->vm_start) {
> > + unmapped_error = -ENOMEM;
> > + start = vma->vm_start;
> > + if (start >= end)
> > + break;
> > + }
> > +
> > + /* Here vma->vm_start <= start < (end|vma->vm_end) */
> > + tmp = vma->vm_end;
> > + if (end < tmp)
> > + tmp = end;
> > +
> > + /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> > + error = visit(vma, &prev, start, tmp, arg);
> > + if (error)
> > + return error;
> > + start = tmp;
> > + if (prev && start < prev->vm_end)
> > + start = prev->vm_end;
> > + if (start >= end)
> > + break;
> > + if (prev)
> > + vma = prev->vm_next;
> > + else /* madvise_remove dropped mmap_lock */
> > + vma = find_vma(mm, start);
> > + }
> > +
> > + return unmapped_error;
> > +}
> > +
> > /*
> > * The madvise(2) system call.
> > *
> > @@ -1126,9 +1195,7 @@ process_madvise_behavior_valid(int behavior)
> > */
> > int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > {
> > - unsigned long end, tmp;
> > - struct vm_area_struct *vma, *prev;
> > - int unmapped_error = 0;
> > + unsigned long end;
> > int error = -EINVAL;
> > int write;
> > size_t len;
> > @@ -1168,51 +1235,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > mmap_read_lock(mm);
> > }
> >
> > - /*
> > - * If the interval [start,end) covers some unmapped address
> > - * ranges, just ignore them, but return -ENOMEM at the end.
> > - * - different from the way of handling in mlock etc.
> > - */
> > - vma = find_vma_prev(mm, start, &prev);
> > - if (vma && start > vma->vm_start)
> > - prev = vma;
> > -
> > blk_start_plug(&plug);
> > - for (;;) {
> > - /* Still start < end. */
> > - error = -ENOMEM;
> > - if (!vma)
> > - goto out;
> > -
> > - /* Here start < (end|vma->vm_end). */
> > - if (start < vma->vm_start) {
> > - unmapped_error = -ENOMEM;
> > - start = vma->vm_start;
> > - if (start >= end)
> > - goto out;
> > - }
> > -
> > - /* Here vma->vm_start <= start < (end|vma->vm_end) */
> > - tmp = vma->vm_end;
> > - if (end < tmp)
> > - tmp = end;
> > -
> > - /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> > - error = madvise_vma(vma, &prev, start, tmp, behavior);
> > - if (error)
> > - goto out;
> > - start = tmp;
> > - if (prev && start < prev->vm_end)
> > - start = prev->vm_end;
> > - error = unmapped_error;
> > - if (start >= end)
> > - goto out;
> > - if (prev)
> > - vma = prev->vm_next;
> > - else /* madvise_remove dropped mmap_lock */
> > - vma = find_vma(mm, start);
> > - }
> > -out:
> > + error = madvise_walk_vmas(mm, start, end, behavior,
> > + madvise_vma_behavior);
> > blk_finish_plug(&plug);
> > if (write)
> > mmap_write_unlock(mm);
> > --
> > 2.33.0.259.gc128427fd7-goog
> >
>
> --
> Kees Cook

2021-08-28 01:54:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Fri, Aug 27, 2021 at 12:18:57PM -0700, Suren Baghdasaryan wrote:
> + anon_name = vma_anon_name(vma);
> + if (anon_name) {
> + seq_pad(m, ' ');
> + seq_puts(m, "[anon:");
> + seq_write(m, anon_name, strlen(anon_name));
> + seq_putc(m, ']');
> + }

...

> + case PR_SET_VMA_ANON_NAME:
> + name = strndup_user((const char __user *)arg,
> + ANON_VMA_NAME_MAX_LEN);
> +
> + if (IS_ERR(name))
> + return PTR_ERR(name);
> +
> + for (pch = name; *pch != '\0'; pch++) {
> + if (!isprint(*pch)) {
> + kfree(name);
> + return -EINVAL;

I think isprint() is too weak a check. For example, I would suggest
forbidding the following characters: ':', ']', '[', ' '. Perhaps
isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
be opposed to some punctuation characters, but let's avoid creating
confusion. Do you happen to know which characters are actually in use
today?

2021-08-28 05:30:45

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 3/3] mm: add anonymous vma name refcounting

On Fri, Aug 27, 2021 at 12:18:58PM -0700, Suren Baghdasaryan wrote:
> While forking a process with high number (64K) of named anonymous vmas the
> overhead caused by strdup() is noticeable. Experiments with ARM64 Android
> device show up to 40% performance regression when forking a process with
> 64k unpopulated anonymous vmas using the max name lengths vs the same
> process with the same number of anonymous vmas having no name.
> Introduce anon_vma_name refcounted structure to avoid the overhead of
> copying vma names during fork() and when splitting named anonymous vmas.
> When a vma is duplicated, instead of copying the name we increment the
> refcount of this structure. Multiple vmas can point to the same
> anon_vma_name as long as they increment the refcount. The name member of
> anon_vma_name structure is assigned at structure allocation time and is
> never changed. If vma name changes then the refcount of the original
> structure is dropped, a new anon_vma_name structure is allocated
> to hold the new name and the vma pointer is updated to point to the new
> structure.
> With this approach the fork() performance regressions is reduced 3-4x
> times and with usecases using more reasonable number of VMAs (a few
> thousand) the regressions is not measurable.

I like the refcounting; thank you!

Since patch2 adds a lot of things that are changed by patch3; maybe
combine them?

--
Kees Cook

2021-08-28 06:00:42

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Sat, Aug 28, 2021 at 02:47:03AM +0100, Matthew Wilcox wrote:
> On Fri, Aug 27, 2021 at 12:18:57PM -0700, Suren Baghdasaryan wrote:
> > + anon_name = vma_anon_name(vma);
> > + if (anon_name) {
> > + seq_pad(m, ' ');
> > + seq_puts(m, "[anon:");
> > + seq_write(m, anon_name, strlen(anon_name));
> > + seq_putc(m, ']');
> > + }

Maybe after seq_pad, use: seq_printf(m, "[anon:%s]", anon_name);

>
> ...
>
> > + case PR_SET_VMA_ANON_NAME:
> > + name = strndup_user((const char __user *)arg,
> > + ANON_VMA_NAME_MAX_LEN);
> > +
> > + if (IS_ERR(name))
> > + return PTR_ERR(name);
> > +
> > + for (pch = name; *pch != '\0'; pch++) {
> > + if (!isprint(*pch)) {
> > + kfree(name);
> > + return -EINVAL;
>
> I think isprint() is too weak a check. For example, I would suggest
> forbidding the following characters: ':', ']', '[', ' '. Perhaps
> isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
> be opposed to some punctuation characters, but let's avoid creating
> confusion. Do you happen to know which characters are actually in use
> today?

There's some sense in refusing [, ], and :, but removing " " seems
unhelpful for reasonable descriptors. As long as weird stuff is escaped,
I think it's fine. Any parser can just extract with m|\[anon:(.*)\]$|

For example, just escape it here instead of refusing to take it. Something
like:

name = strndup_user((const char __user *)arg,
ANON_VMA_NAME_MAX_LEN);
escaped = kasprintf(GFP_KERNEL, "%pE", name);
if (escaped) {
kfree(name);
return -ENOMEM;
}
kfree(name);
name = escaped;

--
Kees Cook

2021-08-28 12:51:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v8 0/3] Anonymous VMA naming patches

Hi!

> Documentation/filesystems/proc.rst | 2 +

Documentation for the setting part would be welcome, too.

Best regards,
Pavel
--
http://www.livejournal.com/~pavelmachek


Attachments:
(No filename) (192.00 B)
signature.asc (201.00 B)
Download all attachments

2021-08-28 16:21:16

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v8 1/3] mm: rearrange madvise code to allow for reuse

On Fri, Aug 27, 2021 at 12:18:56PM -0700, Suren Baghdasaryan wrote:
...
>
> +/*
> + * Apply an madvise behavior to a region of a vma. madvise_update_vma
> + * will handle splitting a vm area into separate areas, each area with its own
> + * behavior.
> + */
> +static int madvise_vma_behavior(struct vm_area_struct *vma,
> + struct vm_area_struct **prev,
> + unsigned long start, unsigned long end,
> + unsigned long behavior)
> +{
> + int error = 0;


Hi Suren! A nitpick -- this variable is never used with default value
so I think we could drop assignment here.
...
> + case MADV_DONTFORK:
> + new_flags |= VM_DONTCOPY;
> + break;
> + case MADV_DOFORK:
> + if (vma->vm_flags & VM_IO) {
> + error = -EINVAL;

We can exit early here, without jumping to the end of the function, right?

> + goto out;
> + }
> + new_flags &= ~VM_DONTCOPY;
> + break;
> + case MADV_WIPEONFORK:
> + /* MADV_WIPEONFORK is only supported on anonymous memory. */
> + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> + error = -EINVAL;

And here too.

> + goto out;
> + }
> + new_flags |= VM_WIPEONFORK;
> + break;
> + case MADV_KEEPONFORK:
> + new_flags &= ~VM_WIPEONFORK;
> + break;
> + case MADV_DONTDUMP:
> + new_flags |= VM_DONTDUMP;
> + break;
> + case MADV_DODUMP:
> + if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> + error = -EINVAL;

Same.

> + goto out;
> + }
> + new_flags &= ~VM_DONTDUMP;
> + break;
> + case MADV_MERGEABLE:
> + case MADV_UNMERGEABLE:
> + error = ksm_madvise(vma, start, end, behavior, &new_flags);
> + if (error)
> + goto out;
> + break;
> + case MADV_HUGEPAGE:
> + case MADV_NOHUGEPAGE:
> + error = hugepage_madvise(vma, &new_flags, behavior);
> + if (error)
> + goto out;
> + break;
> + }
> +
> + error = madvise_update_vma(vma, prev, start, end, new_flags);
> +
> +out:

I suppose we better keep the former comment on why we maps ENOMEM to EAGAIN?

Cyrill

2021-08-28 21:29:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 3/3] mm: add anonymous vma name refcounting

On Fri, Aug 27, 2021 at 10:28 PM Kees Cook <[email protected]> wrote:
>
> On Fri, Aug 27, 2021 at 12:18:58PM -0700, Suren Baghdasaryan wrote:
> > While forking a process with high number (64K) of named anonymous vmas the
> > overhead caused by strdup() is noticeable. Experiments with ARM64 Android
> > device show up to 40% performance regression when forking a process with
> > 64k unpopulated anonymous vmas using the max name lengths vs the same
> > process with the same number of anonymous vmas having no name.
> > Introduce anon_vma_name refcounted structure to avoid the overhead of
> > copying vma names during fork() and when splitting named anonymous vmas.
> > When a vma is duplicated, instead of copying the name we increment the
> > refcount of this structure. Multiple vmas can point to the same
> > anon_vma_name as long as they increment the refcount. The name member of
> > anon_vma_name structure is assigned at structure allocation time and is
> > never changed. If vma name changes then the refcount of the original
> > structure is dropped, a new anon_vma_name structure is allocated
> > to hold the new name and the vma pointer is updated to point to the new
> > structure.
> > With this approach the fork() performance regressions is reduced 3-4x
> > times and with usecases using more reasonable number of VMAs (a few
> > thousand) the regressions is not measurable.
>
> I like the refcounting; thank you!
>
> Since patch2 adds a lot of things that are changed by patch3; maybe
> combine them?

I thought it would be easier to review with the main logic being
written using a basic type (string) first and then replace the basic
type with a more complex refcounted structure. Also, if someone would
like to rerun the tests and measure the regression of strdup vs
refcounting approach, keeping this patch separate makes it easier to
set up these tests.
If that's not convenient I can absolutely squash them together.

>
> --
> Kees Cook

2021-08-28 21:30:18

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Fri, Aug 27, 2021 at 12:18:57PM -0700, Suren Baghdasaryan wrote:
>
> The name is stored in a pointer in the shared union in vm_area_struct
> that points to a null terminated string. Anonymous vmas with the same
> name (equivalent strings) and are otherwise mergeable will be merged.
> The name pointers are not shared between vmas even if they contain the
> same name. The name pointer is stored in a union with fields that are
> only used on file-backed mappings, so it does not increase memory usage.
>
> The patch is based on the original patch developed by Colin Cross, more
> specifically on its latest version [1] posted upstream by Sumit Semwal.
> It used a userspace pointer to store vma names. In that design, name
> pointers could be shared between vmas. However during the last upstreaming
> attempt, Kees Cook raised concerns [2] about this approach and suggested
> to copy the name into kernel memory space, perform validity checks [3]
> and store as a string referenced from vm_area_struct.
> One big concern is about fork() performance which would need to strdup
> anonymous vma names. Dave Hansen suggested experimenting with worst-case
> scenario of forking a process with 64k vmas having longest possible names
> [4]. I ran this experiment on an ARM64 Android device and recorded a
> worst-case regression of almost 40% when forking such a process. This
> regression is addressed in the followup patch which replaces the pointer
> to a name with a refcounted structure that allows sharing the name pointer
> between vmas of the same name. Instead of duplicating the string during
> fork() or when splitting a vma it increments the refcount.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
> [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
> [4] https://lore.kernel.org/linux-mm/[email protected]/
...
> +
> +/* mmap_lock should be read-locked */
> +static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
> + const char *name)
> +{
> + const char *vma_name = vma_anon_name(vma);
> +
> + if (likely(!vma_name))
> + return name == NULL;
> +
> + return name && !strcmp(name, vma_name);
> +}

Hi Suren! There is very important moment with this new feature: if
we assign a name to some VMA it won't longer be mergeable even if
near VMA matches by all other attributes such as flags, permissions
and etc. I mean our vma_merge() start considering the vma namings
and names mismatch potentially blocks merging which happens now
without this new feature. Is it known behaviour or I miss something
pretty obvious here?

2021-08-28 21:55:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Fri, Aug 27, 2021 at 10:52 PM Kees Cook <[email protected]> wrote:
>
> On Sat, Aug 28, 2021 at 02:47:03AM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 27, 2021 at 12:18:57PM -0700, Suren Baghdasaryan wrote:
> > > + anon_name = vma_anon_name(vma);
> > > + if (anon_name) {
> > > + seq_pad(m, ' ');
> > > + seq_puts(m, "[anon:");
> > > + seq_write(m, anon_name, strlen(anon_name));
> > > + seq_putc(m, ']');
> > > + }
>
> Maybe after seq_pad, use: seq_printf(m, "[anon:%s]", anon_name);

Good idea. Will change.

>
> >
> > ...
> >
> > > + case PR_SET_VMA_ANON_NAME:
> > > + name = strndup_user((const char __user *)arg,
> > > + ANON_VMA_NAME_MAX_LEN);
> > > +
> > > + if (IS_ERR(name))
> > > + return PTR_ERR(name);
> > > +
> > > + for (pch = name; *pch != '\0'; pch++) {
> > > + if (!isprint(*pch)) {
> > > + kfree(name);
> > > + return -EINVAL;
> >
> > I think isprint() is too weak a check. For example, I would suggest
> > forbidding the following characters: ':', ']', '[', ' '. Perhaps
> > isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
> > be opposed to some punctuation characters, but let's avoid creating
> > confusion. Do you happen to know which characters are actually in use
> > today?
>
> There's some sense in refusing [, ], and :, but removing " " seems
> unhelpful for reasonable descriptors. As long as weird stuff is escaped,
> I think it's fine. Any parser can just extract with m|\[anon:(.*)\]$|

I see no issue in forbidding '[' and ']' but whitespace and ':' are
currently used by Android. Would forbidding or escaping '[' and ']' be
enough?

>
> For example, just escape it here instead of refusing to take it. Something
> like:
>
> name = strndup_user((const char __user *)arg,
> ANON_VMA_NAME_MAX_LEN);
> escaped = kasprintf(GFP_KERNEL, "%pE", name);

Did you mean "%*pE" as in
https://www.kernel.org/doc/html/latest/core-api/printk-formats.html#raw-buffer-as-an-escaped-string
?

> if (escaped) {
> kfree(name);
> return -ENOMEM;
> }
> kfree(name);
> name = escaped;
>
> --
> Kees Cook

2021-08-28 21:55:41

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Sat, Aug 28, 2021 at 2:28 PM Cyrill Gorcunov <[email protected]> wrote:
>
> On Fri, Aug 27, 2021 at 12:18:57PM -0700, Suren Baghdasaryan wrote:
> >
> > The name is stored in a pointer in the shared union in vm_area_struct
> > that points to a null terminated string. Anonymous vmas with the same
> > name (equivalent strings) and are otherwise mergeable will be merged.
> > The name pointers are not shared between vmas even if they contain the
> > same name. The name pointer is stored in a union with fields that are
> > only used on file-backed mappings, so it does not increase memory usage.
> >
> > The patch is based on the original patch developed by Colin Cross, more
> > specifically on its latest version [1] posted upstream by Sumit Semwal.
> > It used a userspace pointer to store vma names. In that design, name
> > pointers could be shared between vmas. However during the last upstreaming
> > attempt, Kees Cook raised concerns [2] about this approach and suggested
> > to copy the name into kernel memory space, perform validity checks [3]
> > and store as a string referenced from vm_area_struct.
> > One big concern is about fork() performance which would need to strdup
> > anonymous vma names. Dave Hansen suggested experimenting with worst-case
> > scenario of forking a process with 64k vmas having longest possible names
> > [4]. I ran this experiment on an ARM64 Android device and recorded a
> > worst-case regression of almost 40% when forking such a process. This
> > regression is addressed in the followup patch which replaces the pointer
> > to a name with a refcounted structure that allows sharing the name pointer
> > between vmas of the same name. Instead of duplicating the string during
> > fork() or when splitting a vma it increments the refcount.
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
> > [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
> > [4] https://lore.kernel.org/linux-mm/[email protected]/
> ...
> > +
> > +/* mmap_lock should be read-locked */
> > +static inline bool is_same_vma_anon_name(struct vm_area_struct *vma,
> > + const char *name)
> > +{
> > + const char *vma_name = vma_anon_name(vma);
> > +
> > + if (likely(!vma_name))
> > + return name == NULL;
> > +
> > + return name && !strcmp(name, vma_name);
> > +}
>
> Hi Suren! There is very important moment with this new feature: if
> we assign a name to some VMA it won't longer be mergeable even if
> near VMA matches by all other attributes such as flags, permissions
> and etc. I mean our vma_merge() start considering the vma namings
> and names mismatch potentially blocks merging which happens now
> without this new feature. Is it known behaviour or I miss something
> pretty obvious here?

Hi Cyrill,
Correct, this is a known drawback of naming an anonymous VMA. I think
I'll need to document this in prctl(2) manpage, which I should update
to include this new PR_SET_VMA_ANON_NAME option.
Thanks for pointing it out!
Suren.

2021-08-28 22:05:00

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 1/3] mm: rearrange madvise code to allow for reuse

On Sat, Aug 28, 2021 at 9:19 AM Cyrill Gorcunov <[email protected]> wrote:
>
> On Fri, Aug 27, 2021 at 12:18:56PM -0700, Suren Baghdasaryan wrote:
> ...
> >
> > +/*
> > + * Apply an madvise behavior to a region of a vma. madvise_update_vma
> > + * will handle splitting a vm area into separate areas, each area with its own
> > + * behavior.
> > + */
> > +static int madvise_vma_behavior(struct vm_area_struct *vma,
> > + struct vm_area_struct **prev,
> > + unsigned long start, unsigned long end,
> > + unsigned long behavior)
> > +{
> > + int error = 0;
>
>
> Hi Suren! A nitpick -- this variable is never used with default value
> so I think we could drop assignment here.
> ...
> > + case MADV_DONTFORK:
> > + new_flags |= VM_DONTCOPY;
> > + break;
> > + case MADV_DOFORK:
> > + if (vma->vm_flags & VM_IO) {
> > + error = -EINVAL;
>
> We can exit early here, without jumping to the end of the function, right?
>
> > + goto out;
> > + }
> > + new_flags &= ~VM_DONTCOPY;
> > + break;
> > + case MADV_WIPEONFORK:
> > + /* MADV_WIPEONFORK is only supported on anonymous memory. */
> > + if (vma->vm_file || vma->vm_flags & VM_SHARED) {
> > + error = -EINVAL;
>
> And here too.
>
> > + goto out;
> > + }
> > + new_flags |= VM_WIPEONFORK;
> > + break;
> > + case MADV_KEEPONFORK:
> > + new_flags &= ~VM_WIPEONFORK;
> > + break;
> > + case MADV_DONTDUMP:
> > + new_flags |= VM_DONTDUMP;
> > + break;
> > + case MADV_DODUMP:
> > + if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) {
> > + error = -EINVAL;
>
> Same.
>
> > + goto out;
> > + }
> > + new_flags &= ~VM_DONTDUMP;
> > + break;
> > + case MADV_MERGEABLE:
> > + case MADV_UNMERGEABLE:
> > + error = ksm_madvise(vma, start, end, behavior, &new_flags);
> > + if (error)
> > + goto out;
> > + break;
> > + case MADV_HUGEPAGE:
> > + case MADV_NOHUGEPAGE:
> > + error = hugepage_madvise(vma, &new_flags, behavior);
> > + if (error)
> > + goto out;
> > + break;
> > + }
> > +
> > + error = madvise_update_vma(vma, prev, start, end, new_flags);
> > +
> > +out:
>
> I suppose we better keep the former comment on why we maps ENOMEM to EAGAIN?

Thanks for the review Cyrill! Proposed changes sound good to me. Will
change in the next revision.
Suren.

>
> Cyrill

2021-08-28 22:07:47

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 0/3] Anonymous VMA naming patches

On Sat, Aug 28, 2021 at 5:48 AM Pavel Machek <[email protected]> wrote:
>
> Hi!
>
> > Documentation/filesystems/proc.rst | 2 +
>
> Documentation for the setting part would be welcome, too.

Absolutely! Thanks for reminding me. I'll add a description of the new
PR_SET_VMA and PR_SET_VMA_ANON_NAME options for prctl(2) manpage into
the second patch of this series which introduces them. After the patch
is finalized and accepted I'll also post a patch to update the
prctl(2) manpage.
Thanks,
Suren.


>
> Best regards,
> Pavel
> --
> http://www.livejournal.com/~pavelmachek
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

2021-08-30 07:05:33

by Rolf Eike Beer

[permalink] [raw]
Subject: Re: [PATCH v8 3/3] mm: add anonymous vma name refcounting

Am Freitag, 27. August 2021, 21:18:58 CEST schrieb Suren Baghdasaryan:
> While forking a process with high number (64K) of named anonymous vmas the
> overhead caused by strdup() is noticeable. Experiments with ARM64 Android
> device show up to 40% performance regression when forking a process with
> 64k unpopulated anonymous vmas using the max name lengths vs the same
> process with the same number of anonymous vmas having no name.
> Introduce anon_vma_name refcounted structure to avoid the overhead of
> copying vma names during fork() and when splitting named anonymous vmas.
> When a vma is duplicated, instead of copying the name we increment the
> refcount of this structure. Multiple vmas can point to the same
> anon_vma_name as long as they increment the refcount. The name member of
> anon_vma_name structure is assigned at structure allocation time and is
> never changed. If vma name changes then the refcount of the original
> structure is dropped, a new anon_vma_name structure is allocated
> to hold the new name and the vma pointer is updated to point to the new
> structure.
> With this approach the fork() performance regressions is reduced 3-4x
> times and with usecases using more reasonable number of VMAs (a few
> thousand) the regressions is not measurable.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/mm_types.h | 9 ++++++++-
> mm/madvise.c | 42 +++++++++++++++++++++++++++++++++-------
> 2 files changed, 43 insertions(+), 8 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bc029f3fca6a..32ac5dc5ebf3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -63,6 +63,27 @@ static int madvise_need_mmap_write(int behavior)
> }
> }
>
> +static struct anon_vma_name *anon_vma_name_alloc(const char *name)
> +{
> + struct anon_vma_name *anon_name;
> + size_t len = strlen(name);
> +
> + /* Add 1 for NUL terminator at the end of the anon_name->name */
> + anon_name = kzalloc(sizeof(*anon_name) + len + 1,
> + GFP_KERNEL);
> + kref_init(&anon_name->kref);
> + strcpy(anon_name->name, name);
> +
> + return anon_name;
> +}

Given that you overwrite anything in that struct anyway this could be reduced
to kmalloc(), no? And it definitely needs a NULL check.

Eike
--
Rolf Eike Beer, emlix GmbH, http://www.emlix.com
Fon +49 551 30664-0, Fax +49 551 30664-11
Gothaer Platz 3, 37083 Göttingen, Germany
Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055

emlix - smart embedded open source


Attachments:
signature.asc (321.00 B)
This is a digitally signed message part.

2021-08-30 08:14:17

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On 28/08/2021 23.47, Suren Baghdasaryan wrote:
> On Fri, Aug 27, 2021 at 10:52 PM Kees Cook <[email protected]> wrote:
>>
>>>> + case PR_SET_VMA_ANON_NAME:
>>>> + name = strndup_user((const char __user *)arg,
>>>> + ANON_VMA_NAME_MAX_LEN);
>>>> +
>>>> + if (IS_ERR(name))
>>>> + return PTR_ERR(name);
>>>> +
>>>> + for (pch = name; *pch != '\0'; pch++) {
>>>> + if (!isprint(*pch)) {
>>>> + kfree(name);
>>>> + return -EINVAL;
>>>
>>> I think isprint() is too weak a check. For example, I would suggest
>>> forbidding the following characters: ':', ']', '[', ' '. Perhaps

Indeed. There's also the issue that the kernel's ctype actually
implements some almost-but-not-quite latin1, so (some) chars above 0x7f
would also pass isprint() - while everybody today expects utf-8, so the
ability to put almost arbitrary sequences of chars with the high bit set
could certainly confuse some parsers. IOW, don't use isprint() at all,
just explicitly check for the byte values that we and up agreeing to
allow/forbid.

>>> isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
>>> be opposed to some punctuation characters, but let's avoid creating
>>> confusion. Do you happen to know which characters are actually in use
>>> today?
>>
>> There's some sense in refusing [, ], and :, but removing " " seems
>> unhelpful for reasonable descriptors. As long as weird stuff is escaped,
>> I think it's fine. Any parser can just extract with m|\[anon:(.*)\]$|
>
> I see no issue in forbidding '[' and ']' but whitespace and ':' are
> currently used by Android. Would forbidding or escaping '[' and ']' be
> enough?

how about allowing [0x20, 0x7e] except [0x5b, 0x5d], i.e. all printable
(including space) ascii characters, except [ \ ] - the brackets as
already discussed, and backslash because then there's nobody who can get
confused about whether there's some (and then which?) escaping mechanism
in play - "\n" is simply never going to appear. Simple rules, easy to
implement, easy to explain in a man page.

>>
>> For example, just escape it here instead of refusing to take it. Something
>> like:
>>
>> name = strndup_user((const char __user *)arg,
>> ANON_VMA_NAME_MAX_LEN);
>> escaped = kasprintf(GFP_KERNEL, "%pE", name);

I would not go down that road. First, it makes it much harder to explain
the rules for what are allowed and not allowed. Second, parsers become
much more complicated. Third, does the length limit then apply to the
escaped or unescaped string?

Rasmus

2021-08-30 16:15:04

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 3/3] mm: add anonymous vma name refcounting

On Mon, Aug 30, 2021 at 12:03 AM Rolf Eike Beer <[email protected]> wrote:
>
> Am Freitag, 27. August 2021, 21:18:58 CEST schrieb Suren Baghdasaryan:
> > While forking a process with high number (64K) of named anonymous vmas the
> > overhead caused by strdup() is noticeable. Experiments with ARM64 Android
> > device show up to 40% performance regression when forking a process with
> > 64k unpopulated anonymous vmas using the max name lengths vs the same
> > process with the same number of anonymous vmas having no name.
> > Introduce anon_vma_name refcounted structure to avoid the overhead of
> > copying vma names during fork() and when splitting named anonymous vmas.
> > When a vma is duplicated, instead of copying the name we increment the
> > refcount of this structure. Multiple vmas can point to the same
> > anon_vma_name as long as they increment the refcount. The name member of
> > anon_vma_name structure is assigned at structure allocation time and is
> > never changed. If vma name changes then the refcount of the original
> > structure is dropped, a new anon_vma_name structure is allocated
> > to hold the new name and the vma pointer is updated to point to the new
> > structure.
> > With this approach the fork() performance regressions is reduced 3-4x
> > times and with usecases using more reasonable number of VMAs (a few
> > thousand) the regressions is not measurable.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/mm_types.h | 9 ++++++++-
> > mm/madvise.c | 42 +++++++++++++++++++++++++++++++++-------
> > 2 files changed, 43 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index bc029f3fca6a..32ac5dc5ebf3 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -63,6 +63,27 @@ static int madvise_need_mmap_write(int behavior)
> > }
> > }
> >
> > +static struct anon_vma_name *anon_vma_name_alloc(const char *name)
> > +{
> > + struct anon_vma_name *anon_name;
> > + size_t len = strlen(name);
> > +
> > + /* Add 1 for NUL terminator at the end of the anon_name->name */
> > + anon_name = kzalloc(sizeof(*anon_name) + len + 1,
> > + GFP_KERNEL);
> > + kref_init(&anon_name->kref);
> > + strcpy(anon_name->name, name);
> > +
> > + return anon_name;
> > +}
>
> Given that you overwrite anything in that struct anyway this could be reduced
> to kmalloc(), no? And it definitely needs a NULL check.

Ack. I'll address both points in the next revision.
Thanks!
Suren.

>
> Eike
> --
> Rolf Eike Beer, emlix GmbH, http://www.emlix.com
> Fon +49 551 30664-0, Fax +49 551 30664-11
> Gothaer Platz 3, 37083 Göttingen, Germany
> Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
> Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055
>
> emlix - smart embedded open source

2021-08-30 16:19:01

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Mon, Aug 30, 2021 at 1:12 AM Rasmus Villemoes
<[email protected]> wrote:
>
> On 28/08/2021 23.47, Suren Baghdasaryan wrote:
> > On Fri, Aug 27, 2021 at 10:52 PM Kees Cook <[email protected]> wrote:
> >>
> >>>> + case PR_SET_VMA_ANON_NAME:
> >>>> + name = strndup_user((const char __user *)arg,
> >>>> + ANON_VMA_NAME_MAX_LEN);
> >>>> +
> >>>> + if (IS_ERR(name))
> >>>> + return PTR_ERR(name);
> >>>> +
> >>>> + for (pch = name; *pch != '\0'; pch++) {
> >>>> + if (!isprint(*pch)) {
> >>>> + kfree(name);
> >>>> + return -EINVAL;
> >>>
> >>> I think isprint() is too weak a check. For example, I would suggest
> >>> forbidding the following characters: ':', ']', '[', ' '. Perhaps
>
> Indeed. There's also the issue that the kernel's ctype actually
> implements some almost-but-not-quite latin1, so (some) chars above 0x7f
> would also pass isprint() - while everybody today expects utf-8, so the
> ability to put almost arbitrary sequences of chars with the high bit set
> could certainly confuse some parsers. IOW, don't use isprint() at all,
> just explicitly check for the byte values that we and up agreeing to
> allow/forbid.
>
> >>> isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
> >>> be opposed to some punctuation characters, but let's avoid creating
> >>> confusion. Do you happen to know which characters are actually in use
> >>> today?
> >>
> >> There's some sense in refusing [, ], and :, but removing " " seems
> >> unhelpful for reasonable descriptors. As long as weird stuff is escaped,
> >> I think it's fine. Any parser can just extract with m|\[anon:(.*)\]$|
> >
> > I see no issue in forbidding '[' and ']' but whitespace and ':' are
> > currently used by Android. Would forbidding or escaping '[' and ']' be
> > enough?
>
> how about allowing [0x20, 0x7e] except [0x5b, 0x5d], i.e. all printable
> (including space) ascii characters, except [ \ ] - the brackets as
> already discussed, and backslash because then there's nobody who can get
> confused about whether there's some (and then which?) escaping mechanism
> in play - "\n" is simply never going to appear. Simple rules, easy to
> implement, easy to explain in a man page.

Thanks for the suggestion, Rasmus. I'm all for keeping it simple.
Kees, Matthew, would that be acceptable?

>
> >>
> >> For example, just escape it here instead of refusing to take it. Something
> >> like:
> >>
> >> name = strndup_user((const char __user *)arg,
> >> ANON_VMA_NAME_MAX_LEN);
> >> escaped = kasprintf(GFP_KERNEL, "%pE", name);
>
> I would not go down that road. First, it makes it much harder to explain
> the rules for what are allowed and not allowed. Second, parsers become
> much more complicated. Third, does the length limit then apply to the
> escaped or unescaped string?
>
> Rasmus

2021-08-30 17:06:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Mon, Aug 30, 2021 at 09:16:14AM -0700, Suren Baghdasaryan wrote:
> On Mon, Aug 30, 2021 at 1:12 AM Rasmus Villemoes
> <[email protected]> wrote:
> >
> > On 28/08/2021 23.47, Suren Baghdasaryan wrote:
> > > On Fri, Aug 27, 2021 at 10:52 PM Kees Cook <[email protected]> wrote:
> > >>
> > >>>> + case PR_SET_VMA_ANON_NAME:
> > >>>> + name = strndup_user((const char __user *)arg,
> > >>>> + ANON_VMA_NAME_MAX_LEN);
> > >>>> +
> > >>>> + if (IS_ERR(name))
> > >>>> + return PTR_ERR(name);
> > >>>> +
> > >>>> + for (pch = name; *pch != '\0'; pch++) {
> > >>>> + if (!isprint(*pch)) {
> > >>>> + kfree(name);
> > >>>> + return -EINVAL;
> > >>>
> > >>> I think isprint() is too weak a check. For example, I would suggest
> > >>> forbidding the following characters: ':', ']', '[', ' '. Perhaps
> >
> > Indeed. There's also the issue that the kernel's ctype actually
> > implements some almost-but-not-quite latin1, so (some) chars above 0x7f
> > would also pass isprint() - while everybody today expects utf-8, so the
> > ability to put almost arbitrary sequences of chars with the high bit set
> > could certainly confuse some parsers. IOW, don't use isprint() at all,
> > just explicitly check for the byte values that we and up agreeing to
> > allow/forbid.
> >
> > >>> isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
> > >>> be opposed to some punctuation characters, but let's avoid creating
> > >>> confusion. Do you happen to know which characters are actually in use
> > >>> today?
> > >>
> > >> There's some sense in refusing [, ], and :, but removing " " seems
> > >> unhelpful for reasonable descriptors. As long as weird stuff is escaped,
> > >> I think it's fine. Any parser can just extract with m|\[anon:(.*)\]$|
> > >
> > > I see no issue in forbidding '[' and ']' but whitespace and ':' are
> > > currently used by Android. Would forbidding or escaping '[' and ']' be
> > > enough?
> >
> > how about allowing [0x20, 0x7e] except [0x5b, 0x5d], i.e. all printable
> > (including space) ascii characters, except [ \ ] - the brackets as
> > already discussed, and backslash because then there's nobody who can get
> > confused about whether there's some (and then which?) escaping mechanism
> > in play - "\n" is simply never going to appear. Simple rules, easy to
> > implement, easy to explain in a man page.
>
> Thanks for the suggestion, Rasmus. I'm all for keeping it simple.
> Kees, Matthew, would that be acceptable?

Yes, I think so. It permits all kinds of characters that might
be confusing if passed on to something else, but we can't prohibit
everything, and forbidding just these three should remove any confusion
for any parser of /proc. Little Bobby Tables thanks you.

2021-08-31 17:23:33

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Mon, Aug 30, 2021 at 9:59 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Aug 30, 2021 at 09:16:14AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Aug 30, 2021 at 1:12 AM Rasmus Villemoes
> > <[email protected]> wrote:
> > >
> > > On 28/08/2021 23.47, Suren Baghdasaryan wrote:
> > > > On Fri, Aug 27, 2021 at 10:52 PM Kees Cook <[email protected]> wrote:
> > > >>
> > > >>>> + case PR_SET_VMA_ANON_NAME:
> > > >>>> + name = strndup_user((const char __user *)arg,
> > > >>>> + ANON_VMA_NAME_MAX_LEN);
> > > >>>> +
> > > >>>> + if (IS_ERR(name))
> > > >>>> + return PTR_ERR(name);
> > > >>>> +
> > > >>>> + for (pch = name; *pch != '\0'; pch++) {
> > > >>>> + if (!isprint(*pch)) {
> > > >>>> + kfree(name);
> > > >>>> + return -EINVAL;
> > > >>>
> > > >>> I think isprint() is too weak a check. For example, I would suggest
> > > >>> forbidding the following characters: ':', ']', '[', ' '. Perhaps
> > >
> > > Indeed. There's also the issue that the kernel's ctype actually
> > > implements some almost-but-not-quite latin1, so (some) chars above 0x7f
> > > would also pass isprint() - while everybody today expects utf-8, so the
> > > ability to put almost arbitrary sequences of chars with the high bit set
> > > could certainly confuse some parsers. IOW, don't use isprint() at all,
> > > just explicitly check for the byte values that we and up agreeing to
> > > allow/forbid.
> > >
> > > >>> isalnum() would be better? (permit a-zA-Z0-9) I wouldn't necessarily
> > > >>> be opposed to some punctuation characters, but let's avoid creating
> > > >>> confusion. Do you happen to know which characters are actually in use
> > > >>> today?
> > > >>
> > > >> There's some sense in refusing [, ], and :, but removing " " seems
> > > >> unhelpful for reasonable descriptors. As long as weird stuff is escaped,
> > > >> I think it's fine. Any parser can just extract with m|\[anon:(.*)\]$|
> > > >
> > > > I see no issue in forbidding '[' and ']' but whitespace and ':' are
> > > > currently used by Android. Would forbidding or escaping '[' and ']' be
> > > > enough?
> > >
> > > how about allowing [0x20, 0x7e] except [0x5b, 0x5d], i.e. all printable
> > > (including space) ascii characters, except [ \ ] - the brackets as
> > > already discussed, and backslash because then there's nobody who can get
> > > confused about whether there's some (and then which?) escaping mechanism
> > > in play - "\n" is simply never going to appear. Simple rules, easy to
> > > implement, easy to explain in a man page.
> >
> > Thanks for the suggestion, Rasmus. I'm all for keeping it simple.
> > Kees, Matthew, would that be acceptable?
>
> Yes, I think so. It permits all kinds of characters that might
> be confusing if passed on to something else, but we can't prohibit
> everything, and forbidding just these three should remove any confusion
> for any parser of /proc. Little Bobby Tables thanks you.

Thanks for all the feedback! I think I have enough change suggestions
to resping the next revision. Will send an update later today.

2021-09-01 08:12:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Fri 27-08-21 12:18:57, Suren Baghdasaryan wrote:
[...]
> +static void replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
> +{
> + if (!name) {
> + free_vma_anon_name(vma);
> + return;
> + }
> +
> + if (vma->anon_name) {
> + /* Should never happen, to dup use dup_vma_anon_name() */
> + WARN_ON(vma->anon_name == name);

What is the point of this warning?

> +
> + /* Same name, nothing to do here */
> + if (!strcmp(name, vma->anon_name))
> + return;
> +
> + free_vma_anon_name(vma);
> + }
> + vma->anon_name = kstrdup(name, GFP_KERNEL);
> +}
--
Michal Hocko
SUSE Labs

2021-09-01 15:48:16

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Wed, Sep 1, 2021 at 1:10 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Fri 27-08-21 12:18:57, Suren Baghdasaryan wrote:
> [...]
> > +static void replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
> > +{
> > + if (!name) {
> > + free_vma_anon_name(vma);
> > + return;
> > + }
> > +
> > + if (vma->anon_name) {
> > + /* Should never happen, to dup use dup_vma_anon_name() */
> > + WARN_ON(vma->anon_name == name);
>
> What is the point of this warning?

I wanted to make sure replace_vma_anon_name() is not used from inside
vm_area_dup() or some similar place (does not exist today but maybe in
the future) where "new" vma is a copy of "orig" vma and
new->anon_name==orig->anon_name. If someone by mistake calls
replace_vma_anon_name(new, orig->anon_name) and
new->anon_name==orig->anon_name then they will keep pointing to the
same name pointer, which breaks an assumption that ->anon_name
pointers are not shared among vmas even if the string is the same.
That would eventually lead to use-after-free error. After the next
patch implementing refcounting, the similar situation would lead to
both new and orig vma pointing to the same anon_vma_name structure
without raising the refcount, which would also lead to use-after-free
error. That's why the above comment asks to use dup_vma_anon_name() if
this warning ever happens.
I can remove the warning but I thought the problem is subtle enough to
put some safeguards.

>
> > +
> > + /* Same name, nothing to do here */
> > + if (!strcmp(name, vma->anon_name))
> > + return;
> > +
> > + free_vma_anon_name(vma);
> > + }
> > + vma->anon_name = kstrdup(name, GFP_KERNEL);
> > +}
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2021-09-01 18:49:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Fri 27-08-21 12:18:57, Suren Baghdasaryan wrote:
[...]
> Userspace can set the name for a region of memory by calling
> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> Setting the name to NULL clears it.

Maybe I am missing this part but I do not see this being handled
anywhere.

[...]
> @@ -3283,5 +3283,16 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
> return 0;
> }
>
> +#ifdef CONFIG_ADVISE_SYSCALLS
> +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> + unsigned long len_in, const char *name);
> +#else
> +static inline int
> +madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> + unsigned long len_in, const char *name) {
> + return 0;
> +}
> +#endif

You want to make this depend on CONFIG_PROC_FS.

[...]
> +#ifdef CONFIG_MMU
> +
> +#define ANON_VMA_NAME_MAX_LEN 64
> +
> +static int prctl_set_vma(unsigned long opt, unsigned long addr,
> + unsigned long size, unsigned long arg)
> +{
> + struct mm_struct *mm = current->mm;
> + char *name, *pch;
> + int error;
> +
> + switch (opt) {
> + case PR_SET_VMA_ANON_NAME:
> + name = strndup_user((const char __user *)arg,
> + ANON_VMA_NAME_MAX_LEN);
> +
> + if (IS_ERR(name))
> + return PTR_ERR(name);

unless I am missing something NULL name would lead to an error rather
than a name clearing as advertised above.

> +
> + for (pch = name; *pch != '\0'; pch++) {
> + if (!isprint(*pch)) {
> + kfree(name);
> + return -EINVAL;
> + }
> + }
> +
> + mmap_write_lock(mm);
> + error = madvise_set_anon_name(mm, addr, size, name);
> + mmap_write_unlock(mm);
> + kfree(name);
> + break;
> + default:
> + error = -EINVAL;
> + }
> +
> + return error;
--
Michal Hocko
SUSE Labs

2021-09-01 20:06:56

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Wed, Sep 1, 2021 at 1:09 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Fri 27-08-21 12:18:57, Suren Baghdasaryan wrote:
> [...]
> > Userspace can set the name for a region of memory by calling
> > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> > Setting the name to NULL clears it.
>
> Maybe I am missing this part but I do not see this being handled
> anywhere.

It's handled in replace_vma_anon_name(). When name==NULL we call
free_vma_anon_name() which frees and resets anon_name pointer. Except
that, as you noticed, the check after strndup_user() will prevent NULL
to be passed here. I forgot to test this case after conversion to
strndup_user() and missed this important point. Thanks for pointing it
out. Will fix and retest.

>
> [...]
> > @@ -3283,5 +3283,16 @@ static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
> > return 0;
> > }
> >
> > +#ifdef CONFIG_ADVISE_SYSCALLS
> > +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > + unsigned long len_in, const char *name);
> > +#else
> > +static inline int
> > +madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > + unsigned long len_in, const char *name) {
> > + return 0;
> > +}
> > +#endif
>
> You want to make this depend on CONFIG_PROC_FS.

Ack.

>
> [...]
> > +#ifdef CONFIG_MMU
> > +
> > +#define ANON_VMA_NAME_MAX_LEN 64
> > +
> > +static int prctl_set_vma(unsigned long opt, unsigned long addr,
> > + unsigned long size, unsigned long arg)
> > +{
> > + struct mm_struct *mm = current->mm;
> > + char *name, *pch;
> > + int error;
> > +
> > + switch (opt) {
> > + case PR_SET_VMA_ANON_NAME:
> > + name = strndup_user((const char __user *)arg,
> > + ANON_VMA_NAME_MAX_LEN);
> > +
> > + if (IS_ERR(name))
> > + return PTR_ERR(name);
>
> unless I am missing something NULL name would lead to an error rather
> than a name clearing as advertised above.

Correct, I missed that. Will fix.

>
> > +
> > + for (pch = name; *pch != '\0'; pch++) {
> > + if (!isprint(*pch)) {
> > + kfree(name);
> > + return -EINVAL;
> > + }
> > + }
> > +
> > + mmap_write_lock(mm);
> > + error = madvise_set_anon_name(mm, addr, size, name);
> > + mmap_write_unlock(mm);
> > + kfree(name);
> > + break;
> > + default:
> > + error = -EINVAL;
> > + }
> > +
> > + return error;
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2021-09-03 13:40:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Wed 01-09-21 08:42:29, Suren Baghdasaryan wrote:
> On Wed, Sep 1, 2021 at 1:10 AM 'Michal Hocko' via kernel-team
> <[email protected]> wrote:
> >
> > On Fri 27-08-21 12:18:57, Suren Baghdasaryan wrote:
> > [...]
> > > +static void replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
> > > +{
> > > + if (!name) {
> > > + free_vma_anon_name(vma);
> > > + return;
> > > + }
> > > +
> > > + if (vma->anon_name) {
> > > + /* Should never happen, to dup use dup_vma_anon_name() */
> > > + WARN_ON(vma->anon_name == name);
> >
> > What is the point of this warning?
>
> I wanted to make sure replace_vma_anon_name() is not used from inside
> vm_area_dup() or some similar place (does not exist today but maybe in
> the future) where "new" vma is a copy of "orig" vma and
> new->anon_name==orig->anon_name. If someone by mistake calls
> replace_vma_anon_name(new, orig->anon_name) and
> new->anon_name==orig->anon_name then they will keep pointing to the
> same name pointer, which breaks an assumption that ->anon_name
> pointers are not shared among vmas even if the string is the same.
> That would eventually lead to use-after-free error. After the next
> patch implementing refcounting, the similar situation would lead to
> both new and orig vma pointing to the same anon_vma_name structure
> without raising the refcount, which would also lead to use-after-free
> error. That's why the above comment asks to use dup_vma_anon_name() if
> this warning ever happens.
> I can remove the warning but I thought the problem is subtle enough to
> put some safeguards.

This to me sounds very much like a debugging code that shouldn't make it
to the final patch to be merged. I do see your point of an early
diagnostic but we are talking about an internal MM code and that is not
really designed to be robust against its own failures so I do not see
why this should be any special.
--
Michal Hocko
SUSE Labs

2021-09-03 15:48:53

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v8 2/3] mm: add a field to store names for private anonymous memory

On Fri, Sep 3, 2021 at 4:49 AM 'Michal Hocko' via kernel-team
<[email protected]> wrote:
>
> On Wed 01-09-21 08:42:29, Suren Baghdasaryan wrote:
> > On Wed, Sep 1, 2021 at 1:10 AM 'Michal Hocko' via kernel-team
> > <[email protected]> wrote:
> > >
> > > On Fri 27-08-21 12:18:57, Suren Baghdasaryan wrote:
> > > [...]
> > > > +static void replace_vma_anon_name(struct vm_area_struct *vma, const char *name)
> > > > +{
> > > > + if (!name) {
> > > > + free_vma_anon_name(vma);
> > > > + return;
> > > > + }
> > > > +
> > > > + if (vma->anon_name) {
> > > > + /* Should never happen, to dup use dup_vma_anon_name() */
> > > > + WARN_ON(vma->anon_name == name);
> > >
> > > What is the point of this warning?
> >
> > I wanted to make sure replace_vma_anon_name() is not used from inside
> > vm_area_dup() or some similar place (does not exist today but maybe in
> > the future) where "new" vma is a copy of "orig" vma and
> > new->anon_name==orig->anon_name. If someone by mistake calls
> > replace_vma_anon_name(new, orig->anon_name) and
> > new->anon_name==orig->anon_name then they will keep pointing to the
> > same name pointer, which breaks an assumption that ->anon_name
> > pointers are not shared among vmas even if the string is the same.
> > That would eventually lead to use-after-free error. After the next
> > patch implementing refcounting, the similar situation would lead to
> > both new and orig vma pointing to the same anon_vma_name structure
> > without raising the refcount, which would also lead to use-after-free
> > error. That's why the above comment asks to use dup_vma_anon_name() if
> > this warning ever happens.
> > I can remove the warning but I thought the problem is subtle enough to
> > put some safeguards.
>
> This to me sounds very much like a debugging code that shouldn't make it
> to the final patch to be merged. I do see your point of an early
> diagnostic but we are talking about an internal MM code and that is not
> really designed to be robust against its own failures so I do not see
> why this should be any special.

Fair enough. I posted v9 yesterday but will respin another version in
a couple days. Will remove the warning then.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>