2021-04-28 17:37:54

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 00/94] Introducing the Maple Tree

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct. The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses. The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on next-20210419

Link:
https://github.com/oracle/linux-uek/releases/tag/howlett%2Fmaple%2F20210419

Performance on a 144 core x86:

While still using the mmap_sem, the performance seems fairly similar on
real-world workloads, while there are variations in micro-benchmarks.

Increase in performance in the following micro-benchmarks in Hmean:
- wis brk1-threads: Disregard, this is useless.
- wis malloc1-threads: Increase of 15% to 500%
- wis page_fault1-threads: Increase of up to 15%

Decrease in performance in the following micro-benchmarks in Hmean:
- wis brk1-processes: Decrease of 45% due to RCU required allocations
- wis signal1-threads: -3% to -20%

Mixed:
- wis malloc1-processes: +9% to -20%
- wis pthread_mutex1-threads: +9% to -16%
- wis page_fault3-threads: +7% to -21%


kernbench:
Amean user-2 880.66 ( 0.00%) 885.49 * -0.55%*
Amean syst-2 143.88 ( 0.00%) 152.64 * -6.09%*
Amean elsp-2 518.11 ( 0.00%) 524.97 * -1.32%*
Amean user-4 903.74 ( 0.00%) 908.41 * -0.52%*
Amean syst-4 149.61 ( 0.00%) 158.42 * -5.89%*
Amean elsp-4 270.00 ( 0.00%) 275.83 * -2.16%*
Amean user-8 955.25 ( 0.00%) 961.88 * -0.69%*
Amean syst-8 158.76 ( 0.00%) 169.04 * -6.48%*
Amean elsp-8 146.95 ( 0.00%) 148.79 * -1.25%*
Amean user-16 1033.15 ( 0.00%) 1037.42 * -0.41%*
Amean syst-16 168.01 ( 0.00%) 177.40 * -5.59%*
Amean elsp-16 84.63 ( 0.00%) 84.91 * -0.33%*
Amean user-32 1135.48 ( 0.00%) 1130.56 * 0.43%*
Amean syst-32 181.35 ( 0.00%) 190.54 * -5.06%*
Amean elsp-32 50.53 ( 0.00%) 51.88 * -2.67%*
Amean user-64 1191.24 ( 0.00%) 1205.41 * -1.19%*
Amean syst-64 189.13 ( 0.00%) 200.28 * -5.89%*
Amean elsp-64 31.75 ( 0.00%) 31.95 * -0.63%*
Amean user-128 1608.94 ( 0.00%) 1613.01 * -0.25%*
Amean syst-128 239.32 ( 0.00%) 253.50 * -5.93%*
Amean elsp-128 25.34 ( 0.00%) 26.02 * -2.69%*

gitcheckout:
Amean User 0.00 ( 0.00%) 0.00 * 0.00%*
Amean System 8.33 ( 0.00%) 7.86 * 5.59%*
Amean Elapsed 21.85 ( 0.00%) 21.34 * 2.29%*
Amean CPU 98.80 ( 0.00%) 99.00 * -0.20%*


Patch organization:
Patches 1-20 introduce a new VMA API: lookup_vma(). This looks up the VMA at a
given address and returns either NULL or the VMA. In looking at the VMA code,
find_vma() is often used to look up a vma and then check the limits (or
incorrectly assume that find_vma() does not continue searching upwards).
Initially, this is just a wrapper for find_vma_intersection() but becomes more
efficient once the maple tree is used.

Patches 21-25 update the radix tree test suite to support what is necessary for
the maple tree.

Patch 26 is the maple tree and test code: 36,152 lines are the
test_maple_tree.c file alone.

Patches 27-33 are the removal of the rbtree from the mm_struct.

Patches 34-44 are the removal of the vmacache from the kernel.

Patches 45-51 are optimizations for using the maple tree directly.

Patches 52-91 are the removal of the linked list from the mm_struct.

Patches 92-94 are to do with cleaning up the locking around the splitting of
VMAs using the maple tree state.


Liam R. Howlett (94):
mm: Add vma_lookup()
drm/i915/selftests: Use vma_lookup() in __igt_mmap()
arch/arc/kernel/troubleshoot: use vma_lookup() instead of
find_vma_intersection()
arch/arm64/kvm: Use vma_lookup() instead of find_vma_intersection()
arch/powerpc/kvm/book3s_hv_uvmem: Use vma_lookup() instead of
find_vma_intersection()
arch/powerpc/kvm/book3s: Use vma_lookup() in
kvmppc_hv_setup_htab_rma()
arch/mips/kernel/traps: Use vma_lookup() instead of
find_vma_intersection()
x86/sgx: Use vma_lookup() in sgx_encl_find()
virt/kvm: Use vma_lookup() instead of find_vma_intersection()
vfio: Use vma_lookup() instead of find_vma_intersection()
net/ipv5/tcp: Use vma_lookup() in tcp_zerocopy_receive()
drm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()
media: videobuf2: Use vma_lookup() in get_vaddr_frames()
misc/sgi-gru/grufault: Use vma_lookup() in gru_find_vma()
kernel/events/uprobes: Use vma_lookup() in find_active_uprobe()
lib/test_hmm: Use vma_lookup() in dmirror_migrate()
mm/ksm: Use vma_lookup() in find_mergeable_vma()
mm/migrate: Use vma_lookup() in do_pages_stat_array()
mm/mremap: Use vma_lookup() in vma_to_resize()
mm/memory.c: Use vma_lookup() instead of find_vma_intersection()
radix tree test suite: Enhancements for Maple Tree
radix tree test suite: Add support for fallthrough attribute
radix tree test suite: Add support for kmem_cache_free_bulk
radix tree test suite: Add keme_cache_alloc_bulk() support
radix tree test suite: Add __must_be_array() support
Maple Tree: Add new data structure
mm: Start tracking VMAs with maple tree
mm/mmap: Introduce unlock_range() for code cleanup
mm/mmap: Change find_vma() to use the maple tree
mm/mmap: Change find_vma_prev() to use maple tree
mm/mmap: Change unmapped_area and unmapped_area_topdown to use maple
tree
kernel/fork: Convert dup_mmap to use maple tree
mm: Remove rb tree.
arch/m68k/kernel/sys_m68k: Use vma_lookup() in sys_cacheflush()
xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()
mm: Optimize find_exact_vma() to use vma_lookup()
mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
vma_lookup()
mm/gup: Add mm_populate_vma() for use when the vma is known
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()
mm/mmap: Change vm_brk_flags() to use mm_populate_vma()
mm: Change find_vma_intersection() to maple tree and make find_vma()
to inline.
mm/mmap: Change mmap_region() to use maple tree state
mm/mmap: Drop munmap_vma_range()
mm: Remove vmacache
mm/mmap: Change __do_munmap() to avoid unnecessary lookups.
mm/mmap: Move mmap_region() below do_munmap()
mm/mmap: Add do_mas_munmap() and wraper for __do_munmap()
mmap: Use find_vma_intersection in do_mmap() for overlap
mmap: Remove __do_munmap() in favour of do_mas_munmap()
mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
mmap: make remove_vma_list() inline
mm: Introduce vma_next() and vma_prev()
arch/arm64: Remove mmap linked list from vdso.
arch/parisc: Remove mmap linked list from kernel/cache
arch/powerpc: Remove mmap linked list from mm/book3s32/tlb
arch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot
arch/s390: Use maple tree iterators instead of linked list.
arch/x86: Use maple tree iterators for vdso/vma
arch/xtensa: Use maple tree iterators for unmapped area
drivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()
drivers/tee/optee: Use maple tree iterators for __check_mem_type()
fs/binfmt_elf: Use maple tree iterators for fill_files_note()
fs/coredump: Use maple tree iterators in place of linked list
fs/exec: Use vma_next() instead of linked list
fs/proc/base: Use maple tree iterators in place of linked list
fs/proc/task_mmu: Stop using linked list and highest_vm_end
fs/userfaultfd: Stop using vma linked list.
ipc/shm: Stop using the vma linked list
kernel/acct: Use maple tree iterators instead of linked list
kernel/events/core: Use maple tree iterators instead of linked list
kernel/events/uprobes: Use maple tree iterators instead of linked list
kernel/sched/fair: Use maple tree iterators instead of linked list
kernel/sys: Use maple tree iterators instead of linked list
mm/gup: Use maple tree navigation instead of linked list
mm/huge_memory: Use vma_next() instead of vma linked list
mm/khugepaged: Use maple tree iterators instead of vma linked list
mm/ksm: Use maple tree iterators instead of vma linked list
mm/madvise: Use vma_next instead of vma linked list
mm/memcontrol: Stop using mm->highest_vm_end
mm/mempolicy: Use maple tree iterators instead of vma linked list
mm/mlock: Use maple tree iterators instead of vma linked list
mm/mprotect: Use maple tree navigation instead of vma linked list
mm/mremap: Use vma_next() instead of vma linked list
mm/msync: Use vma_next() instead of vma linked list
mm/oom_kill: Use maple tree iterators instead of vma linked list
mm/pagewalk: Use vma_next() instead of vma linked list
mm/swapfile: Use maple tree iterator instead of vma linked list
mm/util: Remove __vma_link_list() and __vma_unlink_list()
arch/um/kernel/tlb: Stop using linked list
bpf: Remove VMA linked list
mm: Remove vma linked list.
mm: Return a bool from anon_vma_interval_tree_verify()
mm/mmap: Add mas_split_vma() and use it for munmap()
mm: Move mas locking outside of munmap() path.

Documentation/core-api/index.rst | 1 +
Documentation/core-api/maple-tree.rst | 36 +
MAINTAINERS | 12 +
arch/arc/kernel/troubleshoot.c | 8 +-
arch/arm64/kernel/vdso.c | 5 +-
arch/arm64/kvm/mmu.c | 2 +-
arch/m68k/kernel/sys_m68k.c | 4 +-
arch/mips/kernel/traps.c | 4 +-
arch/parisc/kernel/cache.c | 15 +-
arch/powerpc/kvm/book3s_hv.c | 4 +-
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
arch/powerpc/mm/book3s32/tlb.c | 5 +-
arch/powerpc/mm/book3s64/subpage_prot.c | 15 +-
arch/s390/configs/debug_defconfig | 1 -
arch/s390/mm/gmap.c | 8 +-
arch/um/kernel/tlb.c | 16 +-
arch/x86/entry/vdso/vma.c | 12 +-
arch/x86/kernel/cpu/sgx/encl.h | 4 +-
arch/x86/kernel/tboot.c | 2 +-
arch/xtensa/kernel/syscall.c | 4 +-
drivers/firmware/efi/efi.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 +-
.../drm/i915/gem/selftests/i915_gem_mman.c | 2 +-
drivers/media/common/videobuf2/frame_vector.c | 2 +-
drivers/misc/cxl/fault.c | 6 +-
drivers/misc/sgi-gru/grufault.c | 4 +-
drivers/tee/optee/call.c | 15 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
drivers/xen/privcmd.c | 2 +-
fs/binfmt_elf.c | 5 +-
fs/coredump.c | 13 +-
fs/exec.c | 9 +-
fs/proc/base.c | 7 +-
fs/proc/task_mmu.c | 45 +-
fs/proc/task_nommu.c | 55 +-
fs/userfaultfd.c | 43 +-
include/linux/maple_tree.h | 449 +
include/linux/mm.h | 62 +-
include/linux/mm_types.h | 33 +-
include/linux/mm_types_task.h | 5 -
include/linux/sched.h | 1 -
include/linux/sched/mm.h | 3 +
include/linux/vm_event_item.h | 4 -
include/linux/vmacache.h | 28 -
include/linux/vmstat.h | 6 -
include/trace/events/maple_tree.h | 227 +
include/trace/events/mmap.h | 71 +
init/main.c | 2 +
ipc/shm.c | 13 +-
kernel/acct.c | 8 +-
kernel/bpf/task_iter.c | 6 +-
kernel/debug/debug_core.c | 12 -
kernel/events/core.c | 7 +-
kernel/events/uprobes.c | 29 +-
kernel/fork.c | 50 +-
kernel/sched/fair.c | 14 +-
kernel/sys.c | 6 +-
lib/Kconfig.debug | 15 +-
lib/Makefile | 3 +-
lib/maple_tree.c | 6393 +++
lib/test_hmm.c | 5 +-
lib/test_maple_tree.c | 36152 ++++++++++++++++
mm/Makefile | 2 +-
mm/debug.c | 12 +-
mm/gup.c | 27 +-
mm/huge_memory.c | 4 +-
mm/init-mm.c | 4 +-
mm/internal.h | 51 +-
mm/interval_tree.c | 6 +-
mm/khugepaged.c | 13 +-
mm/ksm.c | 32 +-
mm/madvise.c | 2 +-
mm/memcontrol.c | 6 +-
mm/memory.c | 45 +-
mm/mempolicy.c | 44 +-
mm/migrate.c | 4 +-
mm/mlock.c | 26 +-
mm/mmap.c | 2271 +-
mm/mprotect.c | 7 +-
mm/mremap.c | 17 +-
mm/msync.c | 2 +-
mm/nommu.c | 135 +-
mm/oom_kill.c | 5 +-
mm/pagewalk.c | 2 +-
mm/swapfile.c | 9 +-
mm/util.c | 32 -
mm/vmacache.c | 117 -
mm/vmstat.c | 4 -
net/ipv4/tcp.c | 4 +-
tools/testing/radix-tree/.gitignore | 2 +
tools/testing/radix-tree/Makefile | 13 +-
tools/testing/radix-tree/generated/autoconf.h | 1 +
tools/testing/radix-tree/linux.c | 78 +-
tools/testing/radix-tree/linux/kernel.h | 10 +
tools/testing/radix-tree/linux/maple_tree.h | 7 +
tools/testing/radix-tree/linux/slab.h | 2 +
tools/testing/radix-tree/maple.c | 59 +
tools/testing/radix-tree/test.h | 1 +
.../radix-tree/trace/events/maple_tree.h | 8 +
virt/kvm/kvm_main.c | 2 +-
100 files changed, 45347 insertions(+), 1699 deletions(-)
create mode 100644 Documentation/core-api/maple-tree.rst
create mode 100644 include/linux/maple_tree.h
delete mode 100644 include/linux/vmacache.h
create mode 100644 include/trace/events/maple_tree.h
create mode 100644 lib/maple_tree.c
create mode 100644 lib/test_maple_tree.c
delete mode 100644 mm/vmacache.c
create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
create mode 100644 tools/testing/radix-tree/maple.c
create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

--
2.30.2


2021-04-28 17:37:56

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 06/94] arch/powerpc/kvm/book3s: Use vma_lookup() in kvmppc_hv_setup_htab_rma()

Using vma_lookup() removes the requirement to check if the address is
within the returned vma.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/powerpc/kvm/book3s_hv.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4a532410e128..89a942e652e5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4759,8 +4759,8 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu)
/* Look up the VMA for the start of this memory slot */
hva = memslot->userspace_addr;
mmap_read_lock(kvm->mm);
- vma = find_vma(kvm->mm, hva);
- if (!vma || vma->vm_start > hva || (vma->vm_flags & VM_IO))
+ vma = vma_lookup(kvm->mm, hva);
+ if (!vma || (vma->vm_flags & VM_IO))
goto up_out;

psize = vma_kernel_pagesize(vma);
--
2.30.2

2021-04-28 17:37:57

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 25/94] radix tree test suite: Add __must_be_array() support

Signed-off-by: Liam R. Howlett <[email protected]>
---
tools/testing/radix-tree/linux/kernel.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index 99979aeaa379..e44603a181da 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -31,4 +31,6 @@
# define fallthrough do {} while (0) /* fallthrough */
#endif /* __has_attribute */

+#define __must_be_array(a) BUILD_BUG_ON_ZERO(__same_type((a), &(a)[0]))
+
#endif /* _KERNEL_H */
--
2.30.2

2021-04-28 17:37:58

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 05/94] arch/powerpc/kvm/book3s_hv_uvmem: Use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..34720b79588f 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -614,7 +614,7 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,

/* Fetch the VMA if addr is not in the latest fetched one */
if (!vma || addr >= vma->vm_end) {
- vma = find_vma_intersection(kvm->mm, addr, addr+1);
+ vma = vma_lookup(kvm->mm, addr);
if (!vma) {
pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
break;
--
2.30.2

2021-04-28 17:37:59

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 24/94] radix tree test suite: Add keme_cache_alloc_bulk() support

Signed-off-by: Liam R. Howlett <[email protected]>
---
tools/testing/radix-tree/linux.c | 51 +++++++++++++++++++++++++++
tools/testing/radix-tree/linux/slab.h | 1 +
2 files changed, 52 insertions(+)

diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
index 380bbc0a48d6..fb19a40ebb46 100644
--- a/tools/testing/radix-tree/linux.c
+++ b/tools/testing/radix-tree/linux.c
@@ -99,6 +99,57 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
for (int i = 0; i < size; i++)
kmem_cache_free(cachep, list[i]);
}
+int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
+ void **p)
+{
+ size_t i;
+
+ if (kmalloc_verbose)
+ printk("Bulk alloc %lu\n", size);
+
+ if (!(gfp & __GFP_DIRECT_RECLAIM) && cachep->non_kernel < size)
+ return 0;
+
+ if (!(gfp & __GFP_DIRECT_RECLAIM))
+ cachep->non_kernel -= size;
+
+ pthread_mutex_lock(&cachep->lock);
+ if (cachep->nr_objs >= size) {
+ struct radix_tree_node *node = cachep->objs;
+
+ for (i = 0; i < size; i++) {
+ cachep->nr_objs--;
+ cachep->objs = node->parent;
+ p[i] = cachep->objs;
+ }
+ pthread_mutex_unlock(&cachep->lock);
+ node->parent = NULL;
+ } else {
+ pthread_mutex_unlock(&cachep->lock);
+ for (i = 0; i < size; i++) {
+ if (cachep->align) {
+ posix_memalign(&p[i], cachep->align,
+ cachep->size * size);
+ } else {
+ p[i] = malloc(cachep->size * size);
+ }
+ if (cachep->ctor)
+ cachep->ctor(p[i]);
+ else if (gfp & __GFP_ZERO)
+ memset(p[i], 0, cachep->size);
+ }
+ }
+
+ for (i = 0; i < size; i++) {
+ uatomic_inc(&nr_allocated);
+ uatomic_inc(&nr_tallocated);
+ if (kmalloc_verbose)
+ printf("Allocating %p from slab\n", p[i]);
+ }
+
+ return size;
+}
+

void *kmalloc(size_t size, gfp_t gfp)
{
diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
index 53b79c15b3a2..ba42b8cc11d0 100644
--- a/tools/testing/radix-tree/linux/slab.h
+++ b/tools/testing/radix-tree/linux/slab.h
@@ -25,4 +25,5 @@ struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
void (*ctor)(void *));

void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t, void **);
+int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t, size_t, void **);
#endif /* SLAB_H */
--
2.30.2

2021-04-28 17:38:00

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 28/94] mm/mmap: Introduce unlock_range() for code cleanup

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 37 ++++++++++++++++++-------------------
1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index bce25db96fd1..112be171b662 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2992,6 +2992,20 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
return __split_vma(mm, vma, addr, new_below);
}

+static inline void unlock_range(struct vm_area_struct *start, unsigned long limit)
+{
+ struct mm_struct *mm = start->vm_mm;
+ struct vm_area_struct *tmp = start;
+
+ while (tmp && tmp->vm_start < limit) {
+ if (tmp->vm_flags & VM_LOCKED) {
+ mm->locked_vm -= vma_pages(tmp);
+ munlock_vma_pages_all(tmp);
+ }
+
+ tmp = tmp->vm_next;
+ }
+}
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
@@ -3080,17 +3094,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/*
* unlock any mlock()ed ranges before detaching vmas
*/
- if (mm->locked_vm) {
- struct vm_area_struct *tmp = vma;
- while (tmp && tmp->vm_start < end) {
- if (tmp->vm_flags & VM_LOCKED) {
- mm->locked_vm -= vma_pages(tmp);
- munlock_vma_pages_all(tmp);
- }
-
- tmp = tmp->vm_next;
- }
- }
+ if (mm->locked_vm)
+ unlock_range(vma, end);

/* Detach vmas from rbtree */
if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
@@ -3377,14 +3382,8 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_unlock(mm);
}

- if (mm->locked_vm) {
- vma = mm->mmap;
- while (vma) {
- if (vma->vm_flags & VM_LOCKED)
- munlock_vma_pages_all(vma);
- vma = vma->vm_next;
- }
- }
+ if (mm->locked_vm)
+ unlock_range(mm->mmap, ULONG_MAX);

arch_exit_mmap(mm);

--
2.30.2

2021-04-28 17:38:02

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 23/94] radix tree test suite: Add support for kmem_cache_free_bulk

Signed-off-by: Liam R. Howlett <[email protected]>
---
tools/testing/radix-tree/linux.c | 9 +++++++++
tools/testing/radix-tree/linux/slab.h | 1 +
2 files changed, 10 insertions(+)

diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
index 93f7de81fbe8..380bbc0a48d6 100644
--- a/tools/testing/radix-tree/linux.c
+++ b/tools/testing/radix-tree/linux.c
@@ -91,6 +91,15 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
pthread_mutex_unlock(&cachep->lock);
}

+void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
+{
+ if (kmalloc_verbose)
+ printk("Bulk free %p[0-%lu]\n", list, size - 1);
+
+ for (int i = 0; i < size; i++)
+ kmem_cache_free(cachep, list[i]);
+}
+
void *kmalloc(size_t size, gfp_t gfp)
{
void *ret;
diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
index 2958830ce4d7..53b79c15b3a2 100644
--- a/tools/testing/radix-tree/linux/slab.h
+++ b/tools/testing/radix-tree/linux/slab.h
@@ -24,4 +24,5 @@ struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
unsigned int align, unsigned int flags,
void (*ctor)(void *));

+void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t, void **);
#endif /* SLAB_H */
--
2.30.2

2021-04-28 17:38:02

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 13/94] media: videobuf2: Use vma_lookup() in get_vaddr_frames()

vma_lookup() does the same as find_vma_intersection() with a range of 1,
but is easier to understand.

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/media/common/videobuf2/frame_vector.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c
index 91fea7199e85..b84b706073cb 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -64,7 +64,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
do {
unsigned long *nums = frame_vector_pfns(vec);

- vma = find_vma_intersection(mm, start, start + 1);
+ vma = vma_lookup(mm, start);
if (!vma)
break;

--
2.30.2

2021-04-28 17:38:07

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 35/94] xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()

vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value. It is more efficient
to only walk to the requested value as this case requires the address to
equal the vm_start.

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/xen/privcmd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 720a7b7abd46..5f903ae9af7e 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -282,7 +282,7 @@ static long privcmd_ioctl_mmap(struct file *file, void __user *udata)
struct page, lru);
struct privcmd_mmap_entry *msg = page_address(page);

- vma = find_vma(mm, msg->va);
+ vma = vma_lookup(mm, msg->va);
rc = -EINVAL;

if (!vma || (msg->va != vma->vm_start) || vma->vm_private_data)
--
2.30.2

2021-04-28 17:38:07

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 08/94] x86/sgx: Use vma_lookup() in sgx_encl_find()

Using vma_lookup() removes the requirement to check if the address is
within the returned vma.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/x86/kernel/cpu/sgx/encl.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 6e74f85b6264..fec43ca65065 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -91,8 +91,8 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
{
struct vm_area_struct *result;

- result = find_vma(mm, addr);
- if (!result || result->vm_ops != &sgx_vm_ops || addr < result->vm_start)
+ result = vma_lookup(mm, addr);
+ if (!result || result->vm_ops != &sgx_vm_ops)
return -EINVAL;

*vma = result;
--
2.30.2

2021-04-28 17:38:07

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 36/94] mm: Optimize find_exact_vma() to use vma_lookup()

Use vma_lookup() to walk the tree to the start value requested. If
the vma at the start does not match, then the answer is NULL and there
is no need to look at the next vma the way that find_vma() would.

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 146976070fed..cf17491be249 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2733,7 +2733,7 @@ static inline unsigned long vma_pages(struct vm_area_struct *vma)
static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
unsigned long vm_start, unsigned long vm_end)
{
- struct vm_area_struct *vma = find_vma(mm, vm_start);
+ struct vm_area_struct *vma = vma_lookup(mm, vm_start);

if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
vma = NULL;
--
2.30.2

2021-04-28 17:38:08

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 32/94] kernel/fork: Convert dup_mmap to use maple tree

Use the maple tree iterator to duplicate the mm_struct trees.

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 2 --
include/linux/sched/mm.h | 3 +++
kernel/fork.c | 24 +++++++++++++++++++-----
mm/mmap.c | 4 ----
4 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e89bacfa9145..7f7dff6ad884 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2498,8 +2498,6 @@ extern bool arch_has_descending_max_zone_pfns(void);
/* nommu.c */
extern atomic_long_t mmap_pages_allocated;
extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
-/* maple_tree */
-void vma_store(struct mm_struct *mm, struct vm_area_struct *vma);

/* interval_tree.c */
void vma_interval_tree_insert(struct vm_area_struct *node,
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..76cab3aea6ab 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -8,6 +8,7 @@
#include <linux/mm_types.h>
#include <linux/gfp.h>
#include <linux/sync_core.h>
+#include <linux/maple_tree.h>

/*
* Routines for handling mm_structs
@@ -67,11 +68,13 @@ static inline void mmdrop(struct mm_struct *mm)
*/
static inline void mmget(struct mm_struct *mm)
{
+ mt_set_in_rcu(&mm->mm_mt);
atomic_inc(&mm->mm_users);
}

static inline bool mmget_not_zero(struct mm_struct *mm)
{
+ mt_set_in_rcu(&mm->mm_mt);
return atomic_inc_not_zero(&mm->mm_users);
}

diff --git a/kernel/fork.c b/kernel/fork.c
index c37abaf28eb9..832416ff613e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -477,7 +477,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
struct rb_node **rb_link, *rb_parent;
int retval;
- unsigned long charge;
+ unsigned long charge = 0;
+ MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
+ MA_STATE(mas, &mm->mm_mt, 0, 0);
LIST_HEAD(uf);

uprobe_start_dup_mmap();
@@ -511,7 +513,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
goto out;

prev = NULL;
- for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
+
+ retval = mas_entry_count(&mas, oldmm->map_count);
+ if (retval)
+ goto fail_nomem;
+
+ rcu_read_lock();
+ mas_for_each(&old_mas, mpnt, ULONG_MAX) {
struct file *file;

if (mpnt->vm_flags & VM_DONTCOPY) {
@@ -525,7 +533,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
*/
if (fatal_signal_pending(current)) {
retval = -EINTR;
- goto out;
+ goto loop_out;
}
if (mpnt->vm_flags & VM_ACCOUNT) {
unsigned long len = vma_pages(mpnt);
@@ -594,7 +602,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
rb_parent = &tmp->vm_rb;

/* Link the vma into the MT */
- vma_store(mm, tmp);
+ mas.index = tmp->vm_start;
+ mas.last = tmp->vm_end - 1;
+ mas_store(&mas, tmp);

mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -604,14 +614,17 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->vm_ops->open(tmp);

if (retval)
- goto out;
+ goto loop_out;
}
/* a new mm has just been created */
retval = arch_dup_mmap(oldmm, mm);
+loop_out:
out:
+ rcu_read_unlock();
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
mmap_write_unlock(oldmm);
+ mas_destroy(&mas);
dup_userfaultfd_complete(&uf);
fail_uprobe_end:
uprobe_end_dup_mmap();
@@ -1092,6 +1105,7 @@ static inline void __mmput(struct mm_struct *mm)
{
VM_BUG_ON(atomic_read(&mm->mm_users));

+ mt_clear_in_rcu(&mm->mm_mt);
uprobe_clear_state(mm);
exit_aio(mm);
ksm_exit(mm);
diff --git a/mm/mmap.c b/mm/mmap.c
index 929c2f9eb3f5..1bd43f4db28e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -780,10 +780,6 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
GFP_KERNEL);
}

-void vma_store(struct mm_struct *mm, struct vm_area_struct *vma) {
- vma_mt_store(mm, vma);
-}
-
static void
__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node **rb_link,
--
2.30.2

2021-04-28 17:38:09

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 30/94] mm/mmap: Change find_vma_prev() to use maple tree

Change the implementation of find_vma_prev to use the new maple tree
data structure.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3a9a9aee2f63..0fc81b02935f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2508,23 +2508,30 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
}
EXPORT_SYMBOL(find_vma);

-/*
- * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
+/**
+ * find_vma_prev() - Find the VMA for a given address, or the next vma and
+ * set %pprev to the previous VMA, if any.
+ * @mm: The mm_struct to check
+ * @addr: The address
+ * @pprev: The pointer to set to the previous VMA
+ *
+ * Returns: The VMA associated with @addr, or the next vma.
+ * May return %NULL in the case of no vma at addr or above.
*/
struct vm_area_struct *
find_vma_prev(struct mm_struct *mm, unsigned long addr,
- struct vm_area_struct **pprev)
+ struct vm_area_struct **pprev)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

- vma = find_vma(mm, addr);
- if (vma) {
- *pprev = vma->vm_prev;
- } else {
- struct rb_node *rb_node = rb_last(&mm->mm_rb);
+ rcu_read_lock();
+ vma = mas_find(&mas, ULONG_MAX);
+ if (!vma)
+ mas_reset(&mas);

- *pprev = rb_node ? rb_entry(rb_node, struct vm_area_struct, vm_rb) : NULL;
- }
+ *pprev = mas_prev(&mas, 0);
+ rcu_read_unlock();
return vma;
}

--
2.30.2

2021-04-28 17:38:10

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 09/94] virt/kvm: Use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
virt/kvm/kvm_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 383df23514b9..aedb642cb4be 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2019,7 +2019,7 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
}

retry:
- vma = find_vma_intersection(current->mm, addr, addr + 1);
+ vma = vma_lookup(current->mm, addr);

if (vma == NULL)
pfn = KVM_PFN_ERR_FAULT;
--
2.30.2

2021-04-28 17:38:12

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 42/94] mm/mmap: Change mmap_region() to use maple tree state

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 1 +
mm/mmap.c | 218 ++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 187 insertions(+), 32 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 3a272ec5ccaa..6fa7557e7140 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -3946,6 +3946,7 @@ static inline void *_mas_store(struct ma_state *mas, void *entry, bool overwrite
if (ret > 2)
return NULL;
spanning_store:
+
return content;
}

diff --git a/mm/mmap.c b/mm/mmap.c
index df39c01eda12..4c873313549a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -516,9 +516,10 @@ munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
struct vm_area_struct **pprev, struct list_head *uf)
{
// Needs optimization.
- while (range_has_overlap(mm, start, start + len, pprev))
+ while (range_has_overlap(mm, start, start + len, pprev)) {
if (do_munmap(mm, start, len, uf))
return -ENOMEM;
+ }
return 0;
}
static unsigned long count_vma_pages_range(struct mm_struct *mm,
@@ -595,6 +596,27 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
GFP_KERNEL);
}

+static void vma_mas_link(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct ma_state *mas, struct vm_area_struct *prev)
+{
+ struct address_space *mapping = NULL;
+
+ if (vma->vm_file) {
+ mapping = vma->vm_file->f_mapping;
+ i_mmap_lock_write(mapping);
+ }
+
+ vma_mas_store(vma, mas);
+ __vma_link_list(mm, vma, prev);
+ __vma_link_file(vma);
+
+ if (mapping)
+ i_mmap_unlock_write(mapping);
+
+ mm->map_count++;
+ validate_mm(mm);
+}
+
static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev)
{
@@ -630,6 +652,98 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
mm->map_count++;
}

+inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, pgoff_t pgoff,
+ struct vm_area_struct *next)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct address_space *mapping = NULL;
+ struct rb_root_cached *root = NULL;
+ struct anon_vma *anon_vma = vma->anon_vma;
+ struct file *file = vma->vm_file;
+ bool remove_next = false;
+ int error;
+
+ if (next && (vma != next) && (end == next->vm_end)) {
+ remove_next = true;
+ if (next->anon_vma && !vma->anon_vma) {
+ vma->anon_vma = next->anon_vma;
+ error = anon_vma_clone(vma, next);
+ if (error)
+ return error;
+ }
+ }
+
+ vma_adjust_trans_huge(vma, start, end, 0);
+
+ if (file) {
+ mapping = file->f_mapping;
+ root = &mapping->i_mmap;
+ uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+ i_mmap_lock_write(mapping);
+ }
+
+ if (anon_vma) {
+ anon_vma_lock_write(anon_vma);
+ anon_vma_interval_tree_pre_update_vma(vma);
+ }
+
+ if (file) {
+ flush_dcache_mmap_lock(mapping);
+ vma_interval_tree_remove(vma, root);
+ }
+
+ vma->vm_start = start;
+ vma->vm_end = end;
+ vma->vm_pgoff = pgoff;
+ /* Note: mas must be pointing to the expanding VMA */
+ vma_mas_store(vma, mas);
+
+ if (file) {
+ vma_interval_tree_insert(vma, root);
+ flush_dcache_mmap_unlock(mapping);
+ }
+
+ /* Expanding over the next vma */
+ if (remove_next) {
+ /* Remove from mm linked list - also updates highest_vm_end */
+ __vma_unlink_list(mm, next);
+
+ /* Kill the cache */
+ vmacache_invalidate(mm);
+
+ if (file)
+ __remove_shared_vm_struct(next, file, mapping);
+
+ } else if (!next) {
+ mm->highest_vm_end = vm_end_gap(vma);
+ }
+
+ if (anon_vma) {
+ anon_vma_interval_tree_post_update_vma(vma);
+ anon_vma_unlock_write(anon_vma);
+ }
+
+ if (file) {
+ i_mmap_unlock_write(mapping);
+ uprobe_mmap(vma);
+ }
+
+ if (remove_next) {
+ if (file) {
+ uprobe_munmap(next, next->vm_start, next->vm_end);
+ fput(file);
+ }
+ if (next->anon_vma)
+ anon_vma_merge(vma, next);
+ mm->map_count--;
+ mpol_put(vma_policy(next));
+ vm_area_free(next);
+ }
+
+ validate_mm(mm);
+ return 0;
+}
/*
* We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
* is already present in an i_mmap tree without adjusting the tree.
@@ -1615,9 +1729,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct list_head *uf)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma, *prev, *merge;
- int error;
+ struct vm_area_struct *vma = NULL;
+ struct vm_area_struct *prev, *next;
+ pgoff_t pglen = len >> PAGE_SHIFT;
unsigned long charged = 0;
+ unsigned long end = addr + len;
+ unsigned long merge_start = addr, merge_end = end;
+ pgoff_t vm_pgoff;
+ int error;
+ MA_STATE(mas, &mm->mm_mt, addr, end - 1);

/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
@@ -1627,16 +1747,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* MAP_FIXED may remove pages of mappings that intersects with
* requested mapping. Account for the pages it would unmap.
*/
- nr_pages = count_vma_pages_range(mm, addr, addr + len);
+ nr_pages = count_vma_pages_range(mm, addr, end);

if (!may_expand_vm(mm, vm_flags,
(len >> PAGE_SHIFT) - nr_pages))
return -ENOMEM;
}

- /* Clear old maps, set up prev and uf */
- if (munmap_vma_range(mm, addr, len, &prev, uf))
+ /* Unmap any existing mapping in the area */
+ if (do_munmap(mm, addr, len, uf))
return -ENOMEM;
+
/*
* Private writable mapping: check memory availability
*/
@@ -1647,14 +1768,44 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}

- /*
- * Can we just expand an old mapping?
- */
- vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
- if (vma)
- goto out;

+ if (vm_flags & VM_SPECIAL) {
+ prev = mas_prev(&mas, 0);
+ goto cannot_expand;
+ }
+
+ /* Attempt to expand an old mapping */
+
+ /* Check next */
+ next = mas_next(&mas, ULONG_MAX);
+ if (next && next->vm_start == end && vma_policy(next) &&
+ can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
+ NULL_VM_UFFD_CTX)) {
+ merge_end = next->vm_end;
+ vma = next;
+ vm_pgoff = next->vm_pgoff - pglen;
+ }
+
+ /* Check prev */
+ prev = mas_prev(&mas, 0);
+ if (prev && prev->vm_end == addr && !vma_policy(prev) &&
+ can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
+ NULL_VM_UFFD_CTX)) {
+ merge_start = prev->vm_start;
+ vma = prev;
+ vm_pgoff = prev->vm_pgoff;
+ }
+
+
+ /* Actually expand, if possible */
+ if (vma &&
+ !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
+ khugepaged_enter_vma_merge(prev, vm_flags);
+ goto expanded;
+ }
+
+ mas_set_range(&mas, addr, end - 1);
+cannot_expand:
/*
* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
@@ -1667,7 +1818,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
}

vma->vm_start = addr;
- vma->vm_end = addr + len;
+ vma->vm_end = end;
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
@@ -1698,8 +1849,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
*
* Answer: Yes, several device drivers can do it in their
* f_op->mmap method. -DaveM
- * Bug: If addr is changed, prev, rb_link, rb_parent should
- * be updated for vma_link()
*/
WARN_ON_ONCE(addr != vma->vm_start);

@@ -1708,18 +1857,25 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* If vm_flags changed after call_mmap(), we should try merge vma again
* as we may succeed this time.
*/
- if (unlikely(vm_flags != vma->vm_flags && prev)) {
- merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
- NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
- if (merge) {
+ if (unlikely(vm_flags != vma->vm_flags && prev &&
+ prev->vm_end == addr && !vma_policy(prev) &&
+ can_vma_merge_after(prev, vm_flags, NULL, file,
+ pgoff, NULL_VM_UFFD_CTX))) {
+ merge_start = prev->vm_start;
+ vm_pgoff = prev->vm_pgoff;
+ if (!vma_expand(&mas, prev, merge_start, merge_end,
+ vm_pgoff, next)) {
/* ->mmap() can change vma->vm_file and fput the original file. So
* fput the vma->vm_file here or we would add an extra fput for file
* and cause general protection fault ultimately.
*/
fput(vma->vm_file);
vm_area_free(vma);
- vma = merge;
- /* Update vm_flags to pick up the change. */
+ vma = prev;
+ /* Update vm_flags and possible addr to pick up the change. We don't
+ * warn here if addr changed as the vma is not linked by vma_link().
+ */
+ addr = vma->vm_start;
vm_flags = vma->vm_flags;
goto unmap_writable;
}
@@ -1743,7 +1899,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto free_vma;
}

- vma_link(mm, vma, prev);
+ mas.index = mas.last = addr;
+ mas_walk(&mas);
+ vma_mas_link(mm, vma, &mas, prev);
/* Once vma denies write, undo our temporary denial count */
if (file) {
unmap_writable:
@@ -1753,14 +1911,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
allow_write_access(file);
}
file = vma->vm_file;
-out:
+expanded:
perf_event_mmap(vma);

vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
- is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current->mm))
+ is_vm_hugetlb_page(vma) ||
+ vma == get_gate_vma(current->mm))
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
else
mm->locked_vm += (len >> PAGE_SHIFT);
@@ -2585,16 +2743,13 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
arch_unmap(mm, start, end);

/* Find the first overlapping VMA */
- vma = find_vma(mm, start);
+ vma = find_vma_intersection(mm, start, end);
if (!vma)
return 0;
+
prev = vma->vm_prev;
/* we have start < vma->vm_end */

- /* if it doesn't overlap, we have nothing.. */
- if (vma->vm_start >= end)
- return 0;
-
/*
* If we need to split any vma, do it now to save pain later.
*
@@ -2604,7 +2759,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
*/
if (start > vma->vm_start) {
int error;
-
/*
* Make sure that map_count on return from munmap() will
* not exceed its limit; but let map_count go just above
--
2.30.2

2021-04-28 17:38:13

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 50/94] mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()

do_brk_munmap() already has aligned addresses.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index cf4aa715eb63..3b1a9f6bc39b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2965,12 +2965,16 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct unmap;
unsigned long unmap_pages;
- int ret = 1;
+ int ret;

arch_unmap(mm, newbrk, oldbrk);

- if (likely(vma->vm_start >= newbrk)) { // remove entire mapping(s)
- ret = do_mas_munmap(mas, mm, newbrk, oldbrk-newbrk, uf, true);
+ if (likely((vma->vm_end < oldbrk) ||
+ ((vma->vm_start == newbrk) && (vma->vm_end == oldbrk)))) {
+ // remove entire mapping(s)
+ mas->last = oldbrk - 1;
+ ret = do_mas_align_munmap(mas, vma, mm, newbrk, oldbrk, uf,
+ true);
goto munmap_full_vma;
}

--
2.30.2

2021-04-28 17:38:15

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 56/94] arch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot

Start using the maple tree

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/powerpc/mm/book3s64/subpage_prot.c | 15 ++++-----------
1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16a972..51722199408e 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -149,25 +149,18 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
unsigned long len)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

/*
* We don't try too hard, we just mark all the vma in that range
* VM_NOHUGEPAGE and split them.
*/
- vma = find_vma(mm, addr);
- /*
- * If the range is in unmapped range, just return
- */
- if (vma && ((addr + len) <= vma->vm_start))
- return;
-
- while (vma) {
- if (vma->vm_start >= (addr + len))
- break;
+ rcu_read_lock();
+ mas_for_each(&mas, vma, addr + len) {
vma->vm_flags |= VM_NOHUGEPAGE;
walk_page_vma(vma, &subpage_walk_ops, NULL);
- vma = vma->vm_next;
}
+ rcu_read_unlock();
}
#else
static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
--
2.30.2

2021-04-28 17:38:19

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 73/94] kernel/sys: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/sys.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 3a583a29815f..3a33ef07cc22 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1864,9 +1864,11 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
err = -EBUSY;
if (exe_file) {
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_read_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!vma->vm_file)
continue;
if (path_equal(&vma->vm_file->f_path,
@@ -1874,6 +1876,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
goto exit_err;
}

+ rcu_read_unlock();
mmap_read_unlock(mm);
fput(exe_file);
}
@@ -1888,6 +1891,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
fdput(exe);
return err;
exit_err:
+ rcu_read_unlock();
mmap_read_unlock(mm);
fput(exe_file);
goto exit;
--
2.30.2

2021-04-28 17:38:22

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 63/94] fs/coredump: Use maple tree iterators in place of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/coredump.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 2868e3e171ae..b7f42e81d84d 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1056,10 +1056,10 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
return vma->vm_end - vma->vm_start;
}

-static struct vm_area_struct *first_vma(struct task_struct *tsk,
+static struct vm_area_struct *first_vma(struct mm_struct *mm,
struct vm_area_struct *gate_vma)
{
- struct vm_area_struct *ret = tsk->mm->mmap;
+ struct vm_area_struct *ret = find_vma(mm, 0);

if (ret)
return ret;
@@ -1070,12 +1070,13 @@ static struct vm_area_struct *first_vma(struct task_struct *tsk,
* Helper function for iterating across a vma list. It ensures that the caller
* will visit `gate_vma' prior to terminating the search.
*/
-static struct vm_area_struct *next_vma(struct vm_area_struct *this_vma,
+static struct vm_area_struct *next_vma(struct mm_struct *mm,
+ struct vm_area_struct *this_vma,
struct vm_area_struct *gate_vma)
{
struct vm_area_struct *ret;

- ret = this_vma->vm_next;
+ ret = vma_next(mm, this_vma);
if (ret)
return ret;
if (this_vma == gate_vma)
@@ -1113,8 +1114,8 @@ int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
return -ENOMEM;
}

- for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
- vma = next_vma(vma, gate_vma), i++) {
+ for (i = 0, vma = first_vma(mm, gate_vma); vma != NULL;
+ vma = next_vma(mm, vma, gate_vma), i++) {
struct core_vma_metadata *m = (*vma_meta) + i;

m->start = vma->vm_start;
--
2.30.2

2021-04-28 17:38:24

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 62/94] fs/binfmt_elf: Use maple tree iterators for fill_files_note()

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/binfmt_elf.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 187b3f2b9202..264e37903949 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1612,6 +1612,7 @@ static int fill_files_note(struct memelfnote *note)
user_long_t *data;
user_long_t *start_end_ofs;
char *name_base, *name_curpos;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

/* *Estimated* file count and total data size needed */
count = mm->map_count;
@@ -1636,7 +1637,8 @@ static int fill_files_note(struct memelfnote *note)
name_base = name_curpos = ((char *)data) + names_ofs;
remaining = size - names_ofs;
count = 0;
- for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
struct file *file;
const char *filename;

@@ -1665,6 +1667,7 @@ static int fill_files_note(struct memelfnote *note)
*start_end_ofs++ = vma->vm_pgoff;
count++;
}
+ rcu_read_unlock();

/* Now we know exact count of files, can store it */
data[0] = count;
--
2.30.2

2021-04-28 17:39:37

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 22/94] radix tree test suite: Add support for fallthrough attribute

Add support for fallthrough on case statements. Note this does *NOT*
check for missing fallthrough, but does allow compiling of code with
fallthrough in case statements.

Signed-off-by: Liam R. Howlett <[email protected]>
---
tools/testing/radix-tree/linux/kernel.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index c5c9d05f29da..99979aeaa379 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -24,4 +24,11 @@
#define __must_hold(x)

#define EXPORT_PER_CPU_SYMBOL_GPL(x)
+
+#if __has_attribute(__fallthrough__)
+# define fallthrough __attribute__((__fallthrough__))
+#else
+# define fallthrough do {} while (0) /* fallthrough */
+#endif /* __has_attribute */
+
#endif /* _KERNEL_H */
--
2.30.2

2021-04-28 17:39:38

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 80/94] mm/mempolicy: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mempolicy.c | 44 +++++++++++++++++++++++++++-----------------
1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d79fa299b70c..efffa5e5aabf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -404,10 +404,13 @@ void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_write_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next)
+ mas_lock(&mas);
+ mas_for_each(&mas, vma, ULONG_MAX)
mpol_rebind_policy(vma->vm_policy, new);
+ mas_unlock(&mas);
mmap_write_unlock(mm);
}

@@ -671,7 +674,7 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
static int queue_pages_test_walk(unsigned long start, unsigned long end,
struct mm_walk *walk)
{
- struct vm_area_struct *vma = walk->vma;
+ struct vm_area_struct *next, *vma = walk->vma;
struct queue_pages *qp = walk->private;
unsigned long endvma = vma->vm_end;
unsigned long flags = qp->flags;
@@ -686,9 +689,10 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
/* hole at head side of range */
return -EFAULT;
}
+ next = vma_next(vma->vm_mm, vma);
if (!(flags & MPOL_MF_DISCONTIG_OK) &&
((vma->vm_end < qp->end) &&
- (!vma->vm_next || vma->vm_end < vma->vm_next->vm_start)))
+ (!next || vma->vm_end < next->vm_start)))
/* hole at middle or tail of range */
return -EFAULT;

@@ -802,28 +806,28 @@ static int vma_replace_policy(struct vm_area_struct *vma,
static int mbind_range(struct mm_struct *mm, unsigned long start,
unsigned long end, struct mempolicy *new_pol)
{
- struct vm_area_struct *next;
struct vm_area_struct *prev;
struct vm_area_struct *vma;
int err = 0;
pgoff_t pgoff;
unsigned long vmstart;
unsigned long vmend;
+ MA_STATE(mas, &mm->mm_mt, start, start);

- vma = find_vma(mm, start);
+ rcu_read_lock();
+ vma = mas_find(&mas, ULONG_MAX);
VM_BUG_ON(!vma);

- prev = vma->vm_prev;
+ prev = mas_prev(&mas, 0);
if (start > vma->vm_start)
prev = vma;

- for (; vma && vma->vm_start < end; prev = vma, vma = next) {
- next = vma->vm_next;
+ mas_for_each(&mas, vma, end - 1) {
vmstart = max(start, vma->vm_start);
vmend = min(end, vma->vm_end);

if (mpol_equal(vma_policy(vma), new_pol))
- continue;
+ goto next;

pgoff = vma->vm_pgoff +
((vmstart - vma->vm_start) >> PAGE_SHIFT);
@@ -832,7 +836,7 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
new_pol, vma->vm_userfaultfd_ctx);
if (prev) {
vma = prev;
- next = vma->vm_next;
+ mas_set(&mas, vma->vm_end);
if (mpol_equal(vma_policy(vma), new_pol))
continue;
/* vma_merge() joined vma && vma->next, case 8 */
@@ -848,13 +852,16 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
if (err)
goto out;
}
- replace:
+replace:
err = vma_replace_policy(vma, new_pol);
if (err)
goto out;
+next:
+ prev = vma;
}

- out:
+out:
+ rcu_read_unlock();
return err;
}

@@ -975,7 +982,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
* want to return MPOL_DEFAULT in this case.
*/
mmap_read_lock(mm);
- vma = find_vma_intersection(mm, addr, addr+1);
+ vma = vma_lookup(mm, addr);
if (!vma) {
mmap_read_unlock(mm);
return -EFAULT;
@@ -1082,6 +1089,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
int flags)
{
nodemask_t nmask;
+ struct vm_area_struct *vma;
LIST_HEAD(pagelist);
int err = 0;
struct migration_target_control mtc = {
@@ -1097,8 +1105,9 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
* need migration. Between passing in the full user address
* space range and MPOL_MF_DISCONTIG_OK, this call can not fail.
*/
+ vma = find_vma(mm, 0);
VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
- queue_pages_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
+ queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
flags | MPOL_MF_DISCONTIG_OK, &pagelist);

if (!list_empty(&pagelist)) {
@@ -1227,14 +1236,15 @@ static struct page *new_page(struct page *page, unsigned long start)
{
struct vm_area_struct *vma;
unsigned long address;
+ MA_STATE(mas, &current->mm->mm_mt, start, start);

- vma = find_vma(current->mm, start);
- while (vma) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
address = page_address_in_vma(page, vma);
if (address != -EFAULT)
break;
- vma = vma->vm_next;
}
+ rcu_read_unlock();

if (PageHuge(page)) {
return alloc_huge_page_vma(page_hstate(compound_head(page)),
--
2.30.2

2021-04-28 17:39:38

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 34/94] arch/m68k/kernel/sys_m68k: Use vma_lookup() in sys_cacheflush()

Using vma_lookup() enables for simplified checking of the returned vma
to ensure the end address also falls within the same vma. The start
address must be in the returned vma from vma_lookup().

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/m68k/kernel/sys_m68k.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/m68k/kernel/sys_m68k.c b/arch/m68k/kernel/sys_m68k.c
index f55bdcb8e4f1..bd0274c7592e 100644
--- a/arch/m68k/kernel/sys_m68k.c
+++ b/arch/m68k/kernel/sys_m68k.c
@@ -402,8 +402,8 @@ sys_cacheflush (unsigned long addr, int scope, int cache, unsigned long len)
* to this process.
*/
mmap_read_lock(current->mm);
- vma = find_vma(current->mm, addr);
- if (!vma || addr < vma->vm_start || addr + len > vma->vm_end)
+ vma = vma_lookup(current->mm, addr);
+ if (!vma || addr + len > vma->vm_end)
goto out_unlock;
}

--
2.30.2

2021-04-28 17:39:38

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 84/94] mm/msync: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/msync.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/msync.c b/mm/msync.c
index 137d1c104f3e..d5fcecc95829 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -104,7 +104,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
error = 0;
goto out_unlock;
}
- vma = vma->vm_next;
+ vma = vma_next(mm, vma);
}
}
out_unlock:
--
2.30.2

2021-04-28 17:39:39

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 15/94] kernel/events/uprobes: Use vma_lookup() in find_active_uprobe()

vma_lookup() will only return the VMA which contains the address
requested so the code is easier to read.

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/events/uprobes.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6addc9780319..907d4ee00cb2 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2046,8 +2046,8 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
struct vm_area_struct *vma;

mmap_read_lock(mm);
- vma = find_vma(mm, bp_vaddr);
- if (vma && vma->vm_start <= bp_vaddr) {
+ vma = vma_lookup(mm, bp_vaddr);
+ if (vma) {
if (valid_vma(vma, false)) {
struct inode *inode = file_inode(vma->vm_file);
loff_t offset = vaddr_to_offset(vma, bp_vaddr);
--
2.30.2

2021-04-28 17:39:40

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 14/94] misc/sgi-gru/grufault: Use vma_lookup() in gru_find_vma()

Use vma_lookup() to avoid needing to validate the VMA returned and for
easier to read code.

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/misc/sgi-gru/grufault.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 723825524ea0..d7ef61e602ed 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -49,8 +49,8 @@ struct vm_area_struct *gru_find_vma(unsigned long vaddr)
{
struct vm_area_struct *vma;

- vma = find_vma(current->mm, vaddr);
- if (vma && vma->vm_start <= vaddr && vma->vm_ops == &gru_vm_ops)
+ vma = vma_lookup(current->mm, vaddr);
+ if (vma && vma->vm_ops == &gru_vm_ops)
return vma;
return NULL;
}
--
2.30.2

2021-04-28 17:39:49

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 11/94] net/ipv5/tcp: Use vma_lookup() in tcp_zerocopy_receive()

Clean up code by using vma_lookup() to look up a specific VMA.

Signed-off-by: Liam R. Howlett <[email protected]>
---
net/ipv4/tcp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e14fd0c50c10..d4781a514012 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2094,8 +2094,8 @@ static int tcp_zerocopy_receive(struct sock *sk,

mmap_read_lock(current->mm);

- vma = find_vma(current->mm, address);
- if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops) {
+ vma = vma_lookup(current->mm, address);
+ if (!vma || vma->vm_ops != &tcp_vm_ops) {
mmap_read_unlock(current->mm);
return -EINVAL;
}
--
2.30.2

2021-04-28 17:39:49

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 10/94] vfio: Use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a0747c35a778..fb695bf0b1c4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -567,7 +567,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
vaddr = untagged_addr(vaddr);

retry:
- vma = find_vma_intersection(mm, vaddr, vaddr + 1);
+ vma = vma_lookup(mm, vaddr);

if (vma && vma->vm_flags & VM_PFNMAP) {
ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
--
2.30.2

2021-04-28 17:39:52

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 52/94] mm: Introduce vma_next() and vma_prev()

Rename internal vma_next() to _vma_next().

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 33 +++++++++++++++++++++++++++++++++
mm/mmap.c | 12 ++++++------
2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cbc79a9fa911..82b076787515 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2694,6 +2694,24 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
return mtree_load(&mm->mm_mt, addr);
}

+static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
+ const struct vm_area_struct *vma)
+{
+ MA_STATE(mas, &mm->mm_mt, 0, 0);
+
+ mas_set(&mas, vma->vm_end);
+ return mas_next(&mas, ULONG_MAX);
+}
+
+static inline struct vm_area_struct *vma_prev(struct mm_struct *mm,
+ const struct vm_area_struct *vma)
+{
+ MA_STATE(mas, &mm->mm_mt, 0, 0);
+
+ mas_set(&mas, vma->vm_start);
+ return mas_prev(&mas, 0);
+}
+
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
unsigned long vm_start = vma->vm_start;
@@ -2735,6 +2753,21 @@ static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
return vma;
}

+static inline struct vm_area_struct *vma_mas_next(struct ma_state *mas)
+{
+ struct ma_state tmp;
+
+ memcpy(&tmp, mas, sizeof(tmp));
+ return mas_next(&tmp, ULONG_MAX);
+}
+
+static inline struct vm_area_struct *vma_mas_prev(struct ma_state *mas)
+{
+ struct ma_state tmp;
+
+ memcpy(&tmp, mas, sizeof(tmp));
+ return mas_prev(&tmp, 0);
+}
static inline bool range_in_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index a8e4f836b167..51a29bb789ba 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -481,7 +481,7 @@ static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
}

/*
- * vma_next() - Get the next VMA.
+ * _vma_next() - Get the next VMA or the first.
* @mm: The mm_struct.
* @vma: The current vma.
*
@@ -489,7 +489,7 @@ static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
*
* Returns: The next VMA after @vma.
*/
-static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
+static inline struct vm_area_struct *_vma_next(struct mm_struct *mm,
struct vm_area_struct *vma)
{
if (!vma)
@@ -1144,7 +1144,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;

- next = vma_next(mm, prev);
+ next = _vma_next(mm, prev);
area = next;
if (area && area->vm_end == end) /* cases 6, 7, 8 */
next = next->vm_next;
@@ -2302,7 +2302,7 @@ static void unmap_region(struct mm_struct *mm,
struct vm_area_struct *vma, struct vm_area_struct *prev,
unsigned long start, unsigned long end)
{
- struct vm_area_struct *next = vma_next(mm, prev);
+ struct vm_area_struct *next = _vma_next(mm, prev);
struct mmu_gather tlb;

lru_add_drain();
@@ -2453,7 +2453,7 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
if (error)
return error;
prev = vma;
- vma = vma_next(mm, prev);
+ vma = _vma_next(mm, prev);
mas->index = start;
mas_reset(mas);
} else {
@@ -2470,7 +2470,7 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
int error = __split_vma(mm, last, end, 1);
if (error)
return error;
- vma = vma_next(mm, prev);
+ vma = _vma_next(mm, prev);
mas_reset(mas);
}

--
2.30.2

2021-04-28 17:39:58

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 17/94] mm/ksm: Use vma_lookup() in find_mergeable_vma()

vma_lookup checks the limits of the vma so the code can be more readable
and clean.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/ksm.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 6bbe314c5260..ced6830d0ff4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -521,10 +521,8 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
struct vm_area_struct *vma;
if (ksm_test_exit(mm))
return NULL;
- vma = find_vma(mm, addr);
- if (!vma || vma->vm_start > addr)
- return NULL;
- if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
+ vma = vma_lookup(mm, addr);
+ if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
return NULL;
return vma;
}
--
2.30.2

2021-04-28 17:39:59

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 45/94] mm/mmap: Change __do_munmap() to avoid unnecessary lookups.

As there is no longer a vmacache, find_vma() is more expensive and so
avoid doing them

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 115 ++++++++++++++++++++++++++++--------------------------
1 file changed, 59 insertions(+), 56 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 10c42a41e023..8ce36776fe43 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2551,44 +2551,6 @@ static void unmap_region(struct mm_struct *mm,
tlb_finish_mmu(&tlb);
}

-/*
- * Create a list of vma's touched by the unmap, removing them from the mm's
- * vma list as we go..
- */
-static bool
-detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
- struct vm_area_struct *prev, unsigned long end)
-{
- struct vm_area_struct **insertion_point;
- struct vm_area_struct *tail_vma = NULL;
-
- insertion_point = (prev ? &prev->vm_next : &mm->mmap);
- vma->vm_prev = NULL;
- vma_mt_szero(mm, vma->vm_start, end);
- do {
- mm->map_count--;
- tail_vma = vma;
- vma = vma->vm_next;
- } while (vma && vma->vm_start < end);
- *insertion_point = vma;
- if (vma)
- vma->vm_prev = prev;
- else
- mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
- tail_vma->vm_next = NULL;
-
- /*
- * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
- * VM_GROWSUP VMA. Such VMAs can change their size under
- * down_read(mmap_lock) and collide with the VMA we are about to unmap.
- */
- if (vma && (vma->vm_flags & VM_GROWSDOWN))
- return false;
- if (prev && (prev->vm_flags & VM_GROWSUP))
- return false;
- return true;
-}
-
/*
* __split_vma() bypasses sysctl_max_map_count checking. We use this where it
* has already been checked or doesn't make sense to fail.
@@ -2668,18 +2630,24 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
return __split_vma(mm, vma, addr, new_below);
}

-static inline void unlock_range(struct vm_area_struct *start, unsigned long limit)
+static inline int unlock_range(struct vm_area_struct *start,
+ struct vm_area_struct **tail, unsigned long limit)
{
struct mm_struct *mm = start->vm_mm;
struct vm_area_struct *tmp = start;
+ int count = 0;

while (tmp && tmp->vm_start < limit) {
+ *tail = tmp;
+ count++;
if (tmp->vm_flags & VM_LOCKED) {
mm->locked_vm -= vma_pages(tmp);
munlock_vma_pages_all(tmp);
}
tmp = tmp->vm_next;
}
+
+ return count;
}
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
@@ -2691,24 +2659,24 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
{
unsigned long end;
struct vm_area_struct *vma, *prev, *last;
+ MA_STATE(mas, &mm->mm_mt, start, start);

if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
return -EINVAL;

- len = PAGE_ALIGN(len);
- end = start + len;
- if (len == 0)
+ end = start + PAGE_ALIGN(len);
+ if (end == start)
return -EINVAL;

/* arch_unmap() might do unmaps itself. */
arch_unmap(mm, start, end);

/* Find the first overlapping VMA */
- vma = find_vma_intersection(mm, start, end);
+ vma = mas_find(&mas, end - 1);
if (!vma)
return 0;

- prev = vma->vm_prev;
+ mas.last = end - 1;
/* we have start < vma->vm_end */

/*
@@ -2732,16 +2700,27 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
if (error)
return error;
prev = vma;
+ vma = vma_next(mm, prev);
+ mas.index = start;
+ mas_reset(&mas);
+ } else {
+ prev = vma->vm_prev;
}

+ if (vma->vm_end >= end)
+ last = vma;
+ else
+ last = find_vma_intersection(mm, end - 1, end);
+
/* Does it split the last one? */
- last = find_vma(mm, end);
- if (last && end > last->vm_start) {
+ if (last && end < last->vm_end) {
int error = __split_vma(mm, last, end, 1);
if (error)
return error;
+ vma = vma_next(mm, prev);
+ mas_reset(&mas);
}
- vma = vma_next(mm, prev);
+

if (unlikely(uf)) {
/*
@@ -2754,22 +2733,46 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
* failure that it's not worth optimizing it for.
*/
int error = userfaultfd_unmap_prep(vma, start, end, uf);
+
if (error)
return error;
}

/*
- * unlock any mlock()ed ranges before detaching vmas
+ * unlock any mlock()ed ranges before detaching vmas, count the number
+ * of VMAs to be dropped, and return the tail entry of the affected
+ * area.
*/
- if (mm->locked_vm)
- unlock_range(vma, end);
+ mm->map_count -= unlock_range(vma, &last, end);
+ /* Drop removed area from the tree */
+ mas_store_gfp(&mas, NULL, GFP_KERNEL);
+
+ /* Detach vmas from the MM linked list */
+ vma->vm_prev = NULL;
+ if (prev)
+ prev->vm_next = last->vm_next;
+ else
+ mm->mmap = last->vm_next;

- /* Detach vmas from the MM linked list and remove from the mm tree*/
- if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
- downgrade = false;
+ if (last->vm_next) {
+ last->vm_next->vm_prev = prev;
+ last->vm_next = NULL;
+ } else
+ mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;

- if (downgrade)
- mmap_write_downgrade(mm);
+ /*
+ * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
+ * VM_GROWSUP VMA. Such VMAs can change their size under
+ * down_read(mmap_lock) and collide with the VMA we are about to unmap.
+ */
+ if (downgrade) {
+ if (last && (last->vm_flags & VM_GROWSDOWN))
+ downgrade = false;
+ else if (prev && (prev->vm_flags & VM_GROWSUP))
+ downgrade = false;
+ else
+ mmap_write_downgrade(mm);
+ }

unmap_region(mm, vma, prev, start, end);

@@ -3182,7 +3185,7 @@ void exit_mmap(struct mm_struct *mm)
}

if (mm->locked_vm)
- unlock_range(mm->mmap, ULONG_MAX);
+ unlock_range(mm->mmap, &vma, ULONG_MAX);

arch_exit_mmap(mm);

--
2.30.2

2021-04-28 17:40:00

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 31/94] mm/mmap: Change unmapped_area and unmapped_area_topdown to use maple tree

Use the new maple tree data structure to find an unmapped area.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 263 +++++++-----------------------------------------------
1 file changed, 32 insertions(+), 231 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0fc81b02935f..929c2f9eb3f5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2049,260 +2049,61 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return error;
}

+/* unmapped_area() Find an area between the low_limit and the high_limit with
+ * the correct alignment and offset, all from @info. Note: current->mm is used
+ * for the search.
+ *
+ * @info: The unmapped area information including the range (low_limit -
+ * hight_limit), the alignment offset and mask.
+ *
+ * Return: A memory address or -ENOMEM.
+ */
static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
{
- /*
- * We implement the search by looking for an rbtree node that
- * immediately follows a suitable gap. That is,
- * - gap_start = vma->vm_prev->vm_end <= info->high_limit - length;
- * - gap_end = vma->vm_start >= info->low_limit + length;
- * - gap_end - gap_start >= length
- */
+ unsigned long length, gap;

- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long length, low_limit, high_limit, gap_start, gap_end;
- unsigned long gap;
- MA_STATE(mas, &mm->mm_mt, 0, 0);
+ MA_STATE(mas, &current->mm->mm_mt, 0, 0);

/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask;
if (length < info->length)
return -ENOMEM;

- rcu_read_lock();
- mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1,
- length);
- rcu_read_unlock();
- gap = mas.index;
- gap += (info->align_offset - gap) & info->align_mask;
-
- /* Adjust search limits by the desired length */
- if (info->high_limit < length)
- return -ENOMEM;
- high_limit = info->high_limit - length;
-
- if (info->low_limit > high_limit)
+ if (mas_empty_area(&mas, info->low_limit, info->high_limit - 1,
+ length)) {
return -ENOMEM;
- low_limit = info->low_limit + length;
-
- /* Check if rbtree root looks promising */
- if (RB_EMPTY_ROOT(&mm->mm_rb))
- goto check_highest;
- vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
- if (vma->rb_subtree_gap < length)
- goto check_highest;
-
- while (true) {
- /* Visit left subtree if it looks promising */
- gap_end = vm_start_gap(vma);
- if (gap_end >= low_limit && vma->vm_rb.rb_left) {
- struct vm_area_struct *left =
- rb_entry(vma->vm_rb.rb_left,
- struct vm_area_struct, vm_rb);
- if (left->rb_subtree_gap >= length) {
- vma = left;
- continue;
- }
- }
-
- gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;
-check_current:
- /* Check if current node has a suitable gap */
- if (gap_start > high_limit)
- return -ENOMEM;
- if (gap_end >= low_limit &&
- gap_end > gap_start && gap_end - gap_start >= length)
- goto found;
-
- /* Visit right subtree if it looks promising */
- if (vma->vm_rb.rb_right) {
- struct vm_area_struct *right =
- rb_entry(vma->vm_rb.rb_right,
- struct vm_area_struct, vm_rb);
- if (right->rb_subtree_gap >= length) {
- vma = right;
- continue;
- }
- }
-
- /* Go back up the rbtree to find next candidate node */
- while (true) {
- struct rb_node *prev = &vma->vm_rb;
- if (!rb_parent(prev))
- goto check_highest;
- vma = rb_entry(rb_parent(prev),
- struct vm_area_struct, vm_rb);
- if (prev == vma->vm_rb.rb_left) {
- gap_start = vm_end_gap(vma->vm_prev);
- gap_end = vm_start_gap(vma);
- goto check_current;
- }
- }
}
-
-check_highest:
- /* Check highest gap, which does not precede any rbtree node */
- gap_start = mm->highest_vm_end;
- gap_end = ULONG_MAX; /* Only for VM_BUG_ON below */
- if (gap_start > high_limit)
- return -ENOMEM;
-
-found:
- /* We found a suitable gap. Clip it with the original low_limit. */
- if (gap_start < info->low_limit)
- gap_start = info->low_limit;
-
- /* Adjust gap address to the desired alignment */
- gap_start += (info->align_offset - gap_start) & info->align_mask;
-
- VM_BUG_ON(gap_start + info->length > info->high_limit);
- VM_BUG_ON(gap_start + info->length > gap_end);
-
- VM_BUG_ON(gap != gap_start);
- return gap_start;
-}
-
-static inline unsigned long top_area_aligned(struct vm_unmapped_area_info *info,
- unsigned long end)
-{
- return (end - info->length - info->align_offset) & (~info->align_mask);
+ gap = mas.index;
+ gap += (info->align_offset - gap) & info->align_mask;
+ return gap;
}

+/* unmapped_area_topdown() Find an area between the low_limit and the
+ * high_limit with * the correct alignment and offset at the highest available
+ * address, all from * @info. Note: current->mm is used for the search.
+ *
+ * @info: The unmapped area information including the range (low_limit -
+ * hight_limit), the alignment offset and mask.
+ *
+ * Return: A memory address or -ENOMEM.
+ */
static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma = NULL;
- unsigned long length, low_limit, high_limit, gap_start, gap_end;
- unsigned long gap;
-
- MA_STATE(mas, &mm->mm_mt, 0, 0);
- validate_mm_mt(mm);
+ unsigned long length, gap;

+ MA_STATE(mas, &current->mm->mm_mt, 0, 0);
/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask;
if (length < info->length)
return -ENOMEM;

- rcu_read_lock();
- mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1,
- length);
- rcu_read_unlock();
- gap = (mas.index + info->align_mask) & ~info->align_mask;
- gap -= info->align_offset & info->align_mask;
-
- /*
- * Adjust search limits by the desired length.
- * See implementation comment at top of unmapped_area().
- */
- gap_end = info->high_limit;
- if (gap_end < length)
+ if (mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1,
+ length)) {
return -ENOMEM;
- high_limit = gap_end - length;
-
- if (info->low_limit > high_limit)
- return -ENOMEM;
- low_limit = info->low_limit + length;
-
- /* Check highest gap, which does not precede any rbtree node */
- gap_start = mm->highest_vm_end;
- if (gap_start <= high_limit)
- goto found_highest;
-
- /* Check if rbtree root looks promising */
- if (RB_EMPTY_ROOT(&mm->mm_rb))
- return -ENOMEM;
- vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
- if (vma->rb_subtree_gap < length)
- return -ENOMEM;
-
- while (true) {
- /* Visit right subtree if it looks promising */
- gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;
- if (gap_start <= high_limit && vma->vm_rb.rb_right) {
- struct vm_area_struct *right =
- rb_entry(vma->vm_rb.rb_right,
- struct vm_area_struct, vm_rb);
- if (right->rb_subtree_gap >= length) {
- vma = right;
- continue;
- }
- }
-
-check_current:
- /* Check if current node has a suitable gap */
- gap_end = vm_start_gap(vma);
- if (gap_end < low_limit)
- return -ENOMEM;
- if (gap_start <= high_limit &&
- gap_end > gap_start && gap_end - gap_start >= length)
- goto found;
-
- /* Visit left subtree if it looks promising */
- if (vma->vm_rb.rb_left) {
- struct vm_area_struct *left =
- rb_entry(vma->vm_rb.rb_left,
- struct vm_area_struct, vm_rb);
- if (left->rb_subtree_gap >= length) {
- vma = left;
- continue;
- }
- }
-
- /* Go back up the rbtree to find next candidate node */
- while (true) {
- struct rb_node *prev = &vma->vm_rb;
- if (!rb_parent(prev))
- return -ENOMEM;
- vma = rb_entry(rb_parent(prev),
- struct vm_area_struct, vm_rb);
- if (prev == vma->vm_rb.rb_right) {
- gap_start = vma->vm_prev ?
- vm_end_gap(vma->vm_prev) : 0;
- goto check_current;
- }
- }
}
-
-found:
- /* We found a suitable gap. Clip it with the original high_limit. */
- if (gap_end > info->high_limit)
- gap_end = info->high_limit;
-
-found_highest:
- /* Compute highest gap address at the desired alignment */
- gap_end -= info->length;
- gap_end -= (gap_end - info->align_offset) & info->align_mask;
-
- VM_BUG_ON(gap_end < info->low_limit);
- VM_BUG_ON(gap_end < gap_start);
-
- if (gap != gap_end) {
- pr_err("%s: %px Gap was found: mt %lu gap_end %lu\n", __func__,
- mm, gap, gap_end);
- pr_err("window was %lu - %lu size %lu\n", info->high_limit,
- info->low_limit, length);
- pr_err("mas.min %lu max %lu mas.last %lu\n", mas.min, mas.max,
- mas.last);
- pr_err("mas.index %lu align mask %lu offset %lu\n", mas.index,
- info->align_mask, info->align_offset);
- pr_err("rb_find_vma find on %lu => %px (%px)\n", mas.index,
- find_vma(mm, mas.index), vma);
-#if defined(CONFIG_DEBUG_MAPLE_TREE)
- mt_dump(&mm->mm_mt);
-#endif
- {
- struct vm_area_struct *dv = mm->mmap;
-
- while (dv) {
- printk("vma %px %lu-%lu\n", dv, dv->vm_start, dv->vm_end);
- dv = dv->vm_next;
- }
- }
- VM_BUG_ON(gap != gap_end);
- }
-
- return gap_end;
+ gap = (mas.index + info->align_mask) & ~info->align_mask;
+ gap -= info->align_offset & info->align_mask;
+ return gap;
}

/*
--
2.30.2

2021-04-28 17:40:02

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 46/94] mm/mmap: Move mmap_region() below do_munmap()

Relocation of code for the next commit. There should be no changes
here.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 472 +++++++++++++++++++++++++++---------------------------
1 file changed, 236 insertions(+), 236 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8ce36776fe43..0106b5accd7c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1695,242 +1695,6 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
}

-unsigned long mmap_region(struct file *file, unsigned long addr,
- unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf)
-{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma = NULL;
- struct vm_area_struct *prev, *next;
- pgoff_t pglen = len >> PAGE_SHIFT;
- unsigned long charged = 0;
- unsigned long end = addr + len;
- unsigned long merge_start = addr, merge_end = end;
- pgoff_t vm_pgoff;
- int error;
- MA_STATE(mas, &mm->mm_mt, addr, end - 1);
-
- /* Check against address space limit. */
- if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
- unsigned long nr_pages;
-
- /*
- * MAP_FIXED may remove pages of mappings that intersects with
- * requested mapping. Account for the pages it would unmap.
- */
- nr_pages = count_vma_pages_range(mm, addr, end);
-
- if (!may_expand_vm(mm, vm_flags,
- (len >> PAGE_SHIFT) - nr_pages))
- return -ENOMEM;
- }
-
- /* Unmap any existing mapping in the area */
- if (do_munmap(mm, addr, len, uf))
- return -ENOMEM;
-
- /*
- * Private writable mapping: check memory availability
- */
- if (accountable_mapping(file, vm_flags)) {
- charged = len >> PAGE_SHIFT;
- if (security_vm_enough_memory_mm(mm, charged))
- return -ENOMEM;
- vm_flags |= VM_ACCOUNT;
- }
-
-
- if (vm_flags & VM_SPECIAL) {
- prev = mas_prev(&mas, 0);
- goto cannot_expand;
- }
-
- /* Attempt to expand an old mapping */
-
- /* Check next */
- next = mas_next(&mas, ULONG_MAX);
- if (next && next->vm_start == end && vma_policy(next) &&
- can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
- NULL_VM_UFFD_CTX)) {
- merge_end = next->vm_end;
- vma = next;
- vm_pgoff = next->vm_pgoff - pglen;
- }
-
- /* Check prev */
- prev = mas_prev(&mas, 0);
- if (prev && prev->vm_end == addr && !vma_policy(prev) &&
- can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
- NULL_VM_UFFD_CTX)) {
- merge_start = prev->vm_start;
- vma = prev;
- vm_pgoff = prev->vm_pgoff;
- }
-
-
- /* Actually expand, if possible */
- if (vma &&
- !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
- khugepaged_enter_vma_merge(prev, vm_flags);
- goto expanded;
- }
-
- mas_set_range(&mas, addr, end - 1);
-cannot_expand:
- /*
- * Determine the object being mapped and call the appropriate
- * specific mapper. the address has already been validated, but
- * not unmapped, but the maps are removed from the list.
- */
- vma = vm_area_alloc(mm);
- if (!vma) {
- error = -ENOMEM;
- goto unacct_error;
- }
-
- vma->vm_start = addr;
- vma->vm_end = end;
- vma->vm_flags = vm_flags;
- vma->vm_page_prot = vm_get_page_prot(vm_flags);
- vma->vm_pgoff = pgoff;
-
- if (file) {
- if (vm_flags & VM_DENYWRITE) {
- error = deny_write_access(file);
- if (error)
- goto free_vma;
- }
- if (vm_flags & VM_SHARED) {
- error = mapping_map_writable(file->f_mapping);
- if (error)
- goto allow_write_and_free_vma;
- }
-
- /* ->mmap() can change vma->vm_file, but must guarantee that
- * vma_link() below can deny write-access if VM_DENYWRITE is set
- * and map writably if VM_SHARED is set. This usually means the
- * new file must not have been exposed to user-space, yet.
- */
- vma->vm_file = get_file(file);
- error = call_mmap(file, vma);
- if (error)
- goto unmap_and_free_vma;
-
- /* Can addr have changed??
- *
- * Answer: Yes, several device drivers can do it in their
- * f_op->mmap method. -DaveM
- */
- WARN_ON_ONCE(addr != vma->vm_start);
-
- addr = vma->vm_start;
-
- /* If vm_flags changed after call_mmap(), we should try merge vma again
- * as we may succeed this time.
- */
- if (unlikely(vm_flags != vma->vm_flags && prev &&
- prev->vm_end == addr && !vma_policy(prev) &&
- can_vma_merge_after(prev, vm_flags, NULL, file,
- pgoff, NULL_VM_UFFD_CTX))) {
- merge_start = prev->vm_start;
- vm_pgoff = prev->vm_pgoff;
- if (!vma_expand(&mas, prev, merge_start, merge_end,
- vm_pgoff, next)) {
- /* ->mmap() can change vma->vm_file and fput the original file. So
- * fput the vma->vm_file here or we would add an extra fput for file
- * and cause general protection fault ultimately.
- */
- fput(vma->vm_file);
- vm_area_free(vma);
- vma = prev;
- /* Update vm_flags and possible addr to pick up the change. We don't
- * warn here if addr changed as the vma is not linked by vma_link().
- */
- addr = vma->vm_start;
- vm_flags = vma->vm_flags;
- goto unmap_writable;
- }
- }
-
- vm_flags = vma->vm_flags;
- } else if (vm_flags & VM_SHARED) {
- error = shmem_zero_setup(vma);
- if (error)
- goto free_vma;
- } else {
- vma_set_anonymous(vma);
- }
-
- /* Allow architectures to sanity-check the vm_flags */
- if (!arch_validate_flags(vma->vm_flags)) {
- error = -EINVAL;
- if (file)
- goto unmap_and_free_vma;
- else
- goto free_vma;
- }
-
- mas.index = mas.last = addr;
- mas_walk(&mas);
- vma_mas_link(mm, vma, &mas, prev);
- /* Once vma denies write, undo our temporary denial count */
- if (file) {
-unmap_writable:
- if (vm_flags & VM_SHARED)
- mapping_unmap_writable(file->f_mapping);
- if (vm_flags & VM_DENYWRITE)
- allow_write_access(file);
- }
- file = vma->vm_file;
-expanded:
- perf_event_mmap(vma);
-
- vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
- if (vm_flags & VM_LOCKED) {
- if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
- is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current->mm))
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
- else
- mm->locked_vm += (len >> PAGE_SHIFT);
- }
-
- if (file)
- uprobe_mmap(vma);
-
- /*
- * New (or expanded) vma always get soft dirty status.
- * Otherwise user-space soft-dirty page tracker won't
- * be able to distinguish situation when vma area unmapped,
- * then new mapped in-place (which must be aimed as
- * a completely new data area).
- */
- vma->vm_flags |= VM_SOFTDIRTY;
-
- vma_set_page_prot(vma);
-
- return addr;
-
-unmap_and_free_vma:
- fput(vma->vm_file);
- vma->vm_file = NULL;
-
- /* Undo any partial mapping done by a device driver. */
- unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
- charged = 0;
- if (vm_flags & VM_SHARED)
- mapping_unmap_writable(file->f_mapping);
-allow_write_and_free_vma:
- if (vm_flags & VM_DENYWRITE)
- allow_write_access(file);
-free_vma:
- vm_area_free(vma);
-unacct_error:
- if (charged)
- vm_unacct_memory(charged);
- return error;
-}
-
/* unmapped_area() Find an area between the low_limit and the high_limit with
* the correct alignment and offset, all from @info. Note: current->mm is used
* for the search.
@@ -2788,6 +2552,242 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return __do_munmap(mm, start, len, uf, false);
}

+unsigned long mmap_region(struct file *file, unsigned long addr,
+ unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+ struct list_head *uf)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma = NULL;
+ struct vm_area_struct *prev, *next;
+ pgoff_t pglen = len >> PAGE_SHIFT;
+ unsigned long charged = 0;
+ unsigned long end = addr + len;
+ unsigned long merge_start = addr, merge_end = end;
+ pgoff_t vm_pgoff;
+ int error;
+ MA_STATE(mas, &mm->mm_mt, addr, end - 1);
+
+ /* Check against address space limit. */
+ if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
+ unsigned long nr_pages;
+
+ /*
+ * MAP_FIXED may remove pages of mappings that intersects with
+ * requested mapping. Account for the pages it would unmap.
+ */
+ nr_pages = count_vma_pages_range(mm, addr, end);
+
+ if (!may_expand_vm(mm, vm_flags,
+ (len >> PAGE_SHIFT) - nr_pages))
+ return -ENOMEM;
+ }
+
+ /* Unmap any existing mapping in the area */
+ if (do_munmap(mm, addr, len, uf))
+ return -ENOMEM;
+
+ /*
+ * Private writable mapping: check memory availability
+ */
+ if (accountable_mapping(file, vm_flags)) {
+ charged = len >> PAGE_SHIFT;
+ if (security_vm_enough_memory_mm(mm, charged))
+ return -ENOMEM;
+ vm_flags |= VM_ACCOUNT;
+ }
+
+
+ if (vm_flags & VM_SPECIAL) {
+ prev = mas_prev(&mas, 0);
+ goto cannot_expand;
+ }
+
+ /* Attempt to expand an old mapping */
+
+ /* Check next */
+ next = mas_next(&mas, ULONG_MAX);
+ if (next && next->vm_start == end && vma_policy(next) &&
+ can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
+ NULL_VM_UFFD_CTX)) {
+ merge_end = next->vm_end;
+ vma = next;
+ vm_pgoff = next->vm_pgoff - pglen;
+ }
+
+ /* Check prev */
+ prev = mas_prev(&mas, 0);
+ if (prev && prev->vm_end == addr && !vma_policy(prev) &&
+ can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
+ NULL_VM_UFFD_CTX)) {
+ merge_start = prev->vm_start;
+ vma = prev;
+ vm_pgoff = prev->vm_pgoff;
+ }
+
+
+ /* Actually expand, if possible */
+ if (vma &&
+ !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
+ khugepaged_enter_vma_merge(prev, vm_flags);
+ goto expanded;
+ }
+
+ mas_set_range(&mas, addr, end - 1);
+cannot_expand:
+ /*
+ * Determine the object being mapped and call the appropriate
+ * specific mapper. the address has already been validated, but
+ * not unmapped, but the maps are removed from the list.
+ */
+ vma = vm_area_alloc(mm);
+ if (!vma) {
+ error = -ENOMEM;
+ goto unacct_error;
+ }
+
+ vma->vm_start = addr;
+ vma->vm_end = end;
+ vma->vm_flags = vm_flags;
+ vma->vm_page_prot = vm_get_page_prot(vm_flags);
+ vma->vm_pgoff = pgoff;
+
+ if (file) {
+ if (vm_flags & VM_DENYWRITE) {
+ error = deny_write_access(file);
+ if (error)
+ goto free_vma;
+ }
+ if (vm_flags & VM_SHARED) {
+ error = mapping_map_writable(file->f_mapping);
+ if (error)
+ goto allow_write_and_free_vma;
+ }
+
+ /* ->mmap() can change vma->vm_file, but must guarantee that
+ * vma_link() below can deny write-access if VM_DENYWRITE is set
+ * and map writably if VM_SHARED is set. This usually means the
+ * new file must not have been exposed to user-space, yet.
+ */
+ vma->vm_file = get_file(file);
+ error = call_mmap(file, vma);
+ if (error)
+ goto unmap_and_free_vma;
+
+ /* Can addr have changed??
+ *
+ * Answer: Yes, several device drivers can do it in their
+ * f_op->mmap method. -DaveM
+ */
+ WARN_ON_ONCE(addr != vma->vm_start);
+
+ addr = vma->vm_start;
+
+ /* If vm_flags changed after call_mmap(), we should try merge vma again
+ * as we may succeed this time.
+ */
+ if (unlikely(vm_flags != vma->vm_flags && prev &&
+ prev->vm_end == addr && !vma_policy(prev) &&
+ can_vma_merge_after(prev, vm_flags, NULL, file,
+ pgoff, NULL_VM_UFFD_CTX))) {
+ merge_start = prev->vm_start;
+ vm_pgoff = prev->vm_pgoff;
+ if (!vma_expand(&mas, prev, merge_start, merge_end,
+ vm_pgoff, next)) {
+ /* ->mmap() can change vma->vm_file and fput the original file. So
+ * fput the vma->vm_file here or we would add an extra fput for file
+ * and cause general protection fault ultimately.
+ */
+ fput(vma->vm_file);
+ vm_area_free(vma);
+ vma = prev;
+ /* Update vm_flags and possible addr to pick up the change. We don't
+ * warn here if addr changed as the vma is not linked by vma_link().
+ */
+ addr = vma->vm_start;
+ vm_flags = vma->vm_flags;
+ goto unmap_writable;
+ }
+ }
+
+ vm_flags = vma->vm_flags;
+ } else if (vm_flags & VM_SHARED) {
+ error = shmem_zero_setup(vma);
+ if (error)
+ goto free_vma;
+ } else {
+ vma_set_anonymous(vma);
+ }
+
+ /* Allow architectures to sanity-check the vm_flags */
+ if (!arch_validate_flags(vma->vm_flags)) {
+ error = -EINVAL;
+ if (file)
+ goto unmap_and_free_vma;
+ else
+ goto free_vma;
+ }
+
+ mas.index = mas.last = addr;
+ mas_walk(&mas);
+ vma_mas_link(mm, vma, &mas, prev);
+ /* Once vma denies write, undo our temporary denial count */
+ if (file) {
+unmap_writable:
+ if (vm_flags & VM_SHARED)
+ mapping_unmap_writable(file->f_mapping);
+ if (vm_flags & VM_DENYWRITE)
+ allow_write_access(file);
+ }
+ file = vma->vm_file;
+expanded:
+ perf_event_mmap(vma);
+
+ vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+ if (vm_flags & VM_LOCKED) {
+ if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
+ is_vm_hugetlb_page(vma) ||
+ vma == get_gate_vma(current->mm))
+ vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ else
+ mm->locked_vm += (len >> PAGE_SHIFT);
+ }
+
+ if (file)
+ uprobe_mmap(vma);
+
+ /*
+ * New (or expanded) vma always get soft dirty status.
+ * Otherwise user-space soft-dirty page tracker won't
+ * be able to distinguish situation when vma area unmapped,
+ * then new mapped in-place (which must be aimed as
+ * a completely new data area).
+ */
+ vma->vm_flags |= VM_SOFTDIRTY;
+
+ vma_set_page_prot(vma);
+
+ return addr;
+
+unmap_and_free_vma:
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+
+ /* Undo any partial mapping done by a device driver. */
+ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+ charged = 0;
+ if (vm_flags & VM_SHARED)
+ mapping_unmap_writable(file->f_mapping);
+allow_write_and_free_vma:
+ if (vm_flags & VM_DENYWRITE)
+ allow_write_access(file);
+free_vma:
+ vm_area_free(vma);
+unacct_error:
+ if (charged)
+ vm_unacct_memory(charged);
+ return error;
+}
+
static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
{
int ret;
--
2.30.2

2021-04-28 17:40:01

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 33/94] mm: Remove rb tree.

Remove the RB tree and start using the maple tree for vm_area_struct
tracking.

Drop validate_mm() calls in expand_upwards() and expand_downwards() as
the lock is not held.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/x86/kernel/tboot.c | 1 -
drivers/firmware/efi/efi.c | 1 -
fs/proc/task_nommu.c | 55 ++--
include/linux/mm.h | 4 +-
include/linux/mm_types.h | 26 +-
kernel/fork.c | 8 -
mm/init-mm.c | 2 -
mm/mmap.c | 525 ++++++++-----------------------------
mm/nommu.c | 96 +++----
mm/util.c | 8 +
10 files changed, 185 insertions(+), 541 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 6f978f722dff..121f28bb2209 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -97,7 +97,6 @@ void __init tboot_probe(void)

static pgd_t *tboot_pg_dir;
static struct mm_struct tboot_mm = {
- .mm_rb = RB_ROOT,
.mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 271ae8c7bb07..8aaeaa824576 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -54,7 +54,6 @@ static unsigned long __initdata mem_reserve = EFI_INVALID_TABLE_ADDR;
static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR;

struct mm_struct efi_mm = {
- .mm_rb = RB_ROOT,
.mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a6d21fc0033c..8691a1216d1c 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -22,15 +22,13 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
{
struct vm_area_struct *vma;
struct vm_region *region;
- struct rb_node *p;
unsigned long bytes = 0, sbytes = 0, slack = 0, size;
-
- mmap_read_lock(mm);
- for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
- vma = rb_entry(p, struct vm_area_struct, vm_rb);
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

+ mmap_read_lock(mm);
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
bytes += kobjsize(vma);
-
region = vma->vm_region;
if (region) {
size = kobjsize(region);
@@ -53,7 +51,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
sbytes += kobjsize(mm);
else
bytes += kobjsize(mm);
-
+
if (current->fs && current->fs->users > 1)
sbytes += kobjsize(current->fs);
else
@@ -77,20 +75,21 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
"Shared:\t%8lu bytes\n",
bytes, slack, sbytes);

+ rcu_read_unlock();
mmap_read_unlock(mm);
}

unsigned long task_vsize(struct mm_struct *mm)
{
struct vm_area_struct *vma;
- struct rb_node *p;
unsigned long vsize = 0;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_read_lock(mm);
- for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
- vma = rb_entry(p, struct vm_area_struct, vm_rb);
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX)
vsize += vma->vm_end - vma->vm_start;
- }
+ rcu_read_unlock();
mmap_read_unlock(mm);
return vsize;
}
@@ -101,12 +100,12 @@ unsigned long task_statm(struct mm_struct *mm,
{
struct vm_area_struct *vma;
struct vm_region *region;
- struct rb_node *p;
unsigned long size = kobjsize(mm);
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_read_lock(mm);
- for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
- vma = rb_entry(p, struct vm_area_struct, vm_rb);
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
size += kobjsize(vma);
region = vma->vm_region;
if (region) {
@@ -119,6 +118,7 @@ unsigned long task_statm(struct mm_struct *mm,
>> PAGE_SHIFT;
*data = (PAGE_ALIGN(mm->start_stack) - (mm->start_data & PAGE_MASK))
>> PAGE_SHIFT;
+ rcu_read_unlock();
mmap_read_unlock(mm);
size >>= PAGE_SHIFT;
size += *text + *data;
@@ -190,17 +190,20 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
*/
static int show_map(struct seq_file *m, void *_p)
{
- struct rb_node *p = _p;
-
- return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb));
+ return nommu_vma_show(m, _p);
}

static void *m_start(struct seq_file *m, loff_t *pos)
{
struct proc_maps_private *priv = m->private;
struct mm_struct *mm;
- struct rb_node *p;
- loff_t n = *pos;
+ struct vm_area_struct *vma;
+ unsigned long addr = *pos;
+ MA_STATE(mas, &priv->mm->mm_mt, addr, addr);
+
+ /* See m_next(). Zero at the start or after lseek. */
+ if (addr == -1UL)
+ return NULL;

/* pin the task and mm whilst we play with them */
priv->task = get_proc_task(priv->inode);
@@ -216,14 +219,12 @@ static void *m_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-EINTR);
}

- /* start from the Nth VMA */
- for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
- if (n-- == 0)
- return p;
+ /* start the next element from addr */
+ vma = mas_find(&mas, ULONG_MAX);

mmap_read_unlock(mm);
mmput(mm);
- return NULL;
+ return vma;
}

static void m_stop(struct seq_file *m, void *_vml)
@@ -242,10 +243,10 @@ static void m_stop(struct seq_file *m, void *_vml)

static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
{
- struct rb_node *p = _p;
+ struct vm_area_struct *vma = _p;

- (*pos)++;
- return p ? rb_next(p) : NULL;
+ *pos = vma->vm_end;
+ return vma->vm_next;
}

static const struct seq_operations proc_pid_maps_ops = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f7dff6ad884..146976070fed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2553,8 +2553,6 @@ extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
extern int split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
-extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
- struct rb_node **, struct rb_node *);
extern void unlink_file_vma(struct vm_area_struct *);
extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
unsigned long addr, unsigned long len, pgoff_t pgoff,
@@ -2699,7 +2697,7 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
static inline
struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
{
- return find_vma_intersection(mm, addr, addr + 1);
+ return mtree_load(&mm->mm_mt, addr);
}

static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 51733fc44daf..41551bfa6ce0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -311,19 +311,6 @@ struct vm_area_struct {

/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;
-
- struct rb_node vm_rb;
-
- /*
- * Largest free memory gap in bytes to the left of this VMA.
- * Either between this VMA and vma->vm_prev, or between one of the
- * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
- * get_unmapped_area find a free area of the right size.
- */
- unsigned long rb_subtree_gap;
-
- /* Second cache line starts here. */
-
struct mm_struct *vm_mm; /* The address space we belong to. */

/*
@@ -333,6 +320,12 @@ struct vm_area_struct {
pgprot_t vm_page_prot;
unsigned long vm_flags; /* Flags, see mm.h. */

+ /* Information about our backing store: */
+ unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
+ * units
+ */
+ /* Second cache line starts here. */
+ struct file *vm_file; /* File we map to (can be NULL). */
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
@@ -351,16 +344,14 @@ struct vm_area_struct {
struct list_head anon_vma_chain; /* Serialized by mmap_lock &
* page_table_lock */
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
+ /* Third cache line starts here. */

/* Function pointers to deal with this struct. */
const struct vm_operations_struct *vm_ops;

- /* Information about our backing store: */
- unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
- units */
- struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */

+
#ifdef CONFIG_SWAP
atomic_long_t swap_readahead_info;
#endif
@@ -389,7 +380,6 @@ struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
struct maple_tree mm_mt;
- struct rb_root mm_rb;
u64 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 832416ff613e..83afd3007a2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -475,7 +475,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
{
struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
- struct rb_node **rb_link, *rb_parent;
int retval;
unsigned long charge = 0;
MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
@@ -502,8 +501,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mm->exec_vm = oldmm->exec_vm;
mm->stack_vm = oldmm->stack_vm;

- rb_link = &mm->mm_rb.rb_node;
- rb_parent = NULL;
pprev = &mm->mmap;
retval = ksm_fork(mm, oldmm);
if (retval)
@@ -597,10 +594,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->vm_prev = prev;
prev = tmp;

- __vma_link_rb(mm, tmp, rb_link, rb_parent);
- rb_link = &tmp->vm_rb.rb_right;
- rb_parent = &tmp->vm_rb;
-
/* Link the vma into the MT */
mas.index = tmp->vm_start;
mas.last = tmp->vm_end - 1;
@@ -1033,7 +1026,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
{
mm->mmap = NULL;
- mm->mm_rb = RB_ROOT;
mt_init_flags(&mm->mm_mt, MAPLE_ALLOC_RANGE);
mm->vmacache_seqnum = 0;
atomic_set(&mm->mm_users, 1);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 2014d4b82294..04bbe5172b72 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/mm_types.h>
-#include <linux/rbtree.h>
#include <linux/maple_tree.h>
#include <linux/rwsem.h>
#include <linux/spinlock.h>
@@ -28,7 +27,6 @@
* and size this cpu_bitmask to NR_CPUS.
*/
struct mm_struct init_mm = {
- .mm_rb = RB_ROOT,
.mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
diff --git a/mm/mmap.c b/mm/mmap.c
index 1bd43f4db28e..7747047c4cbe 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -38,7 +38,6 @@
#include <linux/audit.h>
#include <linux/khugepaged.h>
#include <linux/uprobes.h>
-#include <linux/rbtree_augmented.h>
#include <linux/notifier.h>
#include <linux/memory.h>
#include <linux/printk.h>
@@ -290,93 +289,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
return origbrk;
}

-static inline unsigned long vma_compute_gap(struct vm_area_struct *vma)
-{
- unsigned long gap, prev_end;
-
- /*
- * Note: in the rare case of a VM_GROWSDOWN above a VM_GROWSUP, we
- * allow two stack_guard_gaps between them here, and when choosing
- * an unmapped area; whereas when expanding we only require one.
- * That's a little inconsistent, but keeps the code here simpler.
- */
- gap = vm_start_gap(vma);
- if (vma->vm_prev) {
- prev_end = vm_end_gap(vma->vm_prev);
- if (gap > prev_end)
- gap -= prev_end;
- else
- gap = 0;
- }
- return gap;
-}
-
-#ifdef CONFIG_DEBUG_VM_RB
-static unsigned long vma_compute_subtree_gap(struct vm_area_struct *vma)
-{
- unsigned long max = vma_compute_gap(vma), subtree_gap;
- if (vma->vm_rb.rb_left) {
- subtree_gap = rb_entry(vma->vm_rb.rb_left,
- struct vm_area_struct, vm_rb)->rb_subtree_gap;
- if (subtree_gap > max)
- max = subtree_gap;
- }
- if (vma->vm_rb.rb_right) {
- subtree_gap = rb_entry(vma->vm_rb.rb_right,
- struct vm_area_struct, vm_rb)->rb_subtree_gap;
- if (subtree_gap > max)
- max = subtree_gap;
- }
- return max;
-}
-
-static int browse_rb(struct mm_struct *mm)
-{
- struct rb_root *root = &mm->mm_rb;
- int i = 0, j, bug = 0;
- struct rb_node *nd, *pn = NULL;
- unsigned long prev = 0, pend = 0;
-
- for (nd = rb_first(root); nd; nd = rb_next(nd)) {
- struct vm_area_struct *vma;
- vma = rb_entry(nd, struct vm_area_struct, vm_rb);
- if (vma->vm_start < prev) {
- pr_emerg("vm_start %lx < prev %lx\n",
- vma->vm_start, prev);
- bug = 1;
- }
- if (vma->vm_start < pend) {
- pr_emerg("vm_start %lx < pend %lx\n",
- vma->vm_start, pend);
- bug = 1;
- }
- if (vma->vm_start > vma->vm_end) {
- pr_emerg("vm_start %lx > vm_end %lx\n",
- vma->vm_start, vma->vm_end);
- bug = 1;
- }
- spin_lock(&mm->page_table_lock);
- if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) {
- pr_emerg("free gap %lx, correct %lx\n",
- vma->rb_subtree_gap,
- vma_compute_subtree_gap(vma));
- bug = 1;
- }
- spin_unlock(&mm->page_table_lock);
- i++;
- pn = nd;
- prev = vma->vm_start;
- pend = vma->vm_end;
- }
- j = 0;
- for (nd = pn; nd; nd = rb_prev(nd))
- j++;
- if (i != j) {
- pr_emerg("backwards %d, forwards %d\n", j, i);
- bug = 1;
- }
- return bug ? -1 : i;
-}
#if defined(CONFIG_DEBUG_MAPLE_TREE)
extern void mt_validate(struct maple_tree *mt);
extern void mt_dump(const struct maple_tree *mt);
@@ -405,17 +317,25 @@ static void validate_mm_mt(struct mm_struct *mm)
dump_stack();
#ifdef CONFIG_DEBUG_VM
dump_vma(vma_mt);
- pr_emerg("and next in rb\n");
+ pr_emerg("and vm_next\n");
dump_vma(vma->vm_next);
-#endif
+#endif // CONFIG_DEBUG_VM
pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
mas.index, mas.last);
pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
vma_mt->vm_start, vma_mt->vm_end);
- pr_emerg("rb vma: %px %lu - %lu\n", vma,
+ if (vma->vm_prev) {
+ pr_emerg("ll prev: %px %lu - %lu\n",
+ vma->vm_prev, vma->vm_prev->vm_start,
+ vma->vm_prev->vm_end);
+ }
+ pr_emerg("ll vma: %px %lu - %lu\n", vma,
vma->vm_start, vma->vm_end);
- pr_emerg("rb->next = %px %lu - %lu\n", vma->vm_next,
- vma->vm_next->vm_start, vma->vm_next->vm_end);
+ if (vma->vm_next) {
+ pr_emerg("ll next: %px %lu - %lu\n",
+ vma->vm_next, vma->vm_next->vm_start,
+ vma->vm_next->vm_end);
+ }

mt_dump(mas.tree);
if (vma_mt->vm_end != mas.last + 1) {
@@ -441,21 +361,6 @@ static void validate_mm_mt(struct mm_struct *mm)
rcu_read_unlock();
mt_validate(&mm->mm_mt);
}
-#else
-#define validate_mm_mt(root) do { } while (0)
-#endif
-static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore)
-{
- struct rb_node *nd;
-
- for (nd = rb_first(root); nd; nd = rb_next(nd)) {
- struct vm_area_struct *vma;
- vma = rb_entry(nd, struct vm_area_struct, vm_rb);
- VM_BUG_ON_VMA(vma != ignore &&
- vma->rb_subtree_gap != vma_compute_subtree_gap(vma),
- vma);
- }
-}

static void validate_mm(struct mm_struct *mm)
{
@@ -464,6 +369,8 @@ static void validate_mm(struct mm_struct *mm)
unsigned long highest_address = 0;
struct vm_area_struct *vma = mm->mmap;

+ validate_mm_mt(mm);
+
while (vma) {
struct anon_vma *anon_vma = vma->anon_vma;
struct anon_vma_chain *avc;
@@ -488,80 +395,13 @@ static void validate_mm(struct mm_struct *mm)
mm->highest_vm_end, highest_address);
bug = 1;
}
- i = browse_rb(mm);
- if (i != mm->map_count) {
- if (i != -1)
- pr_emerg("map_count %d rb %d\n", mm->map_count, i);
- bug = 1;
- }
VM_BUG_ON_MM(bug, mm);
}
-#else
-#define validate_mm_rb(root, ignore) do { } while (0)
+
+#else // !CONFIG_DEBUG_MAPLE_TREE
#define validate_mm_mt(root) do { } while (0)
#define validate_mm(mm) do { } while (0)
-#endif
-
-RB_DECLARE_CALLBACKS_MAX(static, vma_gap_callbacks,
- struct vm_area_struct, vm_rb,
- unsigned long, rb_subtree_gap, vma_compute_gap)
-
-/*
- * Update augmented rbtree rb_subtree_gap values after vma->vm_start or
- * vma->vm_prev->vm_end values changed, without modifying the vma's position
- * in the rbtree.
- */
-static void vma_gap_update(struct vm_area_struct *vma)
-{
- /*
- * As it turns out, RB_DECLARE_CALLBACKS_MAX() already created
- * a callback function that does exactly what we want.
- */
- vma_gap_callbacks_propagate(&vma->vm_rb, NULL);
-}
-
-static inline void vma_rb_insert(struct vm_area_struct *vma,
- struct rb_root *root)
-{
- /* All rb_subtree_gap values must be consistent prior to insertion */
- validate_mm_rb(root, NULL);
-
- rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
-}
-
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
-{
- /*
- * Note rb_erase_augmented is a fairly large inline function,
- * so make sure we instantiate it only once with our desired
- * augmented rbtree callbacks.
- */
- rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
-}
-
-static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
- struct rb_root *root,
- struct vm_area_struct *ignore)
-{
- /*
- * All rb_subtree_gap values must be consistent prior to erase,
- * with the possible exception of
- *
- * a. the "next" vma being erased if next->vm_start was reduced in
- * __vma_adjust() -> __vma_unlink()
- * b. the vma being erased in detach_vmas_to_be_unmapped() ->
- * vma_rb_erase()
- */
- validate_mm_rb(root, ignore);
-
- __vma_rb_erase(vma, root);
-}
-
-static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
- struct rb_root *root)
-{
- vma_rb_erase_ignore(vma, root, vma);
-}
+#endif // CONFIG_DEBUG_MAPLE_TREE

/*
* vma has some anon_vma assigned, and is already inserted on that
@@ -595,38 +435,26 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
}

-static int find_vma_links(struct mm_struct *mm, unsigned long addr,
- unsigned long end, struct vm_area_struct **pprev,
- struct rb_node ***rb_link, struct rb_node **rb_parent)
+/* Private
+ * range_has_overlap() - Check the @start - @end range for overlapping VMAs and
+ * sets up a pointer to the previous VMA
+ *
+ * @mm - the mm struct
+ * @start - the start address of the range
+ * @end - the end address of the range
+ * @pprev - the pointer to the pointer of the previous VMA
+ *
+ * Returns: True if there is an overlapping VMA, false otherwise
+ */
+static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct vm_area_struct **pprev)
{
- struct rb_node **__rb_link, *__rb_parent, *rb_prev;
-
- __rb_link = &mm->mm_rb.rb_node;
- rb_prev = __rb_parent = NULL;
+ struct vm_area_struct *existing;

- while (*__rb_link) {
- struct vm_area_struct *vma_tmp;
-
- __rb_parent = *__rb_link;
- vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb);
-
- if (vma_tmp->vm_end > addr) {
- /* Fail if an existing vma overlaps the area */
- if (vma_tmp->vm_start < end)
- return -ENOMEM;
- __rb_link = &__rb_parent->rb_left;
- } else {
- rb_prev = __rb_parent;
- __rb_link = &__rb_parent->rb_right;
- }
- }
-
- *pprev = NULL;
- if (rb_prev)
- *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
- *rb_link = __rb_link;
- *rb_parent = __rb_parent;
- return 0;
+ MA_STATE(mas, &mm->mm_mt, start, start);
+ existing = mas_find(&mas, end - 1);
+ *pprev = mas_prev(&mas, 0);
+ return existing ? true : false;
}

/*
@@ -653,8 +481,6 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
* @start: The start of the range.
* @len: The length of the range.
* @pprev: pointer to the pointer that will be set to previous vm_area_struct
- * @rb_link: the rb_node
- * @rb_parent: the parent rb_node
*
* Find all the vm_area_struct that overlap from @start to
* @end and munmap them. Set @pprev to the previous vm_area_struct.
@@ -663,76 +489,41 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
*/
static inline int
munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
- struct vm_area_struct **pprev, struct rb_node ***link,
- struct rb_node **parent, struct list_head *uf)
+ struct vm_area_struct **pprev, struct list_head *uf)
{
-
- while (find_vma_links(mm, start, start + len, pprev, link, parent))
+ // Needs optimization.
+ while (range_has_overlap(mm, start, start + len, pprev))
if (do_munmap(mm, start, len, uf))
return -ENOMEM;
-
return 0;
}
static unsigned long count_vma_pages_range(struct mm_struct *mm,
unsigned long addr, unsigned long end)
{
unsigned long nr_pages = 0;
- unsigned long nr_mt_pages = 0;
struct vm_area_struct *vma;
+ unsigned long vm_start, vm_end;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

- /* Find first overlapping mapping */
- vma = find_vma_intersection(mm, addr, end);
+ /* Find first overlaping mapping */
+ vma = mas_find(&mas, end - 1);
if (!vma)
return 0;

- nr_pages = (min(end, vma->vm_end) -
- max(addr, vma->vm_start)) >> PAGE_SHIFT;
+ vm_start = vma->vm_start;
+ vm_end = vma->vm_end;
+ nr_pages = (min(end, vm_end) - max(addr, vm_start)) >> PAGE_SHIFT;

/* Iterate over the rest of the overlaps */
- for (vma = vma->vm_next; vma; vma = vma->vm_next) {
- unsigned long overlap_len;
-
- if (vma->vm_start > end)
- break;
-
- overlap_len = min(end, vma->vm_end) - vma->vm_start;
- nr_pages += overlap_len >> PAGE_SHIFT;
+ mas_for_each(&mas, vma, end) {
+ vm_start = vma->vm_start;
+ vm_end = vma->vm_end;
+ nr_pages += (min(end, vm_end) - vm_start) >> PAGE_SHIFT;
}

- mt_for_each(&mm->mm_mt, vma, addr, end) {
- nr_mt_pages +=
- (min(end, vma->vm_end) - vma->vm_start) >> PAGE_SHIFT;
- }
-
- VM_BUG_ON_MM(nr_pages != nr_mt_pages, mm);
-
return nr_pages;
}

-void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
- struct rb_node **rb_link, struct rb_node *rb_parent)
-{
- /* Update tracking information for the gap following the new vma. */
- if (vma->vm_next)
- vma_gap_update(vma->vm_next);
- else
- mm->highest_vm_end = vm_end_gap(vma);
-
- /*
- * vma->vm_prev wasn't known when we followed the rbtree to find the
- * correct insertion point for that vma. As a result, we could not
- * update the vma vm_rb parents rb_subtree_gap values on the way down.
- * So, we first insert the vma with a zero rb_subtree_gap value
- * (to be consistent with what we did on the way down), and then
- * immediately update the gap to the correct value. Finally we
- * rebalance the rbtree after all augmented values have been set.
- */
- rb_link_node(&vma->vm_rb, rb_parent, rb_link);
- vma->rb_subtree_gap = 0;
- vma_gap_update(vma);
- vma_rb_insert(vma, &mm->mm_rb);
-}
-
static void __vma_link_file(struct vm_area_struct *vma)
{
struct file *file;
@@ -780,19 +571,8 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
GFP_KERNEL);
}

-static void
-__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
- struct vm_area_struct *prev, struct rb_node **rb_link,
- struct rb_node *rb_parent)
-{
- vma_mt_store(mm, vma);
- __vma_link_list(mm, vma, prev);
- __vma_link_rb(mm, vma, rb_link, rb_parent);
-}
-
static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
- struct vm_area_struct *prev, struct rb_node **rb_link,
- struct rb_node *rb_parent)
+ struct vm_area_struct *prev)
{
struct address_space *mapping = NULL;

@@ -801,7 +581,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
i_mmap_lock_write(mapping);
}

- __vma_link(mm, vma, prev, rb_link, rb_parent);
+ vma_mt_store(mm, vma);
+ __vma_link_list(mm, vma, prev);
__vma_link_file(vma);

if (mapping)
@@ -813,30 +594,18 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

/*
* Helper for vma_adjust() in the split_vma insert case: insert a vma into the
- * mm's list and rbtree. It has already been inserted into the interval tree.
+ * mm's list and the mm tree. It has already been inserted into the interval tree.
*/
static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
struct vm_area_struct *prev;
- struct rb_node **rb_link, *rb_parent;

- if (find_vma_links(mm, vma->vm_start, vma->vm_end,
- &prev, &rb_link, &rb_parent))
- BUG();
- __vma_link(mm, vma, prev, rb_link, rb_parent);
+ BUG_ON(range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev));
+ vma_mt_store(mm, vma);
+ __vma_link_list(mm, vma, prev);
mm->map_count++;
}

-static __always_inline void __vma_unlink(struct mm_struct *mm,
- struct vm_area_struct *vma,
- struct vm_area_struct *ignore)
-{
- vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
- __vma_unlink_list(mm, vma);
- /* Kill the cache */
- vmacache_invalidate(mm);
-}
-
/*
* We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
* is already present in an i_mmap tree without adjusting the tree.
@@ -854,13 +623,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct rb_root_cached *root = NULL;
struct anon_vma *anon_vma = NULL;
struct file *file = vma->vm_file;
- bool start_changed = false, end_changed = false;
+ bool vma_changed = false;
long adjust_next = 0;
int remove_next = 0;

- validate_mm(mm);
- validate_mm_mt(mm);
-
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;

@@ -986,21 +752,23 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
}

if (start != vma->vm_start) {
- unsigned long old_start = vma->vm_start;
+ if (vma->vm_start < start)
+ vma_mt_szero(mm, vma->vm_start, start);
+ else
+ vma_changed = true;
vma->vm_start = start;
- if (old_start < start)
- vma_mt_szero(mm, old_start, start);
- start_changed = true;
}
if (end != vma->vm_end) {
- unsigned long old_end = vma->vm_end;
+ if (vma->vm_end > end)
+ vma_mt_szero(mm, end, vma->vm_end);
+ else
+ vma_changed = true;
vma->vm_end = end;
- if (old_end > end)
- vma_mt_szero(mm, end, old_end);
- end_changed = true;
+ if (!next)
+ mm->highest_vm_end = vm_end_gap(vma);
}

- if (end_changed || start_changed)
+ if (vma_changed)
vma_mt_store(mm, vma);

vma->vm_pgoff = pgoff;
@@ -1018,25 +786,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
}

if (remove_next) {
- /*
- * vma_merge has merged next into vma, and needs
- * us to remove next before dropping the locks.
- * Since we have expanded over this vma, the maple tree will
- * have overwritten by storing the value
- */
- if (remove_next != 3)
- __vma_unlink(mm, next, next);
- else
- /*
- * vma is not before next if they've been
- * swapped.
- *
- * pre-swap() next->vm_start was reduced so
- * tell validate_mm_rb to ignore pre-swap()
- * "next" (which is stored in post-swap()
- * "vma").
- */
- __vma_unlink(mm, next, vma);
+ __vma_unlink_list(mm, next);
+ /* Kill the cache */
+ vmacache_invalidate(mm);
if (file)
__remove_shared_vm_struct(next, file, mapping);
} else if (insert) {
@@ -1046,15 +798,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
* (it may either follow vma or precede it).
*/
__insert_vm_struct(mm, insert);
- } else {
- if (start_changed)
- vma_gap_update(vma);
- if (end_changed) {
- if (!next)
- mm->highest_vm_end = vm_end_gap(vma);
- else if (!adjust_next)
- vma_gap_update(next);
- }
}

if (anon_vma) {
@@ -1112,10 +855,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
remove_next = 1;
end = next->vm_end;
goto again;
- }
- else if (next)
- vma_gap_update(next);
- else {
+ } else if (!next) {
/*
* If remove_next == 2 we obviously can't
* reach this path.
@@ -1142,8 +882,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
uprobe_mmap(insert);

validate_mm(mm);
- validate_mm_mt(mm);
-
return 0;
}

@@ -1290,7 +1028,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *area, *next;
int err;

- validate_mm_mt(mm);
/*
* We later require that vma->vm_flags == vm_flags,
* so this tests vma->vm_flags & VM_SPECIAL, too.
@@ -1366,7 +1103,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
khugepaged_enter_vma_merge(area, vm_flags);
return area;
}
- validate_mm_mt(mm);

return NULL;
}
@@ -1536,6 +1272,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags_t vm_flags;
int pkey = 0;

+ validate_mm(mm);
*populate = 0;

if (!len)
@@ -1856,10 +1593,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev, *merge;
int error;
- struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;

- validate_mm_mt(mm);
/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;
@@ -1875,8 +1610,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return -ENOMEM;
}

- /* Clear old maps, set up prev, rb_link, rb_parent, and uf */
- if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
+ /* Clear old maps, set up prev and uf */
+ if (munmap_vma_range(mm, addr, len, &prev, uf))
return -ENOMEM;
/*
* Private writable mapping: check memory availability
@@ -1984,7 +1719,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto free_vma;
}

- vma_link(mm, vma, prev, rb_link, rb_parent);
+ vma_link(mm, vma, prev);
/* Once vma denies write, undo our temporary denial count */
if (file) {
unmap_writable:
@@ -2021,7 +1756,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

vma_set_page_prot(vma);

- validate_mm_mt(mm);
return addr;

unmap_and_free_vma:
@@ -2041,7 +1775,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unacct_error:
if (charged)
vm_unacct_memory(charged);
- validate_mm_mt(mm);
return error;
}

@@ -2324,9 +2057,6 @@ find_vma_prev(struct mm_struct *mm, unsigned long addr,

rcu_read_lock();
vma = mas_find(&mas, ULONG_MAX);
- if (!vma)
- mas_reset(&mas);
-
*pprev = mas_prev(&mas, 0);
rcu_read_unlock();
return vma;
@@ -2390,7 +2120,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
unsigned long gap_addr;
int error = 0;

- validate_mm_mt(mm);
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

@@ -2437,15 +2166,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
error = acct_stack_growth(vma, size, grow);
if (!error) {
/*
- * vma_gap_update() doesn't support concurrent
- * updates, but we only hold a shared mmap_lock
- * lock here, so we need to protect against
- * concurrent vma expansions.
- * anon_vma_lock_write() doesn't help here, as
- * we don't guarantee that all growable vmas
- * in a mm share the same root anon vma.
- * So, we reuse mm->page_table_lock to guard
- * against concurrent vma expansions.
+ * We only hold a shared mmap_lock lock here, so
+ * we need to protect against concurrent vma
+ * expansions. anon_vma_lock_write() doesn't
+ * help here, as we don't guarantee that all
+ * growable vmas in a mm share the same root
+ * anon vma. So, we reuse mm->page_table_lock
+ * to guard against concurrent vma expansions.
*/
spin_lock(&mm->page_table_lock);
if (vma->vm_flags & VM_LOCKED)
@@ -2453,10 +2180,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
+ vma_mt_store(mm, vma);
anon_vma_interval_tree_post_update_vma(vma);
- if (vma->vm_next)
- vma_gap_update(vma->vm_next);
- else
+ if (!vma->vm_next)
mm->highest_vm_end = vm_end_gap(vma);
spin_unlock(&mm->page_table_lock);

@@ -2466,8 +2192,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
}
anon_vma_unlock_write(vma->anon_vma);
khugepaged_enter_vma_merge(vma, vma->vm_flags);
- validate_mm(mm);
- validate_mm_mt(mm);
return error;
}
#endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -2475,14 +2199,12 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
*/
-int expand_downwards(struct vm_area_struct *vma,
- unsigned long address)
+int expand_downwards(struct vm_area_struct *vma, unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *prev;
int error = 0;

- validate_mm(mm);
address &= PAGE_MASK;
if (address < mmap_min_addr)
return -EPERM;
@@ -2519,15 +2241,13 @@ int expand_downwards(struct vm_area_struct *vma,
error = acct_stack_growth(vma, size, grow);
if (!error) {
/*
- * vma_gap_update() doesn't support concurrent
- * updates, but we only hold a shared mmap_lock
- * lock here, so we need to protect against
- * concurrent vma expansions.
- * anon_vma_lock_write() doesn't help here, as
- * we don't guarantee that all growable vmas
- * in a mm share the same root anon vma.
- * So, we reuse mm->page_table_lock to guard
- * against concurrent vma expansions.
+ * We only hold a shared mmap_lock lock here, so
+ * we need to protect against concurrent vma
+ * expansions. anon_vma_lock_write() doesn't
+ * help here, as we don't guarantee that all
+ * growable vmas in a mm share the same root
+ * anon vma. So, we reuse mm->page_table_lock
+ * to guard against concurrent vma expansions.
*/
spin_lock(&mm->page_table_lock);
if (vma->vm_flags & VM_LOCKED)
@@ -2539,7 +2259,6 @@ int expand_downwards(struct vm_area_struct *vma,
/* Overwrite old entry in mtree. */
vma_mt_store(mm, vma);
anon_vma_interval_tree_post_update_vma(vma);
- vma_gap_update(vma);
spin_unlock(&mm->page_table_lock);

perf_event_mmap(vma);
@@ -2548,7 +2267,6 @@ int expand_downwards(struct vm_area_struct *vma,
}
anon_vma_unlock_write(vma->anon_vma);
khugepaged_enter_vma_merge(vma, vma->vm_flags);
- validate_mm(mm);
return error;
}

@@ -2681,16 +2399,14 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
vma->vm_prev = NULL;
vma_mt_szero(mm, vma->vm_start, end);
do {
- vma_rb_erase(vma, &mm->mm_rb);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
} while (vma && vma->vm_start < end);
*insertion_point = vma;
- if (vma) {
+ if (vma)
vma->vm_prev = prev;
- vma_gap_update(vma);
- } else
+ else
mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
tail_vma->vm_next = NULL;

@@ -2821,11 +2537,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
if (len == 0)
return -EINVAL;

- /*
- * arch_unmap() might do unmaps itself. It must be called
- * and finish any rbtree manipulation before this code
- * runs and also starts to manipulate the rbtree.
- */
+ /* arch_unmap() might do unmaps itself. */
arch_unmap(mm, start, end);

/* Find the first overlapping VMA */
@@ -2833,7 +2545,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
if (!vma)
return 0;
prev = vma->vm_prev;
- /* we have start < vma->vm_end */
+ /* we have start < vma->vm_end */

/* if it doesn't overlap, we have nothing.. */
if (vma->vm_start >= end)
@@ -2893,7 +2605,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
if (mm->locked_vm)
unlock_range(vma, end);

- /* Detach vmas from rbtree */
+ /* Detach vmas from the MM linked list and remove from the mm tree*/
if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
downgrade = false;

@@ -3041,11 +2753,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
* anonymous maps. eventually we may be able to do some
* brk-specific accounting here.
*/
-static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf)
+static int do_brk_flags(unsigned long addr, unsigned long len,
+ unsigned long flags, struct list_head *uf)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
- struct rb_node **rb_link, *rb_parent;
pgoff_t pgoff = addr >> PAGE_SHIFT;
int error;
unsigned long mapped_addr;
@@ -3064,8 +2776,8 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
if (error)
return error;

- /* Clear old maps, set up prev, rb_link, rb_parent, and uf */
- if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
+ /* Clear old maps, set up prev and uf */
+ if (munmap_vma_range(mm, addr, len, &prev, uf))
return -ENOMEM;

/* Check against address space limits *after* clearing old maps... */
@@ -3099,7 +2811,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
vma->vm_pgoff = pgoff;
vma->vm_flags = flags;
vma->vm_page_prot = vm_get_page_prot(flags);
- vma_link(mm, vma, prev, rb_link, rb_parent);
+ vma_link(mm, vma, prev);
out:
perf_event_mmap(vma);
mm->total_vm += len >> PAGE_SHIFT;
@@ -3219,26 +2931,10 @@ void exit_mmap(struct mm_struct *mm)
int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
struct vm_area_struct *prev;
- struct rb_node **rb_link, *rb_parent;
- unsigned long start = vma->vm_start;
- struct vm_area_struct *overlap = NULL;

- if (find_vma_links(mm, vma->vm_start, vma->vm_end,
- &prev, &rb_link, &rb_parent))
+ if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev))
return -ENOMEM;

- overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1);
- if (overlap) {
-
- pr_err("Found vma ending at %lu\n", start - 1);
- pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap,
- overlap->vm_start, overlap->vm_end - 1);
-#if defined(CONFIG_DEBUG_MAPLE_TREE)
- mt_dump(&mm->mm_mt);
-#endif
- BUG();
- }
-
if ((vma->vm_flags & VM_ACCOUNT) &&
security_vm_enough_memory_mm(mm, vma_pages(vma)))
return -ENOMEM;
@@ -3260,7 +2956,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
}

- vma_link(mm, vma, prev, rb_link, rb_parent);
+ vma_link(mm, vma, prev);
return 0;
}

@@ -3276,9 +2972,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
unsigned long vma_start = vma->vm_start;
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *new_vma, *prev;
- struct rb_node **rb_link, *rb_parent;
bool faulted_in_anon_vma = true;
- unsigned long index = addr;

validate_mm_mt(mm);
/*
@@ -3290,10 +2984,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
faulted_in_anon_vma = false;
}

- if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
+ if (range_has_overlap(mm, addr, addr + len, &prev))
return NULL; /* should never get here */
- if (mt_find(&mm->mm_mt, &index, addr+len - 1))
- BUG();
+
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
vma->vm_userfaultfd_ctx);
@@ -3334,7 +3027,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
- vma_link(mm, new_vma, prev, rb_link, rb_parent);
+ vma_link(mm, new_vma, prev);
*need_rmap_locks = false;
}
validate_mm_mt(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 8848cf7cb7c1..c410f99203fb 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -566,13 +566,14 @@ static void put_nommu_region(struct vm_region *region)
*/
static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
{
- struct vm_area_struct *pvma, *prev;
struct address_space *mapping;
- struct rb_node **p, *parent, *rb_prev;
+ struct vm_area_struct *prev;
+ MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end);

BUG_ON(!vma->vm_region);

mm->map_count++;
+ printk("mm at %u\n", mm->map_count);
vma->vm_mm = mm;

/* add the VMA to the mapping */
@@ -586,42 +587,12 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
i_mmap_unlock_write(mapping);
}

+ rcu_read_lock();
+ prev = mas_prev(&mas, 0);
+ rcu_read_unlock();
+ mas_reset(&mas);
/* add the VMA to the tree */
- parent = rb_prev = NULL;
- p = &mm->mm_rb.rb_node;
- while (*p) {
- parent = *p;
- pvma = rb_entry(parent, struct vm_area_struct, vm_rb);
-
- /* sort by: start addr, end addr, VMA struct addr in that order
- * (the latter is necessary as we may get identical VMAs) */
- if (vma->vm_start < pvma->vm_start)
- p = &(*p)->rb_left;
- else if (vma->vm_start > pvma->vm_start) {
- rb_prev = parent;
- p = &(*p)->rb_right;
- } else if (vma->vm_end < pvma->vm_end)
- p = &(*p)->rb_left;
- else if (vma->vm_end > pvma->vm_end) {
- rb_prev = parent;
- p = &(*p)->rb_right;
- } else if (vma < pvma)
- p = &(*p)->rb_left;
- else if (vma > pvma) {
- rb_prev = parent;
- p = &(*p)->rb_right;
- } else
- BUG();
- }
-
- rb_link_node(&vma->vm_rb, parent, p);
- rb_insert_color(&vma->vm_rb, &mm->mm_rb);
-
- /* add VMA to the VMA list also */
- prev = NULL;
- if (rb_prev)
- prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
-
+ vma_mas_store(vma, &mas);
__vma_link_list(mm, vma, prev);
}

@@ -634,6 +605,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
struct address_space *mapping;
struct mm_struct *mm = vma->vm_mm;
struct task_struct *curr = current;
+ MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);

mm->map_count--;
for (i = 0; i < VMACACHE_SIZE; i++) {
@@ -643,7 +615,6 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
break;
}
}
-
/* remove the VMA from the mapping */
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -656,8 +627,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
}

/* remove from the MM's tree and list */
- rb_erase(&vma->vm_rb, &mm->mm_rb);
-
+ vma_mas_remove(vma, &mas);
__vma_unlink_list(mm, vma);
}

@@ -681,24 +651,21 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

/* check the cache first */
vma = vmacache_find(mm, addr);
if (likely(vma))
return vma;

- /* trawl the list (there may be multiple mappings in which addr
- * resides) */
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (vma->vm_start > addr)
- return NULL;
- if (vma->vm_end > addr) {
- vmacache_update(addr, vma);
- return vma;
- }
- }
+ rcu_read_lock();
+ vma = mas_walk(&mas);
+ rcu_read_unlock();

- return NULL;
+ if (vma)
+ vmacache_update(addr, vma);
+
+ return vma;
}
EXPORT_SYMBOL(find_vma);

@@ -730,26 +697,25 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
{
struct vm_area_struct *vma;
unsigned long end = addr + len;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

/* check the cache first */
vma = vmacache_find_exact(mm, addr, end);
if (vma)
return vma;

- /* trawl the list (there may be multiple mappings in which addr
- * resides) */
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (vma->vm_start < addr)
- continue;
- if (vma->vm_start > addr)
- return NULL;
- if (vma->vm_end == end) {
- vmacache_update(addr, vma);
- return vma;
- }
- }
-
- return NULL;
+ rcu_read_lock();
+ vma = mas_walk(&mas);
+ rcu_read_unlock();
+ if (!vma)
+ return NULL;
+ if (vma->vm_start != addr)
+ return NULL;
+ if (vma->vm_end != end)
+ return NULL;
+
+ vmacache_update(addr, vma);
+ return vma;
}

/*
diff --git a/mm/util.c b/mm/util.c
index 0b6dd9d81da7..35deaa0ccac5 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -287,6 +287,8 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
vma->vm_next = next;
if (next)
next->vm_prev = vma;
+ else
+ mm->highest_vm_end = vm_end_gap(vma);
}

void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
@@ -301,6 +303,12 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
mm->mmap = next;
if (next)
next->vm_prev = prev;
+ else {
+ if (prev)
+ mm->highest_vm_end = vm_end_gap(prev);
+ else
+ mm->highest_vm_end = 0;
+ }
}

/* Check if the vma is being used as a stack by this task */
--
2.30.2

2021-04-28 17:40:07

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 37/94] mm/khugepaged: Optimize collapse_pte_mapped_thp() by using vma_lookup()

vma_lookup() will walk the vma tree once and not continue to look for
the next vma. Since the exact vma is checked below, this is a more
optimal way of searching.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6c0185fdd815..33cf91529f0b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1431,7 +1431,7 @@ static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
{
unsigned long haddr = addr & HPAGE_PMD_MASK;
- struct vm_area_struct *vma = find_vma(mm, haddr);
+ struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page *hpage;
pte_t *start_pte, *pte;
pmd_t *pmd, _pmd;
--
2.30.2

2021-04-28 17:40:08

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 59/94] arch/xtensa: Use maple tree iterators for unmapped area

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/xtensa/kernel/syscall.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/xtensa/kernel/syscall.c b/arch/xtensa/kernel/syscall.c
index 201356faa7e6..118fe0ca7594 100644
--- a/arch/xtensa/kernel/syscall.c
+++ b/arch/xtensa/kernel/syscall.c
@@ -58,6 +58,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
struct vm_area_struct *vmm;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

if (flags & MAP_FIXED) {
/* We do not accept a shared mapping if it would violate
@@ -79,7 +80,8 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
else
addr = PAGE_ALIGN(addr);

- for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
+ /* Must hold mm_mt lock */
+ mas_for_each(&mas, vmm, ULONG_MAX) {
/* At this point: (!vmm || addr < vmm->vm_end). */
if (TASK_SIZE - len < addr)
return -ENOMEM;
--
2.30.2

2021-04-28 17:40:15

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 38/94] mm/gup: Add mm_populate_vma() for use when the vma is known

When a vma is known, avoid calling mm_populate to search for the vma to
populate.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/gup.c | 20 ++++++++++++++++++++
mm/internal.h | 4 ++++
2 files changed, 24 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index c3a17b189064..48fe98ab0729 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1468,6 +1468,26 @@ long populate_vma_page_range(struct vm_area_struct *vma,
NULL, NULL, locked);
}

+/*
+ * mm_populate_vma() - Populate a single range in a single vma.
+ * @vma: The vma to populate.
+ * @start: The start address to populate
+ * @end: The end address to stop populating
+ *
+ * Note: Ignores errors.
+ */
+void mm_populate_vma(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ struct mm_struct *mm = current->mm;
+ int locked = 1;
+
+ mmap_read_lock(mm);
+ populate_vma_page_range(vma, start, end, &locked);
+ if (locked)
+ mmap_read_unlock(mm);
+}
+
/*
* __mm_populate - populate and/or mlock pages within a range of address space.
*
diff --git a/mm/internal.h b/mm/internal.h
index 7ad55938d391..583f2f1e6ff8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -346,6 +346,10 @@ static inline bool is_data_mapping(vm_flags_t flags)
return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
}

+/* mm/gup.c */
+extern void mm_populate_vma(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end);
+
/* Maple tree operations using VMAs */
/*
* vma_mas_store() - Store a VMA in the maple tree.
--
2.30.2

2021-04-28 17:40:17

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 55/94] arch/powerpc: Remove mmap linked list from mm/book3s32/tlb

Start using the maple tree

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/powerpc/mm/book3s32/tlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s32/tlb.c b/arch/powerpc/mm/book3s32/tlb.c
index 19f0ef950d77..2c8b991de8e8 100644
--- a/arch/powerpc/mm/book3s32/tlb.c
+++ b/arch/powerpc/mm/book3s32/tlb.c
@@ -81,6 +81,7 @@ EXPORT_SYMBOL(hash__flush_range);
void hash__flush_tlb_mm(struct mm_struct *mm)
{
struct vm_area_struct *mp;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

/*
* It is safe to go down the mm's list of vmas when called
@@ -88,8 +89,10 @@ void hash__flush_tlb_mm(struct mm_struct *mm)
* unmap_region or exit_mmap, but not from vmtruncate on SMP -
* but it seems dup_mmap is the only SMP case which gets here.
*/
- for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
+ rcu_read_lock();
+ mas_for_each(&mas, mp, ULONG_MAX)
hash__flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
+ rcu_read_unlock();
}
EXPORT_SYMBOL(hash__flush_tlb_mm);

--
2.30.2

2021-04-28 17:40:17

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 49/94] mmap: Remove __do_munmap() in favour of do_mas_munmap()

Export new interface and use it in place of old interface.

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 4 ++--
mm/mmap.c | 16 ++++------------
mm/mremap.c | 7 ++++---
3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd8abaa433f9..cbc79a9fa911 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2604,8 +2604,8 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
unsigned long pgoff, unsigned long *populate, struct list_head *uf);
-extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
- struct list_head *uf, bool downgrade);
+extern int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
+ unsigned long start, size_t len, struct list_head *uf, bool downgrade);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf);
extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3e67fb5eac31..cf4aa715eb63 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2574,13 +2574,6 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
return do_mas_align_munmap(mas, vma, mm, start, end, uf, downgrade);
}

-int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
- struct list_head *uf, bool downgrade)
-{
- MA_STATE(mas, &mm->mm_mt, start, start);
- return do_mas_munmap(&mas, mm, start, len, uf, downgrade);
-}
-
/* do_munmap() - Wrapper function for non-maple tree aware do_munmap() calls.
* @mm: The mm_struct
* @start: The start address to munmap
@@ -2590,7 +2583,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
struct list_head *uf)
{
- return __do_munmap(mm, start, len, uf, false);
+ MA_STATE(mas, &mm->mm_mt, start, start);
+ return do_mas_munmap(&mas, mm, start, len, uf, false);
}

unsigned long mmap_region(struct file *file, unsigned long addr,
@@ -2834,11 +2828,12 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
int ret;
struct mm_struct *mm = current->mm;
LIST_HEAD(uf);
+ MA_STATE(mas, &mm->mm_mt, start, start);

if (mmap_write_lock_killable(mm))
return -EINTR;

- ret = __do_munmap(mm, start, len, &uf, downgrade);
+ ret = do_mas_munmap(&mas, mm, start, len, &uf, downgrade);
/*
* Returning 1 indicates mmap_lock is downgraded.
* But 1 is not legal return value of vm_munmap() and munmap(), reset
@@ -2975,9 +2970,6 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
arch_unmap(mm, newbrk, oldbrk);

if (likely(vma->vm_start >= newbrk)) { // remove entire mapping(s)
- mas_set(mas, newbrk);
- if (vma->vm_start != newbrk)
- mas_reset(mas); // cause a re-walk for the first overlap.
ret = do_mas_munmap(mas, mm, newbrk, oldbrk-newbrk, uf, true);
goto munmap_full_vma;
}
diff --git a/mm/mremap.c b/mm/mremap.c
index 04143755cd1e..d2dba8188be5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -881,14 +881,15 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
/*
* Always allow a shrinking remap: that just unmaps
* the unnecessary pages..
- * __do_munmap does all the needed commit accounting, and
+ * do_mas_munmap does all the needed commit accounting, and
* downgrades mmap_lock to read if so directed.
*/
if (old_len >= new_len) {
int retval;
+ MA_STATE(mas, &mm->mm_mt, addr + new_len, addr + new_len);

- retval = __do_munmap(mm, addr+new_len, old_len - new_len,
- &uf_unmap, true);
+ retval = do_mas_munmap(&mas, mm, addr + new_len,
+ old_len - new_len, &uf_unmap, true);
if (retval < 0 && old_len != new_len) {
ret = retval;
goto out;
--
2.30.2

2021-04-28 17:40:19

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 18/94] mm/migrate: Use vma_lookup() in do_pages_stat_array()

Using vma_lookup() allows for cleaner and more clear code.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/migrate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b234c3f3acb7..611781c0f9b5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1833,8 +1833,8 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
struct page *page;
int err = -EFAULT;

- vma = find_vma(mm, addr);
- if (!vma || addr < vma->vm_start)
+ vma = vma_lookup(mm, addr);
+ if (!vma)
goto set_status;

/* FOLL_DUMP to ignore special (like zero) pages */
--
2.30.2

2021-04-28 17:40:19

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 76/94] mm/khugepaged: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/khugepaged.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 33cf91529f0b..4983a25c5a90 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2063,6 +2063,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
struct mm_struct *mm;
struct vm_area_struct *vma;
int progress = 0;
+ MA_STATE(mas, NULL, 0, 0);

VM_BUG_ON(!pages);
lockdep_assert_held(&khugepaged_mm_lock);
@@ -2079,18 +2080,22 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
khugepaged_collapse_pte_mapped_thps(mm_slot);

mm = mm_slot->mm;
+ mas.tree = &mm->mm_mt;
/*
* Don't wait for semaphore (to avoid long wait times). Just move to
* the next mm on the list.
*/
vma = NULL;
+ mas_set(&mas, khugepaged_scan.address);
if (unlikely(!mmap_read_trylock(mm)))
goto breakouterloop_mmap_lock;
+
+ rcu_read_lock();
if (likely(!khugepaged_test_exit(mm)))
- vma = find_vma(mm, khugepaged_scan.address);
+ vma = mas_find(&mas, ULONG_MAX);

progress++;
- for (; vma; vma = vma->vm_next) {
+ mas_for_each(&mas, vma, ULONG_MAX) {
unsigned long hstart, hend;

cond_resched();
@@ -2129,6 +2134,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
pgoff_t pgoff = linear_page_index(vma,
khugepaged_scan.address);

+ rcu_read_unlock();
mmap_read_unlock(mm);
ret = 1;
khugepaged_scan_file(mm, file, pgoff, hpage);
@@ -2149,6 +2155,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
}
}
breakouterloop:
+ rcu_read_unlock();
mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
breakouterloop_mmap_lock:

--
2.30.2

2021-04-28 17:40:19

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 82/94] mm/mprotect: Use maple tree navigation instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mprotect.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index e7a443157988..c468a823627f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -518,6 +518,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
(prot & PROT_READ);
+ MA_STATE(mas, &current->mm->mm_mt, start, start);

start = untagged_addr(start);

@@ -549,11 +550,11 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
if ((pkey != -1) && !mm_pkey_is_allocated(current->mm, pkey))
goto out;

- vma = find_vma(current->mm, start);
+ vma = mas_find(&mas, ULONG_MAX);
error = -ENOMEM;
if (!vma)
goto out;
- prev = vma->vm_prev;
+ prev = mas_prev(&mas, 0);
if (unlikely(grows & PROT_GROWSDOWN)) {
if (vma->vm_start >= end)
goto out;
@@ -634,7 +635,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
if (nstart >= end)
goto out;

- vma = prev->vm_next;
+ vma = vma_next(current->mm, prev);
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
goto out;
--
2.30.2

2021-04-28 17:40:20

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 75/94] mm/huge_memory: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/huge_memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 63ed6b25deaa..cf05049afb1b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2325,11 +2325,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
split_huge_pmd_if_needed(vma, end);

/*
- * If we're also updating the vma->vm_next->vm_start,
+ * If we're also updating the vma_next(vma)->vm_start,
* check if we need to split it.
*/
if (adjust_next > 0) {
- struct vm_area_struct *next = vma->vm_next;
+ struct vm_area_struct *next = vma_next(vma->vm_mm, vma);
unsigned long nstart = next->vm_start;
nstart += adjust_next;
split_huge_pmd_if_needed(next, nstart);
--
2.30.2

2021-04-28 17:40:24

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 90/94] bpf: Remove VMA linked list

Use vma_next() and remove reference to the start of the linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/bpf/task_iter.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index b68cb5d6d6eb..c1c71adc7c1a 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -400,10 +400,10 @@ task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)

switch (op) {
case task_vma_iter_first_vma:
- curr_vma = curr_task->mm->mmap;
+ curr_vma = find_vma(curr_task->mm, 0);
break;
case task_vma_iter_next_vma:
- curr_vma = curr_vma->vm_next;
+ curr_vma = vma_next(curr_vma->vm_mm, curr_vma);
break;
case task_vma_iter_find_vma:
/* We dropped mmap_lock so it is necessary to use find_vma
@@ -417,7 +417,7 @@ task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)
if (curr_vma &&
curr_vma->vm_start == info->prev_vm_start &&
curr_vma->vm_end == info->prev_vm_end)
- curr_vma = curr_vma->vm_next;
+ curr_vma = vma_next(curr_vma->vm_mm, curr_vma);
break;
}
if (!curr_vma) {
--
2.30.2

2021-04-28 17:40:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 92/94] mm: Return a bool from anon_vma_interval_tree_verify()

Added to allow printing which vma has the issue

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 2 +-
mm/interval_tree.c | 6 +++---
mm/mmap.c | 7 +++++--
3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1f1ae32fa9d..6bf5369ad319 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2526,7 +2526,7 @@ anon_vma_interval_tree_iter_first(struct rb_root_cached *root,
struct anon_vma_chain *anon_vma_interval_tree_iter_next(
struct anon_vma_chain *node, unsigned long start, unsigned long last);
#ifdef CONFIG_DEBUG_VM_RB
-void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
+bool anon_vma_interval_tree_verify(struct anon_vma_chain *node);
#endif

#define anon_vma_interval_tree_foreach(avc, root, start, last) \
diff --git a/mm/interval_tree.c b/mm/interval_tree.c
index 32e390c42c53..5958e27d4381 100644
--- a/mm/interval_tree.c
+++ b/mm/interval_tree.c
@@ -103,9 +103,9 @@ anon_vma_interval_tree_iter_next(struct anon_vma_chain *node,
}

#ifdef CONFIG_DEBUG_VM_RB
-void anon_vma_interval_tree_verify(struct anon_vma_chain *node)
+bool anon_vma_interval_tree_verify(struct anon_vma_chain *node)
{
- WARN_ON_ONCE(node->cached_vma_start != avc_start_pgoff(node));
- WARN_ON_ONCE(node->cached_vma_last != avc_last_pgoff(node));
+ return WARN_ON_ONCE(node->cached_vma_start != avc_start_pgoff(node)) ||
+ WARN_ON_ONCE(node->cached_vma_last != avc_last_pgoff(node));
}
#endif
diff --git a/mm/mmap.c b/mm/mmap.c
index c2baf006bcde..ae1ffe726405 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -326,8 +326,10 @@ static void validate_mm(struct mm_struct *mm)
struct anon_vma_chain *avc;
if (anon_vma) {
anon_vma_lock_read(anon_vma);
- list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
- anon_vma_interval_tree_verify(avc);
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma) {
+ if (anon_vma_interval_tree_verify(avc))
+ pr_warn("Interval tree issue in %px", vma);
+ }
anon_vma_unlock_read(anon_vma);
}
#endif
@@ -339,6 +341,7 @@ static void validate_mm(struct mm_struct *mm)
pr_emerg("map_count %d mas_for_each %d\n", mm->map_count, i);
bug = 1;
}
+
VM_BUG_ON_MM(bug, mm);
}

--
2.30.2

2021-04-28 17:40:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 93/94] mm/mmap: Add mas_split_vma() and use it for munmap()

Use the maple state when splitting a node to not have to rewalk/reset the state on splits.
This is also needed to clean the locks up

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 175 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ae1ffe726405..5335bd72bda3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2182,6 +2182,178 @@ static void unmap_region(struct mm_struct *mm,
max);
tlb_finish_mmu(&tlb);
}
+
+/*
+ *
+ * Does not support inserting a new vma and modifying the other side of the vma
+ * mas will point to insert or the new zeroed area.
+ */
+static inline
+int vma_shrink(struct ma_state *mas, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, pgoff_t pgoff,
+ struct vm_area_struct *insert)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct address_space *mapping = NULL;
+ struct rb_root_cached *root = NULL;
+ struct anon_vma *anon_vma = NULL;
+ struct file *file = vma->vm_file;
+ unsigned long old_end = vma->vm_end, old_start = vma->vm_start;
+
+ validate_mm(mm);
+ vma_adjust_trans_huge(vma, start, end, 0);
+ if (file) {
+ mapping = file->f_mapping;
+ root = &mapping->i_mmap;
+ uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+
+ i_mmap_lock_write(mapping);
+ /*
+ * Put into interval tree now, so instantiated pages are visible
+ * to arm/parisc __flush_dcache_page throughout; but we cannot
+ * insert into address space until vma start or end is updated.
+ */
+
+ if (insert)
+ __vma_link_file(insert);
+ }
+
+ anon_vma = vma->anon_vma;
+ if (anon_vma) {
+ anon_vma_lock_write(anon_vma);
+ anon_vma_interval_tree_pre_update_vma(vma);
+ }
+
+ if (file) {
+ flush_dcache_mmap_lock(mapping);
+ vma_interval_tree_remove(vma, root);
+ }
+
+ vma->vm_start = start;
+ vma->vm_end = end;
+ vma->vm_pgoff = pgoff;
+ if (!insert) {
+
+ /* If vm_start changed, and the insert does not end at the old
+ * start, then that area needs to be zeroed
+ */
+ if (old_start != vma->vm_start) {
+ mas->last = end;
+ mas_store_gfp(mas, NULL, GFP_KERNEL);
+ }
+
+ /* If vm_end changed, and the insert does not start at the new
+ * end, then that area needs to be zeroed
+ */
+ if (old_end != vma->vm_end) {
+ mas->index = end;
+ mas->last = old_end;
+ mas_store_gfp(mas, NULL, GFP_KERNEL);
+ }
+ }
+
+ if (file) {
+ vma_interval_tree_insert(vma, root);
+ flush_dcache_mmap_unlock(mapping);
+ }
+
+ if (insert) { // Insert.
+ vma_mas_store(insert, mas);
+ mm->map_count++;
+ }
+
+ if (anon_vma) {
+ anon_vma_interval_tree_post_update_vma(vma);
+ anon_vma_unlock_write(anon_vma);
+ }
+
+ if (file) {
+ i_mmap_unlock_write(mapping);
+ uprobe_mmap(vma);
+ if (insert)
+ uprobe_mmap(insert);
+ }
+
+ validate_mm(mm);
+ return 0;
+}
+
+/*
+ * mas_split_vma() - Split the VMA into two.
+ *
+ * @mm: The mm_struct
+ * @mas: The maple state - must point to the vma being altered
+ * @vma: The vma to split
+ * @addr: The address to split @vma
+ * @new_below: Add the new vma at the lower address (first part) of vma.
+ *
+ * Note: The @mas must point to the vma that is being split or MAS_START.
+ * Upon return, @mas points to the new VMA. sysctl_max_map_count is not
+ * checked.
+ */
+int mas_split_vma(struct mm_struct *mm, struct ma_state *mas,
+ struct vm_area_struct *vma, unsigned long addr, int new_below)
+{
+ struct vm_area_struct *new;
+ int err;
+
+ validate_mm(mm);
+ if (vma->vm_ops && vma->vm_ops->may_split) {
+ err = vma->vm_ops->may_split(vma, addr);
+ if (err)
+ return err;
+ }
+
+ new = vm_area_dup(vma);
+ if (!new)
+ return -ENOMEM;
+
+ if (new_below)
+ new->vm_end = addr;
+ else {
+ new->vm_start = addr;
+ new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
+ }
+
+ err = vma_dup_policy(vma, new);
+ if (err)
+ goto out_free_vma;
+
+ err = anon_vma_clone(new, vma);
+ if (err)
+ goto out_free_mpol;
+
+ if (new->vm_file)
+ get_file(new->vm_file);
+
+ if (new->vm_ops && new->vm_ops->open)
+ new->vm_ops->open(new);
+
+ if (new_below)
+ err = vma_shrink(mas, vma, addr, vma->vm_end, vma->vm_pgoff +
+ ((addr - new->vm_start) >> PAGE_SHIFT), new);
+ else
+ err = vma_shrink(mas, vma, vma->vm_start, addr, vma->vm_pgoff,
+ new);
+
+ validate_mm(mm);
+ /* Success. */
+ if (!err)
+ return 0;
+
+ /* Clean everything up if vma_adjust failed. */
+ if (new->vm_ops && new->vm_ops->close)
+ new->vm_ops->close(new);
+ if (new->vm_file)
+ fput(new->vm_file);
+ unlink_anon_vmas(new);
+ out_free_mpol:
+ mpol_put(vma_policy(new));
+ out_free_vma:
+ vm_area_free(new);
+ return err;
+}
+
/*
* __split_vma() bypasses sysctl_max_map_count checking. We use this where it
* has already been checked or doesn't make sense to fail.
@@ -2330,12 +2502,11 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
return -ENOMEM;

- error = __split_vma(mm, vma, start, 0);
+ error = mas_split_vma(mm, mas, vma, start, 0);
if (error)
return error;

prev = vma;
- mas_set_range(mas, start, end - 1);
vma = mas_walk(mas);

} else {
@@ -2353,11 +2524,10 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
/* Does it split the last one? */
if (last && end < last->vm_end) {
int error;
- error = __split_vma(mm, last, end, 1);
+ error = mas_split_vma(mm, mas, last, end, 1);
if (error)
return error;
- mas_set(mas, end - 1);
- last = mas_walk(mas);
+ validate_mm(mm);
}
next = mas_next(mas, ULONG_MAX);

@@ -2518,11 +2688,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}

- mas_set_range(&mas, addr, end - 1);
- mas_walk(&mas); // Walk to the empty area (munmapped above)
ma_prev = mas;
prev = mas_prev(&ma_prev, 0);
-
if (vm_flags & VM_SPECIAL)
goto cannot_expand;

@@ -2694,10 +2861,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* a completely new data area).
*/
vma->vm_flags |= VM_SOFTDIRTY;
-
vma_set_page_prot(vma);
validate_mm(mm);
-
return addr;

unmap_and_free_vma:
--
2.30.2

2021-04-28 17:40:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 91/94] mm: Remove vma linked list.

The vma linked list has been replaced by the maple tree iterators and
vma_next() vma_prev() functions.

A part of this change is also the iterators free_pgd_range(),
zap_page_range(), and unmap_single_vma()

The internal _vma_next() has been dropped.

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/proc/task_nommu.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/mm_types.h | 6 -
kernel/fork.c | 15 +-
mm/debug.c | 12 +-
mm/internal.h | 4 +-
mm/memory.c | 41 ++-
mm/mmap.c | 573 +++++++++++++++++----------------------
mm/nommu.c | 14 +-
9 files changed, 301 insertions(+), 368 deletions(-)

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 8691a1216d1c..be02e8997ddf 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -246,7 +246,7 @@ static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
struct vm_area_struct *vma = _p;

*pos = vma->vm_end;
- return vma->vm_next;
+ return vma_next(vma->vm_mm, vma);
}

static const struct seq_operations proc_pid_maps_ops = {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 82b076787515..e1f1ae32fa9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,7 +1731,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
- unsigned long start, unsigned long end);
+ struct ma_state *mas, unsigned long start, unsigned long end);

struct mmu_notifier_range;

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 304692ada024..ca9fb13c2aca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -303,14 +303,11 @@ struct vm_userfaultfd_ctx {};
* library, the executable area etc).
*/
struct vm_area_struct {
- /* The first cache line has the info for VMA tree walking. */
-
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */

/* linked list of VM areas per task, sorted by address */
- struct vm_area_struct *vm_next, *vm_prev;
struct mm_struct *vm_mm; /* The address space we belong to. */

/*
@@ -324,7 +321,6 @@ struct vm_area_struct {
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
* units
*/
- /* Second cache line starts here. */
struct file *vm_file; /* File we map to (can be NULL). */
/*
* For areas with an address space and backing store,
@@ -378,7 +374,6 @@ struct core_state {
struct kioctx_table;
struct mm_struct {
struct {
- struct vm_area_struct *mmap; /* list of VMAs */
struct maple_tree mm_mt;
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
@@ -393,7 +388,6 @@ struct mm_struct {
unsigned long mmap_compat_legacy_base;
#endif
unsigned long task_size; /* size of task vm space */
- unsigned long highest_vm_end; /* highest vma end address */
pgd_t * pgd;

#ifdef CONFIG_MEMBARRIER
diff --git a/kernel/fork.c b/kernel/fork.c
index fe0922f75cc5..6da1022e8758 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -364,7 +364,6 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
*/
*new = data_race(*orig);
INIT_LIST_HEAD(&new->anon_vma_chain);
- new->vm_next = new->vm_prev = NULL;
}
return new;
}
@@ -473,7 +472,7 @@ EXPORT_SYMBOL(free_task);
static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
{
- struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
+ struct vm_area_struct *mpnt, *tmp;
int retval;
unsigned long charge = 0;
MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
@@ -500,7 +499,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
mm->exec_vm = oldmm->exec_vm;
mm->stack_vm = oldmm->stack_vm;

- pprev = &mm->mmap;
retval = ksm_fork(mm, oldmm);
if (retval)
goto out;
@@ -508,8 +506,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (retval)
goto out;

- prev = NULL;
-
retval = mas_entry_count(&mas, oldmm->map_count);
if (retval)
goto fail_nomem;
@@ -585,14 +581,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (is_vm_hugetlb_page(tmp))
reset_vma_resv_huge_pages(tmp);

- /*
- * Link in the new vma and copy the page table entries.
- */
- *pprev = tmp;
- pprev = &tmp->vm_next;
- tmp->vm_prev = prev;
- prev = tmp;
-
/* Link the vma into the MT */
mas.index = tmp->vm_start;
mas.last = tmp->vm_end - 1;
@@ -1024,7 +1012,6 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
{
- mm->mmap = NULL;
mt_init_flags(&mm->mm_mt, MAPLE_ALLOC_RANGE);
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
diff --git a/mm/debug.c b/mm/debug.c
index f382d319722a..55afc85c4fcc 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -203,8 +203,8 @@ void dump_vma(const struct vm_area_struct *vma)
"prot %lx anon_vma %px vm_ops %px\n"
"pgoff %lx file %px private_data %px\n"
"flags: %#lx(%pGv)\n",
- vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_next,
- vma->vm_prev, vma->vm_mm,
+ vma, (void *)vma->vm_start, (void *)vma->vm_end,
+ vma_next(vma->vm_mm, vma), vma_prev(vma->vm_mm, vma), vma->vm_mm,
(unsigned long)pgprot_val(vma->vm_page_prot),
vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
vma->vm_file, vma->vm_private_data,
@@ -214,11 +214,11 @@ EXPORT_SYMBOL(dump_vma);

void dump_mm(const struct mm_struct *mm)
{
- pr_emerg("mm %px mmap %px task_size %lu\n"
+ pr_emerg("mm %px task_size %lu\n"
#ifdef CONFIG_MMU
"get_unmapped_area %px\n"
#endif
- "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
+ "mmap_base %lu mmap_legacy_base %lu\n"
"pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count %d\n"
"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
"pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
@@ -242,11 +242,11 @@ void dump_mm(const struct mm_struct *mm)
"tlb_flush_pending %d\n"
"def_flags: %#lx(%pGv)\n",

- mm, mm->mmap, mm->task_size,
+ mm, mm->task_size,
#ifdef CONFIG_MMU
mm->get_unmapped_area,
#endif
- mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end,
+ mm->mmap_base, mm->mmap_legacy_base,
mm->pgd, atomic_read(&mm->mm_users),
atomic_read(&mm->mm_count),
mm_pgtables_bytes(mm),
diff --git a/mm/internal.h b/mm/internal.h
index 34d00548aa81..0fb161ee7f73 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -36,8 +36,8 @@ void page_writeback_init(void);

vm_fault_t do_swap_page(struct vm_fault *vmf);

-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
- unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
+ struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling);

static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
{
diff --git a/mm/memory.c b/mm/memory.c
index 91e2a4c8dfd3..4de079ad0d48 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -399,13 +399,18 @@ void free_pgd_range(struct mmu_gather *tlb,
} while (pgd++, addr = next, addr != end);
}

-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
- unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
+ struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling)
{
- while (vma) {
- struct vm_area_struct *next = vma->vm_next;
+ struct vm_area_struct *next;
+ struct ma_state ma_next = *mas;
+
+ do {
unsigned long addr = vma->vm_start;

+ next = mas_find(&ma_next, ceiling - 1);
+ BUG_ON(vma->vm_start < floor);
+ BUG_ON(vma->vm_end - 1 > ceiling - 1);
/*
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
@@ -422,16 +427,20 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
*/
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
&& !is_vm_hugetlb_page(next)) {
+ *mas = ma_next;
vma = next;
- next = vma->vm_next;
+ next = mas_find(&ma_next, ceiling - 1);
+ BUG_ON(vma->vm_start < floor);
+ BUG_ON(vma->vm_end -1 > ceiling - 1);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
floor, next ? next->vm_start : ceiling);
}
+ *mas = ma_next;
vma = next;
- }
+ } while (vma);
}

int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
@@ -1510,16 +1519,19 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* drops the lock and schedules.
*/
void unmap_vmas(struct mmu_gather *tlb,
- struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr)
+ struct vm_area_struct *vma, struct ma_state *mas,
+ unsigned long start_addr, unsigned long end_addr)
{
struct mmu_notifier_range range;

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
start_addr, end_addr);
mmu_notifier_invalidate_range_start(&range);
- for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+ do {
+ BUG_ON(vma->vm_start < start_addr);
+ BUG_ON(vma->vm_end > end_addr);
unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+ } while ((vma = mas_find(mas, end_addr - 1)) != NULL);
mmu_notifier_invalidate_range_end(&range);
}

@@ -1536,15 +1548,20 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
{
struct mmu_notifier_range range;
struct mmu_gather tlb;
+ unsigned long end = start + size;
+ MA_STATE(mas, &vma->vm_mm->mm_mt, start, start);

lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
- start, start + size);
+ start, end);
tlb_gather_mmu(&tlb, vma->vm_mm);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
- for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
- unmap_single_vma(&tlb, vma, start, range.end, NULL);
+ rcu_read_lock();
+ mas_for_each(&mas, vma, end - 1)
+ unmap_single_vma(&tlb, vma, start, end, NULL);
+ rcu_read_unlock();
+
mmu_notifier_invalidate_range_end(&range);
tlb_finish_mmu(&tlb);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index ed1b9df86966..c2baf006bcde 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -74,10 +74,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
static bool ignore_rlimit_data;
core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);

-static void unmap_region(struct mm_struct *mm,
- struct vm_area_struct *vma, struct vm_area_struct *prev,
- unsigned long start, unsigned long end);
-
/* description of effects of mapping type and prot in current implementation.
* this is due to the limited x86 page protection hardware. The expected
* behavior is in parens:
@@ -175,10 +171,8 @@ void unlink_file_vma(struct vm_area_struct *vma)
/*
* Close a vm structure and free it, returning the next.
*/
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static void remove_vma(struct vm_area_struct *vma)
{
- struct vm_area_struct *next = vma->vm_next;
-
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
@@ -186,13 +180,13 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
fput(vma->vm_file);
mpol_put(vma_policy(vma));
vm_area_free(vma);
- return next;
}

static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
unsigned long newbrk, unsigned long oldbrk,
struct list_head *uf);
-static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
+static int do_brk_flags(struct ma_state *mas, struct ma_state *ma_prev,
+ struct vm_area_struct **brkvma,
unsigned long addr, unsigned long request,
unsigned long flags);
SYSCALL_DEFINE1(brk, unsigned long, brk)
@@ -205,6 +199,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
bool downgraded = false;
LIST_HEAD(uf);
MA_STATE(mas, &mm->mm_mt, 0, 0);
+ struct ma_state ma_neighbour;

if (mmap_write_lock_killable(mm))
return -EINTR;
@@ -261,7 +256,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
* before calling do_brk_munmap().
*/
mm->brk = brk;
- mas.last = oldbrk - 1;
ret = do_brk_munmap(&mas, brkvma, newbrk, oldbrk, &uf);
if (ret == 1) {
downgraded = true;
@@ -272,26 +266,26 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
mm->brk = origbrk;
goto out;
}
+ ma_neighbour = mas;
+ next = mas_next(&ma_neighbour, newbrk + PAGE_SIZE + stack_guard_gap);
/* Only check if the next VMA is within the stack_guard_gap of the
* expansion area */
- next = mas_next(&mas, newbrk + PAGE_SIZE + stack_guard_gap);
/* Check against existing mmap mappings. */
if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
goto out;

- brkvma = mas_prev(&mas, mm->start_brk);
+ brkvma = mas_prev(&ma_neighbour, mm->start_brk);
if (brkvma) {
- if(brkvma->vm_start >= oldbrk)
+ if (brkvma->vm_start >= oldbrk)
goto out; // Trying to map over another vma.

- if (brkvma->vm_end <= min_brk) {
+ if (brkvma->vm_end <= min_brk)
brkvma = NULL;
- mas_reset(&mas);
- }
}

/* Ok, looks good - let it rip. */
- if (do_brk_flags(&mas, &brkvma, oldbrk, newbrk - oldbrk, 0) < 0)
+ if (do_brk_flags(&mas, &ma_neighbour, &brkvma, oldbrk,
+ newbrk - oldbrk, 0) < 0)
goto out;

mm->brk = brk;
@@ -316,85 +310,17 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
extern void mt_validate(struct maple_tree *mt);
extern void mt_dump(const struct maple_tree *mt);

-/* Validate the maple tree */
-static void validate_mm_mt(struct mm_struct *mm)
-{
- struct maple_tree *mt = &mm->mm_mt;
- struct vm_area_struct *vma_mt, *vma = mm->mmap;
-
- MA_STATE(mas, mt, 0, 0);
- rcu_read_lock();
- mas_for_each(&mas, vma_mt, ULONG_MAX) {
- if (xa_is_zero(vma_mt))
- continue;
-
- if (!vma)
- break;
-
- if ((vma != vma_mt) ||
- (vma->vm_start != vma_mt->vm_start) ||
- (vma->vm_end != vma_mt->vm_end) ||
- (vma->vm_start != mas.index) ||
- (vma->vm_end - 1 != mas.last)) {
- pr_emerg("issue in %s\n", current->comm);
- dump_stack();
-#ifdef CONFIG_DEBUG_VM
- dump_vma(vma_mt);
- pr_emerg("and vm_next\n");
- dump_vma(vma->vm_next);
-#endif // CONFIG_DEBUG_VM
- pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
- mas.index, mas.last);
- pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
- vma_mt->vm_start, vma_mt->vm_end);
- if (vma->vm_prev) {
- pr_emerg("ll prev: %px %lu - %lu\n",
- vma->vm_prev, vma->vm_prev->vm_start,
- vma->vm_prev->vm_end);
- }
- pr_emerg("ll vma: %px %lu - %lu\n", vma,
- vma->vm_start, vma->vm_end);
- if (vma->vm_next) {
- pr_emerg("ll next: %px %lu - %lu\n",
- vma->vm_next, vma->vm_next->vm_start,
- vma->vm_next->vm_end);
- }
-
- mt_dump(mas.tree);
- if (vma_mt->vm_end != mas.last + 1) {
- pr_err("vma: %px vma_mt %lu-%lu\tmt %lu-%lu\n",
- mm, vma_mt->vm_start, vma_mt->vm_end,
- mas.index, mas.last);
- mt_dump(mas.tree);
- }
- VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm);
- if (vma_mt->vm_start != mas.index) {
- pr_err("vma: %px vma_mt %px %lu - %lu doesn't match\n",
- mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end);
- mt_dump(mas.tree);
- }
- VM_BUG_ON_MM(vma_mt->vm_start != mas.index, mm);
- }
- VM_BUG_ON(vma != vma_mt);
- vma = vma->vm_next;
-
- }
- VM_BUG_ON(vma);
-
- rcu_read_unlock();
- mt_validate(&mm->mm_mt);
-}
-
static void validate_mm(struct mm_struct *mm)
{
int bug = 0;
int i = 0;
- unsigned long highest_address = 0;
- struct vm_area_struct *vma = mm->mmap;
+ struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- validate_mm_mt(mm);
+ mmap_assert_locked(mm);

- while (vma) {
+ mt_validate(&mm->mm_mt);
+ mas_for_each(&mas, vma, ULONG_MAX) {
#ifdef CONFIG_DEBUG_VM_RB
struct anon_vma *anon_vma = vma->anon_vma;
struct anon_vma_chain *avc;
@@ -405,24 +331,18 @@ static void validate_mm(struct mm_struct *mm)
anon_vma_unlock_read(anon_vma);
}
#endif
- highest_address = vm_end_gap(vma);
- vma = vma->vm_next;
+ VM_BUG_ON(mas.index != vma->vm_start);
+ VM_BUG_ON(mas.last != vma->vm_end - 1);
i++;
}
if (i != mm->map_count) {
- pr_emerg("map_count %d vm_next %d\n", mm->map_count, i);
- bug = 1;
- }
- if (highest_address != mm->highest_vm_end) {
- pr_emerg("mm->highest_vm_end %lx, found %lx\n",
- mm->highest_vm_end, highest_address);
+ pr_emerg("map_count %d mas_for_each %d\n", mm->map_count, i);
bug = 1;
}
VM_BUG_ON_MM(bug, mm);
}

#else // !CONFIG_DEBUG_MAPLE_TREE
-#define validate_mm_mt(root) do { } while (0)
#define validate_mm(mm) do { } while (0)
#endif // CONFIG_DEBUG_MAPLE_TREE

@@ -469,7 +389,7 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
*
* Returns: True if there is an overlapping VMA, false otherwise
*/
-static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
+static inline bool range_has_overlap(struct mm_struct *mm, unsigned long start,
unsigned long end, struct vm_area_struct **pprev)
{
struct vm_area_struct *existing;
@@ -480,24 +400,6 @@ static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
return existing ? true : false;
}

-/*
- * _vma_next() - Get the next VMA or the first.
- * @mm: The mm_struct.
- * @vma: The current vma.
- *
- * If @vma is NULL, return the first vma in the mm.
- *
- * Returns: The next VMA after @vma.
- */
-static inline struct vm_area_struct *_vma_next(struct mm_struct *mm,
- struct vm_area_struct *vma)
-{
- if (!vma)
- return mm->mmap;
-
- return vma->vm_next;
-}
-
static unsigned long count_vma_pages_range(struct mm_struct *mm,
unsigned long addr, unsigned long end)
{
@@ -573,7 +475,7 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
}

static void vma_mas_link(struct mm_struct *mm, struct vm_area_struct *vma,
- struct ma_state *mas, struct vm_area_struct *prev)
+ struct ma_state *mas)
{
struct address_space *mapping = NULL;

@@ -592,36 +494,26 @@ static void vma_mas_link(struct mm_struct *mm, struct vm_area_struct *vma,
validate_mm(mm);
}

-static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
- struct vm_area_struct *prev)
+static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
{
- struct address_space *mapping = NULL;
-
- if (vma->vm_file) {
- mapping = vma->vm_file->f_mapping;
- i_mmap_lock_write(mapping);
- }
-
- vma_mt_store(mm, vma);
- __vma_link_file(vma);
+ MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end - 1);

- if (mapping)
- i_mmap_unlock_write(mapping);
-
- mm->map_count++;
- validate_mm(mm);
+ vma_mas_link(mm, vma, &mas);
}

/*
* Helper for vma_adjust() in the split_vma insert case: insert a vma into the
* mm's list and the mm tree. It has already been inserted into the interval tree.
*/
-static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
+static inline void __insert_vm_struct(struct mm_struct *mm,
+ struct vm_area_struct *vma)
{
- struct vm_area_struct *prev;
+ MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end - 1);
+
+ BUG_ON(mas_find(&mas, vma->vm_end - 1));
+ mas_reset(&mas);

- BUG_ON(range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev));
- vma_mt_store(mm, vma);
+ vma_mas_store(vma, &mas);
mm->map_count++;
}

@@ -637,9 +529,11 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
bool remove_next = false;
int error;

+
if (next && (vma != next) && (end == next->vm_end)) {
+ /* Expanding existing VMA over a gap and next */
remove_next = true;
- if (next->anon_vma && !vma->anon_vma) {
+ if (next->anon_vma && !anon_vma) {
vma->anon_vma = next->anon_vma;
error = anon_vma_clone(vma, next);
if (error)
@@ -669,7 +563,6 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
- /* Note: mas must be pointing to the expanding VMA */
vma_mas_store(vma, mas);

if (file) {
@@ -678,11 +571,8 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
}

/* Expanding over the next vma */
- if (remove_next && file) {
+ if (remove_next && file)
__remove_shared_vm_struct(next, file, mapping);
- } else if (!next) {
- mm->highest_vm_end = vm_end_gap(vma);
- }

if (anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);
@@ -721,7 +611,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct vm_area_struct *expand)
{
struct mm_struct *mm = vma->vm_mm;
- struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
+ struct vm_area_struct *next = vma_next(mm, vma), *orig_vma = vma;
struct address_space *mapping = NULL;
struct rb_root_cached *root = NULL;
struct anon_vma *anon_vma = NULL;
@@ -730,6 +620,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
long adjust_next = 0;
int remove_next = 0;

+ validate_mm(mm);
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;

@@ -762,7 +653,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
*/
remove_next = 1 + (end > next->vm_end);
VM_WARN_ON(remove_next == 2 &&
- end != next->vm_next->vm_end);
+ end != vma_next(mm, next)->vm_end);
/* trim end to next, for case 6 first pass */
end = next->vm_end;
}
@@ -775,7 +666,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
* next, if the vma overlaps with it.
*/
if (remove_next == 2 && !next->anon_vma)
- exporter = next->vm_next;
+ exporter = vma_next(mm, next);

} else if (end > next->vm_start) {
/*
@@ -867,8 +758,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
else
vma_changed = true;
vma->vm_end = end;
- if (!next)
- mm->highest_vm_end = vm_end_gap(vma);
}

if (vma_changed)
@@ -936,7 +825,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
* "next->vm_prev->vm_end" changed and the
* "vma->vm_next" gap must be updated.
*/
- next = vma->vm_next;
+ next = vma_next(mm, vma);
} else {
/*
* For the scope of the comment "next" and
@@ -954,27 +843,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
remove_next = 1;
end = next->vm_end;
goto again;
- } else if (!next) {
- /*
- * If remove_next == 2 we obviously can't
- * reach this path.
- *
- * If remove_next == 3 we can't reach this
- * path because pre-swap() next is always not
- * NULL. pre-swap() "next" is not being
- * removed and its next->vm_end is not altered
- * (and furthermore "end" already matches
- * next->vm_end in remove_next == 3).
- *
- * We reach this only in the remove_next == 1
- * case if the "next" vma that was removed was
- * the highest vma of the mm. However in such
- * case next->vm_end == "end" and the extended
- * "vma" has vma->vm_end == next->vm_end so
- * mm->highest_vm_end doesn't need any update
- * in remove_next == 1 case.
- */
- VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma));
}
}
if (insert && file)
@@ -1134,10 +1002,14 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;

- next = _vma_next(mm, prev);
+ if (!prev)
+ next = find_vma(mm, 0);
+ else
+ next = vma_next(mm, prev);
+
area = next;
if (area && area->vm_end == end) /* cases 6, 7, 8 */
- next = next->vm_next;
+ next = vma_next(mm, next);

/* verify some invariant that must be enforced by the caller */
VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -1203,6 +1075,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
return area;
}

+ validate_mm(mm);
return NULL;
}

@@ -1272,17 +1145,20 @@ static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_
struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = NULL;
+ struct vm_area_struct *next, *prev;

/* Try next first. */
- if (vma->vm_next) {
- anon_vma = reusable_anon_vma(vma->vm_next, vma, vma->vm_next);
+ next = vma_next(vma->vm_mm, vma);
+ if (next) {
+ anon_vma = reusable_anon_vma(next, vma, next);
if (anon_vma)
return anon_vma;
}

/* Try prev next. */
- if (vma->vm_prev)
- anon_vma = reusable_anon_vma(vma->vm_prev, vma->vm_prev, vma);
+ prev = vma_prev(vma->vm_mm, vma);
+ if (prev)
+ anon_vma = reusable_anon_vma(prev, prev, vma);

/*
* We might reach here with anon_vma == NULL if we can't find
@@ -2055,7 +1931,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (gap_addr < address || gap_addr > TASK_SIZE)
gap_addr = TASK_SIZE;

- next = vma->vm_next;
+ next = vma_next(mm, vma);
if (next && next->vm_start < gap_addr && vma_is_accessible(next)) {
if (!(next->vm_flags & VM_GROWSUP))
return -ENOMEM;
@@ -2101,8 +1977,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
vma->vm_end = address;
vma_mt_store(mm, vma);
anon_vma_interval_tree_post_update_vma(vma);
- if (!vma->vm_next)
- mm->highest_vm_end = vm_end_gap(vma);
spin_unlock(&mm->page_table_lock);

perf_event_mmap(vma);
@@ -2129,7 +2003,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
return -EPERM;

/* Enforce stack_guard_gap */
- prev = vma->vm_prev;
+ prev = vma_prev(mm, vma);
/* Check that both stack segments have the same anon_vma? */
if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
vma_is_accessible(prev)) {
@@ -2264,21 +2138,21 @@ EXPORT_SYMBOL_GPL(find_extend_vma);
*
* Called with the mm semaphore held.
*/
-static inline void remove_vma_list(struct mm_struct *mm,
- struct vm_area_struct *vma)
+static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
{
+ struct vm_area_struct *vma;
unsigned long nr_accounted = 0;

/* Update high watermark before we lower total_vm */
update_hiwater_vm(mm);
- do {
+ mas_for_each(mas, vma, ULONG_MAX) {
long nrpages = vma_pages(vma);

if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += nrpages;
vm_stat_account(mm, vma->vm_flags, -nrpages);
- vma = remove_vma(vma);
- } while (vma);
+ remove_vma(vma);
+ }
vm_unacct_memory(nr_accounted);
validate_mm(mm);
}
@@ -2289,21 +2163,22 @@ static inline void remove_vma_list(struct mm_struct *mm,
* Called with the mm semaphore held.
*/
static void unmap_region(struct mm_struct *mm,
- struct vm_area_struct *vma, struct vm_area_struct *prev,
- unsigned long start, unsigned long end)
+ struct vm_area_struct *vma, struct ma_state *mas,
+ unsigned long start, unsigned long end,
+ struct vm_area_struct *prev, unsigned long max)
{
- struct vm_area_struct *next = _vma_next(mm, prev);
struct mmu_gather tlb;
+ struct ma_state ma_pgtb = *mas;

lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, vma, start, end);
- free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
- next ? next->vm_start : USER_PGTABLES_CEILING);
+ unmap_vmas(&tlb, vma, mas, start, end);
+ free_pgtables(&tlb, &ma_pgtb, vma,
+ prev ? prev->vm_end : FIRST_USER_ADDRESS,
+ max);
tlb_finish_mmu(&tlb);
}
-
/*
* __split_vma() bypasses sysctl_max_map_count checking. We use this where it
* has already been checked or doesn't make sense to fail.
@@ -2313,7 +2188,6 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct vm_area_struct *new;
int err;
- validate_mm_mt(mm);

if (vma->vm_ops && vma->vm_ops->may_split) {
err = vma->vm_ops->may_split(vma, addr);
@@ -2366,7 +2240,6 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
mpol_put(vma_policy(new));
out_free_vma:
vm_area_free(new);
- validate_mm_mt(mm);
return err;
}

@@ -2383,24 +2256,31 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
return __split_vma(mm, vma, addr, new_below);
}

-static inline int unlock_range(struct vm_area_struct *start,
- struct vm_area_struct **tail, unsigned long limit)
+
+static inline void detach_range(struct mm_struct *mm, struct ma_state *mas,
+ struct ma_state *dst, struct vm_area_struct **vma)
{
- struct mm_struct *mm = start->vm_mm;
- struct vm_area_struct *tmp = start;
+ unsigned long start = dst->index;
+ unsigned long end = dst->last;
int count = 0;

- while (tmp && tmp->vm_start < limit) {
- *tail = tmp;
+ do {
count++;
- if (tmp->vm_flags & VM_LOCKED) {
- mm->locked_vm -= vma_pages(tmp);
- munlock_vma_pages_all(tmp);
+ *vma = mas_prev(mas, start);
+ BUG_ON((*vma)->vm_start < start);
+ BUG_ON((*vma)->vm_end > end + 1);
+ vma_mas_store(*vma, dst);
+ if ((*vma)->vm_flags & VM_LOCKED) {
+ mm->locked_vm -= vma_pages(*vma);
+ munlock_vma_pages_all(*vma);
}
- tmp = tmp->vm_next;
- }
+ } while ((*vma)->vm_start > start);

- return count;
+ /* Drop removed area from the tree */
+ mas->last = end;
+ mas_store_gfp(mas, NULL, GFP_KERNEL);
+ /* Decrement map_count */
+ mm->map_count -= count;
}

/* do_mas_align_munmap() - munmap the aligned region from @start to @end.
@@ -2419,9 +2299,17 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start, unsigned long end,
struct list_head *uf, bool downgrade)
{
- struct vm_area_struct *prev, *last;
+ struct vm_area_struct *prev, *last, *next = NULL;
+ struct maple_tree mt_detach;
+ unsigned long max = USER_PGTABLES_CEILING;
+ MA_STATE(dst, NULL, start, end - 1);
+ struct ma_state tmp;
/* we have start < vma->vm_end */

+ validate_mm(mm);
+ /* arch_unmap() might do unmaps itself. */
+ arch_unmap(mm, start, end);
+
/*
* If we need to split any vma, do it now to save pain later.
*
@@ -2442,28 +2330,33 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
error = __split_vma(mm, vma, start, 0);
if (error)
return error;
+
prev = vma;
- vma = _vma_next(mm, prev);
- mas->index = start;
- mas_reset(mas);
+ mas_set_range(mas, start, end - 1);
+ vma = mas_walk(mas);
+
} else {
- prev = vma->vm_prev;
+ tmp = *mas;
+ prev = mas_prev(&tmp, 0);
}

- if (vma->vm_end >= end)
+ if (end < vma->vm_end) {
last = vma;
- else
- last = find_vma_intersection(mm, end - 1, end);
+ } else {
+ mas_set(mas, end - 1);
+ last = mas_walk(mas);
+ }

/* Does it split the last one? */
if (last && end < last->vm_end) {
- int error = __split_vma(mm, last, end, 1);
+ int error;
+ error = __split_vma(mm, last, end, 1);
if (error)
return error;
- vma = _vma_next(mm, prev);
- mas_reset(mas);
+ mas_set(mas, end - 1);
+ last = mas_walk(mas);
}
-
+ next = mas_next(mas, ULONG_MAX);

if (unlikely(uf)) {
/*
@@ -2481,27 +2374,15 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
return error;
}

- /*
- * unlock any mlock()ed ranges before detaching vmas, count the number
- * of VMAs to be dropped, and return the tail entry of the affected
- * area.
- */
- mm->map_count -= unlock_range(vma, &last, end);
- /* Drop removed area from the tree */
- mas_store_gfp(mas, NULL, GFP_KERNEL);
+ /* Point of no return */
+ mas_lock(mas);
+ if (next)
+ max = next->vm_start;

- /* Detach vmas from the MM linked list */
- vma->vm_prev = NULL;
- if (prev)
- prev->vm_next = last->vm_next;
- else
- mm->mmap = last->vm_next;
-
- if (last->vm_next) {
- last->vm_next->vm_prev = prev;
- last->vm_next = NULL;
- } else
- mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
+ mtree_init(&mt_detach, MAPLE_ALLOC_RANGE);
+ dst.tree = &mt_detach;
+ detach_range(mm, mas, &dst, &vma);
+ mas_unlock(mas);

/*
* Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
@@ -2509,7 +2390,7 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
* down_read(mmap_lock) and collide with the VMA we are about to unmap.
*/
if (downgrade) {
- if (last && (last->vm_flags & VM_GROWSDOWN))
+ if (next && (next->vm_flags & VM_GROWSDOWN))
downgrade = false;
else if (prev && (prev->vm_flags & VM_GROWSUP))
downgrade = false;
@@ -2517,11 +2398,16 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
mmap_write_downgrade(mm);
}

- unmap_region(mm, vma, prev, start, end);
+ /* Unmap the region */
+ unmap_region(mm, vma, &dst, start, end, prev, max);
+
+ /* Statistics and freeing VMAs */
+ mas_set(&dst, start);
+ remove_mt(mm, &dst);

- /* Fix up all other VM information */
- remove_vma_list(mm, vma);
+ mtree_destroy(&mt_detach);

+ validate_mm(mm);
return downgrade ? 1 : 0;
}

@@ -2546,16 +2432,14 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
unsigned long end;
struct vm_area_struct *vma;

- if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
+ if ((offset_in_page(start)) || (start > TASK_SIZE) ||
+ (len > TASK_SIZE - start))
return -EINVAL;

end = start + PAGE_ALIGN(len);
if (end == start)
return -EINVAL;

- /* arch_unmap() might do unmaps itself. */
- arch_unmap(mm, start, end);
-
/* Find the first overlapping VMA */
vma = mas_find(mas, end - 1);
if (!vma)
@@ -2574,8 +2458,11 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
struct list_head *uf)
{
+ int ret;
MA_STATE(mas, &mm->mm_mt, start, start);
- return do_mas_munmap(&mas, mm, start, len, uf, false);
+
+ ret = do_mas_munmap(&mas, mm, start, len, uf, false);
+ return ret;
}

unsigned long mmap_region(struct file *file, unsigned long addr,
@@ -2584,15 +2471,18 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
- struct vm_area_struct *prev, *next;
+ struct vm_area_struct *prev, *next = NULL;
pgoff_t pglen = len >> PAGE_SHIFT;
unsigned long charged = 0;
unsigned long end = addr + len;
unsigned long merge_start = addr, merge_end = end;
+ unsigned long max = USER_PGTABLES_CEILING;
pgoff_t vm_pgoff;
int error;
+ struct ma_state ma_prev, tmp;
MA_STATE(mas, &mm->mm_mt, addr, end - 1);

+
/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;
@@ -2608,57 +2498,66 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return -ENOMEM;
}

+ validate_mm(mm);
/* Unmap any existing mapping in the area */
- if (do_mas_munmap(&mas, mm, addr, len, uf, false))
+ if (do_mas_munmap(&mas, mm, addr, len, uf, false)) {
return -ENOMEM;
+ }

/*
* Private writable mapping: check memory availability
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
- if (security_vm_enough_memory_mm(mm, charged))
+ if (security_vm_enough_memory_mm(mm, charged)) {
return -ENOMEM;
+ }
vm_flags |= VM_ACCOUNT;
}

+ mas_set_range(&mas, addr, end - 1);
+ mas_walk(&mas); // Walk to the empty area (munmapped above)
+ ma_prev = mas;
+ prev = mas_prev(&ma_prev, 0);

- if (vm_flags & VM_SPECIAL) {
- prev = mas_prev(&mas, 0);
+ if (vm_flags & VM_SPECIAL)
goto cannot_expand;
- }

/* Attempt to expand an old mapping */

/* Check next */
- next = mas_next(&mas, ULONG_MAX);
- if (next && next->vm_start == end && vma_policy(next) &&
- can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
- NULL_VM_UFFD_CTX)) {
- merge_end = next->vm_end;
- vma = next;
- vm_pgoff = next->vm_pgoff - pglen;
+ tmp = mas;
+ next = mas_next(&tmp, ULONG_MAX);
+ if (next) {
+ max = next->vm_start;
+ if (next->vm_start == end && vma_policy(next) &&
+ can_vma_merge_before(next, vm_flags, NULL, file,
+ pgoff + pglen, NULL_VM_UFFD_CTX)) {
+ /* Try to expand next back over the requested area */
+ merge_end = next->vm_end;
+ vma = next;
+ vm_pgoff = next->vm_pgoff - pglen;
+ }
}

/* Check prev */
- prev = mas_prev(&mas, 0);
if (prev && prev->vm_end == addr && !vma_policy(prev) &&
can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
NULL_VM_UFFD_CTX)) {
+ /* Try to expand the prev over the requested area */
merge_start = prev->vm_start;
vma = prev;
+ mas = ma_prev;
vm_pgoff = prev->vm_pgoff;
}

-
/* Actually expand, if possible */
if (vma &&
!vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
- khugepaged_enter_vma_merge(prev, vm_flags);
+ khugepaged_enter_vma_merge(vma, vm_flags);
goto expanded;
}

- mas_set_range(&mas, addr, end - 1);
cannot_expand:
/*
* Determine the object being mapped and call the appropriate
@@ -2704,9 +2603,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* Answer: Yes, several device drivers can do it in their
* f_op->mmap method. -DaveM
*/
- WARN_ON_ONCE(addr != vma->vm_start);
+ if (addr != vma->vm_start) {
+ WARN_ON_ONCE(addr != vma->vm_start);
+ addr = vma->vm_start;
+ mas_set_range(&mas, addr, end - 1);
+ }

- addr = vma->vm_start;

/* If vm_flags changed after call_mmap(), we should try merge vma again
* as we may succeed this time.
@@ -2717,7 +2619,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
pgoff, NULL_VM_UFFD_CTX))) {
merge_start = prev->vm_start;
vm_pgoff = prev->vm_pgoff;
- if (!vma_expand(&mas, prev, merge_start, merge_end,
+ if (!vma_expand(&ma_prev, prev, merge_start, merge_end,
vm_pgoff, next)) {
/* ->mmap() can change vma->vm_file and fput the original file. So
* fput the vma->vm_file here or we would add an extra fput for file
@@ -2742,6 +2644,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto free_vma;
} else {
vma_set_anonymous(vma);
+ vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
}

/* Allow architectures to sanity-check the vm_flags */
@@ -2753,9 +2656,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto free_vma;
}

- mas.index = mas.last = addr;
- mas_walk(&mas);
- vma_mas_link(mm, vma, &mas, prev);
+ vma_mas_link(mm, vma, &mas);
+
/* Once vma denies write, undo our temporary denial count */
if (file) {
unmap_writable:
@@ -2791,6 +2693,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_flags |= VM_SOFTDIRTY;

vma_set_page_prot(vma);
+ validate_mm(mm);

return addr;

@@ -2799,7 +2702,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_file = NULL;

/* Undo any partial mapping done by a device driver. */
- unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+ mas_set_range(&mas, addr, end - 1);
+ unmap_region(mm, vma, &mas, vma->vm_start, vma->vm_end, prev, max);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
@@ -2823,7 +2727,6 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)

if (mmap_write_lock_killable(mm))
return -EINTR;
-
ret = do_mas_munmap(&mas, mm, start, len, &uf, downgrade);
/*
* Returning 1 indicates mmap_lock is downgraded.
@@ -2866,15 +2769,16 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
unsigned long populate = 0;
unsigned long ret = -EINVAL;
struct file *file;
+ MA_STATE(mas, &mm->mm_mt, start, start);

pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.rst.\n",
current->comm, current->pid);

if (prot)
return ret;
+
start = start & PAGE_MASK;
size = size & PAGE_MASK;
-
if (start + size <= start)
return ret;

@@ -2885,20 +2789,22 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
if (mmap_write_lock_killable(mm))
return -EINTR;

- vma = find_vma(mm, start);
+ mas_set(&mas, start);
+ vma = mas_walk(&mas);

if (!vma || !(vma->vm_flags & VM_SHARED))
goto out;

- if (start < vma->vm_start)
+ if (!vma->vm_file)
goto out;

if (start + size > vma->vm_end) {
- struct vm_area_struct *next;
+ struct vm_area_struct *prev, *next;

- for (next = vma->vm_next; next; next = next->vm_next) {
+ prev = vma;
+ mas_for_each(&mas, next, start + size) {
/* hole between vmas ? */
- if (next->vm_start != next->vm_prev->vm_end)
+ if (next->vm_start != prev->vm_end)
goto out;

if (next->vm_file != vma->vm_file)
@@ -2909,6 +2815,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,

if (start + size <= next->vm_end)
break;
+
+ prev = next;
}

if (!next)
@@ -2954,9 +2862,10 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
struct list_head *uf)
{
struct mm_struct *mm = vma->vm_mm;
- struct vm_area_struct unmap;
+ struct vm_area_struct unmap, *next;
unsigned long unmap_pages;
int ret;
+ struct ma_state ma_next;

arch_unmap(mm, newbrk, oldbrk);

@@ -2972,11 +2881,17 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
vma_init(&unmap, mm);
unmap.vm_start = newbrk;
unmap.vm_end = oldbrk;
+ unmap.vm_pgoff = newbrk >> PAGE_SHIFT;
+ if (vma->anon_vma)
+ vma_set_anonymous(&unmap);
+
ret = userfaultfd_unmap_prep(&unmap, newbrk, oldbrk, uf);
if (ret)
return ret;
- ret = 1;

+ ret = 1;
+ ma_next = *mas;
+ next = mas_next(&ma_next, ULONG_MAX);
// Change the oldbrk of vma to the newbrk of the munmap area
vma_adjust_trans_huge(vma, vma->vm_start, newbrk, 0);
if (vma->anon_vma) {
@@ -2984,15 +2899,16 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
anon_vma_interval_tree_pre_update_vma(vma);
}

- vma->vm_end = newbrk;
if (vma_mas_remove(&unmap, mas))
goto mas_store_fail;

+ vma->vm_end = newbrk;
if (vma->anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);
anon_vma_unlock_write(vma->anon_vma);
}

+ validate_mm(mm);
unmap_pages = vma_pages(&unmap);
if (unmap.vm_flags & VM_LOCKED) {
mm->locked_vm -= unmap_pages;
@@ -3000,14 +2916,15 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
}

mmap_write_downgrade(mm);
- unmap_region(mm, &unmap, vma, newbrk, oldbrk);
+ unmap_region(mm, &unmap, mas, newbrk, oldbrk, vma,
+ next ? next->vm_start : 0);
/* Statistics */
vm_stat_account(mm, unmap.vm_flags, -unmap_pages);
if (unmap.vm_flags & VM_ACCOUNT)
vm_unacct_memory(unmap_pages);

munmap_full_vma:
- validate_mm_mt(mm);
+ validate_mm(mm);
return ret;

mas_store_fail:
@@ -3031,15 +2948,15 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
* do not match then create a new anonymous VMA. Eventually we may be able to
* do some brk-specific accounting here.
*/
-static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
+static int do_brk_flags(struct ma_state *mas, struct ma_state *ma_prev,
+ struct vm_area_struct **brkvma,
unsigned long addr, unsigned long len,
unsigned long flags)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *prev = NULL, *vma;
+ struct vm_area_struct *vma;
int error;
unsigned long mapped_addr;
- validate_mm_mt(mm);

/* Until we need other flags, refuse anything except VM_EXEC. */
if ((flags & (~VM_EXEC)) != 0)
@@ -3064,7 +2981,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

- mas->last = addr + len - 1;
if (*brkvma) {
vma = *brkvma;
/* Expand the existing vma if possible; almost never a singular
@@ -3073,17 +2989,21 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
if ((!vma->anon_vma ||
list_is_singular(&vma->anon_vma_chain)) &&
((vma->vm_flags & ~VM_SOFTDIRTY) == flags)){
- mas->index = vma->vm_start;
+ ma_prev->index = vma->vm_start;
+ ma_prev->last = addr + len - 1;

vma_adjust_trans_huge(vma, addr, addr + len, 0);
if (vma->anon_vma) {
anon_vma_lock_write(vma->anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}
+ mas_lock(ma_prev);
vma->vm_end = addr + len;
vma->vm_flags |= VM_SOFTDIRTY;
- if (mas_store_gfp(mas, vma, GFP_KERNEL))
+ if (mas_store_gfp(ma_prev, vma, GFP_KERNEL)) {
+ mas_unlock(ma_prev);
goto mas_mod_fail;
+ }

if (vma->anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);
@@ -3092,11 +3012,9 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
khugepaged_enter_vma_merge(vma, flags);
goto out;
}
- prev = vma;
}
- mas->index = addr;
- mas_walk(mas);

+ mas->last = addr + len - 1;
/* create a vma struct for an anonymous mapping */
vma = vm_area_alloc(mm);
if (!vma)
@@ -3111,9 +3029,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
if (vma_mas_store(vma, mas))
goto mas_store_fail;

- if (!prev)
- prev = mas_prev(mas, 0);
-
mm->map_count++;
*brkvma = vma;
out:
@@ -3123,7 +3038,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
if (flags & VM_LOCKED)
mm->locked_vm += (len >> PAGE_SHIFT);
vma->vm_flags |= VM_SOFTDIRTY;
- validate_mm_mt(mm);
return 0;

mas_store_fail:
@@ -3162,7 +3076,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)

// This vma left intentionally blank.
mas_walk(&mas);
- ret = do_brk_flags(&mas, &vma, addr, len, flags);
+ ret = do_brk_flags(&mas, &mas, &vma, addr, len, flags);
populate = ((mm->def_flags & VM_LOCKED) != 0);
mmap_write_unlock(mm);
if (populate && !ret)
@@ -3183,6 +3097,8 @@ void exit_mmap(struct mm_struct *mm)
struct mmu_gather tlb;
struct vm_area_struct *vma;
unsigned long nr_accounted = 0;
+ struct ma_state mas2;
+ MA_STATE(mas, &mm->mm_mt, FIRST_USER_ADDRESS, FIRST_USER_ADDRESS);

/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
@@ -3211,32 +3127,43 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_unlock(mm);
}

- if (mm->locked_vm)
- unlock_range(mm->mmap, &vma, ULONG_MAX);
+ if (mm->locked_vm) {
+ mas_for_each(&mas, vma, ULONG_MAX) {
+ if (vma->vm_flags & VM_LOCKED) {
+ mm->locked_vm -= vma_pages(vma);
+ munlock_vma_pages_all(vma);
+ }
+ }
+ mas_set(&mas, FIRST_USER_ADDRESS);
+ }

arch_exit_mmap(mm);

- vma = mm->mmap;
- if (!vma) /* Can happen if dup_mmap() received an OOM */
+ vma = mas_find(&mas, ULONG_MAX);
+ if (!vma) { /* Can happen if dup_mmap() received an OOM */
+ rcu_read_unlock();
return;
+ }

lru_add_drain();
flush_cache_mm(mm);
tlb_gather_mmu_fullmm(&tlb, mm);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- unmap_vmas(&tlb, vma, 0, -1);
- free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+ mas2 = mas;
+ unmap_vmas(&tlb, vma, &mas, 0, -1);
+ free_pgtables(&tlb, &mas2, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb);

/*
* Walk the list again, actually closing and freeing it,
* with preemption enabled, without holding any MM locks.
*/
- while (vma) {
+ mas_set(&mas, 0);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- vma = remove_vma(vma);
+ remove_vma(vma);
cond_resched();
}

@@ -3251,9 +3178,7 @@ void exit_mmap(struct mm_struct *mm)
*/
int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
- struct vm_area_struct *prev;
-
- if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev))
+ if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
return -ENOMEM;

if ((vma->vm_flags & VM_ACCOUNT) &&
@@ -3277,7 +3202,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
}

- vma_link(mm, vma, prev);
+ vma_link(mm, vma);
return 0;
}

@@ -3295,7 +3220,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
struct vm_area_struct *new_vma, *prev;
bool faulted_in_anon_vma = true;

- validate_mm_mt(mm);
/*
* If anonymous vma has not yet been faulted, update new pgoff
* to match new location, to increase its chance of merging.
@@ -3348,10 +3272,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
- vma_link(mm, new_vma, prev);
+ vma_link(mm, new_vma);
*need_rmap_locks = false;
}
- validate_mm_mt(mm);
return new_vma;

out_free_mempol:
@@ -3359,7 +3282,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
out_free_vma:
vm_area_free(new_vma);
out:
- validate_mm_mt(mm);
return NULL;
}

@@ -3496,7 +3418,6 @@ static struct vm_area_struct *__install_special_mapping(
int ret;
struct vm_area_struct *vma;

- validate_mm_mt(mm);
vma = vm_area_alloc(mm);
if (unlikely(vma == NULL))
return ERR_PTR(-ENOMEM);
@@ -3518,12 +3439,10 @@ static struct vm_area_struct *__install_special_mapping(

perf_event_mmap(vma);

- validate_mm_mt(mm);
return vma;

out:
vm_area_free(vma);
- validate_mm_mt(mm);
return ERR_PTR(ret);
}

@@ -3648,12 +3567,14 @@ int mm_take_all_locks(struct mm_struct *mm)
{
struct vm_area_struct *vma;
struct anon_vma_chain *avc;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

BUG_ON(mmap_read_trylock(mm));

mutex_lock(&mm_all_locks_mutex);
+ rcu_read_lock();

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
if (vma->vm_file && vma->vm_file->f_mapping &&
@@ -3661,7 +3582,8 @@ int mm_take_all_locks(struct mm_struct *mm)
vm_lock_mapping(mm, vma->vm_file->f_mapping);
}

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_set(&mas, 0);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
if (vma->vm_file && vma->vm_file->f_mapping &&
@@ -3669,7 +3591,8 @@ int mm_take_all_locks(struct mm_struct *mm)
vm_lock_mapping(mm, vma->vm_file->f_mapping);
}

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_set(&mas, 0);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
if (vma->anon_vma)
@@ -3677,9 +3600,11 @@ int mm_take_all_locks(struct mm_struct *mm)
vm_lock_anon_vma(mm, avc->anon_vma);
}

+ rcu_read_unlock();
return 0;

out_unlock:
+ rcu_read_unlock();
mm_drop_all_locks(mm);
return -EINTR;
}
@@ -3728,17 +3653,21 @@ void mm_drop_all_locks(struct mm_struct *mm)
{
struct vm_area_struct *vma;
struct anon_vma_chain *avc;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

BUG_ON(mmap_read_trylock(mm));
BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (vma->anon_vma)
list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
vm_unlock_anon_vma(avc->anon_vma);
if (vma->vm_file && vma->vm_file->f_mapping)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+ rcu_read_unlock();

mutex_unlock(&mm_all_locks_mutex);
}
diff --git a/mm/nommu.c b/mm/nommu.c
index 916038bafc65..a99e276445ce 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1428,7 +1428,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list
return -EINVAL;
if (end == vma->vm_end)
goto erase_whole_vma;
- vma = vma_next(mm, vma);
+
+ vma = vma_next(mm, vma);
} while (vma);
return -EINVAL;
} else {
@@ -1479,18 +1480,23 @@ SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
void exit_mmap(struct mm_struct *mm)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, NULL, 0, 0);

if (!mm)
return;

+ mas.tree = &mm->mm_mt;
mm->total_vm = 0;
-
- while ((vma = mm->mmap)) {
- mm->mmap = vma_next(mm, vma);
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
delete_vma_from_mm(vma);
delete_vma(mm, vma);
+ rcu_read_unlock();
+ mas_pause(&mas);
cond_resched();
+ rcu_read_lock();
}
+ rcu_read_unlock();
}

int vm_brk(unsigned long addr, unsigned long len)
--
2.30.2

2021-04-28 17:40:51

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 81/94] mm/mlock: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mlock.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 5e9f4dea4e96..c2ba408852f9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -562,6 +562,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
unsigned long nstart, end, tmp;
struct vm_area_struct *vma, *prev;
int error;
+ MA_STATE(mas, &current->mm->mm_mt, start, start);

VM_BUG_ON(offset_in_page(start));
VM_BUG_ON(len != PAGE_ALIGN(len));
@@ -570,11 +571,11 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
return -EINVAL;
if (end == start)
return 0;
- vma = find_vma(current->mm, start);
- if (!vma || vma->vm_start > start)
+ vma = mas_walk(&mas);
+ if (!vma)
return -ENOMEM;

- prev = vma->vm_prev;
+ prev = mas_prev(&mas, 0);
if (start > vma->vm_start)
prev = vma;

@@ -596,7 +597,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
if (nstart >= end)
break;

- vma = prev->vm_next;
+ vma = vma_next(prev->vm_mm, prev);
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
break;
@@ -617,15 +618,13 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
{
struct vm_area_struct *vma;
unsigned long count = 0;
+ MA_STATE(mas, &mm->mm_mt, start, start);

if (mm == NULL)
mm = current->mm;

- vma = find_vma(mm, start);
- if (vma == NULL)
- return 0;
-
- for (; vma ; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, start + len) {
if (start >= vma->vm_end)
continue;
if (start + len <= vma->vm_start)
@@ -640,6 +639,7 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
count += vma->vm_end - vma->vm_start;
}
}
+ rcu_read_unlock();

return count >> PAGE_SHIFT;
}
@@ -740,6 +740,7 @@ static int apply_mlockall_flags(int flags)
{
struct vm_area_struct *vma, *prev = NULL;
vm_flags_t to_add = 0;
+ MA_STATE(mas, &current->mm->mm_mt, 0, 0);

current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
if (flags & MCL_FUTURE) {
@@ -758,7 +759,8 @@ static int apply_mlockall_flags(int flags)
to_add |= VM_LOCKONFAULT;
}

- for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
vm_flags_t newflags;

newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
@@ -766,8 +768,12 @@ static int apply_mlockall_flags(int flags)

/* Ignore errors */
mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
+ rcu_read_unlock();
+ mas_pause(&mas);
cond_resched();
+ rcu_read_lock();
}
+ rcu_read_unlock();
out:
return 0;
}
--
2.30.2

2021-04-28 19:39:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 03/94] arch/arc/kernel/troubleshoot: use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/arc/kernel/troubleshoot.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arc/kernel/troubleshoot.c b/arch/arc/kernel/troubleshoot.c
index a331bb5d8319..7654c2e42dc0 100644
--- a/arch/arc/kernel/troubleshoot.c
+++ b/arch/arc/kernel/troubleshoot.c
@@ -83,12 +83,12 @@ static void show_faulting_vma(unsigned long address)
* non-inclusive vma
*/
mmap_read_lock(active_mm);
- vma = find_vma(active_mm, address);
+ vma = vma_lookup(active_mm, address);

- /* check against the find_vma( ) behaviour which returns the next VMA
- * if the container VMA is not found
+ /* Lookup the vma at the address and report if the container VMA is not
+ * found
*/
- if (vma && (vma->vm_start <= address)) {
+ if (vma) {
char buf[ARC_PATH_MAX];
char *nm = "?";

--
2.30.2

2021-04-28 19:39:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 07/94] arch/mips/kernel/traps: Use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/mips/kernel/traps.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index 0b4e06303c55..6f07362de5ce 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -784,7 +784,6 @@ void force_fcr31_sig(unsigned long fcr31, void __user *fault_addr,
int process_fpemu_return(int sig, void __user *fault_addr, unsigned long fcr31)
{
int si_code;
- struct vm_area_struct *vma;

switch (sig) {
case 0:
@@ -800,8 +799,7 @@ int process_fpemu_return(int sig, void __user *fault_addr, unsigned long fcr31)

case SIGSEGV:
mmap_read_lock(current->mm);
- vma = find_vma(current->mm, (unsigned long)fault_addr);
- if (vma && (vma->vm_start <= (unsigned long)fault_addr))
+ if (vma_lookup(current->mm, (unsigned long)fault_addr))
si_code = SEGV_ACCERR;
else
si_code = SEGV_MAPERR;
--
2.30.2

2021-04-28 19:39:41

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 12/94] drm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()

Using vma_lookup() allows for cleaner code as the vma start address
validation is not needed, as apposed to find_vma().

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 3bef0432cac2..bd8df9bc9e38 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -709,8 +709,8 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
}

mmap_read_lock(mm);
- vma = find_vma(mm, start);
- if (unlikely(!vma || start < vma->vm_start)) {
+ vma = vma_lookup(mm, start);
+ if (unlikely(!vma)) {
r = -EFAULT;
goto out_unlock;
}
--
2.30.2

2021-04-28 19:39:43

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 04/94] arch/arm64/kvm: Use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/arm64/kvm/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cd4d51ae3d4a..11069db817f0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -855,7 +855,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,

/* Let's check if we will get back a huge page backed by hugetlbfs */
mmap_read_lock(current->mm);
- vma = find_vma_intersection(current->mm, hva, hva + 1);
+ vma = vma_lookup(current->mm, hva);
if (unlikely(!vma)) {
kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
mmap_read_unlock(current->mm);
--
2.30.2

2021-04-28 19:40:21

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 21/94] radix tree test suite: Enhancements for Maple Tree

Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
tools/testing/radix-tree/linux.c | 16 +++++++++++++++-
tools/testing/radix-tree/linux/kernel.h | 1 +
2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
index 2d9c59df60de..93f7de81fbe8 100644
--- a/tools/testing/radix-tree/linux.c
+++ b/tools/testing/radix-tree/linux.c
@@ -24,15 +24,28 @@ struct kmem_cache {
int nr_objs;
void *objs;
void (*ctor)(void *);
+ unsigned int non_kernel;
};

+void kmem_cache_set_non_kernel(struct kmem_cache *cachep, unsigned int val)
+{
+ cachep->non_kernel = val;
+}
+
+unsigned long kmem_cache_get_alloc(struct kmem_cache *cachep)
+{
+ return cachep->size * nr_allocated;
+}
void *kmem_cache_alloc(struct kmem_cache *cachep, int gfp)
{
void *p;

- if (!(gfp & __GFP_DIRECT_RECLAIM))
+ if (!(gfp & __GFP_DIRECT_RECLAIM) && !cachep->non_kernel)
return NULL;

+ if (!(gfp & __GFP_DIRECT_RECLAIM))
+ cachep->non_kernel--;
+
pthread_mutex_lock(&cachep->lock);
if (cachep->nr_objs) {
struct radix_tree_node *node = cachep->objs;
@@ -116,5 +129,6 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
ret->nr_objs = 0;
ret->objs = NULL;
ret->ctor = ctor;
+ ret->non_kernel = 0;
return ret;
}
diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
index 39867fd80c8f..c5c9d05f29da 100644
--- a/tools/testing/radix-tree/linux/kernel.h
+++ b/tools/testing/radix-tree/linux/kernel.h
@@ -14,6 +14,7 @@
#include "../../../include/linux/kconfig.h"

#define printk printf
+#define pr_err printk
#define pr_info printk
#define pr_debug printk
#define pr_cont printk
--
2.30.2

2021-04-28 19:40:21

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 16/94] lib/test_hmm: Use vma_lookup() in dmirror_migrate()

vma_lookup() will only return the vma which contains the address and
will not return the next vma. This allows the code to easier to
understand.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/test_hmm.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..15f2e2db77bc 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -686,9 +686,8 @@ static int dmirror_migrate(struct dmirror *dmirror,

mmap_read_lock(mm);
for (addr = start; addr < end; addr = next) {
- vma = find_vma(mm, addr);
- if (!vma || addr < vma->vm_start ||
- !(vma->vm_flags & VM_READ)) {
+ vma = vma_lookup(mm, addr);
+ if (!vma || !(vma->vm_flags & VM_READ)) {
ret = -EINVAL;
goto out;
}
--
2.30.2

2021-04-28 19:40:37

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 20/94] mm/memory.c: Use vma_lookup() instead of find_vma_intersection()

Use the new vma_lookup() call for abstraction & code readability.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 27014c3bde9f..91e2a4c8dfd3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4944,8 +4944,8 @@ int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
* Check if this is a VM_IO | VM_PFNMAP VMA, which
* we can access using slightly different code.
*/
- vma = find_vma(mm, addr);
- if (!vma || vma->vm_start > addr)
+ vma = vma_lookup(mm, addr);
+ if (!vma)
break;
if (vma->vm_ops && vma->vm_ops->access)
ret = vma->vm_ops->access(vma, addr, buf,
--
2.30.2

2021-04-28 19:41:18

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 19/94] mm/mremap: Use vma_lookup() in vma_to_resize()

vma_lookup checks limits of the vma so the code can be more clear.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mremap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 47c255b60150..04143755cd1e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -634,10 +634,10 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
unsigned long *p)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma = find_vma(mm, addr);
+ struct vm_area_struct *vma = vma_lookup(mm, addr);
unsigned long pgoff;

- if (!vma || vma->vm_start > addr)
+ if (!vma)
return ERR_PTR(-EFAULT);

/*
--
2.30.2

2021-04-28 19:58:22

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 29/94] mm/mmap: Change find_vma() to use the maple tree

Start using the maple tree to find VMA entries in an mm_struct.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 28 ++++++++++------------------
1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 112be171b662..3a9a9aee2f63 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2483,10 +2483,16 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

EXPORT_SYMBOL(get_unmapped_area);

-/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
+/**
+ * find_vma() - Find the VMA for a given address, or the next vma.
+ * @mm: The mm_struct to check
+ * @addr: The address
+ *
+ * Returns: The VMA associated with addr, or the next vma.
+ * May return %NULL in the case of no vma at addr or above.
+ */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
- struct rb_node *rb_node;
struct vm_area_struct *vma;

/* Check the cache first. */
@@ -2494,24 +2500,10 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
if (likely(vma))
return vma;

- rb_node = mm->mm_rb.rb_node;
-
- while (rb_node) {
- struct vm_area_struct *tmp;
-
- tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
-
- if (tmp->vm_end > addr) {
- vma = tmp;
- if (tmp->vm_start <= addr)
- break;
- rb_node = rb_node->rb_left;
- } else
- rb_node = rb_node->rb_right;
- }
-
+ vma = mt_find(&mm->mm_mt, &addr, ULONG_MAX);
if (vma)
vmacache_update(addr, vma);
+
return vma;
}
EXPORT_SYMBOL(find_vma);
--
2.30.2

2021-04-28 19:58:32

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 27/94] mm: Start tracking VMAs with maple tree

Start tracking the VMAs with the new maple tree structure in parallel
with the rb_tree. Add debug and trace events for maple tree operations
and duplicate the rb_tree that is created on forks into the maple tree.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/x86/kernel/tboot.c | 1 +
drivers/firmware/efi/efi.c | 1 +
include/linux/mm.h | 2 +
include/linux/mm_types.h | 2 +
include/trace/events/mmap.h | 71 ++++++++++++
init/main.c | 2 +
kernel/fork.c | 4 +
mm/init-mm.c | 2 +
mm/internal.h | 44 +++++++
mm/mmap.c | 224 +++++++++++++++++++++++++++++++++++-
10 files changed, 351 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index f9af561c3cd4..6f978f722dff 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -98,6 +98,7 @@ void __init tboot_probe(void)
static pgd_t *tboot_pg_dir;
static struct mm_struct tboot_mm = {
.mm_rb = RB_ROOT,
+ .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 4b7ee3fa9224..271ae8c7bb07 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -55,6 +55,7 @@ static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR;

struct mm_struct efi_mm = {
.mm_rb = RB_ROOT,
+ .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.write_protect_seq = SEQCNT_ZERO(efi_mm.write_protect_seq),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f7dff6ad884..e89bacfa9145 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2498,6 +2498,8 @@ extern bool arch_has_descending_max_zone_pfns(void);
/* nommu.c */
extern atomic_long_t mmap_pages_allocated;
extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
+/* maple_tree */
+void vma_store(struct mm_struct *mm, struct vm_area_struct *vma);

/* interval_tree.c */
void vma_interval_tree_insert(struct vm_area_struct *node,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..51733fc44daf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -8,6 +8,7 @@
#include <linux/list.h>
#include <linux/spinlock.h>
#include <linux/rbtree.h>
+#include <linux/maple_tree.h>
#include <linux/rwsem.h>
#include <linux/completion.h>
#include <linux/cpumask.h>
@@ -387,6 +388,7 @@ struct kioctx_table;
struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
+ struct maple_tree mm_mt;
struct rb_root mm_rb;
u64 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
diff --git a/include/trace/events/mmap.h b/include/trace/events/mmap.h
index 4661f7ba07c0..4ffe3d348966 100644
--- a/include/trace/events/mmap.h
+++ b/include/trace/events/mmap.h
@@ -42,6 +42,77 @@ TRACE_EVENT(vm_unmapped_area,
__entry->low_limit, __entry->high_limit, __entry->align_mask,
__entry->align_offset)
);
+
+TRACE_EVENT(vma_mt_szero,
+ TP_PROTO(struct mm_struct *mm, unsigned long start,
+ unsigned long end),
+
+ TP_ARGS(mm, start, end),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct*, mm)
+ __field(unsigned long, start)
+ __field(unsigned long, end)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->start = start;
+ __entry->end = end - 1;
+ ),
+
+ TP_printk("mt_mod %px, (NULL), SNULL, %lu, %lu,",
+ __entry->mm,
+ (unsigned long) __entry->start,
+ (unsigned long) __entry->end
+ )
+);
+
+TRACE_EVENT(vma_mt_store,
+ TP_PROTO(struct mm_struct *mm, struct vm_area_struct *vma),
+
+ TP_ARGS(mm, vma),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct*, mm)
+ __field(struct vm_area_struct*, vma)
+ __field(unsigned long, vm_start)
+ __field(unsigned long, vm_end)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vma = vma;
+ __entry->vm_start = vma->vm_start;
+ __entry->vm_end = vma->vm_end - 1;
+ ),
+
+ TP_printk("mt_mod %px, (%px), STORE, %lu, %lu,",
+ __entry->mm, __entry->vma,
+ (unsigned long) __entry->vm_start,
+ (unsigned long) __entry->vm_end
+ )
+);
+
+
+TRACE_EVENT(exit_mmap,
+ TP_PROTO(struct mm_struct *mm),
+
+ TP_ARGS(mm),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct*, mm)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ ),
+
+ TP_printk("mt_mod %px, DESTROY\n",
+ __entry->mm
+ )
+);
+
#endif

/* This part must be outside protection */
diff --git a/init/main.c b/init/main.c
index 7b6f49c4d388..f559c8fb5300 100644
--- a/init/main.c
+++ b/init/main.c
@@ -115,6 +115,7 @@ static int kernel_init(void *);

extern void init_IRQ(void);
extern void radix_tree_init(void);
+extern void maple_tree_init(void);

/*
* Debug helper: via this flag we know that we are in 'early bootup code'
@@ -951,6 +952,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
+ maple_tree_init();

/*
* Set up housekeeping before setting up workqueues to allow the unbound
diff --git a/kernel/fork.c b/kernel/fork.c
index 9de8c967c2d5..c37abaf28eb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -593,6 +593,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
rb_link = &tmp->vm_rb.rb_right;
rb_parent = &tmp->vm_rb;

+ /* Link the vma into the MT */
+ vma_store(mm, tmp);
+
mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
retval = copy_page_range(tmp, mpnt);
@@ -1018,6 +1021,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
{
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
+ mt_init_flags(&mm->mm_mt, MAPLE_ALLOC_RANGE);
mm->vmacache_seqnum = 0;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 153162669f80..2014d4b82294 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/mm_types.h>
#include <linux/rbtree.h>
+#include <linux/maple_tree.h>
#include <linux/rwsem.h>
#include <linux/spinlock.h>
#include <linux/list.h>
@@ -28,6 +29,7 @@
*/
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
+ .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index f469f69309de..7ad55938d391 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -346,6 +346,50 @@ static inline bool is_data_mapping(vm_flags_t flags)
return (flags & (VM_WRITE | VM_SHARED | VM_STACK)) == VM_WRITE;
}

+/* Maple tree operations using VMAs */
+/*
+ * vma_mas_store() - Store a VMA in the maple tree.
+ * @vma: The vm_area_struct
+ * @mas: The maple state
+ *
+ * Efficient way to store a VMA in the maple tree when the @mas has already
+ * walked to the correct location.
+ *
+ * Note: the end address is inclusive in the maple tree.
+ */
+static inline int vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
+{
+ int ret;
+
+ mas->index = vma->vm_start;
+ mas->last = vma->vm_end - 1;
+ mas_lock(mas);
+ ret = mas_store_gfp(mas, vma, GFP_KERNEL);
+ mas_unlock(mas);
+ return ret;
+}
+
+/*
+ * vma_mas_remove() - Remove a VMA from the maple tree.
+ * @vma: The vm_area_struct
+ * @mas: The maple state
+ *
+ * Efficient way to remove a VMA from the maple tree when the @mas has already
+ * been established and points to the correct location.
+ * Note: the end address is inclusive in the maple tree.
+ */
+static inline int vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
+{
+ int ret;
+
+ mas->index = vma->vm_start;
+ mas->last = vma->vm_end - 1;
+ mas_lock(mas);
+ ret = mas_store_gfp(mas, NULL, GFP_KERNEL);
+ mas_unlock(mas);
+ return ret;
+}
+
/* mm/util.c */
void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev);
diff --git a/mm/mmap.c b/mm/mmap.c
index 81f5595a8490..bce25db96fd1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -377,7 +377,73 @@ static int browse_rb(struct mm_struct *mm)
}
return bug ? -1 : i;
}
+#if defined(CONFIG_DEBUG_MAPLE_TREE)
+extern void mt_validate(struct maple_tree *mt);
+extern void mt_dump(const struct maple_tree *mt);

+/* Validate the maple tree */
+static void validate_mm_mt(struct mm_struct *mm)
+{
+ struct maple_tree *mt = &mm->mm_mt;
+ struct vm_area_struct *vma_mt, *vma = mm->mmap;
+
+ MA_STATE(mas, mt, 0, 0);
+ rcu_read_lock();
+ mas_for_each(&mas, vma_mt, ULONG_MAX) {
+ if (xa_is_zero(vma_mt))
+ continue;
+
+ if (!vma)
+ break;
+
+ if ((vma != vma_mt) ||
+ (vma->vm_start != vma_mt->vm_start) ||
+ (vma->vm_end != vma_mt->vm_end) ||
+ (vma->vm_start != mas.index) ||
+ (vma->vm_end - 1 != mas.last)) {
+ pr_emerg("issue in %s\n", current->comm);
+ dump_stack();
+#ifdef CONFIG_DEBUG_VM
+ dump_vma(vma_mt);
+ pr_emerg("and next in rb\n");
+ dump_vma(vma->vm_next);
+#endif
+ pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
+ mas.index, mas.last);
+ pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
+ vma_mt->vm_start, vma_mt->vm_end);
+ pr_emerg("rb vma: %px %lu - %lu\n", vma,
+ vma->vm_start, vma->vm_end);
+ pr_emerg("rb->next = %px %lu - %lu\n", vma->vm_next,
+ vma->vm_next->vm_start, vma->vm_next->vm_end);
+
+ mt_dump(mas.tree);
+ if (vma_mt->vm_end != mas.last + 1) {
+ pr_err("vma: %px vma_mt %lu-%lu\tmt %lu-%lu\n",
+ mm, vma_mt->vm_start, vma_mt->vm_end,
+ mas.index, mas.last);
+ mt_dump(mas.tree);
+ }
+ VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm);
+ if (vma_mt->vm_start != mas.index) {
+ pr_err("vma: %px vma_mt %px %lu - %lu doesn't match\n",
+ mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end);
+ mt_dump(mas.tree);
+ }
+ VM_BUG_ON_MM(vma_mt->vm_start != mas.index, mm);
+ }
+ VM_BUG_ON(vma != vma_mt);
+ vma = vma->vm_next;
+
+ }
+ VM_BUG_ON(vma);
+
+ rcu_read_unlock();
+ mt_validate(&mm->mm_mt);
+}
+#else
+#define validate_mm_mt(root) do { } while (0)
+#endif
static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore)
{
struct rb_node *nd;
@@ -432,6 +498,7 @@ static void validate_mm(struct mm_struct *mm)
}
#else
#define validate_mm_rb(root, ignore) do { } while (0)
+#define validate_mm_mt(root) do { } while (0)
#define validate_mm(mm) do { } while (0)
#endif

@@ -610,6 +677,7 @@ static unsigned long count_vma_pages_range(struct mm_struct *mm,
unsigned long addr, unsigned long end)
{
unsigned long nr_pages = 0;
+ unsigned long nr_mt_pages = 0;
struct vm_area_struct *vma;

/* Find first overlapping mapping */
@@ -631,6 +699,13 @@ static unsigned long count_vma_pages_range(struct mm_struct *mm,
nr_pages += overlap_len >> PAGE_SHIFT;
}

+ mt_for_each(&mm->mm_mt, vma, addr, end) {
+ nr_mt_pages +=
+ (min(end, vma->vm_end) - vma->vm_start) >> PAGE_SHIFT;
+ }
+
+ VM_BUG_ON_MM(nr_pages != nr_mt_pages, mm);
+
return nr_pages;
}

@@ -677,11 +752,44 @@ static void __vma_link_file(struct vm_area_struct *vma)
}
}

+/*
+ * vma_mt_szero() - Set a given range to zero. Used when modifying a
+ * vm_area_struct start or end.
+ *
+ * @mm: The struct_mm
+ * @start: The start address to zero
+ * @end: The end address to zero.
+ */
+static inline void vma_mt_szero(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
+{
+ trace_vma_mt_szero(mm, start, end);
+ mtree_store_range(&mm->mm_mt, start, end - 1, NULL, GFP_KERNEL);
+}
+
+/*
+ * vma_mt_store() - Store a given vm_area_struct in the maple tree.
+ *
+ * @mm: The struct_mm
+ * @vma: The vm_area_struct to store in the maple tree.
+ */
+static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+ trace_vma_mt_store(mm, vma);
+ mtree_store_range(&mm->mm_mt, vma->vm_start, vma->vm_end - 1, vma,
+ GFP_KERNEL);
+}
+
+void vma_store(struct mm_struct *mm, struct vm_area_struct *vma) {
+ vma_mt_store(mm, vma);
+}
+
static void
__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct rb_node **rb_link,
struct rb_node *rb_parent)
{
+ vma_mt_store(mm, vma);
__vma_link_list(mm, vma, prev);
__vma_link_rb(mm, vma, rb_link, rb_parent);
}
@@ -754,6 +862,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
long adjust_next = 0;
int remove_next = 0;

+ validate_mm(mm);
+ validate_mm_mt(mm);
+
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;

@@ -879,17 +990,28 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
}

if (start != vma->vm_start) {
+ unsigned long old_start = vma->vm_start;
vma->vm_start = start;
+ if (old_start < start)
+ vma_mt_szero(mm, old_start, start);
start_changed = true;
}
if (end != vma->vm_end) {
+ unsigned long old_end = vma->vm_end;
vma->vm_end = end;
+ if (old_end > end)
+ vma_mt_szero(mm, end, old_end);
end_changed = true;
}
+
+ if (end_changed || start_changed)
+ vma_mt_store(mm, vma);
+
vma->vm_pgoff = pgoff;
if (adjust_next) {
next->vm_start += adjust_next;
next->vm_pgoff += adjust_next >> PAGE_SHIFT;
+ vma_mt_store(mm, next);
}

if (file) {
@@ -903,6 +1025,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
/*
* vma_merge has merged next into vma, and needs
* us to remove next before dropping the locks.
+ * Since we have expanded over this vma, the maple tree will
+ * have overwritten by storing the value
*/
if (remove_next != 3)
__vma_unlink(mm, next, next);
@@ -1022,6 +1146,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
uprobe_mmap(insert);

validate_mm(mm);
+ validate_mm_mt(mm);

return 0;
}
@@ -1169,6 +1294,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *area, *next;
int err;

+ validate_mm_mt(mm);
/*
* We later require that vma->vm_flags == vm_flags,
* so this tests vma->vm_flags & VM_SPECIAL, too.
@@ -1244,6 +1370,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
khugepaged_enter_vma_merge(area, vm_flags);
return area;
}
+ validate_mm_mt(mm);

return NULL;
}
@@ -1736,6 +1863,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;

+ validate_mm_mt(mm);
/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;
@@ -1897,6 +2025,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

vma_set_page_prot(vma);

+ validate_mm_mt(mm);
return addr;

unmap_and_free_vma:
@@ -1916,6 +2045,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unacct_error:
if (charged)
vm_unacct_memory(charged);
+ validate_mm_mt(mm);
return error;
}

@@ -1932,12 +2062,21 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long length, low_limit, high_limit, gap_start, gap_end;
+ unsigned long gap;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask;
if (length < info->length)
return -ENOMEM;

+ rcu_read_lock();
+ mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1,
+ length);
+ rcu_read_unlock();
+ gap = mas.index;
+ gap += (info->align_offset - gap) & info->align_mask;
+
/* Adjust search limits by the desired length */
if (info->high_limit < length)
return -ENOMEM;
@@ -2019,20 +2158,39 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info)

VM_BUG_ON(gap_start + info->length > info->high_limit);
VM_BUG_ON(gap_start + info->length > gap_end);
+
+ VM_BUG_ON(gap != gap_start);
return gap_start;
}

+static inline unsigned long top_area_aligned(struct vm_unmapped_area_info *info,
+ unsigned long end)
+{
+ return (end - info->length - info->align_offset) & (~info->align_mask);
+}
+
static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;
unsigned long length, low_limit, high_limit, gap_start, gap_end;
+ unsigned long gap;
+
+ MA_STATE(mas, &mm->mm_mt, 0, 0);
+ validate_mm_mt(mm);

/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask;
if (length < info->length)
return -ENOMEM;

+ rcu_read_lock();
+ mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1,
+ length);
+ rcu_read_unlock();
+ gap = (mas.index + info->align_mask) & ~info->align_mask;
+ gap -= info->align_offset & info->align_mask;
+
/*
* Adjust search limits by the desired length.
* See implementation comment at top of unmapped_area().
@@ -2118,6 +2276,32 @@ static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)

VM_BUG_ON(gap_end < info->low_limit);
VM_BUG_ON(gap_end < gap_start);
+
+ if (gap != gap_end) {
+ pr_err("%s: %px Gap was found: mt %lu gap_end %lu\n", __func__,
+ mm, gap, gap_end);
+ pr_err("window was %lu - %lu size %lu\n", info->high_limit,
+ info->low_limit, length);
+ pr_err("mas.min %lu max %lu mas.last %lu\n", mas.min, mas.max,
+ mas.last);
+ pr_err("mas.index %lu align mask %lu offset %lu\n", mas.index,
+ info->align_mask, info->align_offset);
+ pr_err("rb_find_vma find on %lu => %px (%px)\n", mas.index,
+ find_vma(mm, mas.index), vma);
+#if defined(CONFIG_DEBUG_MAPLE_TREE)
+ mt_dump(&mm->mm_mt);
+#endif
+ {
+ struct vm_area_struct *dv = mm->mmap;
+
+ while (dv) {
+ printk("vma %px %lu-%lu\n", dv, dv->vm_start, dv->vm_end);
+ dv = dv->vm_next;
+ }
+ }
+ VM_BUG_ON(gap != gap_end);
+ }
+
return gap_end;
}

@@ -2330,7 +2514,6 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
vmacache_update(addr, vma);
return vma;
}
-
EXPORT_SYMBOL(find_vma);

/*
@@ -2411,6 +2594,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
unsigned long gap_addr;
int error = 0;

+ validate_mm_mt(mm);
if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;

@@ -2487,6 +2671,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
anon_vma_unlock_write(vma->anon_vma);
khugepaged_enter_vma_merge(vma, vma->vm_flags);
validate_mm(mm);
+ validate_mm_mt(mm);
return error;
}
#endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -2501,6 +2686,7 @@ int expand_downwards(struct vm_area_struct *vma,
struct vm_area_struct *prev;
int error = 0;

+ validate_mm(mm);
address &= PAGE_MASK;
if (address < mmap_min_addr)
return -EPERM;
@@ -2554,6 +2740,8 @@ int expand_downwards(struct vm_area_struct *vma,
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_start = address;
vma->vm_pgoff -= grow;
+ /* Overwrite old entry in mtree. */
+ vma_mt_store(mm, vma);
anon_vma_interval_tree_post_update_vma(vma);
vma_gap_update(vma);
spin_unlock(&mm->page_table_lock);
@@ -2695,6 +2883,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,

insertion_point = (prev ? &prev->vm_next : &mm->mmap);
vma->vm_prev = NULL;
+ vma_mt_szero(mm, vma->vm_start, end);
do {
vma_rb_erase(vma, &mm->mm_rb);
mm->map_count--;
@@ -2733,6 +2922,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct vm_area_struct *new;
int err;
+ validate_mm_mt(mm);

if (vma->vm_ops && vma->vm_ops->may_split) {
err = vma->vm_ops->may_split(vma, addr);
@@ -2785,6 +2975,7 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
mpol_put(vma_policy(new));
out_free_vma:
vm_area_free(new);
+ validate_mm_mt(mm);
return err;
}

@@ -3057,6 +3248,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
pgoff_t pgoff = addr >> PAGE_SHIFT;
int error;
unsigned long mapped_addr;
+ validate_mm_mt(mm);

/* Until we need other flags, refuse anything except VM_EXEC. */
if ((flags & (~VM_EXEC)) != 0)
@@ -3114,6 +3306,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
if (flags & VM_LOCKED)
mm->locked_vm += (len >> PAGE_SHIFT);
vma->vm_flags |= VM_SOFTDIRTY;
+ validate_mm_mt(mm);
return 0;
}

@@ -3218,6 +3411,9 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
cond_resched();
}
+
+ trace_exit_mmap(mm);
+ mtree_destroy(&mm->mm_mt);
vm_unacct_memory(nr_accounted);
}

@@ -3229,10 +3425,25 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
struct vm_area_struct *prev;
struct rb_node **rb_link, *rb_parent;
+ unsigned long start = vma->vm_start;
+ struct vm_area_struct *overlap = NULL;

if (find_vma_links(mm, vma->vm_start, vma->vm_end,
&prev, &rb_link, &rb_parent))
return -ENOMEM;
+
+ overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1);
+ if (overlap) {
+
+ pr_err("Found vma ending at %lu\n", start - 1);
+ pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap,
+ overlap->vm_start, overlap->vm_end - 1);
+#if defined(CONFIG_DEBUG_MAPLE_TREE)
+ mt_dump(&mm->mm_mt);
+#endif
+ BUG();
+ }
+
if ((vma->vm_flags & VM_ACCOUNT) &&
security_vm_enough_memory_mm(mm, vma_pages(vma)))
return -ENOMEM;
@@ -3272,7 +3483,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
struct vm_area_struct *new_vma, *prev;
struct rb_node **rb_link, *rb_parent;
bool faulted_in_anon_vma = true;
+ unsigned long index = addr;

+ validate_mm_mt(mm);
/*
* If anonymous vma has not yet been faulted, update new pgoff
* to match new location, to increase its chance of merging.
@@ -3284,6 +3497,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,

if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
return NULL; /* should never get here */
+ if (mt_find(&mm->mm_mt, &index, addr+len - 1))
+ BUG();
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
vma->vm_userfaultfd_ctx);
@@ -3327,6 +3542,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
vma_link(mm, new_vma, prev, rb_link, rb_parent);
*need_rmap_locks = false;
}
+ validate_mm_mt(mm);
return new_vma;

out_free_mempol:
@@ -3334,6 +3550,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
out_free_vma:
vm_area_free(new_vma);
out:
+ validate_mm_mt(mm);
return NULL;
}

@@ -3470,6 +3687,7 @@ static struct vm_area_struct *__install_special_mapping(
int ret;
struct vm_area_struct *vma;

+ validate_mm_mt(mm);
vma = vm_area_alloc(mm);
if (unlikely(vma == NULL))
return ERR_PTR(-ENOMEM);
@@ -3491,10 +3709,12 @@ static struct vm_area_struct *__install_special_mapping(

perf_event_mmap(vma);

+ validate_mm_mt(mm);
return vma;

out:
vm_area_free(vma);
+ validate_mm_mt(mm);
return ERR_PTR(ret);
}

--
2.30.2

2021-04-28 20:02:19

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 40/94] mm/mmap: Change vm_brk_flags() to use mm_populate_vma()

Skip searching for the vma to populate after creating it by passing in a pointer.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6671e34b9693..7371fbf267ed 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3004,7 +3004,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
populate = ((mm->def_flags & VM_LOCKED) != 0);
mmap_write_unlock(mm);
if (populate && !ret)
- mm_populate(addr, len);
+ mm_populate_vma(vma, addr, addr + len);
return ret;
}
EXPORT_SYMBOL(vm_brk_flags);
--
2.30.2

2021-04-28 20:04:34

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 43/94] mm/mmap: Drop munmap_vma_range()

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 23 -----------------------
1 file changed, 23 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 4c873313549a..b730b57e47c9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -499,29 +499,6 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
return vma->vm_next;
}

-/*
- * munmap_vma_range() - munmap VMAs that overlap a range.
- * @mm: The mm struct
- * @start: The start of the range.
- * @len: The length of the range.
- * @pprev: pointer to the pointer that will be set to previous vm_area_struct
- *
- * Find all the vm_area_struct that overlap from @start to
- * @end and munmap them. Set @pprev to the previous vm_area_struct.
- *
- * Returns: -ENOMEM on munmap failure or 0 on success.
- */
-static inline int
-munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
- struct vm_area_struct **pprev, struct list_head *uf)
-{
- // Needs optimization.
- while (range_has_overlap(mm, start, start + len, pprev)) {
- if (do_munmap(mm, start, len, uf))
- return -ENOMEM;
- }
- return 0;
-}
static unsigned long count_vma_pages_range(struct mm_struct *mm,
unsigned long addr, unsigned long end)
{
--
2.30.2

2021-04-28 20:19:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 60/94] drivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/misc/cxl/fault.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 60c829113299..60a33b953ef4 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -313,6 +313,7 @@ static void cxl_prefault_vma(struct cxl_context *ctx)
struct vm_area_struct *vma;
int rc;
struct mm_struct *mm;
+ MA_STATE(mas, NULL, 0, 0);

mm = get_mem_context(ctx);
if (mm == NULL) {
@@ -321,8 +322,10 @@ static void cxl_prefault_vma(struct cxl_context *ctx)
return;
}

+ mas.tree = &mm->mm_mt;
mmap_read_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
for (ea = vma->vm_start; ea < vma->vm_end;
ea = next_segment(ea, slb.vsid)) {
rc = copro_calculate_slb(mm, ea, &slb);
@@ -336,6 +339,7 @@ static void cxl_prefault_vma(struct cxl_context *ctx)
last_esid = slb.esid;
}
}
+ rcu_read_unlock();
mmap_read_unlock(mm);

mmput(mm);
--
2.30.2

2021-04-28 20:19:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 66/94] fs/proc/task_mmu: Stop using linked list and highest_vm_end

Remove references to mm_struct linked list and highest_vm_end for when they are removed

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/proc/task_mmu.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 503e1355cf6e..f7ce3cc60cc7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -164,14 +164,13 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
{
struct proc_maps_private *priv = m->private;
- struct vm_area_struct *next, *vma = v;
+ struct vm_area_struct *next = NULL, *vma = v;

- if (vma == priv->tail_vma)
- next = NULL;
- else if (vma->vm_next)
- next = vma->vm_next;
- else
- next = priv->tail_vma;
+ if (vma != priv->tail_vma) {
+ next = vma_next(vma->vm_mm, vma);
+ if (!next)
+ next = priv->tail_vma;
+ }

*ppos = next ? next->vm_start : -1UL;

@@ -844,16 +843,16 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
struct mem_size_stats mss;
- struct mm_struct *mm;
+ struct mm_struct *mm = priv->mm;
struct vm_area_struct *vma;
- unsigned long last_vma_end = 0;
+ unsigned long vma_start = 0, last_vma_end = 0;
int ret = 0;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

priv->task = get_proc_task(priv->inode);
if (!priv->task)
return -ESRCH;

- mm = priv->mm;
if (!mm || !mmget_not_zero(mm)) {
ret = -ESRCH;
goto out_put_task;
@@ -866,8 +865,13 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
goto out_put_mm;

hold_task_mempolicy(priv);
+ vma = mas_find(&mas, 0);
+
+ if (unlikely(!vma))
+ goto empty_set;

- for (vma = priv->mm->mmap; vma;) {
+ vma_start = vma->vm_start;
+ do {
smap_gather_stats(vma, &mss, 0);
last_vma_end = vma->vm_end;

@@ -876,6 +880,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
* access it for write request.
*/
if (mmap_lock_is_contended(mm)) {
+ mas_pause(&mas);
mmap_read_unlock(mm);
ret = mmap_read_lock_killable(mm);
if (ret) {
@@ -919,7 +924,8 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
* contains last_vma_end.
* Iterate VMA' from last_vma_end.
*/
- vma = find_vma(mm, last_vma_end - 1);
+ mas.index = mas.last = last_vma_end - 1;
+ vma = mas_find(&mas, ULONG_MAX);
/* Case 3 above */
if (!vma)
break;
@@ -933,11 +939,10 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
smap_gather_stats(vma, &mss, last_vma_end);
}
/* Case 2 above */
- vma = vma->vm_next;
- }
+ } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);

- show_vma_header_prefix(m, priv->mm->mmap->vm_start,
- last_vma_end, 0, 0, 0, 0);
+empty_set:
+ show_vma_header_prefix(m, vma_start, last_vma_end, 0, 0, 0, 0);
seq_pad(m, ' ');
seq_puts(m, "[rollup]\n");

@@ -1230,6 +1235,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
return -ESRCH;
mm = get_task_mm(task);
if (mm) {
+ MA_STATE(mas, &mm->mm_mt, 0, 0);
struct mmu_notifier_range range;
struct clear_refs_private cp = {
.type = type,
@@ -1249,19 +1255,21 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
}

if (type == CLEAR_REFS_SOFT_DIRTY) {
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_lock(&mas);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!(vma->vm_flags & VM_SOFTDIRTY))
continue;
vma->vm_flags &= ~VM_SOFTDIRTY;
vma_set_page_prot(vma);
}
+ mas_unlock(&mas);

inc_tlb_flush_pending(mm);
mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY,
0, NULL, mm, 0, -1UL);
mmu_notifier_invalidate_range_start(&range);
}
- walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops,
+ walk_page_range(mm, 0, -1, &clear_refs_walk_ops,
&cp);
if (type == CLEAR_REFS_SOFT_DIRTY) {
mmu_notifier_invalidate_range_end(&range);
--
2.30.2

2021-04-28 20:19:27

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 57/94] arch/s390: Use maple tree iterators instead of linked list.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/s390/mm/gmap.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..77879744d652 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2502,8 +2502,10 @@ static const struct mm_walk_ops thp_split_walk_ops = {
static inline void thp_split_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+ /* Must hold mm_mt lock already */
+ mas_for_each(&mas, vma, ULONG_MAX) {
vma->vm_flags &= ~VM_HUGEPAGE;
vma->vm_flags |= VM_NOHUGEPAGE;
walk_page_vma(vma, &thp_split_walk_ops, NULL);
@@ -2571,8 +2573,10 @@ int gmap_mark_unmergeable(void)
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
int ret;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ /* Must hold mm_mt lock already */
+ mas_for_each(&mas, vma, ULONG_MAX) {
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
MADV_UNMERGEABLE, &vma->vm_flags);
if (ret)
--
2.30.2

2021-04-28 20:20:41

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 64/94] fs/exec: Use vma_next() instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/exec.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 3d3f7d46137c..3c9dae291b94 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -681,6 +681,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
unsigned long length = old_end - old_start;
unsigned long new_start = old_start - shift;
unsigned long new_end = old_end - shift;
+ struct vm_area_struct *next;
struct mmu_gather tlb;

BUG_ON(new_start > new_end);
@@ -708,12 +709,13 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)

lru_add_drain();
tlb_gather_mmu(&tlb, mm);
+ next = vma_next(mm, vma);
if (new_end > old_start) {
/*
* when the old and new regions overlap clear from new_end.
*/
free_pgd_range(&tlb, new_end, old_end, new_end,
- vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
+ next ? next->vm_start : USER_PGTABLES_CEILING);
} else {
/*
* otherwise, clean from old_start; this is done to not touch
@@ -722,7 +724,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
* for the others its just a little faster.
*/
free_pgd_range(&tlb, old_start, old_end, new_end,
- vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
+ next ? next->vm_start : USER_PGTABLES_CEILING);
}
tlb_finish_mmu(&tlb);

--
2.30.2

2021-04-28 20:21:02

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 68/94] ipc/shm: Stop using the vma linked list

When searching for a VMA, a maple state can be used instead of the linked list in
the mm_struct

Signed-off-by: Liam R. Howlett <[email protected]>
---
ipc/shm.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index febd88daba8c..e26da39eccb5 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1631,6 +1631,7 @@ long ksys_shmdt(char __user *shmaddr)
loff_t size = 0;
struct file *file;
struct vm_area_struct *next;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);
#endif

if (addr & ~PAGE_MASK)
@@ -1660,11 +1661,11 @@ long ksys_shmdt(char __user *shmaddr)
* match the usual checks anyway. So assume all vma's are
* above the starting address given.
*/
- vma = find_vma(mm, addr);

#ifdef CONFIG_MMU
+ vma = mas_find(&mas, ULONG_MAX);
while (vma) {
- next = vma->vm_next;
+ next = mas_find(&mas, ULONG_MAX);

/*
* Check if the starting address would match, i.e. it's
@@ -1703,21 +1704,21 @@ long ksys_shmdt(char __user *shmaddr)
*/
size = PAGE_ALIGN(size);
while (vma && (loff_t)(vma->vm_end - addr) <= size) {
- next = vma->vm_next;
-
/* finding a matching vma now does not alter retval */
if ((vma->vm_ops == &shm_vm_ops) &&
((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
(vma->vm_file == file))
do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
- vma = next;
+
+ vma = mas_find(&mas, addr + size - 1);
}

#else /* CONFIG_MMU */
+ vma = mas_walk(&mas);
/* under NOMMU conditions, the exact address to be destroyed must be
* given
*/
- if (vma && vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) {
+ if (vma && vma->vm_ops == &shm_vm_ops) {
do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
retval = 0;
}
--
2.30.2

2021-04-28 20:22:05

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 67/94] fs/userfaultfd: Stop using vma linked list.

Don't use the mm_struct linked list or the vma->vm_next in prep for removal

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/userfaultfd.c | 43 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 14f92285d04f..1fd0f5b5c934 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -605,14 +605,18 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
if (release_new_ctx) {
struct vm_area_struct *vma;
struct mm_struct *mm = release_new_ctx->mm;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

/* the various vma->vm_userfaultfd_ctx still points to it */
mmap_write_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next)
+ mas_lock(&mas);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
vma->vm_flags &= ~__VM_UFFD_FLAGS;
}
+ }
+ mas_unlock(&mas);
mmap_write_unlock(mm);

userfaultfd_ctx_put(release_new_ctx);
@@ -797,7 +801,10 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct list_head *unmaps)
{
- for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+ MA_STATE(mas, &vma->vm_mm->mm_mt, vma->vm_start, vma->vm_start);
+
+ rcu_read_lock();
+ mas_for_each(&mas, vma, end) {
struct userfaultfd_unmap_ctx *unmap_ctx;
struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;

@@ -816,6 +823,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma,
unmap_ctx->end = end;
list_add_tail(&unmap_ctx->list, unmaps);
}
+ rcu_read_unlock();

return 0;
}
@@ -847,6 +855,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
/* len == 0 means wake all */
struct userfaultfd_wake_range range = { .len = 0, };
unsigned long new_flags;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

WRITE_ONCE(ctx->released, true);

@@ -862,9 +871,14 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
* taking the mmap_lock for writing.
*/
mmap_write_lock(mm);
+ mas_lock(&mas);
prev = NULL;
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_for_each(&mas, vma, ULONG_MAX) {
+ mas_unlock(&mas);
+ mas_pause(&mas);
cond_resched();
+ mas_lock(&mas);
+
BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
!!(vma->vm_flags & __VM_UFFD_FLAGS));
if (vma->vm_userfaultfd_ctx.ctx != ctx) {
@@ -884,6 +898,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
vma->vm_flags = new_flags;
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
}
+ mas_unlock(&mas);
mmap_write_unlock(mm);
mmput(mm);
wakeup:
@@ -1288,6 +1303,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
bool found;
bool basic_ioctls;
unsigned long start, end, vma_end;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

user_uffdio_register = (struct uffdio_register __user *) arg;

@@ -1326,6 +1342,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
goto out;

mmap_write_lock(mm);
+ rcu_read_lock();
vma = find_vma_prev(mm, start, &prev);
if (!vma)
goto out_unlock;
@@ -1351,8 +1368,12 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
*/
found = false;
basic_ioctls = false;
- for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+ mas_set(&mas, vma->vm_start);
+ mas_for_each(&mas, cur, end) {
+ rcu_read_unlock();
+ mas_pause(&mas);
cond_resched();
+ rcu_read_lock();

BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
!!(cur->vm_flags & __VM_UFFD_FLAGS));
@@ -1469,9 +1490,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
skip:
prev = vma;
start = vma->vm_end;
- vma = vma->vm_next;
+ vma = vma_next(mm, vma);
} while (vma && vma->vm_start < end);
out_unlock:
+ rcu_read_unlock();
mmap_write_unlock(mm);
mmput(mm);
if (!ret) {
@@ -1514,6 +1536,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
bool found;
unsigned long start, end, vma_end;
const void __user *buf = (void __user *)arg;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

ret = -EFAULT;
if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
@@ -1557,8 +1580,13 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
*/
found = false;
ret = -EINVAL;
- for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+ rcu_read_lock();
+ mas_set(&mas, vma->vm_start);
+ mas_for_each(&mas, cur, end) {
+ rcu_read_unlock();
+ mas_pause(&mas);
cond_resched();
+ rcu_read_lock();

BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
!!(cur->vm_flags & __VM_UFFD_FLAGS));
@@ -1575,6 +1603,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,

found = true;
}
+ rcu_read_unlock();
BUG_ON(!found);

if (vma->vm_start < start)
@@ -1643,7 +1672,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
skip:
prev = vma;
start = vma->vm_end;
- vma = vma->vm_next;
+ vma = vma_next(mm, vma);
} while (vma && vma->vm_start < end);
out_unlock:
mmap_write_unlock(mm);
--
2.30.2

2021-04-28 20:23:02

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 74/94] mm/gup: Use maple tree navigation instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/gup.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 48fe98ab0729..4acb0aa75d80 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1502,6 +1502,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
struct vm_area_struct *vma = NULL;
int locked = 0;
long ret = 0;
+ MA_STATE(mas, &mm->mm_mt, start, start);

end = start + len;

@@ -1513,10 +1514,10 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
if (!locked) {
locked = 1;
mmap_read_lock(mm);
- vma = find_vma(mm, nstart);
+ vma = mas_find(&mas, end);
} else if (nstart >= vma->vm_end)
- vma = vma->vm_next;
- if (!vma || vma->vm_start >= end)
+ vma = mas_next(&mas, end);
+ if (!vma)
break;
/*
* Set [nstart; nend) to intersection of desired address
--
2.30.2

2021-04-28 20:30:16

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 70/94] kernel/events/core: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/events/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f07943183041..73817c6c921e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10041,14 +10041,17 @@ static void perf_addr_filter_apply(struct perf_addr_filter *filter,
struct perf_addr_filter_range *fr)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!vma->vm_file)
continue;

if (perf_addr_filter_vma_adjust(filter, vma, fr))
- return;
+ break;
}
+ rcu_read_unlock();
}

/*
--
2.30.2

2021-04-28 20:30:16

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 69/94] kernel/acct: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/acct.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/acct.c b/kernel/acct.c
index a64102be2bb0..82c79cab4faf 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -534,16 +534,16 @@ void acct_collect(long exitcode, int group_dead)
struct pacct_struct *pacct = &current->signal->pacct;
u64 utime, stime;
unsigned long vsize = 0;
+ MA_STATE(mas, &current->mm->mm_mt, 0, 0);

if (group_dead && current->mm) {
struct vm_area_struct *vma;

mmap_read_lock(current->mm);
- vma = current->mm->mmap;
- while (vma) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX)
vsize += vma->vm_end - vma->vm_start;
- vma = vma->vm_next;
- }
+ rcu_read_unlock();
mmap_read_unlock(current->mm);
}

--
2.30.2

2021-04-28 20:30:16

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 72/94] kernel/sched/fair: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/sched/fair.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b8ae02f1994..db403de2131e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2726,6 +2726,7 @@ static void task_numa_work(struct callback_head *work)
unsigned long start, end;
unsigned long nr_pte_updates = 0;
long pages, virtpages;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));

@@ -2778,13 +2779,17 @@ static void task_numa_work(struct callback_head *work)

if (!mmap_read_trylock(mm))
return;
- vma = find_vma(mm, start);
+
+ rcu_read_lock();
+ mas_set(&mas, start);
+ vma = mas_find(&mas, ULONG_MAX);
if (!vma) {
reset_ptenuma_scan(p);
start = 0;
- vma = mm->mmap;
+ mas_set(&mas, start);
}
- for (; vma; vma = vma->vm_next) {
+
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
continue;
@@ -2829,7 +2834,9 @@ static void task_numa_work(struct callback_head *work)
if (pages <= 0 || virtpages <= 0)
goto out;

+ rcu_read_unlock();
cond_resched();
+ rcu_read_lock();
} while (end != vma->vm_end);
}

@@ -2844,6 +2851,7 @@ static void task_numa_work(struct callback_head *work)
mm->numa_scan_offset = start;
else
reset_ptenuma_scan(p);
+ rcu_read_unlock();
mmap_read_unlock(mm);

/*
--
2.30.2

2021-04-28 20:30:16

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 65/94] fs/proc/base: Use maple tree iterators in place of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/proc/base.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index fd46d8dd0cf4..8f62ab2e77e5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2315,6 +2315,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
GENRADIX(struct map_files_info) fa;
struct map_files_info *p;
int ret;
+ MA_STATE(mas, NULL, 0, 0);

genradix_init(&fa);

@@ -2340,8 +2341,10 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
mmput(mm);
goto out_put_task;
}
+ rcu_read_lock();

nr_files = 0;
+ mas.tree = &mm->mm_mt;

/*
* We need two passes here:
@@ -2353,7 +2356,8 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
* routine might require mmap_lock taken in might_fault().
*/

- for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+ pos = 2;
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!vma->vm_file)
continue;
if (++pos <= ctx->pos)
@@ -2371,6 +2375,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
p->end = vma->vm_end;
p->mode = vma->vm_file->f_mode;
}
+ rcu_read_unlock();
mmap_read_unlock(mm);
mmput(mm);

--
2.30.2

2021-04-28 20:30:17

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 77/94] mm/ksm: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/ksm.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index ced6830d0ff4..aa0cfe1ef2d7 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -969,11 +969,14 @@ static int unmerge_and_remove_all_rmap_items(void)
struct mm_slot, mm_list);
spin_unlock(&ksm_mmlist_lock);

- for (mm_slot = ksm_scan.mm_slot;
- mm_slot != &ksm_mm_head; mm_slot = ksm_scan.mm_slot) {
+ for (mm_slot = ksm_scan.mm_slot; mm_slot != &ksm_mm_head;
+ mm_slot = ksm_scan.mm_slot) {
+ MA_STATE(mas, &mm_slot->mm->mm_mt, 0, 0);
+
mm = mm_slot->mm;
mmap_read_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (ksm_test_exit(mm))
break;
if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
@@ -985,6 +988,7 @@ static int unmerge_and_remove_all_rmap_items(void)
}

remove_trailing_rmap_items(&mm_slot->rmap_list);
+ rcu_read_unlock();
mmap_read_unlock(mm);

spin_lock(&ksm_mmlist_lock);
@@ -1008,6 +1012,7 @@ static int unmerge_and_remove_all_rmap_items(void)
return 0;

error:
+ rcu_read_unlock();
mmap_read_unlock(mm);
spin_lock(&ksm_mmlist_lock);
ksm_scan.mm_slot = &ksm_mm_head;
@@ -2222,6 +2227,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
struct vm_area_struct *vma;
struct rmap_item *rmap_item;
int nid;
+ MA_STATE(mas, NULL, 0, 0);

if (list_empty(&ksm_mm_head.mm_list))
return NULL;
@@ -2279,13 +2285,15 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
}

mm = slot->mm;
+ mas.tree = &mm->mm_mt;
+
mmap_read_lock(mm);
+ rcu_read_lock();
if (ksm_test_exit(mm))
- vma = NULL;
- else
- vma = find_vma(mm, ksm_scan.address);
+ goto no_vmas;

- for (; vma; vma = vma->vm_next) {
+ mas_set(&mas, ksm_scan.address);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!(vma->vm_flags & VM_MERGEABLE))
continue;
if (ksm_scan.address < vma->vm_start)
@@ -2313,6 +2321,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
ksm_scan.address += PAGE_SIZE;
} else
put_page(*page);
+ rcu_read_unlock();
mmap_read_unlock(mm);
return rmap_item;
}
@@ -2323,6 +2332,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
}

if (ksm_test_exit(mm)) {
+no_vmas:
ksm_scan.address = 0;
ksm_scan.rmap_list = &slot->rmap_list;
}
@@ -2351,9 +2361,11 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)

free_mm_slot(slot);
clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+ rcu_read_unlock();
mmap_read_unlock(mm);
mmdrop(mm);
} else {
+ rcu_read_unlock();
mmap_read_unlock(mm);
/*
* mmap_read_unlock(mm) first because after
--
2.30.2

2021-04-28 20:30:28

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 78/94] mm/madvise: Use vma_next instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 63e489e5bfdb..ce9c738b7663 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1140,7 +1140,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
if (start >= end)
goto out;
if (prev)
- vma = prev->vm_next;
+ vma = vma_next(mm, prev);
else /* madvise_remove dropped mmap_lock */
vma = find_vma(mm, start);
}
--
2.30.2

2021-04-28 20:30:29

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 71/94] kernel/events/uprobes: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
kernel/events/uprobes.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 907d4ee00cb2..8d3248ffbd68 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -355,13 +355,16 @@ static bool valid_ref_ctr_vma(struct uprobe *uprobe,
static struct vm_area_struct *
find_ref_ctr_vma(struct uprobe *uprobe, struct mm_struct *mm)
{
- struct vm_area_struct *tmp;
+ struct vm_area_struct *tmp = NULL;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- for (tmp = mm->mmap; tmp; tmp = tmp->vm_next)
+ rcu_read_lock();
+ mas_for_each(&mas, tmp, ULONG_MAX)
if (valid_ref_ctr_vma(uprobe, tmp))
- return tmp;
+ break;
+ rcu_read_unlock();

- return NULL;
+ return tmp;
}

static int
@@ -1237,9 +1240,10 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
{
struct vm_area_struct *vma;
int err = 0;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_read_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_for_each(&mas, vma, ULONG_MAX) {
unsigned long vaddr;
loff_t offset;

@@ -1988,8 +1992,10 @@ bool uprobe_deny_signal(void)
static void mmf_recalc_uprobes(struct mm_struct *mm)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!valid_vma(vma, false))
continue;
/*
@@ -1999,10 +2005,15 @@ static void mmf_recalc_uprobes(struct mm_struct *mm)
* Or this uprobe can be filtered out.
*/
if (vma_has_uprobes(vma, vma->vm_start, vma->vm_end))
- return;
+ goto completed;
}
+ rcu_read_unlock();

clear_bit(MMF_HAS_UPROBES, &mm->flags);
+ return;
+
+completed:
+ rcu_read_unlock();
}

static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
--
2.30.2

2021-04-28 20:30:47

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 83/94] mm/mremap: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mremap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index d2dba8188be5..3bd70eeed544 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -623,7 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (excess) {
vma->vm_flags |= VM_ACCOUNT;
if (split)
- vma->vm_next->vm_flags |= VM_ACCOUNT;
+ vma_next(mm, vma)->vm_flags |= VM_ACCOUNT;
}

return new_addr;
@@ -796,9 +796,11 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
{
unsigned long end = vma->vm_end + delta;
+ struct vm_area_struct *next;
if (end < vma->vm_end) /* overflow */
return 0;
- if (vma->vm_next && vma->vm_next->vm_start < end) /* intersection */
+ next = vma_next(vma->vm_mm, vma);
+ if (next && next->vm_start < end) /* intersection */
return 0;
if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
0, MAP_FIXED) & ~PAGE_MASK)
--
2.30.2

2021-04-28 20:30:58

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 85/94] mm/oom_kill: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/oom_kill.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 54527de9cd2d..1f6491965802 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -515,6 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
bool ret = true;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

/*
* Tell all users of get_user/copy_from_user etc... that the content
@@ -524,7 +525,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
*/
set_bit(MMF_UNSTABLE, &mm->flags);

- for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (!can_madv_lru_vma(vma))
continue;

@@ -556,6 +558,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
tlb_finish_mmu(&tlb);
}
}
+ rcu_read_unlock();

return ret;
}
--
2.30.2

2021-04-28 20:31:01

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 79/94] mm/memcontrol: Stop using mm->highest_vm_end

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/memcontrol.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..0272c9466502 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5666,7 +5666,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
unsigned long precharge;

mmap_read_lock(mm);
- walk_page_range(mm, 0, mm->highest_vm_end, &precharge_walk_ops, NULL);
+ walk_page_range(mm, 0, -1, &precharge_walk_ops, NULL);
mmap_read_unlock(mm);

precharge = mc.precharge;
@@ -5964,9 +5964,7 @@ static void mem_cgroup_move_charge(void)
* When we have consumed all precharges and failed in doing
* additional charge, the page walk just aborts.
*/
- walk_page_range(mc.mm, 0, mc.mm->highest_vm_end, &charge_walk_ops,
- NULL);
-
+ walk_page_range(mc.mm, 0, -1, &charge_walk_ops, NULL);
mmap_read_unlock(mc.mm);
atomic_dec(&mc.from->moving_account);
}
--
2.30.2

2021-04-28 20:31:18

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 86/94] mm/pagewalk: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/pagewalk.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..20bd8d14d042 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -408,7 +408,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
} else { /* inside vma */
walk.vma = vma;
next = min(end, vma->vm_end);
- vma = vma->vm_next;
+ vma = vma_next(mm, vma);;

err = walk_page_test(start, next, &walk);
if (err > 0) {
--
2.30.2

2021-04-28 20:31:18

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 89/94] arch/um/kernel/tlb: Stop using linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/um/kernel/tlb.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index bc38f79ca3a3..0cbbebad70a0 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -584,21 +584,21 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,

void flush_tlb_mm(struct mm_struct *mm)
{
- struct vm_area_struct *vma = mm->mmap;
+ struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- while (vma != NULL) {
+ /* Must hold mm_mt lock already */
+ mas_for_each(&mas, vma, ULONG_MAX)
fix_range(mm, vma->vm_start, vma->vm_end, 0);
- vma = vma->vm_next;
- }
}

void force_flush_all(void)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma = mm->mmap;
+ struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- while (vma != NULL) {
+ /* Must hold mm_mt lock already */
+ mas_for_each(&mas, vma, ULONG_MAX)
fix_range(mm, vma->vm_start, vma->vm_end, 1);
- vma = vma->vm_next;
- }
}
--
2.30.2

2021-04-28 20:31:18

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 41/94] mm: Change find_vma_intersection() to maple tree and make find_vma() to inline.

Move find_vma_intersection() to mmap.c and change implementation to
maple tree.

When searching for a vma within a range, it is easier to use the maple
tree interface. This means the find_vma() call changes to a special
case of the find_vma_intersection(). Exported for kvm module.

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 10 ++--------
mm/mmap.c | 40 +++++++++++++++++++++++++++++-----------
2 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf17491be249..dd8abaa433f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2678,14 +2678,8 @@ extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned lon

/* Look up the first VMA which intersects the interval start_addr..end_addr-1,
NULL if none. Assume start_addr < end_addr. */
-static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr)
-{
- struct vm_area_struct * vma = find_vma(mm,start_addr);
-
- if (vma && end_addr <= vma->vm_start)
- vma = NULL;
- return vma;
-}
+extern struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
+ unsigned long start_addr, unsigned long end_addr);

/**
* vma_lookup() - Find a VMA at a specific address
diff --git a/mm/mmap.c b/mm/mmap.c
index 7371fbf267ed..df39c01eda12 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2037,32 +2037,50 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,

EXPORT_SYMBOL(get_unmapped_area);

-/**
- * find_vma() - Find the VMA for a given address, or the next vma.
- * @mm: The mm_struct to check
- * @addr: The address
+/*
+ * find_vma_intersection - Find the first vma between [@start, @end)
+ * @mm: The mm_struct to use.
+ * @start: The start address
+ * @end: The end address
*
- * Returns: The VMA associated with addr, or the next vma.
- * May return %NULL in the case of no vma at addr or above.
+ * Returns: The VMA associated with the @start or the next VMA within the range.
+ * May return %NULL in the case of no vma within the range.
*/
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
+ unsigned long start_addr,
+ unsigned long end_addr)
{
struct vm_area_struct *vma;
- MA_STATE(mas, &mm->mm_mt, addr, addr);
+ MA_STATE(mas, &mm->mm_mt, start_addr, start_addr);

/* Check the cache first. */
- vma = vmacache_find(mm, addr);
+ vma = vmacache_find(mm, start_addr);
if (likely(vma))
return vma;

rcu_read_lock();
- vma = mas_find(&mas, -1);
+ vma = mas_find(&mas, end_addr - 1);
rcu_read_unlock();
if (vma)
- vmacache_update(addr, vma);
+ vmacache_update(mas.index, vma);

return vma;
}
+EXPORT_SYMBOL(find_vma_intersection);
+
+/**
+ * find_vma() - Find the VMA for a given address, or the next vma.
+ * @mm: The mm_struct to check
+ * @addr: The address
+ *
+ * Returns: The VMA associated with addr, or the next vma.
+ * May return NULL in the case of no vma at addr or above.
+ */
+inline struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+ // Note find_vma_intersection will decrease 0 to underflow to ULONG_MAX
+ return find_vma_intersection(mm, addr, 0);
+}
EXPORT_SYMBOL(find_vma);

/**
--
2.30.2

2021-04-28 20:31:25

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 48/94] mmap: Use find_vma_intersection in do_mmap() for overlap

When detecting a conflict with MAP_FIXED_NOREPLACE, using the new interface avoids
the need for a temp variable

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6fa93606e62b..3e67fb5eac31 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1425,9 +1425,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
return addr;

if (flags & MAP_FIXED_NOREPLACE) {
- struct vm_area_struct *vma = find_vma(mm, addr);
-
- if (vma && vma->vm_start < addr + len)
+ if (find_vma_intersection(mm, addr, addr + len))
return -EEXIST;
}

--
2.30.2

2021-04-28 20:31:26

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 39/94] mm/mmap: Change do_brk_flags() to expand existing VMA and add do_brk_munmap()

Avoid allocating a new VMA when it is not necessary. Expand or contract
the existing VMA instead. This avoids unnecessary tree manipulations
and allocations.

Once the VMA is known, use it directly when populating to avoid
unnecessary lookup work.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 269 +++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 214 insertions(+), 55 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 7747047c4cbe..6671e34b9693 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -190,17 +190,22 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
return next;
}

-static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
- struct list_head *uf);
+static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
+ unsigned long newbrk, unsigned long oldbrk,
+ struct list_head *uf);
+static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
+ unsigned long addr, unsigned long request,
+ unsigned long flags);
SYSCALL_DEFINE1(brk, unsigned long, brk)
{
unsigned long newbrk, oldbrk, origbrk;
struct mm_struct *mm = current->mm;
- struct vm_area_struct *next;
+ struct vm_area_struct *brkvma, *next = NULL;
unsigned long min_brk;
bool populate;
bool downgraded = false;
LIST_HEAD(uf);
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

if (mmap_write_lock_killable(mm))
return -EINTR;
@@ -240,37 +245,56 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
goto success;
}

- /*
- * Always allow shrinking brk.
- * __do_munmap() may downgrade mmap_lock to read.
- */
- if (brk <= mm->brk) {
+ mas_set(&mas, newbrk);
+ brkvma = mas_walk(&mas);
+ if (brkvma) { // munmap necessary, there is something at newbrk.
+ /*
+ * Always allow shrinking brk.
+ * do_brk_munmap() may downgrade mmap_lock to read.
+ */
int ret;

+ if (brkvma->vm_start >= oldbrk)
+ goto out; // mapping intersects with an existing non-brk vma.
/*
- * mm->brk must to be protected by write mmap_lock so update it
- * before downgrading mmap_lock. When __do_munmap() fails,
- * mm->brk will be restored from origbrk.
+ * mm->brk must to be protected by write mmap_lock.
+ * do_brk_munmap() may downgrade the lock, so update it
+ * before calling do_brk_munmap().
*/
mm->brk = brk;
- ret = __do_munmap(mm, newbrk, oldbrk-newbrk, &uf, true);
- if (ret < 0) {
- mm->brk = origbrk;
- goto out;
- } else if (ret == 1) {
+ mas.last = oldbrk - 1;
+ ret = do_brk_munmap(&mas, brkvma, newbrk, oldbrk, &uf);
+ if (ret == 1) {
downgraded = true;
- }
- goto success;
- }
+ goto success;
+ } else if (!ret)
+ goto success;

+ mm->brk = origbrk;
+ goto out;
+ }
+ /* Only check if the next VMA is within the stack_guard_gap of the
+ * expansion area */
+ next = mas_next(&mas, newbrk + PAGE_SIZE + stack_guard_gap);
/* Check against existing mmap mappings. */
- next = find_vma(mm, oldbrk);
if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
goto out;

+ brkvma = mas_prev(&mas, mm->start_brk);
+ if (brkvma) {
+ if(brkvma->vm_start >= oldbrk)
+ goto out; // Trying to map over another vma.
+
+ if (brkvma->vm_end <= min_brk) {
+ brkvma = NULL;
+ mas_reset(&mas);
+ }
+ }
+
/* Ok, looks good - let it rip. */
- if (do_brk_flags(oldbrk, newbrk-oldbrk, 0, &uf) < 0)
+ if (do_brk_flags(&mas, &brkvma, oldbrk, newbrk - oldbrk, 0) < 0)
goto out;
+
mm->brk = brk;

success:
@@ -281,7 +305,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
mmap_write_unlock(mm);
userfaultfd_unmap_complete(mm, &uf);
if (populate)
- mm_populate(oldbrk, newbrk - oldbrk);
+ mm_populate_vma(brkvma, oldbrk, newbrk);
return brk;

out:
@@ -372,16 +396,16 @@ static void validate_mm(struct mm_struct *mm)
validate_mm_mt(mm);

while (vma) {
+#ifdef CONFIG_DEBUG_VM_RB
struct anon_vma *anon_vma = vma->anon_vma;
struct anon_vma_chain *avc;
-
if (anon_vma) {
anon_vma_lock_read(anon_vma);
list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
anon_vma_interval_tree_verify(avc);
anon_vma_unlock_read(anon_vma);
}
-
+#endif
highest_address = vm_end_gap(vma);
vma = vma->vm_next;
i++;
@@ -2024,13 +2048,16 @@ EXPORT_SYMBOL(get_unmapped_area);
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

/* Check the cache first. */
vma = vmacache_find(mm, addr);
if (likely(vma))
return vma;

- vma = mt_find(&mm->mm_mt, &addr, ULONG_MAX);
+ rcu_read_lock();
+ vma = mas_find(&mas, -1);
+ rcu_read_unlock();
if (vma)
vmacache_update(addr, vma);

@@ -2514,7 +2541,6 @@ static inline void unlock_range(struct vm_area_struct *start, unsigned long limi
mm->locked_vm -= vma_pages(tmp);
munlock_vma_pages_all(tmp);
}
-
tmp = tmp->vm_next;
}
}
@@ -2749,16 +2775,105 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
}

/*
- * this is really a simplified "do_mmap". it only handles
- * anonymous maps. eventually we may be able to do some
- * brk-specific accounting here.
+ * bkr_munmap() - Unmap a parital vma.
+ * @mas: The maple tree state.
+ * @vma: The vma to be modified
+ * @newbrk: the start of the address to unmap
+ * @oldbrk: The end of the address to unmap
+ * @uf: The userfaultfd list_head
+ *
+ * Returns: 0 on success.
+ * unmaps a partial VMA mapping. Does not handle alignment, downgrades lock if
+ * possible.
+ */
+static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
+ unsigned long newbrk, unsigned long oldbrk,
+ struct list_head *uf)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct unmap;
+ unsigned long unmap_pages;
+ int ret = 1;
+
+ arch_unmap(mm, newbrk, oldbrk);
+
+ if (likely(vma->vm_start >= newbrk)) { // remove entire mapping(s)
+ mas_set(mas, newbrk);
+ if (vma->vm_start != newbrk)
+ mas_reset(mas); // cause a re-walk for the first overlap.
+ ret = __do_munmap(mm, newbrk, oldbrk - newbrk, uf, true);
+ goto munmap_full_vma;
+ }
+
+ vma_init(&unmap, mm);
+ unmap.vm_start = newbrk;
+ unmap.vm_end = oldbrk;
+ ret = userfaultfd_unmap_prep(&unmap, newbrk, oldbrk, uf);
+ if (ret)
+ return ret;
+ ret = 1;
+
+ // Change the oldbrk of vma to the newbrk of the munmap area
+ vma_adjust_trans_huge(vma, vma->vm_start, newbrk, 0);
+ if (vma->anon_vma) {
+ anon_vma_lock_write(vma->anon_vma);
+ anon_vma_interval_tree_pre_update_vma(vma);
+ }
+
+ vma->vm_end = newbrk;
+ if (vma_mas_remove(&unmap, mas))
+ goto mas_store_fail;
+
+ vmacache_invalidate(vma->vm_mm);
+ if (vma->anon_vma) {
+ anon_vma_interval_tree_post_update_vma(vma);
+ anon_vma_unlock_write(vma->anon_vma);
+ }
+
+ unmap_pages = vma_pages(&unmap);
+ if (unmap.vm_flags & VM_LOCKED) {
+ mm->locked_vm -= unmap_pages;
+ munlock_vma_pages_range(&unmap, newbrk, oldbrk);
+ }
+
+ mmap_write_downgrade(mm);
+ unmap_region(mm, &unmap, vma, newbrk, oldbrk);
+ /* Statistics */
+ vm_stat_account(mm, unmap.vm_flags, -unmap_pages);
+ if (unmap.vm_flags & VM_ACCOUNT)
+ vm_unacct_memory(unmap_pages);
+
+munmap_full_vma:
+ validate_mm_mt(mm);
+ return ret;
+
+mas_store_fail:
+ vma->vm_end = oldbrk;
+ if (vma->anon_vma) {
+ anon_vma_interval_tree_post_update_vma(vma);
+ anon_vma_unlock_write(vma->anon_vma);
+ }
+ return -ENOMEM;
+}
+
+/*
+ * do_brk_flags() - Increase the brk vma if the flags match.
+ * @mas: The maple tree state.
+ * @addr: The start address
+ * @len: The length of the increase
+ * @vma: The vma,
+ * @flags: The VMA Flags
+ *
+ * Extend the brk VMA from addr to addr + len. If the VMA is NULL or the flags
+ * do not match then create a new anonymous VMA. Eventually we may be able to
+ * do some brk-specific accounting here.
*/
-static int do_brk_flags(unsigned long addr, unsigned long len,
- unsigned long flags, struct list_head *uf)
+static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
+ unsigned long addr, unsigned long len,
+ unsigned long flags)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma, *prev;
- pgoff_t pgoff = addr >> PAGE_SHIFT;
+ struct vm_area_struct *prev = NULL, *vma;
int error;
unsigned long mapped_addr;
validate_mm_mt(mm);
@@ -2776,11 +2891,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
if (error)
return error;

- /* Clear old maps, set up prev and uf */
- if (munmap_vma_range(mm, addr, len, &prev, uf))
- return -ENOMEM;
-
- /* Check against address space limits *after* clearing old maps... */
+ /* Check against address space limits by the changed size */
if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
return -ENOMEM;

@@ -2790,28 +2901,59 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

- /* Can we just expand an old private anonymous mapping? */
- vma = vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
- if (vma)
- goto out;
+ mas->last = addr + len - 1;
+ if (*brkvma) {
+ vma = *brkvma;
+ /* Expand the existing vma if possible; almost never a singular
+ * list, so this will almost always fail. */

- /*
- * create a vma struct for an anonymous mapping
- */
- vma = vm_area_alloc(mm);
- if (!vma) {
- vm_unacct_memory(len >> PAGE_SHIFT);
- return -ENOMEM;
+ if ((!vma->anon_vma ||
+ list_is_singular(&vma->anon_vma_chain)) &&
+ ((vma->vm_flags & ~VM_SOFTDIRTY) == flags)){
+ mas->index = vma->vm_start;
+
+ vma_adjust_trans_huge(vma, addr, addr + len, 0);
+ if (vma->anon_vma) {
+ anon_vma_lock_write(vma->anon_vma);
+ anon_vma_interval_tree_pre_update_vma(vma);
+ }
+ vma->vm_end = addr + len;
+ vma->vm_flags |= VM_SOFTDIRTY;
+ if (mas_store_gfp(mas, vma, GFP_KERNEL))
+ goto mas_mod_fail;
+
+ if (vma->anon_vma) {
+ anon_vma_interval_tree_post_update_vma(vma);
+ anon_vma_unlock_write(vma->anon_vma);
+ }
+ khugepaged_enter_vma_merge(vma, flags);
+ goto out;
+ }
+ prev = vma;
}
+ mas->index = addr;
+ mas_walk(mas);
+
+ /* create a vma struct for an anonymous mapping */
+ vma = vm_area_alloc(mm);
+ if (!vma)
+ goto vma_alloc_fail;

vma_set_anonymous(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;
- vma->vm_pgoff = pgoff;
+ vma->vm_pgoff = addr >> PAGE_SHIFT;
vma->vm_flags = flags;
vma->vm_page_prot = vm_get_page_prot(flags);
- vma_link(mm, vma, prev);
+ if (vma_mas_store(vma, mas))
+ goto mas_store_fail;
+
+ if (!prev)
+ prev = mas_prev(mas, 0);
+
+ __vma_link_list(mm, vma, prev);
+ mm->map_count++;
+ *brkvma = vma;
out:
perf_event_mmap(vma);
mm->total_vm += len >> PAGE_SHIFT;
@@ -2821,15 +2963,31 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
vma->vm_flags |= VM_SOFTDIRTY;
validate_mm_mt(mm);
return 0;
+
+mas_store_fail:
+ vm_area_free(vma);
+vma_alloc_fail:
+ vm_unacct_memory(len >> PAGE_SHIFT);
+ return -ENOMEM;
+
+mas_mod_fail:
+ vma->vm_end = addr;
+ if (vma->anon_vma) {
+ anon_vma_interval_tree_post_update_vma(vma);
+ anon_vma_unlock_write(vma->anon_vma);
+ }
+ return -ENOMEM;
+
}

int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
{
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma = NULL;
unsigned long len;
int ret;
bool populate;
- LIST_HEAD(uf);
+ MA_STATE(mas, &mm->mm_mt, addr, addr);

len = PAGE_ALIGN(request);
if (len < request)
@@ -2840,10 +2998,11 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
if (mmap_write_lock_killable(mm))
return -EINTR;

- ret = do_brk_flags(addr, len, flags, &uf);
+ // This vma left intentionally blank.
+ mas_walk(&mas);
+ ret = do_brk_flags(&mas, &vma, addr, len, flags);
populate = ((mm->def_flags & VM_LOCKED) != 0);
mmap_write_unlock(mm);
- userfaultfd_unmap_complete(mm, &uf);
if (populate && !ret)
mm_populate(addr, len);
return ret;
--
2.30.2

2021-04-28 20:31:33

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 61/94] drivers/tee/optee: Use maple tree iterators for __check_mem_type()

Signed-off-by: Liam R. Howlett <[email protected]>
---
drivers/tee/optee/call.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
index 6132cc8d014c..cfe55b9e92b6 100644
--- a/drivers/tee/optee/call.c
+++ b/drivers/tee/optee/call.c
@@ -550,11 +550,18 @@ static bool is_normal_memory(pgprot_t p)

static int __check_mem_type(struct vm_area_struct *vma, unsigned long end)
{
- while (vma && is_normal_memory(vma->vm_page_prot)) {
- if (vma->vm_end >= end)
- return 0;
- vma = vma->vm_next;
+ MA_STATE(mas, &vma->vm_mm->mm_mt, vma->vm_start, vma->vm_start);
+
+
+ rcu_read_lock();
+ mas_for_each(&mas, vma, end) {
+ if (!is_normal_memory(vma->vm_page_prot))
+ break;
}
+ rcu_read_unlock();
+
+ if (!vma)
+ return 0;

return -EINVAL;
}
--
2.30.2

2021-04-28 20:31:33

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 88/94] mm/util: Remove __vma_link_list() and __vma_unlink_list()

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 5 -----
mm/mmap.c | 19 ++++---------------
mm/nommu.c | 6 ++----
mm/util.c | 40 ----------------------------------------
4 files changed, 6 insertions(+), 64 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 583f2f1e6ff8..34d00548aa81 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -394,11 +394,6 @@ static inline int vma_mas_remove(struct vm_area_struct *vma, struct ma_state *ma
return ret;
}

-/* mm/util.c */
-void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
- struct vm_area_struct *prev);
-void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
-
#ifdef CONFIG_MMU
extern long populate_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end, int *nonblocking);
diff --git a/mm/mmap.c b/mm/mmap.c
index 51a29bb789ba..ed1b9df86966 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -583,7 +583,6 @@ static void vma_mas_link(struct mm_struct *mm, struct vm_area_struct *vma,
}

vma_mas_store(vma, mas);
- __vma_link_list(mm, vma, prev);
__vma_link_file(vma);

if (mapping)
@@ -604,7 +603,6 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
}

vma_mt_store(mm, vma);
- __vma_link_list(mm, vma, prev);
__vma_link_file(vma);

if (mapping)
@@ -624,7 +622,6 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)

BUG_ON(range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev));
vma_mt_store(mm, vma);
- __vma_link_list(mm, vma, prev);
mm->map_count++;
}

@@ -681,13 +678,8 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
}

/* Expanding over the next vma */
- if (remove_next) {
- /* Remove from mm linked list - also updates highest_vm_end */
- __vma_unlink_list(mm, next);
-
- if (file)
- __remove_shared_vm_struct(next, file, mapping);
-
+ if (remove_next && file) {
+ __remove_shared_vm_struct(next, file, mapping);
} else if (!next) {
mm->highest_vm_end = vm_end_gap(vma);
}
@@ -896,10 +888,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
flush_dcache_mmap_unlock(mapping);
}

- if (remove_next) {
- __vma_unlink_list(mm, next);
- if (file)
- __remove_shared_vm_struct(next, file, mapping);
+ if (remove_next && file) {
+ __remove_shared_vm_struct(next, file, mapping);
} else if (insert) {
/*
* split_vma has split insert from vma, and needs
@@ -3124,7 +3114,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct **brkvma,
if (!prev)
prev = mas_prev(mas, 0);

- __vma_link_list(mm, vma, prev);
mm->map_count++;
*brkvma = vma;
out:
diff --git a/mm/nommu.c b/mm/nommu.c
index 0eea24df1bd5..916038bafc65 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -592,7 +592,6 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
mas_reset(&mas);
/* add the VMA to the tree */
vma_mas_store(vma, &mas);
- __vma_link_list(mm, vma, prev);
}

/*
@@ -617,7 +616,6 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)

/* remove from the MM's tree and list */
vma_mas_remove(vma, &mas);
- __vma_unlink_list(vma->vm_mm, vma);
}

/*
@@ -1430,7 +1428,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list
return -EINVAL;
if (end == vma->vm_end)
goto erase_whole_vma;
- vma = vma->vm_next;
+ vma = vma_next(mm, vma);
} while (vma);
return -EINVAL;
} else {
@@ -1488,7 +1486,7 @@ void exit_mmap(struct mm_struct *mm)
mm->total_vm = 0;

while ((vma = mm->mmap)) {
- mm->mmap = vma->vm_next;
+ mm->mmap = vma_next(mm, vma);
delete_vma_from_mm(vma);
delete_vma(mm, vma);
cond_resched();
diff --git a/mm/util.c b/mm/util.c
index 35deaa0ccac5..3619b2529e51 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -271,46 +271,6 @@ void *memdup_user_nul(const void __user *src, size_t len)
}
EXPORT_SYMBOL(memdup_user_nul);

-void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
- struct vm_area_struct *prev)
-{
- struct vm_area_struct *next;
-
- vma->vm_prev = prev;
- if (prev) {
- next = prev->vm_next;
- prev->vm_next = vma;
- } else {
- next = mm->mmap;
- mm->mmap = vma;
- }
- vma->vm_next = next;
- if (next)
- next->vm_prev = vma;
- else
- mm->highest_vm_end = vm_end_gap(vma);
-}
-
-void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
-{
- struct vm_area_struct *prev, *next;
-
- next = vma->vm_next;
- prev = vma->vm_prev;
- if (prev)
- prev->vm_next = next;
- else
- mm->mmap = next;
- if (next)
- next->vm_prev = prev;
- else {
- if (prev)
- mm->highest_vm_end = vm_end_gap(prev);
- else
- mm->highest_vm_end = 0;
- }
-}
-
/* Check if the vma is being used as a stack by this task */
int vma_is_stack_for_current(struct vm_area_struct *vma)
{
--
2.30.2

2021-04-28 20:31:33

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 87/94] mm/swapfile: Use maple tree iterator instead of vma linked list

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/swapfile.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 149e77454e3c..69003ab63e64 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2116,17 +2116,24 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type,
{
struct vm_area_struct *vma;
int ret = 0;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_read_lock(mm);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_lock(&mas);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (vma->anon_vma) {
ret = unuse_vma(vma, type, frontswap,
fs_pages_to_unuse);
if (ret)
break;
}
+
+ mas_unlock(&mas);
+ mas_pause(&mas);
cond_resched();
+ mas_lock(&mas);
}
+ mas_unlock(&mas);
mmap_read_unlock(mm);
return ret;
}
--
2.30.2

2021-04-28 20:31:44

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 47/94] mm/mmap: Add do_mas_munmap() and wraper for __do_munmap()

To avoid extra tree work, it is necessary to support passing in a maple state
to key functions. Start this work with __do_munmap().

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 107 ++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 75 insertions(+), 32 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0106b5accd7c..6fa93606e62b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2413,34 +2413,24 @@ static inline int unlock_range(struct vm_area_struct *start,

return count;
}
-/* Munmap is split into 2 main parts -- this part which finds
- * what needs doing, and the areas themselves, which do the
- * work. This now handles partial unmappings.
- * Jeremy Fitzhardinge <[email protected]>
+
+/* do_mas_align_munmap() - munmap the aligned region from @start to @end.
+ *
+ * @mas: The maple_state, ideally set up to alter the correct tree location.
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @downgrade: Set to true to attempt a downwrite of the mmap_sem
+ *
+ *
*/
-int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
- struct list_head *uf, bool downgrade)
+static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ struct list_head *uf, bool downgrade)
{
- unsigned long end;
- struct vm_area_struct *vma, *prev, *last;
- MA_STATE(mas, &mm->mm_mt, start, start);
-
- if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
- return -EINVAL;
-
- end = start + PAGE_ALIGN(len);
- if (end == start)
- return -EINVAL;
-
- /* arch_unmap() might do unmaps itself. */
- arch_unmap(mm, start, end);
-
- /* Find the first overlapping VMA */
- vma = mas_find(&mas, end - 1);
- if (!vma)
- return 0;
-
- mas.last = end - 1;
+ struct vm_area_struct *prev, *last;
/* we have start < vma->vm_end */

/*
@@ -2465,8 +2455,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return error;
prev = vma;
vma = vma_next(mm, prev);
- mas.index = start;
- mas_reset(&mas);
+ mas->index = start;
+ mas_reset(mas);
} else {
prev = vma->vm_prev;
}
@@ -2482,7 +2472,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
if (error)
return error;
vma = vma_next(mm, prev);
- mas_reset(&mas);
+ mas_reset(mas);
}


@@ -2509,7 +2499,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
*/
mm->map_count -= unlock_range(vma, &last, end);
/* Drop removed area from the tree */
- mas_store_gfp(&mas, NULL, GFP_KERNEL);
+ mas_store_gfp(mas, NULL, GFP_KERNEL);

/* Detach vmas from the MM linked list */
vma->vm_prev = NULL;
@@ -2546,6 +2536,59 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return downgrade ? 1 : 0;
}

+/*
+ * do_mas_munmap() - munmap a given range.
+ * @mas: The maple state
+ * @mm: The mm_struct
+ * @start: The start address to munmap
+ * @len: The length of the range to munmap
+ * @uf: The userfaultfd list_head
+ * @downgrade: set to true if the user wants to attempt to write_downgrade the
+ * mmap_sem
+ *
+ * This function takes a @mas that is in the correct state to remove the
+ * mapping(s). The @len will be aligned and any arch_unmap work will be
+ * preformed.
+ */
+int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
+ unsigned long start, size_t len, struct list_head *uf,
+ bool downgrade)
+{
+ unsigned long end;
+ struct vm_area_struct *vma;
+
+ if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
+ return -EINVAL;
+
+ end = start + PAGE_ALIGN(len);
+ if (end == start)
+ return -EINVAL;
+
+ /* arch_unmap() might do unmaps itself. */
+ arch_unmap(mm, start, end);
+
+ /* Find the first overlapping VMA */
+ vma = mas_find(mas, end - 1);
+ if (!vma)
+ return 0;
+
+ mas->last = end - 1;
+ return do_mas_align_munmap(mas, vma, mm, start, end, uf, downgrade);
+}
+
+int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+ struct list_head *uf, bool downgrade)
+{
+ MA_STATE(mas, &mm->mm_mt, start, start);
+ return do_mas_munmap(&mas, mm, start, len, uf, downgrade);
+}
+
+/* do_munmap() - Wrapper function for non-maple tree aware do_munmap() calls.
+ * @mm: The mm_struct
+ * @start: The start address to munmap
+ * @len: The length to be munmapped.
+ * @uf: The userfaultfd list_head
+ */
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
struct list_head *uf)
{
@@ -2583,7 +2626,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
}

/* Unmap any existing mapping in the area */
- if (do_munmap(mm, addr, len, uf))
+ if (do_mas_munmap(&mas, mm, addr, len, uf, false))
return -ENOMEM;

/*
@@ -2937,7 +2980,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
mas_set(mas, newbrk);
if (vma->vm_start != newbrk)
mas_reset(mas); // cause a re-walk for the first overlap.
- ret = __do_munmap(mm, newbrk, oldbrk - newbrk, uf, true);
+ ret = do_mas_munmap(mas, mm, newbrk, oldbrk-newbrk, uf, true);
goto munmap_full_vma;
}

--
2.30.2

2021-04-28 20:32:01

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 44/94] mm: Remove vmacache

The maple tree is able to find a VMA quick enough to no longer need the
vma cache. Remove the vmacache to reduce work in keeping it up to date
and code complexity.

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/s390/configs/debug_defconfig | 1 -
fs/exec.c | 3 -
fs/proc/task_mmu.c | 1 -
include/linux/mm_types.h | 1 -
include/linux/mm_types_task.h | 5 --
include/linux/sched.h | 1 -
include/linux/vm_event_item.h | 4 -
include/linux/vmacache.h | 28 -------
include/linux/vmstat.h | 6 --
kernel/debug/debug_core.c | 12 ---
kernel/fork.c | 5 --
lib/Kconfig.debug | 10 ---
mm/Makefile | 2 +-
mm/debug.c | 4 +-
mm/mmap.c | 17 -----
mm/nommu.c | 31 +-------
mm/vmacache.c | 117 ------------------------------
mm/vmstat.c | 4 -
18 files changed, 6 insertions(+), 246 deletions(-)
delete mode 100644 include/linux/vmacache.h
delete mode 100644 mm/vmacache.c

diff --git a/arch/s390/configs/debug_defconfig b/arch/s390/configs/debug_defconfig
index 6422618a4f75..a7aed6dbc06e 100644
--- a/arch/s390/configs/debug_defconfig
+++ b/arch/s390/configs/debug_defconfig
@@ -790,7 +790,6 @@ CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_VM=y
-CONFIG_DEBUG_VM_VMACACHE=y
CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..3d3f7d46137c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -28,7 +28,6 @@
#include <linux/file.h>
#include <linux/fdtable.h>
#include <linux/mm.h>
-#include <linux/vmacache.h>
#include <linux/stat.h>
#include <linux/fcntl.h>
#include <linux/swap.h>
@@ -1020,8 +1019,6 @@ static int exec_mmap(struct mm_struct *mm)
activate_mm(active_mm, mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
- tsk->mm->vmacache_seqnum = 0;
- vmacache_flush(tsk);
task_unlock(tsk);
if (old_mm) {
mmap_read_unlock(old_mm);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc9784544b24..503e1355cf6e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/pagewalk.h>
-#include <linux/vmacache.h>
#include <linux/hugetlb.h>
#include <linux/huge_mm.h>
#include <linux/mount.h>
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 41551bfa6ce0..304692ada024 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -380,7 +380,6 @@ struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
struct maple_tree mm_mt;
- u64 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index c1bc6731125c..33c9fa4d4f66 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -31,11 +31,6 @@
#define VMACACHE_SIZE (1U << VMACACHE_BITS)
#define VMACACHE_MASK (VMACACHE_SIZE - 1)

-struct vmacache {
- u64 seqnum;
- struct vm_area_struct *vmas[VMACACHE_SIZE];
-};
-
/*
* When updating this, please also update struct resident_page_types[] in
* kernel/fork.c
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d5264b18cb6..e85fcd3ef86a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -772,7 +772,6 @@ struct task_struct {
struct mm_struct *active_mm;

/* Per-thread vma caching: */
- struct vmacache vmacache;

#ifdef SPLIT_RSS_COUNTING
struct task_rss_stat rss_stat;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ae0dd1948c2b..cd3ff075470b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -117,10 +117,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NR_TLB_LOCAL_FLUSH_ALL,
NR_TLB_LOCAL_FLUSH_ONE,
#endif /* CONFIG_DEBUG_TLBFLUSH */
-#ifdef CONFIG_DEBUG_VM_VMACACHE
- VMACACHE_FIND_CALLS,
- VMACACHE_FIND_HITS,
-#endif
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h
deleted file mode 100644
index 6fce268a4588..000000000000
--- a/include/linux/vmacache.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __LINUX_VMACACHE_H
-#define __LINUX_VMACACHE_H
-
-#include <linux/sched.h>
-#include <linux/mm.h>
-
-static inline void vmacache_flush(struct task_struct *tsk)
-{
- memset(tsk->vmacache.vmas, 0, sizeof(tsk->vmacache.vmas));
-}
-
-extern void vmacache_update(unsigned long addr, struct vm_area_struct *newvma);
-extern struct vm_area_struct *vmacache_find(struct mm_struct *mm,
- unsigned long addr);
-
-#ifndef CONFIG_MMU
-extern struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
-#endif
-
-static inline void vmacache_invalidate(struct mm_struct *mm)
-{
- mm->vmacache_seqnum++;
-}
-
-#endif /* __LINUX_VMACACHE_H */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3299cd69e4ca..0517f3b123ea 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,12 +125,6 @@ static inline void vm_events_fold_cpu(int cpu)
#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
#endif

-#ifdef CONFIG_DEBUG_VM_VMACACHE
-#define count_vm_vmacache_event(x) count_vm_event(x)
-#else
-#define count_vm_vmacache_event(x) do {} while (0)
-#endif
-
#define __count_zid_vm_events(item, zid, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)

diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 4708aec492df..1bd0bb76ed2c 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -53,7 +53,6 @@
#include <linux/pid.h>
#include <linux/smp.h>
#include <linux/mm.h>
-#include <linux/vmacache.h>
#include <linux/rcupdate.h>
#include <linux/irq.h>

@@ -285,17 +284,6 @@ static void kgdb_flush_swbreak_addr(unsigned long addr)
if (!CACHE_FLUSH_IS_SAFE)
return;

- if (current->mm) {
- int i;
-
- for (i = 0; i < VMACACHE_SIZE; i++) {
- if (!current->vmacache.vmas[i])
- continue;
- flush_cache_range(current->vmacache.vmas[i],
- addr, addr + BREAK_INSTR_SIZE);
- }
- }
-
/* Force flush instruction cache if it was outside the mm */
flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 83afd3007a2b..fe0922f75cc5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -42,7 +42,6 @@
#include <linux/mmu_notifier.h>
#include <linux/fs.h>
#include <linux/mm.h>
-#include <linux/vmacache.h>
#include <linux/nsproxy.h>
#include <linux/capability.h>
#include <linux/cpu.h>
@@ -1027,7 +1026,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
{
mm->mmap = NULL;
mt_init_flags(&mm->mm_mt, MAPLE_ALLOC_RANGE);
- mm->vmacache_seqnum = 0;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
seqcount_init(&mm->write_protect_seq);
@@ -1425,9 +1423,6 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
if (!oldmm)
return 0;

- /* initialize the new vmacache entries */
- vmacache_flush(tsk);
-
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index df977009425e..2328b8aa1f54 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -789,16 +789,6 @@ config DEBUG_VM

If unsure, say N.

-config DEBUG_VM_VMACACHE
- bool "Debug VMA caching"
- depends on DEBUG_VM
- help
- Enable this to turn on VMA caching debug information. Doing so
- can cause significant overhead, so only enable it in non-production
- environments.
-
- If unsure, say N.
-
config DEBUG_VM_RB
bool "Debug VM red-black trees"
depends on DEBUG_VM
diff --git a/mm/Makefile b/mm/Makefile
index a9ad6122d468..a061cf7fd591 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -50,7 +50,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
readahead.o swap.o truncate.o vmscan.o shmem.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o percpu.o slab_common.o \
- compaction.o vmacache.o \
+ compaction.o \
interval_tree.o list_lru.o workingset.o \
debug.o gup.o mmap_lock.o $(mmu-y)

diff --git a/mm/debug.c b/mm/debug.c
index 0bdda8407f71..f382d319722a 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -214,7 +214,7 @@ EXPORT_SYMBOL(dump_vma);

void dump_mm(const struct mm_struct *mm)
{
- pr_emerg("mm %px mmap %px seqnum %llu task_size %lu\n"
+ pr_emerg("mm %px mmap %px task_size %lu\n"
#ifdef CONFIG_MMU
"get_unmapped_area %px\n"
#endif
@@ -242,7 +242,7 @@ void dump_mm(const struct mm_struct *mm)
"tlb_flush_pending %d\n"
"def_flags: %#lx(%pGv)\n",

- mm, mm->mmap, (long long) mm->vmacache_seqnum, mm->task_size,
+ mm, mm->mmap, mm->task_size,
#ifdef CONFIG_MMU
mm->get_unmapped_area,
#endif
diff --git a/mm/mmap.c b/mm/mmap.c
index b730b57e47c9..10c42a41e023 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -13,7 +13,6 @@
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <linux/mm.h>
-#include <linux/vmacache.h>
#include <linux/shm.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
@@ -686,9 +685,6 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
/* Remove from mm linked list - also updates highest_vm_end */
__vma_unlink_list(mm, next);

- /* Kill the cache */
- vmacache_invalidate(mm);
-
if (file)
__remove_shared_vm_struct(next, file, mapping);

@@ -902,8 +898,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,

if (remove_next) {
__vma_unlink_list(mm, next);
- /* Kill the cache */
- vmacache_invalidate(mm);
if (file)
__remove_shared_vm_struct(next, file, mapping);
} else if (insert) {
@@ -2188,16 +2182,9 @@ struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
struct vm_area_struct *vma;
MA_STATE(mas, &mm->mm_mt, start_addr, start_addr);

- /* Check the cache first. */
- vma = vmacache_find(mm, start_addr);
- if (likely(vma))
- return vma;
-
rcu_read_lock();
vma = mas_find(&mas, end_addr - 1);
rcu_read_unlock();
- if (vma)
- vmacache_update(mas.index, vma);

return vma;
}
@@ -2590,9 +2577,6 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
tail_vma->vm_next = NULL;

- /* Kill the cache */
- vmacache_invalidate(mm);
-
/*
* Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
* VM_GROWSUP VMA. Such VMAs can change their size under
@@ -2973,7 +2957,6 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
if (vma_mas_remove(&unmap, mas))
goto mas_store_fail;

- vmacache_invalidate(vma->vm_mm);
if (vma->anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);
anon_vma_unlock_write(vma->anon_vma);
diff --git a/mm/nommu.c b/mm/nommu.c
index c410f99203fb..0eea24df1bd5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -19,7 +19,6 @@
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/sched/mm.h>
-#include <linux/vmacache.h>
#include <linux/mman.h>
#include <linux/swap.h>
#include <linux/file.h>
@@ -601,22 +600,12 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
*/
static void delete_vma_from_mm(struct vm_area_struct *vma)
{
- int i;
- struct address_space *mapping;
- struct mm_struct *mm = vma->vm_mm;
- struct task_struct *curr = current;
MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);

- mm->map_count--;
- for (i = 0; i < VMACACHE_SIZE; i++) {
- /* if the vma is cached, invalidate the entire cache */
- if (curr->vmacache.vmas[i] == vma) {
- vmacache_invalidate(mm);
- break;
- }
- }
+ vma->vm_mm->map_count--;
/* remove the VMA from the mapping */
if (vma->vm_file) {
+ struct address_space *mapping;
mapping = vma->vm_file->f_mapping;

i_mmap_lock_write(mapping);
@@ -628,7 +617,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)

/* remove from the MM's tree and list */
vma_mas_remove(vma, &mas);
- __vma_unlink_list(mm, vma);
+ __vma_unlink_list(vma->vm_mm, vma);
}

/*
@@ -653,18 +642,10 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
struct vm_area_struct *vma;
MA_STATE(mas, &mm->mm_mt, addr, addr);

- /* check the cache first */
- vma = vmacache_find(mm, addr);
- if (likely(vma))
- return vma;
-
rcu_read_lock();
vma = mas_walk(&mas);
rcu_read_unlock();

- if (vma)
- vmacache_update(addr, vma);
-
return vma;
}
EXPORT_SYMBOL(find_vma);
@@ -699,11 +680,6 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
unsigned long end = addr + len;
MA_STATE(mas, &mm->mm_mt, addr, addr);

- /* check the cache first */
- vma = vmacache_find_exact(mm, addr, end);
- if (vma)
- return vma;
-
rcu_read_lock();
vma = mas_walk(&mas);
rcu_read_unlock();
@@ -714,7 +690,6 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
if (vma->vm_end != end)
return NULL;

- vmacache_update(addr, vma);
return vma;
}

diff --git a/mm/vmacache.c b/mm/vmacache.c
deleted file mode 100644
index 01a6e6688ec1..000000000000
--- a/mm/vmacache.c
+++ /dev/null
@@ -1,117 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (C) 2014 Davidlohr Bueso.
- */
-#include <linux/sched/signal.h>
-#include <linux/sched/task.h>
-#include <linux/mm.h>
-#include <linux/vmacache.h>
-
-/*
- * Hash based on the pmd of addr if configured with MMU, which provides a good
- * hit rate for workloads with spatial locality. Otherwise, use pages.
- */
-#ifdef CONFIG_MMU
-#define VMACACHE_SHIFT PMD_SHIFT
-#else
-#define VMACACHE_SHIFT PAGE_SHIFT
-#endif
-#define VMACACHE_HASH(addr) ((addr >> VMACACHE_SHIFT) & VMACACHE_MASK)
-
-/*
- * This task may be accessing a foreign mm via (for example)
- * get_user_pages()->find_vma(). The vmacache is task-local and this
- * task's vmacache pertains to a different mm (ie, its own). There is
- * nothing we can do here.
- *
- * Also handle the case where a kernel thread has adopted this mm via
- * kthread_use_mm(). That kernel thread's vmacache is not applicable to this mm.
- */
-static inline bool vmacache_valid_mm(struct mm_struct *mm)
-{
- return current->mm == mm && !(current->flags & PF_KTHREAD);
-}
-
-void vmacache_update(unsigned long addr, struct vm_area_struct *newvma)
-{
- if (vmacache_valid_mm(newvma->vm_mm))
- current->vmacache.vmas[VMACACHE_HASH(addr)] = newvma;
-}
-
-static bool vmacache_valid(struct mm_struct *mm)
-{
- struct task_struct *curr;
-
- if (!vmacache_valid_mm(mm))
- return false;
-
- curr = current;
- if (mm->vmacache_seqnum != curr->vmacache.seqnum) {
- /*
- * First attempt will always be invalid, initialize
- * the new cache for this task here.
- */
- curr->vmacache.seqnum = mm->vmacache_seqnum;
- vmacache_flush(curr);
- return false;
- }
- return true;
-}
-
-struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
-{
- int idx = VMACACHE_HASH(addr);
- int i;
-
- count_vm_vmacache_event(VMACACHE_FIND_CALLS);
-
- if (!vmacache_valid(mm))
- return NULL;
-
- for (i = 0; i < VMACACHE_SIZE; i++) {
- struct vm_area_struct *vma = current->vmacache.vmas[idx];
-
- if (vma) {
-#ifdef CONFIG_DEBUG_VM_VMACACHE
- if (WARN_ON_ONCE(vma->vm_mm != mm))
- break;
-#endif
- if (vma->vm_start <= addr && vma->vm_end > addr) {
- count_vm_vmacache_event(VMACACHE_FIND_HITS);
- return vma;
- }
- }
- if (++idx == VMACACHE_SIZE)
- idx = 0;
- }
-
- return NULL;
-}
-
-#ifndef CONFIG_MMU
-struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
- unsigned long start,
- unsigned long end)
-{
- int idx = VMACACHE_HASH(start);
- int i;
-
- count_vm_vmacache_event(VMACACHE_FIND_CALLS);
-
- if (!vmacache_valid(mm))
- return NULL;
-
- for (i = 0; i < VMACACHE_SIZE; i++) {
- struct vm_area_struct *vma = current->vmacache.vmas[idx];
-
- if (vma && vma->vm_start == start && vma->vm_end == end) {
- count_vm_vmacache_event(VMACACHE_FIND_HITS);
- return vma;
- }
- if (++idx == VMACACHE_SIZE)
- idx = 0;
- }
-
- return NULL;
-}
-#endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index cccee36b289c..37bf2fef2cee 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1361,10 +1361,6 @@ const char * const vmstat_text[] = {
"nr_tlb_local_flush_one",
#endif /* CONFIG_DEBUG_TLBFLUSH */

-#ifdef CONFIG_DEBUG_VM_VMACACHE
- "vmacache_find_calls",
- "vmacache_find_hits",
-#endif
#ifdef CONFIG_SWAP
"swap_ra",
"swap_ra_hit",
--
2.30.2

2021-04-28 20:32:24

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 54/94] arch/parisc: Remove mmap linked list from kernel/cache

Start using the maple tree

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/parisc/kernel/cache.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 86a1a63563fd..bc7bffed24ba 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -520,9 +520,13 @@ static inline unsigned long mm_total_size(struct mm_struct *mm)
{
struct vm_area_struct *vma;
unsigned long usize = 0;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

- for (vma = mm->mmap; vma; vma = vma->vm_next)
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX)
usize += vma->vm_end - vma->vm_start;
+ rcu_read_unlock();
+
return usize;
}

@@ -548,6 +552,7 @@ void flush_cache_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
pgd_t *pgd;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

/* Flushing the whole cache on each cpu takes forever on
rp3440, etc. So, avoid it if the mm isn't too big. */
@@ -560,17 +565,20 @@ void flush_cache_mm(struct mm_struct *mm)
}

if (mm->context == mfsp(3)) {
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
flush_user_dcache_range_asm(vma->vm_start, vma->vm_end);
if (vma->vm_flags & VM_EXEC)
flush_user_icache_range_asm(vma->vm_start, vma->vm_end);
flush_tlb_range(vma, vma->vm_start, vma->vm_end);
}
+ rcu_read_unlock();
return;
}

pgd = mm->pgd;
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
unsigned long addr;

for (addr = vma->vm_start; addr < vma->vm_end;
@@ -590,6 +598,7 @@ void flush_cache_mm(struct mm_struct *mm)
}
}
}
+ rcu_read_unlock();
}

void flush_cache_range(struct vm_area_struct *vma,
--
2.30.2

2021-04-28 20:32:57

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 58/94] arch/x86: Use maple tree iterators for vdso/vma

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/x86/entry/vdso/vma.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 235a5794296a..c0b160a9585d 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -128,15 +128,19 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;

+ MA_STATE(mas, &mm->mm_mt, 0, 0);
+
mmap_read_lock(mm);
+ rcu_read_lock();

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_for_each(&mas, vma, ULONG_MAX) {
unsigned long size = vma->vm_end - vma->vm_start;

if (vma_is_special_mapping(vma, &vvar_mapping))
zap_page_range(vma, vma->vm_start, size);
}

+ rcu_read_unlock();
mmap_read_unlock(mm);
return 0;
}
@@ -354,6 +358,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_write_lock(mm);
/*
@@ -363,13 +368,16 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr)
* We could search vma near context.vdso, but it's a slowpath,
* so let's explicitly check all VMAs to be completely sure.
*/
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ mas_lock(&mas);
+ mas_for_each(&mas, vma, ULONG_MAX) {
if (vma_is_special_mapping(vma, &vdso_mapping) ||
vma_is_special_mapping(vma, &vvar_mapping)) {
+ mas_unlock(&mas);
mmap_write_unlock(mm);
return -EEXIST;
}
}
+ mas_unlock(&mas);
mmap_write_unlock(mm);

return map_vdso(image, addr);
--
2.30.2

2021-04-28 20:33:23

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 51/94] mmap: make remove_vma_list() inline

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3b1a9f6bc39b..a8e4f836b167 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2274,7 +2274,8 @@ EXPORT_SYMBOL_GPL(find_extend_vma);
*
* Called with the mm semaphore held.
*/
-static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
+static inline void remove_vma_list(struct mm_struct *mm,
+ struct vm_area_struct *vma)
{
unsigned long nr_accounted = 0;

--
2.30.2

2021-04-28 20:33:24

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 53/94] arch/arm64: Remove mmap linked list from vdso.

Start using the maple tree

Signed-off-by: Liam R. Howlett <[email protected]>
---
arch/arm64/kernel/vdso.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index a61fc4f989b3..57ea81fbe04b 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -136,10 +136,12 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
{
struct mm_struct *mm = task->mm;
struct vm_area_struct *vma;
+ MA_STATE(mas, &mm->mm_mt, 0, 0);

mmap_read_lock(mm);

- for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ rcu_read_lock();
+ mas_for_each(&mas, vma, ULONG_MAX) {
unsigned long size = vma->vm_end - vma->vm_start;

if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
@@ -149,6 +151,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
zap_page_range(vma, vma->vm_start, size);
#endif
}
+ rcu_read_unlock();

mmap_read_unlock(mm);
return 0;
--
2.30.2

2021-04-28 20:34:03

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 94/94] mm: Move mas locking outside of munmap() path.

Now that there is a split variant that allows splitting to use a maple state,
move the locks to a more logical position.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 4 ---
mm/mmap.c | 81 +++++++++++++++++++++++++++++++++------------------
mm/nommu.c | 4 +++
3 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0fb161ee7f73..68888d4d9cb3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -367,9 +367,7 @@ static inline int vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas

mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
- mas_lock(mas);
ret = mas_store_gfp(mas, vma, GFP_KERNEL);
- mas_unlock(mas);
return ret;
}

@@ -388,9 +386,7 @@ static inline int vma_mas_remove(struct vm_area_struct *vma, struct ma_state *ma

mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
- mas_lock(mas);
ret = mas_store_gfp(mas, NULL, GFP_KERNEL);
- mas_unlock(mas);
return ret;
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 5335bd72bda3..a0a4d1c4ca15 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -239,6 +239,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
goto success;
}

+ mas_lock(&mas);
mas_set(&mas, newbrk);
brkvma = mas_walk(&mas);
if (brkvma) { // munmap necessary, there is something at newbrk.
@@ -289,19 +290,21 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
goto out;

mm->brk = brk;
-
success:
populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0;
if (downgraded)
mmap_read_unlock(mm);
- else
+ else {
+ mas_unlock(&mas);
mmap_write_unlock(mm);
+ }
userfaultfd_unmap_complete(mm, &uf);
if (populate)
mm_populate_vma(brkvma, oldbrk, newbrk);
return brk;

out:
+ mas_unlock(&mas);
mmap_write_unlock(mm);
return origbrk;
}
@@ -501,7 +504,9 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
{
MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end - 1);

+ mas_lock(&mas);
vma_mas_link(mm, vma, &mas);
+ mas_unlock(&mas);
}

/*
@@ -2442,8 +2447,6 @@ static inline void detach_range(struct mm_struct *mm, struct ma_state *mas,
do {
count++;
*vma = mas_prev(mas, start);
- BUG_ON((*vma)->vm_start < start);
- BUG_ON((*vma)->vm_end > end + 1);
vma_mas_store(*vma, dst);
if ((*vma)->vm_flags & VM_LOCKED) {
mm->locked_vm -= vma_pages(*vma);
@@ -2548,14 +2551,12 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
}

/* Point of no return */
- mas_lock(mas);
if (next)
max = next->vm_start;

mtree_init(&mt_detach, MAPLE_ALLOC_RANGE);
dst.tree = &mt_detach;
detach_range(mm, mas, &dst, &vma);
- mas_unlock(mas);

/*
* Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
@@ -2567,8 +2568,10 @@ static int do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
downgrade = false;
else if (prev && (prev->vm_flags & VM_GROWSUP))
downgrade = false;
- else
+ else {
+ mas_unlock(mas);
mmap_write_downgrade(mm);
+ }
}

/* Unmap the region */
@@ -2634,7 +2637,9 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
int ret;
MA_STATE(mas, &mm->mm_mt, start, start);

+ mas_lock(&mas);
ret = do_mas_munmap(&mas, mm, start, len, uf, false);
+ mas_unlock(&mas);
return ret;
}

@@ -2651,11 +2656,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long merge_start = addr, merge_end = end;
unsigned long max = USER_PGTABLES_CEILING;
pgoff_t vm_pgoff;
- int error;
+ int error = ENOMEM;
struct ma_state ma_prev, tmp;
MA_STATE(mas, &mm->mm_mt, addr, end - 1);


+ mas_lock(&mas);
/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
unsigned long nr_pages;
@@ -2668,23 +2674,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

if (!may_expand_vm(mm, vm_flags,
(len >> PAGE_SHIFT) - nr_pages))
- return -ENOMEM;
+ goto no_mem;
}

validate_mm(mm);
/* Unmap any existing mapping in the area */
- if (do_mas_munmap(&mas, mm, addr, len, uf, false)) {
- return -ENOMEM;
- }
+ if (do_mas_munmap(&mas, mm, addr, len, uf, false))
+ goto no_mem;

/*
* Private writable mapping: check memory availability
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
- if (security_vm_enough_memory_mm(mm, charged)) {
- return -ENOMEM;
- }
+ if (security_vm_enough_memory_mm(mm, charged))
+ goto no_mem;
vm_flags |= VM_ACCOUNT;
}

@@ -2735,10 +2739,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* not unmapped, but the maps are removed from the list.
*/
vma = vm_area_alloc(mm);
- if (!vma) {
- error = -ENOMEM;
+ if (!vma)
goto unacct_error;
- }

vma->vm_start = addr;
vma->vm_end = end;
@@ -2863,6 +2865,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_flags |= VM_SOFTDIRTY;
vma_set_page_prot(vma);
validate_mm(mm);
+ mas_unlock(&mas);
return addr;

unmap_and_free_vma:
@@ -2883,6 +2886,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unacct_error:
if (charged)
vm_unacct_memory(charged);
+no_mem:
+ mas_unlock(&mas);
return error;
}

@@ -2895,6 +2900,7 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)

if (mmap_write_lock_killable(mm))
return -EINTR;
+ mas_lock(&mas);
ret = do_mas_munmap(&mas, mm, start, len, &uf, downgrade);
/*
* Returning 1 indicates mmap_lock is downgraded.
@@ -2904,8 +2910,10 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
if (ret == 1) {
mmap_read_unlock(mm);
ret = 0;
- } else
+ } else {
+ mas_unlock(&mas);
mmap_write_unlock(mm);
+ }

userfaultfd_unmap_complete(mm, &uf);
return ret;
@@ -2957,6 +2965,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
if (mmap_write_lock_killable(mm))
return -EINTR;

+ rcu_read_lock();
mas_set(&mas, start);
vma = mas_walk(&mas);

@@ -3005,6 +3014,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
prot, flags, pgoff, &populate, NULL);
fput(file);
out:
+ rcu_read_unlock();
mmap_write_unlock(mm);
if (populate)
mm_populate(ret, populate);
@@ -3021,7 +3031,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
* @oldbrk: The end of the address to unmap
* @uf: The userfaultfd list_head
*
- * Returns: 0 on success.
+ * Returns: 0 on success, 1 on success and downgraded write lock, negative
+ * otherwise.
* unmaps a partial VMA mapping. Does not handle alignment, downgrades lock if
* possible.
*/
@@ -3083,6 +3094,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
munlock_vma_pages_range(&unmap, newbrk, oldbrk);
}

+ mas_unlock(mas);
mmap_write_downgrade(mm);
unmap_region(mm, &unmap, mas, newbrk, oldbrk, vma,
next ? next->vm_start : 0);
@@ -3165,13 +3177,10 @@ static int do_brk_flags(struct ma_state *mas, struct ma_state *ma_prev,
anon_vma_lock_write(vma->anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}
- mas_lock(ma_prev);
vma->vm_end = addr + len;
vma->vm_flags |= VM_SOFTDIRTY;
- if (mas_store_gfp(ma_prev, vma, GFP_KERNEL)) {
- mas_unlock(ma_prev);
+ if (mas_store_gfp(ma_prev, vma, GFP_KERNEL))
goto mas_mod_fail;
- }

if (vma->anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);
@@ -3242,10 +3251,12 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
if (mmap_write_lock_killable(mm))
return -EINTR;

+ mas_lock(&mas);
// This vma left intentionally blank.
mas_walk(&mas);
ret = do_brk_flags(&mas, &mas, &vma, addr, len, flags);
populate = ((mm->def_flags & VM_LOCKED) != 0);
+ mas_unlock(&mas);
mmap_write_unlock(mm);
if (populate && !ret)
mm_populate_vma(vma, addr, addr + len);
@@ -3307,9 +3318,10 @@ void exit_mmap(struct mm_struct *mm)

arch_exit_mmap(mm);

+ mas_lock(&mas);
vma = mas_find(&mas, ULONG_MAX);
if (!vma) { /* Can happen if dup_mmap() received an OOM */
- rcu_read_unlock();
+ mas_unlock(&mas);
return;
}

@@ -3322,6 +3334,7 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(&tlb, vma, &mas, 0, -1);
free_pgtables(&tlb, &mas2, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb);
+ mas_unlock(&mas);

/*
* Walk the list again, actually closing and freeing it,
@@ -3346,12 +3359,16 @@ void exit_mmap(struct mm_struct *mm)
*/
int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
- if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
- return -ENOMEM;
+ MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end - 1);
+
+ mas_lock(&mas);
+ if (mas_find(&mas, vma->vm_end - 1))
+ goto no_mem;

if ((vma->vm_flags & VM_ACCOUNT) &&
security_vm_enough_memory_mm(mm, vma_pages(vma)))
- return -ENOMEM;
+ goto no_mem;
+

/*
* The vm_pgoff of a purely anonymous vma should be irrelevant
@@ -3370,8 +3387,14 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
}

- vma_link(mm, vma);
+ mas_reset(&mas);
+ vma_mas_link(mm, vma, &mas);
+ mas_unlock(&mas);
return 0;
+
+no_mem:
+ mas_unlock(&mas);
+ return -ENOMEM;
}

/*
diff --git a/mm/nommu.c b/mm/nommu.c
index a99e276445ce..65eee2770625 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -571,6 +571,7 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)

BUG_ON(!vma->vm_region);

+ mas_lock(&mas);
mm->map_count++;
printk("mm at %u\n", mm->map_count);
vma->vm_mm = mm;
@@ -592,6 +593,7 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
mas_reset(&mas);
/* add the VMA to the tree */
vma_mas_store(vma, &mas);
+ mas_unlock(&mas);
}

/*
@@ -601,6 +603,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
{
MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);

+ mas_lock(&mas);
vma->vm_mm->map_count--;
/* remove the VMA from the mapping */
if (vma->vm_file) {
@@ -616,6 +619,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)

/* remove from the MM's tree and list */
vma_mas_remove(vma, &mas);
+ mas_unlock(&mas);
}

/*
--
2.30.2

2021-05-01 05:23:30

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 38/94] mm/gup: Add mm_populate_vma() for use when the vma is known

On Wed, Apr 28, 2021 at 03:36:08PM +0000, Liam Howlett wrote:
> When a vma is known, avoid calling mm_populate to search for the vma to
> populate.
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> mm/gup.c | 20 ++++++++++++++++++++
> mm/internal.h | 4 ++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index c3a17b189064..48fe98ab0729 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1468,6 +1468,26 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> NULL, NULL, locked);
> }
>
> +/*
> + * mm_populate_vma() - Populate a single range in a single vma.
> + * @vma: The vma to populate.
> + * @start: The start address to populate
> + * @end: The end address to stop populating
> + *
> + * Note: Ignores errors.
> + */
> +void mm_populate_vma(struct vm_area_struct *vma, unsigned long start,
> + unsigned long end)
> +{
> + struct mm_struct *mm = current->mm;
> + int locked = 1;
> +
> + mmap_read_lock(mm);
> + populate_vma_page_range(vma, start, end, &locked);
> + if (locked)
> + mmap_read_unlock(mm);
> +}
> +

This seems like a nonsensical API at first glance - VMAs that are found
in the vma tree might be modified, merged, split, or freed at any time
if the mmap lock is not held, so the API can not be safely used. I think
this applies to maple tree vmas just as much as it did for rbtree vmas ?

--
Michel "walken" Lespinasse

2021-05-01 06:15:15

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 94/94] mm: Move mas locking outside of munmap() path.

On Wed, Apr 28, 2021 at 03:36:32PM +0000, Liam Howlett wrote:
> Now that there is a split variant that allows splitting to use a maple state,
> move the locks to a more logical position.

In this patch set, is the maple tree lock ever held outside of code
sections already protected by the mmap write lock ?

--
Michel "walken" Lespinasse

2021-05-03 18:10:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 38/94] mm/gup: Add mm_populate_vma() for use when the vma is known

On Mon, May 03, 2021 at 03:53:58PM +0000, Liam Howlett wrote:
> * Michel Lespinasse <[email protected]> [210501 01:13]:
> > On Wed, Apr 28, 2021 at 03:36:08PM +0000, Liam Howlett wrote:
> > > When a vma is known, avoid calling mm_populate to search for the vma to
> > > populate.
> > >
> > > Signed-off-by: Liam R. Howlett <[email protected]>
> > > ---
> > > mm/gup.c | 20 ++++++++++++++++++++
> > > mm/internal.h | 4 ++++
> > > 2 files changed, 24 insertions(+)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index c3a17b189064..48fe98ab0729 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1468,6 +1468,26 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> > > NULL, NULL, locked);
> > > }
> > >
> > > +/*
> > > + * mm_populate_vma() - Populate a single range in a single vma.
> > > + * @vma: The vma to populate.
> > > + * @start: The start address to populate
> > > + * @end: The end address to stop populating
> > > + *
> > > + * Note: Ignores errors.
> > > + */
> > > +void mm_populate_vma(struct vm_area_struct *vma, unsigned long start,
> > > + unsigned long end)
> > > +{
> > > + struct mm_struct *mm = current->mm;
> > > + int locked = 1;
> > > +
> > > + mmap_read_lock(mm);
> > > + populate_vma_page_range(vma, start, end, &locked);
> > > + if (locked)
> > > + mmap_read_unlock(mm);
> > > +}
> > > +
> >
> > This seems like a nonsensical API at first glance - VMAs that are found
> > in the vma tree might be modified, merged, split, or freed at any time
> > if the mmap lock is not held, so the API can not be safely used. I think
> > this applies to maple tree vmas just as much as it did for rbtree vmas ?
>
> This is correct - it cannot be used without having the mmap_sem lock.
> This is a new internal mm code API and is used to avoid callers that use
> mm_populate() on a range that is known to be in a single VMA and already
> have that VMA. So instead of re-walking the tree to re-find the VMAs,
> this function can be used with the known VMA and range.
>
> It is used as described in patch 39 and 40 of this series.

In patch 39, what you do is:

1 Take the mmap_sem for write
2 do stuff
3 Drop the mmap_sem
4 Call mm_populate_vm() with the vma, which takes the mmap_sem
for read

The problem is that between 3 & 4, a racing thread might cause us to free
the vma and so we've now passed a bogus pointer into mm_populate_vm().

What we need instead is to downgrade the mmap_sem from write to read at
step 3, so the vma is guaranteed to still be good.

2021-05-03 18:10:53

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 94/94] mm: Move mas locking outside of munmap() path.

* Michel Lespinasse <[email protected]> [210501 02:13]:
> On Wed, Apr 28, 2021 at 03:36:32PM +0000, Liam Howlett wrote:
> > Now that there is a split variant that allows splitting to use a maple state,
> > move the locks to a more logical position.
>
> In this patch set, is the maple tree lock ever held outside of code
> sections already protected by the mmap write lock ?


No, the maple tree lock is a currently a subset of the mmap write lock.

Thanks,
Liam

2021-05-03 18:11:36

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 38/94] mm/gup: Add mm_populate_vma() for use when the vma is known

* Michel Lespinasse <[email protected]> [210501 01:13]:
> On Wed, Apr 28, 2021 at 03:36:08PM +0000, Liam Howlett wrote:
> > When a vma is known, avoid calling mm_populate to search for the vma to
> > populate.
> >
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > mm/gup.c | 20 ++++++++++++++++++++
> > mm/internal.h | 4 ++++
> > 2 files changed, 24 insertions(+)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index c3a17b189064..48fe98ab0729 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1468,6 +1468,26 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> > NULL, NULL, locked);
> > }
> >
> > +/*
> > + * mm_populate_vma() - Populate a single range in a single vma.
> > + * @vma: The vma to populate.
> > + * @start: The start address to populate
> > + * @end: The end address to stop populating
> > + *
> > + * Note: Ignores errors.
> > + */
> > +void mm_populate_vma(struct vm_area_struct *vma, unsigned long start,
> > + unsigned long end)
> > +{
> > + struct mm_struct *mm = current->mm;
> > + int locked = 1;
> > +
> > + mmap_read_lock(mm);
> > + populate_vma_page_range(vma, start, end, &locked);
> > + if (locked)
> > + mmap_read_unlock(mm);
> > +}
> > +
>
> This seems like a nonsensical API at first glance - VMAs that are found
> in the vma tree might be modified, merged, split, or freed at any time
> if the mmap lock is not held, so the API can not be safely used. I think
> this applies to maple tree vmas just as much as it did for rbtree vmas ?

This is correct - it cannot be used without having the mmap_sem lock.
This is a new internal mm code API and is used to avoid callers that use
mm_populate() on a range that is known to be in a single VMA and already
have that VMA. So instead of re-walking the tree to re-find the VMAs,
this function can be used with the known VMA and range.

It is used as described in patch 39 and 40 of this series.

Thanks,
Liam

2021-05-03 23:07:27

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 38/94] mm/gup: Add mm_populate_vma() for use when the vma is known

* Matthew Wilcox <[email protected]> [210503 12:02]:
> On Mon, May 03, 2021 at 03:53:58PM +0000, Liam Howlett wrote:
> > * Michel Lespinasse <[email protected]> [210501 01:13]:
> > > On Wed, Apr 28, 2021 at 03:36:08PM +0000, Liam Howlett wrote:
> > > > When a vma is known, avoid calling mm_populate to search for the vma to
> > > > populate.
> > > >
> > > > Signed-off-by: Liam R. Howlett <[email protected]>
> > > > ---
> > > > mm/gup.c | 20 ++++++++++++++++++++
> > > > mm/internal.h | 4 ++++
> > > > 2 files changed, 24 insertions(+)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index c3a17b189064..48fe98ab0729 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -1468,6 +1468,26 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> > > > NULL, NULL, locked);
> > > > }
> > > >
> > > > +/*
> > > > + * mm_populate_vma() - Populate a single range in a single vma.
> > > > + * @vma: The vma to populate.
> > > > + * @start: The start address to populate
> > > > + * @end: The end address to stop populating
> > > > + *
> > > > + * Note: Ignores errors.
> > > > + */
> > > > +void mm_populate_vma(struct vm_area_struct *vma, unsigned long start,
> > > > + unsigned long end)
> > > > +{
> > > > + struct mm_struct *mm = current->mm;
> > > > + int locked = 1;
> > > > +
> > > > + mmap_read_lock(mm);
> > > > + populate_vma_page_range(vma, start, end, &locked);
> > > > + if (locked)
> > > > + mmap_read_unlock(mm);
> > > > +}
> > > > +
> > >
> > > This seems like a nonsensical API at first glance - VMAs that are found
> > > in the vma tree might be modified, merged, split, or freed at any time
> > > if the mmap lock is not held, so the API can not be safely used. I think
> > > this applies to maple tree vmas just as much as it did for rbtree vmas ?
> >
> > This is correct - it cannot be used without having the mmap_sem lock.
> > This is a new internal mm code API and is used to avoid callers that use
> > mm_populate() on a range that is known to be in a single VMA and already
> > have that VMA. So instead of re-walking the tree to re-find the VMAs,
> > this function can be used with the known VMA and range.
> >
> > It is used as described in patch 39 and 40 of this series.
>
> In patch 39, what you do is:
>
> 1 Take the mmap_sem for write
> 2 do stuff
> 3 Drop the mmap_sem
> 4 Call mm_populate_vm() with the vma, which takes the mmap_sem
> for read
>
> The problem is that between 3 & 4, a racing thread might cause us to free
> the vma and so we've now passed a bogus pointer into mm_populate_vm().
>
> What we need instead is to downgrade the mmap_sem from write to read at
> step 3, so the vma is guaranteed to still be good.

Thank you. I will remove these patches from the series and work on this
idea.

Regards,
Liam

2021-05-28 17:32:54

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 21/94] radix tree test suite: Enhancements for Maple Tree

On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
>

I know you have v2 for the first part of this patchset, I'm just going
over the whole thing... There should be some description here of what
the new struct member and new function are for. Ideally you would also
split it in two because it introduces two seemingly independent
additions: non_kernel and kmem_cache_get_alloc.

> Signed-off-by: Liam R. Howlett <[email protected]>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> tools/testing/radix-tree/linux.c | 16 +++++++++++++++-
> tools/testing/radix-tree/linux/kernel.h | 1 +
> 2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
> index 2d9c59df60de..93f7de81fbe8 100644
> --- a/tools/testing/radix-tree/linux.c
> +++ b/tools/testing/radix-tree/linux.c
> @@ -24,15 +24,28 @@ struct kmem_cache {
> int nr_objs;
> void *objs;
> void (*ctor)(void *);
> + unsigned int non_kernel;
> };
>
> +void kmem_cache_set_non_kernel(struct kmem_cache *cachep, unsigned int val)
> +{
> + cachep->non_kernel = val;
> +}
> +
> +unsigned long kmem_cache_get_alloc(struct kmem_cache *cachep)
> +{
> + return cachep->size * nr_allocated;

IIUC nr_allocated is incremented/decremented every time memory is
allocated/freed from *any* kmem_cache. Each kmem_cache has its own
size. So, nr_allocated counts allocated objects of potentially
different sizes. If that is so then I'm unclear what the result of
this multiplication would represent.

> +}
> void *kmem_cache_alloc(struct kmem_cache *cachep, int gfp)
> {
> void *p;
>
> - if (!(gfp & __GFP_DIRECT_RECLAIM))
> + if (!(gfp & __GFP_DIRECT_RECLAIM) && !cachep->non_kernel)
> return NULL;
>
> + if (!(gfp & __GFP_DIRECT_RECLAIM))
> + cachep->non_kernel--;
> +
> pthread_mutex_lock(&cachep->lock);
> if (cachep->nr_objs) {
> struct radix_tree_node *node = cachep->objs;
> @@ -116,5 +129,6 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> ret->nr_objs = 0;
> ret->objs = NULL;
> ret->ctor = ctor;
> + ret->non_kernel = 0;
> return ret;
> }
> diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
> index 39867fd80c8f..c5c9d05f29da 100644
> --- a/tools/testing/radix-tree/linux/kernel.h
> +++ b/tools/testing/radix-tree/linux/kernel.h
> @@ -14,6 +14,7 @@
> #include "../../../include/linux/kconfig.h"
>
> #define printk printf
> +#define pr_err printk
> #define pr_info printk
> #define pr_debug printk
> #define pr_cont printk
> --
> 2.30.2

2021-05-28 17:59:23

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 23/94] radix tree test suite: Add support for kmem_cache_free_bulk

On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> tools/testing/radix-tree/linux.c | 9 +++++++++
> tools/testing/radix-tree/linux/slab.h | 1 +
> 2 files changed, 10 insertions(+)
>
> diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
> index 93f7de81fbe8..380bbc0a48d6 100644
> --- a/tools/testing/radix-tree/linux.c
> +++ b/tools/testing/radix-tree/linux.c
> @@ -91,6 +91,15 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
> pthread_mutex_unlock(&cachep->lock);
> }
>
> +void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
> +{
> + if (kmalloc_verbose)
> + printk("Bulk free %p[0-%lu]\n", list, size - 1);

nit: Printing the address of the "list" is meaningless IMHO unless you
output its value in kmem_cache_alloc_bulk, which you do not.
I would also suggest combining the patch introducing
kmem_cache_alloc_bulk with this one since they seem to be
compementary.

> +
> + for (int i = 0; i < size; i++)
> + kmem_cache_free(cachep, list[i]);
> +}
> +
> void *kmalloc(size_t size, gfp_t gfp)
> {
> void *ret;
> diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
> index 2958830ce4d7..53b79c15b3a2 100644
> --- a/tools/testing/radix-tree/linux/slab.h
> +++ b/tools/testing/radix-tree/linux/slab.h
> @@ -24,4 +24,5 @@ struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> unsigned int align, unsigned int flags,
> void (*ctor)(void *));
>
> +void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t, void **);
> #endif /* SLAB_H */
> --
> 2.30.2
>

2021-05-28 18:18:44

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 24/94] radix tree test suite: Add keme_cache_alloc_bulk() support

On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> tools/testing/radix-tree/linux.c | 51 +++++++++++++++++++++++++++
> tools/testing/radix-tree/linux/slab.h | 1 +
> 2 files changed, 52 insertions(+)
>
> diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
> index 380bbc0a48d6..fb19a40ebb46 100644
> --- a/tools/testing/radix-tree/linux.c
> +++ b/tools/testing/radix-tree/linux.c
> @@ -99,6 +99,57 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
> for (int i = 0; i < size; i++)
> kmem_cache_free(cachep, list[i]);
> }
> +int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
> + void **p)
> +{
> + size_t i;
> +
> + if (kmalloc_verbose)
> + printk("Bulk alloc %lu\n", size);
> +
> + if (!(gfp & __GFP_DIRECT_RECLAIM) && cachep->non_kernel < size)
> + return 0;
> +
> + if (!(gfp & __GFP_DIRECT_RECLAIM))
> + cachep->non_kernel -= size;
> +
> + pthread_mutex_lock(&cachep->lock);
> + if (cachep->nr_objs >= size) {
> + struct radix_tree_node *node = cachep->objs;
> +

I don't think the loop below is correct because "node" is not being
changed on each iteration:

> + for (i = 0; i < size; i++) {
> + cachep->nr_objs--;
> + cachep->objs = node->parent;

In the above assignment cachep->objs will be assigned the same value
on all iterations.

> + p[i] = cachep->objs;

p[0] should point to the node, however here it would point to the node->parent.

> + }
> + pthread_mutex_unlock(&cachep->lock);
> + node->parent = NULL;

here you terminated the original cachep->objs which is not even inside
the "p" list at this point (it was skipped).

> + } else {
> + pthread_mutex_unlock(&cachep->lock);
> + for (i = 0; i < size; i++) {
> + if (cachep->align) {
> + posix_memalign(&p[i], cachep->align,
> + cachep->size * size);
> + } else {
> + p[i] = malloc(cachep->size * size);
> + }
> + if (cachep->ctor)
> + cachep->ctor(p[i]);
> + else if (gfp & __GFP_ZERO)
> + memset(p[i], 0, cachep->size);
> + }
> + }
> +
> + for (i = 0; i < size; i++) {
> + uatomic_inc(&nr_allocated);
> + uatomic_inc(&nr_tallocated);

I don't see nr_tallocated even in linux-next branch. Was it introduced
in one of the previous patches and I missed it?

> + if (kmalloc_verbose)
> + printf("Allocating %p from slab\n", p[i]);
> + }
> +
> + return size;
> +}
> +
>
> void *kmalloc(size_t size, gfp_t gfp)
> {
> diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
> index 53b79c15b3a2..ba42b8cc11d0 100644
> --- a/tools/testing/radix-tree/linux/slab.h
> +++ b/tools/testing/radix-tree/linux/slab.h
> @@ -25,4 +25,5 @@ struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> void (*ctor)(void *));
>
> void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t, void **);
> +int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t, size_t, void **);
> #endif /* SLAB_H */
> --
> 2.30.2

2021-05-28 18:56:49

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 21/94] radix tree test suite: Enhancements for Maple Tree

* Suren Baghdasaryan <[email protected]> [210528 13:31]:
> On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
> >
>
> I know you have v2 for the first part of this patchset, I'm just going
> over the whole thing... There should be some description here of what
> the new struct member and new function are for. Ideally you would also
> split it in two because it introduces two seemingly independent
> additions: non_kernel and kmem_cache_get_alloc.

Your comments are still valid and appreciated.

I did add a description to the patch:
--------------------------------
radix tree test suite: Add kmem_cache enhancements and pr_err

Add kmem_cache_set_non_kernel(), a mechanism to allow a certain number
of kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set
in the flags.

Add kmem_cache_get_alloc() to see the size of the allocated kmem_cache.

Add a define of pr_err to printk.
--------------------------------

I did group these two changes together as they were both affecting
kmem_cache. I will reorganize them into separate commits.

>
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> > ---
> > tools/testing/radix-tree/linux.c | 16 +++++++++++++++-
> > tools/testing/radix-tree/linux/kernel.h | 1 +
> > 2 files changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
> > index 2d9c59df60de..93f7de81fbe8 100644
> > --- a/tools/testing/radix-tree/linux.c
> > +++ b/tools/testing/radix-tree/linux.c
> > @@ -24,15 +24,28 @@ struct kmem_cache {
> > int nr_objs;
> > void *objs;
> > void (*ctor)(void *);
> > + unsigned int non_kernel;
> > };
> >
> > +void kmem_cache_set_non_kernel(struct kmem_cache *cachep, unsigned int val)
> > +{
> > + cachep->non_kernel = val;
> > +}
> > +
> > +unsigned long kmem_cache_get_alloc(struct kmem_cache *cachep)
> > +{
> > + return cachep->size * nr_allocated;
>
> IIUC nr_allocated is incremented/decremented every time memory is
> allocated/freed from *any* kmem_cache. Each kmem_cache has its own
> size. So, nr_allocated counts allocated objects of potentially
> different sizes. If that is so then I'm unclear what the result of
> this multiplication would represent.

This is intended to only be used for testing with one kmem_cache, so it
hasn't been an issue. Having this variable exist external to the
kmem_cache struct allows for checking if any allocations remain outside
the scope of the kmem_cache (ie: threads). I think putting it in the
struct would cause issues with the IDR testing. I could make this a new
variable and increment them together but this variable existed for the
IDR testing and I didn't need to support that additional functionality
for my testing. I should at least add a comment about this limitation
though.

>
> > +}
> > void *kmem_cache_alloc(struct kmem_cache *cachep, int gfp)
> > {
> > void *p;
> >
> > - if (!(gfp & __GFP_DIRECT_RECLAIM))
> > + if (!(gfp & __GFP_DIRECT_RECLAIM) && !cachep->non_kernel)
> > return NULL;
> >
> > + if (!(gfp & __GFP_DIRECT_RECLAIM))
> > + cachep->non_kernel--;
> > +
> > pthread_mutex_lock(&cachep->lock);
> > if (cachep->nr_objs) {
> > struct radix_tree_node *node = cachep->objs;
> > @@ -116,5 +129,6 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> > ret->nr_objs = 0;
> > ret->objs = NULL;
> > ret->ctor = ctor;
> > + ret->non_kernel = 0;
> > return ret;
> > }
> > diff --git a/tools/testing/radix-tree/linux/kernel.h b/tools/testing/radix-tree/linux/kernel.h
> > index 39867fd80c8f..c5c9d05f29da 100644
> > --- a/tools/testing/radix-tree/linux/kernel.h
> > +++ b/tools/testing/radix-tree/linux/kernel.h
> > @@ -14,6 +14,7 @@
> > #include "../../../include/linux/kconfig.h"
> >
> > #define printk printf
> > +#define pr_err printk
> > #define pr_info printk
> > #define pr_debug printk
> > #define pr_cont printk
> > --
> > 2.30.2

2021-05-28 19:09:59

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 23/94] radix tree test suite: Add support for kmem_cache_free_bulk

* Suren Baghdasaryan <[email protected]> [210528 13:55]:
> On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
> >
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > tools/testing/radix-tree/linux.c | 9 +++++++++
> > tools/testing/radix-tree/linux/slab.h | 1 +
> > 2 files changed, 10 insertions(+)
> >
> > diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
> > index 93f7de81fbe8..380bbc0a48d6 100644
> > --- a/tools/testing/radix-tree/linux.c
> > +++ b/tools/testing/radix-tree/linux.c
> > @@ -91,6 +91,15 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
> > pthread_mutex_unlock(&cachep->lock);
> > }
> >
> > +void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
> > +{
> > + if (kmalloc_verbose)
> > + printk("Bulk free %p[0-%lu]\n", list, size - 1);
>
> nit: Printing the address of the "list" is meaningless IMHO unless you
> output its value in kmem_cache_alloc_bulk, which you do not.

The address has been rather useful for my testing when combined with
how the list is created and the LSAN_OPTIONS="report_objects=1". When
this information is of interest is when a test fails, so the tree will
be dumped. Combined with the list head and the report_objects output, I
am able to deduce if there is too much in the list or too few, which
operation caused the issue, and what calculation is of interest.

Adding the alloc_bulk counterpart is not very useful because the
prediction of how many nodes are necessary is the worst-case, so the
head of the list is almost never used and the request size is already
known. Adding that print is just noise for my use case.

> I would also suggest combining the patch introducing
> kmem_cache_alloc_bulk with this one since they seem to be
> compementary.

Yes, I agree. I noticed this and fixed it in v2.

>
> > +
> > + for (int i = 0; i < size; i++)
> > + kmem_cache_free(cachep, list[i]);
> > +}
> > +
> > void *kmalloc(size_t size, gfp_t gfp)
> > {
> > void *ret;
> > diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
> > index 2958830ce4d7..53b79c15b3a2 100644
> > --- a/tools/testing/radix-tree/linux/slab.h
> > +++ b/tools/testing/radix-tree/linux/slab.h
> > @@ -24,4 +24,5 @@ struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> > unsigned int align, unsigned int flags,
> > void (*ctor)(void *));
> >
> > +void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t, void **);
> > #endif /* SLAB_H */
> > --
> > 2.30.2
> >

2021-05-28 19:34:07

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 24/94] radix tree test suite: Add keme_cache_alloc_bulk() support

* Suren Baghdasaryan <[email protected]> [210528 14:17]:
> On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
> >
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > tools/testing/radix-tree/linux.c | 51 +++++++++++++++++++++++++++
> > tools/testing/radix-tree/linux/slab.h | 1 +
> > 2 files changed, 52 insertions(+)
> >
> > diff --git a/tools/testing/radix-tree/linux.c b/tools/testing/radix-tree/linux.c
> > index 380bbc0a48d6..fb19a40ebb46 100644
> > --- a/tools/testing/radix-tree/linux.c
> > +++ b/tools/testing/radix-tree/linux.c
> > @@ -99,6 +99,57 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
> > for (int i = 0; i < size; i++)
> > kmem_cache_free(cachep, list[i]);
> > }
> > +int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
> > + void **p)
> > +{
> > + size_t i;
> > +
> > + if (kmalloc_verbose)
> > + printk("Bulk alloc %lu\n", size);
> > +
> > + if (!(gfp & __GFP_DIRECT_RECLAIM) && cachep->non_kernel < size)
> > + return 0;
> > +
> > + if (!(gfp & __GFP_DIRECT_RECLAIM))
> > + cachep->non_kernel -= size;
> > +
> > + pthread_mutex_lock(&cachep->lock);
> > + if (cachep->nr_objs >= size) {
> > + struct radix_tree_node *node = cachep->objs;
> > +
>
> I don't think the loop below is correct because "node" is not being
> changed on each iteration:
>
> > + for (i = 0; i < size; i++) {
> > + cachep->nr_objs--;
> > + cachep->objs = node->parent;
>
> In the above assignment cachep->objs will be assigned the same value
> on all iterations.
>
> > + p[i] = cachep->objs;
>
> p[0] should point to the node, however here it would point to the node->parent.
>
> > + }
> > + pthread_mutex_unlock(&cachep->lock);
> > + node->parent = NULL;
>
> here you terminated the original cachep->objs which is not even inside
> the "p" list at this point (it was skipped).

I just verified that this code wasn't hit in my current test code. I
will test and fix this. Good catch.

>
> > + } else {
> > + pthread_mutex_unlock(&cachep->lock);
> > + for (i = 0; i < size; i++) {
> > + if (cachep->align) {
> > + posix_memalign(&p[i], cachep->align,
> > + cachep->size * size);
> > + } else {
> > + p[i] = malloc(cachep->size * size);
> > + }
> > + if (cachep->ctor)
> > + cachep->ctor(p[i]);
> > + else if (gfp & __GFP_ZERO)
> > + memset(p[i], 0, cachep->size);
> > + }
> > + }
> > +
> > + for (i = 0; i < size; i++) {
> > + uatomic_inc(&nr_allocated);
> > + uatomic_inc(&nr_tallocated);
>
> I don't see nr_tallocated even in linux-next branch. Was it introduced
> in one of the previous patches and I missed it?

It was introduced with the maple tree itself. I will spin this off as
its own patch with the same edits as nr_allocated.

>
> > + if (kmalloc_verbose)
> > + printf("Allocating %p from slab\n", p[i]);
> > + }
> > +
> > + return size;
> > +}
> > +
> >
> > void *kmalloc(size_t size, gfp_t gfp)
> > {
> > diff --git a/tools/testing/radix-tree/linux/slab.h b/tools/testing/radix-tree/linux/slab.h
> > index 53b79c15b3a2..ba42b8cc11d0 100644
> > --- a/tools/testing/radix-tree/linux/slab.h
> > +++ b/tools/testing/radix-tree/linux/slab.h
> > @@ -25,4 +25,5 @@ struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> > void (*ctor)(void *));
> >
> > void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t, void **);
> > +int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t, size_t, void **);
> > #endif /* SLAB_H */
> > --
> > 2.30.2

2021-05-29 00:44:28

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 32/94] kernel/fork: Convert dup_mmap to use maple tree

On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
>
> Use the maple tree iterator to duplicate the mm_struct trees.
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> include/linux/mm.h | 2 --
> include/linux/sched/mm.h | 3 +++
> kernel/fork.c | 24 +++++++++++++++++++-----
> mm/mmap.c | 4 ----
> 4 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e89bacfa9145..7f7dff6ad884 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2498,8 +2498,6 @@ extern bool arch_has_descending_max_zone_pfns(void);
> /* nommu.c */
> extern atomic_long_t mmap_pages_allocated;
> extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
> -/* maple_tree */
> -void vma_store(struct mm_struct *mm, struct vm_area_struct *vma);
>
> /* interval_tree.c */
> void vma_interval_tree_insert(struct vm_area_struct *node,
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e24b1fe348e3..76cab3aea6ab 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -8,6 +8,7 @@
> #include <linux/mm_types.h>
> #include <linux/gfp.h>
> #include <linux/sync_core.h>
> +#include <linux/maple_tree.h>
>
> /*
> * Routines for handling mm_structs
> @@ -67,11 +68,13 @@ static inline void mmdrop(struct mm_struct *mm)
> */
> static inline void mmget(struct mm_struct *mm)
> {
> + mt_set_in_rcu(&mm->mm_mt);
> atomic_inc(&mm->mm_users);
> }
>
> static inline bool mmget_not_zero(struct mm_struct *mm)
> {
> + mt_set_in_rcu(&mm->mm_mt);

Should you be calling mt_set_in_rcu() if atomic_inc_not_zero() failed?
I don't think mmput() is called after mmget_not_zero() fails and
mt_clear_in_rcu() will not be called.

> return atomic_inc_not_zero(&mm->mm_users);
> }
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c37abaf28eb9..832416ff613e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -477,7 +477,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> struct rb_node **rb_link, *rb_parent;
> int retval;
> - unsigned long charge;
> + unsigned long charge = 0;
> + MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
> + MA_STATE(mas, &mm->mm_mt, 0, 0);
> LIST_HEAD(uf);
>
> uprobe_start_dup_mmap();
> @@ -511,7 +513,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> goto out;
>
> prev = NULL;
> - for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
> +
> + retval = mas_entry_count(&mas, oldmm->map_count);
> + if (retval)
> + goto fail_nomem;
> +
> + rcu_read_lock();
> + mas_for_each(&old_mas, mpnt, ULONG_MAX) {
> struct file *file;
>
> if (mpnt->vm_flags & VM_DONTCOPY) {
> @@ -525,7 +533,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> */
> if (fatal_signal_pending(current)) {
> retval = -EINTR;
> - goto out;
> + goto loop_out;
> }
> if (mpnt->vm_flags & VM_ACCOUNT) {
> unsigned long len = vma_pages(mpnt);
> @@ -594,7 +602,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> rb_parent = &tmp->vm_rb;
>
> /* Link the vma into the MT */
> - vma_store(mm, tmp);
> + mas.index = tmp->vm_start;
> + mas.last = tmp->vm_end - 1;
> + mas_store(&mas, tmp);
>
> mm->map_count++;
> if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -604,14 +614,17 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> tmp->vm_ops->open(tmp);
>
> if (retval)
> - goto out;
> + goto loop_out;
> }
> /* a new mm has just been created */
> retval = arch_dup_mmap(oldmm, mm);
> +loop_out:
> out:
> + rcu_read_unlock();
> mmap_write_unlock(mm);
> flush_tlb_mm(oldmm);
> mmap_write_unlock(oldmm);
> + mas_destroy(&mas);
> dup_userfaultfd_complete(&uf);
> fail_uprobe_end:
> uprobe_end_dup_mmap();
> @@ -1092,6 +1105,7 @@ static inline void __mmput(struct mm_struct *mm)
> {
> VM_BUG_ON(atomic_read(&mm->mm_users));
>
> + mt_clear_in_rcu(&mm->mm_mt);
> uprobe_clear_state(mm);
> exit_aio(mm);
> ksm_exit(mm);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 929c2f9eb3f5..1bd43f4db28e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -780,10 +780,6 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
> GFP_KERNEL);
> }
>
> -void vma_store(struct mm_struct *mm, struct vm_area_struct *vma) {
> - vma_mt_store(mm, vma);
> -}
> -
> static void
> __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> struct vm_area_struct *prev, struct rb_node **rb_link,
> --
> 2.30.2

2021-05-29 01:29:07

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH 33/94] mm: Remove rb tree.

On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
>
> Remove the RB tree and start using the maple tree for vm_area_struct
> tracking.
>
> Drop validate_mm() calls in expand_upwards() and expand_downwards() as
> the lock is not held.
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> arch/x86/kernel/tboot.c | 1 -
> drivers/firmware/efi/efi.c | 1 -
> fs/proc/task_nommu.c | 55 ++--
> include/linux/mm.h | 4 +-
> include/linux/mm_types.h | 26 +-
> kernel/fork.c | 8 -
> mm/init-mm.c | 2 -
> mm/mmap.c | 525 ++++++++-----------------------------
> mm/nommu.c | 96 +++----
> mm/util.c | 8 +
> 10 files changed, 185 insertions(+), 541 deletions(-)
>
> diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
> index 6f978f722dff..121f28bb2209 100644
> --- a/arch/x86/kernel/tboot.c
> +++ b/arch/x86/kernel/tboot.c
> @@ -97,7 +97,6 @@ void __init tboot_probe(void)
>
> static pgd_t *tboot_pg_dir;
> static struct mm_struct tboot_mm = {
> - .mm_rb = RB_ROOT,
> .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
> .pgd = swapper_pg_dir,
> .mm_users = ATOMIC_INIT(2),
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 271ae8c7bb07..8aaeaa824576 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -54,7 +54,6 @@ static unsigned long __initdata mem_reserve = EFI_INVALID_TABLE_ADDR;
> static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR;
>
> struct mm_struct efi_mm = {
> - .mm_rb = RB_ROOT,
> .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
> .mm_users = ATOMIC_INIT(2),
> .mm_count = ATOMIC_INIT(1),
> diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
> index a6d21fc0033c..8691a1216d1c 100644
> --- a/fs/proc/task_nommu.c
> +++ b/fs/proc/task_nommu.c
> @@ -22,15 +22,13 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> {
> struct vm_area_struct *vma;
> struct vm_region *region;
> - struct rb_node *p;
> unsigned long bytes = 0, sbytes = 0, slack = 0, size;
> -
> - mmap_read_lock(mm);
> - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
> - vma = rb_entry(p, struct vm_area_struct, vm_rb);
> + MA_STATE(mas, &mm->mm_mt, 0, 0);
>
> + mmap_read_lock(mm);
> + rcu_read_lock();
> + mas_for_each(&mas, vma, ULONG_MAX) {
> bytes += kobjsize(vma);
> -
> region = vma->vm_region;
> if (region) {
> size = kobjsize(region);
> @@ -53,7 +51,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> sbytes += kobjsize(mm);
> else
> bytes += kobjsize(mm);
> -
> +
> if (current->fs && current->fs->users > 1)
> sbytes += kobjsize(current->fs);
> else
> @@ -77,20 +75,21 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> "Shared:\t%8lu bytes\n",
> bytes, slack, sbytes);
>
> + rcu_read_unlock();
> mmap_read_unlock(mm);
> }
>
> unsigned long task_vsize(struct mm_struct *mm)
> {
> struct vm_area_struct *vma;
> - struct rb_node *p;
> unsigned long vsize = 0;
> + MA_STATE(mas, &mm->mm_mt, 0, 0);
>
> mmap_read_lock(mm);
> - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
> - vma = rb_entry(p, struct vm_area_struct, vm_rb);
> + rcu_read_lock();
> + mas_for_each(&mas, vma, ULONG_MAX)
> vsize += vma->vm_end - vma->vm_start;
> - }
> + rcu_read_unlock();
> mmap_read_unlock(mm);
> return vsize;
> }
> @@ -101,12 +100,12 @@ unsigned long task_statm(struct mm_struct *mm,
> {
> struct vm_area_struct *vma;
> struct vm_region *region;
> - struct rb_node *p;
> unsigned long size = kobjsize(mm);
> + MA_STATE(mas, &mm->mm_mt, 0, 0);
>
> mmap_read_lock(mm);
> - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
> - vma = rb_entry(p, struct vm_area_struct, vm_rb);
> + rcu_read_lock();
> + mas_for_each(&mas, vma, ULONG_MAX) {
> size += kobjsize(vma);
> region = vma->vm_region;
> if (region) {
> @@ -119,6 +118,7 @@ unsigned long task_statm(struct mm_struct *mm,
> >> PAGE_SHIFT;
> *data = (PAGE_ALIGN(mm->start_stack) - (mm->start_data & PAGE_MASK))
> >> PAGE_SHIFT;
> + rcu_read_unlock();
> mmap_read_unlock(mm);
> size >>= PAGE_SHIFT;
> size += *text + *data;
> @@ -190,17 +190,20 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
> */
> static int show_map(struct seq_file *m, void *_p)
> {
> - struct rb_node *p = _p;
> -
> - return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb));
> + return nommu_vma_show(m, _p);
> }
>
> static void *m_start(struct seq_file *m, loff_t *pos)
> {
> struct proc_maps_private *priv = m->private;
> struct mm_struct *mm;
> - struct rb_node *p;
> - loff_t n = *pos;
> + struct vm_area_struct *vma;
> + unsigned long addr = *pos;
> + MA_STATE(mas, &priv->mm->mm_mt, addr, addr);
> +
> + /* See m_next(). Zero at the start or after lseek. */
> + if (addr == -1UL)
> + return NULL;
>
> /* pin the task and mm whilst we play with them */
> priv->task = get_proc_task(priv->inode);
> @@ -216,14 +219,12 @@ static void *m_start(struct seq_file *m, loff_t *pos)
> return ERR_PTR(-EINTR);
> }
>
> - /* start from the Nth VMA */
> - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
> - if (n-- == 0)
> - return p;
> + /* start the next element from addr */
> + vma = mas_find(&mas, ULONG_MAX);
>
> mmap_read_unlock(mm);
> mmput(mm);
> - return NULL;
> + return vma;
> }
>
> static void m_stop(struct seq_file *m, void *_vml)
> @@ -242,10 +243,10 @@ static void m_stop(struct seq_file *m, void *_vml)
>
> static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
> {
> - struct rb_node *p = _p;
> + struct vm_area_struct *vma = _p;
>
> - (*pos)++;
> - return p ? rb_next(p) : NULL;
> + *pos = vma->vm_end;
> + return vma->vm_next;
> }
>
> static const struct seq_operations proc_pid_maps_ops = {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f7dff6ad884..146976070fed 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2553,8 +2553,6 @@ extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
> extern int split_vma(struct mm_struct *, struct vm_area_struct *,
> unsigned long addr, int new_below);
> extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> -extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
> - struct rb_node **, struct rb_node *);
> extern void unlink_file_vma(struct vm_area_struct *);
> extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
> unsigned long addr, unsigned long len, pgoff_t pgoff,
> @@ -2699,7 +2697,7 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
> static inline
> struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
> {
> - return find_vma_intersection(mm, addr, addr + 1);
> + return mtree_load(&mm->mm_mt, addr);
> }
>
> static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 51733fc44daf..41551bfa6ce0 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -311,19 +311,6 @@ struct vm_area_struct {
>
> /* linked list of VM areas per task, sorted by address */
> struct vm_area_struct *vm_next, *vm_prev;
> -
> - struct rb_node vm_rb;
> -
> - /*
> - * Largest free memory gap in bytes to the left of this VMA.
> - * Either between this VMA and vma->vm_prev, or between one of the
> - * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
> - * get_unmapped_area find a free area of the right size.
> - */
> - unsigned long rb_subtree_gap;
> -
> - /* Second cache line starts here. */
> -
> struct mm_struct *vm_mm; /* The address space we belong to. */
>
> /*
> @@ -333,6 +320,12 @@ struct vm_area_struct {
> pgprot_t vm_page_prot;
> unsigned long vm_flags; /* Flags, see mm.h. */
>
> + /* Information about our backing store: */
> + unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
> + * units
> + */
> + /* Second cache line starts here. */
> + struct file *vm_file; /* File we map to (can be NULL). */
> /*
> * For areas with an address space and backing store,
> * linkage into the address_space->i_mmap interval tree.
> @@ -351,16 +344,14 @@ struct vm_area_struct {
> struct list_head anon_vma_chain; /* Serialized by mmap_lock &
> * page_table_lock */
> struct anon_vma *anon_vma; /* Serialized by page_table_lock */
> + /* Third cache line starts here. */
>
> /* Function pointers to deal with this struct. */
> const struct vm_operations_struct *vm_ops;
>
> - /* Information about our backing store: */
> - unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
> - units */
> - struct file * vm_file; /* File we map to (can be NULL). */
> void * vm_private_data; /* was vm_pte (shared mem) */
>
> +
> #ifdef CONFIG_SWAP
> atomic_long_t swap_readahead_info;
> #endif
> @@ -389,7 +380,6 @@ struct mm_struct {
> struct {
> struct vm_area_struct *mmap; /* list of VMAs */
> struct maple_tree mm_mt;
> - struct rb_root mm_rb;
> u64 vmacache_seqnum; /* per-thread vmacache */
> #ifdef CONFIG_MMU
> unsigned long (*get_unmapped_area) (struct file *filp,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 832416ff613e..83afd3007a2b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -475,7 +475,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> struct mm_struct *oldmm)
> {
> struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> - struct rb_node **rb_link, *rb_parent;
> int retval;
> unsigned long charge = 0;
> MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
> @@ -502,8 +501,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> mm->exec_vm = oldmm->exec_vm;
> mm->stack_vm = oldmm->stack_vm;
>
> - rb_link = &mm->mm_rb.rb_node;
> - rb_parent = NULL;
> pprev = &mm->mmap;
> retval = ksm_fork(mm, oldmm);
> if (retval)
> @@ -597,10 +594,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> tmp->vm_prev = prev;
> prev = tmp;
>
> - __vma_link_rb(mm, tmp, rb_link, rb_parent);
> - rb_link = &tmp->vm_rb.rb_right;
> - rb_parent = &tmp->vm_rb;
> -
> /* Link the vma into the MT */
> mas.index = tmp->vm_start;
> mas.last = tmp->vm_end - 1;
> @@ -1033,7 +1026,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> struct user_namespace *user_ns)
> {
> mm->mmap = NULL;
> - mm->mm_rb = RB_ROOT;
> mt_init_flags(&mm->mm_mt, MAPLE_ALLOC_RANGE);
> mm->vmacache_seqnum = 0;
> atomic_set(&mm->mm_users, 1);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 2014d4b82294..04bbe5172b72 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -1,6 +1,5 @@
> // SPDX-License-Identifier: GPL-2.0
> #include <linux/mm_types.h>
> -#include <linux/rbtree.h>
> #include <linux/maple_tree.h>
> #include <linux/rwsem.h>
> #include <linux/spinlock.h>
> @@ -28,7 +27,6 @@
> * and size this cpu_bitmask to NR_CPUS.
> */
> struct mm_struct init_mm = {
> - .mm_rb = RB_ROOT,
> .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
> .pgd = swapper_pg_dir,
> .mm_users = ATOMIC_INIT(2),
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1bd43f4db28e..7747047c4cbe 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -38,7 +38,6 @@
> #include <linux/audit.h>
> #include <linux/khugepaged.h>
> #include <linux/uprobes.h>
> -#include <linux/rbtree_augmented.h>
> #include <linux/notifier.h>
> #include <linux/memory.h>
> #include <linux/printk.h>
> @@ -290,93 +289,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
> return origbrk;
> }
>
> -static inline unsigned long vma_compute_gap(struct vm_area_struct *vma)
> -{
> - unsigned long gap, prev_end;
> -
> - /*
> - * Note: in the rare case of a VM_GROWSDOWN above a VM_GROWSUP, we
> - * allow two stack_guard_gaps between them here, and when choosing
> - * an unmapped area; whereas when expanding we only require one.
> - * That's a little inconsistent, but keeps the code here simpler.
> - */
> - gap = vm_start_gap(vma);
> - if (vma->vm_prev) {
> - prev_end = vm_end_gap(vma->vm_prev);
> - if (gap > prev_end)
> - gap -= prev_end;
> - else
> - gap = 0;
> - }
> - return gap;
> -}
> -
> -#ifdef CONFIG_DEBUG_VM_RB
> -static unsigned long vma_compute_subtree_gap(struct vm_area_struct *vma)
> -{
> - unsigned long max = vma_compute_gap(vma), subtree_gap;
> - if (vma->vm_rb.rb_left) {
> - subtree_gap = rb_entry(vma->vm_rb.rb_left,
> - struct vm_area_struct, vm_rb)->rb_subtree_gap;
> - if (subtree_gap > max)
> - max = subtree_gap;
> - }
> - if (vma->vm_rb.rb_right) {
> - subtree_gap = rb_entry(vma->vm_rb.rb_right,
> - struct vm_area_struct, vm_rb)->rb_subtree_gap;
> - if (subtree_gap > max)
> - max = subtree_gap;
> - }
> - return max;
> -}
> -
> -static int browse_rb(struct mm_struct *mm)
> -{
> - struct rb_root *root = &mm->mm_rb;
> - int i = 0, j, bug = 0;
> - struct rb_node *nd, *pn = NULL;
> - unsigned long prev = 0, pend = 0;
> -
> - for (nd = rb_first(root); nd; nd = rb_next(nd)) {
> - struct vm_area_struct *vma;
> - vma = rb_entry(nd, struct vm_area_struct, vm_rb);
> - if (vma->vm_start < prev) {
> - pr_emerg("vm_start %lx < prev %lx\n",
> - vma->vm_start, prev);
> - bug = 1;
> - }
> - if (vma->vm_start < pend) {
> - pr_emerg("vm_start %lx < pend %lx\n",
> - vma->vm_start, pend);
> - bug = 1;
> - }
> - if (vma->vm_start > vma->vm_end) {
> - pr_emerg("vm_start %lx > vm_end %lx\n",
> - vma->vm_start, vma->vm_end);
> - bug = 1;
> - }
> - spin_lock(&mm->page_table_lock);
> - if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) {
> - pr_emerg("free gap %lx, correct %lx\n",
> - vma->rb_subtree_gap,
> - vma_compute_subtree_gap(vma));
> - bug = 1;
> - }
> - spin_unlock(&mm->page_table_lock);
> - i++;
> - pn = nd;
> - prev = vma->vm_start;
> - pend = vma->vm_end;
> - }
> - j = 0;
> - for (nd = pn; nd; nd = rb_prev(nd))
> - j++;
> - if (i != j) {
> - pr_emerg("backwards %d, forwards %d\n", j, i);
> - bug = 1;
> - }
> - return bug ? -1 : i;
> -}
> #if defined(CONFIG_DEBUG_MAPLE_TREE)
> extern void mt_validate(struct maple_tree *mt);
> extern void mt_dump(const struct maple_tree *mt);
> @@ -405,17 +317,25 @@ static void validate_mm_mt(struct mm_struct *mm)
> dump_stack();
> #ifdef CONFIG_DEBUG_VM
> dump_vma(vma_mt);
> - pr_emerg("and next in rb\n");
> + pr_emerg("and vm_next\n");
> dump_vma(vma->vm_next);
> -#endif
> +#endif // CONFIG_DEBUG_VM
> pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
> mas.index, mas.last);
> pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
> vma_mt->vm_start, vma_mt->vm_end);
> - pr_emerg("rb vma: %px %lu - %lu\n", vma,
> + if (vma->vm_prev) {
> + pr_emerg("ll prev: %px %lu - %lu\n",
> + vma->vm_prev, vma->vm_prev->vm_start,
> + vma->vm_prev->vm_end);
> + }
> + pr_emerg("ll vma: %px %lu - %lu\n", vma,
> vma->vm_start, vma->vm_end);
> - pr_emerg("rb->next = %px %lu - %lu\n", vma->vm_next,
> - vma->vm_next->vm_start, vma->vm_next->vm_end);
> + if (vma->vm_next) {
> + pr_emerg("ll next: %px %lu - %lu\n",
> + vma->vm_next, vma->vm_next->vm_start,
> + vma->vm_next->vm_end);
> + }
>
> mt_dump(mas.tree);
> if (vma_mt->vm_end != mas.last + 1) {
> @@ -441,21 +361,6 @@ static void validate_mm_mt(struct mm_struct *mm)
> rcu_read_unlock();
> mt_validate(&mm->mm_mt);
> }
> -#else
> -#define validate_mm_mt(root) do { } while (0)
> -#endif
> -static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore)
> -{
> - struct rb_node *nd;
> -
> - for (nd = rb_first(root); nd; nd = rb_next(nd)) {
> - struct vm_area_struct *vma;
> - vma = rb_entry(nd, struct vm_area_struct, vm_rb);
> - VM_BUG_ON_VMA(vma != ignore &&
> - vma->rb_subtree_gap != vma_compute_subtree_gap(vma),
> - vma);
> - }
> -}
>
> static void validate_mm(struct mm_struct *mm)
> {
> @@ -464,6 +369,8 @@ static void validate_mm(struct mm_struct *mm)
> unsigned long highest_address = 0;
> struct vm_area_struct *vma = mm->mmap;
>
> + validate_mm_mt(mm);
> +
> while (vma) {
> struct anon_vma *anon_vma = vma->anon_vma;
> struct anon_vma_chain *avc;
> @@ -488,80 +395,13 @@ static void validate_mm(struct mm_struct *mm)
> mm->highest_vm_end, highest_address);
> bug = 1;
> }
> - i = browse_rb(mm);
> - if (i != mm->map_count) {
> - if (i != -1)
> - pr_emerg("map_count %d rb %d\n", mm->map_count, i);
> - bug = 1;
> - }
> VM_BUG_ON_MM(bug, mm);
> }
> -#else
> -#define validate_mm_rb(root, ignore) do { } while (0)
> +
> +#else // !CONFIG_DEBUG_MAPLE_TREE
> #define validate_mm_mt(root) do { } while (0)
> #define validate_mm(mm) do { } while (0)
> -#endif
> -
> -RB_DECLARE_CALLBACKS_MAX(static, vma_gap_callbacks,
> - struct vm_area_struct, vm_rb,
> - unsigned long, rb_subtree_gap, vma_compute_gap)
> -
> -/*
> - * Update augmented rbtree rb_subtree_gap values after vma->vm_start or
> - * vma->vm_prev->vm_end values changed, without modifying the vma's position
> - * in the rbtree.
> - */
> -static void vma_gap_update(struct vm_area_struct *vma)
> -{
> - /*
> - * As it turns out, RB_DECLARE_CALLBACKS_MAX() already created
> - * a callback function that does exactly what we want.
> - */
> - vma_gap_callbacks_propagate(&vma->vm_rb, NULL);
> -}
> -
> -static inline void vma_rb_insert(struct vm_area_struct *vma,
> - struct rb_root *root)
> -{
> - /* All rb_subtree_gap values must be consistent prior to insertion */
> - validate_mm_rb(root, NULL);
> -
> - rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> -}
> -
> -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> -{
> - /*
> - * Note rb_erase_augmented is a fairly large inline function,
> - * so make sure we instantiate it only once with our desired
> - * augmented rbtree callbacks.
> - */
> - rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> -}
> -
> -static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> - struct rb_root *root,
> - struct vm_area_struct *ignore)
> -{
> - /*
> - * All rb_subtree_gap values must be consistent prior to erase,
> - * with the possible exception of
> - *
> - * a. the "next" vma being erased if next->vm_start was reduced in
> - * __vma_adjust() -> __vma_unlink()
> - * b. the vma being erased in detach_vmas_to_be_unmapped() ->
> - * vma_rb_erase()
> - */
> - validate_mm_rb(root, ignore);
> -
> - __vma_rb_erase(vma, root);
> -}
> -
> -static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
> - struct rb_root *root)
> -{
> - vma_rb_erase_ignore(vma, root, vma);
> -}
> +#endif // CONFIG_DEBUG_MAPLE_TREE
>
> /*
> * vma has some anon_vma assigned, and is already inserted on that
> @@ -595,38 +435,26 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> }
>
> -static int find_vma_links(struct mm_struct *mm, unsigned long addr,
> - unsigned long end, struct vm_area_struct **pprev,
> - struct rb_node ***rb_link, struct rb_node **rb_parent)
> +/* Private
> + * range_has_overlap() - Check the @start - @end range for overlapping VMAs and
> + * sets up a pointer to the previous VMA
> + *
> + * @mm - the mm struct
> + * @start - the start address of the range
> + * @end - the end address of the range
> + * @pprev - the pointer to the pointer of the previous VMA
> + *
> + * Returns: True if there is an overlapping VMA, false otherwise
> + */
> +static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct vm_area_struct **pprev)
> {
> - struct rb_node **__rb_link, *__rb_parent, *rb_prev;
> -
> - __rb_link = &mm->mm_rb.rb_node;
> - rb_prev = __rb_parent = NULL;
> + struct vm_area_struct *existing;
>
> - while (*__rb_link) {
> - struct vm_area_struct *vma_tmp;
> -
> - __rb_parent = *__rb_link;
> - vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb);
> -
> - if (vma_tmp->vm_end > addr) {
> - /* Fail if an existing vma overlaps the area */
> - if (vma_tmp->vm_start < end)
> - return -ENOMEM;
> - __rb_link = &__rb_parent->rb_left;
> - } else {
> - rb_prev = __rb_parent;
> - __rb_link = &__rb_parent->rb_right;
> - }
> - }
> -
> - *pprev = NULL;
> - if (rb_prev)
> - *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
> - *rb_link = __rb_link;
> - *rb_parent = __rb_parent;
> - return 0;
> + MA_STATE(mas, &mm->mm_mt, start, start);
> + existing = mas_find(&mas, end - 1);
> + *pprev = mas_prev(&mas, 0);
> + return existing ? true : false;
> }
>
> /*
> @@ -653,8 +481,6 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
> * @start: The start of the range.
> * @len: The length of the range.
> * @pprev: pointer to the pointer that will be set to previous vm_area_struct
> - * @rb_link: the rb_node
> - * @rb_parent: the parent rb_node
> *
> * Find all the vm_area_struct that overlap from @start to
> * @end and munmap them. Set @pprev to the previous vm_area_struct.
> @@ -663,76 +489,41 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
> */
> static inline int
> munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
> - struct vm_area_struct **pprev, struct rb_node ***link,
> - struct rb_node **parent, struct list_head *uf)
> + struct vm_area_struct **pprev, struct list_head *uf)
> {
> -
> - while (find_vma_links(mm, start, start + len, pprev, link, parent))
> + // Needs optimization.
> + while (range_has_overlap(mm, start, start + len, pprev))
> if (do_munmap(mm, start, len, uf))
> return -ENOMEM;
> -
> return 0;
> }
> static unsigned long count_vma_pages_range(struct mm_struct *mm,
> unsigned long addr, unsigned long end)
> {
> unsigned long nr_pages = 0;
> - unsigned long nr_mt_pages = 0;
> struct vm_area_struct *vma;
> + unsigned long vm_start, vm_end;
> + MA_STATE(mas, &mm->mm_mt, addr, addr);
>
> - /* Find first overlapping mapping */
> - vma = find_vma_intersection(mm, addr, end);
> + /* Find first overlaping mapping */

nit: I think the original comment was correct.

> + vma = mas_find(&mas, end - 1);
> if (!vma)
> return 0;
>
> - nr_pages = (min(end, vma->vm_end) -
> - max(addr, vma->vm_start)) >> PAGE_SHIFT;
> + vm_start = vma->vm_start;
> + vm_end = vma->vm_end;
> + nr_pages = (min(end, vm_end) - max(addr, vm_start)) >> PAGE_SHIFT;
>
> /* Iterate over the rest of the overlaps */
> - for (vma = vma->vm_next; vma; vma = vma->vm_next) {
> - unsigned long overlap_len;
> -
> - if (vma->vm_start > end)
> - break;
> -
> - overlap_len = min(end, vma->vm_end) - vma->vm_start;
> - nr_pages += overlap_len >> PAGE_SHIFT;
> + mas_for_each(&mas, vma, end) {
> + vm_start = vma->vm_start;
> + vm_end = vma->vm_end;
> + nr_pages += (min(end, vm_end) - vm_start) >> PAGE_SHIFT;
> }
>
> - mt_for_each(&mm->mm_mt, vma, addr, end) {
> - nr_mt_pages +=
> - (min(end, vma->vm_end) - vma->vm_start) >> PAGE_SHIFT;
> - }
> -
> - VM_BUG_ON_MM(nr_pages != nr_mt_pages, mm);
> -
> return nr_pages;
> }
>
> -void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
> - struct rb_node **rb_link, struct rb_node *rb_parent)
> -{
> - /* Update tracking information for the gap following the new vma. */
> - if (vma->vm_next)
> - vma_gap_update(vma->vm_next);
> - else
> - mm->highest_vm_end = vm_end_gap(vma);
> -
> - /*
> - * vma->vm_prev wasn't known when we followed the rbtree to find the
> - * correct insertion point for that vma. As a result, we could not
> - * update the vma vm_rb parents rb_subtree_gap values on the way down.
> - * So, we first insert the vma with a zero rb_subtree_gap value
> - * (to be consistent with what we did on the way down), and then
> - * immediately update the gap to the correct value. Finally we
> - * rebalance the rbtree after all augmented values have been set.
> - */
> - rb_link_node(&vma->vm_rb, rb_parent, rb_link);
> - vma->rb_subtree_gap = 0;
> - vma_gap_update(vma);
> - vma_rb_insert(vma, &mm->mm_rb);
> -}
> -
> static void __vma_link_file(struct vm_area_struct *vma)
> {
> struct file *file;
> @@ -780,19 +571,8 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
> GFP_KERNEL);
> }
>
> -static void
> -__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> - struct vm_area_struct *prev, struct rb_node **rb_link,
> - struct rb_node *rb_parent)
> -{
> - vma_mt_store(mm, vma);
> - __vma_link_list(mm, vma, prev);
> - __vma_link_rb(mm, vma, rb_link, rb_parent);
> -}
> -
> static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> - struct vm_area_struct *prev, struct rb_node **rb_link,
> - struct rb_node *rb_parent)
> + struct vm_area_struct *prev)
> {
> struct address_space *mapping = NULL;
>
> @@ -801,7 +581,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> i_mmap_lock_write(mapping);
> }
>
> - __vma_link(mm, vma, prev, rb_link, rb_parent);
> + vma_mt_store(mm, vma);
> + __vma_link_list(mm, vma, prev);
> __vma_link_file(vma);
>
> if (mapping)
> @@ -813,30 +594,18 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
>
> /*
> * Helper for vma_adjust() in the split_vma insert case: insert a vma into the
> - * mm's list and rbtree. It has already been inserted into the interval tree.
> + * mm's list and the mm tree. It has already been inserted into the interval tree.
> */
> static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> {
> struct vm_area_struct *prev;
> - struct rb_node **rb_link, *rb_parent;
>
> - if (find_vma_links(mm, vma->vm_start, vma->vm_end,
> - &prev, &rb_link, &rb_parent))
> - BUG();
> - __vma_link(mm, vma, prev, rb_link, rb_parent);
> + BUG_ON(range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev));
> + vma_mt_store(mm, vma);
> + __vma_link_list(mm, vma, prev);
> mm->map_count++;
> }
>
> -static __always_inline void __vma_unlink(struct mm_struct *mm,
> - struct vm_area_struct *vma,
> - struct vm_area_struct *ignore)
> -{
> - vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
> - __vma_unlink_list(mm, vma);
> - /* Kill the cache */
> - vmacache_invalidate(mm);
> -}
> -
> /*
> * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
> * is already present in an i_mmap tree without adjusting the tree.
> @@ -854,13 +623,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> struct rb_root_cached *root = NULL;
> struct anon_vma *anon_vma = NULL;
> struct file *file = vma->vm_file;
> - bool start_changed = false, end_changed = false;
> + bool vma_changed = false;
> long adjust_next = 0;
> int remove_next = 0;
>
> - validate_mm(mm);
> - validate_mm_mt(mm);
> -
> if (next && !insert) {
> struct vm_area_struct *exporter = NULL, *importer = NULL;
>
> @@ -986,21 +752,23 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> }
>
> if (start != vma->vm_start) {
> - unsigned long old_start = vma->vm_start;
> + if (vma->vm_start < start)
> + vma_mt_szero(mm, vma->vm_start, start);
> + else
> + vma_changed = true;
> vma->vm_start = start;
> - if (old_start < start)
> - vma_mt_szero(mm, old_start, start);
> - start_changed = true;
> }
> if (end != vma->vm_end) {
> - unsigned long old_end = vma->vm_end;
> + if (vma->vm_end > end)
> + vma_mt_szero(mm, end, vma->vm_end);
> + else
> + vma_changed = true;
> vma->vm_end = end;
> - if (old_end > end)
> - vma_mt_szero(mm, end, old_end);
> - end_changed = true;
> + if (!next)
> + mm->highest_vm_end = vm_end_gap(vma);
> }
>
> - if (end_changed || start_changed)
> + if (vma_changed)
> vma_mt_store(mm, vma);
>
> vma->vm_pgoff = pgoff;
> @@ -1018,25 +786,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> }
>
> if (remove_next) {
> - /*
> - * vma_merge has merged next into vma, and needs
> - * us to remove next before dropping the locks.
> - * Since we have expanded over this vma, the maple tree will
> - * have overwritten by storing the value
> - */
> - if (remove_next != 3)
> - __vma_unlink(mm, next, next);
> - else
> - /*
> - * vma is not before next if they've been
> - * swapped.
> - *
> - * pre-swap() next->vm_start was reduced so
> - * tell validate_mm_rb to ignore pre-swap()
> - * "next" (which is stored in post-swap()
> - * "vma").
> - */
> - __vma_unlink(mm, next, vma);
> + __vma_unlink_list(mm, next);
> + /* Kill the cache */
> + vmacache_invalidate(mm);
> if (file)
> __remove_shared_vm_struct(next, file, mapping);
> } else if (insert) {
> @@ -1046,15 +798,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> * (it may either follow vma or precede it).
> */
> __insert_vm_struct(mm, insert);
> - } else {
> - if (start_changed)
> - vma_gap_update(vma);
> - if (end_changed) {
> - if (!next)
> - mm->highest_vm_end = vm_end_gap(vma);
> - else if (!adjust_next)
> - vma_gap_update(next);
> - }
> }
>
> if (anon_vma) {
> @@ -1112,10 +855,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> remove_next = 1;
> end = next->vm_end;
> goto again;
> - }
> - else if (next)
> - vma_gap_update(next);
> - else {
> + } else if (!next) {
> /*
> * If remove_next == 2 we obviously can't
> * reach this path.
> @@ -1142,8 +882,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> uprobe_mmap(insert);
>
> validate_mm(mm);
> - validate_mm_mt(mm);
> -
> return 0;
> }
>
> @@ -1290,7 +1028,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> struct vm_area_struct *area, *next;
> int err;
>
> - validate_mm_mt(mm);
> /*
> * We later require that vma->vm_flags == vm_flags,
> * so this tests vma->vm_flags & VM_SPECIAL, too.
> @@ -1366,7 +1103,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> khugepaged_enter_vma_merge(area, vm_flags);
> return area;
> }
> - validate_mm_mt(mm);
>
> return NULL;
> }
> @@ -1536,6 +1272,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> vm_flags_t vm_flags;
> int pkey = 0;
>
> + validate_mm(mm);
> *populate = 0;
>
> if (!len)
> @@ -1856,10 +1593,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma, *prev, *merge;
> int error;
> - struct rb_node **rb_link, *rb_parent;
> unsigned long charged = 0;
>
> - validate_mm_mt(mm);
> /* Check against address space limit. */
> if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> unsigned long nr_pages;
> @@ -1875,8 +1610,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> return -ENOMEM;
> }
>
> - /* Clear old maps, set up prev, rb_link, rb_parent, and uf */
> - if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
> + /* Clear old maps, set up prev and uf */
> + if (munmap_vma_range(mm, addr, len, &prev, uf))
> return -ENOMEM;
> /*
> * Private writable mapping: check memory availability
> @@ -1984,7 +1719,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> goto free_vma;
> }
>
> - vma_link(mm, vma, prev, rb_link, rb_parent);
> + vma_link(mm, vma, prev);
> /* Once vma denies write, undo our temporary denial count */
> if (file) {
> unmap_writable:
> @@ -2021,7 +1756,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
> vma_set_page_prot(vma);
>
> - validate_mm_mt(mm);
> return addr;
>
> unmap_and_free_vma:
> @@ -2041,7 +1775,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> unacct_error:
> if (charged)
> vm_unacct_memory(charged);
> - validate_mm_mt(mm);
> return error;
> }
>
> @@ -2324,9 +2057,6 @@ find_vma_prev(struct mm_struct *mm, unsigned long addr,
>
> rcu_read_lock();
> vma = mas_find(&mas, ULONG_MAX);
> - if (!vma)
> - mas_reset(&mas);
> -
> *pprev = mas_prev(&mas, 0);
> rcu_read_unlock();
> return vma;
> @@ -2390,7 +2120,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> unsigned long gap_addr;
> int error = 0;
>
> - validate_mm_mt(mm);
> if (!(vma->vm_flags & VM_GROWSUP))
> return -EFAULT;
>
> @@ -2437,15 +2166,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> error = acct_stack_growth(vma, size, grow);
> if (!error) {
> /*
> - * vma_gap_update() doesn't support concurrent
> - * updates, but we only hold a shared mmap_lock
> - * lock here, so we need to protect against
> - * concurrent vma expansions.
> - * anon_vma_lock_write() doesn't help here, as
> - * we don't guarantee that all growable vmas
> - * in a mm share the same root anon vma.
> - * So, we reuse mm->page_table_lock to guard
> - * against concurrent vma expansions.
> + * We only hold a shared mmap_lock lock here, so
> + * we need to protect against concurrent vma
> + * expansions. anon_vma_lock_write() doesn't
> + * help here, as we don't guarantee that all
> + * growable vmas in a mm share the same root
> + * anon vma. So, we reuse mm->page_table_lock
> + * to guard against concurrent vma expansions.
> */
> spin_lock(&mm->page_table_lock);
> if (vma->vm_flags & VM_LOCKED)
> @@ -2453,10 +2180,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> vm_stat_account(mm, vma->vm_flags, grow);
> anon_vma_interval_tree_pre_update_vma(vma);
> vma->vm_end = address;
> + vma_mt_store(mm, vma);
> anon_vma_interval_tree_post_update_vma(vma);
> - if (vma->vm_next)
> - vma_gap_update(vma->vm_next);
> - else
> + if (!vma->vm_next)
> mm->highest_vm_end = vm_end_gap(vma);
> spin_unlock(&mm->page_table_lock);
>
> @@ -2466,8 +2192,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> }
> anon_vma_unlock_write(vma->anon_vma);
> khugepaged_enter_vma_merge(vma, vma->vm_flags);
> - validate_mm(mm);
> - validate_mm_mt(mm);
> return error;
> }
> #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> @@ -2475,14 +2199,12 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> /*
> * vma is the first one with address < vma->vm_start. Have to extend vma.
> */
> -int expand_downwards(struct vm_area_struct *vma,
> - unsigned long address)
> +int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *prev;
> int error = 0;
>
> - validate_mm(mm);
> address &= PAGE_MASK;
> if (address < mmap_min_addr)
> return -EPERM;
> @@ -2519,15 +2241,13 @@ int expand_downwards(struct vm_area_struct *vma,
> error = acct_stack_growth(vma, size, grow);
> if (!error) {
> /*
> - * vma_gap_update() doesn't support concurrent
> - * updates, but we only hold a shared mmap_lock
> - * lock here, so we need to protect against
> - * concurrent vma expansions.
> - * anon_vma_lock_write() doesn't help here, as
> - * we don't guarantee that all growable vmas
> - * in a mm share the same root anon vma.
> - * So, we reuse mm->page_table_lock to guard
> - * against concurrent vma expansions.
> + * We only hold a shared mmap_lock lock here, so
> + * we need to protect against concurrent vma
> + * expansions. anon_vma_lock_write() doesn't
> + * help here, as we don't guarantee that all
> + * growable vmas in a mm share the same root
> + * anon vma. So, we reuse mm->page_table_lock
> + * to guard against concurrent vma expansions.
> */
> spin_lock(&mm->page_table_lock);
> if (vma->vm_flags & VM_LOCKED)
> @@ -2539,7 +2259,6 @@ int expand_downwards(struct vm_area_struct *vma,
> /* Overwrite old entry in mtree. */
> vma_mt_store(mm, vma);
> anon_vma_interval_tree_post_update_vma(vma);
> - vma_gap_update(vma);
> spin_unlock(&mm->page_table_lock);
>
> perf_event_mmap(vma);
> @@ -2548,7 +2267,6 @@ int expand_downwards(struct vm_area_struct *vma,
> }
> anon_vma_unlock_write(vma->anon_vma);
> khugepaged_enter_vma_merge(vma, vma->vm_flags);
> - validate_mm(mm);
> return error;
> }
>
> @@ -2681,16 +2399,14 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> vma->vm_prev = NULL;
> vma_mt_szero(mm, vma->vm_start, end);
> do {
> - vma_rb_erase(vma, &mm->mm_rb);
> mm->map_count--;
> tail_vma = vma;
> vma = vma->vm_next;
> } while (vma && vma->vm_start < end);
> *insertion_point = vma;
> - if (vma) {
> + if (vma)
> vma->vm_prev = prev;
> - vma_gap_update(vma);
> - } else
> + else
> mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
> tail_vma->vm_next = NULL;
>
> @@ -2821,11 +2537,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> if (len == 0)
> return -EINVAL;
>
> - /*
> - * arch_unmap() might do unmaps itself. It must be called
> - * and finish any rbtree manipulation before this code
> - * runs and also starts to manipulate the rbtree.
> - */
> + /* arch_unmap() might do unmaps itself. */
> arch_unmap(mm, start, end);
>
> /* Find the first overlapping VMA */
> @@ -2833,7 +2545,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> if (!vma)
> return 0;
> prev = vma->vm_prev;
> - /* we have start < vma->vm_end */
> + /* we have start < vma->vm_end */
>
> /* if it doesn't overlap, we have nothing.. */
> if (vma->vm_start >= end)
> @@ -2893,7 +2605,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> if (mm->locked_vm)
> unlock_range(vma, end);
>
> - /* Detach vmas from rbtree */
> + /* Detach vmas from the MM linked list and remove from the mm tree*/
> if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
> downgrade = false;
>
> @@ -3041,11 +2753,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
> * anonymous maps. eventually we may be able to do some
> * brk-specific accounting here.
> */
> -static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf)
> +static int do_brk_flags(unsigned long addr, unsigned long len,
> + unsigned long flags, struct list_head *uf)
> {
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma, *prev;
> - struct rb_node **rb_link, *rb_parent;
> pgoff_t pgoff = addr >> PAGE_SHIFT;
> int error;
> unsigned long mapped_addr;
> @@ -3064,8 +2776,8 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
> if (error)
> return error;
>
> - /* Clear old maps, set up prev, rb_link, rb_parent, and uf */
> - if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
> + /* Clear old maps, set up prev and uf */
> + if (munmap_vma_range(mm, addr, len, &prev, uf))
> return -ENOMEM;
>
> /* Check against address space limits *after* clearing old maps... */
> @@ -3099,7 +2811,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
> vma->vm_pgoff = pgoff;
> vma->vm_flags = flags;
> vma->vm_page_prot = vm_get_page_prot(flags);
> - vma_link(mm, vma, prev, rb_link, rb_parent);
> + vma_link(mm, vma, prev);
> out:
> perf_event_mmap(vma);
> mm->total_vm += len >> PAGE_SHIFT;
> @@ -3219,26 +2931,10 @@ void exit_mmap(struct mm_struct *mm)
> int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> {
> struct vm_area_struct *prev;
> - struct rb_node **rb_link, *rb_parent;
> - unsigned long start = vma->vm_start;
> - struct vm_area_struct *overlap = NULL;
>
> - if (find_vma_links(mm, vma->vm_start, vma->vm_end,
> - &prev, &rb_link, &rb_parent))
> + if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev))
> return -ENOMEM;
>
> - overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1);
> - if (overlap) {
> -
> - pr_err("Found vma ending at %lu\n", start - 1);
> - pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap,
> - overlap->vm_start, overlap->vm_end - 1);
> -#if defined(CONFIG_DEBUG_MAPLE_TREE)
> - mt_dump(&mm->mm_mt);
> -#endif
> - BUG();
> - }
> -
> if ((vma->vm_flags & VM_ACCOUNT) &&
> security_vm_enough_memory_mm(mm, vma_pages(vma)))
> return -ENOMEM;
> @@ -3260,7 +2956,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
> }
>
> - vma_link(mm, vma, prev, rb_link, rb_parent);
> + vma_link(mm, vma, prev);
> return 0;
> }
>
> @@ -3276,9 +2972,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> unsigned long vma_start = vma->vm_start;
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *new_vma, *prev;
> - struct rb_node **rb_link, *rb_parent;
> bool faulted_in_anon_vma = true;
> - unsigned long index = addr;
>
> validate_mm_mt(mm);
> /*
> @@ -3290,10 +2984,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> faulted_in_anon_vma = false;
> }
>
> - if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
> + if (range_has_overlap(mm, addr, addr + len, &prev))
> return NULL; /* should never get here */
> - if (mt_find(&mm->mm_mt, &index, addr+len - 1))
> - BUG();
> +
> new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> vma->vm_userfaultfd_ctx);
> @@ -3334,7 +3027,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> get_file(new_vma->vm_file);
> if (new_vma->vm_ops && new_vma->vm_ops->open)
> new_vma->vm_ops->open(new_vma);
> - vma_link(mm, new_vma, prev, rb_link, rb_parent);
> + vma_link(mm, new_vma, prev);
> *need_rmap_locks = false;
> }
> validate_mm_mt(mm);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8848cf7cb7c1..c410f99203fb 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -566,13 +566,14 @@ static void put_nommu_region(struct vm_region *region)
> */
> static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
> {
> - struct vm_area_struct *pvma, *prev;
> struct address_space *mapping;
> - struct rb_node **p, *parent, *rb_prev;
> + struct vm_area_struct *prev;
> + MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end);
>
> BUG_ON(!vma->vm_region);
>
> mm->map_count++;
> + printk("mm at %u\n", mm->map_count);

I think this was added while debugging and should be removed?

> vma->vm_mm = mm;
>
> /* add the VMA to the mapping */
> @@ -586,42 +587,12 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
> i_mmap_unlock_write(mapping);
> }
>
> + rcu_read_lock();
> + prev = mas_prev(&mas, 0);
> + rcu_read_unlock();
> + mas_reset(&mas);
> /* add the VMA to the tree */
> - parent = rb_prev = NULL;
> - p = &mm->mm_rb.rb_node;
> - while (*p) {
> - parent = *p;
> - pvma = rb_entry(parent, struct vm_area_struct, vm_rb);
> -
> - /* sort by: start addr, end addr, VMA struct addr in that order
> - * (the latter is necessary as we may get identical VMAs) */
> - if (vma->vm_start < pvma->vm_start)
> - p = &(*p)->rb_left;
> - else if (vma->vm_start > pvma->vm_start) {
> - rb_prev = parent;
> - p = &(*p)->rb_right;
> - } else if (vma->vm_end < pvma->vm_end)
> - p = &(*p)->rb_left;
> - else if (vma->vm_end > pvma->vm_end) {
> - rb_prev = parent;
> - p = &(*p)->rb_right;
> - } else if (vma < pvma)
> - p = &(*p)->rb_left;
> - else if (vma > pvma) {
> - rb_prev = parent;
> - p = &(*p)->rb_right;
> - } else
> - BUG();
> - }
> -
> - rb_link_node(&vma->vm_rb, parent, p);
> - rb_insert_color(&vma->vm_rb, &mm->mm_rb);
> -
> - /* add VMA to the VMA list also */
> - prev = NULL;
> - if (rb_prev)
> - prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
> -
> + vma_mas_store(vma, &mas);
> __vma_link_list(mm, vma, prev);
> }
>
> @@ -634,6 +605,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> struct address_space *mapping;
> struct mm_struct *mm = vma->vm_mm;
> struct task_struct *curr = current;
> + MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);
>
> mm->map_count--;
> for (i = 0; i < VMACACHE_SIZE; i++) {
> @@ -643,7 +615,6 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> break;
> }
> }
> -
> /* remove the VMA from the mapping */
> if (vma->vm_file) {
> mapping = vma->vm_file->f_mapping;
> @@ -656,8 +627,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> }
>
> /* remove from the MM's tree and list */
> - rb_erase(&vma->vm_rb, &mm->mm_rb);
> -
> + vma_mas_remove(vma, &mas);
> __vma_unlink_list(mm, vma);
> }
>
> @@ -681,24 +651,21 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
> struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> {
> struct vm_area_struct *vma;
> + MA_STATE(mas, &mm->mm_mt, addr, addr);
>
> /* check the cache first */
> vma = vmacache_find(mm, addr);
> if (likely(vma))
> return vma;
>
> - /* trawl the list (there may be multiple mappings in which addr
> - * resides) */
> - for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - if (vma->vm_start > addr)
> - return NULL;
> - if (vma->vm_end > addr) {
> - vmacache_update(addr, vma);
> - return vma;
> - }
> - }
> + rcu_read_lock();
> + vma = mas_walk(&mas);
> + rcu_read_unlock();
>
> - return NULL;
> + if (vma)
> + vmacache_update(addr, vma);
> +
> + return vma;
> }
> EXPORT_SYMBOL(find_vma);
>
> @@ -730,26 +697,25 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
> {
> struct vm_area_struct *vma;
> unsigned long end = addr + len;
> + MA_STATE(mas, &mm->mm_mt, addr, addr);
>
> /* check the cache first */
> vma = vmacache_find_exact(mm, addr, end);
> if (vma)
> return vma;
>
> - /* trawl the list (there may be multiple mappings in which addr
> - * resides) */
> - for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - if (vma->vm_start < addr)
> - continue;
> - if (vma->vm_start > addr)
> - return NULL;
> - if (vma->vm_end == end) {
> - vmacache_update(addr, vma);
> - return vma;
> - }
> - }
> -
> - return NULL;
> + rcu_read_lock();
> + vma = mas_walk(&mas);
> + rcu_read_unlock();
> + if (!vma)
> + return NULL;
> + if (vma->vm_start != addr)
> + return NULL;
> + if (vma->vm_end != end)
> + return NULL;
> +
> + vmacache_update(addr, vma);
> + return vma;
> }
>
> /*
> diff --git a/mm/util.c b/mm/util.c
> index 0b6dd9d81da7..35deaa0ccac5 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -287,6 +287,8 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
> vma->vm_next = next;
> if (next)
> next->vm_prev = vma;
> + else
> + mm->highest_vm_end = vm_end_gap(vma);
> }
>
> void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
> @@ -301,6 +303,12 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
> mm->mmap = next;
> if (next)
> next->vm_prev = prev;
> + else {
> + if (prev)
> + mm->highest_vm_end = vm_end_gap(prev);
> + else
> + mm->highest_vm_end = 0;
> + }
> }
>
> /* Check if the vma is being used as a stack by this task */
> --
> 2.30.2

2021-05-31 14:06:09

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 32/94] kernel/fork: Convert dup_mmap to use maple tree

* Suren Baghdasaryan <[email protected]> [210528 20:42]:
> On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
> >
> > Use the maple tree iterator to duplicate the mm_struct trees.
> >
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > include/linux/mm.h | 2 --
> > include/linux/sched/mm.h | 3 +++
> > kernel/fork.c | 24 +++++++++++++++++++-----
> > mm/mmap.c | 4 ----
> > 4 files changed, 22 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e89bacfa9145..7f7dff6ad884 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2498,8 +2498,6 @@ extern bool arch_has_descending_max_zone_pfns(void);
> > /* nommu.c */
> > extern atomic_long_t mmap_pages_allocated;
> > extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
> > -/* maple_tree */
> > -void vma_store(struct mm_struct *mm, struct vm_area_struct *vma);
> >
> > /* interval_tree.c */
> > void vma_interval_tree_insert(struct vm_area_struct *node,
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index e24b1fe348e3..76cab3aea6ab 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -8,6 +8,7 @@
> > #include <linux/mm_types.h>
> > #include <linux/gfp.h>
> > #include <linux/sync_core.h>
> > +#include <linux/maple_tree.h>
> >
> > /*
> > * Routines for handling mm_structs
> > @@ -67,11 +68,13 @@ static inline void mmdrop(struct mm_struct *mm)
> > */
> > static inline void mmget(struct mm_struct *mm)
> > {
> > + mt_set_in_rcu(&mm->mm_mt);
> > atomic_inc(&mm->mm_users);
> > }
> >
> > static inline bool mmget_not_zero(struct mm_struct *mm)
> > {
> > + mt_set_in_rcu(&mm->mm_mt);
>
> Should you be calling mt_set_in_rcu() if atomic_inc_not_zero() failed?
> I don't think mmput() is called after mmget_not_zero() fails and
> mt_clear_in_rcu() will not be called.

Good catch, but having it the way it is will be faster with the
possibility of re-entering RCU mode if there is a race during tear down.
Entering RCU mode during tear-down mean that the nodes that are not
already freed would remain for an RCU cycle before being freed. I don't
think it is worth checking every time this is called for such a low
payoff. I should probably add a comment about this though.

>
> > return atomic_inc_not_zero(&mm->mm_users);
> > }
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index c37abaf28eb9..832416ff613e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -477,7 +477,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> > struct rb_node **rb_link, *rb_parent;
> > int retval;
> > - unsigned long charge;
> > + unsigned long charge = 0;
> > + MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
> > + MA_STATE(mas, &mm->mm_mt, 0, 0);
> > LIST_HEAD(uf);
> >
> > uprobe_start_dup_mmap();
> > @@ -511,7 +513,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > goto out;
> >
> > prev = NULL;
> > - for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
> > +
> > + retval = mas_entry_count(&mas, oldmm->map_count);
> > + if (retval)
> > + goto fail_nomem;
> > +
> > + rcu_read_lock();
> > + mas_for_each(&old_mas, mpnt, ULONG_MAX) {
> > struct file *file;
> >
> > if (mpnt->vm_flags & VM_DONTCOPY) {
> > @@ -525,7 +533,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > */
> > if (fatal_signal_pending(current)) {
> > retval = -EINTR;
> > - goto out;
> > + goto loop_out;
> > }
> > if (mpnt->vm_flags & VM_ACCOUNT) {
> > unsigned long len = vma_pages(mpnt);
> > @@ -594,7 +602,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > rb_parent = &tmp->vm_rb;
> >
> > /* Link the vma into the MT */
> > - vma_store(mm, tmp);
> > + mas.index = tmp->vm_start;
> > + mas.last = tmp->vm_end - 1;
> > + mas_store(&mas, tmp);
> >
> > mm->map_count++;
> > if (!(tmp->vm_flags & VM_WIPEONFORK))
> > @@ -604,14 +614,17 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > tmp->vm_ops->open(tmp);
> >
> > if (retval)
> > - goto out;
> > + goto loop_out;
> > }
> > /* a new mm has just been created */
> > retval = arch_dup_mmap(oldmm, mm);
> > +loop_out:
> > out:
> > + rcu_read_unlock();
> > mmap_write_unlock(mm);
> > flush_tlb_mm(oldmm);
> > mmap_write_unlock(oldmm);
> > + mas_destroy(&mas);
> > dup_userfaultfd_complete(&uf);
> > fail_uprobe_end:
> > uprobe_end_dup_mmap();
> > @@ -1092,6 +1105,7 @@ static inline void __mmput(struct mm_struct *mm)
> > {
> > VM_BUG_ON(atomic_read(&mm->mm_users));
> >
> > + mt_clear_in_rcu(&mm->mm_mt);
> > uprobe_clear_state(mm);
> > exit_aio(mm);
> > ksm_exit(mm);
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 929c2f9eb3f5..1bd43f4db28e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -780,10 +780,6 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
> > GFP_KERNEL);
> > }
> >
> > -void vma_store(struct mm_struct *mm, struct vm_area_struct *vma) {
> > - vma_mt_store(mm, vma);
> > -}
> > -
> > static void
> > __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> > struct vm_area_struct *prev, struct rb_node **rb_link,
> > --
> > 2.30.2

2021-05-31 14:10:05

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 33/94] mm: Remove rb tree.

* Suren Baghdasaryan <[email protected]> [210528 21:26]:
> On Wed, Apr 28, 2021 at 8:36 AM Liam Howlett <[email protected]> wrote:
> >
> > Remove the RB tree and start using the maple tree for vm_area_struct
> > tracking.
> >
> > Drop validate_mm() calls in expand_upwards() and expand_downwards() as
> > the lock is not held.
> >
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > arch/x86/kernel/tboot.c | 1 -
> > drivers/firmware/efi/efi.c | 1 -
> > fs/proc/task_nommu.c | 55 ++--
> > include/linux/mm.h | 4 +-
> > include/linux/mm_types.h | 26 +-
> > kernel/fork.c | 8 -
> > mm/init-mm.c | 2 -
> > mm/mmap.c | 525 ++++++++-----------------------------
> > mm/nommu.c | 96 +++----
> > mm/util.c | 8 +
> > 10 files changed, 185 insertions(+), 541 deletions(-)
> >
> > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
> > index 6f978f722dff..121f28bb2209 100644
> > --- a/arch/x86/kernel/tboot.c
> > +++ b/arch/x86/kernel/tboot.c
> > @@ -97,7 +97,6 @@ void __init tboot_probe(void)
> >
> > static pgd_t *tboot_pg_dir;
> > static struct mm_struct tboot_mm = {
> > - .mm_rb = RB_ROOT,
> > .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
> > .pgd = swapper_pg_dir,
> > .mm_users = ATOMIC_INIT(2),
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index 271ae8c7bb07..8aaeaa824576 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -54,7 +54,6 @@ static unsigned long __initdata mem_reserve = EFI_INVALID_TABLE_ADDR;
> > static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR;
> >
> > struct mm_struct efi_mm = {
> > - .mm_rb = RB_ROOT,
> > .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
> > .mm_users = ATOMIC_INIT(2),
> > .mm_count = ATOMIC_INIT(1),
> > diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
> > index a6d21fc0033c..8691a1216d1c 100644
> > --- a/fs/proc/task_nommu.c
> > +++ b/fs/proc/task_nommu.c
> > @@ -22,15 +22,13 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> > {
> > struct vm_area_struct *vma;
> > struct vm_region *region;
> > - struct rb_node *p;
> > unsigned long bytes = 0, sbytes = 0, slack = 0, size;
> > -
> > - mmap_read_lock(mm);
> > - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
> > - vma = rb_entry(p, struct vm_area_struct, vm_rb);
> > + MA_STATE(mas, &mm->mm_mt, 0, 0);
> >
> > + mmap_read_lock(mm);
> > + rcu_read_lock();
> > + mas_for_each(&mas, vma, ULONG_MAX) {
> > bytes += kobjsize(vma);
> > -
> > region = vma->vm_region;
> > if (region) {
> > size = kobjsize(region);
> > @@ -53,7 +51,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> > sbytes += kobjsize(mm);
> > else
> > bytes += kobjsize(mm);
> > -
> > +
> > if (current->fs && current->fs->users > 1)
> > sbytes += kobjsize(current->fs);
> > else
> > @@ -77,20 +75,21 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> > "Shared:\t%8lu bytes\n",
> > bytes, slack, sbytes);
> >
> > + rcu_read_unlock();
> > mmap_read_unlock(mm);
> > }
> >
> > unsigned long task_vsize(struct mm_struct *mm)
> > {
> > struct vm_area_struct *vma;
> > - struct rb_node *p;
> > unsigned long vsize = 0;
> > + MA_STATE(mas, &mm->mm_mt, 0, 0);
> >
> > mmap_read_lock(mm);
> > - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
> > - vma = rb_entry(p, struct vm_area_struct, vm_rb);
> > + rcu_read_lock();
> > + mas_for_each(&mas, vma, ULONG_MAX)
> > vsize += vma->vm_end - vma->vm_start;
> > - }
> > + rcu_read_unlock();
> > mmap_read_unlock(mm);
> > return vsize;
> > }
> > @@ -101,12 +100,12 @@ unsigned long task_statm(struct mm_struct *mm,
> > {
> > struct vm_area_struct *vma;
> > struct vm_region *region;
> > - struct rb_node *p;
> > unsigned long size = kobjsize(mm);
> > + MA_STATE(mas, &mm->mm_mt, 0, 0);
> >
> > mmap_read_lock(mm);
> > - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
> > - vma = rb_entry(p, struct vm_area_struct, vm_rb);
> > + rcu_read_lock();
> > + mas_for_each(&mas, vma, ULONG_MAX) {
> > size += kobjsize(vma);
> > region = vma->vm_region;
> > if (region) {
> > @@ -119,6 +118,7 @@ unsigned long task_statm(struct mm_struct *mm,
> > >> PAGE_SHIFT;
> > *data = (PAGE_ALIGN(mm->start_stack) - (mm->start_data & PAGE_MASK))
> > >> PAGE_SHIFT;
> > + rcu_read_unlock();
> > mmap_read_unlock(mm);
> > size >>= PAGE_SHIFT;
> > size += *text + *data;
> > @@ -190,17 +190,20 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
> > */
> > static int show_map(struct seq_file *m, void *_p)
> > {
> > - struct rb_node *p = _p;
> > -
> > - return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb));
> > + return nommu_vma_show(m, _p);
> > }
> >
> > static void *m_start(struct seq_file *m, loff_t *pos)
> > {
> > struct proc_maps_private *priv = m->private;
> > struct mm_struct *mm;
> > - struct rb_node *p;
> > - loff_t n = *pos;
> > + struct vm_area_struct *vma;
> > + unsigned long addr = *pos;
> > + MA_STATE(mas, &priv->mm->mm_mt, addr, addr);
> > +
> > + /* See m_next(). Zero at the start or after lseek. */
> > + if (addr == -1UL)
> > + return NULL;
> >
> > /* pin the task and mm whilst we play with them */
> > priv->task = get_proc_task(priv->inode);
> > @@ -216,14 +219,12 @@ static void *m_start(struct seq_file *m, loff_t *pos)
> > return ERR_PTR(-EINTR);
> > }
> >
> > - /* start from the Nth VMA */
> > - for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
> > - if (n-- == 0)
> > - return p;
> > + /* start the next element from addr */
> > + vma = mas_find(&mas, ULONG_MAX);
> >
> > mmap_read_unlock(mm);
> > mmput(mm);
> > - return NULL;
> > + return vma;
> > }
> >
> > static void m_stop(struct seq_file *m, void *_vml)
> > @@ -242,10 +243,10 @@ static void m_stop(struct seq_file *m, void *_vml)
> >
> > static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
> > {
> > - struct rb_node *p = _p;
> > + struct vm_area_struct *vma = _p;
> >
> > - (*pos)++;
> > - return p ? rb_next(p) : NULL;
> > + *pos = vma->vm_end;
> > + return vma->vm_next;
> > }
> >
> > static const struct seq_operations proc_pid_maps_ops = {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7f7dff6ad884..146976070fed 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2553,8 +2553,6 @@ extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
> > extern int split_vma(struct mm_struct *, struct vm_area_struct *,
> > unsigned long addr, int new_below);
> > extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> > -extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
> > - struct rb_node **, struct rb_node *);
> > extern void unlink_file_vma(struct vm_area_struct *);
> > extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
> > unsigned long addr, unsigned long len, pgoff_t pgoff,
> > @@ -2699,7 +2697,7 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
> > static inline
> > struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
> > {
> > - return find_vma_intersection(mm, addr, addr + 1);
> > + return mtree_load(&mm->mm_mt, addr);
> > }
> >
> > static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 51733fc44daf..41551bfa6ce0 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -311,19 +311,6 @@ struct vm_area_struct {
> >
> > /* linked list of VM areas per task, sorted by address */
> > struct vm_area_struct *vm_next, *vm_prev;
> > -
> > - struct rb_node vm_rb;
> > -
> > - /*
> > - * Largest free memory gap in bytes to the left of this VMA.
> > - * Either between this VMA and vma->vm_prev, or between one of the
> > - * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
> > - * get_unmapped_area find a free area of the right size.
> > - */
> > - unsigned long rb_subtree_gap;
> > -
> > - /* Second cache line starts here. */
> > -
> > struct mm_struct *vm_mm; /* The address space we belong to. */
> >
> > /*
> > @@ -333,6 +320,12 @@ struct vm_area_struct {
> > pgprot_t vm_page_prot;
> > unsigned long vm_flags; /* Flags, see mm.h. */
> >
> > + /* Information about our backing store: */
> > + unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
> > + * units
> > + */
> > + /* Second cache line starts here. */
> > + struct file *vm_file; /* File we map to (can be NULL). */
> > /*
> > * For areas with an address space and backing store,
> > * linkage into the address_space->i_mmap interval tree.
> > @@ -351,16 +344,14 @@ struct vm_area_struct {
> > struct list_head anon_vma_chain; /* Serialized by mmap_lock &
> > * page_table_lock */
> > struct anon_vma *anon_vma; /* Serialized by page_table_lock */
> > + /* Third cache line starts here. */
> >
> > /* Function pointers to deal with this struct. */
> > const struct vm_operations_struct *vm_ops;
> >
> > - /* Information about our backing store: */
> > - unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
> > - units */
> > - struct file * vm_file; /* File we map to (can be NULL). */
> > void * vm_private_data; /* was vm_pte (shared mem) */
> >
> > +
> > #ifdef CONFIG_SWAP
> > atomic_long_t swap_readahead_info;
> > #endif
> > @@ -389,7 +380,6 @@ struct mm_struct {
> > struct {
> > struct vm_area_struct *mmap; /* list of VMAs */
> > struct maple_tree mm_mt;
> > - struct rb_root mm_rb;
> > u64 vmacache_seqnum; /* per-thread vmacache */
> > #ifdef CONFIG_MMU
> > unsigned long (*get_unmapped_area) (struct file *filp,
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 832416ff613e..83afd3007a2b 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -475,7 +475,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > struct mm_struct *oldmm)
> > {
> > struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> > - struct rb_node **rb_link, *rb_parent;
> > int retval;
> > unsigned long charge = 0;
> > MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
> > @@ -502,8 +501,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > mm->exec_vm = oldmm->exec_vm;
> > mm->stack_vm = oldmm->stack_vm;
> >
> > - rb_link = &mm->mm_rb.rb_node;
> > - rb_parent = NULL;
> > pprev = &mm->mmap;
> > retval = ksm_fork(mm, oldmm);
> > if (retval)
> > @@ -597,10 +594,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > tmp->vm_prev = prev;
> > prev = tmp;
> >
> > - __vma_link_rb(mm, tmp, rb_link, rb_parent);
> > - rb_link = &tmp->vm_rb.rb_right;
> > - rb_parent = &tmp->vm_rb;
> > -
> > /* Link the vma into the MT */
> > mas.index = tmp->vm_start;
> > mas.last = tmp->vm_end - 1;
> > @@ -1033,7 +1026,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > struct user_namespace *user_ns)
> > {
> > mm->mmap = NULL;
> > - mm->mm_rb = RB_ROOT;
> > mt_init_flags(&mm->mm_mt, MAPLE_ALLOC_RANGE);
> > mm->vmacache_seqnum = 0;
> > atomic_set(&mm->mm_users, 1);
> > diff --git a/mm/init-mm.c b/mm/init-mm.c
> > index 2014d4b82294..04bbe5172b72 100644
> > --- a/mm/init-mm.c
> > +++ b/mm/init-mm.c
> > @@ -1,6 +1,5 @@
> > // SPDX-License-Identifier: GPL-2.0
> > #include <linux/mm_types.h>
> > -#include <linux/rbtree.h>
> > #include <linux/maple_tree.h>
> > #include <linux/rwsem.h>
> > #include <linux/spinlock.h>
> > @@ -28,7 +27,6 @@
> > * and size this cpu_bitmask to NR_CPUS.
> > */
> > struct mm_struct init_mm = {
> > - .mm_rb = RB_ROOT,
> > .mm_mt = MTREE_INIT(mm_mt, MAPLE_ALLOC_RANGE),
> > .pgd = swapper_pg_dir,
> > .mm_users = ATOMIC_INIT(2),
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 1bd43f4db28e..7747047c4cbe 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -38,7 +38,6 @@
> > #include <linux/audit.h>
> > #include <linux/khugepaged.h>
> > #include <linux/uprobes.h>
> > -#include <linux/rbtree_augmented.h>
> > #include <linux/notifier.h>
> > #include <linux/memory.h>
> > #include <linux/printk.h>
> > @@ -290,93 +289,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
> > return origbrk;
> > }
> >
> > -static inline unsigned long vma_compute_gap(struct vm_area_struct *vma)
> > -{
> > - unsigned long gap, prev_end;
> > -
> > - /*
> > - * Note: in the rare case of a VM_GROWSDOWN above a VM_GROWSUP, we
> > - * allow two stack_guard_gaps between them here, and when choosing
> > - * an unmapped area; whereas when expanding we only require one.
> > - * That's a little inconsistent, but keeps the code here simpler.
> > - */
> > - gap = vm_start_gap(vma);
> > - if (vma->vm_prev) {
> > - prev_end = vm_end_gap(vma->vm_prev);
> > - if (gap > prev_end)
> > - gap -= prev_end;
> > - else
> > - gap = 0;
> > - }
> > - return gap;
> > -}
> > -
> > -#ifdef CONFIG_DEBUG_VM_RB
> > -static unsigned long vma_compute_subtree_gap(struct vm_area_struct *vma)
> > -{
> > - unsigned long max = vma_compute_gap(vma), subtree_gap;
> > - if (vma->vm_rb.rb_left) {
> > - subtree_gap = rb_entry(vma->vm_rb.rb_left,
> > - struct vm_area_struct, vm_rb)->rb_subtree_gap;
> > - if (subtree_gap > max)
> > - max = subtree_gap;
> > - }
> > - if (vma->vm_rb.rb_right) {
> > - subtree_gap = rb_entry(vma->vm_rb.rb_right,
> > - struct vm_area_struct, vm_rb)->rb_subtree_gap;
> > - if (subtree_gap > max)
> > - max = subtree_gap;
> > - }
> > - return max;
> > -}
> > -
> > -static int browse_rb(struct mm_struct *mm)
> > -{
> > - struct rb_root *root = &mm->mm_rb;
> > - int i = 0, j, bug = 0;
> > - struct rb_node *nd, *pn = NULL;
> > - unsigned long prev = 0, pend = 0;
> > -
> > - for (nd = rb_first(root); nd; nd = rb_next(nd)) {
> > - struct vm_area_struct *vma;
> > - vma = rb_entry(nd, struct vm_area_struct, vm_rb);
> > - if (vma->vm_start < prev) {
> > - pr_emerg("vm_start %lx < prev %lx\n",
> > - vma->vm_start, prev);
> > - bug = 1;
> > - }
> > - if (vma->vm_start < pend) {
> > - pr_emerg("vm_start %lx < pend %lx\n",
> > - vma->vm_start, pend);
> > - bug = 1;
> > - }
> > - if (vma->vm_start > vma->vm_end) {
> > - pr_emerg("vm_start %lx > vm_end %lx\n",
> > - vma->vm_start, vma->vm_end);
> > - bug = 1;
> > - }
> > - spin_lock(&mm->page_table_lock);
> > - if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) {
> > - pr_emerg("free gap %lx, correct %lx\n",
> > - vma->rb_subtree_gap,
> > - vma_compute_subtree_gap(vma));
> > - bug = 1;
> > - }
> > - spin_unlock(&mm->page_table_lock);
> > - i++;
> > - pn = nd;
> > - prev = vma->vm_start;
> > - pend = vma->vm_end;
> > - }
> > - j = 0;
> > - for (nd = pn; nd; nd = rb_prev(nd))
> > - j++;
> > - if (i != j) {
> > - pr_emerg("backwards %d, forwards %d\n", j, i);
> > - bug = 1;
> > - }
> > - return bug ? -1 : i;
> > -}
> > #if defined(CONFIG_DEBUG_MAPLE_TREE)
> > extern void mt_validate(struct maple_tree *mt);
> > extern void mt_dump(const struct maple_tree *mt);
> > @@ -405,17 +317,25 @@ static void validate_mm_mt(struct mm_struct *mm)
> > dump_stack();
> > #ifdef CONFIG_DEBUG_VM
> > dump_vma(vma_mt);
> > - pr_emerg("and next in rb\n");
> > + pr_emerg("and vm_next\n");
> > dump_vma(vma->vm_next);
> > -#endif
> > +#endif // CONFIG_DEBUG_VM
> > pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
> > mas.index, mas.last);
> > pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
> > vma_mt->vm_start, vma_mt->vm_end);
> > - pr_emerg("rb vma: %px %lu - %lu\n", vma,
> > + if (vma->vm_prev) {
> > + pr_emerg("ll prev: %px %lu - %lu\n",
> > + vma->vm_prev, vma->vm_prev->vm_start,
> > + vma->vm_prev->vm_end);
> > + }
> > + pr_emerg("ll vma: %px %lu - %lu\n", vma,
> > vma->vm_start, vma->vm_end);
> > - pr_emerg("rb->next = %px %lu - %lu\n", vma->vm_next,
> > - vma->vm_next->vm_start, vma->vm_next->vm_end);
> > + if (vma->vm_next) {
> > + pr_emerg("ll next: %px %lu - %lu\n",
> > + vma->vm_next, vma->vm_next->vm_start,
> > + vma->vm_next->vm_end);
> > + }
> >
> > mt_dump(mas.tree);
> > if (vma_mt->vm_end != mas.last + 1) {
> > @@ -441,21 +361,6 @@ static void validate_mm_mt(struct mm_struct *mm)
> > rcu_read_unlock();
> > mt_validate(&mm->mm_mt);
> > }
> > -#else
> > -#define validate_mm_mt(root) do { } while (0)
> > -#endif
> > -static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore)
> > -{
> > - struct rb_node *nd;
> > -
> > - for (nd = rb_first(root); nd; nd = rb_next(nd)) {
> > - struct vm_area_struct *vma;
> > - vma = rb_entry(nd, struct vm_area_struct, vm_rb);
> > - VM_BUG_ON_VMA(vma != ignore &&
> > - vma->rb_subtree_gap != vma_compute_subtree_gap(vma),
> > - vma);
> > - }
> > -}
> >
> > static void validate_mm(struct mm_struct *mm)
> > {
> > @@ -464,6 +369,8 @@ static void validate_mm(struct mm_struct *mm)
> > unsigned long highest_address = 0;
> > struct vm_area_struct *vma = mm->mmap;
> >
> > + validate_mm_mt(mm);
> > +
> > while (vma) {
> > struct anon_vma *anon_vma = vma->anon_vma;
> > struct anon_vma_chain *avc;
> > @@ -488,80 +395,13 @@ static void validate_mm(struct mm_struct *mm)
> > mm->highest_vm_end, highest_address);
> > bug = 1;
> > }
> > - i = browse_rb(mm);
> > - if (i != mm->map_count) {
> > - if (i != -1)
> > - pr_emerg("map_count %d rb %d\n", mm->map_count, i);
> > - bug = 1;
> > - }
> > VM_BUG_ON_MM(bug, mm);
> > }
> > -#else
> > -#define validate_mm_rb(root, ignore) do { } while (0)
> > +
> > +#else // !CONFIG_DEBUG_MAPLE_TREE
> > #define validate_mm_mt(root) do { } while (0)
> > #define validate_mm(mm) do { } while (0)
> > -#endif
> > -
> > -RB_DECLARE_CALLBACKS_MAX(static, vma_gap_callbacks,
> > - struct vm_area_struct, vm_rb,
> > - unsigned long, rb_subtree_gap, vma_compute_gap)
> > -
> > -/*
> > - * Update augmented rbtree rb_subtree_gap values after vma->vm_start or
> > - * vma->vm_prev->vm_end values changed, without modifying the vma's position
> > - * in the rbtree.
> > - */
> > -static void vma_gap_update(struct vm_area_struct *vma)
> > -{
> > - /*
> > - * As it turns out, RB_DECLARE_CALLBACKS_MAX() already created
> > - * a callback function that does exactly what we want.
> > - */
> > - vma_gap_callbacks_propagate(&vma->vm_rb, NULL);
> > -}
> > -
> > -static inline void vma_rb_insert(struct vm_area_struct *vma,
> > - struct rb_root *root)
> > -{
> > - /* All rb_subtree_gap values must be consistent prior to insertion */
> > - validate_mm_rb(root, NULL);
> > -
> > - rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> > -}
> > -
> > -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> > -{
> > - /*
> > - * Note rb_erase_augmented is a fairly large inline function,
> > - * so make sure we instantiate it only once with our desired
> > - * augmented rbtree callbacks.
> > - */
> > - rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> > -}
> > -
> > -static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> > - struct rb_root *root,
> > - struct vm_area_struct *ignore)
> > -{
> > - /*
> > - * All rb_subtree_gap values must be consistent prior to erase,
> > - * with the possible exception of
> > - *
> > - * a. the "next" vma being erased if next->vm_start was reduced in
> > - * __vma_adjust() -> __vma_unlink()
> > - * b. the vma being erased in detach_vmas_to_be_unmapped() ->
> > - * vma_rb_erase()
> > - */
> > - validate_mm_rb(root, ignore);
> > -
> > - __vma_rb_erase(vma, root);
> > -}
> > -
> > -static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
> > - struct rb_root *root)
> > -{
> > - vma_rb_erase_ignore(vma, root, vma);
> > -}
> > +#endif // CONFIG_DEBUG_MAPLE_TREE
> >
> > /*
> > * vma has some anon_vma assigned, and is already inserted on that
> > @@ -595,38 +435,26 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> > }
> >
> > -static int find_vma_links(struct mm_struct *mm, unsigned long addr,
> > - unsigned long end, struct vm_area_struct **pprev,
> > - struct rb_node ***rb_link, struct rb_node **rb_parent)
> > +/* Private
> > + * range_has_overlap() - Check the @start - @end range for overlapping VMAs and
> > + * sets up a pointer to the previous VMA
> > + *
> > + * @mm - the mm struct
> > + * @start - the start address of the range
> > + * @end - the end address of the range
> > + * @pprev - the pointer to the pointer of the previous VMA
> > + *
> > + * Returns: True if there is an overlapping VMA, false otherwise
> > + */
> > +static bool range_has_overlap(struct mm_struct *mm, unsigned long start,
> > + unsigned long end, struct vm_area_struct **pprev)
> > {
> > - struct rb_node **__rb_link, *__rb_parent, *rb_prev;
> > -
> > - __rb_link = &mm->mm_rb.rb_node;
> > - rb_prev = __rb_parent = NULL;
> > + struct vm_area_struct *existing;
> >
> > - while (*__rb_link) {
> > - struct vm_area_struct *vma_tmp;
> > -
> > - __rb_parent = *__rb_link;
> > - vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb);
> > -
> > - if (vma_tmp->vm_end > addr) {
> > - /* Fail if an existing vma overlaps the area */
> > - if (vma_tmp->vm_start < end)
> > - return -ENOMEM;
> > - __rb_link = &__rb_parent->rb_left;
> > - } else {
> > - rb_prev = __rb_parent;
> > - __rb_link = &__rb_parent->rb_right;
> > - }
> > - }
> > -
> > - *pprev = NULL;
> > - if (rb_prev)
> > - *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
> > - *rb_link = __rb_link;
> > - *rb_parent = __rb_parent;
> > - return 0;
> > + MA_STATE(mas, &mm->mm_mt, start, start);
> > + existing = mas_find(&mas, end - 1);
> > + *pprev = mas_prev(&mas, 0);
> > + return existing ? true : false;
> > }
> >
> > /*
> > @@ -653,8 +481,6 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
> > * @start: The start of the range.
> > * @len: The length of the range.
> > * @pprev: pointer to the pointer that will be set to previous vm_area_struct
> > - * @rb_link: the rb_node
> > - * @rb_parent: the parent rb_node
> > *
> > * Find all the vm_area_struct that overlap from @start to
> > * @end and munmap them. Set @pprev to the previous vm_area_struct.
> > @@ -663,76 +489,41 @@ static inline struct vm_area_struct *vma_next(struct mm_struct *mm,
> > */
> > static inline int
> > munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
> > - struct vm_area_struct **pprev, struct rb_node ***link,
> > - struct rb_node **parent, struct list_head *uf)
> > + struct vm_area_struct **pprev, struct list_head *uf)
> > {
> > -
> > - while (find_vma_links(mm, start, start + len, pprev, link, parent))
> > + // Needs optimization.
> > + while (range_has_overlap(mm, start, start + len, pprev))
> > if (do_munmap(mm, start, len, uf))
> > return -ENOMEM;
> > -
> > return 0;
> > }
> > static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > unsigned long addr, unsigned long end)
> > {
> > unsigned long nr_pages = 0;
> > - unsigned long nr_mt_pages = 0;
> > struct vm_area_struct *vma;
> > + unsigned long vm_start, vm_end;
> > + MA_STATE(mas, &mm->mm_mt, addr, addr);
> >
> > - /* Find first overlapping mapping */
> > - vma = find_vma_intersection(mm, addr, end);
> > + /* Find first overlaping mapping */
>
> nit: I think the original comment was correct.

Yes. Not sure hwo this happened.

>
> > + vma = mas_find(&mas, end - 1);
> > if (!vma)
> > return 0;
> >
> > - nr_pages = (min(end, vma->vm_end) -
> > - max(addr, vma->vm_start)) >> PAGE_SHIFT;
> > + vm_start = vma->vm_start;
> > + vm_end = vma->vm_end;
> > + nr_pages = (min(end, vm_end) - max(addr, vm_start)) >> PAGE_SHIFT;
> >
> > /* Iterate over the rest of the overlaps */
> > - for (vma = vma->vm_next; vma; vma = vma->vm_next) {
> > - unsigned long overlap_len;
> > -
> > - if (vma->vm_start > end)
> > - break;
> > -
> > - overlap_len = min(end, vma->vm_end) - vma->vm_start;
> > - nr_pages += overlap_len >> PAGE_SHIFT;
> > + mas_for_each(&mas, vma, end) {
> > + vm_start = vma->vm_start;
> > + vm_end = vma->vm_end;
> > + nr_pages += (min(end, vm_end) - vm_start) >> PAGE_SHIFT;
> > }
> >
> > - mt_for_each(&mm->mm_mt, vma, addr, end) {
> > - nr_mt_pages +=
> > - (min(end, vma->vm_end) - vma->vm_start) >> PAGE_SHIFT;
> > - }
> > -
> > - VM_BUG_ON_MM(nr_pages != nr_mt_pages, mm);
> > -
> > return nr_pages;
> > }
> >
> > -void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
> > - struct rb_node **rb_link, struct rb_node *rb_parent)
> > -{
> > - /* Update tracking information for the gap following the new vma. */
> > - if (vma->vm_next)
> > - vma_gap_update(vma->vm_next);
> > - else
> > - mm->highest_vm_end = vm_end_gap(vma);
> > -
> > - /*
> > - * vma->vm_prev wasn't known when we followed the rbtree to find the
> > - * correct insertion point for that vma. As a result, we could not
> > - * update the vma vm_rb parents rb_subtree_gap values on the way down.
> > - * So, we first insert the vma with a zero rb_subtree_gap value
> > - * (to be consistent with what we did on the way down), and then
> > - * immediately update the gap to the correct value. Finally we
> > - * rebalance the rbtree after all augmented values have been set.
> > - */
> > - rb_link_node(&vma->vm_rb, rb_parent, rb_link);
> > - vma->rb_subtree_gap = 0;
> > - vma_gap_update(vma);
> > - vma_rb_insert(vma, &mm->mm_rb);
> > -}
> > -
> > static void __vma_link_file(struct vm_area_struct *vma)
> > {
> > struct file *file;
> > @@ -780,19 +571,8 @@ static inline void vma_mt_store(struct mm_struct *mm, struct vm_area_struct *vma
> > GFP_KERNEL);
> > }
> >
> > -static void
> > -__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> > - struct vm_area_struct *prev, struct rb_node **rb_link,
> > - struct rb_node *rb_parent)
> > -{
> > - vma_mt_store(mm, vma);
> > - __vma_link_list(mm, vma, prev);
> > - __vma_link_rb(mm, vma, rb_link, rb_parent);
> > -}
> > -
> > static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> > - struct vm_area_struct *prev, struct rb_node **rb_link,
> > - struct rb_node *rb_parent)
> > + struct vm_area_struct *prev)
> > {
> > struct address_space *mapping = NULL;
> >
> > @@ -801,7 +581,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> > i_mmap_lock_write(mapping);
> > }
> >
> > - __vma_link(mm, vma, prev, rb_link, rb_parent);
> > + vma_mt_store(mm, vma);
> > + __vma_link_list(mm, vma, prev);
> > __vma_link_file(vma);
> >
> > if (mapping)
> > @@ -813,30 +594,18 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> > /*
> > * Helper for vma_adjust() in the split_vma insert case: insert a vma into the
> > - * mm's list and rbtree. It has already been inserted into the interval tree.
> > + * mm's list and the mm tree. It has already been inserted into the interval tree.
> > */
> > static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> > {
> > struct vm_area_struct *prev;
> > - struct rb_node **rb_link, *rb_parent;
> >
> > - if (find_vma_links(mm, vma->vm_start, vma->vm_end,
> > - &prev, &rb_link, &rb_parent))
> > - BUG();
> > - __vma_link(mm, vma, prev, rb_link, rb_parent);
> > + BUG_ON(range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev));
> > + vma_mt_store(mm, vma);
> > + __vma_link_list(mm, vma, prev);
> > mm->map_count++;
> > }
> >
> > -static __always_inline void __vma_unlink(struct mm_struct *mm,
> > - struct vm_area_struct *vma,
> > - struct vm_area_struct *ignore)
> > -{
> > - vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
> > - __vma_unlink_list(mm, vma);
> > - /* Kill the cache */
> > - vmacache_invalidate(mm);
> > -}
> > -
> > /*
> > * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
> > * is already present in an i_mmap tree without adjusting the tree.
> > @@ -854,13 +623,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> > struct rb_root_cached *root = NULL;
> > struct anon_vma *anon_vma = NULL;
> > struct file *file = vma->vm_file;
> > - bool start_changed = false, end_changed = false;
> > + bool vma_changed = false;
> > long adjust_next = 0;
> > int remove_next = 0;
> >
> > - validate_mm(mm);
> > - validate_mm_mt(mm);
> > -
> > if (next && !insert) {
> > struct vm_area_struct *exporter = NULL, *importer = NULL;
> >
> > @@ -986,21 +752,23 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> > }
> >
> > if (start != vma->vm_start) {
> > - unsigned long old_start = vma->vm_start;
> > + if (vma->vm_start < start)
> > + vma_mt_szero(mm, vma->vm_start, start);
> > + else
> > + vma_changed = true;
> > vma->vm_start = start;
> > - if (old_start < start)
> > - vma_mt_szero(mm, old_start, start);
> > - start_changed = true;
> > }
> > if (end != vma->vm_end) {
> > - unsigned long old_end = vma->vm_end;
> > + if (vma->vm_end > end)
> > + vma_mt_szero(mm, end, vma->vm_end);
> > + else
> > + vma_changed = true;
> > vma->vm_end = end;
> > - if (old_end > end)
> > - vma_mt_szero(mm, end, old_end);
> > - end_changed = true;
> > + if (!next)
> > + mm->highest_vm_end = vm_end_gap(vma);
> > }
> >
> > - if (end_changed || start_changed)
> > + if (vma_changed)
> > vma_mt_store(mm, vma);
> >
> > vma->vm_pgoff = pgoff;
> > @@ -1018,25 +786,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> > }
> >
> > if (remove_next) {
> > - /*
> > - * vma_merge has merged next into vma, and needs
> > - * us to remove next before dropping the locks.
> > - * Since we have expanded over this vma, the maple tree will
> > - * have overwritten by storing the value
> > - */
> > - if (remove_next != 3)
> > - __vma_unlink(mm, next, next);
> > - else
> > - /*
> > - * vma is not before next if they've been
> > - * swapped.
> > - *
> > - * pre-swap() next->vm_start was reduced so
> > - * tell validate_mm_rb to ignore pre-swap()
> > - * "next" (which is stored in post-swap()
> > - * "vma").
> > - */
> > - __vma_unlink(mm, next, vma);
> > + __vma_unlink_list(mm, next);
> > + /* Kill the cache */
> > + vmacache_invalidate(mm);
> > if (file)
> > __remove_shared_vm_struct(next, file, mapping);
> > } else if (insert) {
> > @@ -1046,15 +798,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> > * (it may either follow vma or precede it).
> > */
> > __insert_vm_struct(mm, insert);
> > - } else {
> > - if (start_changed)
> > - vma_gap_update(vma);
> > - if (end_changed) {
> > - if (!next)
> > - mm->highest_vm_end = vm_end_gap(vma);
> > - else if (!adjust_next)
> > - vma_gap_update(next);
> > - }
> > }
> >
> > if (anon_vma) {
> > @@ -1112,10 +855,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> > remove_next = 1;
> > end = next->vm_end;
> > goto again;
> > - }
> > - else if (next)
> > - vma_gap_update(next);
> > - else {
> > + } else if (!next) {
> > /*
> > * If remove_next == 2 we obviously can't
> > * reach this path.
> > @@ -1142,8 +882,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> > uprobe_mmap(insert);
> >
> > validate_mm(mm);
> > - validate_mm_mt(mm);
> > -
> > return 0;
> > }
> >
> > @@ -1290,7 +1028,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> > struct vm_area_struct *area, *next;
> > int err;
> >
> > - validate_mm_mt(mm);
> > /*
> > * We later require that vma->vm_flags == vm_flags,
> > * so this tests vma->vm_flags & VM_SPECIAL, too.
> > @@ -1366,7 +1103,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> > khugepaged_enter_vma_merge(area, vm_flags);
> > return area;
> > }
> > - validate_mm_mt(mm);
> >
> > return NULL;
> > }
> > @@ -1536,6 +1272,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > vm_flags_t vm_flags;
> > int pkey = 0;
> >
> > + validate_mm(mm);
> > *populate = 0;
> >
> > if (!len)
> > @@ -1856,10 +1593,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > struct mm_struct *mm = current->mm;
> > struct vm_area_struct *vma, *prev, *merge;
> > int error;
> > - struct rb_node **rb_link, *rb_parent;
> > unsigned long charged = 0;
> >
> > - validate_mm_mt(mm);
> > /* Check against address space limit. */
> > if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> > unsigned long nr_pages;
> > @@ -1875,8 +1610,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > return -ENOMEM;
> > }
> >
> > - /* Clear old maps, set up prev, rb_link, rb_parent, and uf */
> > - if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
> > + /* Clear old maps, set up prev and uf */
> > + if (munmap_vma_range(mm, addr, len, &prev, uf))
> > return -ENOMEM;
> > /*
> > * Private writable mapping: check memory availability
> > @@ -1984,7 +1719,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > goto free_vma;
> > }
> >
> > - vma_link(mm, vma, prev, rb_link, rb_parent);
> > + vma_link(mm, vma, prev);
> > /* Once vma denies write, undo our temporary denial count */
> > if (file) {
> > unmap_writable:
> > @@ -2021,7 +1756,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> > vma_set_page_prot(vma);
> >
> > - validate_mm_mt(mm);
> > return addr;
> >
> > unmap_and_free_vma:
> > @@ -2041,7 +1775,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > unacct_error:
> > if (charged)
> > vm_unacct_memory(charged);
> > - validate_mm_mt(mm);
> > return error;
> > }
> >
> > @@ -2324,9 +2057,6 @@ find_vma_prev(struct mm_struct *mm, unsigned long addr,
> >
> > rcu_read_lock();
> > vma = mas_find(&mas, ULONG_MAX);
> > - if (!vma)
> > - mas_reset(&mas);
> > -
> > *pprev = mas_prev(&mas, 0);
> > rcu_read_unlock();
> > return vma;
> > @@ -2390,7 +2120,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > unsigned long gap_addr;
> > int error = 0;
> >
> > - validate_mm_mt(mm);
> > if (!(vma->vm_flags & VM_GROWSUP))
> > return -EFAULT;
> >
> > @@ -2437,15 +2166,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > error = acct_stack_growth(vma, size, grow);
> > if (!error) {
> > /*
> > - * vma_gap_update() doesn't support concurrent
> > - * updates, but we only hold a shared mmap_lock
> > - * lock here, so we need to protect against
> > - * concurrent vma expansions.
> > - * anon_vma_lock_write() doesn't help here, as
> > - * we don't guarantee that all growable vmas
> > - * in a mm share the same root anon vma.
> > - * So, we reuse mm->page_table_lock to guard
> > - * against concurrent vma expansions.
> > + * We only hold a shared mmap_lock lock here, so
> > + * we need to protect against concurrent vma
> > + * expansions. anon_vma_lock_write() doesn't
> > + * help here, as we don't guarantee that all
> > + * growable vmas in a mm share the same root
> > + * anon vma. So, we reuse mm->page_table_lock
> > + * to guard against concurrent vma expansions.
> > */
> > spin_lock(&mm->page_table_lock);
> > if (vma->vm_flags & VM_LOCKED)
> > @@ -2453,10 +2180,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > vm_stat_account(mm, vma->vm_flags, grow);
> > anon_vma_interval_tree_pre_update_vma(vma);
> > vma->vm_end = address;
> > + vma_mt_store(mm, vma);
> > anon_vma_interval_tree_post_update_vma(vma);
> > - if (vma->vm_next)
> > - vma_gap_update(vma->vm_next);
> > - else
> > + if (!vma->vm_next)
> > mm->highest_vm_end = vm_end_gap(vma);
> > spin_unlock(&mm->page_table_lock);
> >
> > @@ -2466,8 +2192,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > }
> > anon_vma_unlock_write(vma->anon_vma);
> > khugepaged_enter_vma_merge(vma, vma->vm_flags);
> > - validate_mm(mm);
> > - validate_mm_mt(mm);
> > return error;
> > }
> > #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> > @@ -2475,14 +2199,12 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > /*
> > * vma is the first one with address < vma->vm_start. Have to extend vma.
> > */
> > -int expand_downwards(struct vm_area_struct *vma,
> > - unsigned long address)
> > +int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct vm_area_struct *prev;
> > int error = 0;
> >
> > - validate_mm(mm);
> > address &= PAGE_MASK;
> > if (address < mmap_min_addr)
> > return -EPERM;
> > @@ -2519,15 +2241,13 @@ int expand_downwards(struct vm_area_struct *vma,
> > error = acct_stack_growth(vma, size, grow);
> > if (!error) {
> > /*
> > - * vma_gap_update() doesn't support concurrent
> > - * updates, but we only hold a shared mmap_lock
> > - * lock here, so we need to protect against
> > - * concurrent vma expansions.
> > - * anon_vma_lock_write() doesn't help here, as
> > - * we don't guarantee that all growable vmas
> > - * in a mm share the same root anon vma.
> > - * So, we reuse mm->page_table_lock to guard
> > - * against concurrent vma expansions.
> > + * We only hold a shared mmap_lock lock here, so
> > + * we need to protect against concurrent vma
> > + * expansions. anon_vma_lock_write() doesn't
> > + * help here, as we don't guarantee that all
> > + * growable vmas in a mm share the same root
> > + * anon vma. So, we reuse mm->page_table_lock
> > + * to guard against concurrent vma expansions.
> > */
> > spin_lock(&mm->page_table_lock);
> > if (vma->vm_flags & VM_LOCKED)
> > @@ -2539,7 +2259,6 @@ int expand_downwards(struct vm_area_struct *vma,
> > /* Overwrite old entry in mtree. */
> > vma_mt_store(mm, vma);
> > anon_vma_interval_tree_post_update_vma(vma);
> > - vma_gap_update(vma);
> > spin_unlock(&mm->page_table_lock);
> >
> > perf_event_mmap(vma);
> > @@ -2548,7 +2267,6 @@ int expand_downwards(struct vm_area_struct *vma,
> > }
> > anon_vma_unlock_write(vma->anon_vma);
> > khugepaged_enter_vma_merge(vma, vma->vm_flags);
> > - validate_mm(mm);
> > return error;
> > }
> >
> > @@ -2681,16 +2399,14 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> > vma->vm_prev = NULL;
> > vma_mt_szero(mm, vma->vm_start, end);
> > do {
> > - vma_rb_erase(vma, &mm->mm_rb);
> > mm->map_count--;
> > tail_vma = vma;
> > vma = vma->vm_next;
> > } while (vma && vma->vm_start < end);
> > *insertion_point = vma;
> > - if (vma) {
> > + if (vma)
> > vma->vm_prev = prev;
> > - vma_gap_update(vma);
> > - } else
> > + else
> > mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
> > tail_vma->vm_next = NULL;
> >
> > @@ -2821,11 +2537,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> > if (len == 0)
> > return -EINVAL;
> >
> > - /*
> > - * arch_unmap() might do unmaps itself. It must be called
> > - * and finish any rbtree manipulation before this code
> > - * runs and also starts to manipulate the rbtree.
> > - */
> > + /* arch_unmap() might do unmaps itself. */
> > arch_unmap(mm, start, end);
> >
> > /* Find the first overlapping VMA */
> > @@ -2833,7 +2545,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> > if (!vma)
> > return 0;
> > prev = vma->vm_prev;
> > - /* we have start < vma->vm_end */
> > + /* we have start < vma->vm_end */
> >
> > /* if it doesn't overlap, we have nothing.. */
> > if (vma->vm_start >= end)
> > @@ -2893,7 +2605,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> > if (mm->locked_vm)
> > unlock_range(vma, end);
> >
> > - /* Detach vmas from rbtree */
> > + /* Detach vmas from the MM linked list and remove from the mm tree*/
> > if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
> > downgrade = false;
> >
> > @@ -3041,11 +2753,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
> > * anonymous maps. eventually we may be able to do some
> > * brk-specific accounting here.
> > */
> > -static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf)
> > +static int do_brk_flags(unsigned long addr, unsigned long len,
> > + unsigned long flags, struct list_head *uf)
> > {
> > struct mm_struct *mm = current->mm;
> > struct vm_area_struct *vma, *prev;
> > - struct rb_node **rb_link, *rb_parent;
> > pgoff_t pgoff = addr >> PAGE_SHIFT;
> > int error;
> > unsigned long mapped_addr;
> > @@ -3064,8 +2776,8 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
> > if (error)
> > return error;
> >
> > - /* Clear old maps, set up prev, rb_link, rb_parent, and uf */
> > - if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
> > + /* Clear old maps, set up prev and uf */
> > + if (munmap_vma_range(mm, addr, len, &prev, uf))
> > return -ENOMEM;
> >
> > /* Check against address space limits *after* clearing old maps... */
> > @@ -3099,7 +2811,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
> > vma->vm_pgoff = pgoff;
> > vma->vm_flags = flags;
> > vma->vm_page_prot = vm_get_page_prot(flags);
> > - vma_link(mm, vma, prev, rb_link, rb_parent);
> > + vma_link(mm, vma, prev);
> > out:
> > perf_event_mmap(vma);
> > mm->total_vm += len >> PAGE_SHIFT;
> > @@ -3219,26 +2931,10 @@ void exit_mmap(struct mm_struct *mm)
> > int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> > {
> > struct vm_area_struct *prev;
> > - struct rb_node **rb_link, *rb_parent;
> > - unsigned long start = vma->vm_start;
> > - struct vm_area_struct *overlap = NULL;
> >
> > - if (find_vma_links(mm, vma->vm_start, vma->vm_end,
> > - &prev, &rb_link, &rb_parent))
> > + if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev))
> > return -ENOMEM;
> >
> > - overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1);
> > - if (overlap) {
> > -
> > - pr_err("Found vma ending at %lu\n", start - 1);
> > - pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap,
> > - overlap->vm_start, overlap->vm_end - 1);
> > -#if defined(CONFIG_DEBUG_MAPLE_TREE)
> > - mt_dump(&mm->mm_mt);
> > -#endif
> > - BUG();
> > - }
> > -
> > if ((vma->vm_flags & VM_ACCOUNT) &&
> > security_vm_enough_memory_mm(mm, vma_pages(vma)))
> > return -ENOMEM;
> > @@ -3260,7 +2956,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> > vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
> > }
> >
> > - vma_link(mm, vma, prev, rb_link, rb_parent);
> > + vma_link(mm, vma, prev);
> > return 0;
> > }
> >
> > @@ -3276,9 +2972,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > unsigned long vma_start = vma->vm_start;
> > struct mm_struct *mm = vma->vm_mm;
> > struct vm_area_struct *new_vma, *prev;
> > - struct rb_node **rb_link, *rb_parent;
> > bool faulted_in_anon_vma = true;
> > - unsigned long index = addr;
> >
> > validate_mm_mt(mm);
> > /*
> > @@ -3290,10 +2984,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > faulted_in_anon_vma = false;
> > }
> >
> > - if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
> > + if (range_has_overlap(mm, addr, addr + len, &prev))
> > return NULL; /* should never get here */
> > - if (mt_find(&mm->mm_mt, &index, addr+len - 1))
> > - BUG();
> > +
> > new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> > vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> > vma->vm_userfaultfd_ctx);
> > @@ -3334,7 +3027,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > get_file(new_vma->vm_file);
> > if (new_vma->vm_ops && new_vma->vm_ops->open)
> > new_vma->vm_ops->open(new_vma);
> > - vma_link(mm, new_vma, prev, rb_link, rb_parent);
> > + vma_link(mm, new_vma, prev);
> > *need_rmap_locks = false;
> > }
> > validate_mm_mt(mm);
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 8848cf7cb7c1..c410f99203fb 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -566,13 +566,14 @@ static void put_nommu_region(struct vm_region *region)
> > */
> > static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
> > {
> > - struct vm_area_struct *pvma, *prev;
> > struct address_space *mapping;
> > - struct rb_node **p, *parent, *rb_prev;
> > + struct vm_area_struct *prev;
> > + MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end);
> >
> > BUG_ON(!vma->vm_region);
> >
> > mm->map_count++;
> > + printk("mm at %u\n", mm->map_count);
>
> I think this was added while debugging and should be removed?

Yes, thank you.

>
> > vma->vm_mm = mm;
> >
> > /* add the VMA to the mapping */
> > @@ -586,42 +587,12 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
> > i_mmap_unlock_write(mapping);
> > }
> >
> > + rcu_read_lock();
> > + prev = mas_prev(&mas, 0);
> > + rcu_read_unlock();
> > + mas_reset(&mas);
> > /* add the VMA to the tree */
> > - parent = rb_prev = NULL;
> > - p = &mm->mm_rb.rb_node;
> > - while (*p) {
> > - parent = *p;
> > - pvma = rb_entry(parent, struct vm_area_struct, vm_rb);
> > -
> > - /* sort by: start addr, end addr, VMA struct addr in that order
> > - * (the latter is necessary as we may get identical VMAs) */
> > - if (vma->vm_start < pvma->vm_start)
> > - p = &(*p)->rb_left;
> > - else if (vma->vm_start > pvma->vm_start) {
> > - rb_prev = parent;
> > - p = &(*p)->rb_right;
> > - } else if (vma->vm_end < pvma->vm_end)
> > - p = &(*p)->rb_left;
> > - else if (vma->vm_end > pvma->vm_end) {
> > - rb_prev = parent;
> > - p = &(*p)->rb_right;
> > - } else if (vma < pvma)
> > - p = &(*p)->rb_left;
> > - else if (vma > pvma) {
> > - rb_prev = parent;
> > - p = &(*p)->rb_right;
> > - } else
> > - BUG();
> > - }
> > -
> > - rb_link_node(&vma->vm_rb, parent, p);
> > - rb_insert_color(&vma->vm_rb, &mm->mm_rb);
> > -
> > - /* add VMA to the VMA list also */
> > - prev = NULL;
> > - if (rb_prev)
> > - prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
> > -
> > + vma_mas_store(vma, &mas);
> > __vma_link_list(mm, vma, prev);
> > }
> >
> > @@ -634,6 +605,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> > struct address_space *mapping;
> > struct mm_struct *mm = vma->vm_mm;
> > struct task_struct *curr = current;
> > + MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);
> >
> > mm->map_count--;
> > for (i = 0; i < VMACACHE_SIZE; i++) {
> > @@ -643,7 +615,6 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> > break;
> > }
> > }
> > -
> > /* remove the VMA from the mapping */
> > if (vma->vm_file) {
> > mapping = vma->vm_file->f_mapping;
> > @@ -656,8 +627,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
> > }
> >
> > /* remove from the MM's tree and list */
> > - rb_erase(&vma->vm_rb, &mm->mm_rb);
> > -
> > + vma_mas_remove(vma, &mas);
> > __vma_unlink_list(mm, vma);
> > }
> >
> > @@ -681,24 +651,21 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
> > struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > {
> > struct vm_area_struct *vma;
> > + MA_STATE(mas, &mm->mm_mt, addr, addr);
> >
> > /* check the cache first */
> > vma = vmacache_find(mm, addr);
> > if (likely(vma))
> > return vma;
> >
> > - /* trawl the list (there may be multiple mappings in which addr
> > - * resides) */
> > - for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > - if (vma->vm_start > addr)
> > - return NULL;
> > - if (vma->vm_end > addr) {
> > - vmacache_update(addr, vma);
> > - return vma;
> > - }
> > - }
> > + rcu_read_lock();
> > + vma = mas_walk(&mas);
> > + rcu_read_unlock();
> >
> > - return NULL;
> > + if (vma)
> > + vmacache_update(addr, vma);
> > +
> > + return vma;
> > }
> > EXPORT_SYMBOL(find_vma);
> >
> > @@ -730,26 +697,25 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
> > {
> > struct vm_area_struct *vma;
> > unsigned long end = addr + len;
> > + MA_STATE(mas, &mm->mm_mt, addr, addr);
> >
> > /* check the cache first */
> > vma = vmacache_find_exact(mm, addr, end);
> > if (vma)
> > return vma;
> >
> > - /* trawl the list (there may be multiple mappings in which addr
> > - * resides) */
> > - for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > - if (vma->vm_start < addr)
> > - continue;
> > - if (vma->vm_start > addr)
> > - return NULL;
> > - if (vma->vm_end == end) {
> > - vmacache_update(addr, vma);
> > - return vma;
> > - }
> > - }
> > -
> > - return NULL;
> > + rcu_read_lock();
> > + vma = mas_walk(&mas);
> > + rcu_read_unlock();
> > + if (!vma)
> > + return NULL;
> > + if (vma->vm_start != addr)
> > + return NULL;
> > + if (vma->vm_end != end)
> > + return NULL;
> > +
> > + vmacache_update(addr, vma);
> > + return vma;
> > }
> >
> > /*
> > diff --git a/mm/util.c b/mm/util.c
> > index 0b6dd9d81da7..35deaa0ccac5 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -287,6 +287,8 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
> > vma->vm_next = next;
> > if (next)
> > next->vm_prev = vma;
> > + else
> > + mm->highest_vm_end = vm_end_gap(vma);
> > }
> >
> > void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
> > @@ -301,6 +303,12 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
> > mm->mmap = next;
> > if (next)
> > next->vm_prev = prev;
> > + else {
> > + if (prev)
> > + mm->highest_vm_end = vm_end_gap(prev);
> > + else
> > + mm->highest_vm_end = 0;
> > + }
> > }
> >
> > /* Check if the vma is being used as a stack by this task */
> > --
> > 2.30.2