2022-02-15 15:27:36

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v6 00/71] Introducing the Maple Tree

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct. The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses. The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on v5.17-rc4

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

v6 changes:
- Added patch for xarray testcode which should be dropped for upstream
fix. - Thanks Matthew Wilcox
- Fixed issue with maple state index/last setting when the tree is just
a pointer - Thanks David Howells
- Changed internal RCU handling to not check flags more than once
- Fixed mas_prev() underflow issue - Thanks David Howells
- Fixed returns on mas_prev()/mas_next() when there is only a value at
0 index - Thanks David Howells
- Fixed mas_find_rev() running past minimum value - Thanks David Howells
- Fixed testing code function rename - Thanks Mark Hemment
- Reworked brk() and vm_brk_flags() as suggested - Thanks Vlastimil
Babka
- Documentation fixes - Thanks Mike Rapoport
- Separated test code from tree code - Thanks Vlastimil Babka
- Moved maple_tree_init() call into the maple tree patch for other users
- Thanks David Howells
- Fixed copyright date - Thanks Mike Rapoport
- Whitespace fixes in comments, reduced changes in other locations.
- Added missing kdocs - Thanks Mike Rapoport
- Fixed exit_mmap comment - Thanks Mark Hemment
- Removed RCU tracking as there is an issue with atomic increments of
mmget() - Thanks Mark Hemment for initial issue report that allowed me
to discover this. RCU was not being used, so it is disabled for VMA
tracking for now.
- Removed unnecessary assignment in mtree_range_walk() - Thanks JaeJoon
Jung
- Dropped ma_xa_benchmark from testing Makefile - Thanks JaeJoon Jung
- Fixed leaf gap calculation in rare case of underflow
- Fixed mmap_region() bug on merging of prev
- Fixed mmap_region() bug on khugepaged_enter_vma_merge()
- Added test cases for mas_find_rev(), mas_prev(), mas_next, and
mas_root_expand() to test suite.
- Updated config options and ifdefs to allow other maple tree users to
debug maple tree without debugging the VM maple tree.

v5: https://lore.kernel.org/linux-mm/20220203172051.i2jnhnkudzssdsxg@revolver/T/
v4: https://lore.kernel.org/linux-mm/[email protected]/t/
v3: https://lore.kernel.org/linux-mm/[email protected]/
v2: https://lore.kernel.org/linux-mm/[email protected]/
v1: https://lore.kernel.org/linux-mm/[email protected]/


Performance on a 144 core x86:

It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.

kernbench, increased system time, less user time:
Amean user-2 885.34 ( 0.00%) 886.07 * -0.08%*
Amean syst-2 161.95 ( 0.00%) 168.19 * -3.85%*
Amean elsp-2 530.06 ( 0.00%) 532.96 * -0.55%*
Amean user-4 908.58 ( 0.00%) 908.88 * -0.03%*
Amean syst-4 167.56 ( 0.00%) 173.72 * -3.68%*
Amean elsp-4 277.21 ( 0.00%) 277.38 * -0.06%*
Amean user-8 961.84 ( 0.00%) 962.45 * -0.06%*
Amean syst-8 176.40 ( 0.00%) 183.43 * -3.99%*
Amean elsp-8 150.59 ( 0.00%) 151.21 * -0.41%*
Amean user-16 1040.15 ( 0.00%) 1039.89 * 0.02%*
Amean syst-16 188.19 ( 0.00%) 193.81 * -2.99%*
Amean elsp-16 86.85 ( 0.00%) 86.32 * 0.61%*
Amean user-32 1240.46 ( 0.00%) 1233.93 * 0.53%*
Amean syst-32 217.15 ( 0.00%) 222.99 * -2.69%*
Amean elsp-32 55.16 ( 0.00%) 55.09 * 0.12%*
Amean user-64 1241.17 ( 0.00%) 1234.26 * 0.56%*
Amean syst-64 215.11 ( 0.00%) 220.76 * -2.63%*
Amean elsp-64 32.88 ( 0.00%) 33.72 * -2.57%*
Amean user-128 1613.09 ( 0.00%) 1609.63 * 0.21%*
Amean syst-128 267.10 ( 0.00%) 276.72 * -3.60%*
Amean elsp-128 25.80 ( 0.00%) 26.09 * -1.10%*

Mixed Hmean results:
- freqmine-medium -3.50% to +12.82%
- malloc1-processes: -14.50% to +6.53%
- signal1-processes -1.87% to +14.02%
- page_fault3-threads -6.46% to +26.15%
- pthread_mutex1-threads -16.81% to +28.63%

Decrease in performance in the following micro-benchmarks in Hmean:
- brk1-processes -37.42% to -44.16%
- malloc1-threads -18.27% to -23.08%

Modifications are more expensive so the micro-benchmarks that write but
do not use the data will be negatively affected.

Patch organization:
Patch 1 is to add a missing lock to avoid an assert issue when using a vma iterator.

Patch 2 is an xarray fix due to bitmap header changes which will be
dropped for a pending upstream fix.

Patches 3 to 7 are radix tree test suite additions for maple tree
support.

Patch 8 adds the maple tree.
Patch 9 adds the maple tree test code.

Patches 10-19 are the removal of the rbtree from the mm_struct. This
now includes the introduction of the vma iterator.

Patch 20 optimizes __vma_adjust() for the maple tree.

Patches 21-27 are the removal of the vmacache from the kernel.

Patches 28-31 are internal mm changes for efficiencies.

Patches 32-69 are the removal of the linked list

Patches 70 and 71 are a small cleanup from the removal of the vma linked list.

Liam R. Howlett (61):
xarray: Fix bitmap breakage
radix tree test suite: Add pr_err define
radix tree test suite: Add kmem_cache_set_non_kernel()
radix tree test suite: Add allocation counts and size to kmem_cache
radix tree test suite: Add support for slab bulk APIs
radix tree test suite: Add lockdep_is_held to header
Maple Tree: Add new data structure
lib/test_maple_tree: Add testing for maple tree
mm: Start tracking VMAs with maple tree
mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
mm/mmap: Use maple tree for unmapped_area{_topdown}
kernel/fork: Use maple tree for dup_mmap() during forking
mm: Remove rb tree.
mmap: Change zeroing of maple tree in __vma_adjust()
xen: Use vma_lookup() in privcmd_ioctl_mmap()
mm: Optimize find_exact_vma() to use vma_lookup()
mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
vma_lookup()
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()
mm: Use maple tree operations for find_vma_intersection()
mm/mmap: Use advanced maple tree API for mmap_region()
mm: Remove vmacache
mm/mmap: Move mmap_region() below do_munmap()
mm/mmap: Reorganize munmap to use maple states
mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
arm64: Remove mmap linked list from vdso
parisc: Remove mmap linked list from cache handling
powerpc: Remove mmap linked list walks
s390: Remove vma linked list walks
x86: Remove vma linked list walks
xtensa: Remove vma linked list walks
cxl: Remove vma linked list walk
optee: Remove vma linked list walk
um: Remove vma linked list walk
binfmt_elf: Remove vma linked list walk
exec: Use VMA iterator instead of linked list
fs/proc/base: Use maple tree iterators in place of linked list
userfaultfd: Use maple tree iterator to iterate VMAs
ipc/shm: Use VMA iterator instead of linked list
acct: Use VMA iterator instead of linked list
perf: Use VMA iterator
sched: Use maple tree iterator to walk VMAs
fork: Use VMA iterator
bpf: Remove VMA linked list
mm/gup: Use maple tree navigation instead of linked list
mm/khugepaged: Stop using vma linked list
mm/ksm: Use vma iterators instead of vma linked list
mm/madvise: Use vma_find() instead of vma linked list
mm/memcontrol: Stop using mm->highest_vm_end
mm/mempolicy: Use vma iterator & maple state instead of vma linked
list
mm/mlock: Use vma iterator and instead of vma linked list
mm/mprotect: Use maple tree navigation instead of vma linked list
mm/mremap: Use vma_find_intersection() instead of vma linked list
mm/msync: Use vma_find() instead of vma linked list
mm/oom_kill: Use maple tree iterators instead of vma linked list
mm/pagewalk: Use vma_find() instead of vma linked list
mm/swapfile: Use vma iterator instead of vma linked list
riscv: Use vma iterator for vdso
mm: Remove the vma linked list
mm/mmap: Drop range_has_overlap() function
mm/mmap.c: Pass in mapping to __vma_link_file()

Matthew Wilcox (Oracle) (10):
binfmt_elf: Take the mmap lock when walking the VMA list
mm: Add VMA iterator
mmap: Use the VMA iterator in count_vma_pages_range()
damon: Convert __damon_va_three_regions to use the VMA iterator
proc: Remove VMA rbtree use from nommu
mm: Convert vma_lookup() to use mtree_load()
coredump: Remove vma linked list walk
fs/proc/task_mmu: Stop using linked list and highest_vm_end
i915: Use the VMA iterator
nommu: Remove uses of VMA linked list

Documentation/core-api/index.rst | 1 +
Documentation/core-api/maple_tree.rst | 218 +
MAINTAINERS | 12 +
arch/arm64/kernel/vdso.c | 3 +-
arch/parisc/kernel/cache.c | 9 +-
arch/powerpc/kernel/vdso.c | 6 +-
arch/powerpc/mm/book3s32/tlb.c | 11 +-
arch/powerpc/mm/book3s64/subpage_prot.c | 13 +-
arch/riscv/kernel/vdso.c | 3 +-
arch/s390/kernel/vdso.c | 3 +-
arch/s390/mm/gmap.c | 6 +-
arch/um/kernel/tlb.c | 14 +-
arch/x86/entry/vdso/vma.c | 9 +-
arch/x86/kernel/tboot.c | 2 +-
arch/xtensa/kernel/syscall.c | 18 +-
drivers/firmware/efi/efi.c | 2 +-
drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 14 +-
drivers/misc/cxl/fault.c | 45 +-
drivers/tee/optee/call.c | 18 +-
drivers/xen/privcmd.c | 2 +-
fs/binfmt_elf.c | 6 +-
fs/coredump.c | 33 +-
fs/exec.c | 12 +-
fs/proc/base.c | 5 +-
fs/proc/internal.h | 2 +-
fs/proc/task_mmu.c | 74 +-
fs/proc/task_nommu.c | 45 +-
fs/userfaultfd.c | 49 +-
include/linux/maple_tree.h | 683 +
include/linux/mm.h | 77 +-
include/linux/mm_types.h | 43 +-
include/linux/mm_types_task.h | 12 -
include/linux/sched.h | 1 -
include/linux/userfaultfd_k.h | 7 +-
include/linux/vm_event_item.h | 4 -
include/linux/vmacache.h | 28 -
include/linux/vmstat.h | 6 -
include/linux/xarray.h | 1 +
include/trace/events/maple_tree.h | 123 +
include/trace/events/mmap.h | 71 +
init/main.c | 2 +
ipc/shm.c | 21 +-
kernel/acct.c | 11 +-
kernel/bpf/task_iter.c | 10 +-
kernel/debug/debug_core.c | 12 -
kernel/events/core.c | 3 +-
kernel/events/uprobes.c | 9 +-
kernel/fork.c | 58 +-
kernel/sched/fair.c | 10 +-
lib/Kconfig.debug | 18 +-
lib/Makefile | 3 +-
lib/maple_tree.c | 6967 +++
lib/test_maple_tree.c | 37398 ++++++++++++++++
mm/Makefile | 2 +-
mm/damon/vaddr-test.h | 37 +-
mm/damon/vaddr.c | 53 +-
mm/debug.c | 14 +-
mm/gup.c | 9 +-
mm/huge_memory.c | 4 +-
mm/init-mm.c | 4 +-
mm/internal.h | 78 +-
mm/khugepaged.c | 13 +-
mm/ksm.c | 18 +-
mm/madvise.c | 2 +-
mm/memcontrol.c | 6 +-
mm/memory.c | 33 +-
mm/mempolicy.c | 58 +-
mm/mlock.c | 34 +-
mm/mmap.c | 2086 +-
mm/mprotect.c | 7 +-
mm/mremap.c | 22 +-
mm/msync.c | 2 +-
mm/nommu.c | 127 +-
mm/oom_kill.c | 3 +-
mm/pagewalk.c | 2 +-
mm/swapfile.c | 4 +-
mm/util.c | 32 -
mm/vmacache.c | 117 -
mm/vmstat.c | 4 -
tools/testing/radix-tree/.gitignore | 2 +
tools/testing/radix-tree/Makefile | 9 +-
tools/testing/radix-tree/generated/autoconf.h | 1 +
tools/testing/radix-tree/linux.c | 160 +-
tools/testing/radix-tree/linux/kernel.h | 1 +
tools/testing/radix-tree/linux/lockdep.h | 2 +
tools/testing/radix-tree/linux/maple_tree.h | 7 +
tools/testing/radix-tree/linux/slab.h | 4 +
tools/testing/radix-tree/maple.c | 59 +
.../radix-tree/trace/events/maple_tree.h | 3 +
89 files changed, 47390 insertions(+), 1842 deletions(-)
create mode 100644 Documentation/core-api/maple_tree.rst
create mode 100644 include/linux/maple_tree.h
delete mode 100644 include/linux/vmacache.h
create mode 100644 include/trace/events/maple_tree.h
create mode 100644 lib/maple_tree.c
create mode 100644 lib/test_maple_tree.c
delete mode 100644 mm/vmacache.c
create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
create mode 100644 tools/testing/radix-tree/maple.c
create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

--
2.34.1


2022-02-16 07:39:48

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v6 01/71] binfmt_elf: Take the mmap lock when walking the VMA list

From: "Matthew Wilcox (Oracle)" <[email protected]>

I'm not sure if the VMA list can change under us, but dump_vma_snapshot()
is very careful to take the mmap_lock in write mode. We only need to
take it in read mode here as we do not care if the size of the stack
VMA changes underneath us.

If it can be changed underneath us, this is a potential use-after-free
for a multithreaded process which is dumping core.

Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
---
fs/binfmt_elf.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 9e11e6f13e83..f15efed29d3d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1651,6 +1651,7 @@ static int fill_files_note(struct memelfnote *note)
name_base = name_curpos = ((char *)data) + names_ofs;
remaining = size - names_ofs;
count = 0;
+ mmap_read_lock(mm);
for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
struct file *file;
const char *filename;
@@ -1661,6 +1662,7 @@ static int fill_files_note(struct memelfnote *note)
filename = file_path(file, name_curpos, remaining);
if (IS_ERR(filename)) {
if (PTR_ERR(filename) == -ENAMETOOLONG) {
+ mmap_read_unlock(mm);
kvfree(data);
size = size * 5 / 4;
goto alloc;
@@ -1680,6 +1682,7 @@ static int fill_files_note(struct memelfnote *note)
*start_end_ofs++ = vma->vm_pgoff;
count++;
}
+ mmap_read_unlock(mm);

/* Now we know exact count of files, can store it */
data[0] = count;
--
2.34.1

2022-02-17 00:37:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <[email protected]> wrote:

> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently. There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface. The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct. The long term goal is to reduce
> or remove the mmap_sem contention.

Has a path been chosen which gets us from here to significant reduction
in mmap_lock overhead?

If so, what's the plan and what must be done?

Thanks.

2022-02-17 02:43:04

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <[email protected]> wrote:
>
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently. There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface. The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > or remove the mmap_sem contention.
>
> Has a path been chosen which gets us from here to significant reduction
> in mmap_lock overhead?
>
> If so, what's the plan and what must be done?

I would say there are still competing ideas for how that is to be done.

First, the Maple Tree is independent (to a certain extent) of all the
approaches to reducing mmap_lock overhead. It's a better data structure
for VMA lookup than the rbtree. Set against that, it has higher overhead
for modifications. That means that benchmarks which primarily measure
modification overhead see worse performance. We believe this is not
representative of real workloads, and indeed we see ~parity on workloads
like git and make -jN.

The advantage to this patch set is that we get rid of the vmacache
entirely, we get rid of the doubly-linked list chaining VMAs together
and we get rid of the rbtree usage. And we add a new generic data
structure that hopefully I can point people towards instead of using
the XArray inappropriately.

Both of the approaches to avoiding taking the mmap_lock in the fault path
will benefit from using the maple tree for VMAs. Fewer cache misses and
RCU safety are things we all want. The maple tree will also benefit
if/when either approach is merged; because modifications to the maple
tree are more heavyweight than the rbtree, they block readers for longer.


The SPF approach is further along. It has been tested in the past with
the maple tree, but not in the most recent iteration of either. It
doesn't rely on the maple tree, but it does benefit.

The RCU-freeing-of-VMAs approach that I favour has not received much
attention. I've been busy with other things ;-) It offers much more
opportunity for performance improvement, but will take longer to arrive.

2022-02-24 00:30:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <[email protected]> wrote:
> >
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently. There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface. The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > > or remove the mmap_sem contention.
> >
> > Has a path been chosen which gets us from here to significant reduction
> > in mmap_lock overhead?
> >
> > If so, what's the plan and what must be done?
>
> I would say there are still competing ideas for how that is to be done.
>
> First, the Maple Tree is independent (to a certain extent) of all the
> approaches to reducing mmap_lock overhead. It's a better data structure
> for VMA lookup than the rbtree. Set against that, it has higher overhead
> for modifications. That means that benchmarks which primarily measure
> modification overhead see worse performance. We believe this is not
> representative of real workloads, and indeed we see ~parity on workloads
> like git and make -jN.
>

I'm way behind and only got around to looking at SPF properly today. Maple
is working its way up my list and I need to gather new data but I did
have old data queued from a time when I thought I would get around to
maple tree soon. The big result that stood was was brk performance from
will-it-scale but for processes specifically

wis-malloc
5.17.0-rc3 5.17.0-rc3
vanilla mm-maplevma-v5r1
Hmean brk1-processes-2 5265361.84 ( 0.00%) 3489097.63 * -33.73%*
Hmean brk1-processes-5 12270382.36 ( 0.00%) 8136542.13 * -33.69%*
Hmean brk1-processes-8 19560585.95 ( 0.00%) 12941094.22 * -33.84%*
Hmean brk1-processes-12 27328267.18 ( 0.00%) 18148145.64 * -33.59%*
Hmean brk1-processes-21 39599378.04 ( 0.00%) 26188063.96 * -33.87%*
Hmean brk1-processes-30 59880571.20 ( 0.00%) 39653658.00 * -33.78%*
Hmean brk1-processes-48 72900335.59 ( 0.00%) 49044880.84 * -32.72%*
Hmean brk1-processes-79 72388042.42 ( 0.00%) 48416196.93 * -33.12%*
Hmean brk1-processes-110 72333206.18 ( 0.00%) 48437222.12 * -33.04%*
Hmean brk1-processes-141 72443841.68 ( 0.00%) 48490158.48 * -33.07%*
Hmean brk1-processes-172 72187451.54 ( 0.00%) 48480696.77 * -32.84%*
Hmean brk1-processes-203 72575945.95 ( 0.00%) 48503896.03 * -33.17%*
Hmean brk1-processes-234 67509812.97 ( 0.00%) 48466347.95 * -28.21%*
Hmean brk1-processes-265 72379447.48 ( 0.00%) 48449330.06 * -33.06%*
Hmean brk1-processes-296 72241861.59 ( 0.00%) 48303854.85 * -33.14%*
Hmean brk1-processes-320 72183320.07 ( 0.00%) 48253708.21 * -33.15%*
Hmean brk1-threads-2 1352297.67 ( 0.00%) 2963406.42 * 119.14%*
Hmean brk1-threads-5 814358.55 ( 0.00%) 1233604.84 * 51.48%*
Hmean brk1-threads-8 849900.52 ( 0.00%) 1241170.76 * 46.04%*
Hmean brk1-threads-12 804143.24 ( 0.00%) 1232451.60 * 53.26%*
Hmean brk1-threads-21 672680.28 ( 0.00%) 1001257.38 * 48.85%*
Hmean brk1-threads-30 443221.73 ( 0.00%) 713592.66 * 61.00%*
Hmean brk1-threads-48 397283.95 ( 0.00%) 591349.83 * 48.85%*
Hmean brk1-threads-79 167479.04 ( 0.00%) 548681.03 * 227.61%*
Hmean brk1-threads-110 154108.98 ( 0.00%) 207349.67 * 34.55%*
Hmean brk1-threads-141 166712.91 ( 0.00%) 217314.50 * 30.35%*
Hmean brk1-threads-172 155023.61 ( 0.00%) 210318.59 * 35.67%*
Hmean brk1-threads-203 166943.60 ( 0.00%) 227892.26 * 36.51%*

The mmtests configuration this was based on no longer exists but the
equivalent should be

configs/config-workload-will-it-scale-pf-processes
configs/config-workload-will-it-scale-pf-threads

It shouldn't be too hard to pull out the brk1 test on its own and see
what it's doing in profiles.

brk_test from aim9 also suffers so may be brk-specific.

Other workloads looked ok, kernel building took a small -1.5%
performance hit on some machines but not enough that I would scream.
Others looked mostly neutral but it was only a provisional sniff test
before I even considered looking at 70 patches :O

--
Mel Gorman
SUSE Labs

2022-02-24 01:04:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Wed, Feb 23, 2022 at 04:35:09PM +0000, Mel Gorman wrote:
> On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <[email protected]> wrote:
> > >
> > > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > > processor cache efficiently. There are a number of places in the kernel
> > > > that a non-overlapping range-based tree would be beneficial, especially
> > > > one with a simple interface. The first user that is covered in this
> > > > patch set is the vm_area_struct, where three data structures are
> > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > > > or remove the mmap_sem contention.
> > >
> > > Has a path been chosen which gets us from here to significant reduction
> > > in mmap_lock overhead?
> > >
> > > If so, what's the plan and what must be done?
> >
> > I would say there are still competing ideas for how that is to be done.
> >
> > First, the Maple Tree is independent (to a certain extent) of all the
> > approaches to reducing mmap_lock overhead. It's a better data structure
> > for VMA lookup than the rbtree. Set against that, it has higher overhead
> > for modifications. That means that benchmarks which primarily measure
> > modification overhead see worse performance. We believe this is not
> > representative of real workloads, and indeed we see ~parity on workloads
> > like git and make -jN.
>
> I'm way behind and only got around to looking at SPF properly today. Maple
> is working its way up my list and I need to gather new data but I did
> have old data queued from a time when I thought I would get around to
> maple tree soon. The big result that stood was was brk performance from
> will-it-scale but for processes specifically

Yup, we know about the brk1 regression. It's a really unrealistic
benchmark, which is why we added brk2. To quote the commit message:

Linux has this horrendously complicated anon_vma structure that you don't
care about, but the upshot is that after calling fork(), each process
that calls brk() gets a _new_ VMA created. That is, after calling brk()
the first time, the process address space looks like this:

557777fab000-557777ff0000 rw-p 00000000 00:00 0 [heap]
557777ff0000-557777ff1000 rw-p 00000000 00:00 0 [heap]

so what brk1 is actually testing is how long it takes to create & destroy
a new VMA. This does not match what most programs do -- most will call
exec() which resets the anon_vma structures and starts each program off
with its own heap. And if you do have a multi-process program which
uses brk(), chances are it doesn't just oscillate betwee zero and one
extra pages of heap compared to its parent.

A better test starts out by allocating one page on the heap and then
throbs between one and two pages instead of throbbing between zero and
one page. That means we're actually testing expanding and contracting
the heap instead of creating and destroying a new heap.

For realism, I wanted to add actually accessing the memory in the new
heap, but that doesn't work for the threaded case -- another thread
might remove the memory you just allocated while you're allocating it.
Threaded programs give each thread its own heap anyway, so this is
kind of a pointless syscall to ask about its threaded scalability.

Anyway, here's brk2.c. It is not very different from brk1.c, but the
performance results are quite different (actually worse by about 10-15%).

2022-02-25 07:40:15

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently. There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface. The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct. The long term goal is to reduce
> or remove the mmap_sem contention.
>
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes. With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses. The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
>
> This patch is based on v5.17-rc4
>
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

Just a heads-up. I noticed an use-after-free in today's linux-next below. I
am running out of time to fully triage this, but I noticed this is the only
series (not sure which revision) heavily touched mm/mmap.c recently there.

BUG: KASAN: use-after-free in move_vma.isra.0
Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280

CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
Call trace:
dump_backtrace
show_stack
dump_stack_lvl
print_address_description.constprop.0
kasan_report
__asan_report_load8_noabort
move_vma.isra.0
move_vma at /usr/src/linux-next/mm/mremap.c:714
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

Allocated by task 1280:
kasan_save_stack
__kasan_slab_alloc
slab_post_alloc_hook
kmem_cache_alloc
vm_area_alloc
vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
mmap_region
mmap_region at /usr/src/linux-next/mm/mmap.c:2585
do_mmap
vm_mmap_pgoff
ksys_mmap_pgoff
__arm64_sys_mmap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

Freed by task 1280:
kasan_save_stack
kasan_set_track
kasan_set_free_info
__kasan_slab_free
kmem_cache_free
vm_area_free
remove_vma
remove_vma at /usr/src/linux-next/mm/mmap.c:187
do_mas_align_munmap.isra.0
remove_mt at /usr/src/linux-next/mm/mmap.c:2176
(inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
do_mas_munmap
do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
do_munmap
move_vma.isra.0
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

The buggy address belongs to the object at ffff0009ce752aa8
which belongs to the cache vm_area_struct of size 144
The buggy address is located 32 bytes inside of
144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
The buggy address belongs to the page:
page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
memcg:ffff0009ce456e01
flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
^
ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

2022-02-25 22:16:15

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

* Qian Cai <[email protected]> [220224 22:49]:
> On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently. There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface. The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > or remove the mmap_sem contention.
> >
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes. With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses. The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> >
> > This patch is based on v5.17-rc4
> >
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
>
> Just a heads-up. I noticed an use-after-free in today's linux-next below. I
> am running out of time to fully triage this, but I noticed this is the only
> series (not sure which revision) heavily touched mm/mmap.c recently there.
>
> BUG: KASAN: use-after-free in move_vma.isra.0
> Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
>
> CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
> Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> Call trace:
> dump_backtrace
> show_stack
> dump_stack_lvl
> print_address_description.constprop.0
> kasan_report
> __asan_report_load8_noabort
> move_vma.isra.0
> move_vma at /usr/src/linux-next/mm/mremap.c:714
> __do_sys_mremap
> __arm64_sys_mremap
> invoke_syscall
> el0_svc_common.constprop.0
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> Allocated by task 1280:
> kasan_save_stack
> __kasan_slab_alloc
> slab_post_alloc_hook
> kmem_cache_alloc
> vm_area_alloc
> vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
> mmap_region
> mmap_region at /usr/src/linux-next/mm/mmap.c:2585
> do_mmap
> vm_mmap_pgoff
> ksys_mmap_pgoff
> __arm64_sys_mmap
> invoke_syscall
> el0_svc_common.constprop.0
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> Freed by task 1280:
> kasan_save_stack
> kasan_set_track
> kasan_set_free_info
> __kasan_slab_free
> kmem_cache_free
> vm_area_free
> remove_vma
> remove_vma at /usr/src/linux-next/mm/mmap.c:187
> do_mas_align_munmap.isra.0
> remove_mt at /usr/src/linux-next/mm/mmap.c:2176
> (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
> do_mas_munmap
> do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
> do_munmap
> move_vma.isra.0
> __do_sys_mremap
> __arm64_sys_mremap
> invoke_syscall
> el0_svc_common.constprop.0
> do_el0_svc
> el0_svc
> el0t_64_sync_handler
> el0t_64_sync
>
> The buggy address belongs to the object at ffff0009ce752aa8
> which belongs to the cache vm_area_struct of size 144
> The buggy address is located 32 bytes inside of
> 144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
> The buggy address belongs to the page:
> page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
> head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
> memcg:ffff0009ce456e01
> flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
> raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
> raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
> ^
> ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
> ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Thank you for the report. I will try to reproduce it with a kvm here.

Cheers,
Liam

2022-02-25 23:23:12

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

* Liam Howlett <[email protected]> [220225 14:09]:
> * Qian Cai <[email protected]> [220224 22:49]:
> > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently. There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface. The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > > or remove the mmap_sem contention.
> > >
> > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > > nodes. With the increased branching factor, it is significantly shorter than
> > > the rbtree so it has fewer cache misses. The removal of the linked list
> > > between subsequent entries also reduces the cache misses and the need to pull
> > > in the previous and next VMA during many tree alterations.
> > >
> > > This patch is based on v5.17-rc4
> > >
> > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> >
> > Just a heads-up. I noticed an use-after-free in today's linux-next below. I
> > am running out of time to fully triage this, but I noticed this is the only
> > series (not sure which revision) heavily touched mm/mmap.c recently there.
> >
> > BUG: KASAN: use-after-free in move_vma.isra.0
> > Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
> >
> > CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
> > Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> > Call trace:
> > dump_backtrace
> > show_stack
> > dump_stack_lvl
> > print_address_description.constprop.0
> > kasan_report
> > __asan_report_load8_noabort
> > move_vma.isra.0
> > move_vma at /usr/src/linux-next/mm/mremap.c:714
> > __do_sys_mremap
> > __arm64_sys_mremap
> > invoke_syscall
> > el0_svc_common.constprop.0
> > do_el0_svc
> > el0_svc
> > el0t_64_sync_handler
> > el0t_64_sync
> >
> > Allocated by task 1280:
> > kasan_save_stack
> > __kasan_slab_alloc
> > slab_post_alloc_hook
> > kmem_cache_alloc
> > vm_area_alloc
> > vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
> > mmap_region
> > mmap_region at /usr/src/linux-next/mm/mmap.c:2585
> > do_mmap
> > vm_mmap_pgoff
> > ksys_mmap_pgoff
> > __arm64_sys_mmap
> > invoke_syscall
> > el0_svc_common.constprop.0
> > do_el0_svc
> > el0_svc
> > el0t_64_sync_handler
> > el0t_64_sync
> >
> > Freed by task 1280:
> > kasan_save_stack
> > kasan_set_track
> > kasan_set_free_info
> > __kasan_slab_free
> > kmem_cache_free
> > vm_area_free
> > remove_vma
> > remove_vma at /usr/src/linux-next/mm/mmap.c:187
> > do_mas_align_munmap.isra.0
> > remove_mt at /usr/src/linux-next/mm/mmap.c:2176
> > (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
> > do_mas_munmap
> > do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
> > do_munmap
> > move_vma.isra.0
> > __do_sys_mremap
> > __arm64_sys_mremap
> > invoke_syscall
> > el0_svc_common.constprop.0
> > do_el0_svc
> > el0_svc
> > el0t_64_sync_handler
> > el0t_64_sync
> >
> > The buggy address belongs to the object at ffff0009ce752aa8
> > which belongs to the cache vm_area_struct of size 144
> > The buggy address is located 32 bytes inside of
> > 144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
> > The buggy address belongs to the page:
> > page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
> > head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
> > memcg:ffff0009ce456e01
> > flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
> > raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
> > raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
> > page dumped because: kasan: bad access detected
> >
> > Memory state around the buggy address:
> > ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> > ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> > >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
> > ^
> > ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
> > ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>
> Thank you for the report. I will try to reproduce it with a kvm here.
>

I just booted an arm64 VM with my build and kasan enabled with no issue.
Could you please send me your config file for the build?

Thanks,
Liam


2022-02-26 01:27:57

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> I just booted an arm64 VM with my build and kasan enabled with no issue.
> Could you please send me your config file for the build?

On linux-next, I just do:

$ make arch=arm64 defconfig debug.config [1]

Then, I just generate some memory pressume into swapping/OOM Killer to
trigger it.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config

2022-02-26 01:34:41

by Nathan Chancellor

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > Could you please send me your config file for the build?
>
> On linux-next, I just do:
>
> $ make arch=arm64 defconfig debug.config [1]
>
> Then, I just generate some memory pressume into swapping/OOM Killer to
> trigger it.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config

Is the stacktrace [1] related to the conflict that Mark encountered [2]
while merging the maple and folio trees? Booting a next-20220223 kernel
on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
(just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
maple tree merges makes everything work properly again.

[1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
[2]: https://lore.kernel.org/r/[email protected]/

Cheers,
Nathan

2022-02-26 02:43:48

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

* Nathan Chancellor <[email protected]> [220225 18:00]:
> On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > > Could you please send me your config file for the build?
> >
> > On linux-next, I just do:
> >
> > $ make arch=arm64 defconfig debug.config [1]
> >
> > Then, I just generate some memory pressume into swapping/OOM Killer to
> > trigger it.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
>
> Is the stacktrace [1] related to the conflict that Mark encountered [2]
> while merging the maple and folio trees? Booting a next-20220223 kernel
> on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
> (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
> maple tree merges makes everything work properly again.
>
> [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
> [2]: https://lore.kernel.org/r/[email protected]/

Maybe? I'm trying to figure out why it's having issues.. I've not been
able to reproduce it with just my maple tree branch. Steven Rostedt
found a bad commit that has been fixed in either 20220224, I believe
[1]. It might be best to try next-20220225 and see if you have better
luck if that's an option.

[1]:
https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t


Thanks,
Liam

2022-02-26 23:20:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Sat, 26 Feb 2022, Liam Howlett wrote:
> * Nathan Chancellor <[email protected]> [220225 18:00]:
> > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > > > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > > > Could you please send me your config file for the build?
> > >
> > > On linux-next, I just do:
> > >
> > > $ make arch=arm64 defconfig debug.config [1]
> > >
> > > Then, I just generate some memory pressume into swapping/OOM Killer to
> > > trigger it.
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
> >
> > Is the stacktrace [1] related to the conflict that Mark encountered [2]
> > while merging the maple and folio trees? Booting a next-20220223 kernel
> > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
> > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
> > maple tree merges makes everything work properly again.
> >
> > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
> > [2]: https://lore.kernel.org/r/[email protected]/
>
> Maybe? I'm trying to figure out why it's having issues.. I've not been
> able to reproduce it with just my maple tree branch. Steven Rostedt
> found a bad commit that has been fixed in either 20220224, I believe
> [1]. It might be best to try next-20220225 and see if you have better
> luck if that's an option.
>
> [1]:
> https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t

Hi Liam, I think I have the beginnings of an answer for Qian's issue,
details below; but first off, I'd better make my own opinion clear.

I think this series, however good it may be so far, is undertested and
not approaching ready for 5.18, and should not have been pushed via
git tree into 5.17-rc-late-stage linux-next, subverting Andrew's mmotm.

I believe Stephen and Mark should drop it from linux-next for now,
while linux-next stabilizes for 5.18, while you work on testing and
fixing, and resubmit rebased to 5.18-rc1 or -rc2 when you're ready.

Qian's issue, BUG: KASAN: use-after-free in move_vma.isra.0,
with stacktrace implicating mremap.

I don't have KASAN on, but my config on this laptop does have
CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y, and reported "Slab corruption"
during bootup of mmotm 2022-02-24-22-38 (plus Steven Rostedt's fix
to vfs_statx() which you mention in [1], missed from mmotm - not
related to your series of course, but essential for booting here).
Previous mmotm 2022-02-23-21-20 (plus vfs_statx() fix) was okay,
but did not contain the Maple Tree series.

The "vm_area" "Slab corruption" "Single bit error detected" 6b->7b
report is enough to indicate that the VM_ACCOUNT bit gets set in a
freed vma's vm_flags. And looking through "|= VM_ACCOUNT"s, my
suspicion fell on mm/mremap.c's move_vma(). Which I now see is
implicated in Qian's report too.

mremap move's VM_ACCOUNT accounting is difficult: not losing what's
already charged in case the move fails, accounting the extra without
overcharging, in the face of vmas being split and merged: difficult.

And there's an assumption documented in (now) do_mas_align_munmap()
"Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
unmapped vm_area_struct will remain in use". My suspicion (not
verified) is that the maple tree changes are now violating that;
and I doubt that fixing it will be easy (I'm not going to try) -
doable, sure, but needs careful thought. (The way move_vma() masks
out VM_ACCOUNT in a live vma, then adds it back later: implications
for vma merging; all under mmap lock of course, but still awkward.)

Though I did partially verify it, by commenting out the VM_ACCOUNT
adjustments in move_vma(), and booting several times without any
"Slab corruption". And did also kind-of verify it by booting with
#define VM_ACCOUNT 0: I was interested to see if that bit corruption
could account for all the other bugs I get, but sadly no.

Initially I was building without CONFIG_DEBUG_VM_MAPLE_TREE=y and
CONFIG_DEBUG_MAPLE_TREE=y, but have now switched them on. Hit bugs
without them and with them, but now they're on, validate_mm() often
catches something (whether it is correct to complain, I haven't
investigated, but I assume so since the debug option is there,
and problems are seen without it).

I say "often": it's very erratic. Once, a machine booted with mem=1G
ran kernel builds successfully swapping for 4.5 hours before faulting
in __nr_to_section in virt_to_folio ... while doing a __vma_adjust() -
you'll ask me for a full stacktrace, and I'll answer sorry, too many,
please try for yourself. Another time, for 1.5 hours before hitting
the BUG_ON(is_migration_entry(entry) && !PageLocked(p)) in
pfn_swap_entry_to_page() - suggesting anon_vma locking had not been
right while doing page migration (I was exercising THPs a lot).
But now, can I even get it to complete the boot sequence?

(I happened to be sampling /proc/meminfo during those successful
runs: looking afterwards at those samples, I see Committed_AS growing
steadily; whereas samples saved from pre-maple runs were not: that
would correspond to VM_ACCOUNT vm_enough_memory() charges leaking.)

You're having difficulty reproducing: I suggest trying with different
mem= on the boot command line (some of my loads I run with mem=700M,
some mem=1G, some with the workstation mem 8G) - I get the impression
that different mem= jumbles up memory allocations differently, so
what's harmless in one case is quickly harmful in another.
Or try changing between SLAB and SLUB.

One other thing: I haven't studied the source much, but didn't like
that rwsem_acquire(&mm->mmap_lock.dep_map, 0, 0, _THIS_IP_) hack in
exit_mmap(): sneaked into "mm: Start tracking VMAs with maple tree",
it reverts Suren's 64591e8605d6 "mm: protect free_pgtables with
mmap_lock write lock in exit_mmap", without explanating how it becomes
safe with maple tree (I imagine it's not). That would have to be a
separate, justified patch if it goes forward. (The nearby conflict
resolutions in mmotm and next are not quite right there, some stuff
my mm/munlock series deleted has resurfaced: but it's harmless, and
not worth worrying about if maple tree is dropped from linux-next.)

Hugh

2022-02-27 02:26:09

by Vasily Gorbik

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently. There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface. The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct. The long term goal is to reduce
> or remove the mmap_sem contention.
>
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes. With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses. The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
>
> This patch is based on v5.17-rc4
>
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

Hi Liam,

this patch series completely breaks s390. Besides endianess issue
in maple trees reported here:

https://lore.kernel.org/all/[email protected]/

and with this endianess issue fixed (fixup from here ^^^) we still get numerous
KASAN reports starting with

[PATCH v6 10/71] mm: Start tracking VMAs with maple tree

mostly looking like this:

BUG: KASAN: use-after-free in mas_descend_adopt+0x113a/0x1408
Read of size 8 at addr 00000000b1d4e900 by task cat/610

CPU: 34 PID: 610 Comm: cat Tainted: G W 5.17.0-rc4-91313-g592c3b299aad #1
Hardware name: IBM 3906 M04 704 (LPAR)
Call Trace:
[<000000000234647a>] dump_stack_lvl+0xfa/0x150
[<0000000002328154>] print_address_description.constprop.0+0x64/0x340
[<00000000008f5ba6>] kasan_report+0x13e/0x1a8
[<00000000017f0c12>] mas_descend_adopt+0x113a/0x1408
[<00000000018076ec>] mas_spanning_rebalance.isra.0+0x5164/0x6a20
[<000000000180a71e>] mas_wr_spanning_store.isra.0+0x476/0xbc8
[<00000000018189bc>] mas_store_gfp+0xd4/0x188
[<0000000000827bae>] vma_mt_szero+0x146/0x368
[<000000000082e990>] __do_munmap+0x340/0xe20
[<000000000082f580>] __vm_munmap+0x110/0x1e8
[<000000000082f76e>] __s390x_sys_munmap+0x6e/0x90
[<000000000010dc6c>] do_syscall+0x22c/0x328
[<000000000234d3d2>] __do_syscall+0x9a/0xf8
[<0000000002374ed2>] system_call+0x82/0xb0
INFO: lockdep is turned off.

Allocated by task 610:
kasan_save_stack+0x34/0x58
__kasan_slab_alloc+0x84/0xa8
kmem_cache_alloc+0x20c/0x520
mas_alloc_nodes+0x26a/0x4c8
mas_split.isra.0+0x2aa/0x1418
mas_wr_modify+0x3fa/0xd28
mas_store_gfp+0xd4/0x188
vma_store+0x17a/0x3d8
vma_link+0xac/0x798
mmap_region+0xa5a/0x10b8
do_mmap+0x7c2/0xa90
vm_mmap_pgoff+0x186/0x250
ksys_mmap_pgoff+0x334/0x400
__s390x_sys_old_mmap+0xf4/0x130
do_syscall+0x22c/0x328
__do_syscall+0x9a/0xf8
system_call+0x82/0xb0
Freed by task 610:
kasan_save_stack+0x34/0x58
kasan_set_track+0x36/0x48
kasan_set_free_info+0x34/0x58
____kasan_slab_free+0x11c/0x188
__kasan_slab_free+0x24/0x30
kmem_cache_free_bulk.part.0+0xec/0x538
mas_destroy+0x2e4/0x710
mas_store_gfp+0xf4/0x188
vma_mt_szero+0x146/0x368
__vma_adjust+0x155a/0x2188
__split_vma+0x228/0x450
mprotect_fixup+0x4f2/0x618
do_mprotect_pkey.constprop.0+0x328/0x600
__s390x_sys_mprotect+0x8e/0xb8
do_syscall+0x22c/0x328
__do_syscall+0x9a/0xf8
system_call+0x82/0xb0

The buggy address belongs to the object at 00000000b1d4e900
which belongs to the cache maple_node of size 256
The buggy address is located 0 bytes inside of
256-byte region [00000000b1d4e900, 00000000b1d4ea00)
The buggy address belongs to the page:
page:0000400002c75200 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb1d48
head:0000400002c75200 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x3ffff00000010200(slab|head|node=0|zone=1|lastcpupid=0x1ffff)
raw: 3ffff00000010200 0000400002c91e08 000040000263b608 000000008009ed00
raw: 0000000000000000 0020004000000000 ffffffff00000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
00000000b1d4e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
00000000b1d4e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>00000000b1d4e900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
00000000b1d4e980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
00000000b1d4ea00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

with eventual crash.

How come none of s390 architecture maintainers nor s390 mailing list were
on CC for the changes which affect s390 to that magnitude? None of this
has apparently been tested on s390 or any other big endian system. And
a shoulder tap to give it try would be helpful.

Now we are just starting looking at the problems. And until issues
are resolved this patch series has to be dropped from linux-next.

2022-02-27 20:57:04

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Sat, 26 Feb 2022, Hugh Dickins wrote:
>
> I say "often": it's very erratic. Once, a machine booted with mem=1G
> ran kernel builds successfully swapping for 4.5 hours before faulting
> in __nr_to_section in virt_to_folio ... while doing a __vma_adjust() -
> you'll ask me for a full stacktrace, and I'll answer sorry, too many,
> please try for yourself. Another time, for 1.5 hours before hitting
> the BUG_ON(is_migration_entry(entry) && !PageLocked(p)) in
> pfn_swap_entry_to_page() - suggesting anon_vma locking had not been
> right while doing page migration (I was exercising THPs a lot).
> But now, can I even get it to complete the boot sequence?

Sorry, I never mentioned the architecture: x86_64 throughout for me.

2022-02-28 12:50:01

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

On Sat, Feb 26, 2022 at 01:58:09AM +0000, Liam Howlett wrote:
> Maybe? I'm trying to figure out why it's having issues.. I've not been
> able to reproduce it with just my maple tree branch. Steven Rostedt
> found a bad commit that has been fixed in either 20220224, I believe
> [1]. It might be best to try next-20220225 and see if you have better
> luck if that's an option.

Same thing for next-20220225. I just need to boot it and it will crash
shortly after.

BUG: KASAN: use-after-free in move_vma.isra.0
Read of size 8 at addr ffff00088f74c7e8 by task apt-check/2331

CPU: 2 PID: 2331 Comm: apt-check Tainted: G W 5.17.0-rc5-next-20220225 #245
Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
Call trace:
dump_backtrace
show_stack
dump_stack_lvl
print_address_description.constprop.0
kasan_report
__asan_report_load8_noabort
move_vma.isra.0
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

Allocated by task 2331:
kasan_save_stack
__kasan_slab_alloc
slab_post_alloc_hook
kmem_cache_alloc
vm_area_dup
__split_vma
do_mas_align_munmap.isra.0
do_mas_munmap
__vm_munmap
__arm64_sys_munmap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

Freed by task 2331:
kasan_save_stack
kasan_set_track
kasan_set_free_info
__kasan_slab_free
kmem_cache_free
vm_area_free
remove_vma
do_mas_align_munmap.isra.0
do_mas_munmap
do_munmap
move_vma.isra.0
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync

The buggy address belongs to the object at ffff00088f74c7c8
which belongs to the cache vm_area_struct of size 144
The buggy address is located 32 bytes inside of
144-byte region [ffff00088f74c7c8, ffff00088f74c858)
The buggy address belongs to the physical page:
page:fffffc00223dd300 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x90f74c
head:fffffc00223dd300 order:2 compound_mapcount:0 compound_pincount:0
memcg:ffff000884a75001
flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
raw: 0bfffc0000010200 fffffc00223dd008 fffffc0021e6bc08 ffff000800f5b380
raw: 0000000000000000 0000000000210021 00000001ffffffff ffff000884a75001
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff00088f74c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff00088f74c700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff00088f74c780: fc fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
^
ffff00088f74c800: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
ffff00088f74c880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Unable to handle kernel paging request at virtual address dfff800000000004
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[dfff800000000004] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: ipmi_devintf ipmi_msghandler cppc_cpufreq drm ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod crct10dif_ce nvme mlx5_core mpt3sas nvme_core raid_class
CPU: 2 PID: 2331 Comm: apt-check Tainted: G B W 5.17.0-rc5-next-20220225 #245
Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : move_vma.isra.0
lr : move_vma.isra.0
sp : ffff800016df7ad0
x29: ffff800016df7ad0 x28: ffff00088c7ba6c8 x27: 0000000000000001
x26: 0000ffffad180000 x25: ffff00088bea1c40 x24: ffff00088bea1dc8
x23: ffff00088c988040 x22: 0000000001800000 x21: 0000ffffab82d000
x20: 0000000000000000 x19: 1ffff00002dbef6c x18: 0000000000000066
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000001490 x12: 0000000000000001
x11: 0000000000000009 x10: 0000000000000000 x9 : ffff800008445544
x8 : 000000003fffffff x7 : ffff80000928fbc4 x6 : 00000000f1f1f1f1
x5 : dfff800000000000 x4 : ffff700002dbef18 x3 : 0000000041b58ab3
x2 : 0000000000000004 x1 : dfff800000000000 x0 : 0000000000000020
Call trace:
move_vma.isra.0
__do_sys_mremap
__arm64_sys_mremap
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
Code: 91008000 d2d00001 f2fbffe1 d343fc02 (38e16841)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: 0x160000 from 0xffff800008000000
PHYS_OFFSET: 0x80000000
CPU features: 0x00,00000942,40000846
Memory Limit: none
---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

2022-02-28 17:39:32

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

* Hugh Dickins <[email protected]> [220226 18:20]:
> On Sat, 26 Feb 2022, Liam Howlett wrote:
> > * Nathan Chancellor <[email protected]> [220225 18:00]:
> > > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> > > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > > > > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > > > > Could you please send me your config file for the build?
> > > >
> > > > On linux-next, I just do:
> > > >
> > > > $ make arch=arm64 defconfig debug.config [1]
> > > >
> > > > Then, I just generate some memory pressume into swapping/OOM Killer to
> > > > trigger it.
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
> > >
> > > Is the stacktrace [1] related to the conflict that Mark encountered [2]
> > > while merging the maple and folio trees? Booting a next-20220223 kernel
> > > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
> > > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
> > > maple tree merges makes everything work properly again.
> > >
> > > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
> > > [2]: https://lore.kernel.org/r/[email protected]/
> >
> > Maybe? I'm trying to figure out why it's having issues.. I've not been
> > able to reproduce it with just my maple tree branch. Steven Rostedt
> > found a bad commit that has been fixed in either 20220224, I believe
> > [1]. It might be best to try next-20220225 and see if you have better
> > luck if that's an option.
> >
> > [1]:
> > https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t
>
> Hi Liam, I think I have the beginnings of an answer for Qian's issue,
> details below; but first off, I'd better make my own opinion clear.
>
> I think this series, however good it may be so far, is undertested and
> not approaching ready for 5.18, and should not have been pushed via
> git tree into 5.17-rc-late-stage linux-next, subverting Andrew's mmotm.
>
> I believe Stephen and Mark should drop it from linux-next for now,
> while linux-next stabilizes for 5.18, while you work on testing and
> fixing, and resubmit rebased to 5.18-rc1 or -rc2 when you're ready.

Hugh,

Thank you for your testing, review, and thoughts on the maple tree. I
agree with your findings and see the error, although still have
difficulty reproducing on amd64 and arm64 VMs.

Stephen, Please drop the maple tree from linux-next.


Thank you,
Liam

2022-02-28 17:56:36

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v6 00/71] Introducing the Maple Tree

* Vasily Gorbik <[email protected]> [220226 21:23]:
> On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently. There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface. The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct. The long term goal is to reduce
> > or remove the mmap_sem contention.
> >
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes. With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses. The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> >
> > This patch is based on v5.17-rc4
> >
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
>
> Hi Liam,
>
> this patch series completely breaks s390. Besides endianess issue
> in maple trees reported here:
>
> https://lore.kernel.org/all/[email protected]/
>
> and with this endianess issue fixed (fixup from here ^^^) we still get numerous
> KASAN reports starting with
>
> [PATCH v6 10/71] mm: Start tracking VMAs with maple tree

Thanks for looking at this. That commit had two issues which I was
debugging on parisc for most of last week. The first is that I was not
actually writing the stack expansion to the maple tree in that commit -
it was added in a later commit. The second issue which was most likely
causing your crashes was the calculation for the gap in
unmapped_area_topdown().

With the maple tree (3dea5db7fbb5), I am able to boot parisc and s390
virtual machines. Although, the issue which Hugh discusses in detail
still exists [1]. The latest maple tree can be found on github [2]. I
am still investigating why the heap changed locations on parisc to
ensure it's not of issue.

1: https://lore.kernel.org/all/[email protected]/
2: https://github.com/oracle/linux-uek/tree/maple/mainline

Regards,
Liam