Previous version:
v2: https://lore.kernel.org/lkml/[email protected]/
v1: https://lore.kernel.org/all/[email protected]/
RFC: https://lore.kernel.org/all/[email protected]/
LWN article describing the feature:
https://lwn.net/Articles/906852/
Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM
last year [2], which concluded with suggestion that “a reader/writer
semaphore could be put into the VMA itself; that would have the effect of
using the VMA as a sort of range lock. There would still be contention at
the VMA level, but it would be an improvement.” This patchset implements
this suggested approach.
When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.
One notable way the implementation deviates from the proposal is the way
VMAs are read-locked. During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g. vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex. So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.
We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock. Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.
Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered read-locked
when these sequence numbers are equal. To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct. To unlock all VMAs we increment mm_struct's seq number. This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.
The patchset implements per-VMA locking only for anonymous pages which
are not in swap and avoids userfaultfs as their implementation is more
complex. Additional support for file-back page faults, swapped and user
pages can be added incrementally.
Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.
Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results. Here are
the known usecases when per-VMA locks show benefits:
Android:
Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode. Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs. Per-vma locks prevent this contention by
providing more granular lock.
Google Fibers:
We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.
At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup. Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.
TCP zerocopy receive:
From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.
In today's implementation, a process with N threads where N - 1 are
performing zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g. needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done. Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition. Per-VMA locking _removes_ by
definition this source of contention entirely.
There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock. On an RPC style synthetic workload with 4KB RPCs:
1a) The find+lock+unlock VMA path in the base case, without the per-vma
lock patchset, is about 0.7% of cycles as measured by perf.
1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
cycles overall - most of this is within the TCP read hotpath (a small
fraction is 'other' usage in the system).
2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial
patch written to take advantage of it in TCP, is about 0.4% of cycles
(down from 0.7% above)
2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1%
cycles and is out of the TCP read hotpath entirely (down from 0.5% before,
the remaining usage is the 'other' usage in the system).
So, in addition to entirely removing an onerous source of contention, it
also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
(compared to overall cycles in perf) for the 'small' RPC scenario.
The patchset structure is:
0001-0008: Enable maple-tree RCU mode
0008-0033: Main per-vma locks patchset
0034-0035: Performance optimizations
Changes since v2:
- Add Patch 07/35 lockdep warning fix in maple-tree, per Liam Howlett
- Introduce vma->detached flag to optimize lock_vma_under_rcu to address
pft-threads regression reported by Punit Agrawal
- Remove mm->mm_users check in do_page_fault (which was special-casing
single-threaded processes and was affecting overall performance) to
address pft-threads regression reported by Punit Agrawal
The patchset applies cleanly over mm-unstable branch.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lwn.net/Articles/893906/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/all/[email protected]/
Laurent Dufour (1):
powerc/mm: try VMA lock-based page fault handling first
Liam Howlett (4):
maple_tree: Be more cautious about dead nodes
maple_tree: Detect dead nodes in mas_start()
maple_tree: Fix freeing of nodes in rcu mode
maple_tree: remove extra smp_wmb() from mas_dead_leaves()
Liam R. Howlett (4):
maple_tree: Fix write memory barrier of nodes once dead for RCU mode
maple_tree: Add smp_rmb() to dead node detection
maple_tree: Add RCU lock checking to rcu callback functions
mm: Enable maple tree RCU mode by default.
Michel Lespinasse (1):
mm: rcu safe VMA freeing
Suren Baghdasaryan (25):
mm: introduce CONFIG_PER_VMA_LOCK
mm: move mmap_lock assert function definitions
mm: add per-VMA lock and helper functions to control it
mm: mark VMA as being written when changing vm_flags
mm/mmap: move VMA locking before vma_adjust_trans_huge call
mm/khugepaged: write-lock VMA while collapsing a huge page
mm/mmap: write-lock VMAs before merging, splitting or expanding them
mm/mmap: write-lock VMA before shrinking or expanding it
mm/mremap: write-lock VMA while remapping it to a new address range
mm: write-lock VMAs before removing them from VMA tree
mm: conditionally write-lock VMA in free_pgtables
mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
kernel/fork: assert no VMA readers during its destruction
mm/mmap: prevent pagefault handler from racing with mmu_notifier
registration
mm: introduce vma detached flag
mm: introduce lock_vma_under_rcu to be used from arch-specific code
mm: fall back to mmap_lock if vma->anon_vma is not yet set
mm: add FAULT_FLAG_VMA_LOCK flag
mm: prevent do_swap_page from handling page faults under VMA lock
mm: prevent userfaults to be handled under per-vma lock
mm: introduce per-VMA lock statistics
x86/mm: try VMA lock-based page fault handling first
arm64/mm: try VMA lock-based page fault handling first
mm/mmap: free vm_area_struct without call_rcu in exit_mmap
mm: separate vma->lock from vm_area_struct
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 36 ++++
arch/powerpc/mm/fault.c | 41 ++++
arch/powerpc/platforms/powernv/Kconfig | 1 +
arch/powerpc/platforms/pseries/Kconfig | 1 +
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 36 ++++
include/linux/mm.h | 108 +++++++++-
include/linux/mm_types.h | 32 ++-
include/linux/mmap_lock.h | 37 ++--
include/linux/vm_event_item.h | 6 +
include/linux/vmstat.h | 6 +
kernel/fork.c | 99 +++++++--
lib/maple_tree.c | 269 +++++++++++++++++--------
mm/Kconfig | 12 ++
mm/Kconfig.debug | 6 +
mm/init-mm.c | 3 +
mm/internal.h | 2 +-
mm/khugepaged.c | 5 +
mm/memory.c | 72 ++++++-
mm/mmap.c | 72 +++++--
mm/mremap.c | 1 +
mm/nommu.c | 5 +
mm/rmap.c | 31 +--
mm/vmstat.c | 6 +
tools/testing/radix-tree/maple.c | 16 ++
26 files changed, 754 insertions(+), 151 deletions(-)
--
2.39.1
From: Liam Howlett <[email protected]>
When initially starting a search, the root node may already be in the
process of being replaced in RCU mode. Detect and restart the walk if
this is the case. This is necessary for RCU mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index cc356b8369ad..089cd76ec379 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1360,12 +1360,16 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
mas->max = ULONG_MAX;
mas->depth = 0;
+retry:
root = mas_root(mas);
/* Tree with nodes */
if (likely(xa_is_node(root))) {
mas->depth = 1;
mas->node = mte_safe_root(root);
mas->offset = 0;
+ if (mte_dead_node(mas->node))
+ goto retry;
+
return NULL;
}
--
2.39.1
From: Liam Howlett <[email protected]>
The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types. This is necessary for the
RCU mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 73 ++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 62 insertions(+), 11 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 089cd76ec379..44d6ce30b28e 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -902,6 +902,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt,
meta->end = end;
}
+/*
+ * mas_clear_meta() - clear the metadata information of a node, if it exists
+ * @mas: The maple state
+ * @mn: The maple node
+ * @mt: The maple node type
+ * @offset: The offset of the highest sub-gap in this node.
+ * @end: The end of the data in this node.
+ */
+static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
+ enum maple_type mt)
+{
+ struct maple_metadata *meta;
+ unsigned long *pivots;
+ void __rcu **slots;
+ void *next;
+
+ switch (mt) {
+ case maple_range_64:
+ pivots = mn->mr64.pivot;
+ if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
+ slots = mn->mr64.slot;
+ next = mas_slot_locked(mas, slots,
+ MAPLE_RANGE64_SLOTS - 1);
+ if (unlikely((mte_to_node(next) && mte_node_type(next))))
+ return; /* The last slot is a node, no metadata */
+ }
+ fallthrough;
+ case maple_arange_64:
+ meta = ma_meta(mn, mt);
+ break;
+ default:
+ return;
+ }
+
+ meta->gap = 0;
+ meta->end = 0;
+}
+
/*
* ma_meta_end() - Get the data end of a node from the metadata
* @mn: The maple node
@@ -5455,20 +5493,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
* mas_dead_leaves() - Mark all leaves of a node as dead.
* @mas: The maple state
* @slots: Pointer to the slot array
+ * @type: The maple node type
*
* Must hold the write lock.
*
* Return: The number of leaves marked as dead.
*/
static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
+unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
+ enum maple_type mt)
{
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;
- for (offset = 0; offset < mt_slot_count(mas->node); offset++) {
+ for (offset = 0; offset < mt_slots[mt]; offset++) {
entry = mas_slot_locked(mas, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
@@ -5487,14 +5527,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
{
- struct maple_node *node, *next;
+ struct maple_node *next;
void __rcu **slots = NULL;
next = mas_mn(mas);
do {
- mas->node = ma_enode_ptr(next);
- node = mas_mn(mas);
- slots = ma_slots(node, node->type);
+ mas->node = mt_mk_node(next, next->type);
+ slots = ma_slots(next, next->type);
next = mas_slot_locked(mas, slots, offset);
offset = 0;
} while (!ma_is_leaf(next->type));
@@ -5558,11 +5597,14 @@ static inline void __rcu **mas_destroy_descend(struct ma_state *mas,
node = mas_mn(mas);
slots = ma_slots(node, mte_node_type(mas->node));
next = mas_slot_locked(mas, slots, 0);
- if ((mte_dead_node(next)))
+ if ((mte_dead_node(next))) {
+ mte_to_node(next)->type = mte_node_type(next);
next = mas_slot_locked(mas, slots, 1);
+ }
mte_set_node_dead(mas->node);
node->type = mte_node_type(mas->node);
+ mas_clear_meta(mas, node, node->type);
node->piv_parent = prev;
node->parent_slot = offset;
offset = 0;
@@ -5582,13 +5624,18 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
MA_STATE(mas, &mt, 0, 0);
- if (mte_is_leaf(enode))
+ mas.node = enode;
+ if (mte_is_leaf(enode)) {
+ node->type = mte_node_type(enode);
goto free_leaf;
+ }
+ ma_flags &= ~MT_FLAGS_LOCK_MASK;
mt_init_flags(&mt, ma_flags);
mas_lock(&mas);
- mas.node = start = enode;
+ mte_to_node(enode)->ma_flags = ma_flags;
+ start = enode;
slots = mas_destroy_descend(&mas, start, 0);
node = mas_mn(&mas);
do {
@@ -5596,7 +5643,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
unsigned char offset;
struct maple_enode *parent, *tmp;
- node->slot_len = mas_dead_leaves(&mas, slots);
+ node->type = mte_node_type(mas.node);
+ node->slot_len = mas_dead_leaves(&mas, slots, node->type);
if (free)
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
@@ -5620,7 +5668,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
} while (start != mas.node);
node = mas_mn(&mas);
- node->slot_len = mas_dead_leaves(&mas, slots);
+ node->type = mte_node_type(mas.node);
+ node->slot_len = mas_dead_leaves(&mas, slots, node->type);
if (free)
mt_free_bulk(node->slot_len, slots);
@@ -5630,6 +5679,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
free_leaf:
if (free)
mt_free_rcu(&node->rcu);
+ else
+ mas_clear_meta(&mas, node, node->type);
}
/*
--
2.39.1
From: Liam Howlett <[email protected]>
The call to mte_set_dead_node() before the smp_wmb() already calls
smp_wmb() so this is not needed. This is an optimization for the RCU
mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 44d6ce30b28e..3d5ab02f981a 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5517,7 +5517,6 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
break;
mte_set_node_dead(entry);
- smp_wmb(); /* Needed for RCU */
node->type = type;
rcu_assign_pointer(slots[offset], node);
}
--
2.39.1
From: "Liam R. Howlett" <[email protected]>
During the development of the maple tree, the strategy of freeing
multiple nodes changed and, in the process, the pivots were reused to
store pointers to dead nodes. To ensure the readers see accurate
pivots, the writers need to mark the nodes as dead and call smp_wmb() to
ensure any readers can identify the node as dead before using the pivot
values.
There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead. Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.
Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being
freed are marked as dead to ensure there are no other call paths besides
the two updated paths.
This is necessary for the RCU mode of the maple tree.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 7 +++++--
tools/testing/radix-tree/maple.c | 16 ++++++++++++++++
2 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 3d5ab02f981a..6b6eddadd9d2 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -185,7 +185,7 @@ static void mt_free_rcu(struct rcu_head *head)
*/
static void ma_free_rcu(struct maple_node *node)
{
- node->parent = ma_parent_ptr(node);
+ WARN_ON(node->parent != ma_parent_ptr(node));
call_rcu(&node->rcu, mt_free_rcu);
}
@@ -1778,8 +1778,10 @@ static inline void mas_replace(struct ma_state *mas, bool advanced)
rcu_assign_pointer(slots[offset], mas->node);
}
- if (!advanced)
+ if (!advanced) {
+ mte_set_node_dead(old_enode);
mas_free(mas, old_enode);
+ }
}
/*
@@ -4218,6 +4220,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas)
done:
mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end);
if (in_rcu) {
+ mte_set_node_dead(mas->node);
mas->node = mt_mk_node(newnode, wr_mas->type);
mas_replace(mas, false);
} else {
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 958ee9bdb316..4c89ff333f6f 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mn->slot[1] != NULL);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas.node = MAS_START;
mas_nomem(&mas, GFP_KERNEL);
@@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != i);
MT_BUG_ON(mt, !mn);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
@@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
MT_BUG_ON(mt, !mn);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
@@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
}
@@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas2); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
@@ -334,10 +341,12 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -375,6 +384,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mas_node_count(&mas, i); /* Request */
mas_nomem(&mas, GFP_KERNEL); /* Fill request */
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas_destroy(&mas);
@@ -382,10 +392,13 @@ static noinline void check_new_node(struct maple_tree *mt)
mas_node_count(&mas, i); /* Request */
mas_nomem(&mas, GFP_KERNEL); /* Fill request */
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas_destroy(&mas);
}
@@ -35369,6 +35382,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
mas_destroy(&mas);
@@ -35386,6 +35400,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
@@ -35756,6 +35771,7 @@ void farmer_tests(void)
tree.ma_root = mt_mk_node(node, maple_leaf_64);
mt_dump(&tree);
+ node->parent = ma_parent_ptr(node);
ma_free_rcu(node);
/* Check things that will make lockdep angry */
--
2.39.1
From: "Liam R. Howlett" <[email protected]>
Add an smp_rmb() before reading the parent pointer to ensure that
anything read from the node prior to the parent pointer hasn't been
reordered ahead of this check.
The is necessary for RCU mode.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 6b6eddadd9d2..8ad2d1669fad 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -539,9 +539,11 @@ static inline struct maple_node *mte_parent(const struct maple_enode *enode)
*/
static inline bool ma_dead_node(const struct maple_node *node)
{
- struct maple_node *parent = (void *)((unsigned long)
- node->parent & ~MAPLE_NODE_MASK);
+ struct maple_node *parent;
+ /* Do not reorder reads from the node prior to the parent check */
+ smp_rmb();
+ parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
return (parent == node);
}
@@ -556,6 +558,8 @@ static inline bool mte_dead_node(const struct maple_enode *enode)
struct maple_node *parent, *node;
node = mte_to_node(enode);
+ /* Do not reorder reads from the node prior to the parent check */
+ smp_rmb();
parent = mte_parent(enode);
return (parent == node);
}
--
2.39.1
From: "Liam R. Howlett" <[email protected]>
Dereferencing RCU objects within the RCU callback without the RCU check
has caused lockdep to complain. Fix the RCU dereferencing by using the
RCU callback lock to ensure the operation is safe.
Also stop creating a new lock to use for dereferencing during
destruction of the tree or subtree. Instead, pass through a pointer to
the tree that has the lock that is held for RCU dereferencing checking.
It also does not make sense to use the maple state in the freeing
scenario as the tree walk is a special case where the tree no longer has
the normal encodings and parent pointers.
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Reported-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 188 ++++++++++++++++++++++++-----------------------
1 file changed, 96 insertions(+), 92 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8ad2d1669fad..2be86368237d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -824,6 +824,11 @@ static inline void *mt_slot(const struct maple_tree *mt,
return rcu_dereference_check(slots[offset], mt_locked(mt));
}
+static inline void *mt_slot_locked(struct maple_tree *mt, void __rcu **slots,
+ unsigned char offset)
+{
+ return rcu_dereference_protected(slots[offset], mt_locked(mt));
+}
/*
* mas_slot_locked() - Get the slot value when holding the maple tree lock.
* @mas: The maple state
@@ -835,7 +840,7 @@ static inline void *mt_slot(const struct maple_tree *mt,
static inline void *mas_slot_locked(struct ma_state *mas, void __rcu **slots,
unsigned char offset)
{
- return rcu_dereference_protected(slots[offset], mt_locked(mas->tree));
+ return mt_slot_locked(mas->tree, slots, offset);
}
/*
@@ -907,34 +912,35 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt,
}
/*
- * mas_clear_meta() - clear the metadata information of a node, if it exists
- * @mas: The maple state
+ * mt_clear_meta() - clear the metadata information of a node, if it exists
+ * @mt: The maple tree
* @mn: The maple node
- * @mt: The maple node type
+ * @type: The maple node type
* @offset: The offset of the highest sub-gap in this node.
* @end: The end of the data in this node.
*/
-static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
- enum maple_type mt)
+static inline void mt_clear_meta(struct maple_tree *mt, struct maple_node *mn,
+ enum maple_type type)
{
struct maple_metadata *meta;
unsigned long *pivots;
void __rcu **slots;
void *next;
- switch (mt) {
+ switch (type) {
case maple_range_64:
pivots = mn->mr64.pivot;
if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
slots = mn->mr64.slot;
- next = mas_slot_locked(mas, slots,
- MAPLE_RANGE64_SLOTS - 1);
- if (unlikely((mte_to_node(next) && mte_node_type(next))))
- return; /* The last slot is a node, no metadata */
+ next = mt_slot_locked(mt, slots,
+ MAPLE_RANGE64_SLOTS - 1);
+ if (unlikely((mte_to_node(next) &&
+ mte_node_type(next))))
+ return; /* no metadata, could be node */
}
fallthrough;
case maple_arange_64:
- meta = ma_meta(mn, mt);
+ meta = ma_meta(mn, type);
break;
default:
return;
@@ -5497,7 +5503,7 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
}
/*
- * mas_dead_leaves() - Mark all leaves of a node as dead.
+ * mte_dead_leaves() - Mark all leaves of a node as dead.
* @mas: The maple state
* @slots: Pointer to the slot array
* @type: The maple node type
@@ -5507,16 +5513,16 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
* Return: The number of leaves marked as dead.
*/
static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
- enum maple_type mt)
+unsigned char mte_dead_leaves(struct maple_enode *enode, struct maple_tree *mt,
+ void __rcu **slots)
{
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;
- for (offset = 0; offset < mt_slots[mt]; offset++) {
- entry = mas_slot_locked(mas, slots, offset);
+ for (offset = 0; offset < mt_slot_count(enode); offset++) {
+ entry = mt_slot(mt, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
/* Use both node and type to catch LE & BE metadata */
@@ -5531,162 +5537,160 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
return offset;
}
-static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
+/**
+ * mte_dead_walk() - Walk down a dead tree to just before the leaves
+ * @enode: The maple encoded node
+ * @offset: The starting offset
+ *
+ * Note: This can only be used from the RCU callback context.
+ */
+static void __rcu **mte_dead_walk(struct maple_enode **enode, unsigned char offset)
{
- struct maple_node *next;
+ struct maple_node *node, *next;
void __rcu **slots = NULL;
- next = mas_mn(mas);
+ next = mte_to_node(*enode);
do {
- mas->node = mt_mk_node(next, next->type);
- slots = ma_slots(next, next->type);
- next = mas_slot_locked(mas, slots, offset);
+ *enode = ma_enode_ptr(next);
+ node = mte_to_node(*enode);
+ slots = ma_slots(node, node->type);
+ next = rcu_dereference_protected(slots[offset],
+ lock_is_held(&rcu_callback_map));
offset = 0;
} while (!ma_is_leaf(next->type));
return slots;
}
+/**
+ * mt_free_walk() - Walk & free a tree in the RCU callback context
+ * @head: The RCU head that's within the node.
+ *
+ * Note: This can only be used from the RCU callback context.
+ */
static void mt_free_walk(struct rcu_head *head)
{
void __rcu **slots;
struct maple_node *node, *start;
- struct maple_tree mt;
+ struct maple_enode *enode;
unsigned char offset;
enum maple_type type;
- MA_STATE(mas, &mt, 0, 0);
node = container_of(head, struct maple_node, rcu);
if (ma_is_leaf(node->type))
goto free_leaf;
- mt_init_flags(&mt, node->ma_flags);
- mas_lock(&mas);
start = node;
- mas.node = mt_mk_node(node, node->type);
- slots = mas_dead_walk(&mas, 0);
- node = mas_mn(&mas);
+ enode = mt_mk_node(node, node->type);
+ slots = mte_dead_walk(&enode, 0);
+ node = mte_to_node(enode);
do {
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
- mas.node = node->piv_parent;
- if (mas_mn(&mas) == node)
- goto start_slots_free;
-
- type = mte_node_type(mas.node);
- slots = ma_slots(mte_to_node(mas.node), type);
- if ((offset < mt_slots[type]) && (slots[offset]))
- slots = mas_dead_walk(&mas, offset);
-
- node = mas_mn(&mas);
+ enode = node->piv_parent;
+ if (mte_to_node(enode) == node)
+ goto free_leaf;
+
+ type = mte_node_type(enode);
+ slots = ma_slots(mte_to_node(enode), type);
+ if ((offset < mt_slots[type]) &&
+ rcu_dereference_protected(slots[offset],
+ lock_is_held(&rcu_callback_map)))
+ slots = mte_dead_walk(&enode, offset);
+ node = mte_to_node(enode);
} while ((node != start) || (node->slot_len < offset));
slots = ma_slots(node, node->type);
mt_free_bulk(node->slot_len, slots);
-start_slots_free:
- mas_unlock(&mas);
free_leaf:
mt_free_rcu(&node->rcu);
}
-static inline void __rcu **mas_destroy_descend(struct ma_state *mas,
- struct maple_enode *prev, unsigned char offset)
+static inline void __rcu **mte_destroy_descend(struct maple_enode **enode,
+ struct maple_tree *mt, struct maple_enode *prev, unsigned char offset)
{
struct maple_node *node;
- struct maple_enode *next = mas->node;
+ struct maple_enode *next = *enode;
void __rcu **slots = NULL;
+ enum maple_type type;
+ unsigned char next_offset = 0;
do {
- mas->node = next;
- node = mas_mn(mas);
- slots = ma_slots(node, mte_node_type(mas->node));
- next = mas_slot_locked(mas, slots, 0);
- if ((mte_dead_node(next))) {
- mte_to_node(next)->type = mte_node_type(next);
- next = mas_slot_locked(mas, slots, 1);
- }
+ *enode = next;
+ node = mte_to_node(*enode);
+ type = mte_node_type(*enode);
+ slots = ma_slots(node, type);
+ next = mt_slot_locked(mt, slots, next_offset);
+ if ((mte_dead_node(next)))
+ next = mt_slot_locked(mt, slots, ++next_offset);
- mte_set_node_dead(mas->node);
- node->type = mte_node_type(mas->node);
- mas_clear_meta(mas, node, node->type);
+ mte_set_node_dead(*enode);
+ node->type = type;
node->piv_parent = prev;
node->parent_slot = offset;
- offset = 0;
- prev = mas->node;
+ offset = next_offset;
+ next_offset = 0;
+ prev = *enode;
} while (!mte_is_leaf(next));
return slots;
}
-static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
+static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt,
bool free)
{
void __rcu **slots;
struct maple_node *node = mte_to_node(enode);
struct maple_enode *start;
- struct maple_tree mt;
-
- MA_STATE(mas, &mt, 0, 0);
- mas.node = enode;
if (mte_is_leaf(enode)) {
node->type = mte_node_type(enode);
goto free_leaf;
}
- ma_flags &= ~MT_FLAGS_LOCK_MASK;
- mt_init_flags(&mt, ma_flags);
- mas_lock(&mas);
-
- mte_to_node(enode)->ma_flags = ma_flags;
start = enode;
- slots = mas_destroy_descend(&mas, start, 0);
- node = mas_mn(&mas);
+ slots = mte_destroy_descend(&enode, mt, start, 0);
+ node = mte_to_node(enode); // Updated in the above call.
do {
enum maple_type type;
unsigned char offset;
struct maple_enode *parent, *tmp;
- node->type = mte_node_type(mas.node);
- node->slot_len = mas_dead_leaves(&mas, slots, node->type);
+ node->slot_len = mte_dead_leaves(enode, mt, slots);
if (free)
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
- mas.node = node->piv_parent;
- if (mas_mn(&mas) == node)
- goto start_slots_free;
+ enode = node->piv_parent;
+ if (mte_to_node(enode) == node)
+ goto free_leaf;
- type = mte_node_type(mas.node);
- slots = ma_slots(mte_to_node(mas.node), type);
+ type = mte_node_type(enode);
+ slots = ma_slots(mte_to_node(enode), type);
if (offset >= mt_slots[type])
goto next;
- tmp = mas_slot_locked(&mas, slots, offset);
+ tmp = mt_slot_locked(mt, slots, offset);
if (mte_node_type(tmp) && mte_to_node(tmp)) {
- parent = mas.node;
- mas.node = tmp;
- slots = mas_destroy_descend(&mas, parent, offset);
+ parent = enode;
+ enode = tmp;
+ slots = mte_destroy_descend(&enode, mt, parent, offset);
}
next:
- node = mas_mn(&mas);
- } while (start != mas.node);
+ node = mte_to_node(enode);
+ } while (start != enode);
- node = mas_mn(&mas);
- node->type = mte_node_type(mas.node);
- node->slot_len = mas_dead_leaves(&mas, slots, node->type);
+ node = mte_to_node(enode);
+ node->slot_len = mte_dead_leaves(enode, mt, slots);
if (free)
mt_free_bulk(node->slot_len, slots);
-start_slots_free:
- mas_unlock(&mas);
-
free_leaf:
if (free)
mt_free_rcu(&node->rcu);
else
- mas_clear_meta(&mas, node, node->type);
+ mt_clear_meta(mt, node, node->type);
}
/*
@@ -5702,10 +5706,10 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
struct maple_node *node = mte_to_node(enode);
if (mt_in_rcu(mt)) {
- mt_destroy_walk(enode, mt->ma_flags, false);
+ mt_destroy_walk(enode, mt, false);
call_rcu(&node->rcu, mt_free_walk);
} else {
- mt_destroy_walk(enode, mt->ma_flags, true);
+ mt_destroy_walk(enode, mt, true);
}
}
--
2.39.1
From: "Liam R. Howlett" <[email protected]>
Use the maple tree in RCU mode for VMA tracking. This is necessary for
the use of per-VMA locking. RCU mode is enabled by default but disabled
when exiting an mm and for the new tree during a fork.
Also enable RCU for the tree used in munmap operations to ensure the
nodes remain valid for readers.
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 3 ++-
kernel/fork.c | 3 +++
mm/mmap.c | 4 +++-
3 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cd60243c625..3fd5305dbbf9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -768,7 +768,8 @@ struct mm_struct {
unsigned long cpu_bitmap[];
};
-#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
+#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
+ MT_FLAGS_USE_RCU)
extern struct mm_struct init_mm;
/* Pointer magic because the dynamic array size confuses some compilers. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5e3029ea8e1e..5f23d5e03362 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (retval)
goto out;
+ mt_clear_in_rcu(vmi.mas.tree);
for_each_vma(old_vmi, mpnt) {
struct file *file;
@@ -700,6 +701,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
retval = arch_dup_mmap(oldmm, mm);
loop_out:
vma_iter_free(&vmi);
+ if (!retval)
+ mt_set_in_rcu(vmi.mas.tree);
out:
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
diff --git a/mm/mmap.c b/mm/mmap.c
index 20f21f0949dd..06c0f6686de8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2277,7 +2277,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
int count = 0;
int error = -ENOMEM;
MA_STATE(mas_detach, &mt_detach, 0, 0);
- mt_init_flags(&mt_detach, MT_FLAGS_LOCK_EXTERN);
+ mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags &
+ (MT_FLAGS_LOCK_MASK | MT_FLAGS_USE_RCU));
mt_set_external_lock(&mt_detach, &mm->mmap_lock);
/*
@@ -3042,6 +3043,7 @@ void exit_mmap(struct mm_struct *mm)
*/
set_bit(MMF_OOM_SKIP, &mm->flags);
mmap_write_lock(mm);
+ mt_clear_in_rcu(&mm->mm_mt);
free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb);
--
2.39.1
This configuration variable will be used to build the support for VMA
locking during page fault handling.
This is enabled on supported architectures with SMP and MMU set.
The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/Kconfig | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/mm/Kconfig b/mm/Kconfig
index ca98b2072df5..2e4a7e61768a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1211,6 +1211,18 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }
+config ARCH_SUPPORTS_PER_VMA_LOCK
+ def_bool n
+
+config PER_VMA_LOCK
+ def_bool y
+ depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+ help
+ Allow per-vma locking during page fault handling.
+
+ This feature allows locking each virtual memory area separately when
+ handling page faults instead of taking mmap_lock.
+
source "mm/damon/Kconfig"
endmenu
--
2.39.1
From: Michel Lespinasse <[email protected]>
This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.
Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 13 ++++++++++---
kernel/fork.c | 20 +++++++++++++++++++-
2 files changed, 29 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3fd5305dbbf9..fb4e2afad787 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -480,9 +480,16 @@ struct anon_vma_name {
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
- unsigned long vm_start; /* Our start address within vm_mm. */
- unsigned long vm_end; /* The first byte after our end address
- within vm_mm. */
+ union {
+ struct {
+ /* VMA covers [vm_start; vm_end) addresses within mm */
+ unsigned long vm_start;
+ unsigned long vm_end;
+ };
+#ifdef CONFIG_PER_VMA_LOCK
+ struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+ };
struct mm_struct *vm_mm; /* The address space we belong to. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f23d5e03362..314d51eb91da 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -479,12 +479,30 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}
-void vm_area_free(struct vm_area_struct *vma)
+static void __vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
kmem_cache_free(vm_area_cachep, vma);
}
+#ifdef CONFIG_PER_VMA_LOCK
+static void vm_area_free_rcu_cb(struct rcu_head *head)
+{
+ struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+ __vm_area_free(vma);
+}
+#endif
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_PER_VMA_LOCK
+ call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
+#else
+ __vm_area_free(vma);
+#endif
+}
+
static void account_kernel_stack(struct task_struct *tsk, int account)
{
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
--
2.39.1
Move mmap_lock assert function definitions up so that they can be used
by other mmap_lock routines.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mmap_lock.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..e49ba91bb1f0 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
#endif /* CONFIG_TRACING */
+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+ lockdep_assert_held(&mm->mmap_lock);
+ VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+ lockdep_assert_held_write(&mm->mmap_lock);
+ VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
@@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
up_read_non_owner(&mm->mmap_lock);
}
-static inline void mmap_assert_locked(struct mm_struct *mm)
-{
- lockdep_assert_held(&mm->mmap_lock);
- VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
-static inline void mmap_assert_write_locked(struct mm_struct *mm)
-{
- lockdep_assert_held_write(&mm->mmap_lock);
- VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
static inline int mmap_lock_is_contended(struct mm_struct *mm)
{
return rwsem_is_contended(&mm->mmap_lock);
--
2.39.1
Introduce per-VMA locking. The lock implementation relies on a
per-vma and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_start_read) requires the vma
(vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
If they match then there must be a vma exclusive lock held somewhere.
- read unlock - (implemented by vma_end_read) is a trivial vma->lock
unlock.
- write lock - (vma_start_write) requires the mmap_lock to be held
exclusively and the current mm counter is assigned to the vma counter.
This will allow multiple vmas to be locked under a single mmap_lock
write lock (e.g. during vma merging). The vma counter is modified
under exclusive vma lock.
- write unlock - (vma_end_write_all) is a batch release of all vma
locks held. It doesn't pair with a specific vma_start_write! It is
done before exclusive mmap_lock is released by incrementing mm
sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read lock, all
vma write locks are released as well (effectivelly same as write
unlock).
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 82 +++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 8 ++++
include/linux/mmap_lock.h | 13 +++++++
kernel/fork.c | 4 ++
mm/init-mm.c | 3 ++
5 files changed, 110 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2992a2d55aee..a056ee170e34 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -623,6 +623,87 @@ struct vm_operations_struct {
unsigned long addr);
};
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+ init_rwsem(&vma->lock);
+ vma->vm_lock_seq = -1;
+}
+
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ */
+static inline bool vma_start_read(struct vm_area_struct *vma)
+{
+ /* Check before locking. A race might cause false locked result. */
+ if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+ return false;
+
+ if (unlikely(down_read_trylock(&vma->lock) == 0))
+ return false;
+
+ /*
+ * Overflow might produce false locked result.
+ * False unlocked result is impossible because we modify and check
+ * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+ * modification invalidates all existing locks.
+ */
+ if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+ up_read(&vma->lock);
+ return false;
+ }
+ return true;
+}
+
+static inline void vma_end_read(struct vm_area_struct *vma)
+{
+ rcu_read_lock(); /* keeps vma alive till the end of up_read */
+ up_read(&vma->lock);
+ rcu_read_unlock();
+}
+
+static inline void vma_start_write(struct vm_area_struct *vma)
+{
+ int mm_lock_seq;
+
+ mmap_assert_write_locked(vma->vm_mm);
+
+ /*
+ * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+ * mm->mm_lock_seq can't be concurrently modified.
+ */
+ mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+ if (vma->vm_lock_seq == mm_lock_seq)
+ return;
+
+ down_write(&vma->lock);
+ vma->vm_lock_seq = mm_lock_seq;
+ up_write(&vma->lock);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+{
+ mmap_assert_write_locked(vma->vm_mm);
+ /*
+ * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+ * mm->mm_lock_seq can't be concurrently modified.
+ */
+ VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline bool vma_start_read(struct vm_area_struct *vma)
+ { return false; }
+static inline void vma_end_read(struct vm_area_struct *vma) {}
+static inline void vma_start_write(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -631,6 +712,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+ vma_init_lock(vma);
}
/* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb4e2afad787..e268723eaf44 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -508,6 +508,11 @@ struct vm_area_struct {
vm_flags_t __private __vm_flags;
};
+#ifdef CONFIG_PER_VMA_LOCK
+ int vm_lock_seq;
+ struct rw_semaphore lock;
+#endif
+
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
@@ -633,6 +638,9 @@ struct mm_struct {
* init_mm.mmlist, and are protected
* by mmlist_lock
*/
+#ifdef CONFIG_PER_VMA_LOCK
+ int mm_lock_seq;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index e49ba91bb1f0..aab8f1b28d26 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
}
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+ mmap_assert_write_locked(mm);
+ /* No races during update due to exclusive mmap_lock being held */
+ WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+}
+#else
+static inline void vma_end_write_all(struct mm_struct *mm) {}
+#endif
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
@@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
static inline void mmap_write_unlock(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, true);
+ vma_end_write_all(mm);
up_write(&mm->mmap_lock);
}
static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
+ vma_end_write_all(mm);
downgrade_write(&mm->mmap_lock);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 314d51eb91da..9141427a98b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
*/
data_race(memcpy(new, orig, sizeof(*new)));
INIT_LIST_HEAD(&new->anon_vma_chain);
+ vma_init_lock(new);
dup_anon_vma_name(orig, new);
}
return new;
@@ -1147,6 +1148,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
seqcount_init(&mm->write_protect_seq);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
+#ifdef CONFIG_PER_VMA_LOCK
+ mm->mm_lock_seq = 0;
+#endif
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index c9327abb771c..33269314e060 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -37,6 +37,9 @@ struct mm_struct init_mm = {
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
+#ifdef CONFIG_PER_VMA_LOCK
+ .mm_lock_seq = 0,
+#endif
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
#ifdef CONFIG_IOMMU_SVA
--
2.39.1
Updates to vm_flags have to be done with VMA marked as being written for
preventing concurrent page faults or other modifications.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a056ee170e34..f4f702224ec5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -726,28 +726,28 @@ static inline void vm_flags_init(struct vm_area_struct *vma,
static inline void vm_flags_reset(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
vm_flags_init(vma, flags);
}
static inline void vm_flags_reset_once(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
}
static inline void vm_flags_set(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
ACCESS_PRIVATE(vma, __vm_flags) |= flags;
}
static inline void vm_flags_clear(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
ACCESS_PRIVATE(vma, __vm_flags) &= ~flags;
}
@@ -768,7 +768,7 @@ static inline void __vm_flags_mod(struct vm_area_struct *vma,
static inline void vm_flags_mod(struct vm_area_struct *vma,
vm_flags_t set, vm_flags_t clear)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
__vm_flags_mod(vma, set, clear);
}
--
2.39.1
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done after VMA is marked as being written. Therefore move VMA flag
modifications before vma_adjust_trans_huge() so that VMA is marked
before all these modifications.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 06c0f6686de8..c5f2ddf17b87 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2910,11 +2910,12 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
goto unacct_fail;
+ /* Set flags first to implicitly lock the VMA before updates */
+ vm_flags_set(vma, VM_SOFTDIRTY);
vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
init_vma_prep(&vp, vma);
vma_prepare(&vp);
vma->vm_end = addr + len;
- vm_flags_set(vma, VM_SOFTDIRTY);
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, mm);
--
2.39.1
Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.
Before this patch, page tables can be walked under any one of the
mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This patch adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that
one.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/khugepaged.c | 5 +++++
mm/rmap.c | 31 ++++++++++++++++---------------
2 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 8dbc39896811..d9376acf1fbe 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1140,6 +1140,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
+ vma_start_write(vma);
anon_vma_lock_write(vma->anon_vma);
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
@@ -1607,6 +1608,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}
+ /* Lock the vma before taking i_mmap and page table locks */
+ vma_start_write(vma);
+
/*
* We need to lock the mapping so that from here on, only GUP-fast and
* hardware page walks can access the parts of the page tables that
@@ -1812,6 +1816,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
result = SCAN_PTE_UFFD_WP;
goto unlock_next;
}
+ vma_start_write(vma);
collapse_and_free_pmd(mm, vma, addr, pmd);
if (!cc->is_khugepaged && is_target)
result = set_huge_pmd(vma, addr, pmd, hpage);
diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..8e1a2ad9ca53 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -25,21 +25,22 @@
* mapping->invalidate_lock (in filemap_fault)
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
- * mapping->i_mmap_rwsem
- * anon_vma->rwsem
- * mm->page_table_lock or pte_lock
- * swap_lock (in swap_duplicate, swap_info_get)
- * mmlist_lock (in mmput, drain_mmlist and others)
- * mapping->private_lock (in block_dirty_folio)
- * folio_lock_memcg move_lock (in block_dirty_folio)
- * i_pages lock (widely used)
- * lruvec->lru_lock (in folio_lruvec_lock_irq)
- * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
- * sb_lock (within inode_lock in fs/fs-writeback.c)
- * i_pages lock (widely used, in set_page_dirty,
- * in arch-dependent flush_dcache_mmap_lock,
- * within bdi.wb->list_lock in __sync_single_inode)
+ * vma_start_write
+ * mapping->i_mmap_rwsem
+ * anon_vma->rwsem
+ * mm->page_table_lock or pte_lock
+ * swap_lock (in swap_duplicate, swap_info_get)
+ * mmlist_lock (in mmput, drain_mmlist and others)
+ * mapping->private_lock (in block_dirty_folio)
+ * folio_lock_memcg move_lock (in block_dirty_folio)
+ * i_pages lock (widely used)
+ * lruvec->lru_lock (in folio_lruvec_lock_irq)
+ * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+ * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+ * sb_lock (within inode_lock in fs/fs-writeback.c)
+ * i_pages lock (widely used, in set_page_dirty,
+ * in arch-dependent flush_dcache_mmap_lock,
+ * within bdi.wb->list_lock in __sync_single_inode)
*
* anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon)
* ->tasklist_lock
--
2.39.1
Write-locking VMAs before isolating them ensures that page fault
handlers don't operate on isolated VMAs.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 1 +
mm/nommu.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 0eaa3d1a6cd1..aaa359a147b2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2261,6 +2261,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
static inline int munmap_sidetree(struct vm_area_struct *vma,
struct ma_state *mas_detach)
{
+ vma_start_write(vma);
mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a37..2ab162d773e2 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
current->pid);
return -ENOMEM;
}
+ vma_start_write(vma);
cleanup_vma_from_mm(vma);
/* remove from the MM's tree and list */
@@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
*/
mmap_write_lock(mm);
for_each_vma(vmi, vma) {
+ /*
+ * No need to lock VMA because this is the only mm user and no
+ * page fault handled can race with it.
+ */
cleanup_vma_from_mm(vma);
delete_vma(mm, vma);
cond_resched();
--
2.39.1
vma_expand and vma_shrink change VMA boundaries. Expansion might also
result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
concurrent page faults.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index ec2f8d0af280..f079e5bbcd57 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
ret = dup_anon_vma(vma, next);
if (ret)
return ret;
+
+ /* Lock the VMA before removing it */
+ vma_start_write(next);
}
init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
@@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
goto nomem;
+ vma_start_write(vma);
vma_adjust_trans_huge(vma, start, end, 0);
/* VMA iterator points to previous, so set to start if necessary */
if (vma_iter_addr(vmi) != start)
@@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
return -ENOMEM;
+ vma_start_write(vma);
init_vma_prep(&vp, vma);
vma_adjust_trans_huge(vma, start, end, 0);
vma_prepare(&vp);
--
2.39.1
Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index c5f2ddf17b87..ec2f8d0af280 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -269,8 +269,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
*/
vma_iter_init(&vmi, mm, oldbrk);
next = vma_find(&vmi, newbrk + PAGE_SIZE + stack_guard_gap);
- if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
- goto out;
+ if (next) {
+ vma_start_write(next);
+ if (newbrk + PAGE_SIZE > vm_start_gap(next))
+ goto out;
+ }
brkvma = vma_prev_limit(&vmi, mm->start_brk);
/* Ok, looks good - let it rip. */
@@ -912,10 +915,17 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;
+ if (prev)
+ vma_start_write(prev);
next = find_vma(mm, prev ? prev->vm_end : 0);
+ if (next)
+ vma_start_write(next);
mid = next;
- if (next && next->vm_end == end) /* cases 6, 7, 8 */
+ if (next && next->vm_end == end) { /* cases 6, 7, 8 */
next = find_vma(mm, next->vm_end);
+ if (next)
+ vma_start_write(next);
+ }
/* verify some invariant that must be enforced by the caller */
VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -2163,6 +2173,7 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
WARN_ON(vma->vm_start >= addr);
WARN_ON(vma->vm_end <= addr);
+ vma_start_write(vma);
if (vma->vm_ops && vma->vm_ops->may_split) {
err = vma->vm_ops->may_split(vma, addr);
if (err)
@@ -2518,6 +2529,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Attempt to expand an old mapping */
/* Check next */
+ if (next)
+ vma_start_write(next);
if (next && next->vm_start == end && !vma_policy(next) &&
can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
NULL_VM_UFFD_CTX, NULL)) {
@@ -2527,6 +2540,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
}
/* Check prev */
+ if (prev)
+ vma_start_write(prev);
if (prev && prev->vm_end == addr && !vma_policy(prev) &&
(vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
pgoff, vma->vm_userfaultfd_ctx, NULL) :
@@ -2900,6 +2915,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;
+ if (vma)
+ vma_start_write(vma);
/*
* Expand the existing vma if possible; Note that singular lists do not
* occur after forking, so the expand will only happen on new VMAs.
--
2.39.1
Write-lock VMA as locked before copying it and when copy_vma produces
a new VMA.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
mm/mmap.c | 1 +
mm/mremap.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index f079e5bbcd57..0eaa3d1a6cd1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3202,6 +3202,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+ vma_start_write(new_vma);
if (vma_link(mm, new_vma))
goto out_vma_link;
*need_rmap_locks = false;
diff --git a/mm/mremap.c b/mm/mremap.c
index 411a85682b58..dd541e59edda 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -623,6 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return -ENOMEM;
}
+ vma_start_write(vma);
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
&need_rmap_locks);
--
2.39.1
Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.
Add a parameter to free_pgtables for such scenario.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/internal.h | 2 +-
mm/memory.c | 6 +++++-
mm/mmap.c | 5 +++--
3 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index fc01fd092ea5..400c302fbf47 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -105,7 +105,7 @@ void folio_activate(struct folio *folio);
void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long floor,
- unsigned long ceiling);
+ unsigned long ceiling, bool mm_wr_locked);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index f456f3b5049c..8177c59ffd2d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb,
void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long floor,
- unsigned long ceiling)
+ unsigned long ceiling, bool mm_wr_locked)
{
MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
@@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
+ if (mm_wr_locked)
+ vma_start_write(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
@@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(&mas, ceiling - 1);
+ if (mm_wr_locked)
+ vma_start_write(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index aaa359a147b2..118b2246bba9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2157,7 +2157,8 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
update_hiwater_rss(mm);
unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked);
free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
- next ? next->vm_start : USER_PGTABLES_CEILING);
+ next ? next->vm_start : USER_PGTABLES_CEILING,
+ mm_wr_locked);
tlb_finish_mmu(&tlb);
}
@@ -3069,7 +3070,7 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_lock(mm);
mt_clear_in_rcu(&mm->mm_mt);
free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
- USER_PGTABLES_CEILING);
+ USER_PGTABLES_CEILING, true);
tlb_finish_mmu(&tlb);
/*
--
2.39.1
While unmapping VMAs, adjacent VMAs might be able to grow into the area
being unmapped. In such cases write-lock adjacent VMAs to prevent this
growth.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 118b2246bba9..00f8c5798936 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
* down_read(mmap_lock) and collide with the VMA we are about to unmap.
*/
if (downgrade) {
- if (next && (next->vm_flags & VM_GROWSDOWN))
+ if (next && (next->vm_flags & VM_GROWSDOWN)) {
+ vma_start_write(next);
downgrade = false;
- else if (prev && (prev->vm_flags & VM_GROWSUP))
+ } else if (prev && (prev->vm_flags & VM_GROWSUP)) {
+ vma_start_write(prev);
downgrade = false;
- else
+ } else
mmap_write_downgrade(mm);
}
--
2.39.1
Assert there are no holders of VMA lock for reading when it is about to be
destroyed.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
kernel/fork.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9141427a98b2..a08cc0e2bfde 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -491,6 +491,9 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
{
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
vm_rcu);
+
+ /* The vma should not be locked while being destroyed. */
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock), vma);
__vm_area_free(vma);
}
#endif
--
2.39.1
Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with page fault handlers that would hold VMA
locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 00f8c5798936..801608726be8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3501,6 +3501,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
* of mm/rmap.c:
* - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
* hugetlb mapping);
+ * - all vmas marked locked
* - all i_mmap_rwsem locks;
* - all anon_vma->rwseml
*
@@ -3523,6 +3524,13 @@ int mm_take_all_locks(struct mm_struct *mm)
mutex_lock(&mm_all_locks_mutex);
+ mas_for_each(&mas, vma, ULONG_MAX) {
+ if (signal_pending(current))
+ goto out_unlock;
+ vma_start_write(vma);
+ }
+
+ mas_set(&mas, 0);
mas_for_each(&mas, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
@@ -3612,6 +3620,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
if (vma->vm_file && vma->vm_file->f_mapping)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+ vma_end_write_all(mm);
mutex_unlock(&mm_all_locks_mutex);
}
--
2.39.1
Per-vma locking mechanism will search for VMA under RCU protection and
then after locking it, has to ensure it was not removed from the VMA
tree after we found it. To make this check efficient, introduce a
vma->detached flag to mark VMAs which were removed from the VMA tree.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
include/linux/mm_types.h | 3 +++
mm/mmap.c | 2 ++
3 files changed, 16 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f4f702224ec5..3f98344f829c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
}
+static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+{
+ /* When detaching vma should be write-locked */
+ if (detached)
+ vma_assert_write_locked(vma);
+ vma->detached = detached;
+}
+
#else /* CONFIG_PER_VMA_LOCK */
static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma,
+ bool detached) {}
#endif /* CONFIG_PER_VMA_LOCK */
@@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+ vma_mark_detached(vma, false);
vma_init_lock(vma);
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e268723eaf44..939f4f5a1115 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -511,6 +511,9 @@ struct vm_area_struct {
#ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
struct rw_semaphore lock;
+
+ /* Flag to indicate areas detached from the mm->mm_mt tree */
+ bool detached;
#endif
/*
diff --git a/mm/mmap.c b/mm/mmap.c
index 801608726be8..adf40177e68f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -593,6 +593,7 @@ static inline void vma_complete(struct vma_prepare *vp,
if (vp->remove) {
again:
+ vma_mark_detached(vp->remove, true);
if (vp->file) {
uprobe_munmap(vp->remove, vp->remove->vm_start,
vp->remove->vm_end);
@@ -2267,6 +2268,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;
+ vma_mark_detached(vma, true);
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm -= vma_pages(vma);
--
2.39.1
Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 3 +++
mm/memory.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3f98344f829c..36172bb38e6b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -701,6 +701,9 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
vma->detached = detached;
}
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address);
+
#else /* CONFIG_PER_VMA_LOCK */
static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 8177c59ffd2d..5e1c124552a1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5220,6 +5220,52 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(handle_mm_fault);
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+ MA_STATE(mas, &mm->mm_mt, address, address);
+ struct vm_area_struct *vma;
+
+ rcu_read_lock();
+retry:
+ vma = mas_walk(&mas);
+ if (!vma)
+ goto inval;
+
+ /* Only anonymous vmas are supported for now */
+ if (!vma_is_anonymous(vma))
+ goto inval;
+
+ if (!vma_start_read(vma))
+ goto inval;
+
+ /* Check since vm_start/vm_end might change before we lock the VMA */
+ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+ vma_end_read(vma);
+ goto inval;
+ }
+
+ /* Check if the VMA got isolated after we found it */
+ if (vma->detached) {
+ vma_end_read(vma);
+ /* The area was replaced with another one */
+ goto retry;
+ }
+
+ rcu_read_unlock();
+ return vma;
+inval:
+ rcu_read_unlock();
+ return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
#ifndef __PAGETABLE_P4D_FOLDED
/*
* Allocate p4d page table.
--
2.39.1
When vma->anon_vma is not set, page fault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 5e1c124552a1..13369ff15ec1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5242,6 +5242,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_is_anonymous(vma))
goto inval;
+ /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
+ if (!vma->anon_vma)
+ goto inval;
+
if (!vma_start_read(vma))
goto inval;
--
2.39.1
Add a new flag to distinguish page faults handled under protection of
per-vma lock.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 3 ++-
include/linux/mm_types.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 36172bb38e6b..3c9167529417 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -478,7 +478,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_VMA_LOCK, "VMA_LOCK" }
/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 939f4f5a1115..212e7f923a69 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1056,6 +1056,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
FAULT_FLAG_UNSHARE = 1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+ FAULT_FLAG_VMA_LOCK = 1 << 12,
};
typedef unsigned int __bitwise zap_flags_t;
--
2.39.1
Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 13369ff15ec1..555612d153ad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3688,6 +3688,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
--
2.39.1
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.
Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Peter Xu <[email protected]>
---
mm/memory.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 555612d153ad..751aebc1b29f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5254,6 +5254,15 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_start_read(vma))
goto inval;
+ /*
+ * Due to the possibility of userfault handler dropping mmap_lock, avoid
+ * it for now and fall back to page fault handling under mmap_lock.
+ */
+ if (userfaultfd_armed(vma)) {
+ vma_end_read(vma);
+ goto inval;
+ }
+
/* Check since vm_start/vm_end might change before we lock the VMA */
if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
vma_end_read(vma);
--
2.39.1
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/vm_event_item.h | 6 ++++++
include/linux/vmstat.h | 6 ++++++
mm/Kconfig.debug | 6 ++++++
mm/memory.c | 2 ++
mm/vmstat.c | 6 ++++++
5 files changed, 26 insertions(+)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..8abfa1240040 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+ VMA_LOCK_SUCCESS,
+ VMA_LOCK_ABORT,
+ VMA_LOCK_RETRY,
+ VMA_LOCK_MISS,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..fed855bae6d8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu)
#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
#define __count_zid_vm_events(item, zid, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index c3547a373c9c..4965a7333a3f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -279,3 +279,9 @@ config DEBUG_KMEMLEAK_AUTO_SCAN
If unsure, say Y.
+config PER_VMA_LOCK_STATS
+ bool "Statistics for per-vma locks"
+ depends on PER_VMA_LOCK
+ default y
+ help
+ Statistics for per-vma locks.
diff --git a/mm/memory.c b/mm/memory.c
index 751aebc1b29f..94194a45ffa7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5272,6 +5272,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
/* Check if the VMA got isolated after we found it */
if (vma->detached) {
vma_end_read(vma);
+ count_vm_vma_lock_event(VMA_LOCK_MISS);
/* The area was replaced with another one */
goto retry;
}
@@ -5280,6 +5281,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
return vma;
inval:
rcu_read_unlock();
+ count_vm_vma_lock_event(VMA_LOCK_ABORT);
return NULL;
}
#endif /* CONFIG_PER_VMA_LOCK */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..4f1089a1860e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = {
"direct_map_level2_splits",
"direct_map_level3_splits",
#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+ "vma_lock_success",
+ "vma_lock_abort",
+ "vma_lock_retry",
+ "vma_lock_miss",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.39.1
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..3647f7bdb110 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+ select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a498ae1fbe66..e4399983c50c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/mm.h> /* find_and_lock_vma() */
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(flags & FAULT_FLAG_USER))
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, address);
+ if (!vma)
+ goto lock_mmap;
+
+ if (unlikely(access_error(error_code, vma))) {
+ vma_end_read(vma);
+ goto lock_mmap;
+ }
+ fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_end_read(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ kernelmode_fixup_or_oops(regs, error_code, address,
+ SIGBUS, BUS_ADRERR,
+ ARCH_DEFAULT_PKEY);
+ return;
+ }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs,
}
mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;
--
2.39.1
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c5ccca26a408..9f2c0e352da3 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -95,6 +95,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+ select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f4cb0f85ccf4..9e0db5c387e3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+ struct vm_area_struct *vma;
+#endif
if (kprobe_page_fault(regs, esr))
return 0;
@@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(mm_flags & FAULT_FLAG_USER))
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, addr);
+ if (!vma)
+ goto lock_mmap;
+
+ if (!(vma->vm_flags & vm_flags)) {
+ vma_end_read(vma);
+ goto lock_mmap;
+ }
+ fault = handle_mm_fault(vma, addr & PAGE_MASK,
+ mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_end_read(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
+ return 0;
+ }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
}
mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
/*
* Handle the "normal" (no error) case first.
*/
--
2.39.1
From: Laurent Dufour <[email protected]>
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"
Signed-off-by: Laurent Dufour <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/powerpc/mm/fault.c | 41 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/Kconfig | 1 +
arch/powerpc/platforms/pseries/Kconfig | 1 +
3 files changed, 43 insertions(+)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2bef19cc1b98..c7ae86b04b8a 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(flags & FAULT_FLAG_USER))
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, address);
+ if (!vma)
+ goto lock_mmap;
+
+ if (unlikely(access_pkey_error(is_write, is_exec,
+ (error_code & DSISR_KEYFAULT), vma))) {
+ int rc = bad_access_pkey(regs, address, vma);
+
+ vma_end_read(vma);
+ return rc;
+ }
+
+ if (unlikely(access_error(is_write, is_exec, vma))) {
+ int rc = bad_access(regs, address);
+
+ vma_end_read(vma);
+ return rc;
+ }
+
+ fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_end_read(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ if (fault_signal_pending(fault, regs))
+ return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
mmap_read_unlock(current->mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+ select ARCH_SUPPORTS_PER_VMA_LOCK
default y
config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index a3b4d99567cb..e036a04ff1ca 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
+ select ARCH_SUPPORTS_PER_VMA_LOCK
default y
config PARAVIRT
--
2.39.1
call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.
Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it. Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock. Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().
Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 2 ++
kernel/fork.c | 2 +-
mm/mmap.c | 11 +++++++----
3 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c9167529417..cedef02dfd2b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -256,6 +256,8 @@ void setup_initial_init_mm(void *start_code, void *end_code,
struct vm_area_struct *vm_area_alloc(struct mm_struct *);
struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
void vm_area_free(struct vm_area_struct *);
+/* Use only if VMA has no other users */
+void __vm_area_free(struct vm_area_struct *vma);
#ifndef CONFIG_MMU
extern struct rb_root nommu_region_tree;
diff --git a/kernel/fork.c b/kernel/fork.c
index a08cc0e2bfde..d0999de82f94 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -480,7 +480,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}
-static void __vm_area_free(struct vm_area_struct *vma)
+void __vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
kmem_cache_free(vm_area_cachep, vma);
diff --git a/mm/mmap.c b/mm/mmap.c
index adf40177e68f..d847be615720 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -133,7 +133,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
/*
* Close a vm structure and free it.
*/
-static void remove_vma(struct vm_area_struct *vma)
+static void remove_vma(struct vm_area_struct *vma, bool unreachable)
{
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
@@ -141,7 +141,10 @@ static void remove_vma(struct vm_area_struct *vma)
if (vma->vm_file)
fput(vma->vm_file);
mpol_put(vma_policy(vma));
- vm_area_free(vma);
+ if (unreachable)
+ __vm_area_free(vma);
+ else
+ vm_area_free(vma);
}
static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
@@ -2135,7 +2138,7 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += nrpages;
vm_stat_account(mm, vma->vm_flags, -nrpages);
- remove_vma(vma);
+ remove_vma(vma, false);
}
vm_unacct_memory(nr_accounted);
validate_mm(mm);
@@ -3085,7 +3088,7 @@ void exit_mmap(struct mm_struct *mm)
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- remove_vma(vma);
+ remove_vma(vma, true);
count++;
cond_resched();
} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
--
2.39.1
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
can be optimized later if the new size causes issues.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate the VMA.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 23 ++++++-------
include/linux/mm_types.h | 6 +++-
kernel/fork.c | 73 ++++++++++++++++++++++++++++++++--------
3 files changed, 74 insertions(+), 28 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cedef02dfd2b..96b18ef3bfa3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,6 @@ struct vm_operations_struct {
};
#ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
- init_rwsem(&vma->lock);
- vma->vm_lock_seq = -1;
-}
-
/*
* Try to read-lock a vma. The function is allowed to occasionally yield false
* locked result to avoid performance overhead, in which case we fall back to
@@ -644,17 +638,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
- if (unlikely(down_read_trylock(&vma->lock) == 0))
+ if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
return false;
/*
* Overflow might produce false locked result.
* False unlocked result is impossible because we modify and check
- * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+ * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
* modification invalidates all existing locks.
*/
if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
- up_read(&vma->lock);
+ up_read(&vma->vm_lock->lock);
return false;
}
return true;
@@ -663,7 +657,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
static inline void vma_end_read(struct vm_area_struct *vma)
{
rcu_read_lock(); /* keeps vma alive till the end of up_read */
- up_read(&vma->lock);
+ up_read(&vma->vm_lock->lock);
rcu_read_unlock();
}
@@ -681,9 +675,9 @@ static inline void vma_start_write(struct vm_area_struct *vma)
if (vma->vm_lock_seq == mm_lock_seq)
return;
- down_write(&vma->lock);
+ down_write(&vma->vm_lock->lock);
vma->vm_lock_seq = mm_lock_seq;
- up_write(&vma->lock);
+ up_write(&vma->vm_lock->lock);
}
static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -720,6 +714,10 @@ static inline void vma_mark_detached(struct vm_area_struct *vma,
#endif /* CONFIG_PER_VMA_LOCK */
+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -729,7 +727,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma_mark_detached(vma, false);
- vma_init_lock(vma);
}
/* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 212e7f923a69..30d4f867ae56 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -471,6 +471,10 @@ struct anon_vma_name {
char name[];
};
+struct vma_lock {
+ struct rw_semaphore lock;
+};
+
/*
* This struct describes a virtual memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -510,7 +514,7 @@ struct vm_area_struct {
#ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
- struct rw_semaphore lock;
+ struct vma_lock *vm_lock;
/* Flag to indicate areas detached from the mm->mm_mt tree */
bool detached;
diff --git a/kernel/fork.c b/kernel/fork.c
index d0999de82f94..a152804faa14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,13 +451,49 @@ static struct kmem_cache *vm_area_cachep;
/* SLAB cache for mm_struct structures (tsk->mm) */
static struct kmem_cache *mm_cachep;
+#ifdef CONFIG_PER_VMA_LOCK
+
+/* SLAB cache for vm_area_struct.lock */
+static struct kmem_cache *vma_lock_cachep;
+
+static bool vma_lock_alloc(struct vm_area_struct *vma)
+{
+ vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
+ if (!vma->vm_lock)
+ return false;
+
+ init_rwsem(&vma->vm_lock->lock);
+ vma->vm_lock_seq = -1;
+
+ return true;
+}
+
+static inline void vma_lock_free(struct vm_area_struct *vma)
+{
+ kmem_cache_free(vma_lock_cachep, vma->vm_lock);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
+static inline void vma_lock_free(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
{
struct vm_area_struct *vma;
vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
- if (vma)
- vma_init(vma, mm);
+ if (!vma)
+ return NULL;
+
+ vma_init(vma, mm);
+ if (!vma_lock_alloc(vma)) {
+ kmem_cache_free(vm_area_cachep, vma);
+ return NULL;
+ }
+
return vma;
}
@@ -465,24 +501,30 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
{
struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
- if (new) {
- ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
- ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
- /*
- * orig->shared.rb may be modified concurrently, but the clone
- * will be reinitialized.
- */
- data_race(memcpy(new, orig, sizeof(*new)));
- INIT_LIST_HEAD(&new->anon_vma_chain);
- vma_init_lock(new);
- dup_anon_vma_name(orig, new);
+ if (!new)
+ return NULL;
+
+ ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+ ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+ /*
+ * orig->shared.rb may be modified concurrently, but the clone
+ * will be reinitialized.
+ */
+ data_race(memcpy(new, orig, sizeof(*new)));
+ if (!vma_lock_alloc(new)) {
+ kmem_cache_free(vm_area_cachep, new);
+ return NULL;
}
+ INIT_LIST_HEAD(&new->anon_vma_chain);
+ dup_anon_vma_name(orig, new);
+
return new;
}
void __vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
+ vma_lock_free(vma);
kmem_cache_free(vm_area_cachep, vma);
}
@@ -493,7 +535,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
vm_rcu);
/* The vma should not be locked while being destroyed. */
- VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock), vma);
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
__vm_area_free(vma);
}
#endif
@@ -3089,6 +3131,9 @@ void __init proc_caches_init(void)
NULL);
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+#ifdef CONFIG_PER_VMA_LOCK
+ vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
+#endif
mmap_init();
nsproxy_cache_init();
}
--
2.39.1
First, sorry I didn't see this before v3..
* Suren Baghdasaryan <[email protected]> [230216 00:18]:
> While unmapping VMAs, adjacent VMAs might be able to grow into the area
> being unmapped. In such cases write-lock adjacent VMAs to prevent this
> growth.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/mmap.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 118b2246bba9..00f8c5798936 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> * down_read(mmap_lock) and collide with the VMA we are about to unmap.
> */
> if (downgrade) {
> - if (next && (next->vm_flags & VM_GROWSDOWN))
> + if (next && (next->vm_flags & VM_GROWSDOWN)) {
> + vma_start_write(next);
> downgrade = false;
If the mmap write lock is insufficient to protect us from next/prev
modifications then we need to move *most* of this block above the maple
tree write operation, otherwise we have a race here. When I say most, I
mean everything besides the call to mmap_write_downgrade() needs to be
moved.
If the mmap write lock is sufficient to protect us from next/prev
modifications then we don't need to write lock the vmas themselves.
I believe this is for expand_stack() protection, so I believe it's okay
to not vma write lock these vmas.. I don't think there are other areas
where we can modify the vmas without holding the mmap lock, but others
on the CC list please chime in if I've forgotten something.
So, if I am correct, then you shouldn't lock next/prev and allow the
vma locking fault method on these vmas. This will work because
lock_vma_under_rcu() uses mas_walk() on the faulting address. That is,
your lock_vma_under_rcu() will fail to find anything that needs to be
grown and go back to mmap lock protection. As it is written today, the
vma locking fault handler will fail and we will wait for the mmap lock
to be released even when the vma isn't going to expand.
> - else if (prev && (prev->vm_flags & VM_GROWSUP))
> + } else if (prev && (prev->vm_flags & VM_GROWSUP)) {
> + vma_start_write(prev);
> downgrade = false;
> - else
> + } else
> mmap_write_downgrade(mm);
> }
>
> --
> 2.39.1
On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> When vma->anon_vma is not set, page fault handler will set it by either
> reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> a compatible adjacent VMA and that requires not only the faulting VMA
> to be stable but also the tree structure and other VMAs inside that tree.
> Therefore locking just the faulting VMA is not enough for this search.
> Fall back to taking mmap_lock when vma->anon_vma is not set. This
> situation happens only on the first page fault and should not affect
> overall performance.
I think I asked this before, but don't remember getting an aswer.
Why do we defer setting anon_vma to the first fault? Why don't we
set it up at mmap time?
On Thu, Feb 16, 2023 at 7:34 AM Liam R. Howlett <[email protected]> wrote:
>
>
> First, sorry I didn't see this before v3..
Feedback at any time is highly appreciated!
>
> * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > While unmapping VMAs, adjacent VMAs might be able to grow into the area
> > being unmapped. In such cases write-lock adjacent VMAs to prevent this
> > growth.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > mm/mmap.c | 8 +++++---
> > 1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 118b2246bba9..00f8c5798936 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > * down_read(mmap_lock) and collide with the VMA we are about to unmap.
> > */
> > if (downgrade) {
> > - if (next && (next->vm_flags & VM_GROWSDOWN))
> > + if (next && (next->vm_flags & VM_GROWSDOWN)) {
> > + vma_start_write(next);
> > downgrade = false;
>
> If the mmap write lock is insufficient to protect us from next/prev
> modifications then we need to move *most* of this block above the maple
> tree write operation, otherwise we have a race here. When I say most, I
> mean everything besides the call to mmap_write_downgrade() needs to be
> moved.
Which prior maple tree write operation are you referring to? I see
__split_vma() and munmap_sidetree() which both already lock the VMAs
they operate on, so page faults can't happen in those VMAs.
>
> If the mmap write lock is sufficient to protect us from next/prev
> modifications then we don't need to write lock the vmas themselves.
mmap write lock is not sufficient because with per-VMA locks we do not
take mmap lock at all.
>
> I believe this is for expand_stack() protection, so I believe it's okay
> to not vma write lock these vmas.. I don't think there are other areas
> where we can modify the vmas without holding the mmap lock, but others
> on the CC list please chime in if I've forgotten something.
>
> So, if I am correct, then you shouldn't lock next/prev and allow the
> vma locking fault method on these vmas. This will work because
> lock_vma_under_rcu() uses mas_walk() on the faulting address. That is,
> your lock_vma_under_rcu() will fail to find anything that needs to be
> grown and go back to mmap lock protection. As it is written today, the
> vma locking fault handler will fail and we will wait for the mmap lock
> to be released even when the vma isn't going to expand.
So, let's consider a case when the next VMA is not being removed (so
it was neither removed nor locked by munmap_sidetree()) and it is
found by lock_vma_under_rcu() in the page fault handling path. Page
fault handler can now expand it and push into the area we are
unmapping in unmap_region(). That is the race I'm trying to prevent
here by locking the next/prev VMAs which can be expanded before
unmap_region() unmaps them. Am I missing something?
>
>
> > - else if (prev && (prev->vm_flags & VM_GROWSUP))
> > + } else if (prev && (prev->vm_flags & VM_GROWSUP)) {
> > + vma_start_write(prev);
> > downgrade = false;
> > - else
> > + } else
> > mmap_write_downgrade(mm);
> > }
> >
> > --
> > 2.39.1
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > When vma->anon_vma is not set, page fault handler will set it by either
> > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > a compatible adjacent VMA and that requires not only the faulting VMA
> > to be stable but also the tree structure and other VMAs inside that tree.
> > Therefore locking just the faulting VMA is not enough for this search.
> > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > situation happens only on the first page fault and should not affect
> > overall performance.
>
> I think I asked this before, but don't remember getting an aswer.
> Why do we defer setting anon_vma to the first fault? Why don't we
> set it up at mmap time?
Yeah, I remember that conversation Matthew and I could not find the
definitive answer at the time. I'll look into that again or maybe
someone can answer it here.
In the end rather than changing that logic I decided to skip
vma->anon_vma==NULL cases because I measured them being less than
0.01% of all page faults, so ROI from changing that would be quite
low. But I agree that the logic is weird and maybe we can improve
that. I will have to review that again when I'm working on eliminating
all these special cases we skip, like swap/userfaults/etc.
On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > When vma->anon_vma is not set, page fault handler will set it by either
> > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > to be stable but also the tree structure and other VMAs inside that tree.
> > > Therefore locking just the faulting VMA is not enough for this search.
> > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > situation happens only on the first page fault and should not affect
> > > overall performance.
> >
> > I think I asked this before, but don't remember getting an aswer.
> > Why do we defer setting anon_vma to the first fault? Why don't we
> > set it up at mmap time?
>
> Yeah, I remember that conversation Matthew and I could not find the
> definitive answer at the time. I'll look into that again or maybe
> someone can answer it here.
After looking into it again I'm still under the impression that
vma->anon_vma is populated lazily (during the first page fault rather
than at mmap time) to avoid doing extra work for areas which are never
faulted. Though I might be missing some important detail here.
>
> In the end rather than changing that logic I decided to skip
> vma->anon_vma==NULL cases because I measured them being less than
> 0.01% of all page faults, so ROI from changing that would be quite
> low. But I agree that the logic is weird and maybe we can improve
> that. I will have to review that again when I'm working on eliminating
> all these special cases we skip, like swap/userfaults/etc.
On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > to be stable but also the tree structure and other VMAs inside that tree.
> > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > situation happens only on the first page fault and should not affect
> > > > overall performance.
> > >
> > > I think I asked this before, but don't remember getting an aswer.
> > > Why do we defer setting anon_vma to the first fault? Why don't we
> > > set it up at mmap time?
> >
> > Yeah, I remember that conversation Matthew and I could not find the
> > definitive answer at the time. I'll look into that again or maybe
> > someone can answer it here.
>
> After looking into it again I'm still under the impression that
> vma->anon_vma is populated lazily (during the first page fault rather
> than at mmap time) to avoid doing extra work for areas which are never
> faulted. Though I might be missing some important detail here.
I think this is because the kernel cannot merge VMAs that have
different anon_vmas?
Enabling lazy population of anon_vma could potentially increase the
chances of merging VMAs.
> > In the end rather than changing that logic I decided to skip
> > vma->anon_vma==NULL cases because I measured them being less than
> > 0.01% of all page faults, so ROI from changing that would be quite
> > low. But I agree that the logic is weird and maybe we can improve
> > that. I will have to review that again when I'm working on eliminating
> > all these special cases we skip, like swap/userfaults/etc.
* Suren Baghdasaryan <[email protected]> [230216 14:36]:
> On Thu, Feb 16, 2023 at 7:34 AM Liam R. Howlett <[email protected]> wrote:
> >
> >
> > First, sorry I didn't see this before v3..
>
> Feedback at any time is highly appreciated!
>
> >
> > * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > > While unmapping VMAs, adjacent VMAs might be able to grow into the area
> > > being unmapped. In such cases write-lock adjacent VMAs to prevent this
> > > growth.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > ---
> > > mm/mmap.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 118b2246bba9..00f8c5798936 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > * down_read(mmap_lock) and collide with the VMA we are about to unmap.
> > > */
> > > if (downgrade) {
> > > - if (next && (next->vm_flags & VM_GROWSDOWN))
> > > + if (next && (next->vm_flags & VM_GROWSDOWN)) {
> > > + vma_start_write(next);
> > > downgrade = false;
> >
> > If the mmap write lock is insufficient to protect us from next/prev
> > modifications then we need to move *most* of this block above the maple
> > tree write operation, otherwise we have a race here. When I say most, I
> > mean everything besides the call to mmap_write_downgrade() needs to be
> > moved.
>
> Which prior maple tree write operation are you referring to? I see
> __split_vma() and munmap_sidetree() which both already lock the VMAs
> they operate on, so page faults can't happen in those VMAs.
The write that removes the VMAs from the maple tree a few lines above..
/* Point of no return */
If the mmap lock is not sufficient, then we need to move the
vma_start_write() of prev/next to above the call to
vma_iter_clear_gfp() in do_vmi_align_munmap().
But I still think it IS enough.
>
> >
> > If the mmap write lock is sufficient to protect us from next/prev
> > modifications then we don't need to write lock the vmas themselves.
>
> mmap write lock is not sufficient because with per-VMA locks we do not
> take mmap lock at all.
Understood, but it also does not expand VMAs.
>
> >
> > I believe this is for expand_stack() protection, so I believe it's okay
> > to not vma write lock these vmas.. I don't think there are other areas
> > where we can modify the vmas without holding the mmap lock, but others
> > on the CC list please chime in if I've forgotten something.
> >
> > So, if I am correct, then you shouldn't lock next/prev and allow the
> > vma locking fault method on these vmas. This will work because
> > lock_vma_under_rcu() uses mas_walk() on the faulting address. That is,
> > your lock_vma_under_rcu() will fail to find anything that needs to be
> > grown and go back to mmap lock protection. As it is written today, the
> > vma locking fault handler will fail and we will wait for the mmap lock
> > to be released even when the vma isn't going to expand.
>
> So, let's consider a case when the next VMA is not being removed (so
> it was neither removed nor locked by munmap_sidetree()) and it is
> found by lock_vma_under_rcu() in the page fault handling path.
By this point next VMA is either NULL or outside the munmap area, so
what you said here is always true.
>Page
> fault handler can now expand it and push into the area we are
> unmapping in unmap_region(). That is the race I'm trying to prevent
> here by locking the next/prev VMAs which can be expanded before
> unmap_region() unmaps them. Am I missing something?
Yes, I think the part you are missing (or I am missing..) is that
expand_stack() will never be called without the mmap lock. We don't use
the vma locking to expand the stack.
...
On Fri, Feb 17, 2023 at 6:51 AM Liam R. Howlett <[email protected]> wrote:
>
> * Suren Baghdasaryan <[email protected]> [230216 14:36]:
> > On Thu, Feb 16, 2023 at 7:34 AM Liam R. Howlett <[email protected]> wrote:
> > >
> > >
> > > First, sorry I didn't see this before v3..
> >
> > Feedback at any time is highly appreciated!
> >
> > >
> > > * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > > > While unmapping VMAs, adjacent VMAs might be able to grow into the area
> > > > being unmapped. In such cases write-lock adjacent VMAs to prevent this
> > > > growth.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > ---
> > > > mm/mmap.c | 8 +++++---
> > > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 118b2246bba9..00f8c5798936 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > * down_read(mmap_lock) and collide with the VMA we are about to unmap.
> > > > */
> > > > if (downgrade) {
> > > > - if (next && (next->vm_flags & VM_GROWSDOWN))
> > > > + if (next && (next->vm_flags & VM_GROWSDOWN)) {
> > > > + vma_start_write(next);
> > > > downgrade = false;
> > >
> > > If the mmap write lock is insufficient to protect us from next/prev
> > > modifications then we need to move *most* of this block above the maple
> > > tree write operation, otherwise we have a race here. When I say most, I
> > > mean everything besides the call to mmap_write_downgrade() needs to be
> > > moved.
> >
> > Which prior maple tree write operation are you referring to? I see
> > __split_vma() and munmap_sidetree() which both already lock the VMAs
> > they operate on, so page faults can't happen in those VMAs.
>
> The write that removes the VMAs from the maple tree a few lines above..
> /* Point of no return */
>
> If the mmap lock is not sufficient, then we need to move the
> vma_start_write() of prev/next to above the call to
> vma_iter_clear_gfp() in do_vmi_align_munmap().
>
> But I still think it IS enough.
>
> >
> > >
> > > If the mmap write lock is sufficient to protect us from next/prev
> > > modifications then we don't need to write lock the vmas themselves.
> >
> > mmap write lock is not sufficient because with per-VMA locks we do not
> > take mmap lock at all.
>
> Understood, but it also does not expand VMAs.
>
> >
> > >
> > > I believe this is for expand_stack() protection, so I believe it's okay
> > > to not vma write lock these vmas.. I don't think there are other areas
> > > where we can modify the vmas without holding the mmap lock, but others
> > > on the CC list please chime in if I've forgotten something.
> > >
> > > So, if I am correct, then you shouldn't lock next/prev and allow the
> > > vma locking fault method on these vmas. This will work because
> > > lock_vma_under_rcu() uses mas_walk() on the faulting address. That is,
> > > your lock_vma_under_rcu() will fail to find anything that needs to be
> > > grown and go back to mmap lock protection. As it is written today, the
> > > vma locking fault handler will fail and we will wait for the mmap lock
> > > to be released even when the vma isn't going to expand.
> >
> > So, let's consider a case when the next VMA is not being removed (so
> > it was neither removed nor locked by munmap_sidetree()) and it is
> > found by lock_vma_under_rcu() in the page fault handling path.
>
> By this point next VMA is either NULL or outside the munmap area, so
> what you said here is always true.
>
> >Page
> > fault handler can now expand it and push into the area we are
> > unmapping in unmap_region(). That is the race I'm trying to prevent
> > here by locking the next/prev VMAs which can be expanded before
> > unmap_region() unmaps them. Am I missing something?
>
> Yes, I think the part you are missing (or I am missing..) is that
> expand_stack() will never be called without the mmap lock. We don't use
> the vma locking to expand the stack.
Ah, yes, you are absolutely right. I missed that when the VMA explands
as a result of a page fault, lock_vma_under_rcu() can't find the
faulting VMA (the fault is outside of the area and hence the need to
expand) and will fall back to mmap read locking. Since
do_vmi_align_munmap() holds the mmap write lock and does not downgrade
it, the race will be avoided and expansion will wait until we drop the
mmap write lock.
Good catch Liam! We can drop this patch completely from the series.
Thanks,
Suren.
>
> ...
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote:
> On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > to be stable but also the tree structure and other VMAs inside that tree.
> > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > situation happens only on the first page fault and should not affect
> > > > overall performance.
> > >
> > > I think I asked this before, but don't remember getting an aswer.
> > > Why do we defer setting anon_vma to the first fault? Why don't we
> > > set it up at mmap time?
> >
> > Yeah, I remember that conversation Matthew and I could not find the
> > definitive answer at the time. I'll look into that again or maybe
> > someone can answer it here.
>
> After looking into it again I'm still under the impression that
> vma->anon_vma is populated lazily (during the first page fault rather
> than at mmap time) to avoid doing extra work for areas which are never
> faulted. Though I might be missing some important detail here.
How often does userspace call mmap() and then _never_ fault on it?
I appreciate that userspace might mmap() gigabytes of address space and
then only end up using a small amount of it, so populating it lazily
makes sense. But creating a region and never faulting on it? The only
use-case I can think of is loading shared libraries:
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
(...)
mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000
mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000
mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000
mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000
mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000
but that's a file-backed VMA, not an anon VMA.
On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote:
> > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > > to be stable but also the tree structure and other VMAs inside that tree.
> > > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > > situation happens only on the first page fault and should not affect
> > > > > overall performance.
> > > >
> > > > I think I asked this before, but don't remember getting an aswer.
> > > > Why do we defer setting anon_vma to the first fault? Why don't we
> > > > set it up at mmap time?
> > >
> > > Yeah, I remember that conversation Matthew and I could not find the
> > > definitive answer at the time. I'll look into that again or maybe
> > > someone can answer it here.
> >
> > After looking into it again I'm still under the impression that
> > vma->anon_vma is populated lazily (during the first page fault rather
> > than at mmap time) to avoid doing extra work for areas which are never
> > faulted. Though I might be missing some important detail here.
>
> How often does userspace call mmap() and then _never_ fault on it?
> I appreciate that userspace might mmap() gigabytes of address space and
> then only end up using a small amount of it, so populating it lazily
> makes sense. But creating a region and never faulting on it? The only
> use-case I can think of is loading shared libraries:
>
> openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> (...)
> mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000
> mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000
> mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000
> mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000
> mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000
>
> but that's a file-backed VMA, not an anon VMA.
Might the case of dup_mmap() while forking be the reason why a VMA in
the child process might be never used while parent uses it (or visa
versa)? Again, I'm not sure this is the reason but I can find no other
good explanation.
On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <[email protected]> wrote:
>
> On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> > > >
> > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > > to be stable but also the tree structure and other VMAs inside that tree.
> > > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > > situation happens only on the first page fault and should not affect
> > > > > overall performance.
> > > >
> > > > I think I asked this before, but don't remember getting an aswer.
> > > > Why do we defer setting anon_vma to the first fault? Why don't we
> > > > set it up at mmap time?
> > >
> > > Yeah, I remember that conversation Matthew and I could not find the
> > > definitive answer at the time. I'll look into that again or maybe
> > > someone can answer it here.
> >
> > After looking into it again I'm still under the impression that
> > vma->anon_vma is populated lazily (during the first page fault rather
> > than at mmap time) to avoid doing extra work for areas which are never
> > faulted. Though I might be missing some important detail here.
>
> I think this is because the kernel cannot merge VMAs that have
> different anon_vmas?
>
> Enabling lazy population of anon_vma could potentially increase the
> chances of merging VMAs.
Hmm. Do you have a clear explanation why merging chances increase this
way? A couple of possibilities I can think of would be:
1. If after mmap'ing a VMA and before faulting the first page into it
we often change something that affects anon_vma_compatible() decision,
like vm_policy;
2. When mmap'ing VMAs we do not map them consecutively but the final
arrangement is actually contiguous.
Don't think either of those cases would be very representative of a
usual case but maybe I'm wrong or there is another reason?
>
> > > In the end rather than changing that logic I decided to skip
> > > vma->anon_vma==NULL cases because I measured them being less than
> > > 0.01% of all page faults, so ROI from changing that would be quite
> > > low. But I agree that the logic is weird and maybe we can improve
> > > that. I will have to review that again when I'm working on eliminating
> > > all these special cases we skip, like swap/userfaults/etc.
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, Feb 17, 2023 at 08:13:01AM -0800, Suren Baghdasaryan wrote:
> On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <[email protected]> wrote:
> >
> > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > > > to be stable but also the tree structure and other VMAs inside that tree.
> > > > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > > > situation happens only on the first page fault and should not affect
> > > > > > overall performance.
> > > > >
> > > > > I think I asked this before, but don't remember getting an aswer.
> > > > > Why do we defer setting anon_vma to the first fault? Why don't we
> > > > > set it up at mmap time?
> > > >
> > > > Yeah, I remember that conversation Matthew and I could not find the
> > > > definitive answer at the time. I'll look into that again or maybe
> > > > someone can answer it here.
> > >
> > > After looking into it again I'm still under the impression that
> > > vma->anon_vma is populated lazily (during the first page fault rather
> > > than at mmap time) to avoid doing extra work for areas which are never
> > > faulted. Though I might be missing some important detail here.
> >
> > I think this is because the kernel cannot merge VMAs that have
> > different anon_vmas?
> >
> > Enabling lazy population of anon_vma could potentially increase the
> > chances of merging VMAs.
>
> Hmm. Do you have a clear explanation why merging chances increase this
> way? A couple of possibilities I can think of would be:
> 1. If after mmap'ing a VMA and before faulting the first page into it
> we often change something that affects anon_vma_compatible() decision,
> like vm_policy;
> 2. When mmap'ing VMAs we do not map them consecutively but the final
> arrangement is actually contiguous.
>
> Don't think either of those cases would be very representative of a
> usual case but maybe I'm wrong or there is another reason?
Ok. I agree it does not represent common cases.
Hmm then I wonder how it went from the initial approach of "allocate anon_vma
objects only via fork()" [1] to "populate anon_vma at page faults". [2] [3]
Maybe Hugh, Andrea or Andrew have opinions?
[1] anon_vma RFC2, lore.kernel.org
https://lore.kernel.org/lkml/[email protected]
[2] The status of object-based reverse mapping, LWN.net
https://lwn.net/Articles/85908
[3] rmap 39 add anon_vma rmap
https://gitlab.com/hyeyoo/linux-historical/-/commit/8aa3448cabdfca146aa3fd36e852d0209fb2276a
>
> >
> > > > In the end rather than changing that logic I decided to skip
> > > > vma->anon_vma==NULL cases because I measured them being less than
> > > > 0.01% of all page faults, so ROI from changing that would be quite
> > > > low. But I agree that the logic is weird and maybe we can improve
> > > > that. I will have to review that again when I'm working on eliminating
> > > > all these special cases we skip, like swap/userfaults/etc.
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >
On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote:
> Decisions about whether VMAs can be merged, split or expanded must be
> made while VMAs are protected from the changes which can affect that
> decision. For example, merge_vma uses vma->anon_vma in its decision
Did you mean vma_merge()?
> whether the VMA can be merged. Meanwhile, page fault handler changes
> vma->anon_vma during COW operation.
> Write-lock all VMAs which might be affected by a merge or split operation
> before making decision how such operations should be performed.
>
It doesn't make sense (to me) to update vma->anon_vma during COW fault.
AFAIK children's vma->anon_vma is allocated during fork() and
page->anon_vma is updated on COW to reduce rmap walking because it's now
unshared, no?
As patch 26 just falls back to mmap_lock if no anon_vma is set,
I think we can assume nothing updates vma->anon_vma (and its interval
tree) if we are holding mmap_lock in write mode.
Or am I missing something?
--
Regards,
Hyeonggon
On Thu, Feb 23, 2023 at 02:51:27PM +0000, Hyeonggon Yoo wrote:
> On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote:
> > Decisions about whether VMAs can be merged, split or expanded must be
> > made while VMAs are protected from the changes which can affect that
> > decision. For example, merge_vma uses vma->anon_vma in its decision
>
> Did you mean vma_merge()?
>
> > whether the VMA can be merged. Meanwhile, page fault handler changes
> > vma->anon_vma during COW operation.
> > Write-lock all VMAs which might be affected by a merge or split operation
> > before making decision how such operations should be performed.
> >
>
> It doesn't make sense (to me) to update vma->anon_vma during COW fault.
>
> AFAIK children's vma->anon_vma is allocated during fork() and
> page->anon_vma is updated on COW to reduce rmap walking because it's now
> unshared, no?
>
> As patch 26 just falls back to mmap_lock if no anon_vma is set,
> I think we can assume nothing updates vma->anon_vma (and its interval
> tree) if we are holding mmap_lock in write mode.
*Not interval tree, as it can be modified by other processes*
>
> Or am I missing something?
>
> --
> Regards,
> Hyeonggon
>
On Thu, Feb 23, 2023 at 6:51 AM Hyeonggon Yoo <[email protected]> wrote:
>
> On Wed, Feb 15, 2023 at 09:17:31PM -0800, Suren Baghdasaryan wrote:
> > Decisions about whether VMAs can be merged, split or expanded must be
> > made while VMAs are protected from the changes which can affect that
> > decision. For example, merge_vma uses vma->anon_vma in its decision
>
> Did you mean vma_merge()?
Correct.
>
> > whether the VMA can be merged. Meanwhile, page fault handler changes
> > vma->anon_vma during COW operation.
> > Write-lock all VMAs which might be affected by a merge or split operation
> > before making decision how such operations should be performed.
> >
>
> It doesn't make sense (to me) to update vma->anon_vma during COW fault.
>
> AFAIK children's vma->anon_vma is allocated during fork() and
> page->anon_vma is updated on COW to reduce rmap walking because it's now
> unshared, no?
>
> As patch 26 just falls back to mmap_lock if no anon_vma is set,
> I think we can assume nothing updates vma->anon_vma (and its interval
> tree) if we are holding mmap_lock in write mode.
>
> Or am I missing something?
No, I think you are right. Patch 26 was added in the later versions of
this patchset and at the time this patch was written vma->anon_vma
could change during page fault handling under only per-VMA lock
protection.
So, this patch can be simplified. We still want to prevent concurrent
page faults while VMA is being merged or split (simply because par-VMA
lock that page fault handler holds might become the wrong one if the
VMA is split or merged from under it) but the timing of taking per-VMA
lock does not have to be *before* can_vma_merge_{before|after}. Does
that make sense?
>
> --
> Regards,
> Hyeonggon
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
* Suren Baghdasaryan <[email protected]> [230216 00:18]:
> Page fault handlers might need to fire MMU notifications while a new
> notifier is being registered. Modify mm_take_all_locks to write-lock all
> VMAs and prevent this race with page fault handlers that would hold VMA
> locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> locking order as in page fault handlers.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/mmap.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 00f8c5798936..801608726be8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3501,6 +3501,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
> * of mm/rmap.c:
> * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
> * hugetlb mapping);
> + * - all vmas marked locked
> * - all i_mmap_rwsem locks;
> * - all anon_vma->rwseml
> *
> @@ -3523,6 +3524,13 @@ int mm_take_all_locks(struct mm_struct *mm)
>
> mutex_lock(&mm_all_locks_mutex);
>
> + mas_for_each(&mas, vma, ULONG_MAX) {
> + if (signal_pending(current))
> + goto out_unlock;
> + vma_start_write(vma);
> + }
> +
> + mas_set(&mas, 0);
> mas_for_each(&mas, vma, ULONG_MAX) {
> if (signal_pending(current))
> goto out_unlock;
Do we need a vma_end_write_all(mm) in the out_unlock unrolling?
Also, does this need to honour the strict locking order that we have to
add an entire new loop? This function is...suboptimal today, but if we
could get away with not looping through every VMA for a 4th time, that
would be nice.
> @@ -3612,6 +3620,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> if (vma->vm_file && vma->vm_file->f_mapping)
> vm_unlock_mapping(vma->vm_file->f_mapping);
> }
> + vma_end_write_all(mm);
>
> mutex_unlock(&mm_all_locks_mutex);
> }
> --
> 2.39.1
>
Can we change this to active/inactive? I think there is potential for
reusing this functionality to even larger degrees and that name would
fit better and would still make sense in this context.
ie: vma_mark_active() and vma_mark_inactive() ?
* Suren Baghdasaryan <[email protected]> [230216 00:18]:
> Per-vma locking mechanism will search for VMA under RCU protection and
> then after locking it, has to ensure it was not removed from the VMA
> tree after we found it. To make this check efficient, introduce a
> vma->detached flag to mark VMAs which were removed from the VMA tree.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/mm.h | 11 +++++++++++
> include/linux/mm_types.h | 3 +++
> mm/mmap.c | 2 ++
> 3 files changed, 16 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f4f702224ec5..3f98344f829c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> }
>
> +static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> +{
> + /* When detaching vma should be write-locked */
> + if (detached)
> + vma_assert_write_locked(vma);
> + vma->detached = detached;
> +}
> +
> #else /* CONFIG_PER_VMA_LOCK */
>
> static inline void vma_init_lock(struct vm_area_struct *vma) {}
> @@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> static inline void vma_end_read(struct vm_area_struct *vma) {}
> static inline void vma_start_write(struct vm_area_struct *vma) {}
> static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> +static inline void vma_mark_detached(struct vm_area_struct *vma,
> + bool detached) {}
>
> #endif /* CONFIG_PER_VMA_LOCK */
>
> @@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> vma->vm_mm = mm;
> vma->vm_ops = &dummy_vm_ops;
> INIT_LIST_HEAD(&vma->anon_vma_chain);
> + vma_mark_detached(vma, false);
> vma_init_lock(vma);
> }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e268723eaf44..939f4f5a1115 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -511,6 +511,9 @@ struct vm_area_struct {
> #ifdef CONFIG_PER_VMA_LOCK
> int vm_lock_seq;
> struct rw_semaphore lock;
> +
> + /* Flag to indicate areas detached from the mm->mm_mt tree */
> + bool detached;
> #endif
>
> /*
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 801608726be8..adf40177e68f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -593,6 +593,7 @@ static inline void vma_complete(struct vma_prepare *vp,
>
> if (vp->remove) {
> again:
> + vma_mark_detached(vp->remove, true);
> if (vp->file) {
> uprobe_munmap(vp->remove, vp->remove->vm_start,
> vp->remove->vm_end);
> @@ -2267,6 +2268,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
> if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> return -ENOMEM;
>
> + vma_mark_detached(vma, true);
> if (vma->vm_flags & VM_LOCKED)
> vma->vm_mm->locked_vm -= vma_pages(vma);
>
> --
> 2.39.1
>
Reviewed-by: Liam R. Howlett <[email protected]>
* Suren Baghdasaryan <[email protected]> [230216 00:18]:
> vma_expand and vma_shrink change VMA boundaries. Expansion might also
> result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> concurrent page faults.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/mmap.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ec2f8d0af280..f079e5bbcd57 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> ret = dup_anon_vma(vma, next);
> if (ret)
> return ret;
> +
> + /* Lock the VMA before removing it */
> + vma_start_write(next);
> }
>
> init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_prealloc(vmi))
> goto nomem;
>
> + vma_start_write(vma);
> vma_adjust_trans_huge(vma, start, end, 0);
> /* VMA iterator points to previous, so set to start if necessary */
> if (vma_iter_addr(vmi) != start)
> @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_prealloc(vmi))
> return -ENOMEM;
>
> + vma_start_write(vma);
> init_vma_prep(&vp, vma);
> vma_adjust_trans_huge(vma, start, end, 0);
> vma_prepare(&vp);
> --
> 2.39.1
>
Wait, I figured a better place to do this.
init_multi_vma_prep() should vma_start_write() on any VMA that is passed
in.. that we we catch any modifications here & in vma_merge(), which I
think is missed in this patch set?
* Liam R. Howlett <[email protected]> [230223 15:20]:
> Reviewed-by: Liam R. Howlett <[email protected]>
>
> * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > concurrent page faults.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > mm/mmap.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index ec2f8d0af280..f079e5bbcd57 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > ret = dup_anon_vma(vma, next);
> > if (ret)
> > return ret;
> > +
> > + /* Lock the VMA before removing it */
> > + vma_start_write(next);
> > }
> >
> > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (vma_iter_prealloc(vmi))
> > goto nomem;
> >
> > + vma_start_write(vma);
> > vma_adjust_trans_huge(vma, start, end, 0);
> > /* VMA iterator points to previous, so set to start if necessary */
> > if (vma_iter_addr(vmi) != start)
> > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (vma_iter_prealloc(vmi))
> > return -ENOMEM;
> >
> > + vma_start_write(vma);
> > init_vma_prep(&vp, vma);
> > vma_adjust_trans_huge(vma, start, end, 0);
> > vma_prepare(&vp);
> > --
> > 2.39.1
> >
On Thu, Feb 23, 2023 at 12:06 PM Liam R. Howlett
<[email protected]> wrote:
>
> * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > Page fault handlers might need to fire MMU notifications while a new
> > notifier is being registered. Modify mm_take_all_locks to write-lock all
> > VMAs and prevent this race with page fault handlers that would hold VMA
> > locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> > locking order as in page fault handlers.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > mm/mmap.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 00f8c5798936..801608726be8 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3501,6 +3501,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
> > * of mm/rmap.c:
> > * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
> > * hugetlb mapping);
> > + * - all vmas marked locked
> > * - all i_mmap_rwsem locks;
> > * - all anon_vma->rwseml
> > *
> > @@ -3523,6 +3524,13 @@ int mm_take_all_locks(struct mm_struct *mm)
> >
> > mutex_lock(&mm_all_locks_mutex);
> >
> > + mas_for_each(&mas, vma, ULONG_MAX) {
> > + if (signal_pending(current))
> > + goto out_unlock;
> > + vma_start_write(vma);
> > + }
> > +
> > + mas_set(&mas, 0);
> > mas_for_each(&mas, vma, ULONG_MAX) {
> > if (signal_pending(current))
> > goto out_unlock;
>
> Do we need a vma_end_write_all(mm) in the out_unlock unrolling?
We can't really do that because some VMAs might have been locked
before mm_take_all_locks() got called. So, we will have to wait until
mmap write lock is dropped and vma_end_write_all() is called from
there. Getting a signal while executing mm_take_all_locks() is
probably not too common and won't pose a performance issue.
>
> Also, does this need to honour the strict locking order that we have to
> add an entire new loop? This function is...suboptimal today, but if we
> could get away with not looping through every VMA for a 4th time, that
> would be nice.
That's what I used to do until Jann pointed out the locking order
requirement to avoid deadlocks in here:
https://lore.kernel.org/all/CAG48ez3EAai=1ghnCMF6xcgUebQRm-u2xhwcpYsfP9=r=oVXig@mail.gmail.com/.
>
> > @@ -3612,6 +3620,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> > if (vma->vm_file && vma->vm_file->f_mapping)
> > vm_unlock_mapping(vma->vm_file->f_mapping);
> > }
> > + vma_end_write_all(mm);
> >
> > mutex_unlock(&mm_all_locks_mutex);
> > }
> > --
> > 2.39.1
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Thu, Feb 23, 2023 at 12:08 PM Liam R. Howlett
<[email protected]> wrote:
>
>
> Can we change this to active/inactive? I think there is potential for
> reusing this functionality to even larger degrees and that name would
> fit better and would still make sense in this context.
>
> ie: vma_mark_active() and vma_mark_inactive() ?
Those names sound too generic (not obvious what active/inactive
means), while detached/isolated I think is more clear and specific.
Does not really make a huge difference to me but maybe you can come up
with better naming that addresses my concern and meets your usecase?
>
> * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > Per-vma locking mechanism will search for VMA under RCU protection and
> > then after locking it, has to ensure it was not removed from the VMA
> > tree after we found it. To make this check efficient, introduce a
> > vma->detached flag to mark VMAs which were removed from the VMA tree.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/mm.h | 11 +++++++++++
> > include/linux/mm_types.h | 3 +++
> > mm/mmap.c | 2 ++
> > 3 files changed, 16 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f4f702224ec5..3f98344f829c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
> > }
> >
> > +static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
> > +{
> > + /* When detaching vma should be write-locked */
> > + if (detached)
> > + vma_assert_write_locked(vma);
> > + vma->detached = detached;
> > +}
> > +
> > #else /* CONFIG_PER_VMA_LOCK */
> >
> > static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > @@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
> > static inline void vma_end_read(struct vm_area_struct *vma) {}
> > static inline void vma_start_write(struct vm_area_struct *vma) {}
> > static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> > +static inline void vma_mark_detached(struct vm_area_struct *vma,
> > + bool detached) {}
> >
> > #endif /* CONFIG_PER_VMA_LOCK */
> >
> > @@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> > vma->vm_mm = mm;
> > vma->vm_ops = &dummy_vm_ops;
> > INIT_LIST_HEAD(&vma->anon_vma_chain);
> > + vma_mark_detached(vma, false);
> > vma_init_lock(vma);
> > }
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index e268723eaf44..939f4f5a1115 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -511,6 +511,9 @@ struct vm_area_struct {
> > #ifdef CONFIG_PER_VMA_LOCK
> > int vm_lock_seq;
> > struct rw_semaphore lock;
> > +
> > + /* Flag to indicate areas detached from the mm->mm_mt tree */
> > + bool detached;
> > #endif
> >
> > /*
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 801608726be8..adf40177e68f 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -593,6 +593,7 @@ static inline void vma_complete(struct vma_prepare *vp,
> >
> > if (vp->remove) {
> > again:
> > + vma_mark_detached(vp->remove, true);
> > if (vp->file) {
> > uprobe_munmap(vp->remove, vp->remove->vm_start,
> > vp->remove->vm_end);
> > @@ -2267,6 +2268,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
> > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> > return -ENOMEM;
> >
> > + vma_mark_detached(vma, true);
> > if (vma->vm_flags & VM_LOCKED)
> > vma->vm_mm->locked_vm -= vma_pages(vma);
> >
> > --
> > 2.39.1
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
<[email protected]> wrote:
>
>
> Wait, I figured a better place to do this.
>
> init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> in.. that we we catch any modifications here & in vma_merge(), which I
> think is missed in this patch set?
Hmm. That looks like a good idea but in that case, why not do the
locking inside vma_prepare() itself? From the description of that
function it sounds like it was designed to acquire locks before VMA
modifications, so would be the ideal location for doing that. WDYT?
The only concern is vma_adjust_trans_huge() being called before
vma_prepare() but I *think* that's safe because
vma_adjust_trans_huge() does its modifications after acquiring PTL
lock, which page fault handlers also have to take. Does that sound
right?
>
>
> * Liam R. Howlett <[email protected]> [230223 15:20]:
> > Reviewed-by: Liam R. Howlett <[email protected]>
> >
> > * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > > concurrent page faults.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > ---
> > > mm/mmap.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index ec2f8d0af280..f079e5bbcd57 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > ret = dup_anon_vma(vma, next);
> > > if (ret)
> > > return ret;
> > > +
> > > + /* Lock the VMA before removing it */
> > > + vma_start_write(next);
> > > }
> > >
> > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > if (vma_iter_prealloc(vmi))
> > > goto nomem;
> > >
> > > + vma_start_write(vma);
> > > vma_adjust_trans_huge(vma, start, end, 0);
> > > /* VMA iterator points to previous, so set to start if necessary */
> > > if (vma_iter_addr(vmi) != start)
> > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > if (vma_iter_prealloc(vmi))
> > > return -ENOMEM;
> > >
> > > + vma_start_write(vma);
> > > init_vma_prep(&vp, vma);
> > > vma_adjust_trans_huge(vma, start, end, 0);
> > > vma_prepare(&vp);
> > > --
> > > 2.39.1
> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
* Suren Baghdasaryan <[email protected]> [230223 16:16]:
> On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> <[email protected]> wrote:
> >
> >
> > Wait, I figured a better place to do this.
> >
> > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > in.. that we we catch any modifications here & in vma_merge(), which I
> > think is missed in this patch set?
>
> Hmm. That looks like a good idea but in that case, why not do the
> locking inside vma_prepare() itself? From the description of that
> function it sounds like it was designed to acquire locks before VMA
> modifications, so would be the ideal location for doing that. WDYT?
That might be even better. I think it will result in even less code.
There is also a vma_complete() which might work to call
vma_end_write_all() as well?
> The only concern is vma_adjust_trans_huge() being called before
> vma_prepare() but I *think* that's safe because
> vma_adjust_trans_huge() does its modifications after acquiring PTL
> lock, which page fault handlers also have to take. Does that sound
> right?
I am not sure. We are certainly safe the way it is, and the PTL has to
be safe for concurrent faults.. but this could alter the walk to a page
table while that walk is occurring and I don't think that happens today.
It might be best to leave the locking order the way you have it, unless
someone can tell us it's safe?
We could pass through the three extra variables that are needed to move
the vma_adjust_trans_huge() call within that function as well? This
would have the added benefit of having all locking grouped in the one
location, but the argument list would be getting long, however we could
use the struct.
remove & remove2 should be be detached in vma_prepare() or
vma_complete() as well?
>
> >
> >
> > * Liam R. Howlett <[email protected]> [230223 15:20]:
> > > Reviewed-by: Liam R. Howlett <[email protected]>
> > >
> > > * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > > > concurrent page faults.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > ---
> > > > mm/mmap.c | 5 +++++
> > > > 1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index ec2f8d0af280..f079e5bbcd57 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > ret = dup_anon_vma(vma, next);
> > > > if (ret)
> > > > return ret;
> > > > +
> > > > + /* Lock the VMA before removing it */
> > > > + vma_start_write(next);
> > > > }
> > > >
> > > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > if (vma_iter_prealloc(vmi))
> > > > goto nomem;
> > > >
> > > > + vma_start_write(vma);
> > > > vma_adjust_trans_huge(vma, start, end, 0);
> > > > /* VMA iterator points to previous, so set to start if necessary */
> > > > if (vma_iter_addr(vmi) != start)
> > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > if (vma_iter_prealloc(vmi))
> > > > return -ENOMEM;
> > > >
> > > > + vma_start_write(vma);
> > > > init_vma_prep(&vp, vma);
> > > > vma_adjust_trans_huge(vma, start, end, 0);
> > > > vma_prepare(&vp);
> > > > --
> > > > 2.39.1
> > > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >
On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <[email protected]> wrote:
>
> * Suren Baghdasaryan <[email protected]> [230223 16:16]:
> > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > <[email protected]> wrote:
> > >
> > >
> > > Wait, I figured a better place to do this.
> > >
> > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > think is missed in this patch set?
> >
> > Hmm. That looks like a good idea but in that case, why not do the
> > locking inside vma_prepare() itself? From the description of that
> > function it sounds like it was designed to acquire locks before VMA
> > modifications, so would be the ideal location for doing that. WDYT?
>
> That might be even better. I think it will result in even less code.
Yes.
>
> There is also a vma_complete() which might work to call
> vma_end_write_all() as well?
If there are other VMAs already locked before vma_prepare() then we
would unlock them too. Safer to just let mmap_unlock do
vma_end_write_all().
>
> > The only concern is vma_adjust_trans_huge() being called before
> > vma_prepare() but I *think* that's safe because
> > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > lock, which page fault handlers also have to take. Does that sound
> > right?
>
> I am not sure. We are certainly safe the way it is, and the PTL has to
> be safe for concurrent faults.. but this could alter the walk to a page
> table while that walk is occurring and I don't think that happens today.
>
> It might be best to leave the locking order the way you have it, unless
> someone can tell us it's safe?
Yes, I have the same feelings about changing this.
>
> We could pass through the three extra variables that are needed to move
> the vma_adjust_trans_huge() call within that function as well? This
> would have the added benefit of having all locking grouped in the one
> location, but the argument list would be getting long, however we could
> use the struct.
Any issues if I change the order to have vma_prepare() called always
before vma_adjust_trans_huge()? That way the VMA will always be locked
before vma_adjust_trans_huge() executes and we don't need any
additional arguments.
>
> remove & remove2 should be be detached in vma_prepare() or
> vma_complete() as well?
They are marked detached in vma_complete() (see
https://lore.kernel.org/all/[email protected]/)
and that should be enough. We should be safe as long as we mark them
detached before unlocking the VMA.
>
> >
> > >
> > >
> > > * Liam R. Howlett <[email protected]> [230223 15:20]:
> > > > Reviewed-by: Liam R. Howlett <[email protected]>
> > > >
> > > > * Suren Baghdasaryan <[email protected]> [230216 00:18]:
> > > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also
> > > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent
> > > > > concurrent page faults.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > > ---
> > > > > mm/mmap.c | 5 +++++
> > > > > 1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > index ec2f8d0af280..f079e5bbcd57 100644
> > > > > --- a/mm/mmap.c
> > > > > +++ b/mm/mmap.c
> > > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > > ret = dup_anon_vma(vma, next);
> > > > > if (ret)
> > > > > return ret;
> > > > > +
> > > > > + /* Lock the VMA before removing it */
> > > > > + vma_start_write(next);
> > > > > }
> > > > >
> > > > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > > if (vma_iter_prealloc(vmi))
> > > > > goto nomem;
> > > > >
> > > > > + vma_start_write(vma);
> > > > > vma_adjust_trans_huge(vma, start, end, 0);
> > > > > /* VMA iterator points to previous, so set to start if necessary */
> > > > > if (vma_iter_addr(vmi) != start)
> > > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > > if (vma_iter_prealloc(vmi))
> > > > > return -ENOMEM;
> > > > >
> > > > > + vma_start_write(vma);
> > > > > init_vma_prep(&vp, vma);
> > > > > vma_adjust_trans_huge(vma, start, end, 0);
> > > > > vma_prepare(&vp);
> > > > > --
> > > > > 2.39.1
> > > > >
> > >
> > > --
> > > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> > >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
Here are some measurements from a Pixel 7 Pro that´s running a kernel either with the Per-VMA locks patchset or without.
If there´s interest I can provide results of other specific apps as well.
Results are from consecutive cold app launches issued with "am start" command spawning the main activity of Slack Android app.
https://docs.google.com/spreadsheets/d/1ktujfcyDmIJoQMWsoizGIE-0A_jMS9VMw_seehUY9s0/edit?usp=sharing
There´s quite a noticeable improvement, as can be seen in the graph. The results were reproducible.
If you have any questions feel free to ask.
* Suren Baghdasaryan <[email protected]> [230223 21:06]:
> On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <[email protected]> wrote:
> >
> > * Suren Baghdasaryan <[email protected]> [230223 16:16]:
> > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > > <[email protected]> wrote:
> > > >
> > > >
> > > > Wait, I figured a better place to do this.
> > > >
> > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > > think is missed in this patch set?
> > >
> > > Hmm. That looks like a good idea but in that case, why not do the
> > > locking inside vma_prepare() itself? From the description of that
> > > function it sounds like it was designed to acquire locks before VMA
> > > modifications, so would be the ideal location for doing that. WDYT?
> >
> > That might be even better. I think it will result in even less code.
>
> Yes.
>
> >
> > There is also a vma_complete() which might work to call
> > vma_end_write_all() as well?
>
> If there are other VMAs already locked before vma_prepare() then we
> would unlock them too. Safer to just let mmap_unlock do
> vma_end_write_all().
>
> >
> > > The only concern is vma_adjust_trans_huge() being called before
> > > vma_prepare() but I *think* that's safe because
> > > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > > lock, which page fault handlers also have to take. Does that sound
> > > right?
> >
> > I am not sure. We are certainly safe the way it is, and the PTL has to
> > be safe for concurrent faults.. but this could alter the walk to a page
> > table while that walk is occurring and I don't think that happens today.
> >
> > It might be best to leave the locking order the way you have it, unless
> > someone can tell us it's safe?
>
> Yes, I have the same feelings about changing this.
>
> >
> > We could pass through the three extra variables that are needed to move
> > the vma_adjust_trans_huge() call within that function as well? This
> > would have the added benefit of having all locking grouped in the one
> > location, but the argument list would be getting long, however we could
> > use the struct.
>
> Any issues if I change the order to have vma_prepare() called always
> before vma_adjust_trans_huge()? That way the VMA will always be locked
> before vma_adjust_trans_huge() executes and we don't need any
> additional arguments.
I preserved the locking order from __vma_adjust() to ensure there was no
issues.
I am not sure but, looking through the page table information [1], it
seems that vma_adjust_trans_huge() uses the pmd lock, which is part of
the split page table lock. According to the comment in rmap, it should
be fine to reverse the ordering here.
Instead of:
mmap_lock()
vma_adjust_trans_huge()
pte_lock
pte_unlock
vma_prepare()
mapping->i_mmap_rwsem lock
anon_vma->rwsem lock
<changes to tree/VMAs>
vma_complete()
anon_vma->rwsem unlock
mapping->i_mmap_rwsem unlock
mmap_unlock()
---------
We would have:
mmap_lock()
vma_prepare()
mapping->i_mmap_rwsem lock
anon_vma->rwsem lock
vma_adjust_trans_huge()
pte_lock
pte_unlock
<changes to tree/VMAs>
vma_complete()
anon_vma->rwsem unlock
mapping->i_mmap_rwsem unlock
mmap_unlock()
Essentially, increasing the nesting of the pte lock, but not violating
the ordering.
1. https://docs.kernel.org/mm/split_page_table_lock.html
>
> >
> > remove & remove2 should be be detached in vma_prepare() or
> > vma_complete() as well?
>
> They are marked detached in vma_complete() (see
> https://lore.kernel.org/all/[email protected]/)
> and that should be enough. We should be safe as long as we mark them
> detached before unlocking the VMA.
>
Right, Thanks.
...
On Fri, Feb 24, 2023 at 8:14 AM Liam R. Howlett <[email protected]> wrote:
>
> * Suren Baghdasaryan <[email protected]> [230223 21:06]:
> > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <[email protected]> wrote:
> > >
> > > * Suren Baghdasaryan <[email protected]> [230223 16:16]:
> > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > > > <[email protected]> wrote:
> > > > >
> > > > >
> > > > > Wait, I figured a better place to do this.
> > > > >
> > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > > > think is missed in this patch set?
> > > >
> > > > Hmm. That looks like a good idea but in that case, why not do the
> > > > locking inside vma_prepare() itself? From the description of that
> > > > function it sounds like it was designed to acquire locks before VMA
> > > > modifications, so would be the ideal location for doing that. WDYT?
> > >
> > > That might be even better. I think it will result in even less code.
> >
> > Yes.
> >
> > >
> > > There is also a vma_complete() which might work to call
> > > vma_end_write_all() as well?
> >
> > If there are other VMAs already locked before vma_prepare() then we
> > would unlock them too. Safer to just let mmap_unlock do
> > vma_end_write_all().
> >
> > >
> > > > The only concern is vma_adjust_trans_huge() being called before
> > > > vma_prepare() but I *think* that's safe because
> > > > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > > > lock, which page fault handlers also have to take. Does that sound
> > > > right?
> > >
> > > I am not sure. We are certainly safe the way it is, and the PTL has to
> > > be safe for concurrent faults.. but this could alter the walk to a page
> > > table while that walk is occurring and I don't think that happens today.
> > >
> > > It might be best to leave the locking order the way you have it, unless
> > > someone can tell us it's safe?
> >
> > Yes, I have the same feelings about changing this.
> >
> > >
> > > We could pass through the three extra variables that are needed to move
> > > the vma_adjust_trans_huge() call within that function as well? This
> > > would have the added benefit of having all locking grouped in the one
> > > location, but the argument list would be getting long, however we could
> > > use the struct.
> >
> > Any issues if I change the order to have vma_prepare() called always
> > before vma_adjust_trans_huge()? That way the VMA will always be locked
> > before vma_adjust_trans_huge() executes and we don't need any
> > additional arguments.
>
> I preserved the locking order from __vma_adjust() to ensure there was no
> issues.
>
> I am not sure but, looking through the page table information [1], it
> seems that vma_adjust_trans_huge() uses the pmd lock, which is part of
> the split page table lock. According to the comment in rmap, it should
> be fine to reverse the ordering here.
>
> Instead of:
>
> mmap_lock()
> vma_adjust_trans_huge()
> pte_lock
> pte_unlock
>
> vma_prepare()
> mapping->i_mmap_rwsem lock
> anon_vma->rwsem lock
>
> <changes to tree/VMAs>
>
> vma_complete()
> anon_vma->rwsem unlock
> mapping->i_mmap_rwsem unlock
>
> mmap_unlock()
>
> ---------
>
> We would have:
>
> mmap_lock()
> vma_prepare()
> mapping->i_mmap_rwsem lock
> anon_vma->rwsem lock
>
> vma_adjust_trans_huge()
> pte_lock
> pte_unlock
>
> <changes to tree/VMAs>
>
> vma_complete()
> anon_vma->rwsem unlock
> mapping->i_mmap_rwsem unlock
>
> mmap_unlock()
>
>
> Essentially, increasing the nesting of the pte lock, but not violating
> the ordering.
>
> 1. https://docs.kernel.org/mm/split_page_table_lock.html
Thanks for the confirmation, Liam. I'll make the changes and test over
the weekend. If everything's still fine, I will post the next version
with these and other requested changes on Monday.
>
> >
> > >
> > > remove & remove2 should be be detached in vma_prepare() or
> > > vma_complete() as well?
> >
> > They are marked detached in vma_complete() (see
> > https://lore.kernel.org/all/[email protected]/)
> > and that should be enough. We should be safe as long as we mark them
> > detached before unlocking the VMA.
> >
>
> Right, Thanks.
>
> ...
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, 24 Feb 2023, freak07 wrote:
>Here are some measurements from a Pixel 7 Pro that?s running a kernel either with the Per-VMA locks patchset or without.
>If there?s interest I can provide results of other specific apps as well.
>
>Results are from consecutive cold app launches issued with "am start" command spawning the main activity of Slack Android app.
>
>https://docs.google.com/spreadsheets/d/1ktujfcyDmIJoQMWsoizGIE-0A_jMS9VMw_seehUY9s0/edit?usp=sharing
>
>There?s quite a noticeable improvement, as can be seen in the graph. The results were reproducible.
Thanks for sharing. I am not surprised - after all, per-vma locks narrow some of the performance gaps
between vanilla and speculative pfs, with less complexity (albeit this is now a 35 patch series :).
Thanks,
Davidlohr
On Mon, Feb 27, 2023 at 9:19 AM Davidlohr Bueso <[email protected]> wrote:
>
> On Fri, 24 Feb 2023, freak07 wrote:
>
> >Here are some measurements from a Pixel 7 Pro that´s running a kernel either with the Per-VMA locks patchset or without.
> >If there´s interest I can provide results of other specific apps as well.
> >
> >Results are from consecutive cold app launches issued with "am start" command spawning the main activity of Slack Android app.
> >
> >https://docs.google.com/spreadsheets/d/1ktujfcyDmIJoQMWsoizGIE-0A_jMS9VMw_seehUY9s0/edit?usp=sharing
> >
> >There´s quite a noticeable improvement, as can be seen in the graph. The results were reproducible.
>
> Thanks for sharing. I am not surprised - after all, per-vma locks narrow some of the performance gaps
> between vanilla and speculative pfs, with less complexity (albeit this is now a 35 patch series :).
Yes, depending on the specific app we would expect some launch time
improvement (in this case average improvement is 7%). Thanks for
sharing the numbers!
I'll be posting v4 today, which is a 33 patch series now, so a bit better :)
Thanks,
Suren.
>
> Thanks,
> Davidlohr
>
On Fri, Feb 24, 2023 at 8:19 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Feb 24, 2023 at 8:14 AM Liam R. Howlett <[email protected]> wrote:
> >
> > * Suren Baghdasaryan <[email protected]> [230223 21:06]:
> > > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett <[email protected]> wrote:
> > > >
> > > > * Suren Baghdasaryan <[email protected]> [230223 16:16]:
> > > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > > Wait, I figured a better place to do this.
> > > > > >
> > > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed
> > > > > > in.. that we we catch any modifications here & in vma_merge(), which I
> > > > > > think is missed in this patch set?
> > > > >
> > > > > Hmm. That looks like a good idea but in that case, why not do the
> > > > > locking inside vma_prepare() itself? From the description of that
> > > > > function it sounds like it was designed to acquire locks before VMA
> > > > > modifications, so would be the ideal location for doing that. WDYT?
> > > >
> > > > That might be even better. I think it will result in even less code.
> > >
> > > Yes.
> > >
> > > >
> > > > There is also a vma_complete() which might work to call
> > > > vma_end_write_all() as well?
> > >
> > > If there are other VMAs already locked before vma_prepare() then we
> > > would unlock them too. Safer to just let mmap_unlock do
> > > vma_end_write_all().
> > >
> > > >
> > > > > The only concern is vma_adjust_trans_huge() being called before
> > > > > vma_prepare() but I *think* that's safe because
> > > > > vma_adjust_trans_huge() does its modifications after acquiring PTL
> > > > > lock, which page fault handlers also have to take. Does that sound
> > > > > right?
> > > >
> > > > I am not sure. We are certainly safe the way it is, and the PTL has to
> > > > be safe for concurrent faults.. but this could alter the walk to a page
> > > > table while that walk is occurring and I don't think that happens today.
> > > >
> > > > It might be best to leave the locking order the way you have it, unless
> > > > someone can tell us it's safe?
> > >
> > > Yes, I have the same feelings about changing this.
> > >
> > > >
> > > > We could pass through the three extra variables that are needed to move
> > > > the vma_adjust_trans_huge() call within that function as well? This
> > > > would have the added benefit of having all locking grouped in the one
> > > > location, but the argument list would be getting long, however we could
> > > > use the struct.
> > >
> > > Any issues if I change the order to have vma_prepare() called always
> > > before vma_adjust_trans_huge()? That way the VMA will always be locked
> > > before vma_adjust_trans_huge() executes and we don't need any
> > > additional arguments.
> >
> > I preserved the locking order from __vma_adjust() to ensure there was no
> > issues.
> >
> > I am not sure but, looking through the page table information [1], it
> > seems that vma_adjust_trans_huge() uses the pmd lock, which is part of
> > the split page table lock. According to the comment in rmap, it should
> > be fine to reverse the ordering here.
> >
> > Instead of:
> >
> > mmap_lock()
> > vma_adjust_trans_huge()
> > pte_lock
> > pte_unlock
> >
> > vma_prepare()
> > mapping->i_mmap_rwsem lock
> > anon_vma->rwsem lock
> >
> > <changes to tree/VMAs>
> >
> > vma_complete()
> > anon_vma->rwsem unlock
> > mapping->i_mmap_rwsem unlock
> >
> > mmap_unlock()
> >
> > ---------
> >
> > We would have:
> >
> > mmap_lock()
> > vma_prepare()
> > mapping->i_mmap_rwsem lock
> > anon_vma->rwsem lock
> >
> > vma_adjust_trans_huge()
> > pte_lock
> > pte_unlock
> >
> > <changes to tree/VMAs>
> >
> > vma_complete()
> > anon_vma->rwsem unlock
> > mapping->i_mmap_rwsem unlock
> >
> > mmap_unlock()
> >
> >
> > Essentially, increasing the nesting of the pte lock, but not violating
> > the ordering.
> >
> > 1. https://docs.kernel.org/mm/split_page_table_lock.html
>
> Thanks for the confirmation, Liam. I'll make the changes and test over
> the weekend. If everything's still fine, I will post the next version
> with these and other requested changes on Monday.
Weekend testing results look good with all the changes. I'll post v4 shortly.
Thanks,
Suren.
>
> >
> > >
> > > >
> > > > remove & remove2 should be be detached in vma_prepare() or
> > > > vma_complete() as well?
> > >
> > > They are marked detached in vma_complete() (see
> > > https://lore.kernel.org/all/[email protected]/)
> > > and that should be enough. We should be safe as long as we mark them
> > > detached before unlocking the VMA.
> > >
> >
> > Right, Thanks.
> >
> > ...
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >
On Fri, Feb 17, 2023 at 08:10:35AM -0800, Suren Baghdasaryan wrote:
> On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <[email protected]> wrote:
> > > > >
> > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote:
> > > > > > When vma->anon_vma is not set, page fault handler will set it by either
> > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> > > > > > a compatible adjacent VMA and that requires not only the faulting VMA
> > > > > > to be stable but also the tree structure and other VMAs inside that tree.
> > > > > > Therefore locking just the faulting VMA is not enough for this search.
> > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This
> > > > > > situation happens only on the first page fault and should not affect
> > > > > > overall performance.
> > > > >
> > > > > I think I asked this before, but don't remember getting an aswer.
> > > > > Why do we defer setting anon_vma to the first fault? Why don't we
> > > > > set it up at mmap time?
> > > >
> > > > Yeah, I remember that conversation Matthew and I could not find the
> > > > definitive answer at the time. I'll look into that again or maybe
> > > > someone can answer it here.
> > >
> > > After looking into it again I'm still under the impression that
> > > vma->anon_vma is populated lazily (during the first page fault rather
> > > than at mmap time) to avoid doing extra work for areas which are never
> > > faulted. Though I might be missing some important detail here.
> >
> > How often does userspace call mmap() and then _never_ fault on it?
> > I appreciate that userspace might mmap() gigabytes of address space and
> > then only end up using a small amount of it, so populating it lazily
> > makes sense. But creating a region and never faulting on it? The only
> > use-case I can think of is loading shared libraries:
> >
> > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> > (...)
> > mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000
> > mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000
> > mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000
> > mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000
> > mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000
> >
> > but that's a file-backed VMA, not an anon VMA.
>
> Might the case of dup_mmap() while forking be the reason why a VMA in
> the child process might be never used while parent uses it (or visa
> versa)? Again, I'm not sure this is the reason but I can find no other
> good explanation.
I found an explanation! Well, a partial one. If we MAP_PRIVATE a file
mapping (like, er those ones up there) and only take read faults on it,
we can postpone allocation of the anon_vma indefinitely. But once we
take a write fault in that VMA, we need to allocate an anon_vma for it
so that we can track the anonymous pages that have been allocated to
satisfy the copy-on-write (see do_cow_fault()).
However, I think in that caase, we could probably skip the
find_mergeable_anon_vma() step. We don't today; we check whether
a->vm_file == b->vm_file in anon_vma_compatible, but I wonder if that
triggers often.