2023-02-27 17:36:44

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 00/33] Per-VMA locks

Previous versions:
v3: https://lore.kernel.org/all/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/
v1: https://lore.kernel.org/all/[email protected]/
RFC: https://lore.kernel.org/all/[email protected]/

LWN article describing the feature:
https://lwn.net/Articles/906852/

Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM
last year [2], which concluded with suggestion that “a reader/writer
semaphore could be put into the VMA itself; that would have the effect of
using the VMA as a sort of range lock. There would still be contention at
the VMA level, but it would be an improvement.” This patchset implements
this suggested approach.

When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.

One notable way the implementation deviates from the proposal is the way
VMAs are read-locked. During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g. vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex. So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.
We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock. Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.
Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered read-locked
when these sequence numbers are equal. To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct. To unlock all VMAs we increment mm_struct's seq number. This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.

The patchset implements per-VMA locking only for anonymous pages which
are not in swap and avoids userfaultfs as their implementation is more
complex. Additional support for file-back page faults, swapped and user
pages can be added incrementally.

Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.

Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results. Here are
the known usecases when per-VMA locks show benefits:

Android:
Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode. Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs. Per-vma locks prevent this contention by
providing more granular lock.

Google Fibers:
We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.
At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup. Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.

TCP zerocopy receive:
From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.
In today's implementation, a process with N threads where N - 1 are
performing zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g. needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done. Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition. Per-VMA locking _removes_ by
definition this source of contention entirely.
There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock. On an RPC style synthetic workload with 4KB RPCs:
1a) The find+lock+unlock VMA path in the base case, without the per-vma
lock patchset, is about 0.7% of cycles as measured by perf.
1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
cycles overall - most of this is within the TCP read hotpath (a small
fraction is 'other' usage in the system).
2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial
patch written to take advantage of it in TCP, is about 0.4% of cycles
(down from 0.7% above)
2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1%
cycles and is out of the TCP read hotpath entirely (down from 0.5% before,
the remaining usage is the 'other' usage in the system).
So, in addition to entirely removing an onerous source of contention, it
also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
(compared to overall cycles in perf) for the 'small' RPC scenario.

The patchset structure is:
0001-0008: Enable maple-tree RCU mode
0009-0031: Main per-vma locks patchset
0032-0033: Performance optimizations

Changes since v3:
- Changed patch [3] to move vma_prepare before vma_adjust_trans_huge
- Dropped patch [4] from the set as unnecessary, per Hyeonggon Yoo
- Changed patch [5] to do VMA locking inside vma_prepare, per Liam Howlett
- Dropped patch [6] from the set as unnecessary, per Liam Howlett

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lwn.net/Articles/893906/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/all/[email protected]/
[6] https://lore.kernel.org/all/[email protected]/

The patchset applies cleanly over mm-unstable branch.

Laurent Dufour (1):
powerc/mm: try VMA lock-based page fault handling first

Liam Howlett (4):
maple_tree: Be more cautious about dead nodes
maple_tree: Detect dead nodes in mas_start()
maple_tree: Fix freeing of nodes in rcu mode
maple_tree: remove extra smp_wmb() from mas_dead_leaves()

Liam R. Howlett (4):
maple_tree: Fix write memory barrier of nodes once dead for RCU mode
maple_tree: Add smp_rmb() to dead node detection
maple_tree: Add RCU lock checking to rcu callback functions
mm: Enable maple tree RCU mode by default.

Michel Lespinasse (1):
mm: rcu safe VMA freeing

Suren Baghdasaryan (23):
mm: introduce CONFIG_PER_VMA_LOCK
mm: move mmap_lock assert function definitions
mm: add per-VMA lock and helper functions to control it
mm: mark VMA as being written when changing vm_flags
mm/mmap: move vma_prepare before vma_adjust_trans_huge
mm/khugepaged: write-lock VMA while collapsing a huge page
mm/mmap: write-lock VMAs in vma_prepare before modifying them
mm/mremap: write-lock VMA while remapping it to a new address range
mm: write-lock VMAs before removing them from VMA tree
mm: conditionally write-lock VMA in free_pgtables
kernel/fork: assert no VMA readers during its destruction
mm/mmap: prevent pagefault handler from racing with mmu_notifier
registration
mm: introduce vma detached flag
mm: introduce lock_vma_under_rcu to be used from arch-specific code
mm: fall back to mmap_lock if vma->anon_vma is not yet set
mm: add FAULT_FLAG_VMA_LOCK flag
mm: prevent do_swap_page from handling page faults under VMA lock
mm: prevent userfaults to be handled under per-vma lock
mm: introduce per-VMA lock statistics
x86/mm: try VMA lock-based page fault handling first
arm64/mm: try VMA lock-based page fault handling first
mm/mmap: free vm_area_struct without call_rcu in exit_mmap
mm: separate vma->lock from vm_area_struct

arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 36 ++++
arch/powerpc/mm/fault.c | 41 ++++
arch/powerpc/platforms/powernv/Kconfig | 1 +
arch/powerpc/platforms/pseries/Kconfig | 1 +
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 36 ++++
include/linux/mm.h | 108 +++++++++-
include/linux/mm_types.h | 32 ++-
include/linux/mmap_lock.h | 37 ++--
include/linux/vm_event_item.h | 6 +
include/linux/vmstat.h | 6 +
kernel/fork.c | 99 +++++++--
lib/maple_tree.c | 269 +++++++++++++++++--------
mm/Kconfig | 12 ++
mm/Kconfig.debug | 6 +
mm/init-mm.c | 3 +
mm/internal.h | 2 +-
mm/khugepaged.c | 5 +
mm/memory.c | 72 ++++++-
mm/mmap.c | 53 +++--
mm/mremap.c | 1 +
mm/nommu.c | 5 +
mm/rmap.c | 31 +--
mm/vmstat.c | 6 +
tools/testing/radix-tree/maple.c | 16 ++
26 files changed, 737 insertions(+), 149 deletions(-)

--
2.39.2.722.g9855ee24e9-goog



2023-02-27 17:36:48

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 01/33] maple_tree: Be more cautious about dead nodes

From: Liam Howlett <[email protected]>

ma_pivots() and ma_data_end() may be called with a dead node. Ensure to
that the node isn't dead before using the returned values.

This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 52 +++++++++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 646297cae5d1..cc356b8369ad 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -544,6 +544,7 @@ static inline bool ma_dead_node(const struct maple_node *node)

return (parent == node);
}
+
/*
* mte_dead_node() - check if the @enode is dead.
* @enode: The encoded maple node
@@ -625,6 +626,8 @@ static inline unsigned int mas_alloc_req(const struct ma_state *mas)
* @node - the maple node
* @type - the node type
*
+ * In the event of a dead node, this array may be %NULL
+ *
* Return: A pointer to the maple node pivots
*/
static inline unsigned long *ma_pivots(struct maple_node *node,
@@ -1096,8 +1099,11 @@ static int mas_ascend(struct ma_state *mas)
a_type = mas_parent_enum(mas, p_enode);
a_node = mte_parent(p_enode);
a_slot = mte_parent_slot(p_enode);
- pivots = ma_pivots(a_node, a_type);
a_enode = mt_mk_node(a_node, a_type);
+ pivots = ma_pivots(a_node, a_type);
+
+ if (unlikely(ma_dead_node(a_node)))
+ return 1;

if (!set_min && a_slot) {
set_min = true;
@@ -1401,6 +1407,9 @@ static inline unsigned char ma_data_end(struct maple_node *node,
{
unsigned char offset;

+ if (!pivots)
+ return 0;
+
if (type == maple_arange_64)
return ma_meta_end(node, type);

@@ -1436,6 +1445,9 @@ static inline unsigned char mas_data_end(struct ma_state *mas)
return ma_meta_end(node, type);

pivots = ma_pivots(node, type);
+ if (unlikely(ma_dead_node(node)))
+ return 0;
+
offset = mt_pivots[type] - 1;
if (likely(!pivots[offset]))
return ma_meta_end(node, type);
@@ -4505,6 +4517,9 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min)
node = mas_mn(mas);
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
+ if (unlikely(ma_dead_node(node)))
+ return 1;
+
mas->max = pivots[offset];
if (offset)
mas->min = pivots[offset - 1] + 1;
@@ -4526,6 +4541,9 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min)
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
offset = ma_data_end(node, mt, pivots, mas->max);
+ if (unlikely(ma_dead_node(node)))
+ return 1;
+
if (offset)
mas->min = pivots[offset - 1] + 1;

@@ -4574,6 +4592,7 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
struct maple_enode *enode;
int level = 0;
unsigned char offset;
+ unsigned char node_end;
enum maple_type mt;
void __rcu **slots;

@@ -4597,7 +4616,11 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
node = mas_mn(mas);
mt = mte_node_type(mas->node);
pivots = ma_pivots(node, mt);
- } while (unlikely(offset == ma_data_end(node, mt, pivots, mas->max)));
+ node_end = ma_data_end(node, mt, pivots, mas->max);
+ if (unlikely(ma_dead_node(node)))
+ return 1;
+
+ } while (unlikely(offset == node_end));

slots = ma_slots(node, mt);
pivot = mas_safe_pivot(mas, pivots, ++offset, mt);
@@ -4613,6 +4636,9 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node,
mt = mte_node_type(mas->node);
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
+ if (unlikely(ma_dead_node(node)))
+ return 1;
+
offset = 0;
pivot = pivots[0];
}
@@ -4659,11 +4685,14 @@ static inline void *mas_next_nentry(struct ma_state *mas,
return NULL;
}

- pivots = ma_pivots(node, type);
slots = ma_slots(node, type);
- mas->index = mas_safe_min(mas, pivots, mas->offset);
+ pivots = ma_pivots(node, type);
count = ma_data_end(node, type, pivots, mas->max);
- if (ma_dead_node(node))
+ if (unlikely(ma_dead_node(node)))
+ return NULL;
+
+ mas->index = mas_safe_min(mas, pivots, mas->offset);
+ if (unlikely(ma_dead_node(node)))
return NULL;

if (mas->index > max)
@@ -4817,6 +4846,11 @@ static inline void *mas_prev_nentry(struct ma_state *mas, unsigned long limit,

slots = ma_slots(mn, mt);
pivots = ma_pivots(mn, mt);
+ if (unlikely(ma_dead_node(mn))) {
+ mas_rewalk(mas, index);
+ goto retry;
+ }
+
if (offset == mt_pivots[mt])
pivot = mas->max;
else
@@ -6631,11 +6665,11 @@ static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn,
while (likely(!ma_is_leaf(mt))) {
MT_BUG_ON(mas->tree, mte_dead_node(mas->node));
slots = ma_slots(mn, mt);
- pivots = ma_pivots(mn, mt);
- max = pivots[0];
entry = mas_slot(mas, slots, 0);
+ pivots = ma_pivots(mn, mt);
if (unlikely(ma_dead_node(mn)))
return NULL;
+ max = pivots[0];
mas->node = entry;
mn = mas_mn(mas);
mt = mte_node_type(mas->node);
@@ -6655,13 +6689,13 @@ static inline void *mas_first_entry(struct ma_state *mas, struct maple_node *mn,
if (likely(entry))
return entry;

- pivots = ma_pivots(mn, mt);
- mas->index = pivots[0] + 1;
mas->offset = 1;
entry = mas_slot(mas, slots, 1);
+ pivots = ma_pivots(mn, mt);
if (unlikely(ma_dead_node(mn)))
return NULL;

+ mas->index = pivots[0] + 1;
if (mas->index > limit)
goto none;

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:36:51

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 02/33] maple_tree: Detect dead nodes in mas_start()

From: Liam Howlett <[email protected]>

When initially starting a search, the root node may already be in the
process of being replaced in RCU mode. Detect and restart the walk if
this is the case. This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index cc356b8369ad..089cd76ec379 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1360,12 +1360,16 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
mas->max = ULONG_MAX;
mas->depth = 0;

+retry:
root = mas_root(mas);
/* Tree with nodes */
if (likely(xa_is_node(root))) {
mas->depth = 1;
mas->node = mte_safe_root(root);
mas->offset = 0;
+ if (mte_dead_node(mas->node))
+ goto retry;
+
return NULL;
}

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:36:59

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 03/33] maple_tree: Fix freeing of nodes in rcu mode

From: Liam Howlett <[email protected]>

The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types. This is necessary for the
RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 73 ++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 089cd76ec379..44d6ce30b28e 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -902,6 +902,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt,
meta->end = end;
}

+/*
+ * mas_clear_meta() - clear the metadata information of a node, if it exists
+ * @mas: The maple state
+ * @mn: The maple node
+ * @mt: The maple node type
+ * @offset: The offset of the highest sub-gap in this node.
+ * @end: The end of the data in this node.
+ */
+static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
+ enum maple_type mt)
+{
+ struct maple_metadata *meta;
+ unsigned long *pivots;
+ void __rcu **slots;
+ void *next;
+
+ switch (mt) {
+ case maple_range_64:
+ pivots = mn->mr64.pivot;
+ if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
+ slots = mn->mr64.slot;
+ next = mas_slot_locked(mas, slots,
+ MAPLE_RANGE64_SLOTS - 1);
+ if (unlikely((mte_to_node(next) && mte_node_type(next))))
+ return; /* The last slot is a node, no metadata */
+ }
+ fallthrough;
+ case maple_arange_64:
+ meta = ma_meta(mn, mt);
+ break;
+ default:
+ return;
+ }
+
+ meta->gap = 0;
+ meta->end = 0;
+}
+
/*
* ma_meta_end() - Get the data end of a node from the metadata
* @mn: The maple node
@@ -5455,20 +5493,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
* mas_dead_leaves() - Mark all leaves of a node as dead.
* @mas: The maple state
* @slots: Pointer to the slot array
+ * @type: The maple node type
*
* Must hold the write lock.
*
* Return: The number of leaves marked as dead.
*/
static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
+unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
+ enum maple_type mt)
{
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;

- for (offset = 0; offset < mt_slot_count(mas->node); offset++) {
+ for (offset = 0; offset < mt_slots[mt]; offset++) {
entry = mas_slot_locked(mas, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
@@ -5487,14 +5527,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)

static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
{
- struct maple_node *node, *next;
+ struct maple_node *next;
void __rcu **slots = NULL;

next = mas_mn(mas);
do {
- mas->node = ma_enode_ptr(next);
- node = mas_mn(mas);
- slots = ma_slots(node, node->type);
+ mas->node = mt_mk_node(next, next->type);
+ slots = ma_slots(next, next->type);
next = mas_slot_locked(mas, slots, offset);
offset = 0;
} while (!ma_is_leaf(next->type));
@@ -5558,11 +5597,14 @@ static inline void __rcu **mas_destroy_descend(struct ma_state *mas,
node = mas_mn(mas);
slots = ma_slots(node, mte_node_type(mas->node));
next = mas_slot_locked(mas, slots, 0);
- if ((mte_dead_node(next)))
+ if ((mte_dead_node(next))) {
+ mte_to_node(next)->type = mte_node_type(next);
next = mas_slot_locked(mas, slots, 1);
+ }

mte_set_node_dead(mas->node);
node->type = mte_node_type(mas->node);
+ mas_clear_meta(mas, node, node->type);
node->piv_parent = prev;
node->parent_slot = offset;
offset = 0;
@@ -5582,13 +5624,18 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,

MA_STATE(mas, &mt, 0, 0);

- if (mte_is_leaf(enode))
+ mas.node = enode;
+ if (mte_is_leaf(enode)) {
+ node->type = mte_node_type(enode);
goto free_leaf;
+ }

+ ma_flags &= ~MT_FLAGS_LOCK_MASK;
mt_init_flags(&mt, ma_flags);
mas_lock(&mas);

- mas.node = start = enode;
+ mte_to_node(enode)->ma_flags = ma_flags;
+ start = enode;
slots = mas_destroy_descend(&mas, start, 0);
node = mas_mn(&mas);
do {
@@ -5596,7 +5643,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
unsigned char offset;
struct maple_enode *parent, *tmp;

- node->slot_len = mas_dead_leaves(&mas, slots);
+ node->type = mte_node_type(mas.node);
+ node->slot_len = mas_dead_leaves(&mas, slots, node->type);
if (free)
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
@@ -5620,7 +5668,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
} while (start != mas.node);

node = mas_mn(&mas);
- node->slot_len = mas_dead_leaves(&mas, slots);
+ node->type = mte_node_type(mas.node);
+ node->slot_len = mas_dead_leaves(&mas, slots, node->type);
if (free)
mt_free_bulk(node->slot_len, slots);

@@ -5630,6 +5679,8 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
free_leaf:
if (free)
mt_free_rcu(&node->rcu);
+ else
+ mas_clear_meta(&mas, node, node->type);
}

/*
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:02

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 04/33] maple_tree: remove extra smp_wmb() from mas_dead_leaves()

From: Liam Howlett <[email protected]>

The call to mte_set_dead_node() before the smp_wmb() already calls
smp_wmb() so this is not needed. This is an optimization for the RCU
mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 44d6ce30b28e..3d5ab02f981a 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5517,7 +5517,6 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
break;

mte_set_node_dead(entry);
- smp_wmb(); /* Needed for RCU */
node->type = type;
rcu_assign_pointer(slots[offset], node);
}
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:05

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 05/33] maple_tree: Fix write memory barrier of nodes once dead for RCU mode

From: "Liam R. Howlett" <[email protected]>

During the development of the maple tree, the strategy of freeing
multiple nodes changed and, in the process, the pivots were reused to
store pointers to dead nodes. To ensure the readers see accurate
pivots, the writers need to mark the nodes as dead and call smp_wmb() to
ensure any readers can identify the node as dead before using the pivot
values.

There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead. Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.

Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being
freed are marked as dead to ensure there are no other call paths besides
the two updated paths.

This is necessary for the RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 7 +++++--
tools/testing/radix-tree/maple.c | 16 ++++++++++++++++
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 3d5ab02f981a..6b6eddadd9d2 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -185,7 +185,7 @@ static void mt_free_rcu(struct rcu_head *head)
*/
static void ma_free_rcu(struct maple_node *node)
{
- node->parent = ma_parent_ptr(node);
+ WARN_ON(node->parent != ma_parent_ptr(node));
call_rcu(&node->rcu, mt_free_rcu);
}

@@ -1778,8 +1778,10 @@ static inline void mas_replace(struct ma_state *mas, bool advanced)
rcu_assign_pointer(slots[offset], mas->node);
}

- if (!advanced)
+ if (!advanced) {
+ mte_set_node_dead(old_enode);
mas_free(mas, old_enode);
+ }
}

/*
@@ -4218,6 +4220,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas)
done:
mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end);
if (in_rcu) {
+ mte_set_node_dead(mas->node);
mas->node = mt_mk_node(newnode, wr_mas->type);
mas_replace(mas, false);
} else {
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 958ee9bdb316..4c89ff333f6f 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mn->slot[1] != NULL);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);

+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas.node = MAS_START;
mas_nomem(&mas, GFP_KERNEL);
@@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != i);
MT_BUG_ON(mt, !mn);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}

@@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
MT_BUG_ON(mt, !mn);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}

@@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
}
@@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(&mas2); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
@@ -334,10 +341,12 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, not_empty(mn));
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
@@ -375,6 +384,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mas_node_count(&mas, i); /* Request */
mas_nomem(&mas, GFP_KERNEL); /* Fill request */
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas_destroy(&mas);

@@ -382,10 +392,13 @@ static noinline void check_new_node(struct maple_tree *mt)
mas_node_count(&mas, i); /* Request */
mas_nomem(&mas, GFP_KERNEL); /* Fill request */
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mn = mas_pop_node(&mas); /* get the next node. */
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas_destroy(&mas);
}
@@ -35369,6 +35382,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
mas_destroy(&mas);
@@ -35386,6 +35400,7 @@ static noinline void check_prealloc(struct maple_tree *mt)
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);
+ mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);

MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
@@ -35756,6 +35771,7 @@ void farmer_tests(void)
tree.ma_root = mt_mk_node(node, maple_leaf_64);
mt_dump(&tree);

+ node->parent = ma_parent_ptr(node);
ma_free_rcu(node);

/* Check things that will make lockdep angry */
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:15

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 06/33] maple_tree: Add smp_rmb() to dead node detection

From: "Liam R. Howlett" <[email protected]>

Add an smp_rmb() before reading the parent pointer to ensure that
anything read from the node prior to the parent pointer hasn't been
reordered ahead of this check.

The is necessary for RCU mode.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
lib/maple_tree.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 6b6eddadd9d2..8ad2d1669fad 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -539,9 +539,11 @@ static inline struct maple_node *mte_parent(const struct maple_enode *enode)
*/
static inline bool ma_dead_node(const struct maple_node *node)
{
- struct maple_node *parent = (void *)((unsigned long)
- node->parent & ~MAPLE_NODE_MASK);
+ struct maple_node *parent;

+ /* Do not reorder reads from the node prior to the parent check */
+ smp_rmb();
+ parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
return (parent == node);
}

@@ -556,6 +558,8 @@ static inline bool mte_dead_node(const struct maple_enode *enode)
struct maple_node *parent, *node;

node = mte_to_node(enode);
+ /* Do not reorder reads from the node prior to the parent check */
+ smp_rmb();
parent = mte_parent(enode);
return (parent == node);
}
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:19

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 07/33] maple_tree: Add RCU lock checking to rcu callback functions

From: "Liam R. Howlett" <[email protected]>

Dereferencing RCU objects within the RCU callback without the RCU check
has caused lockdep to complain. Fix the RCU dereferencing by using the
RCU callback lock to ensure the operation is safe.

Also stop creating a new lock to use for dereferencing during
destruction of the tree or subtree. Instead, pass through a pointer to
the tree that has the lock that is held for RCU dereferencing checking.
It also does not make sense to use the maple state in the freeing
scenario as the tree walk is a special case where the tree no longer has
the normal encodings and parent pointers.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Reported-by: Suren Baghdasaryan <[email protected]>
Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 188 ++++++++++++++++++++++++-----------------------
1 file changed, 96 insertions(+), 92 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8ad2d1669fad..2be86368237d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -824,6 +824,11 @@ static inline void *mt_slot(const struct maple_tree *mt,
return rcu_dereference_check(slots[offset], mt_locked(mt));
}

+static inline void *mt_slot_locked(struct maple_tree *mt, void __rcu **slots,
+ unsigned char offset)
+{
+ return rcu_dereference_protected(slots[offset], mt_locked(mt));
+}
/*
* mas_slot_locked() - Get the slot value when holding the maple tree lock.
* @mas: The maple state
@@ -835,7 +840,7 @@ static inline void *mt_slot(const struct maple_tree *mt,
static inline void *mas_slot_locked(struct ma_state *mas, void __rcu **slots,
unsigned char offset)
{
- return rcu_dereference_protected(slots[offset], mt_locked(mas->tree));
+ return mt_slot_locked(mas->tree, slots, offset);
}

/*
@@ -907,34 +912,35 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt,
}

/*
- * mas_clear_meta() - clear the metadata information of a node, if it exists
- * @mas: The maple state
+ * mt_clear_meta() - clear the metadata information of a node, if it exists
+ * @mt: The maple tree
* @mn: The maple node
- * @mt: The maple node type
+ * @type: The maple node type
* @offset: The offset of the highest sub-gap in this node.
* @end: The end of the data in this node.
*/
-static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
- enum maple_type mt)
+static inline void mt_clear_meta(struct maple_tree *mt, struct maple_node *mn,
+ enum maple_type type)
{
struct maple_metadata *meta;
unsigned long *pivots;
void __rcu **slots;
void *next;

- switch (mt) {
+ switch (type) {
case maple_range_64:
pivots = mn->mr64.pivot;
if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
slots = mn->mr64.slot;
- next = mas_slot_locked(mas, slots,
- MAPLE_RANGE64_SLOTS - 1);
- if (unlikely((mte_to_node(next) && mte_node_type(next))))
- return; /* The last slot is a node, no metadata */
+ next = mt_slot_locked(mt, slots,
+ MAPLE_RANGE64_SLOTS - 1);
+ if (unlikely((mte_to_node(next) &&
+ mte_node_type(next))))
+ return; /* no metadata, could be node */
}
fallthrough;
case maple_arange_64:
- meta = ma_meta(mn, mt);
+ meta = ma_meta(mn, type);
break;
default:
return;
@@ -5497,7 +5503,7 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
}

/*
- * mas_dead_leaves() - Mark all leaves of a node as dead.
+ * mte_dead_leaves() - Mark all leaves of a node as dead.
* @mas: The maple state
* @slots: Pointer to the slot array
* @type: The maple node type
@@ -5507,16 +5513,16 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min,
* Return: The number of leaves marked as dead.
*/
static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
- enum maple_type mt)
+unsigned char mte_dead_leaves(struct maple_enode *enode, struct maple_tree *mt,
+ void __rcu **slots)
{
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;

- for (offset = 0; offset < mt_slots[mt]; offset++) {
- entry = mas_slot_locked(mas, slots, offset);
+ for (offset = 0; offset < mt_slot_count(enode); offset++) {
+ entry = mt_slot(mt, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
/* Use both node and type to catch LE & BE metadata */
@@ -5531,162 +5537,160 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
return offset;
}

-static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
+/**
+ * mte_dead_walk() - Walk down a dead tree to just before the leaves
+ * @enode: The maple encoded node
+ * @offset: The starting offset
+ *
+ * Note: This can only be used from the RCU callback context.
+ */
+static void __rcu **mte_dead_walk(struct maple_enode **enode, unsigned char offset)
{
- struct maple_node *next;
+ struct maple_node *node, *next;
void __rcu **slots = NULL;

- next = mas_mn(mas);
+ next = mte_to_node(*enode);
do {
- mas->node = mt_mk_node(next, next->type);
- slots = ma_slots(next, next->type);
- next = mas_slot_locked(mas, slots, offset);
+ *enode = ma_enode_ptr(next);
+ node = mte_to_node(*enode);
+ slots = ma_slots(node, node->type);
+ next = rcu_dereference_protected(slots[offset],
+ lock_is_held(&rcu_callback_map));
offset = 0;
} while (!ma_is_leaf(next->type));

return slots;
}

+/**
+ * mt_free_walk() - Walk & free a tree in the RCU callback context
+ * @head: The RCU head that's within the node.
+ *
+ * Note: This can only be used from the RCU callback context.
+ */
static void mt_free_walk(struct rcu_head *head)
{
void __rcu **slots;
struct maple_node *node, *start;
- struct maple_tree mt;
+ struct maple_enode *enode;
unsigned char offset;
enum maple_type type;
- MA_STATE(mas, &mt, 0, 0);

node = container_of(head, struct maple_node, rcu);

if (ma_is_leaf(node->type))
goto free_leaf;

- mt_init_flags(&mt, node->ma_flags);
- mas_lock(&mas);
start = node;
- mas.node = mt_mk_node(node, node->type);
- slots = mas_dead_walk(&mas, 0);
- node = mas_mn(&mas);
+ enode = mt_mk_node(node, node->type);
+ slots = mte_dead_walk(&enode, 0);
+ node = mte_to_node(enode);
do {
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
- mas.node = node->piv_parent;
- if (mas_mn(&mas) == node)
- goto start_slots_free;
-
- type = mte_node_type(mas.node);
- slots = ma_slots(mte_to_node(mas.node), type);
- if ((offset < mt_slots[type]) && (slots[offset]))
- slots = mas_dead_walk(&mas, offset);
-
- node = mas_mn(&mas);
+ enode = node->piv_parent;
+ if (mte_to_node(enode) == node)
+ goto free_leaf;
+
+ type = mte_node_type(enode);
+ slots = ma_slots(mte_to_node(enode), type);
+ if ((offset < mt_slots[type]) &&
+ rcu_dereference_protected(slots[offset],
+ lock_is_held(&rcu_callback_map)))
+ slots = mte_dead_walk(&enode, offset);
+ node = mte_to_node(enode);
} while ((node != start) || (node->slot_len < offset));

slots = ma_slots(node, node->type);
mt_free_bulk(node->slot_len, slots);

-start_slots_free:
- mas_unlock(&mas);
free_leaf:
mt_free_rcu(&node->rcu);
}

-static inline void __rcu **mas_destroy_descend(struct ma_state *mas,
- struct maple_enode *prev, unsigned char offset)
+static inline void __rcu **mte_destroy_descend(struct maple_enode **enode,
+ struct maple_tree *mt, struct maple_enode *prev, unsigned char offset)
{
struct maple_node *node;
- struct maple_enode *next = mas->node;
+ struct maple_enode *next = *enode;
void __rcu **slots = NULL;
+ enum maple_type type;
+ unsigned char next_offset = 0;

do {
- mas->node = next;
- node = mas_mn(mas);
- slots = ma_slots(node, mte_node_type(mas->node));
- next = mas_slot_locked(mas, slots, 0);
- if ((mte_dead_node(next))) {
- mte_to_node(next)->type = mte_node_type(next);
- next = mas_slot_locked(mas, slots, 1);
- }
+ *enode = next;
+ node = mte_to_node(*enode);
+ type = mte_node_type(*enode);
+ slots = ma_slots(node, type);
+ next = mt_slot_locked(mt, slots, next_offset);
+ if ((mte_dead_node(next)))
+ next = mt_slot_locked(mt, slots, ++next_offset);

- mte_set_node_dead(mas->node);
- node->type = mte_node_type(mas->node);
- mas_clear_meta(mas, node, node->type);
+ mte_set_node_dead(*enode);
+ node->type = type;
node->piv_parent = prev;
node->parent_slot = offset;
- offset = 0;
- prev = mas->node;
+ offset = next_offset;
+ next_offset = 0;
+ prev = *enode;
} while (!mte_is_leaf(next));

return slots;
}

-static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
+static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt,
bool free)
{
void __rcu **slots;
struct maple_node *node = mte_to_node(enode);
struct maple_enode *start;
- struct maple_tree mt;
-
- MA_STATE(mas, &mt, 0, 0);

- mas.node = enode;
if (mte_is_leaf(enode)) {
node->type = mte_node_type(enode);
goto free_leaf;
}

- ma_flags &= ~MT_FLAGS_LOCK_MASK;
- mt_init_flags(&mt, ma_flags);
- mas_lock(&mas);
-
- mte_to_node(enode)->ma_flags = ma_flags;
start = enode;
- slots = mas_destroy_descend(&mas, start, 0);
- node = mas_mn(&mas);
+ slots = mte_destroy_descend(&enode, mt, start, 0);
+ node = mte_to_node(enode); // Updated in the above call.
do {
enum maple_type type;
unsigned char offset;
struct maple_enode *parent, *tmp;

- node->type = mte_node_type(mas.node);
- node->slot_len = mas_dead_leaves(&mas, slots, node->type);
+ node->slot_len = mte_dead_leaves(enode, mt, slots);
if (free)
mt_free_bulk(node->slot_len, slots);
offset = node->parent_slot + 1;
- mas.node = node->piv_parent;
- if (mas_mn(&mas) == node)
- goto start_slots_free;
+ enode = node->piv_parent;
+ if (mte_to_node(enode) == node)
+ goto free_leaf;

- type = mte_node_type(mas.node);
- slots = ma_slots(mte_to_node(mas.node), type);
+ type = mte_node_type(enode);
+ slots = ma_slots(mte_to_node(enode), type);
if (offset >= mt_slots[type])
goto next;

- tmp = mas_slot_locked(&mas, slots, offset);
+ tmp = mt_slot_locked(mt, slots, offset);
if (mte_node_type(tmp) && mte_to_node(tmp)) {
- parent = mas.node;
- mas.node = tmp;
- slots = mas_destroy_descend(&mas, parent, offset);
+ parent = enode;
+ enode = tmp;
+ slots = mte_destroy_descend(&enode, mt, parent, offset);
}
next:
- node = mas_mn(&mas);
- } while (start != mas.node);
+ node = mte_to_node(enode);
+ } while (start != enode);

- node = mas_mn(&mas);
- node->type = mte_node_type(mas.node);
- node->slot_len = mas_dead_leaves(&mas, slots, node->type);
+ node = mte_to_node(enode);
+ node->slot_len = mte_dead_leaves(enode, mt, slots);
if (free)
mt_free_bulk(node->slot_len, slots);

-start_slots_free:
- mas_unlock(&mas);
-
free_leaf:
if (free)
mt_free_rcu(&node->rcu);
else
- mas_clear_meta(&mas, node, node->type);
+ mt_clear_meta(mt, node, node->type);
}

/*
@@ -5702,10 +5706,10 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
struct maple_node *node = mte_to_node(enode);

if (mt_in_rcu(mt)) {
- mt_destroy_walk(enode, mt->ma_flags, false);
+ mt_destroy_walk(enode, mt, false);
call_rcu(&node->rcu, mt_free_walk);
} else {
- mt_destroy_walk(enode, mt->ma_flags, true);
+ mt_destroy_walk(enode, mt, true);
}
}

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:21

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 08/33] mm: Enable maple tree RCU mode by default.

From: "Liam R. Howlett" <[email protected]>

Use the maple tree in RCU mode for VMA tracking. This is necessary for
the use of per-VMA locking. RCU mode is enabled by default but disabled
when exiting an mm and for the new tree during a fork.

Also enable RCU for the tree used in munmap operations to ensure the
nodes remain valid for readers.

Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 3 ++-
kernel/fork.c | 3 +++
mm/mmap.c | 4 +++-
3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 417d25c6a262..22b2ac82bffd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -779,7 +779,8 @@ struct mm_struct {
unsigned long cpu_bitmap[];
};

-#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
+#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
+ MT_FLAGS_USE_RCU)
extern struct mm_struct init_mm;

/* Pointer magic because the dynamic array size confuses some compilers. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cbfdc4b509e..abfcf95734c7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (retval)
goto out;

+ mt_clear_in_rcu(vmi.mas.tree);
for_each_vma(old_vmi, mpnt) {
struct file *file;

@@ -700,6 +701,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
retval = arch_dup_mmap(oldmm, mm);
loop_out:
vma_iter_free(&vmi);
+ if (!retval)
+ mt_set_in_rcu(vmi.mas.tree);
out:
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
diff --git a/mm/mmap.c b/mm/mmap.c
index 740b54be3ed4..c234443ee24c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2277,7 +2277,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
int count = 0;
int error = -ENOMEM;
MA_STATE(mas_detach, &mt_detach, 0, 0);
- mt_init_flags(&mt_detach, MT_FLAGS_LOCK_EXTERN);
+ mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags &
+ (MT_FLAGS_LOCK_MASK | MT_FLAGS_USE_RCU));
mt_set_external_lock(&mt_detach, &mm->mmap_lock);

/*
@@ -3042,6 +3043,7 @@ void exit_mmap(struct mm_struct *mm)
*/
set_bit(MMF_OOM_SKIP, &mm->flags);
mmap_write_lock(mm);
+ mt_clear_in_rcu(&mm->mm_mt);
free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb);
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:33

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 09/33] mm: introduce CONFIG_PER_VMA_LOCK

This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled on supported architectures with SMP and MMU set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/Kconfig | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ca98b2072df5..2e4a7e61768a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1211,6 +1211,18 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }

+config ARCH_SUPPORTS_PER_VMA_LOCK
+ def_bool n
+
+config PER_VMA_LOCK
+ def_bool y
+ depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+ help
+ Allow per-vma locking during page fault handling.
+
+ This feature allows locking each virtual memory area separately when
+ handling page faults instead of taking mmap_lock.
+
source "mm/damon/Kconfig"

endmenu
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:36

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 10/33] mm: rcu safe VMA freeing

From: Michel Lespinasse <[email protected]>

This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 13 ++++++++++---
kernel/fork.c | 20 +++++++++++++++++++-
2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 22b2ac82bffd..64a6b3f6b74f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -480,9 +480,16 @@ struct anon_vma_name {
struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */

- unsigned long vm_start; /* Our start address within vm_mm. */
- unsigned long vm_end; /* The first byte after our end address
- within vm_mm. */
+ union {
+ struct {
+ /* VMA covers [vm_start; vm_end) addresses within mm */
+ unsigned long vm_start;
+ unsigned long vm_end;
+ };
+#ifdef CONFIG_PER_VMA_LOCK
+ struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+ };

struct mm_struct *vm_mm; /* The address space we belong to. */

diff --git a/kernel/fork.c b/kernel/fork.c
index abfcf95734c7..a63b739aeca9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -479,12 +479,30 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}

-void vm_area_free(struct vm_area_struct *vma)
+static void __vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
kmem_cache_free(vm_area_cachep, vma);
}

+#ifdef CONFIG_PER_VMA_LOCK
+static void vm_area_free_rcu_cb(struct rcu_head *head)
+{
+ struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+ __vm_area_free(vma);
+}
+#endif
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_PER_VMA_LOCK
+ call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
+#else
+ __vm_area_free(vma);
+#endif
+}
+
static void account_kernel_stack(struct task_struct *tsk, int account)
{
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:47

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 11/33] mm: move mmap_lock assert function definitions

Move mmap_lock assert function definitions up so that they can be used
by other mmap_lock routines.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mmap_lock.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..e49ba91bb1f0 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)

#endif /* CONFIG_TRACING */

+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+ lockdep_assert_held(&mm->mmap_lock);
+ VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+ lockdep_assert_held_write(&mm->mmap_lock);
+ VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
+}
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
@@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
up_read_non_owner(&mm->mmap_lock);
}

-static inline void mmap_assert_locked(struct mm_struct *mm)
-{
- lockdep_assert_held(&mm->mmap_lock);
- VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
-static inline void mmap_assert_write_locked(struct mm_struct *mm)
-{
- lockdep_assert_held_write(&mm->mmap_lock);
- VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
-}
-
static inline int mmap_lock_is_contended(struct mm_struct *mm)
{
return rwsem_is_contended(&mm->mmap_lock);
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:37:52

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 12/33] mm: add per-VMA lock and helper functions to control it

Introduce per-VMA locking. The lock implementation relies on a
per-vma and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_start_read) requires the vma
(vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
If they match then there must be a vma exclusive lock held somewhere.
- read unlock - (implemented by vma_end_read) is a trivial vma->lock
unlock.
- write lock - (vma_start_write) requires the mmap_lock to be held
exclusively and the current mm counter is assigned to the vma counter.
This will allow multiple vmas to be locked under a single mmap_lock
write lock (e.g. during vma merging). The vma counter is modified
under exclusive vma lock.
- write unlock - (vma_end_write_all) is a batch release of all vma
locks held. It doesn't pair with a specific vma_start_write! It is
done before exclusive mmap_lock is released by incrementing mm
sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read lock, all
vma write locks are released as well (effectivelly same as write
unlock).

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 82 +++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 8 ++++
include/linux/mmap_lock.h | 13 +++++++
kernel/fork.c | 4 ++
mm/init-mm.c | 3 ++
5 files changed, 110 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..bbad5d4fa81b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -623,6 +623,87 @@ struct vm_operations_struct {
unsigned long addr);
};

+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+ init_rwsem(&vma->lock);
+ vma->vm_lock_seq = -1;
+}
+
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ */
+static inline bool vma_start_read(struct vm_area_struct *vma)
+{
+ /* Check before locking. A race might cause false locked result. */
+ if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+ return false;
+
+ if (unlikely(down_read_trylock(&vma->lock) == 0))
+ return false;
+
+ /*
+ * Overflow might produce false locked result.
+ * False unlocked result is impossible because we modify and check
+ * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+ * modification invalidates all existing locks.
+ */
+ if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+ up_read(&vma->lock);
+ return false;
+ }
+ return true;
+}
+
+static inline void vma_end_read(struct vm_area_struct *vma)
+{
+ rcu_read_lock(); /* keeps vma alive till the end of up_read */
+ up_read(&vma->lock);
+ rcu_read_unlock();
+}
+
+static inline void vma_start_write(struct vm_area_struct *vma)
+{
+ int mm_lock_seq;
+
+ mmap_assert_write_locked(vma->vm_mm);
+
+ /*
+ * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+ * mm->mm_lock_seq can't be concurrently modified.
+ */
+ mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+ if (vma->vm_lock_seq == mm_lock_seq)
+ return;
+
+ down_write(&vma->lock);
+ vma->vm_lock_seq = mm_lock_seq;
+ up_write(&vma->lock);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+{
+ mmap_assert_write_locked(vma->vm_mm);
+ /*
+ * current task is holding mmap_write_lock, both vma->vm_lock_seq and
+ * mm->mm_lock_seq can't be concurrently modified.
+ */
+ VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline bool vma_start_read(struct vm_area_struct *vma)
+ { return false; }
+static inline void vma_end_read(struct vm_area_struct *vma) {}
+static inline void vma_start_write(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -631,6 +712,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+ vma_init_lock(vma);
}

/* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 64a6b3f6b74f..a4e7493bacd7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -508,6 +508,11 @@ struct vm_area_struct {
vm_flags_t __private __vm_flags;
};

+#ifdef CONFIG_PER_VMA_LOCK
+ int vm_lock_seq;
+ struct rw_semaphore lock;
+#endif
+
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
@@ -644,6 +649,9 @@ struct mm_struct {
* init_mm.mmlist, and are protected
* by mmlist_lock
*/
+#ifdef CONFIG_PER_VMA_LOCK
+ int mm_lock_seq;
+#endif


unsigned long hiwater_rss; /* High-watermark of RSS usage */
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index e49ba91bb1f0..aab8f1b28d26 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -72,6 +72,17 @@ static inline void mmap_assert_write_locked(struct mm_struct *mm)
VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
}

+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_end_write_all(struct mm_struct *mm)
+{
+ mmap_assert_write_locked(mm);
+ /* No races during update due to exclusive mmap_lock being held */
+ WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1);
+}
+#else
+static inline void vma_end_write_all(struct mm_struct *mm) {}
+#endif
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
@@ -114,12 +125,14 @@ static inline bool mmap_write_trylock(struct mm_struct *mm)
static inline void mmap_write_unlock(struct mm_struct *mm)
{
__mmap_lock_trace_released(mm, true);
+ vma_end_write_all(mm);
up_write(&mm->mmap_lock);
}

static inline void mmap_write_downgrade(struct mm_struct *mm)
{
__mmap_lock_trace_acquire_returned(mm, false, true);
+ vma_end_write_all(mm);
downgrade_write(&mm->mmap_lock);
}

diff --git a/kernel/fork.c b/kernel/fork.c
index a63b739aeca9..e1dd79c7738c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
*/
data_race(memcpy(new, orig, sizeof(*new)));
INIT_LIST_HEAD(&new->anon_vma_chain);
+ vma_init_lock(new);
dup_anon_vma_name(orig, new);
}
return new;
@@ -1216,6 +1217,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
seqcount_init(&mm->write_protect_seq);
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
+#ifdef CONFIG_PER_VMA_LOCK
+ mm->mm_lock_seq = 0;
+#endif
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index c9327abb771c..33269314e060 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -37,6 +37,9 @@ struct mm_struct init_mm = {
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
+#ifdef CONFIG_PER_VMA_LOCK
+ .mm_lock_seq = 0,
+#endif
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
#ifdef CONFIG_IOMMU_SVA
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:02

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 13/33] mm: mark VMA as being written when changing vm_flags

Updates to vm_flags have to be done with VMA marked as being written for
preventing concurrent page faults or other modifications.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bbad5d4fa81b..3d5e8666892d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -726,28 +726,28 @@ static inline void vm_flags_init(struct vm_area_struct *vma,
static inline void vm_flags_reset(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
vm_flags_init(vma, flags);
}

static inline void vm_flags_reset_once(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
}

static inline void vm_flags_set(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
ACCESS_PRIVATE(vma, __vm_flags) |= flags;
}

static inline void vm_flags_clear(struct vm_area_struct *vma,
vm_flags_t flags)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
ACCESS_PRIVATE(vma, __vm_flags) &= ~flags;
}

@@ -768,7 +768,7 @@ static inline void __vm_flags_mod(struct vm_area_struct *vma,
static inline void vm_flags_mod(struct vm_area_struct *vma,
vm_flags_t set, vm_flags_t clear)
{
- mmap_assert_write_locked(vma->vm_mm);
+ vma_start_write(vma);
__vm_flags_mod(vma, set, clear);
}

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:06

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 14/33] mm/mmap: move vma_prepare before vma_adjust_trans_huge

vma_prepare() acquires all locks required before VMA modifications.
Move vma_prepare() before vma_adjust_trans_huge() so that VMA is locked
before any modification.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index c234443ee24c..92893d86c0af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -683,12 +683,12 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
goto nomem;

+ vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, 0);
/* VMA iterator points to previous, so set to start if necessary */
if (vma_iter_addr(vmi) != start)
vma_iter_set(vmi, start);

- vma_prepare(&vp);
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
@@ -723,8 +723,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
return -ENOMEM;

init_vma_prep(&vp, vma);
- vma_adjust_trans_huge(vma, start, end, 0);
vma_prepare(&vp);
+ vma_adjust_trans_huge(vma, start, end, 0);

if (vma->vm_start < start)
vma_iter_clear(vmi, vma->vm_start, start);
@@ -994,12 +994,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
if (vma_iter_prealloc(vmi))
return NULL;

- vma_adjust_trans_huge(vma, vma_start, vma_end, adj_next);
init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
vp.anon_vma != adjust->anon_vma);

vma_prepare(&vp);
+ vma_adjust_trans_huge(vma, vma_start, vma_end, adj_next);
if (vma_start < vma->vm_start || vma_end > vma->vm_end)
vma_expanded = true;

@@ -2198,10 +2198,10 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);

- vma_adjust_trans_huge(vma, vma->vm_start, addr, 0);
init_vma_prep(&vp, vma);
vp.insert = new;
vma_prepare(&vp);
+ vma_adjust_trans_huge(vma, vma->vm_start, addr, 0);

if (new_below) {
vma->vm_start = addr;
@@ -2910,9 +2910,9 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
goto unacct_fail;

- vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
init_vma_prep(&vp, vma);
vma_prepare(&vp);
+ vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
vma->vm_end = addr + len;
vm_flags_set(vma, VM_SOFTDIRTY);
vma_iter_store(vmi, vma);
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:09

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 15/33] mm/khugepaged: write-lock VMA while collapsing a huge page

Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.

Before this patch, page tables can be walked under any one of the
mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This patch adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that
one.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/khugepaged.c | 5 +++++
mm/rmap.c | 31 ++++++++++++++++---------------
2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 941d1c7ea910..c64e01f03f27 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1147,6 +1147,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;

+ vma_start_write(vma);
anon_vma_lock_write(vma->anon_vma);

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
@@ -1614,6 +1615,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}

+ /* Lock the vma before taking i_mmap and page table locks */
+ vma_start_write(vma);
+
/*
* We need to lock the mapping so that from here on, only GUP-fast and
* hardware page walks can access the parts of the page tables that
@@ -1819,6 +1823,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
result = SCAN_PTE_UFFD_WP;
goto unlock_next;
}
+ vma_start_write(vma);
collapse_and_free_pmd(mm, vma, addr, pmd);
if (!cc->is_khugepaged && is_target)
result = set_huge_pmd(vma, addr, pmd, hpage);
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..cfdaa56cad3e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -25,21 +25,22 @@
* mapping->invalidate_lock (in filemap_fault)
* page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
- * mapping->i_mmap_rwsem
- * anon_vma->rwsem
- * mm->page_table_lock or pte_lock
- * swap_lock (in swap_duplicate, swap_info_get)
- * mmlist_lock (in mmput, drain_mmlist and others)
- * mapping->private_lock (in block_dirty_folio)
- * folio_lock_memcg move_lock (in block_dirty_folio)
- * i_pages lock (widely used)
- * lruvec->lru_lock (in folio_lruvec_lock_irq)
- * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
- * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
- * sb_lock (within inode_lock in fs/fs-writeback.c)
- * i_pages lock (widely used, in set_page_dirty,
- * in arch-dependent flush_dcache_mmap_lock,
- * within bdi.wb->list_lock in __sync_single_inode)
+ * vma_start_write
+ * mapping->i_mmap_rwsem
+ * anon_vma->rwsem
+ * mm->page_table_lock or pte_lock
+ * swap_lock (in swap_duplicate, swap_info_get)
+ * mmlist_lock (in mmput, drain_mmlist and others)
+ * mapping->private_lock (in block_dirty_folio)
+ * folio_lock_memcg move_lock (in block_dirty_folio)
+ * i_pages lock (widely used)
+ * lruvec->lru_lock (in folio_lruvec_lock_irq)
+ * inode->i_lock (in set_page_dirty's __mark_inode_dirty)
+ * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
+ * sb_lock (within inode_lock in fs/fs-writeback.c)
+ * i_pages lock (widely used, in set_page_dirty,
+ * in arch-dependent flush_dcache_mmap_lock,
+ * within bdi.wb->list_lock in __sync_single_inode)
*
* anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon)
* ->tasklist_lock
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:22

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 16/33] mm/mmap: write-lock VMAs in vma_prepare before modifying them

Write-lock all VMAs which might be affected by a merge, split, expand
or shrink operations. All these operations use vma_prepare() before
making the modifications, therefore it provides a centralized place to
perform VMA locking.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 92893d86c0af..e73fbb84ce12 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -502,6 +502,16 @@ static inline void init_vma_prep(struct vma_prepare *vp,
*/
static inline void vma_prepare(struct vma_prepare *vp)
{
+ if (vp->vma)
+ vma_start_write(vp->vma);
+ if (vp->adj_next)
+ vma_start_write(vp->adj_next);
+ /* vp->insert is always a newly created VMA, no need for locking */
+ if (vp->remove)
+ vma_start_write(vp->remove);
+ if (vp->remove2)
+ vma_start_write(vp->remove2);
+
if (vp->file) {
uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:26

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 17/33] mm/mremap: write-lock VMA while remapping it to a new address range

Write-lock VMA as locked before copying it and when copy_vma produces
a new VMA.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
mm/mmap.c | 1 +
mm/mremap.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index e73fbb84ce12..1f42b9a52b9b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3189,6 +3189,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+ vma_start_write(new_vma);
if (vma_link(mm, new_vma))
goto out_vma_link;
*need_rmap_locks = false;
diff --git a/mm/mremap.c b/mm/mremap.c
index 1ddf7beb62e9..327c38eb132e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -623,6 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return -ENOMEM;
}

+ vma_start_write(vma);
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
&need_rmap_locks);
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:29

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

Write-locking VMAs before isolating them ensures that page fault
handlers don't operate on isolated VMAs.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 1 +
mm/nommu.c | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 1f42b9a52b9b..f7ed357056c4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
static inline int munmap_sidetree(struct vm_area_struct *vma,
struct ma_state *mas_detach)
{
+ vma_start_write(vma);
mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a37..2ab162d773e2 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
current->pid);
return -ENOMEM;
}
+ vma_start_write(vma);
cleanup_vma_from_mm(vma);

/* remove from the MM's tree and list */
@@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
*/
mmap_write_lock(mm);
for_each_vma(vmi, vma) {
+ /*
+ * No need to lock VMA because this is the only mm user and no
+ * page fault handled can race with it.
+ */
cleanup_vma_from_mm(vma);
delete_vma(mm, vma);
cond_resched();
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:32

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 19/33] mm: conditionally write-lock VMA in free_pgtables

Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.
Add a parameter to free_pgtables for such scenario.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/internal.h | 2 +-
mm/memory.c | 6 +++++-
mm/mmap.c | 5 +++--
3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 08ce56dbb1d9..fce94775819c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -105,7 +105,7 @@ void folio_activate(struct folio *folio);

void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long floor,
- unsigned long ceiling);
+ unsigned long ceiling, bool mm_wr_locked);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);

struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index bfa3100ec5a3..f7f412833e42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb,

void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long floor,
- unsigned long ceiling)
+ unsigned long ceiling, bool mm_wr_locked)
{
MA_STATE(mas, mt, vma->vm_end, vma->vm_end);

@@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
+ if (mm_wr_locked)
+ vma_start_write(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);

@@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(&mas, ceiling - 1);
+ if (mm_wr_locked)
+ vma_start_write(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index f7ed357056c4..ec745586785c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2152,7 +2152,8 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
update_hiwater_rss(mm);
unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked);
free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
- next ? next->vm_start : USER_PGTABLES_CEILING);
+ next ? next->vm_start : USER_PGTABLES_CEILING,
+ mm_wr_locked);
tlb_finish_mmu(&tlb);
}

@@ -3056,7 +3057,7 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_lock(mm);
mt_clear_in_rcu(&mm->mm_mt);
free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
- USER_PGTABLES_CEILING);
+ USER_PGTABLES_CEILING, true);
tlb_finish_mmu(&tlb);

/*
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:35

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 20/33] kernel/fork: assert no VMA readers during its destruction

Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
kernel/fork.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index e1dd79c7738c..bdb55f25895d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -491,6 +491,9 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
{
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
vm_rcu);
+
+ /* The vma should not be locked while being destroyed. */
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock), vma);
__vm_area_free(vma);
}
#endif
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:39

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 21/33] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration

Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with page fault handlers that would hold VMA
locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index ec745586785c..b947d82e8522 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3486,6 +3486,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
* of mm/rmap.c:
* - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
* hugetlb mapping);
+ * - all vmas marked locked
* - all i_mmap_rwsem locks;
* - all anon_vma->rwseml
*
@@ -3508,6 +3509,13 @@ int mm_take_all_locks(struct mm_struct *mm)

mutex_lock(&mm_all_locks_mutex);

+ mas_for_each(&mas, vma, ULONG_MAX) {
+ if (signal_pending(current))
+ goto out_unlock;
+ vma_start_write(vma);
+ }
+
+ mas_set(&mas, 0);
mas_for_each(&mas, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
@@ -3597,6 +3605,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
if (vma->vm_file && vma->vm_file->f_mapping)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+ vma_end_write_all(mm);

mutex_unlock(&mm_all_locks_mutex);
}
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:46

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 23/33] mm: introduce lock_vma_under_rcu to be used from arch-specific code

Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 3 +++
mm/memory.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 895bb3950e8a..46d2db743b1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -701,6 +701,9 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
vma->detached = detached;
}

+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address);
+
#else /* CONFIG_PER_VMA_LOCK */

static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index f7f412833e42..bda4c1a991f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5221,6 +5221,52 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(handle_mm_fault);

+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+ MA_STATE(mas, &mm->mm_mt, address, address);
+ struct vm_area_struct *vma;
+
+ rcu_read_lock();
+retry:
+ vma = mas_walk(&mas);
+ if (!vma)
+ goto inval;
+
+ /* Only anonymous vmas are supported for now */
+ if (!vma_is_anonymous(vma))
+ goto inval;
+
+ if (!vma_start_read(vma))
+ goto inval;
+
+ /* Check since vm_start/vm_end might change before we lock the VMA */
+ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+ vma_end_read(vma);
+ goto inval;
+ }
+
+ /* Check if the VMA got isolated after we found it */
+ if (vma->detached) {
+ vma_end_read(vma);
+ /* The area was replaced with another one */
+ goto retry;
+ }
+
+ rcu_read_unlock();
+ return vma;
+inval:
+ rcu_read_unlock();
+ return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
#ifndef __PAGETABLE_P4D_FOLDED
/*
* Allocate p4d page table.
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:49

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 22/33] mm: introduce vma detached flag

Per-vma locking mechanism will search for VMA under RCU protection and
then after locking it, has to ensure it was not removed from the VMA
tree after we found it. To make this check efficient, introduce a
vma->detached flag to mark VMAs which were removed from the VMA tree.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
include/linux/mm_types.h | 3 +++
mm/mmap.c | 2 ++
3 files changed, 16 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3d5e8666892d..895bb3950e8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma);
}

+static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
+{
+ /* When detaching vma should be write-locked */
+ if (detached)
+ vma_assert_write_locked(vma);
+ vma->detached = detached;
+}
+
#else /* CONFIG_PER_VMA_LOCK */

static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_mark_detached(struct vm_area_struct *vma,
+ bool detached) {}

#endif /* CONFIG_PER_VMA_LOCK */

@@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+ vma_mark_detached(vma, false);
vma_init_lock(vma);
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a4e7493bacd7..45a219d33c6b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -511,6 +511,9 @@ struct vm_area_struct {
#ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
struct rw_semaphore lock;
+
+ /* Flag to indicate areas detached from the mm->mm_mt tree */
+ bool detached;
#endif

/*
diff --git a/mm/mmap.c b/mm/mmap.c
index b947d82e8522..df13c33498db 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -600,6 +600,7 @@ static inline void vma_complete(struct vma_prepare *vp,

if (vp->remove) {
again:
+ vma_mark_detached(vp->remove, true);
if (vp->file) {
uprobe_munmap(vp->remove, vp->remove->vm_start,
vp->remove->vm_end);
@@ -2261,6 +2262,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;

+ vma_mark_detached(vma, true);
if (vma->vm_flags & VM_LOCKED)
vma->vm_mm->locked_vm -= vma_pages(vma);

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:51

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 24/33] mm: fall back to mmap_lock if vma->anon_vma is not yet set

When vma->anon_vma is not set, page fault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index bda4c1a991f0..8855846a361b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5243,6 +5243,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_is_anonymous(vma))
goto inval;

+ /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
+ if (!vma->anon_vma)
+ goto inval;
+
if (!vma_start_read(vma))
goto inval;

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:54

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 25/33] mm: add FAULT_FLAG_VMA_LOCK flag

Add a new flag to distinguish page faults handled under protection of
per-vma lock.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 3 ++-
include/linux/mm_types.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46d2db743b1a..d07ac923333f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -478,7 +478,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags)
{ FAULT_FLAG_USER, "USER" }, \
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
- { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+ { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+ { FAULT_FLAG_VMA_LOCK, "VMA_LOCK" }

/*
* vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45a219d33c6b..6768533a6b7c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1097,6 +1097,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
FAULT_FLAG_UNSHARE = 1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+ FAULT_FLAG_VMA_LOCK = 1 << 12,
};

typedef unsigned int __bitwise zap_flags_t;
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:38:58

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 26/33] mm: prevent do_swap_page from handling page faults under VMA lock

Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 8855846a361b..af3c2c59cd11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3689,6 +3689,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;

+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:01

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 27/33] mm: prevent userfaults to be handled under per-vma lock

Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan <[email protected]>
Suggested-by: Peter Xu <[email protected]>
---
mm/memory.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index af3c2c59cd11..f734f80d28ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5255,6 +5255,15 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_start_read(vma))
goto inval;

+ /*
+ * Due to the possibility of userfault handler dropping mmap_lock, avoid
+ * it for now and fall back to page fault handling under mmap_lock.
+ */
+ if (userfaultfd_armed(vma)) {
+ vma_end_read(vma);
+ goto inval;
+ }
+
/* Check since vm_start/vm_end might change before we lock the VMA */
if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
vma_end_read(vma);
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:10

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..df21fba77db1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+ select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a498ae1fbe66..e4399983c50c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/mm.h> /* find_and_lock_vma() */

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif

+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(flags & FAULT_FLAG_USER))
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, address);
+ if (!vma)
+ goto lock_mmap;
+
+ if (unlikely(access_error(error_code, vma))) {
+ vma_end_read(vma);
+ goto lock_mmap;
+ }
+ fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_end_read(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ kernelmode_fixup_or_oops(regs, error_code, address,
+ SIGBUS, BUS_ADRERR,
+ ARCH_DEFAULT_PKEY);
+ return;
+ }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs,
}

mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;

--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:14

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 28/33] mm: introduce per-VMA lock statistics

Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/vm_event_item.h | 6 ++++++
include/linux/vmstat.h | 6 ++++++
mm/Kconfig.debug | 6 ++++++
mm/memory.c | 2 ++
mm/vmstat.c | 6 ++++++
5 files changed, 26 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..8abfa1240040 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+ VMA_LOCK_SUCCESS,
+ VMA_LOCK_ABORT,
+ VMA_LOCK_RETRY,
+ VMA_LOCK_MISS,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..fed855bae6d8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu)
#define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
#endif

+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
#define __count_zid_vm_events(item, zid, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index c3547a373c9c..4965a7333a3f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -279,3 +279,9 @@ config DEBUG_KMEMLEAK_AUTO_SCAN

If unsure, say Y.

+config PER_VMA_LOCK_STATS
+ bool "Statistics for per-vma locks"
+ depends on PER_VMA_LOCK
+ default y
+ help
+ Statistics for per-vma locks.
diff --git a/mm/memory.c b/mm/memory.c
index f734f80d28ca..255b2f4fdd4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5273,6 +5273,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
/* Check if the VMA got isolated after we found it */
if (vma->detached) {
vma_end_read(vma);
+ count_vm_vma_lock_event(VMA_LOCK_MISS);
/* The area was replaced with another one */
goto retry;
}
@@ -5281,6 +5282,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
return vma;
inval:
rcu_read_unlock();
+ count_vm_vma_lock_event(VMA_LOCK_ABORT);
return NULL;
}
#endif /* CONFIG_PER_VMA_LOCK */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..4f1089a1860e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = {
"direct_map_level2_splits",
"direct_map_level3_splits",
#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+ "vma_lock_success",
+ "vma_lock_abort",
+ "vma_lock_retry",
+ "vma_lock_miss",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:17

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 31/33] powerc/mm: try VMA lock-based page fault handling first

From: Laurent Dufour <[email protected]>

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"

Signed-off-by: Laurent Dufour <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/powerpc/mm/fault.c | 41 ++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/Kconfig | 1 +
arch/powerpc/platforms/pseries/Kconfig | 1 +
3 files changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2bef19cc1b98..c7ae86b04b8a 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;

+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(flags & FAULT_FLAG_USER))
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, address);
+ if (!vma)
+ goto lock_mmap;
+
+ if (unlikely(access_pkey_error(is_write, is_exec,
+ (error_code & DSISR_KEYFAULT), vma))) {
+ int rc = bad_access_pkey(regs, address, vma);
+
+ vma_end_read(vma);
+ return rc;
+ }
+
+ if (unlikely(access_error(is_write, is_exec, vma))) {
+ int rc = bad_access(regs, address);
+
+ vma_end_read(vma);
+ return rc;
+ }
+
+ fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_end_read(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ if (fault_signal_pending(fault, regs))
+ return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,

mmap_read_unlock(current->mm);

+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);

diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+ select ARCH_SUPPORTS_PER_VMA_LOCK
default y

config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
index b481c5c8bae1..9c205fe0e619 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
+ select ARCH_SUPPORTS_PER_VMA_LOCK
default y

config PARAVIRT
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:20

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 30/33] arm64/mm: try VMA lock-based page fault handling first

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 27b2592698b0..412207d789b1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -95,6 +95,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+ select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index f4cb0f85ccf4..9e0db5c387e3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+ struct vm_area_struct *vma;
+#endif

if (kprobe_page_fault(regs, esr))
return 0;
@@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

+#ifdef CONFIG_PER_VMA_LOCK
+ if (!(mm_flags & FAULT_FLAG_USER))
+ goto lock_mmap;
+
+ vma = lock_vma_under_rcu(mm, addr);
+ if (!vma)
+ goto lock_mmap;
+
+ if (!(vma->vm_flags & vm_flags)) {
+ vma_end_read(vma);
+ goto lock_mmap;
+ }
+ fault = handle_mm_fault(vma, addr & PAGE_MASK,
+ mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+ vma_end_read(vma);
+
+ if (!(fault & VM_FAULT_RETRY)) {
+ count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+ goto done;
+ }
+ count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+ /* Quick path to respond to signals */
+ if (fault_signal_pending(fault, regs)) {
+ if (!user_mode(regs))
+ goto no_context;
+ return 0;
+ }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
}
mmap_read_unlock(mm);

+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
/*
* Handle the "normal" (no error) case first.
*/
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:23

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 32/33] mm/mmap: free vm_area_struct without call_rcu in exit_mmap

call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.
Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it. Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock. Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().
Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 2 ++
kernel/fork.c | 2 +-
mm/mmap.c | 11 +++++++----
3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d07ac923333f..5e142bfe7a58 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -256,6 +256,8 @@ void setup_initial_init_mm(void *start_code, void *end_code,
struct vm_area_struct *vm_area_alloc(struct mm_struct *);
struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
void vm_area_free(struct vm_area_struct *);
+/* Use only if VMA has no other users */
+void __vm_area_free(struct vm_area_struct *vma);

#ifndef CONFIG_MMU
extern struct rb_root nommu_region_tree;
diff --git a/kernel/fork.c b/kernel/fork.c
index bdb55f25895d..ad37f1d0c5ab 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -480,7 +480,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
return new;
}

-static void __vm_area_free(struct vm_area_struct *vma)
+void __vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
kmem_cache_free(vm_area_cachep, vma);
diff --git a/mm/mmap.c b/mm/mmap.c
index df13c33498db..0cd3714c2182 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -133,7 +133,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
/*
* Close a vm structure and free it.
*/
-static void remove_vma(struct vm_area_struct *vma)
+static void remove_vma(struct vm_area_struct *vma, bool unreachable)
{
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
@@ -141,7 +141,10 @@ static void remove_vma(struct vm_area_struct *vma)
if (vma->vm_file)
fput(vma->vm_file);
mpol_put(vma_policy(vma));
- vm_area_free(vma);
+ if (unreachable)
+ __vm_area_free(vma);
+ else
+ vm_area_free(vma);
}

static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
@@ -2130,7 +2133,7 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += nrpages;
vm_stat_account(mm, vma->vm_flags, -nrpages);
- remove_vma(vma);
+ remove_vma(vma, false);
}
vm_unacct_memory(nr_accounted);
validate_mm(mm);
@@ -3070,7 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- remove_vma(vma);
+ remove_vma(vma, true);
count++;
cond_resched();
} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
--
2.39.2.722.g9855ee24e9-goog


2023-02-27 17:39:27

by Suren Baghdasaryan

[permalink] [raw]
Subject: [PATCH v4 33/33] mm: separate vma->lock from vm_area_struct

vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
can be optimized later if the new size causes issues.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate the VMA.

Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 23 ++++++-------
include/linux/mm_types.h | 6 +++-
kernel/fork.c | 73 ++++++++++++++++++++++++++++++++--------
3 files changed, 74 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e142bfe7a58..3d4bb18dfcb7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,6 @@ struct vm_operations_struct {
};

#ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
- init_rwsem(&vma->lock);
- vma->vm_lock_seq = -1;
-}
-
/*
* Try to read-lock a vma. The function is allowed to occasionally yield false
* locked result to avoid performance overhead, in which case we fall back to
@@ -644,17 +638,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;

- if (unlikely(down_read_trylock(&vma->lock) == 0))
+ if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0))
return false;

/*
* Overflow might produce false locked result.
* False unlocked result is impossible because we modify and check
- * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+ * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
* modification invalidates all existing locks.
*/
if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
- up_read(&vma->lock);
+ up_read(&vma->vm_lock->lock);
return false;
}
return true;
@@ -663,7 +657,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma)
static inline void vma_end_read(struct vm_area_struct *vma)
{
rcu_read_lock(); /* keeps vma alive till the end of up_read */
- up_read(&vma->lock);
+ up_read(&vma->vm_lock->lock);
rcu_read_unlock();
}

@@ -681,9 +675,9 @@ static inline void vma_start_write(struct vm_area_struct *vma)
if (vma->vm_lock_seq == mm_lock_seq)
return;

- down_write(&vma->lock);
+ down_write(&vma->vm_lock->lock);
vma->vm_lock_seq = mm_lock_seq;
- up_write(&vma->lock);
+ up_write(&vma->vm_lock->lock);
}

static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -720,6 +714,10 @@ static inline void vma_mark_detached(struct vm_area_struct *vma,

#endif /* CONFIG_PER_VMA_LOCK */

+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -729,7 +727,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
vma_mark_detached(vma, false);
- vma_init_lock(vma);
}

/* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6768533a6b7c..89bbf7d8a312 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -471,6 +471,10 @@ struct anon_vma_name {
char name[];
};

+struct vma_lock {
+ struct rw_semaphore lock;
+};
+
/*
* This struct describes a virtual memory area. There is one of these
* per VM-area/task. A VM area is any part of the process virtual memory
@@ -510,7 +514,7 @@ struct vm_area_struct {

#ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
- struct rw_semaphore lock;
+ struct vma_lock *vm_lock;

/* Flag to indicate areas detached from the mm->mm_mt tree */
bool detached;
diff --git a/kernel/fork.c b/kernel/fork.c
index ad37f1d0c5ab..75792157f51a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -451,13 +451,49 @@ static struct kmem_cache *vm_area_cachep;
/* SLAB cache for mm_struct structures (tsk->mm) */
static struct kmem_cache *mm_cachep;

+#ifdef CONFIG_PER_VMA_LOCK
+
+/* SLAB cache for vm_area_struct.lock */
+static struct kmem_cache *vma_lock_cachep;
+
+static bool vma_lock_alloc(struct vm_area_struct *vma)
+{
+ vma->vm_lock = kmem_cache_alloc(vma_lock_cachep, GFP_KERNEL);
+ if (!vma->vm_lock)
+ return false;
+
+ init_rwsem(&vma->vm_lock->lock);
+ vma->vm_lock_seq = -1;
+
+ return true;
+}
+
+static inline void vma_lock_free(struct vm_area_struct *vma)
+{
+ kmem_cache_free(vma_lock_cachep, vma->vm_lock);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline bool vma_lock_alloc(struct vm_area_struct *vma) { return true; }
+static inline void vma_lock_free(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
{
struct vm_area_struct *vma;

vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
- if (vma)
- vma_init(vma, mm);
+ if (!vma)
+ return NULL;
+
+ vma_init(vma, mm);
+ if (!vma_lock_alloc(vma)) {
+ kmem_cache_free(vm_area_cachep, vma);
+ return NULL;
+ }
+
return vma;
}

@@ -465,24 +501,30 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
{
struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);

- if (new) {
- ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
- ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
- /*
- * orig->shared.rb may be modified concurrently, but the clone
- * will be reinitialized.
- */
- data_race(memcpy(new, orig, sizeof(*new)));
- INIT_LIST_HEAD(&new->anon_vma_chain);
- vma_init_lock(new);
- dup_anon_vma_name(orig, new);
+ if (!new)
+ return NULL;
+
+ ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+ ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+ /*
+ * orig->shared.rb may be modified concurrently, but the clone
+ * will be reinitialized.
+ */
+ data_race(memcpy(new, orig, sizeof(*new)));
+ if (!vma_lock_alloc(new)) {
+ kmem_cache_free(vm_area_cachep, new);
+ return NULL;
}
+ INIT_LIST_HEAD(&new->anon_vma_chain);
+ dup_anon_vma_name(orig, new);
+
return new;
}

void __vm_area_free(struct vm_area_struct *vma)
{
free_anon_vma_name(vma);
+ vma_lock_free(vma);
kmem_cache_free(vm_area_cachep, vma);
}

@@ -493,7 +535,7 @@ static void vm_area_free_rcu_cb(struct rcu_head *head)
vm_rcu);

/* The vma should not be locked while being destroyed. */
- VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock), vma);
+ VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
__vm_area_free(vma);
}
#endif
@@ -3160,6 +3202,9 @@ void __init proc_caches_init(void)
NULL);

vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+#ifdef CONFIG_PER_VMA_LOCK
+ vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
+#endif
mmap_init();
nsproxy_cache_init();
}
--
2.39.2.722.g9855ee24e9-goog


2023-03-01 07:02:08

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v4 17/33] mm/mremap: write-lock VMA while remapping it to a new address range

On Mon, Feb 27, 2023 at 09:36:16AM -0800, Suren Baghdasaryan wrote:
> Write-lock VMA as locked before copying it and when copy_vma produces
> a new VMA.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Reviewed-by: Laurent Dufour <[email protected]>
> ---
> mm/mmap.c | 1 +
> mm/mremap.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e73fbb84ce12..1f42b9a52b9b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3189,6 +3189,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> get_file(new_vma->vm_file);
> if (new_vma->vm_ops && new_vma->vm_ops->open)
> new_vma->vm_ops->open(new_vma);
> + vma_start_write(new_vma);

Oh, it's to prevent handling page faults during move_page_tables().


> if (vma_link(mm, new_vma))
> goto out_vma_link;
> *need_rmap_locks = false;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1ddf7beb62e9..327c38eb132e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -623,6 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> return -ENOMEM;
> }
>
> + vma_start_write(vma);
> new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
> new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
> &need_rmap_locks);
> --
> 2.39.2.722.g9855ee24e9-goog

Looks good to me.

Reviewed-by: Hyeonggon Yoo <[email protected]>

>
>

2023-03-01 07:43:54

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> Write-locking VMAs before isolating them ensures that page fault
> handlers don't operate on isolated VMAs.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/mmap.c | 1 +
> mm/nommu.c | 5 +++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1f42b9a52b9b..f7ed357056c4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> static inline int munmap_sidetree(struct vm_area_struct *vma,
> struct ma_state *mas_detach)
> {
> + vma_start_write(vma);
> mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);

I may be missing something, but have few questions:

1) Why does a writer need to both write-lock a VMA and mark the VMA detached
when unmapping it, isn't it enough to just only write-lock a VMA?

2) as VMAs that are going to be removed are already locked in vma_prepare(),
so I think this hunk could be dropped?

> if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> return -ENOMEM;
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a37..2ab162d773e2 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
> current->pid);
> return -ENOMEM;
> }
> + vma_start_write(vma);
> cleanup_vma_from_mm(vma);

3) I think this hunk could be dropped as Per-VMA lock depends on MMU anyway.

Thanks,
Hyeonggon

>
> /* remove from the MM's tree and list */
> @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
> */
> mmap_write_lock(mm);
> for_each_vma(vmi, vma) {
> + /*
> + * No need to lock VMA because this is the only mm user and no
> + * page fault handled can race with it.
> + */
> cleanup_vma_from_mm(vma);
> delete_vma(mm, vma);
> cond_resched();
> --
> 2.39.2.722.g9855ee24e9-goog
>
>

2023-03-01 07:57:09

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote:
> On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > Write-locking VMAs before isolating them ensures that page fault
> > handlers don't operate on isolated VMAs.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > mm/mmap.c | 1 +
> > mm/nommu.c | 5 +++++
> > 2 files changed, 6 insertions(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 1f42b9a52b9b..f7ed357056c4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > static inline int munmap_sidetree(struct vm_area_struct *vma,
> > struct ma_state *mas_detach)
> > {
> > + vma_start_write(vma);
> > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
>
> I may be missing something, but have few questions:
>
> 1) Why does a writer need to both write-lock a VMA and mark the VMA detached
> when unmapping it, isn't it enough to just only write-lock a VMA?
>
> 2) as VMAs that are going to be removed are already locked in vma_prepare(),
> so I think this hunk could be dropped?

After sending this just realized that I did not consider simple munmap case :)
But I still think 1) and 3) are valid question.

>
> > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> > return -ENOMEM;
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 57ba243c6a37..2ab162d773e2 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
> > current->pid);
> > return -ENOMEM;
> > }
> > + vma_start_write(vma);
> > cleanup_vma_from_mm(vma);
>
> 3) I think this hunk could be dropped as Per-VMA lock depends on MMU anyway.
>
> Thanks,
> Hyeonggon
>
> >
> > /* remove from the MM's tree and list */
> > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
> > */
> > mmap_write_lock(mm);
> > for_each_vma(vmi, vma) {
> > + /*
> > + * No need to lock VMA because this is the only mm user and no
> > + * page fault handled can race with it.
> > + */
> > cleanup_vma_from_mm(vma);
> > delete_vma(mm, vma);
> > cond_resched();
> > --
> > 2.39.2.722.g9855ee24e9-goog
> >
> >
>

2023-03-01 09:54:42

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v4 24/33] mm: fall back to mmap_lock if vma->anon_vma is not yet set

On Mon, Feb 27, 2023 at 09:36:23AM -0800, Suren Baghdasaryan wrote:
> When vma->anon_vma is not set, page fault handler will set it by either
> reusing anon_vma of an adjacent VMA if VMAs are compatible or by
> allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
> a compatible adjacent VMA and that requires not only the faulting VMA
> to be stable but also the tree structure and other VMAs inside that tree.
> Therefore locking just the faulting VMA is not enough for this search.
> Fall back to taking mmap_lock when vma->anon_vma is not set. This
> situation happens only on the first page fault and should not affect
> overall performance.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/memory.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index bda4c1a991f0..8855846a361b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5243,6 +5243,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_is_anonymous(vma))
> goto inval;
>
> + /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
> + if (!vma->anon_vma)
> + goto inval;
> +
> if (!vma_start_read(vma))
> goto inval;

Reviewed-by: Hyeonggon Yoo <[email protected]>

> --
> 2.39.2.722.g9855ee24e9-goog
>
>

2023-03-01 18:35:05

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <[email protected]> wrote:
>
> On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote:
> > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > > Write-locking VMAs before isolating them ensures that page fault
> > > handlers don't operate on isolated VMAs.
> > >
> > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > ---
> > > mm/mmap.c | 1 +
> > > mm/nommu.c | 5 +++++
> > > 2 files changed, 6 insertions(+)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 1f42b9a52b9b..f7ed357056c4 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > static inline int munmap_sidetree(struct vm_area_struct *vma,
> > > struct ma_state *mas_detach)
> > > {
> > > + vma_start_write(vma);
> > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> >
> > I may be missing something, but have few questions:
> >
> > 1) Why does a writer need to both write-lock a VMA and mark the VMA detached
> > when unmapping it, isn't it enough to just only write-lock a VMA?

We need to mark the VMA detached to avoid handling page fault in a
detached VMA. The possible scenario is:

lock_vma_under_rcu
vma = mas_walk(&mas)
munmap_sidetree
vma_start_write(vma)

mas_store_gfp() // remove VMA from the tree
vma_end_write_all()
vma_start_read(vma)
// we locked the VMA but it is not part of the tree anymore.

So, marking the VMA locked before vma_end_write_all() and checking
vma->detached after vma_start_read() helps us avoid handling faults in
the detached VMA.


> >
> > 2) as VMAs that are going to be removed are already locked in vma_prepare(),
> > so I think this hunk could be dropped?
>
> After sending this just realized that I did not consider simple munmap case :)
> But I still think 1) and 3) are valid question.
>
> >
> > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> > > return -ENOMEM;
> > > diff --git a/mm/nommu.c b/mm/nommu.c
> > > index 57ba243c6a37..2ab162d773e2 100644
> > > --- a/mm/nommu.c
> > > +++ b/mm/nommu.c
> > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
> > > current->pid);
> > > return -ENOMEM;
> > > }
> > > + vma_start_write(vma);
> > > cleanup_vma_from_mm(vma);
> >
> > 3) I think this hunk could be dropped as Per-VMA lock depends on MMU anyway.

Ah, yes, you are right. We can safely remove the changes in nommu.c
Andrew, should I post a fixup or you can make the removal directly in
mm-unstable?
Thanks,
Suren.

> >
> > Thanks,
> > Hyeonggon
> >
> > >
> > > /* remove from the MM's tree and list */
> > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
> > > */
> > > mmap_write_lock(mm);
> > > for_each_vma(vmi, vma) {
> > > + /*
> > > + * No need to lock VMA because this is the only mm user and no
> > > + * page fault handled can race with it.
> > > + */
> > > cleanup_vma_from_mm(vma);
> > > delete_vma(mm, vma);
> > > cond_resched();
> > > --
> > > 2.39.2.722.g9855ee24e9-goog
> > >
> > >
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2023-03-01 18:43:04

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <[email protected]> wrote:
> >
> > On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote:
> > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > > > Write-locking VMAs before isolating them ensures that page fault
> > > > handlers don't operate on isolated VMAs.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > ---
> > > > mm/mmap.c | 1 +
> > > > mm/nommu.c | 5 +++++
> > > > 2 files changed, 6 insertions(+)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 1f42b9a52b9b..f7ed357056c4 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > static inline int munmap_sidetree(struct vm_area_struct *vma,
> > > > struct ma_state *mas_detach)
> > > > {
> > > > + vma_start_write(vma);
> > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> > >
> > > I may be missing something, but have few questions:
> > >
> > > 1) Why does a writer need to both write-lock a VMA and mark the VMA detached
> > > when unmapping it, isn't it enough to just only write-lock a VMA?
>
> We need to mark the VMA detached to avoid handling page fault in a
> detached VMA. The possible scenario is:
>
> lock_vma_under_rcu
> vma = mas_walk(&mas)
> munmap_sidetree
> vma_start_write(vma)
>
> mas_store_gfp() // remove VMA from the tree
> vma_end_write_all()
> vma_start_read(vma)
> // we locked the VMA but it is not part of the tree anymore.
>
> So, marking the VMA locked before vma_end_write_all() and checking

Sorry, I should have said "marking the VMA *detached* before
vma_end_write_all() and checking vma->detached after vma_start_read()
helps us avoid handling faults in the detached VMA."

> vma->detached after vma_start_read() helps us avoid handling faults in
> the detached VMA.
>
>
> > >
> > > 2) as VMAs that are going to be removed are already locked in vma_prepare(),
> > > so I think this hunk could be dropped?
> >
> > After sending this just realized that I did not consider simple munmap case :)
> > But I still think 1) and 3) are valid question.
> >
> > >
> > > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> > > > return -ENOMEM;
> > > > diff --git a/mm/nommu.c b/mm/nommu.c
> > > > index 57ba243c6a37..2ab162d773e2 100644
> > > > --- a/mm/nommu.c
> > > > +++ b/mm/nommu.c
> > > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
> > > > current->pid);
> > > > return -ENOMEM;
> > > > }
> > > > + vma_start_write(vma);
> > > > cleanup_vma_from_mm(vma);
> > >
> > > 3) I think this hunk could be dropped as Per-VMA lock depends on MMU anyway.
>
> Ah, yes, you are right. We can safely remove the changes in nommu.c
> Andrew, should I post a fixup or you can make the removal directly in
> mm-unstable?
> Thanks,
> Suren.
>
> > >
> > > Thanks,
> > > Hyeonggon
> > >
> > > >
> > > > /* remove from the MM's tree and list */
> > > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
> > > > */
> > > > mmap_write_lock(mm);
> > > > for_each_vma(vmi, vma) {
> > > > + /*
> > > > + * No need to lock VMA because this is the only mm user and no
> > > > + * page fault handled can race with it.
> > > > + */
> > > > cleanup_vma_from_mm(vma);
> > > > delete_vma(mm, vma);
> > > > cond_resched();
> > > > --
> > > > 2.39.2.722.g9855ee24e9-goog
> > > >
> > > >
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >

2023-03-01 19:07:46

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <[email protected]> wrote:
> >
> > On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote:
> > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > > > Write-locking VMAs before isolating them ensures that page fault
> > > > handlers don't operate on isolated VMAs.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > ---
> > > > mm/mmap.c | 1 +
> > > > mm/nommu.c | 5 +++++
> > > > 2 files changed, 6 insertions(+)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 1f42b9a52b9b..f7ed357056c4 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > static inline int munmap_sidetree(struct vm_area_struct *vma,
> > > > struct ma_state *mas_detach)
> > > > {
> > > > + vma_start_write(vma);
> > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> > >
> > > I may be missing something, but have few questions:
> > >
> > > 1) Why does a writer need to both write-lock a VMA and mark the VMA detached
> > > when unmapping it, isn't it enough to just only write-lock a VMA?
>
> We need to mark the VMA detached to avoid handling page fault in a
> detached VMA. The possible scenario is:
>
> lock_vma_under_rcu
> vma = mas_walk(&mas)
> munmap_sidetree
> vma_start_write(vma)
>
> mas_store_gfp() // remove VMA from the tree
> vma_end_write_all()
> vma_start_read(vma)
> // we locked the VMA but it is not part of the tree anymore.
>
> So, marking the VMA locked before vma_end_write_all() and checking
> vma->detached after vma_start_read() helps us avoid handling faults in
> the detached VMA.
>
>
> > >
> > > 2) as VMAs that are going to be removed are already locked in vma_prepare(),
> > > so I think this hunk could be dropped?
> >
> > After sending this just realized that I did not consider simple munmap case :)
> > But I still think 1) and 3) are valid question.
> >
> > >
> > > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
> > > > return -ENOMEM;
> > > > diff --git a/mm/nommu.c b/mm/nommu.c
> > > > index 57ba243c6a37..2ab162d773e2 100644
> > > > --- a/mm/nommu.c
> > > > +++ b/mm/nommu.c
> > > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
> > > > current->pid);
> > > > return -ENOMEM;
> > > > }
> > > > + vma_start_write(vma);
> > > > cleanup_vma_from_mm(vma);
> > >
> > > 3) I think this hunk could be dropped as Per-VMA lock depends on MMU anyway.
>
> Ah, yes, you are right. We can safely remove the changes in nommu.c
> Andrew, should I post a fixup or you can make the removal directly in
> mm-unstable?

I went ahead and posted the fixup for this at:
https://lore.kernel.org/all/[email protected]/

> Thanks,
> Suren.
>
> > >
> > > Thanks,
> > > Hyeonggon
> > >
> > > >
> > > > /* remove from the MM's tree and list */
> > > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm)
> > > > */
> > > > mmap_write_lock(mm);
> > > > for_each_vma(vmi, vma) {
> > > > + /*
> > > > + * No need to lock VMA because this is the only mm user and no
> > > > + * page fault handled can race with it.
> > > > + */
> > > > cleanup_vma_from_mm(vma);
> > > > delete_vma(mm, vma);
> > > > cond_resched();
> > > > --
> > > > 2.39.2.722.g9855ee24e9-goog
> > > >
> > > >
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >

2023-03-02 00:54:09

by Hyeonggon Yoo

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Wed, Mar 01, 2023 at 10:42:48AM -0800, Suren Baghdasaryan wrote:
> On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <[email protected]> wrote:
> > >
> > > On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote:
> > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > > > > Write-locking VMAs before isolating them ensures that page fault
> > > > > handlers don't operate on isolated VMAs.
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > > ---
> > > > > mm/mmap.c | 1 +
> > > > > mm/nommu.c | 5 +++++
> > > > > 2 files changed, 6 insertions(+)
> > > > >
> > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > index 1f42b9a52b9b..f7ed357056c4 100644
> > > > > --- a/mm/mmap.c
> > > > > +++ b/mm/mmap.c
> > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > > static inline int munmap_sidetree(struct vm_area_struct *vma,
> > > > > struct ma_state *mas_detach)
> > > > > {
> > > > > + vma_start_write(vma);
> > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> > > >
> > > > I may be missing something, but have few questions:
> > > >
> > > > 1) Why does a writer need to both write-lock a VMA and mark the VMA detached
> > > > when unmapping it, isn't it enough to just only write-lock a VMA?
> >
> > We need to mark the VMA detached to avoid handling page fault in a
> > detached VMA. The possible scenario is:
> >
> > lock_vma_under_rcu
> > vma = mas_walk(&mas)
> > munmap_sidetree
> > vma_start_write(vma)
> >
> > mas_store_gfp() // remove VMA from the tree
> > vma_end_write_all()
> > vma_start_read(vma)
> > // we locked the VMA but it is not part of the tree anymore.
> >
> > So, marking the VMA locked before vma_end_write_all() and checking
>
> Sorry, I should have said "marking the VMA *detached* before
> vma_end_write_all() and checking vma->detached after vma_start_read()
> helps us avoid handling faults in the detached VMA."
>
> > vma->detached after vma_start_read() helps us avoid handling faults in
> > the detached VMA.

Thank you for explanation, that makes sense!

By the way, if there are no 32bit users of Per-VMA lock (are there?),
"detached" bool could be a VMA flag (i.e. making it depend on 64BIT
and selecting ARCH_USES_HIGH_VMA_FLAGS)

Thanks,
Hyeonggon


2023-03-02 02:22:11

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree

On Wed, Mar 1, 2023 at 4:54 PM Hyeonggon Yoo <[email protected]> wrote:
>
> On Wed, Mar 01, 2023 at 10:42:48AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <[email protected]> wrote:
> > > >
> > > > On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote:
> > > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote:
> > > > > > Write-locking VMAs before isolating them ensures that page fault
> > > > > > handlers don't operate on isolated VMAs.
> > > > > >
> > > > > > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > > > > > ---
> > > > > > mm/mmap.c | 1 +
> > > > > > mm/nommu.c | 5 +++++
> > > > > > 2 files changed, 6 insertions(+)
> > > > > >
> > > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > > index 1f42b9a52b9b..f7ed357056c4 100644
> > > > > > --- a/mm/mmap.c
> > > > > > +++ b/mm/mmap.c
> > > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > > > > static inline int munmap_sidetree(struct vm_area_struct *vma,
> > > > > > struct ma_state *mas_detach)
> > > > > > {
> > > > > > + vma_start_write(vma);
> > > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
> > > > >
> > > > > I may be missing something, but have few questions:
> > > > >
> > > > > 1) Why does a writer need to both write-lock a VMA and mark the VMA detached
> > > > > when unmapping it, isn't it enough to just only write-lock a VMA?
> > >
> > > We need to mark the VMA detached to avoid handling page fault in a
> > > detached VMA. The possible scenario is:
> > >
> > > lock_vma_under_rcu
> > > vma = mas_walk(&mas)
> > > munmap_sidetree
> > > vma_start_write(vma)
> > >
> > > mas_store_gfp() // remove VMA from the tree
> > > vma_end_write_all()
> > > vma_start_read(vma)
> > > // we locked the VMA but it is not part of the tree anymore.
> > >
> > > So, marking the VMA locked before vma_end_write_all() and checking
> >
> > Sorry, I should have said "marking the VMA *detached* before
> > vma_end_write_all() and checking vma->detached after vma_start_read()
> > helps us avoid handling faults in the detached VMA."
> >
> > > vma->detached after vma_start_read() helps us avoid handling faults in
> > > the detached VMA.
>
> Thank you for explanation, that makes sense!
>
> By the way, if there are no 32bit users of Per-VMA lock (are there?),
> "detached" bool could be a VMA flag (i.e. making it depend on 64BIT
> and selecting ARCH_USES_HIGH_VMA_FLAGS)

Yeah, I thought about it but didn't want to make assumptions about
potential users just yet. Besides, I heard there are attempts to make
vm_flags to be always 64-bit (I think Matthew mentioned that to me
once). If that happens, we won't need any dependencies here. Either
way, this conversion into a flag can be done as an additional
optimization later on. I prefer to keep the main patchset as simple as
possible for now.
Thanks,
Suren.

>
> Thanks,
> Hyeonggon
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2023-03-06 15:59:01

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH] powerpc/mm: fix mmap_lock bad unlock

When page fault is tried holding the per VMA lock, bad_access_pkey() and
bad_access() should not be called because it is assuming the mmap_lock is
held.
In the case a bad access is detected, fall back to the default path,
grabbing the mmap_lock to handle the fault and report the error.

Fixes: 169db3bb4609 ("powerc/mm: try VMA lock-based page fault handling first")
Reported-by: Sachin Sant <[email protected]>
Link: https://lore.kernel.org/linux-mm/[email protected]
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
arch/powerpc/mm/fault.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c7ae86b04b8a..e191b3ebd8d6 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -479,17 +479,13 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,

if (unlikely(access_pkey_error(is_write, is_exec,
(error_code & DSISR_KEYFAULT), vma))) {
- int rc = bad_access_pkey(regs, address, vma);
-
vma_end_read(vma);
- return rc;
+ goto lock_mmap;
}

if (unlikely(access_error(is_write, is_exec, vma))) {
- int rc = bad_access(regs, address);
-
vma_end_read(vma);
- return rc;
+ goto lock_mmap;
}

fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
--
2.39.2


2023-03-06 20:25:47

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 31/33] powerc/mm: try VMA lock-based page fault handling first

On Mon, Feb 27, 2023 at 9:37 AM Suren Baghdasaryan <[email protected]> wrote:
>
> From: Laurent Dufour <[email protected]>
>
> Attempt VMA lock-based page fault handling first, and fall back to the
> existing mmap_lock-based handling if that fails.
> Copied from "x86/mm: try VMA lock-based page fault handling first"

Hi Andrew,
Laurent posted a fix for this patch at
https://lore.kernel.org/all/[email protected]/.
Could you please squash the fix into this patch?
Thanks,
Suren.

>
> Signed-off-by: Laurent Dufour <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> arch/powerpc/mm/fault.c | 41 ++++++++++++++++++++++++++
> arch/powerpc/platforms/powernv/Kconfig | 1 +
> arch/powerpc/platforms/pseries/Kconfig | 1 +
> 3 files changed, 43 insertions(+)
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 2bef19cc1b98..c7ae86b04b8a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
> if (is_exec)
> flags |= FAULT_FLAG_INSTRUCTION;
>
> +#ifdef CONFIG_PER_VMA_LOCK
> + if (!(flags & FAULT_FLAG_USER))
> + goto lock_mmap;
> +
> + vma = lock_vma_under_rcu(mm, address);
> + if (!vma)
> + goto lock_mmap;
> +
> + if (unlikely(access_pkey_error(is_write, is_exec,
> + (error_code & DSISR_KEYFAULT), vma))) {
> + int rc = bad_access_pkey(regs, address, vma);
> +
> + vma_end_read(vma);
> + return rc;
> + }
> +
> + if (unlikely(access_error(is_write, is_exec, vma))) {
> + int rc = bad_access(regs, address);
> +
> + vma_end_read(vma);
> + return rc;
> + }
> +
> + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
> + vma_end_read(vma);
> +
> + if (!(fault & VM_FAULT_RETRY)) {
> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> + goto done;
> + }
> + count_vm_vma_lock_event(VMA_LOCK_RETRY);
> +
> + if (fault_signal_pending(fault, regs))
> + return user_mode(regs) ? 0 : SIGBUS;
> +
> +lock_mmap:
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> /* When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> * kernel and should generate an OOPS. Unfortunately, in the case of an
> @@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
>
> mmap_read_unlock(current->mm);
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +done:
> +#endif
> if (unlikely(fault & VM_FAULT_ERROR))
> return mm_fault_error(regs, address, fault);
>
> diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
> index ae248a161b43..70a46acc70d6 100644
> --- a/arch/powerpc/platforms/powernv/Kconfig
> +++ b/arch/powerpc/platforms/powernv/Kconfig
> @@ -16,6 +16,7 @@ config PPC_POWERNV
> select PPC_DOORBELL
> select MMU_NOTIFIER
> select FORCE_SMP
> + select ARCH_SUPPORTS_PER_VMA_LOCK
> default y
>
> config OPAL_PRD
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> index b481c5c8bae1..9c205fe0e619 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -21,6 +21,7 @@ config PPC_PSERIES
> select HOTPLUG_CPU
> select FORCE_SMP
> select SWIOTLB
> + select ARCH_SUPPORTS_PER_VMA_LOCK
> default y
>
> config PARAVIRT
> --
> 2.39.2.722.g9855ee24e9-goog
>

2023-06-29 15:30:17

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

Hi,

On 27. 02. 23, 18:36, Suren Baghdasaryan wrote:
> Attempt VMA lock-based page fault handling first, and fall back to the
> existing mmap_lock-based handling if that fails.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 37 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a825bf031f49..df21fba77db1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -27,6 +27,7 @@ config X86_64
> # Options that are inherently 64-bit kernel only:
> select ARCH_HAS_GIGANTIC_PAGE
> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> + select ARCH_SUPPORTS_PER_VMA_LOCK
> select ARCH_USE_CMPXCHG_LOCKREF
> select HAVE_ARCH_SOFT_DIRTY
> select MODULES_USE_ELF_RELA
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index a498ae1fbe66..e4399983c50c 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -19,6 +19,7 @@
> #include <linux/uaccess.h> /* faulthandler_disabled() */
> #include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
> #include <linux/mm_types.h>
> +#include <linux/mm.h> /* find_and_lock_vma() */
>
> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> #include <asm/traps.h> /* dotraplinkage, ... */
> @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
> }
> #endif
>
> +#ifdef CONFIG_PER_VMA_LOCK
> + if (!(flags & FAULT_FLAG_USER))
> + goto lock_mmap;
> +
> + vma = lock_vma_under_rcu(mm, address);
> + if (!vma)
> + goto lock_mmap;
> +
> + if (unlikely(access_error(error_code, vma))) {
> + vma_end_read(vma);
> + goto lock_mmap;
> + }
> + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
> + vma_end_read(vma);
> +
> + if (!(fault & VM_FAULT_RETRY)) {
> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> + goto done;
> + }
> + count_vm_vma_lock_event(VMA_LOCK_RETRY);

This is apparently not strong enough as it causes go build failures like:

[ 409s] strconv
[ 409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2
[ 409s] fatal error: releasep: invalid p state
[ 409s]

[ 325s] hash/adler32
[ 325s] hash/crc32
[ 325s] cmd/internal/codesign
[ 336s] fatal error: runtime: out of memory

There are many kinds of similar errors. It happens in 1-3 out of 20
builds only.

If I revert the commit on top of 6.4, they all dismiss. Any idea?

The downstream report:
https://bugzilla.suse.com/show_bug.cgi?id=1212775

> +
> + /* Quick path to respond to signals */
> + if (fault_signal_pending(fault, regs)) {
> + if (!user_mode(regs))
> + kernelmode_fixup_or_oops(regs, error_code, address,
> + SIGBUS, BUS_ADRERR,
> + ARCH_DEFAULT_PKEY);
> + return;
> + }
> +lock_mmap:
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> /*
> * Kernel-mode access to the user address space should only occur
> * on well-defined single instructions listed in the exception
> @@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs,
> }
>
> mmap_read_unlock(mm);
> +#ifdef CONFIG_PER_VMA_LOCK
> +done:
> +#endif
> if (likely(!(fault & VM_FAULT_ERROR)))
> return;
>

thanks,
--
js
suse labs


2023-06-29 15:57:59

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

On Thu, Jun 29, 2023 at 7:40 AM Jiri Slaby <[email protected]> wrote:
>
> Hi,
>
> On 27. 02. 23, 18:36, Suren Baghdasaryan wrote:
> > Attempt VMA lock-based page fault handling first, and fall back to the
> > existing mmap_lock-based handling if that fails.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > arch/x86/Kconfig | 1 +
> > arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 37 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index a825bf031f49..df21fba77db1 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -27,6 +27,7 @@ config X86_64
> > # Options that are inherently 64-bit kernel only:
> > select ARCH_HAS_GIGANTIC_PAGE
> > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> > + select ARCH_SUPPORTS_PER_VMA_LOCK
> > select ARCH_USE_CMPXCHG_LOCKREF
> > select HAVE_ARCH_SOFT_DIRTY
> > select MODULES_USE_ELF_RELA
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index a498ae1fbe66..e4399983c50c 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -19,6 +19,7 @@
> > #include <linux/uaccess.h> /* faulthandler_disabled() */
> > #include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
> > #include <linux/mm_types.h>
> > +#include <linux/mm.h> /* find_and_lock_vma() */
> >
> > #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> > #include <asm/traps.h> /* dotraplinkage, ... */
> > @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
> > }
> > #endif
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + if (!(flags & FAULT_FLAG_USER))
> > + goto lock_mmap;
> > +
> > + vma = lock_vma_under_rcu(mm, address);
> > + if (!vma)
> > + goto lock_mmap;
> > +
> > + if (unlikely(access_error(error_code, vma))) {
> > + vma_end_read(vma);
> > + goto lock_mmap;
> > + }
> > + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
> > + vma_end_read(vma);
> > +
> > + if (!(fault & VM_FAULT_RETRY)) {
> > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> > + goto done;
> > + }
> > + count_vm_vma_lock_event(VMA_LOCK_RETRY);
>
> This is apparently not strong enough as it causes go build failures like:
>
> [ 409s] strconv
> [ 409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2
> [ 409s] fatal error: releasep: invalid p state
> [ 409s]
>
> [ 325s] hash/adler32
> [ 325s] hash/crc32
> [ 325s] cmd/internal/codesign
> [ 336s] fatal error: runtime: out of memory

Hi Jiri,
Thanks for reporting! I'm not familiar with go builds. Could you
please explain the error to me or point me to some documentation to
decipher that error?
Thanks,
Suren.

>
> There are many kinds of similar errors. It happens in 1-3 out of 20
> builds only.
>
> If I revert the commit on top of 6.4, they all dismiss. Any idea?
>
> The downstream report:
> https://bugzilla.suse.com/show_bug.cgi?id=1212775
>
> > +
> > + /* Quick path to respond to signals */
> > + if (fault_signal_pending(fault, regs)) {
> > + if (!user_mode(regs))
> > + kernelmode_fixup_or_oops(regs, error_code, address,
> > + SIGBUS, BUS_ADRERR,
> > + ARCH_DEFAULT_PKEY);
> > + return;
> > + }
> > +lock_mmap:
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > /*
> > * Kernel-mode access to the user address space should only occur
> > * on well-defined single instructions listed in the exception
> > @@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs,
> > }
> >
> > mmap_read_unlock(mm);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +done:
> > +#endif
> > if (likely(!(fault & VM_FAULT_ERROR)))
> > return;
> >
>
> thanks,
> --
> js
> suse labs
>

Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

On 29.06.23 16:40, Jiri Slaby wrote:
>
> On 27. 02. 23, 18:36, Suren Baghdasaryan wrote:
>> Attempt VMA lock-based page fault handling first, and fall back to the
>> existing mmap_lock-based handling if that fails.
> [...]
>> +    fault = handle_mm_fault(vma, address, flags |
>> FAULT_FLAG_VMA_LOCK, regs);
>> +    vma_end_read(vma);
>> +
>> +    if (!(fault & VM_FAULT_RETRY)) {
>> +        count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
>> +        goto done;
>> +    }
>> +    count_vm_vma_lock_event(VMA_LOCK_RETRY);
>
> This is apparently not strong enough as it causes go build failures like:
>
> [  409s] strconv
> [  409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2
> [  409s] fatal error: releasep: invalid p state
> [  409s]
>
> [  325s] hash/adler32
> [  325s] hash/crc32
> [  325s] cmd/internal/codesign
> [  336s] fatal error: runtime: out of memory
>
> There are many kinds of similar errors. It happens in 1-3 out of 20
> builds only.
>
> If I revert the commit on top of 6.4, they all dismiss. Any idea?
>
> The downstream report:
> https://bugzilla.suse.com/show_bug.cgi?id=1212775
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 0bff0aaea03e2a3ed6bfa3021
https://bugzilla.suse.com/show_bug.cgi?id=1212775
#regzbot title mm: failures when building go in 1-3 out of 20 builds
#regzbot ignore-activity

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-06-30 07:07:55

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

On 29. 06. 23, 17:30, Suren Baghdasaryan wrote:
> On Thu, Jun 29, 2023 at 7:40 AM Jiri Slaby <[email protected]> wrote:
>>
>> Hi,
>>
>> On 27. 02. 23, 18:36, Suren Baghdasaryan wrote:
>>> Attempt VMA lock-based page fault handling first, and fall back to the
>>> existing mmap_lock-based handling if that fails.
>>>
>>> Signed-off-by: Suren Baghdasaryan <[email protected]>
>>> ---
>>> arch/x86/Kconfig | 1 +
>>> arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 37 insertions(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index a825bf031f49..df21fba77db1 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -27,6 +27,7 @@ config X86_64
>>> # Options that are inherently 64-bit kernel only:
>>> select ARCH_HAS_GIGANTIC_PAGE
>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>>> + select ARCH_SUPPORTS_PER_VMA_LOCK
>>> select ARCH_USE_CMPXCHG_LOCKREF
>>> select HAVE_ARCH_SOFT_DIRTY
>>> select MODULES_USE_ELF_RELA
>>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>> index a498ae1fbe66..e4399983c50c 100644
>>> --- a/arch/x86/mm/fault.c
>>> +++ b/arch/x86/mm/fault.c
>>> @@ -19,6 +19,7 @@
>>> #include <linux/uaccess.h> /* faulthandler_disabled() */
>>> #include <linux/efi.h> /* efi_crash_gracefully_on_page_fault()*/
>>> #include <linux/mm_types.h>
>>> +#include <linux/mm.h> /* find_and_lock_vma() */
>>>
>>> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
>>> #include <asm/traps.h> /* dotraplinkage, ... */
>>> @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
>>> }
>>> #endif
>>>
>>> +#ifdef CONFIG_PER_VMA_LOCK
>>> + if (!(flags & FAULT_FLAG_USER))
>>> + goto lock_mmap;
>>> +
>>> + vma = lock_vma_under_rcu(mm, address);
>>> + if (!vma)
>>> + goto lock_mmap;
>>> +
>>> + if (unlikely(access_error(error_code, vma))) {
>>> + vma_end_read(vma);
>>> + goto lock_mmap;
>>> + }
>>> + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
>>> + vma_end_read(vma);
>>> +
>>> + if (!(fault & VM_FAULT_RETRY)) {
>>> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
>>> + goto done;
>>> + }
>>> + count_vm_vma_lock_event(VMA_LOCK_RETRY);
>>
>> This is apparently not strong enough as it causes go build failures like:
>>
>> [ 409s] strconv
>> [ 409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2
>> [ 409s] fatal error: releasep: invalid p state
>> [ 409s]
>>
>> [ 325s] hash/adler32
>> [ 325s] hash/crc32
>> [ 325s] cmd/internal/codesign
>> [ 336s] fatal error: runtime: out of memory
>
> Hi Jiri,
> Thanks for reporting! I'm not familiar with go builds. Could you
> please explain the error to me or point me to some documentation to
> decipher that error?

Sorry, we are on the same boat -- me neither. It only popped up in our
(openSUSE) build system and I only tracked it down by bisection. Let me
know if I can try something (like a patch or gathering some debug info).

thanks,
--
js
suse labs


2023-06-30 08:36:32

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

On 30. 06. 23, 8:35, Jiri Slaby wrote:
> On 29. 06. 23, 17:30, Suren Baghdasaryan wrote:
>> On Thu, Jun 29, 2023 at 7:40 AM Jiri Slaby <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> On 27. 02. 23, 18:36, Suren Baghdasaryan wrote:
>>>> Attempt VMA lock-based page fault handling first, and fall back to the
>>>> existing mmap_lock-based handling if that fails.
>>>>
>>>> Signed-off-by: Suren Baghdasaryan <[email protected]>
>>>> ---
>>>>    arch/x86/Kconfig    |  1 +
>>>>    arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 37 insertions(+)
>>>>
>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>> index a825bf031f49..df21fba77db1 100644
>>>> --- a/arch/x86/Kconfig
>>>> +++ b/arch/x86/Kconfig
>>>> @@ -27,6 +27,7 @@ config X86_64
>>>>        # Options that are inherently 64-bit kernel only:
>>>>        select ARCH_HAS_GIGANTIC_PAGE
>>>>        select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>>>> +     select ARCH_SUPPORTS_PER_VMA_LOCK
>>>>        select ARCH_USE_CMPXCHG_LOCKREF
>>>>        select HAVE_ARCH_SOFT_DIRTY
>>>>        select MODULES_USE_ELF_RELA
>>>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>>> index a498ae1fbe66..e4399983c50c 100644
>>>> --- a/arch/x86/mm/fault.c
>>>> +++ b/arch/x86/mm/fault.c
>>>> @@ -19,6 +19,7 @@
>>>>    #include <linux/uaccess.h>          /*
>>>> faulthandler_disabled()      */
>>>>    #include <linux/efi.h>                      /*
>>>> efi_crash_gracefully_on_page_fault()*/
>>>>    #include <linux/mm_types.h>
>>>> +#include <linux/mm.h>                        /* find_and_lock_vma() */
>>>>
>>>>    #include <asm/cpufeature.h>         /* boot_cpu_has,
>>>> ...            */
>>>>    #include <asm/traps.h>                      /* dotraplinkage,
>>>> ...           */
>>>> @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
>>>>        }
>>>>    #endif
>>>>
>>>> +#ifdef CONFIG_PER_VMA_LOCK
>>>> +     if (!(flags & FAULT_FLAG_USER))
>>>> +             goto lock_mmap;
>>>> +
>>>> +     vma = lock_vma_under_rcu(mm, address);
>>>> +     if (!vma)
>>>> +             goto lock_mmap;
>>>> +
>>>> +     if (unlikely(access_error(error_code, vma))) {
>>>> +             vma_end_read(vma);
>>>> +             goto lock_mmap;
>>>> +     }
>>>> +     fault = handle_mm_fault(vma, address, flags |
>>>> FAULT_FLAG_VMA_LOCK, regs);
>>>> +     vma_end_read(vma);
>>>> +
>>>> +     if (!(fault & VM_FAULT_RETRY)) {
>>>> +             count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
>>>> +             goto done;
>>>> +     }
>>>> +     count_vm_vma_lock_event(VMA_LOCK_RETRY);
>>>
>>> This is apparently not strong enough as it causes go build failures
>>> like:
>>>
>>> [  409s] strconv
>>> [  409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2
>>> [  409s] fatal error: releasep: invalid p state
>>> [  409s]
>>>
>>> [  325s] hash/adler32
>>> [  325s] hash/crc32
>>> [  325s] cmd/internal/codesign
>>> [  336s] fatal error: runtime: out of memory
>>
>> Hi Jiri,
>> Thanks for reporting! I'm not familiar with go builds. Could you
>> please explain the error to me or point me to some documentation to
>> decipher that error?
>
> Sorry, we are on the same boat -- me neither. It only popped up in our
> (openSUSE) build system and I only tracked it down by bisection. Let me
> know if I can try something (like a patch or gathering some debug info).

FWIW, a failed build log:
https://decibel.fi.muni.cz/~xslaby/n/vma/log.txt

and a strace for it:
https://decibel.fi.muni.cz/~xslaby/n/vma/strace.txt

An excerpt from the log:

[ 55s] runtime: marked free object in span 0x7fca6824bec8,
elemsize=192 freeindex=0 (bad use of unsafe.Pointer? try -d=checkptr)
[ 55s] 0xc0002f2000 alloc marked
[ 55s] 0xc0002f20c0 alloc marked
[ 55s] 0xc0002f2180 alloc marked
[ 55s] 0xc0002f2240 free unmarked
[ 55s] 0xc0002f2300 alloc marked
[ 55s] 0xc0002f23c0 alloc marked
[ 55s] 0xc0002f2480 alloc marked
[ 55s] 0xc0002f2540 alloc marked
[ 55s] 0xc0002f2600 alloc marked
[ 55s] 0xc0002f26c0 alloc marked
[ 55s] 0xc0002f2780 alloc marked
[ 55s] 0xc0002f2840 alloc marked
[ 55s] 0xc0002f2900 alloc marked
[ 55s] 0xc0002f29c0 free unmarked
[ 55s] 0xc0002f2a80 alloc marked
[ 55s] 0xc0002f2b40 alloc marked
[ 55s] 0xc0002f2c00 alloc marked
[ 55s] 0xc0002f2cc0 alloc marked
[ 55s] 0xc0002f2d80 alloc marked
[ 55s] 0xc0002f2e40 alloc marked
[ 55s] 0xc0002f2f00 alloc marked
[ 55s] 0xc0002f2fc0 alloc marked
[ 55s] 0xc0002f3080 alloc marked
[ 55s] 0xc0002f3140 alloc marked
[ 55s] 0xc0002f3200 alloc marked
[ 55s] 0xc0002f32c0 alloc marked
[ 55s] 0xc0002f3380 free unmarked
[ 55s] 0xc0002f3440 free marked zombie


An excerpt from strace:
> 2348
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
child_tid=0x7fcaa6a1b990, parent_tid=0x7fcaa6a1b990, exit_signal=0,
stack=0x7fcaa621b000, stack_size=0x7ffe00, tls=0x7fcaa6a1b6c0} =>
{parent_tid=[2350]}, 88) = 2350

> 2348
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
child_tid=0x7fcaa5882990, parent_tid=0x7fcaa5882990, exit_signal=0,
stack=0x7fcaa5082000, stack_size=0x7ffe00, tls=0x7fcaa58826c0} =>
{parent_tid=[2351]}, 88) = 2351
> 2350 <... clone3 resumed> => {parent_tid=[2372]}, 88) = 2372
> 2351 <... clone3 resumed> => {parent_tid=[2354]}, 88) = 2354
> 2351 <... clone3 resumed> => {parent_tid=[2357]}, 88) = 2357
> 2354 <... clone3 resumed> => {parent_tid=[2355]}, 88) = 2355
> 2355 <... clone3 resumed> => {parent_tid=[2370]}, 88) = 2370
> 2370 mmap(NULL, 262144, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> 2370 <... mmap resumed>) = 0x7fca68249000
> 2372 <... clone3 resumed> => {parent_tid=[2384]}, 88) = 2384
> 2384 <... clone3 resumed> => {parent_tid=[2388]}, 88) = 2388
> 2388 <... clone3 resumed> => {parent_tid=[2392]}, 88) = 2392
> 2392 <... clone3 resumed> => {parent_tid=[2395]}, 88) = 2395
> 2395 write(2, "runtime: marked free object in s"..., 36 <unfinished ...>

I.e. IIUC, all are threads (CLONE_VM) and thread 2370 mapped ANON
0x7fca68249000 - 0x7fca6827ffff and go in thread 2395 thinks for some
reason 0x7fca6824bec8 in that region is "bad".

> thanks,--
--
js
suse labs


2023-06-30 09:07:32

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

On 30. 06. 23, 10:28, Jiri Slaby wrote:
> > 2348
> clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fcaa5882990, parent_tid=0x7fcaa5882990, exit_signal=0, stack=0x7fcaa5082000, stack_size=0x7ffe00, tls=0x7fcaa58826c0} => {parent_tid=[2351]}, 88) = 2351
> > 2350  <... clone3 resumed> => {parent_tid=[2372]}, 88) = 2372
> > 2351  <... clone3 resumed> => {parent_tid=[2354]}, 88) = 2354
> > 2351  <... clone3 resumed> => {parent_tid=[2357]}, 88) = 2357
> > 2354  <... clone3 resumed> => {parent_tid=[2355]}, 88) = 2355
> > 2355  <... clone3 resumed> => {parent_tid=[2370]}, 88) = 2370
> > 2370  mmap(NULL, 262144, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > 2370  <... mmap resumed>)               = 0x7fca68249000
> > 2372  <... clone3 resumed> => {parent_tid=[2384]}, 88) = 2384
> > 2384  <... clone3 resumed> => {parent_tid=[2388]}, 88) = 2388
> > 2388  <... clone3 resumed> => {parent_tid=[2392]}, 88) = 2392
> > 2392  <... clone3 resumed> => {parent_tid=[2395]}, 88) = 2395
> > 2395  write(2, "runtime: marked free object in s"..., 36 <unfinished
> ...>
>
> I.e. IIUC, all are threads (CLONE_VM) and thread 2370 mapped ANON
> 0x7fca68249000 - 0x7fca6827ffff and go in thread 2395 thinks for some
> reason 0x7fca6824bec8 in that region is "bad".

As I was noticed, this might be as well be a fail of the go's
inter-thread communication (or alike) too. It might now be only more
exposed with vma-based locks as we can do more parallelism now.

There are older hard to reproduce bugs in go with similar symptoms (we
see this error sometimes now too):
https://github.com/golang/go/issues/15246

Or this 2016 bug is a red herring. Hard to tell...

>> thanks,
--
js
suse labs


2023-06-30 17:51:16

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

On Fri, Jun 30, 2023 at 1:43 AM Jiri Slaby <[email protected]> wrote:
>
> On 30. 06. 23, 10:28, Jiri Slaby wrote:
> > > 2348
> > clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fcaa5882990, parent_tid=0x7fcaa5882990, exit_signal=0, stack=0x7fcaa5082000, stack_size=0x7ffe00, tls=0x7fcaa58826c0} => {parent_tid=[2351]}, 88) = 2351
> > > 2350 <... clone3 resumed> => {parent_tid=[2372]}, 88) = 2372
> > > 2351 <... clone3 resumed> => {parent_tid=[2354]}, 88) = 2354
> > > 2351 <... clone3 resumed> => {parent_tid=[2357]}, 88) = 2357
> > > 2354 <... clone3 resumed> => {parent_tid=[2355]}, 88) = 2355
> > > 2355 <... clone3 resumed> => {parent_tid=[2370]}, 88) = 2370
> > > 2370 mmap(NULL, 262144, PROT_READ|PROT_WRITE,
> > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
> > > 2370 <... mmap resumed>) = 0x7fca68249000
> > > 2372 <... clone3 resumed> => {parent_tid=[2384]}, 88) = 2384
> > > 2384 <... clone3 resumed> => {parent_tid=[2388]}, 88) = 2388
> > > 2388 <... clone3 resumed> => {parent_tid=[2392]}, 88) = 2392
> > > 2392 <... clone3 resumed> => {parent_tid=[2395]}, 88) = 2395
> > > 2395 write(2, "runtime: marked free object in s"..., 36 <unfinished
> > ...>
> >
> > I.e. IIUC, all are threads (CLONE_VM) and thread 2370 mapped ANON
> > 0x7fca68249000 - 0x7fca6827ffff and go in thread 2395 thinks for some
> > reason 0x7fca6824bec8 in that region is "bad".

Thanks for the analysis Jiri.
Is it possible from these logs to identify whether 2370 finished the
mmap operation before 2395 tried to access 0x7fca6824bec8? That access
has to happen only after mmap finishes mapping the region.

>
> As I was noticed, this might be as well be a fail of the go's
> inter-thread communication (or alike) too. It might now be only more
> exposed with vma-based locks as we can do more parallelism now.

Yes, with multithreaded processes like these where threads are mapping
and accessing memory areas, per-VMA locks should allow for greater
parallelism. So, if there is a race like the one I asked above, it
might become more pronounced with per-VMA locks.

I'll double check the code, but from Kernel's POV mmap would take the
mmap_lock for write then will lock the VMA lock for write. That should
prevent any page fault handlers from accessing this VMA in parallel
until writers release the locks. Page fault path would try to find the
VMA without any lock and then will try to read-lock that VMA. If it
fails it will fall back to mmap_lock. So, if the writer started first
and obtained the VMA lock, the reader will fall back to mmap_lock and
will block until the writer releases the mmap_lock. If the reader got
VMA read lock first then the writer will block while obtaining the
VMA's write lock. However for your scenario, the reader (page fault)
might be getting here before the writer (mmap) and upon not finding
the VMA it is looking for, it will fail.
Please let me know if you can verify this scenario.
Thanks,
Suren.

>
> There are older hard to reproduce bugs in go with similar symptoms (we
> see this error sometimes now too):
> https://github.com/golang/go/issues/15246
>
> Or this 2016 bug is a red herring. Hard to tell...
>
> >> thanks,
> --
> js
> suse labs
>

Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

On 29.06.23 16:40, Jiri Slaby wrote:
> On 27. 02. 23, 18:36, Suren Baghdasaryan wrote:
>> Attempt VMA lock-based page fault handling first, and fall back to the
>> existing mmap_lock-based handling if that fails.
>>
>> Signed-off-by: Suren Baghdasaryan <[email protected]>
>> ---
>>   arch/x86/Kconfig    |  1 +
>>   arch/x86/mm/fault.c | 36 ++++++++++++++++++++++++++++++++++++
>>   2 files changed, 37 insertions(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index a825bf031f49..df21fba77db1 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -27,6 +27,7 @@ config X86_64
>>       # Options that are inherently 64-bit kernel only:
>>       select ARCH_HAS_GIGANTIC_PAGE
>>       select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>> +    select ARCH_SUPPORTS_PER_VMA_LOCK
>>       select ARCH_USE_CMPXCHG_LOCKREF
>>       select HAVE_ARCH_SOFT_DIRTY
>>       select MODULES_USE_ELF_RELA
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index a498ae1fbe66..e4399983c50c 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/uaccess.h>        /* faulthandler_disabled()    */
>>   #include <linux/efi.h>            /*
>> efi_crash_gracefully_on_page_fault()*/
>>   #include <linux/mm_types.h>
>> +#include <linux/mm.h>            /* find_and_lock_vma() */
>>     #include <asm/cpufeature.h>        /* boot_cpu_has, ...        */
>>   #include <asm/traps.h>            /* dotraplinkage, ...        */
>> @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs,
>>       }
>>   #endif
>>   +#ifdef CONFIG_PER_VMA_LOCK
>> +    if (!(flags & FAULT_FLAG_USER))
>> +        goto lock_mmap;
>> +
>> +    vma = lock_vma_under_rcu(mm, address);
>> +    if (!vma)
>> +        goto lock_mmap;
>> +
>> +    if (unlikely(access_error(error_code, vma))) {
>> +        vma_end_read(vma);
>> +        goto lock_mmap;
>> +    }
>> +    fault = handle_mm_fault(vma, address, flags |
>> FAULT_FLAG_VMA_LOCK, regs);
>> +    vma_end_read(vma);
>> +
>> +    if (!(fault & VM_FAULT_RETRY)) {
>> +        count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
>> +        goto done;
>> +    }
>> +    count_vm_vma_lock_event(VMA_LOCK_RETRY);
>
> This is apparently not strong enough as it causes go build failures like:

TWIMC & for the record: there is another report about trouble caused by
this change; for details see

https://bugzilla.kernel.org/show_bug.cgi?id=217624

And a "forward to devs and lists" thread about that report:

https://lore.kernel.org/all/[email protected]/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

> [  409s] strconv
> [  409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2
> [  409s] fatal error: releasep: invalid p state
> [  409s]
>
> [  325s] hash/adler32
> [  325s] hash/crc32
> [  325s] cmd/internal/codesign
> [  336s] fatal error: runtime: out of memory
>
> There are many kinds of similar errors. It happens in 1-3 out of 20
> builds only.
>
> If I revert the commit on top of 6.4, they all dismiss. Any idea?
>
> The downstream report:
> https://bugzilla.suse.com/show_bug.cgi?id=1212775
>
>> +
>> +    /* Quick path to respond to signals */
>> +    if (fault_signal_pending(fault, regs)) {
>> +        if (!user_mode(regs))
>> +            kernelmode_fixup_or_oops(regs, error_code, address,
>> +                         SIGBUS, BUS_ADRERR,
>> +                         ARCH_DEFAULT_PKEY);
>> +        return;
>> +    }
>> +lock_mmap:
>> +#endif /* CONFIG_PER_VMA_LOCK */
>> +
>>       /*
>>        * Kernel-mode access to the user address space should only occur
>>        * on well-defined single instructions listed in the exception
>> @@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs,
>>       }
>>         mmap_read_unlock(mm);
>> +#ifdef CONFIG_PER_VMA_LOCK
>> +done:
>> +#endif
>>       if (likely(!(fault & VM_FAULT_ERROR)))
>>           return;
>>  
>
> thanks,

2023-07-03 10:55:51

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first

Cc Jacob Young (from kernel bugzilla)

On 30. 06. 23, 19:40, Suren Baghdasaryan wrote:
> On Fri, Jun 30, 2023 at 1:43 AM Jiri Slaby <[email protected]> wrote:
>>
>> On 30. 06. 23, 10:28, Jiri Slaby wrote:
>>> > 2348
>>> clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7fcaa5882990, parent_tid=0x7fcaa5882990, exit_signal=0, stack=0x7fcaa5082000, stack_size=0x7ffe00, tls=0x7fcaa58826c0} => {parent_tid=[2351]}, 88) = 2351
>>> > 2350 <... clone3 resumed> => {parent_tid=[2372]}, 88) = 2372
>>> > 2351 <... clone3 resumed> => {parent_tid=[2354]}, 88) = 2354
>>> > 2351 <... clone3 resumed> => {parent_tid=[2357]}, 88) = 2357
>>> > 2354 <... clone3 resumed> => {parent_tid=[2355]}, 88) = 2355
>>> > 2355 <... clone3 resumed> => {parent_tid=[2370]}, 88) = 2370
>>> > 2370 mmap(NULL, 262144, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
>>> > 2370 <... mmap resumed>) = 0x7fca68249000
>>> > 2372 <... clone3 resumed> => {parent_tid=[2384]}, 88) = 2384
>>> > 2384 <... clone3 resumed> => {parent_tid=[2388]}, 88) = 2388
>>> > 2388 <... clone3 resumed> => {parent_tid=[2392]}, 88) = 2392
>>> > 2392 <... clone3 resumed> => {parent_tid=[2395]}, 88) = 2395
>>> > 2395 write(2, "runtime: marked free object in s"..., 36 <unfinished
>>> ...>
>>>
>>> I.e. IIUC, all are threads (CLONE_VM) and thread 2370 mapped ANON
>>> 0x7fca68249000 - 0x7fca6827ffff and go in thread 2395 thinks for some
>>> reason 0x7fca6824bec8 in that region is "bad".
>
> Thanks for the analysis Jiri.
> Is it possible from these logs to identify whether 2370 finished the
> mmap operation before 2395 tried to access 0x7fca6824bec8? That access
> has to happen only after mmap finishes mapping the region.

Hi,

it's hard to tell, but I assume so.

For now, forget about this go's overly complicated, hard to reproduce
case and concentrate on the very nice reduced testcase in:
https://bugzilla.kernel.org/show_bug.cgi?id=217624
;)

FWIW, I can reproduce using the test case too.

thanks,
--
js
suse labs


2023-07-11 10:57:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 00/33] Per-VMA locks

On 7/11/23 12:35, Leon Romanovsky wrote:
>
> On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
>
> <...>
>
>> Laurent Dufour (1):
>> powerc/mm: try VMA lock-based page fault handling first
>
> Hi,
>
> This series and specifically the commit above broke docker over PPC.
> It causes to docker service stuck while trying to activate. Revert of
> this commit allows us to use docker again.

Hi,

there have been follow-up fixes, that are part of 6.4.3 stable (also
6.5-rc1) Does that version work for you?

Vlastimil

> [user@ppc-135-3-200-205 ~]# sudo systemctl status docker
> ● docker.service - Docker Application Container Engine
> Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
> Active: activating (start) since Mon 2023-06-26 14:47:07 IDT; 3h 50min ago
> TriggeredBy: ● docker.socket
> Docs: https://docs.docker.com
> Main PID: 276555 (dockerd)
> Memory: 44.2M
> CGroup: /system.slice/docker.service
> └─ 276555 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
>
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129383166+03:00" level=info msg="Graph migration to content-addressability took 0.00 se>
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129666160+03:00" level=warning msg="Your kernel does not support cgroup cfs period"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129684117+03:00" level=warning msg="Your kernel does not support cgroup cfs quotas"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129697085+03:00" level=warning msg="Your kernel does not support cgroup rt period"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129711513+03:00" level=warning msg="Your kernel does not support cgroup rt runtime"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129720656+03:00" level=warning msg="Unable to find blkio cgroup in mounts"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129805617+03:00" level=warning msg="mountpoint for pids not found"
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.130199070+03:00" level=info msg="Loading containers: start."
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.132688568+03:00" level=warning msg="Running modprobe bridge br_netfilter failed with me>
> Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.271014050+03:00" level=info msg="Default bridge (docker0) is assigned with an IP addres>
>
> Python script which we used for bisect:
>
> import subprocess
> import time
> import sys
>
>
> def run_command(cmd):
> print('running:', cmd)
>
> p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>
> try:
> stdout, stderr = p.communicate(timeout=30)
>
> except subprocess.TimeoutExpired:
> return True
>
> print(stdout.decode())
> print(stderr.decode())
> print('rc:', p.returncode)
>
> return False
>
>
> def main():
> commands = [
> 'sudo systemctl stop docker',
> 'sudo systemctl status docker',
> 'sudo systemctl is-active docker',
> 'sudo systemctl start docker',
> 'sudo systemctl status docker',
> ]
>
> for i in range(1000):
> title = f'Try no. {i + 1}'
> print('*' * 50, title, '*' * 50)
>
> for cmd in commands:
> if run_command(cmd):
> print(f'Reproduced on try no. {i + 1}!')
> print(f'"{cmd}" is stuck!')
>
> return 1
>
> print('\n')
> time.sleep(30)
> return 0
>
> if __name__ == '__main__':
> sys.exit(main())
>
> Thanks


2023-07-11 11:11:18

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v4 00/33] Per-VMA locks


On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:

<...>

> Laurent Dufour (1):
> powerc/mm: try VMA lock-based page fault handling first

Hi,

This series and specifically the commit above broke docker over PPC.
It causes to docker service stuck while trying to activate. Revert of
this commit allows us to use docker again.

[user@ppc-135-3-200-205 ~]# sudo systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: activating (start) since Mon 2023-06-26 14:47:07 IDT; 3h 50min ago
TriggeredBy: ● docker.socket
Docs: https://docs.docker.com
Main PID: 276555 (dockerd)
Memory: 44.2M
CGroup: /system.slice/docker.service
└─ 276555 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129383166+03:00" level=info msg="Graph migration to content-addressability took 0.00 se>
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129666160+03:00" level=warning msg="Your kernel does not support cgroup cfs period"
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129684117+03:00" level=warning msg="Your kernel does not support cgroup cfs quotas"
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129697085+03:00" level=warning msg="Your kernel does not support cgroup rt period"
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129711513+03:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129720656+03:00" level=warning msg="Unable to find blkio cgroup in mounts"
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.129805617+03:00" level=warning msg="mountpoint for pids not found"
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.130199070+03:00" level=info msg="Loading containers: start."
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.132688568+03:00" level=warning msg="Running modprobe bridge br_netfilter failed with me>
Jun 26 14:47:07 ppc-135-3-200-205 dockerd[276555]: time="2023-06-26T14:47:07.271014050+03:00" level=info msg="Default bridge (docker0) is assigned with an IP addres>

Python script which we used for bisect:

import subprocess
import time
import sys


def run_command(cmd):
print('running:', cmd)

p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

try:
stdout, stderr = p.communicate(timeout=30)

except subprocess.TimeoutExpired:
return True

print(stdout.decode())
print(stderr.decode())
print('rc:', p.returncode)

return False


def main():
commands = [
'sudo systemctl stop docker',
'sudo systemctl status docker',
'sudo systemctl is-active docker',
'sudo systemctl start docker',
'sudo systemctl status docker',
]

for i in range(1000):
title = f'Try no. {i + 1}'
print('*' * 50, title, '*' * 50)

for cmd in commands:
if run_command(cmd):
print(f'Reproduced on try no. {i + 1}!')
print(f'"{cmd}" is stuck!')

return 1

print('\n')
time.sleep(30)
return 0

if __name__ == '__main__':
sys.exit(main())

Thanks

2023-07-11 11:29:11

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v4 00/33] Per-VMA locks

On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote:
> On 7/11/23 12:35, Leon Romanovsky wrote:
> >
> > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> >
> > <...>
> >
> >> Laurent Dufour (1):
> >> powerc/mm: try VMA lock-based page fault handling first
> >
> > Hi,
> >
> > This series and specifically the commit above broke docker over PPC.
> > It causes to docker service stuck while trying to activate. Revert of
> > this commit allows us to use docker again.
>
> Hi,
>
> there have been follow-up fixes, that are part of 6.4.3 stable (also
> 6.5-rc1) Does that version work for you?

I'll recheck it again on clean system, but for the record:
1. We are running 6.5-rc1 kernels.
2. PPC doesn't compile for us on -rc1 without this fix.
https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/
3. I didn't see anything relevant -rc1 with "git log arch/powerpc/mm/fault.c".

Do you have in mind anything specific to check?

Thanks

2023-07-11 11:29:21

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v4 00/33] Per-VMA locks

On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote:
> On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote:
> > On 7/11/23 12:35, Leon Romanovsky wrote:
> > >
> > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> > >
> > > <...>
> > >
> > >> Laurent Dufour (1):
> > >> powerc/mm: try VMA lock-based page fault handling first
> > >
> > > Hi,
> > >
> > > This series and specifically the commit above broke docker over PPC.
> > > It causes to docker service stuck while trying to activate. Revert of
> > > this commit allows us to use docker again.
> >
> > Hi,
> >
> > there have been follow-up fixes, that are part of 6.4.3 stable (also
> > 6.5-rc1) Does that version work for you?
>
> I'll recheck it again on clean system, but for the record:
> 1. We are running 6.5-rc1 kernels.
> 2. PPC doesn't compile for us on -rc1 without this fix.
> https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/

Ohh, I see it in -rc1, let's recheck.

> 3. I didn't see anything relevant -rc1 with "git log arch/powerpc/mm/fault.c".
>
> Do you have in mind anything specific to check?
>
> Thanks
>

2023-07-11 17:03:28

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v4 00/33] Per-VMA locks

On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky <[email protected]> wrote:
>
> On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote:
> > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote:
> > > On 7/11/23 12:35, Leon Romanovsky wrote:
> > > >
> > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> > > >
> > > > <...>
> > > >
> > > >> Laurent Dufour (1):
> > > >> powerc/mm: try VMA lock-based page fault handling first
> > > >
> > > > Hi,
> > > >
> > > > This series and specifically the commit above broke docker over PPC.
> > > > It causes to docker service stuck while trying to activate. Revert of
> > > > this commit allows us to use docker again.
> > >
> > > Hi,
> > >
> > > there have been follow-up fixes, that are part of 6.4.3 stable (also
> > > 6.5-rc1) Does that version work for you?
> >
> > I'll recheck it again on clean system, but for the record:
> > 1. We are running 6.5-rc1 kernels.
> > 2. PPC doesn't compile for us on -rc1 without this fix.
> > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/
>
> Ohh, I see it in -rc1, let's recheck.

Hi Leon,
Please let us know how it goes.

>
> > 3. I didn't see anything relevant -rc1 with "git log arch/powerpc/mm/fault.c".

The fixes Vlastimil was referring to are not in the fault.c, they are
in the main mm and fork code. More specifically, check for these
patches to exist in the branch you are testing:

mm: lock newly mapped VMA with corrected ordering
fork: lock VMAs of the parent process when forking
mm: lock newly mapped VMA which can be modified after it becomes visible
mm: lock a vma before stack expansion

Thanks,
Suren.

> >
> > Do you have in mind anything specific to check?
> >
> > Thanks
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2023-07-11 17:41:51

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v4 00/33] Per-VMA locks

On Tue, Jul 11, 2023 at 09:35:13AM -0700, Suren Baghdasaryan wrote:
> On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky <[email protected]> wrote:
> >
> > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote:
> > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote:
> > > > On 7/11/23 12:35, Leon Romanovsky wrote:
> > > > >
> > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote:
> > > > >
> > > > > <...>
> > > > >
> > > > >> Laurent Dufour (1):
> > > > >> powerc/mm: try VMA lock-based page fault handling first
> > > > >
> > > > > Hi,
> > > > >
> > > > > This series and specifically the commit above broke docker over PPC.
> > > > > It causes to docker service stuck while trying to activate. Revert of
> > > > > this commit allows us to use docker again.
> > > >
> > > > Hi,
> > > >
> > > > there have been follow-up fixes, that are part of 6.4.3 stable (also
> > > > 6.5-rc1) Does that version work for you?
> > >
> > > I'll recheck it again on clean system, but for the record:
> > > 1. We are running 6.5-rc1 kernels.
> > > 2. PPC doesn't compile for us on -rc1 without this fix.
> > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/
> >
> > Ohh, I see it in -rc1, let's recheck.
>
> Hi Leon,
> Please let us know how it goes.

Once, we rebuilt clean -rc1, docker worked for us.
Sorry for the noise.

>
> >
> > > 3. I didn't see anything relevant -rc1 with "git log arch/powerpc/mm/fault.c".
>
> The fixes Vlastimil was referring to are not in the fault.c, they are
> in the main mm and fork code. More specifically, check for these
> patches to exist in the branch you are testing:
>
> mm: lock newly mapped VMA with corrected ordering
> fork: lock VMAs of the parent process when forking
> mm: lock newly mapped VMA which can be modified after it becomes visible
> mm: lock a vma before stack expansion

Thanks

>
> Thanks,
> Suren.
>
> > >
> > > Do you have in mind anything specific to check?
> > >
> > > Thanks
> > >
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >