LinuxLists.cc - [PATCH v2 0/6] make vma locking more obvious

2023-08-01 23:22:28

Subject: [PATCH v2 0/6] make vma locking more obvious

During recent vma locking patch reviews Linus and Jann Horn noted a number
of issues with vma locking and suggested improvements:

1. walk_page_range() does not have ability to write-lock a vma during the
walk when it's done under mmap_write_lock. For example s390_reset_cmma().

2. Vma locking is hidden inside vm_flags modifiers and is hard to follow.
Suggestion is to change vm_flags_reset{_once} to assert that vma is
write-locked and require an explicit locking.

3. Same issue with vma_prepare() hiding vma locking.

4. In userfaultfd vm_flags are modified after vma->vm_userfaultfd_ctx and
page faults can operate on a context while it's changed.

5. do_brk_flags() and __install_special_mapping() not locking a newly
created vma before adding it into the mm. While not strictly a problem,
this is fragile if vma is modified after insertion, as in the
mmap_region() case which was recently fixed. Suggestion is to always lock
a new vma before inserting it and making it visible to page faults.

6. vma_assert_write_locked() for CONFIG_PER_VMA_LOCK=n would benefit from
being mmap_assert_write_locked() instead of no-op and then any place which
operates on a vma and calls mmap_assert_write_locked() can be converted
into vma_assert_write_locked().

I CC'ed stable only on the first patch because others are cleanups and the
bug in userfaultfd does not affect stable (lock_vma_under_rcu prevents
uffds from being handled under vma lock protection). However I would be
happy if the whole series is merged into stable 6.4 since it makes vma
locking more maintainable.

The patches apply cleanly over Linus' ToT and will conflict when applied
over mm-unstable due to missing [1]. The conflict can be easily resolved
by ignoring conflicting deletions but probably simpler to take [1] into
mm-unstable and avoid later conflict.

[1] commit 6c21e066f925 ("mm/mempolicy: Take VMA lock before replacing policy")

Changes since v1:
- replace walk_page_range() parameter with mm_walk_ops.walk_lock,
per Linus
- introduced page_walk_lock enum to allow different locking modes
during a walk, per Linus
- added Liam's Reviewed-by

Suren Baghdasaryan (6):
mm: enable page walking API to lock vmas during the walk
mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and
mmap
mm: replace mmap with vma write lock assertions when operating on a
vma
mm: lock vma explicitly before doing vm_flags_reset and
vm_flags_reset_once
mm: always lock new vma before inserting into vma tree
mm: move vma locking out of vma_prepare

arch/powerpc/kvm/book3s_hv_uvmem.c | 1 +
arch/powerpc/mm/book3s64/subpage_prot.c | 1 +
arch/riscv/mm/pageattr.c | 1 +
arch/s390/mm/gmap.c | 5 ++++
drivers/infiniband/hw/hfi1/file_ops.c | 1 +
fs/proc/task_mmu.c | 5 ++++
fs/userfaultfd.c | 6 +++++
include/linux/mm.h | 13 ++++++---
include/linux/pagewalk.h | 11 ++++++++
mm/damon/vaddr.c | 2 ++
mm/hmm.c | 1 +
mm/hugetlb.c | 2 +-
mm/khugepaged.c | 5 ++--
mm/ksm.c | 25 ++++++++++-------
mm/madvise.c | 8 +++---
mm/memcontrol.c | 2 ++
mm/memory-failure.c | 1 +
mm/memory.c | 2 +-
mm/mempolicy.c | 22 +++++++++------
mm/migrate_device.c | 1 +
mm/mincore.c | 1 +
mm/mlock.c | 4 ++-
mm/mmap.c | 29 +++++++++++++-------
mm/mprotect.c | 2 ++
mm/pagewalk.c | 36 ++++++++++++++++++++++---
mm/vmscan.c | 1 +
26 files changed, 146 insertions(+), 42 deletions(-)

--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 23:26:02

by Suren Baghdasaryan

[permalink] [raw]

Subject: [PATCH v2 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap

When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent
to mmap_assert_write_locked().

Suggested-by: Jann Horn <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 406ab9ea818f..262b5f44101d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -750,7 +750,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
static inline void vma_start_write(struct vm_area_struct *vma) {}
static inline bool vma_try_start_write(struct vm_area_struct *vma)
{ return true; }
-static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+ { mmap_assert_write_locked(vma->vm_mm); }
static inline void vma_mark_detached(struct vm_area_struct *vma,
bool detached) {}

--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 23:27:23

by Suren Baghdasaryan

[permalink] [raw]

Subject: [PATCH v2 6/6] mm: move vma locking out of vma_prepare

vma_prepare() is currently the central place where vmas are being locked
before vma_complete() applies changes to them. While this is convenient,
it also obscures vma locking and makes it hard to follow the locking rules.
Move vma locking out of vma_prepare() and take vma locks explicitly at the
locations where vmas are being modified.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 850a39dee075..e59d83cb1d7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp,
*/
static inline void vma_prepare(struct vma_prepare *vp)
{
- vma_start_write(vp->vma);
- if (vp->adj_next)
- vma_start_write(vp->adj_next);
- if (vp->insert)
- vma_start_write(vp->insert);
- if (vp->remove)
- vma_start_write(vp->remove);
- if (vp->remove2)
- vma_start_write(vp->remove2);
-
if (vp->file) {
uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);

@@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
bool remove_next = false;
struct vma_prepare vp;

+ vma_start_write(vma);
if (next && (vma != next) && (end == next->vm_end)) {
int ret;

@@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
ret = dup_anon_vma(vma, next);
if (ret)
return ret;
+ vma_start_write(next);
}

init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
@@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
return -ENOMEM;

+ vma_start_write(vma);
+
init_vma_prep(&vp, vma);
vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, 0);
@@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
/* Can we merge both the predecessor and the successor? */
if (merge_prev && merge_next &&
is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
+ vma_start_write(next);
remove = next; /* case 1 */
vma_end = next->vm_end;
err = dup_anon_vma(prev, next);
if (curr) { /* case 6 */
+ vma_start_write(curr);
remove = curr;
remove2 = next;
if (!next->anon_vma)
@@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
} else if (merge_prev) { /* case 2 */
if (curr) {
err = dup_anon_vma(prev, curr);
+ vma_start_write(curr);
if (end == curr->vm_end) { /* case 7 */
remove = curr;
} else { /* case 5 */
@@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
res = next;
if (prev && addr < prev->vm_end) { /* case 4 */
vma_end = addr;
+ vma_start_write(next);
adjust = next;
adj_start = -(prev->vm_end - addr);
err = dup_anon_vma(next, prev);
@@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
vma_pgoff = next->vm_pgoff - pglen;
if (curr) { /* case 8 */
vma_pgoff = curr->vm_pgoff;
+ vma_start_write(curr);
remove = curr;
err = dup_anon_vma(next, curr);
}
@@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
if (vma_iter_prealloc(vmi))
return NULL;

+ vma_start_write(vma);
+
init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
vp.anon_vma != adjust->anon_vma);
@@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);

+ vma_start_write(vma);
+ vma_start_write(new);
+
init_vma_prep(&vp, vma);
vp.insert = new;
vma_prepare(&vp);
@@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_prealloc(vmi))
goto unacct_fail;

+ vma_start_write(vma);
+
init_vma_prep(&vp, vma);
vma_prepare(&vp);
vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 23:31:08

by Suren Baghdasaryan

[permalink] [raw]

Subject: [PATCH v2 5/6] mm: always lock new vma before inserting into vma tree

While it's not strictly necessary to lock a newly created vma before
adding it into the vma tree (as long as no further changes are performed
to it), it seems like a good policy to lock it and prevent accidental
changes after it becomes visible to the page faults. Lock the vma before
adding it into the vma tree.

Suggested-by: Jann Horn <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/mmap.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3937479d0e07..850a39dee075 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -412,6 +412,8 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
if (vma_iter_prealloc(&vmi))
return -ENOMEM;

+ vma_start_write(vma);
+
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
i_mmap_lock_write(mapping);
@@ -477,7 +479,8 @@ static inline void vma_prepare(struct vma_prepare *vp)
vma_start_write(vp->vma);
if (vp->adj_next)
vma_start_write(vp->adj_next);
- /* vp->insert is always a newly created VMA, no need for locking */
+ if (vp->insert)
+ vma_start_write(vp->insert);
if (vp->remove)
vma_start_write(vp->remove);
if (vp->remove2)
@@ -3098,6 +3101,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma->vm_pgoff = addr >> PAGE_SHIFT;
vm_flags_init(vma, flags);
vma->vm_page_prot = vm_get_page_prot(flags);
+ vma_start_write(vma);
if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
goto mas_store_fail;

@@ -3345,7 +3349,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
- vma_start_write(new_vma);
if (vma_link(mm, new_vma))
goto out_vma_link;
*need_rmap_locks = false;
--
2.41.0.585.gd2178a4bd4-goog

2023-08-01 23:35:03

by Suren Baghdasaryan

[permalink] [raw]

Subject: [PATCH v2 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
not obvious and makes it hard to understand where vma locking is happening.
Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
than vma_flags modification. To make locking more visible, change these
functions to assert that the vma write lock is taken and explicitly lock
the vma beforehand. Fix userfaultfd functions which should lock the vma
earlier.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 1 +
drivers/infiniband/hw/hfi1/file_ops.c | 1 +
fs/userfaultfd.c | 6 ++++++
include/linux/mm.h | 10 +++++++---
mm/madvise.c | 5 ++---
mm/mlock.c | 3 ++-
mm/mprotect.c | 1 +
7 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 709ebd578394..e2d6f9327f77 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -410,6 +410,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
ret = H_STATE;
break;
}
+ vma_start_write(vma);
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index a5ab22cedd41..5920bfc1e1c5 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -344,6 +344,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
goto done;
}

+ vma_start_write(vma);
/*
* vm_pgoff is used as a buffer selector cookie. Always mmap from
* the beginning.
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..6cde95533dcd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -667,6 +667,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
mmap_write_lock(mm);
for_each_vma(vmi, vma) {
if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
+ vma_start_write(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
userfaultfd_set_vm_flags(vma,
vma->vm_flags & ~__VM_UFFD_FLAGS);
@@ -702,6 +703,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)

octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+ vma_start_write(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
return 0;
@@ -783,6 +785,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
atomic_inc(&ctx->mmap_changing);
} else {
/* Drop uffd context if remap feature not enabled */
+ vma_start_write(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
}
@@ -940,6 +943,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
prev = vma;
}

+ vma_start_write(vma);
userfaultfd_set_vm_flags(vma, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
}
@@ -1502,6 +1506,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
* the next vma was merged into the current one and
* the current one has not been updated yet.
*/
+ vma_start_write(vma);
userfaultfd_set_vm_flags(vma, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;

@@ -1685,6 +1690,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
* the next vma was merged into the current one and
* the current one has not been updated yet.
*/
+ vma_start_write(vma);
userfaultfd_set_vm_flags(vma, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 262b5f44101d..2c720c9bb1ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -780,18 +780,22 @@ static inline void vm_flags_init(struct vm_area_struct *vma,
ACCESS_PRIVATE(vma, __vm_flags) = flags;
}

-/* Use when VMA is part of the VMA tree and modifications need coordination */
+/*
+ * Use when VMA is part of the VMA tree and modifications need coordination
+ * Note: vm_flags_reset and vm_flags_reset_once do not lock the vma and
+ * it should be locked explicitly beforehand.
+ */
static inline void vm_flags_reset(struct vm_area_struct *vma,
vm_flags_t flags)
{
- vma_start_write(vma);
+ vma_assert_write_locked(vma);
vm_flags_init(vma, flags);
}

static inline void vm_flags_reset_once(struct vm_area_struct *vma,
vm_flags_t flags)
{
- vma_start_write(vma);
+ vma_assert_write_locked(vma);
WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
}

diff --git a/mm/madvise.c b/mm/madvise.c
index bfe0e06427bd..507b1d299fec 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -173,9 +173,8 @@ static int madvise_update_vma(struct vm_area_struct *vma,
}

success:
- /*
- * vm_flags is protected by the mmap_lock held in write mode.
- */
+ /* vm_flags is protected by the mmap_lock held in write mode. */
+ vma_start_write(vma);
vm_flags_reset(vma, new_flags);
if (!vma->vm_file || vma_is_anon_shmem(vma)) {
error = replace_anon_vma_name(vma, anon_name);
diff --git a/mm/mlock.c b/mm/mlock.c
index 479e09d0994c..06bdfab83b58 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -387,6 +387,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
*/
if (newflags & VM_LOCKED)
newflags |= VM_IO;
+ vma_start_write(vma);
vm_flags_reset_once(vma, newflags);

lru_add_drain();
@@ -461,9 +462,9 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, populate_vma_page_range will bring it back.
*/
-
if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
/* No work to do, and mlocking twice would be wrong */
+ vma_start_write(vma);
vm_flags_reset(vma, newflags);
} else {
mlock_vma_pages_range(vma, start, end, newflags);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3aef1340533a..362e190a8f81 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -657,6 +657,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
* vm_flags and vm_page_prot are protected by the mmap_lock
* held in write mode.
*/
+ vma_start_write(vma);
vm_flags_reset(vma, newflags);
if (vma_wants_manual_pte_write_upgrade(vma))
mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
--
2.41.0.585.gd2178a4bd4-goog

2023-08-02 00:09:39

by Suren Baghdasaryan

[permalink] [raw]

Subject: [PATCH v2 3/6] mm: replace mmap with vma write lock assertions when operating on a vma

Vma write lock assertion always includes mmap write lock assertion and
additional vma lock checks when per-VMA locks are enabled. Replace
weaker mmap_assert_write_locked() assertions with stronger
vma_assert_write_locked() ones when we are operating on a vma which
is expected to be locked.

Suggested-by: Jann Horn <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/hugetlb.c | 2 +-
mm/khugepaged.c | 5 +++--
mm/memory.c | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64a3239b6407..1d871a1167d8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5028,7 +5028,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
src_vma->vm_start,
src_vma->vm_end);
mmu_notifier_invalidate_range_start(&range);
- mmap_assert_write_locked(src);
+ vma_assert_write_locked(src_vma);
raw_write_seqcount_begin(&src->write_protect_seq);
} else {
/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..1e43a56fba31 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
};

VM_BUG_ON(!PageTransHuge(hpage));
- mmap_assert_write_locked(vma->vm_mm);
+ vma_assert_write_locked(vma);

if (do_set_pmd(&vmf, hpage))
return SCAN_FAIL;
@@ -1525,7 +1525,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
pmd_t pmd;
struct mmu_notifier_range range;

- mmap_assert_write_locked(mm);
+ vma_assert_write_locked(vma);
if (vma->vm_file)
lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
/*
@@ -1570,6 +1570,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
int count = 0, result = SCAN_FAIL;
int i;

+ /* Ensure vma can't change, it will be locked below after checks */
mmap_assert_write_locked(mm);

/* Fast check before locking page if already PMD-mapped */
diff --git a/mm/memory.c b/mm/memory.c
index 603b2f419948..652d99b9858a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1312,7 +1312,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
* Use the raw variant of the seqcount_t write API to avoid
* lockdep complaining about preemptibility.
*/
- mmap_assert_write_locked(src_mm);
+ vma_assert_write_locked(src_vma);
raw_write_seqcount_begin(&src_mm->write_protect_seq);
}

--
2.41.0.585.gd2178a4bd4-goog

2023-08-02 17:30:24

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH v2 6/6] mm: move vma locking out of vma_prepare

* Suren Baghdasaryan <[email protected]> [230801 18:07]:
> vma_prepare() is currently the central place where vmas are being locked
> before vma_complete() applies changes to them. While this is convenient,
> it also obscures vma locking and makes it hard to follow the locking rules.
> Move vma locking out of vma_prepare() and take vma locks explicitly at the
> locations where vmas are being modified.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> Reviewed-by: Liam R. Howlett <[email protected]>
> ---
> mm/mmap.c | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 850a39dee075..e59d83cb1d7a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp,
> */
> static inline void vma_prepare(struct vma_prepare *vp)
> {
> - vma_start_write(vp->vma);
> - if (vp->adj_next)
> - vma_start_write(vp->adj_next);
> - if (vp->insert)
> - vma_start_write(vp->insert);
> - if (vp->remove)
> - vma_start_write(vp->remove);
> - if (vp->remove2)
> - vma_start_write(vp->remove2);
> -
> if (vp->file) {
> uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
>
> @@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> bool remove_next = false;
> struct vma_prepare vp;
>
> + vma_start_write(vma);

This lock made me think about the lock in dup_anon_vma().. the only
reason we need to dup the anon vma is because the VMA _will_ be
modified.. So if we are to remove next the dup_anon_vma() call may lock
vma again. This happens multiple times through this patch set.

If we lock both dst and src VMAs before calling dup_anon_vma, then I
think it will become more clear in the code below which VMAs are locked.
We could remove the lock in the dup_anon_vma() (and replace with an
assert?), and drop the vma locking of "vma".

I think this would address the confusion of the locking below that I
experienced and avoid calling the vma_start_write() multiple times for
the same vma. I mean, having the vma_start_write(vma) obscures which
VMA is locked as well.

> if (next && (vma != next) && (end == next->vm_end)) {
> int ret;
>
> @@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> ret = dup_anon_vma(vma, next);
> if (ret)
> return ret;
> + vma_start_write(next);
> }
>
> init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> @@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_prealloc(vmi))
> return -ENOMEM;
>
> + vma_start_write(vma);
> +
> init_vma_prep(&vp, vma);
> vma_prepare(&vp);
> vma_adjust_trans_huge(vma, start, end, 0);
> @@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> /* Can we merge both the predecessor and the successor? */
> if (merge_prev && merge_next &&
> is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
> + vma_start_write(next);
> remove = next; /* case 1 */
> vma_end = next->vm_end;
> err = dup_anon_vma(prev, next);
> if (curr) { /* case 6 */
> + vma_start_write(curr);
> remove = curr;
> remove2 = next;
> if (!next->anon_vma)
> @@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> } else if (merge_prev) { /* case 2 */
> if (curr) {
> err = dup_anon_vma(prev, curr);
> + vma_start_write(curr);
> if (end == curr->vm_end) { /* case 7 */
> remove = curr;
> } else { /* case 5 */
> @@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> res = next;
> if (prev && addr < prev->vm_end) { /* case 4 */
> vma_end = addr;
> + vma_start_write(next);
> adjust = next;
> adj_start = -(prev->vm_end - addr);
> err = dup_anon_vma(next, prev);
> @@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> vma_pgoff = next->vm_pgoff - pglen;
> if (curr) { /* case 8 */
> vma_pgoff = curr->vm_pgoff;
> + vma_start_write(curr);
> remove = curr;
> err = dup_anon_vma(next, curr);
> }
> @@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> if (vma_iter_prealloc(vmi))
> return NULL;
>
> + vma_start_write(vma);
> +
> init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
> VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
> vp.anon_vma != adjust->anon_vma);
> @@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (new->vm_ops && new->vm_ops->open)
> new->vm_ops->open(new);
>
> + vma_start_write(vma);
> + vma_start_write(new);
> +
> init_vma_prep(&vp, vma);
> vp.insert = new;
> vma_prepare(&vp);
> @@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_prealloc(vmi))
> goto unacct_fail;
>
> + vma_start_write(vma);
> +
> init_vma_prep(&vp, vma);
> vma_prepare(&vp);
> vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
> --
> 2.41.0.585.gd2178a4bd4-goog
>

2023-08-02 17:31:12

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap

* Suren Baghdasaryan <[email protected]> [230801 18:07]:
> When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent
> to mmap_assert_write_locked().
>
> Suggested-by: Jann Horn <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Reviewed-by: Liam R. Howlett <[email protected]>

> ---
> include/linux/mm.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 406ab9ea818f..262b5f44101d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -750,7 +750,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
> static inline void vma_start_write(struct vm_area_struct *vma) {}
> static inline bool vma_try_start_write(struct vm_area_struct *vma)
> { return true; }
> -static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
> +static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> + { mmap_assert_write_locked(vma->vm_mm); }
> static inline void vma_mark_detached(struct vm_area_struct *vma,
> bool detached) {}
>
> --
> 2.41.0.585.gd2178a4bd4-goog
>

2023-08-02 17:32:24

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH v2 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

* Suren Baghdasaryan <[email protected]> [230801 18:07]:
> Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
> not obvious and makes it hard to understand where vma locking is happening.
> Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
> than vma_flags modification. To make locking more visible, change these
> functions to assert that the vma write lock is taken and explicitly lock
> the vma beforehand. Fix userfaultfd functions which should lock the vma
> earlier.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Reviewed-by: Liam R. Howlett <[email protected]>

> ---
> arch/powerpc/kvm/book3s_hv_uvmem.c | 1 +
> drivers/infiniband/hw/hfi1/file_ops.c | 1 +
> fs/userfaultfd.c | 6 ++++++
> include/linux/mm.h | 10 +++++++---
> mm/madvise.c | 5 ++---
> mm/mlock.c | 3 ++-
> mm/mprotect.c | 1 +
> 7 files changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 709ebd578394..e2d6f9327f77 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -410,6 +410,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
> ret = H_STATE;
> break;
> }
> + vma_start_write(vma);
> /* Copy vm_flags to avoid partial modifications in ksm_madvise */
> vm_flags = vma->vm_flags;
> ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
> index a5ab22cedd41..5920bfc1e1c5 100644
> --- a/drivers/infiniband/hw/hfi1/file_ops.c
> +++ b/drivers/infiniband/hw/hfi1/file_ops.c
> @@ -344,6 +344,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
> goto done;
> }
>
> + vma_start_write(vma);
> /*
> * vm_pgoff is used as a buffer selector cookie. Always mmap from
> * the beginning.
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 7cecd49e078b..6cde95533dcd 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -667,6 +667,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> mmap_write_lock(mm);
> for_each_vma(vmi, vma) {
> if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> + vma_start_write(vma);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> userfaultfd_set_vm_flags(vma,
> vma->vm_flags & ~__VM_UFFD_FLAGS);
> @@ -702,6 +703,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
>
> octx = vma->vm_userfaultfd_ctx.ctx;
> if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> + vma_start_write(vma);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
> return 0;
> @@ -783,6 +785,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> atomic_inc(&ctx->mmap_changing);
> } else {
> /* Drop uffd context if remap feature not enabled */
> + vma_start_write(vma);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
> }
> @@ -940,6 +943,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> prev = vma;
> }
>
> + vma_start_write(vma);
> userfaultfd_set_vm_flags(vma, new_flags);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> }
> @@ -1502,6 +1506,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> * the next vma was merged into the current one and
> * the current one has not been updated yet.
> */
> + vma_start_write(vma);
> userfaultfd_set_vm_flags(vma, new_flags);
> vma->vm_userfaultfd_ctx.ctx = ctx;
>
> @@ -1685,6 +1690,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> * the next vma was merged into the current one and
> * the current one has not been updated yet.
> */
> + vma_start_write(vma);
> userfaultfd_set_vm_flags(vma, new_flags);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 262b5f44101d..2c720c9bb1ae 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -780,18 +780,22 @@ static inline void vm_flags_init(struct vm_area_struct *vma,
> ACCESS_PRIVATE(vma, __vm_flags) = flags;
> }
>
> -/* Use when VMA is part of the VMA tree and modifications need coordination */
> +/*
> + * Use when VMA is part of the VMA tree and modifications need coordination
> + * Note: vm_flags_reset and vm_flags_reset_once do not lock the vma and
> + * it should be locked explicitly beforehand.
> + */
> static inline void vm_flags_reset(struct vm_area_struct *vma,
> vm_flags_t flags)
> {
> - vma_start_write(vma);
> + vma_assert_write_locked(vma);
> vm_flags_init(vma, flags);
> }
>
> static inline void vm_flags_reset_once(struct vm_area_struct *vma,
> vm_flags_t flags)
> {
> - vma_start_write(vma);
> + vma_assert_write_locked(vma);
> WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
> }
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bfe0e06427bd..507b1d299fec 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -173,9 +173,8 @@ static int madvise_update_vma(struct vm_area_struct *vma,
> }
>
> success:
> - /*
> - * vm_flags is protected by the mmap_lock held in write mode.
> - */
> + /* vm_flags is protected by the mmap_lock held in write mode. */
> + vma_start_write(vma);
> vm_flags_reset(vma, new_flags);
> if (!vma->vm_file || vma_is_anon_shmem(vma)) {
> error = replace_anon_vma_name(vma, anon_name);
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 479e09d0994c..06bdfab83b58 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -387,6 +387,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
> */
> if (newflags & VM_LOCKED)
> newflags |= VM_IO;
> + vma_start_write(vma);
> vm_flags_reset_once(vma, newflags);
>
> lru_add_drain();
> @@ -461,9 +462,9 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> * It's okay if try_to_unmap_one unmaps a page just after we
> * set VM_LOCKED, populate_vma_page_range will bring it back.
> */
> -
> if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
> /* No work to do, and mlocking twice would be wrong */
> + vma_start_write(vma);
> vm_flags_reset(vma, newflags);
> } else {
> mlock_vma_pages_range(vma, start, end, newflags);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 3aef1340533a..362e190a8f81 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -657,6 +657,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
> * vm_flags and vm_page_prot are protected by the mmap_lock
> * held in write mode.
> */
> + vma_start_write(vma);
> vm_flags_reset(vma, newflags);
> if (vma_wants_manual_pte_write_upgrade(vma))
> mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
> --
> 2.41.0.585.gd2178a4bd4-goog
>

2023-08-02 17:44:54

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH v2 5/6] mm: always lock new vma before inserting into vma tree

* Suren Baghdasaryan <[email protected]> [230801 18:07]:
> While it's not strictly necessary to lock a newly created vma before
> adding it into the vma tree (as long as no further changes are performed
> to it), it seems like a good policy to lock it and prevent accidental
> changes after it becomes visible to the page faults. Lock the vma before
> adding it into the vma tree.
>
> Suggested-by: Jann Horn <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Reviewed-by: Liam R. Howlett <[email protected]>

> ---
> mm/mmap.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 3937479d0e07..850a39dee075 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -412,6 +412,8 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> if (vma_iter_prealloc(&vmi))
> return -ENOMEM;
>
> + vma_start_write(vma);
> +
> if (vma->vm_file) {
> mapping = vma->vm_file->f_mapping;
> i_mmap_lock_write(mapping);
> @@ -477,7 +479,8 @@ static inline void vma_prepare(struct vma_prepare *vp)
> vma_start_write(vp->vma);
> if (vp->adj_next)
> vma_start_write(vp->adj_next);
> - /* vp->insert is always a newly created VMA, no need for locking */
> + if (vp->insert)
> + vma_start_write(vp->insert);
> if (vp->remove)
> vma_start_write(vp->remove);
> if (vp->remove2)
> @@ -3098,6 +3101,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> vma->vm_pgoff = addr >> PAGE_SHIFT;
> vm_flags_init(vma, flags);
> vma->vm_page_prot = vm_get_page_prot(flags);
> + vma_start_write(vma);
> if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> goto mas_store_fail;
>
> @@ -3345,7 +3349,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> get_file(new_vma->vm_file);
> if (new_vma->vm_ops && new_vma->vm_ops->open)
> new_vma->vm_ops->open(new_vma);
> - vma_start_write(new_vma);
> if (vma_link(mm, new_vma))
> goto out_vma_link;
> *need_rmap_locks = false;
> --
> 2.41.0.585.gd2178a4bd4-goog
>

2023-08-02 18:00:19

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [PATCH v2 6/6] mm: move vma locking out of vma_prepare

On Wed, Aug 2, 2023 at 9:59 AM Liam R. Howlett <[email protected]> wrote:
>
> * Suren Baghdasaryan <[email protected]> [230801 18:07]:
> > vma_prepare() is currently the central place where vmas are being locked
> > before vma_complete() applies changes to them. While this is convenient,
> > it also obscures vma locking and makes it hard to follow the locking rules.
> > Move vma locking out of vma_prepare() and take vma locks explicitly at the
> > locations where vmas are being modified.
> >
> > Suggested-by: Linus Torvalds <[email protected]>
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > Reviewed-by: Liam R. Howlett <[email protected]>
> > ---
> > mm/mmap.c | 26 ++++++++++++++++----------
> > 1 file changed, 16 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 850a39dee075..e59d83cb1d7a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp,
> > */
> > static inline void vma_prepare(struct vma_prepare *vp)
> > {
> > - vma_start_write(vp->vma);
> > - if (vp->adj_next)
> > - vma_start_write(vp->adj_next);
> > - if (vp->insert)
> > - vma_start_write(vp->insert);
> > - if (vp->remove)
> > - vma_start_write(vp->remove);
> > - if (vp->remove2)
> > - vma_start_write(vp->remove2);
> > -
> > if (vp->file) {
> > uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
> >
> > @@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > bool remove_next = false;
> > struct vma_prepare vp;
> >
> > + vma_start_write(vma);
>
> This lock made me think about the lock in dup_anon_vma().. the only
> reason we need to dup the anon vma is because the VMA _will_ be
> modified.. So if we are to remove next the dup_anon_vma() call may lock
> vma again. This happens multiple times through this patch set.
>
> If we lock both dst and src VMAs before calling dup_anon_vma, then I
> think it will become more clear in the code below which VMAs are locked.
> We could remove the lock in the dup_anon_vma() (and replace with an
> assert?), and drop the vma locking of "vma".
>
> I think this would address the confusion of the locking below that I
> experienced and avoid calling the vma_start_write() multiple times for
> the same vma. I mean, having the vma_start_write(vma) obscures which
> VMA is locked as well.

Agree, it would result in an easier to understand locking code. I'll
move the lock outside of dup_anon_vma() and will replace it with an
assert. Thanks!

>
>
> > if (next && (vma != next) && (end == next->vm_end)) {
> > int ret;
> >
> > @@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > ret = dup_anon_vma(vma, next);
> > if (ret)
> > return ret;
> > + vma_start_write(next);
> > }
> >
> > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > @@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (vma_iter_prealloc(vmi))
> > return -ENOMEM;
> >
> > + vma_start_write(vma);
> > +
> > init_vma_prep(&vp, vma);
> > vma_prepare(&vp);
> > vma_adjust_trans_huge(vma, start, end, 0);
> > @@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> > /* Can we merge both the predecessor and the successor? */
> > if (merge_prev && merge_next &&
> > is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
> > + vma_start_write(next);
> > remove = next; /* case 1 */
> > vma_end = next->vm_end;
> > err = dup_anon_vma(prev, next);
> > if (curr) { /* case 6 */
> > + vma_start_write(curr);
> > remove = curr;
> > remove2 = next;
> > if (!next->anon_vma)
> > @@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> > } else if (merge_prev) { /* case 2 */
> > if (curr) {
> > err = dup_anon_vma(prev, curr);
> > + vma_start_write(curr);
> > if (end == curr->vm_end) { /* case 7 */
> > remove = curr;
> > } else { /* case 5 */
> > @@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> > res = next;
> > if (prev && addr < prev->vm_end) { /* case 4 */
> > vma_end = addr;
> > + vma_start_write(next);
> > adjust = next;
> > adj_start = -(prev->vm_end - addr);
> > err = dup_anon_vma(next, prev);
> > @@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> > vma_pgoff = next->vm_pgoff - pglen;
> > if (curr) { /* case 8 */
> > vma_pgoff = curr->vm_pgoff;
> > + vma_start_write(curr);
> > remove = curr;
> > err = dup_anon_vma(next, curr);
> > }
> > @@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> > if (vma_iter_prealloc(vmi))
> > return NULL;
> >
> > + vma_start_write(vma);
> > +
> > init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
> > VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
> > vp.anon_vma != adjust->anon_vma);
> > @@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (new->vm_ops && new->vm_ops->open)
> > new->vm_ops->open(new);
> >
> > + vma_start_write(vma);
> > + vma_start_write(new);
> > +
> > init_vma_prep(&vp, vma);
> > vp.insert = new;
> > vma_prepare(&vp);
> > @@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (vma_iter_prealloc(vmi))
> > goto unacct_fail;
> >
> > + vma_start_write(vma);
> > +
> > init_vma_prep(&vp, vma);
> > vma_prepare(&vp);
> > vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
> > --
> > 2.41.0.585.gd2178a4bd4-goog
> >

2023-08-02 18:11:50

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH v2 3/6] mm: replace mmap with vma write lock assertions when operating on a vma

* Suren Baghdasaryan <[email protected]> [230801 18:07]:
> Vma write lock assertion always includes mmap write lock assertion and
> additional vma lock checks when per-VMA locks are enabled. Replace
> weaker mmap_assert_write_locked() assertions with stronger
> vma_assert_write_locked() ones when we are operating on a vma which
> is expected to be locked.
>
> Suggested-by: Jann Horn <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>

Reviewed-by: Liam R. Howlett <[email protected]>

> ---
> mm/hugetlb.c | 2 +-
> mm/khugepaged.c | 5 +++--
> mm/memory.c | 2 +-
> 3 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 64a3239b6407..1d871a1167d8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5028,7 +5028,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> src_vma->vm_start,
> src_vma->vm_end);
> mmu_notifier_invalidate_range_start(&range);
> - mmap_assert_write_locked(src);
> + vma_assert_write_locked(src_vma);
> raw_write_seqcount_begin(&src->write_protect_seq);
> } else {
> /*
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 78c8d5d8b628..1e43a56fba31 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
> };
>
> VM_BUG_ON(!PageTransHuge(hpage));
> - mmap_assert_write_locked(vma->vm_mm);
> + vma_assert_write_locked(vma);
>
> if (do_set_pmd(&vmf, hpage))
> return SCAN_FAIL;
> @@ -1525,7 +1525,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
> pmd_t pmd;
> struct mmu_notifier_range range;
>
> - mmap_assert_write_locked(mm);
> + vma_assert_write_locked(vma);
> if (vma->vm_file)
> lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
> /*
> @@ -1570,6 +1570,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> int count = 0, result = SCAN_FAIL;
> int i;
>
> + /* Ensure vma can't change, it will be locked below after checks */
> mmap_assert_write_locked(mm);
>
> /* Fast check before locking page if already PMD-mapped */
> diff --git a/mm/memory.c b/mm/memory.c
> index 603b2f419948..652d99b9858a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1312,7 +1312,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> * Use the raw variant of the seqcount_t write API to avoid
> * lockdep complaining about preemptibility.
> */
> - mmap_assert_write_locked(src_mm);
> + vma_assert_write_locked(src_vma);
> raw_write_seqcount_begin(&src_mm->write_protect_seq);
> }
>
> --
> 2.41.0.585.gd2178a4bd4-goog
>

2023-08-02 19:36:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

On Wed, 2 Aug 2023 at 11:09, Suren Baghdasaryan <[email protected]> wrote:
>
> Ok, IOW the vma would be already locked before mmap() is called...

Yup.

> Just to confirm, you are suggesting to remove vma_start_write() call
> from hfi1_file_mmap() and let vm_flags_reset() generate an assertion
> if it's ever called with an unlocked vma, correct?

Correct.

Linus

2023-08-02 19:36:11

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [PATCH v2 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

On Wed, Aug 2, 2023 at 10:49 AM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 1 Aug 2023 at 15:07, Suren Baghdasaryan <[email protected]> wrote:
> >
> > To make locking more visible, change these
> > functions to assert that the vma write lock is taken and explicitly lock
> > the vma beforehand.
>
> So I obviously think this is a good change, but the fact that it
> touched driver files makes me go "we're still doing something wrong".
>
> I'm not super-happy with hfi1_file_mmap() doing something like
> vma_start_write(), in that I *really* don't think drivers should ever
> have to think about issues like this.
>
> And I think it's unnecessary. This is the mmap op in the
> hfi1_file_ops, and I think that any actual mmap() code had _better_
> had locked the new vma before asking any driver to set things up (and
> the assert would catch it if somebody didn't).
>
> I realize that it doesn't hurt in a technical sense, but I think
> having drivers call these VM-internal subtle locking functions does
> hurt in a maintenance sense, so we should make sure to not have it.

Ok, IOW the vma would be already locked before mmap() is called...
Just to confirm, you are suggesting to remove vma_start_write() call
from hfi1_file_mmap() and let vm_flags_reset() generate an assertion
if it's ever called with an unlocked vma, correct?

>
> Linus

2023-08-02 19:49:41

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

On Tue, 1 Aug 2023 at 15:07, Suren Baghdasaryan <[email protected]> wrote:
>
> To make locking more visible, change these
> functions to assert that the vma write lock is taken and explicitly lock
> the vma beforehand.

So I obviously think this is a good change, but the fact that it
touched driver files makes me go "we're still doing something wrong".

I'm not super-happy with hfi1_file_mmap() doing something like
vma_start_write(), in that I *really* don't think drivers should ever
have to think about issues like this.

And I think it's unnecessary. This is the mmap op in the
hfi1_file_ops, and I think that any actual mmap() code had _better_
had locked the new vma before asking any driver to set things up (and
the assert would catch it if somebody didn't).

I realize that it doesn't hurt in a technical sense, but I think
having drivers call these VM-internal subtle locking functions does
hurt in a maintenance sense, so we should make sure to not have it.

Linus

2023-08-02 20:54:23

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [PATCH v2 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

On Wed, Aug 2, 2023 at 11:54 AM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 2 Aug 2023 at 11:09, Suren Baghdasaryan <[email protected]> wrote:
> >
> > Ok, IOW the vma would be already locked before mmap() is called...
>
> Yup.
>
> > Just to confirm, you are suggesting to remove vma_start_write() call
> > from hfi1_file_mmap() and let vm_flags_reset() generate an assertion
> > if it's ever called with an unlocked vma, correct?
>
> Correct.

Got it. Will update in the next version. Thanks!

>
> Linus
>

2023-08-03 18:42:55

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [PATCH v2 0/6] make vma locking more obvious

On Tue, Aug 1, 2023 at 3:07 PM Suren Baghdasaryan <[email protected]> wrote:
>
> During recent vma locking patch reviews Linus and Jann Horn noted a number
> of issues with vma locking and suggested improvements:
>
> 1. walk_page_range() does not have ability to write-lock a vma during the
> walk when it's done under mmap_write_lock. For example s390_reset_cmma().
>
> 2. Vma locking is hidden inside vm_flags modifiers and is hard to follow.
> Suggestion is to change vm_flags_reset{_once} to assert that vma is
> write-locked and require an explicit locking.
>
> 3. Same issue with vma_prepare() hiding vma locking.
>
> 4. In userfaultfd vm_flags are modified after vma->vm_userfaultfd_ctx and
> page faults can operate on a context while it's changed.
>
> 5. do_brk_flags() and __install_special_mapping() not locking a newly
> created vma before adding it into the mm. While not strictly a problem,
> this is fragile if vma is modified after insertion, as in the
> mmap_region() case which was recently fixed. Suggestion is to always lock
> a new vma before inserting it and making it visible to page faults.
>
> 6. vma_assert_write_locked() for CONFIG_PER_VMA_LOCK=n would benefit from
> being mmap_assert_write_locked() instead of no-op and then any place which
> operates on a vma and calls mmap_assert_write_locked() can be converted
> into vma_assert_write_locked().
>
> I CC'ed stable only on the first patch because others are cleanups and the
> bug in userfaultfd does not affect stable (lock_vma_under_rcu prevents
> uffds from being handled under vma lock protection). However I would be
> happy if the whole series is merged into stable 6.4 since it makes vma
> locking more maintainable.
>
> The patches apply cleanly over Linus' ToT and will conflict when applied
> over mm-unstable due to missing [1]. The conflict can be easily resolved
> by ignoring conflicting deletions but probably simpler to take [1] into
> mm-unstable and avoid later conflict.
>
> [1] commit 6c21e066f925 ("mm/mempolicy: Take VMA lock before replacing policy")
>
> Changes since v1:
> - replace walk_page_range() parameter with mm_walk_ops.walk_lock,
> per Linus
> - introduced page_walk_lock enum to allow different locking modes
> during a walk, per Linus
> - added Liam's Reviewed-by

v3 is posted at
https://lore.kernel.org/all/[email protected]/

>
> Suren Baghdasaryan (6):
> mm: enable page walking API to lock vmas during the walk
> mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and
> mmap
> mm: replace mmap with vma write lock assertions when operating on a
> vma
> mm: lock vma explicitly before doing vm_flags_reset and
> vm_flags_reset_once
> mm: always lock new vma before inserting into vma tree
> mm: move vma locking out of vma_prepare
>
> arch/powerpc/kvm/book3s_hv_uvmem.c | 1 +
> arch/powerpc/mm/book3s64/subpage_prot.c | 1 +
> arch/riscv/mm/pageattr.c | 1 +
> arch/s390/mm/gmap.c | 5 ++++
> drivers/infiniband/hw/hfi1/file_ops.c | 1 +
> fs/proc/task_mmu.c | 5 ++++
> fs/userfaultfd.c | 6 +++++
> include/linux/mm.h | 13 ++++++---
> include/linux/pagewalk.h | 11 ++++++++
> mm/damon/vaddr.c | 2 ++
> mm/hmm.c | 1 +
> mm/hugetlb.c | 2 +-
> mm/khugepaged.c | 5 ++--
> mm/ksm.c | 25 ++++++++++-------
> mm/madvise.c | 8 +++---
> mm/memcontrol.c | 2 ++
> mm/memory-failure.c | 1 +
> mm/memory.c | 2 +-
> mm/mempolicy.c | 22 +++++++++------
> mm/migrate_device.c | 1 +
> mm/mincore.c | 1 +
> mm/mlock.c | 4 ++-
> mm/mmap.c | 29 +++++++++++++-------
> mm/mprotect.c | 2 ++
> mm/pagewalk.c | 36 ++++++++++++++++++++++---
> mm/vmscan.c | 1 +
> 26 files changed, 146 insertions(+), 42 deletions(-)
>
> --
> 2.41.0.585.gd2178a4bd4-goog
>