Received-SPF: pass (google.com: domain of linux-kernel+bounces-141614-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161;
Date: Thu, 11 Apr 2024 17:12:02 -0400
From: Peter Xu <peterx@redhat.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Lokesh Gidra <lokeshgidra@google.com>,
	Alistair Popple <apopple@nvidia.com>
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks
Message-ID: <ZhhSItiyLYBEdAX3@x1n>
References: <20240410170621.2011171-1-peterx@redhat.com>
 <20240411171319.almhz23xulg4f7op@revolver>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20240411171319.almhz23xulg4f7op@revolver>

Hey, Liam,

Thanks a lot for the comments.

On Thu, Apr 11, 2024 at 01:13:19PM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [240410 13:06]:
> > anon_vma is a tricky object in the context of per-vma lock, because it's
> > racy to modify it in that context and mmap lock is needed if it's not
> > stable yet.
> 
> How is it racy in your view?  There isn't a way to get in a situation
> where its state isn't certain, is there?

My apologies on these confusing wordings.  AFAICT we're on the same page
all over the place on this matter, so I will amend these wordings if I'll
repost, and skip many of below comments if I'm confident we're aligned on
the understandings.

And actually, I'll already attach one new version that should remove all
bad side effects, but hopefully only bring good ones, see below.

> 
> > 
> > So far there are three places that sanity checks anon_vma for that:
> > 
> >   - lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
> >     we have taken care of anon memory v.s. potential anon_vma allocations.
> 
> Well, not exactly.  Single vma faults vs potential modifications beyond
> one vma (including uses of adjacent vmas with anon_vmas)

(this is also the wording issue; I'll fix it)

> 
> > 
> >   - lock_vma(): even if it looks so generic as an API, it's only used in
> >     userfaultfd context to leverage per-vma locks.  It does extra check
> >     over MAP_PRIVATE file mappings for the same anon_vma issue.
> 
> This static function is in mm/userfaultfd so, yes, it's not generic -
> the name is generic, but I didn't see a reason to hold up the patch for
> a static name that is descriptive.  It's static so I'm not concerned
> about usage growing.
> 
> I would expect such a check will eventually need to be moved to common
> code, and then potentially change the name to something more
> descriptive.  This seems premature with a single user though.
> 
> > 
> >   - vmf_anon_prepare(): it works for private file mapping faults just like
> >     what lock_vma() wanted to cover above.  One trivial difference is in
> >     some extremely corner case, the fault handler will still allow per-vma
> >     fault to happen, like a READ on a privately mapped file.
> 
> What is happening here is that we are detecting when the virtual memory
> space is stable vs when the vma is stable.  In some cases, when we need
> to check prev/next, then we need to make sure the virtual memory space
> is stable.  In other cases, the vma is all that matters to the operation
> and so we can continue without stopping anyone else from modifying the
> virtual memory space - that is, we are allowing writes in other areas.
> 
> > 
> > The question is whether that's intended to make it as complicated.  Per my
> > question in the thread, it is not intended, and Suren also seems to agree [1].
> > 
> > So the trivial side effect of such patch is:
> > 
> >   - We may do slightly better on the first WRITE of a private file mapping,
> >   because we can retry earlier (in lock_vma_under_rcu(), rather than
> >   vmf_anon_prepare() later).
> > 
> >   - We may always use mmap lock for the initial READs on a private file
> >   mappings, while before this patch it _can_ (only when no WRITE ever
> >   happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> >   read fault with per-vma lock.
> > 
> > Then noted that right after any WRITE the anon_vma will be stablized, then
> > there will be no difference.
> 
> In my view, the anon_vma is always stable.  During a write it is
> initialised.
> 
> >And I believe that should be the majority
> > cases too; I also did try to run a program, writting to MAP_PRIVATE file
> > memory (that I pre-headed in the page cache) and I can hardly measure a
> > difference in performance.
> > 
> > Let's simply ignore all those trivial corner cases and unify the anon_vma
> > check from three places into one.  I also didn't check the rest users of
> > lock_vma_under_rcu(), where in a !fault context it could even fix something
> > that used to race with private file mappings but I didn't check further.
> 
> This change will increase mmap semaphore contention.  I'd like to move
> to a more common structured layout, but I think we need to find a way to
> do this without failing the lock_vma_under_rcu() more frequently.  In
> fact, we need to be looking for ways to make it fail less.
> 
> The UFFD code is more strict on what is acceptable for a per-vma lock
> [1].  This user has a restriction that will decrease the benefits of the
> per-vma lock, but we shouldn't make this case the common one.
> 
> As I'm sure you are aware, the page fault path is the most common path
> for using per-vma locks and should minimize taking the mmap lock as much
> as possible.
> 
> I don't think we should increase the lock contention to simplify the
> code.

Right.  I believe the major issue right now with this patch is that it can
reduce some performance, I didn't worry on that, but it seems that's a
concern to many already.  It could be that I am just too bold on such
attempt.

However note that what I want to do (put all anon_vma magic together)
should be orthogonal issue v.s. above perf degrade.  So please feel free to
have a look at below, I believe it won't degrade anything, but afaiu it
should speed up two things even if only trivially:

  - The WRITE on private file mapping will now retry earlier than before
  - The READ on anon zero pages will now allow per-vma lock

Neither of above will be allowed without below patch.  And below patch
should also put all anon_vma checks together.

I didn't even compile test it yet, but let me send it out first to see
whether any of you has any comments, so take it as super-rfc versions.

Note that I attached patch 1 on a hugetlb change, that'll be needed by
patch 2 otherwise we can hit issue with hugetlb private mappings.  However
patch 1 alone will need some justifications too, but IMHO we can ignore it
for now and just look at patch 2.  I attached patch 1 just in case it's
good to reference.

Thanks,

===8<===
From fed3c10ef87a9614a328edbfd7d4b7047d3f5e57 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 11 Apr 2024 16:23:50 -0400
Subject: [PATCH 1/2] mm/hugetlb: Match behavior on read-only private mappings

The default behavior for reading a private file mapping should fill in the
page cache and map the page read-only.  Hugetlb didn't do like it.  Make it
behave the same.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8536349de13..bc3c97c476d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6199,6 +6199,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	struct folio *folio;
 	pte_t new_pte;
 	bool new_folio, new_pagecache_folio = false;
+	bool is_write = vmf->flags & FAULT_FLAG_WRITE;
 	u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff);
 
 	/*
@@ -6276,7 +6277,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 		__folio_mark_uptodate(folio);
 		new_folio = true;
 
-		if (vma->vm_flags & VM_MAYSHARE) {
+		if (!is_write || vma->vm_flags & VM_MAYSHARE) {
 			int err = hugetlb_add_to_page_cache(folio, mapping,
 							vmf->pgoff);
 			if (err) {
@@ -6294,6 +6295,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			}
 			new_pagecache_folio = true;
 		} else {
+			/* Write on a private mapping */
 			folio_lock(folio);
 
 			ret = vmf_anon_prepare(vmf);
@@ -6333,7 +6335,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	 * any allocations necessary to record that reservation occur outside
 	 * the spinlock.
 	 */
-	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+	if (is_write && !(vma->vm_flags & VM_SHARED)) {
 		if (vma_needs_reservation(h, vma, vmf->address) < 0) {
 			ret = VM_FAULT_OOM;
 			goto backout_unlocked;
-- 
2.44.0

From b9b57fb580992115ae1258fc1ecd8586a5bd70d2 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 4 Apr 2024 16:05:55 -0400
Subject: [PATCH 2/2] mm: Always sanity check anon_vma first for per-vma locks

anon_vma is a tricky object in the context of per-vma lock, because
sometimes it is required later for mapping the pages, while it's not safe
to prepare anon_vma with per-vma lock due to the fact that we can try to
access anon_vma of adjacent VMAs, which we don't yet hold any lock.

So far there are three places that per-vma lock will check against the
existance of anon_vma for that:

  1) lock_vma_under_rcu(): this is the major entrance of per-vma lock,
  where we have taken care of anon memory.

  2) lock_vma(): even if it looks so generic as an API, it's only used in
  userfaultfd context to leverage per-vma locks.  It does extra check over
  MAP_PRIVATE file mappings for the same anon_vma issue.  Note that this
  only applies to the dest vma not src vma, as we don't ask for src vma's
  anon_vma even if it should just present regardless.

  3) vmf_anon_prepare(): it works for private file mapping faults just like
  what lock_vma() wanted to cover above.  However this check is done only
  afterwards deeper in the fault stack.

The question is whether that's intended to make it as complicated.  For
example, why don't we check anon_vma for anonymous too later when prepare
anon_vma, however we do it late for file memory.  AFAICT there's nothing
special with file memory in this case.

To unify anon_vma checks, do it always right after we get the per-vma lock.
We should have enough information already at this point, and it doesn't
need to be postponed for file-only.  The only missing information here is
whether the caller of the per-vma rcu lock will install a writable mapping
or not.  That info is passed into lock_vma_under_rcu() with a boolean now.

So the trivial (all good) side effect of such patch is:

  - We may do slightly better on the first WRITE of a private file mapping,
  because we can retry earlier (in lock_vma_under_rcu(), rather than
  vmf_anon_prepare() later).

  - We may start to allow per-vma page faults on zero pages even for
  anonymous.  While prior to this patch we don't allow that to happen even
  if installation of zero page does not require anon_vma's existance.

I still left a WARN_ON_ONCE() in vmf_anon_prepare() to double check we're
all good.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/mm/fault.c     |  2 +-
 arch/arm64/mm/fault.c   |  2 +-
 arch/powerpc/mm/fault.c |  2 +-
 arch/riscv/mm/fault.c   |  2 +-
 arch/s390/mm/fault.c    |  2 +-
 arch/x86/mm/fault.c     |  2 +-
 include/linux/mm.h      |  7 ++++---
 mm/memory.c             | 46 +++++++++++++++++++++++++++++++----------
 mm/userfaultfd.c        | 21 +++++++------------
 net/ipv4/tcp.c          |  3 ++-
 10 files changed, 55 insertions(+), 34 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 5c4b417e24f9..20c21ecec25c 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -288,7 +288,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
-	vma = lock_vma_under_rcu(mm, addr);
+	vma = lock_vma_under_rcu(mm, addr, flags & FAULT_FLAG_WRITE);
 	if (!vma)
 		goto lock_mmap;
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 405f9aa831bd..4be946ca80bd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -566,7 +566,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
 	if (!(mm_flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
-	vma = lock_vma_under_rcu(mm, addr);
+	vma = lock_vma_under_rcu(mm, addr, flags & FAULT_FLAG_WRITE);
 	if (!vma)
 		goto lock_mmap;
 
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 215690452495..7a2adf40d65d 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -480,7 +480,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
-	vma = lock_vma_under_rcu(mm, address);
+	vma = lock_vma_under_rcu(mm, addrress, is_write);
 	if (!vma)
 		goto lock_mmap;
 
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 5224f3733802..17d8b011105e 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -286,7 +286,7 @@ void handle_page_fault(struct pt_regs *regs)
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
-	vma = lock_vma_under_rcu(mm, addr);
+	vma = lock_vma_under_rcu(mm, addr, flags & FAULT_FLAG_WRITE);
 	if (!vma)
 		goto lock_mmap;
 
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 162ca2576fd4..8b53eefb947a 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -320,7 +320,7 @@ static void do_exception(struct pt_regs *regs, int access)
 		flags |= FAULT_FLAG_WRITE;
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
-	vma = lock_vma_under_rcu(mm, address);
+	vma = lock_vma_under_rcu(mm, address, is_write);
 	if (!vma)
 		goto lock_mmap;
 	if (!(vma->vm_flags & access)) {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 67b18adc75dd..2cf38befaf45 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1356,7 +1356,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (!(flags & FAULT_FLAG_USER))
 		goto lock_mmap;
 
-	vma = lock_vma_under_rcu(mm, address);
+	vma = lock_vma_under_rcu(mm, address, flags & FAULT_FLAG_WRITE);
 	if (!vma)
 		goto lock_mmap;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 07c73451d42f..b3a088ce584d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -777,7 +777,8 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
 }
 
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
-					  unsigned long address);
+					  unsigned long address,
+					  bool writable);
 
 #else /* CONFIG_PER_VMA_LOCK */
 
@@ -790,8 +791,8 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
 
-static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
-		unsigned long address)
+static inline struct vm_area_struct *lock_vma_under_rcu(
+    struct mm_struct *mm, unsigned long address, bool writable)
 {
 	return NULL;
 }
diff --git a/mm/memory.c b/mm/memory.c
index b6fa5146b260..e8168ebc8095 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3218,10 +3218,8 @@ vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
 
 	if (likely(vma->anon_vma))
 		return 0;
-	if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
-		vma_end_read(vma);
-		return VM_FAULT_RETRY;
-	}
+	/* We shouldn't try a per-vma fault at all if anon_vma isn't solid */
+	WARN_ON_ONCE(vmf->flags & FAULT_FLAG_VMA_LOCK);
 	if (__anon_vma_prepare(vma))
 		return VM_FAULT_OOM;
 	return 0;
@@ -5842,13 +5840,38 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_PER_VMA_LOCK
-/*
- * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
- * stable and not isolated. If the VMA is not found or is being modified the
- * function returns NULL.
+
+bool vma_needs_stable_anon_vma(struct vm_area_struct *vma, bool writable)
+{
+	/*
+	 * If to install a read-only mapping, we should never need an
+	 * anon_vma later.  For anon, it should be the zero page.  For
+	 * file, it normally should be a page cache except special mappings
+	 * (e.g. tcp zerocopy provides its own read-only pages).
+	 */
+	if (!writable)
+		return false;
+
+	/* Anonymous writable mappings always will ask for anon_vma */
+	if (vma_is_anonymous(vma))
+		return true;
+
+	/* For file memory, only need anon_vma if it's private */
+	return !(vma->vm_flags & VM_SHARED);
+}
+
+/**
+ * @lock_vma_under_rcu() - Lookup and lock a VMA under RCU protection.
+ * @mm: target mm struct
+ * @address: target virtual address
+ * @writable: whether we may inject a writable mapping with the VMA
+ *
+ * Returned VMA is guaranteed to be stable and not isolated. If the VMA is
+ * not found or is being modified the function returns NULL.
  */
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
-					  unsigned long address)
+					  unsigned long address,
+					  bool writable)
 {
 	MA_STATE(mas, &mm->mm_mt, address, address);
 	struct vm_area_struct *vma;
@@ -5866,9 +5889,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	 * find_mergeable_anon_vma uses adjacent vmas which are not locked.
 	 * This check must happen after vma_start_read(); otherwise, a
 	 * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
-	 * from its anon_vma.
+	 * from its anon_vma.  This applies to both anon or private file maps.
 	 */
-	if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
+	if (unlikely(vma_needs_stable_anon_vma(vma, writable) &&
+		     !vma->anon_vma))
 		goto inval_end_read;
 
 	/* Check since vm_start/vm_end might change before we lock the VMA */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6267afe65d1..b518c111e37a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -71,18 +71,10 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
 {
 	struct vm_area_struct *vma;
 
-	vma = lock_vma_under_rcu(mm, address);
-	if (vma) {
-		/*
-		 * lock_vma_under_rcu() only checks anon_vma for private
-		 * anonymous mappings. But we need to ensure it is assigned in
-		 * private file-backed vmas as well.
-		 */
-		if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
-			vma_end_read(vma);
-		else
-			return vma;
-	}
+	/* This is the dst vma, we always need the anon_vma */
+	vma = lock_vma_under_rcu(mm, address, true);
+	if (vma)
+		return vma;
 
 	mmap_read_lock(mm);
 	vma = find_vma_and_prepare_anon(mm, address);
@@ -1462,8 +1454,11 @@ static int uffd_move_lock(struct mm_struct *mm,
 	 * vma_start_read(src_vma)
 	 * mmap_read_lock(mm)
 	 *					vma_start_write(dst_vma)
+	 *
+	 * Note that here we won't touch src vma's anon_vma.  It should be
+	 * there, but here we don't need to ask for it.
 	 */
-	*src_vmap = lock_vma_under_rcu(mm, src_start);
+	*src_vmap = lock_vma_under_rcu(mm, src_start, false);
 	if (likely(*src_vmap))
 		return 0;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e767721b3a58..5940847b616c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2053,7 +2053,8 @@ static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
 					   unsigned long address,
 					   bool *mmap_locked)
 {
-	struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
+	/* TCP zerocopy pages are always read-only, see tcp_mmap() */
+	struct vm_area_struct *vma = lock_vma_under_rcu(mm, address, false);
 
 	if (vma) {
 		if (vma->vm_ops != &tcp_vm_ops) {
-- 
2.44.0


-- 
Peter Xu