2024-04-10 17:06:38

by Peter Xu

[permalink] [raw]
Subject: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

anon_vma is a tricky object in the context of per-vma lock, because it's
racy to modify it in that context and mmap lock is needed if it's not
stable yet.

So far there are three places that sanity checks anon_vma for that:

- lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
we have taken care of anon memory v.s. potential anon_vma allocations.

- lock_vma(): even if it looks so generic as an API, it's only used in
userfaultfd context to leverage per-vma locks. It does extra check
over MAP_PRIVATE file mappings for the same anon_vma issue.

- vmf_anon_prepare(): it works for private file mapping faults just like
what lock_vma() wanted to cover above. One trivial difference is in
some extremely corner case, the fault handler will still allow per-vma
fault to happen, like a READ on a privately mapped file.

The question is whether that's intended to make it as complicated. Per my
question in the thread, it is not intended, and Suren also seems to agree [1].

So the trivial side effect of such patch is:

- We may do slightly better on the first WRITE of a private file mapping,
because we can retry earlier (in lock_vma_under_rcu(), rather than
vmf_anon_prepare() later).

- We may always use mmap lock for the initial READs on a private file
mappings, while before this patch it _can_ (only when no WRITE ever
happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
read fault with per-vma lock.

Then noted that right after any WRITE the anon_vma will be stablized, then
there will be no difference. And I believe that should be the majority
cases too; I also did try to run a program, writting to MAP_PRIVATE file
memory (that I pre-headed in the page cache) and I can hardly measure a
difference in performance.

Let's simply ignore all those trivial corner cases and unify the anon_vma
check from three places into one. I also didn't check the rest users of
lock_vma_under_rcu(), where in a !fault context it could even fix something
that used to race with private file mappings but I didn't check further.

I still left a WARN_ON_ONCE() in vmf_anon_prepare() to double check we're
all good.

[1] https://lore.kernel.org/r/CAJuCfpGj5xk-NxSwW6Mt8NGZcV9N__8zVPMGXDPAYKMcN9=Oig@mail.gmail.com

Cc: Matthew Wilcox <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Alistair Popple <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/memory.c | 10 ++++------
mm/userfaultfd.c | 13 ++-----------
2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78422d1c7381..4e2a9c4d9776 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3219,10 +3219,8 @@ vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)

if (likely(vma->anon_vma))
return 0;
- if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
- vma_end_read(vma);
- return VM_FAULT_RETRY;
- }
+ /* We shouldn't try a per-vma fault at all if anon_vma isn't solid */
+ WARN_ON_ONCE(vmf->flags & FAULT_FLAG_VMA_LOCK);
if (__anon_vma_prepare(vma))
return VM_FAULT_OOM;
return 0;
@@ -5826,9 +5824,9 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
* find_mergeable_anon_vma uses adjacent vmas which are not locked.
* This check must happen after vma_start_read(); otherwise, a
* concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
- * from its anon_vma.
+ * from its anon_vma. This applies to both anon or private file maps.
*/
- if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
+ if (unlikely(!(vma->vm_flags & VM_SHARED) && !vma->anon_vma))
goto inval_end_read;

/* Check since vm_start/vm_end might change before we lock the VMA */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6267afe65d1..61f21da77dcd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -72,17 +72,8 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
struct vm_area_struct *vma;

vma = lock_vma_under_rcu(mm, address);
- if (vma) {
- /*
- * lock_vma_under_rcu() only checks anon_vma for private
- * anonymous mappings. But we need to ensure it is assigned in
- * private file-backed vmas as well.
- */
- if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
- vma_end_read(vma);
- else
- return vma;
- }
+ if (vma)
+ return vma;

mmap_read_lock(mm);
vma = find_vma_and_prepare_anon(mm, address);
--
2.44.0



2024-04-10 20:27:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> anon_vma is a tricky object in the context of per-vma lock, because it's
> racy to modify it in that context and mmap lock is needed if it's not
> stable yet.

I object to this commit message. First, it's not a "sanity check". It's
a check to see if we already have an anon VMA. Second, it's not "racy
to modify it" at all. The problem is that we need to look at other
VMAs, for which we do not hold the lock.

> So the trivial side effect of such patch is:
>
> - We may do slightly better on the first WRITE of a private file mapping,
> because we can retry earlier (in lock_vma_under_rcu(), rather than
> vmf_anon_prepare() later).
>
> - We may always use mmap lock for the initial READs on a private file
> mappings, while before this patch it _can_ (only when no WRITE ever
> happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> read fault with per-vma lock.

But that's a super common path! Look at 'cat /proc/self/maps'. All
your program text (including libraries) is mapped PRIVATE, and never
written to (except by ptrace, I guess).

NAK this patch.


2024-04-10 20:44:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 09:26:45PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> > anon_vma is a tricky object in the context of per-vma lock, because it's
> > racy to modify it in that context and mmap lock is needed if it's not
> > stable yet.
>
> I object to this commit message. First, it's not a "sanity check". It's
> a check to see if we already have an anon VMA. Second, it's not "racy
> to modify it" at all. The problem is that we need to look at other
> VMAs, for which we do not hold the lock.

For that "do not hold locks" part, isn't that "racy"?

When it's racy in that case, can I still word it as "racy to modify"? We
can't modify it because it's racy to read the other vmas.

For "sanity check".. well, that falls into this category for me but I'm not
a native speaker. So I am open to any rewords for any of above.

>
> > So the trivial side effect of such patch is:
> >
> > - We may do slightly better on the first WRITE of a private file mapping,
> > because we can retry earlier (in lock_vma_under_rcu(), rather than
> > vmf_anon_prepare() later).
> >
> > - We may always use mmap lock for the initial READs on a private file
> > mappings, while before this patch it _can_ (only when no WRITE ever
> > happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> > read fault with per-vma lock.
>
> But that's a super common path! Look at 'cat /proc/self/maps'. All
> your program text (including libraries) is mapped PRIVATE, and never
> written to (except by ptrace, I guess).
>
> NAK this patch.

We're talking about any vma that will first benefit from a per-vma lock
here, right?

I think it should be only relevant to some major VMA or bunch of VMAs that
an userspace maps explicitly, then iiuc the goal is we want to reduce the
cache bouncing of the lock when it used to be per-mm, by replacing it with
a finer lock. It doesn't sound right that these libraries even fall into
this category as they should just get loaded soon enough when the program
starts.

IOW, my understanding is that per-vma lock doesn't benefit from such normal
vmas or simple programs that much; we take either per-vma read lock, or
mmap read lock, and I would expect similar performance when such cache
bouncing isn't heavy.

I can do some tests later today or tomorrow. Any suggestion you have on
amplifying such effect that you have concern with?

--
Peter Xu


2024-04-10 21:10:59

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 04:43:51PM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 09:26:45PM +0100, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> > > anon_vma is a tricky object in the context of per-vma lock, because it's
> > > racy to modify it in that context and mmap lock is needed if it's not
> > > stable yet.
> >
> > I object to this commit message. First, it's not a "sanity check". It's
> > a check to see if we already have an anon VMA. Second, it's not "racy
> > to modify it" at all. The problem is that we need to look at other
> > VMAs, for which we do not hold the lock.
>
> For that "do not hold locks" part, isn't that "racy"?

No.

> > > - We may always use mmap lock for the initial READs on a private file
> > > mappings, while before this patch it _can_ (only when no WRITE ever
> > > happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> > > read fault with per-vma lock.
> >
> > But that's a super common path! Look at 'cat /proc/self/maps'. All
> > your program text (including libraries) is mapped PRIVATE, and never
> > written to (except by ptrace, I guess).
> >
> > NAK this patch.
>
> We're talking about any vma that will first benefit from a per-vma lock
> here, right?
>
> I think it should be only relevant to some major VMA or bunch of VMAs that
> an userspace maps explicitly, then iiuc the goal is we want to reduce the
> cache bouncing of the lock when it used to be per-mm, by replacing it with
> a finer lock. It doesn't sound right that these libraries even fall into
> this category as they should just get loaded soon enough when the program
> starts.
>
> IOW, my understanding is that per-vma lock doesn't benefit from such normal
> vmas or simple programs that much; we take either per-vma read lock, or
> mmap read lock, and I would expect similar performance when such cache
> bouncing isn't heavy.
>
> I can do some tests later today or tomorrow. Any suggestion you have on
> amplifying such effect that you have concern with?

8 socket NUMA system, 800MB text segment, 10,000 threads. No, I'm not
joking, that's a real customer workload.

2024-04-10 21:23:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > I can do some tests later today or tomorrow. Any suggestion you have on
> > amplifying such effect that you have concern with?
>
> 8 socket NUMA system, 800MB text segment, 10,000 threads. No, I'm not
> joking, that's a real customer workload.

Well, I believe you, but even with this, that's a total of 800MB memory on
a giant moster system... probably just to fault in once.

And even before we talk about that into details.. we're talking about such
giant program running acorss hundreds of cores with hundreds of MB text,
then... hasn't the program developer already considered mlockall() at the
entry of the program? Wouldn't that greatly beneficial already with
whatever granule of locks that a future fault would take?

--
Peter Xu


2024-04-11 00:28:09

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 12:59:09AM +0100, Matthew Wilcox wrote:
> On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> > On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > > amplifying such effect that you have concern with?
> > >
> > > 8 socket NUMA system, 800MB text segment, 10,000 threads. No, I'm not
> > > joking, that's a real customer workload.
> >
> > Well, I believe you, but even with this, that's a total of 800MB memory on
> > a giant moster system... probably just to fault in once.
> >
> > And even before we talk about that into details.. we're talking about such
> > giant program running acorss hundreds of cores with hundreds of MB text,
> > then... hasn't the program developer already considered mlockall() at the
> > entry of the program? Wouldn't that greatly beneficial already with
> > whatever granule of locks that a future fault would take?
>
> I don't care what your theory is, or even what your benchmarking shows.
> I had basically the inverse of this patch, and my customer's workload
> showed significant improvement as a result. Data talks, bullshit walks.
> Your patch is NAKed and will remain NAKed.

Either would you tell me your workload, I may try it.

Or, please explain why it helps? If such huge library is in a single VMA,
I don't see why per-vma lock is better than mmap lock. If the text is
combined with multiple vmas, it should only help when each core faults at
least on different vmas, not the same.

Would you go either way, please?

For now my patch got strongly NACKed without yet a real proof. If that
really matters, I am happy to learn, and I agree this patch shouldn't go in
if that's provided. Otherwise I am not convinced. If you think data
talks, I'm happy to try any workload that I am in access with, then we
compare data.

--
Peter Xu


2024-04-11 04:14:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > amplifying such effect that you have concern with?
> >
> > 8 socket NUMA system, 800MB text segment, 10,000 threads. No, I'm not
> > joking, that's a real customer workload.
>
> Well, I believe you, but even with this, that's a total of 800MB memory on
> a giant moster system... probably just to fault in once.
>
> And even before we talk about that into details.. we're talking about such
> giant program running acorss hundreds of cores with hundreds of MB text,
> then... hasn't the program developer already considered mlockall() at the
> entry of the program? Wouldn't that greatly beneficial already with
> whatever granule of locks that a future fault would take?

I don't care what your theory is, or even what your benchmarking shows.
I had basically the inverse of this patch, and my customer's workload
showed significant improvement as a result. Data talks, bullshit walks.
Your patch is NAKed and will remain NAKed.

2024-04-11 14:51:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 08:20:04PM -0400, Peter Xu wrote:
> On Thu, Apr 11, 2024 at 12:59:09AM +0100, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> > > On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > > > amplifying such effect that you have concern with?
> > > >
> > > > 8 socket NUMA system, 800MB text segment, 10,000 threads. No, I'm not
> > > > joking, that's a real customer workload.
> > >
> > > Well, I believe you, but even with this, that's a total of 800MB memory on
> > > a giant moster system... probably just to fault in once.
> > >
> > > And even before we talk about that into details.. we're talking about such
> > > giant program running acorss hundreds of cores with hundreds of MB text,
> > > then... hasn't the program developer already considered mlockall() at the
> > > entry of the program? Wouldn't that greatly beneficial already with
> > > whatever granule of locks that a future fault would take?
> >
> > I don't care what your theory is, or even what your benchmarking shows.
> > I had basically the inverse of this patch, and my customer's workload
> > showed significant improvement as a result. Data talks, bullshit walks.
> > Your patch is NAKed and will remain NAKed.
>
> Either would you tell me your workload, I may try it.
>
> Or, please explain why it helps? If such huge library is in a single VMA,
> I don't see why per-vma lock is better than mmap lock. If the text is
> combined with multiple vmas, it should only help when each core faults at
> least on different vmas, not the same.

Oh, you really don't understand. The mmap_lock is catastrophically
overloaded. Before the per-VMA lock, every page fault took it for read,
and every call to mmap() took it for write. Because our rwsems are
fair, once one thread has called mmap() it waits for all existing page
faults to complete _and_ blocks all page faults from starting until
it has completed. That's a huge source of unexpected latency for any
multithreaded application.

Anything we can do to avoid taking the mmap_sem, even for read, helps any
multithreaded workload. Your suggestion that "this is rare, it doesn't
matter" shows that you don't get it. That you haven't found a workload
where you can measure it shows that your testing is inadequate.

Yes, there's added complexity with the per-VMA locks. But we need it for
good performance. Throwing away performance on a very small reduction
in complexity is a terrible trade-off.

2024-04-11 16:05:59

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 03:50:54PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 10, 2024 at 08:20:04PM -0400, Peter Xu wrote:
> > On Thu, Apr 11, 2024 at 12:59:09AM +0100, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2024 at 05:23:18PM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 10:10:45PM +0100, Matthew Wilcox wrote:
> > > > > > I can do some tests later today or tomorrow. Any suggestion you have on
> > > > > > amplifying such effect that you have concern with?
> > > > >
> > > > > 8 socket NUMA system, 800MB text segment, 10,000 threads. No, I'm not
> > > > > joking, that's a real customer workload.
> > > >
> > > > Well, I believe you, but even with this, that's a total of 800MB memory on
> > > > a giant moster system... probably just to fault in once.
> > > >
> > > > And even before we talk about that into details.. we're talking about such
> > > > giant program running acorss hundreds of cores with hundreds of MB text,
> > > > then... hasn't the program developer already considered mlockall() at the
> > > > entry of the program? Wouldn't that greatly beneficial already with
> > > > whatever granule of locks that a future fault would take?
> > >
> > > I don't care what your theory is, or even what your benchmarking shows.
> > > I had basically the inverse of this patch, and my customer's workload
> > > showed significant improvement as a result. Data talks, bullshit walks.
> > > Your patch is NAKed and will remain NAKed.
> >
> > Either would you tell me your workload, I may try it.
> >
> > Or, please explain why it helps? If such huge library is in a single VMA,
> > I don't see why per-vma lock is better than mmap lock. If the text is
> > combined with multiple vmas, it should only help when each core faults at
> > least on different vmas, not the same.
>
> Oh, you really don't understand. The mmap_lock is catastrophically
> overloaded. Before the per-VMA lock, every page fault took it for read,
> and every call to mmap() took it for write. Because our rwsems are
> fair, once one thread has called mmap() it waits for all existing page
> faults to complete _and_ blocks all page faults from starting until
> it has completed. That's a huge source of unexpected latency for any
> multithreaded application.
>
> Anything we can do to avoid taking the mmap_sem, even for read, helps any
> multithreaded workload. Your suggestion that "this is rare, it doesn't
> matter" shows that you don't get it. That you haven't found a workload
> where you can measure it shows that your testing is inadequate.
>
> Yes, there's added complexity with the per-VMA locks. But we need it for
> good performance. Throwing away performance on a very small reduction
> in complexity is a terrible trade-off.

Yes, this is a technical discussion, and such comments are what I'm looking
for. Thank you.

What I am not sure so far is that what you worried on a performance degrade
for "this small corner case" doesn't exist.

Let's first check when that vmf_anon_prepare() lines are introduced:

commit 17c05f18e54158a3eed0c22c85b7a756b63dcc01
Author: Suren Baghdasaryan <[email protected]>
Date: Mon Feb 27 09:36:25 2023 -0800

mm: prevent do_swap_page from handling page faults under VMA lock

It didn't seem like a plan to do late anon_vma check for any performance
reasons.

To figure these out, let me ask some more questions.

1) When you said "you ran a customer workload, and that greatly improved
performance", are you comparing between:

- with/without file-typed per-vma lock, or,
- with/without this patch?

Note that I'm hopefully not touching that fact that file per-vma should
work like before for the majority. And I'm surprised to see your
comment because I didn't expect this is even measured before.

To ask in another way: do you mean that it's your intention to check
anon_vma late for private file mappings when working on file support on
per-vma lock?

If the answer is yes, I'd say please provide some document patch to
support such behavior, you can stop my patch from getting merged now,
but it's never clear whether someone else will see this and post it
again. If it wasn't obviously to Suren who introduced per-vma lock [1],
then I won't be surprised it's unknown to most developers on the list.

2) What happens if lock_vma_under_rcu() keeps spreading its usage?

Now it's already spread over to the uffd world. Uffd has that special
check to make sure file private mappings are fine in lock_vma().

When it keeps going like that, I won't be surprised to see future users
of lock_vma_under_rcu() forget the private file mappings and it can
cause hard to debug issues. Even if lock_vma_under_rcu() is exactly
what you're looking for, I think we may need lock_vma_under_rcu_fault(),
then lock_vma_under_rcu() should cover private file mappings to make
sure future users don't expose easily to hard to debug issues.

[1] https://lore.kernel.org/r/CAJuCfpGj5xk-NxSwW6Mt8NGZcV9N__8zVPMGXDPAYKMcN9=Oig@mail.gmail.com

Thanks,

--
Peter Xu


2024-04-11 17:10:24

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Wed, Apr 10, 2024 at 1:26 PM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Apr 10, 2024 at 01:06:21PM -0400, Peter Xu wrote:
> > anon_vma is a tricky object in the context of per-vma lock, because it's
> > racy to modify it in that context and mmap lock is needed if it's not
> > stable yet.
>
> I object to this commit message. First, it's not a "sanity check". It's
> a check to see if we already have an anon VMA. Second, it's not "racy
> to modify it" at all. The problem is that we need to look at other
> VMAs, for which we do not hold the lock.
>
> > So the trivial side effect of such patch is:
> >
> > - We may do slightly better on the first WRITE of a private file mapping,
> > because we can retry earlier (in lock_vma_under_rcu(), rather than
> > vmf_anon_prepare() later).
> >
> > - We may always use mmap lock for the initial READs on a private file
> > mappings, while before this patch it _can_ (only when no WRITE ever
> > happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> > read fault with per-vma lock.
>
> But that's a super common path! Look at 'cat /proc/self/maps'. All
> your program text (including libraries) is mapped PRIVATE, and never
> written to (except by ptrace, I guess).

Uh, indeed I didn't realize this would be the side-effect from this
early check. And that's exactly why I wanted Matthew's input on this
in [1].

>
> NAK this patch.
>

2024-04-11 18:16:02

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

* Peter Xu <[email protected]> [240410 13:06]:
> anon_vma is a tricky object in the context of per-vma lock, because it's
> racy to modify it in that context and mmap lock is needed if it's not
> stable yet.

How is it racy in your view? There isn't a way to get in a situation
where its state isn't certain, is there?

>
> So far there are three places that sanity checks anon_vma for that:
>
> - lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
> we have taken care of anon memory v.s. potential anon_vma allocations.

Well, not exactly. Single vma faults vs potential modifications beyond
one vma (including uses of adjacent vmas with anon_vmas)

>
> - lock_vma(): even if it looks so generic as an API, it's only used in
> userfaultfd context to leverage per-vma locks. It does extra check
> over MAP_PRIVATE file mappings for the same anon_vma issue.

This static function is in mm/userfaultfd so, yes, it's not generic -
the name is generic, but I didn't see a reason to hold up the patch for
a static name that is descriptive. It's static so I'm not concerned
about usage growing.

I would expect such a check will eventually need to be moved to common
code, and then potentially change the name to something more
descriptive. This seems premature with a single user though.

>
> - vmf_anon_prepare(): it works for private file mapping faults just like
> what lock_vma() wanted to cover above. One trivial difference is in
> some extremely corner case, the fault handler will still allow per-vma
> fault to happen, like a READ on a privately mapped file.

What is happening here is that we are detecting when the virtual memory
space is stable vs when the vma is stable. In some cases, when we need
to check prev/next, then we need to make sure the virtual memory space
is stable. In other cases, the vma is all that matters to the operation
and so we can continue without stopping anyone else from modifying the
virtual memory space - that is, we are allowing writes in other areas.

>
> The question is whether that's intended to make it as complicated. Per my
> question in the thread, it is not intended, and Suren also seems to agree [1].
>
> So the trivial side effect of such patch is:
>
> - We may do slightly better on the first WRITE of a private file mapping,
> because we can retry earlier (in lock_vma_under_rcu(), rather than
> vmf_anon_prepare() later).
>
> - We may always use mmap lock for the initial READs on a private file
> mappings, while before this patch it _can_ (only when no WRITE ever
> happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> read fault with per-vma lock.
>
> Then noted that right after any WRITE the anon_vma will be stablized, then
> there will be no difference.

In my view, the anon_vma is always stable. During a write it is
initialised.

>And I believe that should be the majority
> cases too; I also did try to run a program, writting to MAP_PRIVATE file
> memory (that I pre-headed in the page cache) and I can hardly measure a
> difference in performance.
>
> Let's simply ignore all those trivial corner cases and unify the anon_vma
> check from three places into one. I also didn't check the rest users of
> lock_vma_under_rcu(), where in a !fault context it could even fix something
> that used to race with private file mappings but I didn't check further.

This change will increase mmap semaphore contention. I'd like to move
to a more common structured layout, but I think we need to find a way to
do this without failing the lock_vma_under_rcu() more frequently. In
fact, we need to be looking for ways to make it fail less.

The UFFD code is more strict on what is acceptable for a per-vma lock
[1]. This user has a restriction that will decrease the benefits of the
per-vma lock, but we shouldn't make this case the common one.

As I'm sure you are aware, the page fault path is the most common path
for using per-vma locks and should minimize taking the mmap lock as much
as possible.

I don't think we should increase the lock contention to simplify the
code.

Thanks,
Liam

[1] https://lore.kernel.org/lkml/CAG48ez0AdTijvuh0xueg_spwNE9tVcPuvqT9WpvmtiNNudQFMw@mail.gmail.com/

2024-04-11 21:17:34

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

Hey, Liam,

Thanks a lot for the comments.

On Thu, Apr 11, 2024 at 01:13:19PM -0400, Liam R. Howlett wrote:
> * Peter Xu <[email protected]> [240410 13:06]:
> > anon_vma is a tricky object in the context of per-vma lock, because it's
> > racy to modify it in that context and mmap lock is needed if it's not
> > stable yet.
>
> How is it racy in your view? There isn't a way to get in a situation
> where its state isn't certain, is there?

My apologies on these confusing wordings. AFAICT we're on the same page
all over the place on this matter, so I will amend these wordings if I'll
repost, and skip many of below comments if I'm confident we're aligned on
the understandings.

And actually, I'll already attach one new version that should remove all
bad side effects, but hopefully only bring good ones, see below.

>
> >
> > So far there are three places that sanity checks anon_vma for that:
> >
> > - lock_vma_under_rcu(): this is the major entrance of per-vma lock, where
> > we have taken care of anon memory v.s. potential anon_vma allocations.
>
> Well, not exactly. Single vma faults vs potential modifications beyond
> one vma (including uses of adjacent vmas with anon_vmas)

(this is also the wording issue; I'll fix it)

>
> >
> > - lock_vma(): even if it looks so generic as an API, it's only used in
> > userfaultfd context to leverage per-vma locks. It does extra check
> > over MAP_PRIVATE file mappings for the same anon_vma issue.
>
> This static function is in mm/userfaultfd so, yes, it's not generic -
> the name is generic, but I didn't see a reason to hold up the patch for
> a static name that is descriptive. It's static so I'm not concerned
> about usage growing.
>
> I would expect such a check will eventually need to be moved to common
> code, and then potentially change the name to something more
> descriptive. This seems premature with a single user though.
>
> >
> > - vmf_anon_prepare(): it works for private file mapping faults just like
> > what lock_vma() wanted to cover above. One trivial difference is in
> > some extremely corner case, the fault handler will still allow per-vma
> > fault to happen, like a READ on a privately mapped file.
>
> What is happening here is that we are detecting when the virtual memory
> space is stable vs when the vma is stable. In some cases, when we need
> to check prev/next, then we need to make sure the virtual memory space
> is stable. In other cases, the vma is all that matters to the operation
> and so we can continue without stopping anyone else from modifying the
> virtual memory space - that is, we are allowing writes in other areas.
>
> >
> > The question is whether that's intended to make it as complicated. Per my
> > question in the thread, it is not intended, and Suren also seems to agree [1].
> >
> > So the trivial side effect of such patch is:
> >
> > - We may do slightly better on the first WRITE of a private file mapping,
> > because we can retry earlier (in lock_vma_under_rcu(), rather than
> > vmf_anon_prepare() later).
> >
> > - We may always use mmap lock for the initial READs on a private file
> > mappings, while before this patch it _can_ (only when no WRITE ever
> > happened... but it doesn't make much sense for a MAP_PRIVATE..) do the
> > read fault with per-vma lock.
> >
> > Then noted that right after any WRITE the anon_vma will be stablized, then
> > there will be no difference.
>
> In my view, the anon_vma is always stable. During a write it is
> initialised.
>
> >And I believe that should be the majority
> > cases too; I also did try to run a program, writting to MAP_PRIVATE file
> > memory (that I pre-headed in the page cache) and I can hardly measure a
> > difference in performance.
> >
> > Let's simply ignore all those trivial corner cases and unify the anon_vma
> > check from three places into one. I also didn't check the rest users of
> > lock_vma_under_rcu(), where in a !fault context it could even fix something
> > that used to race with private file mappings but I didn't check further.
>
> This change will increase mmap semaphore contention. I'd like to move
> to a more common structured layout, but I think we need to find a way to
> do this without failing the lock_vma_under_rcu() more frequently. In
> fact, we need to be looking for ways to make it fail less.
>
> The UFFD code is more strict on what is acceptable for a per-vma lock
> [1]. This user has a restriction that will decrease the benefits of the
> per-vma lock, but we shouldn't make this case the common one.
>
> As I'm sure you are aware, the page fault path is the most common path
> for using per-vma locks and should minimize taking the mmap lock as much
> as possible.
>
> I don't think we should increase the lock contention to simplify the
> code.

Right. I believe the major issue right now with this patch is that it can
reduce some performance, I didn't worry on that, but it seems that's a
concern to many already. It could be that I am just too bold on such
attempt.

However note that what I want to do (put all anon_vma magic together)
should be orthogonal issue v.s. above perf degrade. So please feel free to
have a look at below, I believe it won't degrade anything, but afaiu it
should speed up two things even if only trivially:

- The WRITE on private file mapping will now retry earlier than before
- The READ on anon zero pages will now allow per-vma lock

Neither of above will be allowed without below patch. And below patch
should also put all anon_vma checks together.

I didn't even compile test it yet, but let me send it out first to see
whether any of you has any comments, so take it as super-rfc versions.

Note that I attached patch 1 on a hugetlb change, that'll be needed by
patch 2 otherwise we can hit issue with hugetlb private mappings. However
patch 1 alone will need some justifications too, but IMHO we can ignore it
for now and just look at patch 2. I attached patch 1 just in case it's
good to reference.

Thanks,

===8<===
From fed3c10ef87a9614a328edbfd7d4b7047d3f5e57 Mon Sep 17 00:00:00 2001
From: Peter Xu <[email protected]>
Date: Thu, 11 Apr 2024 16:23:50 -0400
Subject: [PATCH 1/2] mm/hugetlb: Match behavior on read-only private mappings

The default behavior for reading a private file mapping should fill in the
page cache and map the page read-only. Hugetlb didn't do like it. Make it
behave the same.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8536349de13..bc3c97c476d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6199,6 +6199,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
struct folio *folio;
pte_t new_pte;
bool new_folio, new_pagecache_folio = false;
+ bool is_write = vmf->flags & FAULT_FLAG_WRITE;
u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff);

/*
@@ -6276,7 +6277,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
__folio_mark_uptodate(folio);
new_folio = true;

- if (vma->vm_flags & VM_MAYSHARE) {
+ if (!is_write || vma->vm_flags & VM_MAYSHARE) {
int err = hugetlb_add_to_page_cache(folio, mapping,
vmf->pgoff);
if (err) {
@@ -6294,6 +6295,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
}
new_pagecache_folio = true;
} else {
+ /* Write on a private mapping */
folio_lock(folio);

ret = vmf_anon_prepare(vmf);
@@ -6333,7 +6335,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
* any allocations necessary to record that reservation occur outside
* the spinlock.
*/
- if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+ if (is_write && !(vma->vm_flags & VM_SHARED)) {
if (vma_needs_reservation(h, vma, vmf->address) < 0) {
ret = VM_FAULT_OOM;
goto backout_unlocked;
--
2.44.0

From b9b57fb580992115ae1258fc1ecd8586a5bd70d2 Mon Sep 17 00:00:00 2001
From: Peter Xu <[email protected]>
Date: Thu, 4 Apr 2024 16:05:55 -0400
Subject: [PATCH 2/2] mm: Always sanity check anon_vma first for per-vma locks

anon_vma is a tricky object in the context of per-vma lock, because
sometimes it is required later for mapping the pages, while it's not safe
to prepare anon_vma with per-vma lock due to the fact that we can try to
access anon_vma of adjacent VMAs, which we don't yet hold any lock.

So far there are three places that per-vma lock will check against the
existance of anon_vma for that:

1) lock_vma_under_rcu(): this is the major entrance of per-vma lock,
where we have taken care of anon memory.

2) lock_vma(): even if it looks so generic as an API, it's only used in
userfaultfd context to leverage per-vma locks. It does extra check over
MAP_PRIVATE file mappings for the same anon_vma issue. Note that this
only applies to the dest vma not src vma, as we don't ask for src vma's
anon_vma even if it should just present regardless.

3) vmf_anon_prepare(): it works for private file mapping faults just like
what lock_vma() wanted to cover above. However this check is done only
afterwards deeper in the fault stack.

The question is whether that's intended to make it as complicated. For
example, why don't we check anon_vma for anonymous too later when prepare
anon_vma, however we do it late for file memory. AFAICT there's nothing
special with file memory in this case.

To unify anon_vma checks, do it always right after we get the per-vma lock.
We should have enough information already at this point, and it doesn't
need to be postponed for file-only. The only missing information here is
whether the caller of the per-vma rcu lock will install a writable mapping
or not. That info is passed into lock_vma_under_rcu() with a boolean now.

So the trivial (all good) side effect of such patch is:

- We may do slightly better on the first WRITE of a private file mapping,
because we can retry earlier (in lock_vma_under_rcu(), rather than
vmf_anon_prepare() later).

- We may start to allow per-vma page faults on zero pages even for
anonymous. While prior to this patch we don't allow that to happen even
if installation of zero page does not require anon_vma's existance.

I still left a WARN_ON_ONCE() in vmf_anon_prepare() to double check we're
all good.

Cc: Matthew Wilcox <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Lokesh Gidra <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Alistair Popple <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm/mm/fault.c | 2 +-
arch/arm64/mm/fault.c | 2 +-
arch/powerpc/mm/fault.c | 2 +-
arch/riscv/mm/fault.c | 2 +-
arch/s390/mm/fault.c | 2 +-
arch/x86/mm/fault.c | 2 +-
include/linux/mm.h | 7 ++++---
mm/memory.c | 46 +++++++++++++++++++++++++++++++----------
mm/userfaultfd.c | 21 +++++++------------
net/ipv4/tcp.c | 3 ++-
10 files changed, 55 insertions(+), 34 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 5c4b417e24f9..20c21ecec25c 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -288,7 +288,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;

- vma = lock_vma_under_rcu(mm, addr);
+ vma = lock_vma_under_rcu(mm, addr, flags & FAULT_FLAG_WRITE);
if (!vma)
goto lock_mmap;

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 405f9aa831bd..4be946ca80bd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -566,7 +566,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
if (!(mm_flags & FAULT_FLAG_USER))
goto lock_mmap;

- vma = lock_vma_under_rcu(mm, addr);
+ vma = lock_vma_under_rcu(mm, addr, flags & FAULT_FLAG_WRITE);
if (!vma)
goto lock_mmap;

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 215690452495..7a2adf40d65d 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -480,7 +480,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;

- vma = lock_vma_under_rcu(mm, address);
+ vma = lock_vma_under_rcu(mm, addrress, is_write);
if (!vma)
goto lock_mmap;

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 5224f3733802..17d8b011105e 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -286,7 +286,7 @@ void handle_page_fault(struct pt_regs *regs)
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;

- vma = lock_vma_under_rcu(mm, addr);
+ vma = lock_vma_under_rcu(mm, addr, flags & FAULT_FLAG_WRITE);
if (!vma)
goto lock_mmap;

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 162ca2576fd4..8b53eefb947a 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -320,7 +320,7 @@ static void do_exception(struct pt_regs *regs, int access)
flags |= FAULT_FLAG_WRITE;
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;
- vma = lock_vma_under_rcu(mm, address);
+ vma = lock_vma_under_rcu(mm, address, is_write);
if (!vma)
goto lock_mmap;
if (!(vma->vm_flags & access)) {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 67b18adc75dd..2cf38befaf45 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1356,7 +1356,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (!(flags & FAULT_FLAG_USER))
goto lock_mmap;

- vma = lock_vma_under_rcu(mm, address);
+ vma = lock_vma_under_rcu(mm, address, flags & FAULT_FLAG_WRITE);
if (!vma)
goto lock_mmap;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 07c73451d42f..b3a088ce584d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -777,7 +777,8 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
}

struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
- unsigned long address);
+ unsigned long address,
+ bool writable);

#else /* CONFIG_PER_VMA_LOCK */

@@ -790,8 +791,8 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
static inline void vma_mark_detached(struct vm_area_struct *vma,
bool detached) {}

-static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
- unsigned long address)
+static inline struct vm_area_struct *lock_vma_under_rcu(
+ struct mm_struct *mm, unsigned long address, bool writable)
{
return NULL;
}
diff --git a/mm/memory.c b/mm/memory.c
index b6fa5146b260..e8168ebc8095 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3218,10 +3218,8 @@ vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)

if (likely(vma->anon_vma))
return 0;
- if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
- vma_end_read(vma);
- return VM_FAULT_RETRY;
- }
+ /* We shouldn't try a per-vma fault at all if anon_vma isn't solid */
+ WARN_ON_ONCE(vmf->flags & FAULT_FLAG_VMA_LOCK);
if (__anon_vma_prepare(vma))
return VM_FAULT_OOM;
return 0;
@@ -5842,13 +5840,38 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
#endif

#ifdef CONFIG_PER_VMA_LOCK
-/*
- * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
- * stable and not isolated. If the VMA is not found or is being modified the
- * function returns NULL.
+
+bool vma_needs_stable_anon_vma(struct vm_area_struct *vma, bool writable)
+{
+ /*
+ * If to install a read-only mapping, we should never need an
+ * anon_vma later. For anon, it should be the zero page. For
+ * file, it normally should be a page cache except special mappings
+ * (e.g. tcp zerocopy provides its own read-only pages).
+ */
+ if (!writable)
+ return false;
+
+ /* Anonymous writable mappings always will ask for anon_vma */
+ if (vma_is_anonymous(vma))
+ return true;
+
+ /* For file memory, only need anon_vma if it's private */
+ return !(vma->vm_flags & VM_SHARED);
+}
+
+/**
+ * @lock_vma_under_rcu() - Lookup and lock a VMA under RCU protection.
+ * @mm: target mm struct
+ * @address: target virtual address
+ * @writable: whether we may inject a writable mapping with the VMA
+ *
+ * Returned VMA is guaranteed to be stable and not isolated. If the VMA is
+ * not found or is being modified the function returns NULL.
*/
struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address,
+ bool writable)
{
MA_STATE(mas, &mm->mm_mt, address, address);
struct vm_area_struct *vma;
@@ -5866,9 +5889,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
* find_mergeable_anon_vma uses adjacent vmas which are not locked.
* This check must happen after vma_start_read(); otherwise, a
* concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
- * from its anon_vma.
+ * from its anon_vma. This applies to both anon or private file maps.
*/
- if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
+ if (unlikely(vma_needs_stable_anon_vma(vma, writable) &&
+ !vma->anon_vma))
goto inval_end_read;

/* Check since vm_start/vm_end might change before we lock the VMA */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6267afe65d1..b518c111e37a 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -71,18 +71,10 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
{
struct vm_area_struct *vma;

- vma = lock_vma_under_rcu(mm, address);
- if (vma) {
- /*
- * lock_vma_under_rcu() only checks anon_vma for private
- * anonymous mappings. But we need to ensure it is assigned in
- * private file-backed vmas as well.
- */
- if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
- vma_end_read(vma);
- else
- return vma;
- }
+ /* This is the dst vma, we always need the anon_vma */
+ vma = lock_vma_under_rcu(mm, address, true);
+ if (vma)
+ return vma;

mmap_read_lock(mm);
vma = find_vma_and_prepare_anon(mm, address);
@@ -1462,8 +1454,11 @@ static int uffd_move_lock(struct mm_struct *mm,
* vma_start_read(src_vma)
* mmap_read_lock(mm)
* vma_start_write(dst_vma)
+ *
+ * Note that here we won't touch src vma's anon_vma. It should be
+ * there, but here we don't need to ask for it.
*/
- *src_vmap = lock_vma_under_rcu(mm, src_start);
+ *src_vmap = lock_vma_under_rcu(mm, src_start, false);
if (likely(*src_vmap))
return 0;

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e767721b3a58..5940847b616c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2053,7 +2053,8 @@ static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
unsigned long address,
bool *mmap_locked)
{
- struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
+ /* TCP zerocopy pages are always read-only, see tcp_mmap() */
+ struct vm_area_struct *vma = lock_vma_under_rcu(mm, address, false);

if (vma) {
if (vma->vm_ops != &tcp_vm_ops) {
--
2.44.0


--
Peter Xu


2024-04-11 21:48:51

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 10:27:56PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 11, 2024 at 05:12:02PM -0400, Peter Xu wrote:
> > The question is whether that's intended to make it as complicated. For
> > example, why don't we check anon_vma for anonymous too later when prepare
> > anon_vma, however we do it late for file memory. AFAICT there's nothing
> > special with file memory in this case.
>
> Yes, it's absolutely intended. If anything, anon memory is the special
> case that checks up-front.
>
> Congratulations on adding additional instructions to the common case.
> I don't understand why you persist with your nonsense. Please stop.

How many instructions it takes for a late RETRY for WRITEs to private file
mappings, fallback to mmap_sem?

Did you even finish reading the patch at all?

--
Peter Xu


2024-04-11 22:22:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 11:34:44AM -0400, Peter Xu wrote:
> If the answer is yes, I'd say please provide some document patch to
> support such behavior, you can stop my patch from getting merged now,
> but it's never clear whether someone else will see this and post it
> again. If it wasn't obviously to Suren who introduced per-vma lock [1],
> then I won't be surprised it's unknown to most developers on the list.

Anyone touching this path should have a good idea about what is and is
not the common case. Your confidence greatly exceeded your ability
here. I will not be submitting such a patch.

2024-04-12 01:21:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 05:12:02PM -0400, Peter Xu wrote:
> The question is whether that's intended to make it as complicated. For
> example, why don't we check anon_vma for anonymous too later when prepare
> anon_vma, however we do it late for file memory. AFAICT there's nothing
> special with file memory in this case.

Yes, it's absolutely intended. If anything, anon memory is the special
case that checks up-front.

Congratulations on adding additional instructions to the common case.
I don't understand why you persist with your nonsense. Please stop.

2024-04-12 02:08:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 05:46:45PM -0400, Peter Xu wrote:
> On Thu, Apr 11, 2024 at 10:27:56PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 11, 2024 at 05:12:02PM -0400, Peter Xu wrote:
> > > The question is whether that's intended to make it as complicated. For
> > > example, why don't we check anon_vma for anonymous too later when prepare
> > > anon_vma, however we do it late for file memory. AFAICT there's nothing
> > > special with file memory in this case.
> >
> > Yes, it's absolutely intended. If anything, anon memory is the special
> > case that checks up-front.
> >
> > Congratulations on adding additional instructions to the common case.
> > I don't understand why you persist with your nonsense. Please stop.
>
> How many instructions it takes for a late RETRY for WRITEs to private file
> mappings, fallback to mmap_sem?

Doesn't matter. That happens _once_ per VMA, and it's dwarfed by the
cost of allocating and initialising the COWed page. You're adding
instructions to every single page fault. I'm not happy that we had to
add extra instructions to the fault path for single-threaded programs,
but we at least had the justification that we were improving scalability
on large systems. Your excuse is "it makes the code cleaner". And
honestly, I don't think it even does that.

> Did you even finish reading the patch at all?

Yes, I read the whole thing. It's garbage.

2024-04-12 05:41:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > How many instructions it takes for a late RETRY for WRITEs to private file
> > mappings, fallback to mmap_sem?
>
> Doesn't matter. That happens _once_ per VMA, and it's dwarfed by the
> cost of allocating and initialising the COWed page. You're adding
> instructions to every single page fault. I'm not happy that we had to
> add extra instructions to the fault path for single-threaded programs,
> but we at least had the justification that we were improving scalability
> on large systems. Your excuse is "it makes the code cleaner". And
> honestly, I don't think it even does that.

Suren, what would you think to this?

diff --git a/mm/memory.c b/mm/memory.c
index 6e2fe960473d..e495adcbe968 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_start_read(vma))
goto inval;

- /*
- * find_mergeable_anon_vma uses adjacent vmas which are not locked.
- * This check must happen after vma_start_read(); otherwise, a
- * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
- * from its anon_vma.
- */
- if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
- goto inval_end_read;
-
/* Check since vm_start/vm_end might change before we lock the VMA */
if (unlikely(address < vma->vm_start || address >= vma->vm_end))
goto inval_end_read;

That takes a few insns out of the page fault path (good!) at the cost
of one extra trip around the fault handler for the first fault on an
anon vma. It makes the file & anon paths more similar to each other
(good!)

We'd need some data to be sure it's really a win, but less code is
always good.

We could even eagerly initialise vma->anon_vma for anon vmas. I don't
know why we don't do that.

2024-04-12 12:41:05

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > > How many instructions it takes for a late RETRY for WRITEs to private file
> > > mappings, fallback to mmap_sem?
> >
> > Doesn't matter. That happens _once_ per VMA, and it's dwarfed by the
> > cost of allocating and initialising the COWed page. You're adding
> > instructions to every single page fault. I'm not happy that we had to
> > add extra instructions to the fault path for single-threaded programs,
> > but we at least had the justification that we were improving scalability
> > on large systems. Your excuse is "it makes the code cleaner". And
> > honestly, I don't think it even does that.
>
> Suren, what would you think to this?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..e495adcbe968 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> - * This check must happen after vma_start_read(); otherwise, a
> - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> - * from its anon_vma.
> - */
> - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> - goto inval_end_read;
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> goto inval_end_read;
>
> That takes a few insns out of the page fault path (good!) at the cost
> of one extra trip around the fault handler for the first fault on an
> anon vma. It makes the file & anon paths more similar to each other
> (good!)
>
> We'd need some data to be sure it's really a win, but less code is
> always good.

You at least need two things:

(1) don't throw away Jann's comment so easily

(2) have a look on whether anon memory has the fallback yet, at all

Maybe someone can already comment in a harsh way on this one, but no, I'm
not going to be like that.

I still don't understand why you don't like so much to not fallback at all
if we could, the flags I checked was all in hot cache I think anyway.

And since I'm also enough on how you comment in your previous replies, I'll
leave the rest comments for others.

--
Peter Xu


2024-04-12 12:47:16

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Thu, Apr 11, 2024 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > > How many instructions it takes for a late RETRY for WRITEs to private file
> > > mappings, fallback to mmap_sem?
> >
> > Doesn't matter. That happens _once_ per VMA, and it's dwarfed by the
> > cost of allocating and initialising the COWed page. You're adding
> > instructions to every single page fault. I'm not happy that we had to
> > add extra instructions to the fault path for single-threaded programs,
> > but we at least had the justification that we were improving scalability
> > on large systems. Your excuse is "it makes the code cleaner". And
> > honestly, I don't think it even does that.
>
> Suren, what would you think to this?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..e495adcbe968 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> - * This check must happen after vma_start_read(); otherwise, a
> - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> - * from its anon_vma.
> - */
> - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> - goto inval_end_read;
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> goto inval_end_read;
>
> That takes a few insns out of the page fault path (good!) at the cost
> of one extra trip around the fault handler for the first fault on an
> anon vma. It makes the file & anon paths more similar to each other
> (good!)

I see what you mean. The impact would depend on the workload but in my
earlier tests when developing per-VMA locks there were on average less
than 1% faults which were for anonymous pages and had
vma->anon_vma==NULL. I recorded that after using my desktop for a day
or so and running a series of benchmark tests. Again, that number
might be drastically different on some other workloads.

About the code, I'll take a closer look once I'm back from vacation
this weekend but I think you will also have to modify
do_anonymous_page() to use vmf_anon_prepare() instead of
anon_vma_prepare().

>
> We'd need some data to be sure it's really a win, but less code is
> always good.
>
> We could even eagerly initialise vma->anon_vma for anon vmas. I don't
> know why we don't do that.

You found the answer to that question a long time ago and IIRC it was
because in many cases we end up not needing to set vma->anon_vma at
all. So, this is an optimization to try avoiding extra operations
whenever we can. I'll try to find your comment on this.

2024-04-12 13:07:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 5:39 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 11, 2024 at 11:02:32PM +0100, Matthew Wilcox wrote:
> > > > How many instructions it takes for a late RETRY for WRITEs to private file
> > > > mappings, fallback to mmap_sem?
> > >
> > > Doesn't matter. That happens _once_ per VMA, and it's dwarfed by the
> > > cost of allocating and initialising the COWed page. You're adding
> > > instructions to every single page fault. I'm not happy that we had to
> > > add extra instructions to the fault path for single-threaded programs,
> > > but we at least had the justification that we were improving scalability
> > > on large systems. Your excuse is "it makes the code cleaner". And
> > > honestly, I don't think it even does that.
> >
> > Suren, what would you think to this?
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6e2fe960473d..e495adcbe968 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > if (!vma_start_read(vma))
> > goto inval;
> >
> > - /*
> > - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > - * This check must happen after vma_start_read(); otherwise, a
> > - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > - * from its anon_vma.
> > - */
> > - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > - goto inval_end_read;
> > -
> > /* Check since vm_start/vm_end might change before we lock the VMA */
> > if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > goto inval_end_read;
> >
> > That takes a few insns out of the page fault path (good!) at the cost
> > of one extra trip around the fault handler for the first fault on an
> > anon vma. It makes the file & anon paths more similar to each other
> > (good!)
> >
> > We'd need some data to be sure it's really a win, but less code is
> > always good.
>
> You at least need two things:
>
> (1) don't throw away Jann's comment so easily

I agree, if we make this change we should keep this comment and maybe
move it into vmf_anon_prepare()

>
> (2) have a look on whether anon memory has the fallback yet, at all

Yeah, I think do_anonymous_page() will have to change as I mentioned
in the previous reply.

>
> Maybe someone can already comment in a harsh way on this one, but no, I'm
> not going to be like that.
>
> I still don't understand why you don't like so much to not fallback at all
> if we could, the flags I checked was all in hot cache I think anyway.
>
> And since I'm also enough on how you comment in your previous replies, I'll
> leave the rest comments for others.

FWIW I fully accept the blame for not seeing that private file mapping
read case regression. In retrospect this should have been obvious...
but the hindsight is always 20/20.

>
> --
> Peter Xu
>

2024-04-12 13:33:04

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 05:46:52AM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 11, 2024 at 8:14 PM Matthew Wilcox <[email protected]> wrote:
> About the code, I'll take a closer look once I'm back from vacation
> this weekend but I think you will also have to modify
> do_anonymous_page() to use vmf_anon_prepare() instead of
> anon_vma_prepare().

Ah yes. Also do_huge_pmd_anonymous_page(). And we should do this:

+++ b/mm/rmap.c
@@ -182,8 +182,6 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
* for the new allocation. At the same time, we do not want
* to do any locking for the common case of already having
* an anon_vma.
- *
- * This must be called with the mmap_lock held for reading.
*/
int __anon_vma_prepare(struct vm_area_struct *vma)
{
@@ -191,6 +189,7 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
struct anon_vma *anon_vma, *allocated;
struct anon_vma_chain *avc;

+ mmap_assert_locked(mm);
might_sleep();

avc = anon_vma_chain_alloc(GFP_KERNEL);

> > We could even eagerly initialise vma->anon_vma for anon vmas. I don't
> > know why we don't do that.
>
> You found the answer to that question a long time ago and IIRC it was
> because in many cases we end up not needing to set vma->anon_vma at
> all. So, this is an optimization to try avoiding extra operations
> whenever we can. I'll try to find your comment on this.

I thought that was file VMAs that I found the answer to that question?

2024-04-12 16:55:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> problem for per-vma locks, we should have an explanation there. This
> comment would serve that purpose IMO.

I'll do you one better; here's some nice kernel-doc for
vmd_anon_prepare():

commit f89a1cd17f13
Author: Matthew Wilcox (Oracle) <[email protected]>
Date: Fri Apr 12 10:41:02 2024 -0400

mm: Delay the check for a NULL anon_vma

Instead of checking the anon_vma early in the fault path where all page
faults pay the cost, delay it until we know we're going to need the
anon_vma to be filled in. This will have a slight negative effect on the
first fault in an anonymous VMA, but it shortens every other page fault.
It also makes the code slightly cleaner as the anon and file backed
fault handling look more similar.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8d2ed80b0bf..718f91f74a48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1057,11 +1057,13 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
gfp_t gfp;
struct folio *folio;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+ vm_fault_t ret;

if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
return VM_FAULT_FALLBACK;
- if (unlikely(anon_vma_prepare(vma)))
- return VM_FAULT_OOM;
+ ret = vmf_anon_prepare(vmf);
+ if (ret)
+ return ret;
khugepaged_enter_vma(vma, vma->vm_flags);

if (!(vmf->flags & FAULT_FLAG_WRITE) &&
diff --git a/mm/memory.c b/mm/memory.c
index 6e2fe960473d..46b509c3bbc1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3213,6 +3213,21 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
return VM_FAULT_RETRY;
}

+/**
+ * vmf_anon_prepare - Prepare to handle an anonymous fault.
+ * @vmf: The vm_fault descriptor passed from the fault handler.
+ *
+ * When preparing to insert an anonymous page into a VMA from a
+ * fault handler, call this function rather than anon_vma_prepare().
+ * If this vma does not already have an associated anon_vma and we are
+ * only protected by the per-VMA lock, the caller must retry with the
+ * mmap_lock held. __anon_vma_prepare() will look at adjacent VMAs to
+ * determine if this VMA can share its anon_vma, and that's not safe to
+ * do with only the per-VMA lock held for this VMA.
+ *
+ * Return: 0 if fault handling can proceed. Any other value should be
+ * returned to the caller.
+ */
vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -4437,8 +4452,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
}

/* Allocate our own private page. */
- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ ret = vmf_anon_prepare(vmf);
+ if (ret)
+ return ret;
/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
folio = alloc_anon_folio(vmf);
if (IS_ERR(folio))
@@ -5821,15 +5837,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_start_read(vma))
goto inval;

- /*
- * find_mergeable_anon_vma uses adjacent vmas which are not locked.
- * This check must happen after vma_start_read(); otherwise, a
- * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
- * from its anon_vma.
- */
- if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
- goto inval_end_read;
-
/* Check since vm_start/vm_end might change before we lock the VMA */
if (unlikely(address < vma->vm_start || address >= vma->vm_end))
goto inval_end_read;

2024-04-12 17:22:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 06:06:46AM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > > - /*
> > > - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > > - * This check must happen after vma_start_read(); otherwise, a
> > > - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > > - * from its anon_vma.
> > > - */
> > > - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > > - goto inval_end_read;
> > > -
> > > /* Check since vm_start/vm_end might change before we lock the VMA */
> > > if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > > goto inval_end_read;
> > >
> > > That takes a few insns out of the page fault path (good!) at the cost
> > > of one extra trip around the fault handler for the first fault on an
> > > anon vma. It makes the file & anon paths more similar to each other
> > > (good!)
> > >
> > > We'd need some data to be sure it's really a win, but less code is
> > > always good.
>
> I agree, if we make this change we should keep this comment and maybe
> move it into vmf_anon_prepare()

Most of the comment makes no sense if you move it out of
lock_vma_under_rcu(). It's justifying where it needs to be in that
function. If it's no longer in that function, there's not much left of
the comment. What part do you think is valuable and needs to be retained?


2024-04-12 18:39:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 04:19:27PM +0100, Matthew Wilcox wrote:
> On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > problem for per-vma locks, we should have an explanation there. This
> > comment would serve that purpose IMO.
>
> I'll do you one better; here's some nice kernel-doc for
> vmd_anon_prepare():

And here's a followup patch to fix some minor issues in uffd.

- Rename lock_vma() to uffd_lock_vma() because it really is uffd
specific.
- Remove comment referencing unlock_vma() which doesn't exist.
- Fix the comment about lock_vma_under_rcu() which I just made
incorrect.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index f6267afe65d1..a32171158c38 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -56,17 +56,16 @@ struct vm_area_struct *find_vma_and_prepare_anon(struct mm_struct *mm,

#ifdef CONFIG_PER_VMA_LOCK
/*
- * lock_vma() - Lookup and lock vma corresponding to @address.
+ * uffd_lock_vma() - Lookup and lock vma corresponding to @address.
* @mm: mm to search vma in.
* @address: address that the vma should contain.
*
- * Should be called without holding mmap_lock. vma should be unlocked after use
- * with unlock_vma().
+ * Should be called without holding mmap_lock.
*
* Return: A locked vma containing @address, -ENOENT if no vma is found, or
* -ENOMEM if anon_vma couldn't be allocated.
*/
-static struct vm_area_struct *lock_vma(struct mm_struct *mm,
+static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
unsigned long address)
{
struct vm_area_struct *vma;
@@ -74,9 +73,8 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
vma = lock_vma_under_rcu(mm, address);
if (vma) {
/*
- * lock_vma_under_rcu() only checks anon_vma for private
- * anonymous mappings. But we need to ensure it is assigned in
- * private file-backed vmas as well.
+ * We know we're going to need to use anon_vma, so check
+ * that early.
*/
if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
vma_end_read(vma);
@@ -107,7 +105,7 @@ static struct vm_area_struct *uffd_mfill_lock(struct mm_struct *dst_mm,
{
struct vm_area_struct *dst_vma;

- dst_vma = lock_vma(dst_mm, dst_start);
+ dst_vma = uffd_lock_vma(dst_mm, dst_start);
if (IS_ERR(dst_vma) || validate_dst_vma(dst_vma, dst_start + len))
return dst_vma;

@@ -1437,7 +1435,7 @@ static int uffd_move_lock(struct mm_struct *mm,
struct vm_area_struct *vma;
int err;

- vma = lock_vma(mm, dst_start);
+ vma = uffd_lock_vma(mm, dst_start);
if (IS_ERR(vma))
return PTR_ERR(vma);

@@ -1452,7 +1450,7 @@ static int uffd_move_lock(struct mm_struct *mm,
}

/*
- * Using lock_vma() to get src_vma can lead to following deadlock:
+ * Using uffd_lock_vma() to get src_vma can lead to following deadlock:
*
* Thread1 Thread2
* ------- -------
@@ -1474,7 +1472,7 @@ static int uffd_move_lock(struct mm_struct *mm,
err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
if (!err) {
/*
- * See comment in lock_vma() as to why not using
+ * See comment in uffd_lock_vma() as to why not using
* vma_start_read() here.
*/
down_read(&(*dst_vmap)->vm_lock->lock);

2024-04-13 21:42:17

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > problem for per-vma locks, we should have an explanation there. This
> > comment would serve that purpose IMO.
>
> I'll do you one better; here's some nice kernel-doc for
> vmd_anon_prepare():
>
> commit f89a1cd17f13
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Fri Apr 12 10:41:02 2024 -0400
>
> mm: Delay the check for a NULL anon_vma
>
> Instead of checking the anon_vma early in the fault path where all page
> faults pay the cost, delay it until we know we're going to need the
> anon_vma to be filled in. This will have a slight negative effect on the
> first fault in an anonymous VMA, but it shortens every other page fault.
> It also makes the code slightly cleaner as the anon and file backed
> fault handling look more similar.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

The patch looks good to me but I would like to run benchmark tests to
see how it affects different workloads (both positive and negative
effects). I'll try to do that in this coming week and will report if I
find any considerable difference. For now:

Reviewed-by: Suren Baghdasaryan <[email protected]>

Also we should ensure that this patch goes together with the next one
adjusting related code in uffd. It would be better to resend them as
one patchset to avoid confusion.

>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..718f91f74a48 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1057,11 +1057,13 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> gfp_t gfp;
> struct folio *folio;
> unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> + vm_fault_t ret;
>
> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> return VM_FAULT_FALLBACK;
> - if (unlikely(anon_vma_prepare(vma)))
> - return VM_FAULT_OOM;
> + ret = vmf_anon_prepare(vmf);
> + if (ret)
> + return ret;
> khugepaged_enter_vma(vma, vma->vm_flags);
>
> if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..46b509c3bbc1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3213,6 +3213,21 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
> return VM_FAULT_RETRY;
> }
>
> +/**
> + * vmf_anon_prepare - Prepare to handle an anonymous fault.
> + * @vmf: The vm_fault descriptor passed from the fault handler.
> + *
> + * When preparing to insert an anonymous page into a VMA from a
> + * fault handler, call this function rather than anon_vma_prepare().
> + * If this vma does not already have an associated anon_vma and we are
> + * only protected by the per-VMA lock, the caller must retry with the
> + * mmap_lock held. __anon_vma_prepare() will look at adjacent VMAs to
> + * determine if this VMA can share its anon_vma, and that's not safe to
> + * do with only the per-VMA lock held for this VMA.
> + *
> + * Return: 0 if fault handling can proceed. Any other value should be
> + * returned to the caller.
> + */
> vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> @@ -4437,8 +4452,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> }
>
> /* Allocate our own private page. */
> - if (unlikely(anon_vma_prepare(vma)))
> - goto oom;
> + ret = vmf_anon_prepare(vmf);
> + if (ret)
> + return ret;
> /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> folio = alloc_anon_folio(vmf);
> if (IS_ERR(folio))
> @@ -5821,15 +5837,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> - * This check must happen after vma_start_read(); otherwise, a
> - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> - * from its anon_vma.
> - */
> - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> - goto inval_end_read;
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> goto inval_end_read;

2024-04-13 21:47:22

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 8:31 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Apr 12, 2024 at 04:19:27PM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > > problem for per-vma locks, we should have an explanation there. This
> > > comment would serve that purpose IMO.
> >
> > I'll do you one better; here's some nice kernel-doc for
> > vmd_anon_prepare():
>
> And here's a followup patch to fix some minor issues in uffd.
>
> - Rename lock_vma() to uffd_lock_vma() because it really is uffd
> specific.

I'm planning to expand the scope of lock_vma() and reuse it for
/proc/pid/maps reading under per-VMA locks. No objection to renaming
it for now but I'll likely rename it back later once it's used in more
places.

> - Remove comment referencing unlock_vma() which doesn't exist.
> - Fix the comment about lock_vma_under_rcu() which I just made
> incorrect.
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index f6267afe65d1..a32171158c38 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -56,17 +56,16 @@ struct vm_area_struct *find_vma_and_prepare_anon(struct mm_struct *mm,
>
> #ifdef CONFIG_PER_VMA_LOCK
> /*
> - * lock_vma() - Lookup and lock vma corresponding to @address.
> + * uffd_lock_vma() - Lookup and lock vma corresponding to @address.
> * @mm: mm to search vma in.
> * @address: address that the vma should contain.
> *
> - * Should be called without holding mmap_lock. vma should be unlocked after use
> - * with unlock_vma().
> + * Should be called without holding mmap_lock.
> *
> * Return: A locked vma containing @address, -ENOENT if no vma is found, or
> * -ENOMEM if anon_vma couldn't be allocated.
> */
> -static struct vm_area_struct *lock_vma(struct mm_struct *mm,
> +static struct vm_area_struct *uffd_lock_vma(struct mm_struct *mm,
> unsigned long address)
> {
> struct vm_area_struct *vma;
> @@ -74,9 +73,8 @@ static struct vm_area_struct *lock_vma(struct mm_struct *mm,
> vma = lock_vma_under_rcu(mm, address);
> if (vma) {
> /*
> - * lock_vma_under_rcu() only checks anon_vma for private
> - * anonymous mappings. But we need to ensure it is assigned in
> - * private file-backed vmas as well.
> + * We know we're going to need to use anon_vma, so check
> + * that early.
> */
> if (!(vma->vm_flags & VM_SHARED) && unlikely(!vma->anon_vma))
> vma_end_read(vma);
> @@ -107,7 +105,7 @@ static struct vm_area_struct *uffd_mfill_lock(struct mm_struct *dst_mm,
> {
> struct vm_area_struct *dst_vma;
>
> - dst_vma = lock_vma(dst_mm, dst_start);
> + dst_vma = uffd_lock_vma(dst_mm, dst_start);
> if (IS_ERR(dst_vma) || validate_dst_vma(dst_vma, dst_start + len))
> return dst_vma;
>
> @@ -1437,7 +1435,7 @@ static int uffd_move_lock(struct mm_struct *mm,
> struct vm_area_struct *vma;
> int err;
>
> - vma = lock_vma(mm, dst_start);
> + vma = uffd_lock_vma(mm, dst_start);
> if (IS_ERR(vma))
> return PTR_ERR(vma);
>
> @@ -1452,7 +1450,7 @@ static int uffd_move_lock(struct mm_struct *mm,
> }
>
> /*
> - * Using lock_vma() to get src_vma can lead to following deadlock:
> + * Using uffd_lock_vma() to get src_vma can lead to following deadlock:
> *
> * Thread1 Thread2
> * ------- -------
> @@ -1474,7 +1472,7 @@ static int uffd_move_lock(struct mm_struct *mm,
> err = find_vmas_mm_locked(mm, dst_start, src_start, dst_vmap, src_vmap);
> if (!err) {
> /*
> - * See comment in lock_vma() as to why not using
> + * See comment in uffd_lock_vma() as to why not using
> * vma_start_read() here.
> */
> down_read(&(*dst_vmap)->vm_lock->lock);

2024-04-13 22:46:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Sat, Apr 13, 2024 at 02:41:55PM -0700, Suren Baghdasaryan wrote:
> The patch looks good to me but I would like to run benchmark tests to
> see how it affects different workloads (both positive and negative
> effects). I'll try to do that in this coming week and will report if I
> find any considerable difference. For now:
>
> Reviewed-by: Suren Baghdasaryan <[email protected]>
>
> Also we should ensure that this patch goes together with the next one
> adjusting related code in uffd. It would be better to resend them as
> one patchset to avoid confusion.

I've put them here:

http://git.infradead.org/?p=users/willy/pagecache.git;a=shortlog;h=refs/heads/vma-lock

0day has already run build tests, and I'm assured that they'll run perf
tests in the next few days.

2024-04-13 22:52:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Sat, Apr 13, 2024 at 02:46:56PM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 12, 2024 at 8:31 AM Matthew Wilcox <[email protected]> wrote:
> > - Rename lock_vma() to uffd_lock_vma() because it really is uffd
> > specific.
>
> I'm planning to expand the scope of lock_vma() and reuse it for
> /proc/pid/maps reading under per-VMA locks. No objection to renaming
> it for now but I'll likely rename it back later once it's used in more
> places.

That would seem like a mistake. The uffd lock_vma() will create an
anon_vma for VMAs that don't have one, and you wouldn't want that.
It seems to me that lock_vma_under_rcu() does everything you want except
the fallback to mmap_read_lock(). And I'm not sure there's a good way
to package that up ... indeed, I don't see why you'd want the "take
the mmap_lock, look up the VMA, drop the mmap read lock" part at all --
once you've got the mmap_lock, just hold it until you're done.


2024-04-15 15:59:35

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > problem for per-vma locks, we should have an explanation there. This
> > comment would serve that purpose IMO.
>
> I'll do you one better; here's some nice kernel-doc for
> vmd_anon_prepare():

I was looking at the find_tcp_vma(), which seems to be the only other
place where lock_vma_under_rcu() is currently used. I think it's used
there only for file-backed pages, so I don't think your change affects
that usecase but this makes me think that we should have some kind of
a warning for lock_vma_under_rcu() future users... Maybe your addition
of mmap_assert_locked() inside __anon_vma_prepare() is enough. Please
don't forget to include that assertion into your final patch.


>
> commit f89a1cd17f13
> Author: Matthew Wilcox (Oracle) <[email protected]>
> Date: Fri Apr 12 10:41:02 2024 -0400
>
> mm: Delay the check for a NULL anon_vma
>
> Instead of checking the anon_vma early in the fault path where all page
> faults pay the cost, delay it until we know we're going to need the
> anon_vma to be filled in. This will have a slight negative effect on the
> first fault in an anonymous VMA, but it shortens every other page fault.
> It also makes the code slightly cleaner as the anon and file backed
> fault handling look more similar.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..718f91f74a48 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1057,11 +1057,13 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> gfp_t gfp;
> struct folio *folio;
> unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> + vm_fault_t ret;
>
> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
> return VM_FAULT_FALLBACK;
> - if (unlikely(anon_vma_prepare(vma)))
> - return VM_FAULT_OOM;
> + ret = vmf_anon_prepare(vmf);
> + if (ret)
> + return ret;
> khugepaged_enter_vma(vma, vma->vm_flags);
>
> if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..46b509c3bbc1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3213,6 +3213,21 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
> return VM_FAULT_RETRY;
> }
>
> +/**
> + * vmf_anon_prepare - Prepare to handle an anonymous fault.
> + * @vmf: The vm_fault descriptor passed from the fault handler.
> + *
> + * When preparing to insert an anonymous page into a VMA from a
> + * fault handler, call this function rather than anon_vma_prepare().
> + * If this vma does not already have an associated anon_vma and we are
> + * only protected by the per-VMA lock, the caller must retry with the
> + * mmap_lock held. __anon_vma_prepare() will look at adjacent VMAs to
> + * determine if this VMA can share its anon_vma, and that's not safe to
> + * do with only the per-VMA lock held for this VMA.
> + *
> + * Return: 0 if fault handling can proceed. Any other value should be
> + * returned to the caller.
> + */
> vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> @@ -4437,8 +4452,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> }
>
> /* Allocate our own private page. */
> - if (unlikely(anon_vma_prepare(vma)))
> - goto oom;
> + ret = vmf_anon_prepare(vmf);
> + if (ret)
> + return ret;
> /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> folio = alloc_anon_folio(vmf);
> if (IS_ERR(folio))
> @@ -5821,15 +5837,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> - * This check must happen after vma_start_read(); otherwise, a
> - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> - * from its anon_vma.
> - */
> - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> - goto inval_end_read;
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> goto inval_end_read;

2024-04-15 16:27:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Mon, Apr 15, 2024 at 09:19:06AM -0700, Suren Baghdasaryan wrote:
> On Mon, Apr 15, 2024 at 9:13 AM Matthew Wilcox <[email protected]> wrote:
> > The tcp vma is not file backed, but I'm pretty sure that COW is not
> > something they want, so there's never an anon_vma. It's for pages
> > that contain received TCP packets; ie it's mmaped TCP.
>
> I was following
> tcp_zerocopy_receive()->tcp_zerocopy_vm_insert_batch()->vm_insert_pages()->insert_page_in_batch_locked()->validate_page_before_insert()
> which errors out for PageAnon(page). So, I assumed this path works on
> file-backed pages but I'm not familiar with this code at all.

It turns out there are pages which are neither file nor anon ;-)
I have a partial list here:

https://kernelnewbies.org/MatthewWilcox/Memdescs

but I don't even have TCP on that list. I haven't looked into what they
need -- I don't think they need mapping, index, etc, but I need to
figure that out.

2024-04-15 19:02:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Mon, Apr 15, 2024 at 08:58:28AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 12, 2024 at 8:19 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Apr 12, 2024 at 09:53:29AM -0500, Suren Baghdasaryan wrote:
> > > Unless vmf_anon_prepare() already explains why vma->anon_vma poses a
> > > problem for per-vma locks, we should have an explanation there. This
> > > comment would serve that purpose IMO.
> >
> > I'll do you one better; here's some nice kernel-doc for
> > vmd_anon_prepare():
>
> I was looking at the find_tcp_vma(), which seems to be the only other
> place where lock_vma_under_rcu() is currently used. I think it's used
> there only for file-backed pages, so I don't think your change affects
> that usecase but this makes me think that we should have some kind of
> a warning for lock_vma_under_rcu() future users... Maybe your addition
> of mmap_assert_locked() inside __anon_vma_prepare() is enough. Please
> don't forget to include that assertion into your final patch.

That's patch 1/3 on the git branch I pointed you to.

The tcp vma is not file backed, but I'm pretty sure that COW is not
something they want, so there's never an anon_vma. It's for pages
that contain received TCP packets; ie it's mmaped TCP.

2024-04-26 14:01:04

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> Suren, what would you think to this?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e2fe960473d..e495adcbe968 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> if (!vma_start_read(vma))
> goto inval;
>
> - /*
> - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> - * This check must happen after vma_start_read(); otherwise, a
> - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> - * from its anon_vma.
> - */
> - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> - goto inval_end_read;
> -
> /* Check since vm_start/vm_end might change before we lock the VMA */
> if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> goto inval_end_read;
>
> That takes a few insns out of the page fault path (good!) at the cost
> of one extra trip around the fault handler for the first fault on an
> anon vma. It makes the file & anon paths more similar to each other
> (good!)
>
> We'd need some data to be sure it's really a win, but less code is
> always good.

Intel's 0day got back to me with data and it's ridiculously good.
Headline figure: over 3x throughput improvement with vm-scalability
https://lore.kernel.org/all/[email protected]/

I can't see why it's that good. It shouldn't be that good. I'm
seeing big numbers here:

4366 ? 2% +565.6% 29061 perf-stat.overall.cycles-between-cache-misses

and the code being deleted is only checking vma->vm_ops and
vma->anon_vma. Surely that cache line is referenced so frequently
during pagefault that deleting a reference here will make no difference
at all?

We've clearly got an inlining change. viz:

72.57 -72.6 0.00 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
73.28 -72.6 0.70 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
72.55 -72.5 0.00 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
69.93 -69.9 0.00 perf-profile.calltrace.cycles-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
69.12 -69.1 0.00 perf-profile.calltrace.cycles-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
68.78 -68.8 0.00 perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault
65.78 -65.8 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault
65.43 -65.4 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma

11.22 +86.5 97.68 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
11.14 +86.5 97.66 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
3.17 ? 2% +94.0 97.12 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
3.45 ? 2% +94.1 97.59 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
0.00 +98.2 98.15 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +98.2 98.16 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe

so maybe the compiler has been able to eliminate some loads from
contended cachelines?

703147 -87.6% 87147 ? 2% perf-stat.ps.context-switches
663.67 ? 5% +7551.9% 50783 vm-scalability.time.involuntary_context_switches
1.105e+08 -86.7% 14697764 ? 2% vm-scalability.time.voluntary_context_switches

indicates to me that we're taking the mmap rwsem far less often (those
would be accounted as voluntary context switches).

So maybe the cache miss reduction is a consequence of just running for
longer before being preempted.

I still don't understand why we have to take the mmap_sem less often.
Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
an anon_vma on a page fault?

2024-04-26 15:33:10

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

* Suren Baghdasaryan <[email protected]> [240426 11:08]:
> On Fri, Apr 26, 2024 at 7:00 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Apr 12, 2024 at 04:14:16AM +0100, Matthew Wilcox wrote:
> > > Suren, what would you think to this?
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6e2fe960473d..e495adcbe968 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5821,15 +5821,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > > if (!vma_start_read(vma))
> > > goto inval;
> > >
> > > - /*
> > > - * find_mergeable_anon_vma uses adjacent vmas which are not locked.
> > > - * This check must happen after vma_start_read(); otherwise, a
> > > - * concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
> > > - * from its anon_vma.
> > > - */
> > > - if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
> > > - goto inval_end_read;
> > > -
> > > /* Check since vm_start/vm_end might change before we lock the VMA */
> > > if (unlikely(address < vma->vm_start || address >= vma->vm_end))
> > > goto inval_end_read;
> > >
> > > That takes a few insns out of the page fault path (good!) at the cost
> > > of one extra trip around the fault handler for the first fault on an
> > > anon vma. It makes the file & anon paths more similar to each other
> > > (good!)
> > >
> > > We'd need some data to be sure it's really a win, but less code is
> > > always good.
> >
> > Intel's 0day got back to me with data and it's ridiculously good.
> > Headline figure: over 3x throughput improvement with vm-scalability
> > https://lore.kernel.org/all/[email protected]/
> >
> > I can't see why it's that good. It shouldn't be that good. I'm
> > seeing big numbers here:
> >
> > 4366 ą 2% +565.6% 29061 perf-stat.overall.cycles-between-cache-misses
> >
> > and the code being deleted is only checking vma->vm_ops and
> > vma->anon_vma. Surely that cache line is referenced so frequently
> > during pagefault that deleting a reference here will make no difference
> > at all?
>
> That indeed looks overly good. Sorry, I didn't have a chance to run
> the benchmarks on my side yet because of the ongoing Android bootcamp
> this week.
>
> >
> > We've clearly got an inlining change. viz:
> >
> > 72.57 -72.6 0.00 perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
> > 73.28 -72.6 0.70 perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
> > 72.55 -72.5 0.00 perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> > 69.93 -69.9 0.00 perf-profile.calltrace.cycles-pp.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
> > 69.12 -69.1 0.00 perf-profile.calltrace.cycles-pp.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
> > 68.78 -68.8 0.00 perf-profile.calltrace.cycles-pp.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault.exc_page_fault
> > 65.78 -65.8 0.00 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma.do_user_addr_fault
> > 65.43 -65.4 0.00 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.rwsem_down_read_slowpath.down_read_killable.lock_mm_and_find_vma
> >
> > 11.22 +86.5 97.68 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 11.14 +86.5 97.66 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
> > 3.17 ą 2% +94.0 97.12 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
> > 3.45 ą 2% +94.1 97.59 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
> > 0.00 +98.2 98.15 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 0.00 +98.2 98.16 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> >
> > so maybe the compiler has been able to eliminate some loads from
> > contended cachelines?
> >
> > 703147 -87.6% 87147 ą 2% perf-stat.ps.context-switches
> > 663.67 ą 5% +7551.9% 50783 vm-scalability.time.involuntary_context_switches
> > 1.105e+08 -86.7% 14697764 ą 2% vm-scalability.time.voluntary_context_switches
> >
> > indicates to me that we're taking the mmap rwsem far less often (those
> > would be accounted as voluntary context switches).
> >
> > So maybe the cache miss reduction is a consequence of just running for
> > longer before being preempted.
> >
> > I still don't understand why we have to take the mmap_sem less often.
> > Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
> > an anon_vma on a page fault?
>
> I think the only path in either do_anonymous_page() or
> do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> the "Use the zero-page for reads" here:
> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> didn't look into this particular benchmark yet but will try it out
> once I have some time to benchmark your change.
>

This test is read-only:
https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-small-allocs-mt

And the zero-page looks to be what's going on here to me as well.

Would such a change have impact on people who "fault in" the memory
instead of asking for memory to be populated through an API?

Thanks,
Liam

2024-04-26 15:50:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 26, 2024 at 08:32:06AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 26, 2024 at 8:28 AM Matthew Wilcox <[email protected]> wrote:
> > > I think the only path in either do_anonymous_page() or
> > > do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> > > the "Use the zero-page for reads" here:
> > > https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> > > didn't look into this particular benchmark yet but will try it out
> > > once I have some time to benchmark your change.
> >
> > Yes, Liam and I had just brainstormed that as being a plausible
> > explanation too. I don't know how frequent it is to use anon memory
> > read-only. Presumably it must happen often enough that we've bothered
> > to implement the zero-page optimisation. But probably not nearly as
> > often as this benchmark makes it happen ;-)
>
> I also wonder if some of this improvement can be attributed to the
> last patch in your series
> (https://lore.kernel.org/all/[email protected]/).
> I assume it was included in the 0day testing?

Patch 4 was where I expected to see the improvement too. But I think
what's going on is that this benchmark evaded all our hard work on
page fault scalability. Because it's read-only, it never assigned an
anon_vma and so all its page faults fell back to taking the mmap_sem.
So patch 4 will have no effect on this benchmark.

The report from 0day is pretty clear they bisected the performance
improvement to patch 2.

2024-04-26 17:53:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: Always sanity check anon_vma first for per-vma locks

On Fri, Apr 26, 2024 at 08:07:45AM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 26, 2024 at 7:00 AM Matthew Wilcox <[email protected]> wrote:
> > Intel's 0day got back to me with data and it's ridiculously good.
> > Headline figure: over 3x throughput improvement with vm-scalability
> > https://lore.kernel.org/all/[email protected]/
> >
> > I can't see why it's that good. It shouldn't be that good. I'm
> > seeing big numbers here:
> >
> > 4366 ą 2% +565.6% 29061 perf-stat.overall.cycles-between-cache-misses
> >
> > and the code being deleted is only checking vma->vm_ops and
> > vma->anon_vma. Surely that cache line is referenced so frequently
> > during pagefault that deleting a reference here will make no difference
> > at all?
>
> That indeed looks overly good. Sorry, I didn't have a chance to run
> the benchmarks on my side yet because of the ongoing Android bootcamp
> this week.

No problem. Darn work getting in the way of having fun ;-)

> > I still don't understand why we have to take the mmap_sem less often.
> > Is there perhaps a VMA for which we have a NULL vm_ops, but don't set
> > an anon_vma on a page fault?
>
> I think the only path in either do_anonymous_page() or
> do_huge_pmd_anonymous_page() that skips calling anon_vma_prepare() is
> the "Use the zero-page for reads" here:
> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L4265. I
> didn't look into this particular benchmark yet but will try it out
> once I have some time to benchmark your change.

Yes, Liam and I had just brainstormed that as being a plausible
explanation too. I don't know how frequent it is to use anon memory
read-only. Presumably it must happen often enough that we've bothered
to implement the zero-page optimisation. But probably not nearly as
often as this benchmark makes it happen ;-)