LinuxLists.cc - [QUESTION] about the maple tree and current status of mmap

2022-12-28 13:15:11

Subject: [QUESTION] about the maple tree and current status of mmap_lock scalability

Hello mm folks,

I have a few questions about the current status of mmap_lock scalability.

=============================================================
What is currently causing the kernel to use mmap_lock to protect the maple tree?
=============================================================

I understand that the long-term goal is to remove the need for mmap_lock in readers
while traversing the maple tree, using techniques such as RCU or SPF.
What is the biggest obstacle preventing this from being achieved at this time?

==================================================
How does the maple tree provide RCU-safe manipulation of VMAs?
==================================================

Is it similar to the approach suggested in the RCUVM paper (replacing the original
root node with a new root node that shares most of its nodes and deferring
the freeing of stale nodes using RCU)?

I'm having difficulty understanding the design of the maple tree in this regard.

[RCUVM paper] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

Thank you for your time.

---
Hyeonggon

2022-12-28 17:28:11

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

Hi Hyeonggon,

On Wed, Dec 28, 2022 at 4:49 AM Hyeonggon Yoo <[email protected]> wrote:
>
> Hello mm folks,
>
> I have a few questions about the current status of mmap_lock scalability.
>
> =============================================================
> What is currently causing the kernel to use mmap_lock to protect the maple tree?
> =============================================================
>
> I understand that the long-term goal is to remove the need for mmap_lock in readers
> while traversing the maple tree, using techniques such as RCU or SPF.
> What is the biggest obstacle preventing this from being achieved at this time?

Maple tree has an RCU mode which does not need mmap_lock for
traversal. Liam and I were testing it recently and Liam fixed a number
of issues to enable it. It seems stable now and the fixes are
incorporated into the "per-vma locks" patchset which I prepared in
this branch: https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock.
I haven't posted this patchset upstream yet but it's pretty much ready
to go. I'm planning to post it in early January.
Thanks,
Suren.

>
> ==================================================
> How does the maple tree provide RCU-safe manipulation of VMAs?
> ==================================================
>
> Is it similar to the approach suggested in the RCUVM paper (replacing the original
> root node with a new root node that shares most of its nodes and deferring
> the freeing of stale nodes using RCU)?
>
> I'm having difficulty understanding the design of the maple tree in this regard.
>
> [RCUVM paper] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
>
> Thank you for your time.
>
> ---
> Hyeonggon

2022-12-28 21:09:46

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Wed, Dec 28, 2022 at 09:48:51PM +0900, Hyeonggon Yoo wrote:
> Hello mm folks,
>
> I have a few questions about the current status of mmap_lock scalability.
>
> =============================================================
> What is currently causing the kernel to use mmap_lock to protect the maple tree?
> =============================================================
>
> I understand that the long-term goal is to remove the need for mmap_lock in readers
> while traversing the maple tree, using techniques such as RCU or SPF.
> What is the biggest obstacle preventing this from being achieved at this time?

The long term goal is even larger than this. Ideally, the VMA tree
would be protected by a spinlock rather than a mutex. That turned out
to be too large a change for the moment (and isn't all that important
compared to enabling RCU readers)

> ==================================================
> How does the maple tree provide RCU-safe manipulation of VMAs?
> ==================================================
>
> Is it similar to the approach suggested in the RCUVM paper (replacing the original
> root node with a new root node that shares most of its nodes and deferring
> the freeing of stale nodes using RCU)?
>
> I'm having difficulty understanding the design of the maple tree in this regard.
>
> [RCUVM paper] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

While I've read the RCUVM paper, I wouldn't say it was particularly an
inspiration. The Maple Tree is independent of the VM; it's a general
purpose B-tree. As with any B-tree, when modifying a node, we don't
touch nodes that we don't need to touch. As with any RCU data structure,
we defer freeing things while RCU readers might still have a reference
to them.

We don't necessarily go all the way to the root node when modifying a
leaf node. For example, if we have this structure:

Root: Node A, 4000, Node B
Node A: p1, 50, p2, 100, p3, 150, p4, 200, NULL, 250, p6, 1000, p7
Node B: p8, 4050, p9, 4100, p10, 4150, p11, 4200, NULL, 4250, p13

and we replace p4 with a NULL over the whole range from 150-199,
we construct a new Node A2 that contains:

Node A2: p1, 50, p2, 100, p3, 150, NULL, 250, p6, 1000, p7

and we simply write A2 over the entry in Root. Then we mark Node A as
dead and RCU-free Node A. There's no need to replace Root as stores
to a pointer are atomic. If we need to rebalance between Node A and
Node B, we will need to create a new Root (as well as both A and B),
mark all of them as dead and RCU-free them.

2022-12-29 11:50:38

by Hyeonggon Yoo

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Wed, Dec 28, 2022 at 09:10:20AM -0800, Suren Baghdasaryan wrote:
> Hi Hyeonggon,
>
> On Wed, Dec 28, 2022 at 4:49 AM Hyeonggon Yoo <[email protected]> wrote:
> >
> > Hello mm folks,
> >
> > I have a few questions about the current status of mmap_lock scalability.
> >
> > =============================================================
> > What is currently causing the kernel to use mmap_lock to protect the maple tree?
> > =============================================================
> >
> > I understand that the long-term goal is to remove the need for mmap_lock in readers
> > while traversing the maple tree, using techniques such as RCU or SPF.
> > What is the biggest obstacle preventing this from being achieved at this time?
>
> Maple tree has an RCU mode which does not need mmap_lock for
> traversal. Liam and I were testing it recently and Liam fixed a number
> of issues to enable it. It seems stable now and the fixes are
> incorporated into the "per-vma locks" patchset which I prepared in
> this branch: https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock.

Thank you for the link. I didn't realize how far the discussion had progressed.

Let me check if I understand correctly:

To allow non-overlapping page faults while writers are performing VMA operations,
per-VMA locking moves from the mmap_lock to the VMA lock on the reader
side during page fault.

While maple tree traversal is done without locking, readers must take
VMA lock in read mode within RCU read section (or retry taking mmap_lock
if failed) to process page fault.

This ensures that readers are not racing with writers for access to the same
VMA.

Am I getting it right?

> I haven't posted this patchset upstream yet but it's pretty much ready
> to go. I'm planning to post it in early January.

Looking forward to that,
thank you for working on this.

--
Thanks,
Hyeonggon

2022-12-29 14:53:55

by Hyeonggon Yoo

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Wed, Dec 28, 2022 at 08:50:36PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 28, 2022 at 09:48:51PM +0900, Hyeonggon Yoo wrote:
> > Hello mm folks,
> >
> > I have a few questions about the current status of mmap_lock scalability.
> >
> > =============================================================
> > What is currently causing the kernel to use mmap_lock to protect the maple tree?
> > =============================================================
> >
> > I understand that the long-term goal is to remove the need for mmap_lock in readers
> > while traversing the maple tree, using techniques such as RCU or SPF.
> > What is the biggest obstacle preventing this from being achieved at this time?
>
> The long term goal is even larger than this. Ideally, the VMA tree
> would be protected by a spinlock rather than a mutex.

You mean replacing mmap_lock rwsem with a spinlock?
How is that possible if readers can take it for page fault?

> That turned out
> to be too large a change for the moment (and isn't all that important
> compared to enabling RCU readers)

Yeah, better to take one step at a time.

>
> > ==================================================
> > How does the maple tree provide RCU-safe manipulation of VMAs?
> > ==================================================
> >
> > Is it similar to the approach suggested in the RCUVM paper (replacing the original
> > root node with a new root node that shares most of its nodes and deferring
> > the freeing of stale nodes using RCU)?
> >
> > I'm having difficulty understanding the design of the maple tree in this regard.
> >
> > [RCUVM paper] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
>
> While I've read the RCUVM paper, I wouldn't say it was particularly an
> inspiration. The Maple Tree is independent of the VM; it's a general
> purpose B-tree.

My intention was to ask how to synchronize with other VMA operations
after the tree traversal with RCU. (Because it's unreasonable to handle
page fault in RCU read-side critical section)

Per-VMA lock seem to solve it by taking the VMA lock in read mode within
RCU read-side critical section.

> As with any B-tree, when modifying a node, we don't
> touch nodes that we don't need to touch. As with any RCU data structure,
> we defer freeing things while RCU readers might still have a reference
> to them.
>
> We don't necessarily go all the way to the root node when modifying a
> leaf node. For example, if we have this structure:
>
> Root: Node A, 4000, Node B
> Node A: p1, 50, p2, 100, p3, 150, p4, 200, NULL, 250, p6, 1000, p7
> Node B: p8, 4050, p9, 4100, p10, 4150, p11, 4200, NULL, 4250, p13
>
> and we replace p4 with a NULL over the whole range from 150-199,
> we construct a new Node A2 that contains:
>
> Node A2: p1, 50, p2, 100, p3, 150, NULL, 250, p6, 1000, p7
>
> and we simply write A2 over the entry in Root. Then we mark Node A as
> dead and RCU-free Node A. There's no need to replace Root as stores
> to a pointer are atomic.

Thank you for explaining things in an easy and intuitive way.
Okay, I get it's not a big problem to update the value(s) in a
B-tree in RCU-safe way.

> If we need to rebalance between Node A and
> Node B, we will need to create a new Root (as well as both A and B),
> mark all of them as dead and RCU-free them.

--
Thanks,
Hyeonggon

2022-12-29 16:56:16

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Thu, Dec 29, 2022 at 11:22:28PM +0900, Hyeonggon Yoo wrote:
> On Wed, Dec 28, 2022 at 08:50:36PM +0000, Matthew Wilcox wrote:
> > The long term goal is even larger than this. Ideally, the VMA tree
> > would be protected by a spinlock rather than a mutex.
>
> You mean replacing mmap_lock rwsem with a spinlock?
> How is that possible if readers can take it for page fault?

The mmap_lock is taken for many, many things. So the plan was to
have a spinlock in the maple tree (indeed, there's still one there;
it's just in a union with the lockdep_map_p). VMA readers would walk
the tree protected only by RCU; VMA writers would take the spinlock
while modifying the tree. The work Suren, Liam & I are engaged in
still uses the mmap semaphore for writers, but we do walk the tree
under RCU protection.

> > While I've read the RCUVM paper, I wouldn't say it was particularly an
> > inspiration. The Maple Tree is independent of the VM; it's a general
> > purpose B-tree.
>
> My intention was to ask how to synchronize with other VMA operations
> after the tree traversal with RCU. (Because it's unreasonable to handle
> page fault in RCU read-side critical section)
>
> Per-VMA lock seem to solve it by taking the VMA lock in read mode within
> RCU read-side critical section.

Right, but it's a little more complex than that. The real "lock" on
the VMA is actually a sequence count. https://lwn.net/Articles/906852/
does a good job of explaining it, but the VMA lock is really there as
a convenient way for the writer to wait for readers to be sufficiently
"finished" with handling the page fault that any conflicting changes
will be correctly retired.

https://www.infradead.org/~willy/linux/store-free-page-faults.html
outlines how I intend to proceed from Suren's current scheme (where
RCU is only used to protect the tree walk) to using RCU for the
entire page fault.

2022-12-29 17:43:48

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Thu, Dec 29, 2022 at 9:10 AM Lorenzo Stoakes <[email protected]> wrote:
>
> On Thu, Dec 29, 2022 at 04:51:37PM +0000, Matthew Wilcox wrote:
> > The mmap_lock is taken for many, many things. [snip]
>
> I am currently describing the use of this lock (for 6.0) in the book and it is
> striking just how broadly it's used. I'm diagramming it out for 'core' users,
> i.e. non-driver and non-some other things, but even constraining that leaves a
> HUGE number of users. I've also documented the 'unexpected' uses of the
> page_table_lock, which seems to have been significantly improved over time but
> still a few cases remain!
>
> Am happy to give you (+ anybody else on MAINTAINERS list) an early copy of the
> relevant bit (once I've finished the diagrams anyway) if that'd be helpful!

Yes please, that would be interesting.

>
> Now if you guys could stop obsoleting my work that'd be great ;)

2022-12-29 18:16:01

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Thu, Dec 29, 2022 at 05:10:28PM +0000, Lorenzo Stoakes wrote:
> On Thu, Dec 29, 2022 at 04:51:37PM +0000, Matthew Wilcox wrote:
> > The mmap_lock is taken for many, many things. [snip]
>
> I am currently describing the use of this lock (for 6.0) in the book and it is
> striking just how broadly it's used. I'm diagramming it out for 'core' users,
> i.e. non-driver and non-some other things, but even constraining that leaves a
> HUGE number of users.

I fear this would be overwhelming. I don't think anybody would disagree
that the mmap_lock needs to be split up like the BKL was, but we didn't
do that by diagramming it out. Instead, we introduced new smaller locks
that protected much better-defined things until eventually we were able
to kill the BKL entirely.

That's what I'm trying to do here -- there is one well-defined thing
that the maple tree lock will protect, and that's the structure of the
maple tree. It doesn't protect the data pointed to by the pointers
stored in the tree, just the maple tree itself.

> I've also documented the 'unexpected' uses of the
> page_table_lock, which seems to have been significantly improved over time but
> still a few cases remain!

Now, I think this is useful. There's probably few enough abuses of the
PTL that my brain can wrap itself around which ones are legitimate and
then deal with the inappropriate ones.

> Am happy to give you (+ anybody else on MAINTAINERS list) an early copy of the
> relevant bit (once I've finished the diagrams anyway) if that'd be helpful!

I'm definitely interested in the PTL. Thank you for the offer!

> Now if you guys could stop obsoleting my work that'd be great ;)

Never! How else will you get interest in the Second Edition Covering
Linux 7.0? ;-)

2022-12-29 18:16:05

by Lorenzo Stoakes

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Thu, Dec 29, 2022 at 04:51:37PM +0000, Matthew Wilcox wrote:
> The mmap_lock is taken for many, many things. [snip]

I am currently describing the use of this lock (for 6.0) in the book and it is
striking just how broadly it's used. I'm diagramming it out for 'core' users,
i.e. non-driver and non-some other things, but even constraining that leaves a
HUGE number of users. I've also documented the 'unexpected' uses of the
page_table_lock, which seems to have been significantly improved over time but
still a few cases remain!

Am happy to give you (+ anybody else on MAINTAINERS list) an early copy of the
relevant bit (once I've finished the diagrams anyway) if that'd be helpful!

Now if you guys could stop obsoleting my work that'd be great ;)

2023-01-02 12:11:55

by Hyeonggon Yoo

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

From: Hyeonggon Yoo <[email protected]>
To: Matthew Wilcox <[email protected]>
Cc: [email protected], [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected]
Bcc:
Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock
scalability
Reply-To:
In-Reply-To: <[email protected]>

On Thu, Dec 29, 2022 at 04:51:37PM +0000, Matthew Wilcox wrote:
> On Thu, Dec 29, 2022 at 11:22:28PM +0900, Hyeonggon Yoo wrote:
> > On Wed, Dec 28, 2022 at 08:50:36PM +0000, Matthew Wilcox wrote:
> > > The long term goal is even larger than this. Ideally, the VMA tree
> > > would be protected by a spinlock rather than a mutex.
> >
> > You mean replacing mmap_lock rwsem with a spinlock?
> > How is that possible if readers can take it for page fault?
>
> The mmap_lock is taken for many, many things. So the plan was to
> have a spinlock in the maple tree (indeed, there's still one there;
> it's just in a union with the lockdep_map_p). VMA readers would walk
> the tree protected only by RCU; VMA writers would take the spinlock
> while modifying the tree. The work Suren, Liam & I are engaged in
> still uses the mmap semaphore for writers, but we do walk the tree
> under RCU protection.
>

Thanks, I get it. so it's for less overhead for maple tree modification.

> > > While I've read the RCUVM paper, I wouldn't say it was particularly an
> > > inspiration. The Maple Tree is independent of the VM; it's a general
> > > purpose B-tree.
> >
> > My intention was to ask how to synchronize with other VMA operations
> > after the tree traversal with RCU. (Because it's unreasonable to handle
> > page fault in RCU read-side critical section)
> >
> > Per-VMA lock seem to solve it by taking the VMA lock in read mode within
> > RCU read-side critical section.
>
> Right, but it's a little more complex than that. The real "lock" on
> the VMA is actually a sequence count. https://lwn.net/Articles/906852/
> does a good job of explaining it, but the VMA lock is really there as
> a convenient way for the writer to wait for readers to be sufficiently
> "finished" with handling the page fault that any conflicting changes
> will be correctly retired.

Oh, thanks, nice article!

> https://www.infradead.org/~willy/linux/store-free-page-faults.html
> outlines how I intend to proceed from Suren's current scheme (where
> RCU is only used to protect the tree walk) to using RCU for the
> entire page fault.

Thank you for sharing this your outlines.
Okay, so the planned scheme is:

1. Try to process entire page fault under RCU protection
- if failed, goto 2. if succeeded, goto 4.

2. Fall back to Suren's scheme (try to take VMA rwsem)
- if failed, goto 3. if succeeded, goto 4.

3. Fall back to mmap_lock
- goto 4.

4. Finish page fault.

To implement 1, __p*d_alloc() need to take gfp flags
not to sleep in RCU read-side critical section.

What about introducing PF_MEMALLOC_NOWAIT process flag forcing
GFP_NOWAIT | __GFP_NOWARN

similar to PF_MEMALLOC_NO{FS,IO}, looking like this?

Will be less churn.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..77b88f30523b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1725,7 +1725,7 @@ extern struct pid *cad_pid;
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF__HOLE__00004000 0x00004000
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
-#define PF__HOLE__00010000 0x00010000
+#define PF_MEMALLOC_NOWAIT 0x00010000 /* All allocation requests will force GFP_NOWAIT | __GFP_NOWARN */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..4a1196646951 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -204,7 +204,8 @@ static inline gfp_t current_gfp_context(gfp_t flags)
{
unsigned int pflags = READ_ONCE(current->flags);

- if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_PIN))) {
+ if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS
+ | PF_MEMALLOC_PIN | PF_MEMALLOC_NOWAIT))) {
/*
* NOIO implies both NOIO and NOFS and it is a weaker context
* so always make sure it makes precedence
@@ -216,6 +217,8 @@ static inline gfp_t current_gfp_context(gfp_t flags)

if (pflags & PF_MEMALLOC_PIN)
flags &= ~__GFP_MOVABLE;
+ if (pflags & PF_MEMALLOC_NOWAIT)
+ flags = GFP_NOWAIT | __GFP_NOWARN;
}
return flags;
}
@@ -305,6 +308,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+static inline unsigned int memalloc_nowait_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOWAIT;
+ current->flags |= PF_MEMALLOC_NOWAIT;
+ return flags;
+}
+
+static inline void memalloc_nowait_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOWAIT) | flags;

--
Thanks,
Hyeonggon

2023-01-02 14:51:09

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Mon, Jan 02, 2023 at 09:04:12PM +0900, Hyeonggon Yoo wrote:
> > https://www.infradead.org/~willy/linux/store-free-page-faults.html
> > outlines how I intend to proceed from Suren's current scheme (where
> > RCU is only used to protect the tree walk) to using RCU for the
> > entire page fault.
>
> Thank you for sharing this your outlines.
> Okay, so the planned scheme is:
>
> 1. Try to process entire page fault under RCU protection
> - if failed, goto 2. if succeeded, goto 4.
>
> 2. Fall back to Suren's scheme (try to take VMA rwsem)
> - if failed, goto 3. if succeeded, goto 4.

Right. The question is whether to restart the page fault under Suren's
scheme, or just grab the VMA rwsem and continue. Experimentation
needed.

It's also worth noting that Michel has an alternative proposal, which
is to drop out of RCU protection before trying to allocate memory, then
re-enter RCU mode and check the sequence count hasn't changed on the
entire MM. His proposal has the advantage of not trying to allocate
memory while holding the RCU read lock, but the disadvantage of having
to retry the page fault if anyone has called mmap() or munmap(). Which
alternative is better is going to depend on the workload; do we see more
calls to mmap()/munmap(), or do we need to enter page reclaim more often?
I think they're largely equivalent performance-wise in the fast path.
Another metric to consider is code complexity; he thinks his method
is easier to understand and I think mine is easier. To be expected,
I suppose ;-)

> 3. Fall back to mmap_lock
> - goto 4.
>
> 4. Finish page fault.
>
> To implement 1, __p*d_alloc() need to take gfp flags
> not to sleep in RCU read-side critical section.
>
> What about introducing PF_MEMALLOC_NOWAIT process flag forcing
> GFP_NOWAIT | __GFP_NOWARN
>
> similar to PF_MEMALLOC_NO{FS,IO}, looking like this?
>
> Will be less churn.

Certainly less churn, but also far more risky. All of a sudden,
codepaths which used to always succeed will now start failing, and
either there aren't checks for memory allocation failures or those
paths have never been tested before.

2023-02-20 14:27:02

by Hyeonggon Yoo

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Mon, Jan 02, 2023 at 02:37:02PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 02, 2023 at 09:04:12PM +0900, Hyeonggon Yoo wrote:
> > > https://www.infradead.org/~willy/linux/store-free-page-faults.html
> > > outlines how I intend to proceed from Suren's current scheme (where
> > > RCU is only used to protect the tree walk) to using RCU for the
> > > entire page fault.
> >
> > Thank you for sharing this your outlines.
> > Okay, so the planned scheme is:
> >
> > 1. Try to process entire page fault under RCU protection
> > - if failed, goto 2. if succeeded, goto 4.
> >
> > 2. Fall back to Suren's scheme (try to take VMA rwsem)
> > - if failed, goto 3. if succeeded, goto 4.
>
> Right. The question is whether to restart the page fault under Suren's
> scheme, or just grab the VMA rwsem and continue. Experimentation
> needed.
>
> It's also worth noting that Michel has an alternative proposal, which
> is to drop out of RCU protection before trying to allocate memory, then
> re-enter RCU mode and check the sequence count hasn't changed on the
> entire MM. His proposal has the advantage of not trying to allocate
> memory while holding the RCU read lock, but the disadvantage of having
> to retry the page fault if anyone has called mmap() or munmap(). Which
> alternative is better is going to depend on the workload; do we see more
> calls to mmap()/munmap(), or do we need to enter page reclaim more often?
> I think they're largely equivalent performance-wise in the fast path.
> Another metric to consider is code complexity; he thinks his method
> is easier to understand and I think mine is easier. To be expected,
> I suppose ;-)

I'm planning to suggest a cooperative project to my colleagues
that would involve making __p?d_alloc() take gfp flags.

Wondering if there was any progress or conclusion made on which
approach is better for full RCU page faults, or was there another
solution proposed?

Asking this because I don't want to waste my time if the approach
has been abandoned.

Regards,
Hyeonggon

> > 3. Fall back to mmap_lock
> > - goto 4.
> >
> > 4. Finish page fault.

2023-02-20 14:43:40

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Mon, Feb 20, 2023 at 02:26:49PM +0000, Hyeonggon Yoo wrote:
> On Mon, Jan 02, 2023 at 02:37:02PM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 02, 2023 at 09:04:12PM +0900, Hyeonggon Yoo wrote:
> > > > https://www.infradead.org/~willy/linux/store-free-page-faults.html
> > > > outlines how I intend to proceed from Suren's current scheme (where
> > > > RCU is only used to protect the tree walk) to using RCU for the
> > > > entire page fault.
> > >
> > > Thank you for sharing this your outlines.
> > > Okay, so the planned scheme is:
> > >
> > > 1. Try to process entire page fault under RCU protection
> > > - if failed, goto 2. if succeeded, goto 4.
> > >
> > > 2. Fall back to Suren's scheme (try to take VMA rwsem)
> > > - if failed, goto 3. if succeeded, goto 4.
> >
> > Right. The question is whether to restart the page fault under Suren's
> > scheme, or just grab the VMA rwsem and continue. Experimentation
> > needed.
> >
> > It's also worth noting that Michel has an alternative proposal, which
> > is to drop out of RCU protection before trying to allocate memory, then
> > re-enter RCU mode and check the sequence count hasn't changed on the
> > entire MM. His proposal has the advantage of not trying to allocate
> > memory while holding the RCU read lock, but the disadvantage of having
> > to retry the page fault if anyone has called mmap() or munmap(). Which
> > alternative is better is going to depend on the workload; do we see more
> > calls to mmap()/munmap(), or do we need to enter page reclaim more often?
> > I think they're largely equivalent performance-wise in the fast path.
> > Another metric to consider is code complexity; he thinks his method
> > is easier to understand and I think mine is easier. To be expected,
> > I suppose ;-)
>
> I'm planning to suggest a cooperative project to my colleagues
> that would involve making __p?d_alloc() take gfp flags.
>
> Wondering if there was any progress or conclusion made on which
> approach is better for full RCU page faults, or was there another
> solution proposed?
>
> Asking this because I don't want to waste my time if the approach
> has been abandoned.

Thanks for checking, but nobody's made any progress on this, that I know
of.

(The __p?d_alloc() approach may also be useful to support vmalloc()
with flags that aren't GFP_KERNEL compatible)

2023-02-22 11:38:51

by Hyeonggon Yoo

[permalink] [raw]

Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

On Mon, Feb 20, 2023 at 02:43:23PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 20, 2023 at 02:26:49PM +0000, Hyeonggon Yoo wrote:
> > On Mon, Jan 02, 2023 at 02:37:02PM +0000, Matthew Wilcox wrote:
> > > On Mon, Jan 02, 2023 at 09:04:12PM +0900, Hyeonggon Yoo wrote:
> > > > > https://www.infradead.org/~willy/linux/store-free-page-faults.html
> > > > > outlines how I intend to proceed from Suren's current scheme (where
> > > > > RCU is only used to protect the tree walk) to using RCU for the
> > > > > entire page fault.
> > > >
> > > > Thank you for sharing this your outlines.
> > > > Okay, so the planned scheme is:
> > > >
> > > > 1. Try to process entire page fault under RCU protection
> > > > - if failed, goto 2. if succeeded, goto 4.
> > > >
> > > > 2. Fall back to Suren's scheme (try to take VMA rwsem)
> > > > - if failed, goto 3. if succeeded, goto 4.
> > >
> > > Right. The question is whether to restart the page fault under Suren's
> > > scheme, or just grab the VMA rwsem and continue. Experimentation
> > > needed.
> > >
> > > It's also worth noting that Michel has an alternative proposal, which
> > > is to drop out of RCU protection before trying to allocate memory, then
> > > re-enter RCU mode and check the sequence count hasn't changed on the
> > > entire MM. His proposal has the advantage of not trying to allocate
> > > memory while holding the RCU read lock, but the disadvantage of having
> > > to retry the page fault if anyone has called mmap() or munmap(). Which
> > > alternative is better is going to depend on the workload; do we see more
> > > calls to mmap()/munmap(), or do we need to enter page reclaim more often?
> > > I think they're largely equivalent performance-wise in the fast path.
> > > Another metric to consider is code complexity; he thinks his method
> > > is easier to understand and I think mine is easier. To be expected,
> > > I suppose ;-)
> >
> > I'm planning to suggest a cooperative project to my colleagues
> > that would involve making __p?d_alloc() take gfp flags.
> >
> > Wondering if there was any progress or conclusion made on which
> > approach is better for full RCU page faults, or was there another
> > solution proposed?
> >
> > Asking this because I don't want to waste my time if the approach
> > has been abandoned.
>
> Thanks for checking, but nobody's made any progress on this, that I know
> of.

Thanks for confirmation. then I think it's still worth trying.

> (The __p?d_alloc() approach may also be useful to support vmalloc()
> with flags that aren't GFP_KERNEL compatible)

Is there any possible users of that,
sounds like someone tries to call __vmalloc() in interrupt context or
RCU read-side critical section?