2022-05-04 03:26:16

by Paul E. McKenney

[permalink] [raw]
Subject: Memory allocation on speculative fastpaths

Hello!

Just following up from off-list discussions yesterday.

The requirements to allocate on an RCU-protected speculative fastpath
seem to be as follows:

1. Never sleep.
2. Never reclaim.
3. Leave emergency pools alone.

Any others?

If those rules suffice, and if my understanding of the GFP flags is
correct (ha!!!), then the following GFP flags should cover this:

__GFP_NOMEMALLOC | __GFP_NOWARN

Or is this just a fancy way of always returning NULL or some such? ;-)

Thanx, Paul


2022-05-04 03:58:08

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, May 03, 2022 at 09:39:05AM -0700, Paul E. McKenney wrote:
> > On Tue, May 03, 2022 at 06:04:13PM +0200, Michal Hocko wrote:
> > > On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
> > > > Hello!
> > > >
> > > > Just following up from off-list discussions yesterday.
> > > >
> > > > The requirements to allocate on an RCU-protected speculative fastpath
> > > > seem to be as follows:
> > > >
> > > > 1. Never sleep.
> > > > 2. Never reclaim.
> > > > 3. Leave emergency pools alone.
> > > >
> > > > Any others?
> > > >
> > > > If those rules suffice, and if my understanding of the GFP flags is
> > > > correct (ha!!!), then the following GFP flags should cover this:
> > > >
> > > > __GFP_NOMEMALLOC | __GFP_NOWARN
> > >
> > > GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN
> >
> > Ah, good point on GFP_NOWAIT, thank you!
>
> Johannes (I think it was?) made the point to me that if we have another
> task very slowly freeing memory, a task in this path can take advantage
> of that other task's hard work and never go into reclaim. So the
> approach we should take is:
>
> p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>
> if (failure) {
> rcu_read_unlock();
> do_reclaim();
> return FAULT_FLAG_RETRY;
> }
>
> ... but all this is now moot since the approach we agreed to yesterday
> is:

I think the discussion was about the above approach and Johannes
suggested to fallback to the normal pagefault handling with mmap_lock
locked if PMD does not exist. Please correct me if I misunderstood
here.

>
> rcu_read_lock();
> vma = vma_lookup();
> if (down_read_trylock(&vma->sem)) {
> rcu_read_unlock();
> } else {
> rcu_read_unlock();
> mmap_read_lock(mm);
> vma = vma_lookup();
> down_read(&vma->sem);
> }
>
> ... and we then execute the page table allocation under the protection of
> the vma->sem.
>
> At least, that's what I think we agreed to yesterday.

Honestly, I don't remember discussing vma->sem at all.
My understanding is that two approaches were differing by the section
covered by rcu_read_lock/rcu_read_unlock. The solution that you
suggested would handle pagefault completely under RCU as long as it's
possible and would fallback to the mmap_lock if it's impossible, while
Michel's implementation was taking rcu_read_lock/rcu_read_unlock for
smaller sections and would use vma->seq_number to detect any vma
changes between these sections. Your suggested approach sounds simpler
and the way I understood the comments is that we should give it a try.
Did I miss anything?
Thanks,
Suren.



>

2022-05-04 05:44:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Tue, May 03, 2022 at 06:04:13PM +0200, Michal Hocko wrote:
> On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
> > Hello!
> >
> > Just following up from off-list discussions yesterday.
> >
> > The requirements to allocate on an RCU-protected speculative fastpath
> > seem to be as follows:
> >
> > 1. Never sleep.
> > 2. Never reclaim.
> > 3. Leave emergency pools alone.
> >
> > Any others?
> >
> > If those rules suffice, and if my understanding of the GFP flags is
> > correct (ha!!!), then the following GFP flags should cover this:
> >
> > __GFP_NOMEMALLOC | __GFP_NOWARN
>
> GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN

Ah, good point on GFP_NOWAIT, thank you!

> > Or is this just a fancy way of always returning NULL or some such? ;-)
>
> It could fail quite easily. We would also want to guarantee (by
> documenting I guess) that the page allocator never does anything that
> would depend or invoke rcu_synchronize or something like that.

The GPF_NOWAIT should rule out synchronize_rcu() and similar, correct?

> I believe this is the case currently.

Here is hoping! ;-)

Thanx, Paul

2022-05-04 09:33:21

by Michel Lespinasse

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

(for context, this came up during a discussion of speculative page
faults implementation details)

On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <[email protected]> wrote:
> Johannes (I think it was?) made the point to me that if we have another
> task very slowly freeing memory, a task in this path can take advantage
> of that other task's hard work and never go into reclaim. So the
> approach we should take is:
>
> p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>
> if (failure) {
> rcu_read_unlock();
> do_reclaim();
> return FAULT_FLAG_RETRY;
> }

I don't think this works. The problem with allocating page tables is
not just that it may break an rcu-locked code section; you also need
the code inserting the new page tables into the mm's page table tree
to synchronize with any munmap() that may be concurrently running. RCU
isn't sufficient here, and we would need a proper lock when wiring new
page tables (current code relies on mmap lock for this).

> ... but all this is now moot since the approach we agreed to yesterday
> is:
>
> rcu_read_lock();
> vma = vma_lookup();
> if (down_read_trylock(&vma->sem)) {
> rcu_read_unlock();
> } else {
> rcu_read_unlock();
> mmap_read_lock(mm);
> vma = vma_lookup();
> down_read(&vma->sem);
> }
>
> ... and we then execute the page table allocation under the protection of
> the vma->sem.
>
> At least, that's what I think we agreed to yesterday.

I don't remember discussing any of this yesterday. As I remember it,
the discussion was about having one large RCU section vs several small
ones linked by sequence count checks to verify the validity of the vma
at the start of each RCU section.

2022-05-04 11:32:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Tue, May 03, 2022 at 09:39:05AM -0700, Paul E. McKenney wrote:
> On Tue, May 03, 2022 at 06:04:13PM +0200, Michal Hocko wrote:
> > On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
> > > Hello!
> > >
> > > Just following up from off-list discussions yesterday.
> > >
> > > The requirements to allocate on an RCU-protected speculative fastpath
> > > seem to be as follows:
> > >
> > > 1. Never sleep.
> > > 2. Never reclaim.
> > > 3. Leave emergency pools alone.
> > >
> > > Any others?
> > >
> > > If those rules suffice, and if my understanding of the GFP flags is
> > > correct (ha!!!), then the following GFP flags should cover this:
> > >
> > > __GFP_NOMEMALLOC | __GFP_NOWARN
> >
> > GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN
>
> Ah, good point on GFP_NOWAIT, thank you!

Johannes (I think it was?) made the point to me that if we have another
task very slowly freeing memory, a task in this path can take advantage
of that other task's hard work and never go into reclaim. So the
approach we should take is:

p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);

if (failure) {
rcu_read_unlock();
do_reclaim();
return FAULT_FLAG_RETRY;
}

... but all this is now moot since the approach we agreed to yesterday
is:

rcu_read_lock();
vma = vma_lookup();
if (down_read_trylock(&vma->sem)) {
rcu_read_unlock();
} else {
rcu_read_unlock();
mmap_read_lock(mm);
vma = vma_lookup();
down_read(&vma->sem);
}

... and we then execute the page table allocation under the protection of
the vma->sem.

At least, that's what I think we agreed to yesterday.

2022-05-04 15:20:51

by Michal Hocko

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
> Hello!
>
> Just following up from off-list discussions yesterday.
>
> The requirements to allocate on an RCU-protected speculative fastpath
> seem to be as follows:
>
> 1. Never sleep.
> 2. Never reclaim.
> 3. Leave emergency pools alone.
>
> Any others?
>
> If those rules suffice, and if my understanding of the GFP flags is
> correct (ha!!!), then the following GFP flags should cover this:
>
> __GFP_NOMEMALLOC | __GFP_NOWARN

GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN

> Or is this just a fancy way of always returning NULL or some such? ;-)

It could fail quite easily. We would also want to guarantee (by
documenting I guess) that the page allocator never does anything that
would depend or invoke rcu_synchronize or something like that.

I believe this is the case currently.
--
Michal Hocko
SUSE Labs

2022-05-04 20:39:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Tue, May 03, 2022 at 04:15:46PM -0700, Suren Baghdasaryan wrote:
> On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, May 03, 2022 at 09:39:05AM -0700, Paul E. McKenney wrote:
> > > On Tue, May 03, 2022 at 06:04:13PM +0200, Michal Hocko wrote:
> > > > On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
> > > > > Hello!
> > > > >
> > > > > Just following up from off-list discussions yesterday.
> > > > >
> > > > > The requirements to allocate on an RCU-protected speculative fastpath
> > > > > seem to be as follows:
> > > > >
> > > > > 1. Never sleep.
> > > > > 2. Never reclaim.
> > > > > 3. Leave emergency pools alone.
> > > > >
> > > > > Any others?
> > > > >
> > > > > If those rules suffice, and if my understanding of the GFP flags is
> > > > > correct (ha!!!), then the following GFP flags should cover this:
> > > > >
> > > > > __GFP_NOMEMALLOC | __GFP_NOWARN
> > > >
> > > > GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN
> > >
> > > Ah, good point on GFP_NOWAIT, thank you!
> >
> > Johannes (I think it was?) made the point to me that if we have another
> > task very slowly freeing memory, a task in this path can take advantage
> > of that other task's hard work and never go into reclaim. So the
> > approach we should take is:

Right, GFP_NOWAIT can starve out other allocations. It can clear out
the freelists without the burden of having to do reclaim like
everybody else wanting memory during a shortage. Including GFP_KERNEL.

In smaller doses and/or for privileged purposes (e.g. single-argument
kfree_rcu ;)), those allocations are fine. But because the context is
page tables specifically, it would mean that userspace could trigger a
large number of those and DOS other applications and the kernel.

> > p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >
> > if (failure) {
> > rcu_read_unlock();
> > do_reclaim();
> > return FAULT_FLAG_RETRY;
> > }
> >
> > ... but all this is now moot since the approach we agreed to yesterday
> > is:
>
> I think the discussion was about the above approach and Johannes
> suggested to fallback to the normal pagefault handling with mmap_lock
> locked if PMD does not exist. Please correct me if I misunderstood
> here.

Yeah. Either way works, as long as the task is held accountable.

2022-05-04 22:31:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Wed, May 04, 2022 at 01:20:39AM -0700, Michel Lespinasse wrote:
> (for context, this came up during a discussion of speculative page
> faults implementation details)
>
> On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <[email protected]> wrote:
> > Johannes (I think it was?) made the point to me that if we have another
> > task very slowly freeing memory, a task in this path can take advantage
> > of that other task's hard work and never go into reclaim. So the
> > approach we should take is:
> >
> > p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> >
> > if (failure) {
> > rcu_read_unlock();
> > do_reclaim();
> > return FAULT_FLAG_RETRY;
> > }
>
> I don't think this works. The problem with allocating page tables is
> not just that it may break an rcu-locked code section; you also need
> the code inserting the new page tables into the mm's page table tree
> to synchronize with any munmap() that may be concurrently running. RCU
> isn't sufficient here, and we would need a proper lock when wiring new
> page tables (current code relies on mmap lock for this).

Hmm, so what you're saying is that we could see:

CPU 0 CPU 1

rcu_read_lock()
p4d_alloc()
pud_alloc()
mmap_write_lock
pmd_free
pud_free
p4d_free
mmap_write_unlock
pmd_alloc()

... and now CPU 0 has stored a newly allocated PMD into a PUD that will
be freed by RCU.

Good catch, but fortunately we don't have to solve it right now because
we decided on Monday to not pursue this approach.

> > ... but all this is now moot since the approach we agreed to yesterday
> > is:
> >
> > rcu_read_lock();
> > vma = vma_lookup();
> > if (down_read_trylock(&vma->sem)) {
> > rcu_read_unlock();
> > } else {
> > rcu_read_unlock();
> > mmap_read_lock(mm);
> > vma = vma_lookup();
> > down_read(&vma->sem);
> > }
> >
> > ... and we then execute the page table allocation under the protection of
> > the vma->sem.
> >
> > At least, that's what I think we agreed to yesterday.
>
> I don't remember discussing any of this yesterday. As I remember it,
> the discussion was about having one large RCU section vs several small
> ones linked by sequence count checks to verify the validity of the vma
> at the start of each RCU section.

That was indeed what we started out discussing, but Michal and others
wanted to explore locking the VMA as a first step. This approach
doesn't suffer from the same problem as the write side is:

mmap_write_lock()
vma = vma_lookup();
down_write(&vma->sem);
... tear down page tables ...
up_write(&vma->sem);
rcu_free(vma);
mmap_write_unlock();

So when tearing down the page tables, if the fault is on this VMA, it
will block until the fault has released the VMA read lock. If it's on
another VMA, the page table teardown will not tear down any page tables
which are partially covered by any VMA.

Or have I missed something else?

2022-05-25 15:12:30

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On 5/4/22 18:23, Johannes Weiner wrote:
> On Tue, May 03, 2022 at 04:15:46PM -0700, Suren Baghdasaryan wrote:
>> On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <[email protected]> wrote:
>>>
>>> On Tue, May 03, 2022 at 09:39:05AM -0700, Paul E. McKenney wrote:
>>>> On Tue, May 03, 2022 at 06:04:13PM +0200, Michal Hocko wrote:
>>>>> On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
>>>>>> Hello!
>>>>>>
>>>>>> Just following up from off-list discussions yesterday.
>>>>>>
>>>>>> The requirements to allocate on an RCU-protected speculative fastpath
>>>>>> seem to be as follows:
>>>>>>
>>>>>> 1. Never sleep.
>>>>>> 2. Never reclaim.
>>>>>> 3. Leave emergency pools alone.
>>>>>>
>>>>>> Any others?
>>>>>>
>>>>>> If those rules suffice, and if my understanding of the GFP flags is
>>>>>> correct (ha!!!), then the following GFP flags should cover this:
>>>>>>
>>>>>> __GFP_NOMEMALLOC | __GFP_NOWARN
>>>>>
>>>>> GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN
>>>>
>>>> Ah, good point on GFP_NOWAIT, thank you!
>>>
>>> Johannes (I think it was?) made the point to me that if we have another
>>> task very slowly freeing memory, a task in this path can take advantage
>>> of that other task's hard work and never go into reclaim. So the
>>> approach we should take is:
>
> Right, GFP_NOWAIT can starve out other allocations. It can clear out
> the freelists without the burden of having to do reclaim like
> everybody else wanting memory during a shortage. Including GFP_KERNEL.

FTR, I wonder if this is really true, given the suggested fallback. With
GFP_NOWAIT, you can either see memory (in all applicable zones) as

a) above low_watermark, just go ahead and allocate, as GFP_KERNEL would
b) between min and low watermark, wake up kswapd and allocate, as
GFP_KERNEL would
c) below min watermark, the most interesting. GFP_KERNEL fallbacks to
reclaim. If the GFP_NOWAIT path's fallback also includes reclaim, as
suggested in this thread, how is it really different from GFP_KERNEL?

So am I missing something or is GFP_NOWAIT fastpath with an immediate
fallback that includes reclaim (and not just a retry loop) fundamentally
not different from GFP_KERNEL, regardless of how often we attempt it?

> In smaller doses and/or for privileged purposes (e.g. single-argument
> kfree_rcu ;)), those allocations are fine. But because the context is
> page tables specifically, it would mean that userspace could trigger a
> large number of those and DOS other applications and the kernel.
>
>>> p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>> pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>> pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>>
>>> if (failure) {
>>> rcu_read_unlock();
>>> do_reclaim();
>>> return FAULT_FLAG_RETRY;
>>> }
>>>
>>> ... but all this is now moot since the approach we agreed to yesterday
>>> is:
>>
>> I think the discussion was about the above approach and Johannes
>> suggested to fallback to the normal pagefault handling with mmap_lock
>> locked if PMD does not exist. Please correct me if I misunderstood
>> here.
>
> Yeah. Either way works, as long as the task is held accountable.
>


2022-05-25 20:28:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: Memory allocation on speculative fastpaths

On Tue, May 24, 2022 at 10:37:15PM +0200, Vlastimil Babka wrote:
> On 5/4/22 18:23, Johannes Weiner wrote:
> > On Tue, May 03, 2022 at 04:15:46PM -0700, Suren Baghdasaryan wrote:
> >> On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <[email protected]> wrote:
> >>>
> >>> On Tue, May 03, 2022 at 09:39:05AM -0700, Paul E. McKenney wrote:
> >>>> On Tue, May 03, 2022 at 06:04:13PM +0200, Michal Hocko wrote:
> >>>>> On Tue 03-05-22 08:59:13, Paul E. McKenney wrote:
> >>>>>> Hello!
> >>>>>>
> >>>>>> Just following up from off-list discussions yesterday.
> >>>>>>
> >>>>>> The requirements to allocate on an RCU-protected speculative fastpath
> >>>>>> seem to be as follows:
> >>>>>>
> >>>>>> 1. Never sleep.
> >>>>>> 2. Never reclaim.
> >>>>>> 3. Leave emergency pools alone.
> >>>>>>
> >>>>>> Any others?
> >>>>>>
> >>>>>> If those rules suffice, and if my understanding of the GFP flags is
> >>>>>> correct (ha!!!), then the following GFP flags should cover this:
> >>>>>>
> >>>>>> __GFP_NOMEMALLOC | __GFP_NOWARN
> >>>>>
> >>>>> GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN
> >>>>
> >>>> Ah, good point on GFP_NOWAIT, thank you!
> >>>
> >>> Johannes (I think it was?) made the point to me that if we have another
> >>> task very slowly freeing memory, a task in this path can take advantage
> >>> of that other task's hard work and never go into reclaim. So the
> >>> approach we should take is:
> >
> > Right, GFP_NOWAIT can starve out other allocations. It can clear out
> > the freelists without the burden of having to do reclaim like
> > everybody else wanting memory during a shortage. Including GFP_KERNEL.
>
> FTR, I wonder if this is really true, given the suggested fallback.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^

IIRC adding this fallback was the conclusion of the in-person
discussion. Above I just tried to summarize for the record the
original concern that led to it. I could have been more clear.

Your analysis is dead on, of course.