2020-09-21 21:19:39

by Peter Xu

[permalink] [raw]
Subject: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

(Commit message collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
track if the mm_struct has ever been used with pin_user_pages(). mm_structs
that have never been passed to pin_user_pages() cannot have a positive
page_maybe_dma_pinned() by definition. This allows cases that might drive up
the page ref_count to avoid any penalty from handling dma_pinned pages.

Due to complexities with unpining this trivial version is a permanent sticky
bit, future work will be needed to make this a counter.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/mm_types.h | 10 ++++++++++
kernel/fork.c | 1 +
mm/gup.c | 6 ++++++
3 files changed, 17 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 496c3ff97cce..6f291f8b74c6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -441,6 +441,16 @@ struct mm_struct {
#endif
int map_count; /* number of VMAs */

+ /**
+ * @has_pinned: Whether this mm has pinned any pages. This can
+ * be either replaced in the future by @pinned_vm when it
+ * becomes stable, or grow into a counter on its own. We're
+ * aggresive on this bit now - even if the pinned pages were
+ * unpinned later on, we'll still keep this bit set for the
+ * lifecycle of this mm just for simplicity.
+ */
+ int has_pinned;
+
spinlock_t page_table_lock; /* Protects page tables and some
* counters
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index 49677d668de4..7237d418e7b5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1011,6 +1011,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm_pgtables_bytes_init(mm);
mm->map_count = 0;
mm->locked_vm = 0;
+ mm->has_pinned = 0;
atomic64_set(&mm->pinned_vm, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..2d9019bf1773 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1255,6 +1255,9 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
BUG_ON(*locked != 1);
}

+ if (flags & FOLL_PIN)
+ WRITE_ONCE(mm->has_pinned, 1);
+
/*
* FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
* is to set FOLL_GET if the caller wants pages[] filled in (but has
@@ -2660,6 +2663,9 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
FOLL_FAST_ONLY)))
return -EINVAL;

+ if (gup_flags & FOLL_PIN)
+ WRITE_ONCE(current->mm->has_pinned, 1);
+
if (!(gup_flags & FOLL_FAST_ONLY))
might_lock_read(&current->mm->mmap_lock);

--
2.26.2


2020-09-21 21:46:53

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 21, 2020 at 11:17 PM Peter Xu <[email protected]> wrote:
>
> (Commit message collected from Jason Gunthorpe)
>
> Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> that have never been passed to pin_user_pages() cannot have a positive
> page_maybe_dma_pinned() by definition.

There are some caveats here, right? E.g. this isn't necessarily true
for pagecache pages, I think?

2020-09-21 23:50:31

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 21, 2020 at 11:43:38PM +0200, Jann Horn wrote:
> On Mon, Sep 21, 2020 at 11:17 PM Peter Xu <[email protected]> wrote:
> >
> > (Commit message collected from Jason Gunthorpe)
> >
> > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > that have never been passed to pin_user_pages() cannot have a positive
> > page_maybe_dma_pinned() by definition.
>
> There are some caveats here, right? E.g. this isn't necessarily true
> for pagecache pages, I think?

Sorry I didn't follow here. Could you help explain with some details?

Thanks,

--
Peter Xu

2020-09-22 00:22:05

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 12:30 AM Peter Xu <[email protected]> wrote:
> On Mon, Sep 21, 2020 at 11:43:38PM +0200, Jann Horn wrote:
> > On Mon, Sep 21, 2020 at 11:17 PM Peter Xu <[email protected]> wrote:
> > > (Commit message collected from Jason Gunthorpe)
> > >
> > > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> > > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > > that have never been passed to pin_user_pages() cannot have a positive
> > > page_maybe_dma_pinned() by definition.
> >
> > There are some caveats here, right? E.g. this isn't necessarily true
> > for pagecache pages, I think?
>
> Sorry I didn't follow here. Could you help explain with some details?

The commit message says "mm_structs that have never been passed to
pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
definition"; but that is not true for pages which may also be mapped
in a second mm and may have been passed to pin_user_pages() through
that second mm (meaning they must be writable over there and not
shared with us via CoW).

For example:

Process A:

fd_a = open("/foo/bar", O_RDWR);
mapping_a = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd_a, 0);
pin_user_pages(mapping_a, 1, ...);

Process B:

fd_b = open("/foo/bar", O_RDONLY);
mapping_b = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd_b, 0);
*(volatile char *)mapping_b;

At this point, process B has never called pin_user_pages(), but
page_maybe_dma_pinned() on the page at mapping_b would return true.


I don't think this is a problem for the use of page_maybe_dma_pinned()
in fork(), but I do think that the commit message is not entirely
correct.

2020-09-22 00:59:42

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On 9/21/20 2:17 PM, Peter Xu wrote:
> (Commit message collected from Jason Gunthorpe)
>
> Reduce the chance of false positive from page_maybe_dma_pinned() by keeping

Not yet, it doesn't. :) More:

> track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> that have never been passed to pin_user_pages() cannot have a positive
> page_maybe_dma_pinned() by definition. This allows cases that might drive up
> the page ref_count to avoid any penalty from handling dma_pinned pages.
>
> Due to complexities with unpining this trivial version is a permanent sticky
> bit, future work will be needed to make this a counter.

How about this instead:

Subsequent patches intend to reduce the chance of false positives from
page_maybe_dma_pinned(), by also considering whether or not a page has
even been part of an mm struct that has ever had pin_user_pages*()
applied to any of its pages.

In order to allow that, provide a boolean value (even though it's not
implemented exactly as a boolean type) within the mm struct, that is
simply set once and never cleared. This will suffice for an early, rough
implementation that fixes a few problems.

Future work is planned, to provide a more sophisticated solution, likely
involving a counter, and *not* involving something that is set and never
cleared.

>
> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/mm_types.h | 10 ++++++++++
> kernel/fork.c | 1 +
> mm/gup.c | 6 ++++++
> 3 files changed, 17 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 496c3ff97cce..6f291f8b74c6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -441,6 +441,16 @@ struct mm_struct {
> #endif
> int map_count; /* number of VMAs */
>
> + /**
> + * @has_pinned: Whether this mm has pinned any pages. This can
> + * be either replaced in the future by @pinned_vm when it
> + * becomes stable, or grow into a counter on its own. We're
> + * aggresive on this bit now - even if the pinned pages were
> + * unpinned later on, we'll still keep this bit set for the
> + * lifecycle of this mm just for simplicity.
> + */
> + int has_pinned;

I think this would be elegant as an atomic_t, and using atomic_set() and
atomic_read(), which seem even more self-documenting that what you have here.

But it's admittedly a cosmetic point, combined with my perennial fear that
I'm missing something when I look at a READ_ONCE()/WRITE_ONCE() pair. :)

It's completely OK to just ignore this comment, but I didn't want to completely
miss the opportunity to make it a tiny bit cleaner to the reader.

thanks,
--
John Hubbard
NVIDIA

2020-09-22 00:59:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On 9/21/20 4:53 PM, John Hubbard wrote:
> On 9/21/20 2:17 PM, Peter Xu wrote:
>> (Commit message collected from Jason Gunthorpe)
>>
>> Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
>
> Not yet, it doesn't. :)  More:
>
>> track if the mm_struct has ever been used with pin_user_pages(). mm_structs
>> that have never been passed to pin_user_pages() cannot have a positive
>> page_maybe_dma_pinned() by definition. This allows cases that might drive up
>> the page ref_count to avoid any penalty from handling dma_pinned pages.
>>
>> Due to complexities with unpining this trivial version is a permanent sticky
>> bit, future work will be needed to make this a counter.
>
> How about this instead:
>
> Subsequent patches intend to reduce the chance of false positives from
> page_maybe_dma_pinned(), by also considering whether or not a page has
> even been part of an mm struct that has ever had pin_user_pages*()


arggh, correction: please make that:

"...whether or not a page is part of an mm struct that...".

(Present tense.) Otherwise, people start wondering about the checkered past
of a page's past lives, and it badly distracts from the main point here. :)


thanks,
--
John Hubbard
NVIDIA

2020-09-22 11:55:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 12:47:11AM +0200, Jann Horn wrote:
> On Tue, Sep 22, 2020 at 12:30 AM Peter Xu <[email protected]> wrote:
> > On Mon, Sep 21, 2020 at 11:43:38PM +0200, Jann Horn wrote:
> > > On Mon, Sep 21, 2020 at 11:17 PM Peter Xu <[email protected]> wrote:
> > > > (Commit message collected from Jason Gunthorpe)
> > > >
> > > > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> > > > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > > > that have never been passed to pin_user_pages() cannot have a positive
> > > > page_maybe_dma_pinned() by definition.
> > >
> > > There are some caveats here, right? E.g. this isn't necessarily true
> > > for pagecache pages, I think?
> >
> > Sorry I didn't follow here. Could you help explain with some details?
>
> The commit message says "mm_structs that have never been passed to
> pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
> definition"; but that is not true for pages which may also be mapped
> in a second mm and may have been passed to pin_user_pages() through
> that second mm (meaning they must be writable over there and not
> shared with us via CoW).

The message does need a few more words to explain this trick can only
be used with COW'able pages.

> Process A:
>
> fd_a = open("/foo/bar", O_RDWR);
> mapping_a = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd_a, 0);
> pin_user_pages(mapping_a, 1, ...);
>
> Process B:
>
> fd_b = open("/foo/bar", O_RDONLY);
> mapping_b = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd_b, 0);
> *(volatile char *)mapping_b;
>
> At this point, process B has never called pin_user_pages(), but
> page_maybe_dma_pinned() on the page at mapping_b would return true.

My expectation is the pin_user_pages() should have already broken the
COW for the MAP_PRIVATE, so process B should not have a
page_maybe_dma_pinned()

Jason

2020-09-22 14:31:51

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 08:54:36AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 12:47:11AM +0200, Jann Horn wrote:
> > On Tue, Sep 22, 2020 at 12:30 AM Peter Xu <[email protected]> wrote:
> > > On Mon, Sep 21, 2020 at 11:43:38PM +0200, Jann Horn wrote:
> > > > On Mon, Sep 21, 2020 at 11:17 PM Peter Xu <[email protected]> wrote:
> > > > > (Commit message collected from Jason Gunthorpe)
> > > > >
> > > > > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> > > > > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > > > > that have never been passed to pin_user_pages() cannot have a positive
> > > > > page_maybe_dma_pinned() by definition.
> > > >
> > > > There are some caveats here, right? E.g. this isn't necessarily true
> > > > for pagecache pages, I think?
> > >
> > > Sorry I didn't follow here. Could you help explain with some details?
> >
> > The commit message says "mm_structs that have never been passed to
> > pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
> > definition"; but that is not true for pages which may also be mapped
> > in a second mm and may have been passed to pin_user_pages() through
> > that second mm (meaning they must be writable over there and not
> > shared with us via CoW).
>
> The message does need a few more words to explain this trick can only
> be used with COW'able pages.
>
> > Process A:
> >
> > fd_a = open("/foo/bar", O_RDWR);
> > mapping_a = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd_a, 0);
> > pin_user_pages(mapping_a, 1, ...);
> >
> > Process B:
> >
> > fd_b = open("/foo/bar", O_RDONLY);
> > mapping_b = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd_b, 0);
> > *(volatile char *)mapping_b;
> >
> > At this point, process B has never called pin_user_pages(), but
> > page_maybe_dma_pinned() on the page at mapping_b would return true.
>
> My expectation is the pin_user_pages() should have already broken the
> COW for the MAP_PRIVATE, so process B should not have a
> page_maybe_dma_pinned()

When process B maps with PROT_READ only (w/o PROT_WRITE) then it seems the same
page will be mapped.

I think I get the point from Jann now. Maybe it's easier I just remove the
whole "mm_structs that have never been passed to pin_user_pages() cannot have a
positive page_maybe_dma_pinned() by definition" sentence if that's misleading,
because the rest seem to be clear enough on what this new field is used for.

Thanks,

--
Peter Xu

2020-09-22 15:19:38

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 21, 2020 at 04:53:38PM -0700, John Hubbard wrote:
> On 9/21/20 2:17 PM, Peter Xu wrote:
> > (Commit message collected from Jason Gunthorpe)
> >
> > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
>
> Not yet, it doesn't. :) More:
>
> > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > that have never been passed to pin_user_pages() cannot have a positive
> > page_maybe_dma_pinned() by definition. This allows cases that might drive up
> > the page ref_count to avoid any penalty from handling dma_pinned pages.
> >
> > Due to complexities with unpining this trivial version is a permanent sticky
> > bit, future work will be needed to make this a counter.
>
> How about this instead:
>
> Subsequent patches intend to reduce the chance of false positives from
> page_maybe_dma_pinned(), by also considering whether or not a page has
> even been part of an mm struct that has ever had pin_user_pages*()
> applied to any of its pages.
>
> In order to allow that, provide a boolean value (even though it's not
> implemented exactly as a boolean type) within the mm struct, that is
> simply set once and never cleared. This will suffice for an early, rough
> implementation that fixes a few problems.
>
> Future work is planned, to provide a more sophisticated solution, likely
> involving a counter, and *not* involving something that is set and never
> cleared.

This looks good, thanks. Though I think Jason's version is good too (as long
as we remove the confusing sentence, that's the one starting with "mm_structs
that have never been passed... "). Before I drop Jason's version, I think I'd
better figure out what's the major thing we missed so that maybe we can add
another paragraph. E.g., "future work will be needed to make this a counter"
already means "involving a counter, and *not* involving something that is set
and never cleared" to me... Because otherwise it won't be called a counter..

>
> >
> > Suggested-by: Jason Gunthorpe <[email protected]>
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/mm_types.h | 10 ++++++++++
> > kernel/fork.c | 1 +
> > mm/gup.c | 6 ++++++
> > 3 files changed, 17 insertions(+)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 496c3ff97cce..6f291f8b74c6 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -441,6 +441,16 @@ struct mm_struct {
> > #endif
> > int map_count; /* number of VMAs */
> > + /**
> > + * @has_pinned: Whether this mm has pinned any pages. This can
> > + * be either replaced in the future by @pinned_vm when it
> > + * becomes stable, or grow into a counter on its own. We're
> > + * aggresive on this bit now - even if the pinned pages were
> > + * unpinned later on, we'll still keep this bit set for the
> > + * lifecycle of this mm just for simplicity.
> > + */
> > + int has_pinned;
>
> I think this would be elegant as an atomic_t, and using atomic_set() and
> atomic_read(), which seem even more self-documenting that what you have here.
>
> But it's admittedly a cosmetic point, combined with my perennial fear that
> I'm missing something when I look at a READ_ONCE()/WRITE_ONCE() pair. :)

Yeah but I hope I'm using it right.. :) I used READ_ONCE/WRITE_ONCE explicitly
because I think they're cheaper than atomic operations, (which will, iiuc, lock
the bus).

>
> It's completely OK to just ignore this comment, but I didn't want to completely
> miss the opportunity to make it a tiny bit cleaner to the reader.

This can always become an atomic in the future, or am I wrong? Actually if
we're going to the counter way I feel like it's a must.

Thanks,

--
Peter Xu

2020-09-22 15:57:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 10:28:02AM -0400, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 08:54:36AM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 22, 2020 at 12:47:11AM +0200, Jann Horn wrote:
> > > On Tue, Sep 22, 2020 at 12:30 AM Peter Xu <[email protected]> wrote:
> > > > On Mon, Sep 21, 2020 at 11:43:38PM +0200, Jann Horn wrote:
> > > > > On Mon, Sep 21, 2020 at 11:17 PM Peter Xu <[email protected]> wrote:
> > > > > > (Commit message collected from Jason Gunthorpe)
> > > > > >
> > > > > > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> > > > > > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > > > > > that have never been passed to pin_user_pages() cannot have a positive
> > > > > > page_maybe_dma_pinned() by definition.
> > > > >
> > > > > There are some caveats here, right? E.g. this isn't necessarily true
> > > > > for pagecache pages, I think?
> > > >
> > > > Sorry I didn't follow here. Could you help explain with some details?
> > >
> > > The commit message says "mm_structs that have never been passed to
> > > pin_user_pages() cannot have a positive page_maybe_dma_pinned() by
> > > definition"; but that is not true for pages which may also be mapped
> > > in a second mm and may have been passed to pin_user_pages() through
> > > that second mm (meaning they must be writable over there and not
> > > shared with us via CoW).
> >
> > The message does need a few more words to explain this trick can only
> > be used with COW'able pages.
> >
> > > Process A:
> > >
> > > fd_a = open("/foo/bar", O_RDWR);
> > > mapping_a = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd_a, 0);
> > > pin_user_pages(mapping_a, 1, ...);
> > >
> > > Process B:
> > >
> > > fd_b = open("/foo/bar", O_RDONLY);
> > > mapping_b = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd_b, 0);
> > > *(volatile char *)mapping_b;
> > >
> > > At this point, process B has never called pin_user_pages(), but
> > > page_maybe_dma_pinned() on the page at mapping_b would return true.
> >
> > My expectation is the pin_user_pages() should have already broken the
> > COW for the MAP_PRIVATE, so process B should not have a
> > page_maybe_dma_pinned()
>
> When process B maps with PROT_READ only (w/o PROT_WRITE) then it seems the same
> page will be mapped.

I thought MAP_PRIVATE without PROT_WRITE was nonsensical, it only has
meaning for writes initiated by the mapping. MAP_SHARED/PROT_READ is
the same behavior on Linux, IIRC.

But, yes, you certainly can end up with B having
page_maybe_dma_pinned() pages in shared VMA, just not in COW'able
mappings.

> I think I get the point from Jann now. Maybe it's easier I just remove the
> whole "mm_structs that have never been passed to pin_user_pages() cannot have a
> positive page_maybe_dma_pinned() by definition" sentence if that's misleading,
> because the rest seem to be clear enough on what this new field is used for.

"for COW" I think is still the important detail here, see for instance
my remark on the PUD/PMD splitting where it is necessary to test for
cow before using this.

Perhaps we should call it "has_pinned_for_cow" to place emphasis on
this detail? Due to the shared pages issue It really doesn't have any
broader utility, eg for file back pages or otherwise.

Jason

2020-09-22 16:14:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 11:17:36AM -0400, Peter Xu wrote:

> > But it's admittedly a cosmetic point, combined with my perennial fear that
> > I'm missing something when I look at a READ_ONCE()/WRITE_ONCE() pair. :)
>
> Yeah but I hope I'm using it right.. :) I used READ_ONCE/WRITE_ONCE explicitly
> because I think they're cheaper than atomic operations, (which will, iiuc, lock
> the bus).

It is worth thinking a bit about racing fork with
pin_user_pages(). The desired outcome is:

If fork wins the page is write protected, and pin_user_pages_fast()
will COW it.

If pin_user_pages_fast() wins then fork must see the READ_ONCE and
the pin.

As get_user_pages_fast() is lockless it looks like the ordering has to
be like this:

pin_user_pages_fast() fork()
atomic_set(has_pinned, 1);
[..]
atomic_add(page->_refcount)
ordered check write protect()
ordered set write protect()
atomic_read(page->_refcount)
atomic_read(has_pinned)

Such that in all the degenerate racy cases the outcome is that both
sides COW, never neither.

Thus I think it does have to be atomics purely from an ordering
perspective, observing an increased _refcount requires that has_pinned
!= 0 if we are pinning.

So, to make this 100% this ordering will need to be touched up too.

Jason

2020-09-22 16:33:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 8:56 AM Jason Gunthorpe <[email protected]> wrote:
>
> I thought MAP_PRIVATE without PROT_WRITE was nonsensical,

MAP_PRIVATE without PROT_WRITE is very common.

It's what happens for every executable mapping, for example.

And no, it's not the same as MAP_SHARED, for a couple of simple
reasons. It does end up being similar for all the *normal* cases, but
there are various cases where it isn't.

- mprotect() and friends. MAP_PRIVATE is fixed, but it might have
been writable in the past, and it might become writable in the future.

- breakpoints and ptrace. This will punch through even a non-writable
mapping and force a COW (since that's the whole point: executables are
not writable, but to do a SW breakpoint you have to write to the page)

So no, MAP_PRIVATE is not nonsensical without PROT_WRITE, and it's not
even remotely unusual.

Linus

2020-09-22 17:56:17

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 01:10:46PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 11:17:36AM -0400, Peter Xu wrote:
>
> > > But it's admittedly a cosmetic point, combined with my perennial fear that
> > > I'm missing something when I look at a READ_ONCE()/WRITE_ONCE() pair. :)
> >
> > Yeah but I hope I'm using it right.. :) I used READ_ONCE/WRITE_ONCE explicitly
> > because I think they're cheaper than atomic operations, (which will, iiuc, lock
> > the bus).
>
> It is worth thinking a bit about racing fork with
> pin_user_pages(). The desired outcome is:
>
> If fork wins the page is write protected, and pin_user_pages_fast()
> will COW it.
>
> If pin_user_pages_fast() wins then fork must see the READ_ONCE and
> the pin.
>
> As get_user_pages_fast() is lockless it looks like the ordering has to
> be like this:
>
> pin_user_pages_fast() fork()
> atomic_set(has_pinned, 1);
> [..]
> atomic_add(page->_refcount)
> ordered check write protect()
> ordered set write protect()
> atomic_read(page->_refcount)
> atomic_read(has_pinned)
>
> Such that in all the degenerate racy cases the outcome is that both
> sides COW, never neither.
>
> Thus I think it does have to be atomics purely from an ordering
> perspective, observing an increased _refcount requires that has_pinned
> != 0 if we are pinning.
>
> So, to make this 100% this ordering will need to be touched up too.

Thanks for spotting this. So something like below should work, right?

diff --git a/mm/memory.c b/mm/memory.c
index 8f3521be80ca..6591f3f33299 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* Because we'll need to release the locks before doing cow,
* pass this work to upper layer.
*/
- if (READ_ONCE(src_mm->has_pinned) && wp &&
- page_maybe_dma_pinned(page)) {
+ if (wp && page_maybe_dma_pinned(page) &&
+ READ_ONCE(src_mm->has_pinned)) {
/* We've got the page already; we're safe */
data->cow_old_page = page;
data->cow_oldpte = *src_pte;

I can also add some more comment to emphasize this.

I think the WRITE_ONCE/READ_ONCE can actually be kept, because atomic ops
should contain proper memory barriers already so the memory access orders
should be guaranteed (e.g., atomic_add() will have an implicit wmb(); rmb() for
the other side). However maybe it's even simpler to change has_pinned into
atomic as John suggested. Thanks,

--
Peter Xu

2020-09-22 18:03:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On 9/22/20 8:17 AM, Peter Xu wrote:
> On Mon, Sep 21, 2020 at 04:53:38PM -0700, John Hubbard wrote:
>> On 9/21/20 2:17 PM, Peter Xu wrote:
>>> (Commit message collected from Jason Gunthorpe)
>>>
>>> Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
>>
>> Not yet, it doesn't. :) More:
>>
>>> track if the mm_struct has ever been used with pin_user_pages(). mm_structs
>>> that have never been passed to pin_user_pages() cannot have a positive
>>> page_maybe_dma_pinned() by definition. This allows cases that might drive up
>>> the page ref_count to avoid any penalty from handling dma_pinned pages.
>>>
>>> Due to complexities with unpining this trivial version is a permanent sticky
>>> bit, future work will be needed to make this a counter.
>>
>> How about this instead:
>>
>> Subsequent patches intend to reduce the chance of false positives from
>> page_maybe_dma_pinned(), by also considering whether or not a page has
>> even been part of an mm struct that has ever had pin_user_pages*()
>> applied to any of its pages.
>>
>> In order to allow that, provide a boolean value (even though it's not
>> implemented exactly as a boolean type) within the mm struct, that is
>> simply set once and never cleared. This will suffice for an early, rough
>> implementation that fixes a few problems.
>>
>> Future work is planned, to provide a more sophisticated solution, likely
>> involving a counter, and *not* involving something that is set and never
>> cleared.
>
> This looks good, thanks. Though I think Jason's version is good too (as long
> as we remove the confusing sentence, that's the one starting with "mm_structs
> that have never been passed... "). Before I drop Jason's version, I think I'd
> better figure out what's the major thing we missed so that maybe we can add
> another paragraph. E.g., "future work will be needed to make this a counter"
> already means "involving a counter, and *not* involving something that is set
> and never cleared" to me... Because otherwise it won't be called a counter..
>

That's just a bit of harmless redundancy, intended to help clarify where this
is going. But if the redundancy isn't actually helping, you could simply
truncate it to the first half of the sentence, like this:

"Future work is planned, to provide a more sophisticated solution, likely
involving a counter."


thanks,
--
John Hubbard
NVIDIA

2020-09-22 18:17:24

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 11:02:03AM -0700, John Hubbard wrote:
> On 9/22/20 8:17 AM, Peter Xu wrote:
> > On Mon, Sep 21, 2020 at 04:53:38PM -0700, John Hubbard wrote:
> > > On 9/21/20 2:17 PM, Peter Xu wrote:
> > > > (Commit message collected from Jason Gunthorpe)
> > > >
> > > > Reduce the chance of false positive from page_maybe_dma_pinned() by keeping
> > >
> > > Not yet, it doesn't. :) More:
> > >
> > > > track if the mm_struct has ever been used with pin_user_pages(). mm_structs
> > > > that have never been passed to pin_user_pages() cannot have a positive
> > > > page_maybe_dma_pinned() by definition. This allows cases that might drive up
> > > > the page ref_count to avoid any penalty from handling dma_pinned pages.
> > > >
> > > > Due to complexities with unpining this trivial version is a permanent sticky
> > > > bit, future work will be needed to make this a counter.
> > >
> > > How about this instead:
> > >
> > > Subsequent patches intend to reduce the chance of false positives from
> > > page_maybe_dma_pinned(), by also considering whether or not a page has
> > > even been part of an mm struct that has ever had pin_user_pages*()
> > > applied to any of its pages.
> > >
> > > In order to allow that, provide a boolean value (even though it's not
> > > implemented exactly as a boolean type) within the mm struct, that is
> > > simply set once and never cleared. This will suffice for an early, rough
> > > implementation that fixes a few problems.
> > >
> > > Future work is planned, to provide a more sophisticated solution, likely
> > > involving a counter, and *not* involving something that is set and never
> > > cleared.
> >
> > This looks good, thanks. Though I think Jason's version is good too (as long
> > as we remove the confusing sentence, that's the one starting with "mm_structs
> > that have never been passed... "). Before I drop Jason's version, I think I'd
> > better figure out what's the major thing we missed so that maybe we can add
> > another paragraph. E.g., "future work will be needed to make this a counter"
> > already means "involving a counter, and *not* involving something that is set
> > and never cleared" to me... Because otherwise it won't be called a counter..
> >
>
> That's just a bit of harmless redundancy, intended to help clarify where this
> is going. But if the redundancy isn't actually helping, you could simply
> truncate it to the first half of the sentence, like this:
>
> "Future work is planned, to provide a more sophisticated solution, likely
> involving a counter."

Will do. Thanks.

--
Peter Xu

2020-09-22 19:12:44

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On 9/22/20 8:17 AM, Peter Xu wrote:
> On Mon, Sep 21, 2020 at 04:53:38PM -0700, John Hubbard wrote:
>> On 9/21/20 2:17 PM, Peter Xu wrote:
...
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 496c3ff97cce..6f291f8b74c6 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -441,6 +441,16 @@ struct mm_struct {
>>> #endif
>>> int map_count; /* number of VMAs */
>>> + /**
>>> + * @has_pinned: Whether this mm has pinned any pages. This can
>>> + * be either replaced in the future by @pinned_vm when it
>>> + * becomes stable, or grow into a counter on its own. We're
>>> + * aggresive on this bit now - even if the pinned pages were
>>> + * unpinned later on, we'll still keep this bit set for the
>>> + * lifecycle of this mm just for simplicity.
>>> + */
>>> + int has_pinned;
>>
>> I think this would be elegant as an atomic_t, and using atomic_set() and
>> atomic_read(), which seem even more self-documenting that what you have here.
>>
>> But it's admittedly a cosmetic point, combined with my perennial fear that
>> I'm missing something when I look at a READ_ONCE()/WRITE_ONCE() pair. :)
>
> Yeah but I hope I'm using it right.. :) I used READ_ONCE/WRITE_ONCE explicitly
> because I think they're cheaper than atomic operations, (which will, iiuc, lock
> the bus).
>

The "cheaper" argument vanishes if the longer-term plan is to use atomic ops.
Given the intent of this patchset, simpler is better, and "simpler that has the
same performance as the longer term solution" is definitely OK.

>>
>> It's completely OK to just ignore this comment, but I didn't want to completely
>> miss the opportunity to make it a tiny bit cleaner to the reader.
>
> This can always become an atomic in the future, or am I wrong? Actually if
> we're going to the counter way I feel like it's a must.
>

Yes it can change. But you can get the simplicity and clarity now, rather
than waiting.

thanks,
--
John Hubbard
NVIDIA

2020-09-22 19:13:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f3521be80ca..6591f3f33299 100644
> +++ b/mm/memory.c
> @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> * Because we'll need to release the locks before doing cow,
> * pass this work to upper layer.
> */
> - if (READ_ONCE(src_mm->has_pinned) && wp &&
> - page_maybe_dma_pinned(page)) {
> + if (wp && page_maybe_dma_pinned(page) &&
> + READ_ONCE(src_mm->has_pinned)) {
> /* We've got the page already; we're safe */
> data->cow_old_page = page;
> data->cow_oldpte = *src_pte;
>
> I can also add some more comment to emphasize this.

It is not just that, but the ptep_set_wrprotect() has to be done
earlier.

Otherwise it races like:

pin_user_pages_fast() fork()
atomic_set(has_pinned, 1);
[..]
atomic_read(page->_refcount) //false
// skipped atomic_read(has_pinned)
atomic_add(page->_refcount)
ordered check write protect()
ordered set write protect()

And now have a write protect on a DMA pinned page, which is the
invarient we are trying to create.

The best algorithm I've thought of is something like:

pte_map_lock()
if (page) {
if (wp) {
ptep_set_wrprotect()
/* Order with try_grab_compound_head(), either we see
* page_maybe_dma_pinned(), or they see the wrprotect */
get_page();

if (page_maybe_dma_pinned() && READ_ONCE(src_mm->has_pinned)) {
put_page();
ptep_clear_wrprotect()

// do copy
return
}
} else {
get_page();
}
page_dup_rmap()
pte_unmap_lock()

Then the do_wp_page() path would have to detect that the page is not
write protected under the pte lock inside the fault handler and just
do nothing. Ie the set/clear could be visible to the CPU and trigger a
spurious fault, but never trigger a COW.

Thus 'wp' becomes a 'lock' that prevents GUP from returning this page.

Very tricky, deserves a huge comment near the ptep_clear_wrprotect()

Consider the above algorithm beside the gup_fast() algorithm:

if (!pte_access_permitted(pte, flags & FOLL_WRITE))
goto pte_unmap;
[..]
head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
put_compound_head(head, 1, flags);
goto pte_unmap;

That last *ptep will check that the WP is not set after making
page_maybe_dma_pinned() true.

It still looks reasonable, the extra work is still just the additional
atomic in page_maybe_dma_pinned(), just everything else has to be very
carefully sequenced due to unlocked page table accessors.

> I think the WRITE_ONCE/READ_ONCE can actually be kept, because atomic ops
> should contain proper memory barriers already so the memory access orders
> should be guaranteed

I always have to carefully check ORDERING in
Documentation/atomic_t.txt when asking those questions..

It seems very subtle to me, but yes, try_grab_compound_head() and
page_maybe_dma_pinned() are already paired ordering barriers, so both
the pte_val() on the GUP side and the READ_ONCE(has_pinned) look OK.

Jason

2020-09-23 01:02:09

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8f3521be80ca..6591f3f33299 100644
> > +++ b/mm/memory.c
> > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > * Because we'll need to release the locks before doing cow,
> > * pass this work to upper layer.
> > */
> > - if (READ_ONCE(src_mm->has_pinned) && wp &&
> > - page_maybe_dma_pinned(page)) {
> > + if (wp && page_maybe_dma_pinned(page) &&
> > + READ_ONCE(src_mm->has_pinned)) {
> > /* We've got the page already; we're safe */
> > data->cow_old_page = page;
> > data->cow_oldpte = *src_pte;
> >
> > I can also add some more comment to emphasize this.
>
> It is not just that, but the ptep_set_wrprotect() has to be done
> earlier.

Now I understand your point, I think.. So I guess it's not only about
has_pinned, but it should be a race between the fast-gup and the fork() code,
even if has_pinned is always set.

>
> Otherwise it races like:
>
> pin_user_pages_fast() fork()
> atomic_set(has_pinned, 1);
> [..]
> atomic_read(page->_refcount) //false
> // skipped atomic_read(has_pinned)
> atomic_add(page->_refcount)
> ordered check write protect()
> ordered set write protect()
>
> And now have a write protect on a DMA pinned page, which is the
> invarient we are trying to create.
>
> The best algorithm I've thought of is something like:
>
> pte_map_lock()
> if (page) {
> if (wp) {
> ptep_set_wrprotect()
> /* Order with try_grab_compound_head(), either we see
> * page_maybe_dma_pinned(), or they see the wrprotect */
> get_page();

Is this get_page() a must to be after ptep_set_wrprotect() explicitly? IIUC
what we need is to order ptep_set_wrprotect() and page_maybe_dma_pinned() here.
E.g., would a "mb()" work?

Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
to guarantee ordering of try_grab_compound_head() and the pte change check?

>
> if (page_maybe_dma_pinned() && READ_ONCE(src_mm->has_pinned)) {
> put_page();
> ptep_clear_wrprotect()
>
> // do copy
> return
> }
> } else {
> get_page();
> }
> page_dup_rmap()
> pte_unmap_lock()
>
> Then the do_wp_page() path would have to detect that the page is not
> write protected under the pte lock inside the fault handler and just
> do nothing.

Yes, iiuc do_wp_page() should be able to handle spurious write page faults like
this already, as below:

vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
...
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}

So when spin_lock() returns:

- When it's a real cow (not pinned pages; we write-protected it and it keeps
write-protected), we should do cow here as usual.

- When it's a fake cow (pinned pages), the write bit should have been
recovered before the page table lock released, and we'll skip do_wp_page()
and retry the page fault immediately.

> Ie the set/clear could be visible to the CPU and trigger a
> spurious fault, but never trigger a COW.
>
> Thus 'wp' becomes a 'lock' that prevents GUP from returning this page.

Another question is, how about read fast-gup for pinning? Because we can't use
the write-protect mechanism to block a read gup. I remember we've discussed
similar things and iirc your point is "pinned pages should always be with
WRITE". However now I still doubt it... Because I feel like read gup is still
legal (as I mentioned previously - when device purely writes to the page and
the processor only reads from it).

>
> Very tricky, deserves a huge comment near the ptep_clear_wrprotect()
>
> Consider the above algorithm beside the gup_fast() algorithm:
>
> if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> goto pte_unmap;
> [..]
> head = try_grab_compound_head(page, 1, flags);
> if (!head)
> goto pte_unmap;
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> put_compound_head(head, 1, flags);
> goto pte_unmap;
>
> That last *ptep will check that the WP is not set after making
> page_maybe_dma_pinned() true.
>
> It still looks reasonable, the extra work is still just the additional
> atomic in page_maybe_dma_pinned(), just everything else has to be very
> carefully sequenced due to unlocked page table accessors.

Tricky! I'm still thinking about some easier way but no much clue so far.
Hopefully we'll figure out something solid soon.

Thanks,

--
Peter Xu

2020-09-23 13:13:06

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f3521be80ca..6591f3f33299 100644
> > > +++ b/mm/memory.c
> > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > * Because we'll need to release the locks before doing cow,
> > > * pass this work to upper layer.
> > > */
> > > - if (READ_ONCE(src_mm->has_pinned) && wp &&
> > > - page_maybe_dma_pinned(page)) {
> > > + if (wp && page_maybe_dma_pinned(page) &&
> > > + READ_ONCE(src_mm->has_pinned)) {
> > > /* We've got the page already; we're safe */
> > > data->cow_old_page = page;
> > > data->cow_oldpte = *src_pte;
> > >
> > > I can also add some more comment to emphasize this.
> >
> > It is not just that, but the ptep_set_wrprotect() has to be done
> > earlier.
>
> Now I understand your point, I think.. So I guess it's not only about
> has_pinned, but it should be a race between the fast-gup and the fork() code,
> even if has_pinned is always set.
>
> >
> > Otherwise it races like:
> >
> > pin_user_pages_fast() fork()
> > atomic_set(has_pinned, 1);
> > [..]
> > atomic_read(page->_refcount) //false
> > // skipped atomic_read(has_pinned)
> > atomic_add(page->_refcount)
> > ordered check write protect()
> > ordered set write protect()
> >
> > And now have a write protect on a DMA pinned page, which is the
> > invarient we are trying to create.
> >
> > The best algorithm I've thought of is something like:
> >
> > pte_map_lock()
> > if (page) {
> > if (wp) {
> > ptep_set_wrprotect()
> > /* Order with try_grab_compound_head(), either we see
> > * page_maybe_dma_pinned(), or they see the wrprotect */
> > get_page();
>
> Is this get_page() a must to be after ptep_set_wrprotect() explicitly? IIUC
> what we need is to order ptep_set_wrprotect() and page_maybe_dma_pinned() here.
> E.g., would a "mb()" work?
>
> Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
> to guarantee ordering of try_grab_compound_head() and the pte change check?
>
> >
> > if (page_maybe_dma_pinned() && READ_ONCE(src_mm->has_pinned)) {
> > put_page();
> > ptep_clear_wrprotect()
> >
> > // do copy
> > return
> > }
> > } else {
> > get_page();
> > }
> > page_dup_rmap()
> > pte_unmap_lock()
> >
> > Then the do_wp_page() path would have to detect that the page is not
> > write protected under the pte lock inside the fault handler and just
> > do nothing.
>
> Yes, iiuc do_wp_page() should be able to handle spurious write page faults like
> this already, as below:
>
> vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> spin_lock(vmf->ptl);
> ...
> if (vmf->flags & FAULT_FLAG_WRITE) {
> if (!pte_write(entry))
> return do_wp_page(vmf);
> entry = pte_mkdirty(entry);
> }
>
> So when spin_lock() returns:
>
> - When it's a real cow (not pinned pages; we write-protected it and it keeps
> write-protected), we should do cow here as usual.
>
> - When it's a fake cow (pinned pages), the write bit should have been
> recovered before the page table lock released, and we'll skip do_wp_page()
> and retry the page fault immediately.
>
> > Ie the set/clear could be visible to the CPU and trigger a
> > spurious fault, but never trigger a COW.
> >
> > Thus 'wp' becomes a 'lock' that prevents GUP from returning this page.
>
> Another question is, how about read fast-gup for pinning? Because we can't use
> the write-protect mechanism to block a read gup. I remember we've discussed
> similar things and iirc your point is "pinned pages should always be with
> WRITE". However now I still doubt it... Because I feel like read gup is still
> legal (as I mentioned previously - when device purely writes to the page and
> the processor only reads from it).
>
> >
> > Very tricky, deserves a huge comment near the ptep_clear_wrprotect()
> >
> > Consider the above algorithm beside the gup_fast() algorithm:
> >
> > if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> > goto pte_unmap;
> > [..]
> > head = try_grab_compound_head(page, 1, flags);
> > if (!head)
> > goto pte_unmap;
> > if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > put_compound_head(head, 1, flags);
> > goto pte_unmap;
> >
> > That last *ptep will check that the WP is not set after making
> > page_maybe_dma_pinned() true.
> >
> > It still looks reasonable, the extra work is still just the additional
> > atomic in page_maybe_dma_pinned(), just everything else has to be very
> > carefully sequenced due to unlocked page table accessors.
>
> Tricky! I'm still thinking about some easier way but no much clue so far.
> Hopefully we'll figure out something solid soon.

Hmm, how about something like below? Would this be acceptable?

------8<--------
diff --git a/mm/gup.c b/mm/gup.c
index 2d9019bf1773..698bc2b520ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2136,6 +2136,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
struct dev_pagemap *pgmap = NULL;
int nr_start = *nr, ret = 0;
pte_t *ptep, *ptem;
+ spinlock_t *ptl = NULL;
+
+ /*
+ * More strict with FOLL_PIN, otherwise it could race with fork(). The
+ * page table lock guarantees that fork() will capture all the pinned
+ * pages when dup_mm() and do proper page copy on them.
+ */
+ if (flags & FOLL_PIN) {
+ ptl = pte_lockptr(mm, pmd);
+ if (!spin_trylock(ptl))
+ return 0;
+ }

ptem = ptep = pte_offset_map(&pmd, addr);
do {
@@ -2200,6 +2212,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
ret = 1;

pte_unmap:
+ if (ptl)
+ spin_unlock(ptl);
if (pgmap)
put_dev_pagemap(pgmap);
pte_unmap(ptem);
------8<--------

Both of the solution would fail some fast-gups that might have succeeded in the
past. The latter solution might even fail more (because pmd lock should be
definitely bigger than a single pte wrprotect), however afaict it's still a
very, very corner case as it's fast-gup+FOLL_PIN+lockfail (and not to mention
fast-gup should be allowed to fail).

To confirm it can fail, I also checked up that we have only one caller of
pin_user_pages_fast_only(), which is i915_gem_userptr_get_pages(). While it's:

if (mm == current->mm) {
pvec = kvmalloc_array(num_pages, sizeof(struct page *),
GFP_KERNEL |
__GFP_NORETRY |
__GFP_NOWARN);
if (pvec) {
/* defer to worker if malloc fails */
if (!i915_gem_object_is_readonly(obj))
gup_flags |= FOLL_WRITE;
pinned = pin_user_pages_fast_only(obj->userptr.ptr,
num_pages, gup_flags,
pvec);
}
}

So looks like it can fallback to something slow too even if purely unlucky. So
looks safe so far for either solution above.

--
Peter Xu

2020-09-23 14:23:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Wed 23-09-20 09:10:43, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote:
> > On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 8f3521be80ca..6591f3f33299 100644
> > > > +++ b/mm/memory.c
> > > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > * Because we'll need to release the locks before doing cow,
> > > > * pass this work to upper layer.
> > > > */
> > > > - if (READ_ONCE(src_mm->has_pinned) && wp &&
> > > > - page_maybe_dma_pinned(page)) {
> > > > + if (wp && page_maybe_dma_pinned(page) &&
> > > > + READ_ONCE(src_mm->has_pinned)) {
> > > > /* We've got the page already; we're safe */
> > > > data->cow_old_page = page;
> > > > data->cow_oldpte = *src_pte;
> > > >
> > > > I can also add some more comment to emphasize this.
> > >
> > > It is not just that, but the ptep_set_wrprotect() has to be done
> > > earlier.
> >
> > Now I understand your point, I think.. So I guess it's not only about
> > has_pinned, but it should be a race between the fast-gup and the fork() code,
> > even if has_pinned is always set.
> >
> > >
> > > Otherwise it races like:
> > >
> > > pin_user_pages_fast() fork()
> > > atomic_set(has_pinned, 1);
> > > [..]
> > > atomic_read(page->_refcount) //false
> > > // skipped atomic_read(has_pinned)
> > > atomic_add(page->_refcount)
> > > ordered check write protect()
> > > ordered set write protect()
> > >
> > > And now have a write protect on a DMA pinned page, which is the
> > > invarient we are trying to create.
> > >
> > > The best algorithm I've thought of is something like:
> > >
> > > pte_map_lock()
> > > if (page) {
> > > if (wp) {
> > > ptep_set_wrprotect()
> > > /* Order with try_grab_compound_head(), either we see
> > > * page_maybe_dma_pinned(), or they see the wrprotect */
> > > get_page();
> >
> > Is this get_page() a must to be after ptep_set_wrprotect() explicitly? IIUC
> > what we need is to order ptep_set_wrprotect() and page_maybe_dma_pinned() here.
> > E.g., would a "mb()" work?
> >
> > Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
> > to guarantee ordering of try_grab_compound_head() and the pte change check?
> >
> > >
> > > if (page_maybe_dma_pinned() && READ_ONCE(src_mm->has_pinned)) {
> > > put_page();
> > > ptep_clear_wrprotect()
> > >
> > > // do copy
> > > return
> > > }
> > > } else {
> > > get_page();
> > > }
> > > page_dup_rmap()
> > > pte_unmap_lock()
> > >
> > > Then the do_wp_page() path would have to detect that the page is not
> > > write protected under the pte lock inside the fault handler and just
> > > do nothing.
> >
> > Yes, iiuc do_wp_page() should be able to handle spurious write page faults like
> > this already, as below:
> >
> > vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > spin_lock(vmf->ptl);
> > ...
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > if (!pte_write(entry))
> > return do_wp_page(vmf);
> > entry = pte_mkdirty(entry);
> > }
> >
> > So when spin_lock() returns:
> >
> > - When it's a real cow (not pinned pages; we write-protected it and it keeps
> > write-protected), we should do cow here as usual.
> >
> > - When it's a fake cow (pinned pages), the write bit should have been
> > recovered before the page table lock released, and we'll skip do_wp_page()
> > and retry the page fault immediately.
> >
> > > Ie the set/clear could be visible to the CPU and trigger a
> > > spurious fault, but never trigger a COW.
> > >
> > > Thus 'wp' becomes a 'lock' that prevents GUP from returning this page.
> >
> > Another question is, how about read fast-gup for pinning? Because we can't use
> > the write-protect mechanism to block a read gup. I remember we've discussed
> > similar things and iirc your point is "pinned pages should always be with
> > WRITE". However now I still doubt it... Because I feel like read gup is still
> > legal (as I mentioned previously - when device purely writes to the page and
> > the processor only reads from it).
> >
> > >
> > > Very tricky, deserves a huge comment near the ptep_clear_wrprotect()
> > >
> > > Consider the above algorithm beside the gup_fast() algorithm:
> > >
> > > if (!pte_access_permitted(pte, flags & FOLL_WRITE))
> > > goto pte_unmap;
> > > [..]
> > > head = try_grab_compound_head(page, 1, flags);
> > > if (!head)
> > > goto pte_unmap;
> > > if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > > put_compound_head(head, 1, flags);
> > > goto pte_unmap;
> > >
> > > That last *ptep will check that the WP is not set after making
> > > page_maybe_dma_pinned() true.
> > >
> > > It still looks reasonable, the extra work is still just the additional
> > > atomic in page_maybe_dma_pinned(), just everything else has to be very
> > > carefully sequenced due to unlocked page table accessors.
> >
> > Tricky! I'm still thinking about some easier way but no much clue so far.
> > Hopefully we'll figure out something solid soon.
>
> Hmm, how about something like below? Would this be acceptable?
>
> ------8<--------
> diff --git a/mm/gup.c b/mm/gup.c
> index 2d9019bf1773..698bc2b520ac 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2136,6 +2136,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> struct dev_pagemap *pgmap = NULL;
> int nr_start = *nr, ret = 0;
> pte_t *ptep, *ptem;
> + spinlock_t *ptl = NULL;
> +
> + /*
> + * More strict with FOLL_PIN, otherwise it could race with fork(). The
> + * page table lock guarantees that fork() will capture all the pinned
> + * pages when dup_mm() and do proper page copy on them.
> + */
> + if (flags & FOLL_PIN) {
> + ptl = pte_lockptr(mm, pmd);
> + if (!spin_trylock(ptl))
> + return 0;
> + }

I'd hate to take spinlock in the GUP-fast path. Also I don't think this is
quite correct because GUP-fast-only can be called from interrupt context
and page table locks are not interrupt safe. That being said I don't see
what's wrong with the solution Jason proposed of first setting writeprotect
and then checking page_may_be_dma_pinned() during fork(). That should work
just fine AFAICT... BTW note that GUP-fast code is (and this is deliberated
because e.g. DAX depends on this) first updating page->_refcount and then
rechecking PTE didn't change and the page->_refcount update is actually
done using atomic_add_unless() so that it cannot be reordered wrt the PTE
check. So the fork() code only needs to add barriers to pair with this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-23 17:09:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote:
> On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f3521be80ca..6591f3f33299 100644
> > > +++ b/mm/memory.c
> > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > * Because we'll need to release the locks before doing cow,
> > > * pass this work to upper layer.
> > > */
> > > - if (READ_ONCE(src_mm->has_pinned) && wp &&
> > > - page_maybe_dma_pinned(page)) {
> > > + if (wp && page_maybe_dma_pinned(page) &&
> > > + READ_ONCE(src_mm->has_pinned)) {
> > > /* We've got the page already; we're safe */
> > > data->cow_old_page = page;
> > > data->cow_oldpte = *src_pte;
> > >
> > > I can also add some more comment to emphasize this.
> >
> > It is not just that, but the ptep_set_wrprotect() has to be done
> > earlier.
>
> Now I understand your point, I think.. So I guess it's not only about
> has_pinned, but it should be a race between the fast-gup and the fork() code,
> even if has_pinned is always set.

Yes

> > The best algorithm I've thought of is something like:
> >
> > pte_map_lock()
> > if (page) {
> > if (wp) {
> > ptep_set_wrprotect()
> > /* Order with try_grab_compound_head(), either we see
> > * page_maybe_dma_pinned(), or they see the wrprotect */
> > get_page();
>
> Is this get_page() a must to be after ptep_set_wrprotect()
> explicitly?

No, just before page_maybe_dma_pinned()

> IIUC what we need is to order ptep_set_wrprotect() and
> page_maybe_dma_pinned() here. E.g., would a "mb()" work?

mb() is not needed because page_maybe_dma_pinned() has an atomic
barrier too. I like to see get_page() followed immediately by
page_maybe_dma_pinned() since they are accessing the same atomic and
could be fused together someday

> Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
> to guarantee ordering of try_grab_compound_head() and the pte change
> check?

gup_pte_range() is as I quoted? The gup slow path ends up in
follow_page_pte() which uses the pte lock so is OK.
>
> Another question is, how about read fast-gup for pinning? Because we can't use
> the write-protect mechanism to block a read gup. I remember we've discussed
> similar things and iirc your point is "pinned pages should always be with
> WRITE". However now I still doubt it... Because I feel like read gup is still
> legal (as I mentioned previously - when device purely writes to the page and
> the processor only reads from it).

We need a definition for what FOLL_PIN means. After this work on fork
I propose that FOLL_PIN means:

The page is in-use for DMA and the CPU PTE should not be changed
without explicit involvement of the application (eg via mmap/munmap)

If GUP encounters a read-only page during FOLL_PIN the behavior should
depend on what the fault handler would do. If the fault handler would
trigger COW and replace the PTE then it violates the above. GUP should
do the COW before pinning.

If the fault handler would SIGSEGV then GUP can keep the read-only
page and allow !FOLL_WRITE access. The PTE should not be replaced for
other reasons (though I think there is work there too).

For COW related issues the idea is the mm_struct doing the pin will
never trigger a COW. When other processes hit the COW they copy the
page into their mm and don't touch the source MM's PTE.

Today we do this roughly with FOLL_FORCE and FOLL_WRITE in the users,
but a more nuanced version and documentation would be much clearer.

Unfortunately just doing simple read GUP potentially exposes things to
various COW related data corruption races.

This is a discussion beyond this series though..

Jason

2020-09-23 17:13:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Wed, Sep 23, 2020 at 04:20:03PM +0200, Jan Kara wrote:

> I'd hate to take spinlock in the GUP-fast path. Also I don't think this is
> quite correct because GUP-fast-only can be called from interrupt context
> and page table locks are not interrupt safe.

Yes, IIRC, that is a key element of GUP-fast. Was it something to do
with futexes?

> and then checking page_may_be_dma_pinned() during fork(). That should work
> just fine AFAICT... BTW note that GUP-fast code is (and this is deliberated
> because e.g. DAX depends on this) first updating page->_refcount and then
> rechecking PTE didn't change and the page->_refcount update is actually
> done using atomic_add_unless() so that it cannot be reordered wrt the PTE
> check. So the fork() code only needs to add barriers to pair with this.

It is not just DAX, everything needs this check.

After the page is pinned it is prevented from being freed and
recycled. After GUP has the pin it must check that the PTE still
points at the same page, otherwise it might have pinned a page that is
alreay free'd - and that would be a use-after-free issue.

ason

2020-09-24 13:42:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Wed 23-09-20 14:12:07, Jason Gunthorpe wrote:
> On Wed, Sep 23, 2020 at 04:20:03PM +0200, Jan Kara wrote:
>
> > I'd hate to take spinlock in the GUP-fast path. Also I don't think this is
> > quite correct because GUP-fast-only can be called from interrupt context
> > and page table locks are not interrupt safe.
>
> Yes, IIRC, that is a key element of GUP-fast. Was it something to do
> with futexes?

Honestly, I'm not sure.

> > and then checking page_may_be_dma_pinned() during fork(). That should work
> > just fine AFAICT... BTW note that GUP-fast code is (and this is deliberated
> > because e.g. DAX depends on this) first updating page->_refcount and then
> > rechecking PTE didn't change and the page->_refcount update is actually
> > done using atomic_add_unless() so that it cannot be reordered wrt the PTE
> > check. So the fork() code only needs to add barriers to pair with this.
>
> It is not just DAX, everything needs this check.
>
> After the page is pinned it is prevented from being freed and
> recycled. After GUP has the pin it must check that the PTE still
> points at the same page, otherwise it might have pinned a page that is
> alreay free'd - and that would be a use-after-free issue.

I don't think a page use-after-free is really the reason - we add page
reference through page_ref_add_unless(page, x, 0) - i.e., it will fail for
already freed page. It's more about being able to make sure page is not
accessible anymore - and for that modifying pte and then checking page
refcount it *reliable* way to synchronize with GUP-fast...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-24 14:04:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 09:44:09AM +0200, Jan Kara wrote:
> > After the page is pinned it is prevented from being freed and
> > recycled. After GUP has the pin it must check that the PTE still
> > points at the same page, otherwise it might have pinned a page that is
> > alreay free'd - and that would be a use-after-free issue.
>
> I don't think a page use-after-free is really the reason - we add page
> reference through page_ref_add_unless(page, x, 0) - i.e., it will fail for
> already freed page.

I mean, the page could have been freed and already reallocated with a
positive refcount, so the add_unless check isn't protective.

The add_unless prevents the page from being freed. The 2nd pte read
ensures it wasn't already freed/reassigned before the pin.

If something drives the page refcount to zero then it is already
synchronized with GUP fast because of the atomic add_unless, no need
to re-check the pte for that case?? But I don't know what the DAX case
is you mentioned.

Jason

2020-09-24 14:38:11

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Wed, Sep 23, 2020 at 02:07:59PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 22, 2020 at 08:27:35PM -0400, Peter Xu wrote:
> > On Tue, Sep 22, 2020 at 04:11:16PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 22, 2020 at 01:54:15PM -0400, Peter Xu wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 8f3521be80ca..6591f3f33299 100644
> > > > +++ b/mm/memory.c
> > > > @@ -888,8 +888,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > * Because we'll need to release the locks before doing cow,
> > > > * pass this work to upper layer.
> > > > */
> > > > - if (READ_ONCE(src_mm->has_pinned) && wp &&
> > > > - page_maybe_dma_pinned(page)) {
> > > > + if (wp && page_maybe_dma_pinned(page) &&
> > > > + READ_ONCE(src_mm->has_pinned)) {
> > > > /* We've got the page already; we're safe */
> > > > data->cow_old_page = page;
> > > > data->cow_oldpte = *src_pte;
> > > >
> > > > I can also add some more comment to emphasize this.
> > >
> > > It is not just that, but the ptep_set_wrprotect() has to be done
> > > earlier.
> >
> > Now I understand your point, I think.. So I guess it's not only about
> > has_pinned, but it should be a race between the fast-gup and the fork() code,
> > even if has_pinned is always set.
>
> Yes
>
> > > The best algorithm I've thought of is something like:
> > >
> > > pte_map_lock()
> > > if (page) {
> > > if (wp) {
> > > ptep_set_wrprotect()
> > > /* Order with try_grab_compound_head(), either we see
> > > * page_maybe_dma_pinned(), or they see the wrprotect */
> > > get_page();
> >
> > Is this get_page() a must to be after ptep_set_wrprotect()
> > explicitly?
>
> No, just before page_maybe_dma_pinned()
>
> > IIUC what we need is to order ptep_set_wrprotect() and
> > page_maybe_dma_pinned() here. E.g., would a "mb()" work?
>
> mb() is not needed because page_maybe_dma_pinned() has an atomic
> barrier too. I like to see get_page() followed immediately by
> page_maybe_dma_pinned() since they are accessing the same atomic and
> could be fused together someday

If so, I'd hope you won't disagree that I still move the get_page() out of the
"if (wp)". Not only it's a shared operation no matter whether "if (wp)" or
not, but I'm afraid it would confuse future readers on a special ordering on
the get_page() and the wrprotect(), especially with the comment above.

>
> > Another thing is, do we need similar thing for e.g. gup_pte_range(), so that
> > to guarantee ordering of try_grab_compound_head() and the pte change
> > check?
>
> gup_pte_range() is as I quoted? The gup slow path ends up in
> follow_page_pte() which uses the pte lock so is OK.
> >
> > Another question is, how about read fast-gup for pinning? Because we can't use
> > the write-protect mechanism to block a read gup. I remember we've discussed
> > similar things and iirc your point is "pinned pages should always be with
> > WRITE". However now I still doubt it... Because I feel like read gup is still
> > legal (as I mentioned previously - when device purely writes to the page and
> > the processor only reads from it).
>
> We need a definition for what FOLL_PIN means. After this work on fork
> I propose that FOLL_PIN means:
>
> The page is in-use for DMA and the CPU PTE should not be changed
> without explicit involvement of the application (eg via mmap/munmap)
>
> If GUP encounters a read-only page during FOLL_PIN the behavior should
> depend on what the fault handler would do. If the fault handler would
> trigger COW and replace the PTE then it violates the above. GUP should
> do the COW before pinning.
>
> If the fault handler would SIGSEGV then GUP can keep the read-only
> page and allow !FOLL_WRITE access. The PTE should not be replaced for
> other reasons (though I think there is work there too).
>
> For COW related issues the idea is the mm_struct doing the pin will
> never trigger a COW. When other processes hit the COW they copy the
> page into their mm and don't touch the source MM's PTE.
>
> Today we do this roughly with FOLL_FORCE and FOLL_WRITE in the users,
> but a more nuanced version and documentation would be much clearer.
>
> Unfortunately just doing simple read GUP potentially exposes things to
> various COW related data corruption races.
>
> This is a discussion beyond this series though..

Yes. It's kind of related here on whether we can still use wrprotect() to
guard against fast-gup, though. So my understanding is that we should still at
least need the other patch [1] that I proposed in the other thread to force
break-cow for read-only gups (that patch is not only for fast-gup, of course).

But I agree that should be another bigger topic. I hope we don't need to pick
that patch up someday by another dma report on read-only pinned pages...

Regarding the solution here, I think we can also cover read-only fast-gup too
in the future - IIUC what we need to do is to make it pte_protnone() instead of
pte_wrprotect(), then in the fault handler we should identify this special
pte_protnone() against numa balancing (change_prot_numa()). I think it should
work fine too, iiuc, because I don't think we should migrate a page at all if
it's pinned for any reason...

So I think I'll focus on the wrprotect() solution for now. Thanks!

[1] https://lore.kernel.org/lkml/20200915151746.GB2949@xz-x1/

--
Peter Xu

2020-09-24 14:47:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu 24-09-20 11:02:37, Jason Gunthorpe wrote:
> On Thu, Sep 24, 2020 at 09:44:09AM +0200, Jan Kara wrote:
> > > After the page is pinned it is prevented from being freed and
> > > recycled. After GUP has the pin it must check that the PTE still
> > > points at the same page, otherwise it might have pinned a page that is
> > > alreay free'd - and that would be a use-after-free issue.
> >
> > I don't think a page use-after-free is really the reason - we add page
> > reference through page_ref_add_unless(page, x, 0) - i.e., it will fail for
> > already freed page.
>
> I mean, the page could have been freed and already reallocated with a
> positive refcount, so the add_unless check isn't protective.
>
> The add_unless prevents the page from being freed. The 2nd pte read
> ensures it wasn't already freed/reassigned before the pin.

Ah, right!

> If something drives the page refcount to zero then it is already
> synchronized with GUP fast because of the atomic add_unless, no need
> to re-check the pte for that case?? But I don't know what the DAX case
> is you mentioned.

DAX needs to make sure no new references (including GUP-fast) can be created
for a page before truncating page from a file and freeing it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-09-24 16:55:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 10:35:17AM -0400, Peter Xu wrote:

> If so, I'd hope you won't disagree that I still move the get_page() out of the
> "if (wp)". Not only it's a shared operation no matter whether "if (wp)" or
> not, but I'm afraid it would confuse future readers on a special ordering on
> the get_page() and the wrprotect(), especially with the comment above.

Sure, you could add a comment before the page_maybe_dma_pinned that it
could be fused with get_page()

> Yes. It's kind of related here on whether we can still use wrprotect() to
> guard against fast-gup, though. So my understanding is that we should still at
> least need the other patch [1] that I proposed in the other thread to force
> break-cow for read-only gups (that patch is not only for fast-gup, of course).

Probably, I haven't intensively studied that patch, and it should go
along with edits to some of the callers..

> But I agree that should be another bigger topic. I hope we don't need to pick
> that patch up someday by another dma report on read-only pinned pages...

In RDMA we found long ago that read only pins don't work well, I think
most other places are likely the same - the problems are easy enough
to hit. Something like your COW break patch on read is really needed
to allow read-only GUP.

> Regarding the solution here, I think we can also cover read-only fast-gup too
> in the future - IIUC what we need to do is to make it pte_protnone() instead of
> pte_wrprotect(), then in the fault handler we should identify this special
> pte_protnone() against numa balancing (change_prot_numa()). I think it should
> work fine too, iiuc, because I don't think we should migrate a page at all if
> it's pinned for any reason...

With your COW breaking patch the read only fast-gup should break the
COW because of the write protect, just like for the write side. Not
seeing why we need to do something more?

Jason

2020-09-24 17:57:39

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 01:51:52PM -0300, Jason Gunthorpe wrote:
> > Regarding the solution here, I think we can also cover read-only fast-gup too
> > in the future - IIUC what we need to do is to make it pte_protnone() instead of
> > pte_wrprotect(), then in the fault handler we should identify this special
> > pte_protnone() against numa balancing (change_prot_numa()). I think it should
> > work fine too, iiuc, because I don't think we should migrate a page at all if
> > it's pinned for any reason...

[1]

>
> With your COW breaking patch the read only fast-gup should break the
> COW because of the write protect, just like for the write side. Not
> seeing why we need to do something more?

Consider this sequence of a parent process managed to fork() a child:

buf = malloc();
// RDONLY gup
pin_user_pages(buf, !WRITE);
// pte of buf duplicated on both sides
fork();
mprotect(buf, WRITE);
*buf = 1;
// buf page replaced as cow triggered

Currently when fork() we'll happily share a pinned read-only page with the
child by copying the pte directly. However it also means that starting from
this point, the child started to share this pinned page with the parent. Then
if we can somehow trigger a "page unshare"/"cow", problem could occur.

In this case I'm using cow (by another mprotect() to trigger). However I'm not
sure whether this is the only way to replace the pinned page for the parent.

As a summary: imho the important thing is we should not allow any kind of
sharing of any dma page, even it's pinned for read.

If my above understanding is correct - Above [1] may provide a solution for us
(in the future) when we want to block read-only fast-gup too in this patch just
like how we do that using wrprotect().

--
Peter Xu

2020-09-24 18:16:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 01:55:31PM -0400, Peter Xu wrote:
> On Thu, Sep 24, 2020 at 01:51:52PM -0300, Jason Gunthorpe wrote:
> > > Regarding the solution here, I think we can also cover read-only fast-gup too
> > > in the future - IIUC what we need to do is to make it pte_protnone() instead of
> > > pte_wrprotect(), then in the fault handler we should identify this special
> > > pte_protnone() against numa balancing (change_prot_numa()). I think it should
> > > work fine too, iiuc, because I don't think we should migrate a page at all if
> > > it's pinned for any reason...
>
> [1]
>
> >
> > With your COW breaking patch the read only fast-gup should break the
> > COW because of the write protect, just like for the write side. Not
> > seeing why we need to do something more?
>
> Consider this sequence of a parent process managed to fork() a child:
>
> buf = malloc();
> // RDONLY gup
> pin_user_pages(buf, !WRITE);
> // pte of buf duplicated on both sides
> fork();
> mprotect(buf, WRITE);
> *buf = 1;
> // buf page replaced as cow triggered
>
> Currently when fork() we'll happily share a pinned read-only page with the
> child by copying the pte directly.

Why? This series prevents that, the page will be maybe_dma_pinned, so
fork() will copy it.

> As a summary: imho the important thing is we should not allow any kind of
> sharing of any dma page, even it's pinned for read.

Any sharing that results in COW. MAP_SHARED is fine, for instance

My feeling for READ when FOLL_PIN is used GUP_fast will go to the slow
path any time it sees a read-only page.

The slow path will determine if it is read-only because it could be
COW'd or read-only for some other reason

Jason

2020-09-24 18:35:59

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 03:15:01PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 24, 2020 at 01:55:31PM -0400, Peter Xu wrote:
> > On Thu, Sep 24, 2020 at 01:51:52PM -0300, Jason Gunthorpe wrote:
> > > > Regarding the solution here, I think we can also cover read-only fast-gup too
> > > > in the future - IIUC what we need to do is to make it pte_protnone() instead of
> > > > pte_wrprotect(), then in the fault handler we should identify this special
> > > > pte_protnone() against numa balancing (change_prot_numa()). I think it should
> > > > work fine too, iiuc, because I don't think we should migrate a page at all if
> > > > it's pinned for any reason...
> >
> > [1]
> >
> > >
> > > With your COW breaking patch the read only fast-gup should break the
> > > COW because of the write protect, just like for the write side. Not
> > > seeing why we need to do something more?
> >
> > Consider this sequence of a parent process managed to fork() a child:
> >
> > buf = malloc();

Sorry! I think I missed something like:

mprotect(buf, !WRITE);

Here.

> > // RDONLY gup
> > pin_user_pages(buf, !WRITE);
> > // pte of buf duplicated on both sides
> > fork();
> > mprotect(buf, WRITE);
> > *buf = 1;
> > // buf page replaced as cow triggered
> >
> > Currently when fork() we'll happily share a pinned read-only page with the
> > child by copying the pte directly.
>
> Why? This series prevents that, the page will be maybe_dma_pinned, so
> fork() will copy it.

With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then
it'll not go into maybe_dma_pinned() at all since cow==false.

>
> > As a summary: imho the important thing is we should not allow any kind of
> > sharing of any dma page, even it's pinned for read.
>
> Any sharing that results in COW. MAP_SHARED is fine, for instance

Oh right, MAP_SHARED is definitely special.

Thanks,

--
Peter Xu

2020-09-24 18:41:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 02:34:18PM -0400, Peter Xu wrote:

> > > // RDONLY gup
> > > pin_user_pages(buf, !WRITE);
> > > // pte of buf duplicated on both sides
> > > fork();
> > > mprotect(buf, WRITE);
> > > *buf = 1;
> > > // buf page replaced as cow triggered
> > >
> > > Currently when fork() we'll happily share a pinned read-only page with the
> > > child by copying the pte directly.
> >
> > Why? This series prevents that, the page will be maybe_dma_pinned, so
> > fork() will copy it.
>
> With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then
> it'll not go into maybe_dma_pinned() at all since cow==false.

Hum that seems like a problem in this patch, we still need to do the
DMA pinned logic even if the pte is already write protected.

Jason

2020-09-24 21:31:41

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 03:39:53PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 24, 2020 at 02:34:18PM -0400, Peter Xu wrote:
>
> > > > // RDONLY gup
> > > > pin_user_pages(buf, !WRITE);
> > > > // pte of buf duplicated on both sides
> > > > fork();
> > > > mprotect(buf, WRITE);
> > > > *buf = 1;
> > > > // buf page replaced as cow triggered
> > > >
> > > > Currently when fork() we'll happily share a pinned read-only page with the
> > > > child by copying the pte directly.
> > >
> > > Why? This series prevents that, the page will be maybe_dma_pinned, so
> > > fork() will copy it.
> >
> > With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then
> > it'll not go into maybe_dma_pinned() at all since cow==false.
>
> Hum that seems like a problem in this patch, we still need to do the
> DMA pinned logic even if the pte is already write protected.

Yes I agree. I'll take care of that in the next version too.

--
Peter Xu

2020-09-25 20:35:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Thu, Sep 24, 2020 at 2:30 PM Peter Xu <[email protected]> wrote:
>
> > >
> > > With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then
> > > it'll not go into maybe_dma_pinned() at all since cow==false.
> >
> > Hum that seems like a problem in this patch, we still need to do the
> > DMA pinned logic even if the pte is already write protected.
>
> Yes I agree. I'll take care of that in the next version too.

You people seem to be worrying too much about crazy use cases.

The fact is, if people do pinning, they had better be careful
afterwards. I agree that marking things MADV_DONTFORK may not be
great, and there may be apps that do it. But honestly, if people then
do mprotect() to make a VM non-writable after pinning a page for
writing (and before the IO has completed), such an app only has itself
to blame.

So I don't think this issue is even worth worrying about. At some
point, when apps do broken things, the kernel says "you broke it, you
get to keep both pieces". Not "Oh, you're doing unreasonable things,
let me help you".

This has dragged out a lot longer than I hoped it would, and I think
it's been over-complicated.

In fact, looking at this all, I'm starting to think that we don't
actually even need the mm_struct.has_pinned logic, because we can work
with something much simpler: the page mapping count.

A pinned page will have the page count increased by
GUP_PIN_COUNTING_BIAS, and my worry was that this would be ambiguous
with the traditional "fork a lot" UNIX style behavior. And that
traditional case is obviously one of the cases we very much don't want
to slow down.

But a pinned page has _another_ thing that is special about it: the
pinning action broke COW.

So I think we can simply add a

if (page_mapcount(page) != 1)
return false;

to page_maybe_dma_pinned(), and that very naturally protects against
the "is the page count perhaps elevated due to a lot of forking?"

Because pinning forces the mapcount to 1, and while it is pinned,
nothing else should possibly increase it - since the only thing that
would increase it is fork, and the whole point is that we won't be
doing that "page_dup_rmap()" for this page (which is what increases
the mapcount).

So we actually already have a very nice flag for "this page isn't
duplicated by forking".

And if we keep the existing early "ptep_set_wrprotect()", we also know
that we cannot be racing with another thread that is pinning at the
same time, because the fast-gup code won't be touching a read-only
pte.

So we'll just have to mark it writable again before we release the
page table lock, and we avoid that race too.

And honestly, since this is all getting fairly late in the rc, and it
took longer than I thought, I think we should do the GFP_ATOMIC
approach for now - not great, but since it only triggers for this case
that really should never happen anyway, I think it's probably the best
thing for 5.9, and we can improve on things later.

Linus

2020-09-25 21:09:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Fri, Sep 25, 2020 at 12:56 PM Linus Torvalds
<[email protected]> wrote:
>
> And honestly, since this is all getting fairly late in the rc, and it
> took longer than I thought, I think we should do the GFP_ATOMIC
> approach for now - not great, but since it only triggers for this case
> that really should never happen anyway, I think it's probably the best
> thing for 5.9, and we can improve on things later.

I'm not super-happy with this patch, but I'm throwing it out anyway, in case

(a) somebody can test it - I don't have any test cases

(b) somebody can find issues and improve on it

but it's the simplest patch I can come up with for the small-page case.

I have *NOT* tested it. I have tried to think about it, and there are
more lines of comments than there are lines of code, but that only
means that if I didn't think about some case, it's neither in the
comments nor in the code.

I'm happy to take Peter's series too, this is more of an alternative
simplified version to keep the discussion going.

Hmm? What did I miss?

Linus


Attachments:
patch (5.50 kB)

2020-09-25 21:14:54

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Fri, Sep 25, 2020 at 12:56:05PM -0700, Linus Torvalds wrote:
> So I think we can simply add a
>
> if (page_mapcount(page) != 1)
> return false;
>
> to page_maybe_dma_pinned(), and that very naturally protects against
> the "is the page count perhaps elevated due to a lot of forking?"

How about the MAP_SHARED case where the page is pinned by some process but also
shared (so mapcount can be >1)?

> And honestly, since this is all getting fairly late in the rc, and it
> took longer than I thought, I think we should do the GFP_ATOMIC
> approach for now - not great, but since it only triggers for this case
> that really should never happen anyway, I think it's probably the best
> thing for 5.9, and we can improve on things later.

Sorry for that. Maybe I should have moved even faster.

Would the ATOMIC version always work? I mean, I thought it could fail anytime,
so any fork() can start to fail for the tests too.

PS. I do plan to post a GFP_KERNEL version soon today, no matter for this
release or the next one.

--
Peter Xu

2020-09-25 22:21:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Fri, Sep 25, 2020 at 2:13 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Sep 25, 2020 at 12:56:05PM -0700, Linus Torvalds wrote:
> > So I think we can simply add a
> >
> > if (page_mapcount(page) != 1)
> > return false;
> >
> > to page_maybe_dma_pinned(), and that very naturally protects against
> > the "is the page count perhaps elevated due to a lot of forking?"
>
> How about the MAP_SHARED case where the page is pinned by some process but also
> shared (so mapcount can be >1)?

MAP_SHARED doesn't matter, since it's not getting COW'ed anyway, and
we keep the page around regardless.

So MAP_SHARED is the easy case. We'll never get to any of this code,
because is_cow_mapping() won't be true.

You can still screw up MAP_SHARED if you do a truncate() on the
underlying file or something like that, but that most *definitely*
falls under the "you only have yourself to blame" heading.

> Would the ATOMIC version always work? I mean, I thought it could fail anytime,
> so any fork() can start to fail for the tests too.

Sure. I'm not really happy about GFP_ATOMIC, but I suspect it works in practice.

Honestly, if somebody first pins megabytes of memory, and then does a
fork(), they are doing some seriously odd and wrong things. So I think
this should be a "we will try to handle it gracefully, but your load
is broken" case.

I am still inclined to add some kind of warning to this case, but I'm
also a bit on the fence wrt the whole "convenience" issue - for some
very occasional use it's probably convenient to not have to worry
about this in user space.

Actually, what I'm even less happy about than the GFP_ATOMIC is how
much annoying boilerplate this just "map anonymous page" required with
the whole cgroup_charge, throttle, anon_rmap, lru_cache_add thing.
Looking at that patch, it all looks _fairly_ simple, but there's a lot
of details that got duplicated from the pte_none() new-page-fault case
(and that the do_cow_page() case also shares)

I understand why it happens, and there's not *that* many cases, it
made me go "ouch, this is a lot of small details, maybe I missed
some", and I got the feeling that I should try to re-organize a few
helper functions to avoid duplicating the same basic code over and
over again.

But I decided that I wanted to keep the patch minimal and as focused
as possible, so I didn't actually do that. But we clearly have decades
of adding rules that just makes even something as "simple" as "add a
new page to a VM" fairly complex.

Also, to avoid making the patch bigger, I skipped your "pass
destination vma around" patch. I think it's the right thing
conceptually, but everything I looked at also screamed "we don't
actually care about the differences" to me.

I dunno. I'm conflicted. This really _feels_ to me like "we're so
close to just fixing this once and for all", but then I also go "maybe
we should just revert everything and do this for 5.10".

Except "reverting everything" is sadly really really problematic too.
It will fix the rdma issue, but in this case "everything" goes all the
way back to "uhhuh, we have a security issue with COW going the wrong
way". Otherwise I'd have gone that way two weeks ago already.

Linus

2020-09-26 00:46:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Fri, Sep 25, 2020 at 02:06:59PM -0700, Linus Torvalds wrote:
> On Fri, Sep 25, 2020 at 12:56 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > And honestly, since this is all getting fairly late in the rc, and it
> > took longer than I thought, I think we should do the GFP_ATOMIC
> > approach for now - not great, but since it only triggers for this case
> > that really should never happen anyway, I think it's probably the best
> > thing for 5.9, and we can improve on things later.
>
> I'm not super-happy with this patch, but I'm throwing it out anyway, in case
>
> (a) somebody can test it - I don't have any test cases

It looks like it will work and resolve the RDMA case that triggered
this discussion. I will send it to our testers, should hear back
around Monday.

They previously said Peter's v1 patch worked, expecting the same here,
unless something unexpected hits the extra pre-conditions.

Though, we might hit the THP case and find it fails...

> (b) somebody can find issues and improve on it

The THP hunks from Peter's series looked pretty straightforward, I'd
include at least the PMD one.

As a tiny optimization, the preconditions in copy_normal_page() could
order the atomics last to try and reduce the atomics done per fork.

> I'm happy to take Peter's series too, this is more of an alternative
> simplified version to keep the discussion going.

I don't completely grok the consequences of the anon_vma check. We
can exclude file backed mappings as they are broken for pinning
anyhow, so what is left that could be MAP_PRIVATE of a non-anon_vma?

Feels obscure, probably OK. If something does break userspace could
use MAP_SHARED and be fixed.

Otherwise, I do prefer Peter's version because of the GFP_KERNEL. To
touch on your other email..

It was my hope we could move away from the "This should never
happen". From a RDMA POV this idea was sort of managable years ago,
but now I have folks writing data science/ML software in Python that
deep under the libraries use RDMA and has pinned pages. It was a
Python program that detected this regression.

Having all that "just work" regardless of what foolish stuff happens
in the Python layer is very appealing.

Jason

2020-09-26 01:19:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Fri, Sep 25, 2020 at 5:41 PM Jason Gunthorpe <[email protected]> wrote:
>
> I don't completely grok the consequences of the anon_vma check. We
> can exclude file backed mappings as they are broken for pinning
> anyhow, so what is left that could be MAP_PRIVATE of a non-anon_vma?

It really shouldn't ever happen.

The only way a COW vma can have a NULL anon_vma should be if it has no
pages mapped at all.

Technically I think it could happen if you only ever mapped the
special zero page in there (but that shouldn't then get to the
"vm_normal_page()").

> Otherwise, I do prefer Peter's version because of the GFP_KERNEL. To
> touch on your other email..

Yeah, no, I just hadn't seen a new version, so I started getting antsy
and that's when I decided to see what a minimal patch looks like.

I think that over the weekend I'll do Peter's version but with the
"page_mapcount() == 1" check, because I'm starting to like that
better than the mm->has_pinned.

Comments on that plan?

Linus

2020-09-26 22:30:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Fri, Sep 25, 2020 at 6:15 PM Linus Torvalds
<[email protected]> wrote:
>
> I think that over the weekend I'll do Peter's version but with the
> "page_mapcount() == 1" check, because I'm starting to like that
> better than the mm->has_pinned.

Actually, rafter the first read-through, I feel like I'll just apply
the series as-is.

But I'll look at it some more, and do another read-through and make
the final decision tomorrow.

If anybody has any concerns about the v2 patch series from Peter, holler.

Linus

2020-09-27 00:46:40

by Chen, Rong A

[permalink] [raw]
Subject: [mm] 698ac7610f: will-it-scale.per_thread_ops 8.2% improvement

Greeting,

FYI, we noticed a 8.2% improvement of will-it-scale.per_thread_ops due to commit:


commit: 698ac7610f7928ddfa44a0736e89d776579d8b82 ("[PATCH 1/5] mm: Introduce mm_struct.has_pinned")
url: https://github.com/0day-ci/linux/commits/Peter-Xu/mm-Break-COW-for-pinned-pages-during-fork/20200922-052211
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git bcf876870b95592b52519ed4aafcf9d95999bc9c

in testcase: will-it-scale
on test machine: 96 threads Intel(R) Xeon(R) CPU @ 2.30GHz with 128G memory
with following parameters:

nr_task: 100%
mode: thread
test: mmap2
cpufreq_governor: performance
ucode: 0x4002f01

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale





Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/100%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp4/mmap2/will-it-scale/0x4002f01

commit:
v5.8
698ac7610f ("mm: Introduce mm_struct.has_pinned")

v5.8 698ac7610f7928ddfa44a0736e8
---------------- ---------------------------
%stddev %change %stddev
\ | \
2003 +8.2% 2168 will-it-scale.per_thread_ops
192350 +8.3% 208245 will-it-scale.workload
2643 ± 33% -36.3% 1683 meminfo.Active(file)
3.88 ± 2% -0.6 3.25 ± 2% mpstat.cpu.all.idle%
0.00 ± 3% -0.0 0.00 ± 8% mpstat.cpu.all.iowait%
307629 ± 3% +10.5% 340075 ± 3% numa-numastat.node0.local_node
15503 ± 60% +60.2% 24839 ± 25% numa-numastat.node1.other_node
161670 -58.7% 66739 ± 4% vmstat.system.cs
209406 -3.9% 201176 vmstat.system.in
364.00 ± 23% -53.8% 168.00 slabinfo.pid_namespace.active_objs
364.00 ± 23% -53.8% 168.00 slabinfo.pid_namespace.num_objs
985.50 ± 11% +14.6% 1129 ± 8% slabinfo.task_group.active_objs
985.50 ± 11% +14.6% 1129 ± 8% slabinfo.task_group.num_objs
660.25 ± 33% -36.3% 420.25 proc-vmstat.nr_active_file
302055 +2.4% 309211 proc-vmstat.nr_file_pages
281010 +2.6% 288385 proc-vmstat.nr_unevictable
660.25 ± 33% -36.3% 420.25 proc-vmstat.nr_zone_active_file
281010 +2.6% 288385 proc-vmstat.nr_zone_unevictable
20640832 ± 16% +40.0% 28888446 ± 6% cpuidle.C1.time
1743036 ± 6% -54.5% 792376 ± 4% cpuidle.C1.usage
5.048e+08 ± 54% -98.3% 8642335 ± 2% cpuidle.C6.time
706531 ± 51% -94.9% 36224 cpuidle.C6.usage
38313880 -56.5% 16666274 ± 5% cpuidle.POLL.time
18289550 -59.1% 7488947 ± 5% cpuidle.POLL.usage
302.94 ± 5% -32.9% 203.13 ± 6% sched_debug.cfs_rq:/.exec_clock.stddev
31707 ± 6% -43.6% 17867 ± 6% sched_debug.cfs_rq:/.min_vruntime.stddev
0.77 ± 22% +41.1% 1.09 ± 4% sched_debug.cfs_rq:/.nr_spread_over.avg
163292 ± 15% -75.8% 39543 ± 24% sched_debug.cfs_rq:/.spread0.avg
220569 ± 13% -65.9% 75287 ± 16% sched_debug.cfs_rq:/.spread0.max
-1903 +1952.7% -39073 sched_debug.cfs_rq:/.spread0.min
31680 ± 6% -43.7% 17850 ± 6% sched_debug.cfs_rq:/.spread0.stddev
698820 ± 2% -28.4% 500526 ± 3% sched_debug.cpu.avg_idle.avg
1100275 ± 3% -7.2% 1020875 ± 3% sched_debug.cpu.avg_idle.max
250869 -58.4% 104239 ± 4% sched_debug.cpu.nr_switches.avg
766741 ± 25% -64.8% 269919 ± 10% sched_debug.cpu.nr_switches.max
111347 ± 16% -50.8% 54786 ± 11% sched_debug.cpu.nr_switches.min
108077 ± 11% -67.3% 35316 ± 8% sched_debug.cpu.nr_switches.stddev
262769 -59.1% 107346 ± 4% sched_debug.cpu.sched_count.avg
800567 ± 25% -65.9% 272755 ± 10% sched_debug.cpu.sched_count.max
115870 ± 15% -51.6% 56034 ± 11% sched_debug.cpu.sched_count.min
112678 ± 11% -67.8% 36309 ± 9% sched_debug.cpu.sched_count.stddev
122760 -59.2% 50082 ± 4% sched_debug.cpu.sched_goidle.avg
372289 ± 25% -65.9% 126854 ± 10% sched_debug.cpu.sched_goidle.max
53911 ± 15% -51.7% 26040 ± 11% sched_debug.cpu.sched_goidle.min
52986 ± 12% -67.7% 17092 ± 9% sched_debug.cpu.sched_goidle.stddev
138914 -59.1% 56816 ± 4% sched_debug.cpu.ttwu_count.avg
168089 -57.1% 72151 ± 4% sched_debug.cpu.ttwu_count.max
123046 -61.1% 47837 ± 4% sched_debug.cpu.ttwu_count.min
11828 ± 15% -39.8% 7114 ± 4% sched_debug.cpu.ttwu_count.stddev
3050 ± 6% -34.3% 2004 ± 6% sched_debug.cpu.ttwu_local.max
383.97 ± 6% -30.1% 268.54 ± 12% sched_debug.cpu.ttwu_local.stddev
51.21 -0.4 50.79 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
51.24 -0.4 50.82 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.84 ± 8% -0.1 0.76 ± 4% perf-profile.calltrace.cycles-pp.secondary_startup_64
0.83 ± 8% -0.1 0.74 ± 4% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
0.83 ± 8% -0.1 0.74 ± 4% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
0.83 ± 8% -0.1 0.74 ± 4% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
47.27 +0.5 47.79 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
47.27 +0.5 47.79 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
47.26 +0.5 47.78 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
47.25 +0.5 47.78 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
47.28 +0.5 47.81 perf-profile.calltrace.cycles-pp.__munmap
46.75 +0.5 47.30 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64
46.24 +0.5 46.79 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
46.77 +0.6 47.33 perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
46.70 +0.6 47.26 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
0.84 ± 8% -0.1 0.76 ± 4% perf-profile.children.cycles-pp.secondary_startup_64
0.84 ± 8% -0.1 0.76 ± 4% perf-profile.children.cycles-pp.cpu_startup_entry
0.84 ± 8% -0.1 0.76 ± 4% perf-profile.children.cycles-pp.do_idle
0.83 ± 8% -0.1 0.74 ± 4% perf-profile.children.cycles-pp.start_secondary
0.11 ± 4% -0.1 0.04 ± 57% perf-profile.children.cycles-pp.__sched_text_start
0.14 ± 3% -0.1 0.08 ± 6% perf-profile.children.cycles-pp.rwsem_wake
0.10 ± 4% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.wake_up_q
0.10 ± 4% -0.0 0.05 perf-profile.children.cycles-pp.try_to_wake_up
0.07 ± 7% -0.0 0.05 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.24 ± 2% +0.0 0.26 perf-profile.children.cycles-pp.mmap_region
0.42 ± 2% +0.0 0.45 perf-profile.children.cycles-pp.do_mmap
0.67 +0.0 0.71 ± 2% perf-profile.children.cycles-pp.rwsem_spin_on_owner
97.96 +0.1 98.09 perf-profile.children.cycles-pp.rwsem_down_write_slowpath
98.02 +0.1 98.14 perf-profile.children.cycles-pp.down_write_killable
97.86 +0.2 98.02 perf-profile.children.cycles-pp.rwsem_optimistic_spin
96.90 +0.2 97.07 perf-profile.children.cycles-pp.osq_lock
47.26 +0.5 47.78 perf-profile.children.cycles-pp.__x64_sys_munmap
47.28 +0.5 47.81 perf-profile.children.cycles-pp.__munmap
47.25 +0.5 47.78 perf-profile.children.cycles-pp.__vm_munmap
0.24 -0.0 0.19 ± 4% perf-profile.self.cycles-pp.rwsem_optimistic_spin
0.06 -0.0 0.03 ±100% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.09 +0.0 0.10 perf-profile.self.cycles-pp.find_vma
0.65 +0.0 0.70 ± 2% perf-profile.self.cycles-pp.rwsem_spin_on_owner
96.31 +0.2 96.53 perf-profile.self.cycles-pp.osq_lock
0.66 -21.0% 0.52 ± 2% perf-stat.i.MPKI
0.11 +0.0 0.11 perf-stat.i.branch-miss-rate%
14774053 +4.3% 15410076 perf-stat.i.branch-misses
41.17 +0.9 42.09 perf-stat.i.cache-miss-rate%
19999033 -18.7% 16264313 ± 3% perf-stat.i.cache-misses
48696879 -20.4% 38779964 ± 2% perf-stat.i.cache-references
162553 -58.9% 66790 ± 4% perf-stat.i.context-switches
157.36 -3.9% 151.20 perf-stat.i.cpu-migrations
14271 +23.9% 17676 ± 3% perf-stat.i.cycles-between-cache-misses
0.00 ± 8% +0.0 0.00 ± 4% perf-stat.i.dTLB-load-miss-rate%
155048 ± 5% +46.3% 226826 ± 4% perf-stat.i.dTLB-load-misses
0.00 ± 2% +0.0 0.01 perf-stat.i.dTLB-store-miss-rate%
16761 +19.2% 19972 ± 2% perf-stat.i.dTLB-store-misses
4.118e+08 -15.7% 3.47e+08 perf-stat.i.dTLB-stores
93.99 +1.4 95.42 perf-stat.i.iTLB-load-miss-rate%
4103535 ± 3% -6.7% 3827881 perf-stat.i.iTLB-load-misses
263239 ± 4% -28.6% 188037 ± 2% perf-stat.i.iTLB-loads
18498 ± 3% +7.5% 19885 perf-stat.i.instructions-per-iTLB-miss
0.31 +226.8% 1.01 ± 3% perf-stat.i.metric.K/sec
88.51 +1.8 90.30 perf-stat.i.node-load-miss-rate%
8104644 -13.5% 7009101 perf-stat.i.node-load-misses
1047577 -28.3% 751122 perf-stat.i.node-loads
3521008 -13.1% 3058055 perf-stat.i.node-store-misses
0.64 -20.6% 0.51 ± 2% perf-stat.overall.MPKI
0.10 +0.0 0.10 perf-stat.overall.branch-miss-rate%
41.07 +0.9 41.92 perf-stat.overall.cache-miss-rate%
14273 +23.7% 17662 ± 3% perf-stat.overall.cycles-between-cache-misses
0.00 ± 5% +0.0 0.00 ± 5% perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 2% +0.0 0.01 ± 2% perf-stat.overall.dTLB-store-miss-rate%
93.95 +1.3 95.29 perf-stat.overall.iTLB-load-miss-rate%
18507 ± 3% +7.5% 19890 perf-stat.overall.instructions-per-iTLB-miss
88.55 +1.8 90.32 perf-stat.overall.node-load-miss-rate%
1.191e+08 -7.0% 1.107e+08 perf-stat.overall.path-length
14739033 +4.2% 15364552 perf-stat.ps.branch-misses
19930054 -18.7% 16209879 ± 3% perf-stat.ps.cache-misses
48535919 -20.3% 38662093 ± 2% perf-stat.ps.cache-references
161841 -58.9% 66477 ± 4% perf-stat.ps.context-switches
157.06 -3.9% 150.89 perf-stat.ps.cpu-migrations
154906 ± 5% +46.1% 226269 ± 5% perf-stat.ps.dTLB-load-misses
16730 +19.1% 19927 ± 2% perf-stat.ps.dTLB-store-misses
4.104e+08 -15.7% 3.459e+08 perf-stat.ps.dTLB-stores
4089203 ± 3% -6.7% 3814320 perf-stat.ps.iTLB-load-misses
263134 ± 4% -28.4% 188441 ± 2% perf-stat.ps.iTLB-loads
8076488 -13.5% 6984923 perf-stat.ps.node-load-misses
1044002 -28.3% 748425 perf-stat.ps.node-loads
3509068 -13.1% 3048190 perf-stat.ps.node-store-misses
2270 ±170% -98.9% 24.50 ±166% interrupts.36:PCI-MSI.31981569-edge.i40e-eth0-TxRx-0
3730764 ± 2% -62.0% 1416300 ± 4% interrupts.CAL:Function_call_interrupts
2270 ±170% -98.9% 24.25 ±168% interrupts.CPU0.36:PCI-MSI.31981569-edge.i40e-eth0-TxRx-0
111116 ± 29% -67.4% 36227 ± 8% interrupts.CPU0.CAL:Function_call_interrupts
11091 ± 38% -62.2% 4195 ± 12% interrupts.CPU0.RES:Rescheduling_interrupts
48001 ± 26% -53.7% 22228 ± 14% interrupts.CPU1.CAL:Function_call_interrupts
4189 ± 29% -43.2% 2378 ± 5% interrupts.CPU1.RES:Rescheduling_interrupts
34024 ± 32% -44.7% 18804 ± 4% interrupts.CPU10.CAL:Function_call_interrupts
26171 ± 19% -26.1% 19334 ± 3% interrupts.CPU11.CAL:Function_call_interrupts
2370 ± 13% -20.8% 1876 ± 18% interrupts.CPU11.RES:Rescheduling_interrupts
30568 ± 23% -45.9% 16537 ± 4% interrupts.CPU12.CAL:Function_call_interrupts
3094 ± 27% -46.9% 1643 ± 8% interrupts.CPU12.RES:Rescheduling_interrupts
514.50 ± 10% +35.6% 697.75 ± 16% interrupts.CPU12.TLB:TLB_shootdowns
29819 ± 32% -41.7% 17393 interrupts.CPU13.CAL:Function_call_interrupts
35364 ± 38% -49.6% 17819 ± 9% interrupts.CPU14.CAL:Function_call_interrupts
694.75 ± 11% +23.2% 856.00 ± 9% interrupts.CPU14.TLB:TLB_shootdowns
38361 ± 48% -49.8% 19273 ± 9% interrupts.CPU16.CAL:Function_call_interrupts
30069 ± 23% -31.6% 20559 ± 3% interrupts.CPU17.CAL:Function_call_interrupts
809.75 ± 7% -19.5% 651.75 ± 13% interrupts.CPU17.TLB:TLB_shootdowns
28245 ± 29% -33.1% 18894 ± 6% interrupts.CPU18.CAL:Function_call_interrupts
33560 ± 23% -48.2% 17369 ± 6% interrupts.CPU19.CAL:Function_call_interrupts
2863 ± 19% -40.3% 1709 ± 13% interrupts.CPU19.RES:Rescheduling_interrupts
47118 ± 32% -55.7% 20868 ± 3% interrupts.CPU2.CAL:Function_call_interrupts
3897 ± 38% -48.9% 1991 ± 7% interrupts.CPU2.RES:Rescheduling_interrupts
34735 ± 29% -41.7% 20246 ± 9% interrupts.CPU20.CAL:Function_call_interrupts
37232 ± 23% -46.6% 19883 ± 12% interrupts.CPU21.CAL:Function_call_interrupts
32345 ± 16% -38.6% 19845 ± 6% interrupts.CPU22.CAL:Function_call_interrupts
34083 ± 22% -43.4% 19301 ± 6% interrupts.CPU23.CAL:Function_call_interrupts
61308 ± 16% -76.3% 14529 ± 13% interrupts.CPU24.CAL:Function_call_interrupts
6610 ± 26% -76.3% 1568 ± 11% interrupts.CPU24.RES:Rescheduling_interrupts
51384 ± 32% -75.0% 12848 ± 12% interrupts.CPU25.CAL:Function_call_interrupts
4643 ± 39% -70.6% 1366 ± 9% interrupts.CPU25.RES:Rescheduling_interrupts
48788 ± 25% -71.7% 13826 ± 17% interrupts.CPU26.CAL:Function_call_interrupts
4076 ± 32% -70.5% 1203 ± 18% interrupts.CPU26.RES:Rescheduling_interrupts
45702 ± 14% -70.7% 13369 ± 12% interrupts.CPU27.CAL:Function_call_interrupts
3614 ± 21% -67.3% 1180 ± 16% interrupts.CPU27.RES:Rescheduling_interrupts
51216 ± 14% -71.6% 14546 ± 15% interrupts.CPU28.CAL:Function_call_interrupts
4395 ± 24% -67.9% 1410 ± 21% interrupts.CPU28.RES:Rescheduling_interrupts
614.25 ± 18% +33.3% 818.75 ± 17% interrupts.CPU28.TLB:TLB_shootdowns
44945 ± 23% -66.5% 15059 ± 14% interrupts.CPU29.CAL:Function_call_interrupts
3994 ± 34% -68.2% 1271 ± 10% interrupts.CPU29.RES:Rescheduling_interrupts
39154 ± 24% -41.6% 22857 ± 6% interrupts.CPU3.CAL:Function_call_interrupts
45674 ± 11% -68.3% 14470 ± 8% interrupts.CPU30.CAL:Function_call_interrupts
4097 ± 23% -68.8% 1278 ± 10% interrupts.CPU30.RES:Rescheduling_interrupts
51890 ± 13% -72.6% 14227 ± 16% interrupts.CPU31.CAL:Function_call_interrupts
4557 ± 26% -71.4% 1305 ± 21% interrupts.CPU31.RES:Rescheduling_interrupts
41324 ± 23% -76.0% 9933 ± 11% interrupts.CPU32.CAL:Function_call_interrupts
3284 ± 33% -73.4% 873.75 ± 15% interrupts.CPU32.RES:Rescheduling_interrupts
39758 ± 31% -74.5% 10120 ± 17% interrupts.CPU33.CAL:Function_call_interrupts
3373 ± 42% -74.2% 869.00 ± 15% interrupts.CPU33.RES:Rescheduling_interrupts
513.00 ± 27% +46.0% 748.75 ± 16% interrupts.CPU33.TLB:TLB_shootdowns
40015 ± 14% -72.8% 10885 ± 8% interrupts.CPU34.CAL:Function_call_interrupts
3402 ± 25% -68.2% 1080 ± 13% interrupts.CPU34.RES:Rescheduling_interrupts
635.25 ± 22% +49.3% 948.25 ± 13% interrupts.CPU34.TLB:TLB_shootdowns
45251 ± 19% -75.2% 11204 ± 17% interrupts.CPU35.CAL:Function_call_interrupts
3731 ± 31% -73.4% 992.50 ± 20% interrupts.CPU35.RES:Rescheduling_interrupts
43390 ± 11% -78.3% 9434 ± 15% interrupts.CPU36.CAL:Function_call_interrupts
3536 ± 23% -77.3% 803.75 ± 14% interrupts.CPU36.RES:Rescheduling_interrupts
39820 ± 11% -75.9% 9613 ± 10% interrupts.CPU37.CAL:Function_call_interrupts
2987 ± 21% -70.8% 873.25 ± 9% interrupts.CPU37.RES:Rescheduling_interrupts
42969 ± 32% -76.6% 10068 ± 17% interrupts.CPU38.CAL:Function_call_interrupts
3202 ± 36% -74.4% 818.50 ± 27% interrupts.CPU38.RES:Rescheduling_interrupts
35571 ± 16% -72.4% 9822 ± 9% interrupts.CPU39.CAL:Function_call_interrupts
2986 ± 24% -73.9% 778.75 ± 15% interrupts.CPU39.RES:Rescheduling_interrupts
45001 ± 21% -48.2% 23317 ± 7% interrupts.CPU4.CAL:Function_call_interrupts
3689 ± 24% -43.6% 2080 ± 2% interrupts.CPU4.RES:Rescheduling_interrupts
39302 ± 21% -73.4% 10463 ± 8% interrupts.CPU40.CAL:Function_call_interrupts
2968 ± 32% -71.4% 848.50 ± 18% interrupts.CPU40.RES:Rescheduling_interrupts
40826 ± 19% -75.3% 10070 ± 10% interrupts.CPU41.CAL:Function_call_interrupts
3321 ± 30% -70.9% 967.25 ± 10% interrupts.CPU41.RES:Rescheduling_interrupts
700.25 ± 18% +26.2% 883.75 ± 17% interrupts.CPU41.TLB:TLB_shootdowns
35368 ± 11% -73.7% 9308 ± 15% interrupts.CPU42.CAL:Function_call_interrupts
2839 ± 12% -70.3% 844.50 ± 13% interrupts.CPU42.RES:Rescheduling_interrupts
45459 ± 25% -78.7% 9687 ± 11% interrupts.CPU43.CAL:Function_call_interrupts
3703 ± 29% -74.1% 959.50 ± 16% interrupts.CPU43.RES:Rescheduling_interrupts
41495 ± 15% -77.1% 9522 ± 12% interrupts.CPU44.CAL:Function_call_interrupts
3153 ± 28% -75.0% 789.75 ± 15% interrupts.CPU44.RES:Rescheduling_interrupts
38501 ± 26% -72.5% 10601 ± 14% interrupts.CPU45.CAL:Function_call_interrupts
3024 ± 38% -73.8% 791.00 ± 19% interrupts.CPU45.RES:Rescheduling_interrupts
39083 ± 35% -73.6% 10323 ± 18% interrupts.CPU46.CAL:Function_call_interrupts
3173 ± 37% -73.9% 829.75 ± 24% interrupts.CPU46.RES:Rescheduling_interrupts
44486 ± 20% -75.3% 10968 ± 15% interrupts.CPU47.CAL:Function_call_interrupts
3773 ± 34% -76.4% 890.25 ± 24% interrupts.CPU47.RES:Rescheduling_interrupts
34967 ± 42% -51.0% 17117 ± 10% interrupts.CPU48.CAL:Function_call_interrupts
31969 ± 38% -51.7% 15432 ± 12% interrupts.CPU49.CAL:Function_call_interrupts
33786 ± 12% -29.2% 23918 ± 5% interrupts.CPU5.CAL:Function_call_interrupts
3014 ± 16% -33.2% 2012 ± 9% interrupts.CPU5.RES:Rescheduling_interrupts
30514 ± 29% -46.4% 16343 ± 6% interrupts.CPU51.CAL:Function_call_interrupts
34448 ± 26% -48.7% 17686 ± 4% interrupts.CPU52.CAL:Function_call_interrupts
2811 ± 34% -42.0% 1631 ± 4% interrupts.CPU52.RES:Rescheduling_interrupts
30848 ± 33% -47.9% 16059 ± 3% interrupts.CPU54.CAL:Function_call_interrupts
31017 ± 22% -52.7% 14676 ± 7% interrupts.CPU55.CAL:Function_call_interrupts
2501 ± 41% -41.8% 1455 ± 10% interrupts.CPU55.RES:Rescheduling_interrupts
28249 ± 23% -46.9% 14997 ± 10% interrupts.CPU56.CAL:Function_call_interrupts
2113 ± 18% -36.0% 1352 ± 15% interrupts.CPU56.RES:Rescheduling_interrupts
27658 ± 20% -49.3% 14034 ± 3% interrupts.CPU57.CAL:Function_call_interrupts
26559 ± 34% -40.6% 15778 ± 11% interrupts.CPU58.CAL:Function_call_interrupts
27984 ± 27% -39.9% 16815 ± 12% interrupts.CPU59.CAL:Function_call_interrupts
35098 ± 33% -37.5% 21921 interrupts.CPU6.CAL:Function_call_interrupts
3073 ± 37% -40.6% 1825 ± 8% interrupts.CPU6.RES:Rescheduling_interrupts
29248 ± 38% -48.2% 15149 ± 6% interrupts.CPU60.CAL:Function_call_interrupts
30880 ± 33% -52.3% 14722 ± 10% interrupts.CPU61.CAL:Function_call_interrupts
31218 ± 43% -51.5% 15152 ± 3% interrupts.CPU62.CAL:Function_call_interrupts
29210 ± 40% -46.5% 15627 ± 6% interrupts.CPU63.CAL:Function_call_interrupts
26813 ± 15% -39.0% 16343 ± 13% interrupts.CPU64.CAL:Function_call_interrupts
24791 ± 14% -32.6% 16704 ± 10% interrupts.CPU67.CAL:Function_call_interrupts
29638 ± 33% -42.9% 16914 ± 9% interrupts.CPU68.CAL:Function_call_interrupts
36247 ± 33% -48.5% 18670 ± 8% interrupts.CPU69.CAL:Function_call_interrupts
30379 ± 24% -30.6% 21096 ± 5% interrupts.CPU7.CAL:Function_call_interrupts
3027 ± 25% -32.0% 2059 ± 3% interrupts.CPU7.RES:Rescheduling_interrupts
31064 ± 25% -42.8% 17774 interrupts.CPU70.CAL:Function_call_interrupts
52949 ± 14% -77.0% 12198 ± 10% interrupts.CPU72.CAL:Function_call_interrupts
4057 ± 23% -75.7% 985.00 ± 8% interrupts.CPU72.RES:Rescheduling_interrupts
42694 ± 22% -73.6% 11281 ± 16% interrupts.CPU73.CAL:Function_call_interrupts
3318 ± 39% -74.3% 851.75 ± 16% interrupts.CPU73.RES:Rescheduling_interrupts
49143 ± 24% -76.1% 11756 ± 12% interrupts.CPU74.CAL:Function_call_interrupts
3606 ± 31% -73.8% 946.00 ± 12% interrupts.CPU74.RES:Rescheduling_interrupts
50587 ± 24% -72.5% 13930 ± 20% interrupts.CPU75.CAL:Function_call_interrupts
3655 ± 36% -71.1% 1056 ± 12% interrupts.CPU75.RES:Rescheduling_interrupts
57791 ± 21% -78.4% 12488 ± 11% interrupts.CPU76.CAL:Function_call_interrupts
5109 ± 37% -79.7% 1037 ± 20% interrupts.CPU76.RES:Rescheduling_interrupts
52455 ± 26% -75.4% 12922 ± 5% interrupts.CPU77.CAL:Function_call_interrupts
3997 ± 37% -73.9% 1043 ± 14% interrupts.CPU77.RES:Rescheduling_interrupts
49188 ± 21% -74.5% 12521 ± 10% interrupts.CPU78.CAL:Function_call_interrupts
3867 ± 42% -74.5% 986.25 ± 18% interrupts.CPU78.RES:Rescheduling_interrupts
45517 ± 19% -72.6% 12484 ± 19% interrupts.CPU79.CAL:Function_call_interrupts
3369 ± 34% -71.9% 946.25 ± 20% interrupts.CPU79.RES:Rescheduling_interrupts
30702 ± 22% -39.9% 18462 ± 9% interrupts.CPU8.CAL:Function_call_interrupts
2580 ± 28% -39.9% 1550 ± 10% interrupts.CPU8.RES:Rescheduling_interrupts
35561 ± 30% -69.8% 10728 ± 12% interrupts.CPU80.CAL:Function_call_interrupts
2675 ± 44% -70.6% 787.50 ± 7% interrupts.CPU80.RES:Rescheduling_interrupts
38762 ± 33% -73.3% 10349 ± 18% interrupts.CPU81.CAL:Function_call_interrupts
2892 ± 48% -70.5% 853.25 ± 20% interrupts.CPU81.RES:Rescheduling_interrupts
46500 ± 39% -80.2% 9203 ± 6% interrupts.CPU82.CAL:Function_call_interrupts
3726 ± 41% -83.3% 622.75 ± 12% interrupts.CPU82.RES:Rescheduling_interrupts
42125 ± 25% -76.0% 10103 ± 7% interrupts.CPU83.CAL:Function_call_interrupts
3275 ± 40% -75.4% 804.50 ± 6% interrupts.CPU83.RES:Rescheduling_interrupts
37359 ± 28% -74.7% 9436 ± 7% interrupts.CPU84.CAL:Function_call_interrupts
2762 ± 45% -71.1% 797.50 ± 17% interrupts.CPU84.RES:Rescheduling_interrupts
38900 ± 13% -76.2% 9272 ± 8% interrupts.CPU85.CAL:Function_call_interrupts
2704 ± 27% -77.0% 622.25 ± 10% interrupts.CPU85.RES:Rescheduling_interrupts
40662 ± 24% -77.2% 9274 ± 14% interrupts.CPU86.CAL:Function_call_interrupts
3139 ± 39% -79.5% 643.00 ± 28% interrupts.CPU86.RES:Rescheduling_interrupts
33538 ± 23% -71.7% 9484 ± 14% interrupts.CPU87.CAL:Function_call_interrupts
2406 ± 40% -73.5% 638.25 ± 21% interrupts.CPU87.RES:Rescheduling_interrupts
36240 ± 26% -73.8% 9499 ± 10% interrupts.CPU88.CAL:Function_call_interrupts
2450 ± 39% -70.5% 721.75 ± 5% interrupts.CPU88.RES:Rescheduling_interrupts
41267 ± 29% -77.1% 9463 ± 11% interrupts.CPU89.CAL:Function_call_interrupts
3286 ± 34% -73.2% 879.50 ± 17% interrupts.CPU89.RES:Rescheduling_interrupts
36038 ± 18% -50.6% 17796 ± 3% interrupts.CPU9.CAL:Function_call_interrupts
3140 ± 28% -48.3% 1622 ± 9% interrupts.CPU9.RES:Rescheduling_interrupts
38534 ± 25% -77.5% 8675 ± 9% interrupts.CPU90.CAL:Function_call_interrupts
3008 ± 27% -79.1% 629.50 ± 16% interrupts.CPU90.RES:Rescheduling_interrupts
38422 ± 29% -77.2% 8741 ± 14% interrupts.CPU91.CAL:Function_call_interrupts
3095 ± 51% -78.7% 658.75 ± 14% interrupts.CPU91.RES:Rescheduling_interrupts
38120 ± 45% -73.6% 10059 ± 10% interrupts.CPU92.CAL:Function_call_interrupts
2711 ± 61% -73.3% 722.75 ± 10% interrupts.CPU92.RES:Rescheduling_interrupts
37155 ± 19% -74.1% 9628 ± 12% interrupts.CPU93.CAL:Function_call_interrupts
2724 ± 32% -70.4% 806.00 ± 32% interrupts.CPU93.RES:Rescheduling_interrupts
43458 ± 15% -77.1% 9936 ± 10% interrupts.CPU94.CAL:Function_call_interrupts
2832 ± 25% -76.8% 655.75 ± 18% interrupts.CPU94.RES:Rescheduling_interrupts
54226 ± 22% -76.8% 12596 ± 17% interrupts.CPU95.CAL:Function_call_interrupts
4437 ± 37% -80.6% 860.50 ± 11% interrupts.CPU95.RES:Rescheduling_interrupts
302853 ± 2% -58.2% 126676 ± 2% interrupts.RES:Rescheduling_interrupts



will-it-scale.per_thread_ops

2300 +--------------------------------------------------------------------+
| |
2250 |-+ O O O O O |
2200 |-+ O O O O |
| O O O O O O O O |
2150 |-O O O O O O |
| O O O O |
2100 |-+ + |
| : + + +.. .+. |
2050 |.+ : + .+.. + : : + +. +.. + |
2000 |-+..+ + + : : + + + |
| : : + + |
1950 |-+ +.+.. +..+ |
| + |
1900 +--------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (30.54 kB)
config-5.8.0-00001-g698ac7610f7928 (172.08 kB)
job-script (7.98 kB)
job.yaml (5.46 kB)
reproduce (346.00 B)
Download all attachments

2020-09-27 06:25:29

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Sat, Sep 26, 2020 at 03:28:32PM -0700, Linus Torvalds wrote:
> On Fri, Sep 25, 2020 at 6:15 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > I think that over the weekend I'll do Peter's version but with the
> > "page_mapcount() == 1" check, because I'm starting to like that
> > better than the mm->has_pinned.
>
> Actually, rafter the first read-through, I feel like I'll just apply
> the series as-is.
>
> But I'll look at it some more, and do another read-through and make
> the final decision tomorrow.
>
> If anybody has any concerns about the v2 patch series from Peter, holler.

We won't be able to test the series till Tuesday due to religious holidays.

Thanks

>
> Linus

2020-09-27 18:21:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Sat, Sep 26, 2020 at 11:24 PM Leon Romanovsky <[email protected]> wrote:
>
> We won't be able to test the series till Tuesday due to religious holidays.

That's fine. I've merged the series up, and it will be in rc7 later
today, and with an rc8 next week we'll have two weeks to find any
issues.

I did edit Peter's patch 3/4 quite a bit:

- remove the pte_mkyoung(), because there's no _access_ at fork()
time (so it's very different in that respect to the fault case)

- added the lru_cache_add_inactive_or_unevictable() call that I think
is needed and Peter's patch was missing

- split up the "copy page" into its own function kind of like I had
done for my minimal patch

- changed the comments around a bit (mostly due to that split-up
changing some of the flow)

but it should be otherwise the same and I think Peter will still
recognize it as his patch (and I left it with his authorship and
sign-off).

Anyway, it's merged locally in my tree, I'll do some local testing
(and I have a few other pull requests), but I'll push out the end
result soonish assuming nothing shows up (and considering that I don't
have any pinning loads, I seriously doubt it will, since I won't see
any of the interesting cases).

So if we can get the actual rdma cases tested early next week, we'll
be in good shape, I think.

Btw, I'm not convinced about the whole "turn the pte read-only and
then back". If the fork races with another thread doing a pinning
fast-GUP on another CPU, there are memory ordering issues etc too.
That's not necessarily visible on x86 (the "turn read-only being a
locked op will force serialization), but it all looks dodgy as heck.

I'm also not convinced we really want to support that kind of insane
racy load, but whatever. I left the code in place, but it's something
where we might want to rethink things.

Linus

2020-09-27 18:47:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Sun, Sep 27, 2020 at 11:16 AM Linus Torvalds
<[email protected]> wrote:
>
> Btw, I'm not convinced about the whole "turn the pte read-only and
> then back". If the fork races with another thread doing a pinning
> fast-GUP on another CPU, there are memory ordering issues etc too.
> That's not necessarily visible on x86 (the "turn read-only being a
> locked op will force serialization), but it all looks dodgy as heck.

.. looking at it more, I also think it could possibly lose the dirty
bit for the case where another CPU did a HW dirty/accessed bit update
in between the original read of the pte, and then us writing back the
writable pte again.

Us holding the page table lock means that no _software_ accesses will
happen to the PTE, but dirty/accessed bits can be modified by hardware
despite the lock.

That is, of course, a completely crazy case, and I think that since we
only do this for a COW mapping, and only do the PTE changes if the pte
was writable, the pte will always have been dirty already.

So I don't think it's an _actual_ bug, but it's another "this looks
dodgy as heck" marker. It may _work_, but it sure ain't pretty.

But despite having looked at this quite a bit, I don't see anything
that looks actively wrong, so I think the series is fine. This is more
of a note for people to perhaps think about.

Linus

2020-09-28 12:51:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Sun, Sep 27, 2020 at 11:45:30AM -0700, Linus Torvalds wrote:
> On Sun, Sep 27, 2020 at 11:16 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > Btw, I'm not convinced about the whole "turn the pte read-only and
> > then back". If the fork races with another thread doing a pinning
> > fast-GUP on another CPU, there are memory ordering issues etc too.
> > That's not necessarily visible on x86 (the "turn read-only being a
> > locked op will force serialization), but it all looks dodgy as heck.

Oh. Yes, looking again the atomics in the final arrangement of
copy_present_page() aren't going to be strong enough to order this.

Sorry for missing, wasn't able to look very deeply during the weekend.

Not seeing an obvious option besides adding a smp_mb() before
page_maybe_dma_pinned() as Peter once suggested.

> .. looking at it more, I also think it could possibly lose the dirty
> bit for the case where another CPU did a HW dirty/accessed bit update
> in between the original read of the pte, and then us writing back the
> writable pte again.

Ah, I see:

set_pte_at(src_mm, addr, src_pte, pte);

wants to be some arch specific single bit ptep_clear_wrprotect()..

Thanks,
Jason

2020-09-28 16:21:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 5:49 AM Jason Gunthorpe <[email protected]> wrote:
>
> Not seeing an obvious option besides adding a smp_mb() before
> page_maybe_dma_pinned() as Peter once suggested.

That is going to be prohibitively expensive - needing it for each pte
whether it's pinned or not.

I really think the better option is a "don't do that then". This has
_never_ worked before either except by pure luck.

I also doubt anybody does it. forking with threads is a bad idea to
begin with. Doing so while pinning pages even more so.

Linus

2020-09-28 17:17:29

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Sun, Sep 27, 2020 at 11:16:34AM -0700, Linus Torvalds wrote:
> - split up the "copy page" into its own function kind of like I had
> done for my minimal patch

I didn't do that majorly because of the wrprotect() (of the fast-gup race) that
can potentially be kept if it's a normal cow. Maybe we'd add one more comment
above the caller of copy_present_page() (even if we have a "NOTE!" section
already inside the helper) since it'll change *src_pte and it's hard to notice
from the function name "copy_present_page()". Not that important, though.

Thanks for doing the changes. I went over the whole patch and it looks indeed
cleaner than before (I also didn't spot anything wrong either).

Regarding the other even rarer "hardware race on dirty/access bits" - maybe we
can simply set all these two bits always when page copy happens and we want to
recover the ptes? I also agree it's trivial enough, so that maybe we even
don't need to care about that.

--
Peter Xu

2020-09-28 17:25:12

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 09:17:09AM -0700, Linus Torvalds wrote:
> On Mon, Sep 28, 2020 at 5:49 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > Not seeing an obvious option besides adding a smp_mb() before
> > page_maybe_dma_pinned() as Peter once suggested.
>
> That is going to be prohibitively expensive - needing it for each pte
> whether it's pinned or not.
>
> I really think the better option is a "don't do that then". This has
> _never_ worked before either except by pure luck.

Yes... Actually I am also thinking about the complete solution to cover
read-only fast-gups too, but now I start to doubt this, at least for the fork()
path. E.g. if we'd finally like to use pte_protnone() to replace the current
pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer
the same degradation on normal fork()s, or even more. Seems unacceptable.

The other question is, whether we should emphasize and document somewhere that
MADV_DONTFORK is still (and should always be) the preferred way, because
changes like this series can potentially encourage the other way.

--
Peter Xu

2020-09-28 17:56:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 10:23 AM Peter Xu <[email protected]> wrote:
>
> Yes... Actually I am also thinking about the complete solution to cover
> read-only fast-gups too, but now I start to doubt this, at least for the fork()
> path. E.g. if we'd finally like to use pte_protnone() to replace the current
> pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer
> the same degradation on normal fork()s, or even more. Seems unacceptable.

So I think the real question about pinned read gups is what semantics
they should have.

Because honestly, I think we have two options:

- the current "it gets a shared copy from the page tables"

- the "this is an exclusive pin, and it _will_ follow the source VM
changes, and never break"

because honestly, if we get a shared copy at the time of the pinning
(like we do now), then "fork()" is entirely immaterial. The fork() can
have happened ages ago, that page is shared with other processes, and
anybody process writing to it - including very much the pinning one -
will cause a copy-on-write and get a copy of the page.

IOW, the current - and past - semantics for read pinning is that you
get a copy of the page, but any changes made by the pinning process
may OR MAY NOT show up in your pinned copy.

Again: doing a concurrent fork() is entirely immaterial, because the
page can have been made a read-only COW page by _previous_ fork()
calls (or KSM logic or whatever).

In other words: read pinning gets a page efficiently, but there is
zero guarantee of any future coherence with the process doing
subsequent writes.

That has always been the semantics, and FOLL_PIN didn't change that at
all. You may have had things that worked almost by accident (ie you
had made the page private by writing to it after the fork, so the read
pinning _effectively_ gave you a page that was coherent), but even
that was always accidental rather than anything else. Afaik it could
easily be broken by KSM, for example.

In other words, a read pin isn't really any different from a read GUP.
You get a reference to a page that is valid at the time of the page
lookup, and absolutely nothing more.

Now, the alternative is to make a read pin have the same guarantees as
a write pin, and say "this will stay attached to this MM until unmap
or unpin".

But honestly, that is largely going to _be_ the same as a write pin,
because it absolutely needs to do a page COW at the time of the
pinning to get that initial exclusive guarantee in the first place.
Without that initial exclusivity, you cannot avoid future COW events
breaking the wrong way.

So I think the "you get a reference to the page at the time of the
pin, and the page _may_ or may not change under you if the original
process writes to it" are really the only relevant semantics. Because
if you need those exclusive semantics, you might as well just use a
write pin.

The downside of a write pin is that it not only makes that page
exclusive, it also (a) marks it dirty and (b) requires write access.
That can matter particularly for shared mappings. So if you know
you're doing the pin on a shared mmap, then a read pin is the right
thing, because the page will stay around - not because of the VM it
happens in, but because of the underlying file mapping!

See the difference?

> The other question is, whether we should emphasize and document somewhere that
> MADV_DONTFORK is still (and should always be) the preferred way, because
> changes like this series can potentially encourage the other way.

I really suspect that the concurrent fork() case is fundamentally hard
to handle.

Is it impossible? No. Even without any real locking, we could change
the code to do a seqcount_t, for example. The fastgup code wouldn't
take a lock, but it would just fail and fall back to the slow code if
the sequence count fails.

So the copy_page_range() code would do a write count around the copy:

write_seqcount_begin(&mm->seq);
.. do the copy ..
write_seqcount_end(&mm->seq);

and the fast-gup code would do a

seq = raw_read_seqcount(&mm->seq);
if (seq & 1)
return -EAGAIN;

at the top, and do a

if (__read_seqcount_t_retry(&mm->seq, seq) {
.. Uhhuh, that failed, drop the ref to the page again ..
return -EAGAIN;
}

after getting the pin reference.

We could make this conditional on FOLL_PIN, or maybe even a new flag
("FOLL_FORK_CONSISTENT").

So I think we can serialize with fork() without serializing each and every PTE.

If we want to and really need to.

Hmm?

Linus

2020-09-28 18:42:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 10:54:28AM -0700, Linus Torvalds wrote:
> On Mon, Sep 28, 2020 at 10:23 AM Peter Xu <[email protected]> wrote:
> >
> > Yes... Actually I am also thinking about the complete solution to cover
> > read-only fast-gups too, but now I start to doubt this, at least for the fork()
> > path. E.g. if we'd finally like to use pte_protnone() to replace the current
> > pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer
> > the same degradation on normal fork()s, or even more. Seems unacceptable.
>
> So I think the real question about pinned read gups is what semantics
> they should have.
>
> Because honestly, I think we have two options:
>
> - the current "it gets a shared copy from the page tables"
>
> - the "this is an exclusive pin, and it _will_ follow the source VM
> changes, and never break"

The requirement depends on what the driver is doing. Consider a simple
ring-to-device scheme:

ring = mmap()
pin_user_pages(FOLL_LONGTERM)
ring[0] = [..]
trigger_kernel()

Sort of like iouring. Here the kernel will pin the zero page and will
never see any user modifications to the buffer. We must do #2.

While something like read(O_DIRECT) which only needs the buffer to be
stable during a system call would be fine with #1 (and data
incoherence in general)

> In other words, a read pin isn't really any different from a read GUP.
> You get a reference to a page that is valid at the time of the page
> lookup, and absolutely nothing more.

Yes, so far all this new pin stuff has really focused on the write
side - the motivating issue was the set_page_dirty() oops

> But honestly, that is largely going to _be_ the same as a write pin,
> because it absolutely needs to do a page COW at the time of the
> pinning to get that initial exclusive guarantee in the first place.
> Without that initial exclusivity, you cannot avoid future COW events
> breaking the wrong way.

Right, I see places using FOLL_WRITE when they only need read. It is a
bit ugly, IMHO.

> The downside of a write pin is that it not only makes that page
> exclusive, it also (a) marks it dirty and (b) requires write access.

RDMA adds FOLL_FORCE because people complained they couldn't do
read-only transfers from .rodata - uglyier still :\

> So the copy_page_range() code would do a write count around the copy:
>
> write_seqcount_begin(&mm->seq);
> .. do the copy ..
> write_seqcount_end(&mm->seq);

All of gup_fast and copy_mm could be wrappered in a seq count so that
gup_fast always goes to the slow path if fork is concurrent.

That doesn't sound too expensive and avoids all the problems you
pointed with the WP scheme.

As you say fork & pin & threads is already questionable, so an
unconditional slow path on race should be OK.

> If we want to and really need to.

I would like to see some reasonable definition for the
read-side. Having drivers do FOLL_FORCE | FOLL_WRITE for read is just
confusing and weird for a driver facing API.

It may also help focus the remaining discussion for solving
set_page_dirty() if pin_user_pages() had a solid definition.

I prefer the version where read pin and write pin are symmetric. The
PTE in the MM should not change once pinned.

This is useful and if something only needs the original GUP semantics
then GUP is still there.

Jason

2020-09-28 19:32:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 11:39 AM Jason Gunthorpe <[email protected]> wrote:
>
> I prefer the version where read pin and write pin are symmetric. The
> PTE in the MM should not change once pinned.

The thing is, I don't really see how to do that.

Right now the write pin fastpath part depends on the PTE being
writable. That implies "this VM has access to this page".

For a read pin there simply is no other way to do it.

So we'd basically say "fast read pin only works on writable pages",
and then we'd have to go to the slow path if it isn't dirty and
writable.

And the slow path would then do whatever COW is required, but it
wouldn't mark the result dirty (and in the case of a shared mapping,
couldn't mark it writable).

So a read pin action would basically never work for the fast-path for
a few cases, notably a shared read-only mapping - because we could
never mark it in the page tables as "fast pin accessible"

See the problem? A read-only pin is fundamentally different from a
write one, because a write one has that fundamental mark of "I have
private access to this page" in ways a read one simply does not.

So we could make the requirement be that a pinned page is either

(a) from a shared mapping (so the pinning depends on the page cache
association). But we can't test this in the fast path.

or

(b) for a private mapping we require page_mapcount() == 1 and that
it's writable.

but since (a) requires the mapping type, we can't check in the fast
path - we only have the PTE and the page. So the fast-path can only
"emulate" it by that "writable", which is a proper subset of (a) or
(b), but it's not something that is in any way guaranteed.

End result: FOLL_PIN would really only work on private pages, and only
if you don't want to share with the page cache.

And it would basically have no advantages over a writable FOLL_PIN. It
would break the association with any backing store for private pages,
because otherwise it can't follow future writes.

Linus

2020-09-28 19:45:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 11:39 AM Jason Gunthorpe <[email protected]> wrote:
>
> All of gup_fast and copy_mm could be wrappered in a seq count so that
> gup_fast always goes to the slow path if fork is concurrent.
>
> That doesn't sound too expensive and avoids all the problems you
> pointed with the WP scheme.

Ok, I'll start by just removing the "write protect early trick". It
really doesn't work reliably anyway due to memory ordering, and while
I think the dirty bit is ok (and we could probably also set it
unconditionally to make _sure_ it's not dropped like Peter says) it
just makes me convinced it's the wrong approach.

Fixing it at a per-pte level is too expensive, and yeah, if we really
really care about the fork consistency, the sequence count approach
should be much simpler and more obvious.

So I'll do the pte wrprotect/restore removal. Anybody willing to do
and test the sequence count approach?

Linus

2020-09-28 19:51:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 12:36 PM Linus Torvalds
<[email protected]> wrote:
>
> So I'll do the pte wrprotect/restore removal. Anybody willing to do
> and test the sequence count approach?

So the wrprotect removal is trivial, with most of it being about the comments.

However, when I look at this, I am - once again - tempted to just add a

if (__page_mapcount(page) > 1)
return 1;

there too. Because we know it's a private mapping (shared mappings we
checked for with the "is_cow_mapping()" earlier), and the only case we
really care about is the one where the page is only mapped in the
current mm (because that's what a write pinning will have done, and as
mentioned, a read pinning doesn't do anything wrt fork() right now
anyway).

So if it's mapped in another mm, the COW clearly hasn't been broken by
a pin, and a read pinned page had already gone through a fork.

But the more I look at this code, the more I go "ok, I want somebody
to actually test this with the rdma case".

So I'll attach my suggested patch, but I won't actually commit it. I'd
really like to have this tested, possibly _together_ with the sequence
count addition..

Linus


Attachments:
patch (2.47 kB)

2020-09-28 23:30:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 12:50:03PM -0700, Linus Torvalds wrote:

> But the more I look at this code, the more I go "ok, I want somebody
> to actually test this with the rdma case".

I suspect it will be OK as our immediate test suite that touches this
will be using malloc memory.

It seems some mmap trick will be needed to get higher map counts?

> So I'll attach my suggested patch, but I won't actually commit it. I'd
> really like to have this tested, possibly _together_ with the sequence
> count addition..

Ok, we will do this once the holidays are over. If Peter doesn't take
it I can look at the seqcount too.

Thanks,
Jason

2020-09-29 00:01:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 12:29:55PM -0700, Linus Torvalds wrote:
> So a read pin action would basically never work for the fast-path for
> a few cases, notably a shared read-only mapping - because we could
> never mark it in the page tables as "fast pin accessible"

Agree, I was assuming we'd loose more of the fast path to create this
thing. It would only still be fast if the pages are already writable.

I strongly suspect the case of DMA'ing actual read-only data is the
minority here, the usual case is probably filling a writable buffer
with something interesting and then triggering the DMA. The DMA just
happens to be read from the driver view so the driver doesn't set
FOLL_WRITE.

Looking at the FOLL_LONGTERM users, which should be the banner usecase
for this, there are very few that do a read pin and use fast.

> And it would basically have no advantages over a writable FOLL_PIN. It
> would break the association with any backing store for private pages,
> because otherwise it can't follow future writes.

Yes, I wasn't clear enough, I'm looking at this from a driver API
perspective. We have this API

pin_user_pages(FOLL_LONGTERM | FOLL_WRITE)

Which now has no decoherence issues with the MM. If the driver
naturally wants to do read-only access it might be tempted to do:

pin_user_pages(FOLL_LONGTERM)

Which is now NOT the same thing and brings all these really surprising
mm coherence issues back.

The driver author might discover this in testing, then be tempted to
hardwire 'FOLL_LONGTERM | FOLL_WRITE'. Now their uAPI is broken for
things that are actually read-only like .rodata.

If they discover this then they add a FOLL_FORCE to the mix.

When someone comes along to read this later it is a big leap to see
pin_user_pages(FOLL_LONGTERM | FOLL_FORCE | FOLL_WRITE)

and realize this is code for "read only mapping". At least it took me
a while to decipher it the first time I saw it.

I think this is really hard to use and ugly. My thinking has been to
just stick:

if (flags & FOLL_LONGTERM)
flags |= FOLL_FORCE | FOLL_WRITE

In pin_user_pages(). It would make the driver API cleaner. If we can
do a bit better somehow by not COW'ing for certain VMA's as you
explained then all the better, but not my primary goal..

Basically, I think if a driver is using FOLL_LONGTERM | FOLL_PIN we
should guarentee that driver a consistent MM and take the gup_fast
performance hit to do it.

AFAICT the giant wack of other cases not using FOLL_LONGTERM really
shouldn't care about read-decoherence. For those cases the user should
really not be racing write's with data under read-only pin, and the
new COW logic looks like it solves the other issues with this.

I know Jann/John have been careful to not have special behaviors for
the DMA case, but I think it makes sense here. It is actually different.

Jason

2020-09-29 00:22:06

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On 9/28/20 4:57 PM, Jason Gunthorpe wrote:
> On Mon, Sep 28, 2020 at 12:29:55PM -0700, Linus Torvalds wrote:
...
> I think this is really hard to use and ugly. My thinking has been to
> just stick:
>
> if (flags & FOLL_LONGTERM)
> flags |= FOLL_FORCE | FOLL_WRITE
>
> In pin_user_pages(). It would make the driver API cleaner. If we can

+1, yes. The other choices so far are, as you say, really difficult to figure
out.

> do a bit better somehow by not COW'ing for certain VMA's as you
> explained then all the better, but not my primary goal..
>
> Basically, I think if a driver is using FOLL_LONGTERM | FOLL_PIN we
> should guarentee that driver a consistent MM and take the gup_fast
> performance hit to do it.
>
> AFAICT the giant wack of other cases not using FOLL_LONGTERM really
> shouldn't care about read-decoherence. For those cases the user should
> really not be racing write's with data under read-only pin, and the
> new COW logic looks like it solves the other issues with this.

I hope this doesn't kill the seqcount() idea, though. That was my favorite
part of the discussion, because it neatly separates out the two racing domains
(fork, gup/pup) and allows easy reasoning about them--without really impacting
performance.

Truly elegant. We should go there.

>
> I know Jann/John have been careful to not have special behaviors for
> the DMA case, but I think it makes sense here. It is actually different.
>

I think that makes sense. Everyone knew that DMA/FOLL_LONGTERM call sites
were at least potentially special, despite the spirited debates in at least
two conferences about the meaning and implications of "long term". :)

And here we are seeing an example of such a special case, which I think is
natural enough.


thanks,
--
John Hubbard
NVIDIA

2020-09-29 00:32:47

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 07:51:07PM -0300, Jason Gunthorpe wrote:
> > So I'll attach my suggested patch, but I won't actually commit it. I'd
> > really like to have this tested, possibly _together_ with the sequence
> > count addition..
>
> Ok, we will do this once the holidays are over. If Peter doesn't take
> it I can look at the seqcount too.

Please feel free to. I can definitely test against the vfio test I used to
run, but assuming the rdma test would be more ideal. Just let me know if
there's anything else I can help.

Thanks,

--
Peter Xu

2020-10-08 06:17:26

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned

On Mon, Sep 28, 2020 at 12:50:03PM -0700, Linus Torvalds wrote:
> On Mon, Sep 28, 2020 at 12:36 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > So I'll do the pte wrprotect/restore removal. Anybody willing to do
> > and test the sequence count approach?
>
> So the wrprotect removal is trivial, with most of it being about the comments.
>
> However, when I look at this, I am - once again - tempted to just add a
>
> if (__page_mapcount(page) > 1)
> return 1;
>
> there too. Because we know it's a private mapping (shared mappings we
> checked for with the "is_cow_mapping()" earlier), and the only case we
> really care about is the one where the page is only mapped in the
> current mm (because that's what a write pinning will have done, and as
> mentioned, a read pinning doesn't do anything wrt fork() right now
> anyway).
>
> So if it's mapped in another mm, the COW clearly hasn't been broken by
> a pin, and a read pinned page had already gone through a fork.
>
> But the more I look at this code, the more I go "ok, I want somebody
> to actually test this with the rdma case".
>
> So I'll attach my suggested patch, but I won't actually commit it. I'd
> really like to have this tested, possibly _together_ with the sequence
> count addition..

Hi Linus,

We tested the suggested patch for last two weeks in our nightly regressions
and didn't experience any new failures. It looks like it is safe to use
it, but better to take the patch during/after merge window to minimize risk
of delaying v5.9.

Thanks

>
> Linus