LinuxLists.cc - [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow

2023-07-27 22:19:39

Subject: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

This is my proposal on how to handle the fallout of 474098edac26
("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
accidentially missed that follow_page() and smaps implicitly kept the
FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
not trigger faults on PROT_NONE-mapped PTEs.

(maybe it's just me who considers that confusing)

Patch #1 is the original fix proposal, which patch #3 cleans up. Patch #2
is another fix for the issue on the follow_page() level pointed out by
Peter. Patch #4 documents the FOLL_FORCE situation.

Peter prefers a revert of that commit [1], I disagree and am still happy to
see FOLL_NUMA gone that implicitly relied on FOLL_FORCE.

An alternative might be to use an internal FOLL_PROTNONE or
FOLL_NO_PROTNONE flag in patch #3, not so sure about that.

Did a quick sanity test, will do more testing tomorrow.

[1] https://lkml.kernel.org/r/ZMK+jSDgOmJKySTr@x1n

Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: liubo <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>

David Hildenbrand (3):
mm/gup: Make follow_page() succeed again on PROT_NONE PTEs/PMDs
smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()
mm/gup: document FOLL_FORCE behavior

liubo (1):
smaps: Fix the abnormal memory statistics obtained through
/proc/pid/smaps

fs/proc/task_mmu.c | 3 +--
include/linux/mm_types.h | 25 ++++++++++++++++++++++++-
mm/gup.c | 10 +++++++++-
3 files changed, 34 insertions(+), 4 deletions(-)

--
2.41.0

2023-07-27 22:22:44

by David Hildenbrand

[permalink] [raw]

Subject: [PATCH v1 4/4] mm/gup: document FOLL_FORCE behavior

As suggested by Peter, let's document FOLL_FORCE handling and make it
clear that without FOLL_FORCE, we will always trigger NUMA-hinting
faults when stumbling over a PROT_NONE-mapped PTE.

Also add a comment regarding follow_page() and its interaction with
FOLL_FORCE.

Let's place the doc next to the definition, where it certainly can't be
missed.

Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mm_types.h | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2fa6fcc740a1..96cf78686c29 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1243,7 +1243,30 @@ enum {
FOLL_GET = 1 << 1,
/* give error on hole if it would be zero */
FOLL_DUMP = 1 << 2,
- /* get_user_pages read/write w/o permission */
+ /*
+ * Make get_user_pages() and friends ignore some VMA+PTE permissions.
+ *
+ * This flag should primarily only be used by ptrace and some
+ * GUP-internal functionality, such as for mlock handling.
+ *
+ * Without this flag, these functions always trigger page faults
+ * (such as NUMA hinting faults) when stumbling over a
+ * PROT_NONE-mapped PTE.
+ *
+ * !FOLL_WRITE: succeed even if the PTE is PROT_NONE
+ * * Rejected if the VMA is currently not readable and it cannot
+ * become readable
+ *
+ * FOLL_WRITE: succeed even if the PTE is not writable.
+ * * Rejected if the VMA is currently not writable and
+ * * it is a hugetlb mapping
+ * * it is not a COW mapping that could become writable
+ *
+ * Note: follow_page() does not accept FOLL_FORCE. Historically,
+ * follow_page() behaved similar to FOLL_FORCE without FOLL_WRITE:
+ * succeed even if the PTE is PROT_NONE and FOLL_WRITE is not set.
+ * However, VMA permissions are not checked.
+ */
FOLL_FORCE = 1 << 3,
/*
* if a disk transfer is needed, start the IO and return without waiting
--
2.41.0

2023-07-28 17:14:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <[email protected]> wrote:
>
> This is my proposal on how to handle the fallout of 474098edac26
> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
> accidentially missed that follow_page() and smaps implicitly kept the
> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
> not trigger faults on PROT_NONE-mapped PTEs.

Ugh.

I hate how it uses FOLL_FORCE that is inherently scary.

Why do we have that "gup_can_follow_protnone()" logic AT ALL?

Couldn't we just get rid of that disgusting thing, and just say that
GUP (and follow_page()) always just ignores NUMA hinting, and always
just follows protnone?

We literally used to have this:

if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;

ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
be the rare crazy case.

The original reason for not setting FOLL_NUMA all the time is
documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
page faults from gup/gup_fast") from way back in 2012:

* If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
* would be called on PROT_NONE ranges. We must never invoke
* handle_mm_fault on PROT_NONE ranges or the NUMA hinting
* page faults would unprotect the PROT_NONE ranges if
* _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
* bitflag. So to avoid that, don't set FOLL_NUMA if
* FOLL_FORCE is set.

but I don't think the original reason for this is *true* any more.

Because then two years later in 2014, in commit c46a7c817e66 ("x86:
define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
Mel made the code able to distinguish between PROT_NONE and NUMA
pages, and he changed the comment above too.

But I get the very very strong feeling that instead of changing the
comment, he should have actually removed the comment and the code.

So I get the strong feeling that all these FOLL_NUMA games should just
go away. You removed the FOLL_NUMA bit, but you replaced it with using
FOLL_FORCE.

So rather than make this all even more opaque and make it even harder
to figure out why we have that code in the first place, I think it
should all just be removed.

The original reason for FOLL_NUMA simply does not exist any more. We
know exactly when a page is marked for NUMA faulting, and we should
simply *ignore* it for GUP and follow_page().

I think we should treat a NUMA-faulting page as just being present
(and not NUMA-fault it).

Am I missing something?

Linus

2023-07-28 17:54:20

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

On 28.07.23 18:18, Linus Torvalds wrote:
> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <[email protected]> wrote:
>>
>> This is my proposal on how to handle the fallout of 474098edac26
>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
>> accidentially missed that follow_page() and smaps implicitly kept the
>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
>> not trigger faults on PROT_NONE-mapped PTEs.
>
> Ugh.

I was hoping for that reaction, with the assumption that we would get
something cleaner :)

>
> I hate how it uses FOLL_FORCE that is inherently scary.

I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
is FOLL_FORCE in disguise (currently and before 474098edac26, if
FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).

>
> Why do we have that "gup_can_follow_protnone()" logic AT ALL?

That's what I was hoping for.

>
> Couldn't we just get rid of that disgusting thing, and just say that
> GUP (and follow_page()) always just ignores NUMA hinting, and always
> just follows protnone?
>
> We literally used to have this:
>
> if (!(gup_flags & FOLL_FORCE))
> gup_flags |= FOLL_NUMA;
>
> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
> be the rare crazy case.

Yes, but my point would be that we now spell that "rare crazy case"
out for follow_page().

If you're talking about patch #1, I agree, therefore patch #3 to
avoid all that nasty FOLL_FORCE handling in GUP callers.

But yeah, if we can avoid all that, great.

>
> The original reason for not setting FOLL_NUMA all the time is
> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> page faults from gup/gup_fast") from way back in 2012:
>
> * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
> * would be called on PROT_NONE ranges. We must never invoke
> * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
> * page faults would unprotect the PROT_NONE ranges if
> * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
> * bitflag. So to avoid that, don't set FOLL_NUMA if
> * FOLL_FORCE is set.

In handle_mm_fault(), we never call do_numa_page() if
!vma_is_accessible(). Same for do_huge_pmd_numa_page().

So, if we would ever end up triggering a page fault on
mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
would simply do nothing.

At least that's the hope, I'll take a closer look just to make
sure we're good on all call paths.

>
> but I don't think the original reason for this is *true* any more.
>
> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> Mel made the code able to distinguish between PROT_NONE and NUMA
> pages, and he changed the comment above too.

CCing Mel.

I remember that pte_protnone() can only distinguished between
NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().

Indeed, include/linux/pgtable.h:

/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
* the only case the kernel cares is for NUMA balancing and is only ever set
* when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
* _PAGE_PROTNONE so by default, implement the helper as "always no". It
* is the responsibility of the caller to distinguish between PROT_NONE
* protections and NUMA hinting fault protections.
*/

>
> But I get the very very strong feeling that instead of changing the
> comment, he should have actually removed the comment and the code.
>
> So I get the strong feeling that all these FOLL_NUMA games should just
> go away. You removed the FOLL_NUMA bit, but you replaced it with using
> FOLL_FORCE.
>
> So rather than make this all even more opaque and make it even harder
> to figure out why we have that code in the first place, I think it
> should all just be removed.

At least to me, spelling FOLL_FORCE in follow_page() now out is much
less opaque then getting that implied by lack of FOLL_NUMA.

>
> The original reason for FOLL_NUMA simply does not exist any more. We
> know exactly when a page is marked for NUMA faulting, and we should
> simply *ignore* it for GUP and follow_page().
>
> I think we should treat a NUMA-faulting page as just being present
> (and not NUMA-fault it).
>
> Am I missing something?

There was the case for "FOLL_PIN represents application behavior and
should trigger NUMA faults", but I guess that can be ignored.

But it would be much better to just remove all that if we can.

Let me look into some details.

Thanks Linus!

--
Cheers,

David / dhildenb

2023-07-28 18:39:30

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

[...]

>
> There was the case for "FOLL_PIN represents application behavior and
> should trigger NUMA faults", but I guess that can be ignored.
>
> But it would be much better to just remove all that if we can.
>
> Let me look into some details.

I just stumbled over the comment from Mel in follow_trans_huge_pmd():

/* Full NUMA hinting faults to serialise migration in fault paths */

It dates back to

commit 2b4847e73004c10ae6666c2e27b5c5430aed8698
Author: Mel Gorman <[email protected]>
Date: Wed Dec 18 17:08:32 2013 -0800

mm: numa: serialise parallel get_user_page against THP migration

Base pages are unmapped and flushed from cache and TLB during normal
page migration and replaced with a migration entry that causes any
parallel NUMA hinting fault or gup to block until migration completes.

THP does not unmap pages due to a lack of support for migration entries
at a PMD level. This allows races with get_user_pages and
get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
THP migration and PMD numa clearing") made worse by introducing a
pmd_clear_flush().

This patch forces get_user_page (fast and normal) on a pmd_numa page to
go through the slow get_user_page path where it will serialise against
THP migration and properly account for the NUMA hinting fault. On the
migration side the page table lock is taken for each PTE update.

We nowadays do have migration entries at PMD level -- and setting FOLL_FORCE
could similarly trigger such a race.

So I suspect we're good.

--
Cheers,

David / dhildenb

2023-07-28 21:23:20

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

Hi, Linus,

On Fri, Jul 28, 2023 at 09:18:45AM -0700, Linus Torvalds wrote:
> The original reason for FOLL_NUMA simply does not exist any more. We
> know exactly when a page is marked for NUMA faulting, and we should
> simply *ignore* it for GUP and follow_page().
>
> I think we should treat a NUMA-faulting page as just being present
> (and not NUMA-fault it).

But then does it means that any gup-only user will have numa balancing
completely disabled? Since as long as the page will only be accessed by
GUP, the numa balancing will never trigger anyway.. I think KVM is
manipulating guest pages just like that. Not sure whether it means it'll
void the whole numa effort there.

If we allow that GUP from happening (taking protnone as present) I assume
it'll also stop any further numa balancing on this very page to trigger
too, because even if some page fault handler triggered on this protnone
page later that is not GUP anymore, when it wants to migrate the page to
the other numa node it'll see someone is holding a reference on it already,
and then we should give up the balancing.

So to me FOLL_NUMA (or any identifier like it.. just to describe the
caller's need; some may really just want to fetch the pfn/page) still makes
sense. But maybe I totally misunderstood above..

Thanks,

--
Peter Xu

2023-07-28 22:05:46

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

On 28.07.23 19:30, David Hildenbrand wrote:
> On 28.07.23 18:18, Linus Torvalds wrote:
>> On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <[email protected]> wrote:
>>>
>>> This is my proposal on how to handle the fallout of 474098edac26
>>> ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
>>> accidentially missed that follow_page() and smaps implicitly kept the
>>> FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
>>> not trigger faults on PROT_NONE-mapped PTEs.
>>
>> Ugh.
>
> I was hoping for that reaction, with the assumption that we would get
> something cleaner :)
>
>>
>> I hate how it uses FOLL_FORCE that is inherently scary.
>
> I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
> is FOLL_FORCE in disguise (currently and before 474098edac26, if
> FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).
>
>>
>> Why do we have that "gup_can_follow_protnone()" logic AT ALL?
>
> That's what I was hoping for.
>
>>
>> Couldn't we just get rid of that disgusting thing, and just say that
>> GUP (and follow_page()) always just ignores NUMA hinting, and always
>> just follows protnone?
>>
>> We literally used to have this:
>>
>> if (!(gup_flags & FOLL_FORCE))
>> gup_flags |= FOLL_NUMA;
>>
>> ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
>> be the rare crazy case.
>
> Yes, but my point would be that we now spell that "rare crazy case"
> out for follow_page().
>
> If you're talking about patch #1, I agree, therefore patch #3 to
> avoid all that nasty FOLL_FORCE handling in GUP callers.
>
> But yeah, if we can avoid all that, great.
>
>>
>> The original reason for not setting FOLL_NUMA all the time is
>> documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
>> page faults from gup/gup_fast") from way back in 2012:
>>
>> * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
>> * would be called on PROT_NONE ranges. We must never invoke
>> * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
>> * page faults would unprotect the PROT_NONE ranges if
>> * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
>> * bitflag. So to avoid that, don't set FOLL_NUMA if
>> * FOLL_FORCE is set.
>
>
> In handle_mm_fault(), we never call do_numa_page() if
> !vma_is_accessible(). Same for do_huge_pmd_numa_page().
>
> So, if we would ever end up triggering a page fault on
> mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
> would simply do nothing.
>
> At least that's the hope, I'll take a closer look just to make
> sure we're good on all call paths.
>
>>
>> but I don't think the original reason for this is *true* any more.
>>
>> Because then two years later in 2014, in commit c46a7c817e66 ("x86:
>> define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
>> Mel made the code able to distinguish between PROT_NONE and NUMA
>> pages, and he changed the comment above too.
>
> CCing Mel.
>
> I remember that pte_protnone() can only distinguished between
> NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().
>
> Indeed, include/linux/pgtable.h:
>
> /*
> * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
> * the only case the kernel cares is for NUMA balancing and is only ever set
> * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked
> * _PAGE_PROTNONE so by default, implement the helper as "always no". It
> * is the responsibility of the caller to distinguish between PROT_NONE
> * protections and NUMA hinting fault protections.
> */
>
>>
>> But I get the very very strong feeling that instead of changing the
>> comment, he should have actually removed the comment and the code.
>>
>> So I get the strong feeling that all these FOLL_NUMA games should just
>> go away. You removed the FOLL_NUMA bit, but you replaced it with using
>> FOLL_FORCE.
>>
>> So rather than make this all even more opaque and make it even harder
>> to figure out why we have that code in the first place, I think it
>> should all just be removed.
>
> At least to me, spelling FOLL_FORCE in follow_page() now out is much
> less opaque then getting that implied by lack of FOLL_NUMA.
>
>>
>> The original reason for FOLL_NUMA simply does not exist any more. We
>> know exactly when a page is marked for NUMA faulting, and we should
>> simply *ignore* it for GUP and follow_page().
>>
>> I think we should treat a NUMA-faulting page as just being present
>> (and not NUMA-fault it).
>>
>> Am I missing something?
>
> There was the case for "FOLL_PIN represents application behavior and
> should trigger NUMA faults", but I guess that can be ignored.

Re-reading commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page
faults from gup/gup_fast"), it actually does spell out an important case
that we should handle:

"KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault."

That is still the case today (and important). Not triggering NUMA
hinting faults would degrade KVM.

Hmm. So three alternatives I see:

1) Use FOLL_FORCE in follow_page() to unconditionally disable protnone
checks. Alternatively, have an internal FOLL_NO_PROTNONE flag if we
don't like that.

2) Revert the commit and reintroduce unconditional FOLL_NUMA without
FOLL_FORCE.

3) Have a FOLL_NUMA that callers like KVM can pass.

Thoughts?

--
Cheers,

David / dhildenb

2023-07-28 22:27:58

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

On Fri, 28 Jul 2023 at 12:39, Peter Xu <[email protected]> wrote:
>
> But then does it means that any gup-only user will have numa balancing
> completely disabled?

Why would we ever care about a GUP-only user?

Who knows where the actual access is coming from? It might be some
device that is on a different node entirely.

And even if the access is local from the CPU, it

(a) might have happened after we moved somewhere else

(b) who cares about the extra possible NUMA overhead when we just
wasted *thousands* of cycles on GUP?

So NUMA balancing really doesn't seem to make sense for GUP anyway as
far as I can see.

Now, the other side of the same thing is that (a) NUMA faulting should
be fairly rare and (b) once you do GUP, who cares anyway, so you can
also argue that "once you do GUP you might as well NUMA-fault, because
performance simply isn't an issue".

But I really think the real argument is "once you do GUP, numa
faulting is just crazy".

I think what happened is

- the GUP code couldn't tell NUMA and actual PROTNONE apart

- so the GUP code would punch through PROTNONE even when it shouldn't

- so people added FOLL_NUMA to say "I don't want you to punch
through, I want the NUMA fault"

- but then FOLL_FORCE ends up meaning that you actually *do* want to
punch through - regardless of NUMA or not - and now the two got tied
together, and we end up with nonsensical garbage like

if (!(gup_flags & FOLL_FORCE))
gup_flags |= FOLL_NUMA;

to say "oh, actually, to avoid punching through when we shouldn't,
we should NUMA fault".

so we ended up with that case where even if YOU DIDN'T CARE AT ALL,
you got FOLL_NUMA just so that you wouldn't punch through.

And now we're in the situation that we've confused FOLL_FORCE and
FOLL_NUMA, even though they have absolutely *nothing* to do with each
other, except for a random implementation detail about punching
through incorrectly that isn't even relevant any more.

I really think FOLL_NUMA should just go away. And that FOLL_FORCE
replacement for it is just wrong. If you *don't* do something without
FOLL_FORCE, you damn well shouldn't do it just because FOLL_FORCE is
set.

The *only* semantic meaning FOLL_FORCE should have is that it
overrides the vma protections for debuggers (in a very limited
manner). It should *not* affect any NUMA faulting logic in any way,
shape, or form.

Linus

2023-08-02 12:11:52

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout

On Fri, Jul 28, 2023 at 07:30:33PM +0200, David Hildenbrand wrote:
> On 28.07.23 18:18, Linus Torvalds wrote:
> > On Thu, 27 Jul 2023 at 14:28, David Hildenbrand <[email protected]> wrote:
> > >
> > > This is my proposal on how to handle the fallout of 474098edac26
> > > ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I
> > > accidentially missed that follow_page() and smaps implicitly kept the
> > > FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to
> > > not trigger faults on PROT_NONE-mapped PTEs.
> >
> > Ugh.
>
> I was hoping for that reaction, with the assumption that we would get
> something cleaner :)
>
> >
> > I hate how it uses FOLL_FORCE that is inherently scary.
>
> I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it
> is FOLL_FORCE in disguise (currently and before 474098edac26, if
> FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around).
>

FOLL_NUMA being conflated with FOLL_FORCE is almost certainly a historical
accident.

> >
> > Why do we have that "gup_can_follow_protnone()" logic AT ALL?
>
> That's what I was hoping for.
>
> >
> > Couldn't we just get rid of that disgusting thing, and just say that
> > GUP (and follow_page()) always just ignores NUMA hinting, and always
> > just follows protnone?
> >
> > We literally used to have this:
> >
> > if (!(gup_flags & FOLL_FORCE))
> > gup_flags |= FOLL_NUMA;
> >
> > ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should
> > be the rare crazy case.
>
> Yes, but my point would be that we now spell that "rare crazy case"
> out for follow_page().
>
> If you're talking about patch #1, I agree, therefore patch #3 to
> avoid all that nasty FOLL_FORCE handling in GUP callers.
>
> But yeah, if we can avoid all that, great.
>
> >
> > The original reason for not setting FOLL_NUMA all the time is
> > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting
> > page faults from gup/gup_fast") from way back in 2012:
> >
> > * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
> > * would be called on PROT_NONE ranges. We must never invoke
> > * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
> > * page faults would unprotect the PROT_NONE ranges if
> > * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
> > * bitflag. So to avoid that, don't set FOLL_NUMA if
> > * FOLL_FORCE is set.
>
>
> In handle_mm_fault(), we never call do_numa_page() if
> !vma_is_accessible(). Same for do_huge_pmd_numa_page().
>
> So, if we would ever end up triggering a page fault on
> mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we
> would simply do nothing.
>
> At least that's the hope, I'll take a closer look just to make
> sure we're good on all call paths.
>
> >
> > but I don't think the original reason for this is *true* any more.
> >
> > Because then two years later in 2014, in commit c46a7c817e66 ("x86:
> > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels")
> > Mel made the code able to distinguish between PROT_NONE and NUMA
> > pages, and he changed the comment above too.
>
> CCing Mel.
>
> I remember that pte_protnone() can only distinguished between
> NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible().
>

Ok, as usual, I'm far behind and this thread massive but I'll respond
to this part before trying to digest the history of this and the current
implementation.

To the best of my recollection, FOLL_NUMA used to be a correctness issue
but that should no longer true. Initially, it was to prevent mixing up
"PROT_NONE" that was for NUMA hinting and "PROT_NONE" due to VMA
protections. Now the bits are different so this case should be
avoidable.

Later it was still a different correctness issue because PMD migration had
a hacky implementation without migration entries and a GUP could find a
page that was being collapsed and had to be serialised. That should also
now be avoidable.

At some point, FOLL_FORCE and FOLL_NUMA got conflated but they really should
not be related even if they are by accident. FOLL_FORCE (e.g. ptrace)
may have to process the fault and make the page resident and accessible
regardless of any other consequences. FOLL_NUMA ideally should be much
more specific. If the calling context only cares about the struct page
(e.g. smaps) then it's ok to get a reference to the page. If necessary,
it could clear the protection and lose the hinting fault although it's less
than ideal. Just needing the struct page for informational purposes though
should not be treated as a NUMA hinting fault because it has nothing to
do with the tasks memory reference behaviour.

A variant of FOLL_NUMA (FOLL_NUMA_HINT?) may still be required to indicate
the calling context is accessing the page for reasons that are equivalent
to a real memory access from a CPU related to the task mapping the page.
I didn't check but KVM may be an example of this when dealing with some MMU
faults as the page is being looked up on behalf of the task and presumably
from the same CPU the task was running run. Something like reading smaps
only needs the struct page but it should not be treated as a NUMA hinting
fault as the access has nothing to do with the task mapping the page.

> > The original reason for FOLL_NUMA simply does not exist any more. We
> > know exactly when a page is marked for NUMA faulting, and we should
> > simply *ignore* it for GUP and follow_page().
> >

I'd be wary of completely ignoring it if there is any known calling context
that is equivalent to a memory access and the hinting fault should be
processed -- KVM may be an example.

--
Mel Gorman
SUSE Labs