LinuxLists.cc - [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

2023-07-26 07:59:45

Subject: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

In commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
gup_can_follow_protnone()"), FOLL_NUMA was removed and replaced by
the gup_can_follow_protnone interface.

However, for the case where the user-mode process uses transparent
huge pages, when analyzing the memory usage through
/proc/pid/smaps_rollup, the obtained memory usage is not consistent
with the RSS in /proc/pid/status.

Related examples are as follows:
cat /proc/15427/status
VmRSS: 20973024 kB
RssAnon: 20971616 kB
RssFile: 1408 kB
RssShmem: 0 kB

cat /proc/15427/smaps_rollup
00400000-7ffcc372d000 ---p 00000000 00:00 0 [rollup]
Rss: 14419432 kB
Pss: 14418079 kB
Pss_Dirty: 14418016 kB
Pss_Anon: 14418016 kB
Pss_File: 63 kB
Pss_Shmem: 0 kB
Anonymous: 14418016 kB
LazyFree: 0 kB
AnonHugePages: 14417920 kB

The root cause is that the traversal In the page table, the number of
pages obtained by smaps_pmd_entry does not include the pages
corresponding to PROTNONE,resulting in a different situation.

Therefore, when obtaining pages through the follow_trans_huge_pmd
interface, add the FOLL_FORCE flag to count the pages corresponding to
PROTNONE to solve the above problem.

Signed-off-by: liubo <[email protected]>
Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
---
fs/proc/task_mmu.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c1e6531cb02a..ed08f9b869e2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
bool migration = false;

if (pmd_present(*pmd)) {
- /* FOLL_DUMP will return -EFAULT on huge zero page */
- page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+ /* FOLL_DUMP will return -EFAULT on huge zero page
+ * FOLL_FORCE follow a PROT_NONE mapped page
+ */
+ page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pmd);

--
2.27.0

2023-07-27 11:42:11

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On Wed, Jul 26, 2023 at 9:40 AM liubo <[email protected]> wrote:
>
> In commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
> gup_can_follow_protnone()"), FOLL_NUMA was removed and replaced by
> the gup_can_follow_protnone interface.
>
> However, for the case where the user-mode process uses transparent
> huge pages, when analyzing the memory usage through
> /proc/pid/smaps_rollup, the obtained memory usage is not consistent
> with the RSS in /proc/pid/status.
>
> Related examples are as follows:
> cat /proc/15427/status
> VmRSS: 20973024 kB
> RssAnon: 20971616 kB
> RssFile: 1408 kB
> RssShmem: 0 kB
>
> cat /proc/15427/smaps_rollup
> 00400000-7ffcc372d000 ---p 00000000 00:00 0 [rollup]
> Rss: 14419432 kB
> Pss: 14418079 kB
> Pss_Dirty: 14418016 kB
> Pss_Anon: 14418016 kB
> Pss_File: 63 kB
> Pss_Shmem: 0 kB
> Anonymous: 14418016 kB
> LazyFree: 0 kB
> AnonHugePages: 14417920 kB
>
> The root cause is that the traversal In the page table, the number of
> pages obtained by smaps_pmd_entry does not include the pages
> corresponding to PROTNONE,resulting in a different situation.
>

Thanks for reporting and debugging!

> Therefore, when obtaining pages through the follow_trans_huge_pmd
> interface, add the FOLL_FORCE flag to count the pages corresponding to
> PROTNONE to solve the above problem.
>

We really want to avoid the usage of FOLL_FORCE, and ideally limit it
to ptrace only.

> Signed-off-by: liubo <[email protected]>
> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> ---
> fs/proc/task_mmu.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c1e6531cb02a..ed08f9b869e2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> bool migration = false;
>
> if (pmd_present(*pmd)) {
> - /* FOLL_DUMP will return -EFAULT on huge zero page */
> - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> + /* FOLL_DUMP will return -EFAULT on huge zero page
> + * FOLL_FORCE follow a PROT_NONE mapped page
> + */
> + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
> } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> swp_entry_t entry = pmd_to_swp_entry(*pmd);

Might do as an easy fix. But we really should get rid of that
absolutely disgusting usage of follow_trans_huge_pmd().

We don't need 99% of what follow_trans_huge_pmd() does here.

Would the following also fix your issue?

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 507cd4e59d07..fc744964816e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
bool migration = false;

if (pmd_present(*pmd)) {
- /* FOLL_DUMP will return -EFAULT on huge zero page */
- page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
+ page = vm_normal_page_pmd(vma, addr, *pmd);
} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
swp_entry_t entry = pmd_to_swp_entry(*pmd);

It also skips the shared zeropage and pmd_devmap(),

Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I
suspect vm_normal_page_pmd() might be what we actually want to have here.

Because smaps_pte_entry() properly checks for vm_normal_page().

2023-07-27 13:51:08

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On Thu, Jul 27, 2023 at 01:37:06PM +0200, David Hildenbrand wrote:
> On Wed, Jul 26, 2023 at 9:40 AM liubo <[email protected]> wrote:
> >
> > In commit 474098edac26 ("mm/gup: replace FOLL_NUMA by
> > gup_can_follow_protnone()"), FOLL_NUMA was removed and replaced by
> > the gup_can_follow_protnone interface.
> >
> > However, for the case where the user-mode process uses transparent
> > huge pages, when analyzing the memory usage through
> > /proc/pid/smaps_rollup, the obtained memory usage is not consistent
> > with the RSS in /proc/pid/status.
> >
> > Related examples are as follows:
> > cat /proc/15427/status
> > VmRSS: 20973024 kB
> > RssAnon: 20971616 kB
> > RssFile: 1408 kB
> > RssShmem: 0 kB
> >
> > cat /proc/15427/smaps_rollup
> > 00400000-7ffcc372d000 ---p 00000000 00:00 0 [rollup]
> > Rss: 14419432 kB
> > Pss: 14418079 kB
> > Pss_Dirty: 14418016 kB
> > Pss_Anon: 14418016 kB
> > Pss_File: 63 kB
> > Pss_Shmem: 0 kB
> > Anonymous: 14418016 kB
> > LazyFree: 0 kB
> > AnonHugePages: 14417920 kB
> >
> > The root cause is that the traversal In the page table, the number of
> > pages obtained by smaps_pmd_entry does not include the pages
> > corresponding to PROTNONE,resulting in a different situation.
> >
>
> Thanks for reporting and debugging!
>
> > Therefore, when obtaining pages through the follow_trans_huge_pmd
> > interface, add the FOLL_FORCE flag to count the pages corresponding to
> > PROTNONE to solve the above problem.
> >
>
> We really want to avoid the usage of FOLL_FORCE, and ideally limit it
> to ptrace only.

Fundamentally when removing FOLL_NUMA we did already assumed !FORCE is
FOLL_NUMA. It means to me after the removal it's not possible to say in a
gup walker that "it's not FORCEd, but I don't want to trigger NUMA but just
get the page".

Is that what we want? Shall we document that in FOLL_FORCE if we intended
to enforce numa balancing as long as !FORCE?

>
> > Signed-off-by: liubo <[email protected]>
> > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> > ---
> > fs/proc/task_mmu.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index c1e6531cb02a..ed08f9b869e2 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > bool migration = false;
> >
> > if (pmd_present(*pmd)) {
> > - /* FOLL_DUMP will return -EFAULT on huge zero page */
> > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > + /* FOLL_DUMP will return -EFAULT on huge zero page
> > + * FOLL_FORCE follow a PROT_NONE mapped page
> > + */
> > + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
> > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > swp_entry_t entry = pmd_to_swp_entry(*pmd);
>
> Might do as an easy fix. But we really should get rid of that
> absolutely disgusting usage of follow_trans_huge_pmd().
>
> We don't need 99% of what follow_trans_huge_pmd() does here.
>
> Would the following also fix your issue?
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 507cd4e59d07..fc744964816e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> bool migration = false;
>
> if (pmd_present(*pmd)) {
> - /* FOLL_DUMP will return -EFAULT on huge zero page */
> - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> + page = vm_normal_page_pmd(vma, addr, *pmd);
> } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> swp_entry_t entry = pmd_to_swp_entry(*pmd);
>
> It also skips the shared zeropage and pmd_devmap(),
>
> Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I
> suspect vm_normal_page_pmd() might be what we actually want to have here.
>
> Because smaps_pte_entry() properly checks for vm_normal_page().

There're indeed some very trivial detail in vm_normal_page_pmd() that's
different, but maybe not so relevant. E.g.,

if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
return -ENOMEM;

if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
return -EREMOTEIO;

I'm not sure whether the p2pdma page would matter in any form here. E.g.,
whether it can be mapped privately.

--
Peter Xu

2023-07-27 14:00:08

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

>>> Therefore, when obtaining pages through the follow_trans_huge_pmd
>>> interface, add the FOLL_FORCE flag to count the pages corresponding to
>>> PROTNONE to solve the above problem.
>>>
>>
>> We really want to avoid the usage of FOLL_FORCE, and ideally limit it
>> to ptrace only.
>
> Fundamentally when removing FOLL_NUMA we did already assumed !FORCE is
> FOLL_NUMA. It means to me after the removal it's not possible to say in a
> gup walker that "it's not FORCEd, but I don't want to trigger NUMA but just
> get the page".
>
> Is that what we want? Shall we document that in FOLL_FORCE if we intended
> to enforce numa balancing as long as !FORCE?

That was the idea, yes. I could have sworn we had that at least in some
patch description.

Back then, I played with special-casing on gup_can_follow_protnone() on
FOLL_GET | FOLL_PIN. But it's all just best guesses.

Can always be added if deemed necessary and worth it.

Here, it's simply an abuse of that GUP function that I wasn't aware of
-- otherwise I'd have removed that before hand.

>
>>
>>> Signed-off-by: liubo <[email protected]>
>>> Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
>>> ---
>>> fs/proc/task_mmu.c | 6 ++++--
>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index c1e6531cb02a..ed08f9b869e2 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>>> bool migration = false;
>>>
>>> if (pmd_present(*pmd)) {
>>> - /* FOLL_DUMP will return -EFAULT on huge zero page */
>>> - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
>>> + /* FOLL_DUMP will return -EFAULT on huge zero page
>>> + * FOLL_FORCE follow a PROT_NONE mapped page
>>> + */
>>> + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
>>> } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
>>> swp_entry_t entry = pmd_to_swp_entry(*pmd);
>>
>> Might do as an easy fix. But we really should get rid of that
>> absolutely disgusting usage of follow_trans_huge_pmd().
>>
>> We don't need 99% of what follow_trans_huge_pmd() does here.
>>
>> Would the following also fix your issue?
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 507cd4e59d07..fc744964816e 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>> bool migration = false;
>>
>> if (pmd_present(*pmd)) {
>> - /* FOLL_DUMP will return -EFAULT on huge zero page */
>> - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
>> + page = vm_normal_page_pmd(vma, addr, *pmd);
>> } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
>> swp_entry_t entry = pmd_to_swp_entry(*pmd);
>>
>> It also skips the shared zeropage and pmd_devmap(),
>>
>> Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I
>> suspect vm_normal_page_pmd() might be what we actually want to have here.
>>
>> Because smaps_pte_entry() properly checks for vm_normal_page().
>
> There're indeed some very trivial detail in vm_normal_page_pmd() that's
> different, but maybe not so relevant. E.g.,
>
> if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
> return -ENOMEM;

Note that we're not even passing FOLL_GET | FOLL_PIN. Because we're not
actually doing GUP. So the refcount is not that relevant.

>
> if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> return -EREMOTEIO;
>
> I'm not sure whether the p2pdma page would matter in any form here. E.g.,
> whether it can be mapped privately.

Good point, but I don't think that people messing with GUP even imagined
that we would call that function from a !GUP place.

This was wrong from the very start. If we're not in GUP, we shouldn't
call GUP functions.

--
Cheers,

David / dhildenb

2023-07-27 16:17:21

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On Thu, Jul 27, 2023 at 03:28:49PM +0200, David Hildenbrand wrote:
> > > > Therefore, when obtaining pages through the follow_trans_huge_pmd
> > > > interface, add the FOLL_FORCE flag to count the pages corresponding to
> > > > PROTNONE to solve the above problem.
> > > >
> > >
> > > We really want to avoid the usage of FOLL_FORCE, and ideally limit it
> > > to ptrace only.
> >
> > Fundamentally when removing FOLL_NUMA we did already assumed !FORCE is
> > FOLL_NUMA. It means to me after the removal it's not possible to say in a
> > gup walker that "it's not FORCEd, but I don't want to trigger NUMA but just
> > get the page".
> >
> > Is that what we want? Shall we document that in FOLL_FORCE if we intended
> > to enforce numa balancing as long as !FORCE?
>
> That was the idea, yes. I could have sworn we had that at least in some
> patch description.
>
> Back then, I played with special-casing on gup_can_follow_protnone() on
> FOLL_GET | FOLL_PIN. But it's all just best guesses.
>
> Can always be added if deemed necessary and worth it.
>
> Here, it's simply an abuse of that GUP function that I wasn't aware of --
> otherwise I'd have removed that before hand.
>
> >
> > >
> > > > Signed-off-by: liubo <[email protected]>
> > > > Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()")
> > > > ---
> > > > fs/proc/task_mmu.c | 6 ++++--
> > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index c1e6531cb02a..ed08f9b869e2 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -571,8 +571,10 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > > bool migration = false;
> > > >
> > > > if (pmd_present(*pmd)) {
> > > > - /* FOLL_DUMP will return -EFAULT on huge zero page */
> > > > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > > > + /* FOLL_DUMP will return -EFAULT on huge zero page
> > > > + * FOLL_FORCE follow a PROT_NONE mapped page
> > > > + */
> > > > + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP | FOLL_FORCE);
> > > > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > > > swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > >
> > > Might do as an easy fix. But we really should get rid of that
> > > absolutely disgusting usage of follow_trans_huge_pmd().
> > >
> > > We don't need 99% of what follow_trans_huge_pmd() does here.
> > >
> > > Would the following also fix your issue?
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 507cd4e59d07..fc744964816e 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -587,8 +587,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > bool migration = false;
> > >
> > > if (pmd_present(*pmd)) {
> > > - /* FOLL_DUMP will return -EFAULT on huge zero page */
> > > - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
> > > + page = vm_normal_page_pmd(vma, addr, *pmd);
> > > } else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
> > > swp_entry_t entry = pmd_to_swp_entry(*pmd);
> > >
> > > It also skips the shared zeropage and pmd_devmap(),
> > >
> > > Otherwise, a simple pmd_page(*pmd) + is_huge_zero_pmd(*pmd) check will do, but I
> > > suspect vm_normal_page_pmd() might be what we actually want to have here.
> > >
> > > Because smaps_pte_entry() properly checks for vm_normal_page().
> >
> > There're indeed some very trivial detail in vm_normal_page_pmd() that's
> > different, but maybe not so relevant. E.g.,
> >
> > if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
> > return -ENOMEM;
>
> Note that we're not even passing FOLL_GET | FOLL_PIN. Because we're not
> actually doing GUP. So the refcount is not that relevant.
>
> >
> > if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > return -EREMOTEIO;
> >
> > I'm not sure whether the p2pdma page would matter in any form here. E.g.,
> > whether it can be mapped privately.
>
> Good point, but I don't think that people messing with GUP even imagined
> that we would call that function from a !GUP place.
>
> This was wrong from the very start. If we're not in GUP, we shouldn't call
> GUP functions.

My understanding is !GET && !PIN is also called gup.. otherwise we don't
need GET and it can just be always implied.

The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I
don't know whether that's "wrong" to be used..

Back to the topic: I'd say either of the patches look good to solve the
problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
vm_normal_page_pmd() proposed here will also work on it, so nothing I see
wrong on 2nd one yet.

It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
just wonder whether we should document NUMA behavior for FOLL_* somewhere,
because we have an implication right now on !FOLL_FORCE over NUMA, which is
not obvious to me..

And to look more over that aspect, see follow_page(): previously we can
follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
(it never applies FOLL_FORCE, either, so it seems "accidentally" implies
FOLL_NUMA now). Not sure whether it's intended, though..

--
Peter Xu

2023-07-27 18:28:52

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

>>
>> This was wrong from the very start. If we're not in GUP, we shouldn't call
>> GUP functions.
>
> My understanding is !GET && !PIN is also called gup.. otherwise we don't
> need GET and it can just be always implied.

That's not the point. The point is that _arbitrary_ code shouldn't call
into GUP internal helper functions, where they bypass, for example, any
sanity checks.

>
> The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I
> don't know whether that's "wrong" to be used..
>

To me, that is arbitrary code using a GUP internal helper and,
therefore, wrong.

> Back to the topic: I'd say either of the patches look good to solve the
> problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
> vm_normal_page_pmd() proposed here will also work on it, so nothing I see
> wrong on 2nd one yet.
>
> It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
> just wonder whether we should document NUMA behavior for FOLL_* somewhere,
> because we have an implication right now on !FOLL_FORCE over NUMA, which is
> not obvious to me..

Yes, we probably should. For get_use_pages() and friends that behavior
was always like that and it makes sense: usually it represent
application behavior.

>
> And to look more over that aspect, see follow_page(): previously we can
> follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
> (it never applies FOLL_FORCE, either, so it seems "accidentally" implies
> FOLL_NUMA now). Not sure whether it's intended, though..

That was certainly an oversight, thanks for spotting that. That patch
was not supposed to change semantics:

diff --git a/mm/gup.c b/mm/gup.c
index 76d222ccc3ff..ac926e19ff72 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -851,6 +851,13 @@ struct page *follow_page(struct vm_area_struct
*vma, unsigned long address,
if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
return NULL;

+ /*
+ * In contrast to get_user_pages() and friends, we don't want to
+ * fail if the PTE is PROT_NONE: see gup_can_follow_protnone().
+ */
+ if (!(foll_flags & FOLL_WRITE))
+ foll_flags |= FOLL_FORCE;
+
page = follow_page_mask(vma, address, foll_flags, &ctx);
if (ctx.pgmap)
put_dev_pagemap(ctx.pgmap);

--
Cheers,

David / dhildenb

2023-07-27 20:31:55

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On Thu, Jul 27, 2023 at 07:27:02PM +0200, David Hildenbrand wrote:
> > >
> > > This was wrong from the very start. If we're not in GUP, we shouldn't call
> > > GUP functions.
> >
> > My understanding is !GET && !PIN is also called gup.. otherwise we don't
> > need GET and it can just be always implied.
>
> That's not the point. The point is that _arbitrary_ code shouldn't call into
> GUP internal helper functions, where they bypass, for example, any sanity
> checks.

What's the sanity checks that you're referring to?

>
> >
> > The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I
> > don't know whether that's "wrong" to be used..
> >
>
> To me, that is arbitrary code using a GUP internal helper and, therefore,
> wrong.
>
> > Back to the topic: I'd say either of the patches look good to solve the
> > problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
> > vm_normal_page_pmd() proposed here will also work on it, so nothing I see
> > wrong on 2nd one yet.
> >
> > It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
> > just wonder whether we should document NUMA behavior for FOLL_* somewhere,
> > because we have an implication right now on !FOLL_FORCE over NUMA, which is
> > not obvious to me..
>
> Yes, we probably should. For get_use_pages() and friends that behavior was
> always like that and it makes sense: usually it represent application
> behavior.
>
> >
> > And to look more over that aspect, see follow_page(): previously we can
> > follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
> > (it never applies FOLL_FORCE, either, so it seems "accidentally" implies
> > FOLL_NUMA now). Not sure whether it's intended, though..
>
> That was certainly an oversight, thanks for spotting that. That patch was
> not supposed to change semantics:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 76d222ccc3ff..ac926e19ff72 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -851,6 +851,13 @@ struct page *follow_page(struct vm_area_struct *vma,
> unsigned long address,
> if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
> return NULL;
>
> + /*
> + * In contrast to get_user_pages() and friends, we don't want to
> + * fail if the PTE is PROT_NONE: see gup_can_follow_protnone().
> + */
> + if (!(foll_flags & FOLL_WRITE))
> + foll_flags |= FOLL_FORCE;
> +
> page = follow_page_mask(vma, address, foll_flags, &ctx);
> if (ctx.pgmap)
> put_dev_pagemap(ctx.pgmap);

This seems to be slightly against your other solution though for smaps,
where we want to avoid abusing FOLL_FORCE.. isn't it..

Why read only? That'll always attach FOLL_FORCE to all follow page call
sites indeed for now, but just curious - logically "I want to fetch the
page even if protnone" is orthogonal to do with write permission here to
me.

I still worry about further abuse of FOLL_FORCE, I believe you also worry
that so you proposed the other way for the smaps issue.

Do you think we can just revive FOLL_NUMA? That'll be very clear to me
from that aspect that we do still have valid use cases for it.

The very least is if with above we should really document FOLL_FORCE - we
should mention NUMA effects. But that's ... really confusing. Thinking
about that I personally prefer a revival of FOLL_NUMA, then smaps issue all
go away.

Thanks,

--
Peter Xu

2023-07-27 20:47:22

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On 27.07.23 20:59, Peter Xu wrote:
> On Thu, Jul 27, 2023 at 07:27:02PM +0200, David Hildenbrand wrote:
>>>>
>>>> This was wrong from the very start. If we're not in GUP, we shouldn't call
>>>> GUP functions.
>>>
>>> My understanding is !GET && !PIN is also called gup.. otherwise we don't
>>> need GET and it can just be always implied.
>>
>> That's not the point. The point is that _arbitrary_ code shouldn't call into
>> GUP internal helper functions, where they bypass, for example, any sanity
>> checks.
>
> What's the sanity checks that you're referring to?
>

For example in follow_page()

if (vma_is_secretmem(vma))
return NULL;

if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
return NULL;

Maybe you can elaborate why you think we should *not* be using
vm_normal_page_pmd() and instead some arbitrary GUP internal helper? I
don't get it.

>>
>>>
>>> The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I
>>> don't know whether that's "wrong" to be used..
>>>
>>
>> To me, that is arbitrary code using a GUP internal helper and, therefore,
>> wrong.
>>
>>> Back to the topic: I'd say either of the patches look good to solve the
>>> problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
>>> vm_normal_page_pmd() proposed here will also work on it, so nothing I see
>>> wrong on 2nd one yet.
>>>
>>> It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
>>> just wonder whether we should document NUMA behavior for FOLL_* somewhere,
>>> because we have an implication right now on !FOLL_FORCE over NUMA, which is
>>> not obvious to me..
>>
>> Yes, we probably should. For get_use_pages() and friends that behavior was
>> always like that and it makes sense: usually it represent application
>> behavior.
>>
>>>
>>> And to look more over that aspect, see follow_page(): previously we can
>>> follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
>>> (it never applies FOLL_FORCE, either, so it seems "accidentally" implies
>>> FOLL_NUMA now). Not sure whether it's intended, though..
>>
>> That was certainly an oversight, thanks for spotting that. That patch was
>> not supposed to change semantics:
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 76d222ccc3ff..ac926e19ff72 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -851,6 +851,13 @@ struct page *follow_page(struct vm_area_struct *vma,
>> unsigned long address,
>> if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
>> return NULL;
>>
>> + /*
>> + * In contrast to get_user_pages() and friends, we don't want to
>> + * fail if the PTE is PROT_NONE: see gup_can_follow_protnone().
>> + */
>> + if (!(foll_flags & FOLL_WRITE))
>> + foll_flags |= FOLL_FORCE;
>> +
>> page = follow_page_mask(vma, address, foll_flags, &ctx);
>> if (ctx.pgmap)
>> put_dev_pagemap(ctx.pgmap);
>
> This seems to be slightly against your other solution though for smaps,
> where we want to avoid abusing FOLL_FORCE.. isn't it..

This is GUP internal, not some arbitrary code, so to me a *completely*
different discussion.

>
> Why read only? That'll always attach FOLL_FORCE to all follow page call
> sites indeed for now, but just curious - logically "I want to fetch the
> page even if protnone" is orthogonal to do with write permission here to
> me.

Historical these were not the semantics, so I won't change them.

FOLL_FORCE | FOLL_WRITE always had a special taste to it (COW ...).

>
> I still worry about further abuse of FOLL_FORCE, I believe you also worry
> that so you proposed the other way for the smaps issue.
>
> Do you think we can just revive FOLL_NUMA? That'll be very clear to me
> from that aspect that we do still have valid use cases for it.

FOLL_NUMA naming was nowadays wrong to begin with (not to mention,
confusing a we learned). There are other reasons why we have PROT_NONE
-- mprotect(), for example.

We could have a flag that goes the other way around:
FOLL_IGNORE_PROTNONE ... which surprisingly then ends up being exactly
what FOLL_FORCE means without FOLL_WRITE, and what this patch does.

Does that make sense to you?

>
> The very least is if with above we should really document FOLL_FORCE - we
> should mention NUMA effects. But that's ... really confusing. Thinking
> about that I personally prefer a revival of FOLL_NUMA, then smaps issue all
> go away.

smaps needs to be changed in any case IMHO. And I'm absolutely not in
favor of revicing FOLL_NUMA.

--
Cheers,

David / dhildenb

2023-07-27 21:06:42

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On Thu, Jul 27, 2023 at 09:17:45PM +0200, David Hildenbrand wrote:
> On 27.07.23 20:59, Peter Xu wrote:
> > On Thu, Jul 27, 2023 at 07:27:02PM +0200, David Hildenbrand wrote:
> > > > >
> > > > > This was wrong from the very start. If we're not in GUP, we shouldn't call
> > > > > GUP functions.
> > > >
> > > > My understanding is !GET && !PIN is also called gup.. otherwise we don't
> > > > need GET and it can just be always implied.
> > >
> > > That's not the point. The point is that _arbitrary_ code shouldn't call into
> > > GUP internal helper functions, where they bypass, for example, any sanity
> > > checks.
> >
> > What's the sanity checks that you're referring to?
> >
>
> For example in follow_page()
>
> if (vma_is_secretmem(vma))
> return NULL;
>
> if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
> return NULL;
>
>
> Maybe you can elaborate why you think we should *not* be using
> vm_normal_page_pmd() and instead some arbitrary GUP internal helper? I don't
> get it.

Because the old code was written like that?

You're proposing to change it here. Again, I'm fine with the change, but
please don't ask me to justify why the original code is fine.. because I
simply don't see anything majorly wrong with either, it's the change that
needs justification, not keeping it as-is (since Kirill wrote it in 2014).

Well.. I feel like this becomes less helpful to discuss. let's try to move
on.

>
> > >
> > > >
> > > > The other proof is try_grab_page() doesn't fail hard on !GET && !PIN. So I
> > > > don't know whether that's "wrong" to be used..
> > > >
> > >
> > > To me, that is arbitrary code using a GUP internal helper and, therefore,
> > > wrong.
> > >
> > > > Back to the topic: I'd say either of the patches look good to solve the
> > > > problem. If p2pdma pages are mapped as PFNMAP/MIXEDMAP (?), I guess
> > > > vm_normal_page_pmd() proposed here will also work on it, so nothing I see
> > > > wrong on 2nd one yet.
> > > >
> > > > It looks nicer indeed to not have FOLL_FORCE here, but it also makes me
> > > > just wonder whether we should document NUMA behavior for FOLL_* somewhere,
> > > > because we have an implication right now on !FOLL_FORCE over NUMA, which is
> > > > not obvious to me..
> > >
> > > Yes, we probably should. For get_use_pages() and friends that behavior was
> > > always like that and it makes sense: usually it represent application
> > > behavior.
> > >
> > > >
> > > > And to look more over that aspect, see follow_page(): previously we can
> > > > follow a page for protnone (as it never applies FOLL_NUMA) but now it won't
> > > > (it never applies FOLL_FORCE, either, so it seems "accidentally" implies
> > > > FOLL_NUMA now). Not sure whether it's intended, though..
> > >
> > > That was certainly an oversight, thanks for spotting that. That patch was
> > > not supposed to change semantics:
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 76d222ccc3ff..ac926e19ff72 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -851,6 +851,13 @@ struct page *follow_page(struct vm_area_struct *vma,
> > > unsigned long address,
> > > if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
> > > return NULL;
> > >
> > > + /*
> > > + * In contrast to get_user_pages() and friends, we don't want to
> > > + * fail if the PTE is PROT_NONE: see gup_can_follow_protnone().
> > > + */
> > > + if (!(foll_flags & FOLL_WRITE))
> > > + foll_flags |= FOLL_FORCE;
> > > +
> > > page = follow_page_mask(vma, address, foll_flags, &ctx);
> > > if (ctx.pgmap)
> > > put_dev_pagemap(ctx.pgmap);
> >
> > This seems to be slightly against your other solution though for smaps,
> > where we want to avoid abusing FOLL_FORCE.. isn't it..
>
> This is GUP internal, not some arbitrary code, so to me a *completely*
> different discussion.
>
> >
> > Why read only? That'll always attach FOLL_FORCE to all follow page call
> > sites indeed for now, but just curious - logically "I want to fetch the
> > page even if protnone" is orthogonal to do with write permission here to
> > me.
>
> Historical these were not the semantics, so I won't change them.
>
> FOLL_FORCE | FOLL_WRITE always had a special taste to it (COW ...).
>
> >
> > I still worry about further abuse of FOLL_FORCE, I believe you also worry
> > that so you proposed the other way for the smaps issue.
> >
> > Do you think we can just revive FOLL_NUMA? That'll be very clear to me
> > from that aspect that we do still have valid use cases for it.
>
> FOLL_NUMA naming was nowadays wrong to begin with (not to mention, confusing
> a we learned). There are other reasons why we have PROT_NONE -- mprotect(),
> for example.

It doesn't really violate with the name, IMHO - protnone can be either numa
hint or PROT_NONE for real. As long as we return NULL for a FOLL_NUMA
request we're achieving the goal we want - we guarantee a NUMA balancing to
trigger with when FOLL_NUMA provided. It doesn't need to guarantee
anything else, afaiu. The final check relies in vma_is_accessible() in the
fault paths anyway. So I don't blame the old name that much.

>
> We could have a flag that goes the other way around: FOLL_IGNORE_PROTNONE
> ... which surprisingly then ends up being exactly what FOLL_FORCE means
> without FOLL_WRITE, and what this patch does.
>
> Does that make sense to you?
>
>
> >
> > The very least is if with above we should really document FOLL_FORCE - we
> > should mention NUMA effects. But that's ... really confusing. Thinking
> > about that I personally prefer a revival of FOLL_NUMA, then smaps issue all
> > go away.
>
> smaps needs to be changed in any case IMHO. And I'm absolutely not in favor
> of revicing FOLL_NUMA.

As stated above, to me FOLL_NUMA is all fine and clear. If you think
having a flag back for protnone is worthwhile no matter as-is (FOLL_NUMA)
or with reverted meaning, then that sounds all fine to me. Maybe the old
name at least makes old developers know what's that.

I don't have a strong opinion on names though; mostly never had.

Thanks,

--
Peter Xu

2023-07-27 21:14:44

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] smaps: Fix the abnormal memory statistics obtained through /proc/pid/smaps

On 27.07.23 22:30, Peter Xu wrote:
> On Thu, Jul 27, 2023 at 09:17:45PM +0200, David Hildenbrand wrote:
>> On 27.07.23 20:59, Peter Xu wrote:
>>> On Thu, Jul 27, 2023 at 07:27:02PM +0200, David Hildenbrand wrote:
>>>>>>
>>>>>> This was wrong from the very start. If we're not in GUP, we shouldn't call
>>>>>> GUP functions.
>>>>>
>>>>> My understanding is !GET && !PIN is also called gup.. otherwise we don't
>>>>> need GET and it can just be always implied.
>>>>
>>>> That's not the point. The point is that _arbitrary_ code shouldn't call into
>>>> GUP internal helper functions, where they bypass, for example, any sanity
>>>> checks.
>>>
>>> What's the sanity checks that you're referring to?
>>>
>>
>> For example in follow_page()
>>
>> if (vma_is_secretmem(vma))
>> return NULL;
>>
>> if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
>> return NULL;
>>
>>
>> Maybe you can elaborate why you think we should *not* be using
>> vm_normal_page_pmd() and instead some arbitrary GUP internal helper? I don't
>> get it.
>
> Because the old code was written like that?

And it's not 2014 anymore. Nowadays we do have the right helper in place.

[...]

>> FOLL_NUMA naming was nowadays wrong to begin with (not to mention, confusing
>> a we learned). There are other reasons why we have PROT_NONE -- mprotect(),
>> for example.
>
> It doesn't really violate with the name, IMHO - protnone can be either numa
> hint or PROT_NONE for real. As long as we return NULL for a FOLL_NUMA
> request we're achieving the goal we want - we guarantee a NUMA balancing to
> trigger with when FOLL_NUMA provided. It doesn't need to guarantee
> anything else, afaiu. The final check relies in vma_is_accessible() in the
> fault paths anyway. So I don't blame the old name that much.

IMHO, the name FOLL_NUMA made sense when it still was called pte_numa.

>
>>
>> We could have a flag that goes the other way around: FOLL_IGNORE_PROTNONE
>> ... which surprisingly then ends up being exactly what FOLL_FORCE means
>> without FOLL_WRITE, and what this patch does.
>>
>> Does that make sense to you?
>>
>>
>>>
>>> The very least is if with above we should really document FOLL_FORCE - we
>>> should mention NUMA effects. But that's ... really confusing. Thinking
>>> about that I personally prefer a revival of FOLL_NUMA, then smaps issue all
>>> go away.
>>
>> smaps needs to be changed in any case IMHO. And I'm absolutely not in favor
>> of revicing FOLL_NUMA.
>
> As stated above, to me FOLL_NUMA is all fine and clear. If you think
> having a flag back for protnone is worthwhile no matter as-is (FOLL_NUMA)
> or with reverted meaning, then that sounds all fine to me. Maybe the old
> name at least makes old developers know what's that.
>
> I don't have a strong opinion on names though; mostly never had.

I'll avoid new FOLL_ flags first and post my proposal. If many people
are unhappy with that approach, we can revert the commit and call it a day.

Thanks!

--
Cheers,

David / dhildenb