LinuxLists.cc - Re: [PATCH V5 6/6] proc: show MADV

2017-03-01 14:13:54

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> show MADV_FREE pages info of each vma in smaps. The interface is for
> diganose or monitoring purpose, userspace could use it to understand
> what happens in the application. Since userspace could dirty MADV_FREE
> pages without notice from kernel, this interface is the only place we
> can get accurate accounting info about MADV_FREE pages.

I have just got to test this patchset and noticed something that was a
bit surprising

madvise(mmap(len), len, MADV_FREE)
Size: 102400 kB
Rss: 102400 kB
Pss: 102400 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 102400 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 102400 kB
LazyFree: 102368 kB

It took me a some time to realize that LazyFree is not accurate because
there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
is an implementation detail which shouldn't be visible to the userspace.
Should we simply drain the pagevec? A crude way would be to simply
lru_add_drain_all after we are done with the given range. We can also
make this lru_lazyfree_pvecs specific but I am not sure this is worth
the additional code.
---
diff --git a/mm/madvise.c b/mm/madvise.c
index dc5927c812d3..d2c318db16c9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
madvise_free_page_range(&tlb, vma, start, end);
mmu_notifier_invalidate_range_end(mm, start, end);
tlb_finish_mmu(&tlb, start, end);
-
+ lru_add_drain_all();
return 0;
}

--
Michal Hocko
SUSE Labs

2017-03-01 17:50:35

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed 01-03-17 09:37:10, Shaohua Li wrote:
> On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > diganose or monitoring purpose, userspace could use it to understand
> > > what happens in the application. Since userspace could dirty MADV_FREE
> > > pages without notice from kernel, this interface is the only place we
> > > can get accurate accounting info about MADV_FREE pages.
> >
> > I have just got to test this patchset and noticed something that was a
> > bit surprising
> >
> > madvise(mmap(len), len, MADV_FREE)
> > Size: 102400 kB
> > Rss: 102400 kB
> > Pss: 102400 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 102400 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 102400 kB
> > LazyFree: 102368 kB
> >
> > It took me a some time to realize that LazyFree is not accurate because
> > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > is an implementation detail which shouldn't be visible to the userspace.
> > Should we simply drain the pagevec? A crude way would be to simply
> > lru_add_drain_all after we are done with the given range. We can also
> > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > the additional code.
>
> Minchan's original patch includes a drain of pvec. I discard it because I think
> it's not worth the effort. There aren't too many memory in the per-cpu vecs.

but multiply that by the number of CPUs.

> Like what you said, I doubt this is noticeable to userspace.

maybe I wasn't clear enough. I've noticed and I expect others would as
well. We really shouldn't leak implementation details like that. So I
_believe_ this should be fixed. Draining all pagevecs is rather coarse
but it is the simplest thing to do. If you do not want to fold this
into the original patch I can send a standalone one. Or do you have any
concerns about draining?
--
Michal Hocko
SUSE Labs

2017-03-01 18:13:59

by Shaohua Li

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > show MADV_FREE pages info of each vma in smaps. The interface is for
> > diganose or monitoring purpose, userspace could use it to understand
> > what happens in the application. Since userspace could dirty MADV_FREE
> > pages without notice from kernel, this interface is the only place we
> > can get accurate accounting info about MADV_FREE pages.
>
> I have just got to test this patchset and noticed something that was a
> bit surprising
>
> madvise(mmap(len), len, MADV_FREE)
> Size: 102400 kB
> Rss: 102400 kB
> Pss: 102400 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 102400 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 102400 kB
> LazyFree: 102368 kB
>
> It took me a some time to realize that LazyFree is not accurate because
> there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> is an implementation detail which shouldn't be visible to the userspace.
> Should we simply drain the pagevec? A crude way would be to simply
> lru_add_drain_all after we are done with the given range. We can also
> make this lru_lazyfree_pvecs specific but I am not sure this is worth
> the additional code.

Minchan's original patch includes a drain of pvec. I discard it because I think
it's not worth the effort. There aren't too many memory in the per-cpu vecs.
Like what you said, I doubt this is noticeable to userspace.

Thanks,
Shaohua

> ---
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dc5927c812d3..d2c318db16c9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> madvise_free_page_range(&tlb, vma, start, end);
> mmu_notifier_invalidate_range_end(mm, start, end);
> tlb_finish_mmu(&tlb, start, end);
> -
> + lru_add_drain_all();
> return 0;
> }
>
> --
> Michal Hocko
> SUSE Labs

2017-03-01 18:57:42

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
> On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > diganose or monitoring purpose, userspace could use it to understand
> > > what happens in the application. Since userspace could dirty MADV_FREE
> > > pages without notice from kernel, this interface is the only place we
> > > can get accurate accounting info about MADV_FREE pages.
> >
> > I have just got to test this patchset and noticed something that was a
> > bit surprising
> >
> > madvise(mmap(len), len, MADV_FREE)
> > Size: 102400 kB
> > Rss: 102400 kB
> > Pss: 102400 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 102400 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 102400 kB
> > LazyFree: 102368 kB
> >
> > It took me a some time to realize that LazyFree is not accurate because
> > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > is an implementation detail which shouldn't be visible to the userspace.
> > Should we simply drain the pagevec? A crude way would be to simply
> > lru_add_drain_all after we are done with the given range. We can also
> > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > the additional code.
> > ---
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index dc5927c812d3..d2c318db16c9 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > madvise_free_page_range(&tlb, vma, start, end);
> > mmu_notifier_invalidate_range_end(mm, start, end);
> > tlb_finish_mmu(&tlb, start, end);
> > -
> > + lru_add_drain_all();
>
> A full drain on all CPUs is very expensive and IMO not justified for
> some per-cpu fuzz factor in the stats. I'd take hampering the stats
> over hampering the syscall any day; only a subset of MADV_FREE users
> will look at the stats.
>
> And while the aggregate error can be large on machines with many CPUs
> (notably the machines on which you absolutely don't want to send IPIs
> to all cores each time a thread madvises some pages!),

I am not sure I understand. Where would we trigger IPIs?
lru_add_drain_all relies on workqueus.

> the pages of a
> single process are not likely to be spread out across more than a few
> CPUs.

Then we can simply only flushe lru_lazyfree_pvecs which should reduce
the unrelated noise from other pagevecs.

> The error when reading a specific smaps should be completely ok.
>
> In numbers: even if your process is madvising from 16 different CPUs,
> the error in its smaps file will peak at 896K in the worst case. That
> level of concurrency tends to come with much bigger memory quantities
> for that amount of error to matter.

It is still an unexpected behavior IMHO and an implementation detail
which leaks to the userspace.

> IMO this is a non-issue.

I will not insist if there is a general consensus on this and it is a
documented behavior, though.

--
Michal Hocko
SUSE Labs

2017-03-01 20:24:50

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > show MADV_FREE pages info of each vma in smaps. The interface is for
> > diganose or monitoring purpose, userspace could use it to understand
> > what happens in the application. Since userspace could dirty MADV_FREE
> > pages without notice from kernel, this interface is the only place we
> > can get accurate accounting info about MADV_FREE pages.
>
> I have just got to test this patchset and noticed something that was a
> bit surprising
>
> madvise(mmap(len), len, MADV_FREE)
> Size: 102400 kB
> Rss: 102400 kB
> Pss: 102400 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 102400 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 102400 kB
> LazyFree: 102368 kB
>
> It took me a some time to realize that LazyFree is not accurate because
> there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> is an implementation detail which shouldn't be visible to the userspace.
> Should we simply drain the pagevec? A crude way would be to simply
> lru_add_drain_all after we are done with the given range. We can also
> make this lru_lazyfree_pvecs specific but I am not sure this is worth
> the additional code.
> ---
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dc5927c812d3..d2c318db16c9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> madvise_free_page_range(&tlb, vma, start, end);
> mmu_notifier_invalidate_range_end(mm, start, end);
> tlb_finish_mmu(&tlb, start, end);
> -
> + lru_add_drain_all();

A full drain on all CPUs is very expensive and IMO not justified for
some per-cpu fuzz factor in the stats. I'd take hampering the stats
over hampering the syscall any day; only a subset of MADV_FREE users
will look at the stats.

And while the aggregate error can be large on machines with many CPUs
(notably the machines on which you absolutely don't want to send IPIs
to all cores each time a thread madvises some pages!), the pages of a
single process are not likely to be spread out across more than a few
CPUs. The error when reading a specific smaps should be completely ok.

In numbers: even if your process is madvising from 16 different CPUs,
the error in its smaps file will peak at 896K in the worst case. That
level of concurrency tends to come with much bigger memory quantities
for that amount of error to matter.

IMO this is a non-issue.

2017-03-01 21:17:43

by Shaohua Li

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed, Mar 01, 2017 at 06:49:56PM +0100, Michal Hocko wrote:
> On Wed 01-03-17 09:37:10, Shaohua Li wrote:
> > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > > diganose or monitoring purpose, userspace could use it to understand
> > > > what happens in the application. Since userspace could dirty MADV_FREE
> > > > pages without notice from kernel, this interface is the only place we
> > > > can get accurate accounting info about MADV_FREE pages.
> > >
> > > I have just got to test this patchset and noticed something that was a
> > > bit surprising
> > >
> > > madvise(mmap(len), len, MADV_FREE)
> > > Size: 102400 kB
> > > Rss: 102400 kB
> > > Pss: 102400 kB
> > > Shared_Clean: 0 kB
> > > Shared_Dirty: 0 kB
> > > Private_Clean: 102400 kB
> > > Private_Dirty: 0 kB
> > > Referenced: 0 kB
> > > Anonymous: 102400 kB
> > > LazyFree: 102368 kB
> > >
> > > It took me a some time to realize that LazyFree is not accurate because
> > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > > is an implementation detail which shouldn't be visible to the userspace.
> > > Should we simply drain the pagevec? A crude way would be to simply
> > > lru_add_drain_all after we are done with the given range. We can also
> > > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > > the additional code.
> >
> > Minchan's original patch includes a drain of pvec. I discard it because I think
> > it's not worth the effort. There aren't too many memory in the per-cpu vecs.
>
> but multiply that by the number of CPUs.
>
> > Like what you said, I doubt this is noticeable to userspace.
>
> maybe I wasn't clear enough. I've noticed and I expect others would as
> well. We really shouldn't leak implementation details like that. So I
> _believe_ this should be fixed. Draining all pagevecs is rather coarse
> but it is the simplest thing to do. If you do not want to fold this
> into the original patch I can send a standalone one. Or do you have any
> concerns about draining?

No, no objection at all. Just doubt it's worthy. Looks nobody complains similar
issue, For exmaple, deactivate_file_page does the similar thing, then the smaps
'Referenced' could be inaccurate.

Thanks,
Shaohua

2017-03-02 07:55:02

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote:
> On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
> > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > > diganose or monitoring purpose, userspace could use it to understand
> > > > what happens in the application. Since userspace could dirty MADV_FREE
> > > > pages without notice from kernel, this interface is the only place we
> > > > can get accurate accounting info about MADV_FREE pages.
> > >
> > > I have just got to test this patchset and noticed something that was a
> > > bit surprising
> > >
> > > madvise(mmap(len), len, MADV_FREE)
> > > Size: 102400 kB
> > > Rss: 102400 kB
> > > Pss: 102400 kB
> > > Shared_Clean: 0 kB
> > > Shared_Dirty: 0 kB
> > > Private_Clean: 102400 kB
> > > Private_Dirty: 0 kB
> > > Referenced: 0 kB
> > > Anonymous: 102400 kB
> > > LazyFree: 102368 kB
> > >
> > > It took me a some time to realize that LazyFree is not accurate because
> > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > > is an implementation detail which shouldn't be visible to the userspace.
> > > Should we simply drain the pagevec? A crude way would be to simply
> > > lru_add_drain_all after we are done with the given range. We can also
> > > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > > the additional code.
> > > ---
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index dc5927c812d3..d2c318db16c9 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > > madvise_free_page_range(&tlb, vma, start, end);
> > > mmu_notifier_invalidate_range_end(mm, start, end);
> > > tlb_finish_mmu(&tlb, start, end);
> > > -
> > > + lru_add_drain_all();
> >
> > A full drain on all CPUs is very expensive and IMO not justified for
> > some per-cpu fuzz factor in the stats. I'd take hampering the stats
> > over hampering the syscall any day; only a subset of MADV_FREE users
> > will look at the stats.
> >
> > And while the aggregate error can be large on machines with many CPUs
> > (notably the machines on which you absolutely don't want to send IPIs
> > to all cores each time a thread madvises some pages!),
>
> I am not sure I understand. Where would we trigger IPIs?
> lru_add_drain_all relies on workqueus.
>
> > the pages of a
> > single process are not likely to be spread out across more than a few
> > CPUs.
>
> Then we can simply only flushe lru_lazyfree_pvecs which should reduce
> the unrelated noise from other pagevecs.
>
> > The error when reading a specific smaps should be completely ok.
> >
> > In numbers: even if your process is madvising from 16 different CPUs,
> > the error in its smaps file will peak at 896K in the worst case. That
> > level of concurrency tends to come with much bigger memory quantities
> > for that amount of error to matter.
>
> It is still an unexpected behavior IMHO and an implementation detail
> which leaks to the userspace.
>
> > IMO this is a non-issue.
>
> I will not insist if there is a general consensus on this and it is a
> documented behavior, though.

We cannot gurantee with that even draining because madvise_free can
miss some of pages easily with several conditions.
First of all, userspace can never know how many of pages are mapped
in there at the moment. As well, one of page in the range can be
swapped out or is going migrating, fail to try_lockpage and so on.

2017-03-02 14:25:10

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote:
> On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
> > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > > madvise_free_page_range(&tlb, vma, start, end);
> > > mmu_notifier_invalidate_range_end(mm, start, end);
> > > tlb_finish_mmu(&tlb, start, end);
> > > -
> > > + lru_add_drain_all();
> >
> > A full drain on all CPUs is very expensive and IMO not justified for
> > some per-cpu fuzz factor in the stats. I'd take hampering the stats
> > over hampering the syscall any day; only a subset of MADV_FREE users
> > will look at the stats.
> >
> > And while the aggregate error can be large on machines with many CPUs
> > (notably the machines on which you absolutely don't want to send IPIs
> > to all cores each time a thread madvises some pages!),
>
> I am not sure I understand. Where would we trigger IPIs?
> lru_add_drain_all relies on workqueus.

Brainfart on my end, s,IPIs,sync work items,.

That doesn't change my point, though. These things are expensive, and
we had scalability issues with them in the past. See for example
4dd72b4a47a5 ("mm: fadvise: avoid expensive remote LRU cache draining
after FADV_DONTNEED").

> > the pages of a
> > single process are not likely to be spread out across more than a few
> > CPUs.
>
> Then we can simply only flushe lru_lazyfree_pvecs which should reduce
> the unrelated noise from other pagevecs.

The problem isn't flushing other pagevecs once we're already scheduled
on a CPU, the problem is scheduling work on all cpus and then waiting
for completion.

> > The error when reading a specific smaps should be completely ok.
> >
> > In numbers: even if your process is madvising from 16 different CPUs,
> > the error in its smaps file will peak at 896K in the worst case. That
> > level of concurrency tends to come with much bigger memory quantities
> > for that amount of error to matter.
>
> It is still an unexpected behavior IMHO and an implementation detail
> which leaks to the userspace.

We have per-cpu fuzz in every single vmstat counter. Look at
calculate_normal_threshold() in vmstat.c and the sample thresholds for
when per-cpu deltas are flushed. In the vast majority of machines, the
per-cpu error in these counters is much higher than what we get with
pagevecs holding back a few pages.

It's not that I think you're wrong: it *is* an implementation detail.
But we take a bit of incoherency from batching all over the place, so
it's a little odd to take a stand over this particular instance of it
- whether demanding that it'd be fixed, or be documented, which would
only suggest to users that this is special when it really isn't etc.

2017-03-02 16:32:22

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Thu 02-03-17 09:01:01, Johannes Weiner wrote:
> On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote:
> > On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
[...]
> > > The error when reading a specific smaps should be completely ok.
> > >
> > > In numbers: even if your process is madvising from 16 different CPUs,
> > > the error in its smaps file will peak at 896K in the worst case. That
> > > level of concurrency tends to come with much bigger memory quantities
> > > for that amount of error to matter.
> >
> > It is still an unexpected behavior IMHO and an implementation detail
> > which leaks to the userspace.
>
> We have per-cpu fuzz in every single vmstat counter. Look at
> calculate_normal_threshold() in vmstat.c and the sample thresholds for
> when per-cpu deltas are flushed. In the vast majority of machines, the
> per-cpu error in these counters is much higher than what we get with
> pagevecs holding back a few pages.

Yes but vmstat counters have a different usecase AFAIK. You mostly look
at those when debugging or watching the system. /proc/<pid>/smaps is
quite often used to do per task metrics which are then used for some
decision making so it should be less fuzzy if that is possible.

> It's not that I think you're wrong: it *is* an implementation detail.
> But we take a bit of incoherency from batching all over the place, so
> it's a little odd to take a stand over this particular instance of it
> - whether demanding that it'd be fixed, or be documented, which would
> only suggest to users that this is special when it really isn't etc.

I am not aware of other counter printed in smaps that would suffer from
the same problem, but I haven't checked too deeply so I might be wrong.

Anyway it seems that I am alone in my position so I will not insist.
If we have any bug report then we can still fix it.

--
Michal Hocko
SUSE Labs

2017-03-04 00:11:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <[email protected]> wrote:

> > It's not that I think you're wrong: it *is* an implementation detail.
> > But we take a bit of incoherency from batching all over the place, so
> > it's a little odd to take a stand over this particular instance of it
> > - whether demanding that it'd be fixed, or be documented, which would
> > only suggest to users that this is special when it really isn't etc.
>
> I am not aware of other counter printed in smaps that would suffer from
> the same problem, but I haven't checked too deeply so I might be wrong.
>
> Anyway it seems that I am alone in my position so I will not insist.
> If we have any bug report then we can still fix it.

A single lru_add_drain_all() right at the top level (in smaps_show()?)
won't kill us and should significantly improve this issue. And it
might accidentally make some of the other smaps statistics more
accurate as well.

If not, can we please have a nice comment somewhere appropriate which
explains why LazyFree is inaccurate and why we chose to leave it that
way?

2017-03-07 12:18:43

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Fri 03-03-17 16:10:27, Andrew Morton wrote:
> On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <[email protected]> wrote:
>
> > > It's not that I think you're wrong: it *is* an implementation detail.
> > > But we take a bit of incoherency from batching all over the place, so
> > > it's a little odd to take a stand over this particular instance of it
> > > - whether demanding that it'd be fixed, or be documented, which would
> > > only suggest to users that this is special when it really isn't etc.
> >
> > I am not aware of other counter printed in smaps that would suffer from
> > the same problem, but I haven't checked too deeply so I might be wrong.
> >
> > Anyway it seems that I am alone in my position so I will not insist.
> > If we have any bug report then we can still fix it.
>
> A single lru_add_drain_all() right at the top level (in smaps_show()?)
> won't kill us

I do not think we want to put lru_add_drain_all cost to a random
process reading /proc/<pid>/smaps. If anything the one which does the
madvise should be doing this.

> and should significantly improve this issue. And it
> might accidentally make some of the other smaps statistics more
> accurate as well.
>
> If not, can we please have a nice comment somewhere appropriate which
> explains why LazyFree is inaccurate and why we chose to leave it that
> way?

Yeah, I would appreciate the comment more. What about the following
---
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 45853e116eef..0b58b317dc76 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -444,7 +444,9 @@ a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
The memory isn't freed immediately with madvise(). It's freed in memory
-pressure if the memory is clean.
+pressure if the memory is clean. Please note that the printed value might
+be lower than the real value due to optimizations used in the current
+implementation. If this is not desirable please file a bug report.
"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
huge pages.

--
Michal Hocko
SUSE Labs

2017-03-08 01:22:11

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hocko <[email protected]> wrote:

> On Fri 03-03-17 16:10:27, Andrew Morton wrote:
> > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <[email protected]> wrote:
> >
> > > > It's not that I think you're wrong: it *is* an implementation detail.
> > > > But we take a bit of incoherency from batching all over the place, so
> > > > it's a little odd to take a stand over this particular instance of it
> > > > - whether demanding that it'd be fixed, or be documented, which would
> > > > only suggest to users that this is special when it really isn't etc.
> > >
> > > I am not aware of other counter printed in smaps that would suffer from
> > > the same problem, but I haven't checked too deeply so I might be wrong.
> > >
> > > Anyway it seems that I am alone in my position so I will not insist.
> > > If we have any bug report then we can still fix it.
> >
> > A single lru_add_drain_all() right at the top level (in smaps_show()?)
> > won't kill us
>
> I do not think we want to put lru_add_drain_all cost to a random
> process reading /proc/<pid>/smaps.

Why not? It's that process which is calling for the work to be done.

> If anything the one which does the
> madvise should be doing this.

But it would be silly to do extra work in madvise() if nobody will be
reading smaps for the next two months.

How much work is it anyway? What would be the relative impact upon a
smaps read?

2017-03-08 05:51:21

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

Hi Andrew,

On Tue, Mar 07, 2017 at 02:43:38PM -0800, Andrew Morton wrote:
> On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Fri 03-03-17 16:10:27, Andrew Morton wrote:
> > > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <[email protected]> wrote:
> > >
> > > > > It's not that I think you're wrong: it *is* an implementation detail.
> > > > > But we take a bit of incoherency from batching all over the place, so
> > > > > it's a little odd to take a stand over this particular instance of it
> > > > > - whether demanding that it'd be fixed, or be documented, which would
> > > > > only suggest to users that this is special when it really isn't etc.
> > > >
> > > > I am not aware of other counter printed in smaps that would suffer from
> > > > the same problem, but I haven't checked too deeply so I might be wrong.
> > > >
> > > > Anyway it seems that I am alone in my position so I will not insist.
> > > > If we have any bug report then we can still fix it.
> > >
> > > A single lru_add_drain_all() right at the top level (in smaps_show()?)
> > > won't kill us
> >
> > I do not think we want to put lru_add_drain_all cost to a random
> > process reading /proc/<pid>/smaps.
>
> Why not? It's that process which is calling for the work to be done.
>
> > If anything the one which does the
> > madvise should be doing this.
>
> But it would be silly to do extra work in madvise() if nobody will be
> reading smaps for the next two months.
>
> How much work is it anyway? What would be the relative impact upon a
> smaps read?

I agree only if the draining guarantees all of mapped pages in the range
could be marked to lazyfree. However, it's not true because there are a
few of logics to skip the page marking in madvise_free_pte_range.

So, my conclusion is drainning helps a bit but not gaurantees.
In such case, IMHO, let's not do the effort to make better.

Thanks.