2015-07-27 23:26:58

by Mike Kravetz

[permalink] [raw]
Subject: hugetlb pages not accounted for in rss

I started looking at the hugetlb self tests. The test hugetlbfstest
expects hugetlb pages to be accounted for in rss. However, there is
no code in the kernel to do this accounting.

It looks like there was an effort to add the accounting back in 2013.
The test program made it into tree, but the accounting code did not.

The easiest way to resolve this issue would be to remove the test and
perhaps document that hugetlb pages are not accounted for in rss.
However, it does seem like a big oversight that hugetlb pages are not
accounted for in rss. From a quick scan of the code it appears THP
pages are properly accounted for.

Thoughts?
--
Mike Kravetz


2015-07-28 18:32:53

by Jörn Engel

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Mon, Jul 27, 2015 at 04:26:47PM -0700, Mike Kravetz wrote:
> I started looking at the hugetlb self tests. The test hugetlbfstest
> expects hugetlb pages to be accounted for in rss. However, there is
> no code in the kernel to do this accounting.
>
> It looks like there was an effort to add the accounting back in 2013.
> The test program made it into tree, but the accounting code did not.

My apologies. Upstream work always gets axed first when I run out of
time - which happens more often than not.

> The easiest way to resolve this issue would be to remove the test and
> perhaps document that hugetlb pages are not accounted for in rss.
> However, it does seem like a big oversight that hugetlb pages are not
> accounted for in rss. From a quick scan of the code it appears THP
> pages are properly accounted for.
>
> Thoughts?

Unsurprisingly I agree that hugepages should count towards rss. Keeping
the test in keeps us honest. Actually fixing the issue would make us
honest and correct.

Increasingly we have tiny processes (by rss) that actually consume large
fractions of total memory. Makes rss somewhat useless as a measure of
anything.

J?rn

--
Consensus is no proof!
-- John Naisbitt

2015-07-28 21:15:46

by Mike Kravetz

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On 07/28/2015 11:32 AM, J?rn Engel wrote:
> On Mon, Jul 27, 2015 at 04:26:47PM -0700, Mike Kravetz wrote:
>> I started looking at the hugetlb self tests. The test hugetlbfstest
>> expects hugetlb pages to be accounted for in rss. However, there is
>> no code in the kernel to do this accounting.
>>
>> It looks like there was an effort to add the accounting back in 2013.
>> The test program made it into tree, but the accounting code did not.
>
> My apologies. Upstream work always gets axed first when I run out of
> time - which happens more often than not.

No worries, I just noticed the inconsistency of the test program and
no supporting code in the kernel.

>> The easiest way to resolve this issue would be to remove the test and
>> perhaps document that hugetlb pages are not accounted for in rss.
>> However, it does seem like a big oversight that hugetlb pages are not
>> accounted for in rss. From a quick scan of the code it appears THP
>> pages are properly accounted for.
>>
>> Thoughts?
>
> Unsurprisingly I agree that hugepages should count towards rss. Keeping
> the test in keeps us honest. Actually fixing the issue would make us
> honest and correct.
>
> Increasingly we have tiny processes (by rss) that actually consume large
> fractions of total memory. Makes rss somewhat useless as a measure of
> anything.

I'll take a look at what it would take to get the accounting in place.
--
Mike Kravetz

>
> J?rn
>
> --
> Consensus is no proof!
> -- John Naisbitt
>

2015-07-28 22:15:21

by David Rientjes

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Tue, 28 Jul 2015, Mike Kravetz wrote:

> > > The easiest way to resolve this issue would be to remove the test and
> > > perhaps document that hugetlb pages are not accounted for in rss.
> > > However, it does seem like a big oversight that hugetlb pages are not
> > > accounted for in rss. From a quick scan of the code it appears THP
> > > pages are properly accounted for.
> > >
> > > Thoughts?
> >
> > Unsurprisingly I agree that hugepages should count towards rss. Keeping
> > the test in keeps us honest. Actually fixing the issue would make us
> > honest and correct.
> >
> > Increasingly we have tiny processes (by rss) that actually consume large
> > fractions of total memory. Makes rss somewhat useless as a measure of
> > anything.
>
> I'll take a look at what it would take to get the accounting in place.

I'm not sure that I would agree that accounting hugetlb pages in rss would
always be appropriate.

For reserved hugetlb pages, not surplus, the hugetlb pages are always
resident even when unmapped. Unmapping the memory is not going to cause
them to be freed. That's different from thp where the hugepages are
actually freed when you do munmap().

The oom killer looks at rss as the metric to determine which process to
kill that will result in a large amount of memory freeing. If hugetlb
pages are accounted in rss, this may lead to unnecessary killing since
little memory may be freed as a result.

For that reason, we've added hugetlb statistics to the oom killer output
since we've been left wondering in the past where all the memory on the
system went :)

We also have a separate hugetlb cgroup that tracks hugetlb memory usage
rather than memcg.

Starting to account hugetlb pages in rss may lead to breakage in userspace
and I would agree with your earlier suggestion that just removing any test
for rss would be appropriate.

2015-07-28 22:27:01

by Jörn Engel

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Tue, Jul 28, 2015 at 03:15:17PM -0700, David Rientjes wrote:
>
> Starting to account hugetlb pages in rss may lead to breakage in userspace
> and I would agree with your earlier suggestion that just removing any test
> for rss would be appropriate.

What would you propose for me then? I have 80% RAM or more in reserved
hugepages. OOM-killer is not a concern, as it panics the system - the
alternatives were almost universally silly and we didn't want to deal
with system in unpredictable states. But knowing how much memory is
used by which process is a concern. And if you only tell me about the
small (and continuously shrinking) portion, I essentially fly blind.

That is not a case of "may lead to breakage", it _is_ broken.

Ideally we would have fixed this in 2002 when hugetlbfs was introduced.
By now we might have to introduce a new field, rss_including_hugepages
or whatever. Then we have to update tools like top etc. to use the new
field when appropriate. No fun, but might be necessary.

If we can get away with including hugepages in rss and fixing the OOM
killer to be less silly, I would strongly prefer that. But I don't know
how much of a mess we are already in.

J?rn

--
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt

2015-07-28 23:30:22

by David Rientjes

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Tue, 28 Jul 2015, J?rn Engel wrote:

> What would you propose for me then? I have 80% RAM or more in reserved
> hugepages. OOM-killer is not a concern, as it panics the system - the
> alternatives were almost universally silly and we didn't want to deal
> with system in unpredictable states. But knowing how much memory is
> used by which process is a concern. And if you only tell me about the
> small (and continuously shrinking) portion, I essentially fly blind.
>
> That is not a case of "may lead to breakage", it _is_ broken.
>
> Ideally we would have fixed this in 2002 when hugetlbfs was introduced.
> By now we might have to introduce a new field, rss_including_hugepages
> or whatever. Then we have to update tools like top etc. to use the new
> field when appropriate. No fun, but might be necessary.
>
> If we can get away with including hugepages in rss and fixing the OOM
> killer to be less silly, I would strongly prefer that. But I don't know
> how much of a mess we are already in.
>

It's not only the oom killer, I don't believe hugeltb pages are accounted
to the "rss" in memcg. They use the hugetlb_cgroup for that. Starting to
account for them in existing memcg deployments would cause them to hit
their memory limits much earlier. The "rss_huge" field in memcg only
represents transparent hugepages.

I agree with your comment that having done this when hugetlbfs was
introduced would have been optimal.

It's always difficult to add a new class of memory to an existing metric
("new" here because it's currently unaccounted).

If we can add yet another process metric to track hugetlbfs memory mapped,
then the test could be converted to use that. I'm not sure if the
jusitifcation would be strong enough, but you could try.

2015-07-29 00:53:38

by Jörn Engel

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Tue, Jul 28, 2015 at 04:30:19PM -0700, David Rientjes wrote:
>
> It's not only the oom killer, I don't believe hugeltb pages are accounted
> to the "rss" in memcg. They use the hugetlb_cgroup for that. Starting to
> account for them in existing memcg deployments would cause them to hit
> their memory limits much earlier. The "rss_huge" field in memcg only
> represents transparent hugepages.
>
> I agree with your comment that having done this when hugetlbfs was
> introduced would have been optimal.
>
> It's always difficult to add a new class of memory to an existing metric
> ("new" here because it's currently unaccounted).
>
> If we can add yet another process metric to track hugetlbfs memory mapped,
> then the test could be converted to use that. I'm not sure if the
> jusitifcation would be strong enough, but you could try.

Well, we definitely need something. Having a 100GB process show 3GB of
rss is not very useful. How would we notice a memory leak if it only
affects hugepages, for example?

J?rn

--
The object-oriented version of 'Spaghetti code' is, of course, 'Lasagna code'.
(Too many layers).
-- Roberto Waltman.

2015-07-29 19:08:10

by David Rientjes

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Tue, 28 Jul 2015, J?rn Engel wrote:

> Well, we definitely need something. Having a 100GB process show 3GB of
> rss is not very useful. How would we notice a memory leak if it only
> affects hugepages, for example?
>

Since the hugetlb pool is a global resource, it would also be helpful to
determine if a process is mapping more than expected. You can't do that
just by adding a huge rss metric, however: if you have 2MB and 1GB
hugepages configured you wouldn't know if a process was mapping 512 2MB
hugepages or 1 1GB hugepage.

That's the purpose of hugetlb_cgroup, after all, and it supports usage
counters for all hstates. The test could be converted to use that to
measure usage if configured in the kernel.

Beyond that, I'm not sure how a per-hstate rss metric would be exported to
userspace in a clean way and other ways of obtaining the same data are
possible with hugetlb_cgroup. I'm not sure how successful you'd be in
arguing that we need separate rss counters for it.

2015-07-29 23:21:09

by Mike Kravetz

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On 07/29/2015 12:08 PM, David Rientjes wrote:
> On Tue, 28 Jul 2015, J?rn Engel wrote:
>
>> Well, we definitely need something. Having a 100GB process show 3GB of
>> rss is not very useful. How would we notice a memory leak if it only
>> affects hugepages, for example?
>>
>
> Since the hugetlb pool is a global resource, it would also be helpful to
> determine if a process is mapping more than expected. You can't do that
> just by adding a huge rss metric, however: if you have 2MB and 1GB
> hugepages configured you wouldn't know if a process was mapping 512 2MB
> hugepages or 1 1GB hugepage.
>
> That's the purpose of hugetlb_cgroup, after all, and it supports usage
> counters for all hstates. The test could be converted to use that to
> measure usage if configured in the kernel.
>
> Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> userspace in a clean way and other ways of obtaining the same data are
> possible with hugetlb_cgroup. I'm not sure how successful you'd be in
> arguing that we need separate rss counters for it.

If I want to track hugetlb usage on a per-task basis, do I then need to
create one cgroup per task?

For example, suppose I have many tasks using hugetlb and the global pool
is getting low on free pages. It might be useful to know which tasks are
using hugetlb pages, and how many they are using.

I don't actually have this need (I think), but it appears to be what
J?rn is asking for.
--
Mike Kravetz

2015-07-30 21:34:19

by Jörn Engel

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> >
> >Since the hugetlb pool is a global resource, it would also be helpful to
> >determine if a process is mapping more than expected. You can't do that
> >just by adding a huge rss metric, however: if you have 2MB and 1GB
> >hugepages configured you wouldn't know if a process was mapping 512 2MB
> >hugepages or 1 1GB hugepage.

Fair, although I believe 1GB hugepages are overrated. If you assume
that per-page overhead is independent of page size (not quite true, but
close enough), going from 1% small pages to 0.8% small pages will
improve performance as much as going from 99% 2MB pages to 99% 1GB
pages.

> >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> >counters for all hstates. The test could be converted to use that to
> >measure usage if configured in the kernel.
> >
> >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> >userspace in a clean way and other ways of obtaining the same data are
> >possible with hugetlb_cgroup. I'm not sure how successful you'd be in
> >arguing that we need separate rss counters for it.
>
> If I want to track hugetlb usage on a per-task basis, do I then need to
> create one cgroup per task?
>
> For example, suppose I have many tasks using hugetlb and the global pool
> is getting low on free pages. It might be useful to know which tasks are
> using hugetlb pages, and how many they are using.
>
> I don't actually have this need (I think), but it appears to be what
> J?rn is asking for.

Maybe some background is useful. I would absolutely love to use
transparent hugepages. They are absolutely perfect in every respect,
except for performance. With transparent hugepages we get higher
latencies. Small pages are unacceptable, so we are forced to use
non-transparent hugepages.

The part of our system that uses small pages is pretty much constant,
while total system memory follows Moore's law. When possible we even
try to shrink that part. Hugepages already dominate today and things
will get worse.

But otherwise we have all the problems that others also have. There are
memory leaks and we would like to know how much memory each process
actually uses. Most people use rss, while we have nothing good. And I
am not sure if cgroup is the correct answer for essentially fixing a
regression introduced in 2002.

J?rn

--
You cannot suppose that Moliere ever troubled himself to be original in the
matter of ideas. You cannot suppose that the stories he tells in his plays
have never been told before. They were culled, as you very well know.
-- Andre-Louis Moreau in Scarabouche

2015-07-31 21:09:31

by David Rientjes

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Thu, 30 Jul 2015, J?rn Engel wrote:

> > If I want to track hugetlb usage on a per-task basis, do I then need to
> > create one cgroup per task?
> >

I think this would only be used for debugging or testing, but if you have
root and are trying to organize processes into a hugetlb_cgroup hierarchy,
presumably you would just look at smaps and find each thread's hugetlb
memory usage and not bother.

> Maybe some background is useful. I would absolutely love to use
> transparent hugepages. They are absolutely perfect in every respect,
> except for performance. With transparent hugepages we get higher
> latencies. Small pages are unacceptable, so we are forced to use
> non-transparent hugepages.
>

Believe me, we are on the same page that way :) We still deploy
configurations with hugetlb memory because we need to meet certain
allocation requirements and it is only possible to do at boot.

With regard to the performance of thp, I can think of two things that are
affecting you:

- allocation cost

Async memory compaction in the page fault path for thp memory is very
lightweight and it happily falls back to using small pages instead.
Memory compaction is always being improved upon and there is on-going
work to do memory compaction both periodically and in the background to
keep fragmentation low. The ultimate goal would be to remove async
compaction entirely from the thp page fault path and rely on
improvements to memory compaction such that we have a great allocation
success rate and less cost when we fail.

- NUMA cost

Until very recently, thp pages could easily be allocated remotely
instead of small pages locally. That has since been improved and we
only allocate thp locally and then fallback to small pages locally
first. Khugepaged can still migrate memory remotely, but it will
allocate the hugepage on the node where the majority of smallpages
are from.

> The part of our system that uses small pages is pretty much constant,
> while total system memory follows Moore's law. When possible we even
> try to shrink that part. Hugepages already dominate today and things
> will get worse.
>

I wrote a patchset, hugepages overcommit, that allows unmapped hugetlb
pages to be freed in oom conditions before calling the oom killer up to a
certain threshold and then kickoff a background thread to try to
reallocate them. The idea is to keep the hugetlb pool as large as
possible up to oom and then only reclaim what is needed and then try to
reallocate them. Not sure if it would help your particular usecase or
not.

2015-08-04 02:56:57

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: hugetlb pages not accounted for in rss

On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> On 07/29/2015 12:08 PM, David Rientjes wrote:
> >On Tue, 28 Jul 2015, Jörn Engel wrote:
> >
> >>Well, we definitely need something. Having a 100GB process show 3GB of
> >>rss is not very useful. How would we notice a memory leak if it only
> >>affects hugepages, for example?
> >>
> >
> >Since the hugetlb pool is a global resource, it would also be helpful to
> >determine if a process is mapping more than expected. You can't do that
> >just by adding a huge rss metric, however: if you have 2MB and 1GB
> >hugepages configured you wouldn't know if a process was mapping 512 2MB
> >hugepages or 1 1GB hugepage.
> >
> >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> >counters for all hstates. The test could be converted to use that to
> >measure usage if configured in the kernel.
> >
> >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> >userspace in a clean way and other ways of obtaining the same data are
> >possible with hugetlb_cgroup. I'm not sure how successful you'd be in
> >arguing that we need separate rss counters for it.
>
> If I want to track hugetlb usage on a per-task basis, do I then need to
> create one cgroup per task?
>
> For example, suppose I have many tasks using hugetlb and the global pool
> is getting low on free pages. It might be useful to know which tasks are
> using hugetlb pages, and how many they are using.
>
> I don't actually have this need (I think), but it appears to be what
> Jörn is asking for.

One possible way to get hugetlb metric in per-task basis is to walk page
table via /proc/pid/pagemap, and counting page flags for each mapped page
(we can easily do this with tools/vm/page-types.c like "page-types -p <PID>
-b huge"). This is obviously slower than just storing the counter as
in-kernel data and just exporting it, but might be useful in some situation.

Thanks,
Naoya Horiguchi????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-04 05:18:27

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

On Tue, Aug 04, 2015 at 02:55:30AM +0000, Naoya Horiguchi wrote:
> On Wed, Jul 29, 2015 at 04:20:59PM -0700, Mike Kravetz wrote:
> > On 07/29/2015 12:08 PM, David Rientjes wrote:
> > >On Tue, 28 Jul 2015, Jörn Engel wrote:
> > >
> > >>Well, we definitely need something. Having a 100GB process show 3GB of
> > >>rss is not very useful. How would we notice a memory leak if it only
> > >>affects hugepages, for example?
> > >>
> > >
> > >Since the hugetlb pool is a global resource, it would also be helpful to
> > >determine if a process is mapping more than expected. You can't do that
> > >just by adding a huge rss metric, however: if you have 2MB and 1GB
> > >hugepages configured you wouldn't know if a process was mapping 512 2MB
> > >hugepages or 1 1GB hugepage.
> > >
> > >That's the purpose of hugetlb_cgroup, after all, and it supports usage
> > >counters for all hstates. The test could be converted to use that to
> > >measure usage if configured in the kernel.
> > >
> > >Beyond that, I'm not sure how a per-hstate rss metric would be exported to
> > >userspace in a clean way and other ways of obtaining the same data are
> > >possible with hugetlb_cgroup. I'm not sure how successful you'd be in
> > >arguing that we need separate rss counters for it.
> >
> > If I want to track hugetlb usage on a per-task basis, do I then need to
> > create one cgroup per task?
> >
> > For example, suppose I have many tasks using hugetlb and the global pool
> > is getting low on free pages. It might be useful to know which tasks are
> > using hugetlb pages, and how many they are using.
> >
> > I don't actually have this need (I think), but it appears to be what
> > Jörn is asking for.
>
> One possible way to get hugetlb metric in per-task basis is to walk page
> table via /proc/pid/pagemap, and counting page flags for each mapped page
> (we can easily do this with tools/vm/page-types.c like "page-types -p <PID>
> -b huge"). This is obviously slower than just storing the counter as
> in-kernel data and just exporting it, but might be useful in some situation.

BTW, currently smaps doesn't report any meaningful info for vma(VM_HUGETLB).
I wrote the following patch, which hopefully is helpful for your purpose.

Thanks,
Naoya Horiguchi

---
From: Naoya Horiguchi <[email protected]>
Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
inconvenient when we want to know per-task or per-vma base hugetlb usage.
This patch enables these fields by introducing smaps_hugetlb_range().

before patch:

Size: 20480 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

after patch:

Size: 20480 kB
Rss: 18432 kB
Pss: 18432 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 18432 kB
Referenced: 18432 kB
Anonymous: 18432 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ca1e091881d4..c7218603306d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
seq_putc(m, '\n');
}

+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct mem_size_stats *mss = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ struct page *page = NULL;
+
+ if (pte_present(*pte)) {
+ page = vm_normal_page(vma, addr, *pte);
+ } else if (is_swap_pte(*pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+ if (is_migration_entry(swpent))
+ page = migration_entry_to_page(swpent);
+ }
+ if (page)
+ smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
+ pte_young(*pte), pte_dirty(*pte));
+ return 0;
+}
+#endif /* HUGETLB_PAGE */
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+ .hugetlb_entry = smaps_hugetlb_range,
+#endif
.mm = vma->vm_mm,
.private = &mss,
};
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-04 18:22:03

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

On Tue, Aug 04, 2015 at 05:13:39AM +0000, Naoya Horiguchi wrote:
> On Tue, Aug 04, 2015 at 02:55:30AM +0000, Naoya Horiguchi wrote:
> >
> > One possible way to get hugetlb metric in per-task basis is to walk page
> > table via /proc/pid/pagemap, and counting page flags for each mapped page
> > (we can easily do this with tools/vm/page-types.c like "page-types -p <PID>
> > -b huge"). This is obviously slower than just storing the counter as
> > in-kernel data and just exporting it, but might be useful in some situation.

Maybe. The current situation is a mess and I don't know the best way
out of it yet.

> BTW, currently smaps doesn't report any meaningful info for vma(VM_HUGETLB).
> I wrote the following patch, which hopefully is helpful for your purpose.
>
> Thanks,
> Naoya Horiguchi
>
> ---
> From: Naoya Horiguchi <[email protected]>
> Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
>
> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
>
> before patch:
>
> Size: 20480 kB
> Rss: 0 kB
> Pss: 0 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>
> after patch:
>
> Size: 20480 kB
> Rss: 18432 kB
> Pss: 18432 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 18432 kB
> Referenced: 18432 kB
> Anonymous: 18432 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht

Nice!

> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ca1e091881d4..c7218603306d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> seq_putc(m, '\n');
> }
>
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> + unsigned long addr, unsigned long end,
> + struct mm_walk *walk)
> +{
> + struct mem_size_stats *mss = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + struct page *page = NULL;
> +
> + if (pte_present(*pte)) {
> + page = vm_normal_page(vma, addr, *pte);
> + } else if (is_swap_pte(*pte)) {
> + swp_entry_t swpent = pte_to_swp_entry(*pte);
> +
> + if (is_migration_entry(swpent))
> + page = migration_entry_to_page(swpent);
> + }
> + if (page)
> + smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
> + pte_young(*pte), pte_dirty(*pte));
> + return 0;
> +}
> +#endif /* HUGETLB_PAGE */
> +
> static int show_smap(struct seq_file *m, void *v, int is_pid)
> {
> struct vm_area_struct *vma = v;
> struct mem_size_stats mss;
> struct mm_walk smaps_walk = {
> .pmd_entry = smaps_pte_range,
> +#ifdef CONFIG_HUGETLB_PAGE
> + .hugetlb_entry = smaps_hugetlb_range,
> +#endif

Not too fond of the #ifdef. But I won't blame you, as there already is
an example of the same and - worse - a contradicting example that
unconditionally assigns and moved the #ifdef elsewhere.

Hugetlb is the unloved stepchild with 13 years of neglect and
half-measures. It shows.

Patch looks good to me.

Acked-by: J?rn Engel <[email protected]>

J?rn

--
Functionality is an asset, but code is a liability.
--Ted Dziuba

2015-08-06 02:18:48

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

On Tue, 4 Aug 2015, J?rn Engel wrote:

> > From: Naoya Horiguchi <[email protected]>
> > Subject: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)
> >
> > Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> > inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > This patch enables these fields by introducing smaps_hugetlb_range().
> >
> > before patch:
> >
> > Size: 20480 kB
> > Rss: 0 kB
> > Pss: 0 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 0 kB
> > AnonHugePages: 0 kB
> > Swap: 0 kB
> > KernelPageSize: 2048 kB
> > MMUPageSize: 2048 kB
> > Locked: 0 kB
> > VmFlags: rd wr mr mw me de ht
> >
> > after patch:
> >
> > Size: 20480 kB
> > Rss: 18432 kB
> > Pss: 18432 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 18432 kB
> > Referenced: 18432 kB
> > Anonymous: 18432 kB
> > AnonHugePages: 0 kB
> > Swap: 0 kB
> > KernelPageSize: 2048 kB
> > MMUPageSize: 2048 kB
> > Locked: 0 kB
> > VmFlags: rd wr mr mw me de ht
>
> Nice!
>

Hmm, wouldn't this be confusing since VmRSS in /proc/pid/status doesn't
match the rss shown in smaps, since hugetlb mappings aren't accounted in
get_mm_rss()?

Not sure this is a good idea, I think consistency amongst rss values would
be more important.

2015-08-06 07:45:50

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] smaps: fill missing fields for vma(VM_HUGETLB)

On Wed, Aug 05, 2015 at 07:18:44PM -0700, David Rientjes wrote:
...
> Hmm, wouldn't this be confusing since VmRSS in /proc/pid/status doesn't
> match the rss shown in smaps, since hugetlb mappings aren't accounted in
> get_mm_rss()?
>
> Not sure this is a good idea, I think consistency amongst rss values would
> be more important.

Right, so one option is making get_mm_rss() count hugetlb, but that could
make oom/memcg less efficient or broken as you stated in a previous email.
So another one is to add "VmHugetlbRSS:" field in /proc/pid/status?

Thanks,
Naoya Horiguchi????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-07 07:25:42

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2 0/2] hugetlb: display per-process/per-vma usage

I wrote patches to export hugetlb usage info via /proc/pid/{smaps,status}.
In this version, I added patch 2 for /proc/pid/status to deal with the
inconsistency concern from David (thanks for the comment).

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (2):
smaps: fill missing fields for vma(VM_HUGETLB)
mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

fs/proc/task_mmu.c | 32 +++++++++++++++++++++++++++++++-
include/linux/hugetlb.h | 18 ++++++++++++++++++
include/linux/mm.h | 3 +++
include/linux/mm_types.h | 3 +++
mm/hugetlb.c | 9 +++++++++
mm/memory.c | 4 ++--
mm/rmap.c | 4 +++-
7 files changed, 69 insertions(+), 4 deletions(-)????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-07 07:25:48

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)

Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
inconvenient when we want to know per-task or per-vma base hugetlb usage.
This patch enables these fields by introducing smaps_hugetlb_range().

before patch:

Size: 20480 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

after patch:

Size: 20480 kB
Rss: 18432 kB
Pss: 18432 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 18432 kB
Referenced: 18432 kB
Anonymous: 18432 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Jörn Engel <[email protected]>
---
fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index ca1e091881d4..c7218603306d 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
seq_putc(m, '\n');
}

+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct mem_size_stats *mss = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ struct page *page = NULL;
+
+ if (pte_present(*pte)) {
+ page = vm_normal_page(vma, addr, *pte);
+ } else if (is_swap_pte(*pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+ if (is_migration_entry(swpent))
+ page = migration_entry_to_page(swpent);
+ }
+ if (page)
+ smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
+ pte_young(*pte), pte_dirty(*pte));
+ return 0;
+}
+#endif /* HUGETLB_PAGE */
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+ .hugetlb_entry = smaps_hugetlb_range,
+#endif
.mm = vma->vm_mm,
.private = &mss,
};
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-07 07:25:40

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because applications which use hugetlb typically want to control
their processes on the basis of how much memory (including hugetlb) they use.
So this patch simply provides easy access to the info via /proc/pid/status.

This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
is now,) which I guess is fine until we have some strong reason to do it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 5 ++++-
include/linux/hugetlb.h | 18 ++++++++++++++++++
include/linux/mm.h | 3 +++
include/linux/mm_types.h | 3 +++
mm/hugetlb.c | 9 +++++++++
mm/memory.c | 4 ++--
mm/rmap.c | 4 +++-
7 files changed, 42 insertions(+), 4 deletions(-)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index c7218603306d..f181f56fcce2 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
void task_mem(struct seq_file *m, struct mm_struct *mm)
{
unsigned long data, text, lib, swap, ptes, pmds;
- unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, hugetlb_rss;

/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -37,6 +37,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
hiwater_rss = total_rss = get_mm_rss(mm);
if (hiwater_rss < mm->hiwater_rss)
hiwater_rss = mm->hiwater_rss;
+ hugetlb_rss = get_hugetlb_rss(mm);

data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
@@ -51,6 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
"VmPin:\t%8lu kB\n"
"VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
+ "VmHugetlbRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
@@ -64,6 +66,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
mm->pinned_vm << (PAGE_SHIFT-10),
hiwater_rss << (PAGE_SHIFT-10),
total_rss << (PAGE_SHIFT-10),
+ hugetlb_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
ptes >> 10,
diff --git v4.2-rc4.orig/include/linux/hugetlb.h v4.2-rc4/include/linux/hugetlb.h
index d891f949466a..6319df124e68 100644
--- v4.2-rc4.orig/include/linux/hugetlb.h
+++ v4.2-rc4/include/linux/hugetlb.h
@@ -469,6 +469,21 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#define hugepages_supported() (HPAGE_SHIFT != 0)
#endif

+/*
+ * This simple wrappers are to hide MM_HUGETLBPAGES from outside hugetlbfs
+ * subsystem. The counter MM_HUGETLBPAGES is maintained in page unit basis,
+ * so it changes by 512 for example if a 2MB hugepage is mapped or unmapped.
+ */
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_HUGETLBPAGES);
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+ add_mm_counter(mm, MM_HUGETLBPAGES, value);
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +519,9 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
{
return &mm->page_table_lock;
}
+
+#define get_hugetlb_rss(mm) 0
+#define mod_hugetlb_rss(mm, value) do {} while (0)
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4.orig/include/linux/mm.h v4.2-rc4/include/linux/mm.h
index 2e872f92dbac..9218a8856483 100644
--- v4.2-rc4.orig/include/linux/mm.h
+++ v4.2-rc4/include/linux/mm.h
@@ -1355,6 +1355,9 @@ static inline void sync_mm_rss(struct mm_struct *mm)
}
#endif

+extern inline void init_rss_vec(int *rss);
+extern inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+
int vma_wants_writenotify(struct vm_area_struct *vma);

extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git v4.2-rc4.orig/include/linux/mm_types.h v4.2-rc4/include/linux/mm_types.h
index 0038ac7466fd..887b43ba5a18 100644
--- v4.2-rc4.orig/include/linux/mm_types.h
+++ v4.2-rc4/include/linux/mm_types.h
@@ -348,6 +348,9 @@ enum {
MM_FILEPAGES,
MM_ANONPAGES,
MM_SWAPENTS,
+#ifdef CONFIG_HUGETLB_PAGE
+ MM_HUGETLBPAGES,
+#endif
NR_MM_COUNTERS
};

diff --git v4.2-rc4.orig/mm/hugetlb.c v4.2-rc4/mm/hugetlb.c
index a8c3087089d8..12e5e7d3b60f 100644
--- v4.2-rc4.orig/mm/hugetlb.c
+++ v4.2-rc4/mm/hugetlb.c
@@ -2743,9 +2743,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
int ret = 0;
+ int rss[NR_MM_COUNTERS];

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

+ init_rss_vec(rss);
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
@@ -2797,6 +2799,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
get_page(ptepage);
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
+ rss[MM_HUGETLBPAGES] += pages_per_huge_page(h);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
@@ -2805,6 +2808,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow)
mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);

+ add_mm_rss_vec(dst, rss);
return ret;
}

@@ -2823,6 +2827,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long sz = huge_page_size(h);
const unsigned long mmun_start = start; /* For mmu_notifiers */
const unsigned long mmun_end = end; /* For mmu_notifiers */
+ int rss[NR_MM_COUNTERS];

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
@@ -2832,6 +2837,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
address = start;
again:
+ init_rss_vec(rss);
for (; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
@@ -2877,6 +2883,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
set_page_dirty(page);

+ rss[MM_HUGETLBPAGES] -= pages_per_huge_page(h);
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
@@ -2892,6 +2899,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unlock:
spin_unlock(ptl);
}
+ add_mm_rss_vec(mm, rss);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -3261,6 +3269,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);

+ mod_hugetlb_rss(mm, pages_per_huge_page(h));
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4.orig/mm/memory.c v4.2-rc4/mm/memory.c
index 388dcf9aa283..e09b53da2733 100644
--- v4.2-rc4.orig/mm/memory.c
+++ v4.2-rc4/mm/memory.c
@@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
return 0;
}

-static inline void init_rss_vec(int *rss)
+inline void init_rss_vec(int *rss)
{
memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
}

-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
{
int i;

diff --git v4.2-rc4.orig/mm/rmap.c v4.2-rc4/mm/rmap.c
index 171b68768df1..78e77b0ea3c3 100644
--- v4.2-rc4.orig/mm/rmap.c
+++ v4.2-rc4/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
update_hiwater_rss(mm);

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (!PageHuge(page)) {
+ if (PageHuge(page)) {
+ mod_hugetlb_rss(mm, -(1 << compound_order(page)));
+ } else {
if (PageAnon(page))
dec_mm_counter(mm, MM_ANONPAGES);
else
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-07 22:51:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)

On Fri, 7 Aug 2015 07:24:50 +0000 Naoya Horiguchi <[email protected]> wrote:

> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
>
> before patch:
>
> Size: 20480 kB
> Rss: 0 kB
> Pss: 0 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>
> after patch:
>
> Size: 20480 kB
> Rss: 18432 kB
> Pss: 18432 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 18432 kB
> Referenced: 18432 kB
> Anonymous: 18432 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>

Is there something we can say about this in
Documentation/filesystems/proc.txt?

2015-08-07 22:55:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

On Fri, 7 Aug 2015 07:24:50 +0000 Naoya Horiguchi <[email protected]> wrote:

> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because applications which use hugetlb typically want to control
> their processes on the basis of how much memory (including hugetlb) they use.
> So this patch simply provides easy access to the info via /proc/pid/status.
>
> This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
> is now,) which I guess is fine until we have some strong reason to do it.
>

A procfs change triggers a documentation change. Always, please.
Documentation/filesystems/proc.txt is the place.

>
> ...
>
> @@ -504,6 +519,9 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> {
> return &mm->page_table_lock;
> }
> +
> +#define get_hugetlb_rss(mm) 0
> +#define mod_hugetlb_rss(mm, value) do {} while (0)

I don't think these have to be macros? inline functions are nicer in
several ways: more readable, more likely to be documented, can prevent
unused variable warnings.

> #endif /* CONFIG_HUGETLB_PAGE */
>
> static inline spinlock_t *huge_pte_lock(struct hstate *h,
>
> ...
>
> --- v4.2-rc4.orig/mm/memory.c
> +++ v4.2-rc4/mm/memory.c
> @@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
> return 0;
> }
>
> -static inline void init_rss_vec(int *rss)
> +inline void init_rss_vec(int *rss)
> {
> memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
> }
>
> -static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> +inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> {
> int i;

The inlines are a bit odd, but this does save ~10 bytes in memory.o for
some reason.

2015-08-10 00:49:03

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 1/3] smaps: fill missing fields for vma(VM_HUGETLB)

Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
inconvenient when we want to know per-task or per-vma base hugetlb usage.
This patch enables these fields by introducing smaps_hugetlb_range().

before patch:

Size: 20480 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

after patch:

Size: 20480 kB
Rss: 18432 kB
Pss: 18432 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 18432 kB
Referenced: 18432 kB
Anonymous: 18432 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Jörn Engel <[email protected]>
---
fs/proc/task_mmu.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index ca1e091881d4..c7218603306d 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -610,12 +610,39 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
seq_putc(m, '\n');
}

+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct mem_size_stats *mss = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ struct page *page = NULL;
+
+ if (pte_present(*pte)) {
+ page = vm_normal_page(vma, addr, *pte);
+ } else if (is_swap_pte(*pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+ if (is_migration_entry(swpent))
+ page = migration_entry_to_page(swpent);
+ }
+ if (page)
+ smaps_account(mss, page, huge_page_size(hstate_vma(vma)),
+ pte_young(*pte), pte_dirty(*pte));
+ return 0;
+}
+#endif /* HUGETLB_PAGE */
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+ .hugetlb_entry = smaps_hugetlb_range,
+#endif
.mm = vma->vm_mm,
.private = &mss,
};
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-10 00:49:14

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

On Fri, Aug 07, 2015 at 03:55:37PM -0700, Andrew Morton wrote:
> On Fri, 7 Aug 2015 07:24:50 +0000 Naoya Horiguchi <[email protected]> wrote:
>
> > Currently there's no easy way to get per-process usage of hugetlb pages, which
> > is inconvenient because applications which use hugetlb typically want to control
> > their processes on the basis of how much memory (including hugetlb) they use.
> > So this patch simply provides easy access to the info via /proc/pid/status.
> >
> > This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
> > is now,) which I guess is fine until we have some strong reason to do it.
> >
>
> A procfs change triggers a documentation change. Always, please.
> Documentation/filesystems/proc.txt is the place.

OK, I'll do this.

> >
> > ...
> >
> > @@ -504,6 +519,9 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> > {
> > return &mm->page_table_lock;
> > }
> > +
> > +#define get_hugetlb_rss(mm) 0
> > +#define mod_hugetlb_rss(mm, value) do {} while (0)
>
> I don't think these have to be macros? inline functions are nicer in
> several ways: more readable, more likely to be documented, can prevent
> unused variable warnings.

Right, I'll use inline functions.

> > #endif /* CONFIG_HUGETLB_PAGE */
> >
> > static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >
> > ...
> >
> > --- v4.2-rc4.orig/mm/memory.c
> > +++ v4.2-rc4/mm/memory.c
> > @@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
> > return 0;
> > }
> >
> > -static inline void init_rss_vec(int *rss)
> > +inline void init_rss_vec(int *rss)
> > {
> > memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
> > }
> >
> > -static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > +inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > {
> > int i;
>
> The inlines are a bit odd, but this does save ~10 bytes in memory.o for
> some reason.

so I'll keep going with this.

Thanks,
Naoya Horiguchi????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-10 00:49:11

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS

/proc/PID/{status,smaps} is aware of hugetlb RSS now, so let's document it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
Documentation/filesystems/proc.txt | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
index 6f7fafde0884..cb8565e150ed 100644
--- v4.2-rc4.orig/Documentation/filesystems/proc.txt
+++ v4.2-rc4/Documentation/filesystems/proc.txt
@@ -168,6 +168,7 @@ For example, to get the status information of a process, all you have to do is
VmLck: 0 kB
VmHWM: 476 kB
VmRSS: 476 kB
+ VmHugetlbRSS: 0 kB
VmData: 156 kB
VmStk: 88 kB
VmExe: 68 kB
@@ -230,6 +231,7 @@ Table 1-2: Contents of the status files (as of 4.1)
VmLck locked memory size
VmHWM peak resident set size ("high water mark")
VmRSS size of memory portions
+ VmHugetlbRSS size of hugetlb memory portions
VmData size of data, stack, and text segments
VmStk size of data, stack, and text segments
VmExe size of text segment
@@ -440,8 +442,12 @@ indicates the amount of memory currently marked as referenced or accessed.
"Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
-"Swap" shows how much would-be-anonymous memory is also used, but out on
-swap.
+"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
+Since 4.3, "RSS" contains the amount of mappings for hugetlb pages. Although
+RSS of hugetlb mappings is maintained separately from normal mappings
+(displayed in "VmHugetlbRSS" field of /proc/PID/status,) /proc/PID/smaps shows
+both mappings in "RSS" field. Userspace applications clearly distinguish the
+type of mapping with 'ht' flag in "VmFlags" field.

"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-10 00:49:06

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because applications which use hugetlb typically want to control
their processes on the basis of how much memory (including hugetlb) they use.
So this patch simply provides easy access to the info via /proc/pid/status.

This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
is now,) which I guess is fine until we have some strong reason to do it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
fs/proc/task_mmu.c | 5 ++++-
include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
include/linux/mm.h | 3 +++
include/linux/mm_types.h | 3 +++
mm/hugetlb.c | 9 +++++++++
mm/memory.c | 4 ++--
mm/rmap.c | 4 +++-
7 files changed, 48 insertions(+), 4 deletions(-)

diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index c7218603306d..f181f56fcce2 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
void task_mem(struct seq_file *m, struct mm_struct *mm)
{
unsigned long data, text, lib, swap, ptes, pmds;
- unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, hugetlb_rss;

/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -37,6 +37,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
hiwater_rss = total_rss = get_mm_rss(mm);
if (hiwater_rss < mm->hiwater_rss)
hiwater_rss = mm->hiwater_rss;
+ hugetlb_rss = get_hugetlb_rss(mm);

data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
@@ -51,6 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
"VmPin:\t%8lu kB\n"
"VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
+ "VmHugetlbRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
@@ -64,6 +66,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
mm->pinned_vm << (PAGE_SHIFT-10),
hiwater_rss << (PAGE_SHIFT-10),
total_rss << (PAGE_SHIFT-10),
+ hugetlb_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
ptes >> 10,
diff --git v4.2-rc4.orig/include/linux/hugetlb.h v4.2-rc4/include/linux/hugetlb.h
index d891f949466a..fa956d94cacb 100644
--- v4.2-rc4.orig/include/linux/hugetlb.h
+++ v4.2-rc4/include/linux/hugetlb.h
@@ -469,6 +469,21 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#define hugepages_supported() (HPAGE_SHIFT != 0)
#endif

+/*
+ * This simple wrappers are to hide MM_HUGETLBPAGES from outside hugetlbfs
+ * subsystem. The counter MM_HUGETLBPAGES is maintained in page unit basis,
+ * so it changes by 512 for example if a 2MB hugepage is mapped or unmapped.
+ */
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_HUGETLBPAGES);
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+ add_mm_counter(mm, MM_HUGETLBPAGES, value);
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +519,15 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
{
return &mm->page_table_lock;
}
+
+static inline get_hugetlb_rss(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4.orig/include/linux/mm.h v4.2-rc4/include/linux/mm.h
index 2e872f92dbac..9218a8856483 100644
--- v4.2-rc4.orig/include/linux/mm.h
+++ v4.2-rc4/include/linux/mm.h
@@ -1355,6 +1355,9 @@ static inline void sync_mm_rss(struct mm_struct *mm)
}
#endif

+extern inline void init_rss_vec(int *rss);
+extern inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+
int vma_wants_writenotify(struct vm_area_struct *vma);

extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git v4.2-rc4.orig/include/linux/mm_types.h v4.2-rc4/include/linux/mm_types.h
index 0038ac7466fd..887b43ba5a18 100644
--- v4.2-rc4.orig/include/linux/mm_types.h
+++ v4.2-rc4/include/linux/mm_types.h
@@ -348,6 +348,9 @@ enum {
MM_FILEPAGES,
MM_ANONPAGES,
MM_SWAPENTS,
+#ifdef CONFIG_HUGETLB_PAGE
+ MM_HUGETLBPAGES,
+#endif
NR_MM_COUNTERS
};

diff --git v4.2-rc4.orig/mm/hugetlb.c v4.2-rc4/mm/hugetlb.c
index a8c3087089d8..12e5e7d3b60f 100644
--- v4.2-rc4.orig/mm/hugetlb.c
+++ v4.2-rc4/mm/hugetlb.c
@@ -2743,9 +2743,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
int ret = 0;
+ int rss[NR_MM_COUNTERS];

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

+ init_rss_vec(rss);
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
@@ -2797,6 +2799,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
get_page(ptepage);
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
+ rss[MM_HUGETLBPAGES] += pages_per_huge_page(h);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
@@ -2805,6 +2808,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow)
mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);

+ add_mm_rss_vec(dst, rss);
return ret;
}

@@ -2823,6 +2827,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long sz = huge_page_size(h);
const unsigned long mmun_start = start; /* For mmu_notifiers */
const unsigned long mmun_end = end; /* For mmu_notifiers */
+ int rss[NR_MM_COUNTERS];

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
@@ -2832,6 +2837,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
address = start;
again:
+ init_rss_vec(rss);
for (; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
@@ -2877,6 +2883,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
set_page_dirty(page);

+ rss[MM_HUGETLBPAGES] -= pages_per_huge_page(h);
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
@@ -2892,6 +2899,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unlock:
spin_unlock(ptl);
}
+ add_mm_rss_vec(mm, rss);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -3261,6 +3269,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);

+ mod_hugetlb_rss(mm, pages_per_huge_page(h));
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4.orig/mm/memory.c v4.2-rc4/mm/memory.c
index 388dcf9aa283..e09b53da2733 100644
--- v4.2-rc4.orig/mm/memory.c
+++ v4.2-rc4/mm/memory.c
@@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
return 0;
}

-static inline void init_rss_vec(int *rss)
+inline void init_rss_vec(int *rss)
{
memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
}

-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
{
int i;

diff --git v4.2-rc4.orig/mm/rmap.c v4.2-rc4/mm/rmap.c
index 171b68768df1..78e77b0ea3c3 100644
--- v4.2-rc4.orig/mm/rmap.c
+++ v4.2-rc4/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
update_hiwater_rss(mm);

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (!PageHuge(page)) {
+ if (PageHuge(page)) {
+ mod_hugetlb_rss(mm, -(1 << compound_order(page)));
+ } else {
if (PageAnon(page))
dec_mm_counter(mm, MM_ANONPAGES);
else
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-10 01:17:51

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

> @@ -504,6 +519,15 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> {
> return &mm->page_table_lock;
> }
> +
> +static inline get_hugetlb_rss(struct mm_struct *mm)
> +{
> + return 0;
> +}
> +
> +static inline mod_hugetlb_rss(struct mm_struct *mm, long value)
> +{
> +}
> #endif /* CONFIG_HUGETLB_PAGE */
>
> static inline spinlock_t *huge_pte_lock(struct hstate *h,

sorry for my silly mistake, this doesn't build.
Let me resend this.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <[email protected]>
Subject: [PATCH] mm: hugetlb: add VmHugetlbRSS: field in /proc/pid/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because applications which use hugetlb typically want to control
their processes on the basis of how much memory (including hugetlb) they use.
So this patch simply provides easy access to the info via /proc/pid/status.

This patch shouldn't change the OOM behavior (so hugetlb usage is ignored as
is now,) which I guess is fine until we have some strong reason to do it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
fs/proc/task_mmu.c | 5 ++++-
include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
include/linux/mm.h | 3 +++
include/linux/mm_types.h | 3 +++
mm/hugetlb.c | 9 +++++++++
mm/memory.c | 4 ++--
mm/rmap.c | 4 +++-
7 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c7218603306d..f181f56fcce2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,7 +22,7 @@
void task_mem(struct seq_file *m, struct mm_struct *mm)
{
unsigned long data, text, lib, swap, ptes, pmds;
- unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, hugetlb_rss;

/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -37,6 +37,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
hiwater_rss = total_rss = get_mm_rss(mm);
if (hiwater_rss < mm->hiwater_rss)
hiwater_rss = mm->hiwater_rss;
+ hugetlb_rss = get_hugetlb_rss(mm);

data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
@@ -51,6 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
"VmPin:\t%8lu kB\n"
"VmHWM:\t%8lu kB\n"
"VmRSS:\t%8lu kB\n"
+ "VmHugetlbRSS:\t%8lu kB\n"
"VmData:\t%8lu kB\n"
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
@@ -64,6 +66,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
mm->pinned_vm << (PAGE_SHIFT-10),
hiwater_rss << (PAGE_SHIFT-10),
total_rss << (PAGE_SHIFT-10),
+ hugetlb_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
ptes >> 10,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d891f949466a..8c72e935bda8 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -469,6 +469,21 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#define hugepages_supported() (HPAGE_SHIFT != 0)
#endif

+/*
+ * This simple wrappers are to hide MM_HUGETLBPAGES from outside hugetlbfs
+ * subsystem. The counter MM_HUGETLBPAGES is maintained in page unit basis,
+ * so it changes by 512 for example if a 2MB hugepage is mapped or unmapped.
+ */
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_HUGETLBPAGES);
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+ add_mm_counter(mm, MM_HUGETLBPAGES, value);
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +519,15 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
{
return &mm->page_table_lock;
}
+
+static inline int get_hugetlb_rss(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void mod_hugetlb_rss(struct mm_struct *mm, long value)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f92dbac..9218a8856483 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1355,6 +1355,9 @@ static inline void sync_mm_rss(struct mm_struct *mm)
}
#endif

+extern inline void init_rss_vec(int *rss);
+extern inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
+
int vma_wants_writenotify(struct vm_area_struct *vma);

extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..887b43ba5a18 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -348,6 +348,9 @@ enum {
MM_FILEPAGES,
MM_ANONPAGES,
MM_SWAPENTS,
+#ifdef CONFIG_HUGETLB_PAGE
+ MM_HUGETLBPAGES,
+#endif
NR_MM_COUNTERS
};

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..12e5e7d3b60f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2743,9 +2743,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
int ret = 0;
+ int rss[NR_MM_COUNTERS];

cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

+ init_rss_vec(rss);
mmun_start = vma->vm_start;
mmun_end = vma->vm_end;
if (cow)
@@ -2797,6 +2799,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
get_page(ptepage);
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
+ rss[MM_HUGETLBPAGES] += pages_per_huge_page(h);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
@@ -2805,6 +2808,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow)
mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);

+ add_mm_rss_vec(dst, rss);
return ret;
}

@@ -2823,6 +2827,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long sz = huge_page_size(h);
const unsigned long mmun_start = start; /* For mmu_notifiers */
const unsigned long mmun_end = end; /* For mmu_notifiers */
+ int rss[NR_MM_COUNTERS];

WARN_ON(!is_vm_hugetlb_page(vma));
BUG_ON(start & ~huge_page_mask(h));
@@ -2832,6 +2837,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
address = start;
again:
+ init_rss_vec(rss);
for (; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
@@ -2877,6 +2883,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
set_page_dirty(page);

+ rss[MM_HUGETLBPAGES] -= pages_per_huge_page(h);
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
@@ -2892,6 +2899,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unlock:
spin_unlock(ptl);
}
+ add_mm_rss_vec(mm, rss);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -3261,6 +3269,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);

+ mod_hugetlb_rss(mm, pages_per_huge_page(h));
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 388dcf9aa283..e09b53da2733 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -620,12 +620,12 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
return 0;
}

-static inline void init_rss_vec(int *rss)
+inline void init_rss_vec(int *rss)
{
memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
}

-static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
{
int i;

diff --git a/mm/rmap.c b/mm/rmap.c
index 171b68768df1..78e77b0ea3c3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
update_hiwater_rss(mm);

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (!PageHuge(page)) {
+ if (PageHuge(page)) {
+ mod_hugetlb_rss(mm, -(1 << compound_order(page)));
+ } else {
if (PageAnon(page))
dec_mm_counter(mm, MM_ANONPAGES);
else
--
2.4.3


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-11 00:37:57

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)

On Fri, 7 Aug 2015, Naoya Horiguchi wrote:

> Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> inconvenient when we want to know per-task or per-vma base hugetlb usage.
> This patch enables these fields by introducing smaps_hugetlb_range().
>
> before patch:
>
> Size: 20480 kB
> Rss: 0 kB
> Pss: 0 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>
> after patch:
>
> Size: 20480 kB
> Rss: 18432 kB
> Pss: 18432 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 18432 kB
> Referenced: 18432 kB
> Anonymous: 18432 kB
> AnonHugePages: 0 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>

I think this will lead to breakage, unfortunately, specifically for users
who are concerned with resource management.

An example: we use memcg hierarchies to charge memory for individual jobs,
specific users, and system overhead. Memcg is a cgroup, so this is done
for an aggregate of processes, and we often have to monitor their memory
usage. Each process isn't assigned to its own memcg, and I don't believe
common users of memcg assign individual processes to their own memcgs.

When a memcg is out of memory, we need to track the memory usage of
processes attached to its memcg hierarchy to determine what is unexpected,
either as a result of a new rollout or because of a memory leak. To do
that, we use the rss exported by smaps that is now changed with this
patch. By using smaps rather than /proc/pid/status, we can report where
memory usage is unexpected.

This would cause our process that manages all memcgs on our systems to
break. Perhaps I haven't been as convincing in my previous messages of
this, but it's quite an obvious userspace regression.

This memory was not included in rss originally because memory in the
hugetlb persistent pool is always resident. Unmapping the memory does not
free memory. For this reason, hugetlb memory has always been treated as
its own type of memory.

It would have been arguable back when hugetlbfs was introduced whether it
should be included. I'm afraid the ship has sailed on that since a decade
has past and it would cause userspace to break if existing metrics are
used that already have cleared defined semantics.

2015-08-11 00:44:57

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS

On Mon, 10 Aug 2015, Naoya Horiguchi wrote:

> diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
> index 6f7fafde0884..cb8565e150ed 100644
> --- v4.2-rc4.orig/Documentation/filesystems/proc.txt
> +++ v4.2-rc4/Documentation/filesystems/proc.txt
> @@ -168,6 +168,7 @@ For example, to get the status information of a process, all you have to do is
> VmLck: 0 kB
> VmHWM: 476 kB
> VmRSS: 476 kB
> + VmHugetlbRSS: 0 kB
> VmData: 156 kB
> VmStk: 88 kB
> VmExe: 68 kB
> @@ -230,6 +231,7 @@ Table 1-2: Contents of the status files (as of 4.1)
> VmLck locked memory size
> VmHWM peak resident set size ("high water mark")
> VmRSS size of memory portions
> + VmHugetlbRSS size of hugetlb memory portions
> VmData size of data, stack, and text segments
> VmStk size of data, stack, and text segments
> VmExe size of text segment
> @@ -440,8 +442,12 @@ indicates the amount of memory currently marked as referenced or accessed.
> "Anonymous" shows the amount of memory that does not belong to any file. Even
> a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> and a page is modified, the file page is replaced by a private anonymous copy.
> -"Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
> +Since 4.3, "RSS" contains the amount of mappings for hugetlb pages. Although
> +RSS of hugetlb mappings is maintained separately from normal mappings
> +(displayed in "VmHugetlbRSS" field of /proc/PID/status,) /proc/PID/smaps shows
> +both mappings in "RSS" field. Userspace applications clearly distinguish the
> +type of mapping with 'ht' flag in "VmFlags" field.
>
> "VmFlags" field deserves a separate description. This member represents the kernel
> flags associated with the particular virtual memory area in two letter encoded

My objection to adding hugetlb memory to the RSS field of /proc/pid/smaps
still stands and can be addressed in the thread of the first patch. Since
this includes wording that describes that change, then the objection would
also cover that.

With regard to adding VmHugetlbRSS, I think the change is fine, and I
appreciate that you call it VmHugetlbRSS and not VmHugeRSS since that
would be confused with thp.

My only concern regarding VmHugetlbRSS would be extendability and whether
we will eventually, or even today, want to differentiate between various
hugetlb page sizes. For example, if 1GB hugetlb pages on x86 are a
precious resource, then how do I determine which process has mapped it
rather than 512 2MB hugetlb pages?

2015-08-11 23:39:17

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)

On Mon, Aug 10, 2015 at 05:37:54PM -0700, David Rientjes wrote:
> On Fri, 7 Aug 2015, Naoya Horiguchi wrote:
>
> > Currently smaps reports many zero fields for vma(VM_HUGETLB), which is
> > inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > This patch enables these fields by introducing smaps_hugetlb_range().
> >
> > before patch:
> >
> > Size: 20480 kB
> > Rss: 0 kB
> > Pss: 0 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 0 kB
> > AnonHugePages: 0 kB
> > Swap: 0 kB
> > KernelPageSize: 2048 kB
> > MMUPageSize: 2048 kB
> > Locked: 0 kB
> > VmFlags: rd wr mr mw me de ht
> >
> > after patch:
> >
> > Size: 20480 kB
> > Rss: 18432 kB
> > Pss: 18432 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 18432 kB
> > Referenced: 18432 kB
> > Anonymous: 18432 kB
> > AnonHugePages: 0 kB
> > Swap: 0 kB
> > KernelPageSize: 2048 kB
> > MMUPageSize: 2048 kB
> > Locked: 0 kB
> > VmFlags: rd wr mr mw me de ht
> >
>
> I think this will lead to breakage, unfortunately, specifically for users
> who are concerned with resource management.
>
> An example: we use memcg hierarchies to charge memory for individual jobs,
> specific users, and system overhead. Memcg is a cgroup, so this is done
> for an aggregate of processes, and we often have to monitor their memory
> usage. Each process isn't assigned to its own memcg, and I don't believe
> common users of memcg assign individual processes to their own memcgs.
>
> When a memcg is out of memory, we need to track the memory usage of
> processes attached to its memcg hierarchy to determine what is unexpected,
> either as a result of a new rollout or because of a memory leak. To do
> that, we use the rss exported by smaps that is now changed with this
> patch. By using smaps rather than /proc/pid/status, we can report where
> memory usage is unexpected.
>
> This would cause our process that manages all memcgs on our systems to
> break. Perhaps I haven't been as convincing in my previous messages of
> this, but it's quite an obvious userspace regression.

OK, this version assumes that userspace distinguishes vma(VM_HUGETLB) with
"VmFlags" field, which is unrealistic. So I'll keep all existing fields
untouched by introducing hugetlb usage info.

> This memory was not included in rss originally because memory in the
> hugetlb persistent pool is always resident. Unmapping the memory does not
> free memory. For this reason, hugetlb memory has always been treated as
> its own type of memory.

Right, so it might be better not to use the word "RSS" for hugetlb, maybe
something like "HugetlbPages:" seems better to me.

Thanks,
Naoya Horiguchi

> It would have been arguable back when hugetlbfs was introduced whether it
> should be included. I'm afraid the ship has sailed on that since a decade
> has past and it would cause userspace to break if existing metrics are
> used that already have cleared defined semantics.????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-11 23:49:03

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] smaps: fill missing fields for vma(VM_HUGETLB)

On Tue, 11 Aug 2015, Naoya Horiguchi wrote:

> > This memory was not included in rss originally because memory in the
> > hugetlb persistent pool is always resident. Unmapping the memory does not
> > free memory. For this reason, hugetlb memory has always been treated as
> > its own type of memory.
>
> Right, so it might be better not to use the word "RSS" for hugetlb, maybe
> something like "HugetlbPages:" seems better to me.
>

When the pagesize is also specified, as it is in smaps, I think this would
be the best option. Note that we can't distinguish between variable
hugetlb sizes with VmFlags alone.

2015-08-12 00:04:48

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] Documentation/filesystems/proc.txt: document hugetlb RSS

On Mon, Aug 10, 2015 at 05:44:54PM -0700, David Rientjes wrote:
> On Mon, 10 Aug 2015, Naoya Horiguchi wrote:
>
> > diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
> > index 6f7fafde0884..cb8565e150ed 100644
> > --- v4.2-rc4.orig/Documentation/filesystems/proc.txt
> > +++ v4.2-rc4/Documentation/filesystems/proc.txt
> > @@ -168,6 +168,7 @@ For example, to get the status information of a process, all you have to do is
> > VmLck: 0 kB
> > VmHWM: 476 kB
> > VmRSS: 476 kB
> > + VmHugetlbRSS: 0 kB
> > VmData: 156 kB
> > VmStk: 88 kB
> > VmExe: 68 kB
> > @@ -230,6 +231,7 @@ Table 1-2: Contents of the status files (as of 4.1)
> > VmLck locked memory size
> > VmHWM peak resident set size ("high water mark")
> > VmRSS size of memory portions
> > + VmHugetlbRSS size of hugetlb memory portions
> > VmData size of data, stack, and text segments
> > VmStk size of data, stack, and text segments
> > VmExe size of text segment
> > @@ -440,8 +442,12 @@ indicates the amount of memory currently marked as referenced or accessed.
> > "Anonymous" shows the amount of memory that does not belong to any file. Even
> > a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> > and a page is modified, the file page is replaced by a private anonymous copy.
> > -"Swap" shows how much would-be-anonymous memory is also used, but out on
> > -swap.
> > +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
> > +Since 4.3, "RSS" contains the amount of mappings for hugetlb pages. Although
> > +RSS of hugetlb mappings is maintained separately from normal mappings
> > +(displayed in "VmHugetlbRSS" field of /proc/PID/status,) /proc/PID/smaps shows
> > +both mappings in "RSS" field. Userspace applications clearly distinguish the
> > +type of mapping with 'ht' flag in "VmFlags" field.
> >
> > "VmFlags" field deserves a separate description. This member represents the kernel
> > flags associated with the particular virtual memory area in two letter encoded
>
> My objection to adding hugetlb memory to the RSS field of /proc/pid/smaps
> still stands and can be addressed in the thread of the first patch. Since
> this includes wording that describes that change, then the objection would
> also cover that.

OK, I'll update this in accordance with the change on the first patch.

> With regard to adding VmHugetlbRSS, I think the change is fine, and I
> appreciate that you call it VmHugetlbRSS and not VmHugeRSS since that
> would be confused with thp.

I plan to rename the field, then the new name will/should be unconfusing
between thp and hugetlb.

> My only concern regarding VmHugetlbRSS would be extendability and whether
> we will eventually, or even today, want to differentiate between various
> hugetlb page sizes. For example, if 1GB hugetlb pages on x86 are a
> precious resource, then how do I determine which process has mapped it
> rather than 512 2MB hugetlb pages?

"KernelPageSize" field in /proc/PID/smaps is aware of hugetlb page sizes,
so I expected userspace to detect the size itself. But /proc/PID/status shows
only proccess-wide info, so userspace applications must read both of these
files to know the usage per hugepage size, which might be inconvenient.

One idea is to show the new field like "VmHugetlbRSS: 2x512kB 1x1GB" for
both of /proc/PID/{status,smaps}, which passes the full hugetlb info in a
single line so easier to parse and process. Or some other fields shows in
"kB", so "VmHugetlbRSS: 1052672 kB (2x512kB 1x1GB)" is possible for human
readability.

Thank you very much for the feedback, I'll repost soon, but any additional
comment is appreciated.

Naoya????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-12 07:48:46

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
is inconvenient when we want to know per-task or per-vma base hugetlb usage.
To solve this, this patch adds a new line for hugetlb usage like below:

Size: 20480 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
HugetlbPages: 18432 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <[email protected]>
---
v3 -> v4:
- suspend Acked-by tag because v3->v4 change is not trivial
- I stated in previous discussion that HugetlbPages line can contain page
size info, but that's not necessary because we already have KernelPageSize
info.
- merged documentation update, where the current documentation doesn't mention
AnonHugePages, so it's also added.
---
Documentation/filesystems/proc.txt | 7 +++++--
fs/proc/task_mmu.c | 29 +++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 2 deletions(-)

diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
index 6f7fafde0884..22e40211ef64 100644
--- v4.2-rc4.orig/Documentation/filesystems/proc.txt
+++ v4.2-rc4/Documentation/filesystems/proc.txt
@@ -423,6 +423,8 @@ Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
+AnonHugePages: 0 kB
+HugetlbPages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
@@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
"Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
-"Swap" shows how much would-be-anonymous memory is also used, but out on
-swap.
+"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
+"Swap" shows how much would-be-anonymous memory is also used, but out on swap.

"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index ca1e091881d4..2c37938b82ee 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -445,6 +445,7 @@ struct mem_size_stats {
unsigned long anonymous;
unsigned long anonymous_thp;
unsigned long swap;
+ unsigned long hugetlb;
u64 pss;
};

@@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
seq_putc(m, '\n');
}

+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct mem_size_stats *mss = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ struct page *page = NULL;
+
+ if (pte_present(*pte)) {
+ page = vm_normal_page(vma, addr, *pte);
+ } else if (is_swap_pte(*pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+ if (is_migration_entry(swpent))
+ page = migration_entry_to_page(swpent);
+ }
+ if (page)
+ mss->hugetlb += huge_page_size(hstate_vma(vma));
+ return 0;
+}
+#endif /* HUGETLB_PAGE */
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+ .hugetlb_entry = smaps_hugetlb_range,
+#endif
.mm = vma->vm_mm,
.private = &mss,
};
@@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
"Referenced: %8lu kB\n"
"Anonymous: %8lu kB\n"
"AnonHugePages: %8lu kB\n"
+ "HugetlbPages: %8lu kB\n"
"Swap: %8lu kB\n"
"KernelPageSize: %8lu kB\n"
"MMUPageSize: %8lu kB\n"
@@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
mss.referenced >> 10,
mss.anonymous >> 10,
mss.anonymous_thp >> 10,
+ mss.hugetlb >> 10,
mss.swap >> 10,
vma_kernel_pagesize(vma) >> 10,
vma_mmu_pagesize(vma) >> 10,
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-12 07:48:45

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because userspace applications which use hugetlb typically want
to control their processes on the basis of how much memory (including hugetlb)
they use. So this patch simply provides easy access to the info via
/proc/PID/status.

With this patch, for example, /proc/PID/status shows a line like this:

HugetlbPages: 20480 kB (10x2048kB)

If your system supports and enables multiple hugepage sizes, the line looks
like this:

HugetlbPages: 1069056 kB (1x1048576kB 10x2048kB)

, so you can easily know how many hugepages in which pagesize are used by a
process.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
v3 -> v4:
- rename field (VmHugetlbRSS is not the best name)
- introduce struct hugetlb_usage in struct mm_struct (no invasion to struct
mm_rss_stat)
- introduce hugetlb_report_usage()
- merged documentation update

v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
Documentation/filesystems/proc.txt | 3 +++
fs/proc/task_mmu.c | 1 +
include/linux/hugetlb.h | 20 ++++++++++++++++++++
include/linux/mm_types.h | 10 ++++++++++
mm/hugetlb.c | 27 +++++++++++++++++++++++++++
mm/rmap.c | 4 +++-
6 files changed, 64 insertions(+), 1 deletion(-)

diff --git v4.2-rc4.orig/Documentation/filesystems/proc.txt v4.2-rc4/Documentation/filesystems/proc.txt
index 22e40211ef64..e92a4a91fc99 100644
--- v4.2-rc4.orig/Documentation/filesystems/proc.txt
+++ v4.2-rc4/Documentation/filesystems/proc.txt
@@ -174,6 +174,7 @@ For example, to get the status information of a process, all you have to do is
VmLib: 1412 kB
VmPTE: 20 kb
VmSwap: 0 kB
+ HugetlbPages: 0 kB (0x2048kB)
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
@@ -237,6 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
VmPTE size of page table entries
VmPMD size of second level page tables
VmSwap size of swap usage (the number of referred swapents)
+ HugetlbPages size of hugetlb memory portions (with additional info
+ about number of mapped hugepages for each page size)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
diff --git v4.2-rc4.orig/fs/proc/task_mmu.c v4.2-rc4/fs/proc/task_mmu.c
index 2c37938b82ee..b3cf7fa9ef6c 100644
--- v4.2-rc4.orig/fs/proc/task_mmu.c
+++ v4.2-rc4/fs/proc/task_mmu.c
@@ -69,6 +69,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
ptes >> 10,
pmds >> 10,
swap << (PAGE_SHIFT-10));
+ hugetlb_report_usage(m, mm);
}

unsigned long task_vsize(struct mm_struct *mm)
diff --git v4.2-rc4.orig/include/linux/hugetlb.h v4.2-rc4/include/linux/hugetlb.h
index d891f949466a..64aa4db01f48 100644
--- v4.2-rc4.orig/include/linux/hugetlb.h
+++ v4.2-rc4/include/linux/hugetlb.h
@@ -469,6 +469,18 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#define hugepages_supported() (HPAGE_SHIFT != 0)
#endif

+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+
+static inline void inc_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+ atomic_long_inc(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+ atomic_long_dec(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +516,14 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
{
return &mm->page_table_lock;
}
+
+static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
+{
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4.orig/include/linux/mm_types.h v4.2-rc4/include/linux/mm_types.h
index 0038ac7466fd..e95c5fe1eb7d 100644
--- v4.2-rc4.orig/include/linux/mm_types.h
+++ v4.2-rc4/include/linux/mm_types.h
@@ -364,6 +364,12 @@ struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
};

+#ifdef CONFIG_HUGETLB_PAGE
+struct hugetlb_usage {
+ atomic_long_t count[HUGE_MAX_HSTATE];
+};
+#endif
+
struct kioctx_table;
struct mm_struct {
struct vm_area_struct *mmap; /* list of VMAs */
@@ -484,6 +490,10 @@ struct mm_struct {
/* address of the bounds directory */
void __user *bd_addr;
#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+ struct hugetlb_usage hugetlb_usage;
+#endif
};

static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git v4.2-rc4.orig/mm/hugetlb.c v4.2-rc4/mm/hugetlb.c
index a8c3087089d8..b890431d0763 100644
--- v4.2-rc4.orig/mm/hugetlb.c
+++ v4.2-rc4/mm/hugetlb.c
@@ -2562,6 +2562,30 @@ void hugetlb_show_meminfo(void)
1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
}

+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm)
+{
+ int i;
+ unsigned long total_usage = 0;
+
+ for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+ total_usage += atomic_long_read(&mm->hugetlb_usage.count[i]) *
+ (huge_page_size(&hstates[i]) >> 10);
+ }
+
+ seq_printf(m, "HugetlbPages:\t%8lu kB (", total_usage);
+ for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+ if (huge_page_order(&hstates[i]) == 0)
+ break;
+ if (i > 0)
+ seq_puts(m, " ");
+
+ seq_printf(m, "%ldx%dkB",
+ atomic_long_read(&mm->hugetlb_usage.count[i]),
+ huge_page_size(&hstates[i]) >> 10);
+ }
+ seq_puts(m, ")\n");
+}
+
/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
unsigned long hugetlb_total_pages(void)
{
@@ -2797,6 +2821,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
get_page(ptepage);
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
+ inc_hugetlb_count(dst, h);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
@@ -2877,6 +2902,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
set_page_dirty(page);

+ dec_hugetlb_count(mm, h);
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
@@ -3261,6 +3287,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);

+ inc_hugetlb_count(mm, h);
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4.orig/mm/rmap.c v4.2-rc4/mm/rmap.c
index 171b68768df1..b33278bc4ddb 100644
--- v4.2-rc4.orig/mm/rmap.c
+++ v4.2-rc4/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
update_hiwater_rss(mm);

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (!PageHuge(page)) {
+ if (PageHuge(page)) {
+ dec_hugetlb_count(mm, page_hstate(page));
+ } else {
if (PageAnon(page))
dec_mm_counter(mm, MM_ANONPAGES);
else
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-12 20:26:04

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

On Wed, 12 Aug 2015, Naoya Horiguchi wrote:

> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> To solve this, this patch adds a new line for hugetlb usage like below:
>
> Size: 20480 kB
> Rss: 0 kB
> Pss: 0 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> HugetlbPages: 18432 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht
>
> Signed-off-by: Naoya Horiguchi <[email protected]>

Acked-by: David Rientjes <[email protected]>

2015-08-12 20:30:31

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Wed, 12 Aug 2015, Naoya Horiguchi wrote:

> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because userspace applications which use hugetlb typically want
> to control their processes on the basis of how much memory (including hugetlb)
> they use. So this patch simply provides easy access to the info via
> /proc/PID/status.
>
> With this patch, for example, /proc/PID/status shows a line like this:
>
> HugetlbPages: 20480 kB (10x2048kB)
>
> If your system supports and enables multiple hugepage sizes, the line looks
> like this:
>
> HugetlbPages: 1069056 kB (1x1048576kB 10x2048kB)
>
> , so you can easily know how many hugepages in which pagesize are used by a
> process.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>

I'm happy with this and thanks very much for going the extra mile and
breaking the usage down by hstate size.

I'd be interested in the comments of others, though, to see if there is
any reservation about the hstate size breakdown. It may actually find no
current customer who is interested in parsing it. (If we keep it, I would
suggest the 'x' change to '*' similar to per-order breakdowns in
show_mem()). It may also be possible to add it later if a definitive
usecase is presented.

But overall I'm very happy with the new addition and think it's a good
solution to the problem.

2015-08-13 00:47:41

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Wed, Aug 12, 2015 at 01:30:27PM -0700, David Rientjes wrote:
> On Wed, 12 Aug 2015, Naoya Horiguchi wrote:
>
> > Currently there's no easy way to get per-process usage of hugetlb pages, which
> > is inconvenient because userspace applications which use hugetlb typically want
> > to control their processes on the basis of how much memory (including hugetlb)
> > they use. So this patch simply provides easy access to the info via
> > /proc/PID/status.
> >
> > With this patch, for example, /proc/PID/status shows a line like this:
> >
> > HugetlbPages: 20480 kB (10x2048kB)
> >
> > If your system supports and enables multiple hugepage sizes, the line looks
> > like this:
> >
> > HugetlbPages: 1069056 kB (1x1048576kB 10x2048kB)
> >
> > , so you can easily know how many hugepages in which pagesize are used by a
> > process.
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
>
> I'm happy with this and thanks very much for going the extra mile and
> breaking the usage down by hstate size.

Great to hear that.

> I'd be interested in the comments of others, though, to see if there is
> any reservation about the hstate size breakdown. It may actually find no
> current customer who is interested in parsing it. (If we keep it, I would
> suggest the 'x' change to '*' similar to per-order breakdowns in
> show_mem()). It may also be possible to add it later if a definitive
> usecase is presented.

I'm fine to change to '*'.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi <[email protected]>
Subject: [PATCH] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because userspace applications which use hugetlb typically want
to control their processes on the basis of how much memory (including hugetlb)
they use. So this patch simply provides easy access to the info via
/proc/PID/status.

With this patch, for example, /proc/PID/status shows a line like this:

HugetlbPages: 20480 kB (10*2048kB)

If your system supports and enables multiple hugepage sizes, the line looks
like this:

HugetlbPages: 1069056 kB (1*1048576kB 10*2048kB)

, so you can easily know how many hugepages in which pagesize are used by a
process.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
Documentation/filesystems/proc.txt | 3 +++
fs/proc/task_mmu.c | 1 +
include/linux/hugetlb.h | 20 ++++++++++++++++++++
include/linux/mm_types.h | 10 ++++++++++
mm/hugetlb.c | 27 +++++++++++++++++++++++++++
mm/rmap.c | 4 +++-
6 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 22e40211ef64..f561fc46e41b 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -174,6 +174,7 @@ For example, to get the status information of a process, all you have to do is
VmLib: 1412 kB
VmPTE: 20 kb
VmSwap: 0 kB
+ HugetlbPages: 0 kB (0*2048kB)
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
@@ -237,6 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
VmPTE size of page table entries
VmPMD size of second level page tables
VmSwap size of swap usage (the number of referred swapents)
+ HugetlbPages size of hugetlb memory portions (with additional info
+ about number of mapped hugepages for each page size)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2c37938b82ee..b3cf7fa9ef6c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -69,6 +69,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
ptes >> 10,
pmds >> 10,
swap << (PAGE_SHIFT-10));
+ hugetlb_report_usage(m, mm);
}

unsigned long task_vsize(struct mm_struct *mm)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d891f949466a..64aa4db01f48 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -469,6 +469,18 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#define hugepages_supported() (HPAGE_SHIFT != 0)
#endif

+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+
+static inline void inc_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+ atomic_long_inc(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+ atomic_long_dec(&mm->hugetlb_usage.count[hstate_index(h)]);
+}
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +516,14 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
{
return &mm->page_table_lock;
}
+
+static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
+{
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7466fd..e95c5fe1eb7d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -364,6 +364,12 @@ struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
};

+#ifdef CONFIG_HUGETLB_PAGE
+struct hugetlb_usage {
+ atomic_long_t count[HUGE_MAX_HSTATE];
+};
+#endif
+
struct kioctx_table;
struct mm_struct {
struct vm_area_struct *mmap; /* list of VMAs */
@@ -484,6 +490,10 @@ struct mm_struct {
/* address of the bounds directory */
void __user *bd_addr;
#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+ struct hugetlb_usage hugetlb_usage;
+#endif
};

static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087089d8..2338c9713b7a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2562,6 +2562,30 @@ void hugetlb_show_meminfo(void)
1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
}

+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm)
+{
+ int i;
+ unsigned long total_usage = 0;
+
+ for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+ total_usage += atomic_long_read(&mm->hugetlb_usage.count[i]) *
+ (huge_page_size(&hstates[i]) >> 10);
+ }
+
+ seq_printf(m, "HugetlbPages:\t%8lu kB (", total_usage);
+ for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+ if (huge_page_order(&hstates[i]) == 0)
+ break;
+ if (i > 0)
+ seq_puts(m, " ");
+
+ seq_printf(m, "%ld*%dkB",
+ atomic_long_read(&mm->hugetlb_usage.count[i]),
+ huge_page_size(&hstates[i]) >> 10);
+ }
+ seq_puts(m, ")\n");
+}
+
/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
unsigned long hugetlb_total_pages(void)
{
@@ -2797,6 +2821,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
get_page(ptepage);
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
+ inc_hugetlb_count(dst, h);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
@@ -2877,6 +2902,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
set_page_dirty(page);

+ dec_hugetlb_count(mm, h);
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
@@ -3261,6 +3287,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);

+ inc_hugetlb_count(mm, h);
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git a/mm/rmap.c b/mm/rmap.c
index 171b68768df1..b33278bc4ddb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
update_hiwater_rss(mm);

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (!PageHuge(page)) {
+ if (PageHuge(page)) {
+ dec_hugetlb_count(mm, page_hstate(page));
+ } else {
if (PageAnon(page))
dec_mm_counter(mm, MM_ANONPAGES);
else
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-13 21:13:38

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Wed, Aug 12, 2015 at 01:30:27PM -0700, David Rientjes wrote:
>
> I'd be interested in the comments of others, though, to see if there is
> any reservation about the hstate size breakdown. It may actually find no
> current customer who is interested in parsing it. (If we keep it, I would
> suggest the 'x' change to '*' similar to per-order breakdowns in
> show_mem()). It may also be possible to add it later if a definitive
> usecase is presented.

I have no interest in parsing the size breakdown today. I might change
my mind tomorrow and having the extra information hurts very little, so
I won't argue against it either.

> But overall I'm very happy with the new addition and think it's a good
> solution to the problem.

Agreed. Thank you!

J?rn

--
One of the painful things about our time is that those who feel certainty
are stupid, and those with any imagination and understanding are filled
with doubt and indecision.
-- Bertrand Russell

2015-08-13 21:14:22

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu, Aug 13, 2015 at 12:45:33AM +0000, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <[email protected]>
> Subject: [PATCH] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status
>
> Currently there's no easy way to get per-process usage of hugetlb pages, which
> is inconvenient because userspace applications which use hugetlb typically want
> to control their processes on the basis of how much memory (including hugetlb)
> they use. So this patch simply provides easy access to the info via
> /proc/PID/status.
>
> With this patch, for example, /proc/PID/status shows a line like this:
>
> HugetlbPages: 20480 kB (10*2048kB)
>
> If your system supports and enables multiple hugepage sizes, the line looks
> like this:
>
> HugetlbPages: 1069056 kB (1*1048576kB 10*2048kB)
>
> , so you can easily know how many hugepages in which pagesize are used by a
> process.
>
> Signed-off-by: Naoya Horiguchi <[email protected]>

Acked-by: Joern Engel <[email protected]>

J?rn

--
Computer system analysis is like child-rearing; you can do grievous damage,
but you cannot ensure success."
-- Tom DeMarco

2015-08-13 21:14:40

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

On Wed, Aug 12, 2015 at 01:25:59PM -0700, David Rientjes wrote:
> On Wed, 12 Aug 2015, Naoya Horiguchi wrote:
>
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> >
> > Size: 20480 kB
> > Rss: 0 kB
> > Pss: 0 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 0 kB
> > AnonHugePages: 0 kB
> > HugetlbPages: 18432 kB
> > Swap: 0 kB
> > KernelPageSize: 2048 kB
> > MMUPageSize: 2048 kB
> > Locked: 0 kB
> > VmFlags: rd wr mr mw me de ht
> >
> > Signed-off-by: Naoya Horiguchi <[email protected]>
>
> Acked-by: David Rientjes <[email protected]>

Acked-by: Joern Engel <[email protected]>

J?rn

--
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.

2015-08-17 21:28:16

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Wed, 12 Aug 2015, David Rientjes wrote:

> I'm happy with this and thanks very much for going the extra mile and
> breaking the usage down by hstate size.
>
> I'd be interested in the comments of others, though, to see if there is
> any reservation about the hstate size breakdown. It may actually find no
> current customer who is interested in parsing it. (If we keep it, I would
> suggest the 'x' change to '*' similar to per-order breakdowns in
> show_mem()). It may also be possible to add it later if a definitive
> usecase is presented.
>
> But overall I'm very happy with the new addition and think it's a good
> solution to the problem.
>

No objections from anybody else, so

Acked-by: David Rientjes <[email protected]>

Thanks Naoya!

2015-08-20 08:43:49

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v5 0/2] hugetlb: display per-process/per-vma usage

The previous version had build issues in some architectures, because it
required to move the definition of HUGE_MAX_HSTATE across header files
in order to embed a new data structure struct hugetlb_usage into struct
mm_struct. This was a hard problem to solve, so I took another approach
in this version, where I add just a pointer (struct hugetlb_usage *) to
struct mm_struct and dynamically allocate and link it.
This makes the changes larger, but no build issues.

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (2):
mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps
mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

Documentation/filesystems/proc.txt | 10 +++++++--
fs/hugetlbfs/inode.c | 12 ++++++++++
fs/proc/task_mmu.c | 30 +++++++++++++++++++++++++
include/linux/hugetlb.h | 36 +++++++++++++++++++++++++++++
include/linux/mm_types.h | 7 ++++++
kernel/fork.c | 3 +++
mm/hugetlb.c | 46 ++++++++++++++++++++++++++++++++++++++
mm/mmap.c | 1 +
mm/rmap.c | 4 +++-
9 files changed, 146 insertions(+), 3 deletions(-)????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-20 08:49:28

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
is inconvenient when we want to know per-task or per-vma base hugetlb usage.
To solve this, this patch adds a new line for hugetlb usage like below:

Size: 20480 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
HugetlbPages: 18432 kB
Swap: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
Locked: 0 kB
VmFlags: rd wr mr mw me de ht

Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Joern Engel <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
v3 -> v4:
- suspend Acked-by tag because v3->v4 change is not trivial
- I stated in previous discussion that HugetlbPages line can contain page
size info, but that's not necessary because we already have KernelPageSize
info.
- merged documentation update, where the current documentation doesn't mention
AnonHugePages, so it's also added.
---
Documentation/filesystems/proc.txt | 7 +++++--
fs/proc/task_mmu.c | 29 +++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 2 deletions(-)

diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
index 6f7fafde0884..22e40211ef64 100644
--- v4.2-rc4/Documentation/filesystems/proc.txt
+++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
@@ -423,6 +423,8 @@ Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
+AnonHugePages: 0 kB
+HugetlbPages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
@@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
"Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
-"Swap" shows how much would-be-anonymous memory is also used, but out on
-swap.
+"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
+"Swap" shows how much would-be-anonymous memory is also used, but out on swap.

"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
index ca1e091881d4..2c37938b82ee 100644
--- v4.2-rc4/fs/proc/task_mmu.c
+++ v4.2-rc4_patched/fs/proc/task_mmu.c
@@ -445,6 +445,7 @@ struct mem_size_stats {
unsigned long anonymous;
unsigned long anonymous_thp;
unsigned long swap;
+ unsigned long hugetlb;
u64 pss;
};

@@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
seq_putc(m, '\n');
}

+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct mem_size_stats *mss = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ struct page *page = NULL;
+
+ if (pte_present(*pte)) {
+ page = vm_normal_page(vma, addr, *pte);
+ } else if (is_swap_pte(*pte)) {
+ swp_entry_t swpent = pte_to_swp_entry(*pte);
+
+ if (is_migration_entry(swpent))
+ page = migration_entry_to_page(swpent);
+ }
+ if (page)
+ mss->hugetlb += huge_page_size(hstate_vma(vma));
+ return 0;
+}
+#endif /* HUGETLB_PAGE */
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
+#ifdef CONFIG_HUGETLB_PAGE
+ .hugetlb_entry = smaps_hugetlb_range,
+#endif
.mm = vma->vm_mm,
.private = &mss,
};
@@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
"Referenced: %8lu kB\n"
"Anonymous: %8lu kB\n"
"AnonHugePages: %8lu kB\n"
+ "HugetlbPages: %8lu kB\n"
"Swap: %8lu kB\n"
"KernelPageSize: %8lu kB\n"
"MMUPageSize: %8lu kB\n"
@@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
mss.referenced >> 10,
mss.anonymous >> 10,
mss.anonymous_thp >> 10,
+ mss.hugetlb >> 10,
mss.swap >> 10,
vma_kernel_pagesize(vma) >> 10,
vma_mmu_pagesize(vma) >> 10,
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-20 08:43:57

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

Currently there's no easy way to get per-process usage of hugetlb pages, which
is inconvenient because userspace applications which use hugetlb typically want
to control their processes on the basis of how much memory (including hugetlb)
they use. So this patch simply provides easy access to the info via
/proc/PID/status.

With this patch, for example, /proc/PID/status shows a line like this:

HugetlbPages: 20480 kB (10*2048kB)

If your system supports and enables multiple hugepage sizes, the line looks
like this:

HugetlbPages: 1069056 kB (1*1048576kB 10*2048kB)

, so you can easily know how many hugepages in which pagesize are used by a
process.

Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Joern Engel <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
v4 -> v5:
- add (struct hugetlb_usage *) to struct mm_struct
- use %lu instead of %d for seq_printf()
- introduce hugetlb_fork

v3 -> v4:
- rename field (VmHugetlbRSS is not the best name)
- introduce struct hugetlb_usage in struct mm_struct (no invasion to struct
mm_rss_stat)
- introduce hugetlb_report_usage()
- merged documentation update

v2 -> v3:
- use inline functions instead of macros for !CONFIG_HUGETLB_PAGE
---
Documentation/filesystems/proc.txt | 3 +++
fs/hugetlbfs/inode.c | 12 ++++++++++
fs/proc/task_mmu.c | 1 +
include/linux/hugetlb.h | 36 +++++++++++++++++++++++++++++
include/linux/mm_types.h | 7 ++++++
kernel/fork.c | 3 +++
mm/hugetlb.c | 46 ++++++++++++++++++++++++++++++++++++++
mm/mmap.c | 1 +
mm/rmap.c | 4 +++-
9 files changed, 112 insertions(+), 1 deletion(-)

diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
index 22e40211ef64..f561fc46e41b 100644
--- v4.2-rc4/Documentation/filesystems/proc.txt
+++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
@@ -174,6 +174,7 @@ For example, to get the status information of a process, all you have to do is
VmLib: 1412 kB
VmPTE: 20 kb
VmSwap: 0 kB
+ HugetlbPages: 0 kB (0*2048kB)
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
@@ -237,6 +238,8 @@ Table 1-2: Contents of the status files (as of 4.1)
VmPTE size of page table entries
VmPMD size of second level page tables
VmSwap size of swap usage (the number of referred swapents)
+ HugetlbPages size of hugetlb memory portions (with additional info
+ about number of mapped hugepages for each page size)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
diff --git v4.2-rc4/fs/hugetlbfs/inode.c v4.2-rc4_patched/fs/hugetlbfs/inode.c
index 0cf74df68617..bf6ea2645d35 100644
--- v4.2-rc4/fs/hugetlbfs/inode.c
+++ v4.2-rc4_patched/fs/hugetlbfs/inode.c
@@ -115,6 +115,13 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
return -EINVAL;

+ if (!vma->vm_mm->hugetlb_usage) {
+ vma->vm_mm->hugetlb_usage = kzalloc(sizeof(struct hugetlb_usage),
+ GFP_KERNEL);
+ if (!vma->vm_mm->hugetlb_usage)
+ return -ENOMEM;
+ }
+
vma_len = (loff_t)(vma->vm_end - vma->vm_start);

mutex_lock(&inode->i_mutex);
@@ -138,6 +145,11 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
return ret;
}

+void exit_hugetlb_mmap(struct mm_struct *mm)
+{
+ kfree(mm->hugetlb_usage);
+}
+
/*
* Called under down_write(mmap_sem).
*/
diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
index 2c37938b82ee..b3cf7fa9ef6c 100644
--- v4.2-rc4/fs/proc/task_mmu.c
+++ v4.2-rc4_patched/fs/proc/task_mmu.c
@@ -69,6 +69,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
ptes >> 10,
pmds >> 10,
swap << (PAGE_SHIFT-10));
+ hugetlb_report_usage(m, mm);
}

unsigned long task_vsize(struct mm_struct *mm)
diff --git v4.2-rc4/include/linux/hugetlb.h v4.2-rc4_patched/include/linux/hugetlb.h
index d891f949466a..db642ad0b847 100644
--- v4.2-rc4/include/linux/hugetlb.h
+++ v4.2-rc4_patched/include/linux/hugetlb.h
@@ -469,6 +469,25 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
#define hugepages_supported() (HPAGE_SHIFT != 0)
#endif

+struct hugetlb_usage {
+ atomic_long_t count[HUGE_MAX_HSTATE];
+};
+
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+void exit_hugetlb_mmap(struct mm_struct *mm);
+int hugetlb_fork(struct mm_struct *new, struct mm_struct *old);
+
+static inline void inc_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+ VM_BUG_ON_MM(!mm->hugetlb_usage, mm);
+ atomic_long_inc(&mm->hugetlb_usage->count[hstate_index(h)]);
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+ VM_BUG_ON_MM(!mm->hugetlb_usage, mm);
+ atomic_long_dec(&mm->hugetlb_usage->count[hstate_index(h)]);
+}
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
#define alloc_huge_page_node(h, nid) NULL
@@ -504,6 +523,23 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
{
return &mm->page_table_lock;
}
+
+static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
+{
+}
+
+static inline void exit_hugetlb_mmap(struct mm_struct *mm)
+{
+}
+
+static inline int hugetlb_fork(struct mm_struct *new, struct mm_struct *old)
+{
+ return 0;
+}
+
+static inline void dec_hugetlb_count(struct mm_struct *mm, struct hstate *h)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */

static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git v4.2-rc4/include/linux/mm_types.h v4.2-rc4_patched/include/linux/mm_types.h
index 0038ac7466fd..851e964ee8d6 100644
--- v4.2-rc4/include/linux/mm_types.h
+++ v4.2-rc4_patched/include/linux/mm_types.h
@@ -364,6 +364,10 @@ struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
};

+#ifdef CONFIG_HUGETLB_PAGE
+struct hugetlb_usage;
+#endif
+
struct kioctx_table;
struct mm_struct {
struct vm_area_struct *mmap; /* list of VMAs */
@@ -484,6 +488,9 @@ struct mm_struct {
/* address of the bounds directory */
void __user *bd_addr;
#endif
+#ifdef CONFIG_HUGETLB_PAGE
+ struct hugetlb_usage *hugetlb_usage;
+#endif
};

static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git v4.2-rc4/kernel/fork.c v4.2-rc4_patched/kernel/fork.c
index dbd9b8d7b7cc..d43baa91d48c 100644
--- v4.2-rc4/kernel/fork.c
+++ v4.2-rc4_patched/kernel/fork.c
@@ -425,6 +425,9 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
retval = khugepaged_fork(mm, oldmm);
if (retval)
goto out;
+ retval = hugetlb_fork(mm, oldmm);
+ if (retval)
+ goto out;

prev = NULL;
for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
diff --git v4.2-rc4/mm/hugetlb.c v4.2-rc4_patched/mm/hugetlb.c
index a8c3087089d8..3aa8c7919364 100644
--- v4.2-rc4/mm/hugetlb.c
+++ v4.2-rc4_patched/mm/hugetlb.c
@@ -2562,6 +2562,49 @@ void hugetlb_show_meminfo(void)
1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
}

+static unsigned long mm_hstate_usage(struct mm_struct *mm, int hs_idx)
+{
+ if (!mm->hugetlb_usage)
+ return 0;
+ return atomic_long_read(&mm->hugetlb_usage->count[hs_idx]);
+}
+
+void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm)
+{
+ int i;
+ unsigned long total_usage = 0;
+
+ for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+ total_usage += mm_hstate_usage(mm, i) *
+ (huge_page_size(&hstates[i]) >> 10);
+ }
+
+ seq_printf(m, "HugetlbPages:\t%8lu kB (", total_usage);
+ for (i = 0; i < HUGE_MAX_HSTATE; i++) {
+ if (huge_page_order(&hstates[i]) == 0)
+ break;
+ if (i > 0)
+ seq_puts(m, " ");
+
+ seq_printf(m, "%ld*%lukB", mm_hstate_usage(mm, i),
+ huge_page_size(&hstates[i]) >> 10);
+ }
+ seq_puts(m, ")\n");
+}
+
+int hugetlb_fork(struct mm_struct *new, struct mm_struct *old)
+{
+ if (old->hugetlb_usage) {
+ new->hugetlb_usage = kmalloc(sizeof(struct hugetlb_usage),
+ GFP_KERNEL);
+ if (!new->hugetlb_usage)
+ return -ENOMEM;
+ memcpy(new->hugetlb_usage, old->hugetlb_usage,
+ sizeof(struct hugetlb_usage));
+ }
+ return 0;
+}
+
/* Return the number pages of memory we physically have, in PAGE_SIZE units. */
unsigned long hugetlb_total_pages(void)
{
@@ -2797,6 +2840,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
get_page(ptepage);
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
+ inc_hugetlb_count(dst, h);
}
spin_unlock(src_ptl);
spin_unlock(dst_ptl);
@@ -2877,6 +2921,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (huge_pte_dirty(pte))
set_page_dirty(page);

+ dec_hugetlb_count(mm, h);
page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
if (force_flush) {
@@ -3261,6 +3306,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
&& (vma->vm_flags & VM_SHARED)));
set_huge_pte_at(mm, address, ptep, new_pte);

+ inc_hugetlb_count(mm, h);
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
diff --git v4.2-rc4/mm/mmap.c v4.2-rc4_patched/mm/mmap.c
index aa632ade2be7..9d9562bc79a8 100644
--- v4.2-rc4/mm/mmap.c
+++ v4.2-rc4_patched/mm/mmap.c
@@ -2847,6 +2847,7 @@ void exit_mmap(struct mm_struct *mm)
nr_accounted += vma_pages(vma);
vma = remove_vma(vma);
}
+ exit_hugetlb_mmap(mm);
vm_unacct_memory(nr_accounted);
}

diff --git v4.2-rc4/mm/rmap.c v4.2-rc4_patched/mm/rmap.c
index 171b68768df1..b33278bc4ddb 100644
--- v4.2-rc4/mm/rmap.c
+++ v4.2-rc4_patched/mm/rmap.c
@@ -1230,7 +1230,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
update_hiwater_rss(mm);

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (!PageHuge(page)) {
+ if (PageHuge(page)) {
+ dec_hugetlb_count(mm, page_hstate(page));
+ } else {
if (PageAnon(page))
dec_mm_counter(mm, MM_ANONPAGES);
else
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-20 10:49:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

On Thu 20-08-15 08:26:26, Naoya Horiguchi wrote:
> Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> To solve this, this patch adds a new line for hugetlb usage like below:
>
> Size: 20480 kB
> Rss: 0 kB
> Pss: 0 kB
> Shared_Clean: 0 kB
> Shared_Dirty: 0 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 0 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> HugetlbPages: 18432 kB
> Swap: 0 kB
> KernelPageSize: 2048 kB
> MMUPageSize: 2048 kB
> Locked: 0 kB
> VmFlags: rd wr mr mw me de ht

I have only now got to this thread. This is indeed very helpful. I would
just suggest to update Documentation/filesystems/proc.txt to be explicit
that Rss: doesn't count hugetlb pages for historical reasons.

> Signed-off-by: Naoya Horiguchi <[email protected]>
> Acked-by: Joern Engel <[email protected]>
> Acked-by: David Rientjes <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> v3 -> v4:
> - suspend Acked-by tag because v3->v4 change is not trivial
> - I stated in previous discussion that HugetlbPages line can contain page
> size info, but that's not necessary because we already have KernelPageSize
> info.
> - merged documentation update, where the current documentation doesn't mention
> AnonHugePages, so it's also added.
> ---
> Documentation/filesystems/proc.txt | 7 +++++--
> fs/proc/task_mmu.c | 29 +++++++++++++++++++++++++++++
> 2 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git v4.2-rc4/Documentation/filesystems/proc.txt v4.2-rc4_patched/Documentation/filesystems/proc.txt
> index 6f7fafde0884..22e40211ef64 100644
> --- v4.2-rc4/Documentation/filesystems/proc.txt
> +++ v4.2-rc4_patched/Documentation/filesystems/proc.txt
> @@ -423,6 +423,8 @@ Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced: 892 kB
> Anonymous: 0 kB
> +AnonHugePages: 0 kB
> +HugetlbPages: 0 kB
> Swap: 0 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB
> @@ -440,8 +442,9 @@ indicates the amount of memory currently marked as referenced or accessed.
> "Anonymous" shows the amount of memory that does not belong to any file. Even
> a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> and a page is modified, the file page is replaced by a private anonymous copy.
> -"Swap" shows how much would-be-anonymous memory is also used, but out on
> -swap.
> +"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> +"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>
> "VmFlags" field deserves a separate description. This member represents the kernel
> flags associated with the particular virtual memory area in two letter encoded
> diff --git v4.2-rc4/fs/proc/task_mmu.c v4.2-rc4_patched/fs/proc/task_mmu.c
> index ca1e091881d4..2c37938b82ee 100644
> --- v4.2-rc4/fs/proc/task_mmu.c
> +++ v4.2-rc4_patched/fs/proc/task_mmu.c
> @@ -445,6 +445,7 @@ struct mem_size_stats {
> unsigned long anonymous;
> unsigned long anonymous_thp;
> unsigned long swap;
> + unsigned long hugetlb;
> u64 pss;
> };
>
> @@ -610,12 +611,38 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> seq_putc(m, '\n');
> }
>
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> + unsigned long addr, unsigned long end,
> + struct mm_walk *walk)
> +{
> + struct mem_size_stats *mss = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + struct page *page = NULL;
> +
> + if (pte_present(*pte)) {
> + page = vm_normal_page(vma, addr, *pte);
> + } else if (is_swap_pte(*pte)) {
> + swp_entry_t swpent = pte_to_swp_entry(*pte);
> +
> + if (is_migration_entry(swpent))
> + page = migration_entry_to_page(swpent);
> + }
> + if (page)
> + mss->hugetlb += huge_page_size(hstate_vma(vma));
> + return 0;
> +}
> +#endif /* HUGETLB_PAGE */
> +
> static int show_smap(struct seq_file *m, void *v, int is_pid)
> {
> struct vm_area_struct *vma = v;
> struct mem_size_stats mss;
> struct mm_walk smaps_walk = {
> .pmd_entry = smaps_pte_range,
> +#ifdef CONFIG_HUGETLB_PAGE
> + .hugetlb_entry = smaps_hugetlb_range,
> +#endif
> .mm = vma->vm_mm,
> .private = &mss,
> };
> @@ -637,6 +664,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> "Referenced: %8lu kB\n"
> "Anonymous: %8lu kB\n"
> "AnonHugePages: %8lu kB\n"
> + "HugetlbPages: %8lu kB\n"
> "Swap: %8lu kB\n"
> "KernelPageSize: %8lu kB\n"
> "MMUPageSize: %8lu kB\n"
> @@ -651,6 +679,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> mss.referenced >> 10,
> mss.anonymous >> 10,
> mss.anonymous_thp >> 10,
> + mss.hugetlb >> 10,
> mss.swap >> 10,
> vma_kernel_pagesize(vma) >> 10,
> vma_mmu_pagesize(vma) >> 10,
> --
> 2.4.3
> N?????r??y????b?X??ǧv?^?)޺{.n?+????{????zX????ܨ}???Ơz?&j:+v???????zZ+??+zf???h???~????i???z??w?????????&?)ߢf??^jǫy?m??@A?a??? 0??h???i

--
Michal Hocko
SUSE Labs

2015-08-20 11:00:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> Currently there's no easy way to get per-process usage of hugetlb pages,

Is this really the case after your previous patch? You have both
HugetlbPages and KernelPageSize which should be sufficient no?

Reading a single file is, of course, easier but is it really worth the
additional code? I haven't really looked at the patch so I might be
missing something but what would be an advantage over reading
/proc/<pid>/smaps and extracting the information from there?

[...]
> Documentation/filesystems/proc.txt | 3 +++
> fs/hugetlbfs/inode.c | 12 ++++++++++
> fs/proc/task_mmu.c | 1 +
> include/linux/hugetlb.h | 36 +++++++++++++++++++++++++++++
> include/linux/mm_types.h | 7 ++++++
> kernel/fork.c | 3 +++
> mm/hugetlb.c | 46 ++++++++++++++++++++++++++++++++++++++
> mm/mmap.c | 1 +
> mm/rmap.c | 4 +++-
> 9 files changed, 112 insertions(+), 1 deletion(-)
[...]
--
Michal Hocko
SUSE Labs

2015-08-20 19:50:04

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu, 20 Aug 2015, Michal Hocko wrote:

> On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> > Currently there's no easy way to get per-process usage of hugetlb pages,
>
> Is this really the case after your previous patch? You have both
> HugetlbPages and KernelPageSize which should be sufficient no?
>
> Reading a single file is, of course, easier but is it really worth the
> additional code? I haven't really looked at the patch so I might be
> missing something but what would be an advantage over reading
> /proc/<pid>/smaps and extracting the information from there?
>

/proc/pid/smaps requires root, /proc/pid/status doesn't.

2015-08-20 23:22:24

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

On Thu, Aug 20, 2015 at 12:49:29PM +0200, Michal Hocko wrote:
> On Thu 20-08-15 08:26:26, Naoya Horiguchi wrote:
> > Currently /proc/PID/smaps provides no usage info for vma(VM_HUGETLB), which
> > is inconvenient when we want to know per-task or per-vma base hugetlb usage.
> > To solve this, this patch adds a new line for hugetlb usage like below:
> >
> > Size: 20480 kB
> > Rss: 0 kB
> > Pss: 0 kB
> > Shared_Clean: 0 kB
> > Shared_Dirty: 0 kB
> > Private_Clean: 0 kB
> > Private_Dirty: 0 kB
> > Referenced: 0 kB
> > Anonymous: 0 kB
> > AnonHugePages: 0 kB
> > HugetlbPages: 18432 kB
> > Swap: 0 kB
> > KernelPageSize: 2048 kB
> > MMUPageSize: 2048 kB
> > Locked: 0 kB
> > VmFlags: rd wr mr mw me de ht
>
> I have only now got to this thread. This is indeed very helpful. I would
> just suggest to update Documentation/filesystems/proc.txt to be explicit
> that Rss: doesn't count hugetlb pages for historical reasons.

I agree, I want the following diff to be folded to this patch.

>
> > Signed-off-by: Naoya Horiguchi <[email protected]>
> > Acked-by: Joern Engel <[email protected]>
> > Acked-by: David Rientjes <[email protected]>
>
> Acked-by: Michal Hocko <[email protected]>

Thank you.
Naoya Horiguchi
---
From: Naoya Horiguchi <[email protected]>
Date: Fri, 21 Aug 2015 08:13:31 +0900
Subject: [PATCH] Documentation/filesystems/proc.txt: give additional comment
about hugetlb usage

---
Documentation/filesystems/proc.txt | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index f561fc46e41b..b775b6faaeda 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -446,7 +446,8 @@ indicates the amount of memory currently marked as referenced or accessed.
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
-"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
+"HugetlbPages" shows the ammount of memory backed by hugetlbfs page (which is
+not counted in "Rss" or "Pss" field for historical reasons.)
"Swap" shows how much would-be-anonymous memory is also used, but out on swap.

"VmFlags" field deserves a separate description. This member represents the kernel
--
2.4.3
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-20 23:38:12

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu, Aug 20, 2015 at 01:00:05PM +0200, Michal Hocko wrote:
> On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> > Currently there's no easy way to get per-process usage of hugetlb pages,
>
> Is this really the case after your previous patch? You have both
> HugetlbPages and KernelPageSize which should be sufficient no?

We can calcurate it from these info, so saying "no easy way" was incorrect :(

> Reading a single file is, of course, easier but is it really worth the
> additional code? I haven't really looked at the patch so I might be
> missing something but what would be an advantage over reading
> /proc/<pid>/smaps and extracting the information from there?

My first idea was just "users should feel it useful", but permission as David
commented sounds a good technical reason to me.

Thanks,
Naoya Horiguchi????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-21 06:32:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu 20-08-15 12:49:59, David Rientjes wrote:
> On Thu, 20 Aug 2015, Michal Hocko wrote:
>
> > On Thu 20-08-15 08:26:27, Naoya Horiguchi wrote:
> > > Currently there's no easy way to get per-process usage of hugetlb pages,
> >
> > Is this really the case after your previous patch? You have both
> > HugetlbPages and KernelPageSize which should be sufficient no?
> >
> > Reading a single file is, of course, easier but is it really worth the
> > additional code? I haven't really looked at the patch so I might be
> > missing something but what would be an advantage over reading
> > /proc/<pid>/smaps and extracting the information from there?
> >
>
> /proc/pid/smaps requires root, /proc/pid/status doesn't.

Both mmotm and linus tree have
REG("smaps", S_IRUGO, proc_pid_smaps_operations),

and opening the file requires PTRACE_MODE_READ. So I do not see any
requirement for root here. Or did you mean that you need root to examine
all processes? That would be true but I am wondering why would be a regular
user interested in this break out numbers. Hugetlb management sounds
pretty much like an administrative or very specialized thing.

>From my understanding of the discussion there is no usecase to have this
information world readable. Is this correct?
--
Michal Hocko
SUSE Labs

2015-08-21 06:33:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/smaps

On Thu 20-08-15 23:20:12, Naoya Horiguchi wrote:
[...]
> From: Naoya Horiguchi <[email protected]>
> Date: Fri, 21 Aug 2015 08:13:31 +0900
> Subject: [PATCH] Documentation/filesystems/proc.txt: give additional comment
> about hugetlb usage
>
> ---
> Documentation/filesystems/proc.txt | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index f561fc46e41b..b775b6faaeda 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -446,7 +446,8 @@ indicates the amount of memory currently marked as referenced or accessed.
> a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
> and a page is modified, the file page is replaced by a private anonymous copy.
> "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
> -"HugetlbPages" shows the ammount of memory backed by hugetlbfs page.
> +"HugetlbPages" shows the ammount of memory backed by hugetlbfs page (which is
> +not counted in "Rss" or "Pss" field for historical reasons.)
> "Swap" shows how much would-be-anonymous memory is also used, but out on swap.
>
> "VmFlags" field deserves a separate description. This member represents the kernel

Thank you!
--
Michal Hocko
SUSE Labs

2015-08-21 06:53:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
[...]
> > Reading a single file is, of course, easier but is it really worth the
> > additional code? I haven't really looked at the patch so I might be
> > missing something but what would be an advantage over reading
> > /proc/<pid>/smaps and extracting the information from there?
>
> My first idea was just "users should feel it useful", but permission as David
> commented sounds a good technical reason to me.

9 files changed, 112 insertions(+), 1 deletion(-)

is quite a lot especially when it touches hot paths like fork so it
better should have a good usecase. I have already asked in the other
email but is actually anybody requesting this? Nice to have is not
a good justification IMO.
--
Michal Hocko
SUSE Labs

2015-08-21 16:30:38

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Fri, Aug 21, 2015 at 08:53:21AM +0200, Michal Hocko wrote:
> On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
> [...]
> > > Reading a single file is, of course, easier but is it really worth the
> > > additional code? I haven't really looked at the patch so I might be
> > > missing something but what would be an advantage over reading
> > > /proc/<pid>/smaps and extracting the information from there?
> >
> > My first idea was just "users should feel it useful", but permission as David
> > commented sounds a good technical reason to me.
>
> 9 files changed, 112 insertions(+), 1 deletion(-)
>
> is quite a lot especially when it touches hot paths like fork so it
> better should have a good usecase. I have already asked in the other
> email but is actually anybody requesting this? Nice to have is not
> a good justification IMO.

I need some way to judge the real rss of a process, including huge
pages. No strong opinion on implementation details, but something is
clearly needed.

If you have processes with 99% huge pages, you are currently reduced to
guesswork.

J?rn

--
Journalism is printing what someone else does not want printed;
everything else is public relations.
-- George Orwell

2015-08-21 16:39:04

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Fri, Aug 21, 2015 at 08:32:33AM +0200, Michal Hocko wrote:
>
> Both mmotm and linus tree have
> REG("smaps", S_IRUGO, proc_pid_smaps_operations),
>
> and opening the file requires PTRACE_MODE_READ. So I do not see any
> requirement for root here. Or did you mean that you need root to examine
> all processes? That would be true but I am wondering why would be a regular
> user interested in this break out numbers. Hugetlb management sounds
> pretty much like an administrative or very specialized thing.
>
> From my understanding of the discussion there is no usecase to have this
> information world readable. Is this correct?

Well, tools like top currently display rss. Once we have some
interface, I would like a version of top that displays the true rss
including hugepages (hrss maybe?).

If we make such a tool impossible today, someone will complain about it
in the future and we created a new mess for ourselves. I think it is
trouble enough to deal with the old one.

J?rn

--
Denying any reality for any laudable political goal is a bad strategy.
When the facts come out, the discovery of the facts will undermine the
laudable political goals.
-- Jared Diamond

2015-08-24 08:51:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Fri 21-08-15 09:30:33, J?rn Engel wrote:
> On Fri, Aug 21, 2015 at 08:53:21AM +0200, Michal Hocko wrote:
> > On Thu 20-08-15 23:34:51, Naoya Horiguchi wrote:
> > [...]
> > > > Reading a single file is, of course, easier but is it really worth the
> > > > additional code? I haven't really looked at the patch so I might be
> > > > missing something but what would be an advantage over reading
> > > > /proc/<pid>/smaps and extracting the information from there?
> > >
> > > My first idea was just "users should feel it useful", but permission as David
> > > commented sounds a good technical reason to me.
> >
> > 9 files changed, 112 insertions(+), 1 deletion(-)
> >
> > is quite a lot especially when it touches hot paths like fork so it
> > better should have a good usecase. I have already asked in the other
> > email but is actually anybody requesting this? Nice to have is not
> > a good justification IMO.
>
> I need some way to judge the real rss of a process, including huge
> pages. No strong opinion on implementation details, but something is
> clearly needed.

The current implementation makes me worry. Is the per hstate break down
really needed? The implementation would be much more easier without it.

> If you have processes with 99% huge pages, you are currently reduced to
> guesswork.

If you have 99% of hugetlb pages then your load is rather specific and I
would argue that /proc/<pid>/smaps (after patch 1) is a much better way to
get what you want.

--
Michal Hocko
SUSE Labs

2015-08-25 23:23:38

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Mon, 24 Aug 2015, Michal Hocko wrote:

> The current implementation makes me worry. Is the per hstate break down
> really needed? The implementation would be much more easier without it.
>

Yes, it's needed. It provides a complete picture of what statically
reserved hugepages are in use and we're not going to change the
implementation when it is needed to differentiate between variable hugetlb
page sizes that risk breaking existing userspace parsers.

> If you have 99% of hugetlb pages then your load is rather specific and I
> would argue that /proc/<pid>/smaps (after patch 1) is a much better way to
> get what you want.
>

Some distributions change the permissions of smaps, as already stated, for
pretty clear security reasons since it can be used to defeat existing
protection. There's no reason why hugetlb page usage should not be
exported in the same manner and location as memory usage.

Sheesh.

2015-08-26 06:38:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Tue 25-08-15 16:23:34, David Rientjes wrote:
> On Mon, 24 Aug 2015, Michal Hocko wrote:
>
> > The current implementation makes me worry. Is the per hstate break down
> > really needed? The implementation would be much more easier without it.
> >
>
> Yes, it's needed. It provides a complete picture of what statically
> reserved hugepages are in use and we're not going to change the
> implementation when it is needed to differentiate between variable hugetlb
> page sizes that risk breaking existing userspace parsers.

I thought the purpose was to give the amount of hugetlb based
resident memory. At least this is what J?rn was asking for AFAIU.
/proc/<pid>/status should be as lightweight as possible. The current
implementation is quite heavy as already pointed out. So I am really
curious whether this is _really_ needed. I haven't heard about a real
usecase except for top displaying HRss which doesn't need the break
down values. You have brought that up already
http://marc.info/?l=linux-mm&m=143941143109335&w=2 and nobody actually
asked for it. "I do not mind having it" is not an argument for inclusion
especially when the implementation is more costly and touches hot paths.

> > If you have 99% of hugetlb pages then your load is rather specific and I
> > would argue that /proc/<pid>/smaps (after patch 1) is a much better way to
> > get what you want.
>
> Some distributions change the permissions of smaps, as already stated, for
> pretty clear security reasons since it can be used to defeat existing
> protection. There's no reason why hugetlb page usage should not be
> exported in the same manner and location as memory usage.

/proc/<pid>/status provides only per-memory-type break down information
(locked, data, stack, etc...). Different hugetlb sizes are still a
hugetlb memory. So I am not sure I understand you argument here.
--
Michal Hocko
SUSE Labs

2015-08-26 22:02:52

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Wed, 26 Aug 2015, Michal Hocko wrote:

> I thought the purpose was to give the amount of hugetlb based
> resident memory.

Persistent hugetlb memory is always resident, the goal is to show what is
currently mapped.

> At least this is what Jörn was asking for AFAIU.
> /proc/<pid>/status should be as lightweight as possible. The current
> implementation is quite heavy as already pointed out. So I am really
> curious whether this is _really_ needed. I haven't heard about a real
> usecase except for top displaying HRss which doesn't need the break
> down values. You have brought that up already
> http://marc.info/?l=linux-mm&m=143941143109335&w=2 and nobody actually
> asked for it. "I do not mind having it" is not an argument for inclusion
> especially when the implementation is more costly and touches hot paths.
>

It iterates over HUGE_MAX_HSTATE and reads atomic usage counters twice.
On x86, HUGE_MAX_HSTATE == 2. I don't consider that to be expensive.

If you are concerned about the memory allocation of struct hugetlb_usage,
it could easily be embedded directly in struct mm_struct.

2015-08-27 06:48:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Wed 26-08-15 15:02:49, David Rientjes wrote:
> On Wed, 26 Aug 2015, Michal Hocko wrote:
>
> > I thought the purpose was to give the amount of hugetlb based
> > resident memory.
>
> Persistent hugetlb memory is always resident, the goal is to show what is
> currently mapped.
>
> > At least this is what J?rn was asking for AFAIU.
> > /proc/<pid>/status should be as lightweight as possible. The current
> > implementation is quite heavy as already pointed out. So I am really
> > curious whether this is _really_ needed. I haven't heard about a real
> > usecase except for top displaying HRss which doesn't need the break
> > down values. You have brought that up already
> > http://marc.info/?l=linux-mm&m=143941143109335&w=2 and nobody actually
> > asked for it. "I do not mind having it" is not an argument for inclusion
> > especially when the implementation is more costly and touches hot paths.
> >
>
> It iterates over HUGE_MAX_HSTATE and reads atomic usage counters twice.

I am not worried about /proc/<pid>/status read path. That one is indeed
trivial.

> On x86, HUGE_MAX_HSTATE == 2. I don't consider that to be expensive.
>
> If you are concerned about the memory allocation of struct hugetlb_usage,
> it could easily be embedded directly in struct mm_struct.

Yes I am concerned about that and
9 files changed, 112 insertions(+), 1 deletion(-)
for something that is even not clear to be really required. And I still
haven't heard any strong usecase to justify it.

Can we go with the single and much simpler cumulative number first and
only add the break down list if it is _really_ required? We can even
document that the future version of /proc/<pid>/status might add an
additional information to prepare all the parsers to be more careful.
--
Michal Hocko
SUSE Labs

2015-08-27 17:23:56

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
>
> > On x86, HUGE_MAX_HSTATE == 2. I don't consider that to be expensive.
> >
> > If you are concerned about the memory allocation of struct hugetlb_usage,
> > it could easily be embedded directly in struct mm_struct.
>
> Yes I am concerned about that and
> 9 files changed, 112 insertions(+), 1 deletion(-)
> for something that is even not clear to be really required. And I still
> haven't heard any strong usecase to justify it.
>
> Can we go with the single and much simpler cumulative number first and
> only add the break down list if it is _really_ required? We can even
> document that the future version of /proc/<pid>/status might add an
> additional information to prepare all the parsers to be more careful.

I don't care much which way we decide. But I find your reasoning a bit
worrying. If someone asks for a by-size breakup of hugepages in a few
years, you might have existing binaries that depend on the _absence_ of
those extra characters on the line.

Compare:
HugetlbPages: 18432 kB
HugetlbPages: 1069056 kB (1*1048576kB 10*2048kB)

Once someone has written a script that greps for 'HugetlbPages:.*kB$',
you have lost the option of adding anything else to the line. You have
created yet another ABI compatibility headache today in order to save
112 lines of code.

That may be a worthwhile tradeoff, I don't know. But at least I realize
there is a cost, while you seem to ignore that component. There is
value in not painting yourself into a corner.

J?rn

--
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca

2015-08-27 20:44:12

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu, 27 Aug 2015, Jörn Engel wrote:

> On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
> > Can we go with the single and much simpler cumulative number first and
> > only add the break down list if it is _really_ required? We can even
> > document that the future version of /proc/<pid>/status might add an
> > additional information to prepare all the parsers to be more careful.
>
> I don't care much which way we decide. But I find your reasoning a bit
> worrying. If someone asks for a by-size breakup of hugepages in a few
> years, you might have existing binaries that depend on the _absence_ of
> those extra characters on the line.
>
> Compare:
> HugetlbPages: 18432 kB
> HugetlbPages: 1069056 kB (1*1048576kB 10*2048kB)
>
> Once someone has written a script that greps for 'HugetlbPages:.*kB$',
> you have lost the option of adding anything else to the line. You have
> created yet another ABI compatibility headache today in order to save
> 112 lines of code.
>

This is exactly the concern that I have brought up in this thread. We
have no other way to sanely export the breakdown in hugepage size without
new fields being added later with the hstate size being embedded in the
name itself.

I agree with the code as it stands in -mm today and I'm thankful to Naoya
that a long-term maintainable API has been established. Respectfully, I
have no idea why we are still talking about this and I'm not going to be
responding further unless something changes in -mm.

2015-08-31 09:12:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status

On Thu 27-08-15 10:23:51, J?rn Engel wrote:
> On Thu, Aug 27, 2015 at 08:48:18AM +0200, Michal Hocko wrote:
> >
> > > On x86, HUGE_MAX_HSTATE == 2. I don't consider that to be expensive.
> > >
> > > If you are concerned about the memory allocation of struct hugetlb_usage,
> > > it could easily be embedded directly in struct mm_struct.
> >
> > Yes I am concerned about that and
> > 9 files changed, 112 insertions(+), 1 deletion(-)
> > for something that is even not clear to be really required. And I still
> > haven't heard any strong usecase to justify it.
> >
> > Can we go with the single and much simpler cumulative number first and
> > only add the break down list if it is _really_ required? We can even
> > document that the future version of /proc/<pid>/status might add an
> > additional information to prepare all the parsers to be more careful.
>
> I don't care much which way we decide. But I find your reasoning a bit
> worrying. If someone asks for a by-size breakup of hugepages in a few
> years, you might have existing binaries that depend on the _absence_ of
> those extra characters on the line.
>
> Compare:
> HugetlbPages: 18432 kB
> HugetlbPages: 1069056 kB (1*1048576kB 10*2048kB)
>
> Once someone has written a script that greps for 'HugetlbPages:.*kB$',
> you have lost the option of adding anything else to the line.

If you think that an explicit note in the documentation is
not sufficient then I believe we can still handle it backward
compatible. Like separate entries for each existing hugetlb page:
HugetlbPages: 1069056 kB
Hugetlb2MPages: 20480 kB
Hugetlb1GPages: 1048576 kB

or something similar. I would even argue this would be slightly easier
to parse. So it is not like we would be locked into anything.

> You have
> created yet another ABI compatibility headache today in order to save
> 112 lines of code.
>
> That may be a worthwhile tradeoff, I don't know. But at least I realize
> there is a cost, while you seem to ignore that component. There is
> value in not painting yourself into a corner.

My primary point was that we are adding a code for a feature nobody
actually asked for just because somebody might ask for it in future.
--
Michal Hocko
SUSE Labs