LinuxLists.cc - [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

2016-11-29 20:17:15

Subject: [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

Andrew, you can drop proc-mm-export-pte-sizes-directly-in-smaps-v2.patch,
and replace it with this.

Changes from v2:
* Do not assume (wrongly) that smaps_hugetlb_range() always uses
PUDs. (Thanks for pointing this out, Vlastimil). Also handle
hstates that are not exactly at PMD/PUD sizes.

Changes from v1:
* Do one 'Pte' line per pte size instead of mashing on one line
* Use PMD_SIZE for pmds instead of PAGE_SIZE, whoops
* Wrote some Documentation/

--

/proc/$pid/smaps has a number of fields that are intended to imply the
kinds of PTEs used to map memory. "AnonHugePages" obviously tells you
how many PMDs are being used. "MMUPageSize" along with the "Hugetlb"
fields tells you how many PTEs you have for a huge page.

The current mechanisms work fine when we have one or two page sizes.
But, they start to get a bit muddled when we mix page sizes inside
one VMA. For instance, the DAX folks were proposing adding a set of
fields like:

DevicePages:
DeviceHugePages:
DeviceGiganticPages:
DeviceGinormousPages:

to unmuddle things when page sizes get mixed. That's fine, but
it does require userspace know the mapping from our various
arbitrary names to hardware page sizes on each architecture and
kernel configuration. That seems rather suboptimal.

What folks really want is to know how much memory is mapped with
each page size. How about we just do *that* instead?

Patch attached. Seems harmless enough. Seems to compile on a
bunch of random architectures. Makes smaps look like this:

Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Ptes@4kB: 32 kB
Ptes@2MB: 2048 kB

The format I used here should be unlikely to break smaps parsers
unless they're looking for "kB" and now match the 'Ptes@4kB' instead
of the one at the end of the line.

Note: hugetlbfs PTEs are unusual. We can have more than one "pte_t"
for each hugetlbfs "page". arm64 has this configuration, and probably
others. The code should now handle when an hstate's size is not equal
to one of the page table entry sizes. For instance, it assumes that
hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs
and prints them as such.

I've tested this on x86 with normal 4k ptes, anonymous huge pages,
1G hugetlbfs and 2M hugetlbfs pages.

1. I'd like to thank Dan Williams for showing me a mirror as I
complained about the bozo that introduced 'AnonHugePages'.

Cc: Christoph Hellwig <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]

---

b/Documentation/filesystems/proc.txt | 6 +
b/fs/proc/task_mmu.c | 106 ++++++++++++++++++++++++++++++++++-
b/mm/hugetlb.c | 11 +++
3 files changed, 121 insertions(+), 2 deletions(-)

diff -puN fs/proc/task_mmu.c~smaps-pte-sizes fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~smaps-pte-sizes 2016-11-28 08:48:10.899054714 -0800
+++ b/fs/proc/task_mmu.c 2016-11-29 10:31:55.259790185 -0800
@@ -445,6 +445,9 @@ struct mem_size_stats {
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
+ unsigned long rss_pte;
+ unsigned long rss_pmd;
+ unsigned long rss_pud;
u64 pss;
u64 swap_pss;
bool check_shmem_swap;
@@ -519,6 +522,7 @@ static void smaps_pte_entry(pte_t *pte,

if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
+ mss->rss_pte += PAGE_SIZE;
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);

@@ -578,6 +582,7 @@ static void smaps_pmd_entry(pmd_t *pmd,
/* pass */;
else
VM_BUG_ON_PAGE(1, page);
+ mss->rss_pmd += PMD_SIZE;
smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
@@ -684,6 +689,30 @@ static void show_smap_vma_flags(struct s
}

#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * Most architectures have a 1:1 mapping of PTEs to hugetlb page
+ * sizes, but there are some outliers like arm64 that use
+ * multiple hardware PTEs to make a hugetlb "page". Do not
+ * assume that all 'hpage_size's are not exactly at a page table
+ * size boundary. Instead, accept arbitrary 'hpage_size's and
+ * assume they are made up of the next-smallest size. We do not
+ * handle PGD-sized hpages and hugetlb_add_hstate() will WARN()
+ * if it sees one.
+ *
+ * Note also that the page walker code only calls us once per
+ * huge 'struct page', *not* once per PTE in the page tables.
+ */
+static void smaps_hugetlb_present_hpage(struct mem_size_stats *mss,
+ unsigned long hpage_size)
+{
+ if (hpage_size >= PUD_SIZE)
+ mss->rss_pud += hpage_size;
+ else if (hpage_size >= PMD_SIZE)
+ mss->rss_pmd += hpage_size;
+ else
+ mss->rss_pte += hpage_size;
+}
+
static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -702,11 +731,14 @@ static int smaps_hugetlb_range(pte_t *pt
}
if (page) {
int mapcount = page_mapcount(page);
+ unsigned long hpage_size = huge_page_size(hstate_vma(vma));
+
+ smaps_hugetlb_present_hpage(mss, hpage_size);

if (mapcount >= 2)
- mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->shared_hugetlb += hpage_size;
else
- mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+ mss->private_hugetlb += hpage_size;
}
return 0;
}
@@ -716,6 +748,75 @@ void __weak arch_show_smap(struct seq_fi
{
}

+/*
+ * What units should we use for a given number? We want
+ * 2048 to be 2k, so we return 'k'. 1048576 should be
+ * 1M, so we return 'M'.
+ */
+static char size_unit(unsigned long long nr)
+{
+ /*
+ * This ' ' might look a bit goofy in the output. But, why
+ * bother doing anything. Do we even have a <1k page size?
+ */
+ if (nr < (1ULL<<10))
+ return ' ';
+ if (nr < (1ULL<<20))
+ return 'k';
+ if (nr < (1ULL<<30))
+ return 'M';
+ if (nr < (1ULL<<40))
+ return 'G';
+ if (nr < (1ULL<<50))
+ return 'T';
+ if (nr < (1ULL<<60))
+ return 'P';
+ return 'E';
+}
+
+/*
+ * How should we shift down a a given number to scale it
+ * with the units we are printing it as? 2048 to be 2k,
+ * so we want it shifted down by 10. 1048576 should be
+ * 1M, so we want it shifted down by 20.
+ */
+static int size_shift(unsigned long long nr)
+{
+ if (nr < (1ULL<<10))
+ return 0;
+ if (nr < (1ULL<<20))
+ return 10;
+ if (nr < (1ULL<<30))
+ return 20;
+ if (nr < (1ULL<<40))
+ return 30;
+ if (nr < (1ULL<<50))
+ return 40;
+ if (nr < (1ULL<<60))
+ return 50;
+ return 60;
+}
+
+static void show_one_smap_pte(struct seq_file *m, unsigned long bytes_rss,
+ unsigned long pte_size)
+{
+ seq_printf(m, "Ptes@%ld%cB: %8lu kB\n",
+ pte_size >> size_shift(pte_size),
+ size_unit(pte_size),
+ bytes_rss >> 10);
+}
+
+static void show_smap_ptes(struct seq_file *m, struct mem_size_stats *mss)
+{
+ /* Only print the entries for page sizes present in the VMA */
+ if (mss->rss_pte)
+ show_one_smap_pte(m, mss->rss_pte, PAGE_SIZE);
+ if (mss->rss_pmd)
+ show_one_smap_pte(m, mss->rss_pmd, PMD_SIZE);
+ if (mss->rss_pud)
+ show_one_smap_pte(m, mss->rss_pud, PUD_SIZE);
+}
+
static int show_smap(struct seq_file *m, void *v, int is_pid)
{
struct vm_area_struct *vma = v;
@@ -799,6 +900,7 @@ static int show_smap(struct seq_file *m,
(vma->vm_flags & VM_LOCKED) ?
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

+ show_smap_ptes(m, &mss);
arch_show_smap(m, vma);
show_smap_vma_flags(m, vma);
m_cache_vma(m, vma);
diff -puN Documentation/filesystems/proc.txt~smaps-pte-sizes Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt~smaps-pte-sizes 2016-11-28 08:48:10.903054895 -0800
+++ b/Documentation/filesystems/proc.txt 2016-11-28 08:48:10.909055167 -0800
@@ -418,6 +418,9 @@ SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
+Ptes@4kB: 4 kB
+Ptes@2MB: 8192 kB
+
VmFlags: rd ex mr mw me dw

the first of these lines shows the same information as is displayed for the
@@ -450,6 +453,9 @@ replaced by copy-on-write) part of the u
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
+"Ptes@..." lines show how many page table entries are currently in place and
+pointing to memory. There is an entry for each size present in the hardware
+page tables for this mapping.

"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
diff -puN mm/hugetlb.c~smaps-pte-sizes mm/hugetlb.c
--- a/mm/hugetlb.c~smaps-pte-sizes 2016-11-28 14:21:37.555519365 -0800
+++ b/mm/hugetlb.c 2016-11-28 14:28:49.186234688 -0800
@@ -2763,6 +2763,17 @@ void __init hugetlb_add_hstate(unsigned
huge_page_size(h)/1024);

parsed_hstate = h;
+
+ /*
+ * PGD_SIZE isn't widely made available by architecures,
+ * so use PUD_SIZE*PTRS_PER_PUD as a substitute.
+ *
+ * Check for sizes that might be mapped by a PGD. There
+ * are none of these known today, but be on the lookout.
+ * If this trips, we will need to update the mss->rss_*
+ * code in fs/proc/task_mmu.c.
+ */
+ WARN_ON_ONCE((PAGE_SIZE << order) >= PUD_SIZE * PTRS_PER_PUD);
}

static int __init hugetlb_nrpages_setup(char *s)
_

2016-12-01 12:21:45

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

On 11/29/2016 09:17 PM, Dave Hansen wrote:
> Andrew, you can drop proc-mm-export-pte-sizes-directly-in-smaps-v2.patch,
> and replace it with this.
>
> Changes from v2:
> * Do not assume (wrongly) that smaps_hugetlb_range() always uses
> PUDs. (Thanks for pointing this out, Vlastimil). Also handle
> hstates that are not exactly at PMD/PUD sizes.
>
> Changes from v1:
> * Do one 'Pte' line per pte size instead of mashing on one line
> * Use PMD_SIZE for pmds instead of PAGE_SIZE, whoops
> * Wrote some Documentation/
>
> --
>
> /proc/$pid/smaps has a number of fields that are intended to imply the
> kinds of PTEs used to map memory. "AnonHugePages" obviously tells you
> how many PMDs are being used. "MMUPageSize" along with the "Hugetlb"
> fields tells you how many PTEs you have for a huge page.
>
> The current mechanisms work fine when we have one or two page sizes.
> But, they start to get a bit muddled when we mix page sizes inside
> one VMA. For instance, the DAX folks were proposing adding a set of
> fields like:
>
> DevicePages:
> DeviceHugePages:
> DeviceGiganticPages:
> DeviceGinormousPages:
>
> to unmuddle things when page sizes get mixed. That's fine, but
> it does require userspace know the mapping from our various
> arbitrary names to hardware page sizes on each architecture and
> kernel configuration. That seems rather suboptimal.
>
> What folks really want is to know how much memory is mapped with
> each page size. How about we just do *that* instead?
>
> Patch attached. Seems harmless enough. Seems to compile on a
> bunch of random architectures. Makes smaps look like this:
>
> Private_Hugetlb: 0 kB
> Swap: 0 kB
> SwapPss: 0 kB
> KernelPageSize: 4 kB
> MMUPageSize: 4 kB
> Locked: 0 kB
> Ptes@4kB: 32 kB
> Ptes@2MB: 2048 kB
>
> The format I used here should be unlikely to break smaps parsers
> unless they're looking for "kB" and now match the 'Ptes@4kB' instead
> of the one at the end of the line.
>
> Note: hugetlbfs PTEs are unusual. We can have more than one "pte_t"
> for each hugetlbfs "page". arm64 has this configuration, and probably
> others. The code should now handle when an hstate's size is not equal
> to one of the page table entry sizes. For instance, it assumes that
> hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs
> and prints them as such.
>
> I've tested this on x86 with normal 4k ptes, anonymous huge pages,
> 1G hugetlbfs and 2M hugetlbfs pages.
>
> 1. I'd like to thank Dan Williams for showing me a mirror as I
> complained about the bozo that introduced 'AnonHugePages'.
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: [email protected]
> Cc: [email protected]

Acked-by: Vlastimil Babka <[email protected]>

2016-12-01 14:50:43

by Andy Shevchenko

[permalink] [raw]

Subject: Re: [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

On Tue, Nov 29, 2016 at 10:17 PM, Dave Hansen <[email protected]> wrote:
>
> Andrew, you can drop proc-mm-export-pte-sizes-directly-in-smaps-v2.patch,
> and replace it with this.

You added a warning and it immediately appears:

[ 0.402603] ------------[ cut here ]------------
[ 0.402844] WARNING: CPU: 0 PID: 1 at
/home/andy/prj/linux-netboot/mm/hugetlb.c:2918
hugetlb_add_hstate+0x143/0
x14b
[ 0.403042] Modules linked in:
[ 0.403233] CPU: 0 PID: 1 Comm: swapper Not tainted
4.9.0-rc7-next-20161201+ #1
[ 0.403499] Call Trace:
[ 0.403677] dump_stack+0x16/0x1d
[ 0.404081] __warn+0xd1/0xf0
[ 0.404289] ? hugetlb_add_hstate+0x143/0x14b
[ 0.404491] warn_slowpath_null+0x25/0x30
[ 0.404695] hugetlb_add_hstate+0x143/0x14b
[ 0.404908] hugetlb_init+0x79/0x3af
[ 0.405249] ? wake_up_process+0xf/0x20
[ 0.405450] ? kcompactd_run+0x50/0x90
[ 0.405638] ? compact_zone+0x7c0/0x7c0
[ 0.405842] ? hugetlb_add_hstate+0x14b/0x14b
[ 0.406082] do_one_initcall+0x2f/0x160
[ 0.406286] ? repair_env_string+0x12/0x54
[ 0.406482] ? parse_args+0x2a1/0x5a0
[ 0.406684] ? __dquot_free_space+0xa0/0x2d0
[ 0.406892] ? kernel_init_freeable+0xe4/0x18a
[ 0.407088] kernel_init_freeable+0x107/0x18a
[ 0.407303] ? rest_init+0x60/0x60
[ 0.407496] kernel_init+0xb/0x100
[ 0.407703] ? schedule_tail_wrapper+0x9/0x10
[ 0.408099] ret_from_fork+0x19/0x30
[ 0.408302] ---[ end trace 601ba77b9b62b9d7 ]---
[ 0.408481] HugeTLB registered 4 MB page size, pre-allocated 0 pages

Quark SoC here.

Besides that see below.

> +/*
> + * What units should we use for a given number? We want
> + * 2048 to be 2k, so we return 'k'. 1048576 should be
> + * 1M, so we return 'M'.
> + */
> +static char size_unit(unsigned long long nr)
> +{
> + /*
> + * This ' ' might look a bit goofy in the output. But, why
> + * bother doing anything. Do we even have a <1k page size?
> + */
> + if (nr < (1ULL<<10))
> + return ' ';
> + if (nr < (1ULL<<20))
> + return 'k';
> + if (nr < (1ULL<<30))
> + return 'M';
> + if (nr < (1ULL<<40))
> + return 'G';
> + if (nr < (1ULL<<50))
> + return 'T';
> + if (nr < (1ULL<<60))
> + return 'P';
> + return 'E';
> +}
> +
> +/*
> + * How should we shift down a a given number to scale it
> + * with the units we are printing it as? 2048 to be 2k,
> + * so we want it shifted down by 10. 1048576 should be
> + * 1M, so we want it shifted down by 20.
> + */
> +static int size_shift(unsigned long long nr)
> +{
> + if (nr < (1ULL<<10))
> + return 0;
> + if (nr < (1ULL<<20))
> + return 10;
> + if (nr < (1ULL<<30))
> + return 20;
> + if (nr < (1ULL<<40))
> + return 30;
> + if (nr < (1ULL<<50))
> + return 40;
> + if (nr < (1ULL<<60))
> + return 50;
> + return 60;
> +}
> +

New copy of string_get_size() ?

Something similar was discussed for EFI stuff like half a year ago?

> +static void show_one_smap_pte(struct seq_file *m, unsigned long bytes_rss,
> + unsigned long pte_size)
> +{
> + seq_printf(m, "Ptes@%ld%cB: %8lu kB\n",
> + pte_size >> size_shift(pte_size),
> + size_unit(pte_size),
> + bytes_rss >> 10);
> +}

> + /*
> + * PGD_SIZE isn't widely made available by architecures,
> + * so use PUD_SIZE*PTRS_PER_PUD as a substitute.
> + *
> + * Check for sizes that might be mapped by a PGD. There
> + * are none of these known today, but be on the lookout.
> + * If this trips, we will need to update the mss->rss_*
> + * code in fs/proc/task_mmu.c.
> + */
> + WARN_ON_ONCE((PAGE_SIZE << order) >= PUD_SIZE * PTRS_PER_PUD);

This what I got.

--
With Best Regards,
Andy Shevchenko

2016-12-02 15:43:54

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

Dave Hansen <[email protected]> writes:

> Andrew, you can drop proc-mm-export-pte-sizes-directly-in-smaps-v2.patch,
> and replace it with this.
>
.....

> diff -puN mm/hugetlb.c~smaps-pte-sizes mm/hugetlb.c
> --- a/mm/hugetlb.c~smaps-pte-sizes 2016-11-28 14:21:37.555519365 -0800
> +++ b/mm/hugetlb.c 2016-11-28 14:28:49.186234688 -0800
> @@ -2763,6 +2763,17 @@ void __init hugetlb_add_hstate(unsigned
> huge_page_size(h)/1024);
>
> parsed_hstate = h;
> +
> + /*
> + * PGD_SIZE isn't widely made available by architecures,
> + * so use PUD_SIZE*PTRS_PER_PUD as a substitute.
> + *
> + * Check for sizes that might be mapped by a PGD. There
> + * are none of these known today, but be on the lookout.
> + * If this trips, we will need to update the mss->rss_*
> + * code in fs/proc/task_mmu.c.
> + */
> + WARN_ON_ONCE((PAGE_SIZE << order) >= PUD_SIZE * PTRS_PER_PUD);
> }

This will trip for ppc64 16GB hugepage.

For ppc64 we have the 16G at pgd level.

-aneesh

2016-12-02 18:04:12

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

Dave Hansen <[email protected]> writes:

> #ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * Most architectures have a 1:1 mapping of PTEs to hugetlb page
> + * sizes, but there are some outliers like arm64 that use
> + * multiple hardware PTEs to make a hugetlb "page". Do not
> + * assume that all 'hpage_size's are not exactly at a page table
> + * size boundary. Instead, accept arbitrary 'hpage_size's and
> + * assume they are made up of the next-smallest size. We do not
> + * handle PGD-sized hpages and hugetlb_add_hstate() will WARN()
> + * if it sees one.
> + *
> + * Note also that the page walker code only calls us once per
> + * huge 'struct page', *not* once per PTE in the page tables.
> + */
> +static void smaps_hugetlb_present_hpage(struct mem_size_stats *mss,
> + unsigned long hpage_size)
> +{
> + if (hpage_size >= PUD_SIZE)
> + mss->rss_pud += hpage_size;
> + else if (hpage_size >= PMD_SIZE)
> + mss->rss_pmd += hpage_size;
> + else
> + mss->rss_pte += hpage_size;
> +}

some powerpc platforms have multiple page table entries mapping the same
hugepage and on other, we have a page table entry pointing to something
called hugepaeg directory mapping a set of hugepage. So I am not sure
the above will work for all those ?

Also do we derive pte@<size value> correctly there ?

-aneesh

2016-12-09 23:10:08

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH] proc: mm: export PTE sizes directly in smaps (v3)

On 12/01/2016 06:50 AM, Andy Shevchenko wrote:
>> > +static int size_shift(unsigned long long nr)
>> > +{
>> > + if (nr < (1ULL<<10))
>> > + return 0;
>> > + if (nr < (1ULL<<20))
>> > + return 10;
>> > + if (nr < (1ULL<<30))
>> > + return 20;
>> > + if (nr < (1ULL<<40))
>> > + return 30;
>> > + if (nr < (1ULL<<50))
>> > + return 40;
>> > + if (nr < (1ULL<<60))
>> > + return 50;
>> > + return 60;
>> > +}
>> > +
> New copy of string_get_size() ?

Not really. That prints to a buffer, so we'll need to allocate stack
space for a buffer, which we also have to size properly. We also want
to be consistent with other parts of smaps that mean kB==1024 bytes, so
we want string_get_size()'s STRING_UNITS_10 strings, but
STRING_UNITS_2's divisor.

Also, guaranteeing that we have a power-of-2 'block size' lets us cheat
and do things much faster than using real division. Not that it
matters, but we could do it thousands of times for a large smaps file.

Being defined locally, this stuff also gets inlined pretty aggressively.

Given all that, I'm not sure I want to modify string_get_size() to do
exactly what we need here.