2008-10-16 15:58:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/2] Report the size of pages backing VMAs in /proc V3

The following two patches add support for printing the size of pages used by
the kernel and the MMU to back VMAs. This can be used by a user to verify
that a hugepage-aware application is using the expected page sizes.

The first patch prints the size of page used by the kernel when allocating
pages for a VMA in /proc/pid/smaps. The second patch reports on
the size of page used by the MMU as it can differ - for example on POWER
using 64K as a base pagesize on older processors.

Changelog since V2
o Drop changes to /proc/pid/maps - could not get agreement and it affects
procps. Patch to procps was posted but fell into silence. Dropping
patch as smaps gives the necessary information, just with a bit more
legwork by the user
o Drop redundant VM_BUG_ON (Alexey)

Changelog since V1
o Fix build failure on !CONFIG_HUGETLB_PAGE
o Uninline helper functions
o Distinguish between base pagesize and MMU pagesize

arch/powerpc/include/asm/hugetlb.h | 6 ++++++
arch/powerpc/mm/hugetlbpage.c | 7 +++++++
fs/proc/task_mmu.c | 8 ++++++--
include/linux/hugetlb.h | 6 ++++++
mm/hugetlb.c | 29 +++++++++++++++++++++++++++++
5 files changed, 54 insertions(+), 2 deletions(-)


2008-10-16 15:58:58

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/2] Subject: [PATCH] Report the pagesize backing a VMA in /proc/pid/smaps

It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman <[email protected]>
---
fs/proc/task_mmu.c | 6 ++++--
include/linux/hugetlb.h | 3 +++
mm/hugetlb.c | 16 ++++++++++++++++
3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4806830..71c9868 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -391,7 +391,8 @@ static int show_smap(struct seq_file *m, void *v)
"Private_Clean: %8lu kB\n"
"Private_Dirty: %8lu kB\n"
"Referenced: %8lu kB\n"
- "Swap: %8lu kB\n",
+ "Swap: %8lu kB\n"
+ "KernelPageSize: %8lu kB\n",
(vma->vm_end - vma->vm_start) >> 10,
mss.resident >> 10,
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
@@ -400,7 +401,8 @@ static int show_smap(struct seq_file *m, void *v)
mss.private_clean >> 10,
mss.private_dirty >> 10,
mss.referenced >> 10,
- mss.swap >> 10);
+ mss.swap >> 10,
+ vma_kernel_pagesize(vma) >> 10);

return ret;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 32e0ef0..ace04a7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -231,6 +231,8 @@ static inline unsigned long huge_page_size(struct hstate *h)
return (unsigned long)PAGE_SIZE << h->order;
}

+extern unsigned long vma_kernel_pagesize(struct vm_area_struct *vma);
+
static inline unsigned long huge_page_mask(struct hstate *h)
{
return h->mask;
@@ -271,6 +273,7 @@ struct hstate {};
#define hstate_inode(i) NULL
#define huge_page_size(h) PAGE_SIZE
#define huge_page_mask(h) PAGE_MASK
+#define vma_kernel_pagesize(v) PAGE_SIZE
#define huge_page_order(h) 0
#define huge_page_shift(h) PAGE_SHIFT
static inline unsigned int pages_per_huge_page(struct hstate *h)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67a7119..7cb27ec 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -219,6 +219,22 @@ static pgoff_t vma_hugecache_offset(struct hstate *h,
}

/*
+ * Return the size of the pages allocated when backing a VMA. In the majority
+ * cases this will be same size as used by the page table entries.
+ */
+unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
+{
+ struct hstate *hstate;
+
+ if (!is_vm_hugetlb_page(vma))
+ return PAGE_SIZE;
+
+ hstate = hstate_vma(vma);
+
+ return 1UL << (hstate->order + PAGE_SHIFT);
+}
+
+/*
* Flags for MAP_PRIVATE reservations. These are stored in the bottom
* bits of the reservation map pointer, which are always clear due to
* alignment.
--
1.5.6.5

2008-10-16 15:59:25

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/2] Report the MMU pagesize in /proc/pid/smaps

The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA. This matches the size used by the MMU in the majority
of cases. However, one counter-example occurs on PPC64 kernels whereby
a kernel using 64K as a base pagesize may still use 4K pages for the MMU
on older processor. To distinguish, this patch reports MMUPageSize as the
pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/powerpc/include/asm/hugetlb.h | 6 ++++++
arch/powerpc/mm/hugetlbpage.c | 7 +++++++
fs/proc/task_mmu.c | 6 ++++--
include/linux/hugetlb.h | 3 +++
mm/hugetlb.c | 13 +++++++++++++
5 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index 26f0d0a..2655146 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -18,6 +18,12 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep);

/*
+ * The version of vma_mmu_pagesize() in arch/powerpc/mm/hugetlbpage.c needs
+ * to override the version in mm/hugetlb.c
+ */
+#define vma_mmu_pagesize vma_mmu_pagesize
+
+/*
* If the arch doesn't supply something else, assume that hugepage
* size aligned regions are ok without further preparation.
*/
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a117024..edc0c69 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -510,6 +510,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
}

+unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
+{
+ unsigned int psize = get_slice_psize(vma->vm_mm, vma->vm_start);
+
+ return 1UL << mmu_psize_to_shift(psize);
+}
+
/*
* Called by asm hashtable.S for doing lazy icache flush
*/
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 71c9868..3517892 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -392,7 +392,8 @@ static int show_smap(struct seq_file *m, void *v)
"Private_Dirty: %8lu kB\n"
"Referenced: %8lu kB\n"
"Swap: %8lu kB\n"
- "KernelPageSize: %8lu kB\n",
+ "KernelPageSize: %8lu kB\n"
+ "MMUPageSize: %8lu kB\n",
(vma->vm_end - vma->vm_start) >> 10,
mss.resident >> 10,
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
@@ -402,7 +403,8 @@ static int show_smap(struct seq_file *m, void *v)
mss.private_dirty >> 10,
mss.referenced >> 10,
mss.swap >> 10,
- vma_kernel_pagesize(vma) >> 10);
+ vma_kernel_pagesize(vma) >> 10,
+ vma_mmu_pagesize(vma) >> 10);

return ret;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ace04a7..5056021 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -233,6 +233,8 @@ static inline unsigned long huge_page_size(struct hstate *h)

extern unsigned long vma_kernel_pagesize(struct vm_area_struct *vma);

+extern unsigned long vma_mmu_pagesize(struct vm_area_struct *vma);
+
static inline unsigned long huge_page_mask(struct hstate *h)
{
return h->mask;
@@ -274,6 +276,7 @@ struct hstate {};
#define huge_page_size(h) PAGE_SIZE
#define huge_page_mask(h) PAGE_MASK
#define vma_kernel_pagesize(v) PAGE_SIZE
+#define vma_mmu_pagesize(v) PAGE_SIZE
#define huge_page_order(h) 0
#define huge_page_shift(h) PAGE_SHIFT
static inline unsigned int pages_per_huge_page(struct hstate *h)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7cb27ec..fee3d1d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -235,6 +235,19 @@ unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
}

/*
+ * Return the page size being used by the MMU to back a VMA. In the majority
+ * of cases, the page size used by the kernel matches the MMU size. On
+ * architectures where it differs, an architecture-specific version of this
+ * function is required.
+ */
+#ifndef vma_mmu_pagesize
+unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
+{
+ return vma_kernel_pagesize(vma);
+}
+#endif
+
+/*
* Flags for MAP_PRIVATE reservations. These are stored in the bottom
* bits of the reservation map pointer, which are always clear due to
* alignment.
--
1.5.6.5

2008-10-16 16:25:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/2] Subject: [PATCH] Report the pagesize backing a VMA in /proc/pid/smaps

Hi

> It is useful to verify a hugepage-aware application is using the expected
> pagesizes for its memory regions. This patch creates an entry called
> KernelPageSize in /proc/pid/smaps that is the size of page used by the
> kernel to back a VMA. The entry is not called PageSize as it is possible
> the MMU uses a different size. This extension should not break any sensible
> parser that skips lines containing unrecognised information.
>
> Signed-off-by: Mel Gorman <[email protected]>

ack.

2008-10-20 09:18:42

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH 0/2] Report the size of pages backing VMAs in /proc V3

Adding " (hpagesize=4096kB)" onto the end of a filename is as vile
as adding " (deleted)" onto the end. If anything is going to change
in this area, it should be the elimination of " (deleted)". These
tags are perfectly legitimate in filenames.

Looping on stat() while chopping off suspected tags is dreadful.
Besides just being gross, it's slow.

gdb will tolerate up to 7 flags, procps will tolerate up to 31 flags,
and both will tolerate anything without a '/' before the filename.

Obviously, every author of a /proc-based tool has been forced to
take a random guess at the ABI. The /proc/*/smaps is so gross and
that I put off writing a parser for years.

What you can probably get away with:

After the "rwxp" stuff you can add 3 more flags. (gdb limit)
You could use 'L' for locked pages, 'R' for swap reservation,
and 'D' for deleted files. It's probably much better to save
space though, since gdb will crash if you add too many flags.
Three characters can be 18 bits if you base-64 encode them,
being careful to avoid the '/' character. (adding 0x30 works)

Right before the filename, you can add anything except a '/'.
You could add a few columns of numbers or a second flags field.

Not that it matters on such a slow-ass file format, but you
can make parsing faster if you encode the page size in one byte.
Simply add 0x30 to the page shift, then print that byte. Note that
this would let you cram the page size into the flags field.

BTW, I'm thinking that the /proc/*/*maps files fail when the
lines exceed 4096 bytes. The pathname may legitimately be that
long, plus it can be backslash escaped, plus there is all the
junk on the beginning.

2008-10-20 10:06:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/2] Report the size of pages backing VMAs in /proc V3

On (20/10/08 05:18), Albert Cahalan didst pronounce:
> Adding " (hpagesize=4096kB)" onto the end of a filename is as vile
> as adding " (deleted)" onto the end. If anything is going to change
> in this area, it should be the elimination of " (deleted)". These
> tags are perfectly legitimate in filenames.
>

I dropped that change altogether in the last series because of concerns like
this. It did have the potential to grow to something weird looking.

> Looping on stat() while chopping off suspected tags is dreadful.
> Besides just being gross, it's slow.
>

You're probably right. It's a bit weird that it's what you have to do to
figure out if the file in /proc/PID/maps is really there or not.

> gdb will tolerate up to 7 flags, procps will tolerate up to 31 flags,
> and both will tolerate anything without a '/' before the filename.
>

Understood.

> Obviously, every author of a /proc-based tool has been forced to
> take a random guess at the ABI. The /proc/*/smaps is so gross and
> that I put off writing a parser for years.
>

I intend to take a stab at it for the purposes of teaching pmap to print
the pagesizes if the smaps change gets picked up.

> What you can probably get away with:
>
> After the "rwxp" stuff you can add 3 more flags. (gdb limit)
> You could use 'L' for locked pages, 'R' for swap reservation,
> and 'D' for deleted files. It's probably much better to save
> space though, since gdb will crash if you add too many flags.
> Three characters can be 18 bits if you base-64 encode them,
> being careful to avoid the '/' character. (adding 0x30 works)
>

Ok, noted in case I ever decide to tackle the (deleted) removal. It's
not something I feel strongly about though.

> Right before the filename, you can add anything except a '/'.
> You could add a few columns of numbers or a second flags field.
>

My fear was about parsers that hard-coded what number field stored the
filename. If a column was added for pagesize for example, then parsers
would think the pagesize was the filename.

> Not that it matters on such a slow-ass file format, but you
> can make parsing faster if you encode the page size in one byte.
> Simply add 0x30 to the page shift, then print that byte. Note that
> this would let you cram the page size into the flags field.
>

Now, that is an interested idea, albeit it's not one that is easily
human-readable and would need a second parser like pmap but that's ok. If
parsing smaps turns into a total pain in the ass or the performance overhead
of calculating PSS when reading the pagesize becomes a problem, then I'll
try this option. Thanks a lot for that idea.

> BTW, I'm thinking that the /proc/*/*maps files fail when the
> lines exceed 4096 bytes. The pathname may legitimately be that
> long, plus it can be backslash escaped, plus there is all the
> junk on the beginning.
>

Yes. While it's unlikely to be exceeded, a file could be 4096 bytes long
and the other fields will then cause a problem. It was because of things
like this, I was ok with dropping the idea of adding (attribute[=value])
from the end of the filename.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2008-10-20 18:07:40

by Albert Cahalan

[permalink] [raw]
Subject: Re: [PATCH 0/2] Report the size of pages backing VMAs in /proc V3

On Mon, Oct 20, 2008 at 6:06 AM, Mel Gorman <[email protected]> wrote:
> On (20/10/08 05:18), Albert Cahalan didst pronounce:

>> Looping on stat() while chopping off suspected tags is dreadful.
>> Besides just being gross, it's slow.
>
> You're probably right. It's a bit weird that it's what you have to do to
> figure out if the file in /proc/PID/maps is really there or not.

Actually you can't do this, because of directory permissions.

>> Obviously, every author of a /proc-based tool has been forced to
>> take a random guess at the ABI. The /proc/*/smaps is so gross and
>> that I put off writing a parser for years.
>
> I intend to take a stab at it for the purposes of teaching pmap to print
> the pagesizes if the smaps change gets picked up.

FYI, "KernelPageSize" is at least unique under the perfect
hash function I'm using to parse the damn smaps file.

hash = ( ( (s[8]&15) + (s[1]&15) ) ^ (s[0]&3) ) & 31;

I have to wonder if we'll be getting mixed page sizes
within a single mapping, making such info unusable.

>> Right before the filename, you can add anything except a '/'.
>> You could add a few columns of numbers or a second flags field.
>
> My fear was about parsers that hard-coded what number field stored the
> filename. If a column was added for pagesize for example, then parsers
> would think the pagesize was the filename.

It's possible. Every parser I've examined does strchr()
or similar to find that '/' character.

Maybe try some dummy patches in a linux-next kernel?
Give each one a month. You could do "xyz" concatenated
to the flags, a second "rwx" concatenated to the flags,
a single column of "0" before the filename, and several
columns of "parsertest" before the filename.

> Now, that is an interested idea, albeit it's not one that is easily
> human-readable and would need a second parser like pmap but that's ok. If
> parsing smaps turns into a total pain in the ass

I assure you that parsing smaps is a total pain in the ass,
especially if you want tolerable performance. Something
like "top" is not viable if it performs like a Python script.

>> BTW, I'm thinking that the /proc/*/*maps files fail when the
>> lines exceed 4096 bytes. The pathname may legitimately be that
>> long, plus it can be backslash escaped, plus there is all the
>> junk on the beginning.
>
> Yes. While it's unlikely to be exceeded, a file could be 4096 bytes long
> and the other fields will then cause a problem. It was because of things
> like this, I was ok with dropping the idea of adding (attribute[=value])
> from the end of the filename.

"unlikely" is not something one should trust. I think you
can even get a name longer than 4096 bytes if you make
directories relative to the current directory and keep
changing directories as you make the directories.
Then double that with backslashes becoming \\ or
newlines becoming \n (must be escaped) in the output.

I think /proc/*/maps has been broken ever since it was
converted to seq_file, and maybe ever since it got filenames.
Prior to the filenames, lines were fixed-width records.

2008-10-22 09:44:34

by mel

[permalink] [raw]
Subject: Re: [PATCH 0/2] Report the size of pages backing VMAs in /proc V3

On Mon, Oct 20, 2008 at 02:07:28PM -0400, Albert Cahalan wrote:
> On Mon, Oct 20, 2008 at 6:06 AM, Mel Gorman <[email protected]> wrote:
> > On (20/10/08 05:18), Albert Cahalan didst pronounce:
>
> >> Looping on stat() while chopping off suspected tags is dreadful.
> >> Besides just being gross, it's slow.
> >
> > You're probably right. It's a bit weird that it's what you have to do to
> > figure out if the file in /proc/PID/maps is really there or not.
>
> Actually you can't do this, because of directory permissions.
>

Good point.

> >> Obviously, every author of a /proc-based tool has been forced to
> >> take a random guess at the ABI. The /proc/*/smaps is so gross and
> >> that I put off writing a parser for years.
> >
> > I intend to take a stab at it for the purposes of teaching pmap to print
> > the pagesizes if the smaps change gets picked up.
>
> FYI, "KernelPageSize" is at least unique under the perfect
> hash function I'm using to parse the damn smaps file.
>
> hash = ( ( (s[8]&15) + (s[1]&15) ) ^ (s[0]&3) ) & 31;
>

Good to know.

> I have to wonder if we'll be getting mixed page sizes
> within a single mapping, making such info unusable.
>

It's not planned right now, but even if it is, KernelPageSize would
remain as the intended page size. VMAs would either split around each
mixed page size in which case there will be separate VMAs or an
additional field will be added that indicates what number of each
pagesize makes up the mapping.

> >> Right before the filename, you can add anything except a '/'.
> >> You could add a few columns of numbers or a second flags field.
> >
> > My fear was about parsers that hard-coded what number field stored the
> > filename. If a column was added for pagesize for example, then parsers
> > would think the pagesize was the filename.
>
> It's possible. Every parser I've examined does strchr()
> or similar to find that '/' character.
>

I might be the only criminal. A mucky shell script used awk to display
field X and everything past it to find the filename. A more rational
person would have used strchr or found the first / with cut or similar.

> Maybe try some dummy patches in a linux-next kernel?
> Give each one a month. You could do "xyz" concatenated
> to the flags, a second "rwx" concatenated to the flags,
> a single column of "0" before the filename, and several
> columns of "parsertest" before the filename.
>

That sounds reasonable.

> > Now, that is an interested idea, albeit it's not one that is easily
> > human-readable and would need a second parser like pmap but that's ok. If
> > parsing smaps turns into a total pain in the ass
>
> I assure you that parsing smaps is a total pain in the ass,
> especially if you want tolerable performance. Something
> like "top" is not viable if it performs like a Python script.
>

I had assumed that smaps + performance were mutually exclusive because
of the PSS calculation and any active monitoring from something like top
would blow bigtime. That's why I tried modifying maps as well.

> >> BTW, I'm thinking that the /proc/*/*maps files fail when the
> >> lines exceed 4096 bytes. The pathname may legitimately be that
> >> long, plus it can be backslash escaped, plus there is all the
> >> junk on the beginning.
> >
> > Yes. While it's unlikely to be exceeded, a file could be 4096 bytes long
> > and the other fields will then cause a problem. It was because of things
> > like this, I was ok with dropping the idea of adding (attribute[=value])
> > from the end of the filename.
>
> "unlikely" is not something one should trust. I think you
> can even get a name longer than 4096 bytes if you make
> directories relative to the current directory and keep
> changing directories as you make the directories.
> Then double that with backslashes becoming \\ or
> newlines becoming \n (must be escaped) in the output.
>
> I think /proc/*/maps has been broken ever since it was
> converted to seq_file, and maybe ever since it got filenames.
> Prior to the filenames, lines were fixed-width records.
>

You could be right. Only one way to find out for sure really.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab