2008-02-20 13:57:48

by Hans Rosenfeld

[permalink] [raw]
Subject: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

The current code for /proc/pid/pagemap does not work with huge pages (on
x86). The code will make no difference between a normal pmd and a huge
page pmd, trying to parse the contents of the huge page as ptes. Another
problem is that there is no way to get information about the page size a
specific mapping uses.

Also, the current way the "not present" and "swap" bits are encoded in
the returned pfn isn't very clean, especially not if this interface is
going to be extended.

I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
just a raw pfn. The pseudo-pte will contain:

- 58 bits for the physical address of the first byte in the page, even
less bits would probably be sufficient for quite a while

- 4 bits for the page size, with 0 meaning native page size (4k on x86,
8k on alpha, ...) and values 1-15 being specific to the architecture
(I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

- a "swap" bit indicating that a not present page is paged out, with the
physical address field containing page file number and block number
just like before

- a "present" bit just like in a real pte

By shortening the field for the physical address, some more interesting
information could be included, like read/write permissions and the like.
The page size could also be returned directly, 6 bits could be used to
express any page shift in a 64 bit system, but I found the encoded page
size more useful for my specific use case.


The attached patch changes the /proc/pid/pagemap code to use such a
pseudo-pte. The huge page handling is currently limited to 2M/4M pages
on x86, 1G pages will need some more work. To keep the simple mapping of
virtual addresses to file index intact, any huge page pseudo-pte is
replicated in the user buffer to map the equivalent range of small
pages.

Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.

Other architectures will probably need other changes to support huge
pages and return the page size.

I think that the definition of the pseudo-pte structure and the page
size codes should be made available through a header file, but I didn't
do this for now.

Signed-Off-By: Hans Rosenfeld <[email protected]>

---
fs/proc/task_mmu.c | 68 +++++++++++++++++++++++++++++------------
include/asm-x86/pgtable.h | 2 +
include/asm-x86/pgtable_64.h | 1 -
3 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 49958cf..58af588 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -527,16 +527,23 @@ struct pagemapread {
char __user *out, *end;
};

-#define PM_ENTRY_BYTES sizeof(u64)
-#define PM_RESERVED_BITS 3
-#define PM_RESERVED_OFFSET (64 - PM_RESERVED_BITS)
-#define PM_RESERVED_MASK (((1LL<<PM_RESERVED_BITS)-1) << PM_RESERVED_OFFSET)
-#define PM_SPECIAL(nr) (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
-#define PM_NOT_PRESENT PM_SPECIAL(1LL)
-#define PM_SWAP PM_SPECIAL(2LL)
-#define PM_END_OF_BUFFER 1
-
-static int add_to_pagemap(unsigned long addr, u64 pfn,
+struct ppte {
+ uint64_t paddr:58;
+ uint64_t psize:4;
+ uint64_t swap:1;
+ uint64_t present:1;
+};
+
+#ifdef CONFIG_X86
+#define PM_PSIZE_1G 3
+#define PM_PSIZE_4M 2
+#define PM_PSIZE_2M 1
+#endif
+
+#define PM_ENTRY_BYTES sizeof(struct ppte)
+#define PM_END_OF_BUFFER 1
+
+static int add_to_pagemap(unsigned long addr, struct ppte ppte,
struct pagemapread *pm)
{
/*
@@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
* the pfn.
*/
if (pm->out + PM_ENTRY_BYTES >= pm->end) {
- if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
+ if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
return -EFAULT;
pm->out = pm->end;
return PM_END_OF_BUFFER;
}

- if (put_user(pfn, pm->out))
+ if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
return -EFAULT;
pm->out += PM_ENTRY_BYTES;
return 0;
@@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
unsigned long addr;
int err = 0;
for (addr = start; addr < end; addr += PAGE_SIZE) {
- err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
+ err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
if (err)
break;
}
@@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
u64 swap_pte_to_pagemap_entry(pte_t pte)
{
swp_entry_t e = pte_to_swp_entry(pte);
- return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
+ return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
}

static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
int err = 0;

+#ifdef CONFIG_X86
+ if (pmd_huge(*pmd)) {
+ struct ppte ppte = {
+ .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
+ .psize = (HPAGE_SHIFT == 22 ?
+ PM_PSIZE_4M : PM_PSIZE_2M),
+ .swap = 0,
+ .present = 1,
+ };
+
+ for(; addr != end; addr += PAGE_SIZE) {
+ err = add_to_pagemap(addr, ppte, pm);
+ if (err)
+ return err;
+ }
+ } else
+#endif
for (; addr != end; addr += PAGE_SIZE) {
- u64 pfn = PM_NOT_PRESENT;
+ struct ppte ppte = { 0, 0, 0, 0};
+
pte = pte_offset_map(pmd, addr);
- if (is_swap_pte(*pte))
- pfn = swap_pte_to_pagemap_entry(*pte);
- else if (pte_present(*pte))
- pfn = pte_pfn(*pte);
+ if (is_swap_pte(*pte)) {
+ ppte.swap = 1;
+ ppte.paddr = swap_pte_to_pagemap_entry(*pte);
+ } else if (pte_present(*pte)) {
+ ppte.present = 1;
+ ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
+ }
/* unmap so we're not in atomic when we copy to userspace */
pte_unmap(pte);
- err = add_to_pagemap(addr, pfn, pm);
+ err = add_to_pagemap(addr, ppte, pm);
if (err)
return err;
}
diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
index 174b877..76bc8a8 100644
--- a/include/asm-x86/pgtable.h
+++ b/include/asm-x86/pgtable.h
@@ -181,6 +181,8 @@ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
pgprot_val(pgprot)) & __supported_pte_mask);
}

+#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
+
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
pteval_t val = pte_val(pte);
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 02bd4aa..094a538 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -216,7 +216,6 @@ static inline int pud_large(pud_t pte)
#define pmd_none(x) (!pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
#define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
-#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)

#define pte_to_pgoff(pte) ((pte_val(pte) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT)
#define pgoff_to_pte(off) ((pte_t) { .pte = ((off) << PAGE_SHIFT) | _PAGE_FILE })
--
1.5.3.7

--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown


2008-02-23 02:18:52

by Matt Mackall

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

(sorry for the delay, travelling)

On Wed, 2008-02-20 at 14:57 +0100, Hans Rosenfeld wrote:
> The current code for /proc/pid/pagemap does not work with huge pages (on
> x86). The code will make no difference between a normal pmd and a huge
> page pmd, trying to parse the contents of the huge page as ptes. Another
> problem is that there is no way to get information about the page size a
> specific mapping uses.
>
> Also, the current way the "not present" and "swap" bits are encoded in
> the returned pfn isn't very clean, especially not if this interface is
> going to be extended.

Fair.

> I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> just a raw pfn. The pseudo-pte will contain:
>
> - 58 bits for the physical address of the first byte in the page, even
> less bits would probably be sufficient for quite a while
>
> - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> 8k on alpha, ...) and values 1-15 being specific to the architecture
> (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
>
> - a "swap" bit indicating that a not present page is paged out, with the
> physical address field containing page file number and block number
> just like before
>
> - a "present" bit just like in a real pte

This is ok-ish, but I can't say I like it much. Especially the page size
field.

But I don't really have many ideas here. Perhaps having a bit saying
"this entry is really a continuation of the previous one". Then any page
size can be trivially represented. This might also make the code on both
sides simpler?

> By shortening the field for the physical address, some more interesting
> information could be included, like read/write permissions and the like.
> The page size could also be returned directly, 6 bits could be used to
> express any page shift in a 64 bit system, but I found the encoded page
> size more useful for my specific use case.
>
>
> The attached patch changes the /proc/pid/pagemap code to use such a
> pseudo-pte. The huge page handling is currently limited to 2M/4M pages
> on x86, 1G pages will need some more work. To keep the simple mapping of
> virtual addresses to file index intact, any huge page pseudo-pte is
> replicated in the user buffer to map the equivalent range of small
> pages.
>
> Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
> asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
>
> Other architectures will probably need other changes to support huge
> pages and return the page size.
>
> I think that the definition of the pseudo-pte structure and the page
> size codes should be made available through a header file, but I didn't
> do this for now.
>
> Signed-Off-By: Hans Rosenfeld <[email protected]>
>
> ---
> fs/proc/task_mmu.c | 68 +++++++++++++++++++++++++++++------------
> include/asm-x86/pgtable.h | 2 +
> include/asm-x86/pgtable_64.h | 1 -
> 3 files changed, 50 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 49958cf..58af588 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -527,16 +527,23 @@ struct pagemapread {
> char __user *out, *end;
> };
>
> -#define PM_ENTRY_BYTES sizeof(u64)
> -#define PM_RESERVED_BITS 3
> -#define PM_RESERVED_OFFSET (64 - PM_RESERVED_BITS)
> -#define PM_RESERVED_MASK (((1LL<<PM_RESERVED_BITS)-1) << PM_RESERVED_OFFSET)
> -#define PM_SPECIAL(nr) (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
> -#define PM_NOT_PRESENT PM_SPECIAL(1LL)
> -#define PM_SWAP PM_SPECIAL(2LL)
> -#define PM_END_OF_BUFFER 1
> -
> -static int add_to_pagemap(unsigned long addr, u64 pfn,
> +struct ppte {
> + uint64_t paddr:58;
> + uint64_t psize:4;
> + uint64_t swap:1;
> + uint64_t present:1;
> +};
> +
> +#ifdef CONFIG_X86
> +#define PM_PSIZE_1G 3
> +#define PM_PSIZE_4M 2
> +#define PM_PSIZE_2M 1
> +#endif
> +
> +#define PM_ENTRY_BYTES sizeof(struct ppte)
> +#define PM_END_OF_BUFFER 1
> +
> +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
> struct pagemapread *pm)
> {
> /*
> @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
> * the pfn.
> */
> if (pm->out + PM_ENTRY_BYTES >= pm->end) {
> - if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
> + if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
> return -EFAULT;
> pm->out = pm->end;
> return PM_END_OF_BUFFER;
> }
>
> - if (put_user(pfn, pm->out))
> + if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
> return -EFAULT;
> pm->out += PM_ENTRY_BYTES;
> return 0;
> @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> unsigned long addr;
> int err = 0;
> for (addr = start; addr < end; addr += PAGE_SIZE) {
> - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
> + err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
> if (err)
> break;
> }
> @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> u64 swap_pte_to_pagemap_entry(pte_t pte)
> {
> swp_entry_t e = pte_to_swp_entry(pte);
> - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> }
>
> static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> @@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> pte_t *pte;
> int err = 0;
>
> +#ifdef CONFIG_X86
> + if (pmd_huge(*pmd)) {
> + struct ppte ppte = {
> + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> + .psize = (HPAGE_SHIFT == 22 ?
> + PM_PSIZE_4M : PM_PSIZE_2M),
> + .swap = 0,
> + .present = 1,
> + };
> +
> + for(; addr != end; addr += PAGE_SIZE) {
> + err = add_to_pagemap(addr, ppte, pm);
> + if (err)
> + return err;
> + }
> + } else
> +#endif
> for (; addr != end; addr += PAGE_SIZE) {
> - u64 pfn = PM_NOT_PRESENT;
> + struct ppte ppte = { 0, 0, 0, 0};
> +
> pte = pte_offset_map(pmd, addr);
> - if (is_swap_pte(*pte))
> - pfn = swap_pte_to_pagemap_entry(*pte);
> - else if (pte_present(*pte))
> - pfn = pte_pfn(*pte);
> + if (is_swap_pte(*pte)) {
> + ppte.swap = 1;
> + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> + } else if (pte_present(*pte)) {
> + ppte.present = 1;
> + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> + }
> /* unmap so we're not in atomic when we copy to userspace */
> pte_unmap(pte);
> - err = add_to_pagemap(addr, pfn, pm);
> + err = add_to_pagemap(addr, ppte, pm);
> if (err)
> return err;
> }
> diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
> index 174b877..76bc8a8 100644
> --- a/include/asm-x86/pgtable.h
> +++ b/include/asm-x86/pgtable.h
> @@ -181,6 +181,8 @@ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
> pgprot_val(pgprot)) & __supported_pte_mask);
> }
>
> +#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
> +
> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> {
> pteval_t val = pte_val(pte);
> diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
> index 02bd4aa..094a538 100644
> --- a/include/asm-x86/pgtable_64.h
> +++ b/include/asm-x86/pgtable_64.h
> @@ -216,7 +216,6 @@ static inline int pud_large(pud_t pte)
> #define pmd_none(x) (!pmd_val(x))
> #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
> #define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
> -#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
>
> #define pte_to_pgoff(pte) ((pte_val(pte) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT)
> #define pgoff_to_pte(off) ((pte_t) { .pte = ((off) << PAGE_SHIFT) | _PAGE_FILE })
> --
> 1.5.3.7
>
--
Mathematics is the supreme nostalgia of our time.

2008-02-23 08:19:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Wed, 20 Feb 2008 14:57:43 +0100 "Hans Rosenfeld" <[email protected]> wrote:

> The current code for /proc/pid/pagemap does not work with huge pages (on
> x86). The code will make no difference between a normal pmd and a huge
> page pmd, trying to parse the contents of the huge page as ptes. Another
> problem is that there is no way to get information about the page size a
> specific mapping uses.
>
> Also, the current way the "not present" and "swap" bits are encoded in
> the returned pfn isn't very clean, especially not if this interface is
> going to be extended.
>
> I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> just a raw pfn. The pseudo-pte will contain:
>
> - 58 bits for the physical address of the first byte in the page, even
> less bits would probably be sufficient for quite a while
>
> - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> 8k on alpha, ...) and values 1-15 being specific to the architecture
> (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
>
> - a "swap" bit indicating that a not present page is paged out, with the
> physical address field containing page file number and block number
> just like before
>
> - a "present" bit just like in a real pte
>
> By shortening the field for the physical address, some more interesting
> information could be included, like read/write permissions and the like.
> The page size could also be returned directly, 6 bits could be used to
> express any page shift in a 64 bit system, but I found the encoded page
> size more useful for my specific use case.
>
>
> The attached patch changes the /proc/pid/pagemap code to use such a
> pseudo-pte. The huge page handling is currently limited to 2M/4M pages
> on x86, 1G pages will need some more work. To keep the simple mapping of
> virtual addresses to file index intact, any huge page pseudo-pte is
> replicated in the user buffer to map the equivalent range of small
> pages.
>
> Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
> asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
>
> Other architectures will probably need other changes to support huge
> pages and return the page size.
>
> I think that the definition of the pseudo-pte structure and the page
> size codes should be made available through a header file, but I didn't
> do this for now.
>

If we're going to do this, we need to do it *fast*. Once 2.6.25 goes out
our hands are tied.

That means talking with the maintainers of other hugepage-capable
architectures.

> +struct ppte {
> + uint64_t paddr:58;
> + uint64_t psize:4;
> + uint64_t swap:1;
> + uint64_t present:1;
> +};

This is part of the exported kernel interface and hence should be in a
header somewhere, shouldn't it? The old stuff should have been too.

u64 is a bit more conventional than uint64_t, and if we move this to a
userspace-visible header then __u64 is the type to use, I think. Although
one would expect uint64_t to be OK as well.

> +#ifdef CONFIG_X86
> +#define PM_PSIZE_1G 3
> +#define PM_PSIZE_4M 2
> +#define PM_PSIZE_2M 1
> +#endif

No, we should factor this correctly and get the CONFIG_X86 stuff out of here.

> +#define PM_ENTRY_BYTES sizeof(struct ppte)
> +#define PM_END_OF_BUFFER 1
> +
> +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
> struct pagemapread *pm)
> {
> /*
> @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
> * the pfn.
> */
> if (pm->out + PM_ENTRY_BYTES >= pm->end) {
> - if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
> + if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
> return -EFAULT;
> pm->out = pm->end;
> return PM_END_OF_BUFFER;
> }
>
> - if (put_user(pfn, pm->out))
> + if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
> return -EFAULT;
> pm->out += PM_ENTRY_BYTES;
> return 0;
> @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> unsigned long addr;
> int err = 0;
> for (addr = start; addr < end; addr += PAGE_SIZE) {
> - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
> + err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
> if (err)
> break;
> }
> @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> u64 swap_pte_to_pagemap_entry(pte_t pte)
> {
> swp_entry_t e = pte_to_swp_entry(pte);
> - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> }
>
> static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> @@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> pte_t *pte;
> int err = 0;
>
> +#ifdef CONFIG_X86
> + if (pmd_huge(*pmd)) {
> + struct ppte ppte = {
> + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> + .psize = (HPAGE_SHIFT == 22 ?
> + PM_PSIZE_4M : PM_PSIZE_2M),
> + .swap = 0,
> + .present = 1,
> + };
> +
> + for(; addr != end; addr += PAGE_SIZE) {
> + err = add_to_pagemap(addr, ppte, pm);
> + if (err)
> + return err;
> + }
> + } else
> +#endif

You can remove the ifdefs here and then for each hugepage-capable
architecture add a

struct ppte pmd_to_ppte(pmd_t *pmd);

and call that. Something like that.

ppte isn't a very well-chosen name, especially if we export this to
userspace.

> for (; addr != end; addr += PAGE_SIZE) {
> - u64 pfn = PM_NOT_PRESENT;
> + struct ppte ppte = { 0, 0, 0, 0};
> +
> pte = pte_offset_map(pmd, addr);
> - if (is_swap_pte(*pte))
> - pfn = swap_pte_to_pagemap_entry(*pte);
> - else if (pte_present(*pte))
> - pfn = pte_pfn(*pte);
> + if (is_swap_pte(*pte)) {
> + ppte.swap = 1;
> + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> + } else if (pte_present(*pte)) {
> + ppte.present = 1;
> + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> + }
> /* unmap so we're not in atomic when we copy to userspace */
> pte_unmap(pte);
> - err = add_to_pagemap(addr, pfn, pm);
> + err = add_to_pagemap(addr, ppte, pm);
> if (err)
> return err;
> }

Matt? Help?

2008-02-23 15:25:42

by Matt Mackall

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size


On Sat, 2008-02-23 at 00:06 -0800, Andrew Morton wrote:
> On Wed, 20 Feb 2008 14:57:43 +0100 "Hans Rosenfeld" <[email protected]> wrote:
>
> > The current code for /proc/pid/pagemap does not work with huge pages (on
> > x86). The code will make no difference between a normal pmd and a huge
> > page pmd, trying to parse the contents of the huge page as ptes. Another
> > problem is that there is no way to get information about the page size a
> > specific mapping uses.
> >
> > Also, the current way the "not present" and "swap" bits are encoded in
> > the returned pfn isn't very clean, especially not if this interface is
> > going to be extended.
> >
> > I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> > just a raw pfn. The pseudo-pte will contain:
> >
> > - 58 bits for the physical address of the first byte in the page, even
> > less bits would probably be sufficient for quite a while
> >
> > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > 8k on alpha, ...) and values 1-15 being specific to the architecture
> > (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> >
> > - a "swap" bit indicating that a not present page is paged out, with the
> > physical address field containing page file number and block number
> > just like before
> >
> > - a "present" bit just like in a real pte
> >
> > By shortening the field for the physical address, some more interesting
> > information could be included, like read/write permissions and the like.
> > The page size could also be returned directly, 6 bits could be used to
> > express any page shift in a 64 bit system, but I found the encoded page
> > size more useful for my specific use case.
> >
> >
> > The attached patch changes the /proc/pid/pagemap code to use such a
> > pseudo-pte. The huge page handling is currently limited to 2M/4M pages
> > on x86, 1G pages will need some more work. To keep the simple mapping of
> > virtual addresses to file index intact, any huge page pseudo-pte is
> > replicated in the user buffer to map the equivalent range of small
> > pages.
> >
> > Note that I had to move the pmd_pfn() macro from asm-x86/pgtable_64.h to
> > asm-x86/pgtable.h, it applies to both 32 bit and 64 bit x86.
> >
> > Other architectures will probably need other changes to support huge
> > pages and return the page size.
> >
> > I think that the definition of the pseudo-pte structure and the page
> > size codes should be made available through a header file, but I didn't
> > do this for now.
> >
>
> If we're going to do this, we need to do it *fast*. Once 2.6.25 goes out
> our hands are tied.
>
> That means talking with the maintainers of other hugepage-capable
> architectures.
>
> > +struct ppte {
> > + uint64_t paddr:58;
> > + uint64_t psize:4;
> > + uint64_t swap:1;
> > + uint64_t present:1;
> > +};
>
> This is part of the exported kernel interface and hence should be in a
> header somewhere, shouldn't it? The old stuff should have been too.

I think we're better off not using bitfields here.

> u64 is a bit more conventional than uint64_t, and if we move this to a
> userspace-visible header then __u64 is the type to use, I think. Although
> one would expect uint64_t to be OK as well.
>
> > +#ifdef CONFIG_X86
> > +#define PM_PSIZE_1G 3
> > +#define PM_PSIZE_4M 2
> > +#define PM_PSIZE_2M 1
> > +#endif
>
> No, we should factor this correctly and get the CONFIG_X86 stuff out of here.

Perhaps my "continuation bit" idea.

> Matt? Help?

Did my previous message make it out? This is probably my last message
for 24+ hours.

--
Mathematics is the supreme nostalgia of our time.

2008-02-23 18:31:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Sat, 2008-02-23 at 10:18 +0800, Matt Mackall wrote:
> Another
> > problem is that there is no way to get information about the page size a
> > specific mapping uses.

Is this true generically, or just with pagemap? It seems like we should
have a way to tell that a particular mapping is of large pages. I'm
cc'ing a few folks who might know.

> > Also, the current way the "not present" and "swap" bits are encoded in
> > the returned pfn isn't very clean, especially not if this interface is
> > going to be extended.
>
> Fair.

Yup.

> > I propose to change /proc/pid/pagemap to return a pseudo-pte instead of
> > just a raw pfn. The pseudo-pte will contain:
> >
> > - 58 bits for the physical address of the first byte in the page, even
> > less bits would probably be sufficient for quite a while

Well, whether we use a physical address of the first byte of the page or
a pfn doesn't really matter. It just boils down to whether we use low
or high bits for the magic. :)

> > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > 8k on alpha, ...) and values 1-15 being specific to the architecture
> > (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)

"Native page size" probably a bad idea. ppc64 can use 64k or 4k for its
"native" page size and has 16MB large pages (as well as some others).
To make it even more confusing, you can have a 64k kernel page size with
4k mmu mappings!

That said, this is a decent idea as long as we know that nobody will
ever have more than 16 page sizes.

> > - a "swap" bit indicating that a not present page is paged out, with the
> > physical address field containing page file number and block number
> > just like before
> >
> > - a "present" bit just like in a real pte
>
> This is ok-ish, but I can't say I like it much. Especially the page size
> field.
>
> But I don't really have many ideas here. Perhaps having a bit saying
> "this entry is really a continuation of the previous one". Then any page
> size can be trivially represented. This might also make the code on both
> sides simpler?

Yeah, it could just be a special flag plus a mask or offset showing how
many entries to back up to find the actual mapping. If each huge page
entry just had something along the lines of:

PAGEMAP_HUGE_PAGE_BIT | HPAGE_MASK

You can see its a huge mapping from the bit, and you can go find the
physical page by applying HPAGE_MASK to your current position in the
pagemap.

> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 49958cf..58af588 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -527,16 +527,23 @@ struct pagemapread {
> > char __user *out, *end;
> > };
> >
> > -#define PM_ENTRY_BYTES sizeof(u64)
> > -#define PM_RESERVED_BITS 3
> > -#define PM_RESERVED_OFFSET (64 - PM_RESERVED_BITS)
> > -#define PM_RESERVED_MASK (((1LL<<PM_RESERVED_BITS)-1) << PM_RESERVED_OFFSET)
> > -#define PM_SPECIAL(nr) (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
> > -#define PM_NOT_PRESENT PM_SPECIAL(1LL)
> > -#define PM_SWAP PM_SPECIAL(2LL)
> > -#define PM_END_OF_BUFFER 1
> > -
> > -static int add_to_pagemap(unsigned long addr, u64 pfn,
> > +struct ppte {
> > + uint64_t paddr:58;
> > + uint64_t psize:4;
> > + uint64_t swap:1;
> > + uint64_t present:1;
> > +};

It'd be nice to keep the current convention, which is to stay away from
bitfields.

> > +#ifdef CONFIG_X86
> > +#define PM_PSIZE_1G 3
> > +#define PM_PSIZE_4M 2
> > +#define PM_PSIZE_2M 1
> > +#endif

I do think this may get goofy in the future, especially for those
architectures which don't have page sizes tied to Linux pagetables.
Tomorrow, you might end up with:

> > +#ifdef CONFIG_FUNNYARCH
> > +#define PM_PSIZE_64M 4
> > +#define PM_PSIZE_1G 3
> > +#define PM_PSIZE_4M 2
> > +#define PM_PSIZE_2M 1
> > +#endif

> > +#define PM_ENTRY_BYTES sizeof(struct ppte)
> > +#define PM_END_OF_BUFFER 1
> > +
> > +static int add_to_pagemap(unsigned long addr, struct ppte ppte,
> > struct pagemapread *pm)
> > {
> > /*
> > @@ -545,13 +552,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
> > * the pfn.
> > */
> > if (pm->out + PM_ENTRY_BYTES >= pm->end) {
> > - if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
> > + if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
> > return -EFAULT;
> > pm->out = pm->end;
> > return PM_END_OF_BUFFER;
> > }
> >
> > - if (put_user(pfn, pm->out))
> > + if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
> > return -EFAULT;
> > pm->out += PM_ENTRY_BYTES;
> > return 0;
> > @@ -564,7 +571,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > unsigned long addr;
> > int err = 0;
> > for (addr = start; addr < end; addr += PAGE_SIZE) {
> > - err = add_to_pagemap(addr, PM_NOT_PRESENT, pm);
> > + err = add_to_pagemap(addr, (struct ppte) {0, 0, 0, 0}, pm);
> > if (err)
> > break;
> > }
> > @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > u64 swap_pte_to_pagemap_entry(pte_t pte)
> > {
> > swp_entry_t e = pte_to_swp_entry(pte);
> > - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > }
> >
> > static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > @@ -584,16 +591,37 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > pte_t *pte;
> > int err = 0;
> >
> > +#ifdef CONFIG_X86
> > + if (pmd_huge(*pmd)) {
> > + struct ppte ppte = {
> > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > + .psize = (HPAGE_SHIFT == 22 ?
> > + PM_PSIZE_4M : PM_PSIZE_2M),
> > + .swap = 0,
> > + .present = 1,
> > + };
> > +
> > + for(; addr != end; addr += PAGE_SIZE) {
> > + err = add_to_pagemap(addr, ppte, pm);
> > + if (err)
> > + return err;
> > + }
> > + } else
> > +#endif

It's great to make this support huge pages, but things like this
probably need their own function. Putting an #ifdef in the middle of
here makes it a lot harder to read. Just think of when powerpc, ia64
and x86_64 get their grubby mitts in here. ;)

-- Dave

2008-02-25 12:13:05

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
> > > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > > 8k on alpha, ...) and values 1-15 being specific to the architecture
> > > (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
>
> "Native page size" probably a bad idea. ppc64 can use 64k or 4k for its
> "native" page size and has 16MB large pages (as well as some others).
> To make it even more confusing, you can have a 64k kernel page size with
> 4k mmu mappings!
>
> That said, this is a decent idea as long as we know that nobody will
> ever have more than 16 page sizes.

Then a better way to encode the page size would be returning the page
shift. This would need 6 bits instead of 4, but it would probably be
enough for any 64 bit architecture.


> > This is ok-ish, but I can't say I like it much. Especially the page size
> > field.
> >
> > But I don't really have many ideas here. Perhaps having a bit saying
> > "this entry is really a continuation of the previous one". Then any page
> > size can be trivially represented. This might also make the code on both
> > sides simpler?

I don't like the idea of parsing thousands of entries just to find out
that I'm using a huge page. It would be much better to just get the page
size one way or the other in the first entry one reads.


> > > -static int add_to_pagemap(unsigned long addr, u64 pfn,
> > > +struct ppte {
> > > + uint64_t paddr:58;
> > > + uint64_t psize:4;
> > > + uint64_t swap:1;
> > > + uint64_t present:1;
> > > +};
>
> It'd be nice to keep the current convention, which is to stay away from
> bitfields.

I like them, they make the code much more readable.


> > > +#ifdef CONFIG_X86
> > > + if (pmd_huge(*pmd)) {
> > > + struct ppte ppte = {
> > > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > > + .psize = (HPAGE_SHIFT == 22 ?
> > > + PM_PSIZE_4M : PM_PSIZE_2M),
> > > + .swap = 0,
> > > + .present = 1,
> > > + };
> > > +
> > > + for(; addr != end; addr += PAGE_SIZE) {
> > > + err = add_to_pagemap(addr, ppte, pm);
> > > + if (err)
> > > + return err;
> > > + }
> > > + } else
> > > +#endif
>
> It's great to make this support huge pages, but things like this
> probably need their own function. Putting an #ifdef in the middle of
> here makes it a lot harder to read. Just think of when powerpc, ia64
> and x86_64 get their grubby mitts in here. ;)

AFAIK the way huge pages are used on x86 differs much from other
architectures. While on x86 the address translation stops at some upper
table for a huge page, other architectures encode the page size in the
pte (at least the ones I looked at did). So pmd_huge() (and soon
pud_huge()) are very x86-specific and return just 0 on other archs, and
this code would be optimized away for them. All that would be necessary
for other archs is to eventually get the page size from the pte and put
it in the psize field.

The #ifdef could go away if pmd_pfn() was defined as 0 for !x86, it
wouldn't make sense to use it anyway.



--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

2008-02-25 18:39:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size


On Mon, 2008-02-25 at 13:09 +0100, Hans Rosenfeld wrote:
> On Sat, Feb 23, 2008 at 10:31:01AM -0800, Dave Hansen wrote:
> > > > - 4 bits for the page size, with 0 meaning native page size (4k on x86,
> > > > 8k on alpha, ...) and values 1-15 being specific to the architecture
> > > > (I used 1 for 2M, 2 for 4M and 3 for 1G for x86)
> >
> > "Native page size" probably a bad idea. ppc64 can use 64k or 4k for its
> > "native" page size and has 16MB large pages (as well as some others).
> > To make it even more confusing, you can have a 64k kernel page size with
> > 4k mmu mappings!
> >
> > That said, this is a decent idea as long as we know that nobody will
> > ever have more than 16 page sizes.
>
> Then a better way to encode the page size would be returning the page
> shift. This would need 6 bits instead of 4, but it would probably be
> enough for any 64 bit architecture.

That's a good point.

> > > This is ok-ish, but I can't say I like it much. Especially the page size
> > > field.
> > >
> > > But I don't really have many ideas here. Perhaps having a bit saying
> > > "this entry is really a continuation of the previous one". Then any page
> > > size can be trivially represented. This might also make the code on both
> > > sides simpler?
>
> I don't like the idea of parsing thousands of entries just to find out
> that I'm using a huge page. It would be much better to just get the page
> size one way or the other in the first entry one reads.

Did you read my suggestion? We use one bit in the pte to specify that
its a large page mapping, then specify a mask to apply to the address to
get the *first* mapping of the large page, where you're find the actual
physical address. That keeps us from having to worry about specifying
*both* the page size and the pfn in the same pte.

> > > > +#ifdef CONFIG_X86
> > > > + if (pmd_huge(*pmd)) {
> > > > + struct ppte ppte = {
> > > > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > > > + .psize = (HPAGE_SHIFT == 22 ?
> > > > + PM_PSIZE_4M : PM_PSIZE_2M),
> > > > + .swap = 0,
> > > > + .present = 1,
> > > > + };
> > > > +
> > > > + for(; addr != end; addr += PAGE_SIZE) {
> > > > + err = add_to_pagemap(addr, ppte, pm);
> > > > + if (err)
> > > > + return err;
> > > > + }
> > > > + } else
> > > > +#endif
> >
> > It's great to make this support huge pages, but things like this
> > probably need their own function. Putting an #ifdef in the middle of
> > here makes it a lot harder to read. Just think of when powerpc, ia64
> > and x86_64 get their grubby mitts in here. ;)
>
> AFAIK the way huge pages are used on x86 differs much from other
> architectures. While on x86 the address translation stops at some upper
> table for a huge page, other architectures encode the page size in the
> pte (at least the ones I looked at did). So pmd_huge() (and soon
> pud_huge()) are very x86-specific and return just 0 on other archs,

I'm just asking that you please put this in a nice helper function to
hide it from the poor powerpc/ia64/mips... guys that don't want to see
x86 code cluttering up otherwise generic functions. Yes, the compiler
optimizes it away, but it still has a cost to my eyeballs. :)

I eagerly await your next patch!

-- Dave

2008-02-26 20:34:25

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Mon, Feb 25, 2008 at 10:39:10AM -0800, Dave Hansen wrote:
> Did you read my suggestion? We use one bit in the pte to specify that
> its a large page mapping, then specify a mask to apply to the address to
> get the *first* mapping of the large page, where you're find the actual
> physical address. That keeps us from having to worry about specifying
> *both* the page size and the pfn in the same pte.

I did read it, but I don't see the point. I think we have enough bits in
that pseudo pte to include both the page size and the address/pfn.

> I'm just asking that you please put this in a nice helper function to
> hide it from the poor powerpc/ia64/mips... guys that don't want to see
> x86 code cluttering up otherwise generic functions. Yes, the compiler
> optimizes it away, but it still has a cost to my eyeballs. :)
>
> I eagerly await your next patch!

Ok, here it comes. What I changed is:
- added x86-specific inline function pmd_to_ppte() to asm-x86/pgtable.h
and a dummy definition to asm-generic/pgtable.h
- added a function add_huge_to_pagemap to fs/proc/task_mmu.c that is
completely independent of the huge page size
- use pshift instead of psize in pseudo pte, had to shorten paddr field
to 56 bits which is probably still enough for quite some time
- renamed the struct ppte to struct pagemap_ppte, making more clear what
it is actually used for. Since I weren't sure where to put it, I put
the definition in a new header linux/pagemap_ppte.h. I guess that it
will be put somewhere else, linux/pagemap.h or asm-generic/pgtable.h
are what I had in mind. Using one of those created more problems that
I didn't want to look at.



Signed-off-by: Hans Rosenfeld <[email protected]>

---
fs/proc/task_mmu.c | 54 +++++++++++++++++++++++++---------------
include/asm-generic/pgtable.h | 6 ++++
include/asm-x86/pgtable.h | 16 ++++++++++++
include/asm-x86/pgtable_64.h | 1 -
include/linux/pagemap_ppte.h | 17 +++++++++++++
5 files changed, 73 insertions(+), 21 deletions(-)
create mode 100644 include/linux/pagemap_ppte.h

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 49958cf..b3f07f6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -527,16 +527,11 @@ struct pagemapread {
char __user *out, *end;
};

-#define PM_ENTRY_BYTES sizeof(u64)
-#define PM_RESERVED_BITS 3
-#define PM_RESERVED_OFFSET (64 - PM_RESERVED_BITS)
-#define PM_RESERVED_MASK (((1LL<<PM_RESERVED_BITS)-1) << PM_RESERVED_OFFSET)
-#define PM_SPECIAL(nr) (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
-#define PM_NOT_PRESENT PM_SPECIAL(1LL)
-#define PM_SWAP PM_SPECIAL(2LL)
-#define PM_END_OF_BUFFER 1
-
-static int add_to_pagemap(unsigned long addr, u64 pfn,
+#define PM_NOT_PRESENT ((struct pagemap_ppte) {0, 0, 0, 0})
+#define PM_ENTRY_BYTES sizeof(struct pagemap_ppte)
+#define PM_END_OF_BUFFER 1
+
+static int add_to_pagemap(unsigned long addr, struct pagemap_ppte ppte,
struct pagemapread *pm)
{
/*
@@ -545,18 +540,30 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
* the pfn.
*/
if (pm->out + PM_ENTRY_BYTES >= pm->end) {
- if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
+ if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
return -EFAULT;
pm->out = pm->end;
return PM_END_OF_BUFFER;
}

- if (put_user(pfn, pm->out))
+ if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
return -EFAULT;
pm->out += PM_ENTRY_BYTES;
return 0;
}

+static int add_huge_to_pagemap(unsigned long addr, unsigned long end,
+ struct pagemap_ppte ppte, struct pagemapread *pm)
+{
+ int err;
+
+ for (; addr != end; addr += PAGE_SIZE) {
+ err = add_to_pagemap(addr, ppte, pm);
+ if (err)
+ return err;
+ }
+}
+
static int pagemap_pte_hole(unsigned long start, unsigned long end,
void *private)
{
@@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
u64 swap_pte_to_pagemap_entry(pte_t pte)
{
swp_entry_t e = pte_to_swp_entry(pte);
- return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
+ return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
}

static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -584,16 +591,23 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
int err = 0;

- for (; addr != end; addr += PAGE_SIZE) {
- u64 pfn = PM_NOT_PRESENT;
+ if (pmd_huge(*pmd))
+ add_huge_to_pagemap(addr, end, pmd_to_ppte(pmd), pm);
+ else for (; addr != end; addr += PAGE_SIZE) {
+ struct pagemap_ppte ppte = { 0, 0, 0, 0};
+
pte = pte_offset_map(pmd, addr);
- if (is_swap_pte(*pte))
- pfn = swap_pte_to_pagemap_entry(*pte);
- else if (pte_present(*pte))
- pfn = pte_pfn(*pte);
+ if (is_swap_pte(*pte)) {
+ ppte.swap = 1;
+ ppte.paddr = swap_pte_to_pagemap_entry(*pte);
+ } else if (pte_present(*pte)) {
+ ppte.present = 1;
+ ppte.pshift = PAGE_SHIFT;
+ ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
+ }
/* unmap so we're not in atomic when we copy to userspace */
pte_unmap(pte);
- err = add_to_pagemap(addr, pfn, pm);
+ err = add_to_pagemap(addr, ppte, pm);
if (err)
return err;
}
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 44ef329..d7df89d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -195,6 +195,12 @@ static inline int pmd_none_or_clear_bad(pmd_t *pmd)
}
return 0;
}
+
+/* dummy for !x86 */
+#ifndef pmd_to_ppte
+#define pmd_to_ppte(x) ((struct pagemap_ppte) {0, 0, 0, 0})
+#endif
+
#endif /* CONFIG_MMU */

/*
diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
index 174b877..7a446e0 100644
--- a/include/asm-x86/pgtable.h
+++ b/include/asm-x86/pgtable.h
@@ -181,6 +181,8 @@ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
pgprot_val(pgprot)) & __supported_pte_mask);
}

+#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
+
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
pteval_t val = pte_val(pte);
@@ -242,6 +244,20 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

#ifndef __ASSEMBLY__

+#include <linux/pagemap_ppte.h>
+
+static inline struct pagemap_ppte pmd_to_ppte(pmd_t *pmd)
+{
+ struct pagemap_ppte ppte = {
+ .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
+ .pshift = HPAGE_SHIFT,
+ .swap = 0,
+ .present = 1,
+ };
+
+ return ppte;
+}
+
enum {
PG_LEVEL_NONE,
PG_LEVEL_4K,
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 02bd4aa..094a538 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -216,7 +216,6 @@ static inline int pud_large(pud_t pte)
#define pmd_none(x) (!pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
#define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot)))
-#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)

#define pte_to_pgoff(pte) ((pte_val(pte) & PHYSICAL_PAGE_MASK) >> PAGE_SHIFT)
#define pgoff_to_pte(off) ((pte_t) { .pte = ((off) << PAGE_SHIFT) | _PAGE_FILE })
diff --git a/include/linux/pagemap_ppte.h b/include/linux/pagemap_ppte.h
new file mode 100644
index 0000000..4d1256f
--- /dev/null
+++ b/include/linux/pagemap_ppte.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_PAGEMAP_PPTE_H
+#define _LINUX_PAGEMAP_PPTE_H
+
+#include <linux/types.h>
+
+/*
+ * structure used by /proc/pid/pagemap to describe mappings
+ */
+
+struct pagemap_ppte {
+ __u64 paddr:56;
+ __u64 pshift:6;
+ __u64 swap:1;
+ __u64 present:1;
+};
+
+#endif /* _LINUX_PAGEMAP_PPTE_H */
--
1.5.3.7

2008-02-27 17:44:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Tue, 2008-02-26 at 21:25 +0100, Hans Rosenfeld wrote:
> On Mon, Feb 25, 2008 at 10:39:10AM -0800, Dave Hansen wrote:
> > Did you read my suggestion? We use one bit in the pte to specify that
> > its a large page mapping, then specify a mask to apply to the address to
> > get the *first* mapping of the large page, where you're find the actual
> > physical address. That keeps us from having to worry about specifying
> > *both* the page size and the pfn in the same pte.
>
> I did read it, but I don't see the point. I think we have enough bits in
> that pseudo pte to include both the page size and the address/pfn.

I'm just worried that once we establish the format, we can't really
change it. We have enough room in the pseudo-pte now, but what happens
when the next group of people pop up that want something else from this
interface. Right now we have normal memory, swap, and hugetlb pages.

What if people want migration ptes marked next? I'm not sure those fit
into what you have here.

It all fits today, I'm just worried about tomorrow. :(

> +static int add_huge_to_pagemap(unsigned long addr, unsigned long end,
> + struct pagemap_ppte ppte, struct pagemapread *pm)
> +{
> + int err;
> +
> + for (; addr != end; addr += PAGE_SIZE) {
> + err = add_to_pagemap(addr, ppte, pm);
> + if (err)
> + return err;
> + }
> +}
> +
> static int pagemap_pte_hole(unsigned long start, unsigned long end,
> void *private)
> {
> @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> u64 swap_pte_to_pagemap_entry(pte_t pte)
> {
> swp_entry_t e = pte_to_swp_entry(pte);
> - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> }

Is there any way to do unions of bitfields? It seems a bit silly that
we have this bitfield, and then subdivide the bitfield for the swap
entries.

> static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> @@ -584,16 +591,23 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> pte_t *pte;
> int err = 0;
>
> - for (; addr != end; addr += PAGE_SIZE) {
> - u64 pfn = PM_NOT_PRESENT;
> + if (pmd_huge(*pmd))
> + add_huge_to_pagemap(addr, end, pmd_to_ppte(pmd), pm);

Could you make this work with other architectures' large pages as well?
I'd hate to leave ia64, MIPS and powerpc out in the cold. powerpc at
least has large pmds, it just doesn't really expose them to generic
code.

> + else for (; addr != end; addr += PAGE_SIZE) {
> + struct pagemap_ppte ppte = { 0, 0, 0, 0};

Didn't you define a macro for this above? Can you re-use it?

> pte = pte_offset_map(pmd, addr);
> - if (is_swap_pte(*pte))
> - pfn = swap_pte_to_pagemap_entry(*pte);
> - else if (pte_present(*pte))
> - pfn = pte_pfn(*pte);
> + if (is_swap_pte(*pte)) {
> + ppte.swap = 1;
> + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> + } else if (pte_present(*pte)) {
> + ppte.present = 1;
> + ppte.pshift = PAGE_SHIFT;
> + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> + }

Why do we bother wasting space in paddr by shifting up the physical
address? We know the bottom PAGE_SHIFT bits are empty, so doesn't this
just waste them?

The bitfields are nice, and I do see they've spread to generic code.
So, I won't object to them, but please do double-check that they don't
cause any problems, especially with compilers that you might not be
using.

> /* unmap so we're not in atomic when we copy to userspace */
> pte_unmap(pte);
> - err = add_to_pagemap(addr, pfn, pm);
> + err = add_to_pagemap(addr, ppte, pm);
> if (err)
> return err;
> }
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 44ef329..d7df89d 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -195,6 +195,12 @@ static inline int pmd_none_or_clear_bad(pmd_t *pmd)
> }
> return 0;
> }
> +
> +/* dummy for !x86 */
> +#ifndef pmd_to_ppte
> +#define pmd_to_ppte(x) ((struct pagemap_ppte) {0, 0, 0, 0})
> +#endif

I'm really not a fan of the #ifndef style for these headers. I think it
makes it really hard to figure out where definitions come from.

I do think it would be best to keep al the ppte stuff isolated to the
pagemap files as much as humanly possible. There's not much of a reason
to pollute these generic (and already full) headers with our /proc
crap. :)

> #endif /* CONFIG_MMU */
>
> /*
> diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
> index 174b877..7a446e0 100644
> --- a/include/asm-x86/pgtable.h
> +++ b/include/asm-x86/pgtable.h
> @@ -181,6 +181,8 @@ static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
> pgprot_val(pgprot)) & __supported_pte_mask);
> }
>
> +#define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT)
> +
> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> {
> pteval_t val = pte_val(pte);
> @@ -242,6 +244,20 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>
> #ifndef __ASSEMBLY__
>
> +#include <linux/pagemap_ppte.h>
> +
> +static inline struct pagemap_ppte pmd_to_ppte(pmd_t *pmd)
> +{
> + struct pagemap_ppte ppte = {
> + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> + .pshift = HPAGE_SHIFT,
> + .swap = 0,
> + .present = 1,
> + };
> +
> + return ppte;
> +}

Could you investigate this a bit on the other architectures and perhaps
code up something that will at least compile on the others and not just
punt? I just want to make sure that this approach can be extended to
them easily and we don't have to rewrite it. :)

My only other concern is that we're still wobbling on this interface and
it's about to get mainline-released. Should we turn it off in mainline
so that we don't establish an ABI that we know we're going to change
shortly anyway?

-- Dave

2008-02-28 12:02:53

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Wed, Feb 27, 2008 at 09:44:04AM -0800, Dave Hansen wrote:
> On Tue, 2008-02-26 at 21:25 +0100, Hans Rosenfeld wrote:
> I'm just worried that once we establish the format, we can't really
> change it. We have enough room in the pseudo-pte now, but what happens
> when the next group of people pop up that want something else from this
> interface. Right now we have normal memory, swap, and hugetlb pages.
>
> What if people want migration ptes marked next? I'm not sure those fit
> into what you have here.
>
> It all fits today, I'm just worried about tomorrow. :(

We could change the interface to return just a pfn (which is aligned to
the pshift returned), as it was before. That would free up some bits
that we could reserve for future use.

> > @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > u64 swap_pte_to_pagemap_entry(pte_t pte)
> > {
> > swp_entry_t e = pte_to_swp_entry(pte);
> > - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > }
>
> Is there any way to do unions of bitfields? It seems a bit silly that
> we have this bitfield, and then subdivide the bitfield for the swap
> entries.

Having a union of bitfields is allowed, but having a union in a
struct of bitfields or vice-versa will probably cause the compiler not
to put all of this together in a single 64 bit entity.

This whole swap thing still needs some thought. The swap file offset
can take 59 bits, so there is a possibilty that this will break once
someone uses a really huge swap file. I doubt that this will happen, but
that doesn't mean it can't happen. Maybe there should be some completely
different interface for the swap stuff, like /proc/pid/swapmap or
something like that.

> > static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > @@ -584,16 +591,23 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > pte_t *pte;
> > int err = 0;
> >
> > - for (; addr != end; addr += PAGE_SIZE) {
> > - u64 pfn = PM_NOT_PRESENT;
> > + if (pmd_huge(*pmd))
> > + add_huge_to_pagemap(addr, end, pmd_to_ppte(pmd), pm);
>
> Could you make this work with other architectures' large pages as well?
> I'd hate to leave ia64, MIPS and powerpc out in the cold. powerpc at
> least has large pmds, it just doesn't really expose them to generic
> code.

Well, if some powerpc guy would implement pmd_huge() and pmd_pfn() for
powerpc, the x86 specific pmd_to_ppte() won't be that x86 specific no
more. I didn't know there were huge pmds on powerpc, as pmd_huge() is
defined as zero for everything but x86.

Does it have huge puds as well? Once we support 1G pages for x86 a new
function has to be added to this file to handle that special case, too.

> > + else for (; addr != end; addr += PAGE_SIZE) {
> > + struct pagemap_ppte ppte = { 0, 0, 0, 0};
>
> Didn't you define a macro for this above? Can you re-use it?

Good point.

> > pte = pte_offset_map(pmd, addr);
> > - if (is_swap_pte(*pte))
> > - pfn = swap_pte_to_pagemap_entry(*pte);
> > - else if (pte_present(*pte))
> > - pfn = pte_pfn(*pte);
> > + if (is_swap_pte(*pte)) {
> > + ppte.swap = 1;
> > + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> > + } else if (pte_present(*pte)) {
> > + ppte.present = 1;
> > + ppte.pshift = PAGE_SHIFT;
> > + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> > + }

This is the place where those architectures that define the page size in
the pte should test for a huge page and put the correct page size in the
pshift field. I looked at some of them and did not find a function or a
macro to do this test, no generic one and no arch-dependent one.

> Why do we bother wasting space in paddr by shifting up the physical
> address? We know the bottom PAGE_SHIFT bits are empty, so doesn't this
> just waste them?

As I said above, we could just use a raw pfn as the interface did
before.

> The bitfields are nice, and I do see they've spread to generic code.
> So, I won't object to them, but please do double-check that they don't
> cause any problems, especially with compilers that you might not be
> using.

The standard says the ordering of bitfields is "implementation defined".
I'm currently unsure whether this means the implementation of a machine
or of the compiler. In the latter case, using a different compiler for
a user space program than the one that was used to compile the kernel
could create problems.

> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > index 44ef329..d7df89d 100644
> > --- a/include/asm-generic/pgtable.h
> > +++ b/include/asm-generic/pgtable.h
> > @@ -195,6 +195,12 @@ static inline int pmd_none_or_clear_bad(pmd_t *pmd)
> > }
> > return 0;
> > }
> > +
> > +/* dummy for !x86 */
> > +#ifndef pmd_to_ppte
> > +#define pmd_to_ppte(x) ((struct pagemap_ppte) {0, 0, 0, 0})
> > +#endif
>
> I'm really not a fan of the #ifndef style for these headers. I think it
> makes it really hard to figure out where definitions come from.
>
> I do think it would be best to keep al the ppte stuff isolated to the
> pagemap files as much as humanly possible. There's not much of a reason
> to pollute these generic (and already full) headers with our /proc
> crap. :)

If you got an idea where to put it, speak up. I thought asm/pgtable.h
would be the right place, but maybe we should put all this in
linux/pagemap_ppte.h (or whatever it will eventually be called, or
wherever this stuff will eventually end up) and put a nasty #ifdef
CONFIG_X86 around it.

> > +#include <linux/pagemap_ppte.h>
> > +
> > +static inline struct pagemap_ppte pmd_to_ppte(pmd_t *pmd)
> > +{
> > + struct pagemap_ppte ppte = {
> > + .paddr = pmd_pfn(*pmd) << PAGE_SHIFT,
> > + .pshift = HPAGE_SHIFT,
> > + .swap = 0,
> > + .present = 1,
> > + };
> > +
> > + return ppte;
> > +}
>
> Could you investigate this a bit on the other architectures and perhaps
> code up something that will at least compile on the others and not just
> punt? I just want to make sure that this approach can be extended to
> them easily and we don't have to rewrite it. :)

AFAIK this pmd stuff is a special case for x86, at least for huge pages
that are used by user programs. Other architectures encode the page size
in the pte, but I don't have the time to learn how exactly this is done
there and write up code that will work for another 4 or 5 platforms.
Those who are responsible for the memory management on the other archs
should come up with a mechanism to find out whether a pte is for a huge
page or not. Adding code to use that would be fairly trivial, then.

> My only other concern is that we're still wobbling on this interface and
> it's about to get mainline-released. Should we turn it off in mainline
> so that we don't establish an ABI that we know we're going to change
> shortly anyway?

That might be good idea.

Hans


--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

2008-02-29 01:05:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size


On Thu, 2008-02-28 at 13:00 +0100, Hans Rosenfeld wrote:
> On Wed, Feb 27, 2008 at 09:44:04AM -0800, Dave Hansen wrote:
> > On Tue, 2008-02-26 at 21:25 +0100, Hans Rosenfeld wrote:
> > I'm just worried that once we establish the format, we can't really
> > change it. We have enough room in the pseudo-pte now, but what happens
> > when the next group of people pop up that want something else from this
> > interface. Right now we have normal memory, swap, and hugetlb pages.
> >
> > What if people want migration ptes marked next? I'm not sure those fit
> > into what you have here.
> >
> > It all fits today, I'm just worried about tomorrow. :(
>
> We could change the interface to return just a pfn (which is aligned to
> the pshift returned), as it was before. That would free up some bits
> that we could reserve for future use.

Yeah, I think we should do that. No reason to zero-pad it.

> > > @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > > u64 swap_pte_to_pagemap_entry(pte_t pte)
> > > {
> > > swp_entry_t e = pte_to_swp_entry(pte);
> > > - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > > + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > > }
> >
> > Is there any way to do unions of bitfields? It seems a bit silly that
> > we have this bitfield, and then subdivide the bitfield for the swap
> > entries.
>
> Having a union of bitfields is allowed, but having a union in a
> struct of bitfields or vice-versa will probably cause the compiler not
> to put all of this together in a single 64 bit entity.
>
> This whole swap thing still needs some thought. The swap file offset
> can take 59 bits, so there is a possibilty that this will break once
> someone uses a really huge swap file. I doubt that this will happen, but
> that doesn't mean it can't happen. Maybe there should be some completely
> different interface for the swap stuff, like /proc/pid/swapmap or
> something like that.

I wouldn't worry about overflowing it. I think there are plenty of
block layer things that will break, first. :)

> > > static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > @@ -584,16 +591,23 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > pte_t *pte;
> > > int err = 0;
> > >
> > > - for (; addr != end; addr += PAGE_SIZE) {
> > > - u64 pfn = PM_NOT_PRESENT;
> > > + if (pmd_huge(*pmd))
> > > + add_huge_to_pagemap(addr, end, pmd_to_ppte(pmd), pm);
> >
> > Could you make this work with other architectures' large pages as well?
> > I'd hate to leave ia64, MIPS and powerpc out in the cold. powerpc at
> > least has large pmds, it just doesn't really expose them to generic
> > code.
>
> Well, if some powerpc guy would implement pmd_huge() and pmd_pfn() for
> powerpc, the x86 specific pmd_to_ppte() won't be that x86 specific no
> more. I didn't know there were huge pmds on powerpc, as pmd_huge() is
> defined as zero for everything but x86.

OK, I'm now convinced that doing this with pmds is actually completely
wrong. :(

Take a look at this code from mm/hugetlb.c:

for (address = start; address < end; address += HPAGE_SIZE) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;

if (huge_pmd_unshare(mm, &address, ptep))
continue;

pte = huge_ptep_get_and_clear(mm, address, ptep);
if (pte_none(pte))
continue;

page = pte_page(pte);
if (pte_dirty(pte))
set_page_dirty(page);
list_add(&page->lru, &page_list);
}

The arch code is completely responsible for taking the mm and address
and giving you back a pte that you can do pte_page() on. This is a
nice, arch-abstracted interface that everybody can use regardless of how
their arch actually does it internally.

The only issue is that this is *after* the code has decided that a
particular virtual area is for huge pages. The best arch-generic
interface I know for that is: is_vm_hugetlb_page(), but that is
VMA-based. Perhaps we should change the pagemap walk to pass the VMA
around.

We should probably use a similar approach in the pagemap code. Or, if
we're really smart, we can actually share that code with the hugetlb
code.

> Does it have huge puds as well? Once we support 1G pages for x86 a new
> function has to be added to this file to handle that special case, too.

Yes, it does (or will soon have) huge puds. But, they're nicely wrapped
up in that pte_t interface I showed above like the rest of the large
pages.

> > > pte = pte_offset_map(pmd, addr);
> > > - if (is_swap_pte(*pte))
> > > - pfn = swap_pte_to_pagemap_entry(*pte);
> > > - else if (pte_present(*pte))
> > > - pfn = pte_pfn(*pte);
> > > + if (is_swap_pte(*pte)) {
> > > + ppte.swap = 1;
> > > + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> > > + } else if (pte_present(*pte)) {
> > > + ppte.present = 1;
> > > + ppte.pshift = PAGE_SHIFT;
> > > + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> > > + }
>
> This is the place where those architectures that define the page size in
> the pte should test for a huge page and put the correct page size in the
> pshift field. I looked at some of them and did not find a function or a
> macro to do this test, no generic one and no arch-dependent one.

To test a pte for its huge page size? Well, for now, we only support
one huge page size at a time, and that's HPAGE_SIZE.

> > The bitfields are nice, and I do see they've spread to generic code.
> > So, I won't object to them, but please do double-check that they don't
> > cause any problems, especially with compilers that you might not be
> > using.
>
> The standard says the ordering of bitfields is "implementation defined".
> I'm currently unsure whether this means the implementation of a machine
> or of the compiler. In the latter case, using a different compiler for
> a user space program than the one that was used to compile the kernel
> could create problems.

I'd hate to be the first ones to depend on a bitfield for a
user<->kernel interface. Can you look around for precedent in this
area, or convert them back?

-- Dave

2008-02-29 21:41:38

by Matt Mackall

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

(I've been mostly just reading along with this thread, as I haven't
spent much time investigating huge page handling in general)

On Thu, 2008-02-28 at 16:49 -0800, Dave Hansen wrote:
> On Thu, 2008-02-28 at 13:00 +0100, Hans Rosenfeld wrote:
> > On Wed, Feb 27, 2008 at 09:44:04AM -0800, Dave Hansen wrote:
> > > On Tue, 2008-02-26 at 21:25 +0100, Hans Rosenfeld wrote:
> > > I'm just worried that once we establish the format, we can't really
> > > change it. We have enough room in the pseudo-pte now, but what happens
> > > when the next group of people pop up that want something else from this
> > > interface. Right now we have normal memory, swap, and hugetlb pages.
> > >
> > > What if people want migration ptes marked next? I'm not sure those fit
> > > into what you have here.
> > >
> > > It all fits today, I'm just worried about tomorrow. :(
> >
> > We could change the interface to return just a pfn (which is aligned to
> > the pshift returned), as it was before. That would free up some bits
> > that we could reserve for future use.
>
> Yeah, I think we should do that. No reason to zero-pad it.

Indeed.

> > > > @@ -574,7 +581,7 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
> > > > u64 swap_pte_to_pagemap_entry(pte_t pte)
> > > > {
> > > > swp_entry_t e = pte_to_swp_entry(pte);
> > > > - return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > > > + return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
> > > > }
> > >
> > > Is there any way to do unions of bitfields? It seems a bit silly that
> > > we have this bitfield, and then subdivide the bitfield for the swap
> > > entries.
> >
> > Having a union of bitfields is allowed, but having a union in a
> > struct of bitfields or vice-versa will probably cause the compiler not
> > to put all of this together in a single 64 bit entity.
> >
> > This whole swap thing still needs some thought. The swap file offset
> > can take 59 bits, so there is a possibilty that this will break once
> > someone uses a really huge swap file. I doubt that this will happen, but
> > that doesn't mean it can't happen. Maybe there should be some completely
> > different interface for the swap stuff, like /proc/pid/swapmap or
> > something like that.
>
> I wouldn't worry about overflowing it. I think there are plenty of
> block layer things that will break, first. :)

I expect the trend for swap is that it'll be a rather small multiple of
total memory size for the foreseeable future.

> > > > static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > > @@ -584,16 +591,23 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > > pte_t *pte;
> > > > int err = 0;
> > > >
> > > > - for (; addr != end; addr += PAGE_SIZE) {
> > > > - u64 pfn = PM_NOT_PRESENT;
> > > > + if (pmd_huge(*pmd))
> > > > + add_huge_to_pagemap(addr, end, pmd_to_ppte(pmd), pm);
> > >
> > > Could you make this work with other architectures' large pages as well?
> > > I'd hate to leave ia64, MIPS and powerpc out in the cold. powerpc at
> > > least has large pmds, it just doesn't really expose them to generic
> > > code.
> >
> > Well, if some powerpc guy would implement pmd_huge() and pmd_pfn() for
> > powerpc, the x86 specific pmd_to_ppte() won't be that x86 specific no
> > more. I didn't know there were huge pmds on powerpc, as pmd_huge() is
> > defined as zero for everything but x86.
>
> OK, I'm now convinced that doing this with pmds is actually completely
> wrong. :(
>
> Take a look at this code from mm/hugetlb.c:
>
> for (address = start; address < end; address += HPAGE_SIZE) {
> ptep = huge_pte_offset(mm, address);
> if (!ptep)
> continue;
>
> if (huge_pmd_unshare(mm, &address, ptep))
> continue;
>
> pte = huge_ptep_get_and_clear(mm, address, ptep);
> if (pte_none(pte))
> continue;
>
> page = pte_page(pte);
> if (pte_dirty(pte))
> set_page_dirty(page);
> list_add(&page->lru, &page_list);
> }
>
> The arch code is completely responsible for taking the mm and address
> and giving you back a pte that you can do pte_page() on. This is a
> nice, arch-abstracted interface that everybody can use regardless of how
> their arch actually does it internally.
>
> The only issue is that this is *after* the code has decided that a
> particular virtual area is for huge pages. The best arch-generic
> interface I know for that is: is_vm_hugetlb_page(), but that is
> VMA-based. Perhaps we should change the pagemap walk to pass the VMA
> around.

I'd rather avoid that. Requiring a VMA to poke at these things shouldn't
-really- be necessary.

> We should probably use a similar approach in the pagemap code. Or, if
> we're really smart, we can actually share that code with the hugetlb
> code.
>
> > Does it have huge puds as well? Once we support 1G pages for x86 a new
> > function has to be added to this file to handle that special case, too.
>
> Yes, it does (or will soon have) huge puds. But, they're nicely wrapped
> up in that pte_t interface I showed above like the rest of the large
> pages.
>
> > > > pte = pte_offset_map(pmd, addr);
> > > > - if (is_swap_pte(*pte))
> > > > - pfn = swap_pte_to_pagemap_entry(*pte);
> > > > - else if (pte_present(*pte))
> > > > - pfn = pte_pfn(*pte);
> > > > + if (is_swap_pte(*pte)) {
> > > > + ppte.swap = 1;
> > > > + ppte.paddr = swap_pte_to_pagemap_entry(*pte);
> > > > + } else if (pte_present(*pte)) {
> > > > + ppte.present = 1;
> > > > + ppte.pshift = PAGE_SHIFT;
> > > > + ppte.paddr = pte_pfn(*pte) << PAGE_SHIFT;
> > > > + }
> >
> > This is the place where those architectures that define the page size in
> > the pte should test for a huge page and put the correct page size in the
> > pshift field. I looked at some of them and did not find a function or a
> > macro to do this test, no generic one and no arch-dependent one.
>
> To test a pte for its huge page size? Well, for now, we only support
> one huge page size at a time, and that's HPAGE_SIZE.
>
> > > The bitfields are nice, and I do see they've spread to generic code.
> > > So, I won't object to them, but please do double-check that they don't
> > > cause any problems, especially with compilers that you might not be
> > > using.
> >
> > The standard says the ordering of bitfields is "implementation defined".
> > I'm currently unsure whether this means the implementation of a machine
> > or of the compiler. In the latter case, using a different compiler for
> > a user space program than the one that was used to compile the kernel
> > could create problems.
>
> I'd hate to be the first ones to depend on a bitfield for a
> user<->kernel interface. Can you look around for precedent in this
> area, or convert them back?
>
> -- Dave
--
Mathematics is the supreme nostalgia of our time.

2008-02-29 22:30:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Thu, 2008-02-28 at 17:15 -0800, Matt Mackall wrote:
> > The only issue is that this is *after* the code has decided that a
> > particular virtual area is for huge pages. The best arch-generic
> > interface I know for that is: is_vm_hugetlb_page(), but that is
> > VMA-based. Perhaps we should change the pagemap walk to pass the VMA
> > around.
>
> I'd rather avoid that. Requiring a VMA to poke at these things shouldn't
> -really- be necessary.

Yeah, this is strictly true. But, it also assumes that we have all of
the data about where large pages are based on the contents of the
pagetables alone. It's possible that we can derive this, or code up a
bunch of arch-specific functions to do this, but it certainly doesn't
exist today. I'm just not keen on going and learning about how each and
every architecture encodes hugetlb information in their pagetables. :(

The fact is that we treat pagetables in hugetlb areas completely
differently than normal pages. All of the generic functions that deal
with pagetables magically punt over to the hugetlb code when they see a
hugetlb VMA.

I think although it isn't strictly necessary, it is going to save a ton
of work to just pass the VMAs around.

-- Dave

2008-02-29 22:47:15

by Matt Mackall

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size


On Fri, 2008-02-29 at 14:30 -0800, Dave Hansen wrote:
> On Thu, 2008-02-28 at 17:15 -0800, Matt Mackall wrote:
> > > The only issue is that this is *after* the code has decided that a
> > > particular virtual area is for huge pages. The best arch-generic
> > > interface I know for that is: is_vm_hugetlb_page(), but that is
> > > VMA-based. Perhaps we should change the pagemap walk to pass the VMA
> > > around.
> >
> > I'd rather avoid that. Requiring a VMA to poke at these things shouldn't
> > -really- be necessary.
>
> Yeah, this is strictly true. But, it also assumes that we have all of
> the data about where large pages are based on the contents of the
> pagetables alone. It's possible that we can derive this, or code up a
> bunch of arch-specific functions to do this, but it certainly doesn't
> exist today. I'm just not keen on going and learning about how each and
> every architecture encodes hugetlb information in their pagetables. :(
>
> The fact is that we treat pagetables in hugetlb areas completely
> differently than normal pages. All of the generic functions that deal
> with pagetables magically punt over to the hugetlb code when they see a
> hugetlb VMA.
>
> I think although it isn't strictly necessary, it is going to save a ton
> of work to just pass the VMAs around.

Well in any case, step 1 should be to finalize the interface. Let's
start with a single patch that does nothing but move the exported bits
around and makes room for the page shift bits. Are we agreed on what
that should look like now?

ps: my laptop died on Tuesday, so I'll be struggling to even read email
until its replacement arrives.

--
Mathematics is the supreme nostalgia of our time.

2008-03-05 16:08:05

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Fri, Feb 29, 2008 at 04:46:12PM -0600, Matt Mackall wrote:
> Well in any case, step 1 should be to finalize the interface. Let's
> start with a single patch that does nothing but move the exported bits
> around and makes room for the page shift bits. Are we agreed on what
> that should look like now?

Attached is a patch that changes the current pagemap code just to use
the pseudo-pte, which can now also include swap information. The
pseudo-pte is made available through the linux/pagemap_ppte.h header.
The x86-specific huge page support has been left out for now, so the
interface will return wrong information for huge pages.

There are a few reserved bits in the pte part of the union that could be
used to extend the pfn or add more status bits in future (a ronly bit,
indicating that a page is mapped read-only, has been added, but is
currently not set anywhere). The identical parts of the two bitfields in
the union are the present and swap bits, and the page shift.



Signed-off-by: Hans Rosenfeld <[email protected]>

---
fs/proc/task_mmu.c | 41 ++++++++++++++++--------------------
include/linux/pagemap_ppte.h | 47 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 65 insertions(+), 23 deletions(-)
create mode 100644 include/linux/pagemap_ppte.h

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 49958cf..958a37a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -5,6 +5,7 @@
#include <linux/highmem.h>
#include <linux/ptrace.h>
#include <linux/pagemap.h>
+#include <linux/pagemap_ppte.h>
#include <linux/ptrace.h>
#include <linux/mempolicy.h>
#include <linux/swap.h>
@@ -527,16 +528,10 @@ struct pagemapread {
char __user *out, *end;
};

-#define PM_ENTRY_BYTES sizeof(u64)
-#define PM_RESERVED_BITS 3
-#define PM_RESERVED_OFFSET (64 - PM_RESERVED_BITS)
-#define PM_RESERVED_MASK (((1LL<<PM_RESERVED_BITS)-1) << PM_RESERVED_OFFSET)
-#define PM_SPECIAL(nr) (((nr) << PM_RESERVED_OFFSET) | PM_RESERVED_MASK)
-#define PM_NOT_PRESENT PM_SPECIAL(1LL)
-#define PM_SWAP PM_SPECIAL(2LL)
-#define PM_END_OF_BUFFER 1
+#define PM_ENTRY_BYTES sizeof(pagemap_ppte_t)
+#define PM_END_OF_BUFFER 1

-static int add_to_pagemap(unsigned long addr, u64 pfn,
+static int add_to_pagemap(unsigned long addr, pagemap_ppte_t ppte,
struct pagemapread *pm)
{
/*
@@ -545,13 +540,13 @@ static int add_to_pagemap(unsigned long addr, u64 pfn,
* the pfn.
*/
if (pm->out + PM_ENTRY_BYTES >= pm->end) {
- if (copy_to_user(pm->out, &pfn, pm->end - pm->out))
+ if (copy_to_user(pm->out, &ppte, pm->end - pm->out))
return -EFAULT;
pm->out = pm->end;
return PM_END_OF_BUFFER;
}

- if (put_user(pfn, pm->out))
+ if (copy_to_user(pm->out, &ppte, sizeof(ppte)))
return -EFAULT;
pm->out += PM_ENTRY_BYTES;
return 0;
@@ -571,12 +566,6 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end,
return err;
}

-u64 swap_pte_to_pagemap_entry(pte_t pte)
-{
- swp_entry_t e = pte_to_swp_entry(pte);
- return PM_SWAP | swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT);
-}
-
static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
void *private)
{
@@ -585,15 +574,21 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
int err = 0;

for (; addr != end; addr += PAGE_SIZE) {
- u64 pfn = PM_NOT_PRESENT;
+ pagemap_ppte_t ppte = PM_NOT_PRESENT;
+
pte = pte_offset_map(pmd, addr);
- if (is_swap_pte(*pte))
- pfn = swap_pte_to_pagemap_entry(*pte);
- else if (pte_present(*pte))
- pfn = pte_pfn(*pte);
+ if (is_swap_pte(*pte)) {
+ swp_entry_t e = pte_to_swp_entry(*pte);
+ ppte.pm_swap = 1;
+ ppte.pm_type = swp_type(e);
+ ppte.pm_offset = swp_offset(e);
+ } else if (pte_present(*pte)) {
+ ppte.pm_present = 1;
+ ppte.pm_frame = pte_pfn(*pte);
+ }
/* unmap so we're not in atomic when we copy to userspace */
pte_unmap(pte);
- err = add_to_pagemap(addr, pfn, pm);
+ err = add_to_pagemap(addr, ppte, pm);
if (err)
return err;
}
diff --git a/include/linux/pagemap_ppte.h b/include/linux/pagemap_ppte.h
new file mode 100644
index 0000000..a983500
--- /dev/null
+++ b/include/linux/pagemap_ppte.h
@@ -0,0 +1,47 @@
+#ifndef _LINUX_PAGEMAP_PPTE_H
+#define _LINUX_PAGEMAP_PPTE_H
+
+#include <linux/types.h>
+
+/*
+ * structure used by /proc/pid/pagemap to describe mappings
+ */
+
+union pagemap_ppte {
+ struct pagemap_ppte_pte {
+ __u64 _shift:6;
+ __u64 _frame:40;
+ __u64 _resvd:15;
+ __u64 _ronly:1;
+ __u64 _swap:1;
+ __u64 _present:1;
+ } pte;
+
+ struct pagemap_ppte_swp {
+ __u64 _shift:6;
+ __u64 _offset:51;
+ __u64 _type:5;
+ __u64 _swap:1;
+ __u64 _present:1;
+ } swp;
+
+ __u64 val;
+};
+
+#define pm_shift pte._shift
+#define pm_frame pte._frame
+#define pm_offset swp._offset
+#define pm_type swp._type
+#define pm_ronly pte._ronly
+#define pm_swap pte._swap
+#define pm_present pte._present
+
+typedef union pagemap_ppte pagemap_ppte_t;
+
+#define PM_NOT_PRESENT ((pagemap_ppte_t) { .pm_present = 0, \
+ .pm_swap = 0, \
+ .pm_offset = 0, \
+ .pm_type = 0, \
+ .pm_shift = PAGE_SHIFT})
+
+#endif /* _LINUX_PAGEMAP_PPTE_H */
--
1.5.3.7

2008-03-05 16:33:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size


On Wed, 2008-03-05 at 17:00 +0100, Hans Rosenfeld wrote:
>
> There are a few reserved bits in the pte part of the union that could
> be
> used to extend the pfn or add more status bits in future (a ronly bit,

Were you able to find any other instances of these bitfields being used
for kernel<->user communication? I thought you were going to check to
see if we could depend on the layout, or if it changed between
compilers.

-- Dave

2008-03-05 17:25:09

by Hans Rosenfeld

[permalink] [raw]
Subject: Re: [RFC][PATCH] make /proc/pid/pagemap work with huge pages and return page size

On Wed, Mar 05, 2008 at 08:32:46AM -0800, Dave Hansen wrote:
> Were you able to find any other instances of these bitfields being used
> for kernel<->user communication?

I did not look at that.

> I thought you were going to check to
> see if we could depend on the layout, or if it changed between
> compilers.

But I looked at the ABI specification for i386, powerpc and sparc, where
the layout is defined for those architectures. Therefore I assume that
we can depend on the bitfield layout, as long as we use compilers that
adhere to the ABI specification of an architecture.


--
%SYSTEM-F-ANARCHISM, The operating system has been overthrown