2006-03-22 06:37:54

by Chris Wright

[permalink] [raw]
Subject: [RFC PATCH 30/35] Add generic_page_range() function

Add a new mm function generic_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <[email protected]>
Signed-off-by: Christian Limpach <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
---
include/linux/mm.h | 5 ++
mm/memory.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 95 insertions(+)

--- xen-subarch-2.6.orig/include/linux/mm.h
+++ xen-subarch-2.6/include/linux/mm.h
@@ -1013,6 +1013,11 @@ struct page *follow_page(struct vm_area_
#define FOLL_GET 0x04 /* do get_page on page */
#define FOLL_ANON 0x08 /* give ZERO_PAGE if no pgtable */

+typedef int (*pte_fn_t)(pte_t *pte, struct page *pte_page, unsigned long addr,
+ void *data);
+extern int generic_page_range(struct mm_struct *mm, unsigned long address,
+ unsigned long size, pte_fn_t fn, void *data);
+
#ifdef CONFIG_PROC_FS
void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
#else
--- xen-subarch-2.6.orig/mm/memory.c
+++ xen-subarch-2.6/mm/memory.c
@@ -1359,6 +1359,96 @@ int remap_pfn_range(struct vm_area_struc
}
EXPORT_SYMBOL(remap_pfn_range);

+static inline int generic_pte_range(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, unsigned long end,
+ pte_fn_t fn, void *data)
+{
+ pte_t *pte;
+ int err;
+ struct page *pte_page;
+
+ pte = (mm == &init_mm) ?
+ pte_alloc_kernel(pmd, addr) :
+ pte_alloc_map(mm, pmd, addr);
+ if (!pte)
+ return -ENOMEM;
+
+ pte_page = pmd_page(*pmd);
+
+ do {
+ err = fn(pte, pte_page, addr, data);
+ if (err)
+ break;
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+
+ if (mm != &init_mm)
+ pte_unmap(pte-1);
+ return err;
+}
+
+static inline int generic_pmd_range(struct mm_struct *mm, pud_t *pud,
+ unsigned long addr, unsigned long end,
+ pte_fn_t fn, void *data)
+{
+ pmd_t *pmd;
+ unsigned long next;
+ int err;
+
+ pmd = pmd_alloc(mm, pud, addr);
+ if (!pmd)
+ return -ENOMEM;
+ do {
+ next = pmd_addr_end(addr, end);
+ err = generic_pte_range(mm, pmd, addr, next, fn, data);
+ if (err)
+ break;
+ } while (pmd++, addr = next, addr != end);
+ return err;
+}
+
+static inline int generic_pud_range(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long addr, unsigned long end,
+ pte_fn_t fn, void *data)
+{
+ pud_t *pud;
+ unsigned long next;
+ int err;
+
+ pud = pud_alloc(mm, pgd, addr);
+ if (!pud)
+ return -ENOMEM;
+ do {
+ next = pud_addr_end(addr, end);
+ err = generic_pmd_range(mm, pud, addr, next, fn, data);
+ if (err)
+ break;
+ } while (pud++, addr = next, addr != end);
+ return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int generic_page_range(struct mm_struct *mm, unsigned long addr,
+ unsigned long size, pte_fn_t fn, void *data)
+{
+ pgd_t *pgd;
+ unsigned long next;
+ unsigned long end = addr + size;
+ int err;
+
+ BUG_ON(addr >= end);
+ pgd = pgd_offset(mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ err = generic_pud_range(mm, pgd, addr, next, fn, data);
+ if (err)
+ break;
+ } while (pgd++, addr = next, addr != end);
+ return err;
+}
+
/*
* handle_pte_fault chooses page fault handler according to an entry
* which was read non-atomically. Before making any commitment, on

--


2006-03-22 08:51:39

by Zachary Amsden

[permalink] [raw]
Subject: Re: [RFC PATCH 30/35] Add generic_page_range() function

Chris Wright wrote:
> Add a new mm function generic_page_range() which applies a given
> function to every pte in a given virtual address range in a given mm
> structure. This is a generic alternative to cut-and-pasting the Linux
> idiomatic pagetable walking code in every place that a sequence of
> PTEs must be accessed.
>
> Although this interface is intended to be useful in a wide range of
> situations, it is currently used specifically by several Xen
> subsystems, for example: to ensure that pagetables have been allocated
> for a virtual address range, and to construct batched special
> pagetable update requests to map I/O memory (in ioremap()).
>

This interface is great, and highly useful. But it doesn't seem to be
able to work on native hardware, as it doesn't support large pages.

Zach

2006-03-22 09:27:43

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [RFC PATCH 30/35] Add generic_page_range() function


On 22 Mar 2006, at 08:51, Zachary Amsden wrote:

>> Although this interface is intended to be useful in a wide range of
>> situations, it is currently used specifically by several Xen
>> subsystems, for example: to ensure that pagetables have been allocated
>> for a virtual address range, and to construct batched special
>> pagetable update requests to map I/O memory (in ioremap()).
>>
>
> This interface is great, and highly useful. But it doesn't seem to be
> able to work on native hardware, as it doesn't support large pages.

Ah, good point. I'll add that.

-- Keir

2006-03-22 12:01:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH 30/35] Add generic_page_range() function

Chris Wright wrote:
> Add a new mm function generic_page_range() which applies a given
> function to every pte in a given virtual address range in a given mm
> structure. This is a generic alternative to cut-and-pasting the Linux
> idiomatic pagetable walking code in every place that a sequence of
> PTEs must be accessed.
>
> Although this interface is intended to be useful in a wide range of
> situations, it is currently used specifically by several Xen
> subsystems, for example: to ensure that pagetables have been allocated
> for a virtual address range, and to construct batched special
> pagetable update requests to map I/O memory (in ioremap()).
>

I raised the idea when we were tossing around ideas for the page
table walking crapectomy. Of course it was rejected due to use of
the indirect function, however I gues it makes sense for code
outside mm/

Couple of issues with the current code though:

firstly, the name.

secondly, I think you confuse our (confusing) terminology: the page
that holds pte_ts is not the pte_page, the pte_page is the page that
a pte points to

lastly, you don't allow any control over the type of pages that are
walked: this could well be unusably slow for some cases. At least
you should proably design the interface so we can iterate over
present, not present, all, etc so it becomes widely usable. Normally
I'd say to wait until users come up but in this case the function
isn't a speed demon anyway, and you also don't want to give people
any excuses not to use it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-22 14:33:10

by Keir Fraser

[permalink] [raw]
Subject: Re: [RFC PATCH 30/35] Add generic_page_range() function


On 22 Mar 2006, at 11:21, Nick Piggin wrote:

> Couple of issues with the current code though:
>
> firstly, the name.

Okay, can you suggest a better one? That's the best I could come up
with that wasn't long winded. :-)

> secondly, I think you confuse our (confusing) terminology: the page
> that holds pte_ts is not the pte_page, the pte_page is the page that
> a pte points to

What should we call it? Essentially we want to be able to get the
physical address of a PTE in some cases, and passing struct page
pointer seemed the best way to be able to derive that. I can rename it
to something else vaguely plausible if the only problem is the semantic
clash with Linux's idiomatic use of pte_page.

> lastly, you don't allow any control over the type of pages that are
> walked: this could well be unusably slow for some cases. At least
> you should proably design the interface so we can iterate over
> present, not present, all, etc so it becomes widely usable. Normally
> I'd say to wait until users come up but in this case the function
> isn't a speed demon anyway, and you also don't want to give people
> any excuses not to use it.

You mean iterate only over PTEs that are already present, or only those
that were *not* previously present, or all (present and non-present)?
Is that really useful? If so then yes, it's not hard to add.

Thanks,
Keir

2006-03-22 15:35:41

by Keir Fraser

[permalink] [raw]
Subject: Re: [RFC PATCH 30/35] Add generic_page_range() function


On 22 Mar 2006, at 14:33, Keir Fraser wrote:

> Okay, can you suggest a better one? That's the best I could come up
> with that wasn't long winded.

How about apply_to_page_range()?

>
>> secondly, I think you confuse our (confusing) terminology: the page
>> that holds pte_ts is not the pte_page, the pte_page is the page that
>> a pte points to
>
> What should we call it? Essentially we want to be able to get the
> physical address of a PTE in some cases, and passing struct page
> pointer seemed the best way to be able to derive that. I can rename it
> to something else vaguely plausible if the only problem is the
> semantic clash with Linux's idiomatic use of pte_page.

Looks like pmd_page is correct?

-- Keir

2006-03-23 00:16:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH 30/35] Add generic_page_range() function

Keir Fraser wrote:
>
> On 22 Mar 2006, at 14:33, Keir Fraser wrote:
>
>> Okay, can you suggest a better one? That's the best I could come up
>> with that wasn't long winded.
>
>
> How about apply_to_page_range()?
>

That would be better.

>>
>>> secondly, I think you confuse our (confusing) terminology: the page
>>> that holds pte_ts is not the pte_page, the pte_page is the page that
>>> a pte points to
>>
>>
>> What should we call it? Essentially we want to be able to get the
>> physical address of a PTE in some cases, and passing struct page
>> pointer seemed the best way to be able to derive that. I can rename it
>> to something else vaguely plausible if the only problem is the
>> semantic clash with Linux's idiomatic use of pte_page.
>
>
> Looks like pmd_page is correct?
>

Yes... although maybe you could just pass the 'pmd_t *'? That's
what a lot of the mm/memory.c code does.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-23 00:26:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC PATCH 30/35] Add generic_page_range() function

Keir Fraser wrote:

>> lastly, you don't allow any control over the type of pages that are
>> walked: this could well be unusably slow for some cases. At least
>> you should proably design the interface so we can iterate over
>> present, not present, all, etc so it becomes widely usable. Normally
>> I'd say to wait until users come up but in this case the function
>> isn't a speed demon anyway, and you also don't want to give people
>> any excuses not to use it.
>
>
> You mean iterate only over PTEs that are already present, or only those
> that were *not* previously present, or all (present and non-present)? Is
> that really useful? If so then yes, it's not hard to add.
>

Yes. If you look at all the code that walks pagetables (including those
that just operate on a single pte) in arch/ and other places, there is
a fair diversity. A function called generic_page_range() would surely be
able to replace all those ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com