LinuxLists.cc - [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page

2008-11-13 17:19:57

Subject: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

remap_pte_range() just wants to apply a function over a range of ptes
corresponding to a virtual address range. That's exactly what
apply_to_page_range() does, so use it.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
mm/memory.c | 92 ++++++++++++-----------------------------------------------
1 file changed, 20 insertions(+), 72 deletions(-)

===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1472,69 +1472,20 @@
}
EXPORT_SYMBOL(vm_insert_mixed);

-/*
- * maps a range of physical memory into the requested pages. the old
- * mappings are removed. any references to nonexistent pages results
- * in null mappings (currently treated as "copy-on-access")
- */
-static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned long end,
- unsigned long pfn, pgprot_t prot)
+struct remap_data {
+ struct mm_struct *mm;
+ unsigned long pfn;
+ pgprot_t prot;
+};
+
+static int remap_area_pte_fn(pte_t *ptep, pgtable_t token,
+ unsigned long addr, void *data)
{
- pte_t *pte;
- spinlock_t *ptl;
+ struct remap_data *rmd = data;
+ pte_t pte = pte_mkspecial(pfn_pte(rmd->pfn++, rmd->prot));

- pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
- if (!pte)
- return -ENOMEM;
- arch_enter_lazy_mmu_mode();
- do {
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
- pfn++;
- } while (pte++, addr += PAGE_SIZE, addr != end);
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(pte - 1, ptl);
- return 0;
-}
+ set_pte_at(rmd->mm, addr, ptep, pte);

-static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
- unsigned long addr, unsigned long end,
- unsigned long pfn, pgprot_t prot)
-{
- pmd_t *pmd;
- unsigned long next;
-
- pfn -= addr >> PAGE_SHIFT;
- pmd = pmd_alloc(mm, pud, addr);
- if (!pmd)
- return -ENOMEM;
- do {
- next = pmd_addr_end(addr, end);
- if (remap_pte_range(mm, pmd, addr, next,
- pfn + (addr >> PAGE_SHIFT), prot))
- return -ENOMEM;
- } while (pmd++, addr = next, addr != end);
- return 0;
-}
-
-static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
- unsigned long addr, unsigned long end,
- unsigned long pfn, pgprot_t prot)
-{
- pud_t *pud;
- unsigned long next;
-
- pfn -= addr >> PAGE_SHIFT;
- pud = pud_alloc(mm, pgd, addr);
- if (!pud)
- return -ENOMEM;
- do {
- next = pud_addr_end(addr, end);
- if (remap_pmd_range(mm, pud, addr, next,
- pfn + (addr >> PAGE_SHIFT), prot))
- return -ENOMEM;
- } while (pud++, addr = next, addr != end);
return 0;
}

@@ -1551,10 +1502,9 @@
int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t prot)
{
- pgd_t *pgd;
- unsigned long next;
unsigned long end = addr + PAGE_ALIGN(size);
struct mm_struct *mm = vma->vm_mm;
+ struct remap_data rmd;
int err;

/*
@@ -1584,16 +1534,14 @@
vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;

BUG_ON(addr >= end);
- pfn -= addr >> PAGE_SHIFT;
- pgd = pgd_offset(mm, addr);
- flush_cache_range(vma, addr, end);
- do {
- next = pgd_addr_end(addr, end);
- err = remap_pud_range(mm, pgd, addr, next,
- pfn + (addr >> PAGE_SHIFT), prot);
- if (err)
- break;
- } while (pgd++, addr = next, addr != end);
+
+ rmd.mm = mm;
+ rmd.pfn = pfn;
+ rmd.prot = prot;
+
+ err = apply_to_page_range(mm, addr, end - addr,
+ remap_area_pte_fn, &rmd);
+
return err;
}
EXPORT_SYMBOL(remap_pfn_range);

2008-11-13 19:55:11

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

Hi Jeremy,

On Thu, Nov 13, 2008 at 09:19:45AM -0800, Jeremy Fitzhardinge wrote:
> remap_pte_range() just wants to apply a function over a range of ptes
> corresponding to a virtual address range. That's exactly what
> apply_to_page_range() does, so use it.
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> mm/memory.c | 92
> ++++++++++++-----------------------------------------------
> 1 file changed, 20 insertions(+), 72 deletions(-)
>
> ===================================================================
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1472,69 +1472,20 @@
> }
> EXPORT_SYMBOL(vm_insert_mixed);
>
> -/*
> - * maps a range of physical memory into the requested pages. the old
> - * mappings are removed. any references to nonexistent pages results
> - * in null mappings (currently treated as "copy-on-access")
> - */
> -static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +struct remap_data {
> + struct mm_struct *mm;
> + unsigned long pfn;
> + pgprot_t prot;
> +};
> +
> +static int remap_area_pte_fn(pte_t *ptep, pgtable_t token,
> + unsigned long addr, void *data)
> {
> - pte_t *pte;
> - spinlock_t *ptl;
> + struct remap_data *rmd = data;
> + pte_t pte = pte_mkspecial(pfn_pte(rmd->pfn++, rmd->prot));
>
> - pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> - if (!pte)
> - return -ENOMEM;
> - arch_enter_lazy_mmu_mode();
> - do {
> - BUG_ON(!pte_none(*pte));

Dropping by intention?

> - set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
> - pfn++;
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> - arch_leave_lazy_mmu_mode();
> - pte_unmap_unlock(pte - 1, ptl);
> - return 0;
> -}
> + set_pte_at(rmd->mm, addr, ptep, pte);
>
> -static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> -{
> - pmd_t *pmd;
> - unsigned long next;
> -
> - pfn -= addr >> PAGE_SHIFT;
> - pmd = pmd_alloc(mm, pud, addr);
> - if (!pmd)
> - return -ENOMEM;
> - do {
> - next = pmd_addr_end(addr, end);
> - if (remap_pte_range(mm, pmd, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> - } while (pmd++, addr = next, addr != end);
> - return 0;
> -}
> -
> -static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> -{
> - pud_t *pud;
> - unsigned long next;
> -
> - pfn -= addr >> PAGE_SHIFT;
> - pud = pud_alloc(mm, pgd, addr);
> - if (!pud)
> - return -ENOMEM;
> - do {
> - next = pud_addr_end(addr, end);
> - if (remap_pmd_range(mm, pud, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> - } while (pud++, addr = next, addr != end);
> return 0;
> }
>
> @@ -1551,10 +1502,9 @@
> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t prot)
> {
> - pgd_t *pgd;
> - unsigned long next;
> unsigned long end = addr + PAGE_ALIGN(size);
> struct mm_struct *mm = vma->vm_mm;
> + struct remap_data rmd;
> int err;
>
> /*
> @@ -1584,16 +1534,14 @@
> vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
>
> BUG_ON(addr >= end);
> - pfn -= addr >> PAGE_SHIFT;
> - pgd = pgd_offset(mm, addr);
> - flush_cache_range(vma, addr, end);

Was the flushing redundant? I can't spot it reappearing anywhere.

Hannes

2008-11-13 20:12:28

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

Johannes Weiner wrote:
>> - do {
>> - BUG_ON(!pte_none(*pte));
>>
>
> Dropping by intention?
>

Hm, I couldn't really see the point. But I didn't really want to
introduce any functional changes with this patch, so I'll add it back.

>> BUG_ON(addr >= end);
>> - pfn -= addr >> PAGE_SHIFT;
>> - pgd = pgd_offset(mm, addr);
>> - flush_cache_range(vma, addr, end);
>>
>
> Was the flushing redundant? I can't spot it reappearing anywhere.

I guess its needed for virtually indexed cache architectures; I'll add
it back.

Thanks for reviewing.

J

2008-11-13 20:13:46

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: [PATCH 3/2] mm/remap_pfn_range: restore missing flush

Restore the cache flush and BUG_ON removed in the conversion to using
apply_to_page_range().

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Johannes Weiner <[email protected]>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)

===================================================================
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1484,6 +1484,8 @@
struct remap_data *rmd = data;
pte_t pte = pte_mkspecial(pfn_pte(rmd->pfn++, rmd->prot));

+ BUG_ON(!pte_none(*ptep));
+
set_pte_at(rmd->mm, addr, ptep, pte);

return 0;
@@ -1535,6 +1537,8 @@

BUG_ON(addr >= end);

+ flush_cache_range(vma, addr, end);
+
rmd.mm = mm;
rmd.pfn = pfn;
rmd.prot = prot;

2008-11-14 02:20:23

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

On Friday 14 November 2008 04:19, Jeremy Fitzhardinge wrote:
> remap_pte_range() just wants to apply a function over a range of ptes
> corresponding to a virtual address range. That's exactly what
> apply_to_page_range() does, so use it.

This isn't performance critical to anyone?

I see DRM, IB, GRU, other media and video drivers use it.

It IS exactly what apply_to_page_range does, I grant you. But so does
our traditional set of nested loops. So is there any particular reason
to change it? You're not planning to change fork/exit next, are you? :)

> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> mm/memory.c | 92
> ++++++++++++----------------------------------------------- 1 file changed,
> 20 insertions(+), 72 deletions(-)
>
> ===================================================================
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1472,69 +1472,20 @@
> }
> EXPORT_SYMBOL(vm_insert_mixed);
>
> -/*
> - * maps a range of physical memory into the requested pages. the old
> - * mappings are removed. any references to nonexistent pages results
> - * in null mappings (currently treated as "copy-on-access")
> - */
> -static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> +struct remap_data {
> + struct mm_struct *mm;
> + unsigned long pfn;
> + pgprot_t prot;
> +};
> +
> +static int remap_area_pte_fn(pte_t *ptep, pgtable_t token,
> + unsigned long addr, void *data)
> {
> - pte_t *pte;
> - spinlock_t *ptl;
> + struct remap_data *rmd = data;
> + pte_t pte = pte_mkspecial(pfn_pte(rmd->pfn++, rmd->prot));
>
> - pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> - if (!pte)
> - return -ENOMEM;
> - arch_enter_lazy_mmu_mode();
> - do {
> - BUG_ON(!pte_none(*pte));
> - set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
> - pfn++;
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> - arch_leave_lazy_mmu_mode();
> - pte_unmap_unlock(pte - 1, ptl);
> - return 0;
> -}
> + set_pte_at(rmd->mm, addr, ptep, pte);
>
> -static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> -{
> - pmd_t *pmd;
> - unsigned long next;
> -
> - pfn -= addr >> PAGE_SHIFT;
> - pmd = pmd_alloc(mm, pud, addr);
> - if (!pmd)
> - return -ENOMEM;
> - do {
> - next = pmd_addr_end(addr, end);
> - if (remap_pte_range(mm, pmd, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> - } while (pmd++, addr = next, addr != end);
> - return 0;
> -}
> -
> -static inline int remap_pud_range(struct mm_struct *mm, pgd_t *pgd,
> - unsigned long addr, unsigned long end,
> - unsigned long pfn, pgprot_t prot)
> -{
> - pud_t *pud;
> - unsigned long next;
> -
> - pfn -= addr >> PAGE_SHIFT;
> - pud = pud_alloc(mm, pgd, addr);
> - if (!pud)
> - return -ENOMEM;
> - do {
> - next = pud_addr_end(addr, end);
> - if (remap_pmd_range(mm, pud, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot))
> - return -ENOMEM;
> - } while (pud++, addr = next, addr != end);
> return 0;
> }
>
> @@ -1551,10 +1502,9 @@
> int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t prot)
> {
> - pgd_t *pgd;
> - unsigned long next;
> unsigned long end = addr + PAGE_ALIGN(size);
> struct mm_struct *mm = vma->vm_mm;
> + struct remap_data rmd;
> int err;
>
> /*
> @@ -1584,16 +1534,14 @@
> vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
>
> BUG_ON(addr >= end);
> - pfn -= addr >> PAGE_SHIFT;
> - pgd = pgd_offset(mm, addr);
> - flush_cache_range(vma, addr, end);
> - do {
> - next = pgd_addr_end(addr, end);
> - err = remap_pud_range(mm, pgd, addr, next,
> - pfn + (addr >> PAGE_SHIFT), prot);
> - if (err)
> - break;
> - } while (pgd++, addr = next, addr != end);
> +
> + rmd.mm = mm;
> + rmd.pfn = pfn;
> + rmd.prot = prot;
> +
> + err = apply_to_page_range(mm, addr, end - addr,
> + remap_area_pte_fn, &rmd);
> +
> return err;
> }
> EXPORT_SYMBOL(remap_pfn_range);

2008-11-14 02:56:19

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

Nick Piggin wrote:
>
> This isn't performance critical to anyone?
>

The only difference should be between having the specialized code and an
indirect function call, no?

> I see DRM, IB, GRU, other media and video drivers use it.
>
> It IS exactly what apply_to_page_range does, I grant you. But so does
> our traditional set of nested loops. So is there any particular reason
> to change it? You're not planning to change fork/exit next, are you? :)
>

No ;) But I need to have a more Xen-specific version of
remap_pfn_range, and I wanted to 1) have the two versions look as
similar as possible, and 2) not have a pile of duplicate code.

J

2008-11-14 03:17:54

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

On Friday 14 November 2008 13:56, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > This isn't performance critical to anyone?
>
> The only difference should be between having the specialized code and an
> indirect function call, no?

Indirect function call per pte. It's going to be slower surely.

> > I see DRM, IB, GRU, other media and video drivers use it.
> >
> > It IS exactly what apply_to_page_range does, I grant you. But so does
> > our traditional set of nested loops. So is there any particular reason
> > to change it? You're not planning to change fork/exit next, are you? :)
>
> No ;) But I need to have a more Xen-specific version of
> remap_pfn_range, and I wanted to 1) have the two versions look as
> similar as possible, and 2) not have a pile of duplicate code.

It is accepted practice to (carefully) duplicate the page table walking
functions in memory management code. I don't think that's a problem,
there is already so many instances of them (just be sure to stick to
exactly the same form and variable names, and any update or bugfix to
any of them is trivially applicable to all).

2008-11-14 05:23:14

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

Nick Piggin wrote:
> On Friday 14 November 2008 13:56, Jeremy Fitzhardinge wrote:
>
>> Nick Piggin wrote:
>>
>>> This isn't performance critical to anyone?
>>>
>> The only difference should be between having the specialized code and an
>> indirect function call, no?
>>
>
> Indirect function call per pte. It's going to be slower surely.
>

Yes, though changing the calling convention to handle (up to) a whole
page worth of ptes in one call would be fairly simple I think.

> It is accepted practice to (carefully) duplicate the page table walking
> functions in memory management code. I don't think that's a problem,
> there is already so many instances of them (just be sure to stick to
> exactly the same form and variable names, and any update or bugfix to
> any of them is trivially applicable to all).
>

I think that's pretty awful practice, frankly, and I'd much prefer there
to be a single iterator function which everyone uses. The open-coded
iterators everywhere just makes it completely impractical to even think
about other kinds of pagetable structures. (Of course we have at least
two "general purpose" pagetable walkers now...)

J

2008-11-14 07:35:37

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

On Friday 14 November 2008 16:22, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > On Friday 14 November 2008 13:56, Jeremy Fitzhardinge wrote:
> >> Nick Piggin wrote:
> >>> This isn't performance critical to anyone?
> >>
> >> The only difference should be between having the specialized code and an
> >> indirect function call, no?
> >
> > Indirect function call per pte. It's going to be slower surely.
>
> Yes, though changing the calling convention to handle (up to) a whole
> page worth of ptes in one call would be fairly simple I think.

Yep. And leaving it alone is even simpler and still faster :)

> > It is accepted practice to (carefully) duplicate the page table walking
> > functions in memory management code. I don't think that's a problem,
> > there is already so many instances of them (just be sure to stick to
> > exactly the same form and variable names, and any update or bugfix to
> > any of them is trivially applicable to all).
>
> I think that's pretty awful practice, frankly, and I'd much prefer there
> to be a single iterator function which everyone uses.

I think its pretty nice. It means you can make the loops fairly
optimal even if they might have slightly different requirements
(different arguments, latency breaks, copy_page_range etc).

> The open-coded
> iterators everywhere just makes it completely impractical to even think
> about other kinds of pagetable structures. (Of course we have at least
> two "general purpose" pagetable walkers now...)

I think that's being way over dramatic. When switching to a
different page table structure, I assure you that copying and
pasting your new walking algorithm a few times will be the least
of your worries :)

It's not meant to be pluggable. Actually this came up last I think
when the UNSW wanted to add page table accessors to abstract this.
They came up with a good set of things, but in the end you can't
justify slowing things down in these paths unless you actually have
a replacement page table structure that gets you a *net win*. So
far, I haven't heard from them again.

No, adding a cycle here or an indirect function call there IMO is
not acceptable in core mm/ code without a good reason.

2008-11-14 18:04:25

by Jeremy Fitzhardinge

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

Nick Piggin wrote:
> No, adding a cycle here or an indirect function call there IMO is
> not acceptable in core mm/ code without a good reason.
>

<shrug> OK.

J

2008-11-15 09:30:16

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

On Fri, 14 Nov 2008, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > No, adding a cycle here or an indirect function call there IMO is
> > not acceptable in core mm/ code without a good reason.
>
> <shrug> OK.

I'm with Nick on this: admittedly remap_pfn_range() is a borderline
case (since it has no latency breaks at present), but it is a core
mm function, and I'd prefer we leave it as is unless good reason.

So, no hurry, but I'd prefer

mm-implement-remap_pfn_range-with-apply_to_page_range.patch
mm-remap_pfn_range-restore-missing-flush.patch

to be removed from mmotm - and don't I deserve that,
just for actually reading the mm-commits boilerplate ;-?

Hugh

2008-11-17 03:03:41

by Peter Chubb

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: implement remap_pfn_range with apply_to_page_range

>>>>> "Nick" == Nick Piggin <[email protected]> writes:

Nick> It's not meant to be pluggable. Actually this came up last I
Nick> think when the UNSW wanted to add page table accessors to
Nick> abstract this. They came up with a good set of things, but in
Nick> the end you can't justify slowing things down in these paths
Nick> unless you actually have a replacement page table structure that
Nick> gets you a *net win*. So far, I haven't heard from them again.

We tried hard. The slowdown wasn't all that much on system
benchmarks, but you could see it on the microbenchmarks. And it made
stuff a LOT cleaner.

We wanted it to put super-page friendly pagetables in for
architectures not wedded to the 3/4 level x86 pagetable hardware
format.

The people I had working on this left (finished a masters,
finished a PhD, contract ran out and no more money), and I haven't had
the manpower to maintain the patchset, especially after the negative
responses we got from linux-MM.

--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia