This patch adds a return value to the existing arch_free_page function
that indicates whether the normal free routine still has work to
do. The only architecture that currently uses arch_free_page is arch
'um'. arch-xen needs this for 'foreign pages' - pages that don't
belong to the page allocator but are instead managed by custom
allocators. Unlike PG_Reserved, we need the ref counting, but just
need to control how the page is finally freed.
Signed-off-by: [email protected]
---
diff -ur linux-2.6.9/arch/um/kernel/physmem.c linux-2.6.9-new/arch/um/kernel/physmem.c
--- linux-2.6.9/arch/um/kernel/physmem.c 2004-10-18 22:54:37.000000000 +0100
+++ linux-2.6.9-new/arch/um/kernel/physmem.c 2004-11-16 17:37:32.919951978 +0000
@@ -225,7 +225,7 @@
EXPORT_SYMBOL(physmem_remove_mapping);
EXPORT_SYMBOL(physmem_subst_mapping);
-void arch_free_page(struct page *page, int order)
+int arch_free_page(struct page *page, int order)
{
void *virt;
int i;
@@ -234,6 +234,8 @@
virt = __va(page_to_phys(page + i));
physmem_remove_mapping(virt);
}
+
+ return 0;
}
int is_remapped(void *virt)
diff -ur linux-2.6.9/include/asm-um/page.h linux-2.6.9-new/include/asm-um/page.h
--- linux-2.6.9/include/asm-um/page.h 2004-10-18 22:55:36.000000000 +0100
+++ linux-2.6.9-new/include/asm-um/page.h 2004-11-16 17:37:55.471701151 +0000
@@ -46,7 +46,7 @@
extern struct page *arch_validate(struct page *page, int mask, int order);
#define HAVE_ARCH_VALIDATE
-extern void arch_free_page(struct page *page, int order);
+extern int arch_free_page(struct page *page, int order);
#define HAVE_ARCH_FREE_PAGE
#endif
diff -ur linux-2.6.9/include/linux/gfp.h linux-2.6.9-new/include/linux/gfp.h
--- linux-2.6.9/include/linux/gfp.h 2004-10-18 22:53:44.000000000 +0100
+++ linux-2.6.9-new/include/linux/gfp.h 2004-11-16 17:41:25.825723812 +0000
@@ -74,8 +74,16 @@
* optimized to &contig_page_data at compile-time.
*/
+/*
+ * If arch_free_page returns non-zero then the generic free_page code can
+ * immediately bail: the arch-specific function has done all the work.
+ */
#ifndef HAVE_ARCH_FREE_PAGE
-static inline void arch_free_page(struct page *page, int order) { }
+static inline int arch_free_page(struct page *page, int order)
+{
+ /* Generic free_page must do the work. */
+ return 0;
+}
#endif
extern struct page *
diff -ur linux-2.6.9/mm/page_alloc.c linux-2.6.9-new/mm/page_alloc.c
--- linux-2.6.9/mm/page_alloc.c 2004-10-18 22:53:11.000000000 +0100
+++ linux-2.6.9-new/mm/page_alloc.c 2004-11-16 17:42:23.793227175 +0000
@@ -275,7 +275,8 @@
LIST_HEAD(list);
int i;
- arch_free_page(page, order);
+ if (arch_free_page(page, order))
+ return;
mod_page_state(pgfree, 1 << order);
for (i = 0 ; i < (1 << order) ; ++i)
@@ -505,7 +506,8 @@
struct per_cpu_pages *pcp;
unsigned long flags;
- arch_free_page(page, 0);
+ if (arch_free_page(page, 0))
+ return;
kernel_map_pages(page, 1, 0);
inc_page_state(pgfree);
On Wed, 2004-11-17 at 15:48, Ian Pratt wrote:
> This patch adds a return value to the existing arch_free_page function
> that indicates whether the normal free routine still has work to
> do. The only architecture that currently uses arch_free_page is arch
> 'um'. arch-xen needs this for 'foreign pages' - pages that don't
> belong to the page allocator but are instead managed by custom
> allocators.
But, you're modifying page allocator functions to do this. Why would
you call __free_pages_ok() on a page that didn't belong to the page
allocator?
-- Dave
> On Wed, 2004-11-17 at 15:48, Ian Pratt wrote:
> > This patch adds a return value to the existing arch_free_page function
> > that indicates whether the normal free routine still has work to
> > do. The only architecture that currently uses arch_free_page is arch
> > 'um'. arch-xen needs this for 'foreign pages' - pages that don't
> > belong to the page allocator but are instead managed by custom
> > allocators.
>
> But, you're modifying page allocator functions to do this. Why would
> you call __free_pages_ok() on a page that didn't belong to the page
> allocator?
Pages that have been allocated by our custom allocators get passed
into standard linux subsystems where we get no control over how
they're freed. We want the normal page ref counting etc to happen
as per normal, we just want to intercept the final free so that
we can return it to our allocator rather than the standard one.
We use custom allocators in a number of places, most notably for
the pages storing the packet data fragments that are pointed to
by skbs. This enables us to providing guest virtual machines with
zero-copy access to the network, which is a big performance win.
The existing arch_free_page mechanism very nearly does what we
want, we just need to add the 'early exit'.
Thanks,
Ian
On Wed, 2004-11-17 at 17:19, Ian Pratt wrote:
> > On Wed, 2004-11-17 at 15:48, Ian Pratt wrote:
> > > This patch adds a return value to the existing arch_free_page function
> > > that indicates whether the normal free routine still has work to
> > > do. The only architecture that currently uses arch_free_page is arch
> > > 'um'. arch-xen needs this for 'foreign pages' - pages that don't
> > > belong to the page allocator but are instead managed by custom
> > > allocators.
> >
> > But, you're modifying page allocator functions to do this. Why would
> > you call __free_pages_ok() on a page that didn't belong to the page
> > allocator?
>
> Pages that have been allocated by our custom allocators get passed
> into standard linux subsystems where we get no control over how
> they're freed.
OK, then how are they allocated? Are they only allocated by your
special drivers or in your architecture?
I think allowing this is weird:
foo = special_zen_malloc();
bar = kmalloc();
kfree(foo);
kfree(bar);
Shoudn't it be
special_zen_free(free);?
BTW, your arch-specific definition of PG_foreign is a little goofy. If
you need the bit, then put it in page-flags.h now. The memory hotplug
people have an evil plan to consume all unused bits in page->flags
(determined at compile-time) and that little gem will certainly break
it.
It's great that you contained stuff to your architecture, but little
bits like the page-flags thing don't make me think you've done it for
real, only hacked around it :)
-- Dave
[email protected] said:
> Pages that have been allocated by our custom allocators get passed
> into standard linux subsystems where we get no control over how
> they're freed. We want the normal page ref counting etc to happen as
> per normal, we just want to intercept the final free so that we can
> return it to our allocator rather than the standard one.
I have to agree with Dave - this is just a wierd solution. I added
arch_free_page to do arch-specific, invisible-to-the-generic-kernel things.
My intent may not be the be-all and end-all for this, but I think the semantics
you want to add to it are not that reasonable.
My gut reaction (without knowing your problem in any detail) would be that
you need too add some more structure to whatever mechanism you have
so that the pages land in your allocator automatically, like a slab or a new
zone or something.
Jeff
Jeff Dike <[email protected]> wrote:
>
> [email protected] said:
> > Pages that have been allocated by our custom allocators get passed
> > into standard linux subsystems where we get no control over how
> > they're freed. We want the normal page ref counting etc to happen as
> > per normal, we just want to intercept the final free so that we can
> > return it to our allocator rather than the standard one.
>
> I have to agree with Dave - this is just a wierd solution. I added
> arch_free_page to do arch-specific, invisible-to-the-generic-kernel things.
> My intent may not be the be-all and end-all for this, but I think the semantics
> you want to add to it are not that reasonable.
>
> My gut reaction (without knowing your problem in any detail) would be that
> you need too add some more structure to whatever mechanism you have
> so that the pages land in your allocator automatically, like a slab or a new
> zone or something.
>
I can't immediately think of any way of doing that.
One could perhaps hide it with
#ifdef CONFIG_ZEN
#define __free_pages_ok xen_free_page
#else
#define __free_pages_ok everybody_else_free_page
#endif
and rename __free_pages_ok() to everybody_else_free_page().
But heck - why bother? The current patch adds just one line of code in one
place, and the compiler will toss it away anyway for all but xen and um.
I think we can live with that.
[email protected] said:
> But heck - why bother? The current patch adds just one line of code
> in one place, and the compiler will toss it away anyway for all but
> xen and um.
Agree, but on a conceptual level, this is allowing the arch to swipe pages
arbitrarily out from under the nose of the generic page allocator. Plus, the
code that's being bypassed has to be implemented in some form in the arch.
That's not my concern, or yours, necessarily, but it does suggest that the
interface is being abused.
Jeff
Jeff Dike <[email protected]> wrote:
>
> [email protected] said:
> > But heck - why bother? The current patch adds just one line of code
> > in one place, and the compiler will toss it away anyway for all but
> > xen and um.
>
> Agree, but on a conceptual level, this is allowing the arch to swipe pages
> arbitrarily out from under the nose of the generic page allocator. Plus, the
> code that's being bypassed has to be implemented in some form in the arch.
> That's not my concern, or yours, necessarily, but it does suggest that the
> interface is being abused.
>
True. But we can take a look at that once we've seen the rest of the xen
patches. I wouldn't be merging up any of these patches until we've seen
the whole submission.
One tiny suggestion...
Ian Pratt wrote:
> -void arch_free_page(struct page *page, int order)
> +int arch_free_page(struct page *page, int order)
How about just changing that to...
void __arch_free_page(struct page *page, int order)
... and leave the rest of the function alone. Then:
> -extern void arch_free_page(struct page *page, int order);
> +extern int arch_free_page(struct page *page, int order);
Do...
extern void __arch_free_page(struct page *page, int order);
#define arch_free_page(page, order) (__arch_free_page((page), (order)), 0)
That way the compiler can omit the "if(...) return" even on UML
-Mitch
> > Pages that have been allocated by our custom allocators get passed
> > into standard linux subsystems where we get no control over how
> > they're freed.
>
> OK, then how are they allocated? Are they only allocated by your
> special drivers or in your architecture?
Yes, they are only allocated within our own code. I use "allocation"
loosely here -- in the case of our network backend driver, the page
fragments that we pass into the network subsystem are actually
temporary mappings of other domains' RAM -- not allocated from this
kernel's heap at all.
> I think allowing this is weird:
>
> foo = special_zen_malloc();
> bar = kmalloc();
> kfree(foo);
> kfree(bar);
>
> Shoudn't it be
>
> special_zen_free(free);?
We don't hook into the slab allocator, but into the page_alloc zoned
buddy allocator: so in your example replace kmalloc() with
get_free_page(), and kfree() with free_page(). The point is that we
can't always use 'special_xen_free_page()' because the page's refcnt
may fall to zero within a generic Linux subsystem (e.g., the network
stack). So we need to be able to hook of free_page().
It would be great if we could have a "destructor hook" per page that
points at the relevant heap-free function to use, but that'd never fly
here. This is the best we can do with absolutely minimal source
impact, and no run-time impact on other arches.
> BTW, your arch-specific definition of PG_foreign is a little goofy. If
> you need the bit, then put it in page-flags.h now. The memory hotplug
> people have an evil plan to consume all unused bits in page->flags
> (determined at compile-time) and that little gem will certainly break
> it.
The whole of our foreign_page.h was in page-flags.h. Do you think the
whole lot should go back, or just the bit definition?
> It's great that you contained stuff to your architecture, but little
> bits like the page-flags thing don't make me think you've done it for
> real, only hacked around it :)
We hacked around it in this case because the PageForeign stuff was
entirely within page_alloc.c and page-flags.h. Since only Xen used it
we were asked if we could have less non-arch source impact and this is
what we came up with. :-) Point taken on PG_foreign though.
-- Keir
Keir Fraser <[email protected]> wrote:
>
> PG_foreign
Is Xen using PG_arch_1? If not, it can be used for this.
> Is Xen using PG_arch_1? If not, it can be used for this.
No, we're not. This sounds like an excellent soloution.
We'll resync all the patches with the latest snapshot and then re-submit
the lot.
Sorry about the tabs-to-spaces screwup. We forwarded the patches around
so many different people for comment before sending them to lkml that
somewhere along the line something bad happned.
Ian
"Ian Pratt" <[email protected]> wrote:
>
> We forwarded the patches around
> so many different people for comment before sending them to lkml that
> somewhere along the line something bad happned.
Just send 'em to linux-kernel first-up and cc everyone else. That way you
avoid duplication of effort and everyone is on the same page.
I'm still struggling to understand the rationale behind the mem.c change
btw. io_remap_page_range() _is_ remap_page_range() (or, now,
remap_pfn_range()) on x86. So whatever the patch is supposed to be doing,
it's a no-op.
On Thu, Nov 18, 2004 at 02:14:19AM -0800, Andrew Morton wrote:
> Just send 'em to linux-kernel first-up and cc everyone else. That way you
> avoid duplication of effort and everyone is on the same page.
> I'm still struggling to understand the rationale behind the mem.c change
> btw. io_remap_page_range() _is_ remap_page_range() (or, now,
> remap_pfn_range()) on x86. So whatever the patch is supposed to be doing,
> it's a no-op.
Sorry for stepping in so late. io_remap_page_range() has an
architecture-specific calling convention. It can't be used in code
shared across where the calling conventions differ until that's
resolved (which I intend to do, though I'm not going to stop anyone
from doing it before I get around to it).
As drivers/char/mem.c is shared across all architectures, using
io_remap_page_range() there now is unsound and some other method to
accomplish whatever it does beyond remap_pfn_range() must be found
unless the io_remap_page_range() calling convention is resolved first.
-- wli
> "Ian Pratt" <[email protected]> wrote:
> >
> > We forwarded the patches around
> > so many different people for comment before sending them to lkml that
> > somewhere along the line something bad happned.
>
> Just send 'em to linux-kernel first-up and cc everyone else. That way you
> avoid duplication of effort and everyone is on the same page.
>
> I'm still struggling to understand the rationale behind the mem.c change
> btw. io_remap_page_range() _is_ remap_page_range() (or, now,
> remap_pfn_range()) on x86. So whatever the patch is supposed to be doing,
> it's a no-op.
We need to sync up to the current BK tree and see what needs to be
done. If remap_pfn_range() is arch-dep and only used in contexts where
you want to remap real physical address ranges (not "kernel physical"
address ranges) then we can reimplement remap_pfn_range() and remove
that CONFIG_XEN part of mem.c.
-- Keir
"Ian Pratt" <[email protected]> wrote:
>> Just send 'em to linux-kernel first-up and cc everyone else. That way you
>> avoid duplication of effort and everyone is on the same page.
>> I'm still struggling to understand the rationale behind the mem.c change
>> btw. io_remap_page_range() _is_ remap_page_range() (or, now,
>> remap_pfn_range()) on x86. So whatever the patch is supposed to be doing,
>> it's a no-op.
On Thu, Nov 18, 2004 at 10:18:28AM +0000, Keir Fraser wrote:
> We need to sync up to the current BK tree and see what needs to be
> done. If remap_pfn_range() is arch-dep and only used in contexts where
> you want to remap real physical address ranges (not "kernel physical"
> address ranges) then we can reimplement remap_pfn_range() and remove
> that CONFIG_XEN part of mem.c.
It is not so. It uses pfn_to_page() etc. which means it must be aligned
with kernel physical addresses. Xen's requirement is highly unusual. It
may unfortunately merit another #ifdef in drivers/char/mem.c, as using
io_remap_page_range() without such will break some architecture.
-- wli
> On Thu, Nov 18, 2004 at 02:14:19AM -0800, Andrew Morton wrote:
> > Just send 'em to linux-kernel first-up and cc everyone else. That way you
> > avoid duplication of effort and everyone is on the same page.
> > I'm still struggling to understand the rationale behind the mem.c change
> > btw. io_remap_page_range() _is_ remap_page_range() (or, now,
> > remap_pfn_range()) on x86. So whatever the patch is supposed to be doing,
> > it's a no-op.
>
> Sorry for stepping in so late. io_remap_page_range() has an
> architecture-specific calling convention. It can't be used in code
> shared across where the calling conventions differ until that's
> resolved (which I intend to do, though I'm not going to stop anyone
> from doing it before I get around to it).
Having pulled the latest snapshot, it's good to see that
remap_pfn_range has cleaned things up a bit. However, it doesn't
solve our problem.
In arch xen we need to use a different function for mapping MMIO
or BIOS pages, which is the /dev/mem behaviour we need to
support.
I'm not sure we can do this without changing the call in mem.c,
at least not without adding an extra hook inside remap_pfn_range
that allows us to use an alternative to set_pte e.g. slow_set_pte
that tries to figure out whether the pfn is real memory or
not. Personally I think the mem.c #ifdef is cleaner and more
robust.
I'm not sure I understand the issue about io_remap_page_range
having an architecture-specific calling convention. Please can
you enlighten me.
Thanks,
Ian
On Thu, Nov 18, 2004 at 12:51:12PM +0000, Ian Pratt wrote:
> Having pulled the latest snapshot, it's good to see that
> remap_pfn_range has cleaned things up a bit. However, it doesn't
> solve our problem.
> In arch xen we need to use a different function for mapping MMIO
> or BIOS pages, which is the /dev/mem behaviour we need to
> support.
> I'm not sure we can do this without changing the call in mem.c,
> at least not without adding an extra hook inside remap_pfn_range
> that allows us to use an alternative to set_pte e.g. slow_set_pte
> that tries to figure out whether the pfn is real memory or
> not. Personally I think the mem.c #ifdef is cleaner and more
> robust.
> I'm not sure I understand the issue about io_remap_page_range
> having an architecture-specific calling convention. Please can
> you enlighten me.
On some architectures it takes 5 arguments, and on others, 6.
It won't compile everywhere without an #ifdef.
-- wli