2007-02-16 04:50:41

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Allocate/destroy a 'vmalloc' VM area: alloc_vm_area and free_vm_area
The alloc function ensures that page tables are constructed for the
region of kernel virtual address space and mapped into init_mm.

Lock an area so that PTEs are accessible in the current address space:
lock_vm_area and unlock_vm_area. The lock function prevents context
switches to a lazy mm that doesn't have the area mapped into its page
tables. It also ensures that the page tables are mapped into the
current mm by causing the page fault handler to copy the page
directory pointers from init_mm into the current mm.

These functions are not particularly Xen-specific, so they're put into
mm/vmalloc.c.

Signed-off-by: Ian Pratt <[email protected]>
Signed-off-by: Christian Limpach <[email protected]>
Signed-off-by: Chris Wright <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: "Jan Beulich" <[email protected]>

--
include/linux/vmalloc.h | 8 +++++
mm/vmalloc.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+)

===================================================================
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -68,6 +68,14 @@ extern int map_vm_area(struct vm_struct
struct page ***pages);
extern void unmap_vm_area(struct vm_struct *area);

+/* Allocate/destroy a 'vmalloc' VM area. */
+extern struct vm_struct *alloc_vm_area(unsigned long size);
+extern void free_vm_area(struct vm_struct *area);
+
+/* Lock an area so that PTEs are accessible in the current address space. */
+extern void lock_vm_area(struct vm_struct *area);
+extern void unlock_vm_area(struct vm_struct *area);
+
/*
* Internals. Dont't use..
*/
===================================================================
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -747,3 +747,65 @@ out_einval_locked:
}
EXPORT_SYMBOL(remap_vmalloc_range);

+static int f(pte_t *pte, struct page *pmd_page, unsigned long addr, void *data)
+{
+ /* apply_to_page_range() does all the hard work. */
+ return 0;
+}
+
+struct vm_struct *alloc_vm_area(unsigned long size)
+{
+ struct vm_struct *area;
+
+ area = get_vm_area(size, VM_IOREMAP);
+ if (area == NULL)
+ return NULL;
+
+ /*
+ * This ensures that page tables are constructed for this region
+ * of kernel virtual address space and mapped into init_mm.
+ */
+ if (apply_to_page_range(&init_mm, (unsigned long)area->addr,
+ area->size, f, NULL)) {
+ free_vm_area(area);
+ return NULL;
+ }
+
+ return area;
+}
+EXPORT_SYMBOL_GPL(alloc_vm_area);
+
+void free_vm_area(struct vm_struct *area)
+{
+ struct vm_struct *ret;
+ ret = remove_vm_area(area->addr);
+ BUG_ON(ret != area);
+ kfree(area);
+}
+EXPORT_SYMBOL_GPL(free_vm_area);
+
+void lock_vm_area(struct vm_struct *area)
+{
+ unsigned long i;
+ char c;
+
+ /*
+ * Prevent context switch to a lazy mm that doesn't have this area
+ * mapped into its page tables.
+ */
+ preempt_disable();
+
+ /*
+ * Ensure that the page tables are mapped into the current mm. The
+ * page-fault path will copy the page directory pointers from init_mm.
+ */
+ for (i = 0; i < area->size; i += PAGE_SIZE)
+ (void)__get_user(c, (char __user *)area->addr + i);
+}
+EXPORT_SYMBOL_GPL(lock_vm_area);
+
+void unlock_vm_area(struct vm_struct *area)
+{
+ preempt_enable();
+}
+EXPORT_SYMBOL_GPL(unlock_vm_area);

--


2007-02-16 06:44:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On Thu, 15 Feb 2007 18:25:01 -0800 Jeremy Fitzhardinge <[email protected]> wrote:

> +void lock_vm_area(struct vm_struct *area)
> +{
> + unsigned long i;
> + char c;
> +
> + /*
> + * Prevent context switch to a lazy mm that doesn't have this area
> + * mapped into its page tables.
> + */
> + preempt_disable();
> +
> + /*
> + * Ensure that the page tables are mapped into the current mm. The
> + * page-fault path will copy the page directory pointers from init_mm.
> + */
> + for (i = 0; i < area->size; i += PAGE_SIZE)
> + (void)__get_user(c, (char __user *)area->addr + i);
> +}
> +EXPORT_SYMBOL_GPL(lock_vm_area);

This won't work when CONFIG_PREEMPT=y. The pagefault handler will see
in_atomic() and will scram.

(pet-peeve-from-someone-who-remembers-fortran: the reader expects the
variable `i' to be signed. signed int really)

2007-02-16 07:08:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Andrew Morton wrote:
> This won't work when CONFIG_PREEMPT=y. The pagefault handler will see
> in_atomic() and will scram.
>

Is there some other way to get the pagetable populated for the address
range?

J

2007-02-16 07:25:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On Thu, 15 Feb 2007 23:08:02 -0800 Jeremy Fitzhardinge <[email protected]> wrote:

> Andrew Morton wrote:
> > This won't work when CONFIG_PREEMPT=y. The pagefault handler will see
> > in_atomic() and will scram.
> >
>
> Is there some other way to get the pagetable populated for the address
> range?
>

If you really need to run atomically, that gets ugly. Even of one were to
run handle_mm_fault() by hand, it still needs to allocate memory.

Two ugly options might be:

a) touch all the pages, then go atomic, then touch them all again. If
one of them faults (ie: you raced with swapout) then go back and try
again. Obviously susceptible to livelocking.

b) Do get_user_pages() against all the pages, then go atomic, then do
put_page() against them all. Of course, they can immediately get
swapped out.

But that function's already racy against swapout and I guess it works OK.
I don't have clue what it is actually trying to do, so I'm guessing madly
here.

2007-02-16 07:30:57

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Andrew Morton wrote:
> On Thu, 15 Feb 2007 23:08:02 -0800 Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> Andrew Morton wrote:
>>
>>> This won't work when CONFIG_PREEMPT=y. The pagefault handler will see
>>> in_atomic() and will scram.
>>>
>>>
>> Is there some other way to get the pagetable populated for the address
>> range?
>>
>>
>
> If you really need to run atomically, that gets ugly. Even of one were to
> run handle_mm_fault() by hand, it still needs to allocate memory.
>
> Two ugly options might be:
>
> a) touch all the pages, then go atomic, then touch them all again. If
> one of them faults (ie: you raced with swapout) then go back and try
> again. Obviously susceptible to livelocking.
>
> b) Do get_user_pages() against all the pages, then go atomic, then do
> put_page() against them all. Of course, they can immediately get
> swapped out.
>
> But that function's already racy against swapout and I guess it works OK.
> I don't have clue what it is actually trying to do, so I'm guessing madly
> here.
>

It's for populating the pagetable in a vmalloc area. There's magic in
the fault handler to synchronize the vmalloc mappings between different
process's kernel mappings, so if the mapping isn't currently present, it
will fault and create the appropriate mapping. It's not operating on
swappable user memory, so swapping isn't an issue; but if the fault
handler exits immediately with preempt disabled, then there's a problem.

(Even though we don't support preempt, I've been generally coding in a
preempt-safe way, just so that the code looks "normal". And maybe we
will support preempt at some point.)

J

2007-02-16 07:49:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Andrew Morton wrote:
> On Thu, 15 Feb 2007 23:08:02 -0800 Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>>Andrew Morton wrote:
>>
>>>This won't work when CONFIG_PREEMPT=y. The pagefault handler will see
>>>in_atomic() and will scram.
>>>
>>
>>Is there some other way to get the pagetable populated for the address
>>range?
>>
>
>
> If you really need to run atomically, that gets ugly. Even of one were to
> run handle_mm_fault() by hand, it still needs to allocate memory.
>
> Two ugly options might be:
>
> a) touch all the pages, then go atomic, then touch them all again. If
> one of them faults (ie: you raced with swapout) then go back and try
> again. Obviously susceptible to livelocking.
>
> b) Do get_user_pages() against all the pages, then go atomic, then do
> put_page() against them all. Of course, they can immediately get
> swapped out.
>
> But that function's already racy against swapout and I guess it works OK.
> I don't have clue what it is actually trying to do, so I'm guessing madly
> here.

Well its only operating on kernel pages, and against a vmalloc region.
So it would be immune to any sort of unmapping or swapout.

Andrew's option a) should work. What's this for, and how is it not Xen
specific if nothing else in the tree needs such a weird hack?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-02-16 07:50:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On Thu, 15 Feb 2007 23:30:57 -0800 Jeremy Fitzhardinge <[email protected]> wrote:

> > If you really need to run atomically, that gets ugly. Even of one were to
> > run handle_mm_fault() by hand, it still needs to allocate memory.
> >
> > Two ugly options might be:
> >
> > a) touch all the pages, then go atomic, then touch them all again. If
> > one of them faults (ie: you raced with swapout) then go back and try
> > again. Obviously susceptible to livelocking.
> >
> > b) Do get_user_pages() against all the pages, then go atomic, then do
> > put_page() against them all. Of course, they can immediately get
> > swapped out.
> >
> > But that function's already racy against swapout and I guess it works OK.
> > I don't have clue what it is actually trying to do, so I'm guessing madly
> > here.
> >
>
> It's for populating the pagetable in a vmalloc area. There's magic in
> the fault handler to synchronize the vmalloc mappings between different
> process's kernel mappings, so if the mapping isn't currently present, it
> will fault and create the appropriate mapping. It's not operating on
> swappable user memory, so swapping isn't an issue; but if the fault
> handler exits immediately with preempt disabled, then there's a problem.
>

oh, I see. The vmalloc fault can run atomically. In fact it can run at
hard iRQ. So no probs (apart from the fact that it required an email
dialogue to work this out rather than reading the code, but I do go on).

2007-02-16 09:18:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

> It's for populating the pagetable in a vmalloc area. There's magic in

If the lazy setup doesn't work for you you can always call vmalloc_sync()
early.

-Andi

2007-02-16 11:10:14

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas




On 16/2/07 09:18, "Andi Kleen" <[email protected]> wrote:

>> It's for populating the pagetable in a vmalloc area. There's magic in
>
> If the lazy setup doesn't work for you you can always call vmalloc_sync()
> early.

vmalloc_sync_all()? That's a great idea. We can put that in alloc_vm_area()
and kill {lock,unlock}_vm_area() which is a nice interface simplification.

Thanks!
Keir

2007-02-16 11:34:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On Friday 16 February 2007 12:10, Keir Fraser wrote:
>
> On 16/2/07 09:18, "Andi Kleen" <[email protected]> wrote:
>
> >> It's for populating the pagetable in a vmalloc area. There's magic in
> >
> > If the lazy setup doesn't work for you you can always call vmalloc_sync()
> > early.
>
> vmalloc_sync_all()?

Yes.

Credit goes to Jan for writing it originally.

-Andi

2007-02-16 16:46:45

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Andi Kleen wrote:
>> It's for populating the pagetable in a vmalloc area. There's magic in
>>
>
> If the lazy setup doesn't work for you you can always call vmalloc_sync()
> early.
>

Yes, that would work. Unfortunately that's i386 arch-specific, whereas
the rest of this code is generic. I guess I could just move it all to
arch/i386/mm.

J

2007-02-16 17:10:33

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas




On 16/2/07 16:46, "Jeremy Fitzhardinge" <[email protected]> wrote:

> Yes, that would work. Unfortunately that's i386 arch-specific, whereas
> the rest of this code is generic. I guess I could just move it all to
> arch/i386/mm.

This whole thing isn't an issue on ia64 (they no-op lock_vm_area) and
powerpc doesn't use any of the Xen driver code at this time.
vmalloc_sync_all is supported by both i386 and x86_64, so we can make the
call conditional on CONFIG_X86 so that ia64 will continue to build. This is
what I've done in xen-unstable.

-- Keir

2007-02-16 17:12:36

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On 16/2/07 17:10, "Keir Fraser" <[email protected]> wrote:

> On 16/2/07 16:46, "Jeremy Fitzhardinge" <[email protected]> wrote:
>
>> Yes, that would work. Unfortunately that's i386 arch-specific, whereas
>> the rest of this code is generic. I guess I could just move it all to
>> arch/i386/mm.
>
> This whole thing isn't an issue on ia64 (they no-op lock_vm_area) and
> powerpc doesn't use any of the Xen driver code at this time.
> vmalloc_sync_all is supported by both i386 and x86_64, so we can make the
> call conditional on CONFIG_X86 so that ia64 will continue to build. This is
> what I've done in xen-unstable.


In fact that file is only built for i386 and x86_64, so there really is no
problem with using vmalloc_sync_all() directly and without ifdef.

-- Keir


2007-02-16 17:27:05

by Hollis Blanchard

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On Fri, 2007-02-16 at 17:10 +0000, Keir Fraser wrote:
>
>
> On 16/2/07 16:46, "Jeremy Fitzhardinge" <[email protected]> wrote:
>
> > Yes, that would work. Unfortunately that's i386 arch-specific, whereas
> > the rest of this code is generic. I guess I could just move it all to
> > arch/i386/mm.
>
> This whole thing isn't an issue on ia64 (they no-op lock_vm_area) and
> powerpc doesn't use any of the Xen driver code at this time.

Not sure what you mean? PowerPC uses pretty much all of the Xen driver
code: event channels, blkfront/back, netfront/back, console, etc.

--
Hollis Blanchard
IBM Linux Technology Center

2007-02-16 17:27:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Keir Fraser wrote:
> On 16/2/07 17:10, "Keir Fraser" <[email protected]> wrote:
>
>
>> On 16/2/07 16:46, "Jeremy Fitzhardinge" <[email protected]> wrote:
>>
>>
>>> Yes, that would work. Unfortunately that's i386 arch-specific, whereas
>>> the rest of this code is generic. I guess I could just move it all to
>>> arch/i386/mm.
>>>
>> This whole thing isn't an issue on ia64 (they no-op lock_vm_area) and
>> powerpc doesn't use any of the Xen driver code at this time.
>> vmalloc_sync_all is supported by both i386 and x86_64, so we can make the
>> call conditional on CONFIG_X86 so that ia64 will continue to build. This is
>> what I've done in xen-unstable.
>>
>
>
> In fact that file is only built for i386 and x86_64, so there really is no
> problem with using vmalloc_sync_all() directly and without ifdef.
>

I had moved it to mm/vmalloc.c in response to previous review comments
(namely, its not Xen specific, so it shouldn't live in the Xen part of
the tree).

J

2007-02-16 19:07:07

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On 16/2/07 17:27, "Jeremy Fitzhardinge" <[email protected]> wrote:

> In fact that file is only built for i386 and x86_64, so there really is no
>> problem with using vmalloc_sync_all() directly and without ifdef.
>>
>
> I had moved it to mm/vmalloc.c in response to previous review comments
> (namely, its not Xen specific, so it shouldn't live in the Xen part of
> the tree).

Then the call will have to be CONFIG_X86. I hadn't realised powerpc were
also using lock_vm_area. However I suspect that the x86 issue that those
functions were written doesn't even exist on powerpc, or any other non-x86
architecture.

I guess we should change the name of alloc_vm_area, by the way. Perhaps
alloc_vm_area_sync() or similar, to give some hint of what it's doing and
how it differs from other vmalloc-area functions?

-- Keir

2007-02-16 19:20:59

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On 16/2/07 19:06, "Keir Fraser" <[email protected]> wrote:

>> I had moved it to mm/vmalloc.c in response to previous review comments
>> (namely, its not Xen specific, so it shouldn't live in the Xen part of
>> the tree).
>
> Then the call will have to be CONFIG_X86. I hadn't realised powerpc were
> also using lock_vm_area. However I suspect that the x86 issue that those
> functions were written doesn't even exist on powerpc, or any other non-x86
> architecture.

Hmmm... Actually looks like a bunch of architectures do lazy sync of the
vmalloc area, although neither ia64 nor powerpc does so. However, all
current users of the alloc_vm_area() function would be okay since none of
the other lazy-syncing architectures are supported by Xen.

-- Keir


2007-02-16 19:26:37

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Keir Fraser wrote:
> Hmmm... Actually looks like a bunch of architectures do lazy sync of the
> vmalloc area, although neither ia64 nor powerpc does so. However, all
> current users of the alloc_vm_area() function would be okay since none of
> the other lazy-syncing architectures are supported by Xen.
>

Well, assuming that alloc_vm_area() has some non-Xen use, the right
thing is for archs to export vmalloc_sync_all(), and just use that from
common code.

J

2007-02-16 23:31:06

by Keir Fraser

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

On 16/2/07 19:26, "Jeremy Fitzhardinge" <[email protected]> wrote:

> Keir Fraser wrote:
>> Hmmm... Actually looks like a bunch of architectures do lazy sync of the
>> vmalloc area, although neither ia64 nor powerpc does so. However, all
>> current users of the alloc_vm_area() function would be okay since none of
>> the other lazy-syncing architectures are supported by Xen.
>>
>
> Well, assuming that alloc_vm_area() has some non-Xen use, the right
> thing is for archs to export vmalloc_sync_all(), and just use that from
> common code.

It has no other users right now and get_vm_area_sync() would be a
better-named and more generically useful function than alloc_vm_area(). But
yes, to be done properly it does require vmalloc_sync_all() to be defined by
all architectures (even if that's BUG() and implement-properly-on-demand).

get_vm_area_sync(), partnered with existing remove_vm_area(), just seems
much smaller and neater than adding four new functions with a more complex
usage: alloc_vm_area, {lock,unlock}_vm_area, and free_vm_area. Maybe keeping
free_vm_area() too makes sense as its interface is more neatly symmetrical
to that of get_vm_area().

-- Keir


2007-02-16 23:42:25

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Xen-devel] Re: [patch 12/21] Xen-paravirt: Allocate and free vmalloc areas

Keir Fraser wrote:
> It has no other users right now and get_vm_area_sync() would be a
> better-named and more generically useful function than alloc_vm_area().
I'm thinking "reserve" might be a better term; "get" generally has the
suggestion of a refcount.

> get_vm_area_sync(), partnered with existing remove_vm_area(), just seems
> much smaller and neater than adding four new functions with a more complex
> usage: alloc_vm_area, {lock,unlock}_vm_area, and free_vm_area. Maybe keeping
> free_vm_area() too makes sense as its interface is more neatly symmetrical
> to that of get_vm_area().

I've already killed the lock/unlock functions. I'll come up with
something for the get/allocate/reserve and free functions.

J