2008-11-07 20:39:45

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 0/4] Fix vmalloc regression

Nick,

This is the whole set of patches I was talking about.
Patch 3 is the one that in fact fixes the problem
Patches 1 and 2 are debugging aids I made use of, and could be possibly
useful to others
Patch 4 removes guard pages entirely for non-debug kernels, as we have already
previously discussed.

Hope it's all fine.


2008-11-07 20:39:59

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 1/4] don't call __vmalloc from other vmap internal functions

If we do that, output of files like /proc/vmallocinfo
will show things like "vmalloc_32", "vmalloc_user", or
whomever the caller was as the caller. This info is not
as useful as the real caller of the allocation.

So, proposal is to call __vmalloc_node node directly, with
matching parameters to save the caller information

Signed-off-by: Glauber Costa <[email protected]>
---
mm/vmalloc.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0365369..95856d1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1343,7 +1343,8 @@ void *vmalloc_user(unsigned long size)
struct vm_struct *area;
void *ret;

- ret = __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
+ ret = __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO,
+ PAGE_KERNEL, -1, __builtin_return_address(0));
if (ret) {
area = find_vm_area(ret);
area->flags |= VM_USERMAP;
@@ -1388,7 +1389,8 @@ EXPORT_SYMBOL(vmalloc_node);

void *vmalloc_exec(unsigned long size)
{
- return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC);
+ return __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC,
+ -1, __builtin_return_address(0));
}

#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
@@ -1408,7 +1410,8 @@ void *vmalloc_exec(unsigned long size)
*/
void *vmalloc_32(unsigned long size)
{
- return __vmalloc(size, GFP_VMALLOC32, PAGE_KERNEL);
+ return __vmalloc_node(size, GFP_VMALLOC32, PAGE_KERNEL,
+ -1, __builtin_return_address(0));
}
EXPORT_SYMBOL(vmalloc_32);

@@ -1424,7 +1427,8 @@ void *vmalloc_32_user(unsigned long size)
struct vm_struct *area;
void *ret;

- ret = __vmalloc(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL);
+ ret = __vmalloc_node(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL,
+ -1, __builtin_return_address(0));
if (ret) {
area = find_vm_area(ret);
area->flags |= VM_USERMAP;
--
1.5.6.5

2008-11-07 20:40:30

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 2/4] show size of failing allocation

if we can't service a vmalloc allocation, show size of
the allocation that actually failed. Useful for
debugging.

Signed-off-by: Glauber Costa <[email protected]>
---
mm/vmalloc.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 95856d1..7db493d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -381,8 +381,8 @@ found:
goto retry;
}
if (printk_ratelimit())
- printk(KERN_WARNING "vmap allocation failed: "
- "use vmalloc=<size> to increase size.\n");
+ printk(KERN_WARNING "vmap allocation for size %d failed: "
+ "use vmalloc=<size> to increase size.\n", size);
return ERR_PTR(-EBUSY);
}

--
1.5.6.5

2008-11-07 20:40:51

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 3/4] restart search at beggining of vmalloc address

Current vmalloc restart search for a free area in case we
can't find one. The reason is there are areas which are lazily
freed, and could be possibly freed now. However, current implementation
start searching the tree from the last failing address, which is
pretty much by definition at the end of address space. So, we fail.

The proposal of this patch is to restart the search from the beginning
of the requested vstart address. This fixes the regression in running
KVM virtual machines for me, described in
http://lkml.org/lkml/2008/10/28/349, caused by commit
db64fe02258f1507e13fe5212a989922323685ce.

Signed-off-by: Glauber Costa <[email protected]>
CC: Nick Piggin <[email protected]>
---
mm/vmalloc.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7db493d..6fe2003 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -378,6 +378,7 @@ found:
if (!purged) {
purge_vmap_area_lazy();
purged = 1;
+ addr = ALIGN(vstart, align);
goto retry;
}
if (printk_ratelimit())
--
1.5.6.5

2008-11-07 20:41:13

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 4/4] Do not use guard pages in non-debug kernels

In mm/vmalloc.c, make usage of guard pages dependant
on CONFIG_DEBUG_PAGEALLOC.

Signed-off-by: Glauber Costa <[email protected]>
---
mm/vmalloc.c | 25 +++++++++++++++----------
1 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6fe2003..ed73c6f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -28,6 +28,11 @@
#include <asm/uaccess.h>
#include <asm/tlbflush.h>

+#ifdef CONFIG_DEBUG_PAGEALLOC
+#define GUARD_PAGE_SIZE PAGE_SIZE
+#else
+#define GUARD_PAGE_SIZE 0
+#endif

/*** Page table manipulation functions ***/

@@ -363,7 +368,7 @@ retry:
}

while (addr + size >= first->va_start && addr + size <= vend) {
- addr = ALIGN(first->va_end + PAGE_SIZE, align);
+ addr = ALIGN(first->va_end, align);

n = rb_next(&first->rb_node);
if (n)
@@ -954,7 +959,7 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
{
unsigned long addr = (unsigned long)area->addr;
- unsigned long end = addr + area->size - PAGE_SIZE;
+ unsigned long end = addr + area->size - GUARD_PAGE_SIZE;
int err;

err = vmap_page_range(addr, end, prot, *pages);
@@ -1003,7 +1008,7 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
/*
* We always allocate a guard page.
*/
- size += PAGE_SIZE;
+ size += GUARD_PAGE_SIZE;

va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
@@ -1098,7 +1103,7 @@ struct vm_struct *remove_vm_area(const void *addr)
struct vm_struct *vm = va->private;
struct vm_struct *tmp, **p;
free_unmap_vmap_area(va);
- vm->size -= PAGE_SIZE;
+ vm->size -= GUARD_PAGE_SIZE;

write_lock(&vmlist_lock);
for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next)
@@ -1226,7 +1231,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
struct page **pages;
unsigned int nr_pages, array_size, i;

- nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
+ nr_pages = (area->size - GUARD_PAGE_SIZE) >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));

area->nr_pages = nr_pages;
@@ -1451,7 +1456,7 @@ long vread(char *buf, char *addr, unsigned long count)
read_lock(&vmlist_lock);
for (tmp = vmlist; tmp; tmp = tmp->next) {
vaddr = (char *) tmp->addr;
- if (addr >= vaddr + tmp->size - PAGE_SIZE)
+ if (addr >= vaddr + tmp->size - GUARD_PAGE_SIZE)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -1461,7 +1466,7 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
- n = vaddr + tmp->size - PAGE_SIZE - addr;
+ n = vaddr + tmp->size - GUARD_PAGE_SIZE - addr;
do {
if (count == 0)
goto finished;
@@ -1489,7 +1494,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
read_lock(&vmlist_lock);
for (tmp = vmlist; tmp; tmp = tmp->next) {
vaddr = (char *) tmp->addr;
- if (addr >= vaddr + tmp->size - PAGE_SIZE)
+ if (addr >= vaddr + tmp->size - GUARD_PAGE_SIZE)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -1498,7 +1503,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
- n = vaddr + tmp->size - PAGE_SIZE - addr;
+ n = vaddr + tmp->size - GUARD_PAGE_SIZE - addr;
do {
if (count == 0)
goto finished;
@@ -1544,7 +1549,7 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
if (!(area->flags & VM_USERMAP))
return -EINVAL;

- if (usize + (pgoff << PAGE_SHIFT) > area->size - PAGE_SIZE)
+ if (usize + (pgoff << PAGE_SHIFT) > area->size - GUARD_PAGE_SIZE)
return -EINVAL;

addr += pgoff << PAGE_SHIFT;
--
1.5.6.5

2008-11-07 21:56:36

by walt

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix vmalloc regression

Glauber Costa wrote:
> Nick,
>
> This is the whole set of patches I was talking about.
> Patch 3 is the one that in fact fixes the problem...

Yep, patch 3 works for me, thanks. Only the 32-bit kernel
seems to need the patch, FWIW.

2008-11-08 00:58:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix vmalloc regression

On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
> Nick,
>
> This is the whole set of patches I was talking about.
> Patch 3 is the one that in fact fixes the problem
> Patches 1 and 2 are debugging aids I made use of, and could be possibly
> useful to others
> Patch 4 removes guard pages entirely for non-debug kernels, as we have already
> previously discussed.
>
> Hope it's all fine.

OK, these all look good, but I may only push 3/4 for Linus in this round,
along with some of the changes from my patch that you tested as well.

With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should
turn off the lazy unmapping optimisation as well, so it catches use
after free similarly to the page allocator... but probably it is a good
idea at least to avoid the double-guard page for 2.6.28?

Anyway thanks for these, I'll send them up to Andrew/Linus and cc you.

2008-11-08 02:11:47

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix vmalloc regression

On Sat, Nov 08, 2008 at 01:58:32AM +0100, Nick Piggin wrote:
> On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
> > Nick,
> >
> > This is the whole set of patches I was talking about.
> > Patch 3 is the one that in fact fixes the problem
> > Patches 1 and 2 are debugging aids I made use of, and could be possibly
> > useful to others
> > Patch 4 removes guard pages entirely for non-debug kernels, as we have already
> > previously discussed.
> >
> > Hope it's all fine.
>
> OK, these all look good, but I may only push 3/4 for Linus in this round,
> along with some of the changes from my patch that you tested as well.

Makes total sense.
>
> With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps should
> turn off the lazy unmapping optimisation as well, so it catches use
> after free similarly to the page allocator... but probably it is a good
> idea at least to avoid the double-guard page for 2.6.28?

Makes sense. Maybe poisoning after free would also be useful?

2008-11-08 02:54:36

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0/4] Fix vmalloc regression

On Saturday 08 November 2008 13:13, Glauber Costa wrote:
> On Sat, Nov 08, 2008 at 01:58:32AM +0100, Nick Piggin wrote:
> > On Fri, Nov 07, 2008 at 08:35:50PM -0200, Glauber Costa wrote:
> > > Nick,
> > >
> > > This is the whole set of patches I was talking about.
> > > Patch 3 is the one that in fact fixes the problem
> > > Patches 1 and 2 are debugging aids I made use of, and could be possibly
> > > useful to others
> > > Patch 4 removes guard pages entirely for non-debug kernels, as we have
> > > already previously discussed.
> > >
> > > Hope it's all fine.
> >
> > OK, these all look good, but I may only push 3/4 for Linus in this round,
> > along with some of the changes from my patch that you tested as well.
>
> Makes total sense.

OK, sent. Thanks again.


> > With the DEBUG_PAGEALLOC case, I have been thinking that we perhaps
> > should turn off the lazy unmapping optimisation as well, so it catches
> > use after free similarly to the page allocator... but probably it is a
> > good idea at least to avoid the double-guard page for 2.6.28?
>
> Makes sense. Maybe poisoning after free would also be useful?

It's a problem because we're only dealing with virtual address, rather
than real memory. So we don't really have anything to poison (we don't
know what the caller will do with the memory). I guess it would be
possible to poison in the page allocator or in vfree, but.... probably
not worthwhile (after the immediate-unmap debug option).