LinuxLists.cc - [PATCH] regression: vmalloc easily fail.

2008-10-28 20:58:58

Subject: [PATCH] regression: vmalloc easily fail.

Commit db64fe02258f1507e13fe5212a989922323685ce broke
KVM (the symptom) for me. The cause is that vmalloc
allocations fail, despite of the fact that /proc/meminfo
shows plenty of vmalloc space available.

After some investigation, it seems to me that the current
way to compute the next addr in the rb-tree transversal
leaves a spare page between each allocation. After a few
allocations, regardless of their size, we run out of vmalloc
space.

Signed-off-by: Glauber Costa <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Krzysztof Helt <[email protected]>
---
mm/vmalloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0365369..a33b0d1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -363,7 +363,7 @@ retry:
}

while (addr + size >= first->va_start && addr + size <= vend) {
- addr = ALIGN(first->va_end + PAGE_SIZE, align);
+ addr = ALIGN(first->va_end, align);

n = rb_next(&first->rb_node);
if (n)
--
1.5.6.5

2008-10-28 21:03:29

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

Glauber Costa wrote:
> Commit db64fe02258f1507e13fe5212a989922323685ce broke
> KVM (the symptom) for me. The cause is that vmalloc
> allocations fail, despite of the fact that /proc/meminfo
> shows plenty of vmalloc space available.
>
> After some investigation, it seems to me that the current
> way to compute the next addr in the rb-tree transversal
> leaves a spare page between each allocation. After a few
> allocations, regardless of their size, we run out of vmalloc
> space.
>
>
> while (addr + size >= first->va_start && addr + size <= vend) {
> - addr = ALIGN(first->va_end + PAGE_SIZE, align);
> + addr = ALIGN(first->va_end, align);
>
> n = rb_next(&first->rb_node);
> if (n)
>

I'm guessing that the missing comment explains that this is intentional,
to trap buffer overflows?

(okay that was a cheap shot. I don't comment nearly enough either)

Even if you leave a page between allocations, I don't see how you can
fail a one page allocation, unless you've allocated at least N/2 pages
(where N is the size of the vmalloc space in pages).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-10-28 21:08:25

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Tue, Oct 28, 2008 at 11:03:22PM +0200, Avi Kivity wrote:
> Glauber Costa wrote:
>> Commit db64fe02258f1507e13fe5212a989922323685ce broke
>> KVM (the symptom) for me. The cause is that vmalloc
>> allocations fail, despite of the fact that /proc/meminfo
>> shows plenty of vmalloc space available.
>>
>> After some investigation, it seems to me that the current
>> way to compute the next addr in the rb-tree transversal
>> leaves a spare page between each allocation. After a few
>> allocations, regardless of their size, we run out of vmalloc
>> space.
>>
>> while (addr + size >= first->va_start && addr + size <= vend) {
>> - addr = ALIGN(first->va_end + PAGE_SIZE, align);
>> + addr = ALIGN(first->va_end, align);
>> n = rb_next(&first->rb_node);
>> if (n)
>>
>
> I'm guessing that the missing comment explains that this is intentional,
> to trap buffer overflows?
>
> (okay that was a cheap shot. I don't comment nearly enough either)
>
> Even if you leave a page between allocations, I don't see how you can
> fail a one page allocation, unless you've allocated at least N/2 pages
> (where N is the size of the vmalloc space in pages).

I'm hoping Nick will comment on it. I might well be wrong.
but it nicely fixes the problem for me, and actually, you don't need
"at least N/2 pages". The size of the allocations hardly matters, just
the amount of allocations we did. Since kvm does some small
vmalloc allocations, that may be the reason for we triggering it.

2008-10-28 21:22:32

by Roland Dreier

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

> I'm guessing that the missing comment explains that this is
> intentional, to trap buffer overflows?

Actually, speaking of comments, it's interesting that
__get_vm_area_node() -- which is called from vmalloc() -- does:

/*
* We always allocate a guard page.
*/
size += PAGE_SIZE;

va = alloc_vmap_area(size, align, start, end, node, gfp_mask);

and alloc_vmap_area() adds another PAGE_SIZE, as the original email
pointed out:

while (addr + size >= first->va_start && addr + size <= vend) {
addr = ALIGN(first->va_end + PAGE_SIZE, align);

I wonder if the double padding is causing a problem when things get too
fragmented?

- R.

2008-10-28 21:23:08

by Matias Zabaljauregui

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

hello,

>> I'm guessing that the missing comment explains that this is intentional,
>> to trap buffer overflows?

yes, IIRC the pages between vmalloc areas are there for safety reasons.
(like the interval inserted before the first area, defined by VMALLOC_OFFSET)

regards
Matias

2008-10-28 21:43:10

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Tue, 28 Oct 2008 14:22:16 -0700
Roland Dreier <[email protected]> wrote:

> > I'm guessing that the missing comment explains that this is
> > intentional, to trap buffer overflows?
>
> Actually, speaking of comments, it's interesting that
> __get_vm_area_node() -- which is called from vmalloc() -- does:
>
> /*
> * We always allocate a guard page.
> */
> size += PAGE_SIZE;
>
> va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
>
> and alloc_vmap_area() adds another PAGE_SIZE, as the original email
> pointed out:
>
> while (addr + size >= first->va_start && addr + size
> <= vend) { addr = ALIGN(first->va_end + PAGE_SIZE, align);
>
> I wonder if the double padding is causing a problem when things get
> too fragmented?

I suspect it's a case of off-by-one... ALIGN() might round down, and
the "+ (PAGE_SIZE-1)" was there to make it round up.
Except for that missing -1 ...

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-10-28 22:03:33

by Roland Dreier

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

> I suspect it's a case of off-by-one... ALIGN() might round down, and
> the "+ (PAGE_SIZE-1)" was there to make it round up.
> Except for that missing -1 ...

ALIGN() has always rounded up, at least back to 2.4.

2008-10-28 23:29:58

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Tue, Oct 28, 2008 at 08:55:13PM -0200, Glauber Costa wrote:
> Commit db64fe02258f1507e13fe5212a989922323685ce broke
> KVM (the symptom) for me. The cause is that vmalloc
> allocations fail, despite of the fact that /proc/meminfo
> shows plenty of vmalloc space available.
>
> After some investigation, it seems to me that the current
> way to compute the next addr in the rb-tree transversal
> leaves a spare page between each allocation. After a few
> allocations, regardless of their size, we run out of vmalloc
> space.

Right... that was to add a guard page like the old vmalloc allocator.
vmallocs still add their extra page too, so most of them will have
a 2 page guard area, but I didn't think this would hurt significantly.

I'm not against the patch, but I wonder exactly what is filling it up
and how? (can you look at the vmalloc proc function to find out?)

>
> Signed-off-by: Glauber Costa <[email protected]>
> Cc: Jeremy Fitzhardinge <[email protected]>
> Cc: Krzysztof Helt <[email protected]>
> ---
> mm/vmalloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0365369..a33b0d1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -363,7 +363,7 @@ retry:
> }
>
> while (addr + size >= first->va_start && addr + size <= vend) {
> - addr = ALIGN(first->va_end + PAGE_SIZE, align);
> + addr = ALIGN(first->va_end, align);
>
> n = rb_next(&first->rb_node);
> if (n)
> --
> 1.5.6.5

2008-10-29 06:28:21

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

Nick Piggin wrote:
> Right... that was to add a guard page like the old vmalloc allocator.
> vmallocs still add their extra page too, so most of them will have
> a 2 page guard area, but I didn't think this would hurt significantly.
>
> I'm not against the patch, but I wonder exactly what is filling it up
> and how? (can you look at the vmalloc proc function to find out?

Maybe we're allocating two guard pages, but freeing only one?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-10-29 09:47:40

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Wed, Oct 29, 2008 at 12:29:44AM +0100, Nick Piggin wrote:
> On Tue, Oct 28, 2008 at 08:55:13PM -0200, Glauber Costa wrote:
> > Commit db64fe02258f1507e13fe5212a989922323685ce broke
> > KVM (the symptom) for me. The cause is that vmalloc
> > allocations fail, despite of the fact that /proc/meminfo
> > shows plenty of vmalloc space available.
> >
> > After some investigation, it seems to me that the current
> > way to compute the next addr in the rb-tree transversal
> > leaves a spare page between each allocation. After a few
> > allocations, regardless of their size, we run out of vmalloc
> > space.
>
> Right... that was to add a guard page like the old vmalloc allocator.
> vmallocs still add their extra page too, so most of them will have
> a 2 page guard area, but I didn't think this would hurt significantly.
>
> I'm not against the patch, but I wonder exactly what is filling it up
> and how? (can you look at the vmalloc proc function to find out?)
attached.

>
> >
> > Signed-off-by: Glauber Costa <[email protected]>
> > Cc: Jeremy Fitzhardinge <[email protected]>
> > Cc: Krzysztof Helt <[email protected]>
> > ---
> > mm/vmalloc.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 0365369..a33b0d1 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -363,7 +363,7 @@ retry:
> > }
> >
> > while (addr + size >= first->va_start && addr + size <= vend) {
> > - addr = ALIGN(first->va_end + PAGE_SIZE, align);
> > + addr = ALIGN(first->va_end, align);
> >
> > n = rb_next(&first->rb_node);
> > if (n)
> > --
> > 1.5.6.5

Attachments:

(No filename) (1.61 kB)
vmalloc-fails (10.47 kB)
Download all attachments

2008-10-29 10:11:59

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Wed, Oct 29, 2008 at 07:48:56AM -0200, Glauber Costa wrote:

> 0xf7bfe000-0xf7c00000 8192 hpet_enable+0x2d/0x279 phys=fed00000 ioremap
> 0xf7c02000-0xf7c04000 8192 acpi_os_map_memory+0x11/0x1a phys=7fed1000 ioremap
> 0xf7c06000-0xf7c08000 8192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap
> 0xf7c0a000-0xf7c0c000 8192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap
> 0xf7c0d000-0xf7c0f000 8192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc
> 0xf7c10000-0xf7c1f000 61440 acpi_os_map_memory+0x11/0x1a phys=7fed1000 ioremap
> 0xf7c20000-0xf7c22000 8192 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap
> 0xf7c24000-0xf7c27000 12288 acpi_os_map_memory+0x11/0x1a phys=7fef2000 ioremap
> 0xf7c28000-0xf7c2a000 8192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap
> 0xf7c2c000-0xf7c2e000 8192 acpi_os_map_memory+0x11/0x1a phys=7fef4000 ioremap
> 0xf7c30000-0xf7c32000 8192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap
> 0xf7c34000-0xf7c36000 8192 acpi_os_map_memory+0x11/0x1a phys=7fef4000 ioremap
> 0xf7c38000-0xf7c3a000 8192 acpi_os_map_memory+0x11/0x1a phys=7fed1000 ioremap
> 0xf7c3c000-0xf7c3e000 8192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap
> 0xf7c40000-0xf7c42000 8192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap
> 0xf7c44000-0xf7c46000 8192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap
> 0xf7c48000-0xf7c4a000 8192 acpi_os_map_memory+0x11/0x1a phys=7fede000 ioremap
> 0xf7c4b000-0xf7c57000 49152 zisofs_init+0xd/0x1c pages=11 vmalloc
> 0xf7c58000-0xf7c5a000 8192 yenta_probe+0x123/0x63f phys=e4300000 ioremap
> 0xf7c5b000-0xf7c5e000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7c61000-0xf7c65000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf7c67000-0xf7c71000 40960 vmalloc_exec+0x12/0x14 pages=9 vmalloc
> 0xf7c72000-0xf7c74000 8192 usb_hcd_pci_probe+0x132/0x277 phys=ee404000 ioremap
> 0xf7c75000-0xf7c7c000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf7c7d000-0xf7c86000 36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc
> 0xf7c87000-0xf7c89000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf7c8a000-0xf7c91000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf7c92000-0xf7c97000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf7c9b000-0xf7c9f000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf7ca0000-0xf7ca5000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf7ca6000-0xf7cab000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf7cac000-0xf7caf000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7cb0000-0xf7cb2000 8192 iTCO_wdt_probe+0xbe/0x281 [iTCO_wdt] phys=fed1f000 ioremap
> 0xf7cb3000-0xf7cbf000 49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc
> 0xf7cc0000-0xf7cce000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
> 0xf7ccf000-0xf7cd2000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7cd3000-0xf7cd5000 8192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc
> 0xf7cd6000-0xf7cdc000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf7cdd000-0xf7ce0000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7ce1000-0xf7ce4000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7ce6000-0xf7d02000 114688 vmalloc_exec+0x12/0x14 pages=27 vmalloc
> 0xf7d03000-0xf7d08000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf7d09000-0xf7d0b000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf7d0c000-0xf7d12000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf7d13000-0xf7d24000 69632 snd_malloc_sgbuf_pages+0x190/0x1ca [snd_page_alloc] vmap
> 0xf7d25000-0xf7d2a000 20480 drm_ht_create+0x69/0xa5 [drm] pages=4 vmalloc
> 0xf7d2d000-0xf7d2f000 8192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
> 0xf7d30000-0xf7d50000 131072 vmalloc_exec+0x12/0x14 pages=31 vmalloc
> 0xf7d51000-0xf7d54000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7d56000-0xf7d5a000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf7d5b000-0xf7d5e000 12288 vmalloc_user+0x13/0x38 pages=2 vmalloc user
> 0xf7d5f000-0xf7d67000 32768 vmalloc_exec+0x12/0x14 pages=7 vmalloc
> 0xf7d68000-0xf7d79000 69632 snd_malloc_sgbuf_pages+0x190/0x1ca [snd_page_alloc] vmap
> 0xf7d7a000-0xf7d86000 49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc
> 0xf7d87000-0xf7d8e000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf7d8f000-0xf7d91000 8192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc
> 0xf7d92000-0xf7d94000 8192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
> 0xf7d96000-0xf7dba000 147456 vmalloc_exec+0x12/0x14 pages=35 vmalloc
> 0xf7dbb000-0xf7dc0000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf7dc1000-0xf7dc4000 12288 vmalloc_32+0x12/0x14 pages=2 vmalloc
> 0xf7dc7000-0xf7dc9000 8192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
> 0xf7dcb000-0xf7dd4000 36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc
> 0xf7dd5000-0xf7dd7000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf7dd8000-0xf7dda000 8192 pci_iomap+0x72/0x7b phys=ee404000 ioremap
> 0xf7ddb000-0xf7dec000 69632 snd_malloc_sgbuf_pages+0x190/0x1ca [snd_page_alloc] vmap
> 0xf7ded000-0xf7e0e000 135168 vmalloc_exec+0x12/0x14 pages=32 vmalloc
> 0xf7e0f000-0xf7e23000 81920 vmalloc_exec+0x12/0x14 pages=19 vmalloc
> 0xf7e24000-0xf7e2a000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf7e2b000-0xf7e2f000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf7e30000-0xf7e34000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf7e36000-0xf7e45000 61440 vmalloc_exec+0x12/0x14 pages=14 vmalloc
> 0xf7e46000-0xf7e68000 139264 vmalloc_exec+0x12/0x14 pages=33 vmalloc
> 0xf7e6d000-0xf7e6f000 8192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc
> 0xf7e70000-0xf7e79000 36864 ioremap_wc+0x21/0x23 phys=dbff8000 ioremap
> 0xf7e80000-0xf7e91000 69632 drm_addmap_core+0x16d/0x4d5 [drm] phys=ee100000 ioremap
> 0xf7e92000-0xf7ea0000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
> 0xf7ea1000-0xf7ea6000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf7ec8000-0xf7ecf000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf7ed0000-0xf7f06000 221184 vmalloc_exec+0x12/0x14 pages=53 vmalloc
> 0xf7f07000-0xf7f0e000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf7f0f000-0xf7f28000 102400 vmalloc_exec+0x12/0x14 pages=24 vmalloc
> 0xf7f29000-0xf7f2f000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf7f3c000-0xf7f3e000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf7f3f000-0xf7f42000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf7f43000-0xf7f4d000 40960 vmalloc_exec+0x12/0x14 pages=9 vmalloc
> 0xf7f4e000-0xf7f50000 8192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc
> 0xf7f51000-0xf7f57000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf7f59000-0xf7f67000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
> 0xf7faa000-0xf7fbc000 73728 vmalloc_exec+0x12/0x14 pages=17 vmalloc
> 0xf802f000-0xf8032000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf8033000-0xf8039000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf803a000-0xf803c000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf803d000-0xf8043000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf8044000-0xf8046000 8192 __kvm_set_memory_region+0x164/0x355 [kvm] pages=1 vmalloc
> 0xf8047000-0xf8053000 49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc
> 0xf8054000-0xf8058000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf8059000-0xf805b000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf805c000-0xf805e000 8192 pci_iomap+0x72/0x7b phys=edf00000 ioremap
> 0xf8062000-0xf8066000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf8069000-0xf806b000 8192 dm_vcalloc+0x24/0x4c [dm_mod] pages=1 vmalloc
> 0xf80c0000-0xf81b9000 1019904 sys_swapon+0x481/0xaa0 pages=248 vmalloc
> 0xf81e3000-0xf81e8000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf8237000-0xf823c000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf8275000-0xf827a000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf8451000-0xf845f000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
> 0xf848a000-0xf848d000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf84ba000-0xf84bd000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf84be000-0xf84c2000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf84f1000-0xf84f4000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf84f5000-0xf84f8000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf8502000-0xf8510000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
> 0xf8511000-0xf851f000 57344 vmalloc_exec+0x12/0x14 pages=13 vmalloc
> 0xf8523000-0xf852a000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf8534000-0xf8538000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf8539000-0xf853b000 8192 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=1 vmalloc
> 0xf853c000-0xf8545000 36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc
> 0xf8546000-0xf8550000 40960 vmalloc_exec+0x12/0x14 pages=9 vmalloc
> 0xf8551000-0xf8554000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf8555000-0xf8561000 49152 vmalloc_exec+0x12/0x14 pages=11 vmalloc
> 0xf856c000-0xf856f000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf8578000-0xf857d000 20480 azx_probe+0x32b/0xa0a [snd_hda_intel] phys=ee400000 ioremap
> 0xf8594000-0xf85c1000 184320 vmalloc_exec+0x12/0x14 pages=44 vmalloc
> 0xf85db000-0xf85df000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf85f4000-0xf85f6000 8192 __kvm_set_memory_region+0x255/0x355 [kvm] pages=1 vmalloc
> 0xf85f7000-0xf85fa000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf85fb000-0xf8620000 151552 vmalloc_exec+0x12/0x14 pages=36 vmalloc
> 0xf8624000-0xf8628000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf8629000-0xf8689000 393216 vmalloc_exec+0x12/0x14 pages=95 vmalloc
> 0xf868a000-0xf8690000 24576 vmalloc_exec+0x12/0x14 pages=5 vmalloc
> 0xf8691000-0xf8695000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf8699000-0xf869c000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf86a5000-0xf86a9000 16384 vmalloc_exec+0x12/0x14 pages=3 vmalloc
> 0xf86c4000-0xf86c7000 12288 vmalloc_exec+0x12/0x14 pages=2 vmalloc
> 0xf87a6000-0xf87af000 36864 vmalloc_exec+0x12/0x14 pages=8 vmalloc
> 0xf87dd000-0xf87e2000 20480 vmalloc_exec+0x12/0x14 pages=4 vmalloc
> 0xf8815000-0xf881c000 28672 vmalloc_exec+0x12/0x14 pages=6 vmalloc
> 0xf89b8000-0xf89d2000 106496 vmalloc_exec+0x12/0x14 pages=25 vmalloc
> 0xf8a00000-0xf8a21000 135168 e1000_probe+0x1a5/0xbb3 [e1000e] phys=ee000000 ioremap
> 0xf8a22000-0xf9223000 8392704 vmalloc_32+0x12/0x14 pages=2048 vmalloc vpages
> 0xf97a4000-0xf97a7000 12288 e1000e_setup_tx_resources+0x1d/0xbf [e1000e] pages=2 vmalloc
> 0xf97a8000-0xf97ab000 12288 e1000e_setup_rx_resources+0x1d/0xfa [e1000e] pages=2 vmalloc
> 0xf97ac000-0xf98ad000 1052672 __kvm_set_memory_region+0x164/0x355 [kvm] pages=256 vmalloc
> 0xf98ae000-0xf98b1000 12288 __kvm_set_memory_region+0x1df/0x355 [kvm] pages=2 vmalloc
> 0xf98b2000-0xf98b7000 20480 __kvm_set_memory_region+0x164/0x355 [kvm] pages=4 vmalloc

Hmm, spanning <30MB of memory... how much vmalloc space do you have?

Anyway, there's quite a lot of little allocations there, and things are
getting a bit fragmented. Why don't we just drop the guard pages
completely (maybe turn them on with a debug option)? I don't think I have
ever seen a problem caught by them... not to say they can't catch problems,
but those problems could equally appear in linear KVA too (or between guard
pages).

2008-10-29 10:30:05

by Avi Kivity

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

Nick Piggin wrote:
> Hmm, spanning <30MB of memory... how much vmalloc space do you have?
>
>

From the original report:

> VmallocTotal: 122880 kB
> VmallocUsed: 15184 kB
> VmallocChunk: 83764 kB

So it seems there's quite a bit of free space.

Chunk is the largest free contiguous region, right? If so, it seems the
problem is unrelated to guard pages, instead the search isn't finding a
1-page area (with two guard pages) for some reason, even though lots of
free space is available.

--
error compiling committee.c: too many arguments to function

2008-10-29 10:43:46

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> Nick Piggin wrote:
> >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> >
> >
>
> From the original report:
>
> >VmallocTotal: 122880 kB
> >VmallocUsed: 15184 kB
> >VmallocChunk: 83764 kB
>
> So it seems there's quite a bit of free space.
>
> Chunk is the largest free contiguous region, right? If so, it seems the

Yes.

> problem is unrelated to guard pages, instead the search isn't finding a
> 1-page area (with two guard pages) for some reason, even though lots of
> free space is available.

Hmm. The free area search could be buggy...

2008-10-29 22:06:20

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > Nick Piggin wrote:
> > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > >
> > >
> >
> > From the original report:
> >
> > >VmallocTotal: 122880 kB
> > >VmallocUsed: 15184 kB
> > >VmallocChunk: 83764 kB
> >
> > So it seems there's quite a bit of free space.
> >
> > Chunk is the largest free contiguous region, right? If so, it seems the
>
> Yes.
>
>
> > problem is unrelated to guard pages, instead the search isn't finding a
> > 1-page area (with two guard pages) for some reason, even though lots of
> > free space is available.
>
> Hmm. The free area search could be buggy...
Do you want me to grab any specific info of it? Or should I just hack myself
randomly into it? I'll probably have some time for that tomorrow.

2008-10-30 01:53:48

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

2008-10-30 04:49:59

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > Nick Piggin wrote:
> > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > >
> > > >
> > >
> > > From the original report:
> > >
> > > >VmallocTotal: 122880 kB
> > > >VmallocUsed: 15184 kB
> > > >VmallocChunk: 83764 kB
> > >
> > > So it seems there's quite a bit of free space.
> > >
> > > Chunk is the largest free contiguous region, right? If so, it seems the
> >
> > Yes.
> >
> >
> > > problem is unrelated to guard pages, instead the search isn't finding a
> > > 1-page area (with two guard pages) for some reason, even though lots of
> > > free space is available.
> >
> > Hmm. The free area search could be buggy...
> Do you want me to grab any specific info of it? Or should I just hack myself
> randomly into it? I'll probably have some time for that tomorrow.

I took a bit of a look. Does this help you at all?

I still think we should get rid of the guard pages in non-debug kernels
completely, but hopefully this will fix your problems?
--

- Fix off by one bug in the KVA allocator that can leave gaps
- An initial vmalloc failure should start off a synchronous flush of lazy
areas, in case someone is in progress flushing them already.
- Purge lock can be a mutex so we can sleep while that's going on.

Signed-off-by: Nick Piggin <[email protected]>
---
Index: linux-2.6/mm/vmalloc.c
===================================================================
--- linux-2.6.orig/mm/vmalloc.c
+++ linux-2.6/mm/vmalloc.c
@@ -14,6 +14,7 @@
#include <linux/highmem.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/mutex.h>
#include <linux/interrupt.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
@@ -362,7 +363,7 @@ retry:
goto found;
}

- while (addr + size >= first->va_start && addr + size <= vend) {
+ while (addr + size > first->va_start && addr + size <= vend) {
addr = ALIGN(first->va_end + PAGE_SIZE, align);

n = rb_next(&first->rb_node);
@@ -472,7 +473,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_IN
static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
int sync, int force_flush)
{
- static DEFINE_SPINLOCK(purge_lock);
+ static DEFINE_MUTEX(purge_lock);
LIST_HEAD(valist);
struct vmap_area *va;
int nr = 0;
@@ -483,10 +484,10 @@ static void __purge_vmap_area_lazy(unsig
* the case that isn't actually used at the moment anyway.
*/
if (!sync && !force_flush) {
- if (!spin_trylock(&purge_lock))
+ if (!mutex_trylock(&purge_lock))
return;
} else
- spin_lock(&purge_lock);
+ mutex_lock(&purge_lock);

rcu_read_lock();
list_for_each_entry_rcu(va, &vmap_area_list, list) {
@@ -518,7 +519,18 @@ static void __purge_vmap_area_lazy(unsig
__free_vmap_area(va);
spin_unlock(&vmap_area_lock);
}
- spin_unlock(&purge_lock);
+ mutex_unlock(&purge_lock);
+}
+
+/*
+ * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
+ * is already purging.
+ */
+static void try_purge_vmap_area_lazy(void)
+{
+ unsigned long start = ULONG_MAX, end = 0;
+
+ __purge_vmap_area_lazy(&start, &end, 0, 0);
}

/*
@@ -528,7 +540,7 @@ static void purge_vmap_area_lazy(void)
{
unsigned long start = ULONG_MAX, end = 0;

- __purge_vmap_area_lazy(&start, &end, 0, 0);
+ __purge_vmap_area_lazy(&start, &end, 1, 0);
}

/*
@@ -539,7 +551,7 @@ static void free_unmap_vmap_area(struct
va->flags |= VM_LAZY_FREE;
atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages()))
- purge_vmap_area_lazy();
+ try_purge_vmap_area_lazy();
}

static struct vmap_area *find_vmap_area(unsigned long addr)

2008-10-30 11:27:37

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> > On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > > Nick Piggin wrote:
> > > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > > >
> > > > >
> > > >
> > > > From the original report:
> > > >
> > > > >VmallocTotal: 122880 kB
> > > > >VmallocUsed: 15184 kB
> > > > >VmallocChunk: 83764 kB
> > > >
> > > > So it seems there's quite a bit of free space.
> > > >
> > > > Chunk is the largest free contiguous region, right? If so, it seems the
> > >
> > > Yes.
> > >
> > >
> > > > problem is unrelated to guard pages, instead the search isn't finding a
> > > > 1-page area (with two guard pages) for some reason, even though lots of
> > > > free space is available.
> > >
> > > Hmm. The free area search could be buggy...
> > Do you want me to grab any specific info of it? Or should I just hack myself
> > randomly into it? I'll probably have some time for that tomorrow.
>
> I took a bit of a look. Does this help you at all?
>
> I still think we should get rid of the guard pages in non-debug kernels
> completely, but hopefully this will fix your problems?
unfortunately, it doesn't.
problem still happen in a kernel with this patch.

> --
>
> - Fix off by one bug in the KVA allocator that can leave gaps
> - An initial vmalloc failure should start off a synchronous flush of lazy
> areas, in case someone is in progress flushing them already.
> - Purge lock can be a mutex so we can sleep while that's going on.
>
> Signed-off-by: Nick Piggin <[email protected]>
> ---
> Index: linux-2.6/mm/vmalloc.c
> ===================================================================
> --- linux-2.6.orig/mm/vmalloc.c
> +++ linux-2.6/mm/vmalloc.c
> @@ -14,6 +14,7 @@
> #include <linux/highmem.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> +#include <linux/mutex.h>
> #include <linux/interrupt.h>
> #include <linux/proc_fs.h>
> #include <linux/seq_file.h>
> @@ -362,7 +363,7 @@ retry:
> goto found;
> }
>
> - while (addr + size >= first->va_start && addr + size <= vend) {
> + while (addr + size > first->va_start && addr + size <= vend) {
> addr = ALIGN(first->va_end + PAGE_SIZE, align);
>
> n = rb_next(&first->rb_node);
> @@ -472,7 +473,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_IN
> static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> int sync, int force_flush)
> {
> - static DEFINE_SPINLOCK(purge_lock);
> + static DEFINE_MUTEX(purge_lock);
> LIST_HEAD(valist);
> struct vmap_area *va;
> int nr = 0;
> @@ -483,10 +484,10 @@ static void __purge_vmap_area_lazy(unsig
> * the case that isn't actually used at the moment anyway.
> */
> if (!sync && !force_flush) {
> - if (!spin_trylock(&purge_lock))
> + if (!mutex_trylock(&purge_lock))
> return;
> } else
> - spin_lock(&purge_lock);
> + mutex_lock(&purge_lock);
>
> rcu_read_lock();
> list_for_each_entry_rcu(va, &vmap_area_list, list) {
> @@ -518,7 +519,18 @@ static void __purge_vmap_area_lazy(unsig
> __free_vmap_area(va);
> spin_unlock(&vmap_area_lock);
> }
> - spin_unlock(&purge_lock);
> + mutex_unlock(&purge_lock);
> +}
> +
> +/*
> + * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
> + * is already purging.
> + */
> +static void try_purge_vmap_area_lazy(void)
> +{
> + unsigned long start = ULONG_MAX, end = 0;
> +
> + __purge_vmap_area_lazy(&start, &end, 0, 0);
> }
>
> /*
> @@ -528,7 +540,7 @@ static void purge_vmap_area_lazy(void)
> {
> unsigned long start = ULONG_MAX, end = 0;
>
> - __purge_vmap_area_lazy(&start, &end, 0, 0);
> + __purge_vmap_area_lazy(&start, &end, 1, 0);
> }
>
> /*
> @@ -539,7 +551,7 @@ static void free_unmap_vmap_area(struct
> va->flags |= VM_LAZY_FREE;
> atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
> if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages()))
> - purge_vmap_area_lazy();
> + try_purge_vmap_area_lazy();
> }
>
> static struct vmap_area *find_vmap_area(unsigned long addr)

2008-10-30 16:47:24

by Matt Mackall

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Thu, 2008-10-30 at 05:49 +0100, Nick Piggin wrote:
> I still think we should get rid of the guard pages in non-debug kernels
> completely,

For what it's worth, I agree.

--
Mathematics is the supreme nostalgia of our time.

2008-10-30 18:03:34

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Thu, Oct 30, 2008 at 11:46:02AM -0500, Matt Mackall wrote:
> On Thu, 2008-10-30 at 05:49 +0100, Nick Piggin wrote:
> > I still think we should get rid of the guard pages in non-debug kernels
> > completely,
>
> For what it's worth, I agree.
Do we want any specific option, or is DEBUG_VM enough ?

>
> --
> Mathematics is the supreme nostalgia of our time.
>

2008-10-31 03:00:22

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Thu, Oct 30, 2008 at 04:04:43PM -0200, Glauber Costa wrote:
> On Thu, Oct 30, 2008 at 11:46:02AM -0500, Matt Mackall wrote:
> > On Thu, 2008-10-30 at 05:49 +0100, Nick Piggin wrote:
> > > I still think we should get rid of the guard pages in non-debug kernels
> > > completely,
> >
> > For what it's worth, I agree.
> Do we want any specific option, or is DEBUG_VM enough ?

I'd almost say DEBUG_PAGEALLOC; which could also disable lazy unmapping
in order to catch use-after free better. Or just a different option
entirely.

DEBUG_VM has, so far, been kept to relatively cheap tests that are not
much pain to turn on, and shouldn't result in much behavioural change
of algorithms/data structures AFAIKS. Which can be a good thing.

2008-10-31 07:17:11

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Thu, Oct 30, 2008 at 09:28:54AM -0200, Glauber Costa wrote:
> On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> > > On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > > > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > > > Nick Piggin wrote:
> > > > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > > > >
> > > > > >
> > > > >
> > > > > From the original report:
> > > > >
> > > > > >VmallocTotal: 122880 kB
> > > > > >VmallocUsed: 15184 kB
> > > > > >VmallocChunk: 83764 kB
> > > > >
> > > > > So it seems there's quite a bit of free space.
> > > > >
> > > > > Chunk is the largest free contiguous region, right? If so, it seems the
> > > >
> > > > Yes.
> > > >
> > > >
> > > > > problem is unrelated to guard pages, instead the search isn't finding a
> > > > > 1-page area (with two guard pages) for some reason, even though lots of
> > > > > free space is available.
> > > >
> > > > Hmm. The free area search could be buggy...
> > > Do you want me to grab any specific info of it? Or should I just hack myself
> > > randomly into it? I'll probably have some time for that tomorrow.
> >
> > I took a bit of a look. Does this help you at all?
> >
> > I still think we should get rid of the guard pages in non-debug kernels
> > completely, but hopefully this will fix your problems?
> unfortunately, it doesn't.
> problem still happen in a kernel with this patch.

That's weird. Any chance you could dump a list of all the vmap area start
and end adresses and their flags before returning failure?

2008-11-04 17:49:59

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Fri, Oct 31, 2008 at 08:16:44AM +0100, Nick Piggin wrote:
> On Thu, Oct 30, 2008 at 09:28:54AM -0200, Glauber Costa wrote:
> > On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
> > > On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> > > > On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > > > > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > > > > Nick Piggin wrote:
> > > > > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > From the original report:
> > > > > >
> > > > > > >VmallocTotal: 122880 kB
> > > > > > >VmallocUsed: 15184 kB
> > > > > > >VmallocChunk: 83764 kB
> > > > > >
> > > > > > So it seems there's quite a bit of free space.
> > > > > >
> > > > > > Chunk is the largest free contiguous region, right? If so, it seems the
> > > > >
> > > > > Yes.
> > > > >
> > > > >
> > > > > > problem is unrelated to guard pages, instead the search isn't finding a
> > > > > > 1-page area (with two guard pages) for some reason, even though lots of
> > > > > > free space is available.
> > > > >
> > > > > Hmm. The free area search could be buggy...
> > > > Do you want me to grab any specific info of it? Or should I just hack myself
> > > > randomly into it? I'll probably have some time for that tomorrow.
> > >
> > > I took a bit of a look. Does this help you at all?
> > >
> > > I still think we should get rid of the guard pages in non-debug kernels
> > > completely, but hopefully this will fix your problems?
> > unfortunately, it doesn't.
> > problem still happen in a kernel with this patch.
>
> That's weird. Any chance you could dump a list of all the vmap area start
> and end adresses and their flags before returning failure?

by the way, a slightly modified version of your patch, without this snippet:

@@ -362,7 +363,7 @@ retry:
goto found;
}

- while (addr + size >= first->va_start && addr + size <= vend) {
+ while (addr + size > first->va_start && addr + size <= vend) {
addr = ALIGN(first->va_end + PAGE_SIZE, align);

n = rb_next(&first->rb_node);

WFM nicely so far.

I'm attaching /proc/vmallocinfo during kvm execution

Attachments:

(No filename) (2.29 kB)
vmalloc.works (10.78 kB)
Download all attachments

2008-11-05 00:20:30

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Fri, Oct 31, 2008 at 08:16:44AM +0100, Nick Piggin wrote:
> On Thu, Oct 30, 2008 at 09:28:54AM -0200, Glauber Costa wrote:
> > On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
> > > On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> > > > On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > > > > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > > > > Nick Piggin wrote:
> > > > > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > From the original report:
> > > > > >
> > > > > > >VmallocTotal: 122880 kB
> > > > > > >VmallocUsed: 15184 kB
> > > > > > >VmallocChunk: 83764 kB
> > > > > >
> > > > > > So it seems there's quite a bit of free space.
> > > > > >
> > > > > > Chunk is the largest free contiguous region, right? If so, it seems the
> > > > >
> > > > > Yes.
> > > > >
> > > > >
> > > > > > problem is unrelated to guard pages, instead the search isn't finding a
> > > > > > 1-page area (with two guard pages) for some reason, even though lots of
> > > > > > free space is available.
> > > > >
> > > > > Hmm. The free area search could be buggy...
> > > > Do you want me to grab any specific info of it? Or should I just hack myself
> > > > randomly into it? I'll probably have some time for that tomorrow.
> > >
> > > I took a bit of a look. Does this help you at all?
> > >
> > > I still think we should get rid of the guard pages in non-debug kernels
> > > completely, but hopefully this will fix your problems?
> > unfortunately, it doesn't.
> > problem still happen in a kernel with this patch.
>
> That's weird. Any chance you could dump a list of all the vmap area start
> and end adresses and their flags before returning failure?

I said it worked with a single change. Shame on me, I was testing the wrong kernel
it does not work at all. I'll debug it more tomorrow.
>
>
>

2008-11-07 20:36:22

by Glauber Costa

[permalink] [raw]

Subject: Re: [PATCH] regression: vmalloc easily fail.

On Thu, Oct 30, 2008 at 05:49:41AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 08:07:37PM -0200, Glauber Costa wrote:
> > On Wed, Oct 29, 2008 at 11:43:33AM +0100, Nick Piggin wrote:
> > > On Wed, Oct 29, 2008 at 12:29:40PM +0200, Avi Kivity wrote:
> > > > Nick Piggin wrote:
> > > > >Hmm, spanning <30MB of memory... how much vmalloc space do you have?
> > > > >
> > > > >
> > > >
> > > > From the original report:
> > > >
> > > > >VmallocTotal: 122880 kB
> > > > >VmallocUsed: 15184 kB
> > > > >VmallocChunk: 83764 kB
> > > >
> > > > So it seems there's quite a bit of free space.
> > > >
> > > > Chunk is the largest free contiguous region, right? If so, it seems the
> > >
> > > Yes.
> > >
> > >
> > > > problem is unrelated to guard pages, instead the search isn't finding a
> > > > 1-page area (with two guard pages) for some reason, even though lots of
> > > > free space is available.
> > >
> > > Hmm. The free area search could be buggy...
> > Do you want me to grab any specific info of it? Or should I just hack myself
> > randomly into it? I'll probably have some time for that tomorrow.
>
> I took a bit of a look. Does this help you at all?
>
> I still think we should get rid of the guard pages in non-debug kernels
> completely, but hopefully this will fix your problems?
> --
>
> - Fix off by one bug in the KVA allocator that can leave gaps
> - An initial vmalloc failure should start off a synchronous flush of lazy
> areas, in case someone is in progress flushing them already.
> - Purge lock can be a mutex so we can sleep while that's going on.
>
> Signed-off-by: Nick Piggin <[email protected]>
Tested-by: Glauber Costa <[email protected]>
> ---
> Index: linux-2.6/mm/vmalloc.c
> ===================================================================
> --- linux-2.6.orig/mm/vmalloc.c
> +++ linux-2.6/mm/vmalloc.c
> @@ -14,6 +14,7 @@
> #include <linux/highmem.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> +#include <linux/mutex.h>
> #include <linux/interrupt.h>
> #include <linux/proc_fs.h>
> #include <linux/seq_file.h>
> @@ -362,7 +363,7 @@ retry:
> goto found;
> }
>
> - while (addr + size >= first->va_start && addr + size <= vend) {
> + while (addr + size > first->va_start && addr + size <= vend) {
> addr = ALIGN(first->va_end + PAGE_SIZE, align);
>
> n = rb_next(&first->rb_node);
> @@ -472,7 +473,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_IN
> static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
> int sync, int force_flush)
> {
> - static DEFINE_SPINLOCK(purge_lock);
> + static DEFINE_MUTEX(purge_lock);
> LIST_HEAD(valist);
> struct vmap_area *va;
> int nr = 0;
> @@ -483,10 +484,10 @@ static void __purge_vmap_area_lazy(unsig
> * the case that isn't actually used at the moment anyway.
> */
> if (!sync && !force_flush) {
> - if (!spin_trylock(&purge_lock))
> + if (!mutex_trylock(&purge_lock))
> return;
> } else
> - spin_lock(&purge_lock);
> + mutex_lock(&purge_lock);
>
> rcu_read_lock();
> list_for_each_entry_rcu(va, &vmap_area_list, list) {
> @@ -518,7 +519,18 @@ static void __purge_vmap_area_lazy(unsig
> __free_vmap_area(va);
> spin_unlock(&vmap_area_lock);
> }
> - spin_unlock(&purge_lock);
> + mutex_unlock(&purge_lock);
> +}
> +
> +/*
> + * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
> + * is already purging.
> + */
> +static void try_purge_vmap_area_lazy(void)
> +{
> + unsigned long start = ULONG_MAX, end = 0;
> +
> + __purge_vmap_area_lazy(&start, &end, 0, 0);
> }
>
> /*
> @@ -528,7 +540,7 @@ static void purge_vmap_area_lazy(void)
> {
> unsigned long start = ULONG_MAX, end = 0;
>
> - __purge_vmap_area_lazy(&start, &end, 0, 0);
> + __purge_vmap_area_lazy(&start, &end, 1, 0);
> }
>
> /*
> @@ -539,7 +551,7 @@ static void free_unmap_vmap_area(struct
> va->flags |= VM_LAZY_FREE;
> atomic_add((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr);
> if (unlikely(atomic_read(&vmap_lazy_nr) > lazy_max_pages()))
> - purge_vmap_area_lazy();
> + try_purge_vmap_area_lazy();
> }
>
> static struct vmap_area *find_vmap_area(unsigned long addr)