2006-02-05 17:05:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Sunday 05 February 2006 17:36, Bharata B Rao wrote:
> Hi,
>
> I am seeing a kernel crash with 2.6.16-rc1 and rc2 but not on any
> 2.6.15 kernels (rc and 2.6.15.2). Arch is x86_64.
>
> The kernel crashes when I run an application which does:
> - mmap (0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)
> - mbind the memory to the 1st node with policy MPOL_BIND
> - write to that memory
>
> The crash time log on 2.6.16-rc2 looks like this:
>
> Unable to handle kernel NULL pointer dereference at 0000000000000008 RIP:
> <ffffffff801614df>{__rmqueue+63}

There's another report of it. The boot logs seem ok, so I guess
mbind broke somehow. I suppose it's related to the mempolicy changes
that went into 2.6.16-rc1. I'll try to take a look tomorrow if
Christoph doesn't beat it.

OOM with mbind seems to have broken also - it oopses too.

-Andi


2006-02-06 16:11:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Sun, 5 Feb 2006, Andi Kleen wrote:

> > The kernel crashes when I run an application which does:
> > - mmap (0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)
> > - mbind the memory to the 1st node with policy MPOL_BIND
> > - write to that memory

Tried the following code on rc1 and rc2 and it worked fine on ia64:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <numaif.h>

int main(int argc, void *argv[])
{
char *p;
unsigned long nodes = 0x01;

p = mmap(0, 32768, PROT_READ| PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, 0, 0);
mbind(p, 32768, MPOL_BIND, &nodes, 64, 0);
p[34] = 89;
return 0;
}

2006-02-06 18:12:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Monday 06 February 2006 17:11, Christoph Lameter wrote:
> On Sun, 5 Feb 2006, Andi Kleen wrote:
>
> > > The kernel crashes when I run an application which does:
> > > - mmap (0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS)
> > > - mbind the memory to the 1st node with policy MPOL_BIND
> > > - write to that memory
>
> Tried the following code on rc1 and rc2 and it worked fine on ia64:

Perhaps it depends on if the node has enough memory free or not?
I assume if the zonelist has some issue but the first entry is ok
it will only cause problems when the allocation has to go off node
(it shouldn't actually go off node with that policy of course,
but with a full free local node that code path is never triggered)

-Andi

2006-02-06 18:25:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Mon, 6 Feb 2006, Andi Kleen wrote:

> > Tried the following code on rc1 and rc2 and it worked fine on ia64:
>
> Perhaps it depends on if the node has enough memory free or not?
> I assume if the zonelist has some issue but the first entry is ok
> it will only cause problems when the allocation has to go off node
> (it shouldn't actually go off node with that policy of course,

If node 0 is exhausted then you have an OOM situation.

> but with a full free local node that code path is never triggered)

Wamt me to test the OOM path for mbind?

2006-02-06 18:36:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Monday 06 February 2006 19:25, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andi Kleen wrote:
>
> > > Tried the following code on rc1 and rc2 and it worked fine on ia64:
> >
> > Perhaps it depends on if the node has enough memory free or not?
> > I assume if the zonelist has some issue but the first entry is ok
> > it will only cause problems when the allocation has to go off node
> > (it shouldn't actually go off node with that policy of course,
>
> If node 0 is exhausted then you have an OOM situation.

No - it could just need to free some cleanable pages first. That's
a long way before going OOM.

> > but with a full free local node that code path is never triggered)
>
> Wamt me to test the OOM path for mbind?

I already know it oopses - someone else reported that. If you feel
motivated feel free to fix.

-Andi

2006-02-06 18:45:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Mon, 6 Feb 2006, Andi Kleen wrote:

> > If node 0 is exhausted then you have an OOM situation.
>
> No - it could just need to free some cleanable pages first. That's
> a long way before going OOM.

Then node 0 still has memory available. So you suspect zone_reclaim?

> > > but with a full free local node that code path is never triggered)
> >
> > Wamt me to test the OOM path for mbind?
> I already know it oopses - someone else reported that. If you feel
> motivated feel free to fix.

We also have a minor issue with huge pages. If the pools are exhausted
then the kernel will terminate the application with Bus Error.

2006-02-06 18:55:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Monday 06 February 2006 19:45, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andi Kleen wrote:
>
> > > If node 0 is exhausted then you have an OOM situation.
> >
> > No - it could just need to free some cleanable pages first. That's
> > a long way before going OOM.
>
> Then node 0 still has memory available. So you suspect zone_reclaim?

Either zone reclaim or the first entry in the zonelist is ok, but it's
not correctly terminated or something like that so it causes
problems when the kernel looks for the second (just speculating here,
i don't know if that is the problem)

> > > > but with a full free local node that code path is never triggered)
> > >
> > > Wamt me to test the OOM path for mbind?
> > I already know it oopses - someone else reported that. If you feel
> > motivated feel free to fix.
>
> We also have a minor issue with huge pages. If the pools are exhausted
> then the kernel will terminate the application with Bus Error.

That is what prereservation was supposed to prevent. I remember there
were endless discussions when this all was originally implemented long
ago (in the version that never got merged).

Basically there were two approaches:
- Do strict overcommit checking at mmap with prereservation (that was
what the old Intel/SGI patch did)

- The hackish way I implemented in SLES9: just check at mmap time
if there are enough pages but don't prereserve anything. That was
more a 80% solution with races, but seemed to fix the problem well enough
that people in the field didn't really complain. The advantage was that
it was much simpler code.

-Andi

2006-02-06 19:22:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Mon, 6 Feb 2006, Andi Kleen wrote:

> That is what prereservation was supposed to prevent. I remember there
> were endless discussions when this all was originally implemented long
> ago (in the version that never got merged).

But the reservation does not consider cpusets and memory policies right?
It surely must fail if one restrict allocation to one node and then we run
out of memory. That was the testcase that showed the Bus Error....\

2006-02-07 05:55:19

by Bharata B Rao

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Mon, Feb 06, 2006 at 07:55:18PM +0100, Andi Kleen wrote:
> On Monday 06 February 2006 19:45, Christoph Lameter wrote:
> > On Mon, 6 Feb 2006, Andi Kleen wrote:
> >
> > > > If node 0 is exhausted then you have an OOM situation.
> > >
> > > No - it could just need to free some cleanable pages first. That's
> > > a long way before going OOM.
> >
> > Then node 0 still has memory available. So you suspect zone_reclaim?
>
> Either zone reclaim or the first entry in the zonelist is ok, but it's
> not correctly terminated or something like that so it causes
> problems when the kernel looks for the second (just speculating here,
> i don't know if that is the problem)
>

I can still crash my x86_64 box with Christoph's program.

The meminfo in my case looks like this just before I execute the
program.

llm07:~ # cat /sys/devices/system/node/node0/meminfo

Node 0 MemTotal: 3095532 kB
Node 0 MemFree: 2960972 kB
Node 0 MemUsed: 134560 kB
Node 0 Active: 19752 kB
Node 0 Inactive: 14908 kB
Node 0 HighTotal: 0 kB
Node 0 HighFree: 0 kB
Node 0 LowTotal: 3095532 kB
Node 0 LowFree: 2960972 kB
Node 0 Dirty: 0 kB
Node 0 Writeback: 576 kB
Node 0 Mapped: 0 kB
Node 0 Slab: 24200 kB
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
llm07:~ # cat /sys/devices/system/node/node1/meminfo

Node 1 MemTotal: 2002368 kB
Node 1 MemFree: 1964464 kB
Node 1 MemUsed: 37904 kB
Node 1 Active: 10608 kB
Node 1 Inactive: 3056 kB
Node 1 HighTotal: 0 kB
Node 1 HighFree: 0 kB
Node 1 LowTotal: 2002368 kB
Node 1 LowFree: 1964464 kB
Node 1 Dirty: 1164 kB
Node 1 Writeback: 0 kB
Node 1 Mapped: 43064 kB
Node 1 Slab: 9648 kB
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0

I was trying to bind the memory to node 0, which still has enough
free memory.

Not sure if this helps, but I have some more debug data.
While the kernel(2.6.16-rc1) oopes at page_alloc.c, line no: 556
(list_del(&page->lru), some of the variables in __rmqueue look like this at the time of crash:

page = 0xffffffffffffffd8
&page->lru = 0000000000000000
zone = 0xffff81000000e700
zone->name Normal
current_order 0
area->nr_free 0

Regards,
Bharata.

2006-02-07 16:50:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Tue, 7 Feb 2006, Bharata B Rao wrote:

> I can still crash my x86_64 box with Christoph's program.

So it looks like the problem is arch specific. Test program runs fine on
ia64.

> page = 0xffffffffffffffd8
> &page->lru = 0000000000000000

Yup lru field overwritten as I thought.

2006-02-07 23:36:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wednesday 08 February 2006 00:27, Ray Bryant wrote:
> On Tuesday 07 February 2006 10:49, Christoph Lameter wrote:
> > On Tue, 7 Feb 2006, Bharata B Rao wrote:
> > > I can still crash my x86_64 box with Christoph's program.
> >
> > So it looks like the problem is arch specific. Test program runs fine on
> > ia64.
> >
> > > page = 0xffffffffffffffd8
> > > &page->lru = 0000000000000000
> >
> > Yup lru field overwritten as I thought.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> For what it is worth:
>
> Christoph's test program runs fine on my 32 GB, 4 socket, 8 core Opteron 64

Opteron 64? A new exciting upcomming product? @)

> box with 2.6.16-rc1.

Yes it also works on my test box and also some other simple tests with MPOL_BIND.
But we had similar reports on two different systems, so there's very likely a problem.
Just need to reproduce it somehow.

-Andi

2006-02-07 23:27:42

by Ray Bryant

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Tuesday 07 February 2006 10:49, Christoph Lameter wrote:
> On Tue, 7 Feb 2006, Bharata B Rao wrote:
> > I can still crash my x86_64 box with Christoph's program.
>
> So it looks like the problem is arch specific. Test program runs fine on
> ia64.
>
> > page = 0xffffffffffffffd8
> > &page->lru = 0000000000000000
>
> Yup lru field overwritten as I thought.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

For what it is worth:

Christoph's test program runs fine on my 32 GB, 4 socket, 8 core Opteron 64
box with 2.6.16-rc1.
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

2006-02-08 12:05:23

by Bharata B Rao

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, Feb 08, 2006 at 12:36:30AM +0100, Andi Kleen wrote:
> On Wednesday 08 February 2006 00:27, Ray Bryant wrote:
> > On Tuesday 07 February 2006 10:49, Christoph Lameter wrote:
> > > On Tue, 7 Feb 2006, Bharata B Rao wrote:
> > > > I can still crash my x86_64 box with Christoph's program.
> > >
> > > So it looks like the problem is arch specific. Test program runs fine on
> > > ia64.
> > >
> > > > page = 0xffffffffffffffd8
> > > > &page->lru = 0000000000000000
> > >
> > > Yup lru field overwritten as I thought.
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> >
> > For what it is worth:
> >
> > Christoph's test program runs fine on my 32 GB, 4 socket, 8 core Opteron 64
>
> Opteron 64? A new exciting upcomming product? @)
>
> > box with 2.6.16-rc1.
>
> Yes it also works on my test box and also some other simple tests with MPOL_BIND.
> But we had similar reports on two different systems, so there's very likely a problem.
> Just need to reproduce it somehow.
>

I believe I understand why I am seeing this problem with my setup.

The zones in my machine look like this:

On node 0 totalpages: 773791
DMA zone: 2151 pages, LIFO batch:0
DMA32 zone: 771640 pages, LIFO batch:31
Normal zone: 0 pages, LIFO batch:0
HighMem zone: 0 pages, LIFO batch:0
On node 1 totalpages: 500592
DMA zone: 0 pages, LIFO batch:0
DMA32 zone: 242032 pages, LIFO batch:31
Normal zone: 258560 pages, LIFO batch:31
HighMem zone: 0 pages, LIFO batch:0

So it can be seen that the node 0 has only DMA and DMA32 zones while
node 1 has only DMA32 and Normal zones.

The current mempolicy code assumes that the highest zone(policy_zone) that
comes under the memory policy is valid (by which I mean zone->present_pages
is non-zero) for all nodes, which is not true in my case. In this case
the policy_zone gets set to ZONE_NORMAL (highest zone here).

When mbind'ing to node 0, bind_zonelist()(and subsequent functions) binds
the ZONE_NORMAL zone to vma->vm_policy. During the write fault, the allocator
is asked to allocate from a non-existent ZONE_NORMAL zone for node 0. This
I believe is causing the oops I am seeing. It is still not clear to me
why doesn't the allocator fail the allocations from a zone which has
zone->present_pages=0 gracefully.

This whole problem wasn't seen on 2.6.15.2 because, bind_zonelist()
actually makes sure that the zone it is binding to has a non-zero
zone->present_pages.

Regards,
Bharata.

2006-02-08 15:42:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, 8 Feb 2006, Bharata B Rao wrote:

> The zones in my machine look like this:
>
> On node 0 totalpages: 773791
> DMA zone: 2151 pages, LIFO batch:0
> DMA32 zone: 771640 pages, LIFO batch:31
> Normal zone: 0 pages, LIFO batch:0
> HighMem zone: 0 pages, LIFO batch:0
> On node 1 totalpages: 500592
> DMA zone: 0 pages, LIFO batch:0
> DMA32 zone: 242032 pages, LIFO batch:31
> Normal zone: 258560 pages, LIFO batch:31
> HighMem zone: 0 pages, LIFO batch:0
>
> So it can be seen that the node 0 has only DMA and DMA32 zones while
> node 1 has only DMA32 and Normal zones.

Uhh... Thats a rather asymmetric arrangement.

> The current mempolicy code assumes that the highest zone(policy_zone) that
> comes under the memory policy is valid (by which I mean zone->present_pages
> is non-zero) for all nodes, which is not true in my case. In this case
> the policy_zone gets set to ZONE_NORMAL (highest zone here).

Right.

> When mbind'ing to node 0, bind_zonelist()(and subsequent functions) binds
> the ZONE_NORMAL zone to vma->vm_policy. During the write fault, the allocator
> is asked to allocate from a non-existent ZONE_NORMAL zone for node 0. This
> I believe is causing the oops I am seeing. It is still not clear to me
> why doesn't the allocator fail the allocations from a zone which has
> zone->present_pages=0 gracefully.

Hmm....

> This whole problem wasn't seen on 2.6.15.2 because, bind_zonelist()
> actually makes sure that the zone it is binding to has a non-zero
> zone->present_pages.

Correct there was a loop in bind_zonelist that I moved to the zone
initialization to simplify it.

However, this has implications for policy_zone. This variable should store
the zone that policies apply to. However, in your case this zone will vary
which may lead to all sorts of weird behavior even if we fix
bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?

2006-02-08 15:45:44

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:

> However, this has implications for policy_zone. This variable should store
> the zone that policies apply to. However, in your case this zone will vary
> which may lead to all sorts of weird behavior even if we fix
> bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?

It really needs to apply to both (currently you can't police 4GB of your
memory on x86-64) But I haven't worked out a good design how to implement it yet.

-Andi


>

2006-02-08 15:59:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, 8 Feb 2006, Andi Kleen wrote:

> On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
>
> > However, this has implications for policy_zone. This variable should store
> > the zone that policies apply to. However, in your case this zone will vary
> > which may lead to all sorts of weird behavior even if we fix
> > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
>
> It really needs to apply to both (currently you can't police 4GB of your
> memory on x86-64) But I haven't worked out a good design how to implement it yet.

So a provisional solution would be to simply ignore empty zones in
bind_zonelist? Or fall back to earlier zones (which includes unpolicied
zones in the bind zone list?)

Index: linux-2.6.16-rc2/mm/mempolicy.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/mempolicy.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c 2006-02-08 07:55:29.000000000 -0800
@@ -143,8 +143,12 @@ static struct zonelist *bind_zonelist(no
if (!zl)
return NULL;
num = 0;
- for_each_node_mask(nd, *nodes)
- zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+ for_each_node_mask(nd, *nodes) {
+ struct zone *zone = &NODE_DATA(nd)->node_zones[policy_zone];
+
+ if (zone->present_pages)
+ zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+ }
zl->zones[num] = NULL;
return zl;
}

2006-02-08 16:07:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wednesday 08 February 2006 16:59, Christoph Lameter wrote:
> On Wed, 8 Feb 2006, Andi Kleen wrote:
>
> > On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> >
> > > However, this has implications for policy_zone. This variable should store
> > > the zone that policies apply to. However, in your case this zone will vary
> > > which may lead to all sorts of weird behavior even if we fix
> > > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> >
> > It really needs to apply to both (currently you can't police 4GB of your
> > memory on x86-64) But I haven't worked out a good design how to implement it yet.
>
> So a provisional solution would be to simply ignore empty zones in
> bind_zonelist?

That would likely prevent the crash yes (Bharata can you test?)

But of course it still has the problem of a lot of memory being unpolicied
on machines with >4GB if there's both DMA32 and NORMAL.

> Or fall back to earlier zones (which includes unpolicied
> zones in the bind zone list?)

Or that.

Thanks,
-Andi


2006-02-08 16:20:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, 8 Feb 2006, Andi Kleen wrote:

> > So a provisional solution would be to simply ignore empty zones in
> > bind_zonelist?
>
> That would likely prevent the crash yes (Bharata can you test?)
>
> But of course it still has the problem of a lot of memory being unpolicied
> on machines with >4GB if there's both DMA32 and NORMAL.

The fix could result in a zonelist with no zones. So we can answer one
question in __alloc_pages().

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-08 00:05:09.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 08:18:59.000000000 -0800
@@ -913,7 +913,7 @@ restart:
z = zonelist->zones; /* the list of zones suitable for gfp_mask */

if (unlikely(*z == NULL)) {
- /* Should this ever happen?? */
+ /* May occur if MPOL_BIND results in an empty zonelist */
return NULL;
}

2006-02-08 16:28:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wednesday 08 February 2006 17:20, Christoph Lameter wrote:
> On Wed, 8 Feb 2006, Andi Kleen wrote:
>
> > > So a provisional solution would be to simply ignore empty zones in
> > > bind_zonelist?
> >
> > That would likely prevent the crash yes (Bharata can you test?)
> >
> > But of course it still has the problem of a lot of memory being unpolicied
> > on machines with >4GB if there's both DMA32 and NORMAL.
>
> The fix could result in a zonelist with no zones. So we can answer one
> question in __alloc_pages().

I don't think it can happen - at least one zone <= policy-zone has to
have memory otherwise the machine wouldn't work at all.

-Andi

2006-02-08 16:51:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, 8 Feb 2006, Andi Kleen wrote:

> > The fix could result in a zonelist with no zones. So we can answer one
> > question in __alloc_pages().
>
> I don't think it can happen - at least one zone <= policy-zone has to
> have memory otherwise the machine wouldn't work at all.

One could bind to a nodeset that contains a single node. If that node has
no memory in the policy zone then the zonelist generated by
bind_zonelist will be empty.

2006-02-09 04:35:04

by Bharata B Rao

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, Feb 08, 2006 at 05:06:26PM +0100, Andi Kleen wrote:
> On Wednesday 08 February 2006 16:59, Christoph Lameter wrote:
> > On Wed, 8 Feb 2006, Andi Kleen wrote:
> >
> > > On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> > >
> > > > However, this has implications for policy_zone. This variable should store
> > > > the zone that policies apply to. However, in your case this zone will vary
> > > > which may lead to all sorts of weird behavior even if we fix
> > > > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> > >
> > > It really needs to apply to both (currently you can't police 4GB of your
> > > memory on x86-64) But I haven't worked out a good design how to implement it yet.
> >
> > So a provisional solution would be to simply ignore empty zones in
> > bind_zonelist?
>
> That would likely prevent the crash yes (Bharata can you test?)

With this solution, the kernel doesn't crash, but the application does.

Shouldn't we fail mbind if we can't bind any zones ?
Something like this...


Signed-off-by: Bharata B Rao <[email protected]>

--- linux-2.6.16-rc2/mm/mempolicy.c.orig 2006-02-09 01:34:37.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c 2006-02-09 01:39:32.000000000 -0800
@@ -143,8 +143,18 @@
if (!zl)
return NULL;
num = 0;
- for_each_node_mask(nd, *nodes)
- zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+ for_each_node_mask(nd, *nodes) {
+ struct zone *zone = &NODE_DATA(nd)->node_zones[policy_zone];
+
+ if (zone->present_pages)
+ zl->zones[num++] = zone;
+ }
+
+ if (!num) {
+ /* failed to bind even a single zone */
+ kfree(zl);
+ return NULL;
+ }
zl->zones[num] = NULL;
return zl;
}

>
> But of course it still has the problem of a lot of memory being unpolicied
> on machines with >4GB if there's both DMA32 and NORMAL.
>
> > Or fall back to earlier zones (which includes unpolicied
> > zones in the bind zone list?)
>

Does it make sense to have a separate policy_zone for each node so that we
have atleast one(highest) zone in a node which comes under memory policy ?

Regards,
Bharata.

2006-02-09 10:02:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Thursday 09 February 2006 05:39, Bharata B Rao wrote:
> On Wed, Feb 08, 2006 at 05:06:26PM +0100, Andi Kleen wrote:
> > On Wednesday 08 February 2006 16:59, Christoph Lameter wrote:
> > > On Wed, 8 Feb 2006, Andi Kleen wrote:
> > >
> > > > On Wednesday 08 February 2006 16:42, Christoph Lameter wrote:
> > > >
> > > > > However, this has implications for policy_zone. This variable should store
> > > > > the zone that policies apply to. However, in your case this zone will vary
> > > > > which may lead to all sorts of weird behavior even if we fix
> > > > > bind_zonelist. To which zone does policy apply? ZONE_NORMAL or ZONE_DMA32?
> > > >
> > > > It really needs to apply to both (currently you can't police 4GB of your
> > > > memory on x86-64) But I haven't worked out a good design how to implement it yet.
> > >
> > > So a provisional solution would be to simply ignore empty zones in
> > > bind_zonelist?
> >
> > That would likely prevent the crash yes (Bharata can you test?)
>
> With this solution, the kernel doesn't crash, but the application does.
>
> Shouldn't we fail mbind if we can't bind any zones ?

Really need to fix this properly to support both zones in mbind




> Does it make sense to have a separate policy_zone for each node so that we
> have atleast one(highest) zone in a node which comes under memory policy ?

That wouldn't solve the problem. The problem is that the mempolicy needs
at least two zonelists to handle all type of allocations (that is why
i added the concept of policy zone in the first place - to avoid the need
of multilevel zonelists in the policies)

Or maybe it's better to just don't do any policy for GFP_DMA32
allocations and always use the highest zonelist. I guess they're somewhat
rare anyways and the policy will rarely succeed.

-Andi

2006-02-14 19:33:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

I just took another look at this issue and I cannot see anything wrong. An
empty zone should be ignored by the page allocator since nr_free == 0. My
patch should not be needed.

Could you get us the contents of the struct zone that the page allocator
is trying to get memory from?

2006-02-15 05:41:37

by Bharata B Rao

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Tue, Feb 14, 2006 at 11:33:00AM -0800, Christoph Lameter wrote:
> I just took another look at this issue and I cannot see anything wrong. An
> empty zone should be ignored by the page allocator since nr_free == 0. My
> patch should not be needed.

There is a check for list_empty(&area->free_list) in __rmqueue(), which
I think is one of the points in the page allocator where the emptiness of
the free_area list is checked. The current zone(when the crash happens)
bypasses this test leading to this crash.

>
> Could you get us the contents of the struct zone that the page allocator
> is trying to get memory from?

The zone looks like this:

crash> p *(struct zone *)0xffff81000000e700
$1 = {
free_pages = 0,
pages_min = 0,
pages_low = 0,
pages_high = 0,
lowmem_reserve = {0, 0, 0, 0},
pageset = {0xffff81000c013740, 0xffff81013fc42f40, 0xffffffff8071d600,
0xffffffff8071d680, 0xffffffff8071d700, 0xffffffff8071d780,
0xffffffff8071d800, 0xffffffff8071d880, 0xffffffff8071d900,
0xffffffff8071d980, 0xffffffff8071da00, 0xffffffff8071da80,
0xffffffff8071db00, 0xffffffff8071db80, 0xffffffff8071dc00,
0xffffffff8071dc80, 0xffffffff8071dd00, 0xffffffff8071dd80,
0xffffffff8071de00, 0xffffffff8071de80, 0xffffffff8071df00,
0xffffffff8071df80, 0xffffffff8071e000, 0xffffffff8071e080,
0xffffffff8071e100, 0xffffffff8071e180, 0xffffffff8071e200,
0xffffffff8071e280, 0xffffffff8071e300, 0xffffffff8071e380,
0xffffffff8071e400, 0xffffffff8071e480},
lock = {
raw_lock = {
slock = 0
},
break_lock = 1
},
free_area = {{
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}, {
free_list = {
next = 0x0,
prev = 0x0
},
nr_free = 0
}},
_pad1_ = {
x = 0xffff81000000e980 "\001"
},
lru_lock = {
raw_lock = {
slock = 1
},
break_lock = 0
},
active_list = {
next = 0xffff81000000e988,
prev = 0xffff81000000e988
},
inactive_list = {
next = 0xffff81000000e998,
prev = 0xffff81000000e998
},
nr_scan_active = 0,
nr_scan_inactive = 0,
nr_active = 0,
nr_inactive = 0,
pages_scanned = 0,
all_unreclaimable = 0,
reclaim_in_progress = {
counter = 0
},
last_unsuccessful_zone_reclaim = 0,
temp_priority = 12,
prev_priority = 12,
_pad2_ = {
x = 0xffff81000000ea00 ""
},
wait_table = 0x0,
wait_table_size = 0,
wait_table_bits = 0,
zone_pgdat = 0xffff81000000e000,
zone_mem_map = 0x0,
zone_start_pfn = 0,
spanned_pages = 0,
present_pages = 0,
name = 0xffffffff804a858c "Normal"
}

Regards,
Bharata.

2006-02-15 10:33:36

by Bharata B Rao

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, Feb 15, 2006 at 11:16:20AM +0530, Bharata B Rao wrote:
> On Tue, Feb 14, 2006 at 11:33:00AM -0800, Christoph Lameter wrote:
> > I just took another look at this issue and I cannot see anything wrong. An
> > empty zone should be ignored by the page allocator since nr_free == 0. My
> > patch should not be needed.
>
> There is a check for list_empty(&area->free_list) in __rmqueue(), which
> I think is one of the points in the page allocator where the emptiness of
> the free_area list is checked. The current zone(when the crash happens)
> bypasses this test leading to this crash.
>

We don't initialize the free_area list for all zones. Instead,
free_area_init_core() does that only for zones which are non-empty.

But in __rmqueue(), we depend on these free_area lists to be intialized
correctly for all zones, which is not true in the present case we
are discussing.

I think we either need to initialize free_area lists for all zones
or check for !zone->free_area->nr_free in __rmqueue().

Even with this, mbind still needs to be fixed. Even though it
can't get a conforming zone in the node (MPOL_BIND case), right now,
it goes ahead with the "bind"ing of the memory area. This causes the
application to crash (assuming we have fixed the __rmqueue kernel crash)
(Haven't yet figured our why exactly the application dies)

Regards,
Bharata.

2006-02-15 11:22:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wednesday 15 February 2006 11:38, Bharata B Rao wrote:

>
> Even with this, mbind still needs to be fixed. Even though it
> can't get a conforming zone in the node (MPOL_BIND case),

It should just use lower zones then (e.g. if no ZONE_NORMAL
use ZONE_DMA32). yes that needs to be fixed.

How about the appended patch? Does it fix the problem for you?

-Andi

Handle all and empty zones when setting up custom zonelists for mbind

The memory allocator doesn't like empty zones (which have an
uninitialized freelist), so a x86-64 system with a node fully
in GFP_DMA32 only would crash on mbind.

Fix that up by putting all possible zones as fallback into the zonelist
and skipping the empty ones.

In fact the code always enough allocated space for all zones,
but only used it for the highest. This change just uses all the
memory that was allocated before.

This should work fine for now, but whoever implements node hot removal
needs to fix this somewhere else too (or make sure zone datastructures
by itself never go away, only their memory)

Signed-off-by: Andi Kleen <[email protected]>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c
+++ linux/mm/mempolicy.c
@@ -132,19 +132,29 @@ static int mpol_check_policy(int mode, n
}
return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
}
+
/* Generate a custom zonelist for the BIND policy. */
static struct zonelist *bind_zonelist(nodemask_t *nodes)
{
struct zonelist *zl;
- int num, max, nd;
+ int num, max, nd, k;

max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
- zl = kmalloc(sizeof(void *) * max, GFP_KERNEL);
+ zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
if (!zl)
return NULL;
num = 0;
- for_each_node_mask(nd, *nodes)
- zl->zones[num++] = &NODE_DATA(nd)->node_zones[policy_zone];
+ /* First put in the highest zones from all nodes, then all the next
+ lower zones etc. Avoid empty zones because the memory allocator
+ doesn't like them. If you implement node hot removal you
+ have to fix that. */
+ for (k = policy_zone; k >= 0; k--) {
+ for_each_node_mask(nd, *nodes) {
+ struct zone *z = &NODE_DATA(nd)->node_zones[k];
+ if (z->present_pages > 0)
+ zl->zones[num++] = z;
+ }
+ }
zl->zones[num] = NULL;
return zl;
}

2006-02-15 18:10:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, 15 Feb 2006, Bharata B Rao wrote:

> We don't initialize the free_area list for all zones. Instead,
> free_area_init_core() does that only for zones which are non-empty.

Right.

> But in __rmqueue(), we depend on these free_area lists to be intialized
> correctly for all zones, which is not true in the present case we
> are discussing.

> I think we either need to initialize free_area lists for all zones
> or check for !zone->free_area->nr_free in __rmqueue().

Or we can initialize all pcp to contain empty lists for zones without
pages.

> Even with this, mbind still needs to be fixed. Even though it
> can't get a conforming zone in the node (MPOL_BIND case), right now,
> it goes ahead with the "bind"ing of the memory area. This causes the
> application to crash (assuming we have fixed the __rmqueue kernel crash)
> (Haven't yet figured our why exactly the application dies)

The application crashes because of an OOM.

2006-02-15 18:14:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, 15 Feb 2006, Andi Kleen wrote:

> How about the appended patch? Does it fix the problem for you?

I think we still need to address the issue of being able to crash
the page allocator if an empty zone is in the zonelist.

> This should work fine for now, but whoever implements node hot removal
> needs to fix this somewhere else too (or make sure zone datastructures
> by itself never go away, only their memory)

Yup. Simply initializing the pcp structures with empty lists should
suffice though.

2006-02-16 05:14:07

by Bharata B Rao

[permalink] [raw]
Subject: Re: [discuss] mmap, mbind and write to mmap'ed memory crashes 2.6.16-rc1[2] on 2 node X86_64

On Wed, Feb 15, 2006 at 12:21:53PM +0100, Andi Kleen wrote:
> On Wednesday 15 February 2006 11:38, Bharata B Rao wrote:
>
> >
> > Even with this, mbind still needs to be fixed. Even though it
> > can't get a conforming zone in the node (MPOL_BIND case),
>
> It should just use lower zones then (e.g. if no ZONE_NORMAL
> use ZONE_DMA32). yes that needs to be fixed.
>
> How about the appended patch? Does it fix the problem for you?
>

Yes, this fixes the problem. The kernel and the application
don't crash now with this patch.

Regards,
Bharata.