2006-11-15 18:33:40

by Christian Krafft

[permalink] [raw]
Subject: [patch 0/2] fix bugs while booting on NUMA system where some nodes have no mem

Hi,

The following patches are fixing two problems that showed up
while booting a NUMA system where memory was limited to the first node.
Please cc me for comments as I am not subscribed.

cheers,
Christian

PS: sorry for resending it, I didn't cc myself, and wasn't able to reply to this note.


2006-11-15 18:35:29

by Christian Krafft

[permalink] [raw]
Subject: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed

In some cases it might happen, that alloc_bootmem is beeing called
after bootmem pages have been freed. This is, because the condition
SYSTEM_BOOTING is still true after bootmem has been freed.

Signed-off-by: Christian Krafft <[email protected]>

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo
alloc_size = zone->wait_table_hash_nr_entries
* sizeof(wait_queue_head_t);

- if (system_state == SYSTEM_BOOTING) {
+ if (!slab_is_available()) {
zone->wait_table = (wait_queue_head_t *)
alloc_bootmem_node(pgdat, alloc_size);
} else {

2006-11-15 18:37:29

by Christian Krafft

[permalink] [raw]
Subject: [patch 2/2] enables booting a NUMA system where some nodes have no memory

When booting a NUMA system with nodes that have no memory (eg by limiting memory),
bootmem_alloc_core tried to find pages in an uninitialized bootmem_map.
This caused a null pointer access.
This fix adds a check, so that NULL is returned.
That will enable the caller (bootmem_alloc_nopanic)
to alloc memory on other without a panic.

Signed-off-by: Christian Krafft <[email protected]>

Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -196,6 +196,10 @@ __alloc_bootmem_core(struct bootmem_data
if (limit && bdata->node_boot_start >= limit)
return NULL;

+ /* on nodes without memory - bootmem_map is NULL */
+ if(!bdata->node_bootmem_map)
+ return NULL;
+
end_pfn = bdata->node_low_pfn;
limit = PFN_DOWN(limit);
if (limit && end_pfn > limit)

2006-11-15 21:25:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Christian Krafft wrote:

> When booting a NUMA system with nodes that have no memory (eg by limiting memory),
> bootmem_alloc_core tried to find pages in an uninitialized bootmem_map.

Why should we support nodes with no memory? If a node has no memory then
its processors and other resources need to be attached to the nearest node
with memory.

AFAICT The primary role of a node is to manage memory.

2006-11-15 21:58:55

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, Nov 15, 2006 at 01:24:55PM -0800, Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Christian Krafft wrote:
>
> > When booting a NUMA system with nodes that have no memory (eg by limiting memory),
> > bootmem_alloc_core tried to find pages in an uninitialized bootmem_map.
>
> Why should we support nodes with no memory? If a node has no memory then
> its processors and other resources need to be attached to the nearest node
> with memory.
>
> AFAICT The primary role of a node is to manage memory.
>

SGI has nodes that are have neither memory or cpus. These are
IO nodes. Think of them as ordinary nodes that have had the
cpu's & DIMMs removed. Only the IO buses remain.

IO nodes have the same NUMA properties as regular nodes.
They are connected via the numalink fabric, they should be described
in the SLIT table, they should be identified in proximity_domains, etc.

A lot of the core infrastructure is currently missing that is required
to describe IO nodes as regular nodes, but in principle, I don't
see anything wrong with nodes w/o memory.


It is also possible to disable the DIMMs on a node that actually has
cpus & memory. I suspect this doesn't work but I see no reason that you
should HAVE to disable the cpus on nodes that have had the DIMMs disabled.
Our BIOS currently provides the capability to disable DIMMS. The BIOS has
a hack to automatically disable cpus if all DIMMs have been disabled.
This hack was required for several reasons, one of which was linux does
not support nodes with cpus & no memory.



-- jack

2006-11-15 22:05:48

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Christian Krafft wrote:
>
>> When booting a NUMA system with nodes that have no memory (eg by limiting memory),
>> bootmem_alloc_core tried to find pages in an uninitialized bootmem_map.
>
> Why should we support nodes with no memory? If a node has no memory then
> its processors and other resources need to be attached to the nearest node
> with memory.
>
> AFAICT The primary role of a node is to manage memory.

A node is an arbitrary container object containing one or more of:

CPUs
Memory
IO bus

It does not have to contain memory.

M.

2006-11-15 22:41:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Jack Steiner wrote:

> A lot of the core infrastructure is currently missing that is required
> to describe IO nodes as regular nodes, but in principle, I don't
> see anything wrong with nodes w/o memory.

Every processor has a local node on which it runs. The kernel places
memory used by the processor on the local node. Even if we allow
nodes without memory: We still need to associate a "local" node to the
processor. If that is across some NUMA interlink then it is going to be
slower but it will work.

AFAIK It seems to be better to explicitly associate a memory node with a
processor during bootup in arch code.

Various kernel optimizations rely on local memory. Would we create
a special case here of a pglist_data structure without a zones structure?

It seems that the contents of pglist_data are targeted to a memory node.
If we do not have a pglist_data structure then the node would not exist
for the kernel.

What would the benefit or difference be of having nodes without memory?

2006-11-15 22:42:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Martin Bligh wrote:

> A node is an arbitrary container object containing one or more of:
>
> CPUs
> Memory
> IO bus
>
> It does not have to contain memory.

I have never seen a node on Linux without memory. I have seen nodes
without processors and without I/O but not without memory.This seems to be
something new?

2006-11-15 22:43:52

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Jack Steiner wrote:
>
>> A lot of the core infrastructure is currently missing that is required
>> to describe IO nodes as regular nodes, but in principle, I don't
>> see anything wrong with nodes w/o memory.
>
> Every processor has a local node on which it runs. The kernel places
> memory used by the processor on the local node. Even if we allow
> nodes without memory: We still need to associate a "local" node to the
> processor. If that is across some NUMA interlink then it is going to be
> slower but it will work.
>
> AFAIK It seems to be better to explicitly associate a memory node with a
> processor during bootup in arch code.
>
> Various kernel optimizations rely on local memory. Would we create
> a special case here of a pglist_data structure without a zones structure?
>
> It seems that the contents of pglist_data are targeted to a memory node.
> If we do not have a pglist_data structure then the node would not exist
> for the kernel.
>
> What would the benefit or difference be of having nodes without memory?

Some nodes really don't have memory. Either because it's been
deconfigured, or because it was never there in the first place.
We shouldn't need to kludge that.

All we need is an appropriate zonelist for each node, pointing to
the memory it should be accessing.

M.

2006-11-15 22:46:05

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Martin Bligh wrote:
>
>> A node is an arbitrary container object containing one or more of:
>>
>> CPUs
>> Memory
>> IO bus
>>
>> It does not have to contain memory.
>
> I have never seen a node on Linux without memory. I have seen nodes
> without processors and without I/O but not without memory.This seems to be
> something new?

A node was always defined that way. Search back a few years in the lkml
archives. We may be finding bugs in the implementation, but the
definition has not changed.

Supposing we hot-unplugged all the memory in a node? Or seems to have
happened in this instance is boot with mem=, cutting out memory on that
node.

M.

2006-11-15 22:51:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Martin Bligh wrote:

> Supposing we hot-unplugged all the memory in a node? Or seems to have
> happened in this instance is boot with mem=, cutting out memory on that
> node.

So a node with no memory has a pgdat_list structure but no zones? Or empty
zones?

2006-11-15 22:53:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Martin Bligh wrote:

> All we need is an appropriate zonelist for each node, pointing to
> the memory it should be accessing.

But there is no memory on the node. Does the zonelist contain the zones of
the node without memory or not? We simply fall back each allocation to the
next node as if the node was overflowing?

2006-11-16 00:26:21

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wednesday 15 November 2006 23:41, Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Martin Bligh wrote:
> > A node is an arbitrary container object containing one or more of:
> >
> > CPUs
> > Memory
> > IO bus

+ SPUs on a Cell processor

> > It does not have to contain memory.
>
> I have never seen a node on Linux without memory. I have seen nodes
> without processors and without I/O but not without memory.This seems to be
> something new?

In this particular case, we have a dual-socket Cell/B.E. blade server,
where each of the two CPU-socket/south-bridge/memory combinations is
treated as a separate node. The two points that make this tricky
are:

- we want to be able to boot with the 'mem=512M' option, which effectively
disables the memory on the second node (each node has 512MiB).
- Each node has 8 SPUs, all of which we want to use. In order to use an
SPU, we call __add_pages to register the local memory on it, so we have
struct page pointers we can hand out to user mappings with ->nopage().

The __add_pages call needs to do node local allocations (there are
probably more allocations that have the same problem, but this is the
first one that crashes), which oops when there is no memory registered
at all for that node, instead of returning an error or falling back
on a non-local allocation.

Arnd <><

2006-11-16 00:45:00

by Jesper Juhl

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On 15/11/06, Christoph Lameter <[email protected]> wrote:
> On Wed, 15 Nov 2006, Martin Bligh wrote:
>
> > A node is an arbitrary container object containing one or more of:
> >
> > CPUs
> > Memory
> > IO bus
> >
> > It does not have to contain memory.
>
> I have never seen a node on Linux without memory. I have seen nodes
> without processors and without I/O but not without memory.This seems to be
> something new?
>
What about SMP Opteron boards that have RAM slots for each CPU?
With two (or more) CPU's and only memory slots populated for one of
them, wouldn't that count as multiple NUMA nodes but only one of them
with memory?
That would seem to be a pretty common thing that could happen.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-11-16 00:46:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Thu, 16 Nov 2006, Arnd Bergmann wrote:

> - we want to be able to boot with the 'mem=512M' option, which effectively
> disables the memory on the second node (each node has 512MiB).
> - Each node has 8 SPUs, all of which we want to use. In order to use an
> SPU, we call __add_pages to register the local memory on it, so we have
> struct page pointers we can hand out to user mappings with ->nopage().

This is more like the bringup of a processor right? You need
to have the memory online before the processor is brought up otherwise
the slab cannot properly allocate its structures on the node when the
per node portion is brought up. The page allocator has similar issues.

2006-11-16 00:46:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Thu, 16 Nov 2006, Jesper Juhl wrote:

> What about SMP Opteron boards that have RAM slots for each CPU?
> With two (or more) CPU's and only memory slots populated for one of
> them, wouldn't that count as multiple NUMA nodes but only one of them
> with memory?
> That would seem to be a pretty common thing that could happen.

I think so far we have handled these as two processors on one node.

2006-11-16 00:50:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006 14:52:43 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 15 Nov 2006, Martin Bligh wrote:
>
> > All we need is an appropriate zonelist for each node, pointing to
> > the memory it should be accessing.
>
> But there is no memory on the node. Does the zonelist contain the zones of
> the node without memory or not? We simply fall back each allocation to the
> next node as if the node was overflowing?
>
yes. just fallback.
The zonelist[] donen't contain empty-zone.

-Kame

2006-11-16 00:56:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006 14:51:26 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 15 Nov 2006, Martin Bligh wrote:
>
> > Supposing we hot-unplugged all the memory in a node? Or seems to have
> > happened in this instance is boot with mem=, cutting out memory on that
> > node.
>
> So a node with no memory has a pgdat_list structure but no zones? Or empty
> zones?
>

The node has just empty-zone. pgdat/per-cpu-area is allocated on an other
(nearest) node.

I hear some vender's machine has this configuration. (ia64, maybe SGI or HP)

Node0: CPUx0 + XXXGb memory
Node1: CPUx2 + 16MB memory
Node2: CPUx2 + 16MB memory

memory of Node1 and Node2 is tirmmed at boot by GRANULE alignment.
Then, final view is
Node0 : memory-only-node
Node1 : cpu-only-node
Node2 : cpu-only-node.

-Kame

2006-11-16 00:58:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Thu, 16 Nov 2006, KAMEZAWA Hiroyuki wrote:

> > But there is no memory on the node. Does the zonelist contain the zones of
> > the node without memory or not? We simply fall back each allocation to the
> > next node as if the node was overflowing?
> yes. just fallback.

Ok, so we got a useless pglist_data struct and the struct zone contains a
zonelist that does not include the zone.

numa_node_id() points to this and we always get allocations redirected to
other nodes. The slab duplicates its per node structures on the fallback
node.

> The zonelist[] donen't contain empty-zone.

So we will never encounter that zone except when going to the
pglist_data struct through numa_node_id()?

2006-11-16 01:14:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006 16:57:56 -0800 (PST)
Christoph Lameter <[email protected]> wrote:
> numa_node_id() points to this and we always get allocations redirected to
> other nodes. The slab duplicates its per node structures on the fallback
> node.
>
> > The zonelist[] donen't contain empty-zone.
>
> So we will never encounter that zone except when going to the
> pglist_data struct through numa_node_id()?
>
Some pgdat/zone scanning code will access it.
See: for_each_zone() and populated_zone().

AFAIK, in 2.6.9 age(means RHEL4), cpus on memory-less-node are moved to the
nearest node. And there were no useless pgdat.

Now, there are memory-less-node. Cpus on memory-less-node are on a pgdat
with empty-zone. I think this is very simple way rather than remapping.
And I think cpus on memory-less-node are sharing something (FSB,switch,etc..)
Tieing cpus to a memory-less-node may have some benefit.

-Kame

2006-11-16 01:22:58

by Yasunori Goto

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

> I hear some vender's machine has this configuration. (ia64, maybe SGI or HP)
>
> Node0: CPUx0 + XXXGb memory
> Node1: CPUx2 + 16MB memory
> Node2: CPUx2 + 16MB memory
>
> memory of Node1 and Node2 is tirmmed at boot by GRANULE alignment.
> Then, final view is
> Node0 : memory-only-node
> Node1 : cpu-only-node
> Node2 : cpu-only-node.

IIRC, this is HP box. It is using memory interleave among nodes.

Bye.
--
Yasunori Goto


2006-11-16 01:35:51

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, Nov 15, 2006 at 02:40:36PM -0800, Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Jack Steiner wrote:
>
> > A lot of the core infrastructure is currently missing that is required
> > to describe IO nodes as regular nodes, but in principle, I don't
> > see anything wrong with nodes w/o memory.
>
> Every processor has a local node on which it runs. The kernel places
> memory used by the processor on the local node. Even if we allow
> nodes without memory: We still need to associate a "local" node to the
> processor. If that is across some NUMA interlink then it is going to be
> slower but it will work.

True.

>
> AFAIK It seems to be better to explicitly associate a memory node with a
> processor during bootup in arch code.
>
> Various kernel optimizations rely on local memory. Would we create
> a special case here of a pglist_data structure without a zones structure?
>
> It seems that the contents of pglist_data are targeted to a memory node.
> If we do not have a pglist_data structure then the node would not exist
> for the kernel.
>
> What would the benefit or difference be of having nodes without memory?

I doubt that there is a demand for systems with memoryless nodes. However, if the
DIMM(s) on a node fails, I think the system may perform better
with the cpus on the node enabled than it will if they have to be
disabled.



-- jack

2006-11-16 01:57:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Jack Steiner wrote:

> I doubt that there is a demand for systems with memoryless nodes. However, if the
> DIMM(s) on a node fails, I think the system may perform better
> with the cpus on the node enabled than it will if they have to be
> disabled.

Right now we do not have the capability to remove memory from a node while
the system is running.

If the DIMMs have failed and we boot up and the systems finds out that
there is no memory on that node then the cpus can be remapped to
the next memory node. That is better than having lots of useless
structures allocated.

2006-11-16 02:01:37

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Martin Bligh wrote:
>
>> All we need is an appropriate zonelist for each node, pointing to
>> the memory it should be accessing.
>
> But there is no memory on the node. Does the zonelist contain the zones of
> the node without memory or not? We simply fall back each allocation to the
> next node as if the node was overflowing?

Sure. there's no point in putting an empty zone in the zonelist.
We should just skip anything where present_pages is zero.

M.


2006-11-16 02:09:26

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Jack Steiner wrote:
>
>> I doubt that there is a demand for systems with memoryless nodes. However, if the
>> DIMM(s) on a node fails, I think the system may perform better
>> with the cpus on the node enabled than it will if they have to be
>> disabled.
>
> Right now we do not have the capability to remove memory from a node while
> the system is running.
>
> If the DIMMs have failed and we boot up and the systems finds out that
> there is no memory on that node then the cpus can be remapped to
> the next memory node. That is better than having lots of useless
> structures allocated.

A node without memory is a node without memory. Simply remapping the
cpus to another node and pretending the world is different does not
make much sense.

Is there some fundamental problem you see with dealing with the nodes
as is? Doesn't seem that hard to me. I'm not asking you to put the
effort in to fixing it, just if you see some fundamental reason why
it can't be fixed?

M.

2006-11-16 02:35:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 15 Nov 2006, Martin Bligh wrote:

> A node without memory is a node without memory. Simply remapping the
> cpus to another node and pretending the world is different does not
> make much sense.

It avoids overhead both in terms of memory and processing in the kernel
and it seems that is the way we have traditionally dealt with the issue?

Nodes without memory require the VM to allocate memory from different
nodes in order to build up management structures for the node (these
are useless since the node has no memory, caches will be split etc etc).

The cpus will allways fallback to the next node anyways since
their zonelist begins with a zone in a node that has memory.

> Is there some fundamental problem you see with dealing with the nodes
> as is? Doesn't seem that hard to me. I'm not asking you to put the
> effort in to fixing it, just if you see some fundamental reason why
> it can't be fixed?

I am not sure how memoryless nodes would affect various subsystems. And it
seems that this patch only fixes the first issue that they found (?). If
we go down this route then we may have to add more special casing to the
VM in order to cleanly handle memoryless nodes.

But maybe someone else has already experience with memoryless nodes?

2006-11-16 03:28:44

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, Nov 15, 2006 at 05:57:27PM -0800, Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Jack Steiner wrote:
>
> > I doubt that there is a demand for systems with memoryless nodes. However, if the
> > DIMM(s) on a node fails, I think the system may perform better
> > with the cpus on the node enabled than it will if they have to be
> > disabled.
>
> Right now we do not have the capability to remove memory from a node while
> the system is running.

I know. I'm refering to a DIMM that fails power-on diags or one
that is explicitly disabled from the system controller.

Clearly a reboot is required in both cases, but the end result is
a node with cpus and no memory. As I said earlier, the PROM (for several
reasons) automatically the cpus on nodes w/o memory.

2006-11-16 13:08:37

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Thursday 16 November 2006 01:45, Christoph Lameter wrote:
> On Thu, 16 Nov 2006, Arnd Bergmann wrote:
>
> > - we want to be able to boot with the 'mem=512M' option, which effectively
> > ? disables the memory on the second node (each node has 512MiB).
> > - Each node has 8 SPUs, all of which we want to use. In order to use an
> > ? SPU, we call __add_pages to register the local memory on it, so we have
> > ? struct page pointers we can hand out to user mappings with ->nopage().
>
> This is more like the bringup of a processor right? You need
> to have the memory online before the processor is brought up otherwise
> the slab cannot properly allocate its structures on the node when the
> per node portion is brought up. The page allocator has similar issues.

No, that's not really the issue here. The memory we're trying to add to the
mem_map can not be used for kernel allocations at all and is never entered
into the buddy allocator. It can only be used for applications running on
an SPU itself.

So the problem is not the order in which we do things, but the fact that
node data structure has not been initialized, and never will be, when
we add the SPU to the node.

Arnd <><

2006-11-16 15:21:37

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Wed, 2006-11-15 at 14:41 -0800, Christoph Lameter wrote:
> On Wed, 15 Nov 2006, Martin Bligh wrote:
>
> > A node is an arbitrary container object containing one or more of:
> >
> > CPUs
> > Memory
> > IO bus
> >
> > It does not have to contain memory.
>
> I have never seen a node on Linux without memory. I have seen nodes
> without processors and without I/O but not without memory.This seems to be
> something new?

I sent this out earlier in response to another message from Christoph
regarding nodes w/o memory. Don't know if it made it...

>On Fri, 2006-11-10 at 10:16 -0800, Christoph Lameter wrote:
>> On Wed, 8 Nov 2006, KAMEZAWA Hiroyuki wrote:
>>
>> > I wonder there are no code for creating NODE_DATA() for
device-only-node.
>>
>> On IA64 we remap nodes with no memory / cpus to the nearest node
with
>> memory. I think that is sufficient.

I don't think this happens anymore. Back in the ~2.6.5 days, when we
would configure our numa platforms with 100% of memory interleaved [in
hardware at cache line granularity], the cpus would move to the
interleaved "pseudo-node" and the memoryless nodes would be removed.
numactl --hardware would show something like this:

# uname -r
2.6.5-7.244-default
# numactl --hardware
available: 1 nodes (0-0)
node 0 size: 65443 MB
node 0 free: 64506 MB

I started seeing different behavior about the time SPARSEMEM went in.
Now, with a 2.6.16 base kernel [same platform, hardware interleaved
memory], I see:

# uname -r# numactl --hardware
available: 5 nodes (0-4)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 0 MB
node 2 free: 0 MB
node 3 size: 0 MB
node 3 free: 0 MB
node 4 size: 65439 MB
node 4 free: 64492 MB
node distances:
node 0 1 2 3 4
0: 10 17 17 17 14
1: 17 10 17 17 14
2: 17 17 10 17 14
3: 17 17 17 10 14
4: 14 14 14 14 10
2.6.16.21-0.8-default

[Aside: The firmware/SLIT says that the interleaved memory is closer to
all nodes that other nodes' memory. This has interesting implications
for the "overflow" zone lists...]

Lee

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2006-11-16 15:51:22

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

Christian Krafft wrote:
> On Wed, 15 Nov 2006 16:57:56 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
>> On Thu, 16 Nov 2006, KAMEZAWA Hiroyuki wrote:
>>
>>>> But there is no memory on the node. Does the zonelist contain the zones of
>>>> the node without memory or not? We simply fall back each allocation to the
>>>> next node as if the node was overflowing?
>>> yes. just fallback.
>> Ok, so we got a useless pglist_data struct and the struct zone contains a
>> zonelist that does not include the zone.
>
> Okay, I slowly understand what you are talking about.
> I just tried a "numactl --cpunodebind 1 --membind 1 true" which hit an uninitialized zone in slab_node:
>
> return zone_to_nid(policy->v.zonelist->zones[0]);
>
> I also still don't know if it makes sense to have memoryless nodes, but supporting it does.
> So wath would be reasonable, to have empty zonelists for those node, or to check if zonelists are uninitialized ?

You don't want empty zonelists on a node containing CPUs, else it won't
know where to allocate from. You just want to make sure that the zones
in that node (if existant) are not contained in *anyone's* zonelist.

M.

2006-11-16 18:46:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 2/2] enables booting a NUMA system where some nodes have no memory

On Thu, 16 Nov 2006, Christian Krafft wrote:

> Okay, I slowly understand what you are talking about.
> I just tried a "numactl --cpunodebind 1 --membind 1 true" which hit an uninitialized zone in slab_node:
>
> return zone_to_nid(policy->v.zonelist->zones[0]);

I think the above should work fine and give the expected OOM since the
node has no memory.

The zone struct should redirect via the zonelist to nodes that have
memory for allocations that are not bound to a single node.

> I also still don't know if it makes sense to have memoryless nodes, but supporting it does.
> So wath would be reasonable, to have empty zonelists for those node, or to check if zonelists are uninitialized ?

zonelists of those nodes should contain a list of fallback zones with
available memory.

2006-11-21 16:57:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed

On Wed, 15 Nov 2006 19:32:38 +0100
Christian Krafft <[email protected]> wrote:

> In some cases it might happen, that alloc_bootmem is beeing called
> after bootmem pages have been freed. This is, because the condition
> SYSTEM_BOOTING is still true after bootmem has been freed.
>
> Signed-off-by: Christian Krafft <[email protected]>
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo
> alloc_size = zone->wait_table_hash_nr_entries
> * sizeof(wait_queue_head_t);
>
> - if (system_state == SYSTEM_BOOTING) {
> + if (!slab_is_available()) {
> zone->wait_table = (wait_queue_head_t *)
> alloc_bootmem_node(pgdat, alloc_size);
> } else {

I don't think that slab_is_available() is an appropriate way of working out
if we can call vmalloc().

Also, a more complete description of the problem is needed, please. Which
caller is incorrectly allocating bootmem?

2006-11-21 18:27:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed

On Tue, 21 Nov 2006 19:02:13 +0100
Christian Krafft <[email protected]> wrote:

> > > Index: linux/mm/page_alloc.c
> > > ===================================================================
> > > --- linux.orig/mm/page_alloc.c
> > > +++ linux/mm/page_alloc.c
> > > @@ -1931,7 +1931,7 @@ int zone_wait_table_init(struct zone *zo
> > > alloc_size = zone->wait_table_hash_nr_entries
> > > * sizeof(wait_queue_head_t);
> > >
> > > - if (system_state == SYSTEM_BOOTING) {
> > > + if (!slab_is_available()) {
> > > zone->wait_table = (wait_queue_head_t *)
> > > alloc_bootmem_node(pgdat, alloc_size);
> > > } else {
> >
> > I don't think that slab_is_available() is an appropriate way of working out
> > if we can call vmalloc().
>
> Afaik slab_is_available() is the generic replacement for mem_init_done, which exists only on powerpc.
> If thats not appropriate, I dont know why. However, SYSTEM_BOOTING is definitively wrong.

slab is a very different thing from vmalloc. One could easily envisage
situations (now or in the future) in which slab is ready, but vmalloc is
not (more likely vice versa).

It'd be better to add a new vmalloc_is_available. (Just an int - no need
for a helper function).

2006-11-22 09:23:54

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [patch 1/2] fix call to alloc_bootmem after bootmem has been freed

On Tuesday 21 November 2006 19:26, Andrew Morton wrote:
> slab is a very different thing from vmalloc. ?One could easily envisage
> situations (now or in the future) in which slab is ready, but vmalloc is
> not (more likely vice versa).
>
> It'd be better to add a new vmalloc_is_available. ?(Just an int - no need
> for a helper function).

In the time line, we currently have

start_kernel()
...
setup_arch()
init_bootmem() # alloc_bootmem starts working
...
paging_init() # needed for vmalloc
... #
mem_init()
free_all_bootmem() # alloc_bootmem stops working, alloc_pages
# starts working
kmem_cache_init() # kmalloc and vmalloc start working
...
system_state = SYSTEM_RUNNING

The one interesting point here is where you have to transition between
calling alloc_bootmem and calling the regular allocator functions.
Maybe calling it slab_is_available() was not the best choice for a name,
but I don't see a point in having different names for essentially the
same question, "bootmem or not bootmem". The powerpc platform has an
integer variable called 'mem_init_done', which expresses this well
IMHO, but it's currently not portable.

Checking for SYSTEM_RUNNING is obviously the wrong choice, since it is
set at a very late point in bootup, long after bootmem is gone.

Arnd <><