2007-02-13 06:58:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC] [PATCH] more support for memory-less-node.

In my last posintg, mempolicy-fix-for-memory-less-node patch, there was a
discussion 'what do you consider definition of "node" as...?
I found there is no consensus. But I want to go ahead.
Before posing patch again, I'd like to discuss more.

-Kame

In my understanding, a "node" is a block of cpu, memory, devices.
and there could be cpu-only-node, memory-only-node, device-only-node...

There will be discussion. IMHO, to represent hardware configuration
as it is, this definition is very natural and flexible.
(And because my work is memory-hotplug, this definition fits me.)

"Don't support such crazy configuraton" is one of opinions.
I hear x86_64 doesn't support it and defines node as a block of memory,
It remaps cpus on memory-less-nodes to other nodes.
I know ia64 allows memory-less-node. (I don't know about ppc.)
It works well on my box (and HP's box).

If there is memory-less-node, codes which checks all nodes which have memory
should check NODE_DATA(nid)->present_pages.

But following is a bit heavy operation.
xxxxx
for_each_online_node(nid)
if (!node_present_pages(nid))
continue;
xxxxx

This patch adds a new node mask "node_memory_online_map" for nodes
which have memory.

for_each_node_mask(nid, node_memory_online_map) walks all memory-ready-nodes.
This mask is updated at node-hotplug ops.

Signed-Off-By: KAMEZAWA Hiroyuki <[email protected]>

Index: linux-2.6.20/include/linux/nodemask.h
===================================================================
--- linux-2.6.20.orig/include/linux/nodemask.h 2007-02-07 17:25:54.000000000 +0900
+++ linux-2.6.20/include/linux/nodemask.h 2007-02-13 15:31:33.000000000 +0900
@@ -344,6 +344,8 @@

extern nodemask_t node_online_map;
extern nodemask_t node_possible_map;
+/* online nodes which have memory */
+extern nodemask_t node_memory_online_map;

#if MAX_NUMNODES > 1
#define num_online_nodes() nodes_weight(node_online_map)
Index: linux-2.6.20/mm/page_alloc.c
===================================================================
--- linux-2.6.20.orig/mm/page_alloc.c 2007-02-07 17:25:54.000000000 +0900
+++ linux-2.6.20/mm/page_alloc.c 2007-02-13 15:54:04.000000000 +0900
@@ -54,6 +54,9 @@
EXPORT_SYMBOL(node_online_map);
nodemask_t node_possible_map __read_mostly = NODE_MASK_ALL;
EXPORT_SYMBOL(node_possible_map);
+nodemask_t node_memory_online_map __read_mostly = { { [0] = 1UL } };
+EXPORT_SYMBOL(node_memory_online_map);
+
unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
long nr_swap_pages;
@@ -1805,6 +1808,16 @@
}
}

+static void __meminit fixup_memory_online_nodes(void)
+{
+ int nid;
+ nodes_clear(node_memory_online_map);
+ for_each_online_node(nid) {
+ if (node_present_pages(nid))
+ node_set(nid, node_memory_online_map);
+ }
+}
+
#else /* CONFIG_NUMA */

static void __meminit build_zonelists(pg_data_t *pgdat)
@@ -1851,6 +1864,10 @@
pgdat->node_zonelists[i].zlcache_ptr = NULL;
}

+static void fixup_memory_online_nodes(void)
+{
+ return;
+}
#endif /* CONFIG_NUMA */

/* return values int ....just for stop_machine_run() */
@@ -1862,6 +1879,7 @@
build_zonelists(NODE_DATA(nid));
build_zonelist_cache(NODE_DATA(nid));
}
+ fixup_memory_online_nodes();
return 0;
}




2007-02-13 08:33:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.


> In my understanding, a "node" is a block of cpu, memory, devices.
> and there could be cpu-only-node, memory-only-node, device-only-node...

The trouble with this is that you'll need to harden large parts
of code against these. Especially a NULL pgdat is something quite
dangerous. You could make it a dummy empty pgdat, but just assigning it
nearby seems easier.

-Andi

2007-02-13 08:40:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007 09:29:49 +0100
Andi Kleen <[email protected]> wrote:

>
> > In my understanding, a "node" is a block of cpu, memory, devices.
> > and there could be cpu-only-node, memory-only-node, device-only-node...
>
> The trouble with this is that you'll need to harden large parts
> of code against these. Especially a NULL pgdat is something quite
> dangerous. You could make it a dummy empty pgdat, but just assigning it
> nearby seems easier.

Ah...It seems I didn't explain enough.

Now, memorly-less-node has its own pgdat, for its own zonelist.
All *online* node has its own NODA_DATA(nid).

NOD_DATA(nid) is always valid pointer if a node is online.
NODE_DATA(nid)->present_pages can be 0 even if a node is online,
I call this as memory-less-node.

Thanks,
-Kame


2007-02-13 17:10:46

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

KAMEZAWA Hiroyuki wrote:
> In my last posintg, mempolicy-fix-for-memory-less-node patch, there was a
> discussion 'what do you consider definition of "node" as...?
> I found there is no consensus. But I want to go ahead.
> Before posing patch again, I'd like to discuss more.
>
> -Kame
>
> In my understanding, a "node" is a block of cpu, memory, devices.
> and there could be cpu-only-node, memory-only-node, device-only-node...
>
> There will be discussion. IMHO, to represent hardware configuration
> as it is, this definition is very natural and flexible.
> (And because my work is memory-hotplug, this definition fits me.)
>
> "Don't support such crazy configuraton" is one of opinions.
> I hear x86_64 doesn't support it and defines node as a block of memory,
> It remaps cpus on memory-less-nodes to other nodes.
> I know ia64 allows memory-less-node. (I don't know about ppc.)
> It works well on my box (and HP's box).

It doesn't make much sense for an architecture independent structure to
be "defined" in different ways by specific architectures. "not
supported" or "currently broken" might be a better description.

Your description of the node is correct, it's an arbitrary container of
one or more resources. Not only is this definition flexible, it's also
very useful, for memory hotplug, odd types of NUMA boxes, etc.

M.

2007-02-13 17:24:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007, Andi Kleen wrote:

> The trouble with this is that you'll need to harden large parts
> of code against these. Especially a NULL pgdat is something quite
> dangerous. You could make it a dummy empty pgdat, but just assigning it
> nearby seems easier.

Plus there is the issue of having a pgdat but without any valid zone in
it. This is what triggered Kame-sans recent bug.

2007-02-13 17:26:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007, KAMEZAWA Hiroyuki wrote:

> NOD_DATA(nid) is always valid pointer if a node is online.
> NODE_DATA(nid)->present_pages can be 0 even if a node is online,
> I call this as memory-less-node.

Yes but the pgdat will have no valid zone in it. That is new.


2007-02-13 17:49:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.


> Your description of the node is correct, it's an arbitrary container of
> one or more resources. Not only is this definition flexible, it's also
> very useful, for memory hotplug, odd types of NUMA boxes, etc.

I must disagree here. Special cases are always dangerous especially
if they are hard to regression test. I made this discovery the hard
way on x86-64 ... It's best to eliminate them in the first place,
otherwise they will later come back and bite you when you don't expect it.

Adding NULL tests all over mm for this would seem like a clear case
of this to me.

-Andi

2007-02-13 18:03:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007, Andi Kleen wrote:

> Adding NULL tests all over mm for this would seem like a clear case
> of this to me.

Maybe there is an alternative. We are free to number the nodes right?
How about requiring the low node number to have memory and the high ones
do not?

F.e. have a boundary like

nr_mem_nodes ?

Everything above nr_mem_nodes has no memory and cannot be specified in a
nodemask. Those nodes would not be visible to user space via memory
policies and page migration. So the core mempolicy logic could be left
untouched.

The nodes above nr_mem_nodes would exist purely within the kernel. They
would have proximity information (which can be used to determine
neighboring memory. More flexible then the current attachment
to one fixed memory node) but those node numbers could not be specified as
node masks in any memory operations. This would then allow memory less nodes
with I/O or cpus. The user would not be aware of these.

2007-02-13 18:12:57

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

Andi Kleen wrote:
>> Your description of the node is correct, it's an arbitrary container of
>> one or more resources. Not only is this definition flexible, it's also
>> very useful, for memory hotplug, odd types of NUMA boxes, etc.
>
> I must disagree here. Special cases are always dangerous especially
> if they are hard to regression test. I made this discovery the hard
> way on x86-64 ... It's best to eliminate them in the first place,
> otherwise they will later come back and bite you when you don't expect it.
>
> Adding NULL tests all over mm for this would seem like a clear case
> of this to me.

I wasn't suggesting having NULL pointers for pgdats, if that's what you
mean. Just nodes with no memory in them, the pgdat would still be there.
pgdat = struct node, except everything's badly named.


2007-02-13 18:17:46

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

Christoph Lameter wrote:
> On Tue, 13 Feb 2007, Andi Kleen wrote:
>
>> Adding NULL tests all over mm for this would seem like a clear case
>> of this to me.
>
> Maybe there is an alternative. We are free to number the nodes right?
> How about requiring the low node number to have memory and the high ones
> do not?
>
> F.e. have a boundary like
>
> nr_mem_nodes ?
>
> Everything above nr_mem_nodes has no memory and cannot be specified in a
> nodemask. Those nodes would not be visible to user space via memory
> policies and page migration. So the core mempolicy logic could be left
> untouched.
>
> The nodes above nr_mem_nodes would exist purely within the kernel. They
> would have proximity information (which can be used to determine
> neighboring memory. More flexible then the current attachment
> to one fixed memory node) but those node numbers could not be specified as
> node masks in any memory operations. This would then allow memory less nodes
> with I/O or cpus. The user would not be aware of these.

What's wrong with just setting the existing counters like
node_spanned_pages / node_present_pages to zero?

M.

2007-02-13 18:18:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.


> I wasn't suggesting having NULL pointers for pgdats, if that's what you
> mean.

That is what started the original thread at least. Can happen on some
ia64 platforms.

> Just nodes with no memory in them, the pgdat would still be there.
> pgdat = struct node, except everything's badly named.

Ok those can happen even on x86-64, mostly because it's possible
to fill up a node early during boot up with bootmem and then
it's effectively empty.

[there is even still a open bug when this happens on node 0]

Handling out of memory here of course has to be always done.

Just NULL pointers in core data structures are evil. But I'm glad we
agree here.

Now if it's better to set up a empty node or use a nearby node
for a memory less cpu can be further discussed. I still think
I lean towards the later.

-Andi

2007-02-13 18:28:15

by Martin Bligh

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

Andi Kleen wrote:
>> I wasn't suggesting having NULL pointers for pgdats, if that's what you
>> mean.
>
> That is what started the original thread at least. Can happen on some
> ia64 platforms.

OK, that does seem kind of ugly.

>> Just nodes with no memory in them, the pgdat would still be there.
>> pgdat = struct node, except everything's badly named.
>
> Ok those can happen even on x86-64, mostly because it's possible
> to fill up a node early during boot up with bootmem and then
> it's effectively empty.
>
> [there is even still a open bug when this happens on node 0]
>
> Handling out of memory here of course has to be always done.

Yup, if we just set the "size" of the node to zero, it seems
like a natural degenerate case that should be handled anyway.

> Just NULL pointers in core data structures are evil. But I'm glad we
> agree here.
>
> Now if it's better to set up a empty node or use a nearby node
> for a memory less cpu can be further discussed. I still think
> I lean towards the later.

Just seems kind of ugly and unnecessary, particularly if that
memory-less cpu (or IO node) is equidistant from one or more
memory-possessing nodes. As long as their zonelist is set up
correctly, it should all work fine without that, right?

build_zonelists_node already checks populated_zone() so it looks
like it's all set up for that already ...

M.

2007-02-13 18:51:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007, Martin J. Bligh wrote:

> What's wrong with just setting the existing counters like
> node_spanned_pages / node_present_pages to zero?

Will this fix the breakage that Kame-san saw?

2007-02-13 18:52:01

by Bob Picco

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

Andi Kleen wrote: [Tue Feb 13 2007, 01:18:45PM EST]
>
> > I wasn't suggesting having NULL pointers for pgdats, if that's what you
> > mean.
>
> That is what started the original thread at least. Can happen on some
> ia64 platforms.
I don't believe there is a NULL pgdat. The code for memory less nodes in
ia64 discontig.c allocates the memory less nodes pgdat from the best
memory node candidate. If there is a NULL pgdat, then it's a bug. Instead
for memory less nodes you don't have any present pages.

I thought the bug was because the process wanted to bind on just one
memoryless node and MPOL_BIND didn't handle that correctly and return
an error to the process.

bob
>
> > Just nodes with no memory in them, the pgdat would still be there.
> > pgdat = struct node, except everything's badly named.
>
> Ok those can happen even on x86-64, mostly because it's possible
> to fill up a node early during boot up with bootmem and then
> it's effectively empty.
>
> [there is even still a open bug when this happens on node 0]
>
> Handling out of memory here of course has to be always done.
>
> Just NULL pointers in core data structures are evil. But I'm glad we
> agree here.
>
> Now if it's better to set up a empty node or use a nearby node
> for a memory less cpu can be further discussed. I still think
> I lean towards the later.
>
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-02-14 00:13:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007 09:25:00 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Tue, 13 Feb 2007, KAMEZAWA Hiroyuki wrote:
>
> > NOD_DATA(nid) is always valid pointer if a node is online.
> > NODE_DATA(nid)->present_pages can be 0 even if a node is online,
> > I call this as memory-less-node.
>
> Yes but the pgdat will have no valid zone in it. That is new.
>
we have populated_zone() macro for checking it.

(I noticed node-hotplug can create memory-less-zone until memory is
onlined by the user.)


-Kame

2007-02-14 00:21:00

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

On Tue, 13 Feb 2007 10:50:53 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Tue, 13 Feb 2007, Martin J. Bligh wrote:
>
> > What's wrong with just setting the existing counters like
> > node_spanned_pages / node_present_pages to zero?
>
> Will this fix the breakage that Kame-san saw?
>

Now, memory-less-node's presetn_pages and spanned_pages are zero.
and zone's present_pages is zero, too.

We added populated_zone(zone) macro. This can check a zone has pages or not.
(see build_zonelist in page_alloc.c)

-Kame

2007-02-15 12:20:11

by Bodo Eggert

[permalink] [raw]
Subject: Re: [RFC] [PATCH] more support for memory-less-node.

Andi Kleen <[email protected]> wrote:

> Now if it's better to set up a empty node or use a nearby node
> for a memory less cpu can be further discussed. I still think
> I lean towards the later.

Worst case: Only slot 0 is used. Plug your memoryless CPU card into the last
slot before your plug the CPU+mem card into the last-1 slot.
--
W.I.N.D.O.W.S.:
Wireless Intelligent Neohuman Designed for Observation and Worldwide Sabotage
-- http://www.brunching.com/toys/toy-cyborger.html
Fri?, Spammer: [email protected]