2007-02-06 11:23:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

current mempolicy just checks whether a node is online or not.
If there is memory-less-node, mempolicy's target node can be
invalid.
This patch adds a check whether a node has memory or not.

This is an back-trace in which a program uses MPOL_MBIND just includes
memory-less-node.

== backtrace from crash (linux-2.6.20) ==
#0 [BSP:e000000121f412d8] schedule at a00000010061ccc0
#1 [BSP:e000000121f41280] rwsem_down_failed_common at a000000100290490
#2 [BSP:e000000121f41260] rwsem_down_read_failed at a000000100620d30
#3 [BSP:e000000121f41240] down_read at a0000001000b01a0
#4 [BSP:e000000121f411e8] ia64_do_page_fault at a000000100625710
#5 [BSP:e000000121f411e8] ia64_leave_kernel at a00000010000c660
EFRAME: e000000121f47100
B0: a00000010013cc40 CR_IIP: a00000010012aa30
CR_IPSR: 0000101008022018 CR_IFS: 8000000000000205
AR_PFS: 0000000000000309 AR_RSC: 0000000000000003
AR_UNAT: 0000000000000000 AR_RNAT: 0000000000000000
AR_CCV: 0000000000000000 AR_FPSR: 0009804c8a70033f
LOADRS: 0000000000000000 AR_BSPSTORE: 0000000000000000
B6: a00000010003f040 B7: a00000010000ccd0
PR: 000000000055a9a5 R1: a000000100d5a5b0
R2: e00000010c50df7c R3: 0000000000000030
R8: 0000000000000000 R9: e00000011dc52930
R10: e00000011dc52928 R11: e00000010c50df80
R12: e000000121f472c0 R13: e000000121f40000
R14: 0000000000000002 R15: 000000003fffff00
R16: 0000000010400000 R17: e000000121f40000
R18: a000000100b5a9d0 R19: e000000121f40018
R20: e000000121f40c84 R21: 0000000000000000
R22: e000000121f47330 R23: e000000121f47334
R24: e000000121f40b88 R25: e000000121f47340
R26: e000000121f47334 R27: 0000000000000000
R28: 0000000000000000 R29: e000000121f47338
R30: 000000007fffffff R31: a000000100b5b5e0
F6: 1003eccd55056199632ec F7: 1003e9e3779b97f4a7c16
F8: 1003e0a00000010001422 F9: 1003e000000000fa00000
F10: 1003e000000003b9aca00 F11: 1003e431bde82d7b634db
#6 [BSP:e000000121f411c0] slab_node at a00000010012aa30
#7 [BSP:e000000121f41190] alternate_node_alloc at a00000010013cc40
#8 [BSP:e000000121f41160] kmem_cache_alloc at a00000010013dc40
#9 [BSP:e000000121f41100] desc_prologue at a00000010003ee00
#10 [BSP:e000000121f410c0] unw_decode_r2 at a00000010003f0c0
#11 [BSP:e000000121f41068] find_save_locs at a00000010003fbf0
#12 [BSP:e000000121f41038] unw_init_frame_info at a000000100040900
#13 [BSP:e000000121f41010] unw_init_running at a00000010000ccf0
==
This panic(hang) was found by a numa test-set on a system with 3 nodes, where
node(2) was memory-less-node.

This means an access to NULL,here.
==
unsigned slab_node(struct mempolicy *policy)
{
case MPOL_BIND:
/*
* Follow bind policy behavior and start allocation at the
* first node.
*/
return zone_to_nid(policy->v.zonelist->zones[0]);
}
==
length of this zonelist was 0.
It seems fixing a NULL access here is also O.K.
This patch is just an idea.

Signed-Off-By: KAMEZAWA Hiroyuki <[email protected]>



Index: linux-2.6.20/mm/mempolicy.c
===================================================================
--- linux-2.6.20.orig/mm/mempolicy.c 2007-02-06 20:02:31.000000000 +0900
+++ linux-2.6.20/mm/mempolicy.c 2007-02-06 20:09:47.000000000 +0900
@@ -116,6 +116,8 @@
static int mpol_check_policy(int mode, nodemask_t *nodes)
{
int empty = nodes_empty(*nodes);
+ int nd;
+ nodemask_t node_with_memory;

switch (mode) {
case MPOL_DEFAULT:
@@ -130,7 +132,12 @@
return -EINVAL;
break;
}
- return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+ nodes_clear(node_with_memory);
+ for_each_online_node(nd) {
+ if (node_present_pages(nd))
+ node_set(nd, node_with_memory);
+ }
+ return nodes_subset(*nodes, node_with_memory) ? 0 : -EINVAL;
}

/* Generate a custom zonelist for the BIND policy. */


2007-02-06 17:27:07

by Christoph Lameter

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Tue, 6 Feb 2007, KAMEZAWA Hiroyuki wrote:

> This means an access to NULL,here.
> ==
> unsigned slab_node(struct mempolicy *policy)
> {
> case MPOL_BIND:
> /*
> * Follow bind policy behavior and start allocation at the
> * first node.
> */
> return zone_to_nid(policy->v.zonelist->zones[0]);
> }
> ==
> length of this zonelist was 0.
> It seems fixing a NULL access here is also O.K.
> This patch is just an idea.

Hmmm... Remove the node from the node_online_map instead?

> }
> - return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
> + nodes_clear(node_with_memory);
> + for_each_online_node(nd) {
> + if (node_present_pages(nd))
> + node_set(nd, node_with_memory);
> + }
> + return nodes_subset(*nodes, node_with_memory) ? 0 : -EINVAL;
> }

Uhh. No loops here please. If we really need this then we need to have
some sort of set operation here.


2007-02-07 01:14:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Tue, 6 Feb 2007 09:26:53 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Tue, 6 Feb 2007, KAMEZAWA Hiroyuki wrote:
>
> > This means an access to NULL,here.
> > ==
> > unsigned slab_node(struct mempolicy *policy)
> > {
> > case MPOL_BIND:
> > /*
> > * Follow bind policy behavior and start allocation at the
> > * first node.
> > */
> > return zone_to_nid(policy->v.zonelist->zones[0]);
> > }
> > ==
> > length of this zonelist was 0.
> > It seems fixing a NULL access here is also O.K.
> > This patch is just an idea.
>
> Hmmm... Remove the node from the node_online_map instead?
>
Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
How about adding nodemask for nodes equips memory ?

===
There are memory-less-nodes (i.e. cpu only node.) on some systems.
mempolicy, which requires nodemask as its arg, should compare user's nodemask
with nodemask for node-with-memory-mask instead of node_online_map.

This patch adds node_with_memory_map and rewrite mempolicy's node_online_map
to node_with_memory_map.
(This patch supports node-hot-add.)

tested on ia64 NUMA
- 3nodes, node(2) is memory-less
- 3nodes, all nodes have memory.


Signed-Off-By: KAMEZAWA Hiroyuki <[email protected]>

Index: linux-2.6.20/include/linux/nodemask.h
===================================================================
--- linux-2.6.20.orig/include/linux/nodemask.h 2007-02-05 03:44:54.000000000 +0900
+++ linux-2.6.20/include/linux/nodemask.h 2007-02-07 10:03:53.000000000 +0900
@@ -344,6 +344,7 @@

extern nodemask_t node_online_map;
extern nodemask_t node_possible_map;
+extern nodemask_t node_with_memory_map;

#if MAX_NUMNODES > 1
#define num_online_nodes() nodes_weight(node_online_map)
Index: linux-2.6.20/mm/page_alloc.c
===================================================================
--- linux-2.6.20.orig/mm/page_alloc.c 2007-02-05 03:44:54.000000000 +0900
+++ linux-2.6.20/mm/page_alloc.c 2007-02-07 09:48:35.000000000 +0900
@@ -59,6 +59,8 @@
long nr_swap_pages;
int percpu_pagelist_fraction;

+nodemask_t node_with_memory_map = NODE_MASK_NONE;
+
static void __free_pages_ok(struct page *page, unsigned int order);

/*
@@ -1861,6 +1863,8 @@
for_each_online_node(nid) {
build_zonelists(NODE_DATA(nid));
build_zonelist_cache(NODE_DATA(nid));
+ if (node_present_pages(nid))
+ node_set(nid, node_with_memory_map);
}
return 0;
}
Index: linux-2.6.20/mm/mempolicy.c
===================================================================
--- linux-2.6.20.orig/mm/mempolicy.c 2007-02-07 09:29:01.000000000 +0900
+++ linux-2.6.20/mm/mempolicy.c 2007-02-07 09:50:51.000000000 +0900
@@ -130,7 +130,7 @@
return -EINVAL;
break;
}
- return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
+ return nodes_subset(*nodes, node_with_memory_map) ? 0 : -EINVAL;
}

/* Generate a custom zonelist for the BIND policy. */
@@ -500,7 +500,7 @@
case MPOL_PREFERRED:
/* or use current node instead of online map? */
if (p->v.preferred_node < 0)
- *nodes = node_online_map;
+ *nodes = node_with_memory_map;
else
node_set(p->v.preferred_node, *nodes);
break;
@@ -1612,7 +1612,7 @@
/* Set interleaving policy for system init. This way not all
the data structures allocated at system boot end up in node zero. */

- if (do_set_mempolicy(MPOL_INTERLEAVE, &node_online_map))
+ if (do_set_mempolicy(MPOL_INTERLEAVE, &node_with_memory_map))
printk("numa_policy_init: interleaving failed\n");
}












2007-02-07 08:04:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:

> > Hmmm... Remove the node from the node_online_map instead?
> >
> Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> How about adding nodemask for nodes equips memory ?

Ok that is better but...

Would it be possible to attach the cpus to the
next nodes with memory and mark the node offline? That way we could avoid
another mask that we constantly have to check?

Or fix the location where the error occurred to be able to tolerate a node
with no zones?

2007-02-07 08:37:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 00:04:41 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
>
> > > Hmmm... Remove the node from the node_online_map instead?
> > >
> > Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> > How about adding nodemask for nodes equips memory ?
>
> Ok that is better but...
>
> Would it be possible to attach the cpus to the
> next nodes with memory and mark the node offline? That way we could avoid
> another mask that we constantly have to check?
>
Added ia64 list to CC.
I know ia64 kernel did what you say in old days (I know RHEL4/2.6.9 kernel does it).
Someone changed it and created cpu-only-node for some purpose, I don't know why.


> Or fix the location where the error occurred to be able to tolerate a node
> with no zones?
>
Hmm,
In this case, MPOL_MBIND, the user requests to allocate memory from specified nodes.
I think it's better to tell him "you can't do that" than silently allocating memory
from other places.

-Kame





2007-02-07 09:19:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

KAMEZAWA Hiroyuki <[email protected]> writes:

> current mempolicy just checks whether a node is online or not.
> If there is memory-less-node, mempolicy's target node can be
> invalid.
> This patch adds a check whether a node has memory or not.

IMHO there shouldn't be any memory less nodes. The architecture code
should not create them. The CPU should be assigned to a nearby node instead.
At least x86-64 ensures that.

-Andi

2007-02-07 09:20:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

Christoph Lameter <[email protected]> writes:
>
> Would it be possible to attach the cpus to the
> next nodes with memory and mark the node offline? That way we could avoid
> another mask that we constantly have to check?

That is what x86-64 does and I believe is the right solution to this.

-Andi

2007-02-07 10:08:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On 07 Feb 2007 11:20:06 +0100
Andi Kleen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > current mempolicy just checks whether a node is online or not.
> > If there is memory-less-node, mempolicy's target node can be
> > invalid.
> > This patch adds a check whether a node has memory or not.
>
> IMHO there shouldn't be any memory less nodes. The architecture code
> should not create them. The CPU should be assigned to a nearby node instead.
> At least x86-64 ensures that.
>
AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
Then, ia64 can create cpu-memory-less-node(node with no available resource.).
(*)I don't like this.

If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
cpus should be moved to nearby node at hotadd .
And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven
node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)

-Kame



2007-02-07 10:19:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node


> AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> (*)I don't like this.
>
> If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> cpus should be moved to nearby node at hotadd .
> And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven
> node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)

You can also alias node numbers to solve this: just point multiple node numbers
to the same pgdat. For a memory less node this would be a nearby one.

-Andi


2007-02-07 10:38:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 11:19:02 +0100
Andi Kleen <[email protected]> wrote:

>
> > AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> > Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> > (*)I don't like this.
> >
> > If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> > cpus should be moved to nearby node at hotadd .
> > And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven
> > node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)
>
> You can also alias node numbers to solve this: just point multiple node numbers
> to the same pgdat. For a memory less node this would be a nearby one.
>
Hmm, interesting...the 'alias' means follwing ?
==
NODE_DATA(A) = pgdat_for_A
NODE_DATA(B) = pgdat_for_A // B is memory-less.
- NODE_DATA(B) is valid but B is not online.
==
looks complicated..and we have to care /sys/devices/system/node handling.

-Kame

2007-02-07 10:41:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wednesday 07 February 2007 11:37, KAMEZAWA Hiroyuki wrote:
> On Wed, 7 Feb 2007 11:19:02 +0100
> Andi Kleen <[email protected]> wrote:
>
> >
> > > AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> > > Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> > > (*)I don't like this.
> > >
> > > If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> > > cpus should be moved to nearby node at hotadd .
> > > And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven
> > > node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)
> >
> > You can also alias node numbers to solve this: just point multiple node numbers
> > to the same pgdat. For a memory less node this would be a nearby one.
> >
> Hmm, interesting...the 'alias' means follwing ?

Yes.

> NODE_DATA(A) = pgdat_for_A
> NODE_DATA(B) = pgdat_for_A // B is memory-less.
> - NODE_DATA(B) is valid but B is not online.

Well it is online because A is. For all practical purposes
it is A, just under a different name.

> ==
> looks complicated..and we have to care /sys/devices/system/node handling.

x86-64 used to do that when it still only did 1:1 cpu<->memory mappings.
I don't remember any problems with it.

-Andi

2007-02-07 10:49:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 11:41:25 +0100
Andi Kleen <[email protected]> wrote:

> On Wednesday 07 February 2007 11:37, KAMEZAWA Hiroyuki wrote:
> > On Wed, 7 Feb 2007 11:19:02 +0100
> > Andi Kleen <[email protected]> wrote:
> > > You can also alias node numbers to solve this: just point multiple node numbers
> > > to the same pgdat. For a memory less node this would be a nearby one.
> > >
> > Hmm, interesting...the 'alias' means follwing ?
>
> Yes.
>
> > NODE_DATA(A) = pgdat_for_A
> > NODE_DATA(B) = pgdat_for_A // B is memory-less.
> > - NODE_DATA(B) is valid but B is not online.
>
> Well it is online because A is. For all practical purposes
> it is A, just under a different name.
>
> > ==
> > looks complicated..and we have to care /sys/devices/system/node handling.
>
> x86-64 used to do that when it still only did 1:1 cpu<->memory mappings.
> I don't remember any problems with it.
>

How for_each_online_node(nid) works ? it can handle alias-nid ?

==
for_each_online_node(nid) {
pgdat = NODE_DATA(nid);
==
This code never accesses pgdat_for_A twice ?

Thanks,
-Kame


2007-02-07 11:32:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node


> How for_each_online_node(nid) works ? it can handle alias-nid ?
>
> ==
> for_each_online_node(nid) {
> pgdat = NODE_DATA(nid);
> ==
> This code never accesses pgdat_for_A twice ?

It would. If there's a problem it could be changed to walk the pgdat lists instead,
but at least when x86-64 still did it there wasn't.

-Andi

2007-02-07 12:27:59

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 12:32:36 +0100
Andi Kleen <[email protected]> wrote:

>
> > How for_each_online_node(nid) works ? it can handle alias-nid ?
> >
> > ==
> > for_each_online_node(nid) {
> > pgdat = NODE_DATA(nid);
> > ==
> > This code never accesses pgdat_for_A twice ?
>
> It would. If there's a problem it could be changed to walk the pgdat lists instead,
> but at least when x86-64 still did it there wasn't.
>

Thank you for discussion.

I'll try another hot-fix just for this NULL pointer access problem,
which is not invasive and not costly.

Thanks,
-Kame


2007-02-07 14:03:44

by Christoph Lameter

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007, Andi Kleen wrote:

> IMHO there shouldn't be any memory less nodes. The architecture code
> should not create them. The CPU should be assigned to a nearby node instead.
> At least x86-64 ensures that.

Yes I wish we would do it that way on all platforms. SGI's SN2 does that
too.



2007-02-07 14:06:07

by Christoph Lameter

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:

> > IMHO there shouldn't be any memory less nodes. The architecture code
> > should not create them. The CPU should be assigned to a nearby node instead.
> > At least x86-64 ensures that.
> >
> AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> (*)I don't like this.

I think that is only true for !SN2 platforms? Could we fix this?

> If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> cpus should be moved to nearby node at hotadd .
> And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven
> node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)

Well you could bring down the cpu and bring it up again? This would also
assure the best placement of the runtime structures for node?

2007-02-07 15:28:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 06:05:56 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
>
> > > IMHO there shouldn't be any memory less nodes. The architecture code
> > > should not create them. The CPU should be assigned to a nearby node instead.
> > > At least x86-64 ensures that.
> > >
> > AFAIK, ia64 creates nodes just depends on SRAT's possible resource information.
> > Then, ia64 can create cpu-memory-less-node(node with no available resource.).
> > (*)I don't like this.
>
> I think that is only true for !SN2 platforms? Could we fix this?
>
AFAIK, some vendor(HP?) has following configraion
- node0 .... cpu only node
- node1 .... cpu only node
- node2 .... memory only node.
This is because of their memory-interleave technique.

Our 64cpu socket NUMA system also has a config
- node0 cpu+memory node
- node 1 - 7 cpu only node.
for deviding scheduler domain.(old kernel had problem with big-sched-domain)

To fix memory-less-node, we have to test the performance of
"very-big-scheduler-domain" and to define the rule for cpu-hot-add, as
"a new cpu will be added to the most nearby node"
(node-hot-add will have to add some hook..)

I don't know someone who created memory-less-node in past may have some other issues.

There may be some complicated topology system with complicated PXM map.


> > If we don't allow memory-less-node, we may have to add several codes for cpu-hot-add.
> > cpus should be moved to nearby node at hotadd .
> > And node-hot-add have to care that cpus mustn't be added before memory, cpu-driven
> > node-hot-add will never occur. (ACPI's 'container' device spec can't guaranntee this.)
>
> Well you could bring down the cpu and bring it up again? This would also
> assure the best placement of the runtime structures for node?
>
cpu-to-node relationship is fixed in the early stage of cpu hotplug.
I'm not sure we can bring down/up cpu again in clean way. After a cpu is added,
the kernel losts its original PXM value now.

about runtime structures:
The runtime structure placement for a hot-added-node is another issue here.
I and Goto-san have a plan for optimized placement of structures and will
try when we can do. (We are now assgined to RHEL5 stabilization tasks...)

Moving per-cpu-area at hotadd does not look easy.
IMHO, maybe we have to use stop_machine_run() to move it.

Anyway, I'll post an another *easy* patch just for fix the NULL pointer access.
please review.

Thanks,
-Kame






2007-02-07 16:23:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On 07 Feb 2007 11:20:06 +0100 Andi Kleen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > current mempolicy just checks whether a node is online or not.
> > If there is memory-less-node, mempolicy's target node can be
> > invalid.
> > This patch adds a check whether a node has memory or not.
>
> IMHO there shouldn't be any memory less nodes. The architecture code
> should not create them. The CPU should be assigned to a nearby node instead.

umm, why?

A node which has CPUs and no memory is obviously physically possible and
isn't a completely insane thing for a user to do. I'd have thought that
the kernel should be able to cleanly and clearly handle it, and to
accurately present the machine's topology to the user without us having to
go adding falsehoods like this?

> At least x86-64 ensures that.
>
> -Andi

2007-02-07 16:51:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wednesday 07 February 2007 17:23, Andrew Morton wrote:
> On 07 Feb 2007 11:20:06 +0100 Andi Kleen <[email protected]> wrote:
>
> > KAMEZAWA Hiroyuki <[email protected]> writes:
> >
> > > current mempolicy just checks whether a node is online or not.
> > > If there is memory-less-node, mempolicy's target node can be
> > > invalid.
> > > This patch adds a check whether a node has memory or not.
> >
> > IMHO there shouldn't be any memory less nodes. The architecture code
> > should not create them. The CPU should be assigned to a nearby node instead.
>
> umm, why?
>
> A node which has CPUs and no memory is obviously physically possible and
> isn't a completely insane thing for a user to do. I'd have thought that
> the kernel should be able to cleanly and clearly handle it,

It doesn't.

> and to
> accurately present the machine's topology to the user without us having to
> go adding falsehoods like this?

a node is a piece of memory. Without memory it doesn't make sense.

-Andi

2007-02-07 17:43:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 17:50:55 +0100 Andi Kleen <[email protected]> wrote:

> On Wednesday 07 February 2007 17:23, Andrew Morton wrote:
> > On 07 Feb 2007 11:20:06 +0100 Andi Kleen <[email protected]> wrote:
> >
> > > KAMEZAWA Hiroyuki <[email protected]> writes:
> > >
> > > > current mempolicy just checks whether a node is online or not.
> > > > If there is memory-less-node, mempolicy's target node can be
> > > > invalid.
> > > > This patch adds a check whether a node has memory or not.
> > >
> > > IMHO there shouldn't be any memory less nodes. The architecture code
> > > should not create them. The CPU should be assigned to a nearby node instead.
> >
> > umm, why?
> >
> > A node which has CPUs and no memory is obviously physically possible and
> > isn't a completely insane thing for a user to do. I'd have thought that
> > the kernel should be able to cleanly and clearly handle it,
>
> It doesn't.

Fix it?

> > and to
> > accurately present the machine's topology to the user without us having to
> > go adding falsehoods like this?
>
> a node is a piece of memory. Without memory it doesn't make sense.

Who said? I can pick up a piece of circuitry which has four CPUs and no
RAM, wave it about then stick it in a computer. The kernel is just wrong,
surely?

2007-02-07 18:15:43

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] FS : Speedup rw_verify_area()

--- linux-2.6.20/fs/read_write.c 2007-02-07 18:21:59.000000000 +0100
+++ linux-2.6.20-ed/fs/read_write.c 2007-02-07 18:23:41.000000000 +0100
@@ -197,13 +197,13 @@ int rw_verify_area(int read_write, struc
struct inode *inode;
loff_t pos;

+ inode = file->f_path.dentry->d_inode;
if (unlikely((ssize_t) count < 0))
goto Einval;
pos = *ppos;
if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
goto Einval;

- inode = file->f_path.dentry->d_inode;
if (unlikely(inode->i_flock && MANDATORY_LOCK(inode))) {
int retval = locks_mandatory_area(
read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,


Attachments:
(No filename) (1.11 kB)
rw_verify_area.patch (628.00 B)
Download all attachments

2007-02-08 00:38:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007 09:43:44 -0800
Andrew Morton <[email protected]> wrote:

> > > and to
> > > accurately present the machine's topology to the user without us having to
> > > go adding falsehoods like this?
> >
> > a node is a piece of memory. Without memory it doesn't make sense.
>
> Who said? I can pick up a piece of circuitry which has four CPUs and no
> RAM, wave it about then stick it in a computer. The kernel is just wrong,
> surely?
>
As far as I remember, this was the first e-mail which tells me that there are
- cpu only node
- device only node
- memory only node
during node-hot-plug discussion.

http://www.gelato.unsw.edu.au/archives/linux-ia64/0405/9679.html

Thanks,
-Kame

2007-02-08 11:49:06

by Bob Picco

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

Hiroyuki KAMEZAWA wrote: [Wed Feb 07 2007, 03:36:47AM EST]
> On Wed, 7 Feb 2007 00:04:41 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
> > On Wed, 7 Feb 2007, KAMEZAWA Hiroyuki wrote:
> >
> > > > Hmmm... Remove the node from the node_online_map instead?
> > > >
> > > Changing defintion of node_online_map is harmfil. (there are cpu-only-nodes.)
> > > How about adding nodemask for nodes equips memory ?
> >
> > Ok that is better but...
> >
> > Would it be possible to attach the cpus to the
> > next nodes with memory and mark the node offline? That way we could avoid
> > another mask that we constantly have to check?
> >
> Added ia64 list to CC.
> I know ia64 kernel did what you say in old days (I know RHEL4/2.6.9 kernel does it).
> Someone changed it and created cpu-only-node for some purpose, I don't know why.
That was me. It will probably be later today or Friday before I've had
time to review the code. For reference look for string memory_less in
arch/ia64/mm/discontig.c.

The short story is HP ships NUMA boxes with interleaved memory only by
default which is represented by a single memory only node. Originally all
the CPU nodes where assigned to the memory node. The code was very
complicated and incorrect to me. Subsequently, and what we have now, the CPU
only nodes are revealed and the memory only node too. I do believe that a
cpu only nodes should be possible but now there seems to be a new issue.

bob
>
>
> > Or fix the location where the error occurred to be able to tolerate a node
> > with no zones?
> >
> Hmm,
> In this case, MPOL_MBIND, the user requests to allocate memory from specified nodes.
> I think it's better to tell him "you can't do that" than silently allocating memory
> from other places.
>
> -Kame
>
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2007-02-08 19:09:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Wed, 7 Feb 2007, Andrew Morton wrote:

> > > A node which has CPUs and no memory is obviously physically possible and
> > > isn't a completely insane thing for a user to do. I'd have thought that
> > > the kernel should be able to cleanly and clearly handle it,
> >
> > It doesn't.
>
> Fix it?

Why? It is a bad idea.

> > > and to
> > > accurately present the machine's topology to the user without us having to
> > > go adding falsehoods like this?
> >
> > a node is a piece of memory. Without memory it doesn't make sense.
>
> Who said? I can pick up a piece of circuitry which has four CPUs and no
> RAM, wave it about then stick it in a computer. The kernel is just wrong,
> surely?

Surely your computer has some memory so attach it to that memory (which
in a NUMA system would be one or the other node).

Cpu only "nodes" would mean that all memory would be off node. Meaning
whatever interconnect one has would be heavily used. Operating system and
application performance will suffer.


2007-02-08 19:26:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Thu, 8 Feb 2007 11:09:40 -0800 (PST) Christoph Lameter <[email protected]> wrote:

> > > > and to
> > > > accurately present the machine's topology to the user without us having to
> > > > go adding falsehoods like this?
> > >
> > > a node is a piece of memory. Without memory it doesn't make sense.
> >
> > Who said? I can pick up a piece of circuitry which has four CPUs and no
> > RAM, wave it about then stick it in a computer. The kernel is just wrong,
> > surely?
>
> Surely your computer has some memory so attach it to that memory (which
> in a NUMA system would be one or the other node).

"attach it". But it _isn't_ attached. There is no memory on this node.
We seem to be saying that we should misrepresent the physical topology
because the kernel doesn't handle it appropriately.

> Cpu only "nodes" would mean that all memory would be off node. Meaning
> whatever interconnect one has would be heavily used. Operating system and
> application performance will suffer.

>From this a logical step would be to change the kernel to refuse to bring
memoryless nodes online at all.

If that's not an approproate solution, then there must be a legtimate
reason for using memoryless nodes.

Which is it?

2007-02-08 19:35:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [2.6.20][PATCH] fix mempolicy error check on a system with memory-less-node

On Thu, 8 Feb 2007, Andrew Morton wrote:

> > Surely your computer has some memory so attach it to that memory (which
> > in a NUMA system would be one or the other node).
>
> "attach it". But it _isn't_ attached. There is no memory on this node.
> We seem to be saying that we should misrepresent the physical topology
> because the kernel doesn't handle it appropriately.

I think we are have a disagreement on what node means. Andi is saying that
a node is bound to memory. The affinity of that chunk of memory has to be
considered by the Kernel. If there is no memory then there is no memory
management issue. This memory may have assigned I/O devices and cpus that
use that memory for their operations. I think this is the working
definition of the kernel.

You are saying a node means something that is connected to a NUMA
interlink that may have cpus / memory / IO devices.

It seems that both definitions have been used.

> > Cpu only "nodes" would mean that all memory would be off node. Meaning
> > whatever interconnect one has would be heavily used. Operating system and
> > application performance will suffer.
>
> >From this a logical step would be to change the kernel to refuse to bring
> memoryless nodes online at all.

This is the case.

Arch code typically ignores memory less nodes and assigns processors and
I/O devices with no memory to the next node. F.e. the SGI I/O "node" only
exists on the arch level. They are mapped to the next memory node by
arch coide and the locality information returned to the Kernel for devices
on that I/O node is the next memory node.