2009-09-21 16:10:29

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

Currently SLQB is not allowed to be configured on PPC and S390 machines as
CPUs can belong to memoryless nodes. SLQB does not deal with this very well
and crashes reliably.

These patches fix the problem on PPC64 and it appears to be fairly stable.
At least, basic actions that were previously silently halting the machine
complete successfully. There might still be per-cpu problems as Sachin
reported the stability problems on this machine did not depend on SLQB.

Patch 1 notes that the per-node hack in SLQB only works if every node in
the system has a CPU of the same ID. If this is not the case,
the per-node areas are not necessarily allocated. This fix only
applies to ppc64. It's possible that s390 needs a similar hack. The
alternative is to statically allocate the per-node structures but
this is both sub-optimal in terms of performance and memory usage.

Patch 2 notes that on memoryless configurations, memory is always freed
remotely but always allocates locally and falls back to the page
allocator on failure. This effectively is a memory leak. This patch
records in kmem_cache_cpu what node it considers local to be either
the real local node or the closest node available

Patch 3 allows SLQB to be configured on PPC again. It's not enabled on
S390 because I can't test for sure on a suitable configuration there.

This is not ready for merging just yet.

It needs signed-off from the powerpc side because it's now allocating more
memory potentially (Ben?). An alternative to this patch is in V1 that
statically declares the per-node structures but this is potentially
sub-optimal but from a performance and memory utilisation perspective.

>From an SLQB side, how does patch 2 now look from a potential list-corruption
point of view (Christoph, Nick, Pekka?). Certainly this version seems a
lot more sensible than the patch in V1 because the per-cpu list is now
always being used for pages from the closest node.

It would also be nice if the S390 guys could retest as well with SLQB to see
if special action with respect to per-cpu areas is still needed.

arch/powerpc/kernel/setup_64.c | 20 ++++++++++++++++++++
include/linux/slqb_def.h | 3 +++
init/Kconfig | 2 +-
mm/slqb.c | 23 +++++++++++++++++------
4 files changed, 41 insertions(+), 7 deletions(-)


2009-09-21 16:10:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
assumption is made that all valid node IDs will have matching valid CPU
ids. In memoryless configurations, it is possible to have a node ID with
no CPU having the same ID. When this happens, a per-cpu are is not
created and the value of paca[cpu].data_offset is some random value.
This is later deferenced and the system crashes after accessing some
invalid address.

This patch hacks powerpc to allocate per-cpu areas for node IDs that
have no corresponding CPU id. This gets around the immediate problem but
it should be discussed if there is a requirement for a DEFINE_PER_NODE
and how it should be implemented.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/powerpc/kernel/setup_64.c | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 1f68160..a5f52d4 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
paca[i].data_offset = ptr - __per_cpu_start;
memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
}
+#ifdef CONFIG_SLQB
+ /*
+ * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
+ * assumes that ever node ID will have a CPU of that ID to match.
+ * On systems with memoryless nodes, this may not hold true. Hence,
+ * we take a second pass initialising a "per-cpu" area for node-ids
+ * that SLQB can use
+ */
+ for_each_node_state(i, N_NORMAL_MEMORY) {
+
+ /* Skip node IDs that a valid CPU id exists for */
+ if (paca[i].data_offset)
+ continue;
+
+ ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
+
+ paca[i].data_offset = ptr - __per_cpu_start;
+ memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+ }
+#endif /* CONFIG_SLQB */
}
#endif

--
1.6.3.3

2009-09-21 16:10:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/3] slqb: Record what node is local to a kmem_cache_cpu

When freeing a page, SLQB checks if the page belongs to the local node.
If it is not, it is considered a remote free. On the allocation side, it
always checks the local lists and if they are empty, the page allocator
is called. On memoryless configurations, this is effectively a memory
leak and the machine quickly kills itself in an OOM storm.

This patch records what node ID is closest to a CPU and considers that to be
the local node. As the management structure for the CPU is always allocated
from the closest node, the node the CPU structure resides on is considered
"local".

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/slqb_def.h | 3 +++
mm/slqb.c | 23 +++++++++++++++++------
2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/slqb_def.h b/include/linux/slqb_def.h
index 1243dda..2ccbe7e 100644
--- a/include/linux/slqb_def.h
+++ b/include/linux/slqb_def.h
@@ -101,6 +101,9 @@ struct kmem_cache_cpu {
struct kmem_cache_list list; /* List for node-local slabs */
unsigned int colour_next; /* Next colour offset to use */

+ /* local_nid will be numa_node_id() except when memoryless */
+ unsigned int local_nid;
+
#ifdef CONFIG_SMP
/*
* rlist is a list of objects that don't fit on list.freelist (ie.
diff --git a/mm/slqb.c b/mm/slqb.c
index 4ca85e2..1846480 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1375,7 +1375,7 @@ static noinline void *__slab_alloc_page(struct kmem_cache *s,
if (unlikely(!page))
return page;

- if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+ if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == c->local_nid)) {
struct kmem_cache_cpu *c;
int cpu = smp_processor_id();

@@ -1501,15 +1501,16 @@ static __always_inline void *__slab_alloc(struct kmem_cache *s,
struct kmem_cache_cpu *c;
struct kmem_cache_list *l;

+ c = get_cpu_slab(s, smp_processor_id());
+ VM_BUG_ON(!c);
+
#ifdef CONFIG_NUMA
- if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+ if (unlikely(node != -1) && unlikely(node != c->local_nid)) {
try_remote:
return __remote_slab_alloc(s, gfpflags, node);
}
#endif

- c = get_cpu_slab(s, smp_processor_id());
- VM_BUG_ON(!c);
l = &c->list;
object = __cache_list_get_object(s, l);
if (unlikely(!object)) {
@@ -1518,7 +1519,7 @@ try_remote:
object = __slab_alloc_page(s, gfpflags, node);
#ifdef CONFIG_NUMA
if (unlikely(!object)) {
- node = numa_node_id();
+ node = c->local_nid;
goto try_remote;
}
#endif
@@ -1733,7 +1734,7 @@ static __always_inline void __slab_free(struct kmem_cache *s,
slqb_stat_inc(l, FREE);

if (!NUMA_BUILD || !slab_numa(s) ||
- likely(slqb_page_to_nid(page) == numa_node_id())) {
+ likely(slqb_page_to_nid(page) == c->local_nid)) {
/*
* Freeing fastpath. Collects all local-node objects, not
* just those allocated from our per-CPU list. This allows
@@ -1928,6 +1929,16 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
c->rlist.tail = NULL;
c->remote_cache_list = NULL;
#endif
+
+ /*
+ * Determine what the local node to this CPU is. Ordinarily
+ * this would be cpu_to_node() but for memoryless nodes, that
+ * is not the best value. Instead, we take the numa node that
+ * kmem_cache_cpu is allocated from as being the best guess
+ * as being local because it'll match what the page allocator
+ * thinks is the most local
+ */
+ c->local_nid = page_to_nid(virt_to_page((unsigned long)c & PAGE_MASK));
}

#ifdef CONFIG_NUMA
--
1.6.3.3

2009-09-21 16:10:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/3] slqb: Allow SLQB to be used on PPC

SLQB was disabled on PPC as it would stab itself in the face when running
on machines with CPUs on memoryless nodes. As those configurations should
now work, allow SLQB to be configured again on PPC.

Signed-off-by: Mel Gorman <[email protected]>
---
init/Kconfig | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index adc10ab..8f55fde 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1033,7 +1033,7 @@ config SLUB

config SLQB
bool "SLQB (Queued allocator)"
- depends on !PPC && !S390
+ depends on !S390
help
SLQB is a proposed new slab allocator.

--
1.6.3.3

2009-09-21 17:17:47

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

On Mon, 2009-09-21 at 17:10 +0100, Mel Gorman wrote:
> SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> assumption is made that all valid node IDs will have matching valid CPU
> ids. In memoryless configurations, it is possible to have a node ID with
> no CPU having the same ID. When this happens, a per-cpu are is not
> created and the value of paca[cpu].data_offset is some random value.
> This is later deferenced and the system crashes after accessing some
> invalid address.
>
> This patch hacks powerpc to allocate per-cpu areas for node IDs that
> have no corresponding CPU id. This gets around the immediate problem but
> it should be discussed if there is a requirement for a DEFINE_PER_NODE
> and how it should be implemented.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> arch/powerpc/kernel/setup_64.c | 20 ++++++++++++++++++++
> 1 files changed, 20 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 1f68160..a5f52d4 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
> paca[i].data_offset = ptr - __per_cpu_start;
> memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> }
> +#ifdef CONFIG_SLQB
> + /*
> + * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
> + * assumes that ever node ID will have a CPU of that ID to match.
> + * On systems with memoryless nodes, this may not hold true. Hence,
> + * we take a second pass initialising a "per-cpu" area for node-ids
> + * that SLQB can use
> + */

Very trivial, but there's a little trailing whitespace in the first line
of the comment (checkpatch warns on it.) You also spelled initializing
wrong.

Daniel

2009-09-21 17:24:19

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

On Mon, 21 Sep 2009 10:17:52 -0700 Daniel Walker wrote:

> On Mon, 2009-09-21 at 17:10 +0100, Mel Gorman wrote:
> > SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> > assumption is made that all valid node IDs will have matching valid CPU
> > ids. In memoryless configurations, it is possible to have a node ID with
> > no CPU having the same ID. When this happens, a per-cpu are is not
> > created and the value of paca[cpu].data_offset is some random value.
> > This is later deferenced and the system crashes after accessing some
> > invalid address.
> >
> > This patch hacks powerpc to allocate per-cpu areas for node IDs that
> > have no corresponding CPU id. This gets around the immediate problem but
> > it should be discussed if there is a requirement for a DEFINE_PER_NODE
> > and how it should be implemented.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > arch/powerpc/kernel/setup_64.c | 20 ++++++++++++++++++++
> > 1 files changed, 20 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > index 1f68160..a5f52d4 100644
> > --- a/arch/powerpc/kernel/setup_64.c
> > +++ b/arch/powerpc/kernel/setup_64.c
> > @@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
> > paca[i].data_offset = ptr - __per_cpu_start;
> > memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> > }
> > +#ifdef CONFIG_SLQB
> > + /*
> > + * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
> > + * assumes that ever node ID will have a CPU of that ID to match.
> > + * On systems with memoryless nodes, this may not hold true. Hence,
> > + * we take a second pass initialising a "per-cpu" area for node-ids
> > + * that SLQB can use
> > + */
>
> Very trivial, but there's a little trailing whitespace in the first line
> of the comment (checkpatch warns on it.) You also spelled initializing
> wrong.

re: spelling. Not really. Think internationally.

---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/

2009-09-21 17:29:35

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

On Mon, 2009-09-21 at 10:24 -0700, Randy Dunlap wrote:
> On Mon, 21 Sep 2009 10:17:52 -0700 Daniel Walker wrote:
>
> > On Mon, 2009-09-21 at 17:10 +0100, Mel Gorman wrote:
> > > SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> > > assumption is made that all valid node IDs will have matching valid CPU
> > > ids. In memoryless configurations, it is possible to have a node ID with
> > > no CPU having the same ID. When this happens, a per-cpu are is not
> > > created and the value of paca[cpu].data_offset is some random value.
> > > This is later deferenced and the system crashes after accessing some
> > > invalid address.
> > >
> > > This patch hacks powerpc to allocate per-cpu areas for node IDs that
> > > have no corresponding CPU id. This gets around the immediate problem but
> > > it should be discussed if there is a requirement for a DEFINE_PER_NODE
> > > and how it should be implemented.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > arch/powerpc/kernel/setup_64.c | 20 ++++++++++++++++++++
> > > 1 files changed, 20 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > > index 1f68160..a5f52d4 100644
> > > --- a/arch/powerpc/kernel/setup_64.c
> > > +++ b/arch/powerpc/kernel/setup_64.c
> > > @@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
> > > paca[i].data_offset = ptr - __per_cpu_start;
> > > memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> > > }
> > > +#ifdef CONFIG_SLQB
> > > + /*
> > > + * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
> > > + * assumes that ever node ID will have a CPU of that ID to match.
> > > + * On systems with memoryless nodes, this may not hold true. Hence,
> > > + * we take a second pass initialising a "per-cpu" area for node-ids
> > > + * that SLQB can use
> > > + */
> >
> > Very trivial, but there's a little trailing whitespace in the first line
> > of the comment (checkpatch warns on it.) You also spelled initializing
> > wrong.
>
> re: spelling. Not really. Think internationally.

Yeah, I realized that after I sent it .. So misspelled in the American
sense I guess.

Daniel

2009-09-21 17:42:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

On Mon, Sep 21, 2009 at 10:17:52AM -0700, Daniel Walker wrote:
> On Mon, 2009-09-21 at 17:10 +0100, Mel Gorman wrote:
> > SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> > assumption is made that all valid node IDs will have matching valid CPU
> > ids. In memoryless configurations, it is possible to have a node ID with
> > no CPU having the same ID. When this happens, a per-cpu are is not
> > created and the value of paca[cpu].data_offset is some random value.
> > This is later deferenced and the system crashes after accessing some
> > invalid address.
> >
> > This patch hacks powerpc to allocate per-cpu areas for node IDs that
> > have no corresponding CPU id. This gets around the immediate problem but
> > it should be discussed if there is a requirement for a DEFINE_PER_NODE
> > and how it should be implemented.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > arch/powerpc/kernel/setup_64.c | 20 ++++++++++++++++++++
> > 1 files changed, 20 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > index 1f68160..a5f52d4 100644
> > --- a/arch/powerpc/kernel/setup_64.c
> > +++ b/arch/powerpc/kernel/setup_64.c
> > @@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
> > paca[i].data_offset = ptr - __per_cpu_start;
> > memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> > }
> > +#ifdef CONFIG_SLQB
> > + /*
> > + * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
> > + * assumes that ever node ID will have a CPU of that ID to match.
> > + * On systems with memoryless nodes, this may not hold true. Hence,
> > + * we take a second pass initialising a "per-cpu" area for node-ids
> > + * that SLQB can use
> > + */
>
> Very trivial, but there's a little trailing whitespace in the first line
> of the comment (checkpatch warns on it.)

D'oh, will clean it up in the next revision if the substance of the
patch is agreed upon.

> You also spelled initializing wrong.

It's spelt correctly for British English.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-21 17:46:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, Sep 21, 2009 at 05:10:23PM +0100, Mel Gorman wrote:
> Currently SLQB is not allowed to be configured on PPC and S390 machines as
> CPUs can belong to memoryless nodes. SLQB does not deal with this very well
> and crashes reliably.
>
> These patches fix the problem on PPC64 and it appears to be fairly stable.
> At least, basic actions that were previously silently halting the machine
> complete successfully.

I spoke too soon. Stress tests result in application failure, nothing to
dmesg even with the patches applied so it looks like patch 2 is still the
wrong way to fix the OOM-kill storm.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-21 18:02:37

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

Lets just keep SLQB back until the basic issues with memoryless nodes are
resolved. There does not seem to be an easy way to deal with this. Some
thought needs to go into how memoryless node handling relates to per cpu
lists and locking. List handling issues need to be addressed before SLQB.
can work reliably. The same issues can surface on x86 platforms with weird
NUMA memory setups.

Or just allow SLQB for !NUMA configurations and merge it now.

2009-09-21 18:12:37

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, Sep 21, 2009 at 8:54 PM, Christoph Lameter
<[email protected]> wrote:
> Lets just keep SLQB back until the basic issues with memoryless nodes are
> resolved. There does not seem to be an easy way to deal with this. Some
> thought needs to go into how memoryless node handling relates to per cpu
> lists and locking. List handling issues need to be addressed before SLQB.
> can work reliably. The same issues can surface on x86 platforms with weird
> NUMA memory setups.
>
> Or just allow SLQB for !NUMA configurations and merge it now.

I'm holding on to it until the issues are resolved.

2009-09-21 18:07:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, Sep 21, 2009 at 01:54:12PM -0400, Christoph Lameter wrote:
> Lets just keep SLQB back until the basic issues with memoryless nodes are
> resolved.

It's not even super-clear that the memoryless nodes issues are entirely
related to SLQB. Sachin for example says that there was a stall issue
with memoryless nodes that could be triggered without SLQB. Sachin, is
that still accurate?

If so, it's possible that SLQB somehow exasperates the problem in some
unknown fashion.

> There does not seem to be an easy way to deal with this. Some
> thought needs to go into how memoryless node handling relates to per cpu
> lists and locking. List handling issues need to be addressed before SLQB.
> can work reliably. The same issues can surface on x86 platforms with weird
> NUMA memory setups.
>

Can you spot if there is something fundamentally wrong with patch 2? I.e. what
is wrong with treating the closest node as local instead of only the
closest node?

> Or just allow SLQB for !NUMA configurations and merge it now.
>

Forcing SLQB !NUMA will not rattle out any existing list issues
unfortunately :(.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-21 18:21:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, 21 Sep 2009, Mel Gorman wrote:
> Can you spot if there is something fundamentally wrong with patch 2? I.e. what
> is wrong with treating the closest node as local instead of only the
> closest node?

Depends on the way locking is done for percpu queues (likely lockless).
A misidentification of the numa locality of an object may result in locks
not being taken that should have been taken.

> > Or just allow SLQB for !NUMA configurations and merge it now.
> >
>
> Forcing SLQB !NUMA will not rattle out any existing list issues
> unfortunately :(.

But it will make SLQB work right in permitted configurations. The NUMA
issues can then be fixed later upstream.

2009-09-22 00:01:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, 2009-09-21 at 17:10 +0100, Mel Gorman wrote:
>
> It needs signed-off from the powerpc side because it's now allocating
> more
> memory potentially (Ben?). An alternative to this patch is in V1 that
> statically declares the per-node structures but this is potentially
> sub-optimal but from a performance and memory utilisation perspective.

So if I understand correctly, we have a problem with both cpu-less and
memory-less nodes. Interesting setups :-)

I have no strong objection on the allocating of the per-cpu data for
the cpu-less nodes. However, I wonder if we should do that a bit more
nicely, maybe with some kind of "adjusted" cpu_possible_mask() (could be
something like cpu_node_valid_mask or similar) to be used by percpu.

Mostly because it would be nice to have built-in debug features in
per-cpu and in that case, it would need some way to know a valid
number from an invalid one). Either that or just keep track of the
mask of cpus that had percpu data allocated to them

Cheers,
Ben.

2009-09-22 00:02:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

Hello,

Mel Gorman wrote:
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 1f68160..a5f52d4 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
> paca[i].data_offset = ptr - __per_cpu_start;
> memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> }
> +#ifdef CONFIG_SLQB
> + /*
> + * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
> + * assumes that ever node ID will have a CPU of that ID to match.
> + * On systems with memoryless nodes, this may not hold true. Hence,
> + * we take a second pass initialising a "per-cpu" area for node-ids
> + * that SLQB can use
> + */
> + for_each_node_state(i, N_NORMAL_MEMORY) {
> +
> + /* Skip node IDs that a valid CPU id exists for */
> + if (paca[i].data_offset)
> + continue;
> +
> + ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
> +
> + paca[i].data_offset = ptr - __per_cpu_start;
> + memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> + }
> +#endif /* CONFIG_SLQB */
> }
> #endif

Eh... I don't know. This seems too hacky to me. Why not just
allocate pointer array of MAX_NUMNODES and allocate per-node memory
there? This will be slightly more expensive but I doubt it will be
noticeable. The only extra overhead is the cachline footprint for the
extra array.

Thanks.

--
tejun

2009-09-22 00:19:19

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, 22 Sep 2009, Benjamin Herrenschmidt wrote:

> So if I understand correctly, we have a problem with both cpu-less and
> memory-less nodes. Interesting setups :-)
>

I agree with Christoph that we need to resolve the larger kernel issue of
memoryless nodes in the kernel and the result of that work will most
likely become the basis from which the slqb fixes originate.

I disagree that we need kernel support for memoryless nodes on x86 and
probably on all architectures period. "NUMA nodes" will always contain
memory by definition and I think hijacking the node abstraction away from
representing anything but memory affinity is wrong in the interest of a
long-term maintainable kernel and will continue to cause issues such as
this in other subsystems.

I do understand the asymmetries of these machines, including the ppc that
is triggering this particular hang with slqb. But I believe the support
can be implemented in a different way: I would offer an alternative
representation based entirely on node distances. This would isolate each
region of memory that has varying affinity to cpus, pci busses, etc., into
nodes and then report a distance, whether local or remote, to other nodes
much in the way the ACPI specification does with proximity domains.

Using node distances instead of memoryless nodes would still be able to
represent all asymmetric machines that currently benefit from the support
by binding devices to memory regions to which they have the closest
affinity and then reporting relative distances to other nodes via
node_distance().

2009-09-22 05:03:14

by Sachin Sant

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

Mel Gorman wrote:
> On Mon, Sep 21, 2009 at 01:54:12PM -0400, Christoph Lameter wrote:
>
>> Lets just keep SLQB back until the basic issues with memoryless nodes are
>> resolved.
>>
>
> It's not even super-clear that the memoryless nodes issues are entirely
> related to SLQB. Sachin for example says that there was a stall issue
> with memoryless nodes that could be triggered without SLQB. Sachin, is
> that still accurate?
>
I think there are two different problems that we are dealing with.

First one is the SLQB not working on a ppc64 box which seems to be specific
to only one machine and i haven't seen that on other power boxes.The patches
that you have posted seems to allow the box to boot, but eventually it hits
the stall issue(related to percpu dynamic allocator not working on ppc64),
which is the second problem we are dealing with.

The stall issue seems to be much more critical as it is affecting almost
all of the power boxes that i have tested with (4 in all).
This issue is seen with Linus tree as well and was first seen with
2.6.31-git5 (0cb583fd..)

The stall issue was reported here:
http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-September/075791.html

Thanks
-Sachin


> If so, it's possible that SLQB somehow exasperates the problem in some
> unknown fashion.
>
>
>> There does not seem to be an easy way to deal with this. Some
>> thought needs to go into how memoryless node handling relates to per cpu
>> lists and locking. List handling issues need to be addressed before SLQB.
>> can work reliably. The same issues can surface on x86 platforms with weird
>> NUMA memory setups.
>>
>>
>
> Can you spot if there is something fundamentally wrong with patch 2? I.e. what
> is wrong with treating the closest node as local instead of only the
> closest node?
>
>
>> Or just allow SLQB for !NUMA configurations and merge it now.
>>
>>
>
> Forcing SLQB !NUMA will not rattle out any existing list issues
> unfortunately :(.
>
>


--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-09-22 06:34:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, 21 Sep 2009, David Rientjes wrote:

> I disagree that we need kernel support for memoryless nodes on x86 and
> probably on all architectures period. "NUMA nodes" will always contain
> memory by definition and I think hijacking the node abstraction away from
> representing anything but memory affinity is wrong in the interest of a
> long-term maintainable kernel and will continue to cause issues such as
> this in other subsystems.

Amen. Sadly my past opinions on this did not seem convincing enough.

> I do understand the asymmetries of these machines, including the ppc that
> is triggering this particular hang with slqb. But I believe the support
> can be implemented in a different way: I would offer an alternative
> representation based entirely on node distances. This would isolate each
> region of memory that has varying affinity to cpus, pci busses, etc., into
> nodes and then report a distance, whether local or remote, to other nodes
> much in the way the ACPI specification does with proximity domains.

Good idea.

> Using node distances instead of memoryless nodes would still be able to
> represent all asymmetric machines that currently benefit from the support
> by binding devices to memory regions to which they have the closest
> affinity and then reporting relative distances to other nodes via
> node_distance().

How would you deal with a memoryless node that has lets say 4 processors
and some I/O devices? Now the memory policy is round robin and there are 4
nodes at the same distance with 4G memory each. Does one of the nodes now
become priviledged under your plan? How do you equally use memory from all
these nodes?

2009-09-22 07:59:26

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, 22 Sep 2009, Christoph Lameter wrote:

> How would you deal with a memoryless node that has lets say 4 processors
> and some I/O devices? Now the memory policy is round robin and there are 4
> nodes at the same distance with 4G memory each. Does one of the nodes now
> become priviledged under your plan? How do you equally use memory from all
> these nodes?
>

If the distance between the memoryless node with the cpus/devices and all
4G nodes is the same, then this is UMA and no abstraction is necessary:
there's no reason to support interleaving of memory allocations amongst
four different regions of memory if there's no difference in latencies to
those regions.

It is possible, however, to have a system configured in such a way that
representing all devices, including memory, at a single level of
abstraction isn't possible. An example is a four cpu system where cpus
0-1 have local distance to all memory and cpus 2-3 have remote distance.

A solution would be to abstract everything into "system localities" like
the ACPI specification does. These localities in my plan are slightly
different, though: they are limited to only a single class of device.

A locality is simply an aggregate of a particular type of device; a device
is bound to a locality if it shares the same proximity as all other
devices in that locality to all other localities. In other words, the
previous example would have two cpu localities: one with cpus 0-1 and one
with cpus 2-3. If cpu 0 had a different proximity than cpu 1 to a pci
bus, however, there would be three cpu localities.

The equivalent of proximity domains then describes the distance between
all localities; these distances need not be one-way, it is possible for
distance in one direction to be different from the opposite direction,
just as ACPI pxm's allow.

A "node" in this plan is simply a system locality consisting of memory.

For subsystems such as slab allocators, all we require is cpu_to_node()
tables which would map cpu localities to nodes and describe them in terms
of local or remote distance (or whatever the SLIT says, if provided). All
present day information can still be represented in this model, we've just
added additional layers of abstraction internally.

2009-09-22 08:13:29

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, 2009-09-22 at 00:59 -0700, David Rientjes wrote:
> The equivalent of proximity domains then describes the distance between
> all localities; these distances need not be one-way, it is possible for
> distance in one direction to be different from the opposite direction,
> just as ACPI pxm's allow.
>
> A "node" in this plan is simply a system locality consisting of memory.
>
> For subsystems such as slab allocators, all we require is cpu_to_node()
> tables which would map cpu localities to nodes and describe them in terms
> of local or remote distance (or whatever the SLIT says, if provided). All
> present day information can still be represented in this model, we've just
> added additional layers of abstraction internally.

While I like the idea of NUMA nodes being strictly memory and everything
else being expressed by distances, we'll have to clean up quite a few
corners with skeletons in various states of decompositions waiting for
us there.

For example, we have code here or there that (ab)uses the NUMA node
information to link devices with their iommu, that sort of thing. IE, a
hard dependency which isn't really related to a concept of distance to
any memory.

At least on powerpc, nowadays, I can pretty much make everything
fallback to some representation in the device-tree though, thus it
shouldn't be -that- hard to fix I suppose.

Cheers,
Ben.

2009-09-22 08:45:06

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, 22 Sep 2009, Benjamin Herrenschmidt wrote:

> While I like the idea of NUMA nodes being strictly memory and everything
> else being expressed by distances, we'll have to clean up quite a few
> corners with skeletons in various states of decompositions waiting for
> us there.
>

Agreed, it's invasive.

> For example, we have code here or there that (ab)uses the NUMA node
> information to link devices with their iommu, that sort of thing. IE, a
> hard dependency which isn't really related to a concept of distance to
> any memory.
>

ACPI's slit uses a distance of 0xff to specify that one locality is
unreachable from another. We could easily adopt that convention.

> At least on powerpc, nowadays, I can pretty much make everything
> fallback to some representation in the device-tree though, thus it
> shouldn't be -that- hard to fix I suppose.
>

Cool, that's encouraging.

I really think that this type of abstraction would make things simpler in
the long term. For example, I just finished fixing a bug in tip where
cpumask_of_pcibus() wasn't returning cpu_all_mask for busses without any
affinity on x86. This was a consequence of cpumask_of_pcibus() being
forced to rely on pcibus_to_node() since there is no other abstraction
available. For busses without affinity to any specific cpus, the
implementation had relied on returning the mapping's default node of -1 to
represent all cpus. That type of complexity could easily be avoided if
the bus was isolated into its own locality and the mapping to all cpu
localities was of local distance.

2009-09-22 09:30:49

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH 3/3] slqb: Allow SLQB to be used on PPC

On Mon, Sep 21, 2009 at 05:10:26PM +0100, Mel Gorman wrote:
> SLQB was disabled on PPC as it would stab itself in the face when running
> on machines with CPUs on memoryless nodes. As those configurations should
> now work, allow SLQB to be configured again on PPC.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> init/Kconfig | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index adc10ab..8f55fde 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1033,7 +1033,7 @@ config SLUB
>
> config SLQB
> bool "SLQB (Queued allocator)"
> - depends on !PPC && !S390
> + depends on !S390

You can remove S390 from the list independently from this patch set.
As already mentioned SLQB works again on s390 and whatever caused the
bug I reported a few weeks back is gone.

2009-09-22 09:32:18

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] powerpc: Allocate per-cpu areas for node IDs for SLQB to use as per-node areas

On Tue, Sep 22, 2009 at 09:01:55AM +0900, Tejun Heo wrote:
> Mel Gorman wrote:
> > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > index 1f68160..a5f52d4 100644
> > --- a/arch/powerpc/kernel/setup_64.c
> > +++ b/arch/powerpc/kernel/setup_64.c
> > @@ -588,6 +588,26 @@ void __init setup_per_cpu_areas(void)
> > paca[i].data_offset = ptr - __per_cpu_start;
> > memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> > }
> > +#ifdef CONFIG_SLQB
> > + /*
> > + * SLQB abuses DEFINE_PER_CPU to setup a per-node area. This trick
> > + * assumes that ever node ID will have a CPU of that ID to match.
> > + * On systems with memoryless nodes, this may not hold true. Hence,
> > + * we take a second pass initialising a "per-cpu" area for node-ids
> > + * that SLQB can use
> > + */
> > + for_each_node_state(i, N_NORMAL_MEMORY) {
> > +
> > + /* Skip node IDs that a valid CPU id exists for */
> > + if (paca[i].data_offset)
> > + continue;
> > +
> > + ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
> > +
> > + paca[i].data_offset = ptr - __per_cpu_start;
> > + memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
> > + }
> > +#endif /* CONFIG_SLQB */
> > }
> > #endif
>
> Eh... I don't know. This seems too hacky to me.

I've come around to this opinion as well. There are probably too many
other architectures and corners where this is gotten wrong and a more
fundamental fix is needed for SLQB to be able to use this hack.

> Why not just
> allocate pointer array of MAX_NUMNODES and allocate per-node memory
> there?

That's what patch 1 from V1 did. I'll make it Patch 1 for V3.

> This will be slightly more expensive but I doubt it will be
> noticeable. The only extra overhead is the cachline footprint for the
> extra array.
>

I'll compare the vmlinux's to quantify the exact penalty but basically,
I don't think it can be avoided at this point.

Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 09:32:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] slqb: Allow SLQB to be used on PPC

On Tue, Sep 22, 2009 at 11:30:23AM +0200, Heiko Carstens wrote:
> On Mon, Sep 21, 2009 at 05:10:26PM +0100, Mel Gorman wrote:
> > SLQB was disabled on PPC as it would stab itself in the face when running
> > on machines with CPUs on memoryless nodes. As those configurations should
> > now work, allow SLQB to be configured again on PPC.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > init/Kconfig | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index adc10ab..8f55fde 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1033,7 +1033,7 @@ config SLUB
> >
> > config SLQB
> > bool "SLQB (Queued allocator)"
> > - depends on !PPC && !S390
> > + depends on !S390
>
> You can remove S390 from the list independently from this patch set.
> As already mentioned SLQB works again on s390 and whatever caused the
> bug I reported a few weeks back is gone.
>

Nice one. Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 10:05:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Mon, Sep 21, 2009 at 02:17:40PM -0400, Christoph Lameter wrote:
> On Mon, 21 Sep 2009, Mel Gorman wrote:
> > Can you spot if there is something fundamentally wrong with patch 2? I.e. what
> > is wrong with treating the closest node as local instead of only the
> > closest node?
>
> Depends on the way locking is done for percpu queues (likely lockless).
> A misidentification of the numa locality of an object may result in locks
> not being taken that should have been taken.
>

Ok, I'll continue looking from that perspective and see what comes out.
I've spotted a few possible anomolies which I'll stick into a separate
patch.

> > > Or just allow SLQB for !NUMA configurations and merge it now.
> > >
> >
> > Forcing SLQB !NUMA will not rattle out any existing list issues
> > unfortunately :(.
>
> But it will make SLQB work right in permitted configurations. The NUMA
> issues can then be fixed later upstream.
>

I'm going to punt the decision on this one to Pekka or Nick. My feeling
is leave it enabled for NUMA so it can be identified if it gets fixed
for some other reason - e.g. the stalls are due to a per-cpu problem as
stated by Sachin and SLQB happens to exasperate the problem.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 10:07:02

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, Sep 22, 2009 at 10:33:11AM +0530, Sachin Sant wrote:
> Mel Gorman wrote:
>> On Mon, Sep 21, 2009 at 01:54:12PM -0400, Christoph Lameter wrote:
>>
>>> Lets just keep SLQB back until the basic issues with memoryless nodes are
>>> resolved.
>>>
>>
>> It's not even super-clear that the memoryless nodes issues are entirely
>> related to SLQB. Sachin for example says that there was a stall issue
>> with memoryless nodes that could be triggered without SLQB. Sachin, is
>> that still accurate?
>>
>
> I think there are two different problems that we are dealing with.
>

Agreed.

> First one is the SLQB not working on a ppc64 box which seems to be specific
> to only one machine and i haven't seen that on other power boxes.The patches
> that you have posted seems to allow the box to boot, but eventually it hits
> the stall issue(related to percpu dynamic allocator not working on ppc64),
> which is the second problem we are dealing with.
>
> The stall issue seems to be much more critical as it is affecting almost
> all of the power boxes that i have tested with (4 in all).
> This issue is seen with Linus tree as well and was first seen with
> 2.6.31-git5 (0cb583fd..)
>
> The stall issue was reported here:
> http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-September/075791.html
>

Because of this, I'd like to try and get SLQB as far as it can at the
moment and then treat this problem from scratch. I'd rather not treat
multiple different bugs as if they were one problem when they probably
are not.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 10:21:17

by Pekka Enberg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

Hi Mel,

On Tue, 2009-09-22 at 11:05 +0100, Mel Gorman wrote:
> I'm going to punt the decision on this one to Pekka or Nick. My feeling
> is leave it enabled for NUMA so it can be identified if it gets fixed
> for some other reason - e.g. the stalls are due to a per-cpu problem as
> stated by Sachin and SLQB happens to exasperate the problem.

Can I have a tested patch that uses MAX_NUMNODES to allocate the
structs, please? We can convert SLQB over to per-cpu allocator if the
memoryless node issue is resolved.

Pekka

2009-09-22 10:24:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, Sep 22, 2009 at 01:21:15PM +0300, Pekka Enberg wrote:
> Hi Mel,
>
> On Tue, 2009-09-22 at 11:05 +0100, Mel Gorman wrote:
> > I'm going to punt the decision on this one to Pekka or Nick. My feeling
> > is leave it enabled for NUMA so it can be identified if it gets fixed
> > for some other reason - e.g. the stalls are due to a per-cpu problem as
> > stated by Sachin and SLQB happens to exasperate the problem.
>
> Can I have a tested patch that uses MAX_NUMNODES to allocate the
> structs, please? We can convert SLQB over to per-cpu allocator if the
> memoryless node issue is resolved.
>

Patch set in the process of testing. Should have something in a few
hours.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 12:55:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, Sep 22, 2009 at 10:33:11AM +0530, Sachin Sant wrote:
> Mel Gorman wrote:
>> On Mon, Sep 21, 2009 at 01:54:12PM -0400, Christoph Lameter wrote:
>>
>>> Lets just keep SLQB back until the basic issues with memoryless nodes are
>>> resolved.
>>>
>>
>> It's not even super-clear that the memoryless nodes issues are entirely
>> related to SLQB. Sachin for example says that there was a stall issue
>> with memoryless nodes that could be triggered without SLQB. Sachin, is
>> that still accurate?
>>
> I think there are two different problems that we are dealing with.
>
> First one is the SLQB not working on a ppc64 box which seems to be specific
> to only one machine and i haven't seen that on other power boxes.The patches
> that you have posted seems to allow the box to boot, but eventually it hits
> the stall issue(related to percpu dynamic allocator not working on ppc64),
> which is the second problem we are dealing with.
>

Ok, I've sent out V3 of this. It's only a partial fix but it's about as
far as it can be brought until the other difficulties are resolved.

> The stall issue seems to be much more critical as it is affecting almost
> all of the power boxes that i have tested with (4 in all).
> This issue is seen with Linus tree as well and was first seen with
> 2.6.31-git5 (0cb583fd..)
>
> The stall issue was reported here:
> http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-September/075791.html
>

Can you bisect this please?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 13:05:11

by Sachin Sant

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

Mel Gorman wrote:
> On Tue, Sep 22, 2009 at 10:33:11AM +0530, Sachin Sant wrote:
>
>> Mel Gorman wrote:
>>
>>> On Mon, Sep 21, 2009 at 01:54:12PM -0400, Christoph Lameter wrote:
>>>
>>>
>>>> Lets just keep SLQB back until the basic issues with memoryless nodes are
>>>> resolved.
>>>>
>>>>
>>> It's not even super-clear that the memoryless nodes issues are entirely
>>> related to SLQB. Sachin for example says that there was a stall issue
>>> with memoryless nodes that could be triggered without SLQB. Sachin, is
>>> that still accurate?
>>>
>>>
>> I think there are two different problems that we are dealing with.
>>
>> First one is the SLQB not working on a ppc64 box which seems to be specific
>> to only one machine and i haven't seen that on other power boxes.The patches
>> that you have posted seems to allow the box to boot, but eventually it hits
>> the stall issue(related to percpu dynamic allocator not working on ppc64),
>> which is the second problem we are dealing with.
>>
>>
>
> Ok, I've sent out V3 of this. It's only a partial fix but it's about as
> far as it can be brought until the other difficulties are resolved.
>
Thanks Mel.

>
>> The stall issue seems to be much more critical as it is affecting almost
>> all of the power boxes that i have tested with (4 in all).
>> This issue is seen with Linus tree as well and was first seen with
>> 2.6.31-git5 (0cb583fd..)
>>
>> The stall issue was reported here:
>> http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-September/075791.html
>>
>>
>
> Can you bisect this please?
>
The problem seems to have been introduced with
commit ada3fa15057205b7d3f727bba5cd26b5912e350f.

Specifically this patch :
powerpc64: convert to dynamic percpu allocator

If i revert this patch i am able to boot latest git
on a powerpc box.

Thanks
-Sachin

--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------

2009-09-22 13:20:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, Sep 22, 2009 at 06:35:05PM +0530, Sachin Sant wrote:
> Mel Gorman wrote:
>> On Tue, Sep 22, 2009 at 10:33:11AM +0530, Sachin Sant wrote:
>>
>>> Mel Gorman wrote:
>>>
>>>> On Mon, Sep 21, 2009 at 01:54:12PM -0400, Christoph Lameter wrote:
>>>>
>>>>> Lets just keep SLQB back until the basic issues with memoryless nodes are
>>>>> resolved.
>>>>>
>>>> It's not even super-clear that the memoryless nodes issues are entirely
>>>> related to SLQB. Sachin for example says that there was a stall issue
>>>> with memoryless nodes that could be triggered without SLQB. Sachin, is
>>>> that still accurate?
>>>>
>>> I think there are two different problems that we are dealing with.
>>>
>>> First one is the SLQB not working on a ppc64 box which seems to be specific
>>> to only one machine and i haven't seen that on other power boxes.The patches
>>> that you have posted seems to allow the box to boot, but eventually it hits
>>> the stall issue(related to percpu dynamic allocator not working on ppc64),
>>> which is the second problem we are dealing with.
>>>
>>>
>>
>> Ok, I've sent out V3 of this. It's only a partial fix but it's about as
>> far as it can be brought until the other difficulties are resolved.
>>
> Thanks Mel.
>
>>
>>> The stall issue seems to be much more critical as it is affecting almost
>>> all of the power boxes that i have tested with (4 in all).
>>> This issue is seen with Linus tree as well and was first seen with
>>> 2.6.31-git5 (0cb583fd..)
>>>
>>> The stall issue was reported here:
>>> http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-September/075791.html
>>>
>>>
>>
>> Can you bisect this please?
>>
> The problem seems to have been introduced with
> commit ada3fa15057205b7d3f727bba5cd26b5912e350f.
>
> Specifically this patch : powerpc64: convert to dynamic percpu allocator
>

That is commit c2a7e818019f20a5cf7fb26a6eb59e212e6c0cd8.

> If i revert this patch i am able to boot latest git
> on a powerpc box.
>

Ok, this is then independent of the corruption I was seeing which I now
believe is based entirely within SLQBs handling of remote nodes. I was based
on 2.6.31 with SLQB applied on top. This patch was applied in a time after
the baseline I was working from.

Tejun, have you looked at that stall problem? Sorry if you have already,
I haven't dug around for related threads.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 13:38:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, Sep 22, 2009 at 09:29:44PM +0800, ???? wrote:
> Dear all,
> I want ask question about kernel-customization. Our product is
> embedded system, such as ADSL Modem(home gateway). and we use linux
> 2.6.22.15 version. Now config for linux kernel will build kernel size is
> 800KB. How can I config kernel config to reduce kernel size, I want to get
> smaller size like 500KB.
> But out product is network device,so some network protocol of kernel can not
> remove. Below is our config, all of you can give me suggestion, Thanks very
> much!
>

*blinks*

This is not a sensible thread to ask the question on.

I haven't looked at your config to see how it might be stripped down but
I would suggest using scripts/bloat-o-meter to start getting a breakdown
per-subsystem that is making up the bulk of your kernel image.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 15:26:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, Sep 22, 2009 at 10:00:03AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2009-09-21 at 17:10 +0100, Mel Gorman wrote:
> >
> > It needs signed-off from the powerpc side because it's now allocating
> > more
> > memory potentially (Ben?). An alternative to this patch is in V1 that
> > statically declares the per-node structures but this is potentially
> > sub-optimal but from a performance and memory utilisation perspective.
>
> So if I understand correctly, we have a problem with both cpu-less and
> memory-less nodes. Interesting setups :-)
>

memoryless anyway because of the per-node trick. I'm not aware of
cpuless problems but I suspect cpuless nodes exist for more than ppc64.

> I have no strong objection on the allocating of the per-cpu data for
> the cpu-less nodes. However, I wonder if we should do that a bit more
> nicely, maybe with some kind of "adjusted" cpu_possible_mask() (could be
> something like cpu_node_valid_mask or similar) to be used by percpu.
>

That would be reasonable if per-node data becomes more popular although
with only SLQB depending on per-node data, it's hard to justify unless
SLQB *really* benefits from per-node.

Lumping the per-cpu allocator to cover per-cpu and per-node feels a bit
confusing. Maybe it would have been easier if there simply were never
memoryless nodes and cpus were always mapped to their closest, instead of
their local, node. There likely would be a few corner cases though and memory
hotplug would add to the mess. I haven't been able to decide on a sensible
way forward that doesn't involve a number of invasive changes.

> Mostly because it would be nice to have built-in debug features in
> per-cpu and in that case, it would need some way to know a valid
> number from an invalid one). Either that or just keep track of the
> mask of cpus that had percpu data allocated to them
>

The former would be nice, the latter would be a requirement if per-node data
was pursued.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-09-22 17:31:56

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, 22 Sep 2009, Mel Gorman wrote:

> Lumping the per-cpu allocator to cover per-cpu and per-node feels a bit
> confusing. Maybe it would have been easier if there simply were never
> memoryless nodes and cpus were always mapped to their closest, instead of
> their local, node. There likely would be a few corner cases though and memory
> hotplug would add to the mess. I haven't been able to decide on a sensible
> way forward that doesn't involve a number of invasive changes.
>

I strongly agree with removing memoryless node support from the kernel,
but I don't think we should substitute it with a multiple cpu binding
approach because it doesn't export the true physical topology of the
system.

If we treat all cpus equally with respect to a region of memory when one
happens to be more remote, userspace can no longer use sched_setaffinity()
for latency-sensitive apps nor can it correctly interleave with other
nodes. Reduced to its simplest form, a machine now with a single node
because memoryless nodes have been obsoleted would incorrectly report that
all cpus are true siblings.

While memoryless nodes are inconvenient for the implementation, they do
have the benefit of being able to represent the actual physical topology
when binding cpus to their nearest node, even though it may not be local,
would not.

It's important to accurately represent the physical topology, and that can
be done with the device class abstraction model as I described in
http://lkml.org/lkml/2009/9/22/97 of which a node is only a locality of a
particular type.

2009-09-22 23:20:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Fix SLQB on memoryless configurations V2

On Tue, 22 Sep 2009, ?? wrote:

> 800KB. How can I config kernel config to reduce kernel size, I want to get
> smaller size like 500KB.


500kb? That may be a tough call.

> *CONFIG_NETFILTER=y*

Can you drop this one?

Is CONFIG_EMBEDDED set? Maybe I skipped it.