2015-07-02 23:02:12

by Nishanth Aravamudan

[permalink] [raw]
Subject: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
have an ordering issue during boot with early calls to cpu_to_node().
The value returned by those calls now depend on the per-cpu area being
setup, but that is not guaranteed to be the case during boot. Instead,
we need to add an early_cpu_to_node() which doesn't use the per-CPU area
and call that from certain spots that are known to invoke cpu_to_node()
before the per-CPU areas are not configured.

On an example 2-node NUMA system with the following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 2029 MB
node 0 free: 1753 MB
node 1 cpus: 4 5 6 7
node 1 size: 2045 MB
node 1 free: 1945 MB
node distances:
node 0 1
0: 10 40
1: 40 10

we currently emit at boot:

[ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7

After this commit, we correctly emit:

[ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7

Signed-off-by: Nishanth Aravamudan <[email protected]>

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 5f1048e..f2c4c89 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
extern int __node_distance(int, int);
#define node_distance(a, b) __node_distance(a, b)

+extern int early_cpu_to_node(int);
+
extern void __init dump_numa_cpu_topology(void);

extern int sysfs_add_device_to_node(struct device *dev, int nid);
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index c69671c..23a2cf3 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)

static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
{
- return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
- __pa(MAX_DMA_ADDRESS));
+ return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
+ align, __pa(MAX_DMA_ADDRESS));
}

static void __init pcpu_fc_free(void *ptr, size_t size)
@@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)

static int pcpu_cpu_distance(unsigned int from, unsigned int to)
{
- if (cpu_to_node(from) == cpu_to_node(to))
+ if (early_cpu_to_node(from) == early_cpu_to_node(to))
return LOCAL_DISTANCE;
else
return REMOTE_DISTANCE;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e80621..9ffabf4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
}

+int early_cpu_to_node(int cpu)
+{
+ return numa_cpu_lookup_table[cpu];
+}
+
#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
static void unmap_cpu_from_node(unsigned long cpu)
{


2015-07-02 23:03:45

by Nishanth Aravamudan

[permalink] [raw]
Subject: [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table

A simple move to a wrapper function to numa_cpu_lookup_table, now that
power has the early_cpu_to_node() API.

Signed-off-by: Nishanth Aravamudan <[email protected]>

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..7bf333b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
* numa_node_id() works after this.
*/
if (cpu_present(cpu)) {
- set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+ set_cpu_numa_node(cpu, early_cpu_to_node(cpu));
set_cpu_numa_mem(cpu,
- local_memory_node(numa_cpu_lookup_table[cpu]));
+ local_memory_node(early_cpu_to_node(cpu)));
}
}

@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
#ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
#endif
- set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+ set_numa_node(early_cpu_to_node(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
}

2015-07-08 04:01:06

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> have an ordering issue during boot with early calls to cpu_to_node().

"now that .." implies we changed something and broke this. What commit was
it that changed the behaviour?

> The value returned by those calls now depend on the per-cpu area being
> setup, but that is not guaranteed to be the case during boot. Instead,
> we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> and call that from certain spots that are known to invoke cpu_to_node()
> before the per-CPU areas are not configured.
>
> On an example 2-node NUMA system with the following topology:
>
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3
> node 0 size: 2029 MB
> node 0 free: 1753 MB
> node 1 cpus: 4 5 6 7
> node 1 size: 2045 MB
> node 1 free: 1945 MB
> node distances:
> node 0 1
> 0: 10 40
> 1: 40 10
>
> we currently emit at boot:
>
> [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
>
> After this commit, we correctly emit:
>
> [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7


So it looks fairly sane, and I guess it's a bug fix.

But I'm a bit reluctant to put it in straight away without some time in next.

It looks like the symptom is that the per-cpu areas are all allocated on node
0, is that all that goes wrong?

cheers

2015-07-08 23:16:46

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > have an ordering issue during boot with early calls to cpu_to_node().
>
> "now that .." implies we changed something and broke this. What commit was
> it that changed the behaviour?

Well, that's something I'm trying to still unearth. In the commits
before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), the dmesg reports:

pcpu-alloc: [0] 0 1 2 3 4 5 6 7

At least prior to 8c272261194d, this might have been due to the old
powerpc-specific cpu_to_node():

static inline int cpu_to_node(int cpu)
{
int nid;

nid = numa_cpu_lookup_table[cpu];

/*
* During early boot, the numa-cpu lookup table might not have
been
* setup for all CPUs yet. In such cases, default to node 0.
*/
return (nid < 0) ? 0 : nid;
}

which might imply that no one cares or that simply no one noticed.

> > The value returned by those calls now depend on the per-cpu area being
> > setup, but that is not guaranteed to be the case during boot. Instead,
> > we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> > and call that from certain spots that are known to invoke cpu_to_node()
> > before the per-CPU areas are not configured.
> >
> > On an example 2-node NUMA system with the following topology:
> >
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 2029 MB
> > node 0 free: 1753 MB
> > node 1 cpus: 4 5 6 7
> > node 1 size: 2045 MB
> > node 1 free: 1945 MB
> > node distances:
> > node 0 1
> > 0: 10 40
> > 1: 40 10
> >
> > we currently emit at boot:
> >
> > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
> >
> > After this commit, we correctly emit:
> >
> > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7
>
>
> So it looks fairly sane, and I guess it's a bug fix.
>
> But I'm a bit reluctant to put it in straight away without some time in next.

I'm fine with that -- it could use some more extensive testing,
admittedly (I only have been able to verify the pcpu areas are being
correctly allocated on the right node so far).

I still need to test with hotplug and things like that. Hence the RFC.

> It looks like the symptom is that the per-cpu areas are all allocated on node
> 0, is that all that goes wrong?

Yes, that's the symptom. I cc'd a few folks to see if they could help
indicate the performance implications of such a setup -- sorry, I should
have been more explicit about that.

Thanks,
Nish

2015-07-09 01:22:14

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

> Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> have an ordering issue during boot with early calls to cpu_to_node().
> The value returned by those calls now depend on the per-cpu area being
> setup, but that is not guaranteed to be the case during boot. Instead,
> we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> and call that from certain spots that are known to invoke cpu_to_node()
> before the per-CPU areas are not configured.
>
> On an example 2-node NUMA system with the following topology:
>
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3
> node 0 size: 2029 MB
> node 0 free: 1753 MB
> node 1 cpus: 4 5 6 7
> node 1 size: 2045 MB
> node 1 free: 1945 MB
> node distances:
> node 0 1
> 0: 10 40
> 1: 40 10
>
> we currently emit at boot:
>
> [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
>
> After this commit, we correctly emit:
>
> [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7
>
> Signed-off-by: Nishanth Aravamudan <[email protected]>
>
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 5f1048e..f2c4c89 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
> extern int __node_distance(int, int);
> #define node_distance(a, b) __node_distance(a, b)
>
> +extern int early_cpu_to_node(int);
> +
> extern void __init dump_numa_cpu_topology(void);
>
> extern int sysfs_add_device_to_node(struct device *dev, int nid);
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index c69671c..23a2cf3 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
>
> static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
> {
> - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
> - __pa(MAX_DMA_ADDRESS));
> + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
> + align, __pa(MAX_DMA_ADDRESS));
> }
>
> static void __init pcpu_fc_free(void *ptr, size_t size)
> @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
>
> static int pcpu_cpu_distance(unsigned int from, unsigned int to)
> {
> - if (cpu_to_node(from) == cpu_to_node(to))
> + if (early_cpu_to_node(from) == early_cpu_to_node(to))
> return LOCAL_DISTANCE;
> else
> return REMOTE_DISTANCE;
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 5e80621..9ffabf4 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
> cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> }
>
> +int early_cpu_to_node(int cpu)
> +{
> + return numa_cpu_lookup_table[cpu];
> +}
> +
> #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
> static void unmap_cpu_from_node(unsigned long cpu)
> {
>
>

early_cpu_to_node() looks like it's begging to be __init since we
shouldn't have a need to reference to numa_cpu_lookup_table after boot and
that appears like it can be done if pcpu_cpu_distance() is made __init in
this patch and smp_prepare_boot_cpu() is made __init in the next patch.
So I think this is fine, but those functions and things like
reset_numa_cpu_lookup_table() should be in init.text.

After the percpu areas on initialized and cpu_to_node() is correct, it
would be really nice to be able to make numa_cpu_lookup_table[] be
__initdata since it shouldn't be necessary anymore. That probably has cpu
callbacks that need to be modified to no longer look at
numa_cpu_lookup_table[] or pass the value in, but it would make it much
cleaner. Then nobody will have to worry about figuring out whether
early_cpu_to_node() or cpu_to_node() is the right one to call.

2015-07-09 01:24:49

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

On Wed, 8 Jul 2015, Nishanth Aravamudan wrote:

> > It looks like the symptom is that the per-cpu areas are all allocated on node
> > 0, is that all that goes wrong?
>
> Yes, that's the symptom. I cc'd a few folks to see if they could help
> indicate the performance implications of such a setup -- sorry, I should
> have been more explicit about that.
>

Yeah, not sure it's really a bugfix but rather a performance optimization
since cpu_to_node() with CONFIG_USE_PERCPU_NUMA_NODE_ID is only about
performance.

2015-07-09 01:25:21

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] powerpc/smp: use early_cpu_to_node() instead of direct references to numa_cpu_lookup_table

On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

> A simple move to a wrapper function to numa_cpu_lookup_table, now that
> power has the early_cpu_to_node() API.
>
> Signed-off-by: Nishanth Aravamudan <[email protected]>

When early_cpu_to_node() is __init:

Acked-by: David Rientjes <[email protected]>

2015-07-10 16:16:00

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote:
> On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> > On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > > have an ordering issue during boot with early calls to cpu_to_node().
> >
> > "now that .." implies we changed something and broke this. What commit was
> > it that changed the behaviour?
>
> Well, that's something I'm trying to still unearth. In the commits
> before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
> "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), the dmesg reports:
>
> pcpu-alloc: [0] 0 1 2 3 4 5 6 7

Ok, I did a bisection, and it seems like prior to commit
1a4d76076cda69b0abf15463a8cebc172406da25 ("percpu: implement
asynchronous chunk population"), we emitted the above, e.g.:

pcpu-alloc: [0] 0 1 2 3 4 5 6 7

And after that commit, we emitted:

pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7

I'm not exactly sure why that changed, but I'm still
reading/understanding the commit. Tejun might be able to explain.

Tejun, for reference, I noticed on Power systems since the
above-mentioned commit, pcpu-alloc is not reflecting the topology of the
system correctly -- that is, the pcpu areas are all on node 0
unconditionally (based up on pcpu-alloc's output). Prior to that, there
was just one group, it seems like, which completely ignored the NUMA
topology.

Is this just an ordering thing that changed with the introduction of the
async code?

Thanks,
Nish

2015-07-10 16:25:45

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

On 08.07.2015 [18:22:09 -0700], David Rientjes wrote:
> On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:
>
> > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > have an ordering issue during boot with early calls to cpu_to_node().
> > The value returned by those calls now depend on the per-cpu area being
> > setup, but that is not guaranteed to be the case during boot. Instead,
> > we need to add an early_cpu_to_node() which doesn't use the per-CPU area
> > and call that from certain spots that are known to invoke cpu_to_node()
> > before the per-CPU areas are not configured.
> >
> > On an example 2-node NUMA system with the following topology:
> >
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3
> > node 0 size: 2029 MB
> > node 0 free: 1753 MB
> > node 1 cpus: 4 5 6 7
> > node 1 size: 2045 MB
> > node 1 free: 1945 MB
> > node distances:
> > node 0 1
> > 0: 10 40
> > 1: 40 10
> >
> > we currently emit at boot:
> >
> > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
> >
> > After this commit, we correctly emit:
> >
> > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7
> >
> > Signed-off-by: Nishanth Aravamudan <[email protected]>
> >
> > diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> > index 5f1048e..f2c4c89 100644
> > --- a/arch/powerpc/include/asm/topology.h
> > +++ b/arch/powerpc/include/asm/topology.h
> > @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
> > extern int __node_distance(int, int);
> > #define node_distance(a, b) __node_distance(a, b)
> >
> > +extern int early_cpu_to_node(int);
> > +
> > extern void __init dump_numa_cpu_topology(void);
> >
> > extern int sysfs_add_device_to_node(struct device *dev, int nid);
> > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> > index c69671c..23a2cf3 100644
> > --- a/arch/powerpc/kernel/setup_64.c
> > +++ b/arch/powerpc/kernel/setup_64.c
> > @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
> >
> > static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
> > {
> > - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
> > - __pa(MAX_DMA_ADDRESS));
> > + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
> > + align, __pa(MAX_DMA_ADDRESS));
> > }
> >
> > static void __init pcpu_fc_free(void *ptr, size_t size)
> > @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
> >
> > static int pcpu_cpu_distance(unsigned int from, unsigned int to)
> > {
> > - if (cpu_to_node(from) == cpu_to_node(to))
> > + if (early_cpu_to_node(from) == early_cpu_to_node(to))
> > return LOCAL_DISTANCE;
> > else
> > return REMOTE_DISTANCE;
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 5e80621..9ffabf4 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
> > cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> > }
> >
> > +int early_cpu_to_node(int cpu)
> > +{
> > + return numa_cpu_lookup_table[cpu];
> > +}
> > +
> > #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
> > static void unmap_cpu_from_node(unsigned long cpu)
> > {
> >
> >
>
> early_cpu_to_node() looks like it's begging to be __init since we
> shouldn't have a need to reference to numa_cpu_lookup_table after boot and
> that appears like it can be done if pcpu_cpu_distance() is made __init in
> this patch and smp_prepare_boot_cpu() is made __init in the next patch.
> So I think this is fine, but those functions and things like
> reset_numa_cpu_lookup_table() should be in init.text.

Yep, that makes total sense!

> After the percpu areas on initialized and cpu_to_node() is correct, it
> would be really nice to be able to make numa_cpu_lookup_table[] be
> __initdata since it shouldn't be necessary anymore. That probably has cpu
> callbacks that need to be modified to no longer look at
> numa_cpu_lookup_table[] or pass the value in, but it would make it much
> cleaner. Then nobody will have to worry about figuring out whether
> early_cpu_to_node() or cpu_to_node() is the right one to call.

When I worked on the original pcpu patches for power, I wanted to do
this, but got myself confused and never came back to it. Thank you for
suggesting it and I'll work on it soon.

-Nish

2015-07-14 21:31:24

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

On Fri, 10 Jul 2015, Nishanth Aravamudan wrote:

> > After the percpu areas on initialized and cpu_to_node() is correct, it
> > would be really nice to be able to make numa_cpu_lookup_table[] be
> > __initdata since it shouldn't be necessary anymore. That probably has cpu
> > callbacks that need to be modified to no longer look at
> > numa_cpu_lookup_table[] or pass the value in, but it would make it much
> > cleaner. Then nobody will have to worry about figuring out whether
> > early_cpu_to_node() or cpu_to_node() is the right one to call.
>
> When I worked on the original pcpu patches for power, I wanted to do
> this, but got myself confused and never came back to it. Thank you for
> suggesting it and I'll work on it soon.
>

Great, thanks for taking it on! I have powerpc machines so I can test
this and try to help where possible.

2015-07-15 00:22:26

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

On Wed, 2015-07-08 at 16:16 -0700, Nishanth Aravamudan wrote:
> On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> > On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > >
> > > we currently emit at boot:
> > >
> > > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
> > >
> > > After this commit, we correctly emit:
> > >
> > > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7
> >
> >
> > So it looks fairly sane, and I guess it's a bug fix.
> >
> > But I'm a bit reluctant to put it in straight away without some time in next.
>
> I'm fine with that -- it could use some more extensive testing,
> admittedly (I only have been able to verify the pcpu areas are being
> correctly allocated on the right node so far).
>
> I still need to test with hotplug and things like that. Hence the RFC.
>
> > It looks like the symptom is that the per-cpu areas are all allocated on node
> > 0, is that all that goes wrong?
>
> Yes, that's the symptom. I cc'd a few folks to see if they could help
> indicate the performance implications of such a setup -- sorry, I should
> have been more explicit about that.

OK cool. I'm happy to put it in next if you send a non-RFC version.

cheers

2015-07-15 20:35:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

Hello,

On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
> we currently emit at boot:
>
> [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
>
> After this commit, we correctly emit:
>
> [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7

JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
allocation group numbers and they're not split according to nodes but
percpu allocation units. In both cases, there are two units each
serving 0-3 and 4-7. In the above case, because it wasn't being fed
the correct NUMA information, both got assigned to the same group. In
the latter, they got assigned to different ones but even then if the
group numbers match NUMA node numbers, that's just a coincidence.

Thanks.

--
tejun

2015-07-15 20:37:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot

Hello,

On Fri, Jul 10, 2015 at 09:15:47AM -0700, Nishanth Aravamudan wrote:
> On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote:
> > On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote:
> > > On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote:
> > > > Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
> > > > have an ordering issue during boot with early calls to cpu_to_node().
> > >
> > > "now that .." implies we changed something and broke this. What commit was
> > > it that changed the behaviour?
> >
> > Well, that's something I'm trying to still unearth. In the commits
> > before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d
> > "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"), the dmesg reports:
> >
> > pcpu-alloc: [0] 0 1 2 3 4 5 6 7
>
> Ok, I did a bisection, and it seems like prior to commit
> 1a4d76076cda69b0abf15463a8cebc172406da25 ("percpu: implement
> asynchronous chunk population"), we emitted the above, e.g.:
>
> pcpu-alloc: [0] 0 1 2 3 4 5 6 7
>
> And after that commit, we emitted:
>
> pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
>
> I'm not exactly sure why that changed, but I'm still
> reading/understanding the commit. Tejun might be able to explain.
>
> Tejun, for reference, I noticed on Power systems since the
> above-mentioned commit, pcpu-alloc is not reflecting the topology of the
> system correctly -- that is, the pcpu areas are all on node 0
> unconditionally (based up on pcpu-alloc's output). Prior to that, there
> was just one group, it seems like, which completely ignored the NUMA
> topology.
>
> Is this just an ordering thing that changed with the introduction of the
> async code?

It's just each unit growing and percpu allocator deciding to split
them into separate allocation units. Before it was serving all cpus
in a single alloc unit as they looked like they belong to the same
NUMA node and small enough to fit into one alloc unit. In the latter,
the async one added more reserve space, so the allocator is deciding
to split them into two alloc units while assigning them to the same
group as the NUMA info wasn't still there.

Thanks.

--
tejun

2015-07-15 22:43:58

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

On 15.07.2015 [16:35:16 -0400], Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
> > we currently emit at boot:
> >
> > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7
> >
> > After this commit, we correctly emit:
> >
> > [ 0.000000] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7
>
> JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
> allocation group numbers and they're not split according to nodes but
> percpu allocation units. In both cases, there are two units each
> serving 0-3 and 4-7. In the above case, because it wasn't being fed
> the correct NUMA information, both got assigned to the same group. In
> the latter, they got assigned to different ones but even then if the
> group numbers match NUMA node numbers, that's just a coincidence.

Ok, thank you for clarifying! From a correctness perspective, even if
the numbers don't match NUMA nodes, should we expect the grouping to be
split along NUMA topology?

-Nish

2015-07-15 22:48:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

Hello,

On Wed, Jul 15, 2015 at 03:43:51PM -0700, Nishanth Aravamudan wrote:
> Ok, thank you for clarifying! From a correctness perspective, even if
> the numbers don't match NUMA nodes, should we expect the grouping to be
> split along NUMA topology?

Yeap, the groups get formed according to the node distances. Nodes
which are not at LOCAL_DISTANCE are always put in different groups.

Thanks.

--
tejun