LinuxLists.cc - [PATCH] of: return NUMA_NO_NODE from fallback of_node_to

2015-04-08 16:59:31

Subject: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

Node 0 might be offline as well as any other numa node,
in this case kernel cannot handle memory allocation and crashes.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
---
drivers/of/base.c | 2 +-
include/linux/of.h | 5 ++++-
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 8f165b112e03..51f4bd16e613 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
#ifdef CONFIG_NUMA
int __weak of_node_to_nid(struct device_node *np)
{
- return numa_node_id();
+ return NUMA_NO_NODE;
}
#endif

diff --git a/include/linux/of.h b/include/linux/of.h
index dfde07e77a63..78a04ee85a9c 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -623,7 +623,10 @@ static inline const char *of_prop_next_string(struct property *prop,
#if defined(CONFIG_OF) && defined(CONFIG_NUMA)
extern int of_node_to_nid(struct device_node *np);
#else
-static inline int of_node_to_nid(struct device_node *device) { return 0; }
+static inline int of_node_to_nid(struct device_node *device)
+{
+ return NUMA_NO_NODE;
+}
#endif

static inline struct device_node *of_find_matching_node(

2015-04-08 17:04:15

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
> Node 0 might be offline as well as any other numa node,
> in this case kernel cannot handle memory allocation and crashes.

Example:

[ 0.027133] ------------[ cut here ]------------
[ 0.027938] kernel BUG at include/linux/gfp.h:322!
[ 0.028000] invalid opcode: 0000 [#1] SMP
[ 0.028000] Modules linked in:
[ 0.028000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.0.0-rc7 #12
[ 0.028000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[ 0.028000] task: ffff88007d3f8000 ti: ffff88007d3dc000 task.ti:
ffff88007d3dc000
[ 0.028000] RIP: 0010:[<ffffffff8118574c>] [<ffffffff8118574c>]
new_slab+0x30c/0x3c0
[ 0.028000] RSP: 0000:ffff88007d3dfc28 EFLAGS: 00010246
[ 0.028000] RAX: 0000000000000000 RBX: ffff88007d001800 RCX:
0000000000000001
[ 0.028000] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
00000000002012d0
[ 0.028000] RBP: ffff88007d3dfc58 R08: 0000000000000000 R09:
0000000000000000
[ 0.028000] R10: 0000000000000001 R11: ffff88007d02fe40 R12:
00000000000000d0
[ 0.028000] R13: 00000000000000c0 R14: 0000000000000015 R15:
0000000000000000
[ 0.028000] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000)
knlGS:0000000000000000
[ 0.028000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.028000] CR2: 00000000ffffffff CR3: 0000000001e0e000 CR4:
00000000000006f0
[ 0.028000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 0.028000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 0.028000] Stack:
[ 0.028000] 0000000000000000 ffff88007fc175d0 ffffea0001f40bc0
00000000000000c0
[ 0.028000] ffff88007d001800 00000000000080d0 ffff88007d3dfd48
ffffffff8192da27
[ 0.028000] 000000000000000d ffffffff81e27038 0000000000000000
0000000000000000
[ 0.028000] Call Trace:
[ 0.028000] [<ffffffff8192da27>] __slab_alloc+0x3df/0x55d
[ 0.028000] [<ffffffff8109a92b>] ? __lock_acquire+0xc1b/0x1f40
[ 0.028000] [<ffffffff810b1f2c>] ? __irq_domain_add+0x3c/0xe0
[ 0.028000] [<ffffffff810998f5>] ? trace_hardirqs_on_caller+0x105/0x1d0
[ 0.028000] [<ffffffff813631ab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 0.028000] [<ffffffff811890ab>] __kmalloc_node+0xab/0x210
[ 0.028000] [<ffffffff810394df>] ? ioapic_read_entry+0x1f/0x50
[ 0.028000] [<ffffffff810b1f2c>] ? __irq_domain_add+0x3c/0xe0
[ 0.028000] [<ffffffff810b1f2c>] __irq_domain_add+0x3c/0xe0
[ 0.028000] [<ffffffff81039e0e>] mp_irqdomain_create+0x9e/0x120
[ 0.028000] [<ffffffff81f22d49>] setup_IO_APIC+0x6b/0x798
[ 0.028000] [<ffffffff810398a5>] ? clear_IO_APIC+0x45/0x70
[ 0.028000] [<ffffffff81f21f01>] apic_bsp_setup+0x87/0x96
[ 0.028000] [<ffffffff81f1fdb4>] native_smp_prepare_cpus+0x237/0x275
[ 0.028000] [<ffffffff81f131b7>] kernel_init_freeable+0x120/0x265
[ 0.028000] [<ffffffff819271f9>] ? kernel_init+0x9/0xf0
[ 0.028000] [<ffffffff819271f0>] ? rest_init+0x130/0x130
[ 0.028000] [<ffffffff819271f9>] kernel_init+0x9/0xf0
[ 0.028000] [<ffffffff8193b958>] ret_from_fork+0x58/0x90
[ 0.028000] [<ffffffff819271f0>] ? rest_init+0x130/0x130
[ 0.028000] Code: 6b b6 ff ff 49 89 c5 e9 ce fd ff ff 31 c0 90 e9 74
ff ff ff 49 c7 04 04 00 00 00 00 e9 05 ff ff ff 4c 89 e7 ff d0 e9 d9 fe
ff ff <0f> 0b 4c 8b 73 38 44 89 e7 81 cf 00 00 20 00 4c 89 f6 48 c1 ee
[ 0.028000] RIP [<ffffffff8118574c>] new_slab+0x30c/0x3c0
[ 0.028000] RSP <ffff88007d3dfc28>
[ 0.028039] ---[ end trace f03690e70d7e4be6 ]---

>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
> ---
> drivers/of/base.c | 2 +-
> include/linux/of.h | 5 ++++-
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 8f165b112e03..51f4bd16e613 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> #ifdef CONFIG_NUMA
> int __weak of_node_to_nid(struct device_node *np)
> {
> - return numa_node_id();
> + return NUMA_NO_NODE;
> }
> #endif
>
> diff --git a/include/linux/of.h b/include/linux/of.h
> index dfde07e77a63..78a04ee85a9c 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -623,7 +623,10 @@ static inline const char *of_prop_next_string(struct property *prop,
> #if defined(CONFIG_OF) && defined(CONFIG_NUMA)
> extern int of_node_to_nid(struct device_node *np);
> #else
> -static inline int of_node_to_nid(struct device_node *device) { return 0; }
> +static inline int of_node_to_nid(struct device_node *device)
> +{
> + return NUMA_NO_NODE;
> +}
> #endif
>
> static inline struct device_node *of_find_matching_node(
>

--
Konstantin

2015-04-08 23:07:51

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
> On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
> >Node 0 might be offline as well as any other numa node,
> >in this case kernel cannot handle memory allocation and crashes.

Isn't the bug that numa_node_id() returned an offline node? That
shouldn't happen.

#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
...
#ifndef numa_node_id
/* Returns the number of the current Node. */
static inline int numa_node_id(void)
{
return raw_cpu_read(numa_node);
}
#endif
...
#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */

/* Returns the number of the current Node. */
#ifndef numa_node_id
static inline int numa_node_id(void)
{
return cpu_to_node(raw_smp_processor_id());
}
#endif
...

So that's either the per-cpu numa_node value, right? Or the result of
cpu_to_node on the current processor.

> Example:
>
> [ 0.027133] ------------[ cut here ]------------
> [ 0.027938] kernel BUG at include/linux/gfp.h:322!

This is

VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));

in

alloc_pages_exact_node().

And based on the trace below, that's

__slab_alloc -> alloc

alloc_pages_exact_node
<- alloc_slab_page
<- allocate_slab
<- new_slab
<- new_slab_objects
< __slab_alloc?

which is just passing the node value down, right? Which I think was
from:

domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
GFP_KERNEL, of_node_to_nid(of_node));

?

What platform is this on, looks to be x86? qemu emulation of a
pathological topology? What was the topology?

Note that there is a ton of code that seems to assume node 0 is online.
I started working on removing this assumption myself and it just led
down a rathole (on power, we always have node 0 online, even if it is
memoryless and cpuless, as a result).

I am guessing this is just happening early in boot before the per-cpu
areas are setup? That's why (I think) x86 has the early_cpu_to_node()
function...

Or do you not have CONFIG_OF set? So isn't the only change necessary to
the include file, and it should just return first_online_node rather
than 0?

Ah and there's more of those node 0 assumptions :)

#define first_online_node 0
#define first_memory_node 0

if MAX_NUMODES == 1...

-Nish

2015-04-08 23:13:16

by Julian Calaby

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

Hi Konstantin,

On Thu, Apr 9, 2015 at 3:04 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
>>
>> Node 0 might be offline as well as any other numa node,
>> in this case kernel cannot handle memory allocation and crashes.
>
>
> Example:
>
> [ 0.027133] ------------[ cut here ]------------
> [ 0.027938] kernel BUG at include/linux/gfp.h:322!
> [ 0.028000] invalid opcode: 0000 [#1] SMP
> [ 0.028000] Modules linked in:
> [ 0.028000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.0.0-rc7 #12
> [ 0.028000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
> [ 0.028000] task: ffff88007d3f8000 ti: ffff88007d3dc000 task.ti:
> ffff88007d3dc000
> [ 0.028000] RIP: 0010:[<ffffffff8118574c>] [<ffffffff8118574c>]
> new_slab+0x30c/0x3c0
> [ 0.028000] RSP: 0000:ffff88007d3dfc28 EFLAGS: 00010246
> [ 0.028000] RAX: 0000000000000000 RBX: ffff88007d001800 RCX:
> 0000000000000001
> [ 0.028000] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 00000000002012d0
> [ 0.028000] RBP: ffff88007d3dfc58 R08: 0000000000000000 R09:
> 0000000000000000
> [ 0.028000] R10: 0000000000000001 R11: ffff88007d02fe40 R12:
> 00000000000000d0
> [ 0.028000] R13: 00000000000000c0 R14: 0000000000000015 R15:
> 0000000000000000
> [ 0.028000] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000)
> knlGS:0000000000000000
> [ 0.028000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 0.028000] CR2: 00000000ffffffff CR3: 0000000001e0e000 CR4:
> 00000000000006f0
> [ 0.028000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 0.028000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 0.028000] Stack:
> [ 0.028000] 0000000000000000 ffff88007fc175d0 ffffea0001f40bc0
> 00000000000000c0
> [ 0.028000] ffff88007d001800 00000000000080d0 ffff88007d3dfd48
> ffffffff8192da27
> [ 0.028000] 000000000000000d ffffffff81e27038 0000000000000000
> 0000000000000000
> [ 0.028000] Call Trace:
> [ 0.028000] [<ffffffff8192da27>] __slab_alloc+0x3df/0x55d
> [ 0.028000] [<ffffffff8109a92b>] ? __lock_acquire+0xc1b/0x1f40
> [ 0.028000] [<ffffffff810b1f2c>] ? __irq_domain_add+0x3c/0xe0
> [ 0.028000] [<ffffffff810998f5>] ? trace_hardirqs_on_caller+0x105/0x1d0
> [ 0.028000] [<ffffffff813631ab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [ 0.028000] [<ffffffff811890ab>] __kmalloc_node+0xab/0x210
> [ 0.028000] [<ffffffff810394df>] ? ioapic_read_entry+0x1f/0x50
> [ 0.028000] [<ffffffff810b1f2c>] ? __irq_domain_add+0x3c/0xe0
> [ 0.028000] [<ffffffff810b1f2c>] __irq_domain_add+0x3c/0xe0
> [ 0.028000] [<ffffffff81039e0e>] mp_irqdomain_create+0x9e/0x120
> [ 0.028000] [<ffffffff81f22d49>] setup_IO_APIC+0x6b/0x798
> [ 0.028000] [<ffffffff810398a5>] ? clear_IO_APIC+0x45/0x70
> [ 0.028000] [<ffffffff81f21f01>] apic_bsp_setup+0x87/0x96
> [ 0.028000] [<ffffffff81f1fdb4>] native_smp_prepare_cpus+0x237/0x275
> [ 0.028000] [<ffffffff81f131b7>] kernel_init_freeable+0x120/0x265
> [ 0.028000] [<ffffffff819271f9>] ? kernel_init+0x9/0xf0
> [ 0.028000] [<ffffffff819271f0>] ? rest_init+0x130/0x130
> [ 0.028000] [<ffffffff819271f9>] kernel_init+0x9/0xf0
> [ 0.028000] [<ffffffff8193b958>] ret_from_fork+0x58/0x90
> [ 0.028000] [<ffffffff819271f0>] ? rest_init+0x130/0x130
> [ 0.028000] Code: 6b b6 ff ff 49 89 c5 e9 ce fd ff ff 31 c0 90 e9 74 ff
> ff ff 49 c7 04 04 00 00 00 00 e9 05 ff ff ff 4c 89 e7 ff d0 e9 d9 fe ff ff
> <0f> 0b 4c 8b 73 38 44 89 e7 81 cf 00 00 20 00 4c 89 f6 48 c1 ee
> [ 0.028000] RIP [<ffffffff8118574c>] new_slab+0x30c/0x3c0
> [ 0.028000] RSP <ffff88007d3dfc28>
> [ 0.028039] ---[ end trace f03690e70d7e4be6 ]---

Shouldn't this be in the commit message?

Thanks,

--
Julian Calaby

Email: [email protected]
Profile: http://www.google.com/profiles/julian.calaby/

2015-04-09 04:27:35

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan
<[email protected]> wrote:
> On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
>> On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
>> >Node 0 might be offline as well as any other numa node,
>> >in this case kernel cannot handle memory allocation and crashes.
>
> Isn't the bug that numa_node_id() returned an offline node? That
> shouldn't happen.

Offline node 0 came from static-inline copy of that function from of.h
I've patched weak function for keeping consistency.

>
> #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> ...
> #ifndef numa_node_id
> /* Returns the number of the current Node. */
> static inline int numa_node_id(void)
> {
> return raw_cpu_read(numa_node);
> }
> #endif
> ...
> #else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */
>
> /* Returns the number of the current Node. */
> #ifndef numa_node_id
> static inline int numa_node_id(void)
> {
> return cpu_to_node(raw_smp_processor_id());
> }
> #endif
> ...
>
> So that's either the per-cpu numa_node value, right? Or the result of
> cpu_to_node on the current processor.
>
>> Example:
>>
>> [ 0.027133] ------------[ cut here ]------------
>> [ 0.027938] kernel BUG at include/linux/gfp.h:322!
>
> This is
>
> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>
> in
>
> alloc_pages_exact_node().
>
> And based on the trace below, that's
>
> __slab_alloc -> alloc
>
> alloc_pages_exact_node
> <- alloc_slab_page
> <- allocate_slab
> <- new_slab
> <- new_slab_objects
> < __slab_alloc?
>
> which is just passing the node value down, right? Which I think was
> from:
>
> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> GFP_KERNEL, of_node_to_nid(of_node));
>
> ?
>
>
> What platform is this on, looks to be x86? qemu emulation of a
> pathological topology? What was the topology?

qemu x86_64, 2 cpu, 2 numa nodes, all memory in second.
I've slightly patched it to allow that setup (in qemu hardcoded 1Mb
of memory connected to node 0) And i've found unrelated bug --
if numa node has less that 4Mb ram then kernel crashes even
earlier because numa code ignores that node
but buddy allocator still tries to use that pages.

>
> Note that there is a ton of code that seems to assume node 0 is online.
> I started working on removing this assumption myself and it just led
> down a rathole (on power, we always have node 0 online, even if it is
> memoryless and cpuless, as a result).
>
> I am guessing this is just happening early in boot before the per-cpu
> areas are setup? That's why (I think) x86 has the early_cpu_to_node()
> function...
>
> Or do you not have CONFIG_OF set? So isn't the only change necessary to
> the include file, and it should just return first_online_node rather
> than 0?
>
> Ah and there's more of those node 0 assumptions :)

That was x86 where is no CONFIG_OF at all.

I don't know what's wrong with that machine but ACPI reports that
cpus and memory from node 0 as connected to node 1 and everything
seems worked fine until lates upgrade -- seems like buggy static-inline
of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during
early allocations only in since 3.17. Machine owner teells that 3.15
worked fine.

>
> #define first_online_node 0
> #define first_memory_node 0
>
> if MAX_NUMODES == 1...
>
> -Nish
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-04-09 04:35:41

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On Thu, Apr 9, 2015 at 2:12 AM, Julian Calaby <[email protected]> wrote:
> Hi Konstantin,
>
> On Thu, Apr 9, 2015 at 3:04 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
>> On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
>>>
>>> Node 0 might be offline as well as any other numa node,
>>> in this case kernel cannot handle memory allocation and crashes.
>>
>>
>> Example:
>>
>> [ 0.027133] ------------[ cut here ]------------
>> [ 0.027938] kernel BUG at include/linux/gfp.h:322!
>> [ 0.028000] invalid opcode: 0000 [#1] SMP
>> [ 0.028000] Modules linked in:
>> [ 0.028000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.0.0-rc7 #12
>> [ 0.028000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
>> [ 0.028000] task: ffff88007d3f8000 ti: ffff88007d3dc000 task.ti:
>> ffff88007d3dc000
>> [ 0.028000] RIP: 0010:[<ffffffff8118574c>] [<ffffffff8118574c>]
>> new_slab+0x30c/0x3c0
>> [ 0.028000] RSP: 0000:ffff88007d3dfc28 EFLAGS: 00010246
>> [ 0.028000] RAX: 0000000000000000 RBX: ffff88007d001800 RCX:
>> 0000000000000001
>> [ 0.028000] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> 00000000002012d0
>> [ 0.028000] RBP: ffff88007d3dfc58 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 0.028000] R10: 0000000000000001 R11: ffff88007d02fe40 R12:
>> 00000000000000d0
>> [ 0.028000] R13: 00000000000000c0 R14: 0000000000000015 R15:
>> 0000000000000000
>> [ 0.028000] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000)
>> knlGS:0000000000000000
>> [ 0.028000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [ 0.028000] CR2: 00000000ffffffff CR3: 0000000001e0e000 CR4:
>> 00000000000006f0
>> [ 0.028000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 0.028000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [ 0.028000] Stack:
>> [ 0.028000] 0000000000000000 ffff88007fc175d0 ffffea0001f40bc0
>> 00000000000000c0
>> [ 0.028000] ffff88007d001800 00000000000080d0 ffff88007d3dfd48
>> ffffffff8192da27
>> [ 0.028000] 000000000000000d ffffffff81e27038 0000000000000000
>> 0000000000000000
>> [ 0.028000] Call Trace:
>> [ 0.028000] [<ffffffff8192da27>] __slab_alloc+0x3df/0x55d
>> [ 0.028000] [<ffffffff8109a92b>] ? __lock_acquire+0xc1b/0x1f40
>> [ 0.028000] [<ffffffff810b1f2c>] ? __irq_domain_add+0x3c/0xe0
>> [ 0.028000] [<ffffffff810998f5>] ? trace_hardirqs_on_caller+0x105/0x1d0
>> [ 0.028000] [<ffffffff813631ab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> [ 0.028000] [<ffffffff811890ab>] __kmalloc_node+0xab/0x210
>> [ 0.028000] [<ffffffff810394df>] ? ioapic_read_entry+0x1f/0x50
>> [ 0.028000] [<ffffffff810b1f2c>] ? __irq_domain_add+0x3c/0xe0
>> [ 0.028000] [<ffffffff810b1f2c>] __irq_domain_add+0x3c/0xe0
>> [ 0.028000] [<ffffffff81039e0e>] mp_irqdomain_create+0x9e/0x120
>> [ 0.028000] [<ffffffff81f22d49>] setup_IO_APIC+0x6b/0x798
>> [ 0.028000] [<ffffffff810398a5>] ? clear_IO_APIC+0x45/0x70
>> [ 0.028000] [<ffffffff81f21f01>] apic_bsp_setup+0x87/0x96
>> [ 0.028000] [<ffffffff81f1fdb4>] native_smp_prepare_cpus+0x237/0x275
>> [ 0.028000] [<ffffffff81f131b7>] kernel_init_freeable+0x120/0x265
>> [ 0.028000] [<ffffffff819271f9>] ? kernel_init+0x9/0xf0
>> [ 0.028000] [<ffffffff819271f0>] ? rest_init+0x130/0x130
>> [ 0.028000] [<ffffffff819271f9>] kernel_init+0x9/0xf0
>> [ 0.028000] [<ffffffff8193b958>] ret_from_fork+0x58/0x90
>> [ 0.028000] [<ffffffff819271f0>] ? rest_init+0x130/0x130
>> [ 0.028000] Code: 6b b6 ff ff 49 89 c5 e9 ce fd ff ff 31 c0 90 e9 74 ff
>> ff ff 49 c7 04 04 00 00 00 00 e9 05 ff ff ff 4c 89 e7 ff d0 e9 d9 fe ff ff
>> <0f> 0b 4c 8b 73 38 44 89 e7 81 cf 00 00 20 00 4c 89 f6 48 c1 ee
>> [ 0.028000] RIP [<ffffffff8118574c>] new_slab+0x30c/0x3c0
>> [ 0.028000] RSP <ffff88007d3dfc28>
>> [ 0.028039] ---[ end trace f03690e70d7e4be6 ]---
>
> Shouldn't this be in the commit message?

I don't think that this will help somebody, kernel crashes here
only on rare hardware setup. This stack came from especially
patched qemu (because normal cannot configure memory-less nore 0)

>
> Thanks,
>
> --
> Julian Calaby
>
> Email: [email protected]
> Profile: http://www.google.com/profiles/julian.calaby/
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-04-09 22:58:26

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote:
> On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan
> <[email protected]> wrote:
> > On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
> >> On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
> >> >Node 0 might be offline as well as any other numa node,
> >> >in this case kernel cannot handle memory allocation and crashes.
> >
> > Isn't the bug that numa_node_id() returned an offline node? That
> > shouldn't happen.
>
> Offline node 0 came from static-inline copy of that function from of.h
> I've patched weak function for keeping consistency.

Got it, that's not necessarily clear in the original commit message.

> > #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> > ...
> > #ifndef numa_node_id
> > /* Returns the number of the current Node. */
> > static inline int numa_node_id(void)
> > {
> > return raw_cpu_read(numa_node);
> > }
> > #endif
> > ...
> > #else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */
> >
> > /* Returns the number of the current Node. */
> > #ifndef numa_node_id
> > static inline int numa_node_id(void)
> > {
> > return cpu_to_node(raw_smp_processor_id());
> > }
> > #endif
> > ...
> >
> > So that's either the per-cpu numa_node value, right? Or the result of
> > cpu_to_node on the current processor.
> >
> >> Example:
> >>
> >> [ 0.027133] ------------[ cut here ]------------
> >> [ 0.027938] kernel BUG at include/linux/gfp.h:322!
> >
> > This is
> >
> > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
> >
> > in
> >
> > alloc_pages_exact_node().
> >
> > And based on the trace below, that's
> >
> > __slab_alloc -> alloc
> >
> > alloc_pages_exact_node
> > <- alloc_slab_page
> > <- allocate_slab
> > <- new_slab
> > <- new_slab_objects
> > < __slab_alloc?
> >
> > which is just passing the node value down, right? Which I think was
> > from:
> >
> > domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> > GFP_KERNEL, of_node_to_nid(of_node));
> >
> > ?
> >
> >
> > What platform is this on, looks to be x86? qemu emulation of a
> > pathological topology? What was the topology?
>
> qemu x86_64, 2 cpu, 2 numa nodes, all memory in second.

Ok, this worked before? That is, this is a regression?

> I've slightly patched it to allow that setup (in qemu hardcoded 1Mb
> of memory connected to node 0) And i've found unrelated bug --
> if numa node has less that 4Mb ram then kernel crashes even
> earlier because numa code ignores that node
> but buddy allocator still tries to use that pages.

So this isn't an actually supported topology by qemu?

> > Note that there is a ton of code that seems to assume node 0 is online.
> > I started working on removing this assumption myself and it just led
> > down a rathole (on power, we always have node 0 online, even if it is
> > memoryless and cpuless, as a result).
> >
> > I am guessing this is just happening early in boot before the per-cpu
> > areas are setup? That's why (I think) x86 has the early_cpu_to_node()
> > function...
> >
> > Or do you not have CONFIG_OF set? So isn't the only change necessary to
> > the include file, and it should just return first_online_node rather
> > than 0?
> >
> > Ah and there's more of those node 0 assumptions :)
>
> That was x86 where is no CONFIG_OF at all.
>
> I don't know what's wrong with that machine but ACPI reports that
> cpus and memory from node 0 as connected to node 1 and everything
> seems worked fine until lates upgrade -- seems like buggy static-inline
> of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during
> early allocations only in since 3.17. Machine owner teells that 3.15
> worked fine.

So, this was a qemu emulation of this actual physical machine without a
node 0?

As I mentioned, there are lots of node 0 assumptions through the kernel.
You might run into more issues at runtime.

-Nish

2015-04-10 11:37:28

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On 10.04.2015 01:58, Tanisha Aravamudan wrote:
> On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote:
>> On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan
>> <[email protected]> wrote:
>>> On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
>>>> On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
>>>>> Node 0 might be offline as well as any other numa node,
>>>>> in this case kernel cannot handle memory allocation and crashes.
>>>
>>> Isn't the bug that numa_node_id() returned an offline node? That
>>> shouldn't happen.
>>
>> Offline node 0 came from static-inline copy of that function from of.h
>> I've patched weak function for keeping consistency.
>
> Got it, that's not necessarily clear in the original commit message.

Sorry.

>
>>> #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>>> ...
>>> #ifndef numa_node_id
>>> /* Returns the number of the current Node. */
>>> static inline int numa_node_id(void)
>>> {
>>> return raw_cpu_read(numa_node);
>>> }
>>> #endif
>>> ...
>>> #else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */
>>>
>>> /* Returns the number of the current Node. */
>>> #ifndef numa_node_id
>>> static inline int numa_node_id(void)
>>> {
>>> return cpu_to_node(raw_smp_processor_id());
>>> }
>>> #endif
>>> ...
>>>
>>> So that's either the per-cpu numa_node value, right? Or the result of
>>> cpu_to_node on the current processor.
>>>
>>>> Example:
>>>>
>>>> [ 0.027133] ------------[ cut here ]------------
>>>> [ 0.027938] kernel BUG at include/linux/gfp.h:322!
>>>
>>> This is
>>>
>>> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>>>
>>> in
>>>
>>> alloc_pages_exact_node().
>>>
>>> And based on the trace below, that's
>>>
>>> __slab_alloc -> alloc
>>>
>>> alloc_pages_exact_node
>>> <- alloc_slab_page
>>> <- allocate_slab
>>> <- new_slab
>>> <- new_slab_objects
>>> < __slab_alloc?
>>>
>>> which is just passing the node value down, right? Which I think was
>>> from:
>>>
>>> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
>>> GFP_KERNEL, of_node_to_nid(of_node));
>>>
>>> ?
>>>
>>>
>>> What platform is this on, looks to be x86? qemu emulation of a
>>> pathological topology? What was the topology?
>>
>> qemu x86_64, 2 cpu, 2 numa nodes, all memory in second.
>
> Ok, this worked before? That is, this is a regression?

Seems like that worked before 3.17 where
bug was exposed by commit 44767bfaaed782d6d635ecbb13f3980041e6f33e
(x86, irq: Enhance mp_register_ioapic() to support irqdomain)
this is first usage of *irq_domain_add*() in x86.

>
>> I've slightly patched it to allow that setup (in qemu hardcoded 1Mb
>> of memory connected to node 0) And i've found unrelated bug --
>> if numa node has less that 4Mb ram then kernel crashes even
>> earlier because numa code ignores that node
>> but buddy allocator still tries to use that pages.
>
> So this isn't an actually supported topology by qemu?

Qemu easily created memoryless numa nodes but node 0 have hardcoded 1Mb
of ram. This seems like legacy prop for DOS era software.

>
>>> Note that there is a ton of code that seems to assume node 0 is online.
>>> I started working on removing this assumption myself and it just led
>>> down a rathole (on power, we always have node 0 online, even if it is
>>> memoryless and cpuless, as a result).
>>>
>>> I am guessing this is just happening early in boot before the per-cpu
>>> areas are setup? That's why (I think) x86 has the early_cpu_to_node()
>>> function...
>>>
>>> Or do you not have CONFIG_OF set? So isn't the only change necessary to
>>> the include file, and it should just return first_online_node rather
>>> than 0?
>>>
>>> Ah and there's more of those node 0 assumptions :)
>>
>> That was x86 where is no CONFIG_OF at all.
>>
>> I don't know what's wrong with that machine but ACPI reports that
>> cpus and memory from node 0 as connected to node 1 and everything
>> seems worked fine until lates upgrade -- seems like buggy static-inline
>> of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during
>> early allocations only in since 3.17. Machine owner teells that 3.15
>> worked fine.
>
> So, this was a qemu emulation of this actual physical machine without a
> node 0?

Yep. Also I have crash from real machine but that stacktrace is messy
because CONFIG_DEBUG_VM wasn't enabled and kernel crashed inside
buddy allocator when tried to touch unallocated numa node structure.

>
> As I mentioned, there are lots of node 0 assumptions through the kernel.
> You might run into more issues at runtime.

I think it's possible to trigger kernel crash for any memoryless numa
node (not just for 0) if some device (like ioapic in my case) points to
it in its acpi tables. In runtime numa affinity configured by user
usually validated by the kernel, while numbers from firmware might be
used without proper validation.

Anyway seems like at least one x86 machines works fine without memory in
node 0.

--
Konstantin

2015-04-10 19:48:47

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On 10.04.2015 [14:37:19 +0300], Konstantin Khlebnikov wrote:
> On 10.04.2015 01:58, Tanisha Aravamudan wrote:
> >On 09.04.2015 [07:27:28 +0300], Konstantin Khlebnikov wrote:
> >>On Thu, Apr 9, 2015 at 2:07 AM, Nishanth Aravamudan
> >><[email protected]> wrote:
> >>>On 08.04.2015 [20:04:04 +0300], Konstantin Khlebnikov wrote:
> >>>>On 08.04.2015 19:59, Konstantin Khlebnikov wrote:
> >>>>>Node 0 might be offline as well as any other numa node,
> >>>>>in this case kernel cannot handle memory allocation and crashes.
> >>>
> >>>Isn't the bug that numa_node_id() returned an offline node? That
> >>>shouldn't happen.
> >>
> >>Offline node 0 came from static-inline copy of that function from of.h
> >>I've patched weak function for keeping consistency.
> >
> >Got it, that's not necessarily clear in the original commit message.
>
> Sorry.
>
> >
> >>>#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
> >>>...
> >>>#ifndef numa_node_id
> >>>/* Returns the number of the current Node. */
> >>>static inline int numa_node_id(void)
> >>>{
> >>> return raw_cpu_read(numa_node);
> >>>}
> >>>#endif
> >>>...
> >>>#else /* !CONFIG_USE_PERCPU_NUMA_NODE_ID */
> >>>
> >>>/* Returns the number of the current Node. */
> >>>#ifndef numa_node_id
> >>>static inline int numa_node_id(void)
> >>>{
> >>> return cpu_to_node(raw_smp_processor_id());
> >>>}
> >>>#endif
> >>>...
> >>>
> >>>So that's either the per-cpu numa_node value, right? Or the result of
> >>>cpu_to_node on the current processor.
> >>>
> >>>>Example:
> >>>>
> >>>>[ 0.027133] ------------[ cut here ]------------
> >>>>[ 0.027938] kernel BUG at include/linux/gfp.h:322!
> >>>
> >>>This is
> >>>
> >>>VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
> >>>
> >>>in
> >>>
> >>>alloc_pages_exact_node().
> >>>
> >>>And based on the trace below, that's
> >>>
> >>>__slab_alloc -> alloc
> >>>
> >>>alloc_pages_exact_node
> >>> <- alloc_slab_page
> >>> <- allocate_slab
> >>> <- new_slab
> >>> <- new_slab_objects
> >>> < __slab_alloc?
> >>>
> >>>which is just passing the node value down, right? Which I think was
> >>>from:
> >>>
> >>> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> >>> GFP_KERNEL, of_node_to_nid(of_node));
> >>>
> >>>?
> >>>
> >>>
> >>>What platform is this on, looks to be x86? qemu emulation of a
> >>>pathological topology? What was the topology?
> >>
> >>qemu x86_64, 2 cpu, 2 numa nodes, all memory in second.
> >
> >Ok, this worked before? That is, this is a regression?
>
> Seems like that worked before 3.17 where
> bug was exposed by commit 44767bfaaed782d6d635ecbb13f3980041e6f33e
> (x86, irq: Enhance mp_register_ioapic() to support irqdomain)
> this is first usage of *irq_domain_add*() in x86.

Ok.

> >> I've slightly patched it to allow that setup (in qemu hardcoded 1Mb
> >>of memory connected to node 0) And i've found unrelated bug --
> >>if numa node has less that 4Mb ram then kernel crashes even
> >>earlier because numa code ignores that node
> >>but buddy allocator still tries to use that pages.
> >
> >So this isn't an actually supported topology by qemu?
>
> Qemu easily created memoryless numa nodes but node 0 have hardcoded
> 1Mb of ram. This seems like legacy prop for DOS era software.

Well, the problem is that x86 doesn't support memoryless nodes.

git grep MEMORYLESS_NODES
arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES
arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES

> >>>Note that there is a ton of code that seems to assume node 0 is online.
> >>>I started working on removing this assumption myself and it just led
> >>>down a rathole (on power, we always have node 0 online, even if it is
> >>>memoryless and cpuless, as a result).
> >>>
> >>>I am guessing this is just happening early in boot before the per-cpu
> >>>areas are setup? That's why (I think) x86 has the early_cpu_to_node()
> >>>function...
> >>>
> >>>Or do you not have CONFIG_OF set? So isn't the only change necessary to
> >>>the include file, and it should just return first_online_node rather
> >>>than 0?
> >>>
> >>>Ah and there's more of those node 0 assumptions :)
> >>
> >>That was x86 where is no CONFIG_OF at all.
> >>
> >>I don't know what's wrong with that machine but ACPI reports that
> >>cpus and memory from node 0 as connected to node 1 and everything
> >>seems worked fine until lates upgrade -- seems like buggy static-inline
> >>of_node_to_nid was intoduced in 3.13 but x86 ioapic uses it during
> >>early allocations only in since 3.17. Machine owner teells that 3.15
> >>worked fine.
> >
> >So, this was a qemu emulation of this actual physical machine without a
> >node 0?
>
> Yep. Also I have crash from real machine but that stacktrace is messy
> because CONFIG_DEBUG_VM wasn't enabled and kernel crashed inside
> buddy allocator when tried to touch unallocated numa node structure.
>
> >
> >As I mentioned, there are lots of node 0 assumptions through the kernel.
> >You might run into more issues at runtime.
>
> I think it's possible to trigger kernel crash for any memoryless numa
> node (not just for 0) if some device (like ioapic in my case) points to
> it in its acpi tables. In runtime numa affinity configured by user
> usually validated by the kernel, while numbers from firmware might
> be used without proper validation.
>
> Anyway seems like at least one x86 machines works fine without
> memory in node 0.

You're going to run into more issues, without adding proper memoryless
node support, I think.

-Nish

2015-04-13 13:22:52

by Rob Herring

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On Wed, Apr 8, 2015 at 11:59 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> Node 0 might be offline as well as any other numa node,
> in this case kernel cannot handle memory allocation and crashes.
>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
> ---
> drivers/of/base.c | 2 +-
> include/linux/of.h | 5 ++++-
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 8f165b112e03..51f4bd16e613 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> #ifdef CONFIG_NUMA
> int __weak of_node_to_nid(struct device_node *np)
> {
> - return numa_node_id();
> + return NUMA_NO_NODE;

This is going to break any NUMA machine that enables OF and expects
the weak function to work.

Rob

> }
> #endif
>
> diff --git a/include/linux/of.h b/include/linux/of.h
> index dfde07e77a63..78a04ee85a9c 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -623,7 +623,10 @@ static inline const char *of_prop_next_string(struct property *prop,
> #if defined(CONFIG_OF) && defined(CONFIG_NUMA)
> extern int of_node_to_nid(struct device_node *np);
> #else
> -static inline int of_node_to_nid(struct device_node *device) { return 0; }
> +static inline int of_node_to_nid(struct device_node *device)
> +{
> + return NUMA_NO_NODE;
> +}
> #endif
>
> static inline struct device_node *of_find_matching_node(
>
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-04-13 13:38:55

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On 13.04.2015 16:22, Rob Herring wrote:
> On Wed, Apr 8, 2015 at 11:59 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
>> Node 0 might be offline as well as any other numa node,
>> in this case kernel cannot handle memory allocation and crashes.
>>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
>> ---
>> drivers/of/base.c | 2 +-
>> include/linux/of.h | 5 ++++-
>> 2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index 8f165b112e03..51f4bd16e613 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
>> #ifdef CONFIG_NUMA
>> int __weak of_node_to_nid(struct device_node *np)
>> {
>> - return numa_node_id();
>> + return NUMA_NO_NODE;
>
> This is going to break any NUMA machine that enables OF and expects
> the weak function to work.

Why? NUMA_NO_NODE == -1 -- this's standard "no-affinity" signal.
As I see powerpc/sparc versions of of_node_to_nid returns -1 if they
cannot find out which node should be used.

>
> Rob
>
>> }
>> #endif
>>
>> diff --git a/include/linux/of.h b/include/linux/of.h
>> index dfde07e77a63..78a04ee85a9c 100644
>> --- a/include/linux/of.h
>> +++ b/include/linux/of.h
>> @@ -623,7 +623,10 @@ static inline const char *of_prop_next_string(struct property *prop,
>> #if defined(CONFIG_OF) && defined(CONFIG_NUMA)
>> extern int of_node_to_nid(struct device_node *np);
>> #else
>> -static inline int of_node_to_nid(struct device_node *device) { return 0; }
>> +static inline int of_node_to_nid(struct device_node *device)
>> +{
>> + return NUMA_NO_NODE;
>> +}
>> #endif
>>
>> static inline struct device_node *of_find_matching_node(
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe devicetree" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Konstantin

2015-04-13 16:49:58

by Rob Herring

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On Mon, Apr 13, 2015 at 8:38 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> On 13.04.2015 16:22, Rob Herring wrote:
>>
>> On Wed, Apr 8, 2015 at 11:59 AM, Konstantin Khlebnikov
>> <[email protected]> wrote:
>>>
>>> Node 0 might be offline as well as any other numa node,
>>> in this case kernel cannot handle memory allocation and crashes.
>>>
>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>>> Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
>>> ---
>>> drivers/of/base.c | 2 +-
>>> include/linux/of.h | 5 ++++-
>>> 2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>>> index 8f165b112e03..51f4bd16e613 100644
>>> --- a/drivers/of/base.c
>>> +++ b/drivers/of/base.c
>>> @@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
>>> #ifdef CONFIG_NUMA
>>> int __weak of_node_to_nid(struct device_node *np)
>>> {
>>> - return numa_node_id();
>>> + return NUMA_NO_NODE;
>>
>>
>> This is going to break any NUMA machine that enables OF and expects
>> the weak function to work.
>
>
> Why? NUMA_NO_NODE == -1 -- this's standard "no-affinity" signal.
> As I see powerpc/sparc versions of of_node_to_nid returns -1 if they
> cannot find out which node should be used.

Ah, I was thinking those platforms were relying on the default
implementation. I guess any real NUMA support is going to need to
override this function. The arm64 patch series does that as well. We
need to be sure this change is correct for metag which appears to be
the only other OF enabled platform with NUMA support.

In that case, then there is little reason to keep the inline and we
can just always enable the weak function (with your change). It is
slightly less optimal, but the few callers hardly appear to be hot
paths.

Rob

2015-04-29 08:30:26

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

[email protected]
[email protected]

here is proposed fix:
https://www.mail-archive.com/[email protected]/msg864009.html

It returns NUMA_NO_NODE from both static-inline (CONFIG_OF=n) and weak
version of of_node_to_nid(). This change might affect few arches which
whave CONFIG_OF=y but doesn't implement of_node_to_nid() (i.e. depends
on default behavior of weak function). It seems this is only metag.

From mm/ point of view returning NUMA_NO_NODE is a right choice when
code have no idea which numa node should be used -- memory allocation
functions choose current numa node (but they might use any).

On 29.04.2015 04:11, [email protected] wrote:
> When we test the cpu and memory hotplug feature in the server with x86
> architecture and kernel4.0-rc4,we met the similar problem.
>
> The situation is that when memory in node0 is offline,the system is down
> during booting.
>
> Following is the bug information:
> [ 0.335176] BUG: unable to handle kernel paging request at
> 0000000000001b08
> [ 0.342164] IP: [<ffffffff81182587>] __alloc_pages_nodemask+0xb7/0x940
> [ 0.348706] PGD 0
> [ 0.350735] Oops: 0000 [#1] SMP
> [ 0.353993] Modules linked in:
> [ 0.357063] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.0.0-rc4 #1
> [ 0.363232] Hardware name: Inspur TS860/TS860, BIOS TS860_2.0.0
> 2015/03/24
> [ 0.370095] task: ffff88085b1e0000 ti: ffff88085b1e8000 task.ti:
> ffff88085b1e8000
> [ 0.377564] RIP: 0010:[<ffffffff81182587>] [<ffffffff81182587>]
> __alloc_pages_nodemask+0xb7/0x940
> [ 0.386524] RSP: 0000:ffff88085b1ebac8 EFLAGS: 00010246
> [ 0.391828] RAX: 0000000000001b00 RBX: 0000000000000010 RCX:
> 0000000000000000
> [ 0.398953] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 00000000002052d0
> [ 0.406075] RBP: ffff88085b1ebbb8 R08: ffff88085b13fec0 R09:
> 000000005b13fe01
> [ 0.413198] R10: ffff88085e807300 R11: ffffffff810d4bc1 R12:
> 000000000001002a
> [ 0.420321] R13: 00000000002052d0 R14: 0000000000000001 R15:
> 00000000000040d0
> [ 0.427446] FS: 0000000000000000(0000) GS:ffff88085ee00000(0000)
> knlGS:0000000000000000
> [ 0.435522] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.441259] CR2: 0000000000001b08 CR3: 00000000019ae000 CR4:
> 00000000001406f0
> [ 0.448382] Stack:
> [ 0.450392] ffff88085b1e0000 0000000000000400 ffff88085b1effff
> ffff88085b1ebb68
> [ 0.457846] 000000000000007b ffff88085b12d140 ffff88085b249000
> 000000000000007b
> [ 0.465298] ffff88085b1ebb28 ffffffff81af2900 0000000000000000
> 002052d05b12d140
> [ 0.472750] Call Trace:
> [ 0.475206] [<ffffffff811d27b3>] ? deactivate_slab+0x383/0x400
> [ 0.481123] [<ffffffff811d3947>] new_slab+0xa7/0x460
> [ 0.486174] [<ffffffff816789e5>] __slab_alloc+0x310/0x470
> [ 0.491655] [<ffffffff8105304f>] ? dmar_msi_set_affinity+0x8f/0xc0
> [ 0.497921] [<ffffffff810d4bc1>] ? __irq_domain_add+0x41/0x100
> [ 0.503838] [<ffffffff810d0fee>] ? irq_do_set_affinity+0x5e/0x70
> [ 0.509920] [<ffffffff811d571d>] __kmalloc_node+0xad/0x2e0
> [ 0.515483] [<ffffffff810d4bc1>] ? __irq_domain_add+0x41/0x100
> [ 0.521392] [<ffffffff810d4bc1>] __irq_domain_add+0x41/0x100
> [ 0.527133] [<ffffffff8105102e>] mp_irqdomain_create+0x9e/0x120
> [ 0.533140] [<ffffffff81b2fb14>] setup_IO_APIC+0x64/0x1be
> [ 0.538622] [<ffffffff81b2e226>] apic_bsp_setup+0xa2/0xae
> [ 0.544099] [<ffffffff81b2bc70>] native_smp_prepare_cpus+0x267/0x2b2
> [ 0.550531] [<ffffffff81b1927b>] kernel_init_freeable+0xf2/0x253
> [ 0.556625] [<ffffffff8166b960>] ? rest_init+0x80/0x80
> [ 0.561845] [<ffffffff8166b96e>] kernel_init+0xe/0xf0
> [ 0.566979] [<ffffffff81681bd8>] ret_from_fork+0x58/0x90
> [ 0.572374] [<ffffffff8166b960>] ? rest_init+0x80/0x80
> [ 0.577591] Code: 30 97 00 89 45 bc 83 e1 0f b8 22 01 32 01 01 c9 d3
> f8 83 e0 03 89 9d 6c ff ff ff 83 e3 10 89 45 c0 0f 85 6d 01 00 00 48 8b
> 45 88 <48> 83 78 08 00 0f 84 51 01 00 00 b8 01 00 00 00 44 89 f1 d3 e0
> [ 0.597537] RIP [<ffffffff81182587>] __alloc_pages_nodemask+0xb7/0x940
> [ 0.604158] RSP <ffff88085b1ebac8>
> [ 0.607643] CR2: 0000000000001b08
> [ 0.610962] ---[ end trace 0a600c0841386992 ]---
> [ 0.615573] Kernel panic - not syncing: Fatal exception
> [ 0.620792] ---[ end Kernel panic - not syncing: Fatal exception
> *From:* Rob Herring <mailto:[email protected]>
> *Date:* 2015-04-14 00:49
> *To:* Konstantin Khlebnikov <mailto:[email protected]>
> *CC:* Grant Likely <mailto:[email protected]>;
> [email protected] <mailto:[email protected]>; Rob
> Herring <mailto:[email protected]>; [email protected]
> <mailto:[email protected]>; [email protected]
> <mailto:[email protected]>; [email protected]
> <mailto:[email protected]>; linuxppc-dev
> <mailto:[email protected]>
> *Subject:* Re: [PATCH] of: return NUMA_NO_NODE from fallback
> of_node_to_nid()
> On Mon, Apr 13, 2015 at 8:38 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
> > On 13.04.2015 16:22, Rob Herring wrote:
> >>
> >> On Wed, Apr 8, 2015 at 11:59 AM, Konstantin Khlebnikov
> >> <[email protected]> wrote:
> >>>
> >>> Node 0 might be offline as well as any other numa node,
> >>> in this case kernel cannot handle memory allocation and crashes.
> >>>
> >>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> >>> Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
> >>> ---
> >>> drivers/of/base.c | 2 +-
> >>> include/linux/of.h | 5 ++++-
> >>> 2 files changed, 5 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/of/base.c b/drivers/of/base.c
> >>> index 8f165b112e03..51f4bd16e613 100644
> >>> --- a/drivers/of/base.c
> >>> +++ b/drivers/of/base.c
> >>> @@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> >>> #ifdef CONFIG_NUMA
> >>> int __weak of_node_to_nid(struct device_node *np)
> >>> {
> >>> - return numa_node_id();
> >>> + return NUMA_NO_NODE;
> >>
> >>
> >> This is going to break any NUMA machine that enables OF and expects
> >> the weak function to work.
> >
> >
> > Why? NUMA_NO_NODE == -1 -- this's standard "no-affinity" signal.
> > As I see powerpc/sparc versions of of_node_to_nid returns -1 if they
> > cannot find out which node should be used.
> Ah, I was thinking those platforms were relying on the default
> implementation. I guess any real NUMA support is going to need to
> override this function. The arm64 patch series does that as well. We
> need to be sure this change is correct for metag which appears to be
> the only other OF enabled platform with NUMA support.
> In that case, then there is little reason to keep the inline and we
> can just always enable the weak function (with your change). It is
> slightly less optimal, but the few callers hardly appear to be hot
> paths.
> Rob
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Konstantin

2015-06-04 08:25:14

by Grant Likely

[permalink] [raw]

Subject: Re: [PATCH] of: return NUMA_NO_NODE from fallback of_node_to_nid()

On Mon, 13 Apr 2015 11:49:31 -0500
, Rob Herring <[email protected]>
wrote:
> On Mon, Apr 13, 2015 at 8:38 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
> > On 13.04.2015 16:22, Rob Herring wrote:
> >>
> >> On Wed, Apr 8, 2015 at 11:59 AM, Konstantin Khlebnikov
> >> <[email protected]> wrote:
> >>>
> >>> Node 0 might be offline as well as any other numa node,
> >>> in this case kernel cannot handle memory allocation and crashes.
> >>>
> >>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> >>> Fixes: 0c3f061c195c ("of: implement of_node_to_nid as a weak function")
> >>> ---
> >>> drivers/of/base.c | 2 +-
> >>> include/linux/of.h | 5 ++++-
> >>> 2 files changed, 5 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/of/base.c b/drivers/of/base.c
> >>> index 8f165b112e03..51f4bd16e613 100644
> >>> --- a/drivers/of/base.c
> >>> +++ b/drivers/of/base.c
> >>> @@ -89,7 +89,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> >>> #ifdef CONFIG_NUMA
> >>> int __weak of_node_to_nid(struct device_node *np)
> >>> {
> >>> - return numa_node_id();
> >>> + return NUMA_NO_NODE;
> >>
> >>
> >> This is going to break any NUMA machine that enables OF and expects
> >> the weak function to work.
> >
> >
> > Why? NUMA_NO_NODE == -1 -- this's standard "no-affinity" signal.
> > As I see powerpc/sparc versions of of_node_to_nid returns -1 if they
> > cannot find out which node should be used.
>
> Ah, I was thinking those platforms were relying on the default
> implementation. I guess any real NUMA support is going to need to
> override this function. The arm64 patch series does that as well. We
> need to be sure this change is correct for metag which appears to be
> the only other OF enabled platform with NUMA support.
>
> In that case, then there is little reason to keep the inline and we
> can just always enable the weak function (with your change). It is
> slightly less optimal, but the few callers hardly appear to be hot
> paths.

Sounds like you're in agreement with this patch then? Shall I apply it?

g.