2016-10-25 21:31:14

by David Daney

[permalink] [raw]
Subject: [PATCH 0/2] arm64, numa: Fix OOPS with numa=off

From: David Daney <[email protected]>

We get an OOPS in the arm64 kernel on NUMA systems when numa=off is
passed on the command line.

Fix it by returning NUMA_NO_NODE from of_node_to_nid when numa=off.

David Daney (2):
of, numa: Add function to disable of_node_to_nid().
arm64, numa: Force of_node_to_nid to return NUMA_NO_NODE when
numa=off.

arch/arm64/mm/numa.c | 5 ++++-
drivers/of/of_numa.c | 15 +++++++++++++++
include/linux/of.h | 2 ++
3 files changed, 21 insertions(+), 1 deletion(-)

--
1.8.3.1


2016-10-25 21:31:18

by David Daney

[permalink] [raw]
Subject: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

From: David Daney <[email protected]>

On arm64 NUMA kernels we can pass "numa=off" on the command line to
disable NUMA. A side effect of this is that kmalloc_node() calls to
non-zero nodes will crash the system with an OOPS:

[ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
[ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
[ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
[ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
[ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
[ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
[ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
[ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
.
.
.

This is caused by code like this in kernel/irq/irqdomain.c

domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
GFP_KERNEL, of_node_to_nid(of_node));

When NUMA is disabled, the concept of a node is really undefined, so
of_node_to_nid() should unconditionally return NUMA_NO_NODE.

Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
return NUMA_NO_NODE.

The follow on patch will call this new function from the arm64 numa
code.

Reported-by: Gilbert Netzer <[email protected]>
Signed-off-by: David Daney <[email protected]>
---
drivers/of/of_numa.c | 15 +++++++++++++++
include/linux/of.h | 2 ++
2 files changed, 17 insertions(+)

diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
index f63d4b0d..2212299 100644
--- a/drivers/of/of_numa.c
+++ b/drivers/of/of_numa.c
@@ -150,12 +150,27 @@ static int __init of_numa_parse_distance_map(void)
return ret;
}

+static bool of_force_no_numa;
+
+void __of_force_no_numa(void)
+{
+ of_force_no_numa = true;
+}
+
int of_node_to_nid(struct device_node *device)
{
struct device_node *np;
u32 nid;
int r = -ENODATA;

+ /*
+ * If NUMA forced off, nodes are meaningless. Return
+ * NUMA_NO_NODE so that any node specific memory allocations
+ * can succeed from the default pool.
+ */
+ if (of_force_no_numa)
+ return NUMA_NO_NODE;
+
np = of_node_get(device);

while (np) {
diff --git a/include/linux/of.h b/include/linux/of.h
index 299aeb1..6f6244e 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -850,11 +850,13 @@ static inline void of_property_clear_flag(struct property *p, unsigned long flag

#if defined(CONFIG_OF) && defined(CONFIG_NUMA)
extern int of_node_to_nid(struct device_node *np);
+extern void __of_force_no_numa(void);
#else
static inline int of_node_to_nid(struct device_node *device)
{
return NUMA_NO_NODE;
}
+static inline void __of_force_no_numa(void) { /* Empty */ }
#endif

#ifdef CONFIG_OF_NUMA
--
1.8.3.1

2016-10-25 21:31:38

by David Daney

[permalink] [raw]
Subject: [PATCH 2/2] arm64, numa: Force of_node_to_nid to return NUMA_NO_NODE when numa=off.

From: David Daney <[email protected]>

When "numa=off" is passed on the command line, of_node_to_nid() still
returns the node number (which can be greater than zero). However, in
this case all the memory is associated with the dummy node zero. This
causes OOPS in kernel/irq/irqdomain.c:

domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
GFP_KERNEL, of_node_to_nid(of_node));
...

which in my case then caused the kernel to OOPS for the IRQ controller
on node 1:

[ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
[ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
[ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
[ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
[ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
[ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
[ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
[ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0

Fix by forcing of_node_to_nid() to return NUMA_NO_NODE when numa=off.
The kmalloc_node() family is perfectly happy when the node is
specified as NUMA_NO_NODE.

Reported-by: Gilbert Netzer <[email protected]>
Signed-off-by: David Daney <[email protected]>
---
arch/arm64/mm/numa.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 778a985..6d34ebb 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -41,8 +41,10 @@ static __init int numa_parse_early_param(char *opt)
{
if (!opt)
return -EINVAL;
- if (!strncmp(opt, "off", 3))
+ if (!strncmp(opt, "off", 3)) {
+ __of_force_no_numa();
numa_off = true;
+ }

return 0;
}
@@ -432,6 +434,7 @@ static int __init dummy_numa_init(void)
return ret;
}

+ __of_force_no_numa();
numa_off = true;
return 0;
}
--
1.8.3.1

2016-10-26 13:43:08

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

On 25.10.16 14:31:00, David Daney wrote:
> From: David Daney <[email protected]>
>
> On arm64 NUMA kernels we can pass "numa=off" on the command line to
> disable NUMA. A side effect of this is that kmalloc_node() calls to
> non-zero nodes will crash the system with an OOPS:
>
> [ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
> [ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
> [ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
> [ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
> [ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
> [ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
> [ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
> [ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
> .
> .
> .
>
> This is caused by code like this in kernel/irq/irqdomain.c
>
> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> GFP_KERNEL, of_node_to_nid(of_node));
>
> When NUMA is disabled, the concept of a node is really undefined, so
> of_node_to_nid() should unconditionally return NUMA_NO_NODE.
>
> Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
> return NUMA_NO_NODE.
>
> The follow on patch will call this new function from the arm64 numa
> code.

Didn't that work before? numa=off just maps all mem to node 0. If mem
allocation is requested for another node it should just fall back to a
node with mem (node 0 then). I suspect there is something wrong with
the page initialization, see:

http://www.spinics.net/lists/arm-kernel/msg535191.html
https://bugzilla.redhat.com/show_bug.cgi?id=1387793

What is the complete oops?

So I think k*alloc_node() must be able to handle requests to
non-existing nodes. Otherwise your fix is incomplete, assume a failed
of_numa_init() causing a dummy init but still some devices reporting a
node.

-Robert

2016-10-26 21:33:28

by David Daney

[permalink] [raw]
Subject: Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

On 10/26/2016 06:43 AM, Robert Richter wrote:
> On 25.10.16 14:31:00, David Daney wrote:
>> From: David Daney <[email protected]>
>>
>> On arm64 NUMA kernels we can pass "numa=off" on the command line to
>> disable NUMA. A side effect of this is that kmalloc_node() calls to
>> non-zero nodes will crash the system with an OOPS:
>>
>> [ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
>> [ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
>> [ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
>> [ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
>> [ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
>> [ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
>> [ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
>> [ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
>> .
>> .
>> .
>>
>> This is caused by code like this in kernel/irq/irqdomain.c
>>
>> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
>> GFP_KERNEL, of_node_to_nid(of_node));
>>
>> When NUMA is disabled, the concept of a node is really undefined, so
>> of_node_to_nid() should unconditionally return NUMA_NO_NODE.
>>
>> Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
>> return NUMA_NO_NODE.
>>
>> The follow on patch will call this new function from the arm64 numa
>> code.
>
> Didn't that work before?

I am fairly certain that it used to work.

> numa=off just maps all mem to node 0.

Yes, that is the current behavior.

> If mem
> allocation is requested for another node it should just fall back to a
> node with mem (node 0 then).

This is the root of the problem. The ITS code is allocating memory. It
calls of_node_to_nid() to determine which node it resides on. The
answer in the failing case is node-1. Since we have mapped all the
memory to node-0 the __kmalloc_node(..., 1) call fails with the OOPS shown.

It could be that __kmalloc_node() used to allocate memory on a node
other than the requested node if the request couldn't be met. But in
v4.8 and later it produces that OOPS.

If you pass a node containing free memory or NUMA_NO_NODE to
__kmalloc_node(), the allocation succeeds.

When we first did these patches, I advocated removing the numa=off
feature, and requiring people to install usable firmware on their
systems. That was rejected on the grounds that not everybody has the
ability to change their firmware and we would like to allow NUMA kernels
to run on systems with defective firmware by supplying this command line
parameter. Now that I have seen requests from the wild for this, I
think it is a good idea to allow numa=off to be used to work around this
bad firmware.

The change in this patch set is fairly small, and seems to get the job
done. An alternative would be to change __kmalloc_node() to ignore the
node parameter if the request cannot be made, but I assume that there
were good reasons to have the current behavior, so that would be a much
more complicated change to make.



> I suspect there is something wrong with
> the page initialization, see:
>
> http://www.spinics.net/lists/arm-kernel/msg535191.html
> https://bugzilla.redhat.com/show_bug.cgi?id=1387793
>
> What is the complete oops?
>
> So I think k*alloc_node() must be able to handle requests to
> non-existing nodes. Otherwise your fix is incomplete, assume a failed
> of_numa_init() causing a dummy init but still some devices reporting a
> node.

.
.
.
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 4.8.0-rc8-dd ([email protected])
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #29 SMP Tue Sep
27 15:50:35 PDT 2016
[ 0.000000] Boot CPU: AArch64 Processor [431f0a10]
[ 0.000000] NUMA turned off
[ 0.000000] earlycon: pl11 at MMIO 0x000087e024000000 (options '')
[ 0.000000] bootconsole [pl11] enabled
[ 0.000000] efi: Getting EFI parameters from FDT:
[ 0.000000] efi: EFI v2.40 by Cavium Thunder cn88xx EFI
jenkins_weekly_build_40-0-ga1f880f Sep 13 2016 17:05:35
[ 0.000000] efi: ACPI=0xfffff000 ACPI 2.0=0xfffff014 SMBIOS
3.0=0x10ffafcf000
[ 0.000000] cma: Reserved 512 MiB at 0x00000000c0000000
[ 0.000000] NUMA disabled
[ 0.000000] NUMA: Faking a node at [mem
0x0000000000000000-0x0000010fffffffff]
[ 0.000000] NUMA: Adding memblock [0x1400000 - 0xfffdffff] on node 0
[ 0.000000] NUMA: Adding memblock [0xfffe0000 - 0xffffffff] on node 0
[ 0.000000] NUMA: Adding memblock [0x100000000 - 0xfffffffff] on node 0
[ 0.000000] NUMA: Adding memblock [0x10000400000 - 0x10ffa38ffff] on
node 0
[ 0.000000] NUMA: Adding memblock [0x10ffa390000 - 0x10ffa41ffff] on
node 0
[ 0.000000] NUMA: Adding memblock [0x10ffa420000 - 0x10ffaeaffff] on
node 0
[ 0.000000] NUMA: Adding memblock [0x10ffaeb0000 - 0x10ffaffffff] on
node 0
[ 0.000000] NUMA: Adding memblock [0x10ffb000000 - 0x10ffffaffff] on
node 0
[ 0.000000] NUMA: Adding memblock [0x10ffffb0000 - 0x10fffffffff] on
node 0
[ 0.000000] NUMA: Initmem setup node 0 [mem 0x01400000-0x10fffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x10ffffae480-0x10ffffaff7f]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000001400000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x0000010fffffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000001400000-0x00000000fffdffff]
[ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x0000000fffffffff]
[ 0.000000] node 0: [mem 0x0000010000400000-0x0000010ffa38ffff]
[ 0.000000] node 0: [mem 0x0000010ffa390000-0x0000010ffa41ffff]
[ 0.000000] node 0: [mem 0x0000010ffa420000-0x0000010ffaeaffff]
[ 0.000000] node 0: [mem 0x0000010ffaeb0000-0x0000010ffaffffff]
[ 0.000000] node 0: [mem 0x0000010ffb000000-0x0000010ffffaffff]
[ 0.000000] node 0: [mem 0x0000010ffffb0000-0x0000010fffffffff]
[ 0.000000] Initmem setup node 0 [mem
0x0000000001400000-0x0000010fffffffff]
[ 0.000000] psci: probing for conduit method from DT.
[ 0.000000] psci: PSCIv0.2 detected in firmware.
[ 0.000000] psci: Using standard PSCI v0.2 function IDs
[ 0.000000] psci: Trusted OS resident on physical CPU 0x0
[ 0.000000] percpu: Embedded 3 pages/cpu @ffffff0ff6900000 s116736
r8192 d71680 u196608
[ 0.000000] Detected VIPT I-cache on CPU0
[ 0.000000] CPU features: enabling workaround for Cavium erratum 27456
[ 0.000000] Built 1 zonelists in Node order, mobility grouping on.
Total pages: 2094720
[ 0.000000] Policy zone: Normal
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.8.0-rc8-dd
root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root
rd.lvm.lv=rhel/swap LANG=en_US.UTF-8 numa=off console=ttyAMA0,115200n8
earlycon=pl011,0x87e024000000
[ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
[ 0.000000] log_buf_len total cpu_extra contributions: 389120 bytes
[ 0.000000] log_buf_len min size: 524288 bytes
[ 0.000000] log_buf_len: 1048576 bytes
[ 0.000000] early log buf free: 519176(99%)
[ 0.000000] PID hash table entries: 4096 (order: -1, 32768 bytes)
[ 0.000000] software IO TLB [mem 0xfbfd0000-0xfffd0000] (64MB) mapped
at [fffffe00fbfd0000-fffffe00fffcffff]
[ 0.000000] Memory: 133391936K/134193152K available (7356K kernel
code, 1359K rwdata, 3392K rodata, 1216K init, 6799K bss, 276928K
reserved, 524288K cma-reserved)
[ 0.000000] Virtual kernel memory layout:
[ 0.000000] modules : 0xfffffc0000000000 - 0xfffffc0008000000 (
128 MB)
[ 0.000000] vmalloc : 0xfffffc0008000000 - 0xfffffdff5fff0000 (
2045 GB)
[ 0.000000] .text : 0xfffffc0008080000 - 0xfffffc00087b0000 (
7360 KB)
[ 0.000000] .rodata : 0xfffffc00087b0000 - 0xfffffc0008b10000 (
3456 KB)
[ 0.000000] .init : 0xfffffc0008b10000 - 0xfffffc0008c40000 (
1216 KB)
[ 0.000000] .data : 0xfffffc0008c40000 - 0xfffffc0008d93e00 (
1360 KB)
[ 0.000000] .bss : 0xfffffc0008d93e00 - 0xfffffc0009437d48 (
6800 KB)
[ 0.000000] fixed : 0xfffffdff7e7d0000 - 0xfffffdff7ec00000 (
4288 KB)
[ 0.000000] PCI I/O : 0xfffffdff7ee00000 - 0xfffffdff7fe00000 (
16 MB)
[ 0.000000] vmemmap : 0xfffffdff80000000 - 0xfffffe0000000000 (
2 GB maximum)
[ 0.000000] 0xfffffdff80005000 - 0xfffffdffc4000000 (
1087 MB actual)
[ 0.000000] memory : 0xfffffe0001400000 - 0xffffff1000000000
(1114092 MB)
[ 0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=96, Nodes=1
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] Build-time adjustment of leaf fanout to 64.
[ 0.000000] RCU restricting CPUs from NR_CPUS=4096 to nr_cpu_ids=96.
[ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=96
[ 0.000000] NR_IRQS:64 nr_irqs:64 0
[ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
[ 0.000000] ITS: /interrupt-controller@801000000000/gic-its@801000020000
[ 0.000000] ITS@0x0000801000020000: allocated 2097152 Devices
@10001000000 (flat, esz 8, psz 64K, shr 1)
[ 0.000000] ITS: /interrupt-controller@801000000000/gic-its@901000020000
[ 0.000000] ITS@0x0000901000020000: allocated 2097152 Devices
@10002000000 (flat, esz 8, psz 64K, shr 1)
[ 0.000000] Unable to handle kernel NULL pointer dereference at
virtual address 00001680
[ 0.000000] pgd = fffffc0009470000
[ 0.000000] [00001680] *pgd=0000010ffff90003, *pud=0000010ffff90003,
*pmd=0000010ffff90003, *pte=0000000000000000
[ 0.000000] Internal error: Oops: 96000006 [#1] SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc8-dd #29
[ 0.000000] Hardware name: Cavium ThunderX CN88XX board (DT)
[ 0.000000] task: fffffc0008c71c80 task.stack: fffffc0008c40000
[ 0.000000] PC is at __alloc_pages_nodemask+0xa4/0xe68
[ 0.000000] LR is at __alloc_pages_nodemask+0x38/0xe68
[ 0.000000] pc : [<fffffc00081c8950>] lr : [<fffffc00081c88e4>]
pstate: 600000c5
[ 0.000000] sp : fffffc0008c43880
[ 0.000000] x29: fffffc0008c43880 x28: ffffff000041fc00
[ 0.000000] x27: 0000000000201200 x26: 0000000000000000
[ 0.000000] x25: 0000000000000001 x24: 0000000000001680
[ 0.000000] x23: 0000000000201200 x22: fffffc0008c439c8
[ 0.000000] x21: fffffc0008c63000 x20: 0000000000201200
[ 0.000000] x19: 0000000000000000 x18: 0000000000000070
[ 0.000000] x17: 0000000000000008 x16: 0000000000000000
[ 0.000000] x15: 0000000000000000 x14: 2820303030303030
[ 0.000000] x13: 3230303031402073 x12: 6563697665442032
[ 0.000000] x11: 0000000000000020 x10: fffffc0009334000
[ 0.000000] x9 : 0000000001bfff3f x8 : 7f7f7f7f7f7f7f7f
[ 0.000000] x7 : 0000000001210111 x6 : fffffdffc00010a0
[ 0.000000] x5 : 0000000000000000 x4 : 0000000000000000
[ 0.000000] x3 : 0000000000000000 x2 : 0000000000000000
[ 0.000000] x1 : 0000000000000000 x0 : fffffc0008c63bb0
[ 0.000000]
[ 0.000000] Process swapper/0 (pid: 0, stack limit = 0xfffffc0008c40020)
[ 0.000000] Stack: (0xfffffc0008c43880 to 0xfffffc0008c44000)
[ 0.000000] 3880: fffffc0008c439f0 fffffc000821fa70 ffffff000041fc00
0000000000000200
[ 0.000000] 38a0: fffffc0008115374 0000000000000000 0000000000000000
0000000000000001
[ 0.000000] 38c0: 0000000000000000 0000000000000000 0000000000201200
ffffff000041fc00
[ 0.000000] 38e0: fffffc0008c43960 fffffc000810bc20 fffffc0008c43960
fffffc0008c43960
[ 0.000000] 3900: fffffc0008c43930 00000000ffffffd0 fffffc0008c43960
fffffc0008c43960
[ 0.000000] 3920: fffffc0008c43930 00000000ffffffd0 fffffc0008c43970
fffffc0008221658
[ 0.000000] 3940: 7f7f7f7f7f7f7f7f 0000000000000002 0101010101010101
0000000000000020
[ 0.000000] 3960: fffffc0008c43a70 fffffc0008221c04 0000000000000001
00000000024080c0
[ 0.000000] 3980: fffffc0008115374 fffffc0008bf8648 0000000000001000
0000000000000000
[ 0.000000] 39a0: ffffff000041fc00 0000000000000001 ffffff0ff691e840
ffffff000041fc00
[ 0.000000] 39c0: ffffff0ff691e840 0000000000001680 0000000000000000
0000000000000000
[ 0.000000] 39e0: 0000000100000000 0000000000000000 fffffc0008c43a70
fffffc0008221e24
[ 0.000000] 3a00: 0000000000000001 00000000024080c0 fffffc0008115374
fffffc0008bf8648
[ 0.000000] 3a20: 0000000000001000 0000000000000000 0000000000000000
0000000000000001
[ 0.000000] 3a40: ffffff0ff691e840 ffffff000041fc00 fffffc000928a1e8
024080c000000006
[ 0.000000] 3a60: fffffc0008ca6a38 000000000000005c fffffc0008c43b90
fffffc0008239498
[ 0.000000] 3a80: 00000000000000c0 ffffff000041fc00 ffffff0000424f00
0000000000000070
[ 0.000000] 3aa0: 0000000000000001 fffffc0008115374 ffffff000041fc00
fffffc00093f1000
[ 0.000000] 3ac0: ffffff0002000000 ffffff0000433000 fffffc0008c43bd0
fffffc0008a308f0
[ 0.000000] 3ae0: 0000000000010000 0000020000000000 0000000000000000
0000000000000001
[ 0.000000] 3b00: fffffc0008c43b30 fffffc000861f07c fffffc000941efc0
00000000000000c0
[ 0.000000] 3b20: ffffff0ffff44e60 00000000000000c0 fffffc0008c43b70
fffffc000861f234
[ 0.000000] 3b40: ffffff0ffff44e60 0000000000000004 ffffff0ffff44e60
fffffc0008c43c70
[ 0.000000] 3b60: 0000000000000000 fffffc0008a74460 fffffc0008c43ba0
fffffc000861f3fc
[ 0.000000] 3b80: fffffc0008c43ba0 fffffc00083ca55c fffffc0008c43bd0
fffffc0008222c20
[ 0.000000] 3ba0: ffffff000041fc00 00000000024080c0 ffffff0ff691e840
fffffc0008115374
[ 0.000000] 3bc0: 0000000000000001 00000000024080c0 fffffc0008c43c20
fffffc0008115374
[ 0.000000] 3be0: 0000000000000070 ffffff0ffff44e80 ffffff0ffff44e60
0000000000000000
[ 0.000000] 3c00: fffffc0008849a18 ffffffffffffffff 0000000000000000
ffffff0000433000
[ 0.000000] 3c20: fffffc0008c43c80 fffffc0008b461dc ffffff0000424e80
2800000000000000
[ 0.000000] 3c40: 0000000000010000 0000020000000000 0000000000000000
0000000000000400
[ 0.000000] 3c60: 0000000000000400 ffffff00004330f8 0000000000000001
ffffff0ffffabe00
[ 0.000000] 3c80: fffffc0008c43dc0 fffffc0008b462bc fffffc0008d33488
fffffc0008d33000
[ 0.000000] 3ca0: ffffff0ffff44e60 fffffc0008c6c840 ffffff0000424b00
ffffff0000424880
[ 0.000000] 3cc0: 0000000000000002 0000000000000000 0000000001bae074
0000000001f1001c
[ 0.000000] 3ce0: 0000000000000000 fffffc0008a30890 ffffff0000424b00
fffffc0008849940
[ 0.000000] 3d00: ffffff0000433020 fffffc0008a308f0 ffffff0000433008
ffffff0ffff44e60
[ 0.000000] 3d20: fffffc000ac00000 0000000000000008 0000000000000001
8107000000000000
[ 0.000000] 3d40: 00000000000000c0 0000000001000000 00000008fff44e60
0000010002000000
[ 0.000000] 3d60: 0000000000000100 81070000000000ff fffffc0008c43dc0
0000000008b462cc
[ 0.000000] 3d80: 0000901000020000 000090100021ffff ffffff0ffff44f08
0000000000000200
[ 0.000000] 3da0: 0000000000000000 0000000000000000 0000000000000000
0000000000000000
[ 0.000000] 3dc0: fffffc0008c43e10 fffffc0008b4543c fffffc0008c6c828
fffffc0008d32000
[ 0.000000] 3de0: fffffc0008c6c000 ffffff0ffff44470 fffffc0008849000
ffffff0000424880
[ 0.000000] 3e00: fffffc0008c43e10 fffffc0008b45420 fffffc0008c43e60
fffffc0008b456bc
[ 0.000000] 3e20: 0000000000000002 0000000000000003 0000000000000030
ffffff0000424880
[ 0.000000] 3e40: ffffff0ffff44470 0000000000000000 0000000000000018
fffffc0008000000
[ 0.000000] 3e60: fffffc0008c43f00 fffffc0008b5aec8 ffffff0000424700
fffffc0008c43f60
[ 0.000000] 3e80: fffffc0008c43f60 0000000000000000 fffffc0008c43f70
fffffc0008d92000
[ 0.000000] 3ea0: fffffc0008a734e0 fffffc0008a734b8 fffffc0008c43f00
0000000208b5ae3c
[ 0.000000] 3ec0: 0000000000000000 00009010805fffff ffffff0ffff44518
0000000000000200
[ 0.000000] 3ee0: 0000000000000000 0000000000000000 0000000000000000
0000000000000000
[ 0.000000] 3f00: fffffc0008c43f80 fffffc0008b43f9c fffffc0008c60000
fffffc0008b66628
[ 0.000000] 3f20: fffffc0008b66628 fffffc0008dc0000 fffffc0008c60000
ffffff0ffffac580
[ 0.000000] 3f40: 0000000002840000 0000000002870000 0000000000000020
0000000000000000
[ 0.000000] 3f60: fffffc0008c43f60 fffffc0008c43f60 fffffc0008c43f70
fffffc0008c43f70
[ 0.000000] 3f80: fffffc0008c43f90 fffffc0008b12d60 fffffc0008c43fa0
fffffc0008b10a3c
[ 0.000000] 3fa0: 0000000000000000 fffffc0008b101c4 0000010ff7a35218
0000000000000e12
[ 0.000000] 3fc0: 0000000021200000 0000000030d00980 0000000000000000
0000000001400000
[ 0.000000] 3fe0: 0000000000000000 fffffc0008b66628 0000000000000000
0000000000000000
[ 0.000000] Call trace:
[ 0.000000] Exception stack(0xfffffc0008c436b0 to 0xfffffc0008c437e0)
[ 0.000000] 36a0: 0000000000000000
0000040000000000
[ 0.000000] 36c0: fffffc0008c43880 fffffc00081c8950 ffffff0ffffaf180
0000000000000003
[ 0.000000] 36e0: fffffc0008c63000 00000000ffffffff 0000000000000001
0000000000000000
[ 0.000000] 3700: fffffc0008c43720 fffffc00081e25cc 0000000000000000
0000000001bfff3f
[ 0.000000] 3720: fffffc0008c43750 fffffc00081c8454 0000000000000012
0000000000000000
[ 0.000000] 3740: fffffffffffffff8 0000000000000012 fffffc0008c63bb0
0000000000000000
[ 0.000000] 3760: 0000000000000000 0000000000000000 0000000000000000
0000000000000000
[ 0.000000] 3780: fffffdffc00010a0 0000000001210111 7f7f7f7f7f7f7f7f
0000000001bfff3f
[ 0.000000] 37a0: fffffc0009334000 0000000000000020 6563697665442032
3230303031402073
[ 0.000000] 37c0: 2820303030303030 0000000000000000 0000000000000000
0000000000000008
[ 0.000000] [<fffffc00081c8950>] __alloc_pages_nodemask+0xa4/0xe68
[ 0.000000] [<fffffc000821fa70>] new_slab+0xd0/0x564
[ 0.000000] [<fffffc0008221e24>] ___slab_alloc+0x2e4/0x514
[ 0.000000] [<fffffc0008239498>] __slab_alloc+0x48/0x58
[ 0.000000] [<fffffc0008222c20>] __kmalloc_node+0xd0/0x2dc
[ 0.000000] [<fffffc0008115374>] __irq_domain_add+0x7c/0x164
[ 0.000000] [<fffffc0008b461dc>] its_probe+0x784/0x81c
[ 0.000000] [<fffffc0008b462bc>] its_init+0x48/0x1b0
[ 0.000000] [<fffffc0008b4543c>] gic_init_bases+0x228/0x360
[ 0.000000] [<fffffc0008b456bc>] gic_of_init+0x148/0x1cc
[ 0.000000] [<fffffc0008b5aec8>] of_irq_init+0x184/0x298
[ 0.000000] [<fffffc0008b43f9c>] irqchip_init+0x14/0x38
[ 0.000000] [<fffffc0008b12d60>] init_IRQ+0xc/0x30
[ 0.000000] [<fffffc0008b10a3c>] start_kernel+0x240/0x3b8
[ 0.000000] [<fffffc0008b101c4>] __primary_switched+0x30/0x6c
[ 0.000000] Code: 912ec2a0 b9403809 0a0902fb 37b007db (f9400300)
[ 0.000000] ---[ end trace 0000000000000000 ]---
[ 0.000000] Kernel panic - not syncing: Fatal exception
[ 0.000000] ---[ end Kernel panic - not syncing: Fatal exception


Same thing on v4.8.x and v4.9-rc?




>
> -Robert
>

2016-10-26 22:23:09

by Robert Richter

[permalink] [raw]
Subject: Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

There has been some significant rework around
__alloc_pages_nodemask(), adding Mel and linux-mm.

-Robert

On 26.10.16 10:00:02, David Daney wrote:
> On 10/26/2016 06:43 AM, Robert Richter wrote:
> >On 25.10.16 14:31:00, David Daney wrote:
> >>From: David Daney <[email protected]>
> >>
> >>On arm64 NUMA kernels we can pass "numa=off" on the command line to
> >>disable NUMA. A side effect of this is that kmalloc_node() calls to
> >>non-zero nodes will crash the system with an OOPS:
> >>
> >>[ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
> >>[ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
> >>[ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
> >>[ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
> >>[ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
> >>[ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
> >>[ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
> >>[ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
> >>.
> >>.
> >>.
> >>
> >>This is caused by code like this in kernel/irq/irqdomain.c
> >>
> >> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> >> GFP_KERNEL, of_node_to_nid(of_node));
> >>
> >>When NUMA is disabled, the concept of a node is really undefined, so
> >>of_node_to_nid() should unconditionally return NUMA_NO_NODE.
> >>
> >>Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
> >>return NUMA_NO_NODE.
> >>
> >>The follow on patch will call this new function from the arm64 numa
> >>code.
> >
> >Didn't that work before?
>
> I am fairly certain that it used to work.
>
> >numa=off just maps all mem to node 0.
>
> Yes, that is the current behavior.
>
> >If mem
> >allocation is requested for another node it should just fall back to a
> >node with mem (node 0 then).
>
> This is the root of the problem. The ITS code is allocating memory. It
> calls of_node_to_nid() to determine which node it resides on. The answer in
> the failing case is node-1. Since we have mapped all the memory to node-0
> the __kmalloc_node(..., 1) call fails with the OOPS shown.
>
> It could be that __kmalloc_node() used to allocate memory on a node other
> than the requested node if the request couldn't be met. But in v4.8 and
> later it produces that OOPS.
>
> If you pass a node containing free memory or NUMA_NO_NODE to
> __kmalloc_node(), the allocation succeeds.
>
> When we first did these patches, I advocated removing the numa=off feature,
> and requiring people to install usable firmware on their systems. That was
> rejected on the grounds that not everybody has the ability to change their
> firmware and we would like to allow NUMA kernels to run on systems with
> defective firmware by supplying this command line parameter. Now that I
> have seen requests from the wild for this, I think it is a good idea to
> allow numa=off to be used to work around this bad firmware.
>
> The change in this patch set is fairly small, and seems to get the job done.
> An alternative would be to change __kmalloc_node() to ignore the node
> parameter if the request cannot be made, but I assume that there were good
> reasons to have the current behavior, so that would be a much more
> complicated change to make.
>
>
>
> >I suspect there is something wrong with
> >the page initialization, see:
> >
> > http://www.spinics.net/lists/arm-kernel/msg535191.html
> > https://bugzilla.redhat.com/show_bug.cgi?id=1387793
> >
> >What is the complete oops?
> >
> >So I think k*alloc_node() must be able to handle requests to
> >non-existing nodes. Otherwise your fix is incomplete, assume a failed
> >of_numa_init() causing a dummy init but still some devices reporting a
> >node.
>
> .
> .
> .
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [ 0.000000] Booting Linux on physical CPU 0x0
> [ 0.000000] Linux version 4.8.0-rc8-dd ([email protected])
> (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #29 SMP Tue Sep 27
> 15:50:35 PDT 2016
> [ 0.000000] Boot CPU: AArch64 Processor [431f0a10]
> [ 0.000000] NUMA turned off
> [ 0.000000] earlycon: pl11 at MMIO 0x000087e024000000 (options '')
> [ 0.000000] bootconsole [pl11] enabled
> [ 0.000000] efi: Getting EFI parameters from FDT:
> [ 0.000000] efi: EFI v2.40 by Cavium Thunder cn88xx EFI
> jenkins_weekly_build_40-0-ga1f880f Sep 13 2016 17:05:35
> [ 0.000000] efi: ACPI=0xfffff000 ACPI 2.0=0xfffff014 SMBIOS
> 3.0=0x10ffafcf000
> [ 0.000000] cma: Reserved 512 MiB at 0x00000000c0000000
> [ 0.000000] NUMA disabled
> [ 0.000000] NUMA: Faking a node at [mem
> 0x0000000000000000-0x0000010fffffffff]
> [ 0.000000] NUMA: Adding memblock [0x1400000 - 0xfffdffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0xfffe0000 - 0xffffffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x100000000 - 0xfffffffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10000400000 - 0x10ffa38ffff] on node
> 0
> [ 0.000000] NUMA: Adding memblock [0x10ffa390000 - 0x10ffa41ffff] on node
> 0
> [ 0.000000] NUMA: Adding memblock [0x10ffa420000 - 0x10ffaeaffff] on node
> 0
> [ 0.000000] NUMA: Adding memblock [0x10ffaeb0000 - 0x10ffaffffff] on node
> 0
> [ 0.000000] NUMA: Adding memblock [0x10ffb000000 - 0x10ffffaffff] on node
> 0
> [ 0.000000] NUMA: Adding memblock [0x10ffffb0000 - 0x10fffffffff] on node
> 0
> [ 0.000000] NUMA: Initmem setup node 0 [mem 0x01400000-0x10fffffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0x10ffffae480-0x10ffffaff7f]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000001400000-0x00000000ffffffff]
> [ 0.000000] Normal [mem 0x0000000100000000-0x0000010fffffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000001400000-0x00000000fffdffff]
> [ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff]
> [ 0.000000] node 0: [mem 0x0000000100000000-0x0000000fffffffff]
> [ 0.000000] node 0: [mem 0x0000010000400000-0x0000010ffa38ffff]
> [ 0.000000] node 0: [mem 0x0000010ffa390000-0x0000010ffa41ffff]
> [ 0.000000] node 0: [mem 0x0000010ffa420000-0x0000010ffaeaffff]
> [ 0.000000] node 0: [mem 0x0000010ffaeb0000-0x0000010ffaffffff]
> [ 0.000000] node 0: [mem 0x0000010ffb000000-0x0000010ffffaffff]
> [ 0.000000] node 0: [mem 0x0000010ffffb0000-0x0000010fffffffff]
> [ 0.000000] Initmem setup node 0 [mem
> 0x0000000001400000-0x0000010fffffffff]
> [ 0.000000] psci: probing for conduit method from DT.
> [ 0.000000] psci: PSCIv0.2 detected in firmware.
> [ 0.000000] psci: Using standard PSCI v0.2 function IDs
> [ 0.000000] psci: Trusted OS resident on physical CPU 0x0
> [ 0.000000] percpu: Embedded 3 pages/cpu @ffffff0ff6900000 s116736 r8192
> d71680 u196608
> [ 0.000000] Detected VIPT I-cache on CPU0
> [ 0.000000] CPU features: enabling workaround for Cavium erratum 27456
> [ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total
> pages: 2094720
> [ 0.000000] Policy zone: Normal
> [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.8.0-rc8-dd
> root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root
> rd.lvm.lv=rhel/swap LANG=en_US.UTF-8 numa=off console=ttyAMA0,115200n8
> earlycon=pl011,0x87e024000000
> [ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
> [ 0.000000] log_buf_len total cpu_extra contributions: 389120 bytes
> [ 0.000000] log_buf_len min size: 524288 bytes
> [ 0.000000] log_buf_len: 1048576 bytes
> [ 0.000000] early log buf free: 519176(99%)
> [ 0.000000] PID hash table entries: 4096 (order: -1, 32768 bytes)
> [ 0.000000] software IO TLB [mem 0xfbfd0000-0xfffd0000] (64MB) mapped at
> [fffffe00fbfd0000-fffffe00fffcffff]
> [ 0.000000] Memory: 133391936K/134193152K available (7356K kernel code,
> 1359K rwdata, 3392K rodata, 1216K init, 6799K bss, 276928K reserved, 524288K
> cma-reserved)
> [ 0.000000] Virtual kernel memory layout:
> [ 0.000000] modules : 0xfffffc0000000000 - 0xfffffc0008000000 (
> 128 MB)
> [ 0.000000] vmalloc : 0xfffffc0008000000 - 0xfffffdff5fff0000 (
> 2045 GB)
> [ 0.000000] .text : 0xfffffc0008080000 - 0xfffffc00087b0000 (
> 7360 KB)
> [ 0.000000] .rodata : 0xfffffc00087b0000 - 0xfffffc0008b10000 (
> 3456 KB)
> [ 0.000000] .init : 0xfffffc0008b10000 - 0xfffffc0008c40000 (
> 1216 KB)
> [ 0.000000] .data : 0xfffffc0008c40000 - 0xfffffc0008d93e00 (
> 1360 KB)
> [ 0.000000] .bss : 0xfffffc0008d93e00 - 0xfffffc0009437d48 (
> 6800 KB)
> [ 0.000000] fixed : 0xfffffdff7e7d0000 - 0xfffffdff7ec00000 (
> 4288 KB)
> [ 0.000000] PCI I/O : 0xfffffdff7ee00000 - 0xfffffdff7fe00000 (
> 16 MB)
> [ 0.000000] vmemmap : 0xfffffdff80000000 - 0xfffffe0000000000 (
> 2 GB maximum)
> [ 0.000000] 0xfffffdff80005000 - 0xfffffdffc4000000 (
> 1087 MB actual)
> [ 0.000000] memory : 0xfffffe0001400000 - 0xffffff1000000000
> (1114092 MB)
> [ 0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=96, Nodes=1
> [ 0.000000] Hierarchical RCU implementation.
> [ 0.000000] Build-time adjustment of leaf fanout to 64.
> [ 0.000000] RCU restricting CPUs from NR_CPUS=4096 to nr_cpu_ids=96.
> [ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=96
> [ 0.000000] NR_IRQS:64 nr_irqs:64 0
> [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> [ 0.000000] ITS: /interrupt-controller@801000000000/gic-its@801000020000
> [ 0.000000] ITS@0x0000801000020000: allocated 2097152 Devices
> @10001000000 (flat, esz 8, psz 64K, shr 1)
> [ 0.000000] ITS: /interrupt-controller@801000000000/gic-its@901000020000
> [ 0.000000] ITS@0x0000901000020000: allocated 2097152 Devices
> @10002000000 (flat, esz 8, psz 64K, shr 1)
> [ 0.000000] Unable to handle kernel NULL pointer dereference at virtual
> address 00001680
> [ 0.000000] pgd = fffffc0009470000
> [ 0.000000] [00001680] *pgd=0000010ffff90003, *pud=0000010ffff90003,
> *pmd=0000010ffff90003, *pte=0000000000000000
> [ 0.000000] Internal error: Oops: 96000006 [#1] SMP
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc8-dd #29
> [ 0.000000] Hardware name: Cavium ThunderX CN88XX board (DT)
> [ 0.000000] task: fffffc0008c71c80 task.stack: fffffc0008c40000
> [ 0.000000] PC is at __alloc_pages_nodemask+0xa4/0xe68
> [ 0.000000] LR is at __alloc_pages_nodemask+0x38/0xe68
> [ 0.000000] pc : [<fffffc00081c8950>] lr : [<fffffc00081c88e4>] pstate:
> 600000c5
> [ 0.000000] sp : fffffc0008c43880
> [ 0.000000] x29: fffffc0008c43880 x28: ffffff000041fc00
> [ 0.000000] x27: 0000000000201200 x26: 0000000000000000
> [ 0.000000] x25: 0000000000000001 x24: 0000000000001680
> [ 0.000000] x23: 0000000000201200 x22: fffffc0008c439c8
> [ 0.000000] x21: fffffc0008c63000 x20: 0000000000201200
> [ 0.000000] x19: 0000000000000000 x18: 0000000000000070
> [ 0.000000] x17: 0000000000000008 x16: 0000000000000000
> [ 0.000000] x15: 0000000000000000 x14: 2820303030303030
> [ 0.000000] x13: 3230303031402073 x12: 6563697665442032
> [ 0.000000] x11: 0000000000000020 x10: fffffc0009334000
> [ 0.000000] x9 : 0000000001bfff3f x8 : 7f7f7f7f7f7f7f7f
> [ 0.000000] x7 : 0000000001210111 x6 : fffffdffc00010a0
> [ 0.000000] x5 : 0000000000000000 x4 : 0000000000000000
> [ 0.000000] x3 : 0000000000000000 x2 : 0000000000000000
> [ 0.000000] x1 : 0000000000000000 x0 : fffffc0008c63bb0
> [ 0.000000]
> [ 0.000000] Process swapper/0 (pid: 0, stack limit = 0xfffffc0008c40020)
> [ 0.000000] Stack: (0xfffffc0008c43880 to 0xfffffc0008c44000)
> [ 0.000000] 3880: fffffc0008c439f0 fffffc000821fa70 ffffff000041fc00
> 0000000000000200
> [ 0.000000] 38a0: fffffc0008115374 0000000000000000 0000000000000000
> 0000000000000001
> [ 0.000000] 38c0: 0000000000000000 0000000000000000 0000000000201200
> ffffff000041fc00
> [ 0.000000] 38e0: fffffc0008c43960 fffffc000810bc20 fffffc0008c43960
> fffffc0008c43960
> [ 0.000000] 3900: fffffc0008c43930 00000000ffffffd0 fffffc0008c43960
> fffffc0008c43960
> [ 0.000000] 3920: fffffc0008c43930 00000000ffffffd0 fffffc0008c43970
> fffffc0008221658
> [ 0.000000] 3940: 7f7f7f7f7f7f7f7f 0000000000000002 0101010101010101
> 0000000000000020
> [ 0.000000] 3960: fffffc0008c43a70 fffffc0008221c04 0000000000000001
> 00000000024080c0
> [ 0.000000] 3980: fffffc0008115374 fffffc0008bf8648 0000000000001000
> 0000000000000000
> [ 0.000000] 39a0: ffffff000041fc00 0000000000000001 ffffff0ff691e840
> ffffff000041fc00
> [ 0.000000] 39c0: ffffff0ff691e840 0000000000001680 0000000000000000
> 0000000000000000
> [ 0.000000] 39e0: 0000000100000000 0000000000000000 fffffc0008c43a70
> fffffc0008221e24
> [ 0.000000] 3a00: 0000000000000001 00000000024080c0 fffffc0008115374
> fffffc0008bf8648
> [ 0.000000] 3a20: 0000000000001000 0000000000000000 0000000000000000
> 0000000000000001
> [ 0.000000] 3a40: ffffff0ff691e840 ffffff000041fc00 fffffc000928a1e8
> 024080c000000006
> [ 0.000000] 3a60: fffffc0008ca6a38 000000000000005c fffffc0008c43b90
> fffffc0008239498
> [ 0.000000] 3a80: 00000000000000c0 ffffff000041fc00 ffffff0000424f00
> 0000000000000070
> [ 0.000000] 3aa0: 0000000000000001 fffffc0008115374 ffffff000041fc00
> fffffc00093f1000
> [ 0.000000] 3ac0: ffffff0002000000 ffffff0000433000 fffffc0008c43bd0
> fffffc0008a308f0
> [ 0.000000] 3ae0: 0000000000010000 0000020000000000 0000000000000000
> 0000000000000001
> [ 0.000000] 3b00: fffffc0008c43b30 fffffc000861f07c fffffc000941efc0
> 00000000000000c0
> [ 0.000000] 3b20: ffffff0ffff44e60 00000000000000c0 fffffc0008c43b70
> fffffc000861f234
> [ 0.000000] 3b40: ffffff0ffff44e60 0000000000000004 ffffff0ffff44e60
> fffffc0008c43c70
> [ 0.000000] 3b60: 0000000000000000 fffffc0008a74460 fffffc0008c43ba0
> fffffc000861f3fc
> [ 0.000000] 3b80: fffffc0008c43ba0 fffffc00083ca55c fffffc0008c43bd0
> fffffc0008222c20
> [ 0.000000] 3ba0: ffffff000041fc00 00000000024080c0 ffffff0ff691e840
> fffffc0008115374
> [ 0.000000] 3bc0: 0000000000000001 00000000024080c0 fffffc0008c43c20
> fffffc0008115374
> [ 0.000000] 3be0: 0000000000000070 ffffff0ffff44e80 ffffff0ffff44e60
> 0000000000000000
> [ 0.000000] 3c00: fffffc0008849a18 ffffffffffffffff 0000000000000000
> ffffff0000433000
> [ 0.000000] 3c20: fffffc0008c43c80 fffffc0008b461dc ffffff0000424e80
> 2800000000000000
> [ 0.000000] 3c40: 0000000000010000 0000020000000000 0000000000000000
> 0000000000000400
> [ 0.000000] 3c60: 0000000000000400 ffffff00004330f8 0000000000000001
> ffffff0ffffabe00
> [ 0.000000] 3c80: fffffc0008c43dc0 fffffc0008b462bc fffffc0008d33488
> fffffc0008d33000
> [ 0.000000] 3ca0: ffffff0ffff44e60 fffffc0008c6c840 ffffff0000424b00
> ffffff0000424880
> [ 0.000000] 3cc0: 0000000000000002 0000000000000000 0000000001bae074
> 0000000001f1001c
> [ 0.000000] 3ce0: 0000000000000000 fffffc0008a30890 ffffff0000424b00
> fffffc0008849940
> [ 0.000000] 3d00: ffffff0000433020 fffffc0008a308f0 ffffff0000433008
> ffffff0ffff44e60
> [ 0.000000] 3d20: fffffc000ac00000 0000000000000008 0000000000000001
> 8107000000000000
> [ 0.000000] 3d40: 00000000000000c0 0000000001000000 00000008fff44e60
> 0000010002000000
> [ 0.000000] 3d60: 0000000000000100 81070000000000ff fffffc0008c43dc0
> 0000000008b462cc
> [ 0.000000] 3d80: 0000901000020000 000090100021ffff ffffff0ffff44f08
> 0000000000000200
> [ 0.000000] 3da0: 0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [ 0.000000] 3dc0: fffffc0008c43e10 fffffc0008b4543c fffffc0008c6c828
> fffffc0008d32000
> [ 0.000000] 3de0: fffffc0008c6c000 ffffff0ffff44470 fffffc0008849000
> ffffff0000424880
> [ 0.000000] 3e00: fffffc0008c43e10 fffffc0008b45420 fffffc0008c43e60
> fffffc0008b456bc
> [ 0.000000] 3e20: 0000000000000002 0000000000000003 0000000000000030
> ffffff0000424880
> [ 0.000000] 3e40: ffffff0ffff44470 0000000000000000 0000000000000018
> fffffc0008000000
> [ 0.000000] 3e60: fffffc0008c43f00 fffffc0008b5aec8 ffffff0000424700
> fffffc0008c43f60
> [ 0.000000] 3e80: fffffc0008c43f60 0000000000000000 fffffc0008c43f70
> fffffc0008d92000
> [ 0.000000] 3ea0: fffffc0008a734e0 fffffc0008a734b8 fffffc0008c43f00
> 0000000208b5ae3c
> [ 0.000000] 3ec0: 0000000000000000 00009010805fffff ffffff0ffff44518
> 0000000000000200
> [ 0.000000] 3ee0: 0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [ 0.000000] 3f00: fffffc0008c43f80 fffffc0008b43f9c fffffc0008c60000
> fffffc0008b66628
> [ 0.000000] 3f20: fffffc0008b66628 fffffc0008dc0000 fffffc0008c60000
> ffffff0ffffac580
> [ 0.000000] 3f40: 0000000002840000 0000000002870000 0000000000000020
> 0000000000000000
> [ 0.000000] 3f60: fffffc0008c43f60 fffffc0008c43f60 fffffc0008c43f70
> fffffc0008c43f70
> [ 0.000000] 3f80: fffffc0008c43f90 fffffc0008b12d60 fffffc0008c43fa0
> fffffc0008b10a3c
> [ 0.000000] 3fa0: 0000000000000000 fffffc0008b101c4 0000010ff7a35218
> 0000000000000e12
> [ 0.000000] 3fc0: 0000000021200000 0000000030d00980 0000000000000000
> 0000000001400000
> [ 0.000000] 3fe0: 0000000000000000 fffffc0008b66628 0000000000000000
> 0000000000000000
> [ 0.000000] Call trace:
> [ 0.000000] Exception stack(0xfffffc0008c436b0 to 0xfffffc0008c437e0)
> [ 0.000000] 36a0: 0000000000000000
> 0000040000000000
> [ 0.000000] 36c0: fffffc0008c43880 fffffc00081c8950 ffffff0ffffaf180
> 0000000000000003
> [ 0.000000] 36e0: fffffc0008c63000 00000000ffffffff 0000000000000001
> 0000000000000000
> [ 0.000000] 3700: fffffc0008c43720 fffffc00081e25cc 0000000000000000
> 0000000001bfff3f
> [ 0.000000] 3720: fffffc0008c43750 fffffc00081c8454 0000000000000012
> 0000000000000000
> [ 0.000000] 3740: fffffffffffffff8 0000000000000012 fffffc0008c63bb0
> 0000000000000000
> [ 0.000000] 3760: 0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [ 0.000000] 3780: fffffdffc00010a0 0000000001210111 7f7f7f7f7f7f7f7f
> 0000000001bfff3f
> [ 0.000000] 37a0: fffffc0009334000 0000000000000020 6563697665442032
> 3230303031402073
> [ 0.000000] 37c0: 2820303030303030 0000000000000000 0000000000000000
> 0000000000000008
> [ 0.000000] [<fffffc00081c8950>] __alloc_pages_nodemask+0xa4/0xe68
> [ 0.000000] [<fffffc000821fa70>] new_slab+0xd0/0x564
> [ 0.000000] [<fffffc0008221e24>] ___slab_alloc+0x2e4/0x514
> [ 0.000000] [<fffffc0008239498>] __slab_alloc+0x48/0x58
> [ 0.000000] [<fffffc0008222c20>] __kmalloc_node+0xd0/0x2dc
> [ 0.000000] [<fffffc0008115374>] __irq_domain_add+0x7c/0x164
> [ 0.000000] [<fffffc0008b461dc>] its_probe+0x784/0x81c
> [ 0.000000] [<fffffc0008b462bc>] its_init+0x48/0x1b0
> [ 0.000000] [<fffffc0008b4543c>] gic_init_bases+0x228/0x360
> [ 0.000000] [<fffffc0008b456bc>] gic_of_init+0x148/0x1cc
> [ 0.000000] [<fffffc0008b5aec8>] of_irq_init+0x184/0x298
> [ 0.000000] [<fffffc0008b43f9c>] irqchip_init+0x14/0x38
> [ 0.000000] [<fffffc0008b12d60>] init_IRQ+0xc/0x30
> [ 0.000000] [<fffffc0008b10a3c>] start_kernel+0x240/0x3b8
> [ 0.000000] [<fffffc0008b101c4>] __primary_switched+0x30/0x6c
> [ 0.000000] Code: 912ec2a0 b9403809 0a0902fb 37b007db (f9400300)
> [ 0.000000] ---[ end trace 0000000000000000 ]---
> [ 0.000000] Kernel panic - not syncing: Fatal exception
> [ 0.000000] ---[ end Kernel panic - not syncing: Fatal exception
>
>
> Same thing on v4.8.x and v4.9-rc?
>
>
>
>
> >
> >-Robert
> >
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2016-10-28 01:51:51

by Zhen Lei

[permalink] [raw]
Subject: Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().



On 2016/10/27 1:00, David Daney wrote:
> On 10/26/2016 06:43 AM, Robert Richter wrote:
>> On 25.10.16 14:31:00, David Daney wrote:
>>> From: David Daney <[email protected]>
>>>
>>> On arm64 NUMA kernels we can pass "numa=off" on the command line to
>>> disable NUMA. A side effect of this is that kmalloc_node() calls to
>>> non-zero nodes will crash the system with an OOPS:
>>>
>>> [ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
>>> [ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
>>> [ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
>>> [ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
>>> [ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
>>> [ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
>>> [ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
>>> [ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
>>> .
>>> .
>>> .
>>>
>>> This is caused by code like this in kernel/irq/irqdomain.c
>>>
>>> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
>>> GFP_KERNEL, of_node_to_nid(of_node));
>>>
>>> When NUMA is disabled, the concept of a node is really undefined, so
>>> of_node_to_nid() should unconditionally return NUMA_NO_NODE.
>>>
>>> Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
>>> return NUMA_NO_NODE.
>>>
>>> The follow on patch will call this new function from the arm64 numa
>>> code.
>>
>> Didn't that work before?
>
> I am fairly certain that it used to work.
>
>> numa=off just maps all mem to node 0.
>
> Yes, that is the current behavior.
It just deal with the cpu nodes, but I think currently you added "numa-node-id" in the peripheral device(maybe ITS).

>
>> If mem
>> allocation is requested for another node it should just fall back to a
>> node with mem (node 0 then).
>
> This is the root of the problem. The ITS code is allocating memory. It calls of_node_to_nid() to determine which node it resides on. The answer in the failing case is node-1. Since we have mapped all the memory to node-0 the __kmalloc_node(..., 1) call fails with the OOPS shown.
>
> It could be that __kmalloc_node() used to allocate memory on a node other than the requested node if the request couldn't be met. But in v4.8 and later it produces that OOPS.
>
> If you pass a node containing free memory or NUMA_NO_NODE to __kmalloc_node(), the allocation succeeds.
>
> When we first did these patches, I advocated removing the numa=off feature, and requiring people to install usable firmware on their systems. That was rejected on the grounds that not everybody has the ability to change their firmware and we would like to allow NUMA kernels to run on systems with defective firmware by supplying this command line parameter. Now that I have seen requests from the wild for this, I think it is a good idea to allow numa=off to be used to work around this bad firmware.
>
> The change in this patch set is fairly small, and seems to get the job done. An alternative would be to change __kmalloc_node() to ignore the node parameter if the request cannot be made, but I assume that there were good reasons to have the current behavior, so that would be a much more complicated change to make.
>
>
>
>> I suspect there is something wrong with
>> the page initialization, see:
>>
>> http://www.spinics.net/lists/arm-kernel/msg535191.html
>> https://bugzilla.redhat.com/show_bug.cgi?id=1387793
>>
>> What is the complete oops?
>>
>> So I think k*alloc_node() must be able to handle requests to
>> non-existing nodes. Otherwise your fix is incomplete, assume a failed
>> of_numa_init() causing a dummy init but still some devices reporting a
>> node.
>
> .
> .
> .
> EFI stub: Booting Linux Kernel...
> EFI stub: Using DTB from configuration table
> EFI stub: Exiting boot services and installing virtual address map...
> [ 0.000000] Booting Linux on physical CPU 0x0
> [ 0.000000] Linux version 4.8.0-rc8-dd ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #29 SMP Tue Sep 27 15:50:35 PDT 2016
> [ 0.000000] Boot CPU: AArch64 Processor [431f0a10]
> [ 0.000000] NUMA turned off
> [ 0.000000] earlycon: pl11 at MMIO 0x000087e024000000 (options '')
> [ 0.000000] bootconsole [pl11] enabled
> [ 0.000000] efi: Getting EFI parameters from FDT:
> [ 0.000000] efi: EFI v2.40 by Cavium Thunder cn88xx EFI jenkins_weekly_build_40-0-ga1f880f Sep 13 2016 17:05:35
> [ 0.000000] efi: ACPI=0xfffff000 ACPI 2.0=0xfffff014 SMBIOS 3.0=0x10ffafcf000
> [ 0.000000] cma: Reserved 512 MiB at 0x00000000c0000000
> [ 0.000000] NUMA disabled
> [ 0.000000] NUMA: Faking a node at [mem 0x0000000000000000-0x0000010fffffffff]
> [ 0.000000] NUMA: Adding memblock [0x1400000 - 0xfffdffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0xfffe0000 - 0xffffffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x100000000 - 0xfffffffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10000400000 - 0x10ffa38ffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10ffa390000 - 0x10ffa41ffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10ffa420000 - 0x10ffaeaffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10ffaeb0000 - 0x10ffaffffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10ffb000000 - 0x10ffffaffff] on node 0
> [ 0.000000] NUMA: Adding memblock [0x10ffffb0000 - 0x10fffffffff] on node 0
> [ 0.000000] NUMA: Initmem setup node 0 [mem 0x01400000-0x10fffffffff]
> [ 0.000000] NUMA: NODE_DATA [mem 0x10ffffae480-0x10ffffaff7f]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000001400000-0x00000000ffffffff]
> [ 0.000000] Normal [mem 0x0000000100000000-0x0000010fffffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000001400000-0x00000000fffdffff]
> [ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff]
> [ 0.000000] node 0: [mem 0x0000000100000000-0x0000000fffffffff]
> [ 0.000000] node 0: [mem 0x0000010000400000-0x0000010ffa38ffff]
> [ 0.000000] node 0: [mem 0x0000010ffa390000-0x0000010ffa41ffff]
> [ 0.000000] node 0: [mem 0x0000010ffa420000-0x0000010ffaeaffff]
> [ 0.000000] node 0: [mem 0x0000010ffaeb0000-0x0000010ffaffffff]
> [ 0.000000] node 0: [mem 0x0000010ffb000000-0x0000010ffffaffff]
> [ 0.000000] node 0: [mem 0x0000010ffffb0000-0x0000010fffffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000001400000-0x0000010fffffffff]
> [ 0.000000] psci: probing for conduit method from DT.
> [ 0.000000] psci: PSCIv0.2 detected in firmware.
> [ 0.000000] psci: Using standard PSCI v0.2 function IDs
> [ 0.000000] psci: Trusted OS resident on physical CPU 0x0
> [ 0.000000] percpu: Embedded 3 pages/cpu @ffffff0ff6900000 s116736 r8192 d71680 u196608
> [ 0.000000] Detected VIPT I-cache on CPU0
> [ 0.000000] CPU features: enabling workaround for Cavium erratum 27456
> [ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 2094720
> [ 0.000000] Policy zone: Normal
> [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.8.0-rc8-dd root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap LANG=en_US.UTF-8 numa=off console=ttyAMA0,115200n8 earlycon=pl011,0x87e024000000
> [ 0.000000] log_buf_len individual max cpu contribution: 4096 bytes
> [ 0.000000] log_buf_len total cpu_extra contributions: 389120 bytes
> [ 0.000000] log_buf_len min size: 524288 bytes
> [ 0.000000] log_buf_len: 1048576 bytes
> [ 0.000000] early log buf free: 519176(99%)
> [ 0.000000] PID hash table entries: 4096 (order: -1, 32768 bytes)
> [ 0.000000] software IO TLB [mem 0xfbfd0000-0xfffd0000] (64MB) mapped at [fffffe00fbfd0000-fffffe00fffcffff]
> [ 0.000000] Memory: 133391936K/134193152K available (7356K kernel code, 1359K rwdata, 3392K rodata, 1216K init, 6799K bss, 276928K reserved, 524288K cma-reserved)
> [ 0.000000] Virtual kernel memory layout:
> [ 0.000000] modules : 0xfffffc0000000000 - 0xfffffc0008000000 ( 128 MB)
> [ 0.000000] vmalloc : 0xfffffc0008000000 - 0xfffffdff5fff0000 ( 2045 GB)
> [ 0.000000] .text : 0xfffffc0008080000 - 0xfffffc00087b0000 ( 7360 KB)
> [ 0.000000] .rodata : 0xfffffc00087b0000 - 0xfffffc0008b10000 ( 3456 KB)
> [ 0.000000] .init : 0xfffffc0008b10000 - 0xfffffc0008c40000 ( 1216 KB)
> [ 0.000000] .data : 0xfffffc0008c40000 - 0xfffffc0008d93e00 ( 1360 KB)
> [ 0.000000] .bss : 0xfffffc0008d93e00 - 0xfffffc0009437d48 ( 6800 KB)
> [ 0.000000] fixed : 0xfffffdff7e7d0000 - 0xfffffdff7ec00000 ( 4288 KB)
> [ 0.000000] PCI I/O : 0xfffffdff7ee00000 - 0xfffffdff7fe00000 ( 16 MB)
> [ 0.000000] vmemmap : 0xfffffdff80000000 - 0xfffffe0000000000 ( 2 GB maximum)
> [ 0.000000] 0xfffffdff80005000 - 0xfffffdffc4000000 ( 1087 MB actual)
> [ 0.000000] memory : 0xfffffe0001400000 - 0xffffff1000000000 (1114092 MB)
> [ 0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=96, Nodes=1
> [ 0.000000] Hierarchical RCU implementation.
> [ 0.000000] Build-time adjustment of leaf fanout to 64.
> [ 0.000000] RCU restricting CPUs from NR_CPUS=4096 to nr_cpu_ids=96.
> [ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=96
> [ 0.000000] NR_IRQS:64 nr_irqs:64 0
> [ 0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> [ 0.000000] ITS: /interrupt-controller@801000000000/gic-its@801000020000
> [ 0.000000] ITS@0x0000801000020000: allocated 2097152 Devices @10001000000 (flat, esz 8, psz 64K, shr 1)
> [ 0.000000] ITS: /interrupt-controller@801000000000/gic-its@901000020000
> [ 0.000000] ITS@0x0000901000020000: allocated 2097152 Devices @10002000000 (flat, esz 8, psz 64K, shr 1)
> [ 0.000000] Unable to handle kernel NULL pointer dereference at virtual address 00001680
> [ 0.000000] pgd = fffffc0009470000
> [ 0.000000] [00001680] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff90003, *pte=0000000000000000
> [ 0.000000] Internal error: Oops: 96000006 [#1] SMP
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc8-dd #29
> [ 0.000000] Hardware name: Cavium ThunderX CN88XX board (DT)
> [ 0.000000] task: fffffc0008c71c80 task.stack: fffffc0008c40000
> [ 0.000000] PC is at __alloc_pages_nodemask+0xa4/0xe68
> [ 0.000000] LR is at __alloc_pages_nodemask+0x38/0xe68
> [ 0.000000] pc : [<fffffc00081c8950>] lr : [<fffffc00081c88e4>] pstate: 600000c5
> [ 0.000000] sp : fffffc0008c43880
> [ 0.000000] x29: fffffc0008c43880 x28: ffffff000041fc00
> [ 0.000000] x27: 0000000000201200 x26: 0000000000000000
> [ 0.000000] x25: 0000000000000001 x24: 0000000000001680
> [ 0.000000] x23: 0000000000201200 x22: fffffc0008c439c8
> [ 0.000000] x21: fffffc0008c63000 x20: 0000000000201200
> [ 0.000000] x19: 0000000000000000 x18: 0000000000000070
> [ 0.000000] x17: 0000000000000008 x16: 0000000000000000
> [ 0.000000] x15: 0000000000000000 x14: 2820303030303030
> [ 0.000000] x13: 3230303031402073 x12: 6563697665442032
> [ 0.000000] x11: 0000000000000020 x10: fffffc0009334000
> [ 0.000000] x9 : 0000000001bfff3f x8 : 7f7f7f7f7f7f7f7f
> [ 0.000000] x7 : 0000000001210111 x6 : fffffdffc00010a0
> [ 0.000000] x5 : 0000000000000000 x4 : 0000000000000000
> [ 0.000000] x3 : 0000000000000000 x2 : 0000000000000000
> [ 0.000000] x1 : 0000000000000000 x0 : fffffc0008c63bb0
> [ 0.000000]
> [ 0.000000] Process swapper/0 (pid: 0, stack limit = 0xfffffc0008c40020)
> [ 0.000000] Stack: (0xfffffc0008c43880 to 0xfffffc0008c44000)
> [ 0.000000] 3880: fffffc0008c439f0 fffffc000821fa70 ffffff000041fc00 0000000000000200
> [ 0.000000] 38a0: fffffc0008115374 0000000000000000 0000000000000000 0000000000000001
> [ 0.000000] 38c0: 0000000000000000 0000000000000000 0000000000201200 ffffff000041fc00
> [ 0.000000] 38e0: fffffc0008c43960 fffffc000810bc20 fffffc0008c43960 fffffc0008c43960
> [ 0.000000] 3900: fffffc0008c43930 00000000ffffffd0 fffffc0008c43960 fffffc0008c43960
> [ 0.000000] 3920: fffffc0008c43930 00000000ffffffd0 fffffc0008c43970 fffffc0008221658
> [ 0.000000] 3940: 7f7f7f7f7f7f7f7f 0000000000000002 0101010101010101 0000000000000020
> [ 0.000000] 3960: fffffc0008c43a70 fffffc0008221c04 0000000000000001 00000000024080c0
> [ 0.000000] 3980: fffffc0008115374 fffffc0008bf8648 0000000000001000 0000000000000000
> [ 0.000000] 39a0: ffffff000041fc00 0000000000000001 ffffff0ff691e840 ffffff000041fc00
> [ 0.000000] 39c0: ffffff0ff691e840 0000000000001680 0000000000000000 0000000000000000
> [ 0.000000] 39e0: 0000000100000000 0000000000000000 fffffc0008c43a70 fffffc0008221e24
> [ 0.000000] 3a00: 0000000000000001 00000000024080c0 fffffc0008115374 fffffc0008bf8648
> [ 0.000000] 3a20: 0000000000001000 0000000000000000 0000000000000000 0000000000000001
> [ 0.000000] 3a40: ffffff0ff691e840 ffffff000041fc00 fffffc000928a1e8 024080c000000006
> [ 0.000000] 3a60: fffffc0008ca6a38 000000000000005c fffffc0008c43b90 fffffc0008239498
> [ 0.000000] 3a80: 00000000000000c0 ffffff000041fc00 ffffff0000424f00 0000000000000070
> [ 0.000000] 3aa0: 0000000000000001 fffffc0008115374 ffffff000041fc00 fffffc00093f1000
> [ 0.000000] 3ac0: ffffff0002000000 ffffff0000433000 fffffc0008c43bd0 fffffc0008a308f0
> [ 0.000000] 3ae0: 0000000000010000 0000020000000000 0000000000000000 0000000000000001
> [ 0.000000] 3b00: fffffc0008c43b30 fffffc000861f07c fffffc000941efc0 00000000000000c0
> [ 0.000000] 3b20: ffffff0ffff44e60 00000000000000c0 fffffc0008c43b70 fffffc000861f234
> [ 0.000000] 3b40: ffffff0ffff44e60 0000000000000004 ffffff0ffff44e60 fffffc0008c43c70
> [ 0.000000] 3b60: 0000000000000000 fffffc0008a74460 fffffc0008c43ba0 fffffc000861f3fc
> [ 0.000000] 3b80: fffffc0008c43ba0 fffffc00083ca55c fffffc0008c43bd0 fffffc0008222c20
> [ 0.000000] 3ba0: ffffff000041fc00 00000000024080c0 ffffff0ff691e840 fffffc0008115374
> [ 0.000000] 3bc0: 0000000000000001 00000000024080c0 fffffc0008c43c20 fffffc0008115374
> [ 0.000000] 3be0: 0000000000000070 ffffff0ffff44e80 ffffff0ffff44e60 0000000000000000
> [ 0.000000] 3c00: fffffc0008849a18 ffffffffffffffff 0000000000000000 ffffff0000433000
> [ 0.000000] 3c20: fffffc0008c43c80 fffffc0008b461dc ffffff0000424e80 2800000000000000
> [ 0.000000] 3c40: 0000000000010000 0000020000000000 0000000000000000 0000000000000400
> [ 0.000000] 3c60: 0000000000000400 ffffff00004330f8 0000000000000001 ffffff0ffffabe00
> [ 0.000000] 3c80: fffffc0008c43dc0 fffffc0008b462bc fffffc0008d33488 fffffc0008d33000
> [ 0.000000] 3ca0: ffffff0ffff44e60 fffffc0008c6c840 ffffff0000424b00 ffffff0000424880
> [ 0.000000] 3cc0: 0000000000000002 0000000000000000 0000000001bae074 0000000001f1001c
> [ 0.000000] 3ce0: 0000000000000000 fffffc0008a30890 ffffff0000424b00 fffffc0008849940
> [ 0.000000] 3d00: ffffff0000433020 fffffc0008a308f0 ffffff0000433008 ffffff0ffff44e60
> [ 0.000000] 3d20: fffffc000ac00000 0000000000000008 0000000000000001 8107000000000000
> [ 0.000000] 3d40: 00000000000000c0 0000000001000000 00000008fff44e60 0000010002000000
> [ 0.000000] 3d60: 0000000000000100 81070000000000ff fffffc0008c43dc0 0000000008b462cc
> [ 0.000000] 3d80: 0000901000020000 000090100021ffff ffffff0ffff44f08 0000000000000200
> [ 0.000000] 3da0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] 3dc0: fffffc0008c43e10 fffffc0008b4543c fffffc0008c6c828 fffffc0008d32000
> [ 0.000000] 3de0: fffffc0008c6c000 ffffff0ffff44470 fffffc0008849000 ffffff0000424880
> [ 0.000000] 3e00: fffffc0008c43e10 fffffc0008b45420 fffffc0008c43e60 fffffc0008b456bc
> [ 0.000000] 3e20: 0000000000000002 0000000000000003 0000000000000030 ffffff0000424880
> [ 0.000000] 3e40: ffffff0ffff44470 0000000000000000 0000000000000018 fffffc0008000000
> [ 0.000000] 3e60: fffffc0008c43f00 fffffc0008b5aec8 ffffff0000424700 fffffc0008c43f60
> [ 0.000000] 3e80: fffffc0008c43f60 0000000000000000 fffffc0008c43f70 fffffc0008d92000
> [ 0.000000] 3ea0: fffffc0008a734e0 fffffc0008a734b8 fffffc0008c43f00 0000000208b5ae3c
> [ 0.000000] 3ec0: 0000000000000000 00009010805fffff ffffff0ffff44518 0000000000000200
> [ 0.000000] 3ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] 3f00: fffffc0008c43f80 fffffc0008b43f9c fffffc0008c60000 fffffc0008b66628
> [ 0.000000] 3f20: fffffc0008b66628 fffffc0008dc0000 fffffc0008c60000 ffffff0ffffac580
> [ 0.000000] 3f40: 0000000002840000 0000000002870000 0000000000000020 0000000000000000
> [ 0.000000] 3f60: fffffc0008c43f60 fffffc0008c43f60 fffffc0008c43f70 fffffc0008c43f70
> [ 0.000000] 3f80: fffffc0008c43f90 fffffc0008b12d60 fffffc0008c43fa0 fffffc0008b10a3c
> [ 0.000000] 3fa0: 0000000000000000 fffffc0008b101c4 0000010ff7a35218 0000000000000e12
> [ 0.000000] 3fc0: 0000000021200000 0000000030d00980 0000000000000000 0000000001400000
> [ 0.000000] 3fe0: 0000000000000000 fffffc0008b66628 0000000000000000 0000000000000000
> [ 0.000000] Call trace:
> [ 0.000000] Exception stack(0xfffffc0008c436b0 to 0xfffffc0008c437e0)
> [ 0.000000] 36a0: 0000000000000000 0000040000000000
> [ 0.000000] 36c0: fffffc0008c43880 fffffc00081c8950 ffffff0ffffaf180 0000000000000003
> [ 0.000000] 36e0: fffffc0008c63000 00000000ffffffff 0000000000000001 0000000000000000
> [ 0.000000] 3700: fffffc0008c43720 fffffc00081e25cc 0000000000000000 0000000001bfff3f
> [ 0.000000] 3720: fffffc0008c43750 fffffc00081c8454 0000000000000012 0000000000000000
> [ 0.000000] 3740: fffffffffffffff8 0000000000000012 fffffc0008c63bb0 0000000000000000
> [ 0.000000] 3760: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 0.000000] 3780: fffffdffc00010a0 0000000001210111 7f7f7f7f7f7f7f7f 0000000001bfff3f
> [ 0.000000] 37a0: fffffc0009334000 0000000000000020 6563697665442032 3230303031402073
> [ 0.000000] 37c0: 2820303030303030 0000000000000000 0000000000000000 0000000000000008
> [ 0.000000] [<fffffc00081c8950>] __alloc_pages_nodemask+0xa4/0xe68
> [ 0.000000] [<fffffc000821fa70>] new_slab+0xd0/0x564
> [ 0.000000] [<fffffc0008221e24>] ___slab_alloc+0x2e4/0x514
> [ 0.000000] [<fffffc0008239498>] __slab_alloc+0x48/0x58
> [ 0.000000] [<fffffc0008222c20>] __kmalloc_node+0xd0/0x2dc
> [ 0.000000] [<fffffc0008115374>] __irq_domain_add+0x7c/0x164
> [ 0.000000] [<fffffc0008b461dc>] its_probe+0x784/0x81c
> [ 0.000000] [<fffffc0008b462bc>] its_init+0x48/0x1b0
> [ 0.000000] [<fffffc0008b4543c>] gic_init_bases+0x228/0x360
> [ 0.000000] [<fffffc0008b456bc>] gic_of_init+0x148/0x1cc
> [ 0.000000] [<fffffc0008b5aec8>] of_irq_init+0x184/0x298
> [ 0.000000] [<fffffc0008b43f9c>] irqchip_init+0x14/0x38
> [ 0.000000] [<fffffc0008b12d60>] init_IRQ+0xc/0x30
> [ 0.000000] [<fffffc0008b10a3c>] start_kernel+0x240/0x3b8
> [ 0.000000] [<fffffc0008b101c4>] __primary_switched+0x30/0x6c
> [ 0.000000] Code: 912ec2a0 b9403809 0a0902fb 37b007db (f9400300)
> [ 0.000000] ---[ end trace 0000000000000000 ]---
> [ 0.000000] Kernel panic - not syncing: Fatal exception
> [ 0.000000] ---[ end Kernel panic - not syncing: Fatal exception
>
>
> Same thing on v4.8.x and v4.9-rc?
>
>
>
>
>>
>> -Robert
>>
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
> .
>

2016-10-28 10:19:08

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

On Tue, Oct 25, 2016 at 02:31:00PM -0700, David Daney wrote:
> From: David Daney <[email protected]>
>
> On arm64 NUMA kernels we can pass "numa=off" on the command line to
> disable NUMA. A side effect of this is that kmalloc_node() calls to
> non-zero nodes will crash the system with an OOPS:
>
> [ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
> [ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
> [ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
> [ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
> [ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
> [ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
> [ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
> [ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
> .
> .
> .
>
> This is caused by code like this in kernel/irq/irqdomain.c
>
> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
> GFP_KERNEL, of_node_to_nid(of_node));
>
> When NUMA is disabled, the concept of a node is really undefined, so
> of_node_to_nid() should unconditionally return NUMA_NO_NODE.
>
> Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
> return NUMA_NO_NODE.
>
> The follow on patch will call this new function from the arm64 numa
> code.
>
> Reported-by: Gilbert Netzer <[email protected]>
> Signed-off-by: David Daney <[email protected]>
> ---
> drivers/of/of_numa.c | 15 +++++++++++++++
> include/linux/of.h | 2 ++
> 2 files changed, 17 insertions(+)
>
> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
> index f63d4b0d..2212299 100644
> --- a/drivers/of/of_numa.c
> +++ b/drivers/of/of_numa.c
> @@ -150,12 +150,27 @@ static int __init of_numa_parse_distance_map(void)
> return ret;
> }
>
> +static bool of_force_no_numa;
> +
> +void __of_force_no_numa(void)
> +{
> + of_force_no_numa = true;
> +}
> +
> int of_node_to_nid(struct device_node *device)
> {
> struct device_node *np;
> u32 nid;
> int r = -ENODATA;
>
> + /*
> + * If NUMA forced off, nodes are meaningless. Return
> + * NUMA_NO_NODE so that any node specific memory allocations
> + * can succeed from the default pool.
> + */
> + if (of_force_no_numa)
> + return NUMA_NO_NODE;

Why don't you just check if the nid you get back from the device is set in
numa_nodes_parsed and return NUMA_NO_NODE if not?

Will

2016-10-28 17:02:43

by David Daney

[permalink] [raw]
Subject: Re: [PATCH 1/2] of, numa: Add function to disable of_node_to_nid().

On 10/28/2016 03:19 AM, Will Deacon wrote:
> On Tue, Oct 25, 2016 at 02:31:00PM -0700, David Daney wrote:
>> From: David Daney <[email protected]>
>>
>> On arm64 NUMA kernels we can pass "numa=off" on the command line to
>> disable NUMA. A side effect of this is that kmalloc_node() calls to
>> non-zero nodes will crash the system with an OOPS:
>>
>> [ 0.000000] [<fffffc00081bba84>] __alloc_pages_nodemask+0xa4/0xe68
>> [ 0.000000] [<fffffc00082163a8>] new_slab+0xd0/0x57c
>> [ 0.000000] [<fffffc000821879c>] ___slab_alloc+0x2e4/0x514
>> [ 0.000000] [<fffffc000823882c>] __slab_alloc+0x48/0x58
>> [ 0.000000] [<fffffc00082195a0>] __kmalloc_node+0xd0/0x2e0
>> [ 0.000000] [<fffffc00081119b8>] __irq_domain_add+0x7c/0x164
>> [ 0.000000] [<fffffc0008b75d30>] its_probe+0x784/0x81c
>> [ 0.000000] [<fffffc0008b75e10>] its_init+0x48/0x1b0
>> .
>> .
>> .
>>
>> This is caused by code like this in kernel/irq/irqdomain.c
>>
>> domain = kzalloc_node(sizeof(*domain) + (sizeof(unsigned int) * size),
>> GFP_KERNEL, of_node_to_nid(of_node));
>>
>> When NUMA is disabled, the concept of a node is really undefined, so
>> of_node_to_nid() should unconditionally return NUMA_NO_NODE.
>>
>> Add __of_force_no_numa() to allow of_node_to_nid() to be forced to
>> return NUMA_NO_NODE.
>>
>> The follow on patch will call this new function from the arm64 numa
>> code.
>>
>> Reported-by: Gilbert Netzer <[email protected]>
>> Signed-off-by: David Daney <[email protected]>
>> ---
>> drivers/of/of_numa.c | 15 +++++++++++++++
>> include/linux/of.h | 2 ++
>> 2 files changed, 17 insertions(+)
>>
>> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
>> index f63d4b0d..2212299 100644
>> --- a/drivers/of/of_numa.c
>> +++ b/drivers/of/of_numa.c
>> @@ -150,12 +150,27 @@ static int __init of_numa_parse_distance_map(void)
>> return ret;
>> }
>>
>> +static bool of_force_no_numa;
>> +
>> +void __of_force_no_numa(void)
>> +{
>> + of_force_no_numa = true;
>> +}
>> +
>> int of_node_to_nid(struct device_node *device)
>> {
>> struct device_node *np;
>> u32 nid;
>> int r = -ENODATA;
>>
>> + /*
>> + * If NUMA forced off, nodes are meaningless. Return
>> + * NUMA_NO_NODE so that any node specific memory allocations
>> + * can succeed from the default pool.
>> + */
>> + if (of_force_no_numa)
>> + return NUMA_NO_NODE;
>
> Why don't you just check if the nid you get back from the device is set in
> numa_nodes_parsed and return NUMA_NO_NODE if not?

numa_nodes_parsed is __initdata. Perhaps node_possible_map would be better.

I will try that.

David.


>
> Will
>