When allocating pg_data in alloc_node_data(), it will try to allocate from
local node first and then from any node. If it fails at the second trial,
it means there is not available memory on any node.
This patch fixes the error message and correct one typo.
Signed-off-by: Wei Yang <[email protected]>
---
v2:
* also print the original nid in the error message, based on Borislav's
comment
---
arch/x86/mm/numa.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 12dcad7297a5..ac632e5397aa 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -201,7 +201,7 @@ static void __init alloc_node_data(int nid)
nd_pa = __memblock_alloc_base(nd_size, SMP_CACHE_BYTES,
MEMBLOCK_ALLOC_ACCESSIBLE);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
+ pr_err("Cannot find %zu bytes in any node(initial node: %d)\n",
nd_size, nid);
return;
}
@@ -225,7 +225,7 @@ static void __init alloc_node_data(int nid)
* numa_cleanup_meminfo - Cleanup a numa_meminfo
* @mi: numa_meminfo to clean up
*
- * Sanitize @mi by merging and removing unncessary memblks. Also check for
+ * Sanitize @mi by merging and removing unnecessary memblks. Also check for
* conflicts and clear unused memblks.
*
* RETURNS:
--
2.11.0
numa_nodemask_from_meminfo() is called to set bit according to
numa_meminfo. While the only two places for this call is used to set proper
bit to a copy of numa_nodes_parsed from numa_meminfo. With current code
path, those numa node information in numa_meminfo is a subset of
numa_nodes_parsed. So it is not necessary to set the bits again.
The following is a code path analysis to prove the numa node information in
numa_meminfo is a subset of numa_nodes_parsed.
x86_numa_init()
numa_init()
Case 1
acpi_numa_init()
acpi_parse_memory_affinity()
numa_add_memblk()
node_set(numa_nodes_parsed)
acpi_parse_slit()
numa_nodemask_from_meminfo()
Case 2
amd_numa_init()
numa_add_memblk()
node_set(numa_nodes_parsed)
Case 3
dummy_numa_init()
node_set(numa_nodes_parsed)
numa_add_memblk()
numa_register_memblks()
numa_nodemask_from_meminfo()
>From the code path analysis, we can see each time a memblk is added, the
proper bit is set in numa_nodes_parsed, which means it is not necessary to
set it again in numa_nodemask_from_meminfo() for a copy of
numa_nodes_parsed.
This patch removes numa_nodemask_from_meminfo().
Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/numa.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ac632e5397aa..5ecc5a745c51 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -314,20 +314,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
return 0;
}
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
- const struct numa_meminfo *mi)
-{
- int i;
-
- for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
- if (mi->blk[i].start != mi->blk[i].end &&
- mi->blk[i].nid != NUMA_NO_NODE)
- node_set(mi->blk[i].nid, *nodemask);
-}
-
/**
* numa_reset_distance - Reset NUMA distance table
*
@@ -347,16 +333,12 @@ void __init numa_reset_distance(void)
static int __init numa_alloc_distance(void)
{
- nodemask_t nodes_parsed;
size_t size;
int i, j, cnt = 0;
u64 phys;
/* size the new table and allocate it */
- nodes_parsed = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
-
- for_each_node_mask(i, nodes_parsed)
+ for_each_node_mask(i, numa_nodes_parsed)
cnt = i;
cnt++;
size = cnt * cnt * sizeof(numa_distance[0]);
@@ -535,7 +517,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;
--
2.11.0
Commit-ID: 43dac8f6a74c9811454f4efbe52b48f7a802c277
Gitweb: http://git.kernel.org/tip/43dac8f6a74c9811454f4efbe52b48f7a802c277
Author: Wei Yang <[email protected]>
AuthorDate: Tue, 14 Mar 2017 11:08:00 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Mon, 3 Apr 2017 11:54:37 +0200
x86/mm/numa: Improve alloc_node_data() error path message
alloc_node_data() tries to allocate from the local node first and, if
that attempt fails, falls back to any node. Improve the error message to
issue the initial node for ease during debugging.
Fix a typo in the comments, while at it.
Signed-off-by: Wei Yang <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Masssage commit message. ]
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/mm/numa.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 12dcad7..93671d8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -201,7 +201,7 @@ static void __init alloc_node_data(int nid)
nd_pa = __memblock_alloc_base(nd_size, SMP_CACHE_BYTES,
MEMBLOCK_ALLOC_ACCESSIBLE);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
+ pr_err("Cannot find %zu bytes in any node (initial node: %d)\n",
nd_size, nid);
return;
}
@@ -225,7 +225,7 @@ static void __init alloc_node_data(int nid)
* numa_cleanup_meminfo - Cleanup a numa_meminfo
* @mi: numa_meminfo to clean up
*
- * Sanitize @mi by merging and removing unncessary memblks. Also check for
+ * Sanitize @mi by merging and removing unnecessary memblks. Also check for
* conflicts and clear unused memblks.
*
* RETURNS:
Commit-ID: 474aeffd88b87746a75583f356183d5c6caa4213
Gitweb: http://git.kernel.org/tip/474aeffd88b87746a75583f356183d5c6caa4213
Author: Wei Yang <[email protected]>
AuthorDate: Tue, 14 Mar 2017 11:08:01 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Mon, 3 Apr 2017 11:54:37 +0200
x86/mm/numa: Remove numa_nodemask_from_meminfo()
numa_nodemask_from_meminfo() generates a nodemask of nodes which have
memory according to a meminfo descriptor.
The two callsites of that function both set bits in copies of the
numa_nodes_parsed nodemask. In both cases, the information in supplied
numa_meminfo is a subset of numa_nodes_parsed. So setting those bits
again is not really necessary.
Here are the three call paths which show that the supplied numa_meminfo
argument describes memory regions in nodes which are already in
numa_nodes_parsed:
x86_numa_init()
numa_init()
Case 1:
acpi_numa_init()
acpi_parse_memory_affinity()
numa_add_memblk()
node_set(numa_nodes_parsed)
acpi_parse_slit()
acpi_numa_slit_init()
numa_set_distance()
numa_alloc_distance()
numa_nodemask_from_meminfo()
Case 2:
amd_numa_init()
numa_add_memblk()
node_set(numa_nodes_parsed)
Case 3
dummy_numa_init()
node_set(numa_nodes_parsed)
numa_add_memblk()
numa_register_memblks()
numa_nodemask_from_meminfo()
Thus, in all three cases, the respective bit in numa_nodes_parsed is
set, which means it is not necessary to set it again in a copy of
numa_nodes_parsed.
So remove that function.
Signed-off-by: Wei Yang <[email protected]>
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Heavily massage commit message. ]
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/mm/numa.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 93671d8..175f54a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -314,20 +314,6 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
return 0;
}
-/*
- * Set nodes, which have memory in @mi, in *@nodemask.
- */
-static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask,
- const struct numa_meminfo *mi)
-{
- int i;
-
- for (i = 0; i < ARRAY_SIZE(mi->blk); i++)
- if (mi->blk[i].start != mi->blk[i].end &&
- mi->blk[i].nid != NUMA_NO_NODE)
- node_set(mi->blk[i].nid, *nodemask);
-}
-
/**
* numa_reset_distance - Reset NUMA distance table
*
@@ -347,16 +333,12 @@ void __init numa_reset_distance(void)
static int __init numa_alloc_distance(void)
{
- nodemask_t nodes_parsed;
size_t size;
int i, j, cnt = 0;
u64 phys;
/* size the new table and allocate it */
- nodes_parsed = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&nodes_parsed, &numa_meminfo);
-
- for_each_node_mask(i, nodes_parsed)
+ for_each_node_mask(i, numa_nodes_parsed)
cnt = i;
cnt++;
size = cnt * cnt * sizeof(numa_distance[0]);
@@ -535,7 +517,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;
On Tue, Mar 14, 2017 at 11:08:01AM +0800, Wei Yang wrote:
> numa_nodemask_from_meminfo() is called to set bit according to
> numa_meminfo. While the only two places for this call is used to set proper
> bit to a copy of numa_nodes_parsed from numa_meminfo. With current code
> path, those numa node information in numa_meminfo is a subset of
> numa_nodes_parsed. So it is not necessary to set the bits again.
>
> The following is a code path analysis to prove the numa node information in
> numa_meminfo is a subset of numa_nodes_parsed.
>
> x86_numa_init()
> numa_init()
> Case 1
> acpi_numa_init()
> acpi_parse_memory_affinity()
> numa_add_memblk()
> node_set(numa_nodes_parsed)
> acpi_parse_slit()
> numa_nodemask_from_meminfo()
>
> Case 2
> amd_numa_init()
> numa_add_memblk()
> node_set(numa_nodes_parsed)
>
> Case 3
> dummy_numa_init()
> node_set(numa_nodes_parsed)
> numa_add_memblk()
>
> numa_register_memblks()
> numa_nodemask_from_meminfo()
>
> From the code path analysis, we can see each time a memblk is added, the
> proper bit is set in numa_nodes_parsed, which means it is not necessary to
> set it again in numa_nodemask_from_meminfo() for a copy of
> numa_nodes_parsed.
>
> This patch removes numa_nodemask_from_meminfo().
>
> Signed-off-by: Wei Yang <[email protected]>
I've got the crash below on master/tip. Reveting the patch helps.
================================================================================
UBSAN: Undefined behaviour in /home/kas/linux/la57/mm/sparse.c:336:9
member access within null pointer of type 'struct pglist_data'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc5-00604-gf03eaf0479bc #5084
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x84/0xb8
ubsan_epilogue+0x12/0x3f
__ubsan_handle_type_mismatch+0x80/0x1a0
sparse_early_usemaps_alloc_node+0x45/0x1b0
alloc_usemap_and_memmap+0x37b/0x390
? alloc_usemap_and_memmap+0x390/0x390
? memblock_virt_alloc_try_nid+0xa4/0xb7
? 0xffffffff81000000
sparse_init+0x5e/0x31a
? 0xffffffff81000000
? 0xffffffff81000000
paging_init+0x18/0x27
setup_arch+0xc92/0xe67
? early_idt_handler_array+0x120/0x120
start_kernel+0x63/0x4e3
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x171/0x180
secondary_startup_64+0x9f/0x9f
================================================================================
BUG: unable to handle kernel paging request at 0000000000021a40
IP: sparse_early_usemaps_alloc_node+0x45/0x1b0
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc5-00604-gf03eaf0479bc #5084
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
task: ffffffff82615280 task.stack: ffffffff82600000
RIP: 0010:sparse_early_usemaps_alloc_node+0x45/0x1b0
RSP: 0000:ffffffff82603cf8 EFLAGS: 00010082
RAX: 0000000000000002 RBX: 0000000000000800 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000001
RBP: ffffffff82603d40 R08: 3d3d3d3d3d3d3d3d R09: 3d3d3d3d3d3d3d3d
R10: 000000000401b000 R11: 3d3d3d3d3d3d3d3d R12: 0000000000000050
R13: ffff88087bbdf000 R14: 0000000000000000 R15: 0000000000000050
FS: 0000000000000000(0000) GS:ffffffff8333f000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000021a40 CR3: 000000000540a000 CR4: 00000000000006b0
Call Trace:
alloc_usemap_and_memmap+0x37b/0x390
? alloc_usemap_and_memmap+0x390/0x390
? memblock_virt_alloc_try_nid+0xa4/0xb7
? 0xffffffff81000000
sparse_init+0x5e/0x31a
? 0xffffffff81000000
? 0xffffffff81000000
paging_init+0x18/0x27
setup_arch+0xc92/0xe67
? early_idt_handler_array+0x120/0x120
start_kernel+0x63/0x4e3
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x171/0x180
secondary_startup_64+0x9f/0x9f
Code: c1 e3 05 48 83 ec 20 48 89 55 c0 e8 1b 36 f1 fd 4e 8b 34 f5 00 95 31 83 4d 85 f6 75 0e 31 f6 48 c7 c7 e0 92 96 82 e8 e0 bf 5b fe <45> 8b 86 40 1a 02 00 31 c9 31 d2 31 f6 48 89 df e8 b9 e4 ff ff
RIP: sparse_early_usemaps_alloc_node+0x45/0x1b0 RSP: ffffffff82603cf8
CR2: 0000000000021a40
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception
--
Kirill A. Shutemov
On Thu, Apr 06, 2017 at 03:44:59PM +0300, Kirill A. Shutemov wrote:
> I've got the crash below on master/tip. Reveting the patch helps.
>
> ================================================================================
> UBSAN: Undefined behaviour in /home/kas/linux/la57/mm/sparse.c:336:9
> member access within null pointer of type 'struct pglist_data'
> CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc5-00604-gf03eaf0479bc #5084
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Oh, qemu, how do you trigger this exactly? .config and qemu cmdline pls?
Alternatively, can you run this debug diff and give me the output?
I'd like to know what is happening and how did I miss that during
review.
Thanks.
---
Oh, qemu, how do you trigger this exactly? .config and qemu cmdline pls?
Alternatively, can you run this debug diff and give me the output?
I'd like to know what is happening and how did I miss that during review.
Thanks.
---
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 29bfcb42c4f5..e20101fed1d9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -517,11 +517,19 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
+
+ pr_info("%s: numa_nodes_parsed: %*pbl\n",
+ __func__, nodemask_pr_args(&numa_nodes_parsed));
+
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
+
+ if (mb->nid != NUMA_NO_NODE)
+ pr_info("%s: nid: %d\n", __func__, mb->nid);
+
memblock_set_node(mb->start, mb->end - mb->start,
&memblock.memory, mb->nid);
}
diff --git a/mm/sparse.c b/mm/sparse.c
index db6bf3c97ea2..1f4cb635a111 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -333,6 +333,7 @@ static unsigned long * __init
sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
unsigned long size)
{
+ pr_info("%s: node_id: %d\n", __func__, pgdat->node_id);
return memblock_virt_alloc_node_nopanic(size, pgdat->node_id);
}
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Thu, Apr 06, 2017 at 04:59:37PM +0200, Borislav Petkov wrote:
> On Thu, Apr 06, 2017 at 03:44:59PM +0300, Kirill A. Shutemov wrote:
> > I've got the crash below on master/tip. Reveting the patch helps.
> >
> > ================================================================================
> > UBSAN: Undefined behaviour in /home/kas/linux/la57/mm/sparse.c:336:9
> > member access within null pointer of type 'struct pglist_data'
> > CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc5-00604-gf03eaf0479bc #5084
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>
> Oh, qemu, how do you trigger this exactly? .config and qemu cmdline pls?
>
> Alternatively, can you run this debug diff and give me the output?
>
> I'd like to know what is happening and how did I miss that during
> review.
>
> Thanks.
>
> ---
>
>
>
> Oh, qemu, how do you trigger this exactly? .config and qemu cmdline pls?
qemu-system-x86_64 \
-machine "type=q35,accel=kvm:tcg" \
-cpu "kvm64" \
-smp "8" \
-m "32G" \
-chardev "stdio,mux=on,id=stdio,signal=off" \
-mon "chardev=stdio,mode=readline,default" \
-device "isa-serial,chardev=stdio" \
-kernel "/home/kas/var/linus/arch/x86/boot/bzImage" \
-nographic \
-append "console=ttyS0 numa=fake=4 earlyprintk=ttyS0" \
#
Config is attached.
Looks like fake numa is the key.
> Alternatively, can you run this debug diff and give me the output?
>
> I'd like to know what is happening and how did I miss that during review.
...
NUMA: Warning: node ids are out of bound, from=0 to=1 distance=20 [ 0.000000] numa_register_memblks: numa_nodes_parsed: 0
numa_register_memblks: nid: 0
numa_register_memblks: nid: 1
numa_register_memblks: nid: 2
numa_register_memblks: nid: 3
NODE_DATA(0) allocated [mem 0x27ffde000-0x27fffffff]
kvm-clock: Using msrs 4b564d01 and 4b564d00
kvm-clock: cpu 0, msr 8:7bfe0001, primary cpu clock
kvm-clock: using sched offset of 828966599 cycles
clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
sparse_early_usemaps_alloc_pgdat_section: node_id: 0
================================================================================
UBSAN: Undefined behaviour in /home/kas/linux/x86-gup/mm/sparse.c:336:2
member access within null pointer of type 'struct pglist_data'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc5-00604-gf03eaf0479bc-dirty #5093
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x84/0xb8
ubsan_epilogue+0x12/0x3f
__ubsan_handle_type_mismatch+0x80/0x1a0
sparse_early_usemaps_alloc_node+0x45/0x1ca
alloc_usemap_and_memmap+0x37b/0x390
? alloc_usemap_and_memmap+0x390/0x390
? memblock_virt_alloc_try_nid+0xa4/0xb7
? 0xffffffff81000000
sparse_init+0x5e/0x31a
? 0xffffffff81000000
? 0xffffffff81000000
paging_init+0x18/0x27
setup_arch+0xc92/0xe67
? early_idt_handler_array+0x120/0x120
start_kernel+0x63/0x4e3
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x171/0x180
secondary_startup_64+0x9f/0x9f
================================================================================
BUG: unable to handle kernel paging request at 0000000000021a40
IP: sparse_early_usemaps_alloc_node+0x45/0x1ca
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 0 PID: 0 Comm: swapper Not tainted 4.11.0-rc5-00604-gf03eaf0479bc-dirty #5093
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
task: ffffffff82615280 task.stack: ffffffff82600000
RIP: 0010:sparse_early_usemaps_alloc_node+0x45/0x1ca
RSP: 0000:ffffffff82603cf8 EFLAGS: 00010082
RAX: 0000000000000002 RBX: 0000000000000800 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000001
RBP: ffffffff82603d40 R08: 3d3d3d3d3d3d3d3d R09: 3d3d3d3d3d3d3d3d
R10: 000000000401b000 R11: 3d3d3d3d3d3d3d3d R12: 0000000000000050
R13: ffff88087bbdf000 R14: 0000000000000000 R15: 0000000000000050
FS: 0000000000000000(0000) GS:ffffffff83340000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000021a40 CR3: 000000000260a000 CR4: 00000000000006b0
Call Trace:
alloc_usemap_and_memmap+0x37b/0x390
? alloc_usemap_and_memmap+0x390/0x390
? memblock_virt_alloc_try_nid+0xa4/0xb7
? 0xffffffff81000000
sparse_init+0x5e/0x31a
? 0xffffffff81000000
? 0xffffffff81000000
paging_init+0x18/0x27
setup_arch+0xc92/0xe67
? early_idt_handler_array+0x120/0x120
start_kernel+0x63/0x4e3
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x171/0x180
secondary_startup_64+0x9f/0x9f
Code: c1 e3 05 48 83 ec 20 48 89 55 c0 e8 b6 25 f1 fd 4e 8b 34 f5 c0 a4 31 83 4d 85 f6 75 0e 31 f6 48 c7 c7 e0 92 96 82 e8 7b af 5b fe <41> 8b 96 40 1a 02 00 48 c7 c6 e0 3d 24 82 48 c7 c7 3c 11 43 82
RIP: sparse_early_usemaps_alloc_node+0x45/0x1ca RSP: ffffffff82603cf8
CR2: 0000000000021a40
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception
---[ end Kernel panic - not syncing: Fatal exception
--
Kirill A. Shutemov
On Thu, Apr 06, 2017 at 06:42:16PM +0300, Kirill A. Shutemov wrote:
> Config is attached.
Thanks!
> Looks like fake numa is the key.
...
> NUMA: Warning: node ids are out of bound, from=0 to=1 distance=20 [ 0.000000] numa_register_memblks: numa_nodes_parsed: 0
> numa_register_memblks: nid: 0
> numa_register_memblks: nid: 1
> numa_register_memblks: nid: 2
> numa_register_memblks: nid: 3
Yeah, the fake numa thing calls emu_setup_memblk() and that doesn't
set numa_nodes_parsed to the number of fake numa nodes. And since with
that "cleanup" which opened more work than it saved (btw, this is the
last time I'm looking at crap like that) we got rid of the "enlarging"
of the node mask to the actual nodes count and *that* blows up with
numa_nodes_parsed having only node 0 in there.
Long story short, something as trivial as this helps here:
---
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index a8f90ce3dedf..60d82b8baaa3 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -70,6 +70,9 @@ static int __init emu_setup_memblk(struct numa_meminfo *ei,
printk(KERN_INFO "Faking node %d at [mem %#018Lx-%#018Lx] (%LuMB)\n",
nid, eb->start, eb->end - 1, (eb->end - eb->start) >> 20);
+
+ node_set(nid, numa_nodes_parsed);
+
return 0;
}
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Thu, Apr 06, 2017 at 08:01:13PM +0200, Borislav Petkov wrote:
> On Thu, Apr 06, 2017 at 06:42:16PM +0300, Kirill A. Shutemov wrote:
> > Config is attached.
>
> Thanks!
>
> > Looks like fake numa is the key.
>
> ...
>
> > NUMA: Warning: node ids are out of bound, from=0 to=1 distance=20 [ 0.000000] numa_register_memblks: numa_nodes_parsed: 0
> > numa_register_memblks: nid: 0
> > numa_register_memblks: nid: 1
> > numa_register_memblks: nid: 2
> > numa_register_memblks: nid: 3
>
> Yeah, the fake numa thing calls emu_setup_memblk() and that doesn't
> set numa_nodes_parsed to the number of fake numa nodes. And since with
> that "cleanup" which opened more work than it saved (btw, this is the
> last time I'm looking at crap like that) we got rid of the "enlarging"
> of the node mask to the actual nodes count and *that* blows up with
> numa_nodes_parsed having only node 0 in there.
>
> Long story short, something as trivial as this helps here:
Yep. Works for me.
Reported-and-tested-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Thu, Apr 06, 2017 at 09:21:47PM +0300, Kirill A. Shutemov wrote:
> > Long story short, something as trivial as this helps here:
>
> Yep. Works for me.
>
> Reported-and-tested-by: Kirill A. Shutemov <[email protected]>
Thanks.
Now, I'd really like to have more test coverage and be sure this
"cleanup" doesn't break anything else so Wei, please grab tip/master,
apply the oneliner from two messages ago, take Kirill's qemu cmdline
and run all fake numa scenarios you can think of to make sure your
cleanup doesn't break anything else.
Qemu can emulate real numa too, for example you can boot with:
-smp 64 \
-numa node,nodeid=0,cpus=1-8 \
-numa node,nodeid=1,cpus=9-16 \
-numa node,nodeid=2,cpus=17-24 \
-numa node,nodeid=3,cpus=25-32 \
-numa node,nodeid=4,cpus=0 \
-numa node,nodeid=4,cpus=33-39 \
-numa node,nodeid=5,cpus=40-47 \
-numa node,nodeid=6,cpus=48-55 \
-numa node,nodeid=7,cpus=56-63
after configuring the kernel accordingly.
Then, test baremetal too.
numa_emulation() should give you an idea about possible options
numa=fake takes. Documentation/x86/x86_64/boot-options.txt has some
(all?) too.
Thanks.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Fri, Apr 7, 2017 at 2:48 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Apr 06, 2017 at 09:21:47PM +0300, Kirill A. Shutemov wrote:
>> > Long story short, something as trivial as this helps here:
>>
>> Yep. Works for me.
>>
>> Reported-and-tested-by: Kirill A. Shutemov <[email protected]>
>
> Thanks.
>
> Now, I'd really like to have more test coverage and be sure this
> "cleanup" doesn't break anything else so Wei, please grab tip/master,
> apply the oneliner from two messages ago, take Kirill's qemu cmdline
> and run all fake numa scenarios you can think of to make sure your
> cleanup doesn't break anything else.
>
Oops, sorry to bring in the regression with my cleanup.
I haven't noticed there is a kernel command line "numa=fake", which
is the cause of the crash I think.
So from my understanding, I am goting to do these tests:
1. all fake numa scenarios with Kirill's qemu command line
2. Real numa scenarios with following qemu command option
3. Baremetal
One more question, on the baremetal mathine, I can't change the
numa configuration, so there would be only one case. Do you have
some specific requirement?
Well, if I missed something, just let me know :-)
> Qemu can emulate real numa too, for example you can boot with:
>
> -smp 64 \
> -numa node,nodeid=0,cpus=1-8 \
> -numa node,nodeid=1,cpus=9-16 \
> -numa node,nodeid=2,cpus=17-24 \
> -numa node,nodeid=3,cpus=25-32 \
> -numa node,nodeid=4,cpus=0 \
> -numa node,nodeid=4,cpus=33-39 \
> -numa node,nodeid=5,cpus=40-47 \
> -numa node,nodeid=6,cpus=48-55 \
> -numa node,nodeid=7,cpus=56-63
>
> after configuring the kernel accordingly.
>
> Then, test baremetal too.
>
> numa_emulation() should give you an idea about possible options
> numa=fake takes. Documentation/x86/x86_64/boot-options.txt has some
> (all?) too.
>
> Thanks.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
On Sun, Apr 09, 2017 at 11:12:14AM +0800, Wei Yang wrote:
> Oops, sorry to bring in the regression with my cleanup.
> I haven't noticed there is a kernel command line "numa=fake", which
> is the cause of the crash I think.
Of course it is, didn't you see my debugging upthread?
> So from my understanding, I am goting to do these tests:
>
> 1. all fake numa scenarios with Kirill's qemu command line
It is enough if you boot the kernel with "numa=fake..."
> 2. Real numa scenarios with following qemu command option
Not qemu command option but a kernel cmdline option.
> 3. Baremetal
>
> One more question, on the baremetal mathine, I can't change the
> numa configuration, so there would be only one case. Do you have
> some specific requirement?
numa=fake on baremetal too.
> Well, if I missed something, just let me know :-)
>
> > Qemu can emulate real numa too, for example you can boot with:
> >
> > -smp 64 \
> > -numa node,nodeid=0,cpus=1-8 \
> > -numa node,nodeid=1,cpus=9-16 \
> > -numa node,nodeid=2,cpus=17-24 \
> > -numa node,nodeid=3,cpus=25-32 \
> > -numa node,nodeid=4,cpus=0 \
> > -numa node,nodeid=4,cpus=33-39 \
> > -numa node,nodeid=5,cpus=40-47 \
> > -numa node,nodeid=6,cpus=48-55 \
> > -numa node,nodeid=7,cpus=56-63
Also, do this in kvm. kvm can emulate a lot of numa configurations, do
experiment with those too.
Basically, try to break your "cleanup". Stuff one should do for every
patch one sends anyway.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Mon, Apr 10, 2017 at 02:43:20PM +0200, Borislav Petkov wrote:
>On Sun, Apr 09, 2017 at 11:12:14AM +0800, Wei Yang wrote:
>> Oops, sorry to bring in the regression with my cleanup.
>> I haven't noticed there is a kernel command line "numa=fake", which
>> is the cause of the crash I think.
>
>Of course it is, didn't you see my debugging upthread?
>
>> So from my understanding, I am goting to do these tests:
>>
>> 1. all fake numa scenarios with Kirill's qemu command line
>
>It is enough if you boot the kernel with "numa=fake..."
>
>> 2. Real numa scenarios with following qemu command option
>
>Not qemu command option but a kernel cmdline option.
>
>> 3. Baremetal
>>
>> One more question, on the baremetal mathine, I can't change the
>> numa configuration, so there would be only one case. Do you have
>> some specific requirement?
>
>numa=fake on baremetal too.
>
>> Well, if I missed something, just let me know :-)
>>
>> > Qemu can emulate real numa too, for example you can boot with:
>> >
>> > -smp 64 \
>> > -numa node,nodeid=0,cpus=1-8 \
>> > -numa node,nodeid=1,cpus=9-16 \
>> > -numa node,nodeid=2,cpus=17-24 \
>> > -numa node,nodeid=3,cpus=25-32 \
>> > -numa node,nodeid=4,cpus=0 \
>> > -numa node,nodeid=4,cpus=33-39 \
>> > -numa node,nodeid=5,cpus=40-47 \
>> > -numa node,nodeid=6,cpus=48-55 \
>> > -numa node,nodeid=7,cpus=56-63
>
>Also, do this in kvm. kvm can emulate a lot of numa configurations, do
>experiment with those too.
>
>Basically, try to break your "cleanup". Stuff one should do for every
>patch one sends anyway.
Hi, Borislav
I have tried several test combinations of the fake numa. The result shows good.
The test result marked as P (Passed), means the system boots up and simple
kernel build test succeed.
# test matrix and result
## Qemu
With qemu, I have tried [phys_node, emu_node] = [(1, 4), (0, 2, 4, 8)]
+----------------+--------+--------+
| phys_node | 1 | 4 |
|emu_node | | |
+----------------+--------+--------+
| 0 | P | P |
+----------------+--------+--------+
| 2 | P | P |
+----------------+--------+--------+
| 4 | P | P |
+----------------+--------+--------+
| 8 | P | P |
+----------------+--------+--------+
phys_node is emulated with qemu command line:
"-numa node,nodeid=0,cpus=1-2 -numa node,nodeid=1,cpus=3-4 -numa
node,nodeid=2,cpus=0 -numa node,nodeid=2,cpus=5 -numa
node,nodeid=3,cpus=6-7"
emu_node is emulated with kernel command line:
"numa=fake=N"
## Baremetal
On my machine, it only has one numa node, so I could just verify phys_node
with 1.
+----------------+--------+
| phys_node | 1 |
|emu_node | |
+----------------+--------+
| 0 | P |
+----------------+--------+
| 2 | P |
+----------------+--------+
| 4 | P |
+----------------+--------+
| 8 | P |
+----------------+--------+
emu_node is emulated with kernel command line:
"numa=fake=N"
# Other things I observed
Generally, in qemu guest, every thing looks good, while there are two things I
saw in baremetal machine.
At first I want to emphasize, I saw the same behavior with/without my
"cleanup".
## only 3 node when fake=4
[ 0.000000] Faking a node at [mem 0x0000000000000000-0x000000022f5fffff]
[ 0.000000] Faking node 0 at [mem 0x0000000000000000-0x000000007fffffff]
(2048MB)
[ 0.000000] Faking node 1 at [mem 0x0000000080000000-0x0000000133ffffff]
(2880MB)
[ 0.000000] Faking node 2 at [mem 0x0000000134000000-0x000000022f5fffff]
(4022MB)
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009cfff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x000000007fffffff]
[ 0.000000] node 1: [mem 0x0000000080000000-0x00000000ba5b1fff]
[ 0.000000] node 1: [mem 0x00000000ba5b9000-0x00000000bad8dfff]
[ 0.000000] node 1: [mem 0x00000000bafb6000-0x00000000ca8a1fff]
[ 0.000000] node 1: [mem 0x00000000ca93a000-0x00000000ca977fff]
[ 0.000000] node 1: [mem 0x00000000cafff000-0x00000000caffffff]
[ 0.000000] node 1: [mem 0x0000000100000000-0x0000000133ffffff]
[ 0.000000] node 2: [mem 0x0000000134000000-0x000000022f5fffff]
## some warning
I don't see these two warnings without "numa=fake=N".
[ 0.004000] sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.004000] ------------[ cut here ]------------
[ 0.004000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smpboot.c:424 topology_sane.isra.5+0x6c/0x70
[ 8.594469] sysfs: cannot create duplicate filename '/devices/platform/coretemp.0/hwmon/hwmon2/temp2_label'
[ 8.594478] ------------[ cut here ]------------
[ 8.594482] WARNING: CPU: 4 PID: 34 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x56/0x70
# Some thoughts on the code
After went throught the numa_emulation(), I suggest to restructure the
numa_nodes_parsed based on the emulated nodes, instead of set
numa_nodes_parsed directly in emu_setup_memblk().
Two cases in my mind, which are not friendly:
1. split_nodes_size_interleave/split_nodes_interleave() may fail or the
following procedure may fail.
2. fake node may be less than physcial nodes
Both of them may leads to a inaccurate numa_nodes_parsed. So I have a patch to
restructure it from emulated node info.
Will send it soon.
>
>--
>Regards/Gruss,
> Boris.
>
>Good mailing practices for 400: avoid top-posting and trim the reply.
--
Wei Yang
Help you, Help me