2019-07-05 04:18:29

by Pingfan Liu

[permalink] [raw]
Subject: [PATCH 1/2] x86/numa: carve node online semantics out of alloc_node_data()

Node online means either memory online or cpu online. But there is
requirement to instance a pglist_data, which has neither cpu nor memory
online (refer to [2/2]).

So carve out the online semantics, and call node_set_online() where either
memory or cpu is online.

Signed-off-by: Pingfan Liu <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Barret Rhoden <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/mm/numa.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e6dad60..b48d507 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -213,8 +213,6 @@ static void __init alloc_node_data(int nid)

node_data[nid] = nd;
memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
-
- node_set_online(nid);
}

/**
@@ -589,6 +587,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
continue;

alloc_node_data(nid);
+ node_set_online(nid);
}

/* Dump memblock with node info and return. */
@@ -760,8 +759,10 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;

- if (!node_online(node))
+ if (!node_online(node)) {
init_memory_less_node(node);
+ node_set_online(nid);
+ }

numa_set_node(cpu, node);
}
--
2.7.5


2019-07-05 04:19:55

by Pingfan Liu

[permalink] [raw]
Subject: [PATCH 2/2] x86/numa: instance all parsed numa node

I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
is used to speed up kdump process, so it is not a rare case.

It turns out that some pgdat is not instanced when specifying nr_cpus, e.g, on
x86, not initialized by init_cpu_to_node()->init_memory_less_node(). But
device->numa_node info is used as preferred_nid param for
__alloc_pages_nodemask(), which causes NULL reference ac->zonelist =
node_zonelist(preferred_nid, gfp_mask);

Although this bug is detected on x86, it should affect all archs, where a
machine with a numa-node having no memory, if nr_cpus prevents the instance of
the node, and the device on the node tries to allocate memory with
device->numa_node info.

The patch takes the way by instancing all parsed numa node on x86. (for more
detail, please refer to section I and II)

I. Notes about the crashing info:
-1 kexec -l with nr_cpus=4
-2 system info
NUMA node0 CPU(s): 0,8,16,24
NUMA node1 CPU(s): 2,10,18,26
NUMA node2 CPU(s): 4,12,20,28
NUMA node3 CPU(s): 6,14,22,30
NUMA node4 CPU(s): 1,9,17,25
NUMA node5 CPU(s): 3,11,19,27
NUMA node6 CPU(s): 5,13,21,29
NUMA node7 CPU(s): 7,15,23,31
-3 panic stack
[...]
[ 5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[ 5.729187] pcieport 0000:00:01.1: Signaling PME with IRQ 34
[ 5.735187] pcieport 0000:00:01.2: Signaling PME with IRQ 35
[ 5.741168] pcieport 0000:00:01.3: Signaling PME with IRQ 36
[ 5.747189] pcieport 0000:00:07.1: Signaling PME with IRQ 37
[ 5.754061] pcieport 0000:00:08.1: Signaling PME with IRQ 39
[ 5.760727] pcieport 0000:20:07.1: Signaling PME with IRQ 40
[ 5.766955] pcieport 0000:20:08.1: Signaling PME with IRQ 42
[ 5.772742] BUG: unable to handle kernel paging request at 0000000000002088
[ 5.773618] PGD 0 P4D 0
[ 5.773618] Oops: 0000 [#1] SMP NOPTI
[ 5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[ 5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[ 5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[ 5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[ 5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[ 5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[ 5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[ 5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[ 5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[ 5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[ 5.773618] FS: 0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[ 5.773618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[ 5.773618] Call Trace:
[ 5.773618] new_slab+0xa9/0x570
[ 5.773618] ___slab_alloc+0x375/0x540
[ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] __slab_alloc+0x1c/0x38
[ 5.773618] __kmalloc_node_track_caller+0xc8/0x270
[ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] devm_kmalloc+0x28/0x60
[ 5.773618] pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] really_probe+0x73/0x420
[ 5.773618] driver_probe_device+0x115/0x130
[ 5.773618] __driver_attach+0x103/0x110
[ 5.773618] ? driver_probe_device+0x130/0x130
[ 5.773618] bus_for_each_dev+0x67/0xc0
[ 5.773618] ? klist_add_tail+0x3b/0x70
[ 5.773618] bus_add_driver+0x41/0x260
[ 5.773618] ? pcie_port_setup+0x4d/0x4d
[ 5.773618] driver_register+0x5b/0xe0
[ 5.773618] ? pcie_port_setup+0x4d/0x4d
[ 5.773618] do_one_initcall+0x4e/0x1d4
[ 5.773618] ? init_setup+0x25/0x28
[ 5.773618] kernel_init_freeable+0x1c1/0x26e
[ 5.773618] ? loglevel+0x5b/0x5b
[ 5.773618] ? rest_init+0xb0/0xb0
[ 5.773618] kernel_init+0xa/0x110
[ 5.773618] ret_from_fork+0x22/0x40
[ 5.773618] Modules linked in:
[ 5.773618] CR2: 0000000000002088
[ 5.773618] ---[ end trace 1030c9120a03d081 ]---
[...]

-4 other notes about the reproduction of this bug:
On my test machine, this bug is covered by 'commit 0d76bcc960e6 ("Revert
"ACPI/PCI: Pay attention to device-specific _PXM node values"")', but the
crack caused by dev->numa_node is still exposed from other path.

II. history

I had a original try on [1], which took the way by deferring the instance of
offline node.

Later Michal has suggested a fix [2], which only consider node with memory as
online. Beside fixing this bug, that patch also aimed at excluding memory-less
node as a candidate when iterating the zones. It is a pity that the method
conflicts with the scheduler code, which assumes node with cpu as online too.
You can find the broken by "git grep for_each_online_node | grep sched" or the
discussion in tail of [3].

Since Michal has no time to continue on this issue. I pick it up again. This
patch drops the change of "node online" definition in [2], i.e. still consider
node as online if it has either cpu or memory. And keeps the rest main idea in
[2] of initializing all parsed node on x86. For other archs, they need extra
dedicated effort.

[1]: https://patchwork.kernel.org/patch/10738733/
[2]: https://lkml.org/lkml/2019/2/13/253
[3]: https://lore.kernel.org/lkml/[email protected]/T/

Signed-off-by: Pingfan Liu <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Qian Cai <[email protected]>
Cc: Barret Rhoden <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/mm/numa.c | 17 ++++++++++++-----
mm/page_alloc.c | 11 ++++++++---
2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b48d507..5f5b558 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -732,6 +732,15 @@ static void __init init_memory_less_node(int nid)
*/
}

+static void __init init_parsed_rest_node(void)
+{
+ int node;
+
+ for_each_node_mask(node, node_possible_map)
+ if (!node_online(node))
+ init_memory_less_node(node);
+}
+
/*
* Setup early cpu_to_node.
*
@@ -752,6 +761,7 @@ void __init init_cpu_to_node(void)
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);

BUG_ON(cpu_to_apicid == NULL);
+ init_parsed_rest_node();

for_each_possible_cpu(cpu) {
int node = numa_cpu_node(cpu);
@@ -759,11 +769,8 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;

- if (!node_online(node)) {
- init_memory_less_node(node);
- node_set_online(nid);
- }
-
+ if (!node_online(node))
+ node_set_online(node);
numa_set_node(cpu, node);
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8a..5d8db00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5662,10 +5662,15 @@ static void __build_all_zonelists(void *data)
if (self && !node_online(self->node_id)) {
build_zonelists(self);
} else {
- for_each_online_node(nid) {
+ /* In rare case, node_zonelist() hits offline node */
+ for_each_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
-
- build_zonelists(pgdat);
+ /*
+ * This condition can be removed on archs, with all
+ * possible node instanced.
+ */
+ if (pgdat)
+ build_zonelists(pgdat);
}

#ifdef CONFIG_HAVE_MEMORYLESS_NODES
--
2.7.5

2019-07-07 19:45:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Fri, 5 Jul 2019, Pingfan Liu wrote:

> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> is used to speed up kdump process, so it is not a rare case.

But fundamentally wrong, really.

The rest of the CPUs are in a half baken state and any broadcast event,
e.g. MCE or a stray IPI, will result in a undiagnosable crash.

Thanks,

tglx


2019-07-08 12:23:10

by Pingfan Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <[email protected]> wrote:
>
> On Fri, 5 Jul 2019, Pingfan Liu wrote:
>
> > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> > is used to speed up kdump process, so it is not a rare case.
>
> But fundamentally wrong, really.
>
> The rest of the CPUs are in a half baken state and any broadcast event,
> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
Very appreciate if you can pay more word on it? I tried to figure out
your point, but fail.

For "a half baked state", I think you concern about LAPIC state, and I
expand this point like the following:

For IPI: when capture kernel BSP is up, the rest cpus are still loop
inside crash_nmi_callback(), so there is no way to eject new IPI from
these cpu. Also we disable_local_APIC(), which effectively prevent the
LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
occur in crash case.

For MCE, I am not sure whether it can broadcast or not between cpus,
but as my understanding, it can not. Then is it a problem?

From another view point, is there any difference between nr_cpus=1 and
nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
it does for nr_cpus=1.

Thanks,
Pingfan

2019-07-08 14:04:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Mon, 8 Jul 2019, Pingfan Liu wrote:
> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <[email protected]> wrote:
> >
> > On Fri, 5 Jul 2019, Pingfan Liu wrote:
> >
> > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> > > is used to speed up kdump process, so it is not a rare case.
> >
> > But fundamentally wrong, really.
> >
> > The rest of the CPUs are in a half baken state and any broadcast event,
> > e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> Very appreciate if you can pay more word on it? I tried to figure out
> your point, but fail.
>
> For "a half baked state", I think you concern about LAPIC state, and I
> expand this point like the following:

It's not only the APIC state. It's the state of the CPUs in general.

> For IPI: when capture kernel BSP is up, the rest cpus are still loop
> inside crash_nmi_callback(), so there is no way to eject new IPI from
> these cpu. Also we disable_local_APIC(), which effectively prevent the
> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> occur in crash case.

Fair enough for the IPI case.

> For MCE, I am not sure whether it can broadcast or not between cpus,
> but as my understanding, it can not. Then is it a problem?

It can and it does.

That's the whole point why we bring up all CPUs in the 'nosmt' case and
shut the siblings down again after setting CR4.MCE. Actually that's in fact
a 'let's hope no MCE hits before that happened' approach, but that's all we
can do.

If we don't do that then the MCE broadcast can hit a CPU which has some
firmware initialized state. The result can be a full system lockup, triple
fault etc.

So when the MCE hits a CPU which is still in the crashed kernel lala state,
then all hell breaks lose.

> From another view point, is there any difference between nr_cpus=1 and
> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> it does for nr_cpus=1.

Anything less than the actual number of present CPUs is problematic except
you use the 'let's hope nothing happens' approach. We could add an option
to stop the bringup at the early online state similar to what we do for
'nosmt'.

Thanks,

tglx

2019-07-08 22:46:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Mon, 8 Jul 2019, Andy Lutomirski wrote:
> > On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <[email protected]> wrote:
> > Anything less than the actual number of present CPUs is problematic except
> > you use the 'let's hope nothing happens' approach. We could add an option
> > to stop the bringup at the early online state similar to what we do for
> > 'nosmt'.
> >
> How about we change nr_cpus to do that instead so we never have to have
> this conversation again?

Actually not the worst idea. We have all the infrastructure already.

Thanks,

tglx

2019-07-08 22:46:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node



> On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <[email protected]> wrote:
>
>> On Mon, 8 Jul 2019, Pingfan Liu wrote:
>>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <[email protected]> wrote:
>>>
>>>> On Fri, 5 Jul 2019, Pingfan Liu wrote:
>>>>
>>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
>>>> is used to speed up kdump process, so it is not a rare case.
>>>
>>> But fundamentally wrong, really.
>>>
>>> The rest of the CPUs are in a half baken state and any broadcast event,
>>> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
>> Very appreciate if you can pay more word on it? I tried to figure out
>> your point, but fail.
>>
>> For "a half baked state", I think you concern about LAPIC state, and I
>> expand this point like the following:
>
> It's not only the APIC state. It's the state of the CPUs in general.
>
>> For IPI: when capture kernel BSP is up, the rest cpus are still loop
>> inside crash_nmi_callback(), so there is no way to eject new IPI from
>> these cpu. Also we disable_local_APIC(), which effectively prevent the
>> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
>> occur in crash case.
>
> Fair enough for the IPI case.
>
>> For MCE, I am not sure whether it can broadcast or not between cpus,
>> but as my understanding, it can not. Then is it a problem?
>
> It can and it does.
>
> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> a 'let's hope no MCE hits before that happened' approach, but that's all we
> can do.
>
> If we don't do that then the MCE broadcast can hit a CPU which has some
> firmware initialized state. The result can be a full system lockup, triple
> fault etc.
>
> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> then all hell breaks lose.
>
>> From another view point, is there any difference between nr_cpus=1 and
>> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
>> it does for nr_cpus=1.
>
> Anything less than the actual number of present CPUs is problematic except
> you use the 'let's hope nothing happens' approach. We could add an option
> to stop the bringup at the early online state similar to what we do for
> 'nosmt'.
>
>

How about we change nr_cpus to do that instead so we never have to have this conversation again?

2019-07-09 04:21:29

by Pingfan Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <[email protected]> wrote:
>
> On Mon, 8 Jul 2019, Pingfan Liu wrote:
> > On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <[email protected]> wrote:
> > >
> > > On Fri, 5 Jul 2019, Pingfan Liu wrote:
> > >
> > > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> > > > is used to speed up kdump process, so it is not a rare case.
> > >
> > > But fundamentally wrong, really.
> > >
> > > The rest of the CPUs are in a half baken state and any broadcast event,
> > > e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> > Very appreciate if you can pay more word on it? I tried to figure out
> > your point, but fail.
> >
> > For "a half baked state", I think you concern about LAPIC state, and I
> > expand this point like the following:
>
> It's not only the APIC state. It's the state of the CPUs in general.
For other states, "kexec -l " is a kind of boot loader and the boot
cpu complies with the kernel boot up provision. As for the rest AP,
they are pinged at loop before receiving #INIT IPI. Then the left
things is the same as SMP boot up.

>
> > For IPI: when capture kernel BSP is up, the rest cpus are still loop
> > inside crash_nmi_callback(), so there is no way to eject new IPI from
> > these cpu. Also we disable_local_APIC(), which effectively prevent the
> > LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> > occur in crash case.
>
> Fair enough for the IPI case.
>
> > For MCE, I am not sure whether it can broadcast or not between cpus,
> > but as my understanding, it can not. Then is it a problem?
>
> It can and it does.
>
> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> a 'let's hope no MCE hits before that happened' approach, but that's all we
> can do.
>
> If we don't do that then the MCE broadcast can hit a CPU which has some
> firmware initialized state. The result can be a full system lockup, triple
> fault etc.
>
> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> then all hell breaks lose.
Thank you for the comprehensive explain. With your guide, now, I have
a full understanding of the issue.

But when I tried to add something to enable CR4.MCE in
crash_nmi_callback(), I realized that it is undo-able in some case (if
crashed, we will not ask an offline smt cpu to online), also it is
needless. "kexec -l/-p" takes the advantage of the cpu state in the
first kernel, where all logical cpu has CR4.MCE=1.

So kexec is exempt from this bug if the first kernel already do it.
>
> > From another view point, is there any difference between nr_cpus=1 and
> > nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> > it does for nr_cpus=1.
>
> Anything less than the actual number of present CPUs is problematic except
> you use the 'let's hope nothing happens' approach. We could add an option
> to stop the bringup at the early online state similar to what we do for
> 'nosmt'.
Yes, we should do something about nr_cpus param for the first kernel.

Thanks,
Pingfan

2019-07-09 04:27:41

by Pingfan Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Tue, Jul 9, 2019 at 1:53 AM Andy Lutomirski <[email protected]> wrote:
>
>
>
> > On Jul 8, 2019, at 3:35 AM, Thomas Gleixner <[email protected]> wrote:
> >
> >> On Mon, 8 Jul 2019, Pingfan Liu wrote:
> >>> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <[email protected]> wrote:
> >>>
> >>>> On Fri, 5 Jul 2019, Pingfan Liu wrote:
> >>>>
> >>>> I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> >>>> is used to speed up kdump process, so it is not a rare case.
> >>>
> >>> But fundamentally wrong, really.
> >>>
> >>> The rest of the CPUs are in a half baken state and any broadcast event,
> >>> e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> >> Very appreciate if you can pay more word on it? I tried to figure out
> >> your point, but fail.
> >>
> >> For "a half baked state", I think you concern about LAPIC state, and I
> >> expand this point like the following:
> >
> > It's not only the APIC state. It's the state of the CPUs in general.
> >
> >> For IPI: when capture kernel BSP is up, the rest cpus are still loop
> >> inside crash_nmi_callback(), so there is no way to eject new IPI from
> >> these cpu. Also we disable_local_APIC(), which effectively prevent the
> >> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> >> occur in crash case.
> >
> > Fair enough for the IPI case.
> >
> >> For MCE, I am not sure whether it can broadcast or not between cpus,
> >> but as my understanding, it can not. Then is it a problem?
> >
> > It can and it does.
> >
> > That's the whole point why we bring up all CPUs in the 'nosmt' case and
> > shut the siblings down again after setting CR4.MCE. Actually that's in fact
> > a 'let's hope no MCE hits before that happened' approach, but that's all we
> > can do.
> >
> > If we don't do that then the MCE broadcast can hit a CPU which has some
> > firmware initialized state. The result can be a full system lockup, triple
> > fault etc.
> >
> > So when the MCE hits a CPU which is still in the crashed kernel lala state,
> > then all hell breaks lose.
> >
> >> From another view point, is there any difference between nr_cpus=1 and
> >> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> >> it does for nr_cpus=1.
> >
> > Anything less than the actual number of present CPUs is problematic except
> > you use the 'let's hope nothing happens' approach. We could add an option
> > to stop the bringup at the early online state similar to what we do for
> > 'nosmt'.
> >
> >
>
> How about we change nr_cpus to do that instead so we never have to have this conversation again?
Are you interest in implementing this?

Thanks,
Pingfan

2019-07-09 06:21:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Tue, 9 Jul 2019, Pingfan Liu wrote:
> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <[email protected]> wrote:
> > It can and it does.
> >
> > That's the whole point why we bring up all CPUs in the 'nosmt' case and
> > shut the siblings down again after setting CR4.MCE. Actually that's in fact
> > a 'let's hope no MCE hits before that happened' approach, but that's all we
> > can do.
> >
> > If we don't do that then the MCE broadcast can hit a CPU which has some
> > firmware initialized state. The result can be a full system lockup, triple
> > fault etc.
> >
> > So when the MCE hits a CPU which is still in the crashed kernel lala state,
> > then all hell breaks lose.
> Thank you for the comprehensive explain. With your guide, now, I have
> a full understanding of the issue.
>
> But when I tried to add something to enable CR4.MCE in
> crash_nmi_callback(), I realized that it is undo-able in some case (if
> crashed, we will not ask an offline smt cpu to online), also it is
> needless. "kexec -l/-p" takes the advantage of the cpu state in the
> first kernel, where all logical cpu has CR4.MCE=1.
>
> So kexec is exempt from this bug if the first kernel already do it.

No. If the MCE broadcast is handled by a CPU which is stuck in the old
kernel stop loop, then it will execute on the old kernel and eventually run
into the memory corruption which crashed the old one.

Thanks,

tglx

2019-07-09 07:26:02

by Pingfan Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 9 Jul 2019, Pingfan Liu wrote:
> > On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <[email protected]> wrote:
> > > It can and it does.
> > >
> > > That's the whole point why we bring up all CPUs in the 'nosmt' case and
> > > shut the siblings down again after setting CR4.MCE. Actually that's in fact
> > > a 'let's hope no MCE hits before that happened' approach, but that's all we
> > > can do.
> > >
> > > If we don't do that then the MCE broadcast can hit a CPU which has some
> > > firmware initialized state. The result can be a full system lockup, triple
> > > fault etc.
> > >
> > > So when the MCE hits a CPU which is still in the crashed kernel lala state,
> > > then all hell breaks lose.
> > Thank you for the comprehensive explain. With your guide, now, I have
> > a full understanding of the issue.
> >
> > But when I tried to add something to enable CR4.MCE in
> > crash_nmi_callback(), I realized that it is undo-able in some case (if
> > crashed, we will not ask an offline smt cpu to online), also it is
> > needless. "kexec -l/-p" takes the advantage of the cpu state in the
> > first kernel, where all logical cpu has CR4.MCE=1.
> >
> > So kexec is exempt from this bug if the first kernel already do it.
>
> No. If the MCE broadcast is handled by a CPU which is stuck in the old
> kernel stop loop, then it will execute on the old kernel and eventually run
> into the memory corruption which crashed the old one.
>
Yes, you are right. Stuck cpu may execute the old do_machine_check()
code. But I just found out that we have
do_machine_check()->__mc_check_crashing_cpu() to against this case.

And I think the MCE issue with nr_cpus is not closely related with
this series, can
be a separated issue. I had question whether Andy will take it, if
not, I am glad to do it.

Thanks and regards,
Pingfan

2019-07-09 13:34:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node



> On Jul 9, 2019, at 1:24 AM, Pingfan Liu <[email protected]> wrote:
>
>> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <[email protected]> wrote:
>>
>>> On Tue, 9 Jul 2019, Pingfan Liu wrote:
>>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <[email protected]> wrote:
>>>> It can and it does.
>>>>
>>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and
>>>> shut the siblings down again after setting CR4.MCE. Actually that's in fact
>>>> a 'let's hope no MCE hits before that happened' approach, but that's all we
>>>> can do.
>>>>
>>>> If we don't do that then the MCE broadcast can hit a CPU which has some
>>>> firmware initialized state. The result can be a full system lockup, triple
>>>> fault etc.
>>>>
>>>> So when the MCE hits a CPU which is still in the crashed kernel lala state,
>>>> then all hell breaks lose.
>>> Thank you for the comprehensive explain. With your guide, now, I have
>>> a full understanding of the issue.
>>>
>>> But when I tried to add something to enable CR4.MCE in
>>> crash_nmi_callback(), I realized that it is undo-able in some case (if
>>> crashed, we will not ask an offline smt cpu to online), also it is
>>> needless. "kexec -l/-p" takes the advantage of the cpu state in the
>>> first kernel, where all logical cpu has CR4.MCE=1.
>>>
>>> So kexec is exempt from this bug if the first kernel already do it.
>>
>> No. If the MCE broadcast is handled by a CPU which is stuck in the old
>> kernel stop loop, then it will execute on the old kernel and eventually run
>> into the memory corruption which crashed the old one.
>>
> Yes, you are right. Stuck cpu may execute the old do_machine_check()
> code. But I just found out that we have
> do_machine_check()->__mc_check_crashing_cpu() to against this case.
>
> And I think the MCE issue with nr_cpus is not closely related with
> this series, can
> be a separated issue. I had question whether Andy will take it, if
> not, I am glad to do it.
>
>

Go for it. I’m not familiar enough with the SMP boot stuff that I would be able to do it any faster than you. I’ll gladly help review it.

2019-07-10 08:41:49

by Pingfan Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Tue, Jul 9, 2019 at 9:34 PM Andy Lutomirski <[email protected]> wrote:
>
>
>
> > On Jul 9, 2019, at 1:24 AM, Pingfan Liu <[email protected]> wrote:
> >
> >> On Tue, Jul 9, 2019 at 2:12 PM Thomas Gleixner <[email protected]> wrote:
> >>
> >>> On Tue, 9 Jul 2019, Pingfan Liu wrote:
> >>>> On Mon, Jul 8, 2019 at 5:35 PM Thomas Gleixner <[email protected]> wrote:
> >>>> It can and it does.
> >>>>
> >>>> That's the whole point why we bring up all CPUs in the 'nosmt' case and
> >>>> shut the siblings down again after setting CR4.MCE. Actually that's in fact
> >>>> a 'let's hope no MCE hits before that happened' approach, but that's all we
> >>>> can do.
> >>>>
> >>>> If we don't do that then the MCE broadcast can hit a CPU which has some
> >>>> firmware initialized state. The result can be a full system lockup, triple
> >>>> fault etc.
> >>>>
> >>>> So when the MCE hits a CPU which is still in the crashed kernel lala state,
> >>>> then all hell breaks lose.
> >>> Thank you for the comprehensive explain. With your guide, now, I have
> >>> a full understanding of the issue.
> >>>
> >>> But when I tried to add something to enable CR4.MCE in
> >>> crash_nmi_callback(), I realized that it is undo-able in some case (if
> >>> crashed, we will not ask an offline smt cpu to online), also it is
> >>> needless. "kexec -l/-p" takes the advantage of the cpu state in the
> >>> first kernel, where all logical cpu has CR4.MCE=1.
> >>>
> >>> So kexec is exempt from this bug if the first kernel already do it.
> >>
> >> No. If the MCE broadcast is handled by a CPU which is stuck in the old
> >> kernel stop loop, then it will execute on the old kernel and eventually run
> >> into the memory corruption which crashed the old one.
> >>
> > Yes, you are right. Stuck cpu may execute the old do_machine_check()
> > code. But I just found out that we have
> > do_machine_check()->__mc_check_crashing_cpu() to against this case.
> >
> > And I think the MCE issue with nr_cpus is not closely related with
> > this series, can
> > be a separated issue. I had question whether Andy will take it, if
> > not, I am glad to do it.
> >
> >
>
> Go for it. I’m not familiar enough with the SMP boot stuff that I would be able to do it any faster than you. I’ll gladly help review it.
I had sent out a patch to fix maxcpus "[PATCH] smp: force all cpu to
boot once under maxcpus option"
But for the case of nrcpus, I think things will not be so easy due to
percpu area, and I think it may take a quite different way.

Thanks,
Pingfan

2019-07-10 11:41:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/numa: instance all parsed numa node

On Wed, 10 Jul 2019, Pingfan Liu wrote:
> On Tue, Jul 9, 2019 at 9:34 PM Andy Lutomirski <[email protected]> wrote:
> >
> > Go for it. I’m not familiar enough with the SMP boot stuff that I would
> > be able to do it any faster than you. I’ll gladly help review it.
>
> I had sent out a patch to fix maxcpus "[PATCH] smp: force all cpu to
> boot once under maxcpus option"
>
> But for the case of nrcpus, I think things will not be so easy due to
> percpu area, and I think it may take a quite different way.

No.

It's the same problem and it's broken in the same way as maxcpus on x86. So
nr_cpus on x86 has to do:

if (nr_cpus < num_present_cpus()) {
pr_info(....);
max_cpus = nr_cpus;
nr_cpus = num_present_cpus();
}

or something like that.

Stop making extra cases which are pointlessly different. X86 boot is a
trainwreck in hardware, so no magic software can fix it.

All you can do is pray that it reaches the point where all present CPUs
have been at least minimaly initialized.

Thanks,

tglx