A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up
In this situation, we use NUMA_NO_NODE to avoid oops.
[ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[ 25.740982] Mem abort info:
[ 25.743762] ESR = 0x96000005
[ 25.746803] Exception class = DABT (current EL), IL = 32 bits
[ 25.752711] SET = 0, FnV = 0
[ 25.755751] EA = 0, S1PTW = 0
[ 25.758878] Data abort info:
[ 25.761745] ISV = 0, ISS = 0x00000005
[ 25.765568] CM = 0, WnR = 0
[ 25.768521] [0000000000001988] user address but active_mm is swapper
[ 25.774861] Internal error: Oops: 96000005 [#1] SMP
[ 25.779724] Modules linked in:
[ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
[ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[ 25.813252] sp : ffff00000996f660
[ 25.816553] x29: ffff00000996f660 x28: 0000000000000000
[ 25.821852] x27: 00000000014012c0 x26: 0000000000000000
[ 25.827150] x25: 0000000000000003 x24: ffff000008099eac
[ 25.832449] x23: 0000000000400000 x22: 0000000000000000
[ 25.837747] x21: 0000000000000001 x20: 0000000000000000
[ 25.843045] x19: 0000000000400000 x18: 0000000000010e00
[ 25.848343] x17: 000000000437f790 x16: 0000000000000020
[ 25.853641] x15: 0000000000000000 x14: 6549435020524541
[ 25.858939] x13: 20454d502067756c x12: 0000000000000000
[ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
[ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
[ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
[ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081
[ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000
[ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980
[ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
[ 25.902712] Call trace:
[ 25.905146] __alloc_pages_nodemask+0xf0/0xe70
[ 25.909577] allocate_slab+0x94/0x590
[ 25.913225] new_slab+0x68/0xc8
[ 25.916353] ___slab_alloc+0x444/0x4f8
[ 25.920088] __slab_alloc+0x50/0x68
[ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230
[ 25.928426] pci_acpi_scan_root+0x94/0x278
[ 25.932510] acpi_pci_root_add+0x228/0x4b0
[ 25.936593] acpi_bus_attach+0x10c/0x218
[ 25.940501] acpi_bus_attach+0xac/0x218
[ 25.944323] acpi_bus_attach+0xac/0x218
[ 25.948144] acpi_bus_scan+0x5c/0xc0
[ 25.951708] acpi_scan_init+0xf8/0x254
[ 25.955443] acpi_init+0x310/0x37c
[ 25.958831] do_one_initcall+0x54/0x208
[ 25.962653] kernel_init_freeable+0x244/0x340
[ 25.966999] kernel_init+0x18/0x118
[ 25.970474] ret_from_fork+0x10/0x1c
[ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[ 25.980162] ---[ end trace 64f0893eb21ec283 ]---
[ 25.984765] Kernel panic - not syncing: Fatal exception
Xie XiuQi (2):
arm64: avoid alloc memory on offline node
drivers: check numa node's online status in dev_to_node
arch/arm64/kernel/pci.c | 3 +++
include/linux/device.h | 7 ++++++-
2 files changed, 9 insertions(+), 1 deletion(-)
--
1.8.3.1
If dev->numa_node is not available (or offline), we should
return NUMA_NO_NODE to prevent alloc memory on offline
nodes, which could cause oops.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up
[ 27.851041] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[ 27.859128] Mem abort info:
[ 27.861908] ESR = 0x96000005
[ 27.864949] Exception class = DABT (current EL), IL = 32 bits
[ 27.870860] SET = 0, FnV = 0
[ 27.873900] EA = 0, S1PTW = 0
[ 27.877029] Data abort info:
[ 27.879895] ISV = 0, ISS = 0x00000005
[ 27.883716] CM = 0, WnR = 0
[ 27.886673] [0000000000001988] user address but active_mm is swapper
[ 27.893012] Internal error: Oops: 96000005 [#1] SMP
[ 27.897876] Modules linked in:
[ 27.900919] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #116
[ 27.907865] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B306 05/28/2018
[ 27.916983] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 27.921763] pc : __alloc_pages_nodemask+0xf0/0xe70
[ 27.926540] lr : __alloc_pages_nodemask+0x184/0xe70
[ 27.931403] sp : ffff00000996f7e0
[ 27.934704] x29: ffff00000996f7e0 x28: ffff000008cb10a0
[ 27.940003] x27: 00000000014012c0 x26: 0000000000000000
[ 27.945301] x25: 0000000000000003 x24: ffff0000085bbc14
[ 27.950600] x23: 0000000000400000 x22: 0000000000000000
[ 27.955898] x21: 0000000000000001 x20: 0000000000000000
[ 27.961196] x19: 0000000000400000 x18: 0000000000000f00
[ 27.966494] x17: 00000000003bff88 x16: 0000000000000020
[ 27.971792] x15: 000000000000003b x14: ffffffffffffffff
[ 27.977090] x13: ffffffffffff0000 x12: 0000000000000030
[ 27.982388] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
[ 27.987686] x9 : 2e64716e622e7364 x8 : 7f7f7f7f7f7f7f7f
[ 27.992984] x7 : 0000000000000000 x6 : ffff000008d73c08
[ 27.998282] x5 : 0000000000000000 x4 : 0000000000000081
[ 28.003580] x3 : 0000000000000000 x2 : 0000000000000000
[ 28.008878] x1 : 0000000000000001 x0 : 0000000000001980
[ 28.014177] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
[ 28.020863] Call trace:
[ 28.023296] __alloc_pages_nodemask+0xf0/0xe70
[ 28.027727] allocate_slab+0x94/0x590
[ 28.031374] new_slab+0x68/0xc8
[ 28.034502] ___slab_alloc+0x444/0x4f8
[ 28.038237] __slab_alloc+0x50/0x68
[ 28.041713] __kmalloc_node_track_caller+0x100/0x320
[ 28.046664] devm_kmalloc+0x3c/0x90
[ 28.050139] pinctrl_bind_pins+0x4c/0x298
[ 28.054135] driver_probe_device+0xb4/0x4a0
[ 28.058305] __driver_attach+0x124/0x128
[ 28.062213] bus_for_each_dev+0x78/0xe0
[ 28.066035] driver_attach+0x30/0x40
[ 28.069597] bus_add_driver+0x248/0x2b8
[ 28.073419] driver_register+0x68/0x100
[ 28.077242] __pci_register_driver+0x64/0x78
[ 28.081500] pcie_portdrv_init+0x44/0x4c
[ 28.085410] do_one_initcall+0x54/0x208
[ 28.089232] kernel_init_freeable+0x244/0x340
[ 28.093577] kernel_init+0x18/0x118
[ 28.097052] ret_from_fork+0x10/0x1c
[ 28.100614] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[ 28.106740] ---[ end trace e32df44e6e1c3a4b ]---
Signed-off-by: Xie XiuQi <[email protected]>
Tested-by: Huiqiang Wang <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Tomasz Nowicki <[email protected]>
Cc: Xishi Qiu <[email protected]>
---
include/linux/device.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/device.h b/include/linux/device.h
index 4779569..2a4fb08 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1017,7 +1017,12 @@ extern __printf(2, 3)
#ifdef CONFIG_NUMA
static inline int dev_to_node(struct device *dev)
{
- return dev->numa_node;
+ int node = dev->numa_node;
+
+ if (unlikely(node != NUMA_NO_NODE && !node_online(node)))
+ return NUMA_NO_NODE;
+
+ return node;
}
static inline void set_dev_node(struct device *dev, int node)
{
--
1.8.3.1
A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up
In this situation, we use NUMA_NO_NODE to avoid oops.
[ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
[ 25.740982] Mem abort info:
[ 25.743762] ESR = 0x96000005
[ 25.746803] Exception class = DABT (current EL), IL = 32 bits
[ 25.752711] SET = 0, FnV = 0
[ 25.755751] EA = 0, S1PTW = 0
[ 25.758878] Data abort info:
[ 25.761745] ISV = 0, ISS = 0x00000005
[ 25.765568] CM = 0, WnR = 0
[ 25.768521] [0000000000001988] user address but active_mm is swapper
[ 25.774861] Internal error: Oops: 96000005 [#1] SMP
[ 25.779724] Modules linked in:
[ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
[ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[ 25.813252] sp : ffff00000996f660
[ 25.816553] x29: ffff00000996f660 x28: 0000000000000000
[ 25.821852] x27: 00000000014012c0 x26: 0000000000000000
[ 25.827150] x25: 0000000000000003 x24: ffff000008099eac
[ 25.832449] x23: 0000000000400000 x22: 0000000000000000
[ 25.837747] x21: 0000000000000001 x20: 0000000000000000
[ 25.843045] x19: 0000000000400000 x18: 0000000000010e00
[ 25.848343] x17: 000000000437f790 x16: 0000000000000020
[ 25.853641] x15: 0000000000000000 x14: 6549435020524541
[ 25.858939] x13: 20454d502067756c x12: 0000000000000000
[ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
[ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
[ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
[ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081
[ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000
[ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980
[ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
[ 25.902712] Call trace:
[ 25.905146] __alloc_pages_nodemask+0xf0/0xe70
[ 25.909577] allocate_slab+0x94/0x590
[ 25.913225] new_slab+0x68/0xc8
[ 25.916353] ___slab_alloc+0x444/0x4f8
[ 25.920088] __slab_alloc+0x50/0x68
[ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230
[ 25.928426] pci_acpi_scan_root+0x94/0x278
[ 25.932510] acpi_pci_root_add+0x228/0x4b0
[ 25.936593] acpi_bus_attach+0x10c/0x218
[ 25.940501] acpi_bus_attach+0xac/0x218
[ 25.944323] acpi_bus_attach+0xac/0x218
[ 25.948144] acpi_bus_scan+0x5c/0xc0
[ 25.951708] acpi_scan_init+0xf8/0x254
[ 25.955443] acpi_init+0x310/0x37c
[ 25.958831] do_one_initcall+0x54/0x208
[ 25.962653] kernel_init_freeable+0x244/0x340
[ 25.966999] kernel_init+0x18/0x118
[ 25.970474] ret_from_fork+0x10/0x1c
[ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[ 25.980162] ---[ end trace 64f0893eb21ec283 ]---
[ 25.984765] Kernel panic - not syncing: Fatal exception
Signed-off-by: Xie XiuQi <[email protected]>
Tested-by: Huiqiang Wang <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Tomasz Nowicki <[email protected]>
Cc: Xishi Qiu <[email protected]>
---
arch/arm64/kernel/pci.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 0e2ea1c..e17cc45 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
struct pci_bus *bus, *child;
struct acpi_pci_root_ops *root_ops;
+ if (node != NUMA_NO_NODE && !node_online(node))
+ node = NUMA_NO_NODE;
+
ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
if (!ri)
return NULL;
--
1.8.3.1
Hi Xiuqi,
On 2018/5/31 20:14, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up
I think adding detail info will be easy to be understood:
- NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;
- But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
numa node for node 3, but with devices on that numa node, alloc memory will
be panic because NUMA node 3 is not a valid node.
>
> In this situation, we use NUMA_NO_NODE to avoid oops.
[snip]
>
> Xie XiuQi (2):
> arm64: avoid alloc memory on offline node
> drivers: check numa node's online status in dev_to_node
I think we still missing devices like SMMU, ITS, so how about check
the numa node online in the core memory allocation such as kmalloc_node()?
Thanks
Hanjun
On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up
>
> In this situation, we use NUMA_NO_NODE to avoid oops.
>
> [ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> [ 25.740982] Mem abort info:
> [ 25.743762] ESR = 0x96000005
> [ 25.746803] Exception class = DABT (current EL), IL = 32 bits
> [ 25.752711] SET = 0, FnV = 0
> [ 25.755751] EA = 0, S1PTW = 0
> [ 25.758878] Data abort info:
> [ 25.761745] ISV = 0, ISS = 0x00000005
> [ 25.765568] CM = 0, WnR = 0
> [ 25.768521] [0000000000001988] user address but active_mm is swapper
> [ 25.774861] Internal error: Oops: 96000005 [#1] SMP
> [ 25.779724] Modules linked in:
> [ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> [ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> [ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> [ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> [ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> [ 25.813252] sp : ffff00000996f660
> [ 25.816553] x29: ffff00000996f660 x28: 0000000000000000
> [ 25.821852] x27: 00000000014012c0 x26: 0000000000000000
> [ 25.827150] x25: 0000000000000003 x24: ffff000008099eac
> [ 25.832449] x23: 0000000000400000 x22: 0000000000000000
> [ 25.837747] x21: 0000000000000001 x20: 0000000000000000
> [ 25.843045] x19: 0000000000400000 x18: 0000000000010e00
> [ 25.848343] x17: 000000000437f790 x16: 0000000000000020
> [ 25.853641] x15: 0000000000000000 x14: 6549435020524541
> [ 25.858939] x13: 20454d502067756c x12: 0000000000000000
> [ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> [ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> [ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> [ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> [ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> [ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> [ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
> [ 25.902712] Call trace:
> [ 25.905146] __alloc_pages_nodemask+0xf0/0xe70
> [ 25.909577] allocate_slab+0x94/0x590
> [ 25.913225] new_slab+0x68/0xc8
> [ 25.916353] ___slab_alloc+0x444/0x4f8
> [ 25.920088] __slab_alloc+0x50/0x68
> [ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230
> [ 25.928426] pci_acpi_scan_root+0x94/0x278
> [ 25.932510] acpi_pci_root_add+0x228/0x4b0
> [ 25.936593] acpi_bus_attach+0x10c/0x218
> [ 25.940501] acpi_bus_attach+0xac/0x218
> [ 25.944323] acpi_bus_attach+0xac/0x218
> [ 25.948144] acpi_bus_scan+0x5c/0xc0
> [ 25.951708] acpi_scan_init+0xf8/0x254
> [ 25.955443] acpi_init+0x310/0x37c
> [ 25.958831] do_one_initcall+0x54/0x208
> [ 25.962653] kernel_init_freeable+0x244/0x340
> [ 25.966999] kernel_init+0x18/0x118
> [ 25.970474] ret_from_fork+0x10/0x1c
> [ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> [ 25.980162] ---[ end trace 64f0893eb21ec283 ]---
> [ 25.984765] Kernel panic - not syncing: Fatal exception
>
> Signed-off-by: Xie XiuQi <[email protected]>
> Tested-by: Huiqiang Wang <[email protected]>
> Cc: Hanjun Guo <[email protected]>
> Cc: Tomasz Nowicki <[email protected]>
> Cc: Xishi Qiu <[email protected]>
> ---
> arch/arm64/kernel/pci.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index 0e2ea1c..e17cc45 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> struct pci_bus *bus, *child;
> struct acpi_pci_root_ops *root_ops;
>
> + if (node != NUMA_NO_NODE && !node_online(node))
> + node = NUMA_NO_NODE;
> +
This really feels like a bodge, but it does appear to be what other
architectures do, so:
Acked-by: Will Deacon <[email protected]>
Will
[+cc akpm, linux-mm, linux-pci]
On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <[email protected]> wrote:
>
> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > A numa system may return node which is not online.
> > For example, a numa node:
> > 1) without memory
> > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> >
> > In this situation, we use NUMA_NO_NODE to avoid oops.
> >
> > [ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > [ 25.740982] Mem abort info:
> > [ 25.743762] ESR = 0x96000005
> > [ 25.746803] Exception class = DABT (current EL), IL = 32 bits
> > [ 25.752711] SET = 0, FnV = 0
> > [ 25.755751] EA = 0, S1PTW = 0
> > [ 25.758878] Data abort info:
> > [ 25.761745] ISV = 0, ISS = 0x00000005
> > [ 25.765568] CM = 0, WnR = 0
> > [ 25.768521] [0000000000001988] user address but active_mm is swapper
> > [ 25.774861] Internal error: Oops: 96000005 [#1] SMP
> > [ 25.779724] Modules linked in:
> > [ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > [ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > [ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > [ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > [ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > [ 25.813252] sp : ffff00000996f660
> > [ 25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > [ 25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > [ 25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > [ 25.832449] x23: 0000000000400000 x22: 0000000000000000
> > [ 25.837747] x21: 0000000000000001 x20: 0000000000000000
> > [ 25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > [ 25.848343] x17: 000000000437f790 x16: 0000000000000020
> > [ 25.853641] x15: 0000000000000000 x14: 6549435020524541
> > [ 25.858939] x13: 20454d502067756c x12: 0000000000000000
> > [ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > [ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > [ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > [ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > [ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > [ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > [ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
> > [ 25.902712] Call trace:
> > [ 25.905146] __alloc_pages_nodemask+0xf0/0xe70
> > [ 25.909577] allocate_slab+0x94/0x590
> > [ 25.913225] new_slab+0x68/0xc8
> > [ 25.916353] ___slab_alloc+0x444/0x4f8
> > [ 25.920088] __slab_alloc+0x50/0x68
> > [ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230
> > [ 25.928426] pci_acpi_scan_root+0x94/0x278
> > [ 25.932510] acpi_pci_root_add+0x228/0x4b0
> > [ 25.936593] acpi_bus_attach+0x10c/0x218
> > [ 25.940501] acpi_bus_attach+0xac/0x218
> > [ 25.944323] acpi_bus_attach+0xac/0x218
> > [ 25.948144] acpi_bus_scan+0x5c/0xc0
> > [ 25.951708] acpi_scan_init+0xf8/0x254
> > [ 25.955443] acpi_init+0x310/0x37c
> > [ 25.958831] do_one_initcall+0x54/0x208
> > [ 25.962653] kernel_init_freeable+0x244/0x340
> > [ 25.966999] kernel_init+0x18/0x118
> > [ 25.970474] ret_from_fork+0x10/0x1c
> > [ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > [ 25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > [ 25.984765] Kernel panic - not syncing: Fatal exception
> >
> > Signed-off-by: Xie XiuQi <[email protected]>
> > Tested-by: Huiqiang Wang <[email protected]>
> > Cc: Hanjun Guo <[email protected]>
> > Cc: Tomasz Nowicki <[email protected]>
> > Cc: Xishi Qiu <[email protected]>
> > ---
> > arch/arm64/kernel/pci.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 0e2ea1c..e17cc45 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> > struct pci_bus *bus, *child;
> > struct acpi_pci_root_ops *root_ops;
> >
> > + if (node != NUMA_NO_NODE && !node_online(node))
> > + node = NUMA_NO_NODE;
> > +
>
> This really feels like a bodge, but it does appear to be what other
> architectures do, so:
>
> Acked-by: Will Deacon <[email protected]>
I agree, this doesn't feel like something we should be avoiding in the
caller of kzalloc_node().
I would not expect kzalloc_node() to return memory that's offline, no
matter what node we told it to allocate from. I could imagine it
returning failure, or returning memory from a node that *is* online,
but returning a pointer to offline memory seems broken.
Are we putting memory that's offline in the free list? I don't know
where to look to figure this out.
Bjorn
On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
> [+cc akpm, linux-mm, linux-pci]
>
> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <[email protected]> wrote:
> >
> > On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > > A numa system may return node which is not online.
> > > For example, a numa node:
> > > 1) without memory
> > > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> > >
> > > In this situation, we use NUMA_NO_NODE to avoid oops.
> > >
> > > [ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
> > > [ 25.740982] Mem abort info:
> > > [ 25.743762] ESR = 0x96000005
> > > [ 25.746803] Exception class = DABT (current EL), IL = 32 bits
> > > [ 25.752711] SET = 0, FnV = 0
> > > [ 25.755751] EA = 0, S1PTW = 0
> > > [ 25.758878] Data abort info:
> > > [ 25.761745] ISV = 0, ISS = 0x00000005
> > > [ 25.765568] CM = 0, WnR = 0
> > > [ 25.768521] [0000000000001988] user address but active_mm is swapper
> > > [ 25.774861] Internal error: Oops: 96000005 [#1] SMP
> > > [ 25.779724] Modules linked in:
> > > [ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> > > [ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
> > > [ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> > > [ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > > [ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > > [ 25.813252] sp : ffff00000996f660
> > > [ 25.816553] x29: ffff00000996f660 x28: 0000000000000000
> > > [ 25.821852] x27: 00000000014012c0 x26: 0000000000000000
> > > [ 25.827150] x25: 0000000000000003 x24: ffff000008099eac
> > > [ 25.832449] x23: 0000000000400000 x22: 0000000000000000
> > > [ 25.837747] x21: 0000000000000001 x20: 0000000000000000
> > > [ 25.843045] x19: 0000000000400000 x18: 0000000000010e00
> > > [ 25.848343] x17: 000000000437f790 x16: 0000000000000020
> > > [ 25.853641] x15: 0000000000000000 x14: 6549435020524541
> > > [ 25.858939] x13: 20454d502067756c x12: 0000000000000000
> > > [ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
> > > [ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
> > > [ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
> > > [ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081
> > > [ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000
> > > [ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980
> > > [ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
> > > [ 25.902712] Call trace:
> > > [ 25.905146] __alloc_pages_nodemask+0xf0/0xe70
> > > [ 25.909577] allocate_slab+0x94/0x590
> > > [ 25.913225] new_slab+0x68/0xc8
> > > [ 25.916353] ___slab_alloc+0x444/0x4f8
> > > [ 25.920088] __slab_alloc+0x50/0x68
> > > [ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230
> > > [ 25.928426] pci_acpi_scan_root+0x94/0x278
> > > [ 25.932510] acpi_pci_root_add+0x228/0x4b0
> > > [ 25.936593] acpi_bus_attach+0x10c/0x218
> > > [ 25.940501] acpi_bus_attach+0xac/0x218
> > > [ 25.944323] acpi_bus_attach+0xac/0x218
> > > [ 25.948144] acpi_bus_scan+0x5c/0xc0
> > > [ 25.951708] acpi_scan_init+0xf8/0x254
> > > [ 25.955443] acpi_init+0x310/0x37c
> > > [ 25.958831] do_one_initcall+0x54/0x208
> > > [ 25.962653] kernel_init_freeable+0x244/0x340
> > > [ 25.966999] kernel_init+0x18/0x118
> > > [ 25.970474] ret_from_fork+0x10/0x1c
> > > [ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > > [ 25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > > [ 25.984765] Kernel panic - not syncing: Fatal exception
> > >
> > > Signed-off-by: Xie XiuQi <[email protected]>
> > > Tested-by: Huiqiang Wang <[email protected]>
> > > Cc: Hanjun Guo <[email protected]>
> > > Cc: Tomasz Nowicki <[email protected]>
> > > Cc: Xishi Qiu <[email protected]>
> > > ---
> > > arch/arm64/kernel/pci.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > > index 0e2ea1c..e17cc45 100644
> > > --- a/arch/arm64/kernel/pci.c
> > > +++ b/arch/arm64/kernel/pci.c
> > > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
> > > struct pci_bus *bus, *child;
> > > struct acpi_pci_root_ops *root_ops;
> > >
> > > + if (node != NUMA_NO_NODE && !node_online(node))
> > > + node = NUMA_NO_NODE;
> > > +
> >
> > This really feels like a bodge, but it does appear to be what other
> > architectures do, so:
> >
> > Acked-by: Will Deacon <[email protected]>
>
> I agree, this doesn't feel like something we should be avoiding in the
> caller of kzalloc_node().
>
> I would not expect kzalloc_node() to return memory that's offline, no
> matter what node we told it to allocate from. I could imagine it
> returning failure, or returning memory from a node that *is* online,
> but returning a pointer to offline memory seems broken.
>
> Are we putting memory that's offline in the free list? I don't know
> where to look to figure this out.
I am not sure I have the full context but pci_acpi_scan_root calls
kzalloc_node(sizeof(*info), GFP_KERNEL, node)
and that should fall back to whatever node that is online. Offline node
shouldn't keep any pages behind. So there must be something else going
on here and the patch is not the right way to handle it. What does
faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
--
Michal Hocko
SUSE Labs
On 2018/6/7 18:55, Michal Hocko wrote:
> On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
>> [+cc akpm, linux-mm, linux-pci]
>>
>> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon <[email protected]> wrote:
>>>
>>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
>>>> A numa system may return node which is not online.
>>>> For example, a numa node:
>>>> 1) without memory
>>>> 2) NR_CPUS is very small, and the cpus on the node are not brought up
>>>>
>>>> In this situation, we use NUMA_NO_NODE to avoid oops.
>>>>
>>>> [ 25.732905] Unable to handle kernel NULL pointer dereference at virtual address 00001988
>>>> [ 25.740982] Mem abort info:
>>>> [ 25.743762] ESR = 0x96000005
>>>> [ 25.746803] Exception class = DABT (current EL), IL = 32 bits
>>>> [ 25.752711] SET = 0, FnV = 0
>>>> [ 25.755751] EA = 0, S1PTW = 0
>>>> [ 25.758878] Data abort info:
>>>> [ 25.761745] ISV = 0, ISS = 0x00000005
>>>> [ 25.765568] CM = 0, WnR = 0
>>>> [ 25.768521] [0000000000001988] user address but active_mm is swapper
>>>> [ 25.774861] Internal error: Oops: 96000005 [#1] SMP
>>>> [ 25.779724] Modules linked in:
>>>> [ 25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
>>>> [ 25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 2.0 RC0 - B305 05/28/2018
>>>> [ 25.798831] pstate: 80c00009 (Nzcv daif +PAN +UAO)
>>>> [ 25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
>>>> [ 25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
>>>> [ 25.813252] sp : ffff00000996f660
>>>> [ 25.816553] x29: ffff00000996f660 x28: 0000000000000000
>>>> [ 25.821852] x27: 00000000014012c0 x26: 0000000000000000
>>>> [ 25.827150] x25: 0000000000000003 x24: ffff000008099eac
>>>> [ 25.832449] x23: 0000000000400000 x22: 0000000000000000
>>>> [ 25.837747] x21: 0000000000000001 x20: 0000000000000000
>>>> [ 25.843045] x19: 0000000000400000 x18: 0000000000010e00
>>>> [ 25.848343] x17: 000000000437f790 x16: 0000000000000020
>>>> [ 25.853641] x15: 0000000000000000 x14: 6549435020524541
>>>> [ 25.858939] x13: 20454d502067756c x12: 0000000000000000
>>>> [ 25.864237] x11: ffff00000996f6f0 x10: 0000000000000006
>>>> [ 25.869536] x9 : 00000000000012a4 x8 : ffff8023c000ff90
>>>> [ 25.874834] x7 : 0000000000000000 x6 : ffff000008d73c08
>>>> [ 25.880132] x5 : 0000000000000000 x4 : 0000000000000081
>>>> [ 25.885430] x3 : 0000000000000000 x2 : 0000000000000000
>>>> [ 25.890728] x1 : 0000000000000001 x0 : 0000000000001980
>>>> [ 25.896027] Process swapper/0 (pid: 1, stack limit = 0x (ptrval))
>>>> [ 25.902712] Call trace:
>>>> [ 25.905146] __alloc_pages_nodemask+0xf0/0xe70
>>>> [ 25.909577] allocate_slab+0x94/0x590
>>>> [ 25.913225] new_slab+0x68/0xc8
>>>> [ 25.916353] ___slab_alloc+0x444/0x4f8
>>>> [ 25.920088] __slab_alloc+0x50/0x68
>>>> [ 25.923562] kmem_cache_alloc_node_trace+0xe8/0x230
>>>> [ 25.928426] pci_acpi_scan_root+0x94/0x278
>>>> [ 25.932510] acpi_pci_root_add+0x228/0x4b0
>>>> [ 25.936593] acpi_bus_attach+0x10c/0x218
>>>> [ 25.940501] acpi_bus_attach+0xac/0x218
>>>> [ 25.944323] acpi_bus_attach+0xac/0x218
>>>> [ 25.948144] acpi_bus_scan+0x5c/0xc0
>>>> [ 25.951708] acpi_scan_init+0xf8/0x254
>>>> [ 25.955443] acpi_init+0x310/0x37c
>>>> [ 25.958831] do_one_initcall+0x54/0x208
>>>> [ 25.962653] kernel_init_freeable+0x244/0x340
>>>> [ 25.966999] kernel_init+0x18/0x118
>>>> [ 25.970474] ret_from_fork+0x10/0x1c
>>>> [ 25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
>>>> [ 25.980162] ---[ end trace 64f0893eb21ec283 ]---
>>>> [ 25.984765] Kernel panic - not syncing: Fatal exception
>>>>
>>>> Signed-off-by: Xie XiuQi <[email protected]>
>>>> Tested-by: Huiqiang Wang <[email protected]>
>>>> Cc: Hanjun Guo <[email protected]>
>>>> Cc: Tomasz Nowicki <[email protected]>
>>>> Cc: Xishi Qiu <[email protected]>
>>>> ---
>>>> arch/arm64/kernel/pci.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
>>>> index 0e2ea1c..e17cc45 100644
>>>> --- a/arch/arm64/kernel/pci.c
>>>> +++ b/arch/arm64/kernel/pci.c
>>>> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
>>>> struct pci_bus *bus, *child;
>>>> struct acpi_pci_root_ops *root_ops;
>>>>
>>>> + if (node != NUMA_NO_NODE && !node_online(node))
>>>> + node = NUMA_NO_NODE;
>>>> +
>>>
>>> This really feels like a bodge, but it does appear to be what other
>>> architectures do, so:
>>>
>>> Acked-by: Will Deacon <[email protected]>
>>
>> I agree, this doesn't feel like something we should be avoiding in the
>> caller of kzalloc_node().
>>
>> I would not expect kzalloc_node() to return memory that's offline, no
>> matter what node we told it to allocate from. I could imagine it
>> returning failure, or returning memory from a node that *is* online,
>> but returning a pointer to offline memory seems broken.
>>
>> Are we putting memory that's offline in the free list? I don't know
>> where to look to figure this out.
>
> I am not sure I have the full context but pci_acpi_scan_root calls
> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> and that should fall back to whatever node that is online. Offline node
> shouldn't keep any pages behind. So there must be something else going
> on here and the patch is not the right way to handle it. What does
> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
The whole context is:
The system is booted with a NUMA node has no memory attaching to it
(memory-less NUMA node), also with NR_CPUS less than CPUs presented
in MADT, so CPUs on this memory-less node are not brought up, and
this NUMA node will not be online (but SRAT presents this NUMA node);
Devices attaching to this NUMA node such as PCI host bridge still
return the valid NUMA node via _PXM, but actually that valid NUMA node
is not online which lead to this issue.
Thanks
Hanjun
>
On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> On 2018/6/7 18:55, Michal Hocko wrote:
[...]
> > I am not sure I have the full context but pci_acpi_scan_root calls
> > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > and that should fall back to whatever node that is online. Offline node
> > shouldn't keep any pages behind. So there must be something else going
> > on here and the patch is not the right way to handle it. What does
> > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>
> The whole context is:
>
> The system is booted with a NUMA node has no memory attaching to it
> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> in MADT, so CPUs on this memory-less node are not brought up, and
> this NUMA node will not be online (but SRAT presents this NUMA node);
>
> Devices attaching to this NUMA node such as PCI host bridge still
> return the valid NUMA node via _PXM, but actually that valid NUMA node
> is not online which lead to this issue.
But we should have other numa nodes on the zonelists so the allocator
should fall back to other node. If the zonelist is not intiailized
properly, though, then this can indeed show up as a problem. Knowing
which exact place has blown up would help get a better picture...
--
Michal Hocko
SUSE Labs
Hi Michal,
On 2018/6/7 20:21, Michal Hocko wrote:
> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> On 2018/6/7 18:55, Michal Hocko wrote:
> [...]
>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>> and that should fall back to whatever node that is online. Offline node
>>> shouldn't keep any pages behind. So there must be something else going
>>> on here and the patch is not the right way to handle it. What does
>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>
>> The whole context is:
>>
>> The system is booted with a NUMA node has no memory attaching to it
>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> in MADT, so CPUs on this memory-less node are not brought up, and
>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>
>> Devices attaching to this NUMA node such as PCI host bridge still
>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> is not online which lead to this issue.
>
> But we should have other numa nodes on the zonelists so the allocator
> should fall back to other node. If the zonelist is not intiailized
> properly, though, then this can indeed show up as a problem. Knowing
> which exact place has blown up would help get a better picture...
>
I specific a non-exist node to allocate memory using kzalloc_node,
and got this following error message.
And I found out there is just a VM_WARN, but it does not prevent the memory
allocation continue.
This nid would be use to access NODE_DADA(nid), so if nid is invalid,
it would cause oops here.
459 /*
460 * Allocate pages, preferring the node given as nid. The node must be valid and
461 * online. For more general interface, see alloc_pages_node().
462 */
463 static inline struct page *
464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
465 {
466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
467 VM_WARN_ON(!node_online(nid));
468
469 return __alloc_pages(gfp_mask, order, nid);
470 }
471
(I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
[ 120.061693] WARNING: CPU: 6 PID: 3966 at ./include/linux/gfp.h:467 allocate_slab+0x5fd/0x7e0
[ 120.070095] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[ 120.140992] ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[ 120.149053] CPU: 6 PID: 3966 Comm: insmod Tainted: G OE 4.17.0-rc2-RHEL74+ #5
[ 120.157369] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[ 120.164303] RIP: 0010:allocate_slab+0x5fd/0x7e0
[ 120.168817] RSP: 0018:ffff881196947af0 EFLAGS: 00010246
[ 120.174022] RAX: 0000000000000000 RBX: 00000000014012c0 RCX: ffffffffb4bc8173
[ 120.181126] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8817aefa7868
[ 120.188233] RBP: 00000000014000c0 R08: ffffed02f5df4f0e R09: ffffed02f5df4f0e
[ 120.195338] R10: ffffed02f5df4f0d R11: ffff8817aefa786f R12: 0000000000000055
[ 120.202444] R13: 0000000000000003 R14: ffff880107c0f800 R15: 0000000000000000
[ 120.209550] FS: 00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[ 120.217606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 120.223330] CR2: 0000000000c21b88 CR3: 0000001197fd0006 CR4: 00000000007606e0
[ 120.230435] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 120.237541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 120.244646] PKRU: 55555554
[ 120.247346] Call Trace:
[ 120.249791] ? __kasan_slab_free+0xff/0x150
[ 120.253960] ? mpidr_init+0x20/0x30 [bench]
[ 120.258129] new_slab+0x3d/0x90
[ 120.261262] ___slab_alloc+0x371/0x640
[ 120.265002] ? __wake_up_common+0x8a/0x150
[ 120.269085] ? mpidr_init+0x20/0x30 [bench]
[ 120.273254] ? mpidr_init+0x20/0x30 [bench]
[ 120.277423] __slab_alloc+0x40/0x66
[ 120.280901] kmem_cache_alloc_node_trace+0xbc/0x270
[ 120.285762] ? mpidr_init+0x20/0x30 [bench]
[ 120.289931] ? 0xffffffffc0740000
[ 120.293236] mpidr_init+0x20/0x30 [bench]
[ 120.297236] do_one_initcall+0x4b/0x1f5
[ 120.301062] ? do_init_module+0x22/0x233
[ 120.304972] ? kmem_cache_alloc_trace+0xfe/0x220
[ 120.309571] ? do_init_module+0x22/0x233
[ 120.313481] do_init_module+0x77/0x233
[ 120.317218] load_module+0x21ea/0x2960
[ 120.320955] ? m_show+0x1d0/0x1d0
[ 120.324264] ? security_capable+0x39/0x50
[ 120.328261] __do_sys_finit_module+0x94/0xe0
[ 120.332516] do_syscall_64+0x55/0x180
[ 120.336171] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 120.341203] RIP: 0033:0x7f69352627f9
[ 120.344767] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[ 120.352305] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[ 120.359411] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[ 120.366517] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[ 120.373622] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[ 120.380727] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[ 120.387833] Code: 4b e8 ac 97 eb ff e9 e1 fc ff ff 89 de 89 ef e8 7a 35 ff ff 49 89 c7 4d 85 ff 74 71 0f 1f 44 00 00 e9 f1 fa ff ff e8 cf 54 00 00 <0f> 0b 90 e9 c4 fa ff ff 45 89 e8 b9 b1 05 00 00 48 c7 c2 10 79
[ 120.406620] ---[ end trace 89f801c36550734e ]---
[ 120.411234] BUG: unable to handle kernel paging request at 0000000000002088
[ 120.418168] PGD 8000001197c75067 P4D 8000001197c75067 PUD 119858f067 PMD 0
[ 120.425103] Oops: 0000 [#1] SMP KASAN PTI
[ 120.429097] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[ 120.499986] ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci megaraid_sas libata
[ 120.508045] CPU: 6 PID: 3966 Comm: insmod Tainted: G W OE 4.17.0-rc2-RHEL74+ #5
[ 120.516359] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[ 120.523296] RIP: 0010:__alloc_pages_nodemask+0x10d/0x2c0
[ 120.528586] RSP: 0018:ffff881196947a90 EFLAGS: 00010246
[ 120.533790] RAX: 0000000000000001 RBX: 00000000014012c0 RCX: 0000000000000000
[ 120.540895] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[ 120.548000] RBP: 00000000014012c0 R08: ffffed0233ccb8f4 R09: ffffed0233ccb8f4
[ 120.555105] R10: ffffed0233ccb8f3 R11: ffff88119e65c79f R12: 0000000000000000
[ 120.562210] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 120.569316] FS: 00007f6935d8c740(0000) GS:ffff8817aef80000(0000) knlGS:0000000000000000
[ 120.577374] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 120.583095] CR2: 0000000000002088 CR3: 0000001197fd0006 CR4: 00000000007606e0
[ 120.590200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 120.597307] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 120.604412] PKRU: 55555554
[ 120.607111] Call Trace:
[ 120.609554] allocate_slab+0xd8/0x7e0
[ 120.613205] ? __kasan_slab_free+0xff/0x150
[ 120.617376] ? mpidr_init+0x20/0x30 [bench]
[ 120.621545] new_slab+0x3d/0x90
[ 120.624678] ___slab_alloc+0x371/0x640
[ 120.628415] ? __wake_up_common+0x8a/0x150
[ 120.632498] ? mpidr_init+0x20/0x30 [bench]
[ 120.636667] ? mpidr_init+0x20/0x30 [bench]
[ 120.640836] __slab_alloc+0x40/0x66
[ 120.644315] kmem_cache_alloc_node_trace+0xbc/0x270
[ 120.649175] ? mpidr_init+0x20/0x30 [bench]
[ 120.653343] ? 0xffffffffc0740000
[ 120.656649] mpidr_init+0x20/0x30 [bench]
[ 120.660645] do_one_initcall+0x4b/0x1f5
[ 120.664469] ? do_init_module+0x22/0x233
[ 120.668379] ? kmem_cache_alloc_trace+0xfe/0x220
[ 120.672978] ? do_init_module+0x22/0x233
[ 120.676887] do_init_module+0x77/0x233
[ 120.680624] load_module+0x21ea/0x2960
[ 120.684360] ? m_show+0x1d0/0x1d0
[ 120.687667] ? security_capable+0x39/0x50
[ 120.691663] __do_sys_finit_module+0x94/0xe0
[ 120.695920] do_syscall_64+0x55/0x180
[ 120.699571] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 120.704603] RIP: 0033:0x7f69352627f9
[ 120.708166] RSP: 002b:00007ffd7d73f718 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[ 120.715704] RAX: ffffffffffffffda RBX: 0000000000c201d0 RCX: 00007f69352627f9
[ 120.722808] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[ 120.729913] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffd7d73f8b8
[ 120.737019] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[ 120.744123] R13: 0000000000c20130 R14: 0000000000000000 R15: 0000000000000000
[ 120.751230] Code: 89 c6 74 0d e8 55 ab 5e 00 8b 74 24 1c 48 8b 3c 24 48 8b 54 24 08 89 d9 c1 e9 17 83 e1 01 48 85 d2 88 4c 24 20 0f 85 25 01 00 00 <3b> 77 08 0f 82 1c 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6
[ 120.770020] RIP: __alloc_pages_nodemask+0x10d/0x2c0 RSP: ffff881196947a90
[ 120.776780] CR2: 0000000000002088
[ 120.780116] ---[ end trace 89f801c36550734f ]---
[ 120.978922] Kernel panic - not syncing: Fatal exception
[ 120.984186] Kernel Offset: 0x33800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 121.209501] ---[ end Kernel panic - not syncing: Fatal exception ]---
--
Thanks,
Xie XiuQi
On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> Hi Michal,
>
> On 2018/6/7 20:21, Michal Hocko wrote:
> > On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >> On 2018/6/7 18:55, Michal Hocko wrote:
> > [...]
> >>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>> and that should fall back to whatever node that is online. Offline node
> >>> shouldn't keep any pages behind. So there must be something else going
> >>> on here and the patch is not the right way to handle it. What does
> >>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>
> >> The whole context is:
> >>
> >> The system is booted with a NUMA node has no memory attaching to it
> >> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >> in MADT, so CPUs on this memory-less node are not brought up, and
> >> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>
> >> Devices attaching to this NUMA node such as PCI host bridge still
> >> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >> is not online which lead to this issue.
> >
> > But we should have other numa nodes on the zonelists so the allocator
> > should fall back to other node. If the zonelist is not intiailized
> > properly, though, then this can indeed show up as a problem. Knowing
> > which exact place has blown up would help get a better picture...
> >
>
> I specific a non-exist node to allocate memory using kzalloc_node,
> and got this following error message.
>
> And I found out there is just a VM_WARN, but it does not prevent the memory
> allocation continue.
>
> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> it would cause oops here.
>
> 459 /*
> 460 * Allocate pages, preferring the node given as nid. The node must be valid and
> 461 * online. For more general interface, see alloc_pages_node().
> 462 */
> 463 static inline struct page *
> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> 465 {
> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> 467 VM_WARN_ON(!node_online(nid));
> 468
> 469 return __alloc_pages(gfp_mask, order, nid);
> 470 }
> 471
>
> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
OK, so this is an artificialy broken code, right. You shouldn't get a
non-existent node via standard APIs AFAICS. The original report was
about an existing node which is offline AFAIU. That would be a different
case. If I am missing something and there are legitimate users that try
to allocate from non-existing nodes then we should handle that in
node_zonelist.
[...]
--
Michal Hocko
SUSE Labs
Hi Michal,
On 2018/6/11 16:52, Michal Hocko wrote:
> On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> Hi Michal,
>>
>> On 2018/6/7 20:21, Michal Hocko wrote:
>>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>>>> On 2018/6/7 18:55, Michal Hocko wrote:
>>> [...]
>>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>>>> and that should fall back to whatever node that is online. Offline node
>>>>> shouldn't keep any pages behind. So there must be something else going
>>>>> on here and the patch is not the right way to handle it. What does
>>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>>>
>>>> The whole context is:
>>>>
>>>> The system is booted with a NUMA node has no memory attaching to it
>>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>>>> in MADT, so CPUs on this memory-less node are not brought up, and
>>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>>>
>>>> Devices attaching to this NUMA node such as PCI host bridge still
>>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>>>> is not online which lead to this issue.
>>>
>>> But we should have other numa nodes on the zonelists so the allocator
>>> should fall back to other node. If the zonelist is not intiailized
>>> properly, though, then this can indeed show up as a problem. Knowing
>>> which exact place has blown up would help get a better picture...
>>>
>>
>> I specific a non-exist node to allocate memory using kzalloc_node,
>> and got this following error message.
>>
>> And I found out there is just a VM_WARN, but it does not prevent the memory
>> allocation continue.
>>
>> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> it would cause oops here.
>>
>> 459 /*
>> 460 * Allocate pages, preferring the node given as nid. The node must be valid and
>> 461 * online. For more general interface, see alloc_pages_node().
>> 462 */
>> 463 static inline struct page *
>> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> 465 {
>> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 467 VM_WARN_ON(!node_online(nid));
>> 468
>> 469 return __alloc_pages(gfp_mask, order, nid);
>> 470 }
>> 471
>>
>> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
>
> OK, so this is an artificialy broken code, right. You shouldn't get a
> non-existent node via standard APIs AFAICS. The original report was
> about an existing node which is offline AFAIU. That would be a different
> case. If I am missing something and there are legitimate users that try
> to allocate from non-existing nodes then we should handle that in
> node_zonelist.
I think hanjun's comments may help to understood this question:
- NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;
- But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
numa node for node 3, but with devices on that numa node, alloc memory will
be panic because NUMA node 3 is not a valid node.
I triggered this BUG on arm64 platform, and I found a similar bug has been fixed
on x86 platform. So I sent a similar patch for this bug.
Or, could we consider to fix it in the mm subsystem?
From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
From: Yinghai Lu <[email protected]>
Date: Wed, 20 Feb 2008 12:41:52 -0800
Subject: [PATCH] x86: make dev_to_node return online node
a numa system (with multi HT chains) may return node without ram. Aka it
is not online. Try to get an online node, otherwise return -1.
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/pci/acpi.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index d95de2f..ea8685f 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
set_mp_bus_to_node(busnum, node);
else
node = get_mp_bus_to_node(busnum);
+
+ if (node != -1 && !node_online(node))
+ node = -1;
#endif
/* Allocate per-root-bus (not per bus) arch-specific data.
--
1.8.3.1
>
> [...]
>
--
Thanks,
Xie XiuQi
On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> Hi Michal,
>
> On 2018/6/11 16:52, Michal Hocko wrote:
> > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> >> Hi Michal,
> >>
> >> On 2018/6/7 20:21, Michal Hocko wrote:
> >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> >>> [...]
> >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>>>> and that should fall back to whatever node that is online. Offline node
> >>>>> shouldn't keep any pages behind. So there must be something else going
> >>>>> on here and the patch is not the right way to handle it. What does
> >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>>>
> >>>> The whole context is:
> >>>>
> >>>> The system is booted with a NUMA node has no memory attaching to it
> >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>>>
> >>>> Devices attaching to this NUMA node such as PCI host bridge still
> >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >>>> is not online which lead to this issue.
> >>>
> >>> But we should have other numa nodes on the zonelists so the allocator
> >>> should fall back to other node. If the zonelist is not intiailized
> >>> properly, though, then this can indeed show up as a problem. Knowing
> >>> which exact place has blown up would help get a better picture...
> >>>
> >>
> >> I specific a non-exist node to allocate memory using kzalloc_node,
> >> and got this following error message.
> >>
> >> And I found out there is just a VM_WARN, but it does not prevent the memory
> >> allocation continue.
> >>
> >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> >> it would cause oops here.
> >>
> >> 459 /*
> >> 460 * Allocate pages, preferring the node given as nid. The node must be valid and
> >> 461 * online. For more general interface, see alloc_pages_node().
> >> 462 */
> >> 463 static inline struct page *
> >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> >> 465 {
> >> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >> 467 VM_WARN_ON(!node_online(nid));
> >> 468
> >> 469 return __alloc_pages(gfp_mask, order, nid);
> >> 470 }
> >> 471
> >>
> >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> >
> > OK, so this is an artificialy broken code, right. You shouldn't get a
> > non-existent node via standard APIs AFAICS. The original report was
> > about an existing node which is offline AFAIU. That would be a different
> > case. If I am missing something and there are legitimate users that try
> > to allocate from non-existing nodes then we should handle that in
> > node_zonelist.
>
> I think hanjun's comments may help to understood this question:
> - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
> node;
>
> - But if we boot the system with memory-less node and also with
> CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
> NUMA nodes, 16 CPUs on each NUMA node, if we boot with
> CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
> devices on that numa node, alloc memory will be panic because NUMA node
> 3 is not a valid node.
>
> I triggered this BUG on arm64 platform, and I found a similar bug has
> been fixed on x86 platform. So I sent a similar patch for this bug.
>
> Or, could we consider to fix it in the mm subsystem?
The patch below (b755de8dfdfe) seems like totally the wrong direction.
I don't think we want every caller of kzalloc_node() to have check for
node_online().
Why would memory on an off-line node even be in the allocation pool?
I wouldn't expect that memory to be put in the pool until the node
comes online and the memory is accessible, so this sounds like some
kind of setup issue.
But I'm definitely not an mm person.
> From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> From: Yinghai Lu <[email protected]>
> Date: Wed, 20 Feb 2008 12:41:52 -0800
> Subject: [PATCH] x86: make dev_to_node return online node
>
> a numa system (with multi HT chains) may return node without ram. Aka it
> is not online. Try to get an online node, otherwise return -1.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> arch/x86/pci/acpi.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index d95de2f..ea8685f 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
> set_mp_bus_to_node(busnum, node);
> else
> node = get_mp_bus_to_node(busnum);
> +
> + if (node != -1 && !node_online(node))
> + node = -1;
> #endif
>
> /* Allocate per-root-bus (not per bus) arch-specific data.
On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> > Hi Michal,
> >
> > On 2018/6/11 16:52, Michal Hocko wrote:
> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> > >> Hi Michal,
> > >>
> > >> On 2018/6/7 20:21, Michal Hocko wrote:
> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
> > >>> [...]
> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > >>>>> and that should fall back to whatever node that is online. Offline node
> > >>>>> shouldn't keep any pages behind. So there must be something else going
> > >>>>> on here and the patch is not the right way to handle it. What does
> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> > >>>>
> > >>>> The whole context is:
> > >>>>
> > >>>> The system is booted with a NUMA node has no memory attaching to it
> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
> > >>>>
> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
> > >>>> is not online which lead to this issue.
> > >>>
> > >>> But we should have other numa nodes on the zonelists so the allocator
> > >>> should fall back to other node. If the zonelist is not intiailized
> > >>> properly, though, then this can indeed show up as a problem. Knowing
> > >>> which exact place has blown up would help get a better picture...
> > >>>
> > >>
> > >> I specific a non-exist node to allocate memory using kzalloc_node,
> > >> and got this following error message.
> > >>
> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
> > >> allocation continue.
> > >>
> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> > >> it would cause oops here.
> > >>
> > >> 459 /*
> > >> 460 * Allocate pages, preferring the node given as nid. The node must be valid and
> > >> 461 * online. For more general interface, see alloc_pages_node().
> > >> 462 */
> > >> 463 static inline struct page *
> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> > >> 465 {
> > >> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > >> 467 VM_WARN_ON(!node_online(nid));
> > >> 468
> > >> 469 return __alloc_pages(gfp_mask, order, nid);
> > >> 470 }
> > >> 471
> > >>
> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> > >
> > > OK, so this is an artificialy broken code, right. You shouldn't get a
> > > non-existent node via standard APIs AFAICS. The original report was
> > > about an existing node which is offline AFAIU. That would be a different
> > > case. If I am missing something and there are legitimate users that try
> > > to allocate from non-existing nodes then we should handle that in
> > > node_zonelist.
> >
> > I think hanjun's comments may help to understood this question:
> > - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
> > node;
> >
> > - But if we boot the system with memory-less node and also with
> > CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
> > NUMA nodes, 16 CPUs on each NUMA node, if we boot with
> > CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
> > devices on that numa node, alloc memory will be panic because NUMA node
> > 3 is not a valid node.
Hmm, but this is not a memory-less node. It sounds like a misconfigured
kernel to me or the broken initialization. Each CPU should have a
fallback numa node to be used.
> > I triggered this BUG on arm64 platform, and I found a similar bug has
> > been fixed on x86 platform. So I sent a similar patch for this bug.
> >
> > Or, could we consider to fix it in the mm subsystem?
>
> The patch below (b755de8dfdfe) seems like totally the wrong direction.
> I don't think we want every caller of kzalloc_node() to have check for
> node_online().
absolutely.
> Why would memory on an off-line node even be in the allocation pool?
> I wouldn't expect that memory to be put in the pool until the node
> comes online and the memory is accessible, so this sounds like some
> kind of setup issue.
>
> But I'm definitely not an mm person.
Well, the standard way to handle memory less NUMA nodes is to simply
fallback to the closest NUMA node. We even have an API for that
(numa_mem_id).
> > From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> > From: Yinghai Lu <[email protected]>
> > Date: Wed, 20 Feb 2008 12:41:52 -0800
> > Subject: [PATCH] x86: make dev_to_node return online node
> >
> > a numa system (with multi HT chains) may return node without ram. Aka it
> > is not online. Try to get an online node, otherwise return -1.
> >
> > Signed-off-by: Yinghai Lu <[email protected]>
> > Signed-off-by: Ingo Molnar <[email protected]>
> > Signed-off-by: Thomas Gleixner <[email protected]>
> > ---
> > arch/x86/pci/acpi.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> > index d95de2f..ea8685f 100644
> > --- a/arch/x86/pci/acpi.c
> > +++ b/arch/x86/pci/acpi.c
> > @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct acpi_device *device, int do
> > set_mp_bus_to_node(busnum, node);
> > else
> > node = get_mp_bus_to_node(busnum);
> > +
> > + if (node != -1 && !node_online(node))
> > + node = -1;
> > #endif
> >
> > /* Allocate per-root-bus (not per bus) arch-specific data.
--
Michal Hocko
SUSE Labs
Michal Hocko <[email protected]> writes:
> On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
>> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
>> > Hi Michal,
>> >
>> > On 2018/6/11 16:52, Michal Hocko wrote:
>> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> > >> Hi Michal,
>> > >>
>> > >> On 2018/6/7 20:21, Michal Hocko wrote:
>> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> > >>>> On 2018/6/7 18:55, Michal Hocko wrote:
>> > >>> [...]
>> > >>>>> I am not sure I have the full context but pci_acpi_scan_root calls
>> > >>>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>> > >>>>> and that should fall back to whatever node that is online. Offline node
>> > >>>>> shouldn't keep any pages behind. So there must be something else going
>> > >>>>> on here and the patch is not the right way to handle it. What does
>> > >>>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>> > >>>>
>> > >>>> The whole context is:
>> > >>>>
>> > >>>> The system is booted with a NUMA node has no memory attaching to it
>> > >>>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> > >>>> in MADT, so CPUs on this memory-less node are not brought up, and
>> > >>>> this NUMA node will not be online (but SRAT presents this NUMA node);
>> > >>>>
>> > >>>> Devices attaching to this NUMA node such as PCI host bridge still
>> > >>>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> > >>>> is not online which lead to this issue.
>> > >>>
>> > >>> But we should have other numa nodes on the zonelists so the allocator
>> > >>> should fall back to other node. If the zonelist is not intiailized
>> > >>> properly, though, then this can indeed show up as a problem. Knowing
>> > >>> which exact place has blown up would help get a better picture...
>> > >>>
>> > >>
>> > >> I specific a non-exist node to allocate memory using kzalloc_node,
>> > >> and got this following error message.
>> > >>
>> > >> And I found out there is just a VM_WARN, but it does not prevent the memory
>> > >> allocation continue.
>> > >>
>> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> > >> it would cause oops here.
>> > >>
>> > >> 459 /*
>> > >> 460 * Allocate pages, preferring the node given as nid. The node must be valid and
>> > >> 461 * online. For more general interface, see alloc_pages_node().
>> > >> 462 */
>> > >> 463 static inline struct page *
>> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> > >> 465 {
>> > >> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> > >> 467 VM_WARN_ON(!node_online(nid));
>> > >> 468
>> > >> 469 return __alloc_pages(gfp_mask, order, nid);
>> > >> 470 }
>> > >> 471
>> > >>
>> > >> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
>> > >
>> > > OK, so this is an artificialy broken code, right. You shouldn't get a
>> > > non-existent node via standard APIs AFAICS. The original report was
>> > > about an existing node which is offline AFAIU. That would be a different
>> > > case. If I am missing something and there are legitimate users that try
>> > > to allocate from non-existing nodes then we should handle that in
>> > > node_zonelist.
>> >
>> > I think hanjun's comments may help to understood this question:
>> > - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>> > node;
>> >
>> > - But if we boot the system with memory-less node and also with
>> > CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>> > NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>> > CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>> > devices on that numa node, alloc memory will be panic because NUMA node
>> > 3 is not a valid node.
>
> Hmm, but this is not a memory-less node. It sounds like a misconfigured
> kernel to me or the broken initialization. Each CPU should have a
> fallback numa node to be used.
>
>> > I triggered this BUG on arm64 platform, and I found a similar bug has
>> > been fixed on x86 platform. So I sent a similar patch for this bug.
>> >
>> > Or, could we consider to fix it in the mm subsystem?
>>
>> The patch below (b755de8dfdfe) seems like totally the wrong direction.
>> I don't think we want every caller of kzalloc_node() to have check for
>> node_online().
>
> absolutely.
>
>> Why would memory on an off-line node even be in the allocation pool?
>> I wouldn't expect that memory to be put in the pool until the node
>> comes online and the memory is accessible, so this sounds like some
>> kind of setup issue.
>>
>> But I'm definitely not an mm person.
>
> Well, the standard way to handle memory less NUMA nodes is to simply
> fallback to the closest NUMA node. We even have an API for that
> (numa_mem_id).
CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
up returning the original node in the fallback path.
Xie, does the below patch help? I can submit a proper patch if this
fixes the issue for you.
-- >8 --
Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
Signed-off-by: Punit Agrawal <[email protected]>
---
arch/arm64/Kconfig | 4 ++++
arch/arm64/mm/numa.c | 2 ++
2 files changed, 6 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..5317e9aa93ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
+config HAVE_MEMORYLESS_NODES
+ def_bool y
+ depends on NUMA
+
config HAVE_SETUP_PER_CPU_AREA
def_bool y
depends on NUMA
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index dad128ba98bf..c699dcfe93de 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
static void map_cpu_to_node(unsigned int cpu, int nid)
{
set_cpu_numa_node(cpu, nid);
+ set_numa_mem(local_memory_node(nid));
+
if (nid >= 0)
cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
}
--
2.17.0
On Tue 12-06-18 16:08:03, Punit Agrawal wrote:
> Michal Hocko <[email protected]> writes:
[...]
> > Well, the standard way to handle memory less NUMA nodes is to simply
> > fallback to the closest NUMA node. We even have an API for that
> > (numa_mem_id).
>
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.
Yes this makes more sense.
--
Michal Hocko
SUSE Labs
Punit Agrawal <[email protected]> writes:
[...]
>
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.
>
> Xie, does the below patch help? I can submit a proper patch if this
> fixes the issue for you.
>
> -- >8 --
> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>
> Signed-off-by: Punit Agrawal <[email protected]>
> ---
> arch/arm64/Kconfig | 4 ++++
> arch/arm64/mm/numa.c | 2 ++
> 2 files changed, 6 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index eb2cf4938f6d..5317e9aa93ab 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
> def_bool y
> depends on NUMA
>
> +config HAVE_MEMORYLESS_NODES
> + def_bool y
> + depends on NUMA
> +
> config HAVE_SETUP_PER_CPU_AREA
> def_bool y
> depends on NUMA
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index dad128ba98bf..c699dcfe93de 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
> static void map_cpu_to_node(unsigned int cpu, int nid)
> {
> set_cpu_numa_node(cpu, nid);
> + set_numa_mem(local_memory_node(nid));
Argh, this should be
set_cpu_numa_mem(cpu, local_memory_node(nid));
There is not guarantee that map_cpu_to_node() will be called on the
local cpu.
Hanjun, Xie - can you try with the update please?
Thanks,
Punit
> +
> if (nid >= 0)
> cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
> }
Hi Punit,
On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <[email protected]> writes:
>
>
> [...]
>
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <[email protected]>
>> ---
>> arch/arm64/Kconfig | 4 ++++
>> arch/arm64/mm/numa.c | 2 ++
>> 2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>> def_bool y
>> depends on NUMA
>>
>> +config HAVE_MEMORYLESS_NODES
>> + def_bool y
>> + depends on NUMA
>> +
>> config HAVE_SETUP_PER_CPU_AREA
>> def_bool y
>> depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>> static void map_cpu_to_node(unsigned int cpu, int nid)
>> {
>> set_cpu_numa_node(cpu, nid);
>> + set_numa_mem(local_memory_node(nid));
>
> Argh, this should be
>
> set_cpu_numa_mem(cpu, local_memory_node(nid));
>
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
>
> Hanjun, Xie - can you try with the update please?
Thanks for looking into this, we will try this tomorrow
(the hardware is occupied now) and update here.
Thanks
Hanjun
Hi Punit,
On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal <[email protected]> writes:
>
>
> [...]
>
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal <[email protected]>
>> ---
>> arch/arm64/Kconfig | 4 ++++
>> arch/arm64/mm/numa.c | 2 ++
>> 2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>> def_bool y
>> depends on NUMA
>>
>> +config HAVE_MEMORYLESS_NODES
>> + def_bool y
>> + depends on NUMA
>> +
>> config HAVE_SETUP_PER_CPU_AREA
>> def_bool y
>> depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>> static void map_cpu_to_node(unsigned int cpu, int nid)
>> {
>> set_cpu_numa_node(cpu, nid);
>> + set_numa_mem(local_memory_node(nid));
>
> Argh, this should be
>
> set_cpu_numa_mem(cpu, local_memory_node(nid));
>
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
>
> Hanjun, Xie - can you try with the update please?
I've tested this patch, but it does not help.
The boot message is attached.
I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
Then node 3 is not be created, because node 3 has no memory, and no cpu.
But some pci device may related to node 3, which be set in ACPI table.
165 /* Interface called from ACPI code to setup PCI host controller */
166 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
167 {
168 int node = acpi_get_node(root->device->handle);
169 struct acpi_pci_generic_root_info *ri;
170 struct pci_bus *bus, *child;
171 struct acpi_pci_root_ops *root_ops;
172
// this node may be not created.
177 ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
178 if (!ri)
179 return NULL;
180
181 root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
182 if (!root_ops) {
183 kfree(ri);
184 return NULL;
185 }
186
187 ri->cfg = pci_acpi_setup_ecam_mapping(root);
188 if (!ri->cfg) {
189 kfree(ri);
190 kfree(root_ops);
191 return NULL;
192 }
>
> Thanks,
> Punit
>
>> +
>> if (nid >= 0)
>> cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>> }
>
> .
>
--
Thanks,
Xie XiuQi
On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
[...]
> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> But some pci device may related to node 3, which be set in ACPI table.
Could you double check that zonelists for node 3 are generated
correctly?
--
Michal Hocko
SUSE Labs
Hi Michal,
On 2018/6/19 20:07, Michal Hocko wrote:
> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?
>
zonelists for node 3 is not created at all.
Kernel parse SRAT table to create node info, but in this case,
SRAT table is parsed not completed. Only the first 72 items are parsed.
In SRAT table, we haven't seen node 3 information yet, because cpu_to_node_map[72] is too small.
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10000 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70000 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70001 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70002 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70003 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70100 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70101 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70102 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x70103 -> Node 2
[ 0.000000] ACPI: NUMA: SRAT: cpu_to_node_map[72] is too small, may not be able to use all cpus
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x402000000000-0x4023ffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x23ffffe780-0x23ffffffff]
[ 0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[ 0.000000] NUMA: NODE_DATA [mem 0x4023fffed780-0x4023fffeefff]
[ 0.000000] NUMA: NODE_DATA(1) on node 2
[ 0.000000] NUMA: NODE_DATA [mem 0x4023fffebf00-0x4023fffed77f]
[ 0.000000] Zone ranges:
[ 0.000000] DMA32 [mem 0x0000000000000000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x00004023ffffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000000000-0x000000003942ffff]
[ 0.000000] node 0: [mem 0x0000000039430000-0x000000003956ffff]
[ 0.000000] node 0: [mem 0x0000000039570000-0x000000003963ffff]
[ 0.000000] node 0: [mem 0x0000000039640000-0x00000000396fffff]
[ 0.000000] node 0: [mem 0x0000000039700000-0x000000003971ffff]
[ 0.000000] node 0: [mem 0x0000000039720000-0x0000000039b6ffff]
[ 0.000000] node 0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[ 0.000000] node 0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[ 0.000000] node 0: [mem 0x000000003eb90000-0x000000003fbfffff]
[ 0.000000] node 0: [mem 0x0000002080000000-0x00000023ffffffff]
[ 0.000000] node 2: [mem 0x0000402000000000-0x00004023ffffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x00000023ffffffff]
[ 0.000000] Could not find start_pfn for node 1
[ 0.000000] Initmem setup node 1 [mem 0x0000000000000000-0x0000000000000000]
[ 0.000000] Initmem setup node 2 [mem 0x0000402000000000-0x00004023ffffffff]
--
Thanks,
Xie XiuQi
Michal Hocko <[email protected]> writes:
> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?
The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.
We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.
I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.
Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.
Thanks,
Punit
[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73
-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
return;
- if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
NR_CPUS);
- return;
- }
pxm = pa->proximity_domain;
node = acpi_map_pxm_to_node(pxm);
On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> > [...]
> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> >> But some pci device may related to node 3, which be set in ACPI table.
> >
> > Could you double check that zonelists for node 3 are generated
> > correctly?
>
> The cpus in node 3 aren't onlined and there's no memory attached - I
> suspect that no zonelists are built for this node.
>
> We skip creating a node, if the number of SRAT entries parsed exceeds
> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
> zonelists will be created for it.
>
> I think the problem will go away if the cpus are restricted via the
> kernel command line by setting nr_cpus.
>
> Xie, can you try the below patch on top of the one enabling memoryless
> nodes? I'm not sure this is the right solution but at least it'll
> confirm the problem.
This issue looks familiar (or at least related):
git log d3bd058826aa
The reason why the NR_CPUS guard is there is to avoid overflowing
the early_node_cpu_hwid array. IA64 does something different in
that respect compared to x86, we have to have a look into this.
Regardless, AFAICS the proximity domains to nodes mappings should not
depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
in ARM64 ACPI SRAT parsing.
Lorenzo
>
> Thanks,
> Punit
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73
>
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..fea0f7164f1a 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
> if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
> return;
>
> - if (cpus_in_srat >= NR_CPUS) {
> + if (cpus_in_srat >= NR_CPUS)
> pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> NR_CPUS);
> - return;
> - }
>
> pxm = pa->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);
Lorenzo Pieralisi <[email protected]> writes:
> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>>
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>>
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>>
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>>
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa
Indeed. Thanks for digging into this.
>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.
Ah right... I missed that. The below patch is definitely not what we
want.
> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.
Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().
The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.
In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.
On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
[...]
> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> as a temporary fix (it'll also be easier to backport) while we work on
> fixing these other issues and enabling memoryless nodes.
Well, x86 already does that but copying this antipatern is not really
nice. So it is good as a quick fix but it would be definitely much
better to have a robust fix. Who knows how many other places might hit
this. You certainly do not want to add a hack like this all over...
--
Michal Hocko
SUSE Labs
Michal Hocko <[email protected]> writes:
> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...
Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.
Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.
I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.
On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> > [...]
> >> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> as a temporary fix (it'll also be easier to backport) while we work on
> >> fixing these other issues and enabling memoryless nodes.
> >
> > Well, x86 already does that but copying this antipatern is not really
> > nice. So it is good as a quick fix but it would be definitely much
> > better to have a robust fix. Who knows how many other places might hit
> > this. You certainly do not want to add a hack like this all over...
>
> Completely agree! I was only suggesting it as a temporary measure,
> especially as it looked like a proper fix might be invasive.
>
> Another fix might be to change the node specific allocation to node
> agnostic allocations. It isn't clear why the allocation is being
> requested from a specific node. I think Lorenzo suggested this in one of
> the threads.
I think that code was just copypasted but it is better to fix the
underlying issue.
> I've started putting together a set fixing the issues identified in this
> thread. It should give a better idea on the best course of action.
On ACPI ARM64, this diff should do if I read the code correctly, it
should be (famous last words) just a matter of mapping PXMs to nodes for
every SRAT GICC entry, feel free to pick it up if it works.
Yes, we can take the original patch just because it is safer for an -rc
cycle even though if the patch below would do delaying the fix for a
couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
not a disaster.
Lorenzo
-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..877b268ef9fa 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
return;
- if (cpus_in_srat >= NR_CPUS) {
- pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
- NR_CPUS);
- return;
- }
-
pxm = pa->proximity_domain;
node = acpi_map_pxm_to_node(pxm);
@@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
return;
}
+ node_set(node, numa_nodes_parsed);
+
+ if (cpus_in_srat >= NR_CPUS) {
+ pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
+ NR_CPUS);
+ return;
+ }
+
mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
if (mpidr == PHYS_CPUID_INVALID) {
pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
@@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
early_node_cpu_hwid[cpus_in_srat].node_id = node;
early_node_cpu_hwid[cpus_in_srat].cpu_hwid = mpidr;
- node_set(node, numa_nodes_parsed);
cpus_in_srat++;
pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
pxm, mpidr, node);
Hi Lorenzo, Punit,
On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> Michal Hocko <[email protected]> writes:
>>
>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>> [...]
>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>> fixing these other issues and enabling memoryless nodes.
>>>
>>> Well, x86 already does that but copying this antipatern is not really
>>> nice. So it is good as a quick fix but it would be definitely much
>>> better to have a robust fix. Who knows how many other places might hit
>>> this. You certainly do not want to add a hack like this all over...
>>
>> Completely agree! I was only suggesting it as a temporary measure,
>> especially as it looked like a proper fix might be invasive.
>>
>> Another fix might be to change the node specific allocation to node
>> agnostic allocations. It isn't clear why the allocation is being
>> requested from a specific node. I think Lorenzo suggested this in one of
>> the threads.
>
> I think that code was just copypasted but it is better to fix the
> underlying issue.
>
>> I've started putting together a set fixing the issues identified in this
>> thread. It should give a better idea on the best course of action.
>
> On ACPI ARM64, this diff should do if I read the code correctly, it
> should be (famous last words) just a matter of mapping PXMs to nodes for
> every SRAT GICC entry, feel free to pick it up if it works.
>
> Yes, we can take the original patch just because it is safer for an -rc
> cycle even though if the patch below would do delaying the fix for a
> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> not a disaster.
I tested this patch on my arm board, it works.
--
Thanks,
Xie XiuQi
>
> Lorenzo
>
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..877b268ef9fa 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
> if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
> return;
>
> - if (cpus_in_srat >= NR_CPUS) {
> - pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> - NR_CPUS);
> - return;
> - }
> -
> pxm = pa->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);
>
> @@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
> return;
> }
>
> + node_set(node, numa_nodes_parsed);
> +
> + if (cpus_in_srat >= NR_CPUS) {
> + pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
> + NR_CPUS);
> + return;
> + }
> +
> mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
> if (mpidr == PHYS_CPUID_INVALID) {
> pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
> @@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>
> early_node_cpu_hwid[cpus_in_srat].node_id = node;
> early_node_cpu_hwid[cpus_in_srat].cpu_hwid = mpidr;
> - node_set(node, numa_nodes_parsed);
> cpus_in_srat++;
> pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
> pxm, mpidr, node);
>
> .
>
Xie XiuQi <[email protected]> writes:
> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko <[email protected]> writes:
>>>
>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>> [...]
>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>> fixing these other issues and enabling memoryless nodes.
>>>>
>>>> Well, x86 already does that but copying this antipatern is not really
>>>> nice. So it is good as a quick fix but it would be definitely much
>>>> better to have a robust fix. Who knows how many other places might hit
>>>> this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>>
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>>
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>>
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>>
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.
I am assuming you tried the patch without enabling support for
memory-less nodes.
The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.
So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...
On 2018/6/20 19:51, Punit Agrawal wrote:
> Xie XiuQi <[email protected]> writes:
>
>> Hi Lorenzo, Punit,
>>
>>
>> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>>> Michal Hocko <[email protected]> writes:
>>>>
>>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>>>> [...]
>>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>>>>>> fixing these other issues and enabling memoryless nodes.
>>>>>
>>>>> Well, x86 already does that but copying this antipatern is not really
>>>>> nice. So it is good as a quick fix but it would be definitely much
>>>>> better to have a robust fix. Who knows how many other places might hit
>>>>> this. You certainly do not want to add a hack like this all over...
>>>>
>>>> Completely agree! I was only suggesting it as a temporary measure,
>>>> especially as it looked like a proper fix might be invasive.
>>>>
>>>> Another fix might be to change the node specific allocation to node
>>>> agnostic allocations. It isn't clear why the allocation is being
>>>> requested from a specific node. I think Lorenzo suggested this in one of
>>>> the threads.
>>>
>>> I think that code was just copypasted but it is better to fix the
>>> underlying issue.
>>>
>>>> I've started putting together a set fixing the issues identified in this
>>>> thread. It should give a better idea on the best course of action.
>>>
>>> On ACPI ARM64, this diff should do if I read the code correctly, it
>>> should be (famous last words) just a matter of mapping PXMs to nodes for
>>> every SRAT GICC entry, feel free to pick it up if it works.
>>>
>>> Yes, we can take the original patch just because it is safer for an -rc
>>> cycle even though if the patch below would do delaying the fix for a
>>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>>> not a disaster.
>>
>> I tested this patch on my arm board, it works.
>
> I am assuming you tried the patch without enabling support for
> memory-less nodes.
>
> The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> from NR_CPUS restriction. When it comes to building zonelists, the node
> referenced by the PCI controller also has zonelists initialised.
>
> So it looks like a fallback node is setup even if we don't have
> memory-less nodes enabled. I need to stare some more at the code to see
> why we need memory-less nodes at all then ...
Yes, please. From my limited MM knowledge, zonelists should not be
initialised if no CPU and no memory on this node, correct me if I'm
wrong.
Thanks
Hanjun
On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
> On 2018/6/20 19:51, Punit Agrawal wrote:
> > Xie XiuQi <[email protected]> writes:
> >
> >> Hi Lorenzo, Punit,
> >>
> >>
> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> >>>> Michal Hocko <[email protected]> writes:
> >>>>
> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >>>>> [...]
> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >>>>>> fixing these other issues and enabling memoryless nodes.
> >>>>>
> >>>>> Well, x86 already does that but copying this antipatern is not really
> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >>>>> better to have a robust fix. Who knows how many other places might hit
> >>>>> this. You certainly do not want to add a hack like this all over...
> >>>>
> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >>>> especially as it looked like a proper fix might be invasive.
> >>>>
> >>>> Another fix might be to change the node specific allocation to node
> >>>> agnostic allocations. It isn't clear why the allocation is being
> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >>>> the threads.
> >>>
> >>> I think that code was just copypasted but it is better to fix the
> >>> underlying issue.
> >>>
> >>>> I've started putting together a set fixing the issues identified in this
> >>>> thread. It should give a better idea on the best course of action.
> >>>
> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >>>
> >>> Yes, we can take the original patch just because it is safer for an -rc
> >>> cycle even though if the patch below would do delaying the fix for a
> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >>> not a disaster.
> >>
> >> I tested this patch on my arm board, it works.
> >
> > I am assuming you tried the patch without enabling support for
> > memory-less nodes.
> >
> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> > from NR_CPUS restriction. When it comes to building zonelists, the node
> > referenced by the PCI controller also has zonelists initialised.
> >
> > So it looks like a fallback node is setup even if we don't have
> > memory-less nodes enabled. I need to stare some more at the code to see
> > why we need memory-less nodes at all then ...
>
> Yes, please. From my limited MM knowledge, zonelists should not be
> initialised if no CPU and no memory on this node, correct me if I'm
> wrong.
Well, as long as there is a code which can explicitly ask for a specific
node than it is safer to have zonelists configured. Otherwise you just
force callers to add hacks and figure out the proper placement there.
Zonelists should be cheep to configure for all possible nodes. It's not
like we are talking about huge amount of resources.
--
Michal Hocko
SUSE Labs
Michal Hocko <[email protected]> writes:
> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi <[email protected]> writes:
>> >
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> >>>> Michal Hocko <[email protected]> writes:
>> >>>>
>> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> >>>>> [...]
>> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
>> >>>>>> fixing these other issues and enabling memoryless nodes.
>> >>>>>
>> >>>>> Well, x86 already does that but copying this antipatern is not really
>> >>>>> nice. So it is good as a quick fix but it would be definitely much
>> >>>>> better to have a robust fix. Who knows how many other places might hit
>> >>>>> this. You certainly do not want to add a hack like this all over...
>> >>>>
>> >>>> Completely agree! I was only suggesting it as a temporary measure,
>> >>>> especially as it looked like a proper fix might be invasive.
>> >>>>
>> >>>> Another fix might be to change the node specific allocation to node
>> >>>> agnostic allocations. It isn't clear why the allocation is being
>> >>>> requested from a specific node. I think Lorenzo suggested this in one of
>> >>>> the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>> >>>> I've started putting together a set fixing the issues identified in this
>> >>>> thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> >
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> >
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> >
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>>
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.
I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.
For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.
I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.
On Fri, 22 Jun 2018 11:24:38 +0100
Punit Agrawal <[email protected]> wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
> >> On 2018/6/20 19:51, Punit Agrawal wrote:
> >> > Xie XiuQi <[email protected]> writes:
> >> >
> >> >> Hi Lorenzo, Punit,
> >> >>
> >> >>
> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> >> >>>> Michal Hocko <[email protected]> writes:
> >> >>>>
> >> >>>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >> >>>>> [...]
> >> >>>>>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> >>>>>> as a temporary fix (it'll also be easier to backport) while we work on
> >> >>>>>> fixing these other issues and enabling memoryless nodes.
> >> >>>>>
> >> >>>>> Well, x86 already does that but copying this antipatern is not really
> >> >>>>> nice. So it is good as a quick fix but it would be definitely much
> >> >>>>> better to have a robust fix. Who knows how many other places might hit
> >> >>>>> this. You certainly do not want to add a hack like this all over...
> >> >>>>
> >> >>>> Completely agree! I was only suggesting it as a temporary measure,
> >> >>>> especially as it looked like a proper fix might be invasive.
> >> >>>>
> >> >>>> Another fix might be to change the node specific allocation to node
> >> >>>> agnostic allocations. It isn't clear why the allocation is being
> >> >>>> requested from a specific node. I think Lorenzo suggested this in one of
> >> >>>> the threads.
> >> >>>
> >> >>> I think that code was just copypasted but it is better to fix the
> >> >>> underlying issue.
> >> >>>
> >> >>>> I've started putting together a set fixing the issues identified in this
> >> >>>> thread. It should give a better idea on the best course of action.
> >> >>>
> >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >> >>>
> >> >>> Yes, we can take the original patch just because it is safer for an -rc
> >> >>> cycle even though if the patch below would do delaying the fix for a
> >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >> >>> not a disaster.
> >> >>
> >> >> I tested this patch on my arm board, it works.
> >> >
> >> > I am assuming you tried the patch without enabling support for
> >> > memory-less nodes.
> >> >
> >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> >> > from NR_CPUS restriction. When it comes to building zonelists, the node
> >> > referenced by the PCI controller also has zonelists initialised.
> >> >
> >> > So it looks like a fallback node is setup even if we don't have
> >> > memory-less nodes enabled. I need to stare some more at the code to see
> >> > why we need memory-less nodes at all then ...
> >>
> >> Yes, please. From my limited MM knowledge, zonelists should not be
> >> initialised if no CPU and no memory on this node, correct me if I'm
> >> wrong.
> >
> > Well, as long as there is a code which can explicitly ask for a specific
> > node than it is safer to have zonelists configured. Otherwise you just
> > force callers to add hacks and figure out the proper placement there.
> > Zonelists should be cheep to configure for all possible nodes. It's not
> > like we are talking about huge amount of resources.
>
> I agree. The current problem stems from not configuring the zonelists
> for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
> the configuration of such nodes.
>
> For allocation requests targeting memory-less nodes, the allocator will
> take the slow path and fall back to one of the other nodes based on the
> zonelists.
>
> I'm not sure how common such allocations are but I'll work on enabling
> CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
> config improves the fallback mechanism by starting the search from a
> near-by node with memory.
I'll test it when back in the office, but I had a similar issue with
memory only nodes when I moved the SRAT listing for cpus from the 4
4th mode to the 3rd node to fake some memory I could hot unplug.
This gave a memory only node for the last node on the system.
When I instead moved cpus from the 3rd node to the 4th (so the node
with only memory was now in the middle, everything worked).
Was odd, and I'd been meaning to chase it down but hadn't gotten to it
yet. If I get time I'll put together some test firmwares as see if there
are any other nasty corner cases we aren't handling.
Jonathan
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jonathan Cameron <[email protected]> writes:
[...]
>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet. If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.
If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.
This issue was found through code inspection.
Please make sure to use the updated patch from Lorenzo for your
tests[0].
[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2
>
> Jonathan
>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jonathan Cameron <[email protected]> writes:
[...]
>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet. If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.
If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.
This issue was found through code inspection.
Please make sure to use the updated patch from Lorenzo for your
tests[0].
[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2
>
> Jonathan
>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel