2014-07-11 07:35:16

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

Previously we have posted a patch fix a memory crash issue caused by
memoryless node on x86 platforms, please refer to
http://comments.gmane.org/gmane.linux.kernel/1687425

As suggested by David Rientjes, the most suitable fix for the issue
should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
So this is the patchset according to David's suggestion.

Patch 1-26 prepare for enabling memoryless node on x86 platforms by
replacing cpu_to_node()/numa_node_id() with cpu_to_mem()/numa_mem_id().
Patch 27-29 enable support of memoryless node on x86 platforms.
Patch 30 tunes order to online NUMA node when doing CPU hot-addition.

This patchset fixes the issue mentioned by Mike Galbraith that CPUs
are associated with wrong node after adding memory to a memoryless
node.

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
root@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

The patchset is based on the latest mainstream kernel and has been
tested on a 4-socket Intel platform with CPU/memory hot-addition
capability.

Any comments are welcomed!

Jiang Liu (30):
mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node
mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
x86, numa: Kill useless code to improve code readability
mm: Update _mem_id_[] for every possible CPU when memory
configuration changes
mm, x86: Enable memoryless node support to better support CPU/memory
hotplug
x86, NUMA: Online node earlier when doing CPU hot-addition

arch/x86/Kconfig | 3 ++
arch/x86/kernel/acpi/boot.c | 6 ++-
arch/x86/kernel/apic/io_apic.c | 10 ++---
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
arch/x86/kernel/cpu/perf_event_amd_uncore.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 +--
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 2 +-
arch/x86/kernel/devicetree.c | 2 +-
arch/x86/kernel/irq_32.c | 4 +-
arch/x86/kernel/smpboot.c | 2 +
arch/x86/kvm/vmx.c | 2 +-
arch/x86/mm/numa.c | 52 +++++++++++++++++--------
arch/x86/platform/uv/tlb_uv.c | 2 +-
arch/x86/platform/uv/uv_nmi.c | 3 +-
arch/x86/platform/uv/uv_time.c | 2 +-
drivers/char/mspec.c | 2 +-
drivers/infiniband/hw/qib/qib_file_ops.c | 4 +-
drivers/infiniband/hw/qib/qib_init.c | 2 +-
drivers/irqchip/irq-clps711x.c | 2 +-
drivers/irqchip/irq-gic.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 4 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 +-
drivers/of/base.c | 2 +-
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +-
drivers/scsi/bnx2i/bnx2i_init.c | 2 +-
drivers/scsi/fcoe/fcoe.c | 2 +-
drivers/thermal/intel_powerclamp.c | 4 +-
include/linux/gfp.h | 6 +--
kernel/events/callchain.c | 2 +-
kernel/events/core.c | 2 +-
kernel/events/ring_buffer.c | 2 +-
kernel/rcu/rcutorture.c | 2 +-
kernel/sched/core.c | 8 ++--
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 4 +-
kernel/sched/rt.c | 6 +--
kernel/smp.c | 2 +-
kernel/smpboot.c | 2 +-
kernel/taskstats.c | 2 +-
kernel/timer.c | 2 +-
kernel/trace/ring_buffer.c | 12 +++---
kernel/trace/trace_uprobe.c | 2 +-
mm/huge_memory.c | 6 +--
mm/memcontrol.c | 2 +-
mm/memory.c | 2 +-
mm/page_alloc.c | 10 ++---
mm/percpu-vm.c | 2 +-
mm/vmalloc.c | 2 +-
net/core/dev.c | 6 +--
net/core/flow.c | 2 +-
net/core/pktgen.c | 10 ++---
net/core/sysctl_net_core.c | 2 +-
net/netfilter/x_tables.c | 8 ++--
net/xfrm/xfrm_ipcomp.c | 2 +-
58 files changed, 139 insertions(+), 111 deletions(-)

--
1.7.10.4


2014-07-11 07:35:24

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 01/30] mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
kernel/rcu/rcutorture.c | 2 +-
kernel/smp.c | 2 +-
kernel/smpboot.c | 2 +-
kernel/taskstats.c | 2 +-
kernel/timer.c | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7fa34f86e5ba..f593762d3214 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1209,7 +1209,7 @@ static int rcutorture_booster_init(int cpu)
mutex_lock(&boost_mutex);
VERBOSE_TOROUT_STRING("Creating rcu_torture_boost task");
boost_tasks[cpu] = kthread_create_on_node(rcu_torture_boost, NULL,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"rcu_torture_boost");
if (IS_ERR(boost_tasks[cpu])) {
retval = PTR_ERR(boost_tasks[cpu]);
diff --git a/kernel/smp.c b/kernel/smp.c
index 80c33f8de14f..2f3b84aef159 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -41,7 +41,7 @@ hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
- cpu_to_node(cpu)))
+ cpu_to_mem(cpu)))
return notifier_from_errno(-ENOMEM);
cfd->csd = alloc_percpu(struct call_single_data);
if (!cfd->csd) {
diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index eb89e1807408..9c08e68e48a9 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -171,7 +171,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
if (tsk)
return 0;

- td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
+ td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_mem(cpu));
if (!td)
return -ENOMEM;
td->cpu = cpu;
diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 13d2f7cd65db..cf5cba1e7fbe 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -304,7 +304,7 @@ static int add_del_listener(pid_t pid, const struct cpumask *mask, int isadd)
if (isadd == REGISTER) {
for_each_cpu(cpu, mask) {
s = kmalloc_node(sizeof(struct listener),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!s) {
ret = -ENOMEM;
goto cleanup;
diff --git a/kernel/timer.c b/kernel/timer.c
index 3bb01a323b2a..5831a38b5681 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1546,7 +1546,7 @@ static int init_timers_cpu(int cpu)
* The APs use this path later in boot
*/
base = kzalloc_node(sizeof(*base), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!base)
return -ENOMEM;

--
1.7.10.4

2014-07-11 07:35:46

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 06/30] mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
kernel/trace/ring_buffer.c | 12 ++++++------
kernel/trace/trace_uprobe.c | 2 +-
2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7c56c3d06943..38c51583f968 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1124,13 +1124,13 @@ static int __rb_allocate_pages(int nr_pages, struct list_head *pages, int cpu)
*/
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL | __GFP_NORETRY,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!bpage)
goto free_pages;

list_add(&bpage->list, pages);

- page = alloc_pages_node(cpu_to_node(cpu),
+ page = alloc_pages_node(cpu_to_mem(cpu),
GFP_KERNEL | __GFP_NORETRY, 0);
if (!page)
goto free_pages;
@@ -1183,7 +1183,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
int ret;

cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!cpu_buffer)
return NULL;

@@ -1198,14 +1198,14 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
init_waitqueue_head(&cpu_buffer->irq_work.waiters);

bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (!bpage)
goto fail_free_buffer;

rb_check_bpage(cpu_buffer, bpage);

cpu_buffer->reader_page = bpage;
- page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+ page = alloc_pages_node(cpu_to_mem(cpu), GFP_KERNEL, 0);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -4378,7 +4378,7 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
struct buffer_data_page *bpage;
struct page *page;

- page = alloc_pages_node(cpu_to_node(cpu),
+ page = alloc_pages_node(cpu_to_mem(cpu),
GFP_KERNEL | __GFP_NORETRY, 0);
if (!page)
return NULL;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 3c9b97e6b1f4..e585fb67472b 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -692,7 +692,7 @@ static int uprobe_buffer_init(void)
return -ENOMEM;

for_each_possible_cpu(cpu) {
- struct page *p = alloc_pages_node(cpu_to_node(cpu),
+ struct page *p = alloc_pages_node(cpu_to_mem(cpu),
GFP_KERNEL, 0);
if (p == NULL) {
err_cpu = cpu;
--
1.7.10.4

2014-07-11 07:35:39

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 04/30] mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
net/netfilter/x_tables.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 227aa11e8409..6e7d4bc81422 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -692,10 +692,10 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
if (size <= PAGE_SIZE)
newinfo->entries[cpu] = kmalloc_node(size,
GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
else
newinfo->entries[cpu] = vmalloc_node(size,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));

if (newinfo->entries[cpu] == NULL) {
xt_free_table_info(newinfo);
@@ -801,10 +801,10 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
for_each_possible_cpu(cpu) {
if (size > PAGE_SIZE)
i->jumpstack[cpu] = vmalloc_node(size,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
else
i->jumpstack[cpu] = kmalloc_node(size,
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (i->jumpstack[cpu] == NULL)
/*
* Freeing will be done later on by the callers. The
--
1.7.10.4

2014-07-11 07:35:52

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
include/linux/gfp.h | 6 +++---
mm/memory.c | 2 +-
mm/percpu-vm.c | 2 +-
mm/vmalloc.c | 2 +-
4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6eb1fb37de9a..56dd2043f510 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -314,7 +314,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
{
/* Unknown node is current node */
if (nid < 0)
- nid = numa_node_id();
+ nid = numa_mem_id();

return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
@@ -340,13 +340,13 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
int node);
#else
#define alloc_pages(gfp_mask, order) \
- alloc_pages_node(numa_node_id(), gfp_mask, order)
+ alloc_pages_node(numa_mem_id(), gfp_mask, order)
#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
- alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+ alloc_pages_vma(gfp_mask, 0, vma, addr, numa_mem_id())
#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node)

diff --git a/mm/memory.c b/mm/memory.c
index d67fd9fcf1f2..f434d2692f70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3074,7 +3074,7 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
get_page(page);

count_vm_numa_event(NUMA_HINT_FAULTS);
- if (page_nid == numa_node_id()) {
+ if (page_nid == numa_mem_id()) {
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
*flags |= TNF_FAULT_LOCAL;
}
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 3707c71ae4cd..a20b8f7d0dd0 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -115,7 +115,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
for (i = page_start; i < page_end; i++) {
struct page **pagep = &pages[pcpu_page_idx(cpu, i)];

- *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+ *pagep = alloc_pages_node(cpu_to_mem(cpu), gfp, 0);
if (!*pagep) {
pcpu_free_pages(chunk, pages, populated,
page_start, page_end);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f64632b67196..c06f90641916 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -800,7 +800,7 @@ static struct vmap_block *new_vmap_block(gfp_t gfp_mask)
unsigned long vb_idx;
int node, err;

- node = numa_node_id();
+ node = numa_mem_id();

vb = kmalloc_node(sizeof(struct vmap_block),
gfp_mask & GFP_RECLAIM_MASK, node);
--
1.7.10.4

2014-07-11 07:36:00

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 08/30] mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
mm/huge_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d88fef9..3307dd840873 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -822,7 +822,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, haddr, numa_mem_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
@@ -1111,7 +1111,7 @@ alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
+ vma, haddr, numa_mem_id(), 0);
else
new_page = NULL;

@@ -1255,7 +1255,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
- int page_nid = -1, this_nid = numa_node_id();
+ int page_nid = -1, this_nid = numa_mem_id();
int target_nid, last_cpupid = -1;
bool page_locked;
bool migrated = false;
--
1.7.10.4

2014-07-11 07:36:04

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 09/30] mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..d6c4b7255ca9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1933,7 +1933,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
* we use curret node.
*/
if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
+ node = numa_mem_id();

memcg->last_scanned_node = node;
return node;
--
1.7.10.4

2014-07-11 07:36:12

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 10/30] mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
net/xfrm/xfrm_ipcomp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index ccfdc7115a83..129f469ae75d 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -235,7 +235,7 @@ static void * __percpu *ipcomp_alloc_scratches(void)
for_each_possible_cpu(i) {
void *scratch;

- scratch = vmalloc_node(IPCOMP_SCRATCH_SIZE, cpu_to_node(i));
+ scratch = vmalloc_node(IPCOMP_SCRATCH_SIZE, cpu_to_mem(i));
if (!scratch)
return NULL;
*per_cpu_ptr(scratches, i) = scratch;
--
1.7.10.4

2014-07-11 07:36:24

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 14/30] mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 48ebb6cd69f2..5c057ae21c22 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -877,7 +877,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
- const int current_node = numa_node_id();
+ const int current_node = numa_mem_id();
struct i40e_vsi *vsi = rx_ring->vsi;
u16 i = rx_ring->next_to_clean;
union i40e_rx_desc *rx_desc;
--
1.7.10.4

2014-07-11 07:36:31

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 15/30] mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index f145adbb55ac..2b74bffa5648 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6518,7 +6518,7 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
unsigned int truesize)
{
/* avoid re-using remote pages */
- if (unlikely(page_to_nid(page) != numa_node_id()))
+ if (unlikely(page_to_nid(page) != numa_mem_id()))
return false;

#if (PAGE_SIZE < 8192)
@@ -6588,7 +6588,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));

/* we can reuse buffer as-is, just make sure it is local */
- if (likely(page_to_nid(page) == numa_node_id()))
+ if (likely(page_to_nid(page) == numa_mem_id()))
return true;

/* this page cannot be reused so discard it */
--
1.7.10.4

2014-07-11 07:36:36

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 16/30] mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f5aa3311ea28..46dc083573ea 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1962,7 +1962,7 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));

/* we can reuse buffer as-is, just make sure it is local */
- if (likely(page_to_nid(page) == numa_node_id()))
+ if (likely(page_to_nid(page) == numa_mem_id()))
return true;

/* this page cannot be reused so discard it */
@@ -1974,7 +1974,7 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
rx_buffer->page_offset, size, truesize);

/* avoid re-using remote pages */
- if (unlikely(page_to_nid(page) != numa_node_id()))
+ if (unlikely(page_to_nid(page) != numa_mem_id()))
return false;

#if (PAGE_SIZE < 8192)
--
1.7.10.4

2014-07-11 07:36:52

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 21/30] mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/irqchip/irq-clps711x.c | 2 +-
drivers/irqchip/irq-gic.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/irqchip/irq-clps711x.c b/drivers/irqchip/irq-clps711x.c
index 33340dc97d1d..b0acf8b32a1a 100644
--- a/drivers/irqchip/irq-clps711x.c
+++ b/drivers/irqchip/irq-clps711x.c
@@ -186,7 +186,7 @@ static int __init _clps711x_intc_init(struct device_node *np,
writel_relaxed(0, clps711x_intc->intmr[1]);
writel_relaxed(0, clps711x_intc->intmr[2]);

- err = irq_alloc_descs(-1, 0, ARRAY_SIZE(clps711x_irqs), numa_node_id());
+ err = irq_alloc_descs(-1, 0, ARRAY_SIZE(clps711x_irqs), numa_mem_id());
if (IS_ERR_VALUE(err))
goto out_iounmap;

diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 7e11c9d6ae8c..a7e6c043d823 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -1005,7 +1005,7 @@ void __init gic_init_bases(unsigned int gic_nr, int irq_start,
if (of_property_read_u32(node, "arm,routable-irqs",
&nr_routable_irqs)) {
irq_base = irq_alloc_descs(irq_start, 16, gic_irqs,
- numa_node_id());
+ numa_mem_id());
if (IS_ERR_VALUE(irq_base)) {
WARN(1, "Cannot allocate irq_descs @ IRQ%d, assuming pre-allocated\n",
irq_start);
--
1.7.10.4

2014-07-11 07:37:00

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 22/30] mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/of/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index b9864806e9b8..40d4772973ad 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
#ifdef CONFIG_NUMA
int __weak of_node_to_nid(struct device_node *np)
{
- return numa_node_id();
+ return numa_mem_id();
}
#endif

--
1.7.10.4

2014-07-11 07:37:41

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 25/30] mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/kvm/vmx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332edefc3..beb7c6d5d51b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2964,7 +2964,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)

static struct vmcs *alloc_vmcs_cpu(int cpu)
{
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
struct page *pages;
struct vmcs *vmcs;

--
1.7.10.4

2014-07-11 07:37:46

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 26/30] mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
arch/x86/kernel/cpu/perf_event_amd_uncore.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_ds.c | 6 +++---
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 2 +-
6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index beeb7cc07044..ee5120ce3e98 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -347,7 +347,7 @@ static struct amd_nb *amd_alloc_nb(int cpu)
struct amd_nb *nb;
int i;

- nb = kzalloc_node(sizeof(struct amd_nb), GFP_KERNEL, cpu_to_node(cpu));
+ nb = kzalloc_node(sizeof(struct amd_nb), GFP_KERNEL, cpu_to_mem(cpu));
if (!nb)
return NULL;

diff --git a/arch/x86/kernel/cpu/perf_event_amd_uncore.c b/arch/x86/kernel/cpu/perf_event_amd_uncore.c
index 3bbdf4cd38b9..1a7f4129bf4c 100644
--- a/arch/x86/kernel/cpu/perf_event_amd_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_amd_uncore.c
@@ -291,7 +291,7 @@ static struct pmu amd_l2_pmu = {
static struct amd_uncore *amd_uncore_alloc(unsigned int cpu)
{
return kzalloc_node(sizeof(struct amd_uncore), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
}

static void amd_uncore_cpu_up_prepare(unsigned int cpu)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa62af5..4f48d1bb7608 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1957,7 +1957,7 @@ struct intel_shared_regs *allocate_shared_regs(int cpu)
int i;

regs = kzalloc_node(sizeof(struct intel_shared_regs),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));
if (regs) {
/*
* initialize the locks to keep lockdep happy
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970cb744d..bb0327411bf1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -250,7 +250,7 @@ static DEFINE_PER_CPU(void *, insn_buffer);
static int alloc_pebs_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
int max, thresh = 1; /* always use a single PEBS record */
void *buffer, *ibuffer;

@@ -304,7 +304,7 @@ static void release_pebs_buffer(int cpu)
static int alloc_bts_buffer(int cpu)
{
struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
int max, thresh;
void *buffer;

@@ -341,7 +341,7 @@ static void release_bts_buffer(int cpu)

static int alloc_ds_buffer(int cpu)
{
- int node = cpu_to_node(cpu);
+ int node = cpu_to_mem(cpu);
struct debug_store *ds;

ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index 619f7699487a..9df1ec3b505d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -547,7 +547,7 @@ static int rapl_cpu_prepare(int cpu)
if (rdmsrl_safe(MSR_RAPL_POWER_UNIT, &msr_rapl_power_unit_bits))
return -1;

- pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_mem(cpu));
if (!pmu)
return -1;

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 65bbbea38b9c..4b77ba4b4e36 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -4011,7 +4011,7 @@ static int uncore_cpu_prepare(int cpu, int phys_id)
if (pmu->func_id < 0)
pmu->func_id = j;

- box = uncore_alloc_box(type, cpu_to_node(cpu));
+ box = uncore_alloc_box(type, cpu_to_mem(cpu));
if (!box)
return -ENOMEM;

--
1.7.10.4

2014-07-11 07:37:54

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 27/30] x86, numa: Kill useless code to improve code readability

According to x86 boot sequence, early_cpu_to_node() always returns
NUMA_NO_NODE when called from numa_init(). So kill useless code
to improve code readability.

Related code sequence as below:
x86_cpu_to_node_map is set until step 2, so it is still the default
value (NUMA_NO_NODE) when accessed at step 1.

start_kernel()
setup_arch()
initmem_init()
x86_numa_init()
numa_init()
early_cpu_to_node()
1) return early_per_cpu_ptr(x86_cpu_to_node_map)[cpu];
acpi_boot_init();
sfi_init()
x86_dtb_init()
generic_processor_info()
early_per_cpu(x86_cpu_to_apicid, cpu) = apicid;
init_cpu_to_node()
numa_set_node(cpu, node);
2) per_cpu(x86_cpu_to_node_map, cpu) = node;

rest_init()
kernel_init()
smp_init()
native_cpu_up()
start_secondary()
numa_set_node()
per_cpu(x86_cpu_to_node_map, cpu) = node;

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/mm/numa.c | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a32b706c401a..eec4f6c322bb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -545,8 +545,6 @@ static void __init numa_init_array(void)

rr = first_node(node_online_map);
for (i = 0; i < nr_cpu_ids; i++) {
- if (early_cpu_to_node(i) != NUMA_NO_NODE)
- continue;
numa_set_node(i, rr);
rr = next_node(rr, node_online_map);
if (rr == MAX_NUMNODES)
@@ -633,14 +631,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
numa_init_array();

/*
--
1.7.10.4

2014-07-11 07:38:19

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 24/30] mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/platform/uv/tlb_uv.c | 2 +-
arch/x86/platform/uv/uv_nmi.c | 3 ++-
arch/x86/platform/uv/uv_time.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index dfe605ac1bcd..4612b4396004 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -2116,7 +2116,7 @@ static int __init uv_bau_init(void)

for_each_possible_cpu(cur_cpu) {
mask = &per_cpu(uv_flush_tlb_mask, cur_cpu);
- zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cur_cpu));
+ zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_mem(cur_cpu));
}

nuvhubs = uv_num_possible_blades();
diff --git a/arch/x86/platform/uv/uv_nmi.c b/arch/x86/platform/uv/uv_nmi.c
index c89c93320c12..d17758215a61 100644
--- a/arch/x86/platform/uv/uv_nmi.c
+++ b/arch/x86/platform/uv/uv_nmi.c
@@ -715,7 +715,8 @@ void uv_nmi_setup(void)
nid = cpu_to_node(cpu);
if (uv_hub_nmi_list[nid] == NULL) {
uv_hub_nmi_list[nid] = kzalloc_node(size,
- GFP_KERNEL, nid);
+ GFP_KERNEL,
+ cpu_to_mem(cpu));
BUG_ON(!uv_hub_nmi_list[nid]);
raw_spin_lock_init(&(uv_hub_nmi_list[nid]->nmi_lock));
atomic_set(&uv_hub_nmi_list[nid]->cpu_owner, -1);
diff --git a/arch/x86/platform/uv/uv_time.c b/arch/x86/platform/uv/uv_time.c
index 5c86786bbfd2..c369fb2eb7d3 100644
--- a/arch/x86/platform/uv/uv_time.c
+++ b/arch/x86/platform/uv/uv_time.c
@@ -164,7 +164,7 @@ static __init int uv_rtc_allocate_timers(void)
return -ENOMEM;

for_each_present_cpu(cpu) {
- int nid = cpu_to_node(cpu);
+ int nid = cpu_to_mem(cpu);
int bid = uv_cpu_to_blade_id(cpu);
int bcpu = uv_cpu_hub_info(cpu)->blade_processor_id;
struct uv_rtc_timer_head *head = blade_info[bid];
--
1.7.10.4

2014-07-11 07:38:24

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 28/30] mm: Update _mem_id_[] for every possible CPU when memory configuration changes

Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
configuration changes. So kernel may allocate memory from remote node
for a CPU if the CPU is still in absent or offline state even if the
node associated with the CPU has already been onlined. This patch tries
to improve performance by updating _mem_id_[cpu] for each possible CPU
when memory configuration changes, thus kernel could always allocate
from local node once the node is onlined.

We check node_online(cpu_to_node(cpu)) because:
1) local_memory_node(nid) needs to access NODE_DATA(nid)
2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it

Signed-off-by: Jiang Liu <[email protected]>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ea758b898fd..de86e941ed57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3844,13 +3844,13 @@ static int __build_all_zonelists(void *data)
/*
* We now know the "local memory node" for each node--
* i.e., the node of the first zone in the generic zonelist.
- * Set up numa_mem percpu variable for on-line cpus. During
- * boot, only the boot cpu should be on-line; we'll init the
- * secondary cpus' numa_mem as they come on-line. During
- * node/memory hotplug, we'll fixup all on-line cpus.
+ * Set up numa_mem percpu variable for all possible cpus
+ * if associated node has been onlined.
*/
- if (cpu_online(cpu))
+ if (node_online(cpu_to_node(cpu)))
set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
+ else
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif
}

--
1.7.10.4

2014-07-11 07:38:30

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 29/30] mm, x86: Enable memoryless node support to better support CPU/memory hotplug

With current implementation, all CPUs within a NUMA node will be
assocaited with another NUMA node if the node has no memory installed.

For example, on a four-node system, CPUs on node 2 and 3 are associated
with node 0 when are no memory install on node 2 and 3, which may
confuse users.
root@bkd01sdp:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 15014 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15686 MB
node distances:
node 0 1
0: 10 21
1: 21 10

To be worse, the CPU affinity relationship won't get fixed even after
memory has been added to those nodes. After memory hot-addition to
node 2, CPUs on node 2 are still associated with node 0. This may cause
sub-optimal performance.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 15602 MB
node 0 free: 14743 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15985 MB
node 1 free: 15715 MB
node 2 cpus:
node 2 size: 128 MB
node 2 free: 128 MB
node distances:
node 0 1 2
0: 10 21 21
1: 21 10 21
2: 21 21 10

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
root@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 21 21 21
1: 21 10 21 21
2: 21 21 10 21
3: 21 21 21 10

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/Kconfig | 3 +++
arch/x86/kernel/acpi/boot.c | 5 ++++-
arch/x86/kernel/smpboot.c | 2 ++
arch/x86/mm/numa.c | 42 +++++++++++++++++++++++++++++++++++-------
4 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749ef0fdc..f35b25b88625 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1887,6 +1887,9 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA

+config HAVE_MEMORYLESS_NODES
+ def_bool NUMA
+
config ARCH_ENABLE_SPLIT_PMD_PTLOCK
def_bool y
depends on X86_64 || X86_PAE
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 86281ffb96d6..3b5641703a49 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -612,6 +612,8 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
if (nid != -1) {
set_apicid_to_node(physid, nid);
numa_set_node(cpu, nid);
+ if (node_online(nid))
+ set_cpu_numa_mem(cpu, local_memory_node(nid));
}
#endif
}
@@ -644,9 +646,10 @@ int acpi_unmap_lsapic(int cpu)
{
#ifdef CONFIG_ACPI_NUMA
set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
+ set_cpu_numa_mem(cpu, NUMA_NO_NODE);
#endif

- per_cpu(x86_cpu_to_apicid, cpu) = -1;
+ per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
set_cpu_present(cpu, false);
num_processors--;

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5492798930ef..4a5437989ffe 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -162,6 +162,8 @@ static void smp_callin(void)
__func__, cpuid);
}

+ set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
+
/*
* the boot CPU has finished the init stage and is spinning
* on callin_map until we finish. We are free to set up this
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index eec4f6c322bb..0d17c05480d2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -22,6 +22,7 @@

int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
+static nodemask_t numa_nodes_empty __initdata;

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
EXPORT_SYMBOL(node_data);
@@ -523,8 +524,12 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
end = max(mi->blk[i].end, end);
}

- if (start < end)
+ if (start < end) {
setup_node_data(nid, start, end);
+ } else if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+ setup_node_data(nid, 0, 0);
+ node_set(nid, numa_nodes_empty);
+ }
}

/* Dump memblock with node info and return. */
@@ -541,14 +546,18 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
*/
static void __init numa_init_array(void)
{
- int rr, i;
+ int i, rr = MAX_NUMNODES;

- rr = first_node(node_online_map);
for (i = 0; i < nr_cpu_ids; i++) {
+ /* Search for an onlined node with memory */
+ do {
+ if (rr != MAX_NUMNODES)
+ rr = next_node(rr, node_online_map);
+ if (rr == MAX_NUMNODES)
+ rr = first_node(node_online_map);
+ } while (!node_spanned_pages(rr));
+
numa_set_node(i, rr);
- rr = next_node(rr, node_online_map);
- if (rr == MAX_NUMNODES)
- rr = first_node(node_online_map);
}
}

@@ -694,9 +703,12 @@ static __init int find_near_online_node(int node)
{
int n, val;
int min_val = INT_MAX;
- int best_node = -1;
+ int best_node = NUMA_NO_NODE;

for_each_online_node(n) {
+ if (!node_spanned_pages(n))
+ continue;
+
val = node_distance(node, n);

if (val < min_val) {
@@ -737,6 +749,22 @@ void __init init_cpu_to_node(void)
if (!node_online(node))
node = find_near_online_node(node);
numa_set_node(cpu, node);
+ if (node_spanned_pages(node))
+ set_cpu_numa_mem(cpu, node);
+ if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES))
+ node_clear(node, numa_nodes_empty);
+ }
+
+ /* Destroy empty nodes */
+ if (IS_ENABLED(CONFIG_HAVE_MEMORYLESS_NODES)) {
+ int nid;
+ const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+
+ for_each_node_mask(nid, numa_nodes_empty) {
+ node_set_offline(nid);
+ memblock_free(__pa(node_data[nid]), nd_size);
+ node_data[nid] = NULL;
+ }
}
}

--
1.7.10.4

2014-07-11 07:38:47

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition

With typical CPU hot-addition flow on x86, PCI host bridges embedded
in physical processor are always associated with NOMA_NO_NODE, which
may cause sub-optimal performance.
1) Handle CPU hot-addition notification
acpi_processor_add()
acpi_processor_get_info()
acpi_processor_hotadd_init()
acpi_map_lsapic()
1.a) acpi_map_cpu2node()

2) Handle PCI host bridge hot-addition notification
acpi_pci_root_add()
pci_acpi_scan_root()
2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;

3) Handle memory hot-addition notification
acpi_memory_device_add()
acpi_memory_enable_device()
add_memory()
3.a) node_set_online();

4) Online CPUs through sysfs interfaces
cpu_subsys_online()
cpu_up()
try_online_node()
4.a) node_set_online();

So associated node is always in offline state because it is onlined
until step 3.a or 4.a.

We could improve performance by online node at step 1.a. This change
also makes the code symmetric. Nodes are always created when handling
CPU/memory hot-addition events instead of handling user requests from
sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
events.

It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
which may cause system panic as below.
[ 3663.324476] BUG: unable to handle kernel paging request at 0000000000001f08
[ 3663.332348] IP: [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.339719] PGD 82fe10067 PUD 82ebef067 PMD 0
[ 3663.344773] Oops: 0000 [#1] SMP
[ 3663.348455] Modules linked in: shpchp gpio_ich x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode joydev sb_edac edac_core lpc_ich ipmi_si tpm_tis ipmi_msghandler ioatdma wmi acpi_pad mac_hid lp parport ixgbe isci mpt2sas dca ahci ptp libsas libahci raid_class pps_core scsi_transport_sas mdio hid_generic usbhid hid
[ 3663.394393] CPU: 61 PID: 2416 Comm: cron Tainted: G W 3.14.0-rc5+ #21
[ 3663.402643] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRIVTIN1.86B.0047.F03.1403031049 03/03/2014
[ 3663.414299] task: ffff88082fe54b00 ti: ffff880845fba000 task.ti: ffff880845fba000
[ 3663.422741] RIP: 0010:[<ffffffff81172219>] [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.432857] RSP: 0018:ffff880845fbbcd0 EFLAGS: 00010246
[ 3663.439265] RAX: 0000000000001f00 RBX: 0000000000000000 RCX: 0000000000000000
[ 3663.447291] RDX: 0000000000000000 RSI: 0000000000000a8d RDI: ffffffff81a8d950
[ 3663.455318] RBP: ffff880845fbbd58 R08: ffff880823293400 R09: 0000000000000001
[ 3663.463345] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000002052d0
[ 3663.471363] R13: ffff880854c07600 R14: 0000000000000002 R15: 0000000000000000
[ 3663.479389] FS: 00007f2e8b99e800(0000) GS:ffff88105a400000(0000) knlGS:0000000000000000
[ 3663.488514] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3663.495018] CR2: 0000000000001f08 CR3: 00000008237b1000 CR4: 00000000001407e0
[ 3663.503476] Stack:
[ 3663.505757] ffffffff811bd74d ffff880854c01d98 ffff880854c01df0 ffff880854c01dd0
[ 3663.514167] 00000003208ca420 000000075a5d84d0 ffff88082fe54b00 ffffffff811bb35f
[ 3663.522567] ffff880854c07600 0000000000000003 0000000000001f00 ffff880845fbbd48
[ 3663.530976] Call Trace:
[ 3663.533753] [<ffffffff811bd74d>] ? deactivate_slab+0x41d/0x4f0
[ 3663.540421] [<ffffffff811bb35f>] ? new_slab+0x3f/0x2d0
[ 3663.546307] [<ffffffff811bb3c5>] new_slab+0xa5/0x2d0
[ 3663.552001] [<ffffffff81768c97>] __slab_alloc+0x35d/0x54a
[ 3663.558185] [<ffffffff810a4845>] ? local_clock+0x25/0x30
[ 3663.564686] [<ffffffff8177a34c>] ? __do_page_fault+0x4ec/0x5e0
[ 3663.571356] [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.578609] [<ffffffff810c77f1>] ? __raw_spin_lock_init+0x21/0x60
[ 3663.585570] [<ffffffff811be476>] kmem_cache_alloc_node_trace+0xa6/0x1d0
[ 3663.593112] [<ffffffff810b0054>] ? alloc_fair_sched_group+0xc4/0x190
[ 3663.600363] [<ffffffff810b0054>] alloc_fair_sched_group+0xc4/0x190
[ 3663.607423] [<ffffffff810a359f>] sched_create_group+0x3f/0x80
[ 3663.613994] [<ffffffff810b611f>] sched_autogroup_create_attach+0x3f/0x1b0
[ 3663.621732] [<ffffffff8108258a>] sys_setsid+0xea/0x110
[ 3663.628020] [<ffffffff8177f42d>] system_call_fastpath+0x1a/0x1f
[ 3663.634780] Code: 00 44 89 e7 e8 b9 f8 f4 ff 41 f6 c4 10 74 18 31 d2 be 8d 0a 00 00 48 c7 c7 50 d9 a8 81 e8 70 6a f2 ff e8 db dd 5f 00 48 8b 45 c8 <48> 83 78 08 00 0f 84 b5 01 00 00 48 83 c0 08 44 89 75 c0 4d 89
[ 3663.657032] RIP [<ffffffff81172219>] __alloc_pages_nodemask+0xb9/0x2d0
[ 3663.664491] RSP <ffff880845fbbcd0>
[ 3663.668429] CR2: 0000000000001f08
[ 3663.672659] ---[ end trace df13f08ed9de18ad ]---

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/kernel/acpi/boot.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 3b5641703a49..00c2ed507460 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
nid = acpi_get_node(handle);
if (nid != -1) {
set_apicid_to_node(physid, nid);
+ try_online_node(nid);
numa_set_node(cpu, nid);
if (node_online(nid))
set_cpu_numa_mem(cpu, local_memory_node(nid));
--
1.7.10.4

2014-07-11 07:39:43

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 23/30] mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 10 +++++-----
arch/x86/kernel/devicetree.c | 2 +-
arch/x86/kernel/irq_32.c | 4 ++--
3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 81e08eff05ee..7cb3d58b11e8 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -204,7 +204,7 @@ int __init arch_early_irq_init(void)

cfg = irq_cfgx;
count = ARRAY_SIZE(irq_cfgx);
- node = cpu_to_node(0);
+ node = cpu_to_mem(0);

for (i = 0; i < count; i++) {
irq_set_chip_data(i, &cfg[i]);
@@ -1348,7 +1348,7 @@ static bool __init io_apic_pin_not_connected(int idx, int ioapic_idx, int pin)

static void __init __io_apic_setup_irqs(unsigned int ioapic_idx)
{
- int idx, node = cpu_to_node(0);
+ int idx, node = cpu_to_mem(0);
struct io_apic_irq_attr attr;
unsigned int pin, irq;

@@ -1394,7 +1394,7 @@ static void __init setup_IO_APIC_irqs(void)
*/
void setup_IO_APIC_irq_extra(u32 gsi)
{
- int ioapic_idx = 0, pin, idx, irq, node = cpu_to_node(0);
+ int ioapic_idx = 0, pin, idx, irq, node = cpu_to_mem(0);
struct io_apic_irq_attr attr;

/*
@@ -2662,7 +2662,7 @@ int timer_through_8259 __initdata;
static inline void __init check_timer(void)
{
struct irq_cfg *cfg = irq_get_chip_data(0);
- int node = cpu_to_node(0);
+ int node = cpu_to_mem(0);
int apic1, pin1, apic2, pin2;
unsigned long flags;
int no_pin1 = 0;
@@ -3387,7 +3387,7 @@ int io_apic_set_pci_routing(struct device *dev, int irq,
return -EINVAL;
}

- node = dev ? dev_to_node(dev) : cpu_to_node(0);
+ node = dev ? dev_to_node(dev) : cpu_to_mem(0);

return io_apic_setup_irq_pin_once(irq, node, irq_attr);
}
diff --git a/arch/x86/kernel/devicetree.c b/arch/x86/kernel/devicetree.c
index 7db54b5d5f86..289762f4ea06 100644
--- a/arch/x86/kernel/devicetree.c
+++ b/arch/x86/kernel/devicetree.c
@@ -295,7 +295,7 @@ static int ioapic_xlate(struct irq_domain *domain,
set_io_apic_irq_attr(&attr, idx, line, it->trigger, it->polarity);

rc = io_apic_setup_irq_pin_once(irq_find_mapping(domain, line),
- cpu_to_node(0), &attr);
+ cpu_to_mem(0), &attr);
if (rc)
return rc;

diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 63ce838e5a54..425bb4b1110a 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -128,12 +128,12 @@ void irq_ctx_init(int cpu)
if (per_cpu(hardirq_stack, cpu))
return;

- irqstk = page_address(alloc_pages_node(cpu_to_node(cpu),
+ irqstk = page_address(alloc_pages_node(cpu_to_mem(cpu),
THREADINFO_GFP,
THREAD_SIZE_ORDER));
per_cpu(hardirq_stack, cpu) = irqstk;

- irqstk = page_address(alloc_pages_node(cpu_to_node(cpu),
+ irqstk = page_address(alloc_pages_node(cpu_to_mem(cpu),
THREADINFO_GFP,
THREAD_SIZE_ORDER));
per_cpu(softirq_stack, cpu) = irqstk;
--
1.7.10.4

2014-07-11 07:36:49

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 19/30] mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/scsi/bnx2i/bnx2i_init.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/bnx2i/bnx2i_init.c b/drivers/scsi/bnx2i/bnx2i_init.c
index 80c03b452d61..f67a5a63134e 100644
--- a/drivers/scsi/bnx2i/bnx2i_init.c
+++ b/drivers/scsi/bnx2i/bnx2i_init.c
@@ -423,7 +423,7 @@ static void bnx2i_percpu_thread_create(unsigned int cpu)
p = &per_cpu(bnx2i_percpu, cpu);

thread = kthread_create_on_node(bnx2i_percpu_io_thread, (void *)p,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"bnx2i_thread/%d", cpu);
/* bind thread to the cpu */
if (likely(!IS_ERR(thread))) {
--
1.7.10.4

2014-07-11 07:36:47

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 18/30] mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 785d0d71781e..144534a51cbb 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -2453,7 +2453,7 @@ static void bnx2fc_percpu_thread_create(unsigned int cpu)
p = &per_cpu(bnx2fc_percpu, cpu);

thread = kthread_create_on_node(bnx2fc_percpu_io_thread,
- (void *)p, cpu_to_node(cpu),
+ (void *)p, cpu_to_mem(cpu),
"bnx2fc_thread/%d", cpu);
/* bind thread to the cpu */
if (likely(!IS_ERR(thread))) {
--
1.7.10.4

2014-07-11 07:43:17

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 20/30] mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/scsi/fcoe/fcoe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index 00ee0ed642aa..779a7af0e410 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1257,7 +1257,7 @@ static void fcoe_percpu_thread_create(unsigned int cpu)
p = &per_cpu(fcoe_percpu, cpu);

thread = kthread_create_on_node(fcoe_percpu_receive_thread,
- (void *)p, cpu_to_node(cpu),
+ (void *)p, cpu_to_mem(cpu),
"fcoethread/%d", cpu);

if (likely(!IS_ERR(thread))) {
--
1.7.10.4

2014-07-11 07:45:46

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 17/30] mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/thermal/intel_powerclamp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 95cb7fc20e17..9d9be8cd1b50 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -531,7 +531,7 @@ static int start_power_clamp(void)

thread = kthread_create_on_node(clamp_thread,
(void *) cpu,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"kidle_inject/%ld", cpu);
/* bind to cpu here */
if (likely(!IS_ERR(thread))) {
@@ -582,7 +582,7 @@ static int powerclamp_cpu_callback(struct notifier_block *nfb,
case CPU_ONLINE:
thread = kthread_create_on_node(clamp_thread,
(void *) cpu,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"kidle_inject/%lu", cpu);
if (likely(!IS_ERR(thread))) {
kthread_bind(thread, cpu);
--
1.7.10.4

2014-07-11 07:45:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC Patch V1 25/30] mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Il 11/07/2014 09:37, Jiang Liu ha scritto:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 801332edefc3..beb7c6d5d51b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2964,7 +2964,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>
> static struct vmcs *alloc_vmcs_cpu(int cpu)
> {
> - int node = cpu_to_node(cpu);
> + int node = cpu_to_mem(cpu);
> struct page *pages;
> struct vmcs *vmcs;
>
>

Acked-by: Paolo Bonzini <[email protected]>

2014-07-11 07:36:21

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 13/30] mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e49f31dbd5d8..e9f6f9efd944 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1342,7 +1342,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 rx_packet_len, rx_header_len, rx_sph, rx_hbo;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
- const int current_node = numa_node_id();
+ const int current_node = numa_mem_id();
struct i40e_vsi *vsi = rx_ring->vsi;
u16 i = rx_ring->next_to_clean;
union i40e_rx_desc *rx_desc;
--
1.7.10.4

2014-07-11 07:48:29

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 12/30] mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/infiniband/hw/qib/qib_file_ops.c | 4 ++--
drivers/infiniband/hw/qib/qib_init.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index b15e34eeef68..55540295e0e3 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -1312,8 +1312,8 @@ static int setup_ctxt(struct qib_pportdata *ppd, int ctxt,
assign_ctxt_affinity(fp, dd);

numa_id = qib_numa_aware ? ((fd->rec_cpu_num != -1) ?
- cpu_to_node(fd->rec_cpu_num) :
- numa_node_id()) : dd->assigned_node_id;
+ cpu_to_mem(fd->rec_cpu_num) : numa_mem_id()) :
+ dd->assigned_node_id;

rcd = qib_create_ctxtdata(ppd, ctxt, numa_id);

diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 8d3c78ddc906..85ff56ad1075 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -133,7 +133,7 @@ int qib_create_ctxts(struct qib_devdata *dd)
int local_node_id = pcibus_to_node(dd->pcidev->bus);

if (local_node_id < 0)
- local_node_id = numa_node_id();
+ local_node_id = numa_mem_id();
dd->assigned_node_id = local_node_id;

/*
--
1.7.10.4

2014-07-11 07:36:10

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 11/30] mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
drivers/char/mspec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/mspec.c b/drivers/char/mspec.c
index f1d7fa45c275..20e893cde9fd 100644
--- a/drivers/char/mspec.c
+++ b/drivers/char/mspec.c
@@ -206,7 +206,7 @@ mspec_fault(struct vm_area_struct *vma, struct vm_fault *vmf)

maddr = (volatile unsigned long) vdata->maddr[index];
if (maddr == 0) {
- maddr = uncached_alloc_page(numa_node_id(), 1);
+ maddr = uncached_alloc_page(numa_mem_id(), 1);
if (maddr == 0)
return VM_FAULT_OOM;

--
1.7.10.4

2014-07-11 07:49:50

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 05/30] mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
kernel/events/callchain.c | 2 +-
kernel/events/core.c | 2 +-
kernel/events/ring_buffer.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 97b67df8fbfe..09f470a9262e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -77,7 +77,7 @@ static int alloc_callchain_buffers(void)

for_each_possible_cpu(cpu) {
entries->cpu_entries[cpu] = kmalloc_node(size, GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!entries->cpu_entries[cpu])
goto fail;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a33d9a2bcbd7..bb1a5f326309 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7911,7 +7911,7 @@ static void perf_event_init_cpu(int cpu)
if (swhash->hlist_refcount > 0) {
struct swevent_hlist *hlist;

- hlist = kzalloc_node(sizeof(*hlist), GFP_KERNEL, cpu_to_node(cpu));
+ hlist = kzalloc_node(sizeof(*hlist), GFP_KERNEL, cpu_to_mem(cpu));
WARN_ON(!hlist);
rcu_assign_pointer(swhash->swevent_hlist, hlist);
}
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 146a5792b1d2..22128f58aa0b 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -265,7 +265,7 @@ static void *perf_mmap_alloc_page(int cpu)
struct page *page;
int node;

- node = (cpu == -1) ? cpu : cpu_to_node(cpu);
+ node = (cpu == -1) ? NUMA_NO_NODE : cpu_to_mem(cpu);
page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
if (!page)
return NULL;
--
1.7.10.4

2014-07-11 07:35:35

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 03/30] mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
net/core/dev.c | 6 +++---
net/core/flow.c | 2 +-
net/core/pktgen.c | 10 +++++-----
net/core/sysctl_net_core.c | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 30eedf677913..e4c1e84374b7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1910,7 +1910,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,

/* Need to allocate new map to store queue on this CPU's map */
new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!new_map)
return NULL;

@@ -1973,8 +1973,8 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
map->queues[map->len++] = index;
#ifdef CONFIG_NUMA
if (numa_node_id == -2)
- numa_node_id = cpu_to_node(cpu);
- else if (numa_node_id != cpu_to_node(cpu))
+ numa_node_id = cpu_to_mem(cpu);
+ else if (numa_node_id != cpu_to_mem(cpu))
numa_node_id = -1;
#endif
} else if (dev_maps) {
diff --git a/net/core/flow.c b/net/core/flow.c
index a0348fde1fdf..4139dbb50cc0 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -396,7 +396,7 @@ static int flow_cache_cpu_prepare(struct flow_cache *fc, int cpu)
size_t sz = sizeof(struct hlist_head) * flow_cache_hash_size(fc);

if (!fcp->hash_table) {
- fcp->hash_table = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
+ fcp->hash_table = kzalloc_node(sz, GFP_KERNEL, cpu_to_mem(cpu));
if (!fcp->hash_table) {
pr_err("NET: failed to allocate flow cache sz %zu\n", sz);
return -ENOMEM;
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index fc17a9d309ac..45d18f88dce4 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -2653,7 +2653,7 @@ static void pktgen_finalize_skb(struct pktgen_dev *pkt_dev, struct sk_buff *skb,
(datalen/frags) : PAGE_SIZE;
while (datalen > 0) {
if (unlikely(!pkt_dev->page)) {
- int node = numa_node_id();
+ int node = numa_mem_id();

if (pkt_dev->node >= 0 && (pkt_dev->flags & F_NODE))
node = pkt_dev->node;
@@ -2698,7 +2698,7 @@ static struct sk_buff *pktgen_alloc_skb(struct net_device *dev,
pkt_dev->pkt_overhead;

if (pkt_dev->flags & F_NODE) {
- int node = pkt_dev->node >= 0 ? pkt_dev->node : numa_node_id();
+ int node = pkt_dev->node >= 0 ? pkt_dev->node : numa_mem_id();

skb = __alloc_skb(NET_SKB_PAD + size, GFP_NOWAIT, 0, node);
if (likely(skb)) {
@@ -3533,7 +3533,7 @@ static int pktgen_add_device(struct pktgen_thread *t, const char *ifname)
{
struct pktgen_dev *pkt_dev;
int err;
- int node = cpu_to_node(t->cpu);
+ int node = cpu_to_mem(t->cpu);

/* We don't allow a device to be on several threads */

@@ -3621,7 +3621,7 @@ static int __net_init pktgen_create_thread(int cpu, struct pktgen_net *pn)
struct task_struct *p;

t = kzalloc_node(sizeof(struct pktgen_thread), GFP_KERNEL,
- cpu_to_node(cpu));
+ cpu_to_mem(cpu));
if (!t) {
pr_err("ERROR: out of memory, can't create new thread\n");
return -ENOMEM;
@@ -3637,7 +3637,7 @@ static int __net_init pktgen_create_thread(int cpu, struct pktgen_net *pn)

p = kthread_create_on_node(pktgen_thread_worker,
t,
- cpu_to_node(cpu),
+ cpu_to_mem(cpu),
"kpktgend_%d", cpu);
if (IS_ERR(p)) {
pr_err("kernel_thread() failed for cpu %d\n", t->cpu);
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index cf9cd13509a7..1375447b833e 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -123,7 +123,7 @@ static int flow_limit_cpu_sysctl(struct ctl_table *table, int write,
kfree(cur);
} else if (!cur && cpumask_test_cpu(i, mask)) {
cur = kzalloc_node(len, GFP_KERNEL,
- cpu_to_node(i));
+ cpu_to_mem(i));
if (!cur) {
/* not unwinding previous changes */
ret = -ENOMEM;
--
1.7.10.4

2014-07-11 07:50:39

by Jiang Liu

[permalink] [raw]
Subject: [RFC Patch V1 02/30] mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node

When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
may return a node without memory, and later cause system failure/panic
when calling kmalloc_node() and friends with returned node id.
So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
memory for the/current cpu.

If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
is the same as cpu_to_node()/numa_node_id().

Signed-off-by: Jiang Liu <[email protected]>
---
kernel/sched/core.c | 8 ++++----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 4 ++--
kernel/sched/rt.c | 6 +++---
4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b494fe..27e3af246310 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5743,7 +5743,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
continue;

sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_mem(cpu));

if (!sg)
goto fail;
@@ -6397,14 +6397,14 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
struct sched_group_capacity *sgc;

sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL, cpu_to_mem(j));
if (!sd)
return -ENOMEM;

*per_cpu_ptr(sdd->sd, j) = sd;

sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL, cpu_to_mem(j));
if (!sg)
return -ENOMEM;

@@ -6413,7 +6413,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sg, j) = sg;

sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
- GFP_KERNEL, cpu_to_node(j));
+ GFP_KERNEL, cpu_to_mem(j));
if (!sgc)
return -ENOMEM;

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fc4f98b1258f..95104d363a8c 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1559,7 +1559,7 @@ void init_sched_dl_class(void)

for_each_possible_cpu(i)
zalloc_cpumask_var_node(&per_cpu(local_cpu_mask_dl, i),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
}

#endif /* CONFIG_SMP */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d3335e1f..26e75b8a52e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7611,12 +7611,12 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)

for_each_possible_cpu(i) {
cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!cfs_rq)
goto err;

se = kzalloc_node(sizeof(struct sched_entity),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!se)
goto err_free_rq;

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a49083192c64..88d1315c6223 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -184,12 +184,12 @@ int alloc_rt_sched_group(struct task_group *tg, struct task_group *parent)

for_each_possible_cpu(i) {
rt_rq = kzalloc_node(sizeof(struct rt_rq),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!rt_rq)
goto err;

rt_se = kzalloc_node(sizeof(struct sched_rt_entity),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
if (!rt_se)
goto err_free_rq;

@@ -1945,7 +1945,7 @@ void __init init_sched_rt_class(void)

for_each_possible_cpu(i) {
zalloc_cpumask_var_node(&per_cpu(local_cpu_mask, i),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_mem(i));
}
}
#endif /* CONFIG_SMP */
--
1.7.10.4

2014-07-11 08:30:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> Any comments are welcomed!

Why would anybody _ever_ have a memoryless node? That's ridiculous.

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Jiang Liu wrote:

> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().

Reviewed-by: Christoph Lameter <[email protected]>

2014-07-11 14:42:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hello,

On Fri, Jul 11, 2014 at 03:37:24PM +0800, Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.

The patch itself looks okay to me but is this the right way to handle
this? Can't we just let the allocators fall back to the nearest node
with memory? Why do we need to impose this awareness of memory-less
node on all the users?

Thanks.

--
tejun

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> Hello,
>
> On Fri, Jul 11, 2014 at 03:37:24PM +0800, Jiang Liu wrote:
> > When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> > may return a node without memory, and later cause system failure/panic
> > when calling kmalloc_node() and friends with returned node id.
>
> The patch itself looks okay to me but is this the right way to handle
> this? Can't we just let the allocators fall back to the nearest node
> with memory? Why do we need to impose this awareness of memory-less
> node on all the users?

Allocators typically fall back but they wont in some cases if you say
that you want memory from a particular node. A GFP_THISNODE would force a
failure of the alloc. In other cases it should fall back. I am not sure
that all allocations obey these conventions though.

2014-07-11 15:14:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC Patch V1 01/30] mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, Jul 11, 2014 at 03:37:18PM +0800, Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().
>
> Signed-off-by: Jiang Liu <[email protected]>

For the rcutorture piece:

Acked-by: Paul E. McKenney <[email protected]>

Or if you separate the kernel/rcu/rcutorture.c portion into a separate
patch, I will queue it separately.

Thanx, Paul

> ---
> kernel/rcu/rcutorture.c | 2 +-
> kernel/smp.c | 2 +-
> kernel/smpboot.c | 2 +-
> kernel/taskstats.c | 2 +-
> kernel/timer.c | 2 +-
> 5 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 7fa34f86e5ba..f593762d3214 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -1209,7 +1209,7 @@ static int rcutorture_booster_init(int cpu)
> mutex_lock(&boost_mutex);
> VERBOSE_TOROUT_STRING("Creating rcu_torture_boost task");
> boost_tasks[cpu] = kthread_create_on_node(rcu_torture_boost, NULL,
> - cpu_to_node(cpu),
> + cpu_to_mem(cpu),
> "rcu_torture_boost");
> if (IS_ERR(boost_tasks[cpu])) {
> retval = PTR_ERR(boost_tasks[cpu]);
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 80c33f8de14f..2f3b84aef159 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -41,7 +41,7 @@ hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
> case CPU_UP_PREPARE:
> case CPU_UP_PREPARE_FROZEN:
> if (!zalloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
> - cpu_to_node(cpu)))
> + cpu_to_mem(cpu)))
> return notifier_from_errno(-ENOMEM);
> cfd->csd = alloc_percpu(struct call_single_data);
> if (!cfd->csd) {
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index eb89e1807408..9c08e68e48a9 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -171,7 +171,7 @@ __smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
> if (tsk)
> return 0;
>
> - td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
> + td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_mem(cpu));
> if (!td)
> return -ENOMEM;
> td->cpu = cpu;
> diff --git a/kernel/taskstats.c b/kernel/taskstats.c
> index 13d2f7cd65db..cf5cba1e7fbe 100644
> --- a/kernel/taskstats.c
> +++ b/kernel/taskstats.c
> @@ -304,7 +304,7 @@ static int add_del_listener(pid_t pid, const struct cpumask *mask, int isadd)
> if (isadd == REGISTER) {
> for_each_cpu(cpu, mask) {
> s = kmalloc_node(sizeof(struct listener),
> - GFP_KERNEL, cpu_to_node(cpu));
> + GFP_KERNEL, cpu_to_mem(cpu));
> if (!s) {
> ret = -ENOMEM;
> goto cleanup;
> diff --git a/kernel/timer.c b/kernel/timer.c
> index 3bb01a323b2a..5831a38b5681 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1546,7 +1546,7 @@ static int init_timers_cpu(int cpu)
> * The APs use this path later in boot
> */
> base = kzalloc_node(sizeof(*base), GFP_KERNEL,
> - cpu_to_node(cpu));
> + cpu_to_mem(cpu));
> if (!base)
> return -ENOMEM;
>
> --
> 1.7.10.4
>

2014-07-11 15:22:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hello,

On Fri, Jul 11, 2014 at 10:13:57AM -0500, Christoph Lameter wrote:
> Allocators typically fall back but they wont in some cases if you say
> that you want memory from a particular node. A GFP_THISNODE would force a
> failure of the alloc. In other cases it should fall back. I am not sure
> that all allocations obey these conventions though.

But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
nearest node with memory fallback. Is there any case where the user
would actually want to always fail if it's on the memless node?

Even if that's the case, there's no reason to burden everyone with
this distinction. Most users just wanna say "I'm on this node.
Please allocate considering that". There's nothing wrong with using
numa_node_id() for that.

Thanks.

--
tejun

2014-07-11 15:33:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, Jul 11, 2014 at 11:21:56AM -0400, Tejun Heo wrote:
> Even if that's the case, there's no reason to burden everyone with
> this distinction. Most users just wanna say "I'm on this node.
> Please allocate considering that". There's nothing wrong with using
> numa_node_id() for that.

Also, this is minor but don't we also lose fallback information by
doing this from the caller? Please consider the following topology
where each hop is the same distance.

A - B - X - C - D

Where X is the memless node. num_mem_id() on X would return either B
or C, right? If B or C can't satisfy the allocation, the allocator
would fallback to A from B and D for C, both of which aren't optimal.
It should first fall back to C or B respectively, which the allocator
can't do anymoe because the information is lost when the caller side
performs numa_mem_id().

Seems pretty misguided to me.

Thanks.

--
tejun

2014-07-11 15:33:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> > Any comments are welcomed!
>
> Why would anybody _ever_ have a memoryless node? That's ridiculous.

I'm with Peter here, why would this be a situation that we should even
support? Are there machines out there shipping like this?

greg k-h

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> On Fri, Jul 11, 2014 at 11:21:56AM -0400, Tejun Heo wrote:
> > Even if that's the case, there's no reason to burden everyone with
> > this distinction. Most users just wanna say "I'm on this node.
> > Please allocate considering that". There's nothing wrong with using
> > numa_node_id() for that.
>
> Also, this is minor but don't we also lose fallback information by
> doing this from the caller? Please consider the following topology
> where each hop is the same distance.
>
> A - B - X - C - D
>
> Where X is the memless node. num_mem_id() on X would return either B
> or C, right? If B or C can't satisfy the allocation, the allocator
> would fallback to A from B and D for C, both of which aren't optimal.
> It should first fall back to C or B respectively, which the allocator
> can't do anymoe because the information is lost when the caller side
> performs numa_mem_id().

True but the advantage is that the numa_mem_id() allows the use of a
consitent sort of "local" node which increases allocator performance due
to the abillity to cache objects from that node.

> Seems pretty misguided to me.

IMHO the whole concept of a memoryless node looks pretty misguided to me.

2014-07-11 15:58:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, Jul 11, 2014 at 10:55:59AM -0500, Christoph Lameter wrote:
> > Where X is the memless node. num_mem_id() on X would return either B
> > or C, right? If B or C can't satisfy the allocation, the allocator
> > would fallback to A from B and D for C, both of which aren't optimal.
> > It should first fall back to C or B respectively, which the allocator
> > can't do anymoe because the information is lost when the caller side
> > performs numa_mem_id().
>
> True but the advantage is that the numa_mem_id() allows the use of a
> consitent sort of "local" node which increases allocator performance due
> to the abillity to cache objects from that node.

But the allocator can do the mapping the same. I really don't see why
we'd push the distinction to the individual users.

Thanks.

--
tejun

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> Hello,
>
> On Fri, Jul 11, 2014 at 10:13:57AM -0500, Christoph Lameter wrote:
> > Allocators typically fall back but they wont in some cases if you say
> > that you want memory from a particular node. A GFP_THISNODE would force a
> > failure of the alloc. In other cases it should fall back. I am not sure
> > that all allocations obey these conventions though.
>
> But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
> nearest node with memory fallback. Is there any case where the user
> would actually want to always fail if it's on the memless node?

GFP_THISNODE allocatios must fail if there is no memory available on
the node. No fallback allowed.

If the allocator performs caching for a particular node (like SLAB) then
the allocator *cannnot* accept memory from another node and the alloc via
the page allocator must fail so that the allocator can then pick another
node for keeping track of the allocations.

> Even if that's the case, there's no reason to burden everyone with
> this distinction. Most users just wanna say "I'm on this node.
> Please allocate considering that". There's nothing wrong with using
> numa_node_id() for that.

Well yes that speaks for this patch.

2014-07-11 16:01:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, Jul 11, 2014 at 10:58:52AM -0500, Christoph Lameter wrote:
> > But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
> > nearest node with memory fallback. Is there any case where the user
> > would actually want to always fail if it's on the memless node?
>
> GFP_THISNODE allocatios must fail if there is no memory available on
> the node. No fallback allowed.

I don't know. The intention is that the caller wants something on
this node or the caller will fail or fallback ourselves, right? For
most use cases just considering the nearest memory node as "local" for
memless nodes should work and serve the intentions of the users close
enough. Whether that'd be better or we'd be better off with something
else depends on the details for sure.

Thanks.

--
tejun

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> On Fri, Jul 11, 2014 at 10:55:59AM -0500, Christoph Lameter wrote:
> > > Where X is the memless node. num_mem_id() on X would return either B
> > > or C, right? If B or C can't satisfy the allocation, the allocator
> > > would fallback to A from B and D for C, both of which aren't optimal.
> > > It should first fall back to C or B respectively, which the allocator
> > > can't do anymoe because the information is lost when the caller side
> > > performs numa_mem_id().
> >
> > True but the advantage is that the numa_mem_id() allows the use of a
> > consitent sort of "local" node which increases allocator performance due
> > to the abillity to cache objects from that node.
>
> But the allocator can do the mapping the same. I really don't see why
> we'd push the distinction to the individual users.

The "users" (I guess you mean general kernel code/drivers) can use various
memory allocators which will do the right thing internally regarding
GFP_THISNODE. They do not need to worry too much about this unless there
are reasons beyond optimizing NUMA placement to need memory from a
particuylar node (f.e. a device that requires memory from a numa node that
is local to the PCI bus where the hardware resides).

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> On Fri, Jul 11, 2014 at 10:58:52AM -0500, Christoph Lameter wrote:
> > > But, GFP_THISNODE + numa_mem_id() is identical to numa_node_id() +
> > > nearest node with memory fallback. Is there any case where the user
> > > would actually want to always fail if it's on the memless node?
> >
> > GFP_THISNODE allocatios must fail if there is no memory available on
> > the node. No fallback allowed.
>
> I don't know. The intention is that the caller wants something on
> this node or the caller will fail or fallback ourselves, right? For
> most use cases just considering the nearest memory node as "local" for
> memless nodes should work and serve the intentions of the users close
> enough. Whether that'd be better or we'd be better off with something
> else depends on the details for sure.

Yes that works. But if we want a consistent node to allocate from (and
avoid the fallbacks) then we need this patch. I think this is up to those
needing memoryless nodes to figure out what semantics they need.

2014-07-11 16:24:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, Jul 11, 2014 at 11:19:14AM -0500, Christoph Lameter wrote:
> Yes that works. But if we want a consistent node to allocate from (and
> avoid the fallbacks) then we need this patch. I think this is up to those
> needing memoryless nodes to figure out what semantics they need.

I'm not following what you're saying. Are you saying that we need to
spread numa_mem_id() all over the place for GFP_THISNODE users on
memless nodes? There aren't that many users of GFP_THISNODE.
Wouldn't it make far more sense to just change them? Or just
introduce a new GFP flag GFP_CLOSE_OR_BUST which allows falling back
to the nearest local node for memless nodes. There's no reason to
leak this information outside allocator proper.

Thanks.

--
tejun

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> On Fri, Jul 11, 2014 at 11:19:14AM -0500, Christoph Lameter wrote:
> > Yes that works. But if we want a consistent node to allocate from (and
> > avoid the fallbacks) then we need this patch. I think this is up to those
> > needing memoryless nodes to figure out what semantics they need.
>
> I'm not following what you're saying. Are you saying that we need to
> spread numa_mem_id() all over the place for GFP_THISNODE users on
> memless nodes? There aren't that many users of GFP_THISNODE.

GFP_THISNODE is mostly used by allocators that need memory from specific
nodes. The use of numa_mem_id() there is useful because one will not
get any memory at all when attempting to allocate from a memoryless
node using GFP_THISNODE.

I meant that the relying on fallback to the neighboring nodes without
GFP_THISNODE using numa_node_id() is one approach that may prevent memory
allocators from caching objects for that node because every allocation may
choose a different neighboring node. And the other is the use of
numa_mem_id() which will always use a specific node and avoid fallback to
different node.

The choice is up to those having an interest in memoryless nodes. Which
again I find a pretty strange thing to have that has already proven itself
difficult to maintain in the kernel given the the notion of memory
nodes that should have memory but surprisingly have none. Then there are
the esoteric fallback conditions and special cases introduced. Its a mess.

The best solution may be to just get rid of the whole thing and require
all processors to have a node with memory that is local to them. Current
"memoryless" hardware can simply decide on bootup to pick a memory node
that is local and thus we do not have to deal with it in the core.

2014-07-11 18:28:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hello,

On Fri, Jul 11, 2014 at 12:29:30PM -0500, Christoph Lameter wrote:
> GFP_THISNODE is mostly used by allocators that need memory from specific
> nodes. The use of numa_mem_id() there is useful because one will not
> get any memory at all when attempting to allocate from a memoryless
> node using GFP_THISNODE.

As long as it's in allocator proper, it doesn't matter all that much
but the changes are clearly not contained, are they?

Also, unless this is done where the falling back is actually
happening, numa_mem_id() seems like the wrong interface because you
end up losing information of the originating node. Given that this
isn't a wide spread use case, maybe we can do with something like
numa_mem_id() as a compromise but if we're doing that let's at least
make it clear that it's something ugly (give it an ugly name, not
something as generic as numa_mem_id()) and not expose it outside
allocators.

Thanks.

--
tejun

Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, 11 Jul 2014, Tejun Heo wrote:

> On Fri, Jul 11, 2014 at 12:29:30PM -0500, Christoph Lameter wrote:
> > GFP_THISNODE is mostly used by allocators that need memory from specific
> > nodes. The use of numa_mem_id() there is useful because one will not
> > get any memory at all when attempting to allocate from a memoryless
> > node using GFP_THISNODE.
>
> As long as it's in allocator proper, it doesn't matter all that much
> but the changes are clearly not contained, are they?

Well there is a proliferation of memory allocators recently. NUMA is often
a second thought in those.

2014-07-11 20:02:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On 07/11/2014 08:33 AM, Greg KH wrote:
> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
>> > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
>>> > > Any comments are welcomed!
>> >
>> > Why would anybody _ever_ have a memoryless node? That's ridiculous.
> I'm with Peter here, why would this be a situation that we should even
> support? Are there machines out there shipping like this?

This is orthogonal to the problem Jiang Liu is solving, but...

The IBM guys have been hitting the CPU-less and memoryless node issues
forever, but that's mostly because their (traditional) hypervisor had
good NUMA support and ran multi-node guests.

I've never seen it in practice on x86 mostly because the hypervisors
don't have good NUMA support. I honestly think this is something x86 is
going to have to handle eventually anyway. It's essentially a resource
fragmentation problem, and there are going to be times where a guest
needs to be spun up and hypervisor has nodes with either no spare memory
or no spare CPUs.

The hypervisor has 3 choices in this case:
1. Lie about the NUMA layout
2. Waste the resources
3. Tell the guest how it's actually arranged

2014-07-11 20:21:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

Greg KH <[email protected]> writes:

> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
>> > Any comments are welcomed!
>>
>> Why would anybody _ever_ have a memoryless node? That's ridiculous.
>
> I'm with Peter here, why would this be a situation that we should even
> support? Are there machines out there shipping like this?

We've always had memory nodes.

A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.

There are other cases too.

-Andi

--
[email protected] -- Speaking for myself only

2014-07-11 20:51:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote:
> Greg KH <[email protected]> writes:
>
> > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
> >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> >> > Any comments are welcomed!
> >>
> >> Why would anybody _ever_ have a memoryless node? That's ridiculous.
> >
> > I'm with Peter here, why would this be a situation that we should even
> > support? Are there machines out there shipping like this?
>
> We've always had memory nodes.
>
> A classic case in the old days was a two socket system where someone
> didn't populate any DIMMs on the second socket.

That's a obvious; don't do that then case. Its silly.

> There are other cases too.

Are there any sane ones?

2014-07-11 21:58:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Fri, Jul 11, 2014 at 10:51:06PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote:
> > Greg KH <[email protected]> writes:
> >
> > > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
> > >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> > >> > Any comments are welcomed!
> > >>
> > >> Why would anybody _ever_ have a memoryless node? That's ridiculous.
> > >
> > > I'm with Peter here, why would this be a situation that we should even
> > > support? Are there machines out there shipping like this?
> >
> > We've always had memory nodes.
> >
> > A classic case in the old days was a two socket system where someone
> > didn't populate any DIMMs on the second socket.
>
> That's a obvious; don't do that then case. Its silly.

True. We should recommend that anyone running Linux will email you
for approval of their configuration first.


> > There are other cases too.
>
> Are there any sane ones

Yes.

-Andi

2014-07-11 22:40:50

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Fri, 11 Jul 2014, Greg KH wrote:

> > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> > > Any comments are welcomed!
> >
> > Why would anybody _ever_ have a memoryless node? That's ridiculous.
>
> I'm with Peter here, why would this be a situation that we should even
> support? Are there machines out there shipping like this?

I am pretty sure I've seen ppc64 machine with memoryless NUMA node.

--
Jiri Kosina
SUSE Labs

2014-07-11 23:52:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On 07/11/2014 01:20 PM, Andi Kleen wrote:
> Greg KH <[email protected]> writes:
>
>> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
>>> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
>>>> Any comments are welcomed!
>>>
>>> Why would anybody _ever_ have a memoryless node? That's ridiculous.
>>
>> I'm with Peter here, why would this be a situation that we should even
>> support? Are there machines out there shipping like this?
>
> We've always had memory nodes.
>
> A classic case in the old days was a two socket system where someone
> didn't populate any DIMMs on the second socket.
>
> There are other cases too.
>

Yes, like a node controller-based system where the system can be
populated with either memory cards or CPU cards, for example. Now you
can have both memoryless nodes and memory-only nodes...

Memory-only nodes also happen in real life. In some cases they are done
by permanently putting low-frequency CPUs to sleep for their memory
controllers.

-hpa

2014-07-12 12:32:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC Patch V1 01/30] mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On 2014-07-11 09:37, Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().

I think blk-mq requires some of the same help, as do other places in the
block layer. I'll take a look at that.

As for you smp.c bits here:

Acked-by: Jens Axboe <[email protected]>

--
Jens Axboe

2014-07-15 01:18:13

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Fri, 11 Jul 2014, Peter Zijlstra wrote:

> > There are other cases too.
>
> Are there any sane ones?
>

They are specifically allowed by the ACPI specification to be able to
include only cpus, I/O, networking cards, etc.

2014-07-15 01:20:00

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Sat, 12 Jul 2014, Jiri Kosina wrote:

> I am pretty sure I've seen ppc64 machine with memoryless NUMA node.
>

Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the
problems that have been encountered, including problems in generic kernel
code, on powerpc with memoryless nodes.

2014-07-18 07:36:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC Patch V1 09/30] mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri 11-07-14 15:37:26, Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().

The change makes difference only for really tiny memcgs. If we really
have all pages on unevictable list or anon with no swap allowed and that
is the reason why no node is set in scan_nodes mask then reclaiming
memoryless node or any arbitrary close one doesn't make any difference.
The current memcg might not have any memory on that node at all.

So the change doesn't make any practical difference and the changelog is
misleading.

> Signed-off-by: Jiang Liu <[email protected]>
> ---
> mm/memcontrol.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a2c7bcb0e6eb..d6c4b7255ca9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1933,7 +1933,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
> * we use curret node.
> */
> if (unlikely(node == MAX_NUMNODES))
> - node = numa_node_id();
> + node = numa_mem_id();
>
> memcg->last_scanned_node = node;
> return node;
> --
> 1.7.10.4
>

--
Michal Hocko
SUSE Labs

2014-07-18 12:41:03

by Jason Cooper

[permalink] [raw]
Subject: Re: [RFC Patch V1 21/30] mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Fri, Jul 11, 2014 at 03:37:38PM +0800, Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> drivers/irqchip/irq-clps711x.c | 2 +-
> drivers/irqchip/irq-gic.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)

Do you have anything depending on this? Can apply it to irqchip? If
you need to keep it with other changes,

Acked-by: Jason Cooper <[email protected]>

But please do let me know if I can take it.

thx,

Jason.

2014-07-21 17:16:11

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 01/30] mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hi Paul,

On 11.07.2014 [08:14:05 -0700], Paul E. McKenney wrote:
> On Fri, Jul 11, 2014 at 03:37:18PM +0800, Jiang Liu wrote:
> > When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> > may return a node without memory, and later cause system failure/panic
> > when calling kmalloc_node() and friends with returned node id.
> > So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> > memory for the/current cpu.
> >
> > If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> > is the same as cpu_to_node()/numa_node_id().
> >
> > Signed-off-by: Jiang Liu <[email protected]>
>
> For the rcutorture piece:
>
> Acked-by: Paul E. McKenney <[email protected]>
>
> Or if you separate the kernel/rcu/rcutorture.c portion into a separate
> patch, I will queue it separately.

Just FYI, based upon a separate discussion with Tejun and others, it
seems to be preferred to avoid the proliferation of cpu_to_mem
throughout the kernel blindly. For kthread_create_on_node(), I'm going
to try and fix the underlying issue and so you, as the caller, should
still specify the NUMA node you are running the kthread on
(cpu_to_node), not where you expect the memory to come from
(cpu_to_mem).

Thanks,
Nish

2014-07-21 17:23:41

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

Hi Jiang,

On 11.07.2014 [15:37:17 +0800], Jiang Liu wrote:
> Previously we have posted a patch fix a memory crash issue caused by
> memoryless node on x86 platforms, please refer to
> http://comments.gmane.org/gmane.linux.kernel/1687425
>
> As suggested by David Rientjes, the most suitable fix for the issue
> should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
> So this is the patchset according to David's suggestion.

Hrm, that is initially what David said, but then later on in the thread,
he specifically says he doesn't think memoryless nodes are the problem.
It seems like the issue is the order of onlining of resources on a
specifix x86 platform?

memoryless nodes in and of themselves don't cause the kernel to crash.
powerpc boots with them (both previously without
CONFIG_HAVE_MEMORYLESS_NODES and now with it) and is functional,
although it does lead to some performance issues I'm hoping to resolve.
In fact, David specifically says that the kernel crash you triggered
makes sense as cpu_to_node() points to an offline node?

In any case, a blind s/cpu_to_node/cpu_to_mem/ is not always correct.
There is a semantic difference and in some cases the allocator already
do the right thing under covers (falls back to nearest node) and in some
cases it doesn't.

Thanks,
Nish

2014-07-21 17:33:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC Patch V1 01/30] mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Mon, Jul 21, 2014 at 10:15:27AM -0700, Nishanth Aravamudan wrote:
> Hi Paul,
>
> On 11.07.2014 [08:14:05 -0700], Paul E. McKenney wrote:
> > On Fri, Jul 11, 2014 at 03:37:18PM +0800, Jiang Liu wrote:
> > > When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> > > may return a node without memory, and later cause system failure/panic
> > > when calling kmalloc_node() and friends with returned node id.
> > > So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> > > memory for the/current cpu.
> > >
> > > If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> > > is the same as cpu_to_node()/numa_node_id().
> > >
> > > Signed-off-by: Jiang Liu <[email protected]>
> >
> > For the rcutorture piece:
> >
> > Acked-by: Paul E. McKenney <[email protected]>
> >
> > Or if you separate the kernel/rcu/rcutorture.c portion into a separate
> > patch, I will queue it separately.
>
> Just FYI, based upon a separate discussion with Tejun and others, it
> seems to be preferred to avoid the proliferation of cpu_to_mem
> throughout the kernel blindly. For kthread_create_on_node(), I'm going
> to try and fix the underlying issue and so you, as the caller, should
> still specify the NUMA node you are running the kthread on
> (cpu_to_node), not where you expect the memory to come from
> (cpu_to_mem).

Even better!!! ;-)

Thanx, Paul

2014-07-21 17:38:44

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 17/30] mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On 11.07.2014 [15:37:34 +0800], Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.

You used the same changelog for all of the patches, it seems. But the
interface below (kthread_create_on_node) doesn't go into kmalloc_node?

kthread_create_on_node eventually sets the value used by
tsk_fork_get_node(), which is used by alloc_task_struct_node() and
alloc_thread_info_node(). The first uses kmem_cache_alloc_node() and the
second, depending on the relative sizes of THREAD_SIZE and PAGE_SIZE
uses either alloc_kmem_pages_node() or kmem_cache_alloc_node().
kmem_cache_alloc_node() goes into the appropriate slab allocator which
on SLUB for instance, goes down into __alloc_pages_nodemask. But no
failure occurs when memoryless nodes are present, you just get memory
that is remote from the node specified? Similarly,
alloc_kmem_pages_node() calls into __alloc_pages with an appropriate
node_zonelist, which should provide for the correct fallback based upon
NUMA topology?

What system failure/panic did you see that is resolved by this patch?

> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> drivers/thermal/intel_powerclamp.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 95cb7fc20e17..9d9be8cd1b50 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -531,7 +531,7 @@ static int start_power_clamp(void)
>
> thread = kthread_create_on_node(clamp_thread,
> (void *) cpu,
> - cpu_to_node(cpu),
> + cpu_to_mem(cpu),

As Tejun has pointed out elsewhere, we lose context here about the
original node we were running on. That information is relevant for a few
reasons:

1) In the underlying allocator, we might not have memory *right now* to
satisfy a request, which, say, causes us to deactivate a slab
(CONFIG_SLUB). But that condition may be relieved in the future and we
want to use the correct node again then.

2) For topologies that are symmetrical around a memoryless node, we
could lose the correct fallback information when we specify a nearest
neighbor with memory.

Thanks,
Nish

2014-07-21 17:42:06

by Tony Luck

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
<[email protected]> wrote:
> It seems like the issue is the order of onlining of resources on a
> specific x86 platform?

Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

First: Here are some new cpus
Next: Here is some new memory
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)

So there is a period where the node is memoryless - although that will generally
be resolved when the memory hot plug event arrives ... that isn't guaranteed to
occur (there might not be any memory on the node, or what memory there is
may have failed self-test and been disabled).

-Tony

2014-07-21 17:42:34

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 15/30] mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On 11.07.2014 [15:37:32 +0800], Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index f145adbb55ac..2b74bffa5648 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -6518,7 +6518,7 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
> unsigned int truesize)
> {
> /* avoid re-using remote pages */
> - if (unlikely(page_to_nid(page) != numa_node_id()))
> + if (unlikely(page_to_nid(page) != numa_mem_id()))
> return false;
>
> #if (PAGE_SIZE < 8192)
> @@ -6588,7 +6588,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
> memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
>
> /* we can reuse buffer as-is, just make sure it is local */
> - if (likely(page_to_nid(page) == numa_node_id()))
> + if (likely(page_to_nid(page) == numa_mem_id()))
> return true;
>
> /* this page cannot be reused so discard it */

This doesn't seem to have anything to do with crashes or errors?

The original code is checking if the NUMA node of a page is remote to
the NUMA node current is running on. Your change makes it check if the
NUMA node of a page is not equal to the nearest NUMA node with memory.
That's not necessarily local, though, which seems like that is the whole
point. In this case, perhaps the driver author doesn't want to reuse the
memory at all for performance reasons? In any case, I don't think this
patch has appropriate justification.

Thanks,
Nish

2014-07-21 17:48:05

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 28/30] mm: Update _mem_id_[] for every possible CPU when memory configuration changes

On 11.07.2014 [15:37:45 +0800], Jiang Liu wrote:
> Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
> configuration changes. So kernel may allocate memory from remote node
> for a CPU if the CPU is still in absent or offline state even if the
> node associated with the CPU has already been onlined.

This just sounds like the topology information is being updated at the
wrong place/time? That is, the memory is online, the CPU is being
brought online, but isn't associated with any node?

> This patch tries to improve performance by updating _mem_id_[cpu] for
> each possible CPU when memory configuration changes, thus kernel could
> always allocate from local node once the node is onlined.

Ok, what is the impact? Do you actually see better performance?

> We check node_online(cpu_to_node(cpu)) because:
> 1) local_memory_node(nid) needs to access NODE_DATA(nid)
> 2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> mm/page_alloc.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0ea758b898fd..de86e941ed57 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3844,13 +3844,13 @@ static int __build_all_zonelists(void *data)
> /*
> * We now know the "local memory node" for each node--
> * i.e., the node of the first zone in the generic zonelist.
> - * Set up numa_mem percpu variable for on-line cpus. During
> - * boot, only the boot cpu should be on-line; we'll init the
> - * secondary cpus' numa_mem as they come on-line. During
> - * node/memory hotplug, we'll fixup all on-line cpus.
> + * Set up numa_mem percpu variable for all possible cpus
> + * if associated node has been onlined.
> */
> - if (cpu_online(cpu))
> + if (node_online(cpu_to_node(cpu)))
> set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> + else
> + set_cpu_numa_mem(cpu, NUMA_NO_NODE);
> #endif

2014-07-21 17:52:55

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 22/30] mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On 11.07.2014 [15:37:39 +0800], Jiang Liu wrote:
> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> may return a node without memory, and later cause system failure/panic
> when calling kmalloc_node() and friends with returned node id.
> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> memory for the/current cpu.
>
> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> is the same as cpu_to_node()/numa_node_id().
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> drivers/of/base.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index b9864806e9b8..40d4772973ad 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> #ifdef CONFIG_NUMA
> int __weak of_node_to_nid(struct device_node *np)
> {
> - return numa_node_id();
> + return numa_mem_id();
> }
> #endif

Um, NAK. of_node_to_nid() returns the NUMA node ID for a given device
tree node. The default should be the physically local NUMA node, not the
nearest memory-containing node.

I think the general direction of this patchset is good -- what NUMA
information do we actually are about at each callsite. But the execution
is blind and doesn't consider at all what the code is actually doing.
The changelogs are all identical and don't actually provide any
information about what errors this (or any) specific patch are
resolving.

Thanks,
Nish

2014-07-21 17:57:47

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
> <[email protected]> wrote:
> > It seems like the issue is the order of onlining of resources on a
> > specific x86 platform?
>
> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
>
> First: Here are some new cpus

Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?

> Next: Here is some new memory

And then update the NUMA topology at this point? That is,
set_cpu_numa_node/mem as appropriate so the underlying allocators do the
right thing?

> Last; Here are some new I/O things (PCIe root ports, PCIe devices,
> IOAPICs, IOMMUs, ...)
>
> So there is a period where the node is memoryless - although that will
> generally be resolved when the memory hot plug event arrives ... that
> isn't guaranteed to occur (there might not be any memory on the node,
> or what memory there is may have failed self-test and been disabled).

Right, but the allocator(s) generally does the right thing already in
the face of memoryless nodes -- they fallback to the nearest node. That
leads to poor performance, but is functional. Based upon the previous
thread Jiang pointed to, it seems like the real issue here isn't that
the node is memoryless, but that it's not even online yet? So NODE_DATA
access crashes?

Thanks,
Nish

2014-07-21 20:06:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On Mon, Jul 21, 2014 at 10:41:59AM -0700, Tony Luck wrote:
> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
> <[email protected]> wrote:
> > It seems like the issue is the order of onlining of resources on a
> > specific x86 platform?
>
> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
>
> First: Here are some new cpus
> Next: Here is some new memory
> Last; Here are some new I/O things (PCIe root ports, PCIe devices,
> IOAPICs, IOMMUs, ...)
>
> So there is a period where the node is memoryless - although that will generally
> be resolved when the memory hot plug event arrives ... that isn't guaranteed to
> occur (there might not be any memory on the node, or what memory there is
> may have failed self-test and been disabled).

Right, but we could 'easily' capture that in arch code and make it look
like it was done in a 'sane' order. No need to wreck the rest of the
kernel to support this particular BIOS fuckup.

2014-07-21 21:09:13

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 15/30] mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On 21.07.2014 [12:53:33 -0700], Alexander Duyck wrote:
> I do agree the description should probably be changed. There shouldn't be
> any panics involved, only a performance impact as it will be reallocating
> always if it is on a node with no memory.

Yep, thanks for the review.

> My intention on this was to make certain that the memory used is from the
> closest node possible. As such I believe this change likely honours that.

Absolutely, just wanted to make it explicit that it's not a functional
fix, just a performance fix (presuming this shows up at all on systems
that have memoryless NUMA nodes).

I'd suggest an update to the comments, as well.

Thanks,
Nish

2014-07-23 03:17:02

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 07/30] mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hi Tejun and Christoph,
Thanks for your suggestions and discussion. Tejun really
gives a good point to hide memoryless node interface from normal
slab users. I will rework the patch set to go that direction.
Regards!
Gerry

On 2014/7/12 3:11, Christoph Lameter wrote:
> On Fri, 11 Jul 2014, Tejun Heo wrote:
>
>> On Fri, Jul 11, 2014 at 12:29:30PM -0500, Christoph Lameter wrote:
>>> GFP_THISNODE is mostly used by allocators that need memory from specific
>>> nodes. The use of numa_mem_id() there is useful because one will not
>>> get any memory at all when attempting to allocate from a memoryless
>>> node using GFP_THISNODE.
>>
>> As long as it's in allocator proper, it doesn't matter all that much
>> but the changes are clearly not contained, are they?
>
> Well there is a proliferation of memory allocators recently. NUMA is often
> a second thought in those.
>

2014-07-23 03:18:24

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 09/30] mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hi Michal,
Thanks for your comments! As discussed, we will
rework the patch set in another direction to hide memoryless
node from normal slab users.
Regards!
Gerry

On 2014/7/18 15:36, Michal Hocko wrote:
> On Fri 11-07-14 15:37:26, Jiang Liu wrote:
>> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
>> may return a node without memory, and later cause system failure/panic
>> when calling kmalloc_node() and friends with returned node id.
>> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
>> memory for the/current cpu.
>>
>> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
>> is the same as cpu_to_node()/numa_node_id().
>
> The change makes difference only for really tiny memcgs. If we really
> have all pages on unevictable list or anon with no swap allowed and that
> is the reason why no node is set in scan_nodes mask then reclaiming
> memoryless node or any arbitrary close one doesn't make any difference.
> The current memcg might not have any memory on that node at all.
>
> So the change doesn't make any practical difference and the changelog is
> misleading.
>
>> Signed-off-by: Jiang Liu <[email protected]>
>> ---
>> mm/memcontrol.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a2c7bcb0e6eb..d6c4b7255ca9 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1933,7 +1933,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
>> * we use curret node.
>> */
>> if (unlikely(node == MAX_NUMNODES))
>> - node = numa_node_id();
>> + node = numa_mem_id();
>>
>> memcg->last_scanned_node = node;
>> return node;
>> --
>> 1.7.10.4
>>
>

2014-07-23 03:21:01

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 15/30] mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hi Nishanth and Alexander,
Thanks for review, will update the comments
in next version.
Regards!
Gerry

On 2014/7/22 5:09, Nishanth Aravamudan wrote:
> On 21.07.2014 [12:53:33 -0700], Alexander Duyck wrote:
>> I do agree the description should probably be changed. There shouldn't be
>> any panics involved, only a performance impact as it will be reallocating
>> always if it is on a node with no memory.
>
> Yep, thanks for the review.
>
>> My intention on this was to make certain that the memory used is from the
>> closest node possible. As such I believe this change likely honours that.
>
> Absolutely, just wanted to make it explicit that it's not a functional
> fix, just a performance fix (presuming this shows up at all on systems
> that have memoryless NUMA nodes).
>
> I'd suggest an update to the comments, as well.
>
> Thanks,
> Nish
>

2014-07-23 03:48:11

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 21/30] mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless node

Hi Jason,
Thanks for your review. According to review comments,
we need to rework the patch set in another direction and will
give up this patch.
Regards!
Gerry

On 2014/7/18 20:40, Jason Cooper wrote:
> On Fri, Jul 11, 2014 at 03:37:38PM +0800, Jiang Liu wrote:
>> When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
>> may return a node without memory, and later cause system failure/panic
>> when calling kmalloc_node() and friends with returned node id.
>> So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
>> memory for the/current cpu.
>>
>> If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
>> is the same as cpu_to_node()/numa_node_id().
>>
>> Signed-off-by: Jiang Liu <[email protected]>
>> ---
>> drivers/irqchip/irq-clps711x.c | 2 +-
>> drivers/irqchip/irq-gic.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> Do you have anything depending on this? Can apply it to irqchip? If
> you need to keep it with other changes,
>
> Acked-by: Jason Cooper <[email protected]>
>
> But please do let me know if I can take it.
>
> thx,
>
> Jason.
>

2014-07-23 08:16:23

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 28/30] mm: Update _mem_id_[] for every possible CPU when memory configuration changes



On 2014/7/22 1:47, Nishanth Aravamudan wrote:
> On 11.07.2014 [15:37:45 +0800], Jiang Liu wrote:
>> Current kernel only updates _mem_id_[cpu] for onlined CPUs when memory
>> configuration changes. So kernel may allocate memory from remote node
>> for a CPU if the CPU is still in absent or offline state even if the
>> node associated with the CPU has already been onlined.
>
> This just sounds like the topology information is being updated at the
> wrong place/time? That is, the memory is online, the CPU is being
> brought online, but isn't associated with any node?
Hi Nishanth,
Yes, that's the case.

>
>> This patch tries to improve performance by updating _mem_id_[cpu] for
>> each possible CPU when memory configuration changes, thus kernel could
>> always allocate from local node once the node is onlined.
>
> Ok, what is the impact? Do you actually see better performance?
No real data to support this yet, just with code analysis.
Regards!
Gerry
>
>> We check node_online(cpu_to_node(cpu)) because:
>> 1) local_memory_node(nid) needs to access NODE_DATA(nid)
>> 2) try_offline_node(nid) just zeroes out NODE_DATA(nid) instead of free it
>>
>> Signed-off-by: Jiang Liu <[email protected]>
>> ---
>> mm/page_alloc.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 0ea758b898fd..de86e941ed57 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3844,13 +3844,13 @@ static int __build_all_zonelists(void *data)
>> /*
>> * We now know the "local memory node" for each node--
>> * i.e., the node of the first zone in the generic zonelist.
>> - * Set up numa_mem percpu variable for on-line cpus. During
>> - * boot, only the boot cpu should be on-line; we'll init the
>> - * secondary cpus' numa_mem as they come on-line. During
>> - * node/memory hotplug, we'll fixup all on-line cpus.
>> + * Set up numa_mem percpu variable for all possible cpus
>> + * if associated node has been onlined.
>> */
>> - if (cpu_online(cpu))
>> + if (node_online(cpu_to_node(cpu)))
>> set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
>> + else
>> + set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>> #endif
>
>

2014-07-23 08:21:32

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms



On 2014/7/22 1:57, Nishanth Aravamudan wrote:
> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
>> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>> <[email protected]> wrote:
>>> It seems like the issue is the order of onlining of resources on a
>>> specific x86 platform?
>>
>> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
>>
>> First: Here are some new cpus
>
> Ok, so during this period, you might get some remote allocations. Do you
> know the topology of these CPUs? That is they belong to a
> (soon-to-exist) NUMA node? Can you online that currently offline NUMA
> node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition" tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.

Regards!
Gerry

>
>> Next: Here is some new memory
>
> And then update the NUMA topology at this point? That is,
> set_cpu_numa_node/mem as appropriate so the underlying allocators do the
> right thing?
>
>> Last; Here are some new I/O things (PCIe root ports, PCIe devices,
>> IOAPICs, IOMMUs, ...)
>>
>> So there is a period where the node is memoryless - although that will
>> generally be resolved when the memory hot plug event arrives ... that
>> isn't guaranteed to occur (there might not be any memory on the node,
>> or what memory there is may have failed self-test and been disabled).
>
> Right, but the allocator(s) generally does the right thing already in
> the face of memoryless nodes -- they fallback to the nearest node. That
> leads to poor performance, but is functional. Based upon the previous
> thread Jiang pointed to, it seems like the real issue here isn't that
> the node is memoryless, but that it's not even online yet? So NODE_DATA
> access crashes?
>
> Thanks,
> Nish
>

2014-07-24 23:26:22

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 29/30] mm, x86: Enable memoryless node support to better support CPU/memory hotplug

On 11.07.2014 [15:37:46 +0800], Jiang Liu wrote:
> With current implementation, all CPUs within a NUMA node will be
> assocaited with another NUMA node if the node has no memory installed.

<snip>

> ---
> arch/x86/Kconfig | 3 +++
> arch/x86/kernel/acpi/boot.c | 5 ++++-
> arch/x86/kernel/smpboot.c | 2 ++
> arch/x86/mm/numa.c | 42 +++++++++++++++++++++++++++++++++++-------
> 4 files changed, 44 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a8f749ef0fdc..f35b25b88625 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1887,6 +1887,9 @@ config USE_PERCPU_NUMA_NODE_ID
> def_bool y
> depends on NUMA
>
> +config HAVE_MEMORYLESS_NODES
> + def_bool NUMA
> +
> config ARCH_ENABLE_SPLIT_PMD_PTLOCK
> def_bool y
> depends on X86_64 || X86_PAE
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 86281ffb96d6..3b5641703a49 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -612,6 +612,8 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> if (nid != -1) {
> set_apicid_to_node(physid, nid);
> numa_set_node(cpu, nid);
> + if (node_online(nid))
> + set_cpu_numa_mem(cpu, local_memory_node(nid));

How common is it for this method to be called for a CPU on an offline
node? Aren't you fixing this in the next patch (so maybe the order
should be changed?)?

> }
> #endif
> }
> @@ -644,9 +646,10 @@ int acpi_unmap_lsapic(int cpu)
> {
> #ifdef CONFIG_ACPI_NUMA
> set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
> + set_cpu_numa_mem(cpu, NUMA_NO_NODE);
> #endif
>
> - per_cpu(x86_cpu_to_apicid, cpu) = -1;
> + per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;

I think this is an unrelated change?

> set_cpu_present(cpu, false);
> num_processors--;
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5492798930ef..4a5437989ffe 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -162,6 +162,8 @@ static void smp_callin(void)
> __func__, cpuid);
> }
>
> + set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
> +

Note that you might hit the same issue I reported on powerpc, if
smp_callin() is part of smp_init(). The waitqueue initialization code
depends on cpu_to_node() [and eventually cpu_to_mem()] to be initialized
quite early.

Thanks,
Nish

2014-07-24 23:30:44

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition

On 11.07.2014 [15:37:47 +0800], Jiang Liu wrote:
> With typical CPU hot-addition flow on x86, PCI host bridges embedded
> in physical processor are always associated with NOMA_NO_NODE, which
> may cause sub-optimal performance.
> 1) Handle CPU hot-addition notification
> acpi_processor_add()
> acpi_processor_get_info()
> acpi_processor_hotadd_init()
> acpi_map_lsapic()
> 1.a) acpi_map_cpu2node()
>
> 2) Handle PCI host bridge hot-addition notification
> acpi_pci_root_add()
> pci_acpi_scan_root()
> 2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;
>
> 3) Handle memory hot-addition notification
> acpi_memory_device_add()
> acpi_memory_enable_device()
> add_memory()
> 3.a) node_set_online();
>
> 4) Online CPUs through sysfs interfaces
> cpu_subsys_online()
> cpu_up()
> try_online_node()
> 4.a) node_set_online();
>
> So associated node is always in offline state because it is onlined
> until step 3.a or 4.a.
>
> We could improve performance by online node at step 1.a. This change
> also makes the code symmetric. Nodes are always created when handling
> CPU/memory hot-addition events instead of handling user requests from
> sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
> events.

It seems like this patch has little to nothing to do with the rest of
the series and can be sent on its own?

> It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),

To be clear, the race is that on some x86 platforms, there is a period
of time where a node ID returned by cpu_to_node() is offline.

<snip>

> Signed-off-by: Jiang Liu <[email protected]>
> ---
> arch/x86/kernel/acpi/boot.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 3b5641703a49..00c2ed507460 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> nid = acpi_get_node(handle);
> if (nid != -1) {
> set_apicid_to_node(physid, nid);
> + try_online_node(nid);

try_online_node() seems like it can fail? I assume it's a pretty rare
case, but should the return code be checked?

If it does fail, it seems like there are pretty serious problems and we
shouldn't be onlining this CPU, etc.?

> numa_set_node(cpu, nid);
> if (node_online(nid))
> set_cpu_numa_mem(cpu, local_memory_node(nid));

Which means you can remove this check presuming try_online_node()
returned 0.

Thanks,
Nish

2014-07-24 23:32:44

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
>
>
> On 2014/7/22 1:57, Nishanth Aravamudan wrote:
> > On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
> >> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
> >> <[email protected]> wrote:
> >>> It seems like the issue is the order of onlining of resources on a
> >>> specific x86 platform?
> >>
> >> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
> >>
> >> First: Here are some new cpus
> >
> > Ok, so during this period, you might get some remote allocations. Do you
> > know the topology of these CPUs? That is they belong to a
> > (soon-to-exist) NUMA node? Can you online that currently offline NUMA
> > node at this point (so that NODE_DATA()) resolves, etc.)?
> Hi Nishanth,
> We have method to get the NUMA information about the CPU, and
> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
> CPU hot-addition" tries to solve this issue by onlining NUMA node
> as early as possible. Actually we are trying to enable memoryless node
> as you have suggested.

Ok, it seems like you have two sets of patches then? One is to fix the
NUMA information timing (30/30 only). The rest of the patches are
general discussions about where cpu_to_mem() might be used instead of
cpu_to_node(). However, based upon Tejun's feedback, it seems like
rather than force all callers to use cpu_to_mem(), we should be looking
at the core VM to ensure fallback is occuring appropriately when
memoryless nodes are present.

Do you have a specific situation, once you've applied 30/30, where
kmalloc_node() leads to an Oops?

Thanks,
Nish

2014-07-25 01:41:55

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 29/30] mm, x86: Enable memoryless node support to better support CPU/memory hotplug



On 2014/7/25 7:26, Nishanth Aravamudan wrote:
> On 11.07.2014 [15:37:46 +0800], Jiang Liu wrote:
>> With current implementation, all CPUs within a NUMA node will be
>> assocaited with another NUMA node if the node has no memory installed.
>
> <snip>
>
>> ---
>> arch/x86/Kconfig | 3 +++
>> arch/x86/kernel/acpi/boot.c | 5 ++++-
>> arch/x86/kernel/smpboot.c | 2 ++
>> arch/x86/mm/numa.c | 42 +++++++++++++++++++++++++++++++++++-------
>> 4 files changed, 44 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index a8f749ef0fdc..f35b25b88625 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1887,6 +1887,9 @@ config USE_PERCPU_NUMA_NODE_ID
>> def_bool y
>> depends on NUMA
>>
>> +config HAVE_MEMORYLESS_NODES
>> + def_bool NUMA
>> +
>> config ARCH_ENABLE_SPLIT_PMD_PTLOCK
>> def_bool y
>> depends on X86_64 || X86_PAE
>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
>> index 86281ffb96d6..3b5641703a49 100644
>> --- a/arch/x86/kernel/acpi/boot.c
>> +++ b/arch/x86/kernel/acpi/boot.c
>> @@ -612,6 +612,8 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>> if (nid != -1) {
>> set_apicid_to_node(physid, nid);
>> numa_set_node(cpu, nid);
>> + if (node_online(nid))
>> + set_cpu_numa_mem(cpu, local_memory_node(nid));
>
> How common is it for this method to be called for a CPU on an offline
> node? Aren't you fixing this in the next patch (so maybe the order
> should be changed?)?
Hi Nishanth,
For physical CPU hot-addition instead of logical CPU online through
sysfs, the node is always in offline state.
In v2, I have reordered the patch set so patch 30 goes first.

>
>> }
>> #endif
>> }
>> @@ -644,9 +646,10 @@ int acpi_unmap_lsapic(int cpu)
>> {
>> #ifdef CONFIG_ACPI_NUMA
>> set_apicid_to_node(per_cpu(x86_cpu_to_apicid, cpu), NUMA_NO_NODE);
>> + set_cpu_numa_mem(cpu, NUMA_NO_NODE);
>> #endif
>>
>> - per_cpu(x86_cpu_to_apicid, cpu) = -1;
>> + per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
>
> I think this is an unrelated change?
Thanks for reminder, it's unrelated to support memoryless node.

>
>> set_cpu_present(cpu, false);
>> num_processors--;
>>
>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>> index 5492798930ef..4a5437989ffe 100644
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -162,6 +162,8 @@ static void smp_callin(void)
>> __func__, cpuid);
>> }
>>
>> + set_numa_mem(local_memory_node(cpu_to_node(cpuid)));
>> +
>
> Note that you might hit the same issue I reported on powerpc, if
> smp_callin() is part of smp_init(). The waitqueue initialization code
> depends on cpu_to_node() [and eventually cpu_to_mem()] to be initialized
> quite early.
Thanks for reminder. Patch 29/30 together will setup cpu_to_mem() array
when enumerating CPUs for hot-adding events, so it should be ready
for use when onlining those CPUs.

Regards!
Gerry
>
> Thanks,
> Nish
>

2014-07-25 01:43:27

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition



On 2014/7/25 7:30, Nishanth Aravamudan wrote:
> On 11.07.2014 [15:37:47 +0800], Jiang Liu wrote:
>> With typical CPU hot-addition flow on x86, PCI host bridges embedded
>> in physical processor are always associated with NOMA_NO_NODE, which
>> may cause sub-optimal performance.
>> 1) Handle CPU hot-addition notification
>> acpi_processor_add()
>> acpi_processor_get_info()
>> acpi_processor_hotadd_init()
>> acpi_map_lsapic()
>> 1.a) acpi_map_cpu2node()
>>
>> 2) Handle PCI host bridge hot-addition notification
>> acpi_pci_root_add()
>> pci_acpi_scan_root()
>> 2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;
>>
>> 3) Handle memory hot-addition notification
>> acpi_memory_device_add()
>> acpi_memory_enable_device()
>> add_memory()
>> 3.a) node_set_online();
>>
>> 4) Online CPUs through sysfs interfaces
>> cpu_subsys_online()
>> cpu_up()
>> try_online_node()
>> 4.a) node_set_online();
>>
>> So associated node is always in offline state because it is onlined
>> until step 3.a or 4.a.
>>
>> We could improve performance by online node at step 1.a. This change
>> also makes the code symmetric. Nodes are always created when handling
>> CPU/memory hot-addition events instead of handling user requests from
>> sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
>> events.
>
> It seems like this patch has little to nothing to do with the rest of
> the series and can be sent on its own?
>
>> It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
>
> To be clear, the race is that on some x86 platforms, there is a period
> of time where a node ID returned by cpu_to_node() is offline.
>
> <snip>
>
>> Signed-off-by: Jiang Liu <[email protected]>
>> ---
>> arch/x86/kernel/acpi/boot.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
>> index 3b5641703a49..00c2ed507460 100644
>> --- a/arch/x86/kernel/acpi/boot.c
>> +++ b/arch/x86/kernel/acpi/boot.c
>> @@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>> nid = acpi_get_node(handle);
>> if (nid != -1) {
>> set_apicid_to_node(physid, nid);
>> + try_online_node(nid);
>
> try_online_node() seems like it can fail? I assume it's a pretty rare
> case, but should the return code be checked?
Good suggestion, I should split out this patch to fix the crash.

>
> If it does fail, it seems like there are pretty serious problems and we
> shouldn't be onlining this CPU, etc.?
>
>> numa_set_node(cpu, nid);
>> if (node_online(nid))
>> set_cpu_numa_mem(cpu, local_memory_node(nid));
>
> Which means you can remove this check presuming try_online_node()
> returned 0.
Yes, that's true.

>
> Thanks,
> Nish
>

2014-07-25 01:44:51

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition



On 2014/7/25 7:30, Nishanth Aravamudan wrote:
> On 11.07.2014 [15:37:47 +0800], Jiang Liu wrote:
>> With typical CPU hot-addition flow on x86, PCI host bridges embedded
>> in physical processor are always associated with NOMA_NO_NODE, which
>> may cause sub-optimal performance.
>> 1) Handle CPU hot-addition notification
>> acpi_processor_add()
>> acpi_processor_get_info()
>> acpi_processor_hotadd_init()
>> acpi_map_lsapic()
>> 1.a) acpi_map_cpu2node()
>>
>> 2) Handle PCI host bridge hot-addition notification
>> acpi_pci_root_add()
>> pci_acpi_scan_root()
>> 2.a) if (node != NUMA_NO_NODE && !node_online(node)) node = NUMA_NO_NODE;
>>
>> 3) Handle memory hot-addition notification
>> acpi_memory_device_add()
>> acpi_memory_enable_device()
>> add_memory()
>> 3.a) node_set_online();
>>
>> 4) Online CPUs through sysfs interfaces
>> cpu_subsys_online()
>> cpu_up()
>> try_online_node()
>> 4.a) node_set_online();
>>
>> So associated node is always in offline state because it is onlined
>> until step 3.a or 4.a.
>>
>> We could improve performance by online node at step 1.a. This change
>> also makes the code symmetric. Nodes are always created when handling
>> CPU/memory hot-addition events instead of handling user requests from
>> sysfs interfaces, and are destroyed when handling CPU/memory hot-removal
>> events.
>
> It seems like this patch has little to nothing to do with the rest of
> the series and can be sent on its own?
>
>> It also close a race window caused by kmalloc_node(cpu_to_node(cpu)),
>
> To be clear, the race is that on some x86 platforms, there is a period
> of time where a node ID returned by cpu_to_node() is offline.
>
> <snip>
>
>> Signed-off-by: Jiang Liu <[email protected]>
>> ---
>> arch/x86/kernel/acpi/boot.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
>> index 3b5641703a49..00c2ed507460 100644
>> --- a/arch/x86/kernel/acpi/boot.c
>> +++ b/arch/x86/kernel/acpi/boot.c
>> @@ -611,6 +611,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>> nid = acpi_get_node(handle);
>> if (nid != -1) {
>> set_apicid_to_node(physid, nid);
>> + try_online_node(nid);
>
> try_online_node() seems like it can fail? I assume it's a pretty rare
> case, but should the return code be checked?
>
> If it does fail, it seems like there are pretty serious problems and we
> shouldn't be onlining this CPU, etc.?
>
>> numa_set_node(cpu, nid);
>> if (node_online(nid))
>> set_cpu_numa_mem(cpu, local_memory_node(nid));
>
> Which means you can remove this check presuming try_online_node()
> returned 0.
Good suggestion, will try to enhance the error handling path.

>
> Thanks,
> Nish
>

2014-07-25 01:50:10

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms



On 2014/7/25 7:32, Nishanth Aravamudan wrote:
> On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
>>
>>
>> On 2014/7/22 1:57, Nishanth Aravamudan wrote:
>>> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
>>>> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>>>> <[email protected]> wrote:
>>>>> It seems like the issue is the order of onlining of resources on a
>>>>> specific x86 platform?
>>>>
>>>> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
>>>>
>>>> First: Here are some new cpus
>>>
>>> Ok, so during this period, you might get some remote allocations. Do you
>>> know the topology of these CPUs? That is they belong to a
>>> (soon-to-exist) NUMA node? Can you online that currently offline NUMA
>>> node at this point (so that NODE_DATA()) resolves, etc.)?
>> Hi Nishanth,
>> We have method to get the NUMA information about the CPU, and
>> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
>> CPU hot-addition" tries to solve this issue by onlining NUMA node
>> as early as possible. Actually we are trying to enable memoryless node
>> as you have suggested.
>
> Ok, it seems like you have two sets of patches then? One is to fix the
> NUMA information timing (30/30 only). The rest of the patches are
> general discussions about where cpu_to_mem() might be used instead of
> cpu_to_node(). However, based upon Tejun's feedback, it seems like
> rather than force all callers to use cpu_to_mem(), we should be looking
> at the core VM to ensure fallback is occuring appropriately when
> memoryless nodes are present.
>
> Do you have a specific situation, once you've applied 30/30, where
> kmalloc_node() leads to an Oops?
Hi Nishanth,
After following the two threads related to support of memoryless
node and digging more code, I realized my first version path set is an
overkill. As Tejun has pointed out, we shouldn't expose the detail of
memoryless node to normal user, but there are still some special users
who need the detail. So I have tried to summarize it as:
1) Arch code should online corresponding NUMA node before onlining any
CPU or memory, otherwise it may cause invalid memory access when
accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
information as pointed out by Tejun:
A - B - X - C - D
Where X is the memless node. numa_mem_id() on X would return
either B or C, right? If B or C can't satisfy the allocation,
the allocator would fallback to A from B and D for C, both of
which aren't optimal. It should first fall back to C or B
respectively, which the allocator can't do anymoe because the
information is lost when the caller side performs numa_mem_id().
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
numa_node_id()/cpu_to_node() should be used if caller only wants to
allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
should be used if caller wants to allocate from the nearest node.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
whether a page is allocated from the nearest node.

And my v2 patch set is based on above rules.
Any suggestions here?
Regards!
Gerry

>
> Thanks,
> Nish
>

2014-07-28 13:30:56

by Grant Likely

[permalink] [raw]
Subject: Re: [RFC Patch V1 22/30] mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On Mon, 21 Jul 2014 10:52:41 -0700, Nishanth Aravamudan <[email protected]> wrote:
> On 11.07.2014 [15:37:39 +0800], Jiang Liu wrote:
> > When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> > may return a node without memory, and later cause system failure/panic
> > when calling kmalloc_node() and friends with returned node id.
> > So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> > memory for the/current cpu.
> >
> > If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> > is the same as cpu_to_node()/numa_node_id().
> >
> > Signed-off-by: Jiang Liu <[email protected]>
> > ---
> > drivers/of/base.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/of/base.c b/drivers/of/base.c
> > index b9864806e9b8..40d4772973ad 100644
> > --- a/drivers/of/base.c
> > +++ b/drivers/of/base.c
> > @@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> > #ifdef CONFIG_NUMA
> > int __weak of_node_to_nid(struct device_node *np)
> > {
> > - return numa_node_id();
> > + return numa_mem_id();
> > }
> > #endif
>
> Um, NAK. of_node_to_nid() returns the NUMA node ID for a given device
> tree node. The default should be the physically local NUMA node, not the
> nearest memory-containing node.

That description doesn't match the code. This patch only changes the
default implementation of of_node_to_nid() which doesn't take the device
node into account *at all* when returning a node ID. Just look at the
diff.

I think this patch is correct, and it doesn't affect the override
versions provided by powerpc and sparc.

g.

>
> I think the general direction of this patchset is good -- what NUMA
> information do we actually are about at each callsite. But the execution
> is blind and doesn't consider at all what the code is actually doing.
> The changelogs are all identical and don't actually provide any
> information about what errors this (or any) specific patch are
> resolving.
>
> Thanks,
> Nish
>

2014-07-28 19:26:26

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [RFC Patch V1 22/30] mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node

On 28.07.2014 [07:30:40 -0600], Grant Likely wrote:
> On Mon, 21 Jul 2014 10:52:41 -0700, Nishanth Aravamudan <[email protected]> wrote:
> > On 11.07.2014 [15:37:39 +0800], Jiang Liu wrote:
> > > When CONFIG_HAVE_MEMORYLESS_NODES is enabled, cpu_to_node()/numa_node_id()
> > > may return a node without memory, and later cause system failure/panic
> > > when calling kmalloc_node() and friends with returned node id.
> > > So use cpu_to_mem()/numa_mem_id() instead to get the nearest node with
> > > memory for the/current cpu.
> > >
> > > If CONFIG_HAVE_MEMORYLESS_NODES is disabled, cpu_to_mem()/numa_mem_id()
> > > is the same as cpu_to_node()/numa_node_id().
> > >
> > > Signed-off-by: Jiang Liu <[email protected]>
> > > ---
> > > drivers/of/base.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/of/base.c b/drivers/of/base.c
> > > index b9864806e9b8..40d4772973ad 100644
> > > --- a/drivers/of/base.c
> > > +++ b/drivers/of/base.c
> > > @@ -85,7 +85,7 @@ EXPORT_SYMBOL(of_n_size_cells);
> > > #ifdef CONFIG_NUMA
> > > int __weak of_node_to_nid(struct device_node *np)
> > > {
> > > - return numa_node_id();
> > > + return numa_mem_id();
> > > }
> > > #endif
> >
> > Um, NAK. of_node_to_nid() returns the NUMA node ID for a given device
> > tree node. The default should be the physically local NUMA node, not the
> > nearest memory-containing node.
>
> That description doesn't match the code. This patch only changes the
> default implementation of of_node_to_nid() which doesn't take the device
> node into account *at all* when returning a node ID. Just look at the
> diff.

I meant that of_node_to_nid() seems to be used throughout the call-sites
to indicate caller locality. We want to keep using cpu_to_node() there,
and fallback appropriately in the MM (when allocations occur offnode due
to memoryless nodes), not indicate memory-specific topology the caller
itself. There was a long thread between between Tejun and I that
discussed what we are trying for: https://lkml.org/lkml/2014/7/18/278

I understand that the code unconditionally returns current's NUMA node
ID right now (ignoring the device node). That seems correct, to me, for
something like:

of_device_add:
/* device_add will assume that this device is on the same node as
* the parent. If there is no parent defined, set the node
* explicitly */
if (!ofdev->dev.parent)
set_dev_node(&ofdev->dev, of_node_to_nid(ofdev->dev.of_node));

I don't think we want the default implementation to set the NUMA node of
a dev to the nearest NUMA node with memory?

> I think this patch is correct, and it doesn't affect the override
> versions provided by powerpc and sparc.

Yes, agreed, so maybe it doesn't matter. I guess my point was simply
that it only seems reasonable to change callers of cpu_to_node() to
cpu_to_mem() that aren't in the core MM is if they care about memoryless
nodes explicitly. I don't think the OF code does, so I don't think it
should change.

Sorry for my premature NAK and lack of clarity in my explanation.

-Nish