LinuxLists.cc - [PATCH 0/5] Make cpuid <-> nodeid mapping persistent.

2015-07-07 09:30:12

[permalink] [raw]

Subject: [PATCH 0/5] Make cpuid <-> nodeid mapping persistent.

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
which will lead to memory allocation failure:

SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
|--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.

......

return worker;
}

[Solution]

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it invariable. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to let the
caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
registering local apic. Store the mapping in the array introduced above.

4. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
This is also done by introducing an extra parameter to these apis to let the caller
control if disabled cpus are ignored.

5. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.

For previous discussion, please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
https://lkml.org/lkml/2015/5/14/244

Gu Zheng (5):
x86, gfp: Cache best near node for memory allocation.
x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at
boot time.
x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store
persistent cpuid <-> apicid mapping.
x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.
x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when
booting.

arch/ia64/kernel/acpi.c | 2 +-
arch/x86/include/asm/mpspec.h | 1 +
arch/x86/include/asm/topology.h | 2 +
arch/x86/kernel/acpi/boot.c | 8 +--
arch/x86/kernel/apic/apic.c | 71 ++++++++++++++++++++---
arch/x86/mm/numa.c | 57 ++++++++++++-------
drivers/acpi/acpi_processor.c | 5 +-
drivers/acpi/bus.c | 3 +
drivers/acpi/processor_core.c | 122 +++++++++++++++++++++++++++++++++-------
include/linux/acpi.h | 2 +
include/linux/gfp.h | 12 +++-
11 files changed, 227 insertions(+), 58 deletions(-)

--
1.9.3

2015-07-07 09:31:18

[permalink] [raw]

Subject: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

From: Gu Zheng <[email protected]>

In current code, all possible cpus are mapped to the best near online
node if the node they reside in is offline in init_cpu_to_node().

init_cpu_to_node()
{
......
for_each_possible_cpu(cpu) {
......
if (!node_online(node))
node = find_near_online_node(node);
numa_set_node(cpu, node);
}
}

Why doing this is to prevent memory allocation failure if the cpu is
online but there is no memory on that node.

But since cpuid <-> nodeid mapping will fix after this patch-set, doing
so in initialization pharse makes no sense any more. The best near online
node for each cpu should be cached somewhere.

In this patch, a per-cpu cache named x86_cpu_to_near_online_node is
introduced to store these info, and make use of them when memory allocation
fails in alloc_pages_node() and alloc_pages_exact_node().

Signed-off-by: Gu Zheng <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/topology.h | 2 ++
arch/x86/mm/numa.c | 57 ++++++++++++++++++++++++++---------------
include/linux/gfp.h | 12 ++++++++-
3 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 0fb4648..e3e22b2 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -82,6 +82,8 @@ static inline const struct cpumask *cpumask_of_node(int node)
}
#endif

+extern int get_near_online_node(int node);
+
extern void setup_node_to_cpumask_map(void);

/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..13bd0d7 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -69,6 +69,7 @@ int numa_cpu_node(int cpu)
return NUMA_NO_NODE;
}

+cpumask_t node_to_cpuid_mask_map[MAX_NUMNODES];
cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
EXPORT_SYMBOL(node_to_cpumask_map);

@@ -78,6 +79,31 @@ EXPORT_SYMBOL(node_to_cpumask_map);
DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE);
EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);

+/*
+ * Map cpu index to the best near online node. The best near online node
+ * is the backup node for memory allocation on offline node.
+ */
+DEFINE_PER_CPU(int, x86_cpu_to_near_online_node);
+EXPORT_PER_CPU_SYMBOL(x86_cpu_to_near_online_node);
+
+static int find_near_online_node(int node)
+{
+ int n, val;
+ int min_val = INT_MAX;
+ int best_node = -1;
+
+ for_each_online_node(n) {
+ val = node_distance(node, n);
+
+ if (val < min_val) {
+ min_val = val;
+ best_node = n;
+ }
+ }
+
+ return best_node;
+}
+
void numa_set_node(int cpu, int node)
{
int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
@@ -95,7 +121,11 @@ void numa_set_node(int cpu, int node)
return;
}
#endif
+
+ per_cpu(x86_cpu_to_near_online_node, cpu) =
+ find_near_online_node(numa_cpu_node(cpu));
per_cpu(x86_cpu_to_node_map, cpu) = node;
+ cpumask_set_cpu(cpu, &node_to_cpuid_mask_map[numa_cpu_node(cpu)]);

set_cpu_numa_node(cpu, node);
}
@@ -105,6 +135,13 @@ void numa_clear_node(int cpu)
numa_set_node(cpu, NUMA_NO_NODE);
}

+int get_near_online_node(int node)
+{
+ return per_cpu(x86_cpu_to_near_online_node,
+ cpumask_first(&node_to_cpuid_mask_map[node]));
+}
+EXPORT_SYMBOL(get_near_online_node);
+
/*
* Allocate node_to_cpumask_map based on number of available nodes
* Requires node_possible_map to be valid.
@@ -702,24 +739,6 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}

-static __init int find_near_online_node(int node)
-{
- int n, val;
- int min_val = INT_MAX;
- int best_node = -1;
-
- for_each_online_node(n) {
- val = node_distance(node, n);
-
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
-
- return best_node;
-}
-
/*
* Setup early cpu_to_node.
*
@@ -746,8 +765,6 @@ void __init init_cpu_to_node(void)

if (node == NUMA_NO_NODE)
continue;
- if (!node_online(node))
- node = find_near_online_node(node);
numa_set_node(cpu, node);
}
}
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6ba7cf2..4a18b21 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -307,13 +307,23 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
if (nid < 0)
nid = numa_node_id();

+#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
+ if (!node_online(nid))
+ nid = get_near_online_node(nid);
+#endif
+
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
- VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
+ if (!node_online(nid))
+ nid = get_near_online_node(nid);
+#endif

return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}
--
1.9.3

2015-07-07 09:30:25

[permalink] [raw]

Subject: [PATCH 2/5] x86, acpi, cpu-hotplug: Enable acpi to register all possible cpus at boot time.

From: Gu Zheng <[email protected]>

[Problem]

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

Node | CPU
------------------------
node 0 | 0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline node,
which will lead to memory allocation failure:

SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
|--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, useing the wrong node.

......

return worker;
}

[Solution]

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it invariable. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
This is done by introducing an extra parameter to generic_processor_info to let the
caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
registering local apic. Store the mapping in the array introduced above.

4. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
This is also done by introducing an extra parameter to these apis to let the caller
control if disabled cpus are ignored.

5. Establish all possible cpuid <-> nodeid mapping.
This is done via an additional acpi namespace walk for processors.

This patch finished step 1.

Signed-off-by: Gu Zheng <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/kernel/apic/apic.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..a9c9830 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,7 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}

-int generic_processor_info(int apicid, int version)
+static int __generic_processor_info(int apicid, int version, bool enabled)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2011,7 +2011,8 @@ int generic_processor_info(int apicid, int version)
" Processor %d/0x%x ignored.\n",
thiscpu, apicid);

- disabled_cpus++;
+ if (enabled)
+ disabled_cpus++;
return -ENODEV;
}

@@ -2028,7 +2029,8 @@ int generic_processor_info(int apicid, int version)
" reached. Keeping one slot for boot cpu."
" Processor %d/0x%x ignored.\n", max, thiscpu, apicid);

- disabled_cpus++;
+ if (enabled)
+ disabled_cpus++;
return -ENODEV;
}

@@ -2039,11 +2041,14 @@ int generic_processor_info(int apicid, int version)
"ACPI: NR_CPUS/possible_cpus limit of %i reached."
" Processor %d/0x%x ignored.\n", max, thiscpu, apicid);

- disabled_cpus++;
+ if (enabled)
+ disabled_cpus++;
return -EINVAL;
}

- num_processors++;
+ if (enabled)
+ num_processors++;
+
if (apicid == boot_cpu_physical_apicid) {
/*
* x86_bios_cpu_apicid is required to have processors listed
@@ -2071,7 +2076,8 @@ int generic_processor_info(int apicid, int version)
apic_version[boot_cpu_physical_apicid], cpu, version);
}

- physid_set(apicid, phys_cpu_present_map);
+ if (enabled)
+ physid_set(apicid, phys_cpu_present_map);
if (apicid > max_physical_apicid)
max_physical_apicid = apicid;

@@ -2084,11 +2090,17 @@ int generic_processor_info(int apicid, int version)
apic->x86_32_early_logical_apicid(cpu);
#endif
set_cpu_possible(cpu, true);
- set_cpu_present(cpu, true);
+ if (enabled)
+ set_cpu_present(cpu, true);

return cpu;
}

+int generic_processor_info(int apicid, int version)
+{
+ return __generic_processor_info(apicid, version, true);
+}
+
int hard_smp_processor_id(void)
{
return read_apic_id();
--
1.9.3

2015-07-07 09:30:32

[permalink] [raw]

Subject: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

From: Gu Zheng <[email protected]>

In this patch, we introduce a new static array named apicid_to_cpuid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in apicid_to_cpuid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/mpspec.h | 1 +
arch/x86/kernel/acpi/boot.c | 6 ++----
arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++++++++---
3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
#endif

int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);

#define PHYSID_ARRAY_SIZE BITS_TO_LONGS(MAX_LOCAL_APIC)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e49ee24..bcc85b2 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}

- if (!enabled) {
+ if (!enabled)
++disabled_cpus;
- return -EINVAL;
- }

if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];

- return generic_processor_info(id, ver);
+ return __generic_processor_info(id, ver, enabled);
}

static int __init
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a9c9830..c744ffb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}

-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu is hot-removed.
+ */
+static int apicid_to_cpuid[] = {
+ [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+ int free_id, i;
+
+ free_id = cpumask_next_zero(-1, &cpuid_mask);
+ if (free_id >= nr_cpu_ids)
+ return -1;
+
+ for (i = 0; i < free_id; i++)
+ if (apicid_to_cpuid[i] == apicid)
+ return i;
+
+ apicid_to_cpuid[free_id] = apicid;
+ cpumask_set_cpu(free_id, &cpuid_mask);
+
+ return free_id;
+}
+
+int __generic_processor_info(int apicid, int version, bool enabled)
{
int cpu, max = nr_cpu_ids;
bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
@@ -2058,8 +2089,18 @@ static int __generic_processor_info(int apicid, int version, bool enabled)
* for BSP.
*/
cpu = 0;
- } else
- cpu = cpumask_next_zero(-1, cpu_present_mask);
+ } else {
+ cpu = get_cpuid(apicid);
+ if (cpu < 0) {
+ int thiscpu = max + disabled_cpus;
+
+ pr_warning(" Processor %d/0x%x ignored.\n",
+ thiscpu, apicid);
+ if (enabled)
+ disabled_cpus++;
+ return -EINVAL;
+ }
+ }

/*
* Validate version
--
1.9.3

2015-07-07 09:31:12

[permalink] [raw]

Subject: [PATCH 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

From: Gu Zheng <[email protected]>

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/acpi_processor.c | 5 +++-
drivers/acpi/processor_core.c | 57 +++++++++++++++++++++++++++----------------
2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 92a5f73..338c71a 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -282,8 +282,11 @@ static int acpi_processor_get_info(struct acpi_device *device)
* Extra Processor objects may be enumerated on MP systems with
* less than the max # of CPUs. They should be ignored _iff
* they are physically not present.
+ *
+ * NOTE: Even if the processor has a cpuid, it may not present because
+ * cpuid <-> apicid mapping is persistent now.
*/
- if (invalid_logical_cpuid(pr->id)) {
+ if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 33a38d6..824b98b 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
}

static int map_lapic_id(struct acpi_subtable_header *entry,
- u32 acpi_id, phys_cpuid_t *apic_id)
+ u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
{
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);

- if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+ if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;

if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
}

static int map_x2apic_id(struct acpi_subtable_header *entry,
- int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+ int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+ bool ignore_disabled)
{
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);

- if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+ if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;

if (device_declaration && (apic->uid == acpi_id)) {
@@ -65,12 +66,13 @@ static int map_x2apic_id(struct acpi_subtable_header *entry,
}

static int map_lsapic_id(struct acpi_subtable_header *entry,
- int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+ int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+ bool ignore_disabled)
{
struct acpi_madt_local_sapic *lsapic =
container_of(entry, struct acpi_madt_local_sapic, header);

- if (!(lsapic->lapic_flags & ACPI_MADT_ENABLED))
+ if (ignore_disabled && !(lsapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;

if (device_declaration) {
@@ -87,12 +89,13 @@ static int map_lsapic_id(struct acpi_subtable_header *entry,
* Retrieve the ARM CPU physical identifier (MPIDR)
*/
static int map_gicc_mpidr(struct acpi_subtable_header *entry,
- int device_declaration, u32 acpi_id, phys_cpuid_t *mpidr)
+ int device_declaration, u32 acpi_id, phys_cpuid_t *mpidr,
+ bool ignore_disabled)
{
struct acpi_madt_generic_interrupt *gicc =
container_of(entry, struct acpi_madt_generic_interrupt, header);

- if (!(gicc->flags & ACPI_MADT_ENABLED))
+ if (ignore_disabled && !(gicc->flags & ACPI_MADT_ENABLED))
return -ENODEV;

/* device_declaration means Device object in DSDT, in the
@@ -108,7 +111,7 @@ static int map_gicc_mpidr(struct acpi_subtable_header *entry,
return -EINVAL;
}

-static phys_cpuid_t map_madt_entry(int type, u32 acpi_id)
+static phys_cpuid_t map_madt_entry(int type, u32 acpi_id, bool ignore_disabled)
{
unsigned long madt_end, entry;
phys_cpuid_t phys_id = PHYS_CPUID_INVALID; /* CPU hardware ID */
@@ -128,16 +131,20 @@ static phys_cpuid_t map_madt_entry(int type, u32 acpi_id)
struct acpi_subtable_header *header =
(struct acpi_subtable_header *)entry;
if (header->type == ACPI_MADT_TYPE_LOCAL_APIC) {
- if (!map_lapic_id(header, acpi_id, &phys_id))
+ if (!map_lapic_id(header, acpi_id, &phys_id,
+ ignore_disabled))
break;
} else if (header->type == ACPI_MADT_TYPE_LOCAL_X2APIC) {
- if (!map_x2apic_id(header, type, acpi_id, &phys_id))
+ if (!map_x2apic_id(header, type, acpi_id, &phys_id,
+ ignore_disabled))
break;
} else if (header->type == ACPI_MADT_TYPE_LOCAL_SAPIC) {
- if (!map_lsapic_id(header, type, acpi_id, &phys_id))
+ if (!map_lsapic_id(header, type, acpi_id, &phys_id,
+ ignore_disabled))
break;
} else if (header->type == ACPI_MADT_TYPE_GENERIC_INTERRUPT) {
- if (!map_gicc_mpidr(header, type, acpi_id, &phys_id))
+ if (!map_gicc_mpidr(header, type, acpi_id, &phys_id,
+ ignore_disabled))
break;
}
entry += header->length;
@@ -145,7 +152,8 @@ static phys_cpuid_t map_madt_entry(int type, u32 acpi_id)
return phys_id;
}

-static phys_cpuid_t map_mat_entry(acpi_handle handle, int type, u32 acpi_id)
+static phys_cpuid_t map_mat_entry(acpi_handle handle, int type, u32 acpi_id,
+ bool ignore_disabled)
{
struct acpi_buffer buffer = { ACPI_ALLOCATE_BUFFER, NULL };
union acpi_object *obj;
@@ -166,30 +174,37 @@ static phys_cpuid_t map_mat_entry(acpi_handle handle, int type, u32 acpi_id)

header = (struct acpi_subtable_header *)obj->buffer.pointer;
if (header->type == ACPI_MADT_TYPE_LOCAL_APIC)
- map_lapic_id(header, acpi_id, &phys_id);
+ map_lapic_id(header, acpi_id, &phys_id, ignore_disabled);
else if (header->type == ACPI_MADT_TYPE_LOCAL_SAPIC)
- map_lsapic_id(header, type, acpi_id, &phys_id);
+ map_lsapic_id(header, type, acpi_id, &phys_id, ignore_disabled);
else if (header->type == ACPI_MADT_TYPE_LOCAL_X2APIC)
- map_x2apic_id(header, type, acpi_id, &phys_id);
+ map_x2apic_id(header, type, acpi_id, &phys_id, ignore_disabled);
else if (header->type == ACPI_MADT_TYPE_GENERIC_INTERRUPT)
- map_gicc_mpidr(header, type, acpi_id, &phys_id);
+ map_gicc_mpidr(header, type, acpi_id, &phys_id,
+ ignore_disabled);

exit:
kfree(buffer.pointer);
return phys_id;
}

-phys_cpuid_t acpi_get_phys_id(acpi_handle handle, int type, u32 acpi_id)
+static phys_cpuid_t __acpi_get_phys_id(acpi_handle handle, int type,
+ u32 acpi_id, bool ignore_disabled)
{
phys_cpuid_t phys_id;

- phys_id = map_mat_entry(handle, type, acpi_id);
+ phys_id = map_mat_entry(handle, type, acpi_id, ignore_disabled);
if (invalid_phys_cpuid(phys_id))
- phys_id = map_madt_entry(type, acpi_id);
+ phys_id = map_madt_entry(type, acpi_id, ignore_disabled);

return phys_id;
}

+phys_cpuid_t acpi_get_phys_id(acpi_handle handle, int type, u32 acpi_id)
+{
+ return __acpi_get_phys_id(handle, type, acpi_id, true);
+}
+
int acpi_map_cpuid(phys_cpuid_t phys_id, u32 acpi_id)
{
#ifdef CONFIG_SMP
--
1.9.3

2015-07-07 09:30:51

[permalink] [raw]

Subject: [PATCH 5/5] x86, acpi, cpu-hotplug: Set persistent cpuid <-> nodeid mapping when booting.

From: Gu Zheng <[email protected]>

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

Signed-off-by: Gu Zheng <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/ia64/kernel/acpi.c | 2 +-
arch/x86/kernel/acpi/boot.c | 2 +-
drivers/acpi/bus.c | 3 ++
drivers/acpi/processor_core.c | 65 +++++++++++++++++++++++++++++++++++++++++++
include/linux/acpi.h | 2 ++
5 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
* ACPI based hotplug CPU support
*/
#ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
{
#ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index bcc85b2..b9a1aa1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -695,7 +695,7 @@ static void __init acpi_set_irq_model_ioapic(void)
#ifdef CONFIG_ACPI_HOTPLUG_CPU
#include <acpi/processor.h>

-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
{
#ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 513e7230e..fd03885 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -700,6 +700,9 @@ static int __init acpi_init(void)
acpi_debugfs_init();
acpi_sleep_proc_init();
acpi_wakeup_device_init();
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+ acpi_set_processor_mapping();
+#endif
return 0;
}

diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 824b98b..45580ff 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -261,6 +261,71 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 acpi_id)
}
EXPORT_SYMBOL_GPL(acpi_get_cpuid);

+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool map_processor(acpi_handle handle, int *phys_id, int *cpuid)
+{
+ int type;
+ u32 acpi_id;
+ acpi_status status;
+ acpi_object_type acpi_type;
+ unsigned long long tmp;
+ union acpi_object object = { 0 };
+ struct acpi_buffer buffer = { sizeof(union acpi_object), &object };
+
+ status = acpi_get_type(handle, &acpi_type);
+ if (ACPI_FAILURE(status))
+ return false;
+
+ switch (acpi_type) {
+ case ACPI_TYPE_PROCESSOR:
+ status = acpi_evaluate_object(handle, NULL, NULL, &buffer);
+ if (ACPI_FAILURE(status))
+ return false;
+ acpi_id = object.processor.proc_id;
+ break;
+ case ACPI_TYPE_DEVICE:
+ status = acpi_evaluate_integer(handle, "_UID", NULL, &tmp);
+ if (ACPI_FAILURE(status))
+ return false;
+ acpi_id = tmp;
+ break;
+ default:
+ return false;
+ }
+
+ type = (acpi_type == ACPI_TYPE_DEVICE) ? 1 : 0;
+
+ *phys_id = __acpi_get_phys_id(handle, type, acpi_id, false);
+ *cpuid = acpi_map_cpuid(*phys_id, acpi_id);
+ if (*cpuid == -1)
+ return false;
+
+ return true;
+}
+
+static acpi_status __init
+set_processor_node_mapping(acpi_handle handle, u32 lvl, void *context,
+ void **rv)
+{
+ u32 apic_id;
+ int cpu_id;
+
+ if (!map_processor(handle, &apic_id, &cpu_id))
+ return AE_ERROR;
+
+ acpi_map_cpu2node(handle, cpu_id, apic_id);
+ return AE_OK;
+}
+
+void __init acpi_set_processor_mapping(void)
+{
+ /* Set persistent cpu <-> node mapping for all processors. */
+ acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+ ACPI_UINT32_MAX, set_processor_node_mapping,
+ NULL, NULL, NULL);
+}
+#endif
+
#ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
static int get_ioapic_id(struct acpi_subtable_header *entry, u32 gsi_base,
u64 *phys_addr, int *ioapic_id)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 1618cdf..fe3bd4b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -172,6 +172,8 @@ static inline bool invalid_phys_cpuid(phys_cpuid_t phys_id)
/* Arch dependent functions for cpu hotplug support */
int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, int *pcpu);
int acpi_unmap_cpu(int cpu);
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid);
+void __init acpi_set_processor_mapping(void);
#endif /* CONFIG_ACPI_HOTPLUG_CPU */

#ifdef CONFIG_ACPI_HOTPLUG_IOAPIC
--
1.9.3

2015-07-07 11:14:28

by Mika Penttilä

[permalink] [raw]

Subject: Re: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

I think you forgot to reserve CPU 0 for BSP in cpuid mask.

--Mika

On Tue, Jul 7, 2015 at 12:30 PM, Tang Chen <[email protected]> wrote:
> From: Gu Zheng <[email protected]>
>
> In this patch, we introduce a new static array named apicid_to_cpuid[],
> which is large enough to store info for all possible cpus.
>
> And then, we modify the cpuid calculation. In generic_processor_info(),
> it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
> mapping changes with node hotplug.
>
> After this patch, we find the next unused cpuid, map it to an apicid,
> and store the mapping in apicid_to_cpuid[], so that cpuid <-> apicid
> mapping will be persistent.
>
> And finally we will use this array to make cpuid <-> nodeid persistent.
>
> cpuid <-> apicid mapping is established at local apic registeration time.
> But non-present or disabled cpus are ignored.
>
> In this patch, we establish all possible cpuid <-> apicid mapping when
> registering local apic.
>
>
> Signed-off-by: Gu Zheng <[email protected]>
> Signed-off-by: Tang Chen <[email protected]>
> ---
> arch/x86/include/asm/mpspec.h | 1 +
> arch/x86/kernel/acpi/boot.c | 6 ++----
> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++++++++---
> 3 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
> index b07233b..db902d8 100644
> --- a/arch/x86/include/asm/mpspec.h
> +++ b/arch/x86/include/asm/mpspec.h
> @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
> #endif
>
> int generic_processor_info(int apicid, int version);
> +int __generic_processor_info(int apicid, int version, bool enabled);
>
> #define PHYSID_ARRAY_SIZE BITS_TO_LONGS(MAX_LOCAL_APIC)
>
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index e49ee24..bcc85b2 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
> return -EINVAL;
> }
>
> - if (!enabled) {
> + if (!enabled)
> ++disabled_cpus;
> - return -EINVAL;
> - }
>
> if (boot_cpu_physical_apicid != -1U)
> ver = apic_version[boot_cpu_physical_apicid];
>
> - return generic_processor_info(id, ver);
> + return __generic_processor_info(id, ver, enabled);
> }
>
> static int __init
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index a9c9830..c744ffb 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
> apic_write(APIC_LVT1, value);
> }
>
> -static int __generic_processor_info(int apicid, int version, bool enabled)
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
> + * Do not clear the mapping even if cpu is hot-removed.
> + */
> +static int apicid_to_cpuid[] = {
> + [0 ... NR_CPUS - 1] = -1,
> +};
> +
> +/*
> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
> + */
> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
> +
> +static int get_cpuid(int apicid)
> +{
> + int free_id, i;
> +
> + free_id = cpumask_next_zero(-1, &cpuid_mask);
> + if (free_id >= nr_cpu_ids)
> + return -1;
> +
> + for (i = 0; i < free_id; i++)
> + if (apicid_to_cpuid[i] == apicid)
> + return i;
> +
> + apicid_to_cpuid[free_id] = apicid;
> + cpumask_set_cpu(free_id, &cpuid_mask);
> +
> + return free_id;
> +}
> +
> +int __generic_processor_info(int apicid, int version, bool enabled)
> {
> int cpu, max = nr_cpu_ids;
> bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
> @@ -2058,8 +2089,18 @@ static int __generic_processor_info(int apicid, int version, bool enabled)
> * for BSP.
> */
> cpu = 0;
> - } else
> - cpu = cpumask_next_zero(-1, cpu_present_mask);
> + } else {
> + cpu = get_cpuid(apicid);
> + if (cpu < 0) {
> + int thiscpu = max + disabled_cpus;
> +
> + pr_warning(" Processor %d/0x%x ignored.\n",
> + thiscpu, apicid);
> + if (enabled)
> + disabled_cpus++;
> + return -EINVAL;
> + }
> + }
>
> /*
> * Validate version
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-07-15 03:33:19

[permalink] [raw]

Subject: Re: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

Hi Mika,

On 07/07/2015 07:14 PM, Mika Penttilä wrote:
> I think you forgot to reserve CPU 0 for BSP in cpuid mask.

Sorry for the late reply.

I'm not familiar with BSP. Do you mean in get_cpuid(),
I should reserve 0 for physical cpu0 in BSP ?

Would you please share more detail ?

Thanks.

>
> --Mika
>
> On Tue, Jul 7, 2015 at 12:30 PM, Tang Chen <[email protected]> wrote:
>> From: Gu Zheng <[email protected]>
>>
>> In this patch, we introduce a new static array named apicid_to_cpuid[],
>> which is large enough to store info for all possible cpus.
>>
>> And then, we modify the cpuid calculation. In generic_processor_info(),
>> it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
>> mapping changes with node hotplug.
>>
>> After this patch, we find the next unused cpuid, map it to an apicid,
>> and store the mapping in apicid_to_cpuid[], so that cpuid <-> apicid
>> mapping will be persistent.
>>
>> And finally we will use this array to make cpuid <-> nodeid persistent.
>>
>> cpuid <-> apicid mapping is established at local apic registeration time.
>> But non-present or disabled cpus are ignored.
>>
>> In this patch, we establish all possible cpuid <-> apicid mapping when
>> registering local apic.
>>
>>
>> Signed-off-by: Gu Zheng <[email protected]>
>> Signed-off-by: Tang Chen <[email protected]>
>> ---
>> arch/x86/include/asm/mpspec.h | 1 +
>> arch/x86/kernel/acpi/boot.c | 6 ++----
>> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++++++++---
>> 3 files changed, 47 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
>> index b07233b..db902d8 100644
>> --- a/arch/x86/include/asm/mpspec.h
>> +++ b/arch/x86/include/asm/mpspec.h
>> @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
>> #endif
>>
>> int generic_processor_info(int apicid, int version);
>> +int __generic_processor_info(int apicid, int version, bool enabled);
>>
>> #define PHYSID_ARRAY_SIZE BITS_TO_LONGS(MAX_LOCAL_APIC)
>>
>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
>> index e49ee24..bcc85b2 100644
>> --- a/arch/x86/kernel/acpi/boot.c
>> +++ b/arch/x86/kernel/acpi/boot.c
>> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
>> return -EINVAL;
>> }
>>
>> - if (!enabled) {
>> + if (!enabled)
>> ++disabled_cpus;
>> - return -EINVAL;
>> - }
>>
>> if (boot_cpu_physical_apicid != -1U)
>> ver = apic_version[boot_cpu_physical_apicid];
>>
>> - return generic_processor_info(id, ver);
>> + return __generic_processor_info(id, ver, enabled);
>> }
>>
>> static int __init
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index a9c9830..c744ffb 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>> apic_write(APIC_LVT1, value);
>> }
>>
>> -static int __generic_processor_info(int apicid, int version, bool enabled)
>> +/*
>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>> + * Do not clear the mapping even if cpu is hot-removed.
>> + */
>> +static int apicid_to_cpuid[] = {
>> + [0 ... NR_CPUS - 1] = -1,
>> +};
>> +
>> +/*
>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>> + */
>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>> +
>> +static int get_cpuid(int apicid)
>> +{
>> + int free_id, i;
>> +
>> + free_id = cpumask_next_zero(-1, &cpuid_mask);
>> + if (free_id >= nr_cpu_ids)
>> + return -1;
>> +
>> + for (i = 0; i < free_id; i++)
>> + if (apicid_to_cpuid[i] == apicid)
>> + return i;
>> +
>> + apicid_to_cpuid[free_id] = apicid;
>> + cpumask_set_cpu(free_id, &cpuid_mask);
>> +
>> + return free_id;
>> +}
>> +
>> +int __generic_processor_info(int apicid, int version, bool enabled)
>> {
>> int cpu, max = nr_cpu_ids;
>> bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
>> @@ -2058,8 +2089,18 @@ static int __generic_processor_info(int apicid, int version, bool enabled)
>> * for BSP.
>> */
>> cpu = 0;
>> - } else
>> - cpu = cpumask_next_zero(-1, cpu_present_mask);
>> + } else {
>> + cpu = get_cpuid(apicid);
>> + if (cpu < 0) {
>> + int thiscpu = max + disabled_cpus;
>> +
>> + pr_warning(" Processor %d/0x%x ignored.\n",
>> + thiscpu, apicid);
>> + if (enabled)
>> + disabled_cpus++;
>> + return -EINVAL;
>> + }
>> + }
>>
>> /*
>> * Validate version
>> --
>> 1.9.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
> .
>

2015-07-15 05:35:10

[permalink] [raw]

Subject: Re: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

On 2015/7/15 11:33, Tang Chen wrote:
> Hi Mika,
>
> On 07/07/2015 07:14 PM, Mika Penttilä wrote:
>> I think you forgot to reserve CPU 0 for BSP in cpuid mask.
>
> Sorry for the late reply.
>
> I'm not familiar with BSP. Do you mean in get_cpuid(),
> I should reserve 0 for physical cpu0 in BSP ?
>
> Would you please share more detail ?

BSP stands for "Bootstrapping Processor". In other word,
BSP is CPU0.

2015-07-15 06:25:33

[permalink] [raw]

Subject: Re: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

On 07/15/2015 01:35 PM, Jiang Liu wrote:
> On 2015/7/15 11:33, Tang Chen wrote:
>> Hi Mika,
>>
>> On 07/07/2015 07:14 PM, Mika Penttilä wrote:
>>> I think you forgot to reserve CPU 0 for BSP in cpuid mask.
>> Sorry for the late reply.
>>
>> I'm not familiar with BSP. Do you mean in get_cpuid(),
>> I should reserve 0 for physical cpu0 in BSP ?
>>
>> Would you please share more detail ?
> BSP stands for "Bootstrapping Processor". In other word,
> BSP is CPU0.
> .
>

Ha, how foolish I am.

And yes, cpu0 is not reserved when apicid == boot_cpu_physical_apicid
comes true.

Will update the patch.

Thanks. :)

2015-07-15 21:48:10

[permalink] [raw]

Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

Hello,

On Tue, Jul 07, 2015 at 05:30:21PM +0800, Tang Chen wrote:
...
> Why doing this is to prevent memory allocation failure if the cpu is

"The reason for doing this ..."

> online but there is no memory on that node.
>
> But since cpuid <-> nodeid mapping will fix after this patch-set, doing

"But since cpuid <-> nodeid mapping is planned to be made static, ..."

> so in initialization pharse makes no sense any more. The best near online
> node for each cpu should be cached somewhere.

I'm not really following. Is this because the now offline node can
later come online and we'd have to break the constant mapping
invariant if we update the mapping later? If so, it'd be nice to
spell that out.

> void numa_set_node(int cpu, int node)
> {
> int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
> @@ -95,7 +121,11 @@ void numa_set_node(int cpu, int node)
> return;
> }
> #endif
> +
> + per_cpu(x86_cpu_to_near_online_node, cpu) =
> + find_near_online_node(numa_cpu_node(cpu));
> per_cpu(x86_cpu_to_node_map, cpu) = node;
> + cpumask_set_cpu(cpu, &node_to_cpuid_mask_map[numa_cpu_node(cpu)]);
>
> set_cpu_numa_node(cpu, node);
> }
> @@ -105,6 +135,13 @@ void numa_clear_node(int cpu)
> numa_set_node(cpu, NUMA_NO_NODE);
> }
>
> +int get_near_online_node(int node)
> +{
> + return per_cpu(x86_cpu_to_near_online_node,
> + cpumask_first(&node_to_cpuid_mask_map[node]));
> +}
> +EXPORT_SYMBOL(get_near_online_node);

Umm... this function is sitting on a fairly hot path and scanning a
cpumask each time. Why not just build a numa node -> numa node array?

> @@ -702,24 +739,6 @@ void __init x86_numa_init(void)
> numa_init(dummy_numa_init);
> }
>
> -static __init int find_near_online_node(int node)
> -{
> - int n, val;
> - int min_val = INT_MAX;
> - int best_node = -1;
> -
> - for_each_online_node(n) {
> - val = node_distance(node, n);
> -
> - if (val < min_val) {
> - min_val = val;
> - best_node = n;
> - }
> - }
> -
> - return best_node;
> -}

It's usually better to not mix code movement with actual changes.

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 6ba7cf2..4a18b21 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -307,13 +307,23 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> if (nid < 0)
> nid = numa_node_id();
>
> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
> + if (!node_online(nid))
> + nid = get_near_online_node(nid);
> +#endif

Can you please introduce a wrapper function to do the above so that we
don't open code ifdefs?

> +
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }
>
> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
> unsigned int order)
> {
> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> +
> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
> + if (!node_online(nid))
> + nid = get_near_online_node(nid);
> +#endif
>
> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> }

Ditto. Also, what's the synchronization rules for NUMA node
on/offlining. If you end up updating the mapping later, how would
that be synchronized against the above usages?

Thanks.

--
tejun

2015-07-15 22:02:46

[permalink] [raw]

Subject: Re: [PATCH 3/5] x86, acpi, cpu-hotplug: Introduce apicid_to_cpuid[] array to store persistent cpuid <-> apicid mapping.

Hello,

On Tue, Jul 07, 2015 at 05:30:23PM +0800, Tang Chen wrote:
> From: Gu Zheng <[email protected]>

It would be a good idea to briefly describe what the overall goal is
and why we want that.

> In this patch, we introduce a new static array named apicid_to_cpuid[],
> which is large enough to store info for all possible cpus.
>
> And then, we modify the cpuid calculation. In generic_processor_info(),
> it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
> mapping changes with node hotplug.
>
> After this patch, we find the next unused cpuid, map it to an apicid,
> and store the mapping in apicid_to_cpuid[], so that cpuid <-> apicid
> mapping will be persistent.
>
> And finally we will use this array to make cpuid <-> nodeid persistent.
>
> cpuid <-> apicid mapping is established at local apic registeration time.
> But non-present or disabled cpus are ignored.
>
> In this patch, we establish all possible cpuid <-> apicid mapping when
> registering local apic.
>
>
> Signed-off-by: Gu Zheng <[email protected]>
> Signed-off-by: Tang Chen <[email protected]>
> ---
...
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index e49ee24..bcc85b2 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
> return -EINVAL;
> }
>
> - if (!enabled) {
> + if (!enabled)
> ++disabled_cpus;
> - return -EINVAL;
> - }
>
> if (boot_cpu_physical_apicid != -1U)
> ver = apic_version[boot_cpu_physical_apicid];
>
> - return generic_processor_info(id, ver);
> + return __generic_processor_info(id, ver, enabled);
> }
>
> static int __init
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index a9c9830..c744ffb 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
> apic_write(APIC_LVT1, value);
> }
>
> -static int __generic_processor_info(int apicid, int version, bool enabled)
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.

Logical

Also, isn't it the other way around?

> + * Do not clear the mapping even if cpu is hot-removed.
> + */
> +static int apicid_to_cpuid[] = {
> + [0 ... NR_CPUS - 1] = -1,
> +};
> +
> +/*
> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
> + */
> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
> +
> +static int get_cpuid(int apicid)
> +{
> + int free_id, i;
> +
> + free_id = cpumask_next_zero(-1, &cpuid_mask);
> + if (free_id >= nr_cpu_ids)
> + return -1;
> +
> + for (i = 0; i < free_id; i++)
> + if (apicid_to_cpuid[i] == apicid)
> + return i;
> +
> + apicid_to_cpuid[free_id] = apicid;
> + cpumask_set_cpu(free_id, &cpuid_mask);
> +
> + return free_id;

Why can't this function simply test whether apicid_to_cpuid[] is -1 or
not? Also, why does it need cpuid_mask? Isn't it just giving out cpu
id numbers sequentially?

> +}
> +
> +int __generic_processor_info(int apicid, int version, bool enabled)
> {
> int cpu, max = nr_cpu_ids;
> bool boot_cpu_detected = physid_isset(boot_cpu_physical_apicid,
> @@ -2058,8 +2089,18 @@ static int __generic_processor_info(int apicid, int version, bool enabled)
> * for BSP.
> */
> cpu = 0;
> - } else
> - cpu = cpumask_next_zero(-1, cpu_present_mask);
> + } else {
> + cpu = get_cpuid(apicid);
> + if (cpu < 0) {
> + int thiscpu = max + disabled_cpus;
> +
> + pr_warning(" Processor %d/0x%x ignored.\n",
> + thiscpu, apicid);

Given that the only failing condition is there are more possible cpus
than nr_cpu_ids, it might make more sense to warn this once in
get_cpuid().

Also, wouldn't it make more sense / safer to allocate all online cpus
first and then go through possible cpus?

Thanks.

--
tejun

2015-07-15 22:06:30

[permalink] [raw]

Subject: Re: [PATCH 4/5] x86, acpi, cpu-hotplug: Enable MADT APIs to return disabled apicid.

Hello,

On Tue, Jul 07, 2015 at 05:30:24PM +0800, Tang Chen wrote:
> From: Gu Zheng <[email protected]>
>
> All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
> The current code ignores disabled processors and returns -ENODEV.
>
> After this patch, a new parameter will be added to MADT APIs so that caller
> is able to control if disabled processors are ignored.

This describes what the patch does but doesn't really explain what the
patch is trying to achieve.

> @@ -282,8 +282,11 @@ static int acpi_processor_get_info(struct acpi_device *device)
> * Extra Processor objects may be enumerated on MP systems with
> * less than the max # of CPUs. They should be ignored _iff
> * they are physically not present.
> + *
> + * NOTE: Even if the processor has a cpuid, it may not present because
^
be
> + * cpuid <-> apicid mapping is persistent now.

Saying "now" is kinda weird as this is how the code is gonna be
forever.

Thanks.

--
tejun

2015-07-15 22:13:49

[permalink] [raw]

Subject: Re: [PATCH 0/5] Make cpuid <-> nodeid mapping persistent.

Hello,

On Tue, Jul 07, 2015 at 05:30:20PM +0800, Tang Chen wrote:
> [Solution]
>
> To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
> cpus at boot time, and make it invariable. And according to init_cpu_to_node(),
> cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
> mapping. So the key point is obtaining all cpus' apicid.
>
> apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
> MADT (Multiple APIC Description Table). So we finish the job in the following steps:
>
> 1. Enable apic registeration flow to handle both enabled and disabled cpus.
> This is done by introducing an extra parameter to generic_processor_info to let the
> caller control if disabled cpus are ignored.
>
> 2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
> the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
> registering local apic. Store the mapping in the array introduced above.
>
> 4. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
> This is also done by introducing an extra parameter to these apis to let the caller
> control if disabled cpus are ignored.
>
> 5. Establish all possible cpuid <-> nodeid mapping.
> This is done via an additional acpi namespace walk for processors.

Hmmm... given that we probably want to allocate lower ids to the
online cpus, as otherwise we can end up failing to bring existing cpus
online because NR_CPUS is lower than the number of possible cpus, I
wonder whether doing this lazily could be better / easier. e.g. just
remember the mapping as cpus come online. When a new cpu comes up,
look up whether it came up before. If so, use the ids from the last
time. If not, allocate new ones. I think that would be less amount
of change but does require updating the mapping dynamically.

Thanks.

--
tejun

2015-07-23 04:44:06

[permalink] [raw]

Subject: Re: [PATCH 0/5] Make cpuid <-> nodeid mapping persistent.

On 07/16/2015 06:13 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jul 07, 2015 at 05:30:20PM +0800, Tang Chen wrote:
>> [Solution]
>>
>> To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
>> cpus at boot time, and make it invariable. And according to init_cpu_to_node(),
>> cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> apicid
>> mapping. So the key point is obtaining all cpus' apicid.
>>
>> apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
>> MADT (Multiple APIC Description Table). So we finish the job in the following steps:
>>
>> 1. Enable apic registeration flow to handle both enabled and disabled cpus.
>> This is done by introducing an extra parameter to generic_processor_info to let the
>> caller control if disabled cpus are ignored.
>>
>> 2. Introduce a new array storing all possible cpuid <-> apicid mapping. And also modify
>> the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping when
>> registering local apic. Store the mapping in the array introduced above.
>>
>> 4. Enable _MAT and MADT relative apis to return non-presnet or disabled cpus' apicid.
>> This is also done by introducing an extra parameter to these apis to let the caller
>> control if disabled cpus are ignored.
>>
>> 5. Establish all possible cpuid <-> nodeid mapping.
>> This is done via an additional acpi namespace walk for processors.
> Hmmm... given that we probably want to allocate lower ids to the
> online cpus, as otherwise we can end up failing to bring existing cpus
> online because NR_CPUS is lower than the number of possible cpus, I
> wonder whether doing this lazily could be better / easier. e.g. just
> remember the mapping as cpus come online. When a new cpu comes up,
> look up whether it came up before. If so, use the ids from the last
> time. If not, allocate new ones. I think that would be less amount
> of change but does require updating the mapping dynamically.

Hi TJ,

Allocating cpuid when a new cpu comes up and reusing the cpuid when it
comes up again is possible. But I'm not quite sure if it will be less
modification
because we still need an array or bit map or something to keep the mapping,
and select backup nodes for cpus on memory-less nodes when allocating
memory.

I can post a set of patches for this idea. And then we can see which one
is better.

Thanks. :)

2015-07-23 18:32:34

[permalink] [raw]

Subject: Re: [PATCH 0/5] Make cpuid <-> nodeid mapping persistent.

Hello, Tang.

On Thu, Jul 23, 2015 at 12:44:53PM +0800, Tang Chen wrote:
> Allocating cpuid when a new cpu comes up and reusing the cpuid when it
> comes up again is possible. But I'm not quite sure if it will be less
> modification
> because we still need an array or bit map or something to keep the mapping,
> and select backup nodes for cpus on memory-less nodes when allocating
> memory.
>
> I can post a set of patches for this idea. And then we can see which one is
> better.

I suspect the difference could be that in the current code the users
(workqueue) can remain the same while if we do it lazily there
probably needs to be a way to poke it. As long as the IDs are
allocated to the online CPUs first, I think pre-allocating everything
should be fine too.

Thanks.

--
tejun

2015-08-04 03:37:53

[permalink] [raw]

Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

Hi TJ,

Sorry for the late reply.

On 07/16/2015 05:48 AM, Tejun Heo wrote:
> ......
> so in initialization pharse makes no sense any more. The best near online
> node for each cpu should be cached somewhere.
> I'm not really following. Is this because the now offline node can
> later come online and we'd have to break the constant mapping
> invariant if we update the mapping later? If so, it'd be nice to
> spell that out.

Yes. Will document this in the next version.

>> ......
>>
>> +int get_near_online_node(int node)
>> +{
>> + return per_cpu(x86_cpu_to_near_online_node,
>> + cpumask_first(&node_to_cpuid_mask_map[node]));
>> +}
>> +EXPORT_SYMBOL(get_near_online_node);
> Umm... this function is sitting on a fairly hot path and scanning a
> cpumask each time. Why not just build a numa node -> numa node array?

Indeed. Will avoid to scan a cpumask.

> ......
>
>>
>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
>> unsigned int order)
>> {
>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> +
>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
>> + if (!node_online(nid))
>> + nid = get_near_online_node(nid);
>> +#endif
>>
>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>> }
> Ditto. Also, what's the synchronization rules for NUMA node
> on/offlining. If you end up updating the mapping later, how would
> that be synchronized against the above usages?

I think the near online node map should be updated when node online/offline
happens. But about this, I think the current numa code has a little problem.

As you know, firmware info binds a set of CPUs and memory to a node. But
at boot time, if the node has no memory (a memory-less node) , it won't
be online.
But the CPUs on that node is available, and bound to the near online node.
(Here, I mean numa_set_node(cpu, node).)

Why does the kernel do this ? I think it is used to ensure that we can
allocate memory
successfully by calling functions like alloc_pages_node() and
alloc_pages_exact_node().
By these two fuctions, any CPU should be bound to a node who has memory
so that
memory allocation can be successful.

That means, for a memory-less node at boot time, CPUs on the node is
online,
but the node is not online.

That also means, "the node is online" equals to "the node has memory".
Actually, there
are a lot of code in the kernel is using this rule.

But,
1) in cpu_up(), it will try to online a node, and it doesn't check if
the node has memory.
2) in try_offline_node(), it offlines CPUs first, and then the memory.

This behavior looks a little wired, or let's say it is ambiguous. It
seems that a NUMA node
consists of CPUs and memory. So if the CPUs are online, the node should
be online.

And also,
The main purpose of this patch-set is to make the cpuid <-> nodeid
mapping persistent.
After this patch-set, alloc_pages_node() and alloc_pages_exact_node()
won't depend on
cpuid <-> nodeid mapping any more. So the node should be online if the
CPUs on it are
online. Otherwise, we cannot setup interfaces of CPUs under /sys.

Unfortunately, since I don't have a machine a with memory-less node, I
cannot reproduce
the problem right now.

How do you think the node online behavior should be changed ?

Thanks.

2015-08-04 08:06:06

[permalink] [raw]

Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

On 2015/8/4 11:36, Tang Chen wrote:
> Hi TJ,
>
> Sorry for the late reply.
>
> On 07/16/2015 05:48 AM, Tejun Heo wrote:
>> ......
>> so in initialization pharse makes no sense any more. The best near online
>> node for each cpu should be cached somewhere.
>> I'm not really following. Is this because the now offline node can
>> later come online and we'd have to break the constant mapping
>> invariant if we update the mapping later? If so, it'd be nice to
>> spell that out.
>
> Yes. Will document this in the next version.
>
>>> ......
>>> +int get_near_online_node(int node)
>>> +{
>>> + return per_cpu(x86_cpu_to_near_online_node,
>>> + cpumask_first(&node_to_cpuid_mask_map[node]));
>>> +}
>>> +EXPORT_SYMBOL(get_near_online_node);
>> Umm... this function is sitting on a fairly hot path and scanning a
>> cpumask each time. Why not just build a numa node -> numa node array?
>
> Indeed. Will avoid to scan a cpumask.
>
>> ......
>>
>>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t
>>> gfp_mask,
>>> unsigned int order)
>>> {
>>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>>> +
>>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
>>> + if (!node_online(nid))
>>> + nid = get_near_online_node(nid);
>>> +#endif
>>> return __alloc_pages(gfp_mask, order, node_zonelist(nid,
>>> gfp_mask));
>>> }
>> Ditto. Also, what's the synchronization rules for NUMA node
>> on/offlining. If you end up updating the mapping later, how would
>> that be synchronized against the above usages?
>
> I think the near online node map should be updated when node online/offline
> happens. But about this, I think the current numa code has a little
> problem.
>
> As you know, firmware info binds a set of CPUs and memory to a node. But
> at boot time, if the node has no memory (a memory-less node) , it won't
> be online.
> But the CPUs on that node is available, and bound to the near online node.
> (Here, I mean numa_set_node(cpu, node).)
>
> Why does the kernel do this ? I think it is used to ensure that we can
> allocate memory
> successfully by calling functions like alloc_pages_node() and
> alloc_pages_exact_node().
> By these two fuctions, any CPU should be bound to a node who has memory
> so that
> memory allocation can be successful.
>
> That means, for a memory-less node at boot time, CPUs on the node is
> online,
> but the node is not online.
>
> That also means, "the node is online" equals to "the node has memory".
> Actually, there
> are a lot of code in the kernel is using this rule.
>
>
> But,
> 1) in cpu_up(), it will try to online a node, and it doesn't check if
> the node has memory.
> 2) in try_offline_node(), it offlines CPUs first, and then the memory.
>
> This behavior looks a little wired, or let's say it is ambiguous. It
> seems that a NUMA node
> consists of CPUs and memory. So if the CPUs are online, the node should
> be online.
Hi Chen,
I have posted a patch set to enable memoryless node on x86,
will repost it for review:) Hope it help to solve this issue.
Thanks!
Gerry

>
> And also,
> The main purpose of this patch-set is to make the cpuid <-> nodeid
> mapping persistent.
> After this patch-set, alloc_pages_node() and alloc_pages_exact_node()
> won't depend on
> cpuid <-> nodeid mapping any more. So the node should be online if the
> CPUs on it are
> online. Otherwise, we cannot setup interfaces of CPUs under /sys.
>
>
> Unfortunately, since I don't have a machine a with memory-less node, I
> cannot reproduce
> the problem right now.
>
> How do you think the node online behavior should be changed ?
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-08-04 08:25:45

[permalink] [raw]

Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

On 08/04/2015 04:05 PM, Jiang Liu wrote:
> ......
>
>
> But,
> 1) in cpu_up(), it will try to online a node, and it doesn't check if
> the node has memory.
> 2) in try_offline_node(), it offlines CPUs first, and then the memory.
>
> This behavior looks a little wired, or let's say it is ambiguous. It
> seems that a NUMA node
> consists of CPUs and memory. So if the CPUs are online, the node should
> be online.
> Hi Chen,
> I have posted a patch set to enable memoryless node on x86,
> will repost it for review:) Hope it help to solve this issue.
> Thanks!
> Gerry
>
>
Sure. Looking forward to the latest patches. :)

Thanks.

2015-08-09 06:17:05

[permalink] [raw]

Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

Hi Liu,

Have you posted your new patches ?
(I mean memory-less node support patches.)

If you are going to post them, please cc me.

And BTW, how did you reproduce the memory-less node problem ?
Do you have a real memory-less node on your machine ?

Thanks. :)

On 08/04/2015 04:05 PM, Jiang Liu wrote:
> On 2015/8/4 11:36, Tang Chen wrote:
>> Hi TJ,
>>
>> Sorry for the late reply.
>>
>> On 07/16/2015 05:48 AM, Tejun Heo wrote:
>>> ......
>>> so in initialization pharse makes no sense any more. The best near online
>>> node for each cpu should be cached somewhere.
>>> I'm not really following. Is this because the now offline node can
>>> later come online and we'd have to break the constant mapping
>>> invariant if we update the mapping later? If so, it'd be nice to
>>> spell that out.
>> Yes. Will document this in the next version.
>>
>>>> ......
>>>> +int get_near_online_node(int node)
>>>> +{
>>>> + return per_cpu(x86_cpu_to_near_online_node,
>>>> + cpumask_first(&node_to_cpuid_mask_map[node]));
>>>> +}
>>>> +EXPORT_SYMBOL(get_near_online_node);
>>> Umm... this function is sitting on a fairly hot path and scanning a
>>> cpumask each time. Why not just build a numa node -> numa node array?
>> Indeed. Will avoid to scan a cpumask.
>>
>>> ......
>>>
>>>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t
>>>> gfp_mask,
>>>> unsigned int order)
>>>> {
>>>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>>>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>>>> +
>>>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
>>>> + if (!node_online(nid))
>>>> + nid = get_near_online_node(nid);
>>>> +#endif
>>>> return __alloc_pages(gfp_mask, order, node_zonelist(nid,
>>>> gfp_mask));
>>>> }
>>> Ditto. Also, what's the synchronization rules for NUMA node
>>> on/offlining. If you end up updating the mapping later, how would
>>> that be synchronized against the above usages?
>> I think the near online node map should be updated when node online/offline
>> happens. But about this, I think the current numa code has a little
>> problem.
>>
>> As you know, firmware info binds a set of CPUs and memory to a node. But
>> at boot time, if the node has no memory (a memory-less node) , it won't
>> be online.
>> But the CPUs on that node is available, and bound to the near online node.
>> (Here, I mean numa_set_node(cpu, node).)
>>
>> Why does the kernel do this ? I think it is used to ensure that we can
>> allocate memory
>> successfully by calling functions like alloc_pages_node() and
>> alloc_pages_exact_node().
>> By these two fuctions, any CPU should be bound to a node who has memory
>> so that
>> memory allocation can be successful.
>>
>> That means, for a memory-less node at boot time, CPUs on the node is
>> online,
>> but the node is not online.
>>
>> That also means, "the node is online" equals to "the node has memory".
>> Actually, there
>> are a lot of code in the kernel is using this rule.
>>
>>
>> But,
>> 1) in cpu_up(), it will try to online a node, and it doesn't check if
>> the node has memory.
>> 2) in try_offline_node(), it offlines CPUs first, and then the memory.
>>
>> This behavior looks a little wired, or let's say it is ambiguous. It
>> seems that a NUMA node
>> consists of CPUs and memory. So if the CPUs are online, the node should
>> be online.
> Hi Chen,
> I have posted a patch set to enable memoryless node on x86,
> will repost it for review:) Hope it help to solve this issue.
> Thanks!
> Gerry
>
>> And also,
>> The main purpose of this patch-set is to make the cpuid <-> nodeid
>> mapping persistent.
>> After this patch-set, alloc_pages_node() and alloc_pages_exact_node()
>> won't depend on
>> cpuid <-> nodeid mapping any more. So the node should be online if the
>> CPUs on it are
>> online. Otherwise, we cannot setup interfaces of CPUs under /sys.
>>
>>
>> Unfortunately, since I don't have a machine a with memory-less node, I
>> cannot reproduce
>> the problem right now.
>>
>> How do you think the node online behavior should be changed ?
>>
>> Thanks.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
> .
>

2015-08-12 01:53:57

[permalink] [raw]

Subject: Re: [PATCH 1/5] x86, gfp: Cache best near node for memory allocation.

On 2015/8/9 14:15, Tang Chen wrote:
> Hi Liu,
>
> Have you posted your new patches ?
> (I mean memory-less node support patches.)
Hi Chen,
I have rebased my patches to v4.2-rc4, but unfortunately
it breaks. Seems there are some changes in x86 NUMA support since
3.17. I need some time to figure it out.

>
> If you are going to post them, please cc me.
Sure.

>
> And BTW, how did you reproduce the memory-less node problem ?
> Do you have a real memory-less node on your machine ?
Yes, we have a system with memoryless nodes.
Thanks!
Gerry