2012-05-09 14:30:18

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

Commit-ID: cb83b629bae0327cf9f44f096adc38d150ceb913
Gitweb: http://git.kernel.org/tip/cb83b629bae0327cf9f44f096adc38d150ceb913
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 17 Apr 2012 15:49:36 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 May 2012 15:00:55 +0200

sched/numa: Rewrite the CONFIG_NUMA sched domain support

The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.

Therefore, build a proper numa topology based on node_distance().

Since there's no fixed numa layers anymore, the static SD_NODE_INIT
and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
construct something similar and scales some values either on the
number of cpus in the domain and/or the node_distance() ratio.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: David Howells <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Matt Turner <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: [email protected]
Cc: Dimitri Sivanich <[email protected]>
Cc: Greg Pearson <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/ia64/include/asm/topology.h | 25 ---
arch/mips/include/asm/mach-ip27/topology.h | 17 --
arch/powerpc/include/asm/topology.h | 36 ----
arch/sh/include/asm/topology.h | 25 ---
arch/sparc/include/asm/topology_64.h | 19 --
arch/tile/include/asm/topology.h | 26 ---
arch/x86/include/asm/topology.h | 38 ----
include/linux/topology.h | 37 ----
kernel/sched/core.c | 280 ++++++++++++++++++----------
9 files changed, 185 insertions(+), 318 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 09f6467..a2496e4 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -70,31 +70,6 @@ void build_cpu_to_node_map(void);
.nr_balance_failed = 0, \
}

-/* sched_domains SD_NODE_INIT for IA64 NUMA machines */
-#define SD_NODE_INIT (struct sched_domain) { \
- .parent = NULL, \
- .child = NULL, \
- .groups = NULL, \
- .min_interval = 8, \
- .max_interval = 8*(min(num_online_cpus(), 32U)), \
- .busy_factor = 64, \
- .imbalance_pct = 125, \
- .cache_nice_tries = 2, \
- .busy_idx = 3, \
- .idle_idx = 2, \
- .newidle_idx = 0, \
- .wake_idx = 0, \
- .forkexec_idx = 0, \
- .flags = SD_LOAD_BALANCE \
- | SD_BALANCE_NEWIDLE \
- | SD_BALANCE_EXEC \
- | SD_BALANCE_FORK \
- | SD_SERIALIZE, \
- .last_balance = jiffies, \
- .balance_interval = 64, \
- .nr_balance_failed = 0, \
-}
-
#endif /* CONFIG_NUMA */

#ifdef CONFIG_SMP
diff --git a/arch/mips/include/asm/mach-ip27/topology.h b/arch/mips/include/asm/mach-ip27/topology.h
index 1b1a7d1..b2cf641 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -36,23 +36,6 @@ extern unsigned char __node_distances[MAX_COMPACT_NODES][MAX_COMPACT_NODES];

#define node_distance(from, to) (__node_distances[(from)][(to)])

-/* sched_domains SD_NODE_INIT for SGI IP27 machines */
-#define SD_NODE_INIT (struct sched_domain) { \
- .parent = NULL, \
- .child = NULL, \
- .groups = NULL, \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
- .imbalance_pct = 125, \
- .cache_nice_tries = 1, \
- .flags = SD_LOAD_BALANCE | \
- SD_BALANCE_EXEC, \
- .last_balance = jiffies, \
- .balance_interval = 1, \
- .nr_balance_failed = 0, \
-}
-
#include <asm-generic/topology.h>

#endif /* _ASM_MACH_TOPOLOGY_H */
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index c971858..852ed1b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -18,12 +18,6 @@ struct device_node;
*/
#define RECLAIM_DISTANCE 10

-/*
- * Avoid creating an extra level of balancing (SD_ALLNODES) on the largest
- * POWER7 boxes which have a maximum of 32 nodes.
- */
-#define SD_NODES_PER_DOMAIN 32
-
#include <asm/mmzone.h>

static inline int cpu_to_node(int cpu)
@@ -51,36 +45,6 @@ static inline int pcibus_to_node(struct pci_bus *bus)
cpu_all_mask : \
cpumask_of_node(pcibus_to_node(bus)))

-/* sched_domains SD_NODE_INIT for PPC64 machines */
-#define SD_NODE_INIT (struct sched_domain) { \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
- .imbalance_pct = 125, \
- .cache_nice_tries = 1, \
- .busy_idx = 3, \
- .idle_idx = 1, \
- .newidle_idx = 0, \
- .wake_idx = 0, \
- .forkexec_idx = 0, \
- \
- .flags = 1*SD_LOAD_BALANCE \
- | 0*SD_BALANCE_NEWIDLE \
- | 1*SD_BALANCE_EXEC \
- | 1*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
- | 1*SD_WAKE_AFFINE \
- | 0*SD_PREFER_LOCAL \
- | 0*SD_SHARE_CPUPOWER \
- | 0*SD_POWERSAVINGS_BALANCE \
- | 0*SD_SHARE_PKG_RESOURCES \
- | 1*SD_SERIALIZE \
- | 0*SD_PREFER_SIBLING \
- , \
- .last_balance = jiffies, \
- .balance_interval = 1, \
-}
-
extern int __node_distance(int, int);
#define node_distance(a, b) __node_distance(a, b)

diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
index 88e7340..b0a282d 100644
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -3,31 +3,6 @@

#ifdef CONFIG_NUMA

-/* sched_domains SD_NODE_INIT for sh machines */
-#define SD_NODE_INIT (struct sched_domain) { \
- .parent = NULL, \
- .child = NULL, \
- .groups = NULL, \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
- .imbalance_pct = 125, \
- .cache_nice_tries = 2, \
- .busy_idx = 3, \
- .idle_idx = 2, \
- .newidle_idx = 0, \
- .wake_idx = 0, \
- .forkexec_idx = 0, \
- .flags = SD_LOAD_BALANCE \
- | SD_BALANCE_FORK \
- | SD_BALANCE_EXEC \
- | SD_BALANCE_NEWIDLE \
- | SD_SERIALIZE, \
- .last_balance = jiffies, \
- .balance_interval = 1, \
- .nr_balance_failed = 0, \
-}
-
#define cpu_to_node(cpu) ((void)(cpu),0)
#define parent_node(node) ((void)(node),0)

diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h
index 8b9c556..1754390 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,25 +31,6 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
cpu_all_mask : \
cpumask_of_node(pcibus_to_node(bus)))

-#define SD_NODE_INIT (struct sched_domain) { \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
- .imbalance_pct = 125, \
- .cache_nice_tries = 2, \
- .busy_idx = 3, \
- .idle_idx = 2, \
- .newidle_idx = 0, \
- .wake_idx = 0, \
- .forkexec_idx = 0, \
- .flags = SD_LOAD_BALANCE \
- | SD_BALANCE_FORK \
- | SD_BALANCE_EXEC \
- | SD_SERIALIZE, \
- .last_balance = jiffies, \
- .balance_interval = 1, \
-}
-
#else /* CONFIG_NUMA */

#include <asm-generic/topology.h>
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index 6fdd0c8..7a7ce39 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -78,32 +78,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
.balance_interval = 32, \
}

-/* sched_domains SD_NODE_INIT for TILE architecture */
-#define SD_NODE_INIT (struct sched_domain) { \
- .min_interval = 16, \
- .max_interval = 512, \
- .busy_factor = 32, \
- .imbalance_pct = 125, \
- .cache_nice_tries = 1, \
- .busy_idx = 3, \
- .idle_idx = 1, \
- .newidle_idx = 2, \
- .wake_idx = 1, \
- .flags = 1*SD_LOAD_BALANCE \
- | 1*SD_BALANCE_NEWIDLE \
- | 1*SD_BALANCE_EXEC \
- | 1*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
- | 0*SD_WAKE_AFFINE \
- | 0*SD_PREFER_LOCAL \
- | 0*SD_SHARE_CPUPOWER \
- | 0*SD_SHARE_PKG_RESOURCES \
- | 1*SD_SERIALIZE \
- , \
- .last_balance = jiffies, \
- .balance_interval = 128, \
-}
-
/* By definition, we create nodes based on online memory. */
#define node_has_online_mem(nid) 1

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index b9676ae..095b215 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -92,44 +92,6 @@ extern void setup_node_to_cpumask_map(void);

#define pcibus_to_node(bus) __pcibus_to_node(bus)

-#ifdef CONFIG_X86_32
-# define SD_CACHE_NICE_TRIES 1
-# define SD_IDLE_IDX 1
-#else
-# define SD_CACHE_NICE_TRIES 2
-# define SD_IDLE_IDX 2
-#endif
-
-/* sched_domains SD_NODE_INIT for NUMA machines */
-#define SD_NODE_INIT (struct sched_domain) { \
- .min_interval = 8, \
- .max_interval = 32, \
- .busy_factor = 32, \
- .imbalance_pct = 125, \
- .cache_nice_tries = SD_CACHE_NICE_TRIES, \
- .busy_idx = 3, \
- .idle_idx = SD_IDLE_IDX, \
- .newidle_idx = 0, \
- .wake_idx = 0, \
- .forkexec_idx = 0, \
- \
- .flags = 1*SD_LOAD_BALANCE \
- | 1*SD_BALANCE_NEWIDLE \
- | 1*SD_BALANCE_EXEC \
- | 1*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
- | 1*SD_WAKE_AFFINE \
- | 0*SD_PREFER_LOCAL \
- | 0*SD_SHARE_CPUPOWER \
- | 0*SD_POWERSAVINGS_BALANCE \
- | 0*SD_SHARE_PKG_RESOURCES \
- | 1*SD_SERIALIZE \
- | 0*SD_PREFER_SIBLING \
- , \
- .last_balance = jiffies, \
- .balance_interval = 1, \
-}
-
extern int __node_distance(int, int);
#define node_distance(a, b) __node_distance(a, b)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index e26db03..4f59bf3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -70,7 +70,6 @@ int arch_update_cpu_topology(void);
* Below are the 3 major initializers used in building sched_domains:
* SD_SIBLING_INIT, for SMT domains
* SD_CPU_INIT, for SMP domains
- * SD_NODE_INIT, for NUMA domains
*
* Any architecture that cares to do any tuning to these values should do so
* by defining their own arch-specific initializer in include/asm/topology.h.
@@ -176,48 +175,12 @@ int arch_update_cpu_topology(void);
}
#endif

-/* sched_domains SD_ALLNODES_INIT for NUMA machines */
-#define SD_ALLNODES_INIT (struct sched_domain) { \
- .min_interval = 64, \
- .max_interval = 64*num_online_cpus(), \
- .busy_factor = 128, \
- .imbalance_pct = 133, \
- .cache_nice_tries = 1, \
- .busy_idx = 3, \
- .idle_idx = 3, \
- .flags = 1*SD_LOAD_BALANCE \
- | 1*SD_BALANCE_NEWIDLE \
- | 0*SD_BALANCE_EXEC \
- | 0*SD_BALANCE_FORK \
- | 0*SD_BALANCE_WAKE \
- | 0*SD_WAKE_AFFINE \
- | 0*SD_SHARE_CPUPOWER \
- | 0*SD_POWERSAVINGS_BALANCE \
- | 0*SD_SHARE_PKG_RESOURCES \
- | 1*SD_SERIALIZE \
- | 0*SD_PREFER_SIBLING \
- , \
- .last_balance = jiffies, \
- .balance_interval = 64, \
-}
-
-#ifndef SD_NODES_PER_DOMAIN
-#define SD_NODES_PER_DOMAIN 16
-#endif
-
#ifdef CONFIG_SCHED_BOOK
#ifndef SD_BOOK_INIT
#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
#endif
#endif /* CONFIG_SCHED_BOOK */

-#ifdef CONFIG_NUMA
-#ifndef SD_NODE_INIT
-#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
-#endif
-
-#endif /* CONFIG_NUMA */
-
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DECLARE_PER_CPU(int, numa_node);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6001e5c..b4f2096 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5560,7 +5560,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
break;
}

- if (cpumask_intersects(groupmask, sched_group_cpus(group))) {
+ if (!(sd->flags & SD_OVERLAP) &&
+ cpumask_intersects(groupmask, sched_group_cpus(group))) {
printk(KERN_CONT "\n");
printk(KERN_ERR "ERROR: repeated CPUs\n");
break;
@@ -5898,92 +5899,6 @@ static int __init isolated_cpu_setup(char *str)

__setup("isolcpus=", isolated_cpu_setup);

-#ifdef CONFIG_NUMA
-
-/**
- * find_next_best_node - find the next node to include in a sched_domain
- * @node: node whose sched_domain we're building
- * @used_nodes: nodes already in the sched_domain
- *
- * Find the next node to include in a given scheduling domain. Simply
- * finds the closest node not already in the @used_nodes map.
- *
- * Should use nodemask_t.
- */
-static int find_next_best_node(int node, nodemask_t *used_nodes)
-{
- int i, n, val, min_val, best_node = -1;
-
- min_val = INT_MAX;
-
- for (i = 0; i < nr_node_ids; i++) {
- /* Start at @node */
- n = (node + i) % nr_node_ids;
-
- if (!nr_cpus_node(n))
- continue;
-
- /* Skip already used nodes */
- if (node_isset(n, *used_nodes))
- continue;
-
- /* Simple min distance search */
- val = node_distance(node, n);
-
- if (val < min_val) {
- min_val = val;
- best_node = n;
- }
- }
-
- if (best_node != -1)
- node_set(best_node, *used_nodes);
- return best_node;
-}
-
-/**
- * sched_domain_node_span - get a cpumask for a node's sched_domain
- * @node: node whose cpumask we're constructing
- * @span: resulting cpumask
- *
- * Given a node, construct a good cpumask for its sched_domain to span. It
- * should be one that prevents unnecessary balancing, but also spreads tasks
- * out optimally.
- */
-static void sched_domain_node_span(int node, struct cpumask *span)
-{
- nodemask_t used_nodes;
- int i;
-
- cpumask_clear(span);
- nodes_clear(used_nodes);
-
- cpumask_or(span, span, cpumask_of_node(node));
- node_set(node, used_nodes);
-
- for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
- int next_node = find_next_best_node(node, &used_nodes);
- if (next_node < 0)
- break;
- cpumask_or(span, span, cpumask_of_node(next_node));
- }
-}
-
-static const struct cpumask *cpu_node_mask(int cpu)
-{
- lockdep_assert_held(&sched_domains_mutex);
-
- sched_domain_node_span(cpu_to_node(cpu), sched_domains_tmpmask);
-
- return sched_domains_tmpmask;
-}
-
-static const struct cpumask *cpu_allnodes_mask(int cpu)
-{
- return cpu_possible_mask;
-}
-#endif /* CONFIG_NUMA */
-
static const struct cpumask *cpu_cpu_mask(int cpu)
{
return cpumask_of_node(cpu_to_node(cpu));
@@ -6020,6 +5935,7 @@ struct sched_domain_topology_level {
sched_domain_init_f init;
sched_domain_mask_f mask;
int flags;
+ int numa_level;
struct sd_data data;
};

@@ -6213,10 +6129,6 @@ sd_init_##type(struct sched_domain_topology_level *tl, int cpu) \
}

SD_INIT_FUNC(CPU)
-#ifdef CONFIG_NUMA
- SD_INIT_FUNC(ALLNODES)
- SD_INIT_FUNC(NODE)
-#endif
#ifdef CONFIG_SCHED_SMT
SD_INIT_FUNC(SIBLING)
#endif
@@ -6338,15 +6250,191 @@ static struct sched_domain_topology_level default_topology[] = {
{ sd_init_BOOK, cpu_book_mask, },
#endif
{ sd_init_CPU, cpu_cpu_mask, },
-#ifdef CONFIG_NUMA
- { sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, },
- { sd_init_ALLNODES, cpu_allnodes_mask, },
-#endif
{ NULL, },
};

static struct sched_domain_topology_level *sched_domain_topology = default_topology;

+#ifdef CONFIG_NUMA
+
+static int sched_domains_numa_levels;
+static int sched_domains_numa_scale;
+static int *sched_domains_numa_distance;
+static struct cpumask ***sched_domains_numa_masks;
+static int sched_domains_curr_level;
+
+static inline unsigned long numa_scale(unsigned long x, int level)
+{
+ return x * sched_domains_numa_distance[level] / sched_domains_numa_scale;
+}
+
+static inline int sd_local_flags(int level)
+{
+ if (sched_domains_numa_distance[level] > REMOTE_DISTANCE)
+ return 0;
+
+ return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
+}
+
+static struct sched_domain *
+sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+{
+ struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
+ int level = tl->numa_level;
+ int sd_weight = cpumask_weight(
+ sched_domains_numa_masks[level][cpu_to_node(cpu)]);
+
+ *sd = (struct sched_domain){
+ .min_interval = sd_weight,
+ .max_interval = 2*sd_weight,
+ .busy_factor = 32,
+ .imbalance_pct = 100 + numa_scale(25, level),
+ .cache_nice_tries = 2,
+ .busy_idx = 3,
+ .idle_idx = 2,
+ .newidle_idx = 0,
+ .wake_idx = 0,
+ .forkexec_idx = 0,
+
+ .flags = 1*SD_LOAD_BALANCE
+ | 1*SD_BALANCE_NEWIDLE
+ | 0*SD_BALANCE_EXEC
+ | 0*SD_BALANCE_FORK
+ | 0*SD_BALANCE_WAKE
+ | 0*SD_WAKE_AFFINE
+ | 0*SD_PREFER_LOCAL
+ | 0*SD_SHARE_CPUPOWER
+ | 0*SD_POWERSAVINGS_BALANCE
+ | 0*SD_SHARE_PKG_RESOURCES
+ | 1*SD_SERIALIZE
+ | 0*SD_PREFER_SIBLING
+ | sd_local_flags(level)
+ ,
+ .last_balance = jiffies,
+ .balance_interval = sd_weight,
+ };
+ SD_INIT_NAME(sd, NUMA);
+ sd->private = &tl->data;
+
+ /*
+ * Ugly hack to pass state to sd_numa_mask()...
+ */
+ sched_domains_curr_level = tl->numa_level;
+
+ return sd;
+}
+
+static const struct cpumask *sd_numa_mask(int cpu)
+{
+ return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
+}
+
+static void sched_init_numa(void)
+{
+ int next_distance, curr_distance = node_distance(0, 0);
+ struct sched_domain_topology_level *tl;
+ int level = 0;
+ int i, j, k;
+
+ sched_domains_numa_scale = curr_distance;
+ sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL);
+ if (!sched_domains_numa_distance)
+ return;
+
+ /*
+ * O(nr_nodes^2) deduplicating selection sort -- in order to find the
+ * unique distances in the node_distance() table.
+ *
+ * Assumes node_distance(0,j) includes all distances in
+ * node_distance(i,j) in order to avoid cubic time.
+ *
+ * XXX: could be optimized to O(n log n) by using sort()
+ */
+ next_distance = curr_distance;
+ for (i = 0; i < nr_node_ids; i++) {
+ for (j = 0; j < nr_node_ids; j++) {
+ int distance = node_distance(0, j);
+ if (distance > curr_distance &&
+ (distance < next_distance ||
+ next_distance == curr_distance))
+ next_distance = distance;
+ }
+ if (next_distance != curr_distance) {
+ sched_domains_numa_distance[level++] = next_distance;
+ sched_domains_numa_levels = level;
+ curr_distance = next_distance;
+ } else break;
+ }
+ /*
+ * 'level' contains the number of unique distances, excluding the
+ * identity distance node_distance(i,i).
+ *
+ * The sched_domains_nume_distance[] array includes the actual distance
+ * numbers.
+ */
+
+ sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
+ if (!sched_domains_numa_masks)
+ return;
+
+ /*
+ * Now for each level, construct a mask per node which contains all
+ * cpus of nodes that are that many hops away from us.
+ */
+ for (i = 0; i < level; i++) {
+ sched_domains_numa_masks[i] =
+ kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
+ if (!sched_domains_numa_masks[i])
+ return;
+
+ for (j = 0; j < nr_node_ids; j++) {
+ struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+ if (!mask)
+ return;
+
+ sched_domains_numa_masks[i][j] = mask;
+
+ for (k = 0; k < nr_node_ids; k++) {
+ if (node_distance(cpu_to_node(j), k) >
+ sched_domains_numa_distance[i])
+ continue;
+
+ cpumask_or(mask, mask, cpumask_of_node(k));
+ }
+ }
+ }
+
+ tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
+ sizeof(struct sched_domain_topology_level), GFP_KERNEL);
+ if (!tl)
+ return;
+
+ /*
+ * Copy the default topology bits..
+ */
+ for (i = 0; default_topology[i].init; i++)
+ tl[i] = default_topology[i];
+
+ /*
+ * .. and append 'j' levels of NUMA goodness.
+ */
+ for (j = 0; j < level; i++, j++) {
+ tl[i] = (struct sched_domain_topology_level){
+ .init = sd_numa_init,
+ .mask = sd_numa_mask,
+ .flags = SDTL_OVERLAP,
+ .numa_level = j,
+ };
+ }
+
+ sched_domain_topology = tl;
+}
+#else
+static inline void sched_init_numa(void)
+{
+}
+#endif /* CONFIG_NUMA */
+
static int __sdt_alloc(const struct cpumask *cpu_map)
{
struct sched_domain_topology_level *tl;
@@ -6840,6 +6928,8 @@ void __init sched_init_smp(void)
alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
alloc_cpumask_var(&fallback_doms, GFP_KERNEL);

+ sched_init_numa();
+
get_online_cpus();
mutex_lock(&sched_domains_mutex);
init_sched_domains(cpu_active_mask);


2012-05-10 17:30:17

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Wed, May 9, 2012 at 7:29 AM, tip-bot for Peter Zijlstra
<[email protected]> wrote:
> Commit-ID: ?cb83b629bae0327cf9f44f096adc38d150ceb913
> Gitweb: ? ? http://git.kernel.org/tip/cb83b629bae0327cf9f44f096adc38d150ceb913
> Author: ? ? Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 17 Apr 2012 15:49:36 +0200
> Committer: ?Ingo Molnar <[email protected]>
> CommitDate: Wed, 9 May 2012 15:00:55 +0200
>
> sched/numa: Rewrite the CONFIG_NUMA sched domain support
>
> The current code groups up to 16 nodes in a level and then puts an
> ALLNODES domain spanning the entire tree on top of that. This doesn't
> reflect the numa topology and esp for the smaller not-fully-connected
> machines out there today this might make a difference.
>
> Therefore, build a proper numa topology based on node_distance().
>
> Since there's no fixed numa layers anymore, the static SD_NODE_INIT
> and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
> construct something similar and scales some values either on the
> number of cpus in the domain and/or the node_distance() ratio.
>


not sure if this one or other is related....

got this from 8 socket Nehalem-ex box.

[ 25.549259] mtrr_aps_init() done
[ 25.554298] ------------[ cut here ]------------
[ 25.554549] WARNING: at kernel/sched/core.c:6086
build_sched_domains+0x1a9/0x2d0()
[ 25.565131] Hardware name: unknown
[ 25.565318] Modules linked in:
[ 25.584922] Pid: 1, comm: swapper/0 Not tainted
3.4.0-rc6-yh-03548-gecc3211-dirty #312
[ 25.585308] Call Trace:
[ 25.585464] [<ffffffff8106a7d1>] warn_slowpath_common+0x83/0x9b
[ 25.605128] [<ffffffff8106a803>] warn_slowpath_null+0x1a/0x1c
[ 25.624828] [<ffffffff81097628>] build_sched_domains+0x1a9/0x2d0
[ 25.625154] [<ffffffff8113db34>] ? __kmalloc+0x82/0x15c
[ 25.644820] [<ffffffff828e9151>] sched_init_smp+0x7f/0x194
[ 25.645080] [<ffffffff828d0fdc>] kernel_init+0xa7/0x19f
[ 25.664792] [<ffffffff81dd0954>] kernel_thread_helper+0x4/0x10
[ 25.665094] [<ffffffff81dc8a59>] ? retint_restore_args+0xe/0xe
[ 25.684762] [<ffffffff828d0f35>] ? do_initcalls+0xc9/0xc9
[ 25.685019] [<ffffffff81dd0950>] ? gs_change+0xb/0xb
[ 25.704713] ---[ end trace 5003353dd8ff0030 ]---
[ 25.704967] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000020
[ 25.724721] IP: [<ffffffff813cf408>] __bitmap_weight+0x1a/0x67
[ 25.725011] PGD 0
[ 25.725107] Oops: 0000 [#1] SMP
[ 25.749960] CPU 0
[ 25.750088] Modules linked in:
[ 25.750224]
[ 25.750301] Pid: 1, comm: swapper/0 Tainted: G W
3.4.0-rc6-yh-03548-gecc3211-dirty #312 Oracle Corporation unknown
/
[ 25.765035] RIP: 0010:[<ffffffff813cf408>] [<ffffffff813cf408>]
__bitmap_weight+0x1a/0x67
[ 25.784842] RSP: 0018:ffff8810374c1e70 EFLAGS: 00010206
[ 25.804557] RAX: 0000000000000003 RBX: 000000000000007f RCX: 0000000000000003
[ 25.804940] RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000000000020
[ 25.824665] RBP: ffff8810374c1e70 R08: 0000000000000020 R09: 0000000000000000
[ 25.844504] R10: 0000000000000000 R11: 0000000000000082 R12: ffff8880373bcfc0
[ 25.844882] R13: 0000000000000000 R14: ffff8880373eae00 R15: fffffffffffffc08
[ 25.864512] FS: 0000000000000000(0000) GS:ffff88103de00000(0000)
knlGS:0000000000000000
[ 25.884400] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 25.884695] CR2: 0000000000000020 CR3: 00000000025af000 CR4: 00000000000007f0
[ 25.904389] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 25.904753] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 25.924501] Process swapper/0 (pid: 1, threadinfo ffff8810374c0000,
task ffff8810374b8000)
[ 25.944730] Stack:
[ 25.944856] ffff8810374c1ee0 ffffffff81097636 ffff8810374c1ed0
ffffffff8113db34
[ 25.964506] 2222222222222222 ffff8880373ebe00 00000000001d6828
ffff88803706a000
[ 25.964870] ffff8810374b85c8 ffffffff829c73f8 ffff8810374b85c8
00000000000000ff
[ 25.984495] Call Trace:
[ 25.984624] [<ffffffff81097636>] build_sched_domains+0x1b7/0x2d0
[ 26.004343] [<ffffffff8113db34>] ? __kmalloc+0x82/0x15c
[ 26.004607] [<ffffffff828e9151>] sched_init_smp+0x7f/0x194
[ 26.024288] [<ffffffff828d0fdc>] kernel_init+0xa7/0x19f
[ 26.024560] [<ffffffff81dd0954>] kernel_thread_helper+0x4/0x10
[ 26.044222] [<ffffffff81dc8a59>] ? retint_restore_args+0xe/0xe
[ 26.044539] [<ffffffff828d0f35>] ? do_initcalls+0xc9/0xc9
[ 26.064134] [<ffffffff81dd0950>] ? gs_change+0xb/0xb
[ 26.064410] Code: 48 8b 0c d6 48 89 0c d7 48 ff c2 39 d0 7f f1 5d
c3 89 f0 b9 40 00 00 00 55 99 49 89 f8 45 31 c9 f7 f9 48 89 e5 31 d2
89 c1 eb 0f <49> 8b 3c d0 48 ff c2 f3 48 0f b8 c7 41 01 c1 39 d1 7f ed
45 31
[ 26.104070] RIP [<ffffffff813cf408>] __bitmap_weight+0x1a/0x67
[ 26.123783] RSP <ffff8810374c1e70>
[ 26.123947] CR2: 0000000000000020
[ 26.124143] ---[ end trace 5003353dd8ff0031 ]---
[ 26.143813] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x00000009

2012-05-10 17:45:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Thu, 2012-05-10 at 10:30 -0700, Yinghai Lu wrote:
> not sure if this one or other is related....
>
> got this from 8 socket Nehalem-ex box.
>
> [ 25.549259] mtrr_aps_init() done
> [ 25.554298] ------------[ cut here ]------------
> [ 25.554549] WARNING: at kernel/sched/core.c:6086 build_sched_domains+0x1a9/0x2d0()

oops,.. could you get me the output of:

cat /sys/devices/system/node/node*/distance

for that machine? I'll see if I can reproduce using numa=fake.

2012-05-10 17:54:23

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Thu, May 10, 2012 at 10:44 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-05-10 at 10:30 -0700, Yinghai Lu wrote:
>> not sure if this one or other is related....
>>
>> got this from 8 socket Nehalem-ex box.
>>
>> [ ? 25.549259] mtrr_aps_init() done
>> [ ? 25.554298] ------------[ cut here ]------------
>> [ ? 25.554549] WARNING: at kernel/sched/core.c:6086 build_sched_domains+0x1a9/0x2d0()
>
> oops,.. could you get me the output of:
>
> ?cat /sys/devices/system/node/node*/distance
>
> for that machine? I'll see if I can reproduce using numa=fake.

[ 0.000000] ACPI: SLIT: nodes = 8
[ 0.000000] 10 15 20 15 15 20 20 20
[ 0.000000] 15 10 15 20 20 15 20 20
[ 0.000000] 20 15 10 15 20 20 15 20
[ 0.000000] 15 20 15 10 20 20 20 15
[ 0.000000] 15 20 20 20 10 15 15 20
[ 0.000000] 20 15 20 20 15 10 20 15
[ 0.000000] 20 20 15 20 15 20 10 15
[ 0.000000] 20 20 20 15 20 15 15 10


[root@yhlu-pc2 ~]# cat /sys/devices/system/node/node*/distance
10 15 15 20 15 20 20 20
15 10 20 15 20 15 20 20
15 20 10 15 20 20 15 20
20 15 15 10 20 20 20 15
15 20 20 20 10 15 20 15
20 15 20 20 15 10 15 20
20 20 15 20 20 15 10 15
20 20 20 15 15 20 15 10

2012-05-24 21:23:52

by Tony Luck

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Wed, May 9, 2012 at 7:29 AM, tip-bot for Peter Zijlstra
<[email protected]> wrote:
> Commit-ID: ?cb83b629bae0327cf9f44f096adc38d150ceb913
> Gitweb: ? ? http://git.kernel.org/tip/cb83b629bae0327cf9f44f096adc38d150ceb913
> Author: ? ? Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 17 Apr 2012 15:49:36 +0200
> Committer: ?Ingo Molnar <[email protected]>
> CommitDate: Wed, 9 May 2012 15:00:55 +0200
>
> sched/numa: Rewrite the CONFIG_NUMA sched domain support

This is upstream in Linus' tree now - and seems to be the cause of
an ia64 boot failure. The zonelist that arrives at __alloc_pages_nodemask
is garbage. Changing both the kzalloc_node() calls in sched_init_numa()
into plain kzalloc() calls seems to fix things. So it looks like we are trying
to allocate on a node before the node has been fully set up.

Call Trace:
[<a0000001000165e0>] show_stack+0x80/0xa0
sp=e000000301b7f6f0 bsp=e000000301b71348
[<a000000100016c40>] show_regs+0x640/0x920
sp=e000000301b7f8c0 bsp=e000000301b712f0
[<a0000001000417f0>] die+0x190/0x2c0
sp=e000000301b7f8d0 bsp=e000000301b712b0
[<a000000100074a90>] ia64_do_page_fault+0x6b0/0xac0
sp=e000000301b7f8d0 bsp=e000000301b71258
[<a00000010000c100>] ia64_native_leave_kernel+0x0/0x270
sp=e000000301b7f960 bsp=e000000301b71258
[<a00000010016b3a0>] __alloc_pages_nodemask+0x140/0xce0
sp=e000000301b7fb30 bsp=e000000301b710f0
[<a0000001001ec970>] allocate_slab+0x130/0x3c0
sp=e000000301b7fb50 bsp=e000000301b71098
[<a0000001001ecc40>] new_slab+0x40/0x680
sp=e000000301b7fb50 bsp=e000000301b71040
[<a0000001001ed960>] __slab_alloc+0x6e0/0x8e0
sp=e000000301b7fb50 bsp=e000000301b70fa8
[<a0000001001ef9a0>] kmem_cache_alloc_node+0xc0/0x3a0
sp=e000000301b7fb90 bsp=e000000301b70f70
[<a0000001000df8a0>] sched_init_numa+0x360/0x780
sp=e000000301b7fb90 bsp=e000000301b70ed0
[<a000000100d6be80>] sched_init_smp+0x30/0x300
sp=e000000301b7fbb0 bsp=e000000301b70eb0
[<a000000100d50760>] kernel_init+0x230/0x340
sp=e000000301b7fdb0 bsp=e000000301b70e88
[<a0000001000145f0>] kernel_thread_helper+0x30/0x60
sp=e000000301b7fe30 bsp=e000000301b70e60
[<a00000010000a0c0>] start_kernel_thread+0x20/0x40
sp=e000000301b7fe30 bsp=e000000301b70e60
Disabling lock debugging due to kernel taint

-Tony

2012-05-25 07:31:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Thu, 2012-05-24 at 14:23 -0700, Tony Luck wrote:
> Changing both the kzalloc_node() calls in sched_init_numa()
> into plain kzalloc() calls seems to fix things. So it looks like we are trying
> to allocate on a node before the node has been fully set up.

Right,.. and its not too important either, so lets just use regular
allocations.

That said, I can only find the 1 alloc_node() in sched_init_numa()


---
Subject: sched: Don't try allocating memory from offline nodes
From: Peter Zijlstra <[email protected]>
Date: Fri May 25 09:26:43 CEST 2012

Allocators don't appreciate it when you try and allocate memory from
offline nodes.

Reported-by: Tony Luck <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -6449,7 +6449,7 @@ static void sched_init_numa(void)
return;

for (j = 0; j < nr_node_ids; j++) {
- struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+ struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
if (!mask)
return;


2012-05-25 14:24:58

by Tony Luck

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Fri, May 25, 2012 at 12:31 AM, Peter Zijlstra <[email protected]> wrote:
> Right,.. and its not too important either, so lets just use regular
> allocations.

Thanks.

> That said, I can only find the 1 alloc_node() in sched_init_numa()

Doh - I must have searched for the next, and not noticed that
I had skipped into a different function,

-Tony

2012-05-25 16:26:42

by Tony Luck

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Fri, May 25, 2012 at 7:24 AM, Tony Luck <[email protected]> wrote:
>> That said, I can only find the 1 alloc_node() in sched_init_numa()

Just to complete the loop - your patch is good ... it isn't necessary to
also change another random kzalloc_node() in an unrelated function
that just happens to be where "n" in vi jumps to :-)

Tested-by: Tony Luck <[email protected]>

2012-05-29 00:19:44

by Anton Blanchard

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support


Hi Peter,

We have a number of ppc64 boxes that are hitting this and have
verified that the patch fixes it.

Tested-by: Anton Blanchard <[email protected]>

Thanks!
Anton

---
Subject: sched: Don't try allocating memory from offline nodes
From: Peter Zijlstra <[email protected]>
Date: Fri May 25 09:26:43 CEST 2012

Allocators don't appreciate it when you try and allocate memory from
offline nodes.

Reported-by: Tony Luck <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -6449,7 +6449,7 @@ static void sched_init_numa(void)
return;

for (j = 0; j < nr_node_ids; j++) {
- struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+ struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
if (!mask)
return;


2012-05-29 00:32:31

by Jiang Liu

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

Hi Yinghai,
Does this patch fix your issue? https://lkml.org/lkml/2012/5/9/183.
I have encountered a similar issue on an IA64 platform and the patch above
works around it. But the root cause is a BIOS bug that the order of CPUs
in MADT table doesn't conform to the ACPI specification and the first CPU
in MADT is not the BSP, which breaks some assumption of the booting code
and causes the core dump.
Thanks!

On 05/11/2012 01:54 AM, Yinghai Lu wrote:
> On Thu, May 10, 2012 at 10:44 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, 2012-05-10 at 10:30 -0700, Yinghai Lu wrote:
>>> not sure if this one or other is related....
>>>
>>> got this from 8 socket Nehalem-ex box.
>>>
>>> [ 25.549259] mtrr_aps_init() done
>>> [ 25.554298] ------------[ cut here ]------------
>>> [ 25.554549] WARNING: at kernel/sched/core.c:6086 build_sched_domains+0x1a9/0x2d0()
>>
>> oops,.. could you get me the output of:
>>
>> cat /sys/devices/system/node/node*/distance
>>
>> for that machine? I'll see if I can reproduce using numa=fake.
>
> [ 0.000000] ACPI: SLIT: nodes = 8
> [ 0.000000] 10 15 20 15 15 20 20 20
> [ 0.000000] 15 10 15 20 20 15 20 20
> [ 0.000000] 20 15 10 15 20 20 15 20
> [ 0.000000] 15 20 15 10 20 20 20 15
> [ 0.000000] 15 20 20 20 10 15 15 20
> [ 0.000000] 20 15 20 20 15 10 20 15
> [ 0.000000] 20 20 15 20 15 20 10 15
> [ 0.000000] 20 20 20 15 20 15 15 10
>
>
> [root@yhlu-pc2 ~]# cat /sys/devices/system/node/node*/distance
> 10 15 15 20 15 20 20 20
> 15 10 20 15 20 15 20 20
> 15 20 10 15 20 20 15 20
> 20 15 15 10 20 20 20 15
> 15 20 20 20 10 15 20 15
> 20 15 20 20 15 10 15 20
> 20 20 15 20 20 15 10 15
> 20 20 20 15 15 20 15 10
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-05-29 12:14:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Tue, 2012-05-29 at 08:32 +0800, Jiang Liu wrote:
> Does this patch fix your issue? https://lkml.org/lkml/2012/5/9/183.
> I have encountered a similar issue on an IA64 platform and the patch above
> works around it. But the root cause is a BIOS bug that the order of CPUs
> in MADT table doesn't conform to the ACPI specification and the first CPU
> in MADT is not the BSP, which breaks some assumption of the booting code
> and causes the core dump.

Is that IA64 arch code that contains those false assumptions or is it
generic (sched) code that contains them? Esp in the latter case I'd be
very interested to hear where these are so we can fix them.

2012-05-29 17:12:45

by Yinghai Lu

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Mon, May 28, 2012 at 5:32 PM, Jiang Liu <[email protected]> wrote:
> Hi Yinghai,
> ? ? ? ?Does this patch fix your issue? https://lkml.org/lkml/2012/5/9/183.
> I have encountered a similar issue on an IA64 platform and the patch above
> works around it. But the root cause is a BIOS bug that the order of CPUs
> in MADT table doesn't conform to the ACPI specification and the first CPU
> in MADT is not the BSP, which breaks some assumption of the booting code
> and causes the core dump.

yes, with another patch from PeterZ.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6396,8 +6396,7 @@ static void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;

for (k = 0; k < nr_node_ids; k++) {
- if (node_distance(cpu_to_node(j), k) >
- sched_domains_numa_distance[i])
+ if (node_distance(j, k) >
sched_domains_numa_distance[i])
continue;

cpumask_or(mask, mask, cpumask_of_node(k));

2012-06-05 07:16:21

by Alex Shi

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

LKP performance set 'mem=2g' for some benchmarks, that cmdline hit
kernel panic on __alloc_pages_mask on 3.5-rc1. and this patch can fix
it.
Thanks!

reported-tested-by [email protected]


On Fri, May 25, 2012 at 3:31 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2012-05-24 at 14:23 -0700, Tony Luck wrote:
>> Changing both the kzalloc_node() calls in sched_init_numa()
>> into plain kzalloc() calls seems to fix things. So it looks like we are trying
>> to allocate on a node before the node has been fully set up.
>
> Right,.. and its not too important either, so lets just use regular
> allocations.
>
> That said, I can only find the 1 alloc_node() in sched_init_numa()
>
>
> ---
> Subject: sched: Don't try allocating memory from offline nodes
> From: Peter Zijlstra <[email protected]>
> Date: Fri May 25 09:26:43 CEST 2012
>
> Allocators don't appreciate it when you try and allocate memory from
> offline nodes.
>
> Reported-by: Tony Luck <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> ?kernel/sched/core.c | ? ?6 +++---
> ?1 file changed, 3 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/kernel/sched/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/core.c
> +++ linux-2.6/kernel/sched/core.c
> @@ -6449,7 +6449,7 @@ static void sched_init_numa(void)
> ? ? ? ? ? ? ? ? ? ? ? ?return;
>
> ? ? ? ? ? ? ? ?for (j = 0; j < nr_node_ids; j++) {
> - ? ? ? ? ? ? ? ? ? ? ? struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
> + ? ? ? ? ? ? ? ? ? ? ? struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
> ? ? ? ? ? ? ? ? ? ? ? ?if (!mask)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return;
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/

2012-06-06 07:43:32

by Alex Shi

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

> + ? ? ? /*
> + ? ? ? ?* O(nr_nodes^2) deduplicating selection sort -- in order to find the
> + ? ? ? ?* unique distances in the node_distance() table.
> + ? ? ? ?*
> + ? ? ? ?* Assumes node_distance(0,j) includes all distances in
> + ? ? ? ?* node_distance(i,j) in order to avoid cubic time.

Curious for other platforms node_distance number, actually, this
assumption is right for what I saw Intel platforms. but it is not
match acpispec50.pdf:

Table 6-152 Example Relative Distances Between Proximity Domains
Proximity Domain 0 1 2 3
0 10 15 20 18
1 15 10 16 24
2 20 16 10 12
3 18 24 12 10


Alex

2012-06-06 09:15:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On Wed, 2012-06-06 at 15:43 +0800, Alex Shi wrote:
> > + /*
> > + * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> > + * unique distances in the node_distance() table.
> > + *
> > + * Assumes node_distance(0,j) includes all distances in
> > + * node_distance(i,j) in order to avoid cubic time.
>
> Curious for other platforms node_distance number, actually, this
> assumption is right for what I saw Intel platforms. but it is not
> match acpispec50.pdf:
>
> Table 6-152 Example Relative Distances Between Proximity Domains
> Proximity Domain 0 1 2 3
> 0 10 15 20 18
> 1 15 10 16 24
> 2 20 16 10 12
> 3 18 24 12 10

Yes I know its allowed, I just haven't seen it in practice.

I've got a patch that validates this assumption if you boot with
"sched_debug". If we ever run into such a setup we might need to fix
this -- it shouldn't be too hard, just expensive.

2012-06-07 00:36:48

by Alex Shi

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support

On 06/06/2012 05:15 PM, Peter Zijlstra wrote:

> On Wed, 2012-06-06 at 15:43 +0800, Alex Shi wrote:
>>> + /*
>>> + * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>>> + * unique distances in the node_distance() table.
>>> + *
>>> + * Assumes node_distance(0,j) includes all distances in
>>> + * node_distance(i,j) in order to avoid cubic time.
>>
>> Curious for other platforms node_distance number, actually, this
>> assumption is right for what I saw Intel platforms. but it is not
>> match acpispec50.pdf:
>>
>> Table 6-152 Example Relative Distances Between Proximity Domains
>> Proximity Domain 0 1 2 3
>> 0 10 15 20 18
>> 1 15 10 16 24
>> 2 20 16 10 12
>> 3 18 24 12 10
>
> Yes I know its allowed, I just haven't seen it in practice.


I see. Thanks.

>
> I've got a patch that validates this assumption if you boot with
> "sched_debug". If we ever run into such a setup we might need to fix
> this -- it shouldn't be too hard, just expensive.


Sure.