This code is in no way complete. But since I brought it up in the
"cpusets - big numa cpu and memory placement" thread, I figure the code
needs to be posted.
The basic idea is as follows:
1) Rip out sched_groups and move them into the sched_domains.
2) Add some reference counting, and eventually locking, to
sched_domains.
3) Rewrite & simplify the way sched_domains are built and linked into a
cohesive tree.
This should allow us to support hotplug more easily, simply removing the
domain belonging to the going-away CPU, rather than throwing away the
whole domain tree and rebuilding from scratch. This should also allow
us to support multiple, independent (ie: no shared root) domain trees
which will facilitate isolated CPU groups and exclusive domains. I also
hope this will allow us to leverage the existing topology infrastructure
to build domains that closely resemble the physical structure of the
machine automagically, thus making supporting interesting NUMA machines
and SMT machines easier.
This patch is just a snapshot in the middle of development, so there are
certainly some uglies & bugs that will get fixed. That said, any
comments about the general design are strongly encouraged. Heck, any
feedback at all is welcome! :)
Patch against 2.6.9-rc3-mm2.
[mcd@arrakis mcd]$ diffstat ~/linux/patches/sched_domains/mcd-sched_domains-changes.patch
include/asm-i386/topology.h | 5
include/asm-ia64/topology.h | 5
include/asm-ppc64/topology.h | 5
include/asm-x86_64/topology.h | 5
include/linux/sched.h | 21 -
include/linux/topology.h | 10
kernel/sched.c | 710 +++++++++++++++++++-----------------------
7 files changed, 360 insertions(+), 401 deletions(-)
-Matt
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-i386/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-i386/topology.h
--- linux-2.6.9-rc3-mm2/include/asm-i386/topology.h 2004-10-04 14:38:56.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-i386/topology.h 2004-10-05 16:34:35.000000000 -0700
@@ -74,9 +74,10 @@ static inline cpumask_t pcibus_to_cpumas
/* sched_domains SD_NODE_INIT for NUMAQ machines */
#define SD_NODE_INIT (struct sched_domain) { \
- .span = CPU_MASK_NONE, \
.parent = NULL, \
- .groups = NULL, \
+ .span = CPU_MASK_NONE, \
+ .refcnt = ATOMIC_INIT(0), \
+ .cpu_power = 0, \
.min_interval = 8, \
.max_interval = 32, \
.busy_factor = 32, \
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-ia64/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ia64/topology.h
--- linux-2.6.9-rc3-mm2/include/asm-ia64/topology.h 2004-10-04 14:38:56.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ia64/topology.h 2004-10-05 16:36:57.000000000 -0700
@@ -47,9 +47,10 @@ void build_cpu_to_node_map(void);
/* sched_domains SD_NODE_INIT for IA64 NUMA machines */
#define SD_NODE_INIT (struct sched_domain) { \
- .span = CPU_MASK_NONE, \
.parent = NULL, \
- .groups = NULL, \
+ .span = CPU_MASK_NONE, \
+ .refcnt = ATOMIC_INIT(0), \
+ .cpu_power = 0, \
.min_interval = 80, \
.max_interval = 320, \
.busy_factor = 320, \
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-ppc64/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ppc64/topology.h
--- linux-2.6.9-rc3-mm2/include/asm-ppc64/topology.h 2004-10-04 14:38:56.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ppc64/topology.h 2004-10-05 16:35:55.000000000 -0700
@@ -42,9 +42,10 @@ static inline int node_to_first_cpu(int
/* sched_domains SD_NODE_INIT for PPC64 machines */
#define SD_NODE_INIT (struct sched_domain) { \
- .span = CPU_MASK_NONE, \
.parent = NULL, \
- .groups = NULL, \
+ .span = CPU_MASK_NONE, \
+ .refcnt = ATOMIC_INIT(0), \
+ .cpu_power = 0, \
.min_interval = 8, \
.max_interval = 32, \
.busy_factor = 32, \
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-x86_64/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-x86_64/topology.h
--- linux-2.6.9-rc3-mm2/include/asm-x86_64/topology.h 2004-10-04 14:38:57.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-x86_64/topology.h 2004-10-05 16:35:48.000000000 -0700
@@ -37,9 +37,10 @@ static inline cpumask_t __pcibus_to_cpum
#ifdef CONFIG_NUMA
/* sched_domains SD_NODE_INIT for x86_64 machines */
#define SD_NODE_INIT (struct sched_domain) { \
- .span = CPU_MASK_NONE, \
.parent = NULL, \
- .groups = NULL, \
+ .span = CPU_MASK_NONE, \
+ .refcnt = ATOMIC_INIT(0), \
+ .cpu_power = 0, \
.min_interval = 8, \
.max_interval = 32, \
.busy_factor = 32, \
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/linux/sched.h linux-2.6.9-rc3-mm2+mcd_sd/include/linux/sched.h
--- linux-2.6.9-rc3-mm2/include/linux/sched.h 2004-10-04 14:38:59.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/include/linux/sched.h 2004-10-05 16:33:08.000000000 -0700
@@ -414,22 +414,17 @@ enum idle_type
#define SD_WAKE_BALANCE 32 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 64 /* Domain members share cpu power */
-struct sched_group {
- struct sched_group *next; /* Must be a circular list */
- cpumask_t cpumask;
-
- /*
- * CPU power of this group, SCHED_LOAD_SCALE being max power for a
- * single CPU. This is read only (except for setup, hotplug CPU).
- */
- unsigned long cpu_power;
-};
-
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
- struct sched_group *groups; /* the balancing groups of the domain */
+ struct list_head children; /* link to this domain's children */
+ struct list_head peers; /* peer balancing domains of this domain */
cpumask_t span; /* span of all CPUs in this domain */
+ atomic_t refcnt; /* number of references to this domain */
+ unsigned long cpu_power; /* CPU power of this domain,
+ * SCHED_LOAD_SCALE being max power for
+ * a single CPU. This is read only (except
+ * for setup, hotplug CPU). */
unsigned long min_interval; /* Minimum balance interval ms */
unsigned long max_interval; /* Maximum balance interval ms */
unsigned int busy_factor; /* less balancing by factor if busy */
@@ -465,8 +460,6 @@ struct sched_domain {
#ifdef ARCH_HAS_SCHED_DOMAIN
/* Useful helpers that arch setup code may use. Defined in kernel/sched.c */
extern cpumask_t cpu_isolated_map;
-extern void init_sched_build_groups(struct sched_group groups[],
- cpumask_t span, int (*group_fn)(int cpu));
extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
#endif /* ARCH_HAS_SCHED_DOMAIN */
#endif /* CONFIG_SMP */
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/linux/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/linux/topology.h
--- linux-2.6.9-rc3-mm2/include/linux/topology.h 2004-10-04 14:38:59.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/include/linux/topology.h 2004-10-05 16:34:07.000000000 -0700
@@ -80,9 +80,10 @@ static inline int __next_node_with_cpus(
/* Common values for SMT siblings */
#ifndef SD_SIBLING_INIT
#define SD_SIBLING_INIT (struct sched_domain) { \
- .span = CPU_MASK_NONE, \
.parent = NULL, \
- .groups = NULL, \
+ .span = CPU_MASK_NONE, \
+ .refcnt = ATOMIC_INIT(0), \
+ .cpu_power = 0, \
.min_interval = 1, \
.max_interval = 2, \
.busy_factor = 8, \
@@ -106,9 +107,10 @@ static inline int __next_node_with_cpus(
/* Common values for CPUs */
#ifndef SD_CPU_INIT
#define SD_CPU_INIT (struct sched_domain) { \
- .span = CPU_MASK_NONE, \
.parent = NULL, \
- .groups = NULL, \
+ .span = CPU_MASK_NONE, \
+ .refcnt = ATOMIC_INIT(0), \
+ .cpu_power = 0, \
.min_interval = 1, \
.max_interval = 4, \
.busy_factor = 64, \
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/kernel/sched.c linux-2.6.9-rc3-mm2+mcd_sd/kernel/sched.c
--- linux-2.6.9-rc3-mm2/kernel/sched.c 2004-10-04 14:39:00.000000000 -0700
+++ linux-2.6.9-rc3-mm2+mcd_sd/kernel/sched.c 2004-10-05 17:19:17.000000000 -0700
@@ -2209,32 +2209,35 @@ out:
}
/*
- * find_busiest_group finds and returns the busiest CPU group within the
+ * find_busiest_child finds and returns the busiest child domain of the given
* domain. It calculates and returns the number of tasks which should be
* moved to restore balance via the imbalance parameter.
*/
-static struct sched_group *
-find_busiest_group(struct sched_domain *sd, int this_cpu,
+static struct sched_domain *
+find_busiest_child(struct sched_domain *sd, int cpu,
unsigned long *imbalance, enum idle_type idle)
{
- struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
- unsigned long max_load, avg_load, total_load, this_load, total_pwr;
+ struct sched_domain *busiest_domain, *local_domain, *domain;
+ unsigned long max_load, avg_load, total_load, local_load, total_pwr;
- max_load = this_load = total_load = total_pwr = 0;
-
- do {
+ max_load = local_load = total_load = total_pwr = 0;
+ busiest_domain = local_domain = NULL;
+ list_for_each_entry(domain, &sd->children, peers) {
unsigned long load;
- int local_group;
int i, nr_cpus = 0;
- local_group = cpu_isset(this_cpu, group->cpumask);
+ if (unlikely(cpus_empty(domain->span)))
+ continue;
+
+ if (cpu_isset(cpu, domain->span))
+ local_domain = domain;
- /* Tally up the load of all CPUs in the group */
+ /* Tally up the load of all CPUs in the domain */
avg_load = 0;
- for_each_cpu_mask(i, group->cpumask) {
+ for_each_cpu_mask(i, domain->span) {
/* Bias balancing toward cpus of our domain */
- if (local_group)
+ if (domain == local_domain)
load = target_load(i);
else
load = source_load(i);
@@ -2243,34 +2246,27 @@ find_busiest_group(struct sched_domain *
avg_load += load;
}
- if (!nr_cpus)
- goto nextgroup;
-
total_load += avg_load;
- total_pwr += group->cpu_power;
+ total_pwr += domain->cpu_power;
- /* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+ /* Adjust by relative CPU power of the domain */
+ avg_load = (avg_load * SCHED_LOAD_SCALE) / domain->cpu_power;
- if (local_group) {
- this_load = avg_load;
- this = group;
- goto nextgroup;
+ if (domain == local_domain) {
+ local_load = avg_load;
} else if (avg_load > max_load) {
max_load = avg_load;
- busiest = group;
+ busiest_domain = domain;
}
-nextgroup:
- group = group->next;
- } while (group != sd->groups);
+ }
- if (!busiest || this_load >= max_load)
+ if (!busiest_domain || local_load >= max_load)
goto out_balanced;
avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
- if (this_load >= avg_load ||
- 100*max_load <= sd->imbalance_pct*this_load)
+ if (local_load >= avg_load ||
+ 100*max_load <= sd->imbalance_pct*local_load)
goto out_balanced;
/*
@@ -2284,19 +2280,19 @@ nextgroup:
* by pulling tasks to us. Be careful of negative numbers as they'll
* appear as very large values with unsigned longs.
*/
- *imbalance = min(max_load - avg_load, avg_load - this_load);
+ *imbalance = min(max_load - avg_load, avg_load - local_load);
/* How much load to actually move to equalise the imbalance */
- *imbalance = (*imbalance * min(busiest->cpu_power, this->cpu_power))
+ *imbalance = (*imbalance * min(busiest_domain->cpu_power, local_domain->cpu_power))
/ SCHED_LOAD_SCALE;
if (*imbalance < SCHED_LOAD_SCALE - 1) {
unsigned long pwr_now = 0, pwr_move = 0;
unsigned long tmp;
- if (max_load - this_load >= SCHED_LOAD_SCALE*2) {
+ if (max_load - local_load >= SCHED_LOAD_SCALE*2) {
*imbalance = 1;
- return busiest;
+ return busiest_domain;
}
/*
@@ -2305,21 +2301,21 @@ nextgroup:
* moving them.
*/
- pwr_now += busiest->cpu_power*min(SCHED_LOAD_SCALE, max_load);
- pwr_now += this->cpu_power*min(SCHED_LOAD_SCALE, this_load);
+ pwr_now += busiest_domain->cpu_power*min(SCHED_LOAD_SCALE, max_load);
+ pwr_now += local_domain->cpu_power*min(SCHED_LOAD_SCALE, local_load);
pwr_now /= SCHED_LOAD_SCALE;
/* Amount of load we'd subtract */
- tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/busiest->cpu_power;
+ tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/busiest_domain->cpu_power;
if (max_load > tmp)
- pwr_move += busiest->cpu_power*min(SCHED_LOAD_SCALE,
+ pwr_move += busiest_domain->cpu_power*min(SCHED_LOAD_SCALE,
max_load - tmp);
/* Amount of load we'd add */
- tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/this->cpu_power;
+ tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/local_domain->cpu_power;
if (max_load < tmp)
tmp = max_load;
- pwr_move += this->cpu_power*min(SCHED_LOAD_SCALE, this_load + tmp);
+ pwr_move += local_domain->cpu_power*min(SCHED_LOAD_SCALE, local_load + tmp);
pwr_move /= SCHED_LOAD_SCALE;
/* Move if we gain another 8th of a CPU worth of throughput */
@@ -2327,19 +2323,19 @@ nextgroup:
goto out_balanced;
*imbalance = 1;
- return busiest;
+ return busiest_domain;
}
/* Get rid of the scaling factor, rounding down as we divide */
*imbalance = (*imbalance + 1) / SCHED_LOAD_SCALE;
- return busiest;
+ return busiest_domain;
out_balanced:
- if (busiest && (idle == NEWLY_IDLE ||
+ if (busiest_domain && (idle == NEWLY_IDLE ||
(idle == SCHED_IDLE && max_load > SCHED_LOAD_SCALE)) ) {
*imbalance = 1;
- return busiest;
+ return busiest_domain;
}
*imbalance = 0;
@@ -2347,15 +2343,15 @@ out_balanced:
}
/*
- * find_busiest_queue - find the busiest runqueue among the cpus in group.
+ * find_busiest_queue - find the busiest runqueue among the cpus in @domain
*/
-static runqueue_t *find_busiest_queue(const struct sched_group *group)
+static runqueue_t *find_busiest_queue(const struct sched_domain *domain)
{
unsigned long load, max_load = 0;
runqueue_t *busiest = NULL;
int i;
- for_each_cpu_mask(i, group->cpumask) {
+ for_each_cpu_mask(i, domain->span) {
load = source_load(i);
if (load > max_load) {
@@ -2368,30 +2364,30 @@ static runqueue_t *find_busiest_queue(co
}
/*
- * Check this_cpu to ensure it is balanced within domain. Attempt to move
+ * Check @cpu to ensure it is balanced within @sd. Attempt to move
* tasks if there is an imbalance.
*
- * Called with this_rq unlocked.
+ * Called with @rq unlocked.
*/
-static int load_balance(int this_cpu, runqueue_t *this_rq,
- struct sched_domain *sd, enum idle_type idle)
+static int load_balance(int cpu, runqueue_t *rq, struct sched_domain *sd,
+ enum idle_type idle)
{
- struct sched_group *group;
- runqueue_t *busiest;
+ struct sched_domain *domain;
+ runqueue_t *busiest_rq;
unsigned long imbalance;
int nr_moved;
- spin_lock(&this_rq->lock);
+ spin_lock(&rq->lock);
schedstat_inc(sd, lb_cnt[idle]);
- group = find_busiest_group(sd, this_cpu, &imbalance, idle);
- if (!group) {
+ domain = find_busiest_child(sd, cpu, &imbalance, idle);
+ if (!domain) {
schedstat_inc(sd, lb_nobusyg[idle]);
goto out_balanced;
}
- busiest = find_busiest_queue(group);
- if (!busiest) {
+ busiest_rq = find_busiest_queue(domain);
+ if (!busiest_rq) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
@@ -2401,7 +2397,7 @@ static int load_balance(int this_cpu, ru
* balancing is inherently racy and statistical,
* it could happen in theory.
*/
- if (unlikely(busiest == this_rq)) {
+ if (unlikely(busiest_rq == rq)) {
WARN_ON(1);
goto out_balanced;
}
@@ -2409,19 +2405,18 @@ static int load_balance(int this_cpu, ru
schedstat_add(sd, lb_imbalance[idle], imbalance);
nr_moved = 0;
- if (busiest->nr_running > 1) {
+ if (busiest_rq->nr_running > 1) {
/*
- * Attempt to move tasks. If find_busiest_group has found
- * an imbalance but busiest->nr_running <= 1, the group is
+ * Attempt to move tasks. If find_busiest_child has found
+ * an imbalance but busiest_rq->nr_running <= 1, the domain is
* still unbalanced. nr_moved simply stays zero, so it is
* correctly treated as an imbalance.
*/
- double_lock_balance(this_rq, busiest);
- nr_moved = move_tasks(this_rq, this_cpu, busiest,
- imbalance, sd, idle);
- spin_unlock(&busiest->lock);
+ double_lock_balance(rq, busiest_rq);
+ nr_moved = move_tasks(rq, cpu, busiest_rq, imbalance, sd, idle);
+ spin_unlock(&busiest_rq->lock);
}
- spin_unlock(&this_rq->lock);
+ spin_unlock(&rq->lock);
if (!nr_moved) {
schedstat_inc(sd, lb_failed[idle]);
@@ -2430,15 +2425,15 @@ static int load_balance(int this_cpu, ru
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
int wake = 0;
- spin_lock(&busiest->lock);
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
+ spin_lock(&busiest_rq->lock);
+ if (!busiest_rq->active_balance) {
+ busiest_rq->active_balance = 1;
+ busiest_rq->push_cpu = cpu;
wake = 1;
}
- spin_unlock(&busiest->lock);
+ spin_unlock(&busiest_rq->lock);
if (wake)
- wake_up_process(busiest->migration_thread);
+ wake_up_process(busiest_rq->migration_thread);
/*
* We've kicked active balancing, reset the failure
@@ -2455,7 +2450,7 @@ static int load_balance(int this_cpu, ru
return nr_moved;
out_balanced:
- spin_unlock(&this_rq->lock);
+ spin_unlock(&rq->lock);
/* tune up the balancing interval */
if (sd->balance_interval < sd->max_interval)
@@ -2465,59 +2460,57 @@ out_balanced:
}
/*
- * Check this_cpu to ensure it is balanced within domain. Attempt to move
+ * Check @cpu to ensure it is balanced within @sd. Attempt to move
* tasks if there is an imbalance.
*
- * Called from schedule when this_rq is about to become idle (NEWLY_IDLE).
- * this_rq is locked.
+ * Called from schedule() when @rq is about to become idle (NEWLY_IDLE).
+ * @rq is locked.
*/
-static int load_balance_newidle(int this_cpu, runqueue_t *this_rq,
- struct sched_domain *sd)
+static int load_balance_newidle(int cpu, runqueue_t *rq, struct sched_domain *sd)
{
- struct sched_group *group;
- runqueue_t *busiest = NULL;
+ struct sched_domain *domain;
+ runqueue_t *busiest_rq = NULL;
unsigned long imbalance;
int nr_moved = 0;
schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
- group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
- if (!group) {
+ domain = find_busiest_child(sd, cpu, &imbalance, NEWLY_IDLE);
+ if (!domain) {
schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
goto out;
}
- busiest = find_busiest_queue(group);
- if (!busiest || busiest == this_rq) {
+ busiest_rq = find_busiest_queue(domain);
+ if (!busiest_rq || busiest_rq == rq) {
schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
goto out;
}
/* Attempt to move tasks */
- double_lock_balance(this_rq, busiest);
+ double_lock_balance(rq, busiest_rq);
schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
- nr_moved = move_tasks(this_rq, this_cpu, busiest,
- imbalance, sd, NEWLY_IDLE);
+ nr_moved = move_tasks(rq, cpu, busiest_rq, imbalance, sd, NEWLY_IDLE);
if (!nr_moved)
schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
- spin_unlock(&busiest->lock);
+ spin_unlock(&busiest_rq->lock);
out:
return nr_moved;
}
/*
- * idle_balance is called by schedule() if this_cpu is about to become
+ * idle_balance is called by schedule() if @cpu is about to become
* idle. Attempts to pull tasks from other CPUs.
*/
-static inline void idle_balance(int this_cpu, runqueue_t *this_rq)
+static inline void idle_balance(int cpu, runqueue_t *rq)
{
struct sched_domain *sd;
- for_each_domain(this_cpu, sd) {
+ for_each_domain(cpu, sd) {
if (sd->flags & SD_BALANCE_NEWIDLE) {
- if (load_balance_newidle(this_cpu, this_rq, sd)) {
+ if (load_balance_newidle(cpu, rq, sd)) {
/* We've pulled tasks over so stop searching */
break;
}
@@ -2531,40 +2524,37 @@ static inline void idle_balance(int this
* running on each physical CPU where possible, and not have a physical /
* logical imbalance.
*
- * Called with busiest locked.
+ * Called with busiest_rq locked.
*/
-static void active_load_balance(runqueue_t *busiest, int busiest_cpu)
+static void active_load_balance(runqueue_t *busiest_rq, int busiest_cpu)
{
- struct sched_domain *sd;
- struct sched_group *group, *busy_group;
+ struct sched_domain *balance_domain, *domain, *busiest_domain;
int i;
- schedstat_inc(busiest, alb_cnt);
- if (busiest->nr_running <= 1)
+ schedstat_inc(busiest_rq, alb_cnt);
+ if (busiest_rq->nr_running <= 1)
return;
- for_each_domain(busiest_cpu, sd)
- if (cpu_isset(busiest->push_cpu, sd->span))
+ for_each_domain(busiest_cpu, balance_domain)
+ if (cpu_isset(busiest_rq->push_cpu, balance_domain->span))
break;
- if (!sd)
+ if (!balance_domain)
return;
- group = sd->groups;
- while (!cpu_isset(busiest_cpu, group->cpumask))
- group = group->next;
- busy_group = group;
-
- group = sd->groups;
- do {
+ list_for_each_entry(busiest_domain, &balance_domain->children, peers) {
+ if (cpu_isset(busiest_cpu, busiest_domain->span))
+ break;
+ }
+ list_for_each_entry(domain, &balance_domain->children, peers) {
runqueue_t *rq;
int push_cpu = 0;
- if (group == busy_group)
- goto next_group;
+ if (domain == busiest_domain)
+ continue;
- for_each_cpu_mask(i, group->cpumask) {
+ for_each_cpu_mask(i, domain->span) {
if (!idle_cpu(i))
- goto next_group;
+ continue;
push_cpu = i;
}
@@ -2576,19 +2566,17 @@ static void active_load_balance(runqueue
* it can trigger.. Reported by Bjorn Helgaas on a
* 128-cpu setup.
*/
- if (unlikely(busiest == rq))
- goto next_group;
- double_lock_balance(busiest, rq);
- if (move_tasks(rq, push_cpu, busiest, 1, sd, SCHED_IDLE)) {
- schedstat_inc(busiest, alb_lost);
+ if (unlikely(busiest_rq == rq))
+ continue;
+ double_lock_balance(busiest_rq, rq);
+ if (move_tasks(rq, push_cpu, busiest_rq, 1, balance_domain, SCHED_IDLE)) {
+ schedstat_inc(busiest_rq, alb_lost);
schedstat_inc(rq, alb_gained);
} else {
- schedstat_inc(busiest, alb_failed);
+ schedstat_inc(busiest_rq, alb_failed);
}
spin_unlock(&rq->lock);
-next_group:
- group = group->next;
- } while (group != sd->groups);
+ }
}
/*
@@ -5024,7 +5012,7 @@ static int migration_call(struct notifie
}
spin_unlock_irq(&rq->lock);
break;
-#endif
+#endif /* CONFIG_HOTPLUG_CPU */
}
return NOTIFY_OK;
}
@@ -5049,37 +5037,6 @@ int __init migration_init(void)
#endif
#ifdef CONFIG_SMP
-/*
- * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
- * hold the hotplug lock.
- */
-void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu)
-{
- migration_req_t req;
- unsigned long flags;
- runqueue_t *rq = cpu_rq(cpu);
- int local = 1;
-
- spin_lock_irqsave(&rq->lock, flags);
-
- if (cpu == smp_processor_id() || !cpu_online(cpu)) {
- rq->sd = sd;
- } else {
- init_completion(&req.done);
- req.type = REQ_SET_DOMAIN;
- req.sd = sd;
- list_add(&req.list, &rq->migration_queue);
- local = 0;
- }
-
- spin_unlock_irqrestore(&rq->lock, flags);
-
- if (!local) {
- wake_up_process(rq->migration_thread);
- wait_for_completion(&req.done);
- }
-}
-
/* cpus with isolated domains */
cpumask_t __devinitdata cpu_isolated_map = CPU_MASK_NONE;
@@ -5094,245 +5051,257 @@ static int __init isolated_cpu_setup(cha
cpu_set(ints[i], cpu_isolated_map);
return 1;
}
+__setup("isolcpus=", isolated_cpu_setup);
-__setup ("isolcpus=", isolated_cpu_setup);
/*
- * init_sched_build_groups takes an array of groups, the cpumask we wish
- * to span, and a pointer to a function which identifies what group a CPU
- * belongs to. The return value of group_fn must be a valid index into the
- * groups[] array, and must be >= 0 and < NR_CPUS (due to the fact that we
- * keep track of groups covered with a cpumask_t).
- *
- * init_sched_build_groups will build a circular linked list of the groups
- * covered by the given span, and will set each group's ->cpumask correctly,
- * and ->cpu_power to 0.
+ * Calculate and set the cpu_power field of sched_domain @sd
*/
-void __devinit init_sched_build_groups(struct sched_group groups[],
- cpumask_t span, int (*group_fn)(int cpu))
+static void __devinit set_cpu_power(struct sched_domain *sd)
{
- struct sched_group *first = NULL, *last = NULL;
- cpumask_t covered = CPU_MASK_NONE;
- int i;
-
- for_each_cpu_mask(i, span) {
- int group = group_fn(i);
- struct sched_group *sg = &groups[group];
- int j;
-
- if (cpu_isset(i, covered))
- continue;
-
- sg->cpumask = CPU_MASK_NONE;
- sg->cpu_power = 0;
-
- for_each_cpu_mask(j, span) {
- if (group_fn(j) != group)
- continue;
-
- cpu_set(j, covered);
- cpu_set(j, sg->cpumask);
- }
- if (!first)
- first = sg;
- if (last)
- last->next = sg;
- last = sg;
+ if (list_empty(&sd->children)) {
+ /* Leaf domain. Power = 1 + ((#cpus-1) * per_cpu_gain) */
+ sd->cpu_power = 1 +
+ ((cpus_weight(sd->span) - 1) * sd->per_cpu_gain);
+ } else {
+ /* Interior domain. Power = sum of childrens' power. */
+ struct sched_domain *child_domain;
+ sd->cpu_power = 0;
+ list_for_each_entry(child_domain, &sd->children, peers)
+ sd->cpu_power += child_domain->cpu_power;
}
- last->next = first;
}
+/* What type of sched_domain are we creating? */
+enum sd_type {
+ SD_SYSTEM = 0,
+ SD_NODE = 1,
+ SD_SIBLING = 2,
+};
-#ifdef ARCH_HAS_SCHED_DOMAIN
-extern void __devinit arch_init_sched_domains(void);
-extern void __devinit arch_destroy_sched_domains(void);
+/*
+ * Allocates a sched_domain, initializes it as @type, parents it to @parent,
+ * and links it into the list of its peers (if any).
+ *
+ * All the ugly ifdef'ing in this function is because I didn't want to rename
+ * an modify the SD_*_INIT functions in this patch. It makes it bigger and more
+ * confusing. Arch-specific initializers are what is necessary to fix this mess,
+ * but I'm not including that in this patch. Another reason for this confusion
+ * is that the domain initializers are named for what the domains are _composed of_
+ * rather than what type of domains they are. Ex. SD_NODE_INIT is the initializer
+ * for a domain composed of NUMA nodes, rather than a domain that _is_ a NUMA node.
+ */
+struct sched_domain * __devinit create_domain(struct sched_domain *parent,
+ enum sd_type type)
+{
+ struct sched_domain *domain;
+
+ domain = (struct sched_domain *)kmalloc(sizeof(struct sched_domain), GFP_KERNEL);
+ if (!domain) {
+ printk("Couldn't kmalloc sched_domain!\n");
+ return NULL;
+ }
+ switch (type) {
+ case SD_SYSTEM:
+#ifdef CONFIG_NUMA
+/* Initializing system domain for NUMA machine: contains nodes */
+ *domain = SD_NODE_INIT;
#else
#ifdef CONFIG_SCHED_SMT
-static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
-static struct sched_group sched_group_cpus[NR_CPUS];
-static int __devinit cpu_to_cpu_group(int cpu)
-{
- return cpu;
-}
+/* Initializing system domain for SMP + SMT machine: contains SMT siblings */
+ *domain = SD_SIBLING_INIT;
+#else
+/* Initializing system domain for Flat SMP machine: contains CPUs */
+ *domain = SD_CPU_INIT;
#endif
-
-static DEFINE_PER_CPU(struct sched_domain, phys_domains);
-static struct sched_group sched_group_phys[NR_CPUS];
-static int __devinit cpu_to_phys_group(int cpu)
-{
+#endif
+ break;
+#ifdef CONFIG_NUMA
+ case SD_NODE:
#ifdef CONFIG_SCHED_SMT
- return first_cpu(cpu_sibling_map[cpu]);
+/* Initializing node domain for NUMA + SMT machine: contains SMT siblings */
+ *domain = SD_SIBLING_INIT;
#else
- return cpu;
+/* Initializing node domain for Flat NUMA machine: contains CPUs */
+ *domain = SD_CPU_INIT;
+#endif
+ break;
#endif
+#ifdef CONFIG_SCHED_SMT
+ case SD_SIBLING:
+/* Initializing sibling domain: contains CPUs */
+ *domain = SD_CPU_INIT;
+ break;
+#endif
+ default:
+ printk("Invalid sd_type!\n");
+ kfree(domain);
+ return NULL;
+ }
+ INIT_LIST_HEAD(&domain->children);
+ INIT_LIST_HEAD(&domain->peers);
+ /* MCD - Need locking here? */
+ if (parent) {
+ domain->parent = parent;
+ atomic_inc(&domain->parent->refcnt);
+ list_add_tail(&domain->peers, &domain->parent->children);
+ }
+ return domain;
}
-#ifdef CONFIG_NUMA
-
-static DEFINE_PER_CPU(struct sched_domain, node_domains);
-static struct sched_group sched_group_nodes[MAX_NUMNODES];
-static int __devinit cpu_to_node_group(int cpu)
+/*
+ * Unlinks and destroys sched_domain @sd
+ */
+void __devinit destroy_domain(struct sched_domain *sd)
{
- return cpu_to_node(cpu);
+ if (!sd || !atomic_dec_and_test(&sd->refcnt))
+ return;
+ kfree(sd);
}
-#endif
/*
- * Set up scheduler domains and groups. Callers must hold the hotplug lock.
+ * Attach the domain @sd to @cpu as its base domain.
+ * Callers must hold the hotplug lock.
*/
-static void __devinit arch_init_sched_domains(void)
+void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu)
{
- int i;
- cpumask_t cpu_default_map;
-
- /*
- * Setup mask for cpus without special case scheduling requirements.
- * For now this just excludes isolated cpus, but could be used to
- * exclude other special cases in the future.
- */
- cpus_complement(cpu_default_map, cpu_isolated_map);
- cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
-
- /*
- * Set up domains. Isolated domains just stay on the dummy domain.
- */
- for_each_cpu_mask(i, cpu_default_map) {
- int group;
- struct sched_domain *sd = NULL, *p;
- cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
+ struct sched_domain *old_sd, *domain;
+ migration_req_t req;
+ unsigned long flags;
+ runqueue_t *rq = cpu_rq(cpu);
+ int local = 1;
- cpus_and(nodemask, nodemask, cpu_default_map);
+ /* Add @cpu to domain's span & compute its cpu_power */
+ for (domain = sd; domain; domain = domain->parent) {
+ /* MCD - Should add locking here. Lock each domain while modifying */
+ if (cpu_isset(cpu, domain->span))
+ continue;
+ cpu_set(cpu, domain->span);
+ set_cpu_power(domain);
+ /* MCD - Unlock here */
+ }
-#ifdef CONFIG_NUMA
- sd = &per_cpu(node_domains, i);
- group = cpu_to_node_group(i);
- *sd = SD_NODE_INIT;
- sd->span = cpu_default_map;
- sd->groups = &sched_group_nodes[group];
-#endif
+ spin_lock_irqsave(&rq->lock, flags);
- p = sd;
- sd = &per_cpu(phys_domains, i);
- group = cpu_to_phys_group(i);
- *sd = SD_CPU_INIT;
- sd->span = nodemask;
- sd->parent = p;
- sd->groups = &sched_group_phys[group];
+ old_sd = rq->sd;
-#ifdef CONFIG_SCHED_SMT
- p = sd;
- sd = &per_cpu(cpu_domains, i);
- group = cpu_to_cpu_group(i);
- *sd = SD_SIBLING_INIT;
- sd->span = cpu_sibling_map[i];
- cpus_and(sd->span, sd->span, cpu_default_map);
- sd->parent = p;
- sd->groups = &sched_group_cpus[group];
-#endif
+ if (cpu == smp_processor_id() || !cpu_online(cpu)) {
+ rq->sd = sd;
+ } else {
+ init_completion(&req.done);
+ req.type = REQ_SET_DOMAIN;
+ req.sd = sd;
+ list_add(&req.list, &rq->migration_queue);
+ local = 0;
}
-#ifdef CONFIG_SCHED_SMT
- /* Set up CPU (sibling) groups */
- for_each_online_cpu(i) {
- cpumask_t this_sibling_map = cpu_sibling_map[i];
- cpus_and(this_sibling_map, this_sibling_map, cpu_default_map);
- if (i != first_cpu(this_sibling_map))
- continue;
+ spin_unlock_irqrestore(&rq->lock, flags);
- init_sched_build_groups(sched_group_cpus, this_sibling_map,
- &cpu_to_cpu_group);
+ if (!local) {
+ wake_up_process(rq->migration_thread);
+ wait_for_completion(&req.done);
}
-#endif
- /* Set up physical groups */
- for (i = 0; i < MAX_NUMNODES; i++) {
- cpumask_t nodemask = node_to_cpumask(i);
+ destroy_domain(old_sd);
+}
- cpus_and(nodemask, nodemask, cpu_default_map);
- if (cpus_empty(nodemask))
- continue;
+#ifndef ARCH_HAS_SCHED_DOMAIN
+/*
+ * Set up scheduler domains. Callers must hold the hotplug lock.
+ */
+static void __devinit arch_init_sched_domains(void)
+{
+ struct sched_domain *sys_domain;
+ struct sched_domain *node_domains[MAX_NUMNODES];
+ cpumask_t cpu_usable_map;
+ int i;
- init_sched_build_groups(sched_group_phys, nodemask,
- &cpu_to_phys_group);
- }
+ /*
+ * Setup mask for cpus without special case scheduling requirements.
+ * For now this just excludes isolated cpus, but could be used to
+ * exclude other special cases in the future.
+ */
+ cpus_complement(cpu_usable_map, cpu_isolated_map);
+ cpus_and(cpu_usable_map, cpu_usable_map, cpu_online_map);
+ sys_domain = create_domain(NULL, SD_SYSTEM);
+ BUG_ON(!sys_domain);
+
+/* MCD - Replace these #ifdef's with runtime variables? ie: numa_enabled, smt_enabled... */
#ifdef CONFIG_NUMA
- /* Set up node groups */
- init_sched_build_groups(sched_group_nodes, cpu_default_map,
- &cpu_to_node_group);
-#endif
+ for_each_node(i) {
+ cpumask_t nodemask;
+ nodemask = node_to_cpumask(i);
+ if (!cpus_empty(nodemask))
+ node_domains[i] = create_domain(sys_domain, SD_NODE);
+ }
+#else /* !CONFIG_NUMA */
+ node_domains[0] = sys_domain;
+#endif /* CONFIG_NUMA */
- /* Calculate CPU power for physical packages and nodes */
- for_each_cpu_mask(i, cpu_default_map) {
- int power;
- struct sched_domain *sd;
+ for_each_cpu_mask(i, cpu_usable_map) {
#ifdef CONFIG_SCHED_SMT
- sd = &per_cpu(cpu_domains, i);
- power = SCHED_LOAD_SCALE;
- sd->groups->cpu_power = power;
-#endif
-
- sd = &per_cpu(phys_domains, i);
- power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
- (cpus_weight(sd->groups->cpumask)-1) / 10;
- sd->groups->cpu_power = power;
+ struct sched_domain *sibling_domain;
+ cpumask_t sibling_map = cpu_sibling_map[i];
+ int j;
-#ifdef CONFIG_NUMA
- if (i == first_cpu(sd->groups->cpumask)) {
- /* Only add "power" once for each physical package. */
- sd = &per_cpu(node_domains, i);
- sd->groups->cpu_power += power;
- }
-#endif
- }
+ cpus_and(sibling_map, sibling_map, cpu_usable_map);
+ if (i != first_cpu(sibling_map))
+ continue;
- /* Attach the domains */
- for_each_online_cpu(i) {
- struct sched_domain *sd;
-#ifdef CONFIG_SCHED_SMT
- sd = &per_cpu(cpu_domains, i);
-#else
- sd = &per_cpu(phys_domains, i);
-#endif
- cpu_attach_domain(sd, i);
+ sibling_domain = create_domain(node_domains[cpu_to_node(i)], SD_SIBLING);
+ for_each_cpu_mask(j, sibling_map)
+ cpu_attach_domain(sibling_domain, j);
+#else /* !CONFIG_SCHED_SMT */
+ cpu_attach_domain(node_domains[cpu_to_node(i)], i);
+#endif /* CONFIG_SCHED_SMT */
}
}
#ifdef CONFIG_HOTPLUG_CPU
static void __devinit arch_destroy_sched_domains(void)
{
- /* Do nothing: everything is statically allocated. */
+ /* MCD - Write this function */
}
#endif
-#endif /* ARCH_HAS_SCHED_DOMAIN */
+#else /* ARCH_HAS_SCHED_DOMAIN */
+extern void __devinit arch_init_sched_domains(void);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void __devinit arch_destroy_sched_domains(void);
+#endif
+#endif /* !ARCH_HAS_SCHED_DOMAIN */
-#undef SCHED_DOMAIN_DEBUG
+#define SCHED_DOMAIN_DEBUG
#ifdef SCHED_DOMAIN_DEBUG
-static void sched_domain_debug(void)
+#define SD_DEBUG KERN_ALERT
+static void __devinit sched_domain_debug(void)
{
- int i;
+ int cpu;
- for_each_online_cpu(i) {
- runqueue_t *rq = cpu_rq(i);
- struct sched_domain *sd;
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ runqueue_t *rq = cpu_rq(cpu);
+ struct sched_domain *sd = NULL;
int level = 0;
-
- sd = rq->sd;
-
- printk(KERN_DEBUG "CPU%d:\n", i);
+
+ if (rq)
+ sd = rq->sd;
+
+ printk(SD_DEBUG "CPU%d:", cpu);
+ if (!sd) {
+ printk(" not online.\n");
+ continue;
+ }
+ printk("\n");
do {
- int j;
- char str[NR_CPUS];
- struct sched_group *group = sd->groups;
+ struct sched_domain *child;
cpumask_t groupmask;
+ char str[NR_CPUS];
+ int i;
- cpumask_scnprintf(str, NR_CPUS, sd->span);
- cpus_clear(groupmask);
-
- printk(KERN_DEBUG);
- for (j = 0; j < level + 1; j++)
+ printk(SD_DEBUG);
+ for (i = 0; i < level + 1; i++)
printk(" ");
printk("domain %d: ", level);
@@ -5344,71 +5313,62 @@ static void sched_domain_debug(void)
break;
}
- printk("span %s\n", str);
+ cpumask_scnprintf(str, NR_CPUS, sd->span);
+ printk("span %s, cpu_power %ld\n", str, sd->cpu_power);
- if (!cpu_isset(i, sd->span))
- printk(KERN_DEBUG "ERROR domain->span does not contain CPU%d\n", i);
- if (!cpu_isset(i, group->cpumask))
- printk(KERN_DEBUG "ERROR domain->groups does not contain CPU%d\n", i);
- if (!group->cpu_power)
- printk(KERN_DEBUG "ERROR domain->cpu_power not set\n");
+ if (!cpu_isset(cpu, sd->span))
+ printk(SD_DEBUG "ERROR domain->span does not contain CPU%d\n", cpu);
- printk(KERN_DEBUG);
- for (j = 0; j < level + 2; j++)
+ cpus_clear(groupmask);
+ printk(SD_DEBUG);
+ for (i = 0; i < level + 1; i++)
printk(" ");
- printk("groups:");
- do {
- if (!group) {
- printk(" ERROR: NULL");
- break;
+ if (list_empty(&sd->children)) {
+ printk("Leaf domain, no children\n");
+ } else {
+ printk("Interior domain, childrens' spans:");
+ list_for_each_entry(child, &sd->children, peers) {
+ if (cpus_empty(child->span))
+ printk(" ERROR child has empty span:");
+
+ if (cpus_intersects(groupmask, child->span))
+ printk(" ERROR repeated CPUs:");
+
+ cpumask_scnprintf(str, NR_CPUS, child->span);
+ printk(" %s", str);
+ cpus_or(groupmask, groupmask, child->span);
}
+ printk("\n");
- if (!cpus_weight(group->cpumask))
- printk(" ERROR empty group:");
-
- if (cpus_intersects(groupmask, group->cpumask))
- printk(" ERROR repeated CPUs:");
-
- cpus_or(groupmask, groupmask, group->cpumask);
-
- cpumask_scnprintf(str, NR_CPUS, group->cpumask);
- printk(" %s", str);
-
- group = group->next;
- } while (group != sd->groups);
- printk("\n");
-
- if (!cpus_equal(sd->span, groupmask))
- printk(KERN_DEBUG "ERROR groups don't span domain->span\n");
+ if (!cpus_equal(sd->span, groupmask))
+ printk(SD_DEBUG "ERROR children don't span domain->span\n");
+ }
level++;
sd = sd->parent;
if (sd) {
if (!cpus_subset(groupmask, sd->span))
- printk(KERN_DEBUG "ERROR parent span is not a superset of domain->span\n");
+ printk(SD_DEBUG "ERROR parent span is not a superset of domain->span\n");
}
-
} while (sd);
}
}
-#else
+#else /* !SCHED_DOMAIN_DEBUG */
#define sched_domain_debug() {}
-#endif
+#endif /* SCHED_DOMAIN_DEBUG */
-#ifdef CONFIG_SMP
/*
* Initial dummy domain for early boot and for hotplug cpu. Being static,
* it is initialized to zero, so all balancing flags are cleared which is
* what we want.
*/
static struct sched_domain sched_domain_dummy;
-#endif
#ifdef CONFIG_HOTPLUG_CPU
/*
* Force a reinitialization of the sched domains hierarchy. The domains
- * and groups cannot be updated in place without racing with the balancing
+ * cannot be updated in place without racing with the balancing
* code, so we temporarily attach all running cpus to a "dummy" domain
* which will prevent rebalancing while the sched domains are recalculated.
*/
@@ -5444,7 +5404,7 @@ static int update_sched_domains(struct n
return NOTIFY_OK;
}
-#endif
+#endif /* CONFIG_HOTPLUG_CPU */
void __init sched_init_smp(void)
{
@@ -5455,7 +5415,7 @@ void __init sched_init_smp(void)
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);
}
-#else
+#else /* !CONFIG_SMP */
void __init sched_init_smp(void)
{
}
Matthew Dobson wrote:
> This code is in no way complete. But since I brought it up in the
> "cpusets - big numa cpu and memory placement" thread, I figure the code
> needs to be posted.
>
> The basic idea is as follows:
>
> 1) Rip out sched_groups and move them into the sched_domains.
> 2) Add some reference counting, and eventually locking, to
> sched_domains.
> 3) Rewrite & simplify the way sched_domains are built and linked into a
> cohesive tree.
>
OK. I'm not sure that I like the direction, but... (I haven't looked
too closely at it).
> This should allow us to support hotplug more easily, simply removing the
> domain belonging to the going-away CPU, rather than throwing away the
> whole domain tree and rebuilding from scratch.
Although what we have in -mm now should support CPU hotplug just fine.
The hotplug guys really seem not to care how disruptive a hotplug
operation is.
> This should also allow
> us to support multiple, independent (ie: no shared root) domain trees
> which will facilitate isolated CPU groups and exclusive domains. I also
Hmm, what was my word for them... yeah, disjoint. We can do that now,
see isolcpus= for a subset of the functionality you want (doing larger
exclusive sets would probably just require we run the setup code once
for each exclusive set we want to build).
> hope this will allow us to leverage the existing topology infrastructure
> to build domains that closely resemble the physical structure of the
> machine automagically, thus making supporting interesting NUMA machines
> and SMT machines easier.
>
> This patch is just a snapshot in the middle of development, so there are
> certainly some uglies & bugs that will get fixed. That said, any
> comments about the general design are strongly encouraged. Heck, any
> feedback at all is welcome! :)
>
> Patch against 2.6.9-rc3-mm2.
This is what I did in my first (that nobody ever saw) implementation of
sched domains. Ie. no sched_groups, just use sched_domains as the balancing
object... I'm not sure this works too well.
For example, your bottom level domain is going to basically be a redundant,
single CPU on most topologies, isn't it?
Also, how will you do overlapping domains that SGI want to do (see
arch/ia64/kernel/domain.c in -mm kernels)?
node2 wants to balance between node0, node1, itself, node3, node4.
node4 wants to balance between node2, node3, itself, node5, node6.
etc.
I think your lists will get tangled, no?
> ... thus making supporting interesting NUMA machines
> and SMT machines easier.
Would you be so kind and elaborate on the SMT part.
Marc
Marc wrote:
> > ... thus making supporting interesting NUMA machines
> > and SMT machines easier.
>
> Would you be so kind and elaborate on the SMT part.
Sure - easy - I mistyped "SMP". Even on a system as simple as
a dual processor, tightly coupled HPC applications, using one
thread per processor, run much better if indeed they are placed
one thread to a processor, allowing for genuine parallelism.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
Never mind my s/SMT/SMP/ post ... please pretend you never saw it ;)
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373
On Wednesday, October 6, 2004 7:13 pm, Nick Piggin wrote:
> Hmm, what was my word for them... yeah, disjoint. We can do that now,
> see isolcpus= for a subset of the functionality you want (doing larger
> exclusive sets would probably just require we run the setup code once
> for each exclusive set we want to build).
Yeah, and unfortunately since I added the code for overlapping domains w/o
adding a top level domain at the same time, we have disjoint domains by
default on large systems.
> Also, how will you do overlapping domains that SGI want to do (see
> arch/ia64/kernel/domain.c in -mm kernels)?
>
> node2 wants to balance between node0, node1, itself, node3, node4.
> node4 wants to balance between node2, node3, itself, node5, node6.
> etc.
>
> I think your lists will get tangled, no?
Yeah, but overlapping domains aren't a requirement. In fact, making the
scheduling domains dynamically configurable is probably a *much* better
route, since I doubt that some default overlap setup will be optimal for many
workloads (that doesn't mean we shouldn't have good defaults though). Being
able to configure the rebalance and tick rates of the various domains would
also be a good thing (the defaults could be keyed off of the number of CPUs
and/or nodes in the domain).
Thanks,
Jesse
On Wed, 2004-10-06 at 19:13, Nick Piggin wrote:
> Matthew Dobson wrote:
> > This code is in no way complete. But since I brought it up in the
> > "cpusets - big numa cpu and memory placement" thread, I figure the code
> > needs to be posted.
> >
> > The basic idea is as follows:
> >
> > 1) Rip out sched_groups and move them into the sched_domains.
> > 2) Add some reference counting, and eventually locking, to
> > sched_domains.
> > 3) Rewrite & simplify the way sched_domains are built and linked into a
> > cohesive tree.
> >
>
> OK. I'm not sure that I like the direction, but... (I haven't looked
> too closely at it).
The patch is made somewhat larger by a lot of variable renaming because
of the removal of sched_groups. A lot of s/group/domain/. The vast
majority of the changes are in a rewrite of arch_init_sched_domains &
it's assorted helpers.
> > This should allow us to support hotplug more easily, simply removing the
> > domain belonging to the going-away CPU, rather than throwing away the
> > whole domain tree and rebuilding from scratch.
>
> Although what we have in -mm now should support CPU hotplug just fine.
> The hotplug guys really seem not to care how disruptive a hotplug
> operation is.
I wasn't trying to imply that CPU hotplug isn't supported right now.
But it is currently a very disruptive operation, throwing away the
entire sched_domains & sched_groups tree and then rebuilding it from
scratch just to remove a single CPU! I also understand that this is
supposed to be a rare event (CPU hotplug), but that doesn't mean it
*has* to be a slow, disruptive event. :)
> > This should also allow
> > us to support multiple, independent (ie: no shared root) domain trees
> > which will facilitate isolated CPU groups and exclusive domains. I also
>
> Hmm, what was my word for them... yeah, disjoint. We can do that now,
> see isolcpus= for a subset of the functionality you want (doing larger
> exclusive sets would probably just require we run the setup code once
> for each exclusive set we want to build).
The current code doesn't, to my knowledge support multiple isolated
domains. You can set up a single 'isolated' group with boot time
options, but you can't set up *multiple* isolated groups, nor is there
the ability to do any partitioning/isolation at runtime. This was more
of the motivation for my code than the hotplug simplification. That was
more of a side-benefit.
> > hope this will allow us to leverage the existing topology infrastructure
> > to build domains that closely resemble the physical structure of the
> > machine automagically, thus making supporting interesting NUMA machines
> > and SMT machines easier.
> >
> > This patch is just a snapshot in the middle of development, so there are
> > certainly some uglies & bugs that will get fixed. That said, any
> > comments about the general design are strongly encouraged. Heck, any
> > feedback at all is welcome! :)
> >
> > Patch against 2.6.9-rc3-mm2.
>
> This is what I did in my first (that nobody ever saw) implementation of
> sched domains. Ie. no sched_groups, just use sched_domains as the balancing
> object... I'm not sure this works too well.
>
> For example, your bottom level domain is going to basically be a redundant,
> single CPU on most topologies, isn't it?
>
> Also, how will you do overlapping domains that SGI want to do (see
> arch/ia64/kernel/domain.c in -mm kernels)?
>
> node2 wants to balance between node0, node1, itself, node3, node4.
> node4 wants to balance between node2, node3, itself, node5, node6.
> etc.
>
> I think your lists will get tangled, no?
Yes. I have to put my thinking cap on snug, but I don't think my
version would support this kind of setup. It sounds, from Jesse's
follow up to your mail, that this is not a requirement, though. I'll
take a closer look at the IA64 code and see if it would be supported or
if I could make some small changes to support it.
Thanks for the feedback!!
-Matt
On Wed, 2004-10-06 at 21:12, Marc E. Fiuczynski wrote:
> > ... thus making supporting interesting NUMA machines
> > and SMT machines easier.
>
> Would you be so kind and elaborate on the SMT part.
>
> Marc
Contrary to Paul's posting that no one saw, SMT is not a typo. ;) What
I was trying to get at is that there are already several differing
implementations of SMT (Synchronous Multi-Threading) with different
names and different characteristics. Right now, they're all kind of
handled the same. In the future, however, I see each architecture
defining their own SD_SIBLING_INIT for sibling domains, allowing their
own cache timings, balancing rates, etc. I feel that it would be easier
to support potentially complicated and/or dynamic sibling 'CPU'
relationships with my patch. We've already run into some issues with
hotplugging the siblings of 'real' CPUs on/off, and how the current
sched_domains handles that. Currently, as the code is static and based
on config options rather than runtime variables, it tends to leave a
single CPU in it's own domain, balancing amongst itself with no sibling
(b/c it's been hotplugged off and CONFIG_SCHED_SMT is on). My code was
written with dynamic runtime changes in mind to prevent these kinds of
suboptimal situations.
-Matt
On Wed, 2004-10-06 at 19:13, Nick Piggin wrote:
> This is what I did in my first (that nobody ever saw) implementation of
> sched domains. Ie. no sched_groups, just use sched_domains as the balancing
> object... I'm not sure this works too well.
>
> For example, your bottom level domain is going to basically be a redundant,
> single CPU on most topologies, isn't it?
I forgot to respond to this part in my last mail... :(
My benchmarks haven't shown any real deviation in performance from stock
-mm. Granted, my benchmarking has been very limited, pretty much just
running kernbench on a few different machines with different configs
(ie: SMT, SMP & NUMA on/off). A performance NOOP is exactly what I was
hoping for, though. I don't really expect these changes to be either a
performance win or loss, but a functionality improvement.
The patch is pretty dense because of all the renaming, but there are no
single CPU domains. The lowest level domain would be:
1) Node domains, for NUMA w/ SMP
2) Sibling domains, for SMT or NUMA w/ SMT
3) System domain, for flat SMP
The pseudo code version of my arch_init_sched_domains() looks like:
cpu_usable_map = cpu_online_map & ~cpu_isolated_map
create system domain;
if NUMA
for_each_node()
create node domain, parented to system domain;
for_each_cpu(cpu_usable_map)
if SCHED_SMT
create sibling domain, parented to node domain;
attach sibling cpu to it's domain;
else
attach cpu to either its node domain or system domain;
So there shouldn't be any redundant CPU domains.
-Matt
Matthew Dobson wrote:
>On Wed, 2004-10-06 at 19:13, Nick Piggin wrote:
>
>>Matthew Dobson wrote:
>>
>
>>>This should allow us to support hotplug more easily, simply removing the
>>>domain belonging to the going-away CPU, rather than throwing away the
>>>whole domain tree and rebuilding from scratch.
>>>
>>Although what we have in -mm now should support CPU hotplug just fine.
>>The hotplug guys really seem not to care how disruptive a hotplug
>>operation is.
>>
>
>I wasn't trying to imply that CPU hotplug isn't supported right now.
>But it is currently a very disruptive operation, throwing away the
>entire sched_domains & sched_groups tree and then rebuilding it from
>scratch just to remove a single CPU! I also understand that this is
>supposed to be a rare event (CPU hotplug), but that doesn't mean it
>*has* to be a slow, disruptive event. :)
>
>
Well no... but it already is disruptive :)
>
>>> This should also allow
>>>us to support multiple, independent (ie: no shared root) domain trees
>>>which will facilitate isolated CPU groups and exclusive domains. I also
>>>
>>Hmm, what was my word for them... yeah, disjoint. We can do that now,
>>see isolcpus= for a subset of the functionality you want (doing larger
>>exclusive sets would probably just require we run the setup code once
>>for each exclusive set we want to build).
>>
>
>The current code doesn't, to my knowledge support multiple isolated
>domains. You can set up a single 'isolated' group with boot time
>options, but you can't set up *multiple* isolated groups, nor is there
>the ability to do any partitioning/isolation at runtime. This was more
>of the motivation for my code than the hotplug simplification. That was
>more of a side-benefit.
>
>
No, the isolcpus= option allows you to set up n *single CPU* isolated
domains. You currently can't setup isolated groups with multiple CPUs
in them, no. You can't do runtime partitioning either.
I think both would be pretty trivial to do though with the current
code though.
>
>>>hope this will allow us to leverage the existing topology infrastructure
>>>to build domains that closely resemble the physical structure of the
>>>machine automagically, thus making supporting interesting NUMA machines
>>>and SMT machines easier.
>>>
>>>This patch is just a snapshot in the middle of development, so there are
>>>certainly some uglies & bugs that will get fixed. That said, any
>>>comments about the general design are strongly encouraged. Heck, any
>>>feedback at all is welcome! :)
>>>
>>>Patch against 2.6.9-rc3-mm2.
>>>
>>This is what I did in my first (that nobody ever saw) implementation of
>>sched domains. Ie. no sched_groups, just use sched_domains as the balancing
>>object... I'm not sure this works too well.
>>
>>For example, your bottom level domain is going to basically be a redundant,
>>single CPU on most topologies, isn't it?
>>
>>Also, how will you do overlapping domains that SGI want to do (see
>>arch/ia64/kernel/domain.c in -mm kernels)?
>>
>>node2 wants to balance between node0, node1, itself, node3, node4.
>>node4 wants to balance between node2, node3, itself, node5, node6.
>>etc.
>>
>>I think your lists will get tangled, no?
>>
>
>Yes. I have to put my thinking cap on snug, but I don't think my
>version would support this kind of setup. It sounds, from Jesse's
>follow up to your mail, that this is not a requirement, though. I'll
>take a closer look at the IA64 code and see if it would be supported or
>if I could make some small changes to support it.
>
>
I they might find that it will be a requirement. If not now, then soon.
Your periodic balancing happens from the timer interrupt as you know...
that means pulling a cacheline off every CPU.
But anyway..
>Thanks for the feedback!!
>
OK... I still don't know exactly how your system is an improvement over what
we have, but I'll try to be open minded :)
Hi,
From: Jesse Barnes <[email protected]>
Subject: [Lse-tech] Re: [RFC PATCH] scheduler: Dynamic sched_domains
Date: Thu, 7 Oct 2004 10:01:07 -0700
> On Wednesday, October 6, 2004 7:13 pm, Nick Piggin wrote:
> > Hmm, what was my word for them... yeah, disjoint. We can do that now,
> > see isolcpus= for a subset of the functionality you want (doing larger
> > exclusive sets would probably just require we run the setup code once
> > for each exclusive set we want to build).
>
> Yeah, and unfortunately since I added the code for overlapping domains w/o
> adding a top level domain at the same time, we have disjoint domains by
> default on large systems.
Yup, if SD_NODES_PER_DOMAIN is set to 4, our 32-way TX-7 have
two disjoint domains ;(
(though the current default is 6 for ia64...)
I think the default configuration of the scheduler domains should be
as identical to its real hardware topology as possible, and should
modify the default only when necessary (e.g. for Altix).
Right now with the sched domain scheduler, we have to setup the
domain hierarcy only at boot time statically, which makes it harder to
find the optimal domain topology/parameter. The dynamic patch
makes it easier to modify the default configuration.
If the scheduler gains more dynamic configurability like what Jesse
said, it adds more flexibility for runtime optimization and seems
a way to go. I'm not sure runtime configurability of domain topology
is necessary for all users, but it's extremely useful for developers.
I'll look into the Matt's patch further.
> > Also, how will you do overlapping domains that SGI want to do (see
> > arch/ia64/kernel/domain.c in -mm kernels)?
> >
> > node2 wants to balance between node0, node1, itself, node3, node4.
> > node4 wants to balance between node2, node3, itself, node5, node6.
> > etc.
> >
> > I think your lists will get tangled, no?
>
> Yeah, but overlapping domains aren't a requirement. In fact, making the
> scheduling domains dynamically configurable is probably a *much* better
> route, since I doubt that some default overlap setup will be optimal for many
> workloads (that doesn't mean we shouldn't have good defaults though). Being
> able to configure the rebalance and tick rates of the various domains would
> also be a good thing (the defaults could be keyed off of the number of CPUs
> and/or nodes in the domain).
---
Takayoshi Kochi
Takayoshi Kochi wrote:
> Yup, if SD_NODES_PER_DOMAIN is set to 4, our 32-way TX-7 have
> two disjoint domains ;(
> (though the current default is 6 for ia64...)
>
> I think the default configuration of the scheduler domains should be
> as identical to its real hardware topology as possible, and should
> modify the default only when necessary (e.g. for Altix).
>
That is the idea. Unfortunately the ia64 modifications are ia64 wide.
I don't think it should be too hard to make it sn2 only.
> Right now with the sched domain scheduler, we have to setup the
> domain hierarcy only at boot time statically, which makes it harder to
> find the optimal domain topology/parameter. The dynamic patch
> makes it easier to modify the default configuration.
>
No you don't have to. If you have a look at the work in -mm, basically
the whole thing gets recreated on every hoplug operation. It would be
trivial to modify some parameters then reinit the domains in the same
way.
N disjoint domains can be trivially handled by making N passes over
the init code, each using a different set of CPUs as its
"cpu_possible_map". This can easily be done dynamically by using
the above method.
> If the scheduler gains more dynamic configurability like what Jesse
> said, it adds more flexibility for runtime optimization and seems
> a way to go. I'm not sure runtime configurability of domain topology
> is necessary for all users, but it's extremely useful for developers.
>
That would be nice. The patch queue is pretty well clogged up at the
moment, so I'm not going to look at the scheduler again until all the
patches from -mm get into 2.6.10-bk.
Hi Matthew,
On Thursday 07 October 2004 02:51, Matthew Dobson wrote:
> 1) Rip out sched_groups and move them into the sched_domains.
> 2) Add some reference counting, and eventually locking, to
> sched_domains.
> 3) Rewrite & simplify the way sched_domains are built and linked into a
> cohesive tree.
>
> This should allow us to support hotplug more easily, simply removing the
> domain belonging to the going-away CPU, rather than throwing away the
> whole domain tree and rebuilding from scratch. ?This should also allow
> us to support multiple, independent (ie: no shared root) domain trees
> which will facilitate isolated CPU groups and exclusive domains. ?I also
> hope this will allow us to leverage the existing topology infrastructure
> to build domains that closely resemble the physical structure of the
> machine automagically, thus making supporting interesting NUMA machines
> and SMT machines easier.
more flexibility in building the sched_domains is badly needed, so
your effort towards providing this is the right step. I'm not sure
yet whether your big change is really (and already) a simplification,
but what you described sounded for me like getting the chance to
configure the sched_domains at runtime, dynamically, from user
space. I didn't notice any user interface in your patch, or overlooked
it. Could you please describe the API you had in mind for that?
Regards,
Erich
Erich Focht wrote:
> more flexibility in building the sched_domains is badly needed, so
> your effort towards providing this is the right step. I'm not sure
> yet whether your big change is really (and already) a simplification,
> but what you described sounded for me like getting the chance to
> configure the sched_domains at runtime, dynamically, from user
> space. I didn't notice any user interface in your patch, or overlooked
> it. Could you please describe the API you had in mind for that?
>
OK, what we have in -mm is already close to what we need to do
dynamic building. But let's explore the other topic. User interface.
First of all, I think it may be easiest to allow the user to specify
which cpus belong to which exclusive domains, and have them otherwise
built in the shape of the underlying topology. So for example if your
domains look like this (excuse the crappy ascii art):
0 1 2 3 4 5 6 7
--- --- --- --- <- domain 0
| | | |
------ ------ <- domain 1
| |
---------- <- domain 2 (global)
And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7}
for some crazy reason, the new domains would look like this:
0 1 2 4 5 3 6 7
--- - --- - --- <- 0
| | | | |
----- - - - <- 1
| | | |
------- ----- <- 2 (global, partitioned)
Agreed? You don't need to get fancier than that, do you?
Then how to input the partitions... you could have a sysfs entry that
takes the complete partition info in the form:
0,1,2,3 4,5,6 7,8 ...
Pretty dumb and simple.
Nick Piggin wrote:
> Erich Focht wrote:
>
>> more flexibility in building the sched_domains is badly needed, so
>> your effort towards providing this is the right step. I'm not sure
>> yet whether your big change is really (and already) a simplification,
>> but what you described sounded for me like getting the chance to
>> configure the sched_domains at runtime, dynamically, from user
>> space. I didn't notice any user interface in your patch, or overlooked
>> it. Could you please describe the API you had in mind for that?
>>
>
> OK, what we have in -mm is already close to what we need to do
> dynamic building. But let's explore the other topic. User interface.
>
> First of all, I think it may be easiest to allow the user to specify
> which cpus belong to which exclusive domains, and have them otherwise
> built in the shape of the underlying topology. So for example if your
> domains look like this (excuse the crappy ascii art):
>
> 0 1 2 3 4 5 6 7
> --- --- --- --- <- domain 0
> | | | |
> ------ ------ <- domain 1
> | |
> ---------- <- domain 2 (global)
>
> And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7}
> for some crazy reason, the new domains would look like this:
>
> 0 1 2 4 5 3 6 7
> --- - --- - --- <- 0
> | | | | |
> ----- - - - <- 1
> | | | |
> ------- ----- <- 2 (global, partitioned)
>
> Agreed? You don't need to get fancier than that, do you?
>
> Then how to input the partitions... you could have a sysfs entry that
> takes the complete partition info in the form:
>
> 0,1,2,3 4,5,6 7,8 ...
>
> Pretty dumb and simple.
>
Agreed, what we are thinking is that the CKRM API can be used for that.
Each domain is a class build of resources (cpus,mem).
You use the config interface of CKRM to specify which cpu/mem belongs
to the class. The underlying controller verifies it.
For a first approximation, classes that have config constraints
specified this way will not be allowed to set shares. In sched_domain
terms it would mean that if the sched_domain is not balancable with its
siblings then it forms an exclusive domain. Under the exclusive
class one can continue with the hierarchy that will allow share settings.
So from an API issue this certainly looks feasible, maybe even clean.
On Thursday, October 7, 2004 11:08 pm, Nick Piggin wrote:
> Takayoshi Kochi wrote:
> > Yup, if SD_NODES_PER_DOMAIN is set to 4, our 32-way TX-7 have
> > two disjoint domains ;(
> > (though the current default is 6 for ia64...)
> >
> > I think the default configuration of the scheduler domains should be
> > as identical to its real hardware topology as possible, and should
> > modify the default only when necessary (e.g. for Altix).
>
> That is the idea. Unfortunately the ia64 modifications are ia64 wide.
> I don't think it should be too hard to make it sn2 only.
The NEC and Altix machines both use a SLIT table to describe the machine
layout, so it should be possible to build them correctly w/o special case
code (I hope). The question is how big to make them, but if that's runtime
changeable, then no big deal. Like I said, the main thing missing from my
changes is a system wide domain, but I think John has some ideas about that.
Jesse
On Fri, 2004-10-08 at 03:14, Erich Focht wrote:
> Hi Matthew,
Hi Erich!
> On Thursday 07 October 2004 02:51, Matthew Dobson wrote:
> > 1) Rip out sched_groups and move them into the sched_domains.
> > 2) Add some reference counting, and eventually locking, to
> > sched_domains.
> > 3) Rewrite & simplify the way sched_domains are built and linked into a
> > cohesive tree.
> >
> > This should allow us to support hotplug more easily, simply removing the
> > domain belonging to the going-away CPU, rather than throwing away the
> > whole domain tree and rebuilding from scratch. This should also allow
> > us to support multiple, independent (ie: no shared root) domain trees
> > which will facilitate isolated CPU groups and exclusive domains. I also
> > hope this will allow us to leverage the existing topology infrastructure
> > to build domains that closely resemble the physical structure of the
> > machine automagically, thus making supporting interesting NUMA machines
> > and SMT machines easier.
>
> more flexibility in building the sched_domains is badly needed, so
> your effort towards providing this is the right step. I'm not sure
> yet whether your big change is really (and already) a simplification,
> but what you described sounded for me like getting the chance to
> configure the sched_domains at runtime, dynamically, from user
> space. I didn't notice any user interface in your patch, or overlooked
> it. Could you please describe the API you had in mind for that?
You are correct in your assumption that you didn't notice any user API
in the patch because it wasn't there. The idea I had for the API would
be along the lines of the current cpusets/CKRM interface. A
hierarchical filesystem where you can do operations to
create/modify/remove sched_domains. Along the lines of:
cd /dev/sched_domains/sys_domain
mkdir node0
mkdir node1
mkdir node2
mkdir node3
cd node0
echo 0-3 > cpus
cd ../node1
echo 4-7 > cpus
cd ../node2
echo 8-11 > cpus
cd ../node3
echo 12-15 > cpus
To create a simple 4 node each w/ 4 cpus setup. This is a trivial
example, because this is the kind of thing that would be setup by
default at boot time. I really like the interface that Paul came up
with for cpusets, and I think that the interface we eventually settle on
should be along those lines. Hopefully it can be shared with the
interface CKRM uses to avoid too much interface bloat. I think that we
can probably get the two mechanisms to share a common interface.
-Matt
On Fri, 2004-10-08 at 03:40, Nick Piggin wrote:
> And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7}
> for some crazy reason, the new domains would look like this:
>
> 0 1 2 4 5 3 6 7
> --- - --- - --- <- 0
> | | | | |
> ----- - - - <- 1
> | | | |
> ------- ----- <- 2 (global, partitioned)
>
> Agreed? You don't need to get fancier than that, do you?
>
> Then how to input the partitions... you could have a sysfs entry that
> takes the complete partition info in the form:
>
> 0,1,2,3 4,5,6 7,8 ...
>
> Pretty dumb and simple.
How do we describe the levels other than the first? We'd either need
to:
1) come up with a language to describe the full tree. For your example
I quoted above:
echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions
2) have multiple files:
echo "0,1,2,4,5 3,6,7" > level2
echo "0,1,2 4,5 3 6,7" > level1
echo "0,1 2 4,5 3 6,7" > level0
3) Or do it hierarchically as Paul implemented in cpusets, and as I
described in an earlier mail:
mkdir level2
echo "0,1,2,4,5 3,6,7" > level2/partitions
mkdir level1
echo "0,1,2 4,5 3 6,7" > level1/partitions
mkdir level0
echo "0,1 2 4,5 3 6,7" > level0/partitions
I personally like the hierarchical idea. Machine topologies tend to
look tree-like, and every useful sched_domain layout I've ever seen has
been tree-like. I think our interface should match that.
-Matt
Matthew Dobson wrote:
> On Fri, 2004-10-08 at 03:40, Nick Piggin wrote:
>
>>And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7}
>>for some crazy reason, the new domains would look like this:
>>
>>0 1 2 4 5 3 6 7
>>--- - --- - --- <- 0
>> | | | | |
>> ----- - - - <- 1
>> | | | |
>> ------- ----- <- 2 (global, partitioned)
>>
>>Agreed? You don't need to get fancier than that, do you?
>>
>>Then how to input the partitions... you could have a sysfs entry that
>>takes the complete partition info in the form:
>>
>>0,1,2,3 4,5,6 7,8 ...
>>
>>Pretty dumb and simple.
>
>
> How do we describe the levels other than the first? We'd either need
> to:
> 1) come up with a language to describe the full tree. For your example
> I quoted above:
> echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions
I think the idea was that the full hierarchy was (automatically) derived
from the partition in a way that best matched the physical layout of the
machine?
>
> 2) have multiple files:
> echo "0,1,2,4,5 3,6,7" > level2
> echo "0,1,2 4,5 3 6,7" > level1
> echo "0,1 2 4,5 3 6,7" > level0
>
> 3) Or do it hierarchically as Paul implemented in cpusets, and as I
> described in an earlier mail:
> mkdir level2
> echo "0,1,2,4,5 3,6,7" > level2/partitions
> mkdir level1
> echo "0,1,2 4,5 3 6,7" > level1/partitions
> mkdir level0
> echo "0,1 2 4,5 3 6,7" > level0/partitions
>
> I personally like the hierarchical idea. Machine topologies tend to
> look tree-like, and every useful sched_domain layout I've ever seen has
> been tree-like. I think our interface should match that.
>
> -Matt
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Peter Williams [email protected]
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
On Fri, 2004-10-08 at 08:50, Hubertus Franke wrote:
> Nick Piggin wrote:
> > Agreed? You don't need to get fancier than that, do you?
> >
> > Then how to input the partitions... you could have a sysfs entry that
> > takes the complete partition info in the form:
> >
> > 0,1,2,3 4,5,6 7,8 ...
> >
> > Pretty dumb and simple.
>
> Agreed, what we are thinking is that the CKRM API can be used for that.
> Each domain is a class build of resources (cpus,mem).
> You use the config interface of CKRM to specify which cpu/mem belongs
> to the class. The underlying controller verifies it.
>
> For a first approximation, classes that have config constraints
> specified this way will not be allowed to set shares. In sched_domain
> terms it would mean that if the sched_domain is not balancable with its
> siblings then it forms an exclusive domain. Under the exclusive
> class one can continue with the hierarchy that will allow share settings.
>
> So from an API issue this certainly looks feasible, maybe even clean.
For anyone who doesn't grok CRKM-speak, as I didn't until some recent
conversations with Hubertus, I think this roughly translates into the
following:
"CKRM & sched_domains should be integrated and include management of
both CPU & memory resources. We will use the CKRM rcfs filesystem API
to set up both sched_domains and memory allocation constraints.
sched_domains that are marked as exclusive, ie: their parent domains do
not have the SD_LOAD_BALANCE flag set, will not have their computing
power shared outside that domain via CKRM. This will create 'compute
pools' composed of the exclusive domains in the system. CKRM will
ensure that 'shares' are correctly handled within exclusive (aka
isolated) domains, and that 'shares' don't include compute resources not
available in the sched_domain in question."
Is that a fairly accurate translation for those not initiated in the
ways of CKRM, Hubertus?
-Matt
On Friday 08 October 2004 12:40, Nick Piggin wrote:
> Erich Focht wrote:
> > Could you please describe the API you had in mind for that?
> >
>
> First of all, I think it may be easiest to allow the user to specify
> which cpus belong to which exclusive domains, and have them otherwise
> built in the shape of the underlying topology. So for example if your
> domains look like this (excuse the crappy ascii art):
>
> 0 1 2 3 4 5 6 7
> --- --- --- --- <- domain 0
> | | | |
> ------ ------ <- domain 1
> | |
> ---------- <- domain 2 (global)
>
> And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7}
> for some crazy reason, the new domains would look like this:
>
> 0 1 2 4 5 3 6 7
> --- - --- - --- <- 0
> | | | | |
> ----- - - - <- 1
> | | | |
> ------- ----- <- 2 (global, partitioned)
>
> Agreed? You don't need to get fancier than that, do you?
>
> Then how to input the partitions... you could have a sysfs entry that
> takes the complete partition info in the form:
>
> 0,1,2,3 4,5,6 7,8 ...
>
> Pretty dumb and simple.
Hmmm, this is unusable as long as you don't tell me how to create new
levels and link them together. Adding CPUs is the simplest
part. I'm with Matt here, the filesystem approach is the most
elegant. On the other hand something simple for the start wouldn't be
bad. It would show immediately whether Matt's or your way of dealing
with domains is better suited for relinking and reorganising the
domains structure dynamically. Functionality could be something like:
- list domains
- create domain
- add child domains
- link in parent domain
We're building this from bottom (cpus) up and need to take care of the
unlinking of the global domain when inserting something. But otherwise
this could be sufficient.
Regards,
Erich
On Fri, 2004-10-08 at 14:56, Peter Williams wrote:
> > How do we describe the levels other than the first? We'd either need
> > to:
> > 1) come up with a language to describe the full tree. For your example
> > I quoted above:
> > echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions
>
> I think the idea was that the full hierarchy was (automatically) derived
> from the partition in a way that best matched the physical layout of the
> machine?
Absolutely. I mentioned that in a different response in this thread.
The default behavior on boot up should be for the
arch_init_sched_domains() to build a sched_domains hierarchy that
mirrors the physical layout of the machine for maximal scheduling
efficiency for general computing. If the users/admin of the machine
want to configure the machine for a particular type(s) of workload, then
they can do that through the soon-to-be-figured-out API to rebuild the
sched_domains hierarchy in their own image and likeness...
-Matt
Hi Matt,
I reply to this post because it has more examples ;-)
On Friday 08 October 2004 20:54, Matthew Dobson wrote:
> On Fri, 2004-10-08 at 03:40, Nick Piggin wrote:
> > And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7}
> > for some crazy reason, the new domains would look like this:
> >
> > 0 1 2 4 5 3 6 7
> > --- - --- - --- <- 0
> > | | | | |
> > ----- - - - <- 1
> > | | | |
> > ------- ----- <- 2 (global, partitioned)
> >
> > Agreed? You don't need to get fancier than that, do you?
> >
> > Then how to input the partitions... you could have a sysfs entry that
> > takes the complete partition info in the form:
> >
> > 0,1,2,3 4,5,6 7,8 ...
> >
> > Pretty dumb and simple.
>
> How do we describe the levels other than the first? We'd either need
> to:
> 1) come up with a language to describe the full tree. For your example
> I quoted above:
> echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions
This is not nice. Especially because changes cannot be made
gradually. You need to put in the whole tree each time you change
something. Concurrent processes modifying the structre need additional
mechanisms for synchronization.
> 2) have multiple files:
> echo "0,1,2,4,5 3,6,7" > level2
> echo "0,1,2 4,5 3 6,7" > level1
> echo "0,1 2 4,5 3 6,7" > level0
Add a way to create levels. This one is hard to "parse" and again
there is no protection for the "own" domains and maybe trouble with
synchronization.
> 3) Or do it hierarchically as Paul implemented in cpusets, and as I
> described in an earlier mail:
> mkdir level2
> echo "0,1,2,4,5 3,6,7" > level2/partitions
> mkdir level1
> echo "0,1,2 4,5 3 6,7" > level1/partitions
> mkdir level0
> echo "0,1 2 4,5 3 6,7" > level0/partitions
>
> I personally like the hierarchical idea. Machine topologies tend to
> look tree-like, and every useful sched_domain layout I've ever seen has
> been tree-like. I think our interface should match that.
I like the hierarchical idea, too. The natural way to build it would
be by starting from the cpus and going up. This tree stands on its
leafs... and I'm not sure how to express that in a filesystem.
How about this:
If you would go from top (global) domain down the default could be to
have a directory
sched_domains/
global/
cpu1
cpu2
...
The cpuX files would always exist. All you'd need to do would be to
create subdirectories and move the cpuX files from one to the
other. You should be able to remove directories if they are empty.
# cd sched_domains/global
# mkdir node1 node2
# mv cpu1 cpu2 node1
# mv cpu3 cpu4 node2
sched_domains/
global/
node1/
cpu1
cpu2
node2/
cpu3
cpu4
or disconnected domains:
# cd sched_domains
# mkdir interactive batch
# mv global/cpu1 global/cpu2 interactive
# mv global/cpu3 global/cpu4 batch
# rmdir global
sched_domains/
interactive/
cpu1
cpu2
batch/
cpu3
cpu4
Regards,
Erich
Erich Focht wrote:
>>I personally like the hierarchical idea. Machine topologies tend to
>>look tree-like, and every useful sched_domain layout I've ever seen has
>>been tree-like. I think our interface should match that.
>
>
> I like the hierarchical idea, too. The natural way to build it would
> be by starting from the cpus and going up. This tree stands on its
> leafs... and I'm not sure how to express that in a filesystem.
>
Why would you ever want to play around with the internals of the
thing though? Provided you have a way to create exclusive sets of
CPUs, when would you care about doing more?
On Fri, 2004-10-08 at 15:51, Erich Focht wrote:
> Hmmm, this is unusable as long as you don't tell me how to create new
> levels and link them together. Adding CPUs is the simplest
> part. I'm with Matt here, the filesystem approach is the most
> elegant. On the other hand something simple for the start wouldn't be
> bad. It would show immediately whether Matt's or your way of dealing
> with domains is better suited for relinking and reorganising the
> domains structure dynamically. Functionality could be something like:
> - list domains
> - create domain
> - add child domains
> - link in parent domain
> We're building this from bottom (cpus) up and need to take care of the
> unlinking of the global domain when inserting something. But otherwise
> this could be sufficient.
>
> Regards,
> Erich
I personally like to think of it from the top down. The internal API I
came up with looks like:
create_domain(parent_domain, type);
destroy_domain(domain);
add_cpu_to_domain(cpu, domain);
So you basically build your domain from the top down, from your 1 or
more top-level domains, down to your lowest level domains. You then add
cpus (1 or more per domain) to the leaf domains in the tree you built.
Those cpus cascade up the tree, and the whole tree knows exactly which
cpus are contained in each domain in it.
I think these are the three main functions you need to construct pretty
much any conceivable, useful sched_domains hierarchy.
-Matt
On Saturday 09 October 2004 01:50, Nick Piggin wrote:
> Erich Focht wrote:
>
> >>I personally like the hierarchical idea. Machine topologies tend to
> >>look tree-like, and every useful sched_domain layout I've ever seen has
> >>been tree-like. I think our interface should match that.
> >
> >
> > I like the hierarchical idea, too. The natural way to build it would
> > be by starting from the cpus and going up. This tree stands on its
> > leafs... and I'm not sure how to express that in a filesystem.
> >
>
> Why would you ever want to play around with the internals of the
> thing though? Provided you have a way to create exclusive sets of
> CPUs, when would you care about doing more?
Three reasons come immediately to my mind:
- Move the sched domains setup out of the kernel into user
space. With my proposal of filesystem with directory operations only
(just moving cpuX virtual files around) the boot setup should just be:
global/
cpu1
cpu2
...
The rest could be done very machine and load specific in user
space. This way the kernel scheduler wouldn't need to struggle keeping
up learning characteristics of new machines as they appear on the
radar.
- I sometimes want to create/ destroy isolated partitions at high rate
(through a batch scheduler) and a reasonable API enables me to keep
the domains consistent at any time.
- Flexibility of isolated partitions is a bare necessity. If you
simply divide your system into interactive and batch partition you'd
certainly want to decrease the size of the interactive partition
during the night without rebooting the machine...
Regards,
Erich
On Saturday 09 October 2004 03:05, Matthew Dobson wrote:
> On Fri, 2004-10-08 at 15:51, Erich Focht wrote:
> > We're building this from bottom (cpus) up and need to take care of the
> > unlinking of the global domain when inserting something. But otherwise
> > this could be sufficient.
>
> I personally like to think of it from the top down. The internal API I
> came up with looks like:
>
> create_domain(parent_domain, type);
> destroy_domain(domain);
> add_cpu_to_domain(cpu, domain);
>
> So you basically build your domain from the top down, from your 1 or
> more top-level domains, down to your lowest level domains. You then add
> cpus (1 or more per domain) to the leaf domains in the tree you built.
> Those cpus cascade up the tree, and the whole tree knows exactly which
> cpus are contained in each domain in it.
>
> I think these are the three main functions you need to construct pretty
> much any conceivable, useful sched_domains hierarchy.
I'd suggest adding:
reparent_domain(domain, new_parent_domain);
When I said that the domains tree is standing on its leaves I meant
that the core components are the CPUs. Or the Nodes, if you already
have them. Or some supernodes, if you already have them. In a "normal"
filesystem you have the root directory, create subdirectories and
create files in them. Here you already have the files but not the
structure (or the simplest possible structure).
Anyhow, the 4 command API can well be the guts of the directory
operations API which I proposed.
Regards,
Erich
On Sun, 2004-10-10 at 05:45, Erich Focht wrote:
> On Saturday 09 October 2004 03:05, Matthew Dobson wrote:
> > I personally like to think of it from the top down. The internal API I
> > came up with looks like:
> >
> > create_domain(parent_domain, type);
> > destroy_domain(domain);
> > add_cpu_to_domain(cpu, domain);
> I'd suggest adding:
> reparent_domain(domain, new_parent_domain);
>
> When I said that the domains tree is standing on its leaves I meant
> that the core components are the CPUs. Or the Nodes, if you already
> have them. Or some supernodes, if you already have them. In a "normal"
> filesystem you have the root directory, create subdirectories and
> create files in them. Here you already have the files but not the
> structure (or the simplest possible structure).
>
> Anyhow, the 4 command API can well be the guts of the directory
> operations API which I proposed.
>
> Regards,
> Erich
I like that suggestion. As Paul has been sucked away to other work,
thus giving me a chance to work on my code. I will be focusing on
getting the cpusets/CKRM style interface working with my sched_domains
API. I like the reparent_domain() suggestion, and it makes sense with
the 'mv' command, in regards to the filesystem model that cpusets/CKRM
currently uses.
-Matt
Here's an attempt at dynamic sched domains aka isolated cpusets
o This functionality is on top of CPUSETs and provides a way to
completely isolate any set of CPUs dynamically.
o There is a new cpu_isolated flag that allows users to convert
an exclusive cpuset to an isolated one
o The isolated CPUs are part of their own sched domain.
This ensures that the rebalance code works within the domain,
prevents overhead due to a cpu trying to pull tasks only to find
that its cpus_allowed mask does not allow it to be pulled.
However it does not kick existing processes off the isolated domain
o There is very little code change in the scheduler sched domain
code. Most of it is just splitting up of the arch_init_sched_domains
code to be called dynamically instead of only at boot time.
It has only one API which takes in the map of all cpus affected
and the two new domains to be built
rebuild_sched_domains(cpumask_t change_map, cpumask_t span1, cpumask_t span2)
There are some things that may/will change
o This has been tested only on x86 [8 way -> 4 way with HT]. Still
needs work on other arch's
o I didn't get a chance to see Nick Piggin's RCU sched domains code
as yet, but I know there would be changes here because of that...
o This does not support CPU hotplug as yet
o Making a cpuset isolated manipulates its parent cpus_allowed
mask. When viewed from userspace this is represented as follows
[root@llm11 cpusets] cat cpus
0-3[4-7]
This indicates that CPUs 4-7 are isolated and is/are part of some
child cpuset/s
Appreciate any feedback.
Patch against linux-2.6.12-rc1-mm1.
include/linux/init.h | 2
include/linux/sched.h | 1
kernel/cpuset.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched.c | 109 +++++++++++++++++++++++++-------------
4 files changed, 213 insertions(+), 40 deletions(-)
-Dinakar
Dinakar Guniguntala wrote:
> Here's an attempt at dynamic sched domains aka isolated cpusets
>
Very good, I was wondering when someone would try to implement this ;)
It needs some work. A few initial comments on the kernel/sched.c change
- sorry, don't have too much time right now...
> --- linux-2.6.12-rc1-mm1.orig/kernel/sched.c 2005-04-18 00:46:40.000000000 +0530
> +++ linux-2.6.12-rc1-mm1/kernel/sched.c 2005-04-18 00:47:55.000000000 +0530
> @@ -4895,40 +4895,41 @@ static void check_sibling_maps(void)
> }
> #endif
>
> -/*
> - * Set up scheduler domains and groups. Callers must hold the hotplug lock.
> - */
> -static void __devinit arch_init_sched_domains(void)
> +static void attach_domains(cpumask_t cpu_map)
> {
This shouldn't be needed. There should probably just be one place that
attaches all domains. It is a bit difficult to explain what I mean when
you have 2 such places below.
[...]
> +void rebuild_sched_domains(cpumask_t change_map, cpumask_t span1, cpumask_t span2)
> +{
Interface isn't bad. It would seem to be able to handle everything, but
I think it can be made a bit simpler.
fn_name(cpumask_t span1, cpumask_t span2)
Yeah? The change_map is implicitly the union of the 2 spans. Also I don't
really like the name. It doesn't rebuild so much as split (or join). I
can't think of anything good off the top of my head.
> + unsigned long flags;
> + int i;
> +
> + local_irq_save(flags);
> +
> + for_each_cpu_mask(i, change_map)
> + spin_lock(&cpu_rq(i)->lock);
> +
Locking is wrong. And it has changed again in the latest -mm kernel.
Please diff against that.
> + if (!cpus_empty(span1))
> + build_sched_domains(span1);
> + if (!cpus_empty(span2))
> + build_sched_domains(span2);
> +
You also can't do this - you have to 'offline' the domains first before
building new ones. See the CPU hotplug code that handles this.
[...]
> @@ -5046,13 +5082,13 @@ static int update_sched_domains(struct n
> unsigned long action, void *hcpu)
> {
> int i;
> + cpumask_t temp_map, hotcpu = cpumask_of_cpu((long)hcpu);
>
> switch (action) {
> case CPU_UP_PREPARE:
> case CPU_DOWN_PREPARE:
> - for_each_online_cpu(i)
> - cpu_attach_domain(&sched_domain_dummy, i);
> - arch_destroy_sched_domains();
> + cpus_andnot(temp_map, cpu_online_map, hotcpu);
> + rebuild_sched_domains(cpu_online_map, temp_map, CPU_MASK_NONE);
This makes a hotplug event destroy your nicely set up isolated domains,
doesn't it?
This looks like the most difficult problem to overcome. It needs some
external information to redo the cpuset splits at cpu hotplug time.
Probably a hotplug handler in the cpusets code might be the best way
to do that.
--
SUSE Labs, Novell Inc.
Hmmm ... interesting patch. My reaction to the changes in
kernel/cpuset.c are complicated:
* I'm supposed to be on vacation the rest of this month,
so trying (entirely unsuccessfully so far) not to think
about this.
* This is perhaps the first non-trivial cpuset patch to come
in the last many months from someone other than Simon or
myself - welcome.
* Some coding style and comment details will need working.
* The conceptual model for how to represent this in cpusets
needs some work.
Let me do two things in this reply. First I'll just shoot off
shotgun style the nit picking coding and comment details that
I notice, in a scan of the patch. Then I will step back to a
discussion of the conceptual model. I suspect that by the time
we nail the conceptual model, the code will be sufficiently
rewritten that most of the coding and comment nits will no
longer apply anyway.
But, since nit picking is easier than real thinking ...
* I'd probably ditch the all_cpus() macro, on the
concern that it obfuscates more than it helps.
* The need for _both_ a per-cpuset flag 'CS_CPU_ISOLATED'
and another per-cpuset mask 'isolated_map' concerns me.
I guess that the isolated_map is just a cache of the
set of CPUs isolated in child cpusets, not an independently
settable mask, but it needs to be clearly marked as such
if so.
* Some code lines go past column 80.
* The name 'isolated' probably won't work. There is already
a boottime option "isolcpus=..." for 'isolated' cpus which
is (I think ?) rather different. Perhaps a better name will
fall out of the conceptual discussion, below.
* The change to the output format of the special cpuset file
'cpus', to look like '0-3[4-7]' bothers me in a couple of
ways. It complicates the format from being a simple list.
And it means that the output format is not the same as the
input format (you can't just write back what you read from
such a file anymore).
* Several comments start with the word 'Set', as in:
Set isolated ON on a non exclusive cpuset
Such wording suggests to me that something is being set,
some bit or value changed or turned on. But in each case,
you are just testing for some condition that will return
or error out. Some phrasing such as "If ..." or other
conditional would be clearer.
* The update_sched_domains() routine is complicated, and
hence a primary clue that the conceptual model is not
clean yet.
* None of this was explained in Documentation/cpusets.txt.
* Too bad that cpuset_common_file_write() has to have special
logic for this isolated case. The other flag settings just
turn on and off the associated bit, and don't trigger any
kernel code to adapt to new cpu or memory settings. We
should make an exception to that behaviour only if we must,
and then we must be explicit about the exception.
Ok - enough nits.
Now, onto the real stuff.
This same issue, in a strange way, comes up on the memory side,
as well as on the cpu side.
First, let me verify one thing. I understand that the _key_
purpose of your patch is not so much to isolate cpus, as it
is to allow for structuring scheduling domains to align with
cpuset boundaries. I understand real isolated cpus to be ones
that don't have a scheduling domain (have only the dummy one),
as requested by the "isolcpus=..." boot flag.
The following code snippet from kernel/sched.c is what I derive
this understanding from:
===
static void __devinit arch_init_sched_domains(void)
{
...
/*
* Setup mask for cpus without special case scheduling requirements.
* For now this just excludes isolated cpus, but could be used to
* exclude other special cases in the future.
*/
cpus_complement(cpu_default_map, cpu_isolated_map);
cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
/*
* Set up domains. Isolated domains just stay on the dummy domain.
*/
for_each_cpu_mask(i, cpu_default_map) {
...
===
Second, let me describe how this same issue shows up on the
memory side.
Let's say, for example, someone has partitioned a large
system (100's of cpus and nodes) in two major halves
using cpusets, each half being used by a different
organization.
On one of the halves, they are running a large scientific
program that works on a huge data set that just fits in
the memory available on that half, and they are running
a set of related tools that run different passes over
that data.
Some of these tools might take several cpus, running
parallel threads, and using a little more data shared
by the threads in that tool. Each of these tools might
get its own cpuset, a child (subset) of the big cpuset
that defines the half of the system that this large
scientific program is running within.
The big dataset has to be constrained to the big cpuset
(that half of the system). The smaller tools have to
be constrained to their child cpusets, both for memory
and scheduling.
The individual threads of a tool should probably be
placed using the set_mempolicy and sched_setaffinity
calls, from within the tool. But the tool placement
typically needs to be done from the outside, which
placement cpusets handles better.
This results in some 'memory domains', which are
larger than a leaf node cpuset, smaller than the
entire system, and which will constrain some memory
allocations. In this example, the half of the
system holding the big data set is a memory domain.
These 'memory domains' can be naturally defined by
the memory nodes contained in selected cpusets.
===
Looking at this mathematically, as a hierarchy of nested sets
and subsets, I think we have the same problem, on both the cpu
and memory side.
In both cases, we have an intermediate degree of partitioning
of a system, neither at the most detailed leaf cpuset, nor at
the all encompassing top cpuset. And in both cases, we want
to partition the system, along cpuset boundaries.
Here I use "partition" in the mathematical sense:
===============================================================
A partition of a set X is a set of nonempty subsets of X such
that every element x in X is in exactly one of these subsets.
Equivalently, a set P of subsets of X, is a partition of X if
1. No element of P is empty.
2. The union of the elements of P is equal to X. (We say the
elements of P cover X.)
3. The intersection of any two elements of P is empty. (We say
the elements of P are pairwise disjoint.)
http://www.absoluteastronomy.com/encyclopedia/p/pa/partition_of_a_set.htm
===============================================================
In the case of cpus, we really do prefer the partitions to be
disjoint, because it would be better not to confuse the domain
scheduler with overlapping domains.
In the case of memory, we technically probably don't _have_ to
keep the partitions disjoint. I doubt that the page allocator
(mm/page_alloc.c:__alloc_pages()) really cares. It will strive
valiantly to satisfy the memory request from any of the zones
(each node specific) in the list passed into it.
But for the purposes of providing a clear conceptual model to
our users, I think it is best that we impose this constraint on
the memory side as well as on the cpu side. And I don't think
it will deprive users of any useful configuration alternatives
that they will really miss. Indeed, the typical user will be
striving to use this mechanism to separate different demands
for memory - to isolate them on to different nodes in your
sense of the word isolate.
So, what we want, I claim, is two partitions of the system.
1) A partition of cpus.
2) A partition of memory nodes.
I mean 'partition' in the above mathematical sense, with the
one additional constraint:
* Each subset in both these partitions corresponds to
some cpuset.
That is, for the partition of cpus, for each subset of the
partition, there is a cpuset having the exactly the same cpus
as that subset, no more, no less.
Similary, for the partition of memory nodes.
At any point in time, there would be exactly one such
partitioning of cpus, and one of memory nodes, on the system.
For the cpu case, we would provide a scheduler domain for each
subset of the cpu partitioning.
For the memory case, we would constrain a given allocation
request to either the current tasks cpuset, or to the containing
subset of the memory node partition, depending on per-cpuset
options which will need to be developed in future patches
that will enable marking either GFP_KERNEL allocations, or
allocations for a named shared memory region (mapped file or
such, not anonymous) to be constrained not by the current tasks
cpuset, but by the encompassing subset of the current partition
of memory nodes - (2) above.
Observe that:
* We can specify whether a given cpusets cpus define one of the
subsets of the systems partitioning of cpus, in (1) above,
using a per-cpuset boolean flag.
* We can similarly specify whether a given cpusets memory nodes
define one of the subsets of the systems partitioning of memory
nodes, in (2) above, using one more per-cpuset boolean flag.
* We cannot however do all this correctly just manipulating each
cpuset in isolation, with per-cpuset atomic operations. Or
at least it _seems_ that we cannot do this. Keep reading;
I will find a way.
As you discovered in some of the more complex code in your
update_sched_domains() method, we are dealing with system
wide properties here. The above mathematical properties of a
partition must be followed. If we only have atomic operations on
individual cpusets, then it would _seem_ that more or less any
possible change in the partition subsets will require that we
go through an intermediate state that is illegal. For example,
to move a cpu from one subset to another, it would seem that
we must pass through an intermediate state where it is either
in both subsets, or in neither.
So we require a way for the user to tell us which of the several
cpusets in the system define the current partitioning of cpus,
as will be used to determine scheduler domains, and we require
a way for the user to tell us which of the several cpusets in
the system define the current partitioning of memory nodes,
as will be used to determine where specified memory allocations
will be constrained, when they are allowed to escape the cpuset
of the allocating task.
In both these cases, we must handle the case that the user
didn't follow the properties of a partition (perhaps the subsets
overlap, or don't cover), and return an error without making
a change.
In both of these cases, the user must pass in a selection of
cpusets, some specified subset of all the cpusets on a system,
which the user wants to define the partition of the cpus or
memory nodes on the system, henceforth.
Well, not actually system wide. If the user has rights to modify
some existing cpuset Foo in the system, and if the current cpu or
memory partition of the system exactly aligns with that cpuset
Foo (all subsets of the current cpu or memory partition of the
system are either entirely within, or entirely outside) then the
user could be allowed to redefine the partition subsets within
Foo to another that also aligned with Foo. Perhaps the user
could choose two child cpusets of Foo to define the partitions
subsets, and then later revert to having just the cpuset Foo
define them.
This leads to a possible interface. For each of cpus and
memory, add four per-cpuset control files. Let me take the
cpu case first.
Add the per-cpuset control files:
* domain_cpu_current # readonly boolean
* domain_cpu_pending # read/write boolean
* domain_cpu_rebuild # write only trigger
* domain_cpu_error # read only - last error msg
To rebuild the cpu partitioning below a given cpuset Foo,
the user would:
1) Write 0 or 1 to the domain_cpu_pending file of
each cpuset Foo and below, so that just the cpusets
whose cpus were desired to define the partition
subsets (and hence have dedicated scheduler domains)
had the value '1' in this file.
2) Write a 1 to the domain_cpu_rebuild trigger file
of cpuset Foo.
3) If the write succeeded, the scheduler domains within
the set of cpus in Foo were rebuilt, at that time.
4) If the write failed, read the domain_cpu_error file
for an explanation.
If cpuset Foo aligns with the current system cpu partition, and
if the cpus of the cpusets marked domain_cpu_pending below Foo
define a proper partition of the cpus in Foo, then the write
will succeed, updating the values of the domain_cpu_current
control files for Foo and below to the values that were in
the domain_cpu_pending files, and provoking a rebuild of the
scheduler domains below Foo.
Otherwise the write will fail, and an error message explaining
the problem made available in domain_cpu_error for subsequent
reading. Just setting errno would be insufficient in this
case, as the possible reasons for error are too complex to be
adequately described that way.
Similarly for memory, add the per-cpuset control files:
* domain_mem_current # readonly boolean
* domain_mem_pending # read/write boolean
* domain_mem_rebuild # write only trigger
* domain_mem_error # read only - last error msg
Note, as a detail, that there is no interaction of this domain
feature with the cpu_exclusive or mem_exclusive feature.
This is good. The exclusive feature is of narrow usefulness,
and attempting to integrate it into this domain feature will
cause more grief than benefit.
Also note that adding or removing a cpu from a cpuset that has
its domain_cpu_current flag set true must fail, and similarly
for domain_mem_current.
There are likely (hopefully ;) other possible API's that
accomplish the same thing. But in the process of describing
this API, I hope I have touched on some of the properties that
cpuset domains for cpu and memory should have.
The above scheme should significantly reduce the number of
special cases in the update_sched_domains() routine (which I
would rename to update_cpu_domains, alongside another one to be
provided later, update_mem_domains.) These new update routines
will verify that all the preconditions are met, tear down all
the cpu or mem domains within the scope of the specified cpuset,
and rebuild them according to the partition defined by the
pending_*_domain flags on the descendent cpusets. It's the
same complete rebuild of the partitioning of some subtree,
each time, without all the special cases for incrementally
adding and removing cpus or mems from this or that. Complex
nested if-else-if-else logic is a breeding ground for bugs --
good riddance.
As stated above, there is a single system wide partition of
cpus, and another of mems. I suspect we should consider finding
a way to nest partitions. My (shakey) understanding of what
Nick is doing with scheduler domains is that for the biggest of
systems, we will probably want little scheduler domains inside
bigger ones.
However, if we thought we could avoid, or at least delay
consideration of nested partitions, that would be nice.
This thing is already abstract enough to puzzle many users,
without adding that elaboration.
There -- what do you think of this alternative?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Mon, 2005-04-18 at 22:54 -0700, Paul Jackson wrote:
> Now, onto the real stuff.
>
> This same issue, in a strange way, comes up on the memory side,
> as well as on the cpu side.
>
> First, let me verify one thing. I understand that the _key_
> purpose of your patch is not so much to isolate cpus, as it
> is to allow for structuring scheduling domains to align with
> cpuset boundaries. I understand real isolated cpus to be ones
> that don't have a scheduling domain (have only the dummy one),
> as requested by the "isolcpus=..." boot flag.
>
Yes.
> The following code snippet from kernel/sched.c is what I derive
> this understanding from:
>
Correct. A better name instead of isolated cpusets may be
'partitioned cpusets' or somesuch.
On the other hand, it is more or less equivalent to a single
isolated CPU. Instead of an isolated cpu, you have an isolated
cpuset.
Though I imagine this becomes a complete superset of the
isolcpus= functionality, and it would actually be easier to
manage a single isolated CPU and its associated processes with
the cpusets interfaces after this.
> In both cases, we have an intermediate degree of partitioning
> of a system, neither at the most detailed leaf cpuset, nor at
> the all encompassing top cpuset. And in both cases, we want
> to partition the system, along cpuset boundaries.
>
Yep. This sched-domains partitioning only works when you have
more than one completely disjoint top level cpusets. That is,
you effectively partition the CPUs.
It doesn't work if you have *most* jobs bound to either
{0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed
to use any CPU from 0-7.
> Here I use "partition" in the mathematical sense:
>
> ===============================================================
> A partition of a set X is a set of nonempty subsets of X such
> that every element x in X is in exactly one of these subsets.
>
> Equivalently, a set P of subsets of X, is a partition of X if
>
> 1. No element of P is empty.
> 2. The union of the elements of P is equal to X. (We say the
> elements of P cover X.)
> 3. The intersection of any two elements of P is empty. (We say
> the elements of P are pairwise disjoint.)
>
> http://www.absoluteastronomy.com/encyclopedia/p/pa/partition_of_a_set.htm
> ===============================================================
>
> In the case of cpus, we really do prefer the partitions to be
> disjoint, because it would be better not to confuse the domain
> scheduler with overlapping domains.
>
Yes. The domain scheduler can't handle this at all, it would
have to fall back on cpus_allowed, which in turn can create
big problems for multiprocessor balancing.
> For the cpu case, we would provide a scheduler domain for each
> subset of the cpu partitioning.
>
Yes.
[snip the rest, which I didn't finish reading :P]
>From what I gather, this partitioning does not exactly fit
the cpusets architecture. Because with cpusets you are specifying
on what cpus can a set of tasks run, not dividing the whole system.
Basically for the sched-domains code to be happy, there should be
some top level entity (whether it be cpusets or something else) which
records your current partitioning (the default being one set,
containing all cpus).
> As stated above, there is a single system wide partition of
> cpus, and another of mems. I suspect we should consider finding
> a way to nest partitions. My (shakey) understanding of what
> Nick is doing with scheduler domains is that for the biggest of
> systems, we will probably want little scheduler domains inside
> bigger ones.
The sched-domains setup code will take care of all that for you
already. It won't know or care about the partitions. If you
partition a 64-way system into 2 32-ways, the domain setup code
will just think it is setting up a 32-way system.
Don't worry about the sched-domains side of things at all, that's
pretty easy. Basically you just have to know that it has the
capability to partition the system in an arbitrary disjoint set
of sets of cpus.
If you can make use of that, then we're in business ;)
--
SUSE Labs, Novell Inc.
Nick wrote:
> Basically you just have to know that it has the
> capability to partition the system in an arbitrary disjoint set
> of sets of cpus.
>
> If you can make use of that, then we're in business ;)
You read fast ;)
So you do _not_ want to consider nested sched domains, just disjoint
ones. Good.
> From what I gather, this partitioning does not exactly fit
> the cpusets architecture. Because with cpusets you are specifying
> on what cpus can a set of tasks run, not dividing the whole system.
My evil scheme, and Dinakar's as well, is to provide a way for the user
to designate _some_ of their cpusets as also defining the partition that
controls which cpus are in each sched domain, and so dividing the
system.
"partition" == "an arbitrary disjoint set of sets of cpus"
This fits naturally with the way people use cpusets anyway. They divide
up the system along boundaries that are natural topologically and that
provide a good fit for their jobs, and hope that the kernel will adapt
to such localized placement. They then throw a few more nested (smaller)
cpusets at the problem, to deal with various special needs. If we can
provide them with a means to tell us which of their cpusets define the
natural partitioning of their system, for the job mix and hardware
topology they have, then all is well.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Mon, 2005-04-18 at 23:59 -0700, Paul Jackson wrote:
> Nick wrote:
> > Basically you just have to know that it has the
> > capability to partition the system in an arbitrary disjoint set
> > of sets of cpus.
> >
> > If you can make use of that, then we're in business ;)
>
> You read fast ;)
>
> So you do _not_ want to consider nested sched domains, just disjoint
> ones. Good.
>
You don't either? Good. :)
>
> > From what I gather, this partitioning does not exactly fit
> > the cpusets architecture. Because with cpusets you are specifying
> > on what cpus can a set of tasks run, not dividing the whole system.
>
> My evil scheme, and Dinakar's as well, is to provide a way for the user
> to designate _some_ of their cpusets as also defining the partition that
> controls which cpus are in each sched domain, and so dividing the
> system.
>
> "partition" == "an arbitrary disjoint set of sets of cpus"
>
That would make sense. I'm not familiar with the workings of cpusets,
but that would require every task to be assigned to one of these
sets (or a subset within it), yes?
> This fits naturally with the way people use cpusets anyway. They divide
> up the system along boundaries that are natural topologically and that
> provide a good fit for their jobs, and hope that the kernel will adapt
> to such localized placement. They then throw a few more nested (smaller)
> cpusets at the problem, to deal with various special needs. If we can
> provide them with a means to tell us which of their cpusets define the
> natural partitioning of their system, for the job mix and hardware
> topology they have, then all is well.
>
Sounds like a good fit then. I'll touch up the sched-domains side of
the equation when I get some time hopefully this week or next.
--
SUSE Labs, Novell Inc.
Nick wrote:
> It doesn't work if you have *most* jobs bound to either
> {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed
> to use any CPU from 0-7.
How bad does it not work?
My understanding is that Dinakar's patch did _not_ drive tasks out of
larger cpusets that included two or more of what he called isolated
cpusets, I call cpuset domains.
For example:
System starts up with 8 CPUs and all tasks (except for
a few kernel per-cpu daemons) in the root cpuset, able
to run on CPUs 0-7.
Two cpusets, Alpha and Beta are created, where Alpha
has CPUs 0-3, and Beta has CPUs 4-7.
Anytime someone logs in, their login shell and all
they run from it are placed in one of Alpha or Beta.
The main spawning daemons, such as inetd and cron,
are placed in one of Alpha or Beta.
Only a few daemons that don't do much are left in the
root cpuset, able to run across 0-7.
If we tried to partition the sched domains with Alpha and Beta as
separate domains, what kind of pain do these few daemons in
the root cpuset, on CPUs 0-7, cause?
If the pain is too intolerable, then I'd guess not only do we have to
purge any cpusets superior to the ones determining the domain
partitioning of _all_ tasks, but we'd also have to invent yet one more
boolean flag attribute for any such superior cpusets, to mark them as
_not_ able to allow a task to be attached to them. And we'd have to
refine the hotplug co-existance logic in cpusets, which currently bumps
a task up to its parent cpuset when all the cpus in its current cpuset
are hot unplugged, to also rebuild the sched domains to some legal
configuration, if the parent cpuset was not allowed to have any tasks
attached.
I'd rather not go there, unless push comes to shove. How hard are
you pushing?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
> > So you do _not_ want to consider nested sched domains, just disjoint
> > ones. Good.
> >
>
> You don't either? Good. :)
>From the point of view of cpusets, I'd rather not think
about nested sched domains, for now at least.
But my scheduler savvy colleagues on the big SGI boxes
may well have ambitions here. I can't speak for them.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Nick wrote:
> That would make sense. I'm not familiar with the workings of cpusets,
> but that would require every task to be assigned to one of these
> sets (or a subset within it), yes?
That's the rub, as I noted a couple of messages ago, while you
were writing this message.
It doesn't require every task to be in one of these or a subset.
Tasks could be in some multiple-domain superset, unless that is so
painful that we have to add mechanisms to cpusets to prohibit it.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Tue, Apr 19, 2005 at 09:44:06AM +1000, Nick Piggin wrote:
> Very good, I was wondering when someone would try to implement this ;)
Thank you for the feedback !
> >-static void __devinit arch_init_sched_domains(void)
> >+static void attach_domains(cpumask_t cpu_map)
> > {
>
> This shouldn't be needed. There should probably just be one place that
> attaches all domains. It is a bit difficult to explain what I mean when
> you have 2 such places below.
>
Can you explain a bit more, not sure I understand what you mean
> Interface isn't bad. It would seem to be able to handle everything, but
> I think it can be made a bit simpler.
>
> fn_name(cpumask_t span1, cpumask_t span2)
>
> Yeah? The change_map is implicitly the union of the 2 spans. Also I don't
> really like the name. It doesn't rebuild so much as split (or join). I
> can't think of anything good off the top of my head.
Yeah agreed. It kinda lived on from earlier versions I had
>
> >+ unsigned long flags;
> >+ int i;
> >+
> >+ local_irq_save(flags);
> >+
> >+ for_each_cpu_mask(i, change_map)
> >+ spin_lock(&cpu_rq(i)->lock);
> >+
>
> Locking is wrong. And it has changed again in the latest -mm kernel.
> Please diff against that.
>
I havent looked at the RCU sched domain changes as yet, but I put this in
to address some problems I noticed during stress testing.
Basically with the current hotplug code, it is possible to have a scenario
like this
rebuild domains load balance
| |
| take existing sd pointer
| |
attach to dummy domain |
| loop thro sched groups
change sched group info |
access invalid pointer and panic
> >+ if (!cpus_empty(span1))
> >+ build_sched_domains(span1);
> >+ if (!cpus_empty(span2))
> >+ build_sched_domains(span2);
> >+
>
> You also can't do this - you have to 'offline' the domains first before
> building new ones. See the CPU hotplug code that handles this.
>
By offline if you mean attach to dummy domain, see above
> This makes a hotplug event destroy your nicely set up isolated domains,
> doesn't it?
>
> This looks like the most difficult problem to overcome. It needs some
> external information to redo the cpuset splits at cpu hotplug time.
> Probably a hotplug handler in the cpusets code might be the best way
> to do that.
Yes I am aware of this. What I have in mind is for the hotplug code
from scheduler to call into cpusets code. This will just return say 1
when cpusets is not compiled in and the sched code can continue to do
what it is doing right now, else the cpusets code will find the leaf
cpuset that contains the hotplugged cpu and rebuild the domains accordingly
However the question still remains as to how cpusets should handle
this hotplugged cpu
-Dinakar
On Tue, 2005-04-19 at 00:19 -0700, Paul Jackson wrote:
> Nick wrote:
> > It doesn't work if you have *most* jobs bound to either
> > {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed
> > to use any CPU from 0-7.
>
> How bad does it not work?
>
> My understanding is that Dinakar's patch did _not_ drive tasks out of
> larger cpusets that included two or more of what he called isolated
> cpusets, I call cpuset domains.
>
> For example:
>
> System starts up with 8 CPUs and all tasks (except for
> a few kernel per-cpu daemons) in the root cpuset, able
> to run on CPUs 0-7.
>
> Two cpusets, Alpha and Beta are created, where Alpha
> has CPUs 0-3, and Beta has CPUs 4-7.
>
> Anytime someone logs in, their login shell and all
> they run from it are placed in one of Alpha or Beta.
> The main spawning daemons, such as inetd and cron,
> are placed in one of Alpha or Beta.
>
> Only a few daemons that don't do much are left in the
> root cpuset, able to run across 0-7.
>
> If we tried to partition the sched domains with Alpha and Beta as
> separate domains, what kind of pain do these few daemons in
> the root cpuset, on CPUs 0-7, cause?
>
They don't cause any pain for the scheduler. They will be *in* some
pain because they can't escape from the domain in which they have been
placed (unless you do a set_cpus_allowed thingy).
So, eg. inetd might start up a million blahd servers, but they'll
all be stuck in Alpha even if Beta is completely idle.
> If the pain is too intolerable, then I'd guess not only do we have to
> purge any cpusets superior to the ones determining the domain
> partitioning of _all_ tasks, but we'd also have to invent yet one more
> boolean flag attribute for any such superior cpusets, to mark them as
> _not_ able to allow a task to be attached to them. And we'd have to
> refine the hotplug co-existance logic in cpusets, which currently bumps
> a task up to its parent cpuset when all the cpus in its current cpuset
> are hot unplugged, to also rebuild the sched domains to some legal
> configuration, if the parent cpuset was not allowed to have any tasks
> attached.
>
> I'd rather not go there, unless push comes to shove. How hard are
> you pushing?
>
Well the scheduler simply can't handle it, so it is not so much a
matter of pushing - you simply can't use partitioned domains and
meaningfully have a cpuset above them.
--
SUSE Labs, Novell Inc.
On Mon, 18 Apr 2005, Paul Jackson wrote:
> Hmmm ... interesting patch. My reaction to the changes in
> kernel/cpuset.c are complicated:
>
> * I'm supposed to be on vacation the rest of this month,
> so trying (entirely unsuccessfully so far) not to think
> about this.
> * This is perhaps the first non-trivial cpuset patch to come
> in the last many months from someone other than Simon or
> myself - welcome.
I'm glad to see this happening.
> This leads to a possible interface. For each of cpus and
> memory, add four per-cpuset control files. Let me take the
> cpu case first.
>
> Add the per-cpuset control files:
> * domain_cpu_current # readonly boolean
> * domain_cpu_pending # read/write boolean
> * domain_cpu_rebuild # write only trigger
> * domain_cpu_error # read only - last error msg
> 4) If the write failed, read the domain_cpu_error file
> for an explanation.
> Otherwise the write will fail, and an error message explaining
> the problem made available in domain_cpu_error for subsequent
> reading. Just setting errno would be insufficient in this
> case, as the possible reasons for error are too complex to be
> adequately described that way.
I guess we hit a limit of the filesystem-interface approach here.
Are the possible failure reasons really that complex ?
Is such an error reporting scheme already in use in the kernel ?
I find the two-files approach a bit disturbing -- we have no guarantee
that the error we read is the error we produced. If this is only to get a
hint, OK.
On the other hand, there's also no guarantee that what we are triggering
by writing in domain_cpu_rebuild is what we have set up by writing in
domain_cpu_pending. User applications will need a bit of self-discipline.
> The above scheme should significantly reduce the number of
> special cases in the update_sched_domains() routine (which I
> would rename to update_cpu_domains, alongside another one to be
> provided later, update_mem_domains.) These new update routines
> will verify that all the preconditions are met, tear down all
> the cpu or mem domains within the scope of the specified cpuset,
> and rebuild them according to the partition defined by the
> pending_*_domain flags on the descendent cpusets. It's the
> same complete rebuild of the partitioning of some subtree,
> each time, without all the special cases for incrementally
> adding and removing cpus or mems from this or that. Complex
> nested if-else-if-else logic is a breeding ground for bugs --
> good riddance.
Oh yes.
There's already a good bunch of if-then-else logic in the cpusets because
of the different flags that can apply. We don't need more.
> There -- what do you think of this alternative?
Most of all, that you write mails faster than I am able to read them, so I
might have missed something. But so far I like your proposal.
Simon.
On Mon, Apr 18, 2005 at 10:54:27PM -0700, Paul Jackson wrote:
> Hmmm ... interesting patch. My reaction to the changes in
> kernel/cpuset.c are complicated:
Thanks Paul for taking time off your vaction to reply to this.
I was expecting to see one of your huge mails but this has
exceeded all my expectations :)
> * I'd probably ditch the all_cpus() macro, on the
> concern that it obfuscates more than it helps.
> * The need for _both_ a per-cpuset flag 'CS_CPU_ISOLATED'
> and another per-cpuset mask 'isolated_map' concerns me.
> I guess that the isolated_map is just a cache of the
> set of CPUs isolated in child cpusets, not an independently
> settable mask, but it needs to be clearly marked as such
> if so.
Currently the isolated_map is read-only as you have guessed.
I did think of the user adding cpus to this map from the
cpus_allowed mask but thought the current approach made more sense
> * Some code lines go past column 80.
I need to set my vi to wrap past 80...
> * The name 'isolated' probably won't work. There is already
> a boottime option "isolcpus=..." for 'isolated' cpus which
> is (I think ?) rather different. Perhaps a better name will
> fall out of the conceptual discussion, below.
I was hoping that by the time we are done with this, we would
be able to completely get rid of the isolcpus= option. For that
ofcourse we need to be able build domains that dont run
load balance
> * The change to the output format of the special cpuset file
> 'cpus', to look like '0-3[4-7]' bothers me in a couple of
> ways. It complicates the format from being a simple list.
> And it means that the output format is not the same as the
> input format (you can't just write back what you read from
> such a file anymore).
As i had said in my earlier mail, this was just one way of
representing what I call isolated cpus. The other was to expose
isolated_map to userspace and move cpus between cpus_allowed
and isolated_map
> * Several comments start with the word 'Set', as in:
> Set isolated ON on a non exclusive cpuset
> Such wording suggests to me that something is being set,
> some bit or value changed or turned on. But in each case,
> you are just testing for some condition that will return
> or error out. Some phrasing such as "If ..." or other
> conditional would be clearer.
The wording was from the users point of view for what
action was being done, guess I'll change that
> * The update_sched_domains() routine is complicated, and
> hence a primary clue that the conceptual model is not
> clean yet.
It is complicated because it has to handle all of the different
possible actions that the user can initiate. It can be simplified
if we have stricter rules of what the user can/cannot do
w.r.t to isolated cpusets
> * None of this was explained in Documentation/cpusets.txt.
Yes I plan to add the documentation shortly
> * Too bad that cpuset_common_file_write() has to have special
> logic for this isolated case. The other flag settings just
> turn on and off the associated bit, and don't trigger any
> kernel code to adapt to new cpu or memory settings. We
> should make an exception to that behaviour only if we must,
> and then we must be explicit about the exception.
See my notes on isolated_map above
> First, let me verify one thing. I understand that the _key_
> purpose of your patch is not so much to isolate cpus, as it
> is to allow for structuring scheduling domains to align with
> cpuset boundaries. I understand real isolated cpus to be ones
> that don't have a scheduling domain (have only the dummy one),
> as requested by the "isolcpus=..." boot flag.
Not really. Isolated cpusets allows you to do a soft-partition
of the system, and it would make sense to continue to have load
balancing within these partitions. I would think not having
load balancing should be one of the options available
>
> Second, let me describe how this same issue shows up on the
> memory side.
>
...snip...
>
>
> In the case of cpus, we really do prefer the partitions to be
> disjoint, because it would be better not to confuse the domain
> scheduler with overlapping domains.
Absolutely one of the problem I had was to map the flat disjoint
heirarchy of sched domains to the tree like heirarchy of cpusets
>
> In the case of memory, we technically probably don't _have_ to
> keep the partitions disjoint. I doubt that the page allocator
> (mm/page_alloc.c:__alloc_pages()) really cares. It will strive
> valiantly to satisfy the memory request from any of the zones
> (each node specific) in the list passed into it.
>
I must confess that I havent looked at the memory side all that much,
having more interest in trying to build soft-partitioning of the cpu's
> But for the purposes of providing a clear conceptual model to
> our users, I think it is best that we impose this constraint on
> the memory side as well as on the cpu side. And I don't think
> it will deprive users of any useful configuration alternatives
> that they will really miss. Indeed, the typical user will be
> striving to use this mechanism to separate different demands
> for memory - to isolate them on to different nodes in your
> sense of the word isolate.
>
[...Big snip of new model...]
ok I need to spend more time on you model Paul, but my first
guess is that it doesn't seem to be very intuitive and seems
to make it very complex from the users perspective. However as
I said I need to understand your model a bit more before I
comment on it
>
> However, if we thought we could avoid, or at least delay
> consideration of nested partitions, that would be nice.
> This thing is already abstract enough to puzzle many users,
> without adding that elaboration.
Nested sched domains are going to be nasty and I am not
at all for it. Moreover I think it makes more sense to
to have a flat heirarchy for sched domains
-Dinakar
On Tue, Apr 19, 2005 at 04:19:35PM +1000, Nick Piggin wrote:
[...Snip...]
> Though I imagine this becomes a complete superset of the
> isolcpus= functionality, and it would actually be easier to
> manage a single isolated CPU and its associated processes with
> the cpusets interfaces after this.
That is the idea, though I think that we need to be able to
provide users the option of not doing a load balance within a
sched domain
> It doesn't work if you have *most* jobs bound to either
> {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed
> to use any CPU from 0-7.
That is the current definition of cpu_exclusive on cpusets.
I initially thought of attaching exclusive cpusets to sched domains,
but that would not work because of this reason
> >
> > In the case of cpus, we really do prefer the partitions to be
> > disjoint, because it would be better not to confuse the domain
> > scheduler with overlapping domains.
> >
>
> Yes. The domain scheduler can't handle this at all, it would
> have to fall back on cpus_allowed, which in turn can create
> big problems for multiprocessor balancing.
>
I agree
> >From what I gather, this partitioning does not exactly fit
> the cpusets architecture. Because with cpusets you are specifying
> on what cpus can a set of tasks run, not dividing the whole system.
Since isolated cpusets are trying to partition the system, this
can be restricted to only the first level of cpusets. Keeping in mind
that we have a flat sched domain heirarchy, I think probably this
would simplify the update_sched_domains function quite a bit.
Also I think we can add further restrictions in terms not being able
to change (add/remove) cpus within a isolated cpuset. Instead one would
have to tear down an existing cpuset and make a new one with the
required configuration. that would simplify things even further
> The sched-domains setup code will take care of all that for you
> already. It won't know or care about the partitions. If you
> partition a 64-way system into 2 32-ways, the domain setup code
> will just think it is setting up a 32-way system.
>
> Don't worry about the sched-domains side of things at all, that's
> pretty easy. Basically you just have to know that it has the
> capability to partition the system in an arbitrary disjoint set
> of sets of cpus.
And maybe also have a flag that says whether to have load balancing
in this domain or not
-Dinakar
Dinakar, replying to Nick:
> > It doesn't work if you have *most* jobs bound to either
> > {0, 1, 2, 3} or {4, 5, 6, 7} but one which should be allowed
> > to use any CPU from 0-7.
>
> That is the current definition of cpu_exclusive on cpusets.
> I initially thought of attaching exclusive cpusets to sched domains,
> but that would not work because of this reason
I can't make any sense of this reply, Dinakar.
You say "_That_" is the current definition of cpu_exclusive -- I have no
idea what "_That_" refers to. I see nothing in what Nick wrote that has
anything much to do with the definition of cpu_exclusive. If a cpuset
is marked cpu_exclusive, it means that the kernel will not allow any of
its siblings to have overlapping CPUs. It doesn't mean that its parent
can't overlap CPUs -- indeed it's parent must contain a superset of all
the CPUs in a cpu_exclusive cpuset and its siblings. It doesn't mean
that there cannot be tasks attached to each of the cpu_exclusive cpuset,
its siblings and its parent.
You say "attaching exclusive cpusets to sched domains ... would not work
because of this reason." I have no idea what "this reason" is.
I am pretty sure of a couple of things:
* Your understanding of "cpu_exclusive" is not the same as mine.
* We want to avoid any dependency on "cpu_exclusive" here.
> Since isolated cpusets are trying to partition the system, this
> can be restricted to only the first level of cpusets.
I do not think such a restriction is a good idea. For example, lets say
our 8 CPU system has the following cpusets:
/ # 0-7
/Alpha # 0-3
/Alpha/phi # 0-1
/Alpha/chi # 2-3
/Beta # 4-7
Then I see no problem with cpusets /Alpha/phi, /Alpha/chi and /Beta
being the isolated cpusets, with corresponding scheduler domains. But
phi and chi are not "first level cpusets." If we require a partition
(disjoint cover) of the CPUs in the system, then enforce exactly that.
Do not confuse a rough approximation with a simplified model.
> Also I think we can add further restrictions in terms not being able
> to change (add/remove) cpus within a isolated cpuset.
My approach agrees on this restriction. Earlier I wrote:
> Also note that adding or removing a cpu from a cpuset that has
> its domain_cpu_current flag set true must fail, and similarly
> for domain_mem_current.
This restriction is required in my approach because the CPUs in the
domain_cpu_current cpusets (the isolated CPUs, in your terms) form a
partition (disjoint cover) of the CPUs in the system, which property
would be violated immediately if any CPU were added or removed from any
cpuset defining the partition.
> Instead one would
> have to tear down an existing cpuset and make a new one with the
> required configuration. that would simplify things even further
You've just come close to describing the approach that it took me
"several more" words to describe. Though one doesn't need to tear down
or make any new cpusets; rather one atomically selects a new set of
cpusets to define the partition.
If one had to tear down and remake cpusets to change the partition, then
one would be in trouble -- it would be difficult to provide an API that
allowed doing that atomically. If its not atomic, then we have illegal
intermediate states, where one cpuset is gone and the new one has not
arrived, and our partition of the cpusets in the system no longer covers
the system ("our cover is blown", as they say in undercover police
work.)
> And maybe also have a flag that says whether to have load balancing
> in this domain or not
It's probably too early to think about that.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Simon wrote:
> I guess we hit a limit of the filesystem-interface approach here.
> Are the possible failure reasons really that complex ?
Given the amount of head scratching my proposal has provoked
so far, they might be that complex, yes. Failure reasons
include:
* The cpuset Foo whose domain_cpu_rebuild file we wrote does
not align with the current partition of CPUs on the system
(align: every subset of the partition is either within or
outside the CPUs of Foo)
* The cpusets Foo and its descendents which are marked with
a true domain_cpu_pending do not form a partition of the
CPUs in Foo. This could be either because two of these
cpusets have overlapping CPUs, or because the union of all
the CPUs in these cpusets doesn't cover.
* The usual other reasons such as lacking write permission.
> If this is only to get a hint, OK.
Yes - it would be a hint. The official explanation would be the
errno setting on the failed write. The hint, written to the
domain_cpu_error file, could actually state which two cpusets
had overlapping CPUs, or which CPUs in Foo were not covered by
the union of the CPUs in the marked descendent cpusets.
Yes - it pushing the limits of available mechanisms. Though I don't
offhand see where the filesystem-interface approach is to blame here.
Can you describe any other approach that would provide such a similarly
useful error explanation in a less unusual fashion?
> Is such an error reporting scheme already in use in the kernel ?
I don't think so.
> On the other hand, there's also no guarantee that what we are triggering
> by writing in domain_cpu_rebuild is what we have set up by writing in
> domain_cpu_pending. User applications will need a bit of self-discipline.
True.
To preserve the invariant that the CPUs in the selected cpusets form a
partition (disjoint cover) of the systems CPUs, we either need to
provide an atomic operation that allows passing in a selection of
cpusets, or we need to provide a sequence of operations that essentially
drive a little finite state machine, building up a description of the
new state while the old state remains in place, until the final trigger
is fired.
This suggests what the primary alternative to my proposed API would be,
and that would be an interface that let one pass in a list of cpusets,
requesting that the partition below the specified cpuset subtree Foo be
completely and atomically rebuilt, to be that defined by the list of
cpusets, with the set of CPUS in each of these cpusets defining one
subset of the partition.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Dinakar wrote:
> I was hoping that by the time we are done with this, we would
> be able to completely get rid of the isolcpus= option.
I won't miss it. Though, since it's in the main line kernel,
do you need to mark it deprecated for a while first?
> For that
> ofcourse we need to be able build domains that dont run
> load balance
Ah - so that's what these isolcpus are - ones not load balanced?
This was never clear to me.
> The wording [/* Set ... */ ] was from the users point of view
> for what action was being done, guess I'll change that
Ok - at least now I can read and understand the comments, knowing this.
The other comments in cpuset.c don't follow this convention, of speaking
in the "user's voice", but rather speak in the "responding systems
voice." Best to remain consistent in this matter.
> It is complicated because it has to handle all of the different
> possible actions that the user can initiate. It can be simplified
> if we have stricter rules of what the user can/cannot do
> w.r.t to isolated cpusets
It is complicated because you are trying to pretend that to be doing a
complex state change one step at a time, without a precise statement (at
least, not that I saw) of what the invariants are, and atomic operations
that preserve the invariants.
> > First, let me verify one thing. I understand that the _key_
> > purpose of your patch is not so much to isolate cpus, as it
> > is to allow for structuring scheduling domains to align with
> > cpuset boundaries. I understand real isolated cpus to be ones
> > that don't have a scheduling domain (have only the dummy one),
> > as requested by the "isolcpus=..." boot flag.
>
> Not really. Isolated cpusets allows you to do a soft-partition
> of the system, and it would make sense to continue to have load
> balancing within these partitions. I would think not having
> load balancing should be one of the options available
Ok ... then is it correct to say that your purpose is to partition
the systems CPUs into subsets, such that for each subset, either
there is a scheduler domain for that exactly the CPUs in that subset,
or none of the CPUs in the subset are in any scheduler domain?
> I must confess that I havent looked at the memory side all that much,
> having more interest in trying to build soft-partitioning of the cpu's
This is an understandable focus of interest. Just know that one of the
sanity tests I will apply to a solution for CPUs is whether there is a
corresponding solution for Memory Nodes, using much the same principles,
invariants and conventions.
> ok I need to spend more time on you model Paul, but my first
> guess is that it doesn't seem to be very intuitive and seems
> to make it very complex from the users perspective. However as
> I said I need to understand your model a bit more before I
> comment on it
Well ... I can't claim that my approach is simple. It does have a
clearly defined (well, clear to me ;) mathematical model, with some
invariants that are always preserved in what user space sees, with
atomic operations for changing from one legal state to the next.
The primary invariant is that the sets of CPUs in the cpusets
marked domain_cpu_current form a partition (disjoint covering)
of the CPUs in the system.
What are your invariants, and how can you assure yourself and us
that your code preserves these invariants?
Also, I don't know that the sequence of user operations required
by my interface is that much worse than yours. Let's take an
example, and compare what the user would have to do.
Let's say we have the following cpusets on our 8 CPU system:
/ # CPUs 0-7
/Alpha # CPUs 0-3
/Alpha/phi # CPUs 0-1
/Alpha/chi # CPUs 2-3
/Beta # CPUs 4-7
Let's say we currently have three scheduler domains, for three isolated
(in your terms) cpusets: /Alpha/phi, /Alpha/chi and /Beta.
Let's say we want to change the configuration to have just two scheduler
domains (two isolated cpusets): /Alpha and /Beta.
A user of my API would do the operations:
echo 1 > /Alpha/domain_cpu_pending
echo 1 > /Beta/domain_cpu_pending
echo 0 > /Alpha/phi/domain_cpu_pending
echo 0 > /Alpha/chi/domain_cpu_pending
echo 1 > /domain_cpu_rebuild
The domain_cpu_current state would not change until the final write
(echo) above, at which time the cpuset_sem lock would be taken, and the
system would, atomically to all viewing tasks, change from having the
three cpusets /Alpha/phi, /Alpha/chi and /Beta marked with a true
domain_cpu_current, to having the two cpusets /Alpha and /Beta so
marked.
The alternative API, which I didn't explore, could do this in one step
by writing the new list of cpusets defining the partition, doing the
rough equivalent (need nul separators, not space separators) of:
echo /Alpha /Beta > /list_cpu_subdomains
How does this play out in your interface? Are you convinced that
your invariants are preserved at all times, to all users? Can
you present a convincing argument to others that this is so?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Nick wrote:
> Well the scheduler simply can't handle it, so it is not so much a
> matter of pushing - you simply can't use partitioned domains and
> meaningfully have a cpuset above them.
Translating that into cpuset-speak, I think what you mean is that I
can't have partitioned sched domains and have a task attached to a
cpuset above them, if it matters to me that the task can actually use
all the CPUs in its larger cpuset.
But what you actually said was that I cannot have a cpuset above them.
I can certainly _can_ have a cpuset above the cpusets that define the
partitioned domains. I _have_ to have that, or toss the entire
hierarchical design cpuset. The top cpuset encompasses all the CPUs on
the system, and is above all others.
Let's see if the following example helps clear up these confusions.
Let's say we started out as one big happy family, with a single top
cpuset, and a single sched domain, each encompassing the entire machine.
All tasks are attached to that cpuset and load balanced and scheduled in
that sched domain. Any task can be run anywhere.
Then some yahoo comes along and decides to complicate things. They
create my two cpusets Alpha and Beta, each covering half the system.
They create two partitioned sched domains corresponding to Alpha and
Beta, respectively. They move almost every task into one of Alpha or
Beta, expecting hence forth that each such moved task will only run on
whichever half of the system it was placed in. For instance, if they
moved init into Alpha, that means they _want_ the init task to be
constrained to the Alpha half of the system, even if every CPU in Beta
has been idle for the last 5 hours.
So far, all fine and dandy.
But they leave behind a few tasks still attached to the top cpuset, with
those tasks cpus_allowed still allowing any CPU in the system. They
actually don't give a rat's patootie about these few tasks, because they
consume less than 10 seconds each per day, and so long as they are
allowed their few CPU cycles when they want them, all is well. They
could have moved these tasks as well into Alpha or Beta, but they wanted
to be annoying and see if they could concoct a test case that would
break something here. Or maybe they were just forgetful.
What breaks? You seem to be telling me that this is ver botten, but I
don't see yet where the problem is.
My timid guess is that about all that breaks is that each of these stray
tasks will be forever after stuck in which ever one of Alpha or Beta it
happened to be in at the point of the Great Divide. If say one of these
tasks happened to be on the Beta side at that point, the Beta domain
scheduler will never let an Alpha CPU see that task, leaving the task to
only ever be picked up by a Beta CPU (even though the tasks cpuset and
cpus_allowed would have allowed an Alpha CPU, in theory).
Translating this back into a language my users might speak, I guess is
this means I tell them:
* No scheduling or load balancing is done across partitioned scheduler domains.
* Even if one such domain is hugely oversubscribed, and another totally
idle, no task in one will run in the other. If that's what you want,
then go for it.
* Tasks left attached to cpusets higher up in the hierarchy don't get
moved or load balanced between partitioned sched domains below their cpuset.
They will get stuck in one of the domains, willy-nilly. So if it matters
to you in the slightest which of the partitions a task runs in, attach
it appropriately, to one of the cpusets that define the partitioned
scheduler domains, or below.
In short, perhaps you were trying to make my life, or at least my efforts
to understand this, simple, by telling me that I simply can't have any
cpusets above partitioned sched domains. The literal translation of that
into cpuset-speak throws out the entire cpuset architecture. So I have to
push back and figure out in more detail what really matters here.
Am I anywhere close?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Dinakar wrote:
> Also I think we can add further restrictions in terms not being able
> to change (add/remove) cpus within a isolated cpuset. Instead one would
> have to tear down an existing cpuset and make a new one with the
> required configuration. that would simplify things even further
My earlier reply to this missed the mark a little.
Instead what I would say is this. If one wants to move a CPU from one
cpuset to another, where these two cpusets are not in the same
partitioned scheduler domain, then one first has to collapse the
scheduler domain partitions so that both cpusets _are_ in the same
partitioned scheduler domain. Then one can move the CPU between the two
cpusets, and reestablish the more fine grained partitioned scheduler
domain structure that isolates these two cpusets into different
partitioned scheduler domains.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Tue, Apr 19, 2005 at 10:23:48AM -0700, Paul Jackson wrote:
>
> How does this play out in your interface? Are you convinced that
> your invariants are preserved at all times, to all users? Can
> you present a convincing argument to others that this is so?
Let me give an example of how the current version of isolated cpusets can
be used and hopefully clarify my approach.
Consider a system with 8 cpus that needs to run a mix of workloads.
One set of applications have low latency requirements and another
set have a mixed workload. The administrator decides to allot
2 cpus to the low latency application and the rest to other apps.
To do this, he creates two cpusets
(All cpusets are considered to be exclusive for this discussion)
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 0-7 0
top/lowlat 0-1 0 0-1 0
top/others 2-7 0 2-7 0
He now wants to partition the system along these lines as he wants
to isolate lowlat from the rest of the system to ensure that
a. No tasks from the parent cpuset (top_cpuset in this case)
use these cpus
b. load balance does not run across all cpus 0-7
He does this by
cd /mount-point/lowlat
/bin/echo 1 > cpu_isolated
Internally it takes the cpuset_sem, does some sanity checks and ensures
that these cpus are not visible to any other cpuset including its parent
(by removing these cpus from its parent's cpus_allowed mask and adding
them to its parent's isolated_map) and then calls sched code to partition
the system as
[0-1] [2-7]
The internal state of data structures are as follows
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 2-7 0-1
top/lowlat 0-1 1 0-1 0
top/others 2-7 0 2-7 0
-------------------------------------------------------
The administrator now wants to further partition the "others" cpuset into
a cpu intensive application and a batch one
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 2-7 0-1
top/lowlat 0-1 1 0-1 0
top/others 2-7 0 2-7 0
top/others/cint 2-3 0 2-3 0
top/others/batch 4-7 0 4-7 0
If now the administrator wants to isolate the cint cpuset...
cd /mount-point/others
/bin/echo 1 > cpu_isolated
(At this point no new sched domains are built
as there exists a sched domain which exactly
matches the cpus in the "others" cpuset.)
cd /mount-point/others/cint
/bin/echo 1 > cpu_isolated
At this point cpus from the "others" cpuset are also taken away from its
parent cpus_allowed mask and put into the parent's isolated_map. This means
that the parent cpus_allowed mask is empty. This would now result in
partitioning the "others" cpuset and builds two new sched domains as follows
[2-3] [4-7]
Notice that the cpus 0-1 having already been isolated are not affected
in this operation
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 0 0-7
top/lowlat 0-1 1 0-1 0
top/others 2-7 1 4-7 2-3
top/others/cint 2-3 1 2-3 0
top/others/batch 4-7 0 4-7 0
-------------------------------------------------------
The admin now wants to run more applications in the cint cpuset
and decides to borrow a couple of cpus from the batch cpuset
He removes cpus 4-5 from batch and adds them to cint
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 0 0-7
top/lowlat 0-1 1 0-1 0
top/others 2-7 1 6-7 2-5
top/others/cint 2-5 1 2-5 0
top/others/batch 6-7 0 6-7 0
As cint is already isolated, adding cpus causes it to rebuild all cpus
covered by its cpus_allowed and its parent's cpus_allowed, so the new
sched domains will look as follows
[2-5] [6-7]
cpus 0-1 are ofcourse still not affected
Similarly the admin can remove cpus from cint, which will
result in the domains being rebuilt to what was before
[2-3] [4-7]
-------------------------------------------------------
Hope this clears up my approach. Also note that we still need to take care
of the cpu hotplug case, where any random cpu can be removed and added back
and this code needs to take care of rebuilding the appropriate sched domains
-Dinakar
On Tue, Apr 19, 2005 at 08:26:39AM -0700, Paul Jackson wrote:
> * Your understanding of "cpu_exclusive" is not the same as mine.
Sorry for creating confusion by what I said earlier, I do understand
exactly what cpu_exclusive means. Its just that when I started
working on this (a long time ago) I had a different notion and that is
what I was referring to, I probably should never have brought that up
>
> > Since isolated cpusets are trying to partition the system, this
> > can be restricted to only the first level of cpusets.
>
> I do not think such a restriction is a good idea. For example, lets say
> our 8 CPU system has the following cpusets:
>
And my current implementation has no such restriction, I was only
suggesting that to simplify the code.
>
> > Also I think we can add further restrictions in terms not being able
> > to change (add/remove) cpus within a isolated cpuset.
>
> My approach agrees on this restriction. Earlier I wrote:
> > Also note that adding or removing a cpu from a cpuset that has
> > its domain_cpu_current flag set true must fail, and similarly
> > for domain_mem_current.
>
> This restriction is required in my approach because the CPUs in the
> domain_cpu_current cpusets (the isolated CPUs, in your terms) form a
> partition (disjoint cover) of the CPUs in the system, which property
> would be violated immediately if any CPU were added or removed from any
> cpuset defining the partition.
See my other note explaining how things work currently. I do feel that
this restriction is not good
-Dinakar
Earlier, I wrote to Dinakar:
> What are your invariants, and how can you assure yourself and us
> that your code preserves these invariants?
I repeat that question.
===
On my first reading of your example, I see the following.
It is sinking into my dense skull more than it had before that your
patch changes the meaning of the cpuset field 'cpus_allowed', to only
include the cpus not in isolated children. However there are other uses
of the 'cpus_allowed' field in the cpuset code that are not changed, and
comments and documentation describing this field that are not changed.
I suspect this is an incomplete change.
You don't actually state it that I noticed, but the main point of your
example seems to be that you support incrementally moving individual
cpus between cpusets, without the constraint that both cpusets be in the
same subset of the partition (the same isolation group). So you can
move a cpu in and out of an isolated group without tearing down the
group down first, only to rebuild it after.
To do this, you've added new semantics to some of the operations to
write the 'cpus' special file of a cpuset, if and only if that cpuset is
marked isolated, which involves changing some other masks. These new
semantics are something along the lines of "adding a cpu here implies
removing it from there. This presumably allows you to move cpus in or
out of or between an isolated cpuset, while preserving the essential
properties of a partition - that it is a disjoint covering.
> He removes cpus 4-5 from batch and adds them to cint
Could you spell out the exact steps the user would take, for this part
of your example? What does the user do, what does the kernel do in
response, and what state the cpusets end up in, after each action of the
user?
===
So far, to be honest, I am finding your patch to be rather frustrating.
Perhaps the essential reason is this. The interface that cpusets
presents in the cpuset file system, mounted at /dev/cpuset, is not in my
intentions primarily a human interface. It is primarily a programmatic
interface.
As such, there is a high premium on clarity of design, consistency of
behaviour and absence of side affects. Each operation should do one
thing, clearly defined, changing only what is operated on, preserving
clearly spelled out invariants.
If it takes three steps instead of one to accomplish a typical task,
that's fine. The programs that layer on top of /dev/cpuset don't mind
doing three things to get one thing done. But such programs are a pain
in the backside to program correctly if the affects of each operation
are not clearly defined, not focused on the obvious object being
operated on, or not precisely consistent with an overriding model.
This patch seems to add side affects and the change the meanings of
things, doing so with the most minimum of mention in the description,
without clearly and consistently spelling out the new mental model, and
without uniformly changing all uses, comments and documentation to fit
the new model.
This cpuset facility is also a less commonly used kernel facility, and
changes to cpusets, outside of a few key hooks in the scheduler and
allocator, are not performance critical. This means that there is a
premium in keeping the kernel code minimal, leaving as many details as
practical to userland. This patch seems to increase the kernel text
size, for an ia64 SN2 build using gcc 3.2.3 of a 2.6.12-rc1-mm4 tree I
had at hand, _just_ for the cpuset.c changes, from 23071 bytes to 28999.
That's over a 25% per-cent increase in the kernel text size of the file
kernel/cpuset.o, just for this feature. That's too much, in my view.
I don't know yet if the ability to move cpus between isolated sched
domains without tearing them down and rebuilding them, is a critical
feature for you or not. You have not been clear on what are the
essential requirements of this feature. I don't even know for sure yet
that this is the one key feature in your view that separates your
proposal from the variations I explored.
But if this is for you the critical feature that your proposal has, and
mine lack, then I'd like to see if there is a way to do it without
implicit side affects, without messing with the semantics of what's
there now, and with significantly fewer bytes of kernel text space. And
I'd like to see if we can have uniform and precisely spelled out
semantics, in the code, comments and documentation, with any changes to
the current semantics made everywhere, uniformly.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Wed, Apr 20, 2005 at 12:09:46PM -0700, Paul Jackson wrote:
> Earlier, I wrote to Dinakar:
> > What are your invariants, and how can you assure yourself and us
> > that your code preserves these invariants?
Ok, Let me begin at the beginning and attempt to define what I am
doing here
1. I need a method to isolate a random set of cpus in such a way that
only the set of processes that are specifically assigned can
make use of these CPUs
2. I need to ensure that the sched load balance code does not pull
any tasks other than the assigned ones onto these cpus
3. I need to be able to create multiple such groupings of cpus
that are disjoint from the rest and run only specified tasks
4. I need a user interface to specify which random set of cpus
form such a grouping of disjoint cpus
5. I need to be able to dynamically create and destroy these
grouping of disjoint cpus
6. I need to be able to add/remove cpus to/from this grouping
Now if you try to fit these requirements onto cpusets, keeping in mind
that it already has an user interface and some of the frame work
required to create disjoint groupings of cpus
1. An exclusive cpuset ensures that the cpus it has are disjoint from
all other cpusets except its parent and children
2. So now I need a way to disassociate the cpus of an exclusive
cpuset from its parent, so that this set of cpus are truly
disjoint from the rest of the system.
3. After I have done (2) above, I now need to build two set of sched
domains corresponding to the cpus of this exclusive cpuset and the
remaining cpus of its parent
4. Ensure that the current rules of non-isolated cpusets are all
preserved such that if this feature is not used, all other features
work as before
This is exactly what I have tried to do.
1. Maintain a flag to indicate whether a cpuset is isolated
2. Maintain an isolated_map for every cpuset. This contains a cache of
all cpus associated with isolated children
3. To isolate a cpuset x, x has to be an exclusive cpuset and its
parent has to be an isolated cpuset
4. On isolating a cpuset by issuing
/bin/echo 1 > cpu_isolated
It ensures that conditions in (3) are satisfied and then removes the
cpus of the current cpuset from the parent cpus_allowed mask. (It also
puts the cpus of the current cpuset into the isolated_map of its parent)
This ensures that only the current cpuset and its children will have
access to the now isolated cpus.
It also rebuilds the sched domains into two new domains consisting of
a. All cpus in the parent->cpus_allowed
b. All cpus in current->cpus_allowed
5. Similarly on setting isolated off on a isolated cpuset, (or on doing
an rmdir on an isolated cpuset) It adds all of the cpus of the current
cpuset into its parent cpuset's cpus_allowed mask and removes them from
it's parent's isolated_map
This ensures that all of the cpus in the current cpuset are now
visible to the parent cpuset.
It now rebuilds only one sched domain consisting of all of the cpus
in its parent's cpus_allowed mask.
6. You can also modify the cpus present in an isolated cpuset x provided
that x does not have any children that are also isolated.
7. On adding or removing cpus from an isolated cpuset that does not
have any isolated children, it reworks the parent cpuset's
cpus_allowed and isolated_map masks and rebuilds the sched domains
appropriately
8. Since the function update_cpu_domains, which does all of the
above updations to the parent cpuset's masks, is always called with
cpuset_sem held, it ensures that all these changes are atomic.
> > He removes cpus 4-5 from batch and adds them to cint
>
> Could you spell out the exact steps the user would take, for this part
> of your example? What does the user do, what does the kernel do in
> response, and what state the cpusets end up in, after each action of the
> user?
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 0 0-7
top/lowlat 0-1 1 0-1 0
top/others 2-7 1 4-7 2-3
top/others/cint 2-3 1 2-3 0
top/others/batch 4-7 0 4-7 0
At this point to remove cpus 4-5 from batch and add them to cint, the admin
would do the following steps
# Remove cpus 4-5 from batch
# batch is not a isolated cpuset and hence this step
# has no other implications
/bin/echo 6-7 > /top/others/batch/cpus
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 0 0-7
top/lowlat 0-1 1 0-1 0
top/others 2-7 1 4-7 2-3
top/others/cint 2-3 1 2-3 0
top/others/batch 6-7 0 6-7 0
# Add cpus 4-5 to cint alongwith existing cpus 2-3
/bin/echo 2-5 > /top/others/cint/cpus
cpuset cpus isolated cpus_allowed isolated_map
top 0-7 1 0 0-7
top/lowlat 0-1 1 0-1 0
top/others 2-7 1 6-7 2-5
top/others/cint 2-5 1 2-5 0
top/others/batch 6-7 0 6-7 0
As you can see there are no "side effects" here. All of these are legitimate
operations and work the same even in the current cpusets code as in mainline.
(Except ofcourse the isolation part)
Hope this helps in clarifying all your questions
However, after taking into account all of your comments so far, I have
reworked my patch and reduced and simplified it quite a bit. I have
maintained all of the functionality that I have described so far.
(Adding one restriction viz
You can also modify the cpus present in an isolated cpuset x provided
that x does not have any children that are also isolated.)
I'll send that in a new mail.
Thanks for all your comments and review so far
-Dinakar
Based on the Paul's feedback, I have simplified and cleaned up the
code quite a bit.
o I have taken care of most of the nits, except for the output
format change for cpusets with isolated children.
o Also most of my documentation has been part of my earlier mails
and I have not yet added them to cpusets.txt.
o I still havent looked at the memory side of things.
o Most of the changes are in the cpusets code and almost none
in the sched code. (I'll do that next week)
o Hopefully my earlier mails regarding the design have clarified
many of the questions that were raised
So here goes version 0.2
-rw-r--r-- 1 root root 16548 Apr 21 20:54 cpuset.o.orig
-rw-r--r-- 1 root root 17548 Apr 21 22:09 cpuset.o.sd-v0.2
Around ~6% increase in kernel text size of cpuset.o
include/linux/init.h | 2
include/linux/sched.h | 1
kernel/cpuset.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched.c | 111 ++++++++++++++++++++++++------------
4 files changed, 216 insertions(+), 51 deletions(-)
A few code details (still working on more substantive reply):
+ /* An isolated cpuset has to be exclusive */
+ if ((is_cpu_isolated(trial) && !is_cpu_exclusive(cur))
+ || (!is_cpu_exclusive(trial) && is_cpu_isolated(cur)))
+ return -EINVAL;
Is the above code equivalant to what the comment states:
if (is_cpu_isolated(trial) <= is_cpu_exclusive(trial))
return -EINVAL;
+ t = old_parent = *par;
+ cpus_or(all_map, cs->cpus_allowed, cs->isolated_map);
+
+ /* If cpuset empty or top_cpuset, return */
+ if (cpus_empty(all_map) || par == NULL)
+ return;
If the (par == NULL) check succeeds, then perhaps the earlier (*par)
dereference will have oopsed first?
+ struct cpuset *par = cs->parent, t, old_parent;
Looks like 't' was chosen to be a one-char variable name, to keep some
lines below within 80 columns. I'd do the same myself. But this leaves
a non-symmetrical naming pattern for the new and old parent cpuset values.
Perhaps the following would work better?
struct cpuset *parptr;
struct cpuset o, n; /* old and new parent cpuset values */
+static void update_cpu_domains(struct cpuset *cs, cpumask_t old_map)
Could old_map be passed as a (const cpumask_t *)? The stack space of
this code, just for cpumask_t's (see the old and new above) is getting
large for (really) big systems.
+ /* Make the change */
+ par->cpus_allowed = t.cpus_allowed;
+ par->isolated_map = t.isolated_map;
Why don't you need to propogate upward this change to the parents
cpus_allowed and isolated_map? If a parents isolated_map grows (or
shrinks), doesn't that affect every ancestor, all the way to the top
cpuset?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Dinakar wrote:
> Ok, Let me begin at the beginning and attempt to define what I am
> doing here
The statement of requirements and approach help. Thank-you.
And the comments in the code patch are much easier for me
to understand. Thanks.
Let me step back and consider where we are here.
I've not been entirely happy with the cpu_exclusive (and mem_exclusive)
properties. They were easy to code, and they require only looking at
ones siblings and parent, but they don't provide all that people usually
want, which is system wide exclusivity, because they don't exclude tasks
in ones parent (or more remote ancestor) cpusets from stealing resources.
I take your isolated cpusets as a reasonable attempt to provide what's
really wanted. I had avoided simple, system-wide exclusivity because
I really wanted cpusets to be hierarchical. One should be able to
subdivide and manage one subtree of the cpuset hierarchy, oblivious
to what someone else is doing with a disjoint subtree. Your work shows
how to provide a stronger form of isolation (exclusivity) without
abandoning the hierarchical structure.
There are three directions we could go from here. I am not yet decided
between them:
1) Remove cpu and mem exclusive flags - they are of limited use.
2) Leave code as is.
3) Extend the exclusive capability to include isolation from parents,
along the lines of your patch.
If I was redoing cpusets from scratch, I might not include the exclusive
feature at all - not sure. But it's cheap, at least in terms of code,
and of some use to some users. So I would choose (2) over (1), given
where we are now. The main cost at present of the exclusive flags is
the cost in understanding - they tend to confuse people at first glance,
due to their somewhat unusual approach.
If we go with (3), then I'd like to consider the overall design of this
a bit more. Your patch, as is common for patches, attempts to work within
the current framework, minimizing change. Better to take a step back and
consider what would have been the best design as if the past didn't matter,
then with that clearly in mind, ask how best to get there from here.
I don't think we would have both isolated and exclusive flags, in the
'ideal design.' The exclusive flags are essentially half (or a third)
of what's needed, and the isolated flags and masks the rest of it.
Essentially, your patch replaces the single set of CPUs in a cpuset
with three, related sets:
A] the set of all CPUs managed by that cpuset
B] the set of CPUs allowed to tasks attached to that cpuset
C] the set of CPUs isolated for the dedicated use of some descendent
Sets [B] and [C] form a partition of [A] -- their intersection is empty,
and their union is [A].
Your current presentation of these sets of CPUs shows set [B] in the
cpus file, followed by set [C] in brackets, if I am recalling correctly.
This format changes the format of the current cpus_allowed file, and it
violates the preference for a single value or vector per file. I would
like to consider alternatives.
Your code automatically updates [C] if the child cpuset adds or removes
CPUs from those it manages in isolation (though I am not sure that your
code manages this change all the way back up the hierarchy to the top
cpuset, and I wondering if perhaps your code should be doing this, as
noted in my detailed comments on your patch earlier today.)
I'd be tempted, if taking this approach (3) to consider a couple of
alternatives.
As I spelled out a few days ago, one could mark some cpusets that form a
partition of the systems CPUs, for the purposes of establishing isolated
scheduler domains, without requiring the above three related sets per
cpuset instead of one. I am still unsure how much of your motivation is
the need to make the scheduler more efficient by establishing useful
isolated sched domains, and how much is the need to keep the usage of
CPUs by various jobs isolated, even from tasks attached to parent cpusets.
One can obtain the job isolation just in user code - if you don't want a
task to use a parent cpusets access to your isolated cpuset, then simply
don't attach a task to the parent cpusets. I do not understand yet how
strong your requirement is to have the _kernel_ enforce that there are
not tasks in a parent cpuset which could intrude on the non-isolated
resources of a child. I provide (non open source) user level tools to
my users which enable them to conveniently ensure that there are no such
unwanted tasks, so they don't have a problem with a parent cpusets CPUs
overlapping a cpuset that they are using for an isolated job. Perhaps I
could persuade my employer that it would be appropriate to open source
these tools.
In any case, going (3) would result in _one_ attribute, not two (both
exclusive and isolated, with overlapping semantics, which is confusing.)
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
> Is the above code equivalant to what the comment states:
>
> if (is_cpu_isolated(trial) <= is_cpu_exclusive(trial))
> return -EINVAL;
I think I got that backwards. How about:
/* An isolated cpuset has to be exclusive */
if (!(is_cpu_isolated(trial) <= is_cpu_exclusive(trial)))
return -EINVAL;
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Dinakar's patch contains:
+ /* Make the change */
+ par->cpus_allowed = t.cpus_allowed;
+ par->isolated_map = t.isolated_map;
Doesn't the above make changes to the parent cpus_allowed without
calling validate_change()? Couldn't we do nasty things like
empty that cpus_allowed, leaving tasks in that cpuset starved
(or testing the last chance code that scans up the cpuset
hierarchy looking for a non-empty cpus_allowed)?
What prevents all the immediate children of the top cpuset from
using up all the cpus as isolated cpus, leaving the top cpuset
cpus_allowed empty, which fails even that last chance check,
going to the really really last chance code, allowing any online
cpu to tasks in that cpuset?
These questions are in addition to my earlier question:
Why don't you need to propogate upward this change to
the parents cpus_allowed and isolated_map? If a parents
isolated_map grows (or shrinks), doesn't that affect every
ancestor, all the way to the top cpuset?
I am unable to tell, just from code reading, whether this code
has adequately worked through the details involved in properly
handling nested changes.
I am unable to build or test this on ia64, because you have code
such as the rebuild_sched_domains() routine, that is in the
'#else' half of a very large "#ifdef ARCH_HAS_SCHED_DOMAIN -
#else - #endif" section of kernel/sched.c, and ia64 arch (and
only that arch, so far as I know) defines ARCH_HAS_SCHED_DOMAIN,
so doesn't see this '#else' half.
+ /*
+ * If current isolated cpuset has isolated children
+ * disallow changes to cpu mask
+ */
+ if (!cpus_empty(cs->isolated_map))
+ return -EBUSY;
1) spacing - there's 8 spaces, not a tab, on two of the lines above.
2) I can't tell yet - but I am curious as to whether the above restriction
prohibiting cpu mask changes to a cpuset with isolated children
might be a bit draconian.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Fri, Apr 22, 2005 at 02:26:18PM -0700, Paul Jackson wrote:
> 3) Extend the exclusive capability to include isolation from parents,
> along the lines of your patch.
This was precisely the design that I first came up not so long ago, but
never posted. The reason being that I thought all parties involved had
already agreed to this design because of some reason (unknown to me)
that was already discussed in detail during the last flurry of emails.
Now that you have asked this question and actually said that this would
probably be a better design, I wholeheartedly agree and whats more
I already have most of the code required. Infact here it is
I think I'll redo the patch and post it for review shortly
-Dinakar
(Warning, this has all the warts that have previosuly been pointed
out and more)
Dinakar wrote:
> cpuset cpus isolated cpus_allowed isolated_map
> top 0-7 1 0 0-7
The top cpuset holds the kernel threads that are pinned to a particular
cpu or node. It's not right that their cpusets cpus_allowed is empty,
which is what I guess the "0" in the cpus_allowed column above means.
(Even if the "0" means CPU 0, that still conflicts with kernel threads
on CPUs 1-7.)
We might get away with it on cpus, because we don't change the tasks
cpus_allowed to match the cpusets cpus_allowed (we don't call
set_cpus_allowed, from kernel/cpuset.c) _except_ when someone rebinds
that task to its cpuset by writing its pid into the cpuset tasks file.
So for as long as no one tries to rebind the per-cpu or per-node
kernel threads, no one will notice that they in a cpuset with an
empty cpus_allowed.
This won't even work that well on the memory side, where we resync
a task with its cpuset anytime that a task goes to allocate memory
(if it can WAIT and it is not in interrupt) and we notice that someone
has bumped the mems_generation for its cpuset.
In other words, I strongly suspect that:
1) The top cpuset should allow all cpus, all memory nodes.
2) The way to assure that one task can't have its cpu or memory stolen
by another is to put the other tasks in cpusets that don't overlap.
3) The wrong way to assure this is by refusing to have any other cpusets
that have overlapping cpus_allowed or mems_allowed.
4) There are some tasks that _do_ require to run on the same cpus as
the tasks you would assign to isolated cpusets. These kernel threads,
such as for example the migration and ksoftirqd threads, must be setup
well before user code is run that can configure job specific isolated
cpusets, so these tasks need a cpuset to run in that can be created
during the system boot, before init (pid == 1) starts up. This cpuset
is the top cpuset.
My users are successfully managing what tasks can use what cpu or memory
resources by controlling which tasks are in which cpusets. They do not
require the ability to disable allowed cpus or memory nodes in other cpusets
to do this. It is not entirely clear to me that they even require the
minimal cpu_exclusive/mem_exclusive facility that is there now.
I don't understand why what's there now isn't sufficient. I don't see
that this patch provides any capability that you can't get just by
properly placing tasks in cpusets that have the desired cpus and nodes.
This patch leaves the per-cpu kernel threads with no cpuset that allows
what they need, and it complicates the semantics of things, in ways that
I still don't entirely understand.
Earlier you wrote:
> 1. I need a method to isolate a random set of cpus in such a way that
> only the set of processes that are specifically assigned can
> make use of these CPUs
I don't see why you need this. Nor do I think it is possible.
You don't need to isolate a set of cpus; you need to isolate a set of
processes. So long as you can create non-overlapping cpusets, and
assign processes to them, I don't see where it matters that you cannot
prohibit the creation of overlapping cpusets, or in the case of the top
cpuset, why it matters that you cannot _disallow_ allowed cpus
or memory nodes in existing cpusets.
And this is not possible because at least the kernel per-cpu threads
_do_ need to run on each cpu in the system, including those cpus you
would isolate.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
A few days ago, Nick wrote:
> Well the scheduler simply can't handle it, so it is not so much a
> matter of pushing - you simply can't use partitioned domains and
> meaningfully have a cpuset above them.
And I (pj) replied:
> Translating that into cpuset-speak, I think what you mean is ...
I then went on to ask some questions. I haven't seen a reply.
I probably wrote too many words, and you had more pressing matters
to deal with. Which is fine.
Let's make this simpler.
Ignore cpusets -- let's just talk about a tasks cpus_allowed value,
and scheduler domains. Think of cpusets as just a strange way of
setting a tasks cpus_allowed value.
Question:
What happens if we have say two isolated scheduler domains
on a system, covering say two halves of the system, and
some task has its cpus_allowed set to allow _all_ CPUs?
What kind of pain does that cause? I'm hoping you will say that
the only pain it causes is that the task will only run on one
half of the system, even if the other half is idle. And that
so long as I don't mind that, it's no problem to do this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Sat, Apr 23, 2005 at 03:30:59PM -0700, Paul Jackson wrote:
> The top cpuset holds the kernel threads that are pinned to a particular
> cpu or node. It's not right that their cpusets cpus_allowed is empty,
> which is what I guess the "0" in the cpus_allowed column above means.
> (Even if the "0" means CPU 0, that still conflicts with kernel threads
> on CPUs 1-7.)
Yes, I meant cpus_allowed is empty
>
> We might get away with it on cpus, because we don't change the tasks
> cpus_allowed to match the cpusets cpus_allowed (we don't call
> set_cpus_allowed, from kernel/cpuset.c) _except_ when someone rebinds
> that task to its cpuset by writing its pid into the cpuset tasks file.
> So for as long as no one tries to rebind the per-cpu or per-node
> kernel threads, no one will notice that they in a cpuset with an
> empty cpus_allowed.
True.
> 4) There are some tasks that _do_ require to run on the same cpus as
> the tasks you would assign to isolated cpusets. These kernel threads,
> such as for example the migration and ksoftirqd threads, must be setup
> well before user code is run that can configure job specific isolated
> cpusets, so these tasks need a cpuset to run in that can be created
> during the system boot, before init (pid == 1) starts up. This cpuset
> is the top cpuset.
And those processes (kernel threads) will continue to run on their cpus
>
> I don't understand why what's there now isn't sufficient. I don't see
> that this patch provides any capability that you can't get just by
> properly placing tasks in cpusets that have the desired cpus and nodes.
> This patch leaves the per-cpu kernel threads with no cpuset that allows
> what they need, and it complicates the semantics of things, in ways that
> I still don't entirely understand.
You are forgetting the fact that the scheduler is still load balancing
across all CPUs and tries to pull tasks only to find that the task's
cpus_allowed mask prevents it from being moved
>
> You don't need to isolate a set of cpus; you need to isolate a set of
> processes. So long as you can create non-overlapping cpusets, and
> assign processes to them, I don't see where it matters that you cannot
> prohibit the creation of overlapping cpusets, or in the case of the top
> cpuset, why it matters that you cannot _disallow_ allowed cpus
> or memory nodes in existing cpusets.
>
I am working on a minimalistic design right now and will get back in
a day or two
-Dinakar
Dinakar, replying to pj:
> > I don't understand why what's there now isn't sufficient. I don't see
> > that this patch provides any capability that you can't get just by
> > properly placing tasks in cpusets that have the desired cpus and nodes.
> > This patch leaves the per-cpu kernel threads with no cpuset that allows
> > what they need, and it complicates the semantics of things, in ways that
> > I still don't entirely understand.
>
> You are forgetting the fact that the scheduler is still load balancing
> across all CPUs and tries to pull tasks only to find that the task's
> cpus_allowed mask prevents it from being moved
Well, I haven't forgotten, but I am having a difficult time figuring out
what your real (most likely just one or two) essential requirements are.
A few days ago, you provided a six step list, under the introduction:
> Ok, Let me begin at the beginning and attempt to define what I am
> doing here
I suspect those six steps were not really your essential requirements,
but one possible procedure that accomplishes them.
So far I am guessing that your requirement(s) are one or both of the
following two items:
(1) avoid having the scheduler wasting too much time trying to
load balance tasks that only turn out to be not allowed on
the cpus the scheduler is considering, and/or
(2) provide improved administrative control of a system by being
able to construct a cpuset that is guaranteed to have no
overlap of allowed cpus with its parent or any other cpuset
not descendent from it.
If (1) is one of your essential requirements, then I have described a
couple of implementations that mark some existing cpusets to form a
partition (in the mathematical sense of a disjoint covering of subsets)
of the system to define isolated scheduler domains. I did this without
adding any additional bitmasks to each cpuset.
If (2) is one of your essential requirements, then I believe this can be
done with the current cpuset kernel code, entirely with additional user
level code.
> I am working on a minimalistic design right now and will get back in
> a day or two
Good.
Hopefully, you will also be able to get through my thick head what you
essential requirement(s) is or are.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
Paul Jackson wrote:
> A few days ago, Nick wrote:
>
>>Well the scheduler simply can't handle it, so it is not so much a
>>matter of pushing - you simply can't use partitioned domains and
>>meaningfully have a cpuset above them.
>
>
> And I (pj) replied:
>
>>Translating that into cpuset-speak, I think what you mean is ...
>
>
> I then went on to ask some questions. I haven't seen a reply.
> I probably wrote too many words, and you had more pressing matters
> to deal with. Which is fine.
>
> Let's make this simpler.
>
> Ignore cpusets -- let's just talk about a tasks cpus_allowed value,
> and scheduler domains. Think of cpusets as just a strange way of
> setting a tasks cpus_allowed value.
>
> Question:
>
> What happens if we have say two isolated scheduler domains
> on a system, covering say two halves of the system, and
> some task has its cpus_allowed set to allow _all_ CPUs?
>
> What kind of pain does that cause? I'm hoping you will say that
> the only pain it causes is that the task will only run on one
> half of the system, even if the other half is idle. And that
> so long as I don't mind that, it's no problem to do this.
I'm not the sched_domains guru that Nick is, but as your question has gone
unanswered for a few days I'll chime in and see if I can't help you provoke
a more definitive response.
Your assumptions above are correct to the best of my knowledge. The only
pain it causes is that the scheduler will not be able to "see" outside of
the span of the highest sched_domain attached to a particular CPU.
A B
/ \ / \
X Y Z W
/\ /\ /\ /\
0 1 2 3 4 5 6 7
In this setup, with your "Alpha" & "Beta" domains splitting the system in
half, a process in a cpuset spanning cpus 0..7 will get "stuck" in
whichever domain it happens to be in when the Alpha & Beta domains get
created. Explicit sys_sched_setaffinity() calls will still move it between
domains, but it will not move between Alpha & Beta on its own.
Loadbalancing from CPU 0's perspective (in Alpha) sees only CPUs 0..3.
Right Nick? ;)
-Matt
Matthew wrote:
> and see if I can't help you provoke
> a more definitive response.
Thanks ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401