2006-08-16 01:08:20

by Suresh Siddha

[permalink] [raw]
Subject: [patch] sched: group CPU power setup cleanup

Resending the new patch. Before patch had some issues for Andy and hence
dropped.

Andrew, Please add this to -mm. This patch is against 2.6.18-rc4.
There might be a small conflict while applying to -mm. Let me know if you
want a patch on top of -mm.

thanks,
suresh

--

Cleanup the sched group cpu_power setup code, by introducing
child field and new domain flag in sched_domain.

Signed-off-by: Suresh Siddha <[email protected]>

diff -pNru linux-2.6.18-rc4/include/asm-i386/topology.h linux/include/asm-i386/topology.h
--- linux-2.6.18-rc4/include/asm-i386/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-i386/topology.h 2006-08-14 12:31:51.880613208 -0700
@@ -74,6 +74,7 @@ static inline int node_to_first_cpu(int
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/asm-ia64/topology.h linux/include/asm-ia64/topology.h
--- linux-2.6.18-rc4/include/asm-ia64/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-ia64/topology.h 2006-08-14 12:31:51.881613056 -0700
@@ -59,6 +59,7 @@ void build_cpu_to_node_map(void);
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
@@ -84,6 +85,7 @@ void build_cpu_to_node_map(void);
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 8*(min(num_online_cpus(), 32)), \
diff -pNru linux-2.6.18-rc4/include/asm-mips/mach-ip27/topology.h linux/include/asm-mips/mach-ip27/topology.h
--- linux-2.6.18-rc4/include/asm-mips/mach-ip27/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-mips/mach-ip27/topology.h 2006-08-14 12:31:51.881613056 -0700
@@ -22,6 +22,7 @@ extern unsigned char __node_distances[MA
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/asm-powerpc/topology.h linux/include/asm-powerpc/topology.h
--- linux-2.6.18-rc4/include/asm-powerpc/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-powerpc/topology.h 2006-08-14 12:31:51.881613056 -0700
@@ -43,6 +43,7 @@ extern int pcibus_to_node(struct pci_bus
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/asm-x86_64/topology.h linux/include/asm-x86_64/topology.h
--- linux-2.6.18-rc4/include/asm-x86_64/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-x86_64/topology.h 2006-08-14 12:31:51.882612904 -0700
@@ -31,6 +31,7 @@ extern int __node_distance(int, int);
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.18-rc4/include/linux/sched.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/linux/sched.h 2006-08-14 14:11:28.974956752 -0700
@@ -623,9 +623,16 @@ enum idle_type
#define SD_WAKE_BALANCE 64 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 128 /* Domain members share cpu power */
#define SD_POWERSAVINGS_BALANCE 256 /* Balance for power savings */
+#define SD_SHARE_PKG_RESOURCES 512 /* Domain members share cpu pkg resources */

-#define BALANCE_FOR_POWER ((sched_mc_power_savings || sched_smt_power_savings) \
- ? SD_POWERSAVINGS_BALANCE : 0)
+#define BALANCE_FOR_MC_POWER \
+ (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
+
+#define BALANCE_FOR_PKG_POWER \
+ ((sched_mc_power_savings || sched_smt_power_savings) ? \
+ SD_POWERSAVINGS_BALANCE : 0)
+
+#define test_sd_flag(sd, flag) ((sd && (sd->flags & flag)) ? 1 : 0)


struct sched_group {
@@ -642,6 +649,7 @@ struct sched_group {
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
+ struct sched_domain *child; /* bottom domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
diff -pNru linux-2.6.18-rc4/include/linux/topology.h linux/include/linux/topology.h
--- linux-2.6.18-rc4/include/linux/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/linux/topology.h 2006-08-14 12:31:51.883612752 -0700
@@ -89,6 +89,7 @@
#define SD_SIBLING_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 2, \
@@ -114,11 +115,44 @@
#endif
#endif /* CONFIG_SCHED_SMT */

+#ifdef CONFIG_SCHED_MC
+/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
+#ifndef SD_MC_INIT
+#define SD_MC_INIT (struct sched_domain) { \
+ .span = CPU_MASK_NONE, \
+ .parent = NULL, \
+ .child = NULL, \
+ .groups = NULL, \
+ .min_interval = 1, \
+ .max_interval = 4, \
+ .busy_factor = 64, \
+ .imbalance_pct = 125, \
+ .cache_nice_tries = 1, \
+ .per_cpu_gain = 100, \
+ .busy_idx = 2, \
+ .idle_idx = 1, \
+ .newidle_idx = 2, \
+ .wake_idx = 1, \
+ .forkexec_idx = 1, \
+ .flags = SD_LOAD_BALANCE \
+ | SD_BALANCE_NEWIDLE \
+ | SD_BALANCE_EXEC \
+ | SD_WAKE_AFFINE \
+ | SD_SHARE_PKG_RESOURCES\
+ | BALANCE_FOR_MC_POWER, \
+ .last_balance = jiffies, \
+ .balance_interval = 1, \
+ .nr_balance_failed = 0, \
+}
+#endif
+#endif /* CONFIG_SCHED_MC */
+
/* Common values for CPUs */
#ifndef SD_CPU_INIT
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
@@ -135,7 +169,7 @@
| SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
- | BALANCE_FOR_POWER, \
+ | BALANCE_FOR_PKG_POWER,\
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
@@ -146,6 +180,7 @@
#define SD_ALLNODES_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 64, \
.max_interval = 64*num_online_cpus(), \
@@ -165,15 +200,6 @@
.nr_balance_failed = 0, \
}

-#ifdef CONFIG_SCHED_MC
-#ifndef SD_MC_INIT
-/* for now its same as SD_CPU_INIT.
- * TBD: Tune Domain parameters!
- */
-#define SD_MC_INIT SD_CPU_INIT
-#endif
-#endif
-
#ifdef CONFIG_NUMA
#ifndef SD_NODE_INIT
#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
diff -pNru linux-2.6.18-rc4/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.18-rc4/kernel/sched.c 2006-08-06 11:20:11.000000000 -0700
+++ linux/kernel/sched.c 2006-08-15 15:08:46.000652936 -0700
@@ -1277,7 +1277,6 @@ static int sched_balance_self(int cpu, i
cpumask_t span;
struct sched_group *group;
int new_cpu;
- int weight;

span = sd->span;
group = find_idlest_group(sd, t, cpu);
@@ -1291,14 +1290,9 @@ static int sched_balance_self(int cpu, i
/* Now try balancing at a lower domain level */
cpu = new_cpu;
nextlevel:
- sd = NULL;
- weight = cpus_weight(span);
- for_each_domain(cpu, tmp) {
- if (weight <= cpus_weight(tmp->span))
- break;
- if (tmp->flags & flag)
- sd = tmp;
- }
+ sd = sd->child;
+ if (sd && !(sd->flags & flag))
+ goto nextlevel;
/* while loop will break here if sd == NULL */
}

@@ -2512,8 +2506,14 @@ static int load_balance(int this_cpu, st
unsigned long imbalance;
struct rq *busiest;

+ /*
+ * When power savings policy is enabled for the parent domain, idle
+ * sibling can pick up load irrespective of busy siblings. In this case,
+ * let the state of idle sibling percolate up as IDLE, instead of
+ * portraying it as NOT_IDLE.
+ */
if (idle != NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
sd_idle = 1;

schedstat_inc(sd, lb_cnt[idle]);
@@ -2603,7 +2603,7 @@ static int load_balance(int this_cpu, st
}

if (!nr_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
return nr_moved;

@@ -2619,7 +2619,7 @@ out_one_pinned:
sd->balance_interval *= 2;

if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
return 0;
}
@@ -2640,7 +2640,14 @@ load_balance_newidle(int this_cpu, struc
int nr_moved = 0;
int sd_idle = 0;

- if (sd->flags & SD_SHARE_CPUPOWER && !sched_smt_power_savings)
+ /*
+ * When power savings policy is enabled for the parent domain, idle
+ * sibling can pick up load irrespective of busy siblings. In this case,
+ * let the state of idle sibling percolate up as IDLE, instead of
+ * portraying it as NOT_IDLE.
+ */
+ if (sd->flags & SD_SHARE_CPUPOWER &&
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
sd_idle = 1;

schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
@@ -2672,7 +2679,8 @@ load_balance_newidle(int this_cpu, struc

if (!nr_moved) {
schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
- if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
+ if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
} else
sd->nr_balance_failed = 0;
@@ -2682,7 +2690,7 @@ load_balance_newidle(int this_cpu, struc
out_balanced:
schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
sd->nr_balance_failed = 0;

@@ -5350,7 +5358,9 @@ static int sd_degenerate(struct sched_do
if (sd->flags & (SD_LOAD_BALANCE |
SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
- SD_BALANCE_EXEC)) {
+ SD_BALANCE_EXEC |
+ SD_SHARE_CPUPOWER |
+ SD_SHARE_PKG_RESOURCES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -5384,7 +5394,9 @@ sd_parent_degenerate(struct sched_domain
pflags &= ~(SD_LOAD_BALANCE |
SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
- SD_BALANCE_EXEC);
+ SD_BALANCE_EXEC |
+ SD_SHARE_CPUPOWER |
+ SD_SHARE_PKG_RESOURCES);
}
if (~cflags & pflags)
return 0;
@@ -5406,12 +5418,18 @@ static void cpu_attach_domain(struct sch
struct sched_domain *parent = tmp->parent;
if (!parent)
break;
- if (sd_parent_degenerate(tmp, parent))
+ if (sd_parent_degenerate(tmp, parent)) {
tmp->parent = parent->parent;
+ if (parent->parent)
+ parent->parent->child = tmp;
+ }
}

- if (sd && sd_degenerate(sd))
+ if (sd && sd_degenerate(sd)) {
sd = sd->parent;
+ if (sd)
+ sd->child = NULL;
+ }

sched_domain_debug(sd, cpu);

@@ -6185,6 +6203,58 @@ next_sg:
}

/*
+ * Initialize sched groups cpu_power.
+ *
+ * cpu_power indicates the computing power of each sched group. This is
+ * used for distributing the load between different sched groups
+ * in a sched domain. Typically cpu_power for all the groups in a
+ * sched domain will be same unless there are asymmetries in the topology.
+ * If there are asymmetries, during the load balance, the group having more
+ * cpu_power will pickup more load compared to the group having less cpu_power.
+ *
+ * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
+ * the maximum number of tasks a group can handle in the presence of other idle
+ * or lightly loaded groups in the same sched domain.
+ */
+static void init_sched_groups_power(int cpu, struct sched_domain *sd)
+{
+ struct sched_domain *child;
+ struct sched_group *group;
+
+ WARN_ON(!sd || !sd->groups);
+
+ if (cpu != first_cpu(sd->groups->cpumask))
+ return;
+
+ child = sd->child;
+
+ /*
+ * For perf policy, if the groups in child domain share resources
+ * (for example cores sharing some portions of the cache hierarchy
+ * or SMT), then set this domain groups cpu_power such that each group
+ * can handle only one task, when there are other idle groups in the
+ * same sched domain.
+ */
+ if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
+ (child->flags & SD_SHARE_CPUPOWER ||
+ child->flags & SD_SHARE_PKG_RESOURCES))) {
+ sd->groups->cpu_power = SCHED_LOAD_SCALE;
+ return;
+ }
+
+ sd->groups->cpu_power = 0;
+
+ /*
+ * add cpu_power of each child group to this groups cpu_power
+ */
+ group = child->groups;
+ do {
+ sd->groups->cpu_power += group->cpu_power;
+ group = group->next;
+ } while (group != child->groups);
+}
+
+/*
* Build sched domains for a given set of cpus and attach the sched domains
* to the individual cpus
*/
@@ -6192,6 +6262,7 @@ static int build_sched_domains(const cpu
{
int i;
struct sched_group *sched_group_phys = NULL;
+ struct sched_domain *sd;
#ifdef CONFIG_SCHED_MC
struct sched_group *sched_group_core = NULL;
#endif
@@ -6250,6 +6321,8 @@ static int build_sched_domains(const cpu
*sd = SD_NODE_INIT;
sd->span = sched_domain_node_span(cpu_to_node(i));
sd->parent = p;
+ if (p)
+ p->child = sd;
cpus_and(sd->span, sd->span, *cpu_map);
#endif

@@ -6271,6 +6344,8 @@ static int build_sched_domains(const cpu
*sd = SD_CPU_INIT;
sd->span = nodemask;
sd->parent = p;
+ if (p)
+ p->child = sd;
sd->groups = &sched_group_phys[group];

#ifdef CONFIG_SCHED_MC
@@ -6293,6 +6368,7 @@ static int build_sched_domains(const cpu
sd->span = cpu_coregroup_map(i);
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
+ p->child = sd;
sd->groups = &sched_group_core[group];
#endif

@@ -6304,6 +6380,7 @@ static int build_sched_domains(const cpu
sd->span = cpu_sibling_map[i];
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
+ p->child = sd;
sd->groups = &sched_group_cpus[group];
#endif
}
@@ -6422,72 +6499,20 @@ static int build_sched_domains(const cpu
/* Calculate CPU power for physical packages and nodes */
#ifdef CONFIG_SCHED_SMT
for_each_cpu_mask(i, *cpu_map) {
- struct sched_domain *sd;
sd = &per_cpu(cpu_domains, i);
- sd->groups->cpu_power = SCHED_LOAD_SCALE;
+ init_sched_groups_power(i, sd);
}
#endif
#ifdef CONFIG_SCHED_MC
for_each_cpu_mask(i, *cpu_map) {
- int power;
- struct sched_domain *sd;
sd = &per_cpu(core_domains, i);
- if (sched_smt_power_savings)
- power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
- else
- power = SCHED_LOAD_SCALE + (cpus_weight(sd->groups->cpumask)-1)
- * SCHED_LOAD_SCALE / 10;
- sd->groups->cpu_power = power;
+ init_sched_groups_power(i, sd);
}
#endif

for_each_cpu_mask(i, *cpu_map) {
- struct sched_domain *sd;
-#ifdef CONFIG_SCHED_MC
sd = &per_cpu(phys_domains, i);
- if (i != first_cpu(sd->groups->cpumask))
- continue;
-
- sd->groups->cpu_power = 0;
- if (sched_mc_power_savings || sched_smt_power_savings) {
- int j;
-
- for_each_cpu_mask(j, sd->groups->cpumask) {
- struct sched_domain *sd1;
- sd1 = &per_cpu(core_domains, j);
- /*
- * for each core we will add once
- * to the group in physical domain
- */
- if (j != first_cpu(sd1->groups->cpumask))
- continue;
-
- if (sched_smt_power_savings)
- sd->groups->cpu_power += sd1->groups->cpu_power;
- else
- sd->groups->cpu_power += SCHED_LOAD_SCALE;
- }
- } else
- /*
- * This has to be < 2 * SCHED_LOAD_SCALE
- * Lets keep it SCHED_LOAD_SCALE, so that
- * while calculating NUMA group's cpu_power
- * we can simply do
- * numa_group->cpu_power += phys_group->cpu_power;
- *
- * See "only add power once for each physical pkg"
- * comment below
- */
- sd->groups->cpu_power = SCHED_LOAD_SCALE;
-#else
- int power;
- sd = &per_cpu(phys_domains, i);
- if (sched_smt_power_savings)
- power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
- else
- power = SCHED_LOAD_SCALE;
- sd->groups->cpu_power = power;
-#endif
+ init_sched_groups_power(i, sd);
}

#ifdef CONFIG_NUMA


2006-08-16 04:25:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

Thanks for the cleanup patch resend, Suresh.

> Resending the new patch. Before patch had some issues for Andy and hence
> dropped.
>
> Andrew, Please add this to -mm. This patch is against 2.6.18-rc4.
> There might be a small conflict while applying to -mm. Let me know if you
> want a patch on top of -mm.
>
> thanks,
> suresh
>
> --

I found the above patch commentary frustrating to read, as it told me
very little, and teased me with reference to details that are left
unsaid.

Can we work on this patch's opening text a bit more?

Take a look at Andrew's "The perfect patch"
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

The above quoted material doesn't follow the recommendations
in the (4) Changelog section of Andrew's recommendations, especially
(4a), (4b), (4c) and (4d).

The above quoted material is stuff that will mean little to someone
looking at this changelog in the source history a year from now. Heck,
it means little to me, now, and I supposedly have been trying to track
your sched domain cleanup work more closely than most people.

Just now I ran a grep over my lkml email archives for the last few
months, and apparently the "issue for Andy" refers to a thread that
includes a message dated "Mon, 03 Jul 2006" from Andy Whitcroft under
the generic Subject of "Re: 2.6.17-mm5", noticing a problem with a
panic on NUMA-Q due to a zero cpu_power. Anyone besides Andy and
Suresh who knew that without poking around old lkml messages has way
too good a memory ;).


> Cleanup the sched group cpu_power setup code, by introducing
> child field and new domain flag in sched_domain.

I guess you meant the above two lines to be the actual Changelog
text, to be read by all who examine this change to Linux in the
source control history hereafteer.

But as explained in Andrew's recommendations, the recommended
changelog text should come first (after a possible "From:" line)
in the patch email, followed by "---" a line with just three
dashes, followed by the transient text that will mean little
in the future.

That is, you had:


transient text

--

permanent changelog text (just the above 2 lines)

Signed-off-by: ...

diff ...


Andrew's recommendations would instead have:


permanent changelog text (more than 2 lines, I hope,
in this case)

Signed-off-by: ...

---

transient text

diff ...


> + * cpu_power indicates the computing power of each sched group. This is
> + * used for distributing the load between different sched groups
> + * in a sched domain.

Thanks for explaining what cpu_power means.


> ... Typically cpu_power for all the groups in a
> + * sched domain will be same unless there are asymmetries in the topology.

Does the above mean that all groups in a domain have the same
number of CPUs? Or perhaps my newbie ignorance of what 'group'
means is showing through, and my question makes no sense - sorry
if so.


+static void init_sched_groups_power(int cpu, struct sched_domain *sd)
+{
+ ...
+
+ if (cpu != first_cpu(sd->groups->cpumask))
+ return;

I am a tad surprised that the above always works. Is it ever possible
that init_sched_groups_power() is never called for the first cpu in a
group, and that hence the cpu_power of that group is not uninitialized?

If there is some explanation as to how this is not possible, and it is
guaranteed that init_sched_groups_power() is always called for the
first cpu in a group, then that might be worthy of a comment.

Is it possible to get the partition1 or partition2 in the calls:

int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
{
...
if (!cpus_empty(*partition1))
err = build_sched_domains(partition1);
if (!err && !cpus_empty(*partition2))
err = build_sched_domains(partition2);

so some group had some CPUs, but not the first CPU of groups->cpumask
in one of these partitions?


+ /*
+ * For perf policy, if the groups in child domain share resources
+ * (for example cores sharing some portions of the cache hierarchy
+ * or SMT), then set this domain groups cpu_power such that each group
+ * can handle only one task, when there are other idle groups in the
+ * same sched domain.
+ */

I am clearly still missing proper understanding here. How is it that
the cpu_power of a group can be set so that it "can handle only one task?"


> + if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> + (child->flags & SD_SHARE_CPUPOWER ||
> + child->flags & SD_SHARE_PKG_RESOURCES))) {

Would this be equivalent to the following, which saves a few
machine instructions and a conditional jump as well:

if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
(child->flags &
(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)
)) {

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-16 04:50:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

On Tue, 15 Aug 2006 21:24:55 -0700
Paul Jackson <[email protected]> wrote:

> Thanks for the cleanup patch resend, Suresh.
>
> > Resending the new patch. Before patch had some issues for Andy and hence
> > dropped.
> >
> > Andrew, Please add this to -mm. This patch is against 2.6.18-rc4.
> > There might be a small conflict while applying to -mm. Let me know if you
> > want a patch on top of -mm.
> >
> > thanks,
> > suresh
> >
> > --
>
> I found the above patch commentary frustrating to read, as it told me
> very little, and teased me with reference to details that are left
> unsaid.
>
> Can we work on this patch's opening text a bit more?

Believe it or not, I usually suffer in silence.

>
> ..
>
> > + * cpu_power indicates the computing power of each sched group. This is
> > + * used for distributing the load between different sched groups
> > + * in a sched domain.
>
> Thanks for explaining what cpu_power means.
>

Hope not. To me, "computing power" means megaflops/sec, or Dhrystones
(don't ask) or whatever. If that's what "cpu_power" is referring to then
the name is hopelessly ambiguous with peak joules/sec and a big renaming is
due.

2006-08-16 05:53:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

> Believe it or not, I usually suffer in silence.

I do too (though I look at 0.001% of the patches you do ;).

This time, I was motivated to actually figure this code out a bit,
as a common way to get to this code is via a cpuset cpu_exclusive
setting, and of late a couple of bugs in earlier versions of this
code had caused me grief.

So I was hoping for more useful patch changelog comments, in my
quest to rise above my newbie ranking in kernel/sched.c code.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-16 17:58:26

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

On Tue, Aug 15, 2006 at 09:24:55PM -0700, Paul Jackson wrote:
> That is, you had:
>
>
> transient text
>
> --
>
> permanent changelog text (just the above 2 lines)
>
> Signed-off-by: ...
>
> diff ...
>
>
> Andrew's recommendations would instead have:
>
>
> permanent changelog text (more than 2 lines, I hope,
> in this case)
>
> Signed-off-by: ...
>
> ---
>
> transient text
>
> diff ...

Andrew, is it possible for changing your tpp format such that transient
text comes on the top followed by the change log and signed-off-by...
Transient text will have more info about the patch in the context of
an ongoing lkml thread conversation. Hence it makes sense to be on top rather
than somewhere in between the changelog and the patch.

I will def add more appropriate changelog text..

> > ... Typically cpu_power for all the groups in a
> > + * sched domain will be same unless there are asymmetries in the topology.
>
> Does the above mean that all groups in a domain have the same
> number of CPUs?

typically yes. cpuhotplug or exclusive cpusets can change it..

>
>
> +static void init_sched_groups_power(int cpu, struct sched_domain *sd)
> +{
> + ...
> +
> + if (cpu != first_cpu(sd->groups->cpumask))
> + return;
>
> I am a tad surprised that the above always works. Is it ever possible
> that init_sched_groups_power() is never called for the first cpu in a
> group, and that hence the cpu_power of that group is not uninitialized?

No. This is not possible.

> If there is some explanation as to how this is not possible, and it is
> guaranteed that init_sched_groups_power() is always called for the
> first cpu in a group, then that might be worthy of a comment.

init_sched_groups_power is called for each cpu in the cpu_map and hence for
all the cpus in a group.

>
> Is it possible to get the partition1 or partition2 in the calls:
>
> int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
> {
> ...
> if (!cpus_empty(*partition1))
> err = build_sched_domains(partition1);
> if (!err && !cpus_empty(*partition2))
> err = build_sched_domains(partition2);
>
> so some group had some CPUs, but not the first CPU of groups->cpumask
> in one of these partitions?

Question doesn't make sense... each domain has it own specific groups..

> + /*
> + * For perf policy, if the groups in child domain share resources
> + * (for example cores sharing some portions of the cache hierarchy
> + * or SMT), then set this domain groups cpu_power such that each group
> + * can handle only one task, when there are other idle groups in the
> + * same sched domain.
> + */
>
> I am clearly still missing proper understanding here. How is it that
> the cpu_power of a group can be set so that it "can handle only one task?"

Please see the find_busiest_group() code and how its uses cpu_power.
For example if a group has two tasks and if its cpu_power is
1 * SCHED_LOAD_SCALE, then any other idle group in the domain will pickup
the extra task. I have explained the significance of the 'multiple' in the
comments.

>
>
> > + if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> > + (child->flags & SD_SHARE_CPUPOWER ||
> > + child->flags & SD_SHARE_PKG_RESOURCES))) {
>
> Would this be equivalent to the following, which saves a few
> machine instructions and a conditional jump as well:
>
> if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
> (child->flags &
> (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)
> )) {

compiler will be doing it. Anyhow I will include this change, as it is cleaner.

thanks,
suresh

2006-08-16 18:21:25

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

On Tue, Aug 15, 2006 at 09:47:18PM -0700, Andrew Morton wrote:
> > > + * cpu_power indicates the computing power of each sched group. This is
> > > + * used for distributing the load between different sched groups
> > > + * in a sched domain.
> >
> > Thanks for explaining what cpu_power means.
> >
>
> Hope not. To me, "computing power" means megaflops/sec, or Dhrystones
> (don't ask) or whatever. If that's what "cpu_power" is referring to then
> the name is hopelessly ambiguous with peak joules/sec and a big renaming is
> due.

It refers to group's processing power. Perhaps "horsepower" is better term.

thanks,
suresh

2006-08-17 17:20:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

> > Hope not. To me, "computing power" means megaflops/sec, or Dhrystones
> > (don't ask) or whatever. If that's what "cpu_power" is referring to then
> > the name is hopelessly ambiguous with peak joules/sec and a big renaming is
> > due.
>
> It refers to group's processing power. Perhaps "horsepower" is better term.

Well ... I don't think "horsepower" is a step in the right direction.

Andrew's point was over the word "power", not "cpu". The term
"cpu_power" suggested to him we were concerned with the power supply
watts consumed by a group of CPUs. Indeed, both those concerned with
laptop battery lifetimes, and the air conditioning tonnage needed
for big honkin NUMA iron might have reason to be concerned with the
power consumed by CPUs.

Changing the word "cpu" to "horse", but keeping the word "power",
does nothing to address Andrew's point. Rather it just adds more
confusion. We are obviously dealing with CPUs here, not horses.

My understanding is that the "cpu_power" of the cpus in a sched group
is rougly proportional to the BogoMIPS of the CPUs in that group.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-17 18:16:38

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

On Thu, Aug 17, 2006 at 10:20:30AM -0700, Paul Jackson wrote:
> > It refers to group's processing power. Perhaps "horsepower" is better term.
>
> Well ... I don't think "horsepower" is a step in the right direction.
>
> Andrew's point was over the word "power", not "cpu". The term
> "cpu_power" suggested to him we were concerned with the power supply
> watts consumed by a group of CPUs. Indeed, both those concerned with
> laptop battery lifetimes, and the air conditioning tonnage needed
> for big honkin NUMA iron might have reason to be concerned with the
> power consumed by CPUs.
>
> Changing the word "cpu" to "horse", but keeping the word "power",
> does nothing to address Andrew's point. Rather it just adds more
> confusion. We are obviously dealing with CPUs here, not horses.

Let me resist the temptation and not go into the definition of horsepower
here. You can refer any dictionary.

> My understanding is that the "cpu_power" of the cpus in a sched group
> is rougly proportional to the BogoMIPS of the CPUs in that group.

This variable represents how many tasks(multiplied by scaling factor
SCHED_LOAD_SCALE)the group can handle before it starts distributing the load
to other idle or less lightly loaded groups. For example, group with
two HT threads will have it as < 2 * SCHED_LOAD_SCALE. group with
N physical cpus in a NUMA node will have it as N * SCHED_LOAD_SCALE.
When power savings policy is enabled, some of the domains group values will
increase making each group pickup more load and save some watt power.

"group_capacity" or "load_capacity" might be good term considering all this..

thanks,
suresh

2006-08-17 19:18:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

Suresh wrote:
> Let me resist the temptation and not go into the definition of horsepower
> here. You can refer any dictionary.

Good point <grin>.

Horsepower is a measure of power, of energy over time, such as the
rate of providing or using electrical or mechanical energy.

So, with your suggestion of 'horsepower', are you saying that cpu_power
is a metric of such electrical or mechanical energy -- the peak or
average watts of the electricity consumed by the CPUs in a group?

Or is 'cpu_power' a metric of the computational capacity, such as
BogoMIPS provided by the CPUs in a group, such as I had presumed?

Hmmm ... apparently from your latest explanation, it's neither
of these.

Rather it's a metric of how many tasks to place in a group, due to
various capacities and constraints, such as computational power
(BogoMIPS) and electrical power (watts or horsepower).

Should 'cpu_power' be renamed to 'task_load' ?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-17 23:29:34

by Ian Stirling

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

Paul Jackson wrote:
> Suresh wrote:
>> Let me resist the temptation and not go into the definition of horsepower
>> here. You can refer any dictionary.
>
> Good point <grin>.
>
> Horsepower is a measure of power, of energy over time, such as the
> rate of providing or using electrical or mechanical energy.
>
> So, with your suggestion of 'horsepower', are you saying that cpu_power
> is a metric of such electrical or mechanical energy -- the peak or
> average watts of the electricity consumed by the CPUs in a group?

And is white_power, the energy output of a redneck on a step machine?

Power has assorted meanings over and above the J/s one.

I'd suggest 'computing_power' might be less ambiguous.

2006-08-17 23:36:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

Ian wrote:
> I'd suggest 'computing_power' might be less ambiguous.

Eh ... I should quit suggesting names, until I figure
out more accurately what it means.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-17 23:56:05

by Peter Williams

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

Paul Jackson wrote:
>>> Hope not. To me, "computing power" means megaflops/sec, or Dhrystones
>>> (don't ask) or whatever. If that's what "cpu_power" is referring to then
>>> the name is hopelessly ambiguous with peak joules/sec and a big renaming is
>>> due.
>> It refers to group's processing power. Perhaps "horsepower" is better term.
>
> Well ... I don't think "horsepower" is a step in the right direction.
>
> Andrew's point was over the word "power", not "cpu". The term
> "cpu_power" suggested to him we were concerned with the power supply
> watts consumed by a group of CPUs. Indeed, both those concerned with
> laptop battery lifetimes, and the air conditioning tonnage needed
> for big honkin NUMA iron might have reason to be concerned with the
> power consumed by CPUs.
>
> Changing the word "cpu" to "horse", but keeping the word "power",
> does nothing to address Andrew's point. Rather it just adds more
> confusion. We are obviously dealing with CPUs here, not horses.
>
> My understanding is that the "cpu_power" of the cpus in a sched group
> is rougly proportional to the BogoMIPS of the CPUs in that group.
>

How about energy instead of power? I.e. the CPU's capacity to do work.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-08-18 04:56:27

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch] sched: group CPU power setup cleanup

Paul Jackson writes:

> My understanding is that the "cpu_power" of the cpus in a sched group
> is rougly proportional to the BogoMIPS of the CPUs in that group.

Uh oh, on powerpc the bogomips has nothing at all to do with the speed
of the cpu, but relates instead to the frequency at which the timebase
register counts...

Paul.

2006-08-18 21:36:24

by Suresh Siddha

[permalink] [raw]
Subject: [patch] sched: generic sched_group cpu power setup

Andrew, Please apply this patch to -mm. I will post a 'cpu_power' renaming
patch shortly.

thanks,
suresh
---

Up to now sched group's cpu_power for each sched domain is initialized
independently. This made the setup code ugly as the new sched domains are
getting added.

Make the sched group cpu_power setup code generic, by introducing domain child
field and new domain flag in sched_domain. For most of the sched domains(except
NUMA), sched group's cpu_power is now computed generically using the domain
properties of itself and of the child domain.

sched groups in NUMA domains are setup little differently and hence they don't
use this generic mechanism.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nick Piggin <[email protected]>
---

diff -pNru linux-2.6.18-rc4/include/asm-i386/topology.h linux/include/asm-i386/topology.h
--- linux-2.6.18-rc4/include/asm-i386/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-i386/topology.h 2006-08-18 11:28:12.252099048 -0700
@@ -74,6 +74,7 @@ static inline int node_to_first_cpu(int
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/asm-ia64/topology.h linux/include/asm-ia64/topology.h
--- linux-2.6.18-rc4/include/asm-ia64/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-ia64/topology.h 2006-08-18 11:28:12.252099048 -0700
@@ -59,6 +59,7 @@ void build_cpu_to_node_map(void);
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
@@ -84,6 +85,7 @@ void build_cpu_to_node_map(void);
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 8*(min(num_online_cpus(), 32)), \
diff -pNru linux-2.6.18-rc4/include/asm-mips/mach-ip27/topology.h linux/include/asm-mips/mach-ip27/topology.h
--- linux-2.6.18-rc4/include/asm-mips/mach-ip27/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-mips/mach-ip27/topology.h 2006-08-18 11:28:12.253098896 -0700
@@ -22,6 +22,7 @@ extern unsigned char __node_distances[MA
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/asm-powerpc/topology.h linux/include/asm-powerpc/topology.h
--- linux-2.6.18-rc4/include/asm-powerpc/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-powerpc/topology.h 2006-08-18 11:28:12.253098896 -0700
@@ -43,6 +43,7 @@ extern int pcibus_to_node(struct pci_bus
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/asm-x86_64/topology.h linux/include/asm-x86_64/topology.h
--- linux-2.6.18-rc4/include/asm-x86_64/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/asm-x86_64/topology.h 2006-08-18 11:28:12.253098896 -0700
@@ -31,6 +31,7 @@ extern int __node_distance(int, int);
#define SD_NODE_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 8, \
.max_interval = 32, \
diff -pNru linux-2.6.18-rc4/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.18-rc4/include/linux/sched.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/linux/sched.h 2006-08-18 11:28:12.254098744 -0700
@@ -623,9 +623,16 @@ enum idle_type
#define SD_WAKE_BALANCE 64 /* Perform balancing at task wakeup */
#define SD_SHARE_CPUPOWER 128 /* Domain members share cpu power */
#define SD_POWERSAVINGS_BALANCE 256 /* Balance for power savings */
+#define SD_SHARE_PKG_RESOURCES 512 /* Domain members share cpu pkg resources */

-#define BALANCE_FOR_POWER ((sched_mc_power_savings || sched_smt_power_savings) \
- ? SD_POWERSAVINGS_BALANCE : 0)
+#define BALANCE_FOR_MC_POWER \
+ (sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
+
+#define BALANCE_FOR_PKG_POWER \
+ ((sched_mc_power_savings || sched_smt_power_savings) ? \
+ SD_POWERSAVINGS_BALANCE : 0)
+
+#define test_sd_flag(sd, flag) ((sd && (sd->flags & flag)) ? 1 : 0)


struct sched_group {
@@ -642,6 +649,7 @@ struct sched_group {
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
+ struct sched_domain *child; /* bottom domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
cpumask_t span; /* span of all CPUs in this domain */
unsigned long min_interval; /* Minimum balance interval ms */
diff -pNru linux-2.6.18-rc4/include/linux/topology.h linux/include/linux/topology.h
--- linux-2.6.18-rc4/include/linux/topology.h 2006-08-06 11:20:11.000000000 -0700
+++ linux/include/linux/topology.h 2006-08-18 11:28:12.255098592 -0700
@@ -89,6 +89,7 @@
#define SD_SIBLING_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 2, \
@@ -114,11 +115,44 @@
#endif
#endif /* CONFIG_SCHED_SMT */

+#ifdef CONFIG_SCHED_MC
+/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
+#ifndef SD_MC_INIT
+#define SD_MC_INIT (struct sched_domain) { \
+ .span = CPU_MASK_NONE, \
+ .parent = NULL, \
+ .child = NULL, \
+ .groups = NULL, \
+ .min_interval = 1, \
+ .max_interval = 4, \
+ .busy_factor = 64, \
+ .imbalance_pct = 125, \
+ .cache_nice_tries = 1, \
+ .per_cpu_gain = 100, \
+ .busy_idx = 2, \
+ .idle_idx = 1, \
+ .newidle_idx = 2, \
+ .wake_idx = 1, \
+ .forkexec_idx = 1, \
+ .flags = SD_LOAD_BALANCE \
+ | SD_BALANCE_NEWIDLE \
+ | SD_BALANCE_EXEC \
+ | SD_WAKE_AFFINE \
+ | SD_SHARE_PKG_RESOURCES\
+ | BALANCE_FOR_MC_POWER, \
+ .last_balance = jiffies, \
+ .balance_interval = 1, \
+ .nr_balance_failed = 0, \
+}
+#endif
+#endif /* CONFIG_SCHED_MC */
+
/* Common values for CPUs */
#ifndef SD_CPU_INIT
#define SD_CPU_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 1, \
.max_interval = 4, \
@@ -135,7 +169,7 @@
| SD_BALANCE_NEWIDLE \
| SD_BALANCE_EXEC \
| SD_WAKE_AFFINE \
- | BALANCE_FOR_POWER, \
+ | BALANCE_FOR_PKG_POWER,\
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
@@ -146,6 +180,7 @@
#define SD_ALLNODES_INIT (struct sched_domain) { \
.span = CPU_MASK_NONE, \
.parent = NULL, \
+ .child = NULL, \
.groups = NULL, \
.min_interval = 64, \
.max_interval = 64*num_online_cpus(), \
@@ -165,15 +200,6 @@
.nr_balance_failed = 0, \
}

-#ifdef CONFIG_SCHED_MC
-#ifndef SD_MC_INIT
-/* for now its same as SD_CPU_INIT.
- * TBD: Tune Domain parameters!
- */
-#define SD_MC_INIT SD_CPU_INIT
-#endif
-#endif
-
#ifdef CONFIG_NUMA
#ifndef SD_NODE_INIT
#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
diff -pNru linux-2.6.18-rc4/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.18-rc4/kernel/sched.c 2006-08-06 11:20:11.000000000 -0700
+++ linux/kernel/sched.c 2006-08-18 11:40:19.139595352 -0700
@@ -1277,7 +1277,6 @@ static int sched_balance_self(int cpu, i
cpumask_t span;
struct sched_group *group;
int new_cpu;
- int weight;

span = sd->span;
group = find_idlest_group(sd, t, cpu);
@@ -1291,14 +1290,9 @@ static int sched_balance_self(int cpu, i
/* Now try balancing at a lower domain level */
cpu = new_cpu;
nextlevel:
- sd = NULL;
- weight = cpus_weight(span);
- for_each_domain(cpu, tmp) {
- if (weight <= cpus_weight(tmp->span))
- break;
- if (tmp->flags & flag)
- sd = tmp;
- }
+ sd = sd->child;
+ if (sd && !(sd->flags & flag))
+ goto nextlevel;
/* while loop will break here if sd == NULL */
}

@@ -2512,8 +2506,14 @@ static int load_balance(int this_cpu, st
unsigned long imbalance;
struct rq *busiest;

+ /*
+ * When power savings policy is enabled for the parent domain, idle
+ * sibling can pick up load irrespective of busy siblings. In this case,
+ * let the state of idle sibling percolate up as IDLE, instead of
+ * portraying it as NOT_IDLE.
+ */
if (idle != NOT_IDLE && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
sd_idle = 1;

schedstat_inc(sd, lb_cnt[idle]);
@@ -2603,7 +2603,7 @@ static int load_balance(int this_cpu, st
}

if (!nr_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
return nr_moved;

@@ -2619,7 +2619,7 @@ out_one_pinned:
sd->balance_interval *= 2;

if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
return 0;
}
@@ -2640,7 +2640,14 @@ load_balance_newidle(int this_cpu, struc
int nr_moved = 0;
int sd_idle = 0;

- if (sd->flags & SD_SHARE_CPUPOWER && !sched_smt_power_savings)
+ /*
+ * When power savings policy is enabled for the parent domain, idle
+ * sibling can pick up load irrespective of busy siblings. In this case,
+ * let the state of idle sibling percolate up as IDLE, instead of
+ * portraying it as NOT_IDLE.
+ */
+ if (sd->flags & SD_SHARE_CPUPOWER &&
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
sd_idle = 1;

schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
@@ -2672,7 +2679,8 @@ load_balance_newidle(int this_cpu, struc

if (!nr_moved) {
schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
- if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER)
+ if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
} else
sd->nr_balance_failed = 0;
@@ -2682,7 +2690,7 @@ load_balance_newidle(int this_cpu, struc
out_balanced:
schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
- !sched_smt_power_savings)
+ !test_sd_flag(sd->parent, SD_POWERSAVINGS_BALANCE))
return -1;
sd->nr_balance_failed = 0;

@@ -5350,7 +5358,9 @@ static int sd_degenerate(struct sched_do
if (sd->flags & (SD_LOAD_BALANCE |
SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
- SD_BALANCE_EXEC)) {
+ SD_BALANCE_EXEC |
+ SD_SHARE_CPUPOWER |
+ SD_SHARE_PKG_RESOURCES)) {
if (sd->groups != sd->groups->next)
return 0;
}
@@ -5384,7 +5394,9 @@ sd_parent_degenerate(struct sched_domain
pflags &= ~(SD_LOAD_BALANCE |
SD_BALANCE_NEWIDLE |
SD_BALANCE_FORK |
- SD_BALANCE_EXEC);
+ SD_BALANCE_EXEC |
+ SD_SHARE_CPUPOWER |
+ SD_SHARE_PKG_RESOURCES);
}
if (~cflags & pflags)
return 0;
@@ -5406,12 +5418,18 @@ static void cpu_attach_domain(struct sch
struct sched_domain *parent = tmp->parent;
if (!parent)
break;
- if (sd_parent_degenerate(tmp, parent))
+ if (sd_parent_degenerate(tmp, parent)) {
tmp->parent = parent->parent;
+ if (parent->parent)
+ parent->parent->child = tmp;
+ }
}

- if (sd && sd_degenerate(sd))
+ if (sd && sd_degenerate(sd)) {
sd = sd->parent;
+ if (sd)
+ sd->child = NULL;
+ }

sched_domain_debug(sd, cpu);

@@ -6185,6 +6203,58 @@ next_sg:
}

/*
+ * Initialize sched groups cpu_power.
+ *
+ * cpu_power indicates the capacity of sched group, which is used while
+ * distributing the load between different sched groups in a sched domain.
+ * Typically cpu_power for all the groups in a sched domain will be same unless
+ * there are asymmetries in the topology. If there are asymmetries, group
+ * having more cpu_power will pickup more load compared to the group having
+ * less cpu_power.
+ *
+ * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
+ * the maximum number of tasks a group can handle in the presence of other idle
+ * or lightly loaded groups in the same sched domain.
+ */
+static void init_sched_groups_power(int cpu, struct sched_domain *sd)
+{
+ struct sched_domain *child;
+ struct sched_group *group;
+
+ WARN_ON(!sd || !sd->groups);
+
+ if (cpu != first_cpu(sd->groups->cpumask))
+ return;
+
+ child = sd->child;
+
+ /*
+ * For perf policy, if the groups in child domain share resources
+ * (for example cores sharing some portions of the cache hierarchy
+ * or SMT), then set this domain groups cpu_power such that each group
+ * can handle only one task, when there are other idle groups in the
+ * same sched domain.
+ */
+ if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
+ (child->flags &
+ (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
+ sd->groups->cpu_power = SCHED_LOAD_SCALE;
+ return;
+ }
+
+ sd->groups->cpu_power = 0;
+
+ /*
+ * add cpu_power of each child group to this groups cpu_power
+ */
+ group = child->groups;
+ do {
+ sd->groups->cpu_power += group->cpu_power;
+ group = group->next;
+ } while (group != child->groups);
+}
+
+/*
* Build sched domains for a given set of cpus and attach the sched domains
* to the individual cpus
*/
@@ -6192,6 +6262,7 @@ static int build_sched_domains(const cpu
{
int i;
struct sched_group *sched_group_phys = NULL;
+ struct sched_domain *sd;
#ifdef CONFIG_SCHED_MC
struct sched_group *sched_group_core = NULL;
#endif
@@ -6250,6 +6321,8 @@ static int build_sched_domains(const cpu
*sd = SD_NODE_INIT;
sd->span = sched_domain_node_span(cpu_to_node(i));
sd->parent = p;
+ if (p)
+ p->child = sd;
cpus_and(sd->span, sd->span, *cpu_map);
#endif

@@ -6271,6 +6344,8 @@ static int build_sched_domains(const cpu
*sd = SD_CPU_INIT;
sd->span = nodemask;
sd->parent = p;
+ if (p)
+ p->child = sd;
sd->groups = &sched_group_phys[group];

#ifdef CONFIG_SCHED_MC
@@ -6293,6 +6368,7 @@ static int build_sched_domains(const cpu
sd->span = cpu_coregroup_map(i);
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
+ p->child = sd;
sd->groups = &sched_group_core[group];
#endif

@@ -6304,6 +6380,7 @@ static int build_sched_domains(const cpu
sd->span = cpu_sibling_map[i];
cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
+ p->child = sd;
sd->groups = &sched_group_cpus[group];
#endif
}
@@ -6422,72 +6499,20 @@ static int build_sched_domains(const cpu
/* Calculate CPU power for physical packages and nodes */
#ifdef CONFIG_SCHED_SMT
for_each_cpu_mask(i, *cpu_map) {
- struct sched_domain *sd;
sd = &per_cpu(cpu_domains, i);
- sd->groups->cpu_power = SCHED_LOAD_SCALE;
+ init_sched_groups_power(i, sd);
}
#endif
#ifdef CONFIG_SCHED_MC
for_each_cpu_mask(i, *cpu_map) {
- int power;
- struct sched_domain *sd;
sd = &per_cpu(core_domains, i);
- if (sched_smt_power_savings)
- power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
- else
- power = SCHED_LOAD_SCALE + (cpus_weight(sd->groups->cpumask)-1)
- * SCHED_LOAD_SCALE / 10;
- sd->groups->cpu_power = power;
+ init_sched_groups_power(i, sd);
}
#endif

for_each_cpu_mask(i, *cpu_map) {
- struct sched_domain *sd;
-#ifdef CONFIG_SCHED_MC
sd = &per_cpu(phys_domains, i);
- if (i != first_cpu(sd->groups->cpumask))
- continue;
-
- sd->groups->cpu_power = 0;
- if (sched_mc_power_savings || sched_smt_power_savings) {
- int j;
-
- for_each_cpu_mask(j, sd->groups->cpumask) {
- struct sched_domain *sd1;
- sd1 = &per_cpu(core_domains, j);
- /*
- * for each core we will add once
- * to the group in physical domain
- */
- if (j != first_cpu(sd1->groups->cpumask))
- continue;
-
- if (sched_smt_power_savings)
- sd->groups->cpu_power += sd1->groups->cpu_power;
- else
- sd->groups->cpu_power += SCHED_LOAD_SCALE;
- }
- } else
- /*
- * This has to be < 2 * SCHED_LOAD_SCALE
- * Lets keep it SCHED_LOAD_SCALE, so that
- * while calculating NUMA group's cpu_power
- * we can simply do
- * numa_group->cpu_power += phys_group->cpu_power;
- *
- * See "only add power once for each physical pkg"
- * comment below
- */
- sd->groups->cpu_power = SCHED_LOAD_SCALE;
-#else
- int power;
- sd = &per_cpu(phys_domains, i);
- if (sched_smt_power_savings)
- power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);
- else
- power = SCHED_LOAD_SCALE;
- sd->groups->cpu_power = power;
-#endif
+ init_sched_groups_power(i, sd);
}

#ifdef CONFIG_NUMA

2006-08-18 22:30:28

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: generic sched_group cpu power setup

Suresh wrote:
> I will post a 'cpu_power' renaming patch shortly.

Rename to what?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-18 22:55:09

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch] sched: generic sched_group cpu power setup

On Fri, Aug 18, 2006 at 03:29:54PM -0700, Paul Jackson wrote:
> Suresh wrote:
> > I will post a 'cpu_power' renaming patch shortly.
>
> Rename to what?

Not yet decided. But both the words "cpu" and "power" will disappear :)
sched group can have one or more cpus.. so having 'cpu' is confusing
and 'power/energy' seems to be confusing too..

I don't like 'task_load' as it kind of refers to current task load in a
sched_group. Currently I favor for 'load_capacity'. Anyone reading
sched code know what 'load' is.

If any one has a better suggestion, I am open.

thanks,
suresh

2006-08-19 00:10:01

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] sched: generic sched_group cpu power setup

> If any one has a better suggestion, I am open.

I'm still trying to figure out what the hell it is ;).

Sorry ... as my teenage son would gladly tell you, I can be dense
at times.

If all the CPUs in a system have the same computational capacity,
then is it just the number of CPUs in a group (times a scale factor
of 128, to simulate fixed point arithmetic with integers)?

I presume "yes", from such code lines as:

power = SCHED_LOAD_SCALE * cpus_weight(sd->groups->cpumask);

If two CPUs, side by side, have the same computational capacity,
but one consumes more electrical power (watts) than the other, do they
have different cpu_power?

I presume "no" - electrical power consumption does not affect this value
(though some effort might be made to minimize electrical power consumption
in these calculations, by letting some CPUs go idle if the job mix allows
for that.)

If I presumed correctly, then apparently what we're talking about here
is computational capacity, as is typically measured in such units as
MIPS, megaflops/sec or Drystones. In other words, what Andrew termed
"computing power" when he fired the starter's pistol on this scrum.

Is that what this is -- computational capacity, aka computing power
(appropriately scaled for the convenience of the arithmetic)?

And is the unit of measure just the number of CPUs in the group
(times 128)?

If the above is accurate, then the group structure member could almost
be called "ncpus" (number of cpus in group), unscaled. Perhaps you
only need to scale the value for fixed point arithmetic while calculating
what balancing to attempt.

One more detail, as you likely already noticed, if you rename
"cpu_power" to not mention "power", then consider also the routine
init_numa_sched_groups_power(), the variables pwr_now, pwr_move and
power, and the mentions of "power" in the comments.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-21 22:32:12

by Suresh Siddha

[permalink] [raw]
Subject: Re: [patch] sched: generic sched_group cpu power setup

On Fri, Aug 18, 2006 at 05:09:45PM -0700, Paul Jackson wrote:
> I'm still trying to figure out what the hell it is ;).

Please refer to the code comments in my recent patch. And also the
recent conversations between us about this. My OLS 2005 paper(CMP aware
kernel scheduler) might also help and finally the source code kernel/sched.c

> If all the CPUs in a system have the same computational capacity,
> then is it just the number of CPUs in a group (times a scale factor

No. It depends on the sched domain, its characteristics and depends
on performance/power-savings policy.

> If two CPUs, side by side, have the same computational capacity,
> but one consumes more electrical power (watts) than the other, do they
> have different cpu_power?

No.

> If I presumed correctly, then apparently what we're talking about here
> is computational capacity, as is typically measured in such units as
> MIPS, megaflops/sec or Drystones. In other words, what Andrew termed
> "computing power" when he fired the starter's pistol on this scrum.
>
> Is that what this is -- computational capacity, aka computing power
> (appropriately scaled for the convenience of the arithmetic)?
>
> And is the unit of measure just the number of CPUs in the group
> (times 128)?

No. That is the case only for some scenarios(for example when power
savings is enabled..). In the default performance policy mode, domain
characteristics will determine this value.

thanks,
suresh