2012-11-06 13:12:02

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 0/3] power aware scheduling

This patchset implement my previous power aware scheduling proposal:
https://lkml.org/lkml/2012/8/13/139

It base on tip/master tree,
It also resused much of old code from Suresh in 2nd patch. I don't know
to give credit to Suresh for this.
Suresh, could you like to give 'signed-off' or sth you think needed?

Appreciate for any comments!

Regards!
Alex


2012-11-06 13:12:09

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 2/3] sched: power aware load balance,

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, shrink tasks on less sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.

This patch reuse lots of Suresh's power saving load balance code.
Now the general enabling logical is:
1, Collect power aware scheduler statistics with performance load
balance statistics collection.
2, if domain is eligible for power load balance do it and forget
performance load balance, else do performance load balance.

Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
and 2 sockets * 8 cores * HT SNB EP machine.
In the following checking, when I is 2/4/8/16, all tasks are
shrank to run on single core or single socket.

$for ((i=0; i < I; i++)) ; do while true; do : ; done & done

Checking the power consuming with a powermeter on the NHM EP.
powersaving performance
I = 2 148w 160w
I = 4 175w 181w
I = 8 207w 224w
I = 16 324w 324w

On a SNB laptop(4 cores *HT)
powersaving performance
I = 2 28w 35w
I = 4 38w 52w
I = 6 44w 54w
I = 8 56w 56w

On the SNB EP machine, when I = 16, power saved more than 100 Watts.

Also tested the specjbb2005 with jrockit, kbuild, their peak performance
has no clear change with powersaving policy on all machines. Just
specjbb2005 with openjdk has about 2% drop on NHM EP machine with
powersaving policy.

This patch seems a bit long, but seems hard to split smaller.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 153 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dedc576..acc8b41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3930,6 +3930,8 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+ int power_lb; /* if powersaving lb needed */
+ int perf_lb; /* if performance lb needed */

struct rq * (*find_busiest_queue)(struct lb_env *,
struct sched_group *);
@@ -4356,6 +4358,16 @@ struct sd_lb_stats {
unsigned int busiest_group_weight;

int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned long sd_capacity; /* capacity of this domain */
+ unsigned long sd_nr_running; /* Nr running of this domain */
+ struct sched_group *group_min; /* Least loaded group in sd */
+ struct sched_group *group_leader; /* Group which relieves group_min */
+ unsigned long min_load_per_task; /* load_per_task in group_min */
+ unsigned long leader_nr_running; /* Nr running of group_leader */
+ unsigned long min_nr_running; /* Nr running of group_min */
+
#ifdef CONFIG_SCHED_NUMA
struct sched_group *numa_group; /* group which has offnode_tasks */
unsigned long numa_group_weight;
@@ -4387,6 +4399,123 @@ struct sg_lb_stats {
};

/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+ struct sd_lb_stats *sds)
+{
+ if (sched_policy == SCHED_POLICY_PERFORMANCE ||
+ env->idle == CPU_NOT_IDLE) {
+ env->power_lb = 0;
+ env->perf_lb = 1;
+ return;
+ }
+ env->perf_lb = 0;
+ env->power_lb = 1;
+ sds->min_nr_running = ULONG_MAX;
+ sds->leader_nr_running = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+ struct sched_group *group, struct sd_lb_stats *sds,
+ int local_group, struct sg_lb_stats *sgs)
+{
+ unsigned long threshold;
+
+ if (!env->power_lb)
+ return;
+
+ threshold = sgs->group_weight;
+
+ /*
+ * If the local group is idle or full loaded
+ * no need to do power savings balance at this domain
+ */
+ if (local_group && (sds->this_nr_running == threshold ||
+ !sds->this_nr_running))
+ env->power_lb = 0;
+
+ /* Do performance load balance if any group overload */
+ if (sgs->sum_nr_running > threshold) {
+ env->perf_lb = 1;
+ env->power_lb = 0;
+ }
+
+ /*
+ * If a group is idle,
+ * don't include that group in power savings calculations
+ */
+ if (!env->power_lb || !sgs->sum_nr_running)
+ return;
+
+ sds->sd_nr_running += sgs->sum_nr_running;
+ /*
+ * Calculate the group which has the least non-idle load.
+ * This is the group from where we need to pick up the load
+ * for saving power
+ */
+ if ((sgs->sum_nr_running < sds->min_nr_running) ||
+ (sgs->sum_nr_running == sds->min_nr_running &&
+ group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+ sds->group_min = group;
+ sds->min_nr_running = sgs->sum_nr_running;
+ sds->min_load_per_task = sgs->sum_weighted_load /
+ sgs->sum_nr_running;
+ }
+
+ /*
+ * Calculate the group which is almost near its
+ * capacity but still has some space to pick up some load
+ * from other group and save more power
+ */
+ if (sgs->sum_nr_running + 1 > threshold)
+ return;
+
+ if (sgs->sum_nr_running > sds->leader_nr_running ||
+ (sgs->sum_nr_running == sds->leader_nr_running &&
+ group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+ sds->group_leader = group;
+ sds->leader_nr_running = sgs->sum_nr_running;
+ }
+}
+
+/**
+ * check_sd_power_lb_needed - Check if the power awaring load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+
+static inline void check_sd_power_lb_needed(struct lb_env *env,
+ struct sd_lb_stats *sds)
+{
+ unsigned long threshold = env->sd->span_weight;
+ if (!env->power_lb)
+ return;
+
+ if (sds->sd_nr_running > threshold) {
+ env->power_lb = 0;
+ env->perf_lb = 1;
+ }
+}
+
+/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
* @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -4850,6 +4979,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;

+ init_sd_lb_power_stats(env, sds);
load_idx = get_sd_load_idx(env->sd, env->idle);

do {
@@ -4899,8 +5029,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,

update_sd_numa_stats(env->sd, sg, sds, local_group, &sgs);

+ update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
sg = sg->next;
} while (sg != env->sd->groups);
+
+ check_sd_power_lb_needed(env, sds);
}

/**
@@ -5116,6 +5249,19 @@ find_busiest_group(struct lb_env *env, int *balance)
*/
update_sd_lb_stats(env, balance, &sds);

+ if (!env->perf_lb && !env->power_lb)
+ return NULL;
+
+ if (env->power_lb) {
+ if (sds.this == sds.group_leader &&
+ sds.group_leader != sds.group_min) {
+ env->imbalance = sds.min_load_per_task;
+ return sds.group_min;
+ }
+ env->power_lb = 0;
+ return NULL;
+ }
+
/*
* this_cpu is not the appropriate cpu to perform load balancing at
* this level.
@@ -5222,7 +5368,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu power.
*/
- if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+ if (rq->nr_running == 0 ||
+ (!env->power_lb && capacity &&
+ rq->nr_running == 1 && wl > env->imbalance))
continue;

/*
@@ -5298,6 +5446,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
.find_busiest_queue = find_busiest_queue,
+ .power_lb = 1,
+ .perf_lb = 0,
};

cpumask_copy(cpus, cpu_active_mask);
@@ -5330,7 +5480,8 @@ redo:

ld_moved = 0;
lb_iterations = 1;
- if (busiest->nr_running > 1) {
+ if (busiest->nr_running > 1 ||
+ (busiest->nr_running == 1 && env.power_lb)) {
/*
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is
--
1.7.12

2012-11-06 13:12:16

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 3/3] sched: add power aware scheduling in fork/exec/wake

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest but has capcaity group. The trade off is
adding power aware statistics collection for the group seeking. But
since the collection just happened in power scheduling eligible
condition. So no munch performance impact.

hackbench testing results has no clear dropping even with powersaving
policy.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 233 +++++++++++++++++++++++++++++++++++-----------------
1 file changed, 159 insertions(+), 74 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acc8b41..902ef5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3370,12 +3370,149 @@ static int numa_select_node_cpu(struct task_struct *p, int node)
#endif /* CONFIG_SCHED_NUMA */

/*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+
+ /** Statistics of this group */
+ unsigned long this_load;
+ unsigned long this_load_per_task;
+ unsigned long this_nr_running;
+ unsigned long this_has_capacity;
+ unsigned int this_idle_cpus;
+
+ /* Statistics of the busiest group */
+ unsigned int busiest_idle_cpus;
+ unsigned long max_load;
+ unsigned long busiest_load_per_task;
+ unsigned long busiest_nr_running;
+ unsigned long busiest_group_capacity;
+ unsigned long busiest_has_capacity;
+ unsigned int busiest_group_weight;
+
+ int group_imb; /* Is there imbalance in this sd */
+
+ /* Varibles of power awaring scheduling */
+ unsigned long sd_capacity; /* capacity of this domain */
+ unsigned long sd_nr_running; /* Nr running of this domain */
+ struct sched_group *group_min; /* Least loaded group in sd */
+ struct sched_group *group_leader; /* Group which relieves group_min */
+ unsigned long min_load_per_task; /* load_per_task in group_min */
+ unsigned long leader_nr_running; /* Nr running of group_leader */
+ unsigned long min_nr_running; /* Nr running of group_min */
+#ifdef CONFIG_SCHED_NUMA
+ struct sched_group *numa_group; /* group which has offnode_tasks */
+ unsigned long numa_group_weight;
+ unsigned long numa_group_running;
+
+ unsigned long this_offnode_running;
+ unsigned long this_onnode_running;
+#endif
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ * and task rq selection
+ */
+struct sg_lb_stats {
+ unsigned long avg_load; /*Avg load across the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long sum_nr_running; /* Nr tasks running in the group */
+ unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+ unsigned long group_capacity;
+ unsigned long idle_cpus;
+ unsigned long group_weight;
+ int group_imb; /* Is there an imbalance in the group ? */
+ int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_SCHED_NUMA
+ unsigned long numa_offnode_weight;
+ unsigned long numa_offnode_running;
+ unsigned long numa_onnode_running;
+#endif
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+static void get_sg_power_stats(struct sched_group *group,
+ struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+ int i;
+
+
+ for_each_cpu(i, sched_group_cpus(group)) {
+ struct rq *rq = cpu_rq(i);
+
+ sgs->sum_nr_running += rq->nr_running;
+ }
+
+ sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+ SCHED_POWER_SCALE);
+ if (!sgs->group_capacity)
+ sgs->group_capacity = fix_small_capacity(sd, group);
+ sgs->group_weight = group->group_weight;
+}
+
+static void get_sd_power_stats(struct sched_domain *sd,
+ struct sd_lb_stats *sds)
+{
+ struct sched_group *group;
+ struct sg_lb_stats sgs;
+ long sd_min_delta = LONG_MAX;
+
+ group = sd->groups;
+ do {
+ long g_delta;
+ unsigned long threshold;
+
+ memset(&sgs, 0, sizeof(sgs));
+ get_sg_power_stats(group, sd, &sgs);
+
+ if (sched_policy == SCHED_POLICY_POWERSAVING)
+ threshold = sgs.group_weight;
+ else
+ threshold = sgs.group_capacity;
+ g_delta = threshold - sgs.sum_nr_running;
+
+ if (g_delta > 0 && g_delta < sd_min_delta) {
+ sd_min_delta = g_delta;
+ sds->group_leader = group;
+ }
+
+ sds->sd_nr_running += sgs.sum_nr_running;
+ sds->total_pwr += group->sgp->power;
+ } while (group = group->next, group != sd->groups);
+
+ sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+ SCHED_POWER_SCALE);
+}
+
+static inline int get_sd_sched_policy(struct sched_domain *sd,
+ struct sd_lb_stats *sds)
+{
+ int policy = SCHED_POLICY_PERFORMANCE;
+
+ if (sched_policy != SCHED_POLICY_PERFORMANCE) {
+ memset(sds, 0, sizeof(*sds));
+ get_sd_power_stats(sd, sds);
+
+ if (sd->span_weight > sds->sd_nr_running)
+ policy = SCHED_POLICY_POWERSAVING;
+ }
+ return policy;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
* SD_BALANCE_EXEC.
*
- * Balance, ie. select the least loaded group.
- *
* Returns the target CPU number, or the same CPU if no balancing is needed.
*
* preempt must be disabled.
@@ -3384,12 +3521,14 @@ static int
select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sd_lb_stats sds;
int cpu = smp_processor_id();
int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
int node = tsk_home_node(p);
+ int policy = sched_policy;

if (p->nr_cpus_allowed == 1)
return prev_cpu;
@@ -3412,6 +3551,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)

new_cpu = cpu = node_cpu;
sd = per_cpu(sd_node, cpu);
+ policy = get_sd_sched_policy(sd, &sds);
goto pick_idlest;
}

@@ -3445,8 +3585,12 @@ find_sd:
break;
}

- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag) {
sd = tmp;
+ policy = get_sd_sched_policy(sd, &sds);
+ if (policy != SCHED_POLICY_PERFORMANCE)
+ break;
+ }
}

if (affine_sd) {
@@ -3460,7 +3604,7 @@ find_sd:
pick_idlest:
while (sd) {
int load_idx = sd->forkexec_idx;
- struct sched_group *group;
+ struct sched_group *group = NULL;
int weight;

if (!(sd->flags & sd_flag)) {
@@ -3471,7 +3615,12 @@ pick_idlest:
if (sd_flag & SD_BALANCE_WAKE)
load_idx = sd->wake_idx;

- group = find_idlest_group(sd, p, cpu, load_idx);
+ if (policy != SCHED_POLICY_PERFORMANCE)
+ group = sds.group_leader;
+
+ if (!group)
+ group = find_idlest_group(sd, p, cpu, load_idx);
+
if (!group) {
sd = sd->child;
continue;
@@ -3491,8 +3640,11 @@ pick_idlest:
for_each_domain(cpu, tmp) {
if (weight <= tmp->span_weight)
break;
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag) {
sd = tmp;
+ if (policy != SCHED_POLICY_PERFORMANCE)
+ policy = get_sd_sched_policy(sd, &sds);
+ }
}
/* while loop will break here if sd == NULL */
}
@@ -4330,73 +4482,6 @@ static unsigned long task_h_load(struct task_struct *p)
#endif

/********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
-
- /** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
- unsigned long this_nr_running;
- unsigned long this_has_capacity;
- unsigned int this_idle_cpus;
-
- /* Statistics of the busiest group */
- unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
- unsigned long busiest_nr_running;
- unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
- unsigned int busiest_group_weight;
-
- int group_imb; /* Is there imbalance in this sd */
-
- /* Varibles of power awaring scheduling */
- unsigned long sd_capacity; /* capacity of this domain */
- unsigned long sd_nr_running; /* Nr running of this domain */
- struct sched_group *group_min; /* Least loaded group in sd */
- struct sched_group *group_leader; /* Group which relieves group_min */
- unsigned long min_load_per_task; /* load_per_task in group_min */
- unsigned long leader_nr_running; /* Nr running of group_leader */
- unsigned long min_nr_running; /* Nr running of group_min */
-
-#ifdef CONFIG_SCHED_NUMA
- struct sched_group *numa_group; /* group which has offnode_tasks */
- unsigned long numa_group_weight;
- unsigned long numa_group_running;
-
- unsigned long this_offnode_running;
- unsigned long this_onnode_running;
-#endif
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
- unsigned long avg_load; /*Avg load across the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long sum_nr_running; /* Nr tasks running in the group */
- unsigned long sum_weighted_load; /* Weighted load of group's tasks */
- unsigned long group_capacity;
- unsigned long idle_cpus;
- unsigned long group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
- int group_has_capacity; /* Is there extra capacity in the group? */
-#ifdef CONFIG_SCHED_NUMA
- unsigned long numa_offnode_weight;
- unsigned long numa_offnode_running;
- unsigned long numa_onnode_running;
-#endif
-};

/**
* init_sd_lb_power_stats - Initialize power savings statistics for
--
1.7.12

2012-11-06 13:12:50

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving

$cat /sys/devices/system/cpu/sched_policy/current_sched_policy
powersaving

The using sched policy is 'powersaving'. User can change the policy
by commend 'echo':
echo performance > /sys/devices/system/cpu/current_sched_policy

Power aware scheduling will has different behavior according to
different policy:

performance: the current scheduling behaviour, try to spread tasks
on more CPU sockets or cores.
powersaving: will shrink tasks into sched group until the group's
nr_running is up to group_weight.

Signed-off-by: Alex Shi <[email protected]>
---
Documentation/ABI/testing/sysfs-devices-system-cpu | 21 +++++++
drivers/base/cpu.c | 2 +
include/linux/cpu.h | 2 +
kernel/sched/fair.c | 68 +++++++++++++++++++++-
kernel/sched/sched.h | 5 ++
5 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..1909d3e 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,27 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug
the system. Information writtento the file to remove CPU's
is architecture specific.

+What: /sys/devices/system/cpu/sched_policy/current_sched_policy
+ /sys/devices/system/cpu/sched_policy/available_sched_policy
+Date: Oct 2012
+Contact: Linux kernel mailing list <[email protected]>
+Description: CFS scheduler policy showing and setting interface.
+
+ available_sched_policy shows there are 2 kinds of policy now:
+ performance and powersaving.
+ current_sched_policy shows current scheduler policy. And user
+ can change the policy by writing it.
+
+ Policy decides that CFS scheduler how to distribute tasks onto
+ which CPU unit when tasks number less than LCPU number in system
+
+ performance: try to spread tasks onto more CPU sockets,
+ more CPU cores.
+
+ powersaving: try to shrink tasks onto same core or same CPU
+ until running task number beyond the LCPU number in the core
+ or socket.
+
What: /sys/devices/system/cpu/cpu#/node
Date: October 2009
Contact: Linux memory management mailing list <[email protected]>
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 6345294..5f6a573 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
panic("Failed to register CPU subsystem");

cpu_dev_register_generic();
+
+ create_sysfs_sched_policy_group(cpu_subsys.dev_root);
}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..b2e9265 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -36,6 +36,8 @@ extern void cpu_remove_dev_attr(struct device_attribute *attr);
extern int cpu_add_dev_attr_group(struct attribute_group *attrs);
extern void cpu_remove_dev_attr_group(struct attribute_group *attrs);

+extern int create_sysfs_sched_policy_group(struct device *dev);
+
#ifdef CONFIG_HOTPLUG_CPU
extern void unregister_cpu(struct cpu *cpu);
extern ssize_t arch_cpu_probe(const char *, size_t);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2cebc81..dedc576 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6383,7 +6383,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

-
static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
{
struct sched_entity *se = &task->se;
@@ -6399,6 +6398,73 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
return rr_interval;
}

+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_policy(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "performance powersaving\n");
+}
+
+static ssize_t show_current_sched_policy(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ if (sched_policy == SCHED_POLICY_PERFORMANCE)
+ return sprintf(buf, "performance\n");
+ else if (sched_policy == SCHED_POLICY_POWERSAVING)
+ return sprintf(buf, "powersaving\n");
+ return 0;
+}
+
+static ssize_t set_sched_policy(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ unsigned int ret = -EINVAL;
+ char str_policy[16];
+
+ ret = sscanf(buf, "%15s", str_policy);
+ if (ret != 1)
+ return -EINVAL;
+
+ if (!strcmp(str_policy, "performance"))
+ sched_policy = SCHED_POLICY_PERFORMANCE;
+ else if (!strcmp(str_policy, "powersaving"))
+ sched_policy = SCHED_POLICY_POWERSAVING;
+ else
+ return -EINVAL;
+
+ return count;
+}
+
+/*
+ * * Sysfs setup bits:
+ * */
+static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
+ set_sched_policy);
+
+static DEVICE_ATTR(available_sched_policy, 0444,
+ show_available_sched_policy, NULL);
+
+static struct attribute *sched_policy_default_attrs[] = {
+ &dev_attr_current_sched_policy.attr,
+ &dev_attr_available_sched_policy.attr,
+ NULL
+};
+static struct attribute_group sched_policy_attr_group = {
+ .attrs = sched_policy_default_attrs,
+ .name = "sched_policy",
+};
+
+int __init create_sysfs_sched_policy_group(struct device *dev)
+{
+ return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
+}
+#endif /* CONFIG_SYSFS */
+
/*
* All the scheduling class methods:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 508e77e..9a6e06c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -9,6 +9,11 @@

extern __read_mostly int scheduler_running;

+#define SCHED_POLICY_PERFORMANCE (0x1)
+#define SCHED_POLICY_POWERSAVING (0x2)
+
+extern int __read_mostly sched_policy;
+
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
--
1.7.12

2012-11-06 13:48:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

On Tue, Nov 06, 2012 at 09:09:57PM +0800, Alex Shi wrote:
> This patch add the power aware scheduler knob into sysfs:
>
> $cat /sys/devices/system/cpu/sched_policy/available_sched_policy
> performance powersaving
>
> $cat /sys/devices/system/cpu/sched_policy/current_sched_policy
> powersaving
>
> The using sched policy is 'powersaving'. User can change the policy
> by commend 'echo':
> echo performance > /sys/devices/system/cpu/current_sched_policy
>
> Power aware scheduling will has different behavior according to
> different policy:
>
> performance: the current scheduling behaviour, try to spread tasks
> on more CPU sockets or cores.
> powersaving: will shrink tasks into sched group until the group's
> nr_running is up to group_weight.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-devices-system-cpu | 21 +++++++
> drivers/base/cpu.c | 2 +
> include/linux/cpu.h | 2 +
> kernel/sched/fair.c | 68 +++++++++++++++++++++-
> kernel/sched/sched.h | 5 ++
> 5 files changed, 97 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 6943133..1909d3e 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -53,6 +53,27 @@ Description: Dynamic addition and removal of CPU's. This is not hotplug
> the system. Information writtento the file to remove CPU's
> is architecture specific.
>
> +What: /sys/devices/system/cpu/sched_policy/current_sched_policy
> + /sys/devices/system/cpu/sched_policy/available_sched_policy
> +Date: Oct 2012
> +Contact: Linux kernel mailing list <[email protected]>
> +Description: CFS scheduler policy showing and setting interface.
> +
> + available_sched_policy shows there are 2 kinds of policy now:
> + performance and powersaving.
> + current_sched_policy shows current scheduler policy. And user
> + can change the policy by writing it.
> +
> + Policy decides that CFS scheduler how to distribute tasks onto
> + which CPU unit when tasks number less than LCPU number in system
> +
> + performance: try to spread tasks onto more CPU sockets,
> + more CPU cores.
> +
> + powersaving: try to shrink tasks onto same core or same CPU
> + until running task number beyond the LCPU number in the core
> + or socket.
> +
> What: /sys/devices/system/cpu/cpu#/node
> Date: October 2009
> Contact: Linux memory management mailing list <[email protected]>
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index 6345294..5f6a573 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
> panic("Failed to register CPU subsystem");
>
> cpu_dev_register_generic();
> +
> + create_sysfs_sched_policy_group(cpu_subsys.dev_root);

Are you sure you didn't just race with userspace, creating the sysfs
files after the device was created and announced to userspace?

If so, you need to fix this :)

thanks,

greg k-h

2012-11-06 15:20:28

by Luming Yu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

On Tue, Nov 6, 2012 at 8:09 AM, Alex Shi <[email protected]> wrote:
> This patch add the power aware scheduler knob into sysfs:

The problem is user doesn't know how to use this knob.

Based on what data, people could select one policy which could be surely
better than another?

"Packing small tasks" approach could be better and more intelligent.
http://thread.gmane.org/gmane.linux.kernel/1348522

Just some random thoughts, as I didn't have chance to look into the
details of that patch set yet. But to me, we need to exploit the fact
that we could automatically bind a group of tasks on minimal set of
CPUs that can provide sufficient CPU cycles that are comparable to
a"cpu- run-average" that the task group can get in pure CFS situation
in a given period, until we see more CPU is needed.Then we probably
can maintain required CPU power available to the corresponding
workload, while leaving all other CPUs into power saving mode. The
problem is historical data suggested pattern could become invalid in
future, then we need more CPUs in future..I think this is the point we
need to know before spread or not-spread decision ...if spread would
not help CPU-run-average ,we don't need waste CPU power..but I don't
know how hard it could be. But I'm pretty sure sysfs knob is harder.
:-) /l

2012-11-06 19:51:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: power aware load balance,

On Tue, 6 Nov 2012 21:09:58 +0800
Alex Shi <[email protected]> wrote:

> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>
> Checking the power consuming with a powermeter on the NHM EP.
> powersaving performance
> I = 2 148w 160w
> I = 4 175w 181w
> I = 8 207w 224w
> I = 16 324w 324w
>
> On a SNB laptop(4 cores *HT)
> powersaving performance
> I = 2 28w 35w
> I = 4 38w 52w
> I = 6 44w 54w
> I = 8 56w 56w
>
> On the SNB EP machine, when I = 16, power saved more than 100 Watts.

Confused. According to the above table, at I=16 the EP machine saved 0
watts. Typo in the data?


Also, that's a pretty narrow test - it's doing fork and exec at very
high frequency and things such as task placement decisions at process
startup might be affecting the results. Also, the load will be quite
kernel-intensive, as opposed to the more typical userspace-intensive
loads.

So, please run a broader set of tests so we can see the effects?

2012-11-07 04:37:35

by Preeti Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: power aware load balance,

Hi Alex,

What I am concerned about in this patchset as Peter also
mentioned in the previous discussion of your approach
(https://lkml.org/lkml/2012/8/13/139)
is that:

1.Using nr_running of two different sched groups to decide which one
can be group_leader or group_min might not be be the right approach,
as this might mislead us to think that a group running one task is less
loaded than the group running three tasks although the former task is
a cpu hogger.

2.Comparing the number of cpus with the number of tasks running in a sched
group to decide if the group is underloaded or overloaded again faces
the same issue.The tasks might be short running,not utilizing cpu much.

I also feel before we introduce another side to the scheduler called
'power aware',why not try and see if the current scheduler itself can
perform better? We have an opportunity in terms of PJT's patches which
can help scheduler make more realistic decisions in load balance.Also
since PJT's metric is a statistical one,I believe we could vary it to
allow scheduler to do more rigorous or less rigorous power savings.

It is true however that this approach will not try and evacuate nearly idle
cpus over to nearly full cpus.That is definitely one of the benefits of your
patch,in terms of power savings,but I believe your patch is not making use
of the right metric to decide that.

IMHO,the appraoch towards power aware scheduler should take the following steps:

1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
more intelligent decisions in load balancing.Test the performance and power save
numbers.

2.If the above shows some characteristic change in behaviour over the earlier
scheduler,it should be either towards power save or towards performance.If found
positive towards one of them, try varying the calculation of
per-entity-load to see
if it can lean towards the other behaviour.If it can,then there you
go,you have a
knob to change between policies right there!

3.If you don't get enough power savings with the above approach then
add your patchset
to evacuate nearly idle towards nearly busy groups,but by using PJT's metric to
make the decision.

What do you think?

Regards
Preeti U Murthy
On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi <[email protected]> wrote:
> This patch enabled the power aware consideration in load balance.
>
> As mentioned in the power aware scheduler proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, shrink tasks on less sched_groups will reduce power consumption
>
> The first assumption make performance policy take over scheduling when
> system busy.
> The second assumption make power aware scheduling try to move
> disperse tasks into fewer groups until that groups are full of tasks.
>
> This patch reuse lots of Suresh's power saving load balance code.
> Now the general enabling logical is:
> 1, Collect power aware scheduler statistics with performance load
> balance statistics collection.
> 2, if domain is eligible for power load balance do it and forget
> performance load balance, else do performance load balance.
>
> Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
> and 2 sockets * 8 cores * HT SNB EP machine.
> In the following checking, when I is 2/4/8/16, all tasks are
> shrank to run on single core or single socket.
>
> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>
> Checking the power consuming with a powermeter on the NHM EP.
> powersaving performance
> I = 2 148w 160w
> I = 4 175w 181w
> I = 8 207w 224w
> I = 16 324w 324w
>
> On a SNB laptop(4 cores *HT)
> powersaving performance
> I = 2 28w 35w
> I = 4 38w 52w
> I = 6 44w 54w
> I = 8 56w 56w
>
> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>
> Also tested the specjbb2005 with jrockit, kbuild, their peak performance
> has no clear change with powersaving policy on all machines. Just
> specjbb2005 with openjdk has about 2% drop on NHM EP machine with
> powersaving policy.
>
> This patch seems a bit long, but seems hard to split smaller.
>
> Signed-off-by: Alex Shi <[email protected]>

2012-11-07 12:27:45

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>> index 6345294..5f6a573 100644
>> --- a/drivers/base/cpu.c
>> +++ b/drivers/base/cpu.c
>> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
>> panic("Failed to register CPU subsystem");
>>
>> cpu_dev_register_generic();
>> +
>> + create_sysfs_sched_policy_group(cpu_subsys.dev_root);
>
> Are you sure you didn't just race with userspace, creating the sysfs
> files after the device was created and announced to userspace?

Sorry for don't fully get you. Is the sysfs announced to userspace
just in 'mount -t sysfs sysfs /sys'?

The old powersaving interface: sched_smt_power_savings also
created here. and cpu_dev_init was called early before do_initcalls
which cpuidle/cpufreq sysfs were initialized.

Do you mean this line need to init as core_initcall?

Thanks for comments! :)
>
> If so, you need to fix this :)
>
> thanks,
>
> greg k-h
>


--
Thanks
Alex

2012-11-07 12:42:28

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: power aware load balance,

On 11/07/2012 03:51 AM, Andrew Morton wrote:
> On Tue, 6 Nov 2012 21:09:58 +0800
> Alex Shi <[email protected]> wrote:
>
>> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>>
>> Checking the power consuming with a powermeter on the NHM EP.
>> powersaving performance
>> I = 2 148w 160w
>> I = 4 175w 181w
>> I = 8 207w 224w
>> I = 16 324w 324w
>>
>> On a SNB laptop(4 cores *HT)
>> powersaving performance
>> I = 2 28w 35w
>> I = 4 38w 52w
>> I = 6 44w 54w
>> I = 8 56w 56w
>>
>> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>
> Confused. According to the above table, at I=16 the EP machine saved 0
> watts. Typo in the data?

Not typo, since the LCPU number in the EP machine is 16, so if I = 16,
the powersaving policy doesn't work actually. That is the patch designed
for race to idle assumption.

The result looks same as the third patch(for fork/exec/wu) applied.
Result put here because it is from this patch.

>
>
> Also, that's a pretty narrow test - it's doing fork and exec at very
> high frequency and things such as task placement decisions at process
> startup might be affecting the results. Also, the load will be quite
> kernel-intensive, as opposed to the more typical userspace-intensive
> loads.

Sorry, why you think it keep do fork/exec? It just generate several
'bash' task to burn CPU, without fork/exec.

with I = 8, on my 32 LCPU SNB EP machine:
No do_fork calling in 5 seconds.

$ sudo perf stat -e probe:* -a sleep 5
Performance counter stats for 'sleep 5':
3 probe:do_execve [100.00%]
0 probe:do_fork [100.00%]

And it is not kernel-intensive, it nearly running all in user level.

'Top' output: 25:0%us VS 0.0%sy

Tasks: 319 total, 9 running, 310 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.0%us, 0.0%sy, 0.0%ni, 74.5%id, 0.4%wa, 0.1%hi, 0.0%si,
0.0%st
...

> So, please run a broader set of tests so we can see the effects?
>

Really, I have no more ideas for the suitable benchmarks.

Just tried the kbuild -j 16 on the 32 LCPU SNB EP, power just saved 10%,
but compile time increase about ~15%.
Seems if the task number is variation around the powersaving criteria
number, that just cause trouble.




--
Thanks
Alex

2012-11-07 13:03:33

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

On 11/06/2012 11:20 PM, Luming Yu wrote:
> On Tue, Nov 6, 2012 at 8:09 AM, Alex Shi <[email protected]> wrote:
>> This patch add the power aware scheduler knob into sysfs:
>
> The problem is user doesn't know how to use this knob.
>
> Based on what data, people could select one policy which could be surely
> better than another?
>
> "Packing small tasks" approach could be better and more intelligent.
> http://thread.gmane.org/gmane.linux.kernel/1348522

It is not conflict with this patchset. :)
>
> Just some random thoughts, as I didn't have chance to look into the
> details of that patch set yet. But to me, we need to exploit the fact
> that we could automatically bind a group of tasks on minimal set of
> CPUs that can provide sufficient CPU cycles that are comparable to
> a"cpu- run-average" that the task group can get in pure CFS situation
> in a given period, until we see more CPU is needed.Then we probably
> can maintain required CPU power available to the corresponding
> workload, while leaving all other CPUs into power saving mode. The
> problem is historical data suggested pattern could become invalid in
> future, then we need more CPUs in future..I think this is the point we
> need to know before spread or not-spread decision ...if spread would
> not help CPU-run-average ,we don't need waste CPU power..but I don't
> know how hard it could be. But I'm pretty sure sysfs knob is harder.
> :-) /l
>


--
Thanks
Alex

2012-11-07 13:27:21

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: power aware load balance,

On 11/07/2012 12:37 PM, Preeti Murthy wrote:
> Hi Alex,
>
> What I am concerned about in this patchset as Peter also
> mentioned in the previous discussion of your approach
> (https://lkml.org/lkml/2012/8/13/139)
> is that:
>
> 1.Using nr_running of two different sched groups to decide which one
> can be group_leader or group_min might not be be the right approach,
> as this might mislead us to think that a group running one task is less
> loaded than the group running three tasks although the former task is
> a cpu hogger.
>
> 2.Comparing the number of cpus with the number of tasks running in a sched
> group to decide if the group is underloaded or overloaded again faces
> the same issue.The tasks might be short running,not utilizing cpu much.

Yes, maybe nr task is not the best indicator. But as first step, it can
approve the proposal is a correct path and worth to try more.
Considering the old powersaving implement is also judge on nr tasks, and
my testing result of this. It may be still a option.
>
> I also feel before we introduce another side to the scheduler called
> 'power aware',why not try and see if the current scheduler itself can
> perform better? We have an opportunity in terms of PJT's patches which
> can help scheduler make more realistic decisions in load balance.Also
> since PJT's metric is a statistical one,I believe we could vary it to
> allow scheduler to do more rigorous or less rigorous power savings.

will study the PJT's approach.
Actually, current patch set is also a kind of load balance modification,
right? :)
>
> It is true however that this approach will not try and evacuate nearly idle
> cpus over to nearly full cpus.That is definitely one of the benefits of your
> patch,in terms of power savings,but I believe your patch is not making use
> of the right metric to decide that.

If one sched group just has one task, and another group just has one
LCPU idle, my patch definitely will pull the task to the nearly full
sched group. So I didn't understand what you mean 'will not try and
evacuate nearly idle cpus over to nearly full cpus'.


>
> IMHO,the appraoch towards power aware scheduler should take the following steps:
>
> 1.Make use of PJT's per-entity-load tracking metric to allow scheduler to make
> more intelligent decisions in load balancing.Test the performance and power save
> numbers.
>
> 2.If the above shows some characteristic change in behaviour over the earlier
> scheduler,it should be either towards power save or towards performance.If found
> positive towards one of them, try varying the calculation of
> per-entity-load to see
> if it can lean towards the other behaviour.If it can,then there you
> go,you have a
> knob to change between policies right there!
>
> 3.If you don't get enough power savings with the above approach then
> add your patchset
> to evacuate nearly idle towards nearly busy groups,but by using PJT's metric to
> make the decision.
>
> What do you think?

Will consider this. thanks!
>
> Regards
> Preeti U Murthy
> On Tue, Nov 6, 2012 at 6:39 PM, Alex Shi <[email protected]> wrote:
>> This patch enabled the power aware consideration in load balance.
>>
>> As mentioned in the power aware scheduler proposal, Power aware
>> scheduling has 2 assumptions:
>> 1, race to idle is helpful for power saving
>> 2, shrink tasks on less sched_groups will reduce power consumption
>>
>> The first assumption make performance policy take over scheduling when
>> system busy.
>> The second assumption make power aware scheduling try to move
>> disperse tasks into fewer groups until that groups are full of tasks.
>>
>> This patch reuse lots of Suresh's power saving load balance code.
>> Now the general enabling logical is:
>> 1, Collect power aware scheduler statistics with performance load
>> balance statistics collection.
>> 2, if domain is eligible for power load balance do it and forget
>> performance load balance, else do performance load balance.
>>
>> Has tried on my 2 sockets * 4 cores * HT NHM EP machine.
>> and 2 sockets * 8 cores * HT SNB EP machine.
>> In the following checking, when I is 2/4/8/16, all tasks are
>> shrank to run on single core or single socket.
>>
>> $for ((i=0; i < I; i++)) ; do while true; do : ; done & done
>>
>> Checking the power consuming with a powermeter on the NHM EP.
>> powersaving performance
>> I = 2 148w 160w
>> I = 4 175w 181w
>> I = 8 207w 224w
>> I = 16 324w 324w
>>
>> On a SNB laptop(4 cores *HT)
>> powersaving performance
>> I = 2 28w 35w
>> I = 4 38w 52w
>> I = 6 44w 54w
>> I = 8 56w 56w
>>
>> On the SNB EP machine, when I = 16, power saved more than 100 Watts.
>>
>> Also tested the specjbb2005 with jrockit, kbuild, their peak performance
>> has no clear change with powersaving policy on all machines. Just
>> specjbb2005 with openjdk has about 2% drop on NHM EP machine with
>> powersaving policy.
>>
>> This patch seems a bit long, but seems hard to split smaller.
>>
>> Signed-off-by: Alex Shi <[email protected]>


--
Thanks
Alex

2012-11-07 14:41:27

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

On Wed, Nov 07, 2012 at 08:27:17PM +0800, Alex Shi wrote:
> >> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> >> index 6345294..5f6a573 100644
> >> --- a/drivers/base/cpu.c
> >> +++ b/drivers/base/cpu.c
> >> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
> >> panic("Failed to register CPU subsystem");
> >>
> >> cpu_dev_register_generic();
> >> +
> >> + create_sysfs_sched_policy_group(cpu_subsys.dev_root);
> >
> > Are you sure you didn't just race with userspace, creating the sysfs
> > files after the device was created and announced to userspace?
>
> Sorry for don't fully get you. Is the sysfs announced to userspace
> just in 'mount -t sysfs sysfs /sys'?

No, when the struct device is registered with the driver core.

> The old powersaving interface: sched_smt_power_savings also
> created here. and cpu_dev_init was called early before do_initcalls
> which cpuidle/cpufreq sysfs were initialized.
>
> Do you mean this line need to init as core_initcall?

No, you need to make this as an attribute group for the device, so the
driver core will create it automatically before it tells userspace that
the device is now present.

Use the default attribute groups and you should be fine.

Hope this helps,

greg k-h

2012-11-08 14:40:36

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] sched: add sched_policy and it's sysfs interface

On 11/07/2012 10:41 PM, Greg KH wrote:
> On Wed, Nov 07, 2012 at 08:27:17PM +0800, Alex Shi wrote:
>>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>>> index 6345294..5f6a573 100644
>>>> --- a/drivers/base/cpu.c
>>>> +++ b/drivers/base/cpu.c
>>>> @@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
>>>> panic("Failed to register CPU subsystem");
>>>>
>>>> cpu_dev_register_generic();
>>>> +
>>>> + create_sysfs_sched_policy_group(cpu_subsys.dev_root);
>>>
>>> Are you sure you didn't just race with userspace, creating the sysfs
>>> files after the device was created and announced to userspace?
>>
>> Sorry for don't fully get you. Is the sysfs announced to userspace
>> just in 'mount -t sysfs sysfs /sys'?
>
> No, when the struct device is registered with the driver core.
>
>> The old powersaving interface: sched_smt_power_savings also
>> created here. and cpu_dev_init was called early before do_initcalls
>> which cpuidle/cpufreq sysfs were initialized.
>>
>> Do you mean this line need to init as core_initcall?
>
> No, you need to make this as an attribute group for the device, so the
> driver core will create it automatically before it tells userspace that
> the device is now present.
>
> Use the default attribute groups and you should be fine.

Thanks a lot for explanation! :)

It seems a misunderstanding here. I just create a sysfs group, no device
registered.
The code followed the cpuidle's implementation:

$ ls /sys/devices/system/cpu/cpuidle/
current_driver current_governor_ro

Seems it's still better to move the group creation into sched/fair.c not
here.
>
> Hope this helps,
>
> greg k-h
>


--
Thanks
Alex

2012-11-11 18:49:40

by Preeti Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: power aware load balance,

Hi Alex
I apologise for the delay in replying .

On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi <[email protected]> wrote:
> On 11/07/2012 12:37 PM, Preeti Murthy wrote:
>> Hi Alex,
>>
>> What I am concerned about in this patchset as Peter also
>> mentioned in the previous discussion of your approach
>> (https://lkml.org/lkml/2012/8/13/139)
>> is that:
>>
>> 1.Using nr_running of two different sched groups to decide which one
>> can be group_leader or group_min might not be be the right approach,
>> as this might mislead us to think that a group running one task is less
>> loaded than the group running three tasks although the former task is
>> a cpu hogger.
>>
>> 2.Comparing the number of cpus with the number of tasks running in a sched
>> group to decide if the group is underloaded or overloaded again faces
>> the same issue.The tasks might be short running,not utilizing cpu much.
>
> Yes, maybe nr task is not the best indicator. But as first step, it can
> approve the proposal is a correct path and worth to try more.
> Considering the old powersaving implement is also judge on nr tasks, and
> my testing result of this. It may be still a option.
Hmm.. will think about this and get back.
>>
>> I also feel before we introduce another side to the scheduler called
>> 'power aware',why not try and see if the current scheduler itself can
>> perform better? We have an opportunity in terms of PJT's patches which
>> can help scheduler make more realistic decisions in load balance.Also
>> since PJT's metric is a statistical one,I believe we could vary it to
>> allow scheduler to do more rigorous or less rigorous power savings.
>
> will study the PJT's approach.
> Actually, current patch set is also a kind of load balance modification,
> right? :)
It is true that this is a different approach,in fact we will require
this approach
to do power savings because PJT's patches introduce a new 'metric' and not a new
'approach' in my opinion, to do smarter load balancing,not power aware
load balancing per say.So your patch is surely a step towards power
aware lb.I am just worried about the metric used in it.
>>
>> It is true however that this approach will not try and evacuate nearly idle
>> cpus over to nearly full cpus.That is definitely one of the benefits of your
>> patch,in terms of power savings,but I believe your patch is not making use
>> of the right metric to decide that.
>
> If one sched group just has one task, and another group just has one
> LCPU idle, my patch definitely will pull the task to the nearly full
> sched group. So I didn't understand what you mean 'will not try and
> evacuate nearly idle cpus over to nearly full cpus'
No, by 'this approach' I meant the current load balancer integrated with
the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
over to the nearly full cpus..

Regards
Preeti U Murthy

2012-11-12 03:07:11

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] sched: power aware load balance,

On 11/12/2012 02:49 AM, Preeti Murthy wrote:
> Hi Alex
> I apologise for the delay in replying .

That's all right. I often also busy on other Intel tasks and have no
time to look at LKML. :)
>
> On Wed, Nov 7, 2012 at 6:57 PM, Alex Shi <[email protected]> wrote:
>> On 11/07/2012 12:37 PM, Preeti Murthy wrote:
>>> Hi Alex,
>>>
>>> What I am concerned about in this patchset as Peter also
>>> mentioned in the previous discussion of your approach
>>> (https://lkml.org/lkml/2012/8/13/139)
>>> is that:
>>>
>>> 1.Using nr_running of two different sched groups to decide which one
>>> can be group_leader or group_min might not be be the right approach,
>>> as this might mislead us to think that a group running one task is less
>>> loaded than the group running three tasks although the former task is
>>> a cpu hogger.
>>>
>>> 2.Comparing the number of cpus with the number of tasks running in a sched
>>> group to decide if the group is underloaded or overloaded again faces
>>> the same issue.The tasks might be short running,not utilizing cpu much.
>>
>> Yes, maybe nr task is not the best indicator. But as first step, it can
>> approve the proposal is a correct path and worth to try more.
>> Considering the old powersaving implement is also judge on nr tasks, and
>> my testing result of this. It may be still a option.
> Hmm.. will think about this and get back.
>>>
>>> I also feel before we introduce another side to the scheduler called
>>> 'power aware',why not try and see if the current scheduler itself can
>>> perform better? We have an opportunity in terms of PJT's patches which
>>> can help scheduler make more realistic decisions in load balance.Also
>>> since PJT's metric is a statistical one,I believe we could vary it to
>>> allow scheduler to do more rigorous or less rigorous power savings.
>>
>> will study the PJT's approach.
>> Actually, current patch set is also a kind of load balance modification,
>> right? :)
> It is true that this is a different approach,in fact we will require
> this approach
> to do power savings because PJT's patches introduce a new 'metric' and not a new
> 'approach' in my opinion, to do smarter load balancing,not power aware
> load balancing per say.So your patch is surely a step towards power
> aware lb.I am just worried about the metric used in it.
>>>
>>> It is true however that this approach will not try and evacuate nearly idle
>>> cpus over to nearly full cpus.That is definitely one of the benefits of your
>>> patch,in terms of power savings,but I believe your patch is not making use
>>> of the right metric to decide that.
>>
>> If one sched group just has one task, and another group just has one
>> LCPU idle, my patch definitely will pull the task to the nearly full
>> sched group. So I didn't understand what you mean 'will not try and
>> evacuate nearly idle cpus over to nearly full cpus'
> No, by 'this approach' I meant the current load balancer integrated with
> the PJT's metric.Your approach does 'evacuate' the nearly idle cpus
> over to the nearly full cpus..

Oh, a misunderstand on 'this approach'. :) Anyway, we are all clear
about this now.