2013-04-25 17:24:50

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v4 00/14] sched: packing small tasks

Hi,

This patchset takes advantage of the new per-task load tracking that is
available in the kernel for packing the tasks in as few as possible
CPU/Cluster/Core. It has got 2 packing modes:
-The 1st mode packs the small tasks when the system is not too busy. The main
goal is to reduce the power consumption in the low system load use cases by
minimizing the number of power domain that are enabled but it also keeps the
default behavior which is performance oriented.
-The 2nd mode packs all tasks in as few as possible power domains in order to
improve the power consumption of the system but at the cost of possible
performance decrease because of the increase of the rate of ressources sharing
compared to the default mode.

The packing is done in 3 steps (the last step is only applicable for the
agressive packing mode):

The 1st step looks for the best place to pack tasks in a system according to
its topology and it defines a 1st pack buddy CPU for each CPU if there is one
available. The policy for defining a buddy CPU is that we want to pack at
levels where a group of CPU can be power gated independently from others. To
describe this capability, a new flag SD_SHARE_POWERDOMAIN has been introduced,
that is used to indicate whether the groups of CPUs of a scheduling domain
share their power state. By default, this flag is set in all sched_domain in
order to keep unchanged the current behavior of the scheduler and only ARM
platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.

In a 2nd step, the scheduler checks the load average of a task which wakes up
as well as the load average of the buddy CPU and it can decide to migrate the
light tasks on a not busy buddy. This check is done during the wake up because
small tasks tend to wake up between periodic load balance and asynchronously
to each other which prevents the default mechanism to catch and migrate them
efficiently. A light task is defined by a runnable_avg_sum that is less than
20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
The average CPU load of the task must be less than 20% and the task must have
been runnable less than 10ms when it woke up last time in order to be
electable for the packing migration. So, a task than runs 1 ms each 5ms will
be considered as a small task but a task that runs 50 ms with a period of
500ms, will not.
Then, the business of the buddy CPU depends of the load average for the rq and
the number of running tasks. A CPU with a load average greater than 50% will
be considered as busy CPU whatever the number of running tasks is and this
threshold will be reduced by the number of running tasks in order to not
increase too much the wake up latency of a task. When the buddy CPU is busy,
the scheduler falls back to default CFS policy.

The 3rd step is only used when the agressive packing mode is enable. In this
case, the CPUs pack their tasks in their buddy until they becomes full. Unlike
the previous step, we can't keep the same buddy so we update it during load
balance. During the periodic load balance, the scheduler computes the activity
of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and
then it defines the CPUs that will be used to handle the current activity. The
selected CPUs will be their own buddy and will participate to the default
load balancing mecanism in order to share the tasks in a fair way, whereas the
not selected CPUs will not, and their buddy will be the last selected CPU.
The behavior can be summarized as: The scheduler defines how many CPUs are
required to handle the current activity, keeps the tasks on these CPUS and
perform normal load balancing (or any evolution of the current load balancer
like the use of runnable load avg from Alex https://lkml.org/lkml/2013/4/1/580)
on this limited number of CPUs . Like the other steps, the CPUs are selected to
minimize the number of power domain that must stay on.

Change since V3:

- Take into account comments on previous version.
- Add an agressive packing mode and a knob to select between the various mode

Change since V2:

- Migrate only a task that wakes up
- Change the light tasks threshold to 20%
- Change the loaded CPU threshold to not pull tasks if the current number of
running tasks is null but the load average is already greater than 50%
- Fix the algorithm for selecting the buddy CPU.

Change since V1:

Patch 2/6
- Change the flag name which was not clear. The new name is
SD_SHARE_POWERDOMAIN.
- Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
- Fix issues in the algorithm that looks for the best buddy CPU
- Use pr_debug instead of pr_info
- Fix for uniprocessor
Patch 4/6
- Remove the use of usage_avg_sum which has not been merged
Patch 5/6
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is
ensured
Patch 6/6
- Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
platform

Previous results for v3:

This series has been tested with hackbench on ARM platform and the results
don't show any performance regression

Hackbench 3.9-rc2 +patches
Mean Time (10 tests): 2.048 2.015
stdev : 0.047 0.068

Previous results for V2:

This series has been tested with MP3 play back on ARM platform:
TC2 HMP (dual CA-15 and 3xCA-7 cluster).

The measurements have been done on an Ubuntu image during 60 seconds of
playback and the result has been normalized to 100.

| CA15 | CA7 | total |
-------------------------------------
default | 81 | 97 | 178 |
pack | 13 | 100 | 113 |
-------------------------------------

Previous results for V1:

The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP
(dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have
demonstrated that it's worth packing small tasks at all topology levels.

The performance tests have been done on both platforms with sysbench. The
results don't show any performance regressions. These results are aligned with
the policy which uses the normal behavior with heavy use cases.

test: sysbench --test=cpu --num-threads=N --max-requests=R run

Results below is the average duration of 3 tests on the quad CA-9.
default is the current scheduler behavior (pack buddy CPU is -1)
pack is the scheduler with the pack mechanism

| default | pack |
-----------------------------------
N=8; R=200 | 3.1999 | 3.1921 |
N=8; R=2000 | 31.4939 | 31.4844 |
N=12; R=200 | 3.2043 | 3.2084 |
N=12; R=2000 | 31.4897 | 31.4831 |
N=16; R=200 | 3.1774 | 3.1824 |
N=16; R=2000 | 31.4899 | 31.4897 |
-----------------------------------

The power consumption tests have been done only on TC2 platform which has got
accessible power lines and I have used cyclictest to simulate small tasks. The
tests show some power consumption improvements.

test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20

The measurements have been done during 16 seconds and the result has been
normalized to 100

| CA15 | CA7 | total |
-------------------------------------
default | 100 | 40 | 140 |
pack | <1 | 45 | <46 |
-------------------------------------

The A15 cluster is less power efficient than the A7 cluster but if we assume
that the tasks is well spread on both clusters, we can guest estimate that the
power consumption on a dual cluster of CA7 would have been for a default
kernel:

| CA7 | CA7 | total |
-------------------------------------
default | 40 | 40 | 80 |
-------------------------------------

Vincent Guittot (14):
Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for
load-tracking"
sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain
sched: pack small tasks
sched: pack the idle load balance
ARM: sched: clear SD_SHARE_POWERDOMAIN
sched: add a knob to choose the packing level
sched: agressively pack at wake/fork/exec
sched: trig ILB on an idle buddy
sched: evaluate the activity level of the system
sched: update the buddy CPU
sched: filter task pull request
sched: create a new field with available capacity
sched: update the cpu_power
sched: force migration on buddy CPU

arch/arm/kernel/topology.c | 9 +
arch/ia64/include/asm/topology.h | 1 +
arch/tile/include/asm/topology.h | 1 +
include/linux/sched.h | 11 +-
include/linux/sched/sysctl.h | 8 +
include/linux/topology.h | 4 +
kernel/sched/core.c | 14 +-
kernel/sched/fair.c | 393 +++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 15 +-
kernel/sysctl.c | 13 ++
10 files changed, 423 insertions(+), 46 deletions(-)

--
1.7.9.5


2013-04-25 17:24:54

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 01/14] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f.

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 8 +-------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..16b0d18 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1159,13 +1159,7 @@ struct sched_entity {
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
-
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..c8db984 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1563,12 +1563,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9c2f726 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3394,12 +3393,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..7f36024f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -227,12 +227,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,8 +236,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.9.5

2013-04-25 17:24:59

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 03/14] sched: pack small tasks

During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPUs can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.

On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

| Cluster 0 | Cluster 1 |
| CPU0 | CPU1 | CPU2 | CPU3 |
-----------------------------------
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

If the cores in a cluster can't be power gated independently, the buddy
configuration becomes:

| Cluster 0 | Cluster 1 |
| CPU0 | CPU1 | CPU2 | CPU3 |
-----------------------------------
buddy | CPU0 | CPU1 | CPU0 | CPU0 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 5 ++
3 files changed, 138 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b9861f..c5ef170 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5664,6 +5664,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);

+ update_packing_domain(cpu);
update_top_cache_domain(cpu);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..6adc57c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@ void sched_init_granularity(void)
update_sysctl();
}

+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+ struct sched_domain *sd;
+ int id = -1;
+
+ sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+ if (!sd)
+ sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+ else
+ sd = sd->parent;
+
+ while (sd && (sd->flags & SD_LOAD_BALANCE)
+ && !(sd->flags & SD_SHARE_POWERDOMAIN)) {
+ struct sched_group *sg = sd->groups;
+ struct sched_group *pack = sg;
+ struct sched_group *tmp;
+
+ /*
+ * The sched_domain of a CPU points on the local sched_group
+ * and this CPU of this local group is a good candidate
+ */
+ id = cpu;
+
+ /* loop the sched groups to find the best one */
+ for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+ if (tmp->sgp->power * pack->group_weight >
+ pack->sgp->power * tmp->group_weight)
+ continue;
+
+ if ((tmp->sgp->power * pack->group_weight ==
+ pack->sgp->power * tmp->group_weight)
+ && (cpumask_first(sched_group_cpus(tmp)) >= id))
+ continue;
+
+ /* we have found a better group */
+ pack = tmp;
+
+ /* Take the 1st CPU of the new group */
+ id = cpumask_first(sched_group_cpus(pack));
+ }
+
+ /* Look for another CPU than itself */
+ if (id != cpu)
+ break;
+
+ sd = sd->parent;
+ }
+
+ pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+ per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
@@ -3291,6 +3361,64 @@ done:
return target;
}

+static bool is_buddy_busy(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ u32 sum = rq->avg.runnable_avg_sum;
+ u32 period = rq->avg.runnable_avg_period;
+
+ /*
+ * If a CPU accesses the runnable_avg_sum and runnable_avg_period
+ * fields of its buddy CPU while the latter updates it, it can get the
+ * new version of a field and the old version of the other one. This
+ * can generate erroneous decisions. We don't want to use a lock
+ * mechanism for ensuring the coherency because of the overhead in
+ * this critical path.
+ * The runnable_avg_period of a runqueue tends to the max value in
+ * less than 345ms after plugging a CPU, which implies that we could
+ * use the max value instead of reading runnable_avg_period after
+ * 345ms. During the starting phase, we must ensure a minimum of
+ * coherency between the fields. A simple rule is runnable_avg_sum <=
+ * runnable_avg_period.
+ */
+ sum = min(sum, period);
+
+ /*
+ * A busy buddy is a CPU with a high load or a small load with a lot of
+ * running tasks.
+ */
+ return (sum > (period / (rq->nr_running + 2)));
+}
+
+static bool is_light_task(struct task_struct *p)
+{
+ /* A light task runs less than 20% in average */
+ return ((p->se.avg.runnable_avg_sum * 5) <
+ (p->se.avg.runnable_avg_period));
+}
+
+static int check_pack_buddy(int cpu, struct task_struct *p)
+{
+ int buddy = per_cpu(sd_pack_buddy, cpu);
+
+ /* No pack buddy for this CPU */
+ if (buddy == -1)
+ return false;
+
+ /* buddy is not an allowed CPU */
+ if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
+ return false;
+
+ /*
+ * If the task is a small one and the buddy is not overloaded,
+ * we use buddy cpu
+ */
+ if (!is_light_task(p) || is_buddy_busy(buddy))
+ return false;
+
+ return true;
+}
+
/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3319,6 +3447,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
want_affine = 1;
new_cpu = prev_cpu;
+
+ /* We pack only at wake up and not new task */
+ if (check_pack_buddy(new_cpu, p))
+ return per_cpu(sd_pack_buddy, new_cpu);
}

rcu_read_lock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f36024f..96b164d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -872,6 +872,7 @@ extern const struct sched_class idle_sched_class;

extern void trigger_load_balance(struct rq *rq, int cpu);
extern void idle_balance(int this_cpu, struct rq *this_rq);
+extern void update_packing_domain(int cpu);

#else /* CONFIG_SMP */

@@ -879,6 +880,10 @@ static inline void idle_balance(int cpu, struct rq *rq)
{
}

+static inline void update_packing_domain(int cpu)
+{
+}
+
#endif

extern void sysrq_sched_debug_show(void);
--
1.7.9.5

2013-04-25 17:25:11

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 08/14] sched: trig ILB on an idle buddy

If the buddy CPU is not full and currently idle, we trigger an Idle Load
Balance to give it the opportunity to pull more tasks.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 874f330..954adfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5776,6 +5776,26 @@ end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
}

+static int check_nohz_buddy(int cpu)
+{
+ int buddy = per_cpu(sd_pack_buddy, cpu);
+
+ if (sysctl_sched_packing_mode != SCHED_PACKING_FULL)
+ return false;
+
+ /* No pack buddy for this CPU */
+ if (buddy == -1)
+ return false;
+
+ if (is_buddy_full(buddy))
+ return false;
+
+ if (cpumask_test_cpu(buddy, nohz.idle_cpus_mask))
+ return true;
+
+ return false;
+}
+
/*
* Current heuristic for kicking the idle load balancer in the presence
* of an idle cpu is the system.
@@ -5813,6 +5833,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
if (rq->nr_running >= 2)
goto need_kick;

+ /* the buddy is idle and not busy so we can pack */
+ if (check_nohz_buddy(cpu))
+ goto need_kick;
+
rcu_read_lock();
for_each_domain(cpu, sd) {
struct sched_group *sg = sd->groups;
--
1.7.9.5

2013-04-25 17:25:20

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 10/14] sched: update the buddy CPU

Periodically updates the buddy of a CPU according to the current activity of
the system. A CPU is its own buddy if it participates to the packing effort.
Otherwise, it points to a CPU that participates to the packing effort.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 86 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 234ecdd..28f8ea7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -174,11 +174,17 @@ void sched_init_granularity(void)


#ifdef CONFIG_SMP
+static unsigned long power_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_power;
+}
+
/*
* Save the id of the optimal CPU that should be used to pack small tasks
* The value -1 is used when no buddy has been found
*/
DEFINE_PER_CPU(int, sd_pack_buddy);
+DEFINE_PER_CPU(struct sched_domain *, sd_pack_domain);

/*
* Look for the best buddy CPU that can be used to pack small tasks
@@ -237,6 +243,68 @@ void update_packing_domain(int cpu)
}

pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+ per_cpu(sd_pack_domain, cpu) = sd;
+ per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+void update_packing_buddy(int cpu, int activity)
+{
+ struct sched_domain *sd = per_cpu(sd_pack_domain, cpu);
+ struct sched_group *sg, *pack, *tmp;
+ int id = cpu;
+
+ if (!sd)
+ return;
+
+ /*
+ * The sched_domain of a CPU points on the local sched_group
+ * and this CPU of this local group is a good candidate
+ */
+ pack = sg = sd->groups;
+
+ /* loop the sched groups to find the best one */
+ for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+ if ((tmp->sgp->power * pack->group_weight) >
+ (pack->sgp->power_available * tmp->group_weight))
+ continue;
+
+ if (((tmp->sgp->power * pack->group_weight) ==
+ (pack->sgp->power * tmp->group_weight))
+ && (cpumask_first(sched_group_cpus(tmp)) >= id))
+ continue;
+
+ /* we have found a better group */
+ pack = tmp;
+
+ /* Take the 1st CPU of the new group */
+ id = cpumask_first(sched_group_cpus(pack));
+ }
+
+ if ((cpu == id) || (activity <= power_of(id))) {
+ per_cpu(sd_pack_buddy, cpu) = id;
+ return;
+ }
+
+ for (tmp = pack; activity > 0; tmp = tmp->next) {
+ if (tmp->sgp->power > activity) {
+ id = cpumask_first(sched_group_cpus(tmp));
+ activity -= power_of(id);
+ if (cpu == id)
+ activity = 0;
+ while ((activity > 0) && (id < nr_cpu_ids)) {
+ id = cpumask_next(id, sched_group_cpus(tmp));
+ activity -= power_of(id);
+ if (cpu == id)
+ activity = 0;
+ }
+ } else if (cpumask_test_cpu(cpu, sched_group_cpus(tmp))) {
+ id = cpu;
+ activity = 0;
+ } else {
+ activity -= tmp->sgp->power;
+ }
+ }
+
per_cpu(sd_pack_buddy, cpu) = id;
}

@@ -3014,11 +3082,6 @@ static unsigned long target_load(int cpu, int type)
return max(rq->cpu_load[type-1], total);
}

-static unsigned long power_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_power;
-}
-
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -4740,6 +4803,22 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
}

+static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
+ struct sched_domain *sd)
+{
+ int buddy;
+
+ if (sysctl_sched_packing_mode != SCHED_PACKING_FULL)
+ return;
+
+ /* Update my buddy */
+ if (sd == per_cpu(sd_pack_domain, cpu))
+ update_packing_buddy(cpu, sds->total_activity);
+
+ /* Get my new buddy */
+ buddy = per_cpu(sd_pack_buddy, cpu);
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -4807,6 +4886,8 @@ static inline void update_sd_lb_stats(struct lb_env *env,

sg = sg->next;
} while (sg != env->sd->groups);
+
+ update_plb_buddy(env->dst_cpu, balance, sds, env->sd);
}

/**
--
1.7.9.5

2013-04-25 17:25:32

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 14/14] sched: force migration on buddy CPU

If a CPU that doesn't participate to the packing effort, has at least one
running task, it means that its group is imbalanced and the CPUs can pull this
task.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54c1541..f87aed2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4729,6 +4729,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
max_nr_running = nr_running;
if (min_nr_running > nr_running)
min_nr_running = nr_running;
+
+ if (!is_my_buddy(i, i) && nr_running > 0)
+ sgs->group_imb = 1;
}

sgs->group_load += load;
--
1.7.9.5

2013-04-25 17:25:16

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 11/14] sched: filter task pull request

Only CPUs that participates to the packing effort can pull tasks on a busiest
group.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28f8ea7..6f63fb9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu)
return (sum * 1024 >= period * 1000);
}

+static bool is_my_buddy(int cpu, int buddy)
+{
+ int my_buddy = per_cpu(sd_pack_buddy, cpu);
+ return (my_buddy == -1) || (buddy == my_buddy);
+}
+
static bool is_light_task(struct task_struct *p)
{
/* A light task runs less than 20% in average */
@@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,

/* Bias balancing toward cpus of our domain */
if (local_group) {
- if (idle_cpu(i) && !first_idle_cpu &&
- cpumask_test_cpu(i, sched_group_mask(group))) {
+ if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
+ && cpumask_test_cpu(i, sched_group_mask(group))) {
first_idle_cpu = 1;
balance_cpu = i;
}
@@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,

/* Get my new buddy */
buddy = per_cpu(sd_pack_buddy, cpu);
+
+ /* This CPU doesn't act for agressive packing */
+ if (buddy != cpu)
+ sds->busiest = 0;
}

/**
--
1.7.9.5

2013-04-25 17:25:56

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 13/14] sched: update the cpu_power

The cpu_power is updated for CPUs that don't participate to the packing
effort. We consider that their cpu_power is allocated to idleness as it could
be allocated by rt. So the cpu_power that remains available for cfs, is set to
min value (i.e. 1)

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 756b1e3..54c1541 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -224,12 +224,12 @@ void update_packing_domain(int cpu)

/* loop the sched groups to find the best one */
for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
- if (tmp->sgp->power * pack->group_weight >
- pack->sgp->power * tmp->group_weight)
+ if (tmp->sgp->power_available * pack->group_weight >
+ pack->sgp->power_available * tmp->group_weight)
continue;

- if ((tmp->sgp->power * pack->group_weight ==
- pack->sgp->power * tmp->group_weight)
+ if ((tmp->sgp->power_available * pack->group_weight ==
+ pack->sgp->power_available * tmp->group_weight)
&& (cpumask_first(sched_group_cpus(tmp)) >= id))
continue;

@@ -269,12 +269,12 @@ void update_packing_buddy(int cpu, int activity)

/* loop the sched groups to find the best one */
for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
- if ((tmp->sgp->power * pack->group_weight) >
+ if ((tmp->sgp->power_available * pack->group_weight) >
(pack->sgp->power_available * tmp->group_weight))
continue;

- if (((tmp->sgp->power * pack->group_weight) ==
- (pack->sgp->power * tmp->group_weight))
+ if (((tmp->sgp->power_available * pack->group_weight) ==
+ (pack->sgp->power_available * tmp->group_weight))
&& (cpumask_first(sched_group_cpus(tmp)) >= id))
continue;

@@ -285,20 +285,20 @@ void update_packing_buddy(int cpu, int activity)
id = cpumask_first(sched_group_cpus(pack));
}

- if ((cpu == id) || (activity <= power_of(id))) {
+ if ((cpu == id) || (activity <= available_of(id))) {
per_cpu(sd_pack_buddy, cpu) = id;
return;
}

for (tmp = pack; activity > 0; tmp = tmp->next) {
- if (tmp->sgp->power > activity) {
+ if (tmp->sgp->power_available > activity) {
id = cpumask_first(sched_group_cpus(tmp));
- activity -= power_of(id);
+ activity -= available_of(id);
if (cpu == id)
activity = 0;
while ((activity > 0) && (id < nr_cpu_ids)) {
id = cpumask_next(id, sched_group_cpus(tmp));
- activity -= power_of(id);
+ activity -= available_of(id);
if (cpu == id)
activity = 0;
}
@@ -306,7 +306,7 @@ void update_packing_buddy(int cpu, int activity)
id = cpu;
activity = 0;
} else {
- activity -= tmp->sgp->power;
+ activity -= tmp->sgp->power_available;
}
}

@@ -3369,7 +3369,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}

/* Adjust by relative CPU power of the group */
- avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
+ avg_load = (avg_load * SCHED_POWER_SCALE)
+ / group->sgp->power_available;

if (local_group) {
this_load = avg_load;
@@ -3551,10 +3552,10 @@ static int get_cpu_activity(int cpu)

if (sum == period) {
u32 overload = rq->nr_running > 1 ? 1 : 0;
- return power_of(cpu) + overload;
+ return available_of(cpu) + overload;
}

- return (sum * power_of(cpu)) / period;
+ return (sum * available_of(cpu)) / period;
}

/*
@@ -4596,8 +4597,12 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
cpu_rq(cpu)->cpu_available = power;
sdg->sgp->power_available = power;

+ if (!is_my_buddy(cpu, cpu))
+ power = 1;
+
cpu_rq(cpu)->cpu_power = power;
sdg->sgp->power = power;
+
}

void update_group_power(struct sched_domain *sd, int cpu)
--
1.7.9.5

2013-04-25 17:26:24

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 12/14] sched: create a new field with available capacity

This new field power_available reflects the available capacity of a CPU
unlike the cpu_power which reflects the current capacity.

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/sched/fair.c | 18 +++++++++++++++---
kernel/sched/sched.h | 1 +
3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a82cdeb..03a5143 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -824,7 +824,7 @@ struct sched_group_power {
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
* single CPU.
*/
- unsigned int power, power_orig;
+ unsigned int power, power_orig, power_available;
unsigned long next_update;
/*
* Number of busy cpus in this group.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f63fb9..756b1e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -179,6 +179,11 @@ static unsigned long power_of(int cpu)
return cpu_rq(cpu)->cpu_power;
}

+static unsigned long available_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_available;
+}
+
/*
* Save the id of the optimal CPU that should be used to pack small tasks
* The value -1 is used when no buddy has been found
@@ -4588,6 +4593,9 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
if (!power)
power = 1;

+ cpu_rq(cpu)->cpu_available = power;
+ sdg->sgp->power_available = power;
+
cpu_rq(cpu)->cpu_power = power;
sdg->sgp->power = power;
}
@@ -4596,7 +4604,7 @@ void update_group_power(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long power;
+ unsigned long power, available;
unsigned long interval;

interval = msecs_to_jiffies(sd->balance_interval);
@@ -4608,7 +4616,7 @@ void update_group_power(struct sched_domain *sd, int cpu)
return;
}

- power = 0;
+ power = available = 0;

if (child->flags & SD_OVERLAP) {
/*
@@ -4618,6 +4626,8 @@ void update_group_power(struct sched_domain *sd, int cpu)

for_each_cpu(cpu, sched_group_cpus(sdg))
power += power_of(cpu);
+ available += available_of(cpu);
+
} else {
/*
* !SD_OVERLAP domains can assume that child groups
@@ -4627,11 +4637,13 @@ void update_group_power(struct sched_domain *sd, int cpu)
group = child->groups;
do {
power += group->sgp->power;
+ available += group->sgp->power_available;
group = group->next;
} while (group != child->groups);
}

- sdg->sgp->power_orig = sdg->sgp->power = power;
+ sdg->sgp->power_orig = sdg->sgp->power_available = available;
+ sdg->sgp->power = power;
}

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 96b164d..2b6eaf9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -413,6 +413,7 @@ struct rq {
struct sched_domain *sd;

unsigned long cpu_power;
+ unsigned long cpu_available;

unsigned char idle_balance;
/* For active balancing */
--
1.7.9.5

2013-04-25 17:26:46

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 09/14] sched: evaluate the activity level of the system

We evaluate the activity level of CPUs, groups and domains in order to define
how many CPUs are required to handle the current activity.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 954adfd..234ecdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3465,7 +3465,22 @@ static int check_pack_buddy(int cpu, struct task_struct *p)
return true;

return false;
+}
+
+static int get_cpu_activity(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ u32 sum = rq->avg.runnable_avg_sum;
+ u32 period = rq->avg.runnable_avg_period;
+
+ sum = min(sum, period);
+
+ if (sum == period) {
+ u32 overload = rq->nr_running > 1 ? 1 : 0;
+ return power_of(cpu) + overload;
+ }

+ return (sum * power_of(cpu)) / period;
}

/*
@@ -4355,6 +4370,7 @@ struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
struct sched_group *this; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_activity; /* Total activity of all groups in sd */
unsigned long total_pwr; /* Total power of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */

@@ -4383,6 +4399,7 @@ struct sd_lb_stats {
struct sg_lb_stats {
unsigned long avg_load; /*Avg load across the CPUs of the group */
unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long group_activity; /* Total activity of the group */
unsigned long sum_nr_running; /* Nr tasks running in the group */
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
unsigned long group_capacity;
@@ -4629,6 +4646,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}

sgs->group_load += load;
+ sgs->group_activity += get_cpu_activity(i);
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
@@ -4752,6 +4770,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
return;

sds->total_load += sgs.group_load;
+ sds->total_activity += sgs.group_activity;
sds->total_pwr += sg->sgp->power;

/*
--
1.7.9.5

2013-04-25 17:25:08

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 05/14] ARM: sched: clear SD_SHARE_POWERDOMAIN

The ARM platforms take advantage of packing small tasks on few cores.
This is true even when the cores of a cluster can't be power gated
independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 79282eb..f89a4a2 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {}
*/
struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_local_flags(int level)
+{
+ /* Powergate at threading level doesn't make sense */
+ if (level & SD_SHARE_CPUPOWER)
+ return 1*SD_SHARE_POWERDOMAIN;
+
+ return 0*SD_SHARE_POWERDOMAIN;
+}
+
const struct cpumask *cpu_coregroup_mask(int cpu)
{
return &cpu_topology[cpu].core_sibling;
--
1.7.9.5

2013-04-25 17:25:06

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 06/14] sched: add a knob to choose the packing level

There are 3 packing levels:
- the default one only packs the small tasks when the system is not busy
- the none level doesn't pack anything
- the full level uses as few as possible number of CPUs based on the current
activity of the system

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched/sysctl.h | 8 ++++++++
kernel/sched/fair.c | 12 ++++++++++++
kernel/sysctl.c | 13 +++++++++++++
3 files changed, 33 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..b72a8b8 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -44,6 +44,14 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+enum sched_tunable_packing {
+ SCHED_PACKING_NONE,
+ SCHED_PACKING_DEFAULT,
+ SCHED_PACKING_FULL,
+};
+
+extern enum sched_tunable_packing __read_mostly sysctl_sched_packing_mode;
+
extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a985c98..98166aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,18 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
#endif

+#ifdef CONFIG_SMP
+/*
+ * The packing policy of the scheduler
+ *
+ * Options are:
+ * SCHED_PACKING_NONE - No buddy is used for packing some tasks
+ * SCHED_PACKING_DEFAULT - The small tasks are packed on a not busy CPUs
+ * SCHED_PACKING_FULL - All Tasks are packed in a minimum number of CPUs
+ */
+enum sched_tunable_packing sysctl_sched_packing_mode = SCHED_PACKING_DEFAULT;
+
+#endif
/*
* Increase the granularity value when there are more CPUs,
* because with more CPUs the 'effective latency' as visible
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..ca22f59 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -265,6 +265,8 @@ static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
#ifdef CONFIG_SMP
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
+static int min_sched_packing_mode = SCHED_PACKING_NONE;
+static int max_sched_packing_mode = SCHED_PACKING_FULL-1;
#endif /* CONFIG_SMP */
#endif /* CONFIG_SCHED_DEBUG */

@@ -281,6 +283,17 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+#ifdef CONFIG_SMP
+ {
+ .procname = "sched_packing_mode",
+ .data = &sysctl_sched_packing_mode,
+ .maxlen = sizeof(enum sched_tunable_packing),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &min_sched_packing_mode,
+ .extra2 = &max_sched_packing_mode,
+ },
+#endif
#ifdef CONFIG_SCHED_DEBUG
{
.procname = "sched_min_granularity_ns",
--
1.7.9.5

2013-04-25 17:27:30

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 07/14] sched: agressively pack at wake/fork/exec

According to the packing policy, the scheduler can pack tasks at different
step:
-SCHED_PACKING_NONE level: we don't pack any task.
-SCHED_PACKING_DEFAULT: we only pack small tasks at wake up when system is not
busy.
-SCHED_PACKING_FULL: we pack tasks at wake up until a CPU becomes full. During
a fork or a exec, we assume that the new task is a full running one and we
look for an idle CPU close to the buddy CPU.

Signed-off-by: Vincent Guittot <[email protected]>
---
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98166aa..874f330 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3259,13 +3259,16 @@ static struct sched_group *
find_idlest_group(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int load_idx)
{
- struct sched_group *idlest = NULL, *group = sd->groups;
+ struct sched_group *idlest = NULL, *group = sd->groups, *buddy = NULL;
unsigned long min_load = ULONG_MAX, this_load = 0;
int imbalance = 100 + (sd->imbalance_pct-100)/2;
+ int buddy_cpu = per_cpu(sd_pack_buddy, this_cpu);
+ int get_buddy = ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) &&
+ !(sd->flags & SD_SHARE_POWERDOMAIN) && (buddy_cpu != -1));

do {
unsigned long load, avg_load;
- int local_group;
+ int local_group, buddy_group = 0;
int i;

/* Skip over this group if it has no CPUs allowed */
@@ -3276,6 +3279,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
local_group = cpumask_test_cpu(this_cpu,
sched_group_cpus(group));

+ if (get_buddy) {
+ buddy_group = cpumask_test_cpu(buddy_cpu,
+ sched_group_cpus(group));
+ }
+
/* Tally up the load of all CPUs in the group */
avg_load = 0;

@@ -3287,6 +3295,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
load = target_load(i, load_idx);

avg_load += load;
+
+ if ((buddy_group) && idle_cpu(i))
+ buddy = group;
}

/* Adjust by relative CPU power of the group */
@@ -3300,6 +3311,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
}
} while (group = group->next, group != sd->groups);

+ if (buddy)
+ return buddy;
+
if (!idlest || 100*this_load < imbalance*min_load)
return NULL;
return idlest;
@@ -3402,6 +3416,21 @@ static bool is_buddy_busy(int cpu)
return (sum > (period / (rq->nr_running + 2)));
}

+static bool is_buddy_full(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ u32 sum = rq->avg.runnable_avg_sum;
+ u32 period = rq->avg.runnable_avg_period;
+
+ sum = min(sum, period);
+
+ /*
+ * A full buddy is a CPU with a sum greater or equal to period
+ * We keep a margin of 2.4%
+ */
+ return (sum * 1024 >= period * 1000);
+}
+
static bool is_light_task(struct task_struct *p)
{
/* A light task runs less than 20% in average */
@@ -3413,6 +3442,9 @@ static int check_pack_buddy(int cpu, struct task_struct *p)
{
int buddy = per_cpu(sd_pack_buddy, cpu);

+ if (sysctl_sched_packing_mode == SCHED_PACKING_NONE)
+ return false;
+
/* No pack buddy for this CPU */
if (buddy == -1)
return false;
@@ -3421,14 +3453,19 @@ static int check_pack_buddy(int cpu, struct task_struct *p)
if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
return false;

+ /* We agressively pack at wake up */
+ if ((sysctl_sched_packing_mode == SCHED_PACKING_FULL)
+ && !is_buddy_full(buddy))
+ return true;
/*
* If the task is a small one and the buddy is not overloaded,
* we use buddy cpu
*/
- if (!is_light_task(p) || is_buddy_busy(buddy))
- return false;
+ if (is_light_task(p) && !is_buddy_busy(buddy))
+ return true;
+
+ return false;

- return true;
}

/*
--
1.7.9.5

2013-04-25 17:27:48

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 04/14] sched: pack the idle load balance

Look for an idle CPU close to the pack buddy CPU whenever possible.
The goal is to prevent the wake up of a CPU which doesn't share the power
domain of the pack buddy CPU.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6adc57c..a985c98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5469,7 +5469,26 @@ static struct {

static inline int find_new_ilb(int call_cpu)
{
+ struct sched_domain *sd;
int ilb = cpumask_first(nohz.idle_cpus_mask);
+ int buddy = per_cpu(sd_pack_buddy, call_cpu);
+
+ /*
+ * If we have a pack buddy CPU, we try to run load balance on a CPU
+ * that is close to the buddy.
+ */
+ if (buddy != -1) {
+ for_each_domain(buddy, sd) {
+ if (sd->flags & SD_SHARE_CPUPOWER)
+ continue;
+
+ ilb = cpumask_first_and(sched_domain_span(sd),
+ nohz.idle_cpus_mask);
+
+ if (ilb < nr_cpu_ids)
+ break;
+ }
+ }

if (ilb < nr_cpu_ids && idle_cpu(ilb))
return ilb;
--
1.7.9.5

2013-04-25 17:28:11

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH 02/14] sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain

This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPUs
in a sched_domain level can or not reach a different power state. If groups of
cores can be power gated independently, as an example, the flag should be
cleared at CPU level. This information is used to decide if it's worth packing
some tasks in a group of CPUs in order to power gated the other groups instead
of spreading the tasks. The default behavior of the scheduler is to spread
tasks so the flag is set into all sched_domains.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
arch/ia64/include/asm/topology.h | 1 +
arch/tile/include/asm/topology.h | 1 +
include/linux/sched.h | 1 +
include/linux/topology.h | 4 ++++
kernel/sched/core.c | 6 ++++++
5 files changed, 13 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..6d0b617 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -65,6 +65,7 @@ void build_cpu_to_node_map(void);
| SD_BALANCE_EXEC \
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE, \
+ | arch_sd_local_flags(0)\
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d5e86c9..adc87102 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
| 0*SD_WAKE_AFFINE \
| 0*SD_SHARE_CPUPOWER \
| 0*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(0) \
| 0*SD_SERIALIZE \
, \
.last_balance = jiffies, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16b0d18..a82cdeb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -809,6 +809,7 @@ enum cpu_idle_type {
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
#define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..3eab293 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,6 +99,8 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 1*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(SD_SHARE_CPUPOWER|\
+ SD_SHARE_PKG_RESOURCES) \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
| arch_sd_sibling_asym_packing() \
@@ -131,6 +133,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 0*SD_SHARE_CPUPOWER \
| 1*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\
| 0*SD_SERIALIZE \
, \
.last_balance = jiffies, \
@@ -161,6 +164,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 0*SD_SHARE_CPUPOWER \
| 0*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(0) \
| 0*SD_SERIALIZE \
| 1*SD_PREFER_SIBLING \
, \
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c8db984..3b9861f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5940,6 +5940,11 @@ int __weak arch_sd_sibling_asym_packing(void)
return 0*SD_ASYM_PACKING;
}

+int __weak arch_sd_local_flags(int level)
+{
+ return 1*SD_SHARE_POWERDOMAIN;
+}
+
/*
* Initializers for schedule domains
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6129,6 +6134,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUPOWER
| 0*SD_SHARE_PKG_RESOURCES
+ | 1*SD_SHARE_POWERDOMAIN
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
| sd_local_flags(level)
--
1.7.9.5

2013-04-26 10:01:01

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 11/14] sched: filter task pull request

Part of this patch is missing, the fix below is needed

@@ -3497,7 +3497,9 @@ static bool is_buddy_full(int cpu)
static bool is_my_buddy(int cpu, int buddy)
{
int my_buddy = per_cpu(sd_pack_buddy, cpu);
- return (my_buddy == -1) || (buddy == my_buddy);
+
+ return ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) &&
+ ((my_buddy == -1) || (buddy == my_buddy)));
}

static bool is_light_task(struct task_struct *p)


On 25 April 2013 19:23, Vincent Guittot <[email protected]> wrote:
> Only CPUs that participates to the packing effort can pull tasks on a busiest
> group.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 28f8ea7..6f63fb9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu)
> return (sum * 1024 >= period * 1000);
> }
>
> +static bool is_my_buddy(int cpu, int buddy)
> +{
> + int my_buddy = per_cpu(sd_pack_buddy, cpu);
> + return (my_buddy == -1) || (buddy == my_buddy);
> +}
> +
> static bool is_light_task(struct task_struct *p)
> {
> /* A light task runs less than 20% in average */
> @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>
> /* Bias balancing toward cpus of our domain */
> if (local_group) {
> - if (idle_cpu(i) && !first_idle_cpu &&
> - cpumask_test_cpu(i, sched_group_mask(group))) {
> + if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
> + && cpumask_test_cpu(i, sched_group_mask(group))) {
> first_idle_cpu = 1;
> balance_cpu = i;
> }
> @@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
>
> /* Get my new buddy */
> buddy = per_cpu(sd_pack_buddy, cpu);
> +
> + /* This CPU doesn't act for agressive packing */
> + if (buddy != cpu)
> + sds->busiest = 0;
> }
>
> /**
> --
> 1.7.9.5
>

2013-04-26 12:08:30

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks

Hi,

The patches are available in this git tree:
git://git.linaro.org/people/vingu/kernel.git sched-pack-small-tasks-v4-fixed

Vincent

On 25 April 2013 19:23, Vincent Guittot <[email protected]> wrote:
> Hi,
>
> This patchset takes advantage of the new per-task load tracking that is
> available in the kernel for packing the tasks in as few as possible
> CPU/Cluster/Core. It has got 2 packing modes:
> -The 1st mode packs the small tasks when the system is not too busy. The main
> goal is to reduce the power consumption in the low system load use cases by
> minimizing the number of power domain that are enabled but it also keeps the
> default behavior which is performance oriented.
> -The 2nd mode packs all tasks in as few as possible power domains in order to
> improve the power consumption of the system but at the cost of possible
> performance decrease because of the increase of the rate of ressources sharing
> compared to the default mode.
>
> The packing is done in 3 steps (the last step is only applicable for the
> agressive packing mode):
>
> The 1st step looks for the best place to pack tasks in a system according to
> its topology and it defines a 1st pack buddy CPU for each CPU if there is one
> available. The policy for defining a buddy CPU is that we want to pack at
> levels where a group of CPU can be power gated independently from others. To
> describe this capability, a new flag SD_SHARE_POWERDOMAIN has been introduced,
> that is used to indicate whether the groups of CPUs of a scheduling domain
> share their power state. By default, this flag is set in all sched_domain in
> order to keep unchanged the current behavior of the scheduler and only ARM
> platform clears the SD_SHARE_POWERDOMAIN flag for MC and CPU level.
>
> In a 2nd step, the scheduler checks the load average of a task which wakes up
> as well as the load average of the buddy CPU and it can decide to migrate the
> light tasks on a not busy buddy. This check is done during the wake up because
> small tasks tend to wake up between periodic load balance and asynchronously
> to each other which prevents the default mechanism to catch and migrate them
> efficiently. A light task is defined by a runnable_avg_sum that is less than
> 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
> The average CPU load of the task must be less than 20% and the task must have
> been runnable less than 10ms when it woke up last time in order to be
> electable for the packing migration. So, a task than runs 1 ms each 5ms will
> be considered as a small task but a task that runs 50 ms with a period of
> 500ms, will not.
> Then, the business of the buddy CPU depends of the load average for the rq and
> the number of running tasks. A CPU with a load average greater than 50% will
> be considered as busy CPU whatever the number of running tasks is and this
> threshold will be reduced by the number of running tasks in order to not
> increase too much the wake up latency of a task. When the buddy CPU is busy,
> the scheduler falls back to default CFS policy.
>
> The 3rd step is only used when the agressive packing mode is enable. In this
> case, the CPUs pack their tasks in their buddy until they becomes full. Unlike
> the previous step, we can't keep the same buddy so we update it during load
> balance. During the periodic load balance, the scheduler computes the activity
> of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and
> then it defines the CPUs that will be used to handle the current activity. The
> selected CPUs will be their own buddy and will participate to the default
> load balancing mecanism in order to share the tasks in a fair way, whereas the
> not selected CPUs will not, and their buddy will be the last selected CPU.
> The behavior can be summarized as: The scheduler defines how many CPUs are
> required to handle the current activity, keeps the tasks on these CPUS and
> perform normal load balancing (or any evolution of the current load balancer
> like the use of runnable load avg from Alex https://lkml.org/lkml/2013/4/1/580)
> on this limited number of CPUs . Like the other steps, the CPUs are selected to
> minimize the number of power domain that must stay on.
>
> Change since V3:
>
> - Take into account comments on previous version.
> - Add an agressive packing mode and a knob to select between the various mode
>
> Change since V2:
>
> - Migrate only a task that wakes up
> - Change the light tasks threshold to 20%
> - Change the loaded CPU threshold to not pull tasks if the current number of
> running tasks is null but the load average is already greater than 50%
> - Fix the algorithm for selecting the buddy CPU.
>
> Change since V1:
>
> Patch 2/6
> - Change the flag name which was not clear. The new name is
> SD_SHARE_POWERDOMAIN.
> - Create an architecture dependent function to tune the sched_domain flags
> Patch 3/6
> - Fix issues in the algorithm that looks for the best buddy CPU
> - Use pr_debug instead of pr_info
> - Fix for uniprocessor
> Patch 4/6
> - Remove the use of usage_avg_sum which has not been merged
> Patch 5/6
> - Change the way the coherency of runnable_avg_sum and runnable_avg_period is
> ensured
> Patch 6/6
> - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
> platform
>
> Previous results for v3:
>
> This series has been tested with hackbench on ARM platform and the results
> don't show any performance regression
>
> Hackbench 3.9-rc2 +patches
> Mean Time (10 tests): 2.048 2.015
> stdev : 0.047 0.068
>
> Previous results for V2:
>
> This series has been tested with MP3 play back on ARM platform:
> TC2 HMP (dual CA-15 and 3xCA-7 cluster).
>
> The measurements have been done on an Ubuntu image during 60 seconds of
> playback and the result has been normalized to 100.
>
> | CA15 | CA7 | total |
> -------------------------------------
> default | 81 | 97 | 178 |
> pack | 13 | 100 | 113 |
> -------------------------------------
>
> Previous results for V1:
>
> The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP
> (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have
> demonstrated that it's worth packing small tasks at all topology levels.
>
> The performance tests have been done on both platforms with sysbench. The
> results don't show any performance regressions. These results are aligned with
> the policy which uses the normal behavior with heavy use cases.
>
> test: sysbench --test=cpu --num-threads=N --max-requests=R run
>
> Results below is the average duration of 3 tests on the quad CA-9.
> default is the current scheduler behavior (pack buddy CPU is -1)
> pack is the scheduler with the pack mechanism
>
> | default | pack |
> -----------------------------------
> N=8; R=200 | 3.1999 | 3.1921 |
> N=8; R=2000 | 31.4939 | 31.4844 |
> N=12; R=200 | 3.2043 | 3.2084 |
> N=12; R=2000 | 31.4897 | 31.4831 |
> N=16; R=200 | 3.1774 | 3.1824 |
> N=16; R=2000 | 31.4899 | 31.4897 |
> -----------------------------------
>
> The power consumption tests have been done only on TC2 platform which has got
> accessible power lines and I have used cyclictest to simulate small tasks. The
> tests show some power consumption improvements.
>
> test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
>
> The measurements have been done during 16 seconds and the result has been
> normalized to 100
>
> | CA15 | CA7 | total |
> -------------------------------------
> default | 100 | 40 | 140 |
> pack | <1 | 45 | <46 |
> -------------------------------------
>
> The A15 cluster is less power efficient than the A7 cluster but if we assume
> that the tasks is well spread on both clusters, we can guest estimate that the
> power consumption on a dual cluster of CA7 would have been for a default
> kernel:
>
> | CA7 | CA7 | total |
> -------------------------------------
> default | 40 | 40 | 80 |
> -------------------------------------
>
> Vincent Guittot (14):
> Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for
> load-tracking"
> sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain
> sched: pack small tasks
> sched: pack the idle load balance
> ARM: sched: clear SD_SHARE_POWERDOMAIN
> sched: add a knob to choose the packing level
> sched: agressively pack at wake/fork/exec
> sched: trig ILB on an idle buddy
> sched: evaluate the activity level of the system
> sched: update the buddy CPU
> sched: filter task pull request
> sched: create a new field with available capacity
> sched: update the cpu_power
> sched: force migration on buddy CPU
>
> arch/arm/kernel/topology.c | 9 +
> arch/ia64/include/asm/topology.h | 1 +
> arch/tile/include/asm/topology.h | 1 +
> include/linux/sched.h | 11 +-
> include/linux/sched/sysctl.h | 8 +
> include/linux/topology.h | 4 +
> kernel/sched/core.c | 14 +-
> kernel/sched/fair.c | 393 +++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 15 +-
> kernel/sysctl.c | 13 ++
> 10 files changed, 423 insertions(+), 46 deletions(-)
>
> --
> 1.7.9.5
>

2013-04-26 12:31:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/14] sched: pack small tasks

On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for each CPU
> when one is available. We want to pack at all levels where a group of CPUs can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the flag is
> set at all sched_domain level and the buddy is set to -1. This is the default
> behavior.
>
> On a dual clusters / dual cores system which can power gate each core and
> cluster independently, the buddy configuration will be :
>
> | Cluster 0 | Cluster 1 |
> | CPU0 | CPU1 | CPU2 | CPU3 |
> -----------------------------------
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> If the cores in a cluster can't be power gated independently, the buddy
> configuration becomes:
>
> | Cluster 0 | Cluster 1 |
> | CPU0 | CPU1 | CPU2 | CPU3 |
> -----------------------------------
> buddy | CPU0 | CPU1 | CPU0 | CPU0 |
>
> Small tasks tend to slip out of the periodic load balance so the best place
> to choose to migrate them is during their wake up. The decision is in O(1) as
> we only check again one buddy CPU


So I really don't get the point of this buddy stuff, even for light load non
performance impact stuff you want to do.

The moment you judge cpu0 busy you'll bail, even though its perfectly doable
(and desirable afaict) to continue stacking light tasks on cpu1 instead of
waking up cpu2/3.

So what's wrong with keeping a single light-wake target cpu selection and
updating it appropriately?

Also where/how does the nohz balance cpu criteria not match the light-wake
target criteria?

2013-04-26 12:41:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 03/14] sched: pack small tasks

On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:

> +static bool is_buddy_busy(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + u32 sum = rq->avg.runnable_avg_sum;
> + u32 period = rq->avg.runnable_avg_period;
> +
> + /*
> + * If a CPU accesses the runnable_avg_sum and runnable_avg_period
> + * fields of its buddy CPU while the latter updates it, it can get the
> + * new version of a field and the old version of the other one. This
> + * can generate erroneous decisions. We don't want to use a lock
> + * mechanism for ensuring the coherency because of the overhead in
> + * this critical path.
> + * The runnable_avg_period of a runqueue tends to the max value in
> + * less than 345ms after plugging a CPU, which implies that we could
> + * use the max value instead of reading runnable_avg_period after
> + * 345ms. During the starting phase, we must ensure a minimum of
> + * coherency between the fields. A simple rule is runnable_avg_sum <=
> + * runnable_avg_period.
> + */
> + sum = min(sum, period);
> +
> + /*
> + * A busy buddy is a CPU with a high load or a small load with a lot of
> + * running tasks.
> + */
> + return (sum > (period / (rq->nr_running + 2)));
> +}


I'm still not sold on the entire nr_running thing and the comment doesn't say
why you're doing it.

I'm fairly sure there's software out there that wakes a gazillion threads at a
time only for a gazillion-1 to go back to sleep immediately. Patterns like that
completely defeat your heuristic.

Does that matter... I don't know.

What happens if you keep this thing simple and only put a cap on utilization
(say 80%) and drop the entire nr_running magic? Have you seen it make an actual
difference or did it just seem like a good (TM) thing to do?

> +static bool is_light_task(struct task_struct *p)
> +{
> + /* A light task runs less than 20% in average */
> + return ((p->se.avg.runnable_avg_sum * 5) <
> + (p->se.avg.runnable_avg_period));
> +}

There superfluous () and ' ' in there. Also why 20%.. seemed like a good
number? Do we want a SCHED_DEBUG sysctl for that?

2013-04-26 12:51:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/14] sched: pack the idle load balance

On Thu, Apr 25, 2013 at 07:23:20PM +0200, Vincent Guittot wrote:
> Look for an idle CPU close to the pack buddy CPU whenever possible.
> The goal is to prevent the wake up of a CPU which doesn't share the power
> domain of the pack buddy CPU.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Reviewed-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6adc57c..a985c98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5469,7 +5469,26 @@ static struct {
>
> static inline int find_new_ilb(int call_cpu)
> {
> + struct sched_domain *sd;
> int ilb = cpumask_first(nohz.idle_cpus_mask);
> + int buddy = per_cpu(sd_pack_buddy, call_cpu);
> +
> + /*
> + * If we have a pack buddy CPU, we try to run load balance on a CPU
> + * that is close to the buddy.
> + */
> + if (buddy != -1) {
> + for_each_domain(buddy, sd) {
> + if (sd->flags & SD_SHARE_CPUPOWER)
> + continue;
> +
> + ilb = cpumask_first_and(sched_domain_span(sd),
> + nohz.idle_cpus_mask);
> +
> + if (ilb < nr_cpu_ids)
> + break;
> + }
> + }
>
> if (ilb < nr_cpu_ids && idle_cpu(ilb))
> return ilb;

Ha! and here you hope people won't put multiple big-little clusters in a single
machine? :-)


2013-04-26 13:10:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/14] sched: agressively pack at wake/fork/exec

On Thu, Apr 25, 2013 at 07:23:23PM +0200, Vincent Guittot wrote:
> According to the packing policy, the scheduler can pack tasks at different
> step:
> -SCHED_PACKING_NONE level: we don't pack any task.
> -SCHED_PACKING_DEFAULT: we only pack small tasks at wake up when system is not
> busy.
> -SCHED_PACKING_FULL: we pack tasks at wake up until a CPU becomes full. During
> a fork or a exec, we assume that the new task is a full running one and we
> look for an idle CPU close to the buddy CPU.

This changelog is very short on explaining how it will go about achieving these
goals.

> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98166aa..874f330 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3259,13 +3259,16 @@ static struct sched_group *
> find_idlest_group(struct sched_domain *sd, struct task_struct *p,


So for packing into power domains, wouldn't you typically pick the busiest non-
full domain to fill from other non-full?

Picking the idlest non-full seems like it would generate a ping-pong or not
actually pack anything.

2013-04-26 13:16:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 03/14] sched: pack small tasks

On 26 April 2013 14:30, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for each CPU
>> when one is available. We want to pack at all levels where a group of CPUs can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the flag is
>> set at all sched_domain level and the buddy is set to -1. This is the default
>> behavior.
>>
>> On a dual clusters / dual cores system which can power gate each core and
>> cluster independently, the buddy configuration will be :
>>
>> | Cluster 0 | Cluster 1 |
>> | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> If the cores in a cluster can't be power gated independently, the buddy
>> configuration becomes:
>>
>> | Cluster 0 | Cluster 1 |
>> | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU1 | CPU0 | CPU0 |
>>
>> Small tasks tend to slip out of the periodic load balance so the best place
>> to choose to migrate them is during their wake up. The decision is in O(1) as
>> we only check again one buddy CPU
>
>
> So I really don't get the point of this buddy stuff, even for light load non
> performance impact stuff you want to do.
>
> The moment you judge cpu0 busy you'll bail, even though its perfectly doable
> (and desirable afaict) to continue stacking light tasks on cpu1 instead of
> waking up cpu2/3.
>
> So what's wrong with keeping a single light-wake target cpu selection and
> updating it appropriately?

I have tried to follow the same kind of tasks migration as during load
balance: 1 CPU in a power domain group migrates tasks with other
groups.

>
> Also where/how does the nohz balance cpu criteria not match the light-wake
> target criteria?

The nohz balance cpu is an idle cpu but it doesn't mean that it's the
target cpu which will be sometime busy with the light tasks.

2013-04-26 13:17:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/14] sched: trig ILB on an idle buddy

On Thu, Apr 25, 2013 at 07:23:24PM +0200, Vincent Guittot wrote:
> If the buddy CPU is not full and currently idle, we trigger an Idle Load
> Balance to give it the opportunity to pull more tasks.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 874f330..954adfd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5776,6 +5776,26 @@ end:
> clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> }
>
> +static int check_nohz_buddy(int cpu)
> +{
> + int buddy = per_cpu(sd_pack_buddy, cpu);
> +
> + if (sysctl_sched_packing_mode != SCHED_PACKING_FULL)
> + return false;
> +
> + /* No pack buddy for this CPU */
> + if (buddy == -1)
> + return false;
> +
> + if (is_buddy_full(buddy))
> + return false;
> +
> + if (cpumask_test_cpu(buddy, nohz.idle_cpus_mask))
> + return true;
> +
> + return false;

return cpumask_test_cpu(..); ?

> +}
> +
> /*
> * Current heuristic for kicking the idle load balancer in the presence
> * of an idle cpu is the system.
> @@ -5813,6 +5833,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
> if (rq->nr_running >= 2)
> goto need_kick;
>
> + /* the buddy is idle and not busy so we can pack */
> + if (check_nohz_buddy(cpu))
> + goto need_kick;
> +
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> struct sched_group *sg = sd->groups;

Bah.. this code is so confusing.. nohz_balance_kick(cpu) won't actually kick
@cpu.

Again.. suppose we're a big cpu (2) and our little buddy cpu (0) is busy, we
miss an opportunity to kick the other little cpu into gear (1) and maybe take
the big core out.

2013-04-26 13:38:22

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 03/14] sched: pack small tasks

On 26 April 2013 14:38, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 25, 2013 at 07:23:19PM +0200, Vincent Guittot wrote:
>
>> +static bool is_buddy_busy(int cpu)
>> +{
>> + struct rq *rq = cpu_rq(cpu);
>> + u32 sum = rq->avg.runnable_avg_sum;
>> + u32 period = rq->avg.runnable_avg_period;
>> +
>> + /*
>> + * If a CPU accesses the runnable_avg_sum and runnable_avg_period
>> + * fields of its buddy CPU while the latter updates it, it can get the
>> + * new version of a field and the old version of the other one. This
>> + * can generate erroneous decisions. We don't want to use a lock
>> + * mechanism for ensuring the coherency because of the overhead in
>> + * this critical path.
>> + * The runnable_avg_period of a runqueue tends to the max value in
>> + * less than 345ms after plugging a CPU, which implies that we could
>> + * use the max value instead of reading runnable_avg_period after
>> + * 345ms. During the starting phase, we must ensure a minimum of
>> + * coherency between the fields. A simple rule is runnable_avg_sum <=
>> + * runnable_avg_period.
>> + */
>> + sum = min(sum, period);
>> +
>> + /*
>> + * A busy buddy is a CPU with a high load or a small load with a lot of
>> + * running tasks.
>> + */
>> + return (sum > (period / (rq->nr_running + 2)));
>> +}
>
>
> I'm still not sold on the entire nr_running thing and the comment doesn't say
> why you're doing it.

The main goal is really to minimize the scheduling latency of tasks.
If the cpu already has dozens of task in its runqueue the scheduling
latency can become quite huge. In this case it's worth using another
core

>
> I'm fairly sure there's software out there that wakes a gazillion threads at a
> time only for a gazillion-1 to go back to sleep immediately. Patterns like that
> completely defeat your heuristic.
>
> Does that matter... I don't know.
>
> What happens if you keep this thing simple and only put a cap on utilization
> (say 80%) and drop the entire nr_running magic? Have you seen it make an actual
> difference or did it just seem like a good (TM) thing to do?

It was a efficient way

>
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> + /* A light task runs less than 20% in average */
>> + return ((p->se.avg.runnable_avg_sum * 5) <
>> + (p->se.avg.runnable_avg_period));
>> +}
>
> There superfluous () and ' ' in there. Also why 20%.. seemed like a good
> number? Do we want a SCHED_DEBUG sysctl for that?

I have chosen 20% because it will ensure that the packing mechanism
will only apply on tasks that match the 2 following conditions:
- less than 10 ms during the last single run
- and for those tasks less than 20% of the time in average

These tasks are nearly never catch by the periodic load balance and
they are so small that the overall scheduling latency is nearly not
impacted while the cpu is not too busy

2013-04-26 13:47:37

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 04/14] sched: pack the idle load balance

On 26 April 2013 14:49, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 25, 2013 at 07:23:20PM +0200, Vincent Guittot wrote:
>> Look for an idle CPU close to the pack buddy CPU whenever possible.
>> The goal is to prevent the wake up of a CPU which doesn't share the power
>> domain of the pack buddy CPU.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> Reviewed-by: Morten Rasmussen <[email protected]>
>> ---
>> kernel/sched/fair.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6adc57c..a985c98 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5469,7 +5469,26 @@ static struct {
>>
>> static inline int find_new_ilb(int call_cpu)
>> {
>> + struct sched_domain *sd;
>> int ilb = cpumask_first(nohz.idle_cpus_mask);
>> + int buddy = per_cpu(sd_pack_buddy, call_cpu);
>> +
>> + /*
>> + * If we have a pack buddy CPU, we try to run load balance on a CPU
>> + * that is close to the buddy.
>> + */
>> + if (buddy != -1) {
>> + for_each_domain(buddy, sd) {
>> + if (sd->flags & SD_SHARE_CPUPOWER)
>> + continue;
>> +
>> + ilb = cpumask_first_and(sched_domain_span(sd),
>> + nohz.idle_cpus_mask);
>> +
>> + if (ilb < nr_cpu_ids)
>> + break;
>> + }
>> + }
>>
>> if (ilb < nr_cpu_ids && idle_cpu(ilb))
>> return ilb;
>
> Ha! and here you hope people won't put multiple big-little clusters in a single
> machine? :-)

yes, we will probably face this situation sooner or later but the
other little clusters will probably be not less close than the local
big cluster from a power domain point of view.
That's why i look for the small sched_domain level to the largest one

>
>
>

2013-04-26 14:23:15

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 07/14] sched: agressively pack at wake/fork/exec

On 26 April 2013 15:08, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 25, 2013 at 07:23:23PM +0200, Vincent Guittot wrote:
>> According to the packing policy, the scheduler can pack tasks at different
>> step:
>> -SCHED_PACKING_NONE level: we don't pack any task.
>> -SCHED_PACKING_DEFAULT: we only pack small tasks at wake up when system is not
>> busy.
>> -SCHED_PACKING_FULL: we pack tasks at wake up until a CPU becomes full. During
>> a fork or a exec, we assume that the new task is a full running one and we
>> look for an idle CPU close to the buddy CPU.
>
> This changelog is very short on explaining how it will go about achieving these
> goals.

I could move some explanation of the cover letter inside the commit :
In this case, the CPUs pack their tasks in their buddy until they
becomes full. Unlike
the previous step, we can't keep the same buddy so we update it during load
balance. During the periodic load balance, the scheduler computes the activity
of the system thanks the runnable_avg_sum and the cpu_power of all CPUs and
then it defines the CPUs that will be used to handle the current activity. The
selected CPUs will be their own buddy and will participate to the default
load balancing mecanism in order to share the tasks in a fair way, whereas the
not selected CPUs will not, and their buddy will be the last selected CPU.
The behavior can be summarized as: The scheduler defines how many CPUs are
required to handle the current activity, keeps the tasks on these CPUS and
perform normal load balancing

>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
>> 1 file changed, 42 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 98166aa..874f330 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3259,13 +3259,16 @@ static struct sched_group *
>> find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>
>

A task that wakes up will be caught by the function check_pack_buddy
in order to stay in the CPUs that participates to the packing effort.
We will use the find_idlest_group only for fork/exec tasks which are
considered as full running tasks so we looks for the idlest CPU close
to the buddy.

> So for packing into power domains, wouldn't you typically pick the busiest non-
> full domain to fill from other non-full?
>
> Picking the idlest non-full seems like it would generate a ping-pong or not
> actually pack anything.

2013-04-26 14:52:54

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 08/14] sched: trig ILB on an idle buddy

On 26 April 2013 15:15, Peter Zijlstra <[email protected]> wrote:
> On Thu, Apr 25, 2013 at 07:23:24PM +0200, Vincent Guittot wrote:
>> If the buddy CPU is not full and currently idle, we trigger an Idle Load
>> Balance to give it the opportunity to pull more tasks.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 874f330..954adfd 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5776,6 +5776,26 @@ end:
>> clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
>> }
>>
>> +static int check_nohz_buddy(int cpu)
>> +{
>> + int buddy = per_cpu(sd_pack_buddy, cpu);
>> +
>> + if (sysctl_sched_packing_mode != SCHED_PACKING_FULL)
>> + return false;
>> +
>> + /* No pack buddy for this CPU */
>> + if (buddy == -1)
>> + return false;
>> +
>> + if (is_buddy_full(buddy))
>> + return false;
>> +
>> + if (cpumask_test_cpu(buddy, nohz.idle_cpus_mask))
>> + return true;
>> +
>> + return false;
>
> return cpumask_test_cpu(..); ?

right

>
>> +}
>> +
>> /*
>> * Current heuristic for kicking the idle load balancer in the presence
>> * of an idle cpu is the system.
>> @@ -5813,6 +5833,10 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
>> if (rq->nr_running >= 2)
>> goto need_kick;
>>
>> + /* the buddy is idle and not busy so we can pack */
>> + if (check_nohz_buddy(cpu))
>> + goto need_kick;
>> +
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> struct sched_group *sg = sd->groups;
>
> Bah.. this code is so confusing.. nohz_balance_kick(cpu) won't actually kick
> @cpu.
>
> Again.. suppose we're a big cpu (2) and our little buddy cpu (0) is busy, we
> miss an opportunity to kick the other little cpu into gear (1) and maybe take
> the big core out.

You're right that we can miss an opportunity but it means that cpus
allocated to packing effort are full and a new buddy cpu is going to
be allocated and will pull the task a bit later.


>
>

2013-04-26 15:00:49

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks

On 4/25/2013 10:23 AM, Vincent Guittot wrote:
> Hi,
>
> This patchset takes advantage of the new per-task load tracking that is
> available in the kernel for packing the tasks in as few as possible
> CPU/Cluster/Core. It has got 2 packing modes:
> -The 1st mode packs the small tasks when the system is not too busy. The main
> goal is to reduce the power consumption in the low system load use cases by
> minimizing the number of power domain that are enabled but it also keeps the
> default behavior which is performance oriented.
> -The 2nd mode packs all tasks in as few as possible power domains in order to
> improve the power consumption of the system but at the cost of possible
> performance decrease because of the increase of the rate of ressources sharing
> compared to the default mode.


so I got to ask the hard question; what percentage of system level (not just cpu level)
power consumption gain can you measure (pick your favorite workload)...

on x86 (even on the low power stuff) I expect this to be very far into the noise
(since we have per core power gates, and power transitions are pretty fast)

you have some numbers in the back of your mail, but it's hard for me to get a conclusion out of
that (they either measure only cpu power, or are just vague in general)

2013-04-26 15:40:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks

On 26 April 2013 17:00, Arjan van de Ven <[email protected]> wrote:
> On 4/25/2013 10:23 AM, Vincent Guittot wrote:
>>
>> Hi,
>>
>> This patchset takes advantage of the new per-task load tracking that is
>> available in the kernel for packing the tasks in as few as possible
>> CPU/Cluster/Core. It has got 2 packing modes:
>> -The 1st mode packs the small tasks when the system is not too busy. The
>> main
>> goal is to reduce the power consumption in the low system load use cases
>> by
>> minimizing the number of power domain that are enabled but it also keeps
>> the
>> default behavior which is performance oriented.
>> -The 2nd mode packs all tasks in as few as possible power domains in order
>> to
>> improve the power consumption of the system but at the cost of possible
>> performance decrease because of the increase of the rate of ressources
>> sharing
>> compared to the default mode.
>
>
>
> so I got to ask the hard question; what percentage of system level (not just
> cpu level)
> power consumption gain can you measure (pick your favorite workload)...
>

I haven't system level figures for my patches but only for the cpu
subsystem. If we use the MP3 results in the back of my mail, they show
an improvement of 37 % (113/178) for the CPU subsystem of the
platform. If we assume that the CPU subsystem contributes 25% of an
embedded system power consumption (this can vary across platform
depending of the use of HW accelerator but it should be a almost fair
percentage), the patch can impact the power consumption on up to 9%.

> on x86 (even on the low power stuff) I expect this to be very far into the
> noise
> (since we have per core power gates, and power transitions are pretty fast)
>
> you have some numbers in the back of your mail, but it's hard for me to get
> a conclusion out of
> that (they either measure only cpu power, or are just vague in general)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-04-26 15:46:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks

>>
>>
>> so I got to ask the hard question; what percentage of system level (not just
>> cpu level)
>> power consumption gain can you measure (pick your favorite workload)...
>>
>
> I haven't system level figures for my patches but only for the cpu
> subsystem. If we use the MP3 results in the back of my mail, they show
> an improvement of 37 % (113/178) for the CPU subsystem of the
> platform. If we assume that the CPU subsystem contributes 25% of an
> embedded system power consumption (this can vary across platform
> depending of the use of HW accelerator but it should be a almost fair
> percentage), the patch can impact the power consumption on up to 9%.
>

sadly the math tends to not work quite that easy;
memory takes significantly more power when the system is not idle than when it is idle for example. [*]
so while reducing cpu power by making it run a bit longer (at lower frequency or
slower core or whatever) is a pure win if you only look at the cpu, but it may
(or may not) be a loss when looking at a whole system level.

I've learned the hard way that you cannot just look at the cpu numbers; you must look
at the whole-system power when playing with such tradeoffs.

That does not mean that your patch is not useful; it very well can be, but
without having looked at whole-system power that's a very dangerous conclusion to make.
So.. if you get a chance, I'd love to see data on a whole-system level... even for just one workload
and one system
(playing mp3 sounds like a quite reasonable workload for such things indeed)


[*] I assume that on your ARM systems, memory goes into self refresh during system idle just as it does on x86

2013-04-26 15:56:35

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks

On 26 April 2013 17:46, Arjan van de Ven <[email protected]> wrote:
>>>
>>>
>>> so I got to ask the hard question; what percentage of system level (not
>>> just
>>> cpu level)
>>> power consumption gain can you measure (pick your favorite workload)...
>>>
>>
>> I haven't system level figures for my patches but only for the cpu
>> subsystem. If we use the MP3 results in the back of my mail, they show
>> an improvement of 37 % (113/178) for the CPU subsystem of the
>> platform. If we assume that the CPU subsystem contributes 25% of an
>> embedded system power consumption (this can vary across platform
>> depending of the use of HW accelerator but it should be a almost fair
>> percentage), the patch can impact the power consumption on up to 9%.
>>
>
> sadly the math tends to not work quite that easy;
> memory takes significantly more power when the system is not idle than when
> it is idle for example. [*]
> so while reducing cpu power by making it run a bit longer (at lower
> frequency or
> slower core or whatever) is a pure win if you only look at the cpu, but it
> may
> (or may not) be a loss when looking at a whole system level.

I agree. That's why the default packing mode of the patches is trying
to not increase (too much) the scheduling latency of tasks but only to
force tasks to use the same cpu when ever possible in order to
minimize the number CPU/cluster that are used

>
> I've learned the hard way that you cannot just look at the cpu numbers; you
> must look
> at the whole-system power when playing with such tradeoffs.
>
> That does not mean that your patch is not useful; it very well can be, but
> without having looked at whole-system power that's a very dangerous
> conclusion to make.
> So.. if you get a chance, I'd love to see data on a whole-system level...
> even for just one workload
> and one system
> (playing mp3 sounds like a quite reasonable workload for such things indeed)

I will try to get such figures

Vincent
>
>
> [*] I assume that on your ARM systems, memory goes into self refresh during
> system idle just as it does on x86
>

2013-04-28 08:19:59

by Francesco Lavra

[permalink] [raw]
Subject: Re: [PATCH 10/14] sched: update the buddy CPU

Hi,

On 04/25/2013 07:23 PM, Vincent Guittot wrote:
> Periodically updates the buddy of a CPU according to the current activity of
> the system. A CPU is its own buddy if it participates to the packing effort.
> Otherwise, it points to a CPU that participates to the packing effort.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
> kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 86 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 234ecdd..28f8ea7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -174,11 +174,17 @@ void sched_init_granularity(void)
>
>
> #ifdef CONFIG_SMP
> +static unsigned long power_of(int cpu)
> +{
> + return cpu_rq(cpu)->cpu_power;
> +}
> +
> /*
> * Save the id of the optimal CPU that should be used to pack small tasks
> * The value -1 is used when no buddy has been found
> */
> DEFINE_PER_CPU(int, sd_pack_buddy);
> +DEFINE_PER_CPU(struct sched_domain *, sd_pack_domain);
>
> /*
> * Look for the best buddy CPU that can be used to pack small tasks
> @@ -237,6 +243,68 @@ void update_packing_domain(int cpu)
> }
>
> pr_debug("CPU%d packing on CPU%d\n", cpu, id);
> + per_cpu(sd_pack_domain, cpu) = sd;
> + per_cpu(sd_pack_buddy, cpu) = id;
> +}
> +
> +void update_packing_buddy(int cpu, int activity)
> +{
> + struct sched_domain *sd = per_cpu(sd_pack_domain, cpu);
> + struct sched_group *sg, *pack, *tmp;
> + int id = cpu;
> +
> + if (!sd)
> + return;
> +
> + /*
> + * The sched_domain of a CPU points on the local sched_group
> + * and this CPU of this local group is a good candidate
> + */
> + pack = sg = sd->groups;
> +
> + /* loop the sched groups to find the best one */
> + for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
> + if ((tmp->sgp->power * pack->group_weight) >
> + (pack->sgp->power_available * tmp->group_weight))

The power_available struct member is defined in a subsequent patch
(12/14), so this patch series would break git bisect.

--
Francesco

2013-04-29 07:32:34

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 10/14] sched: update the buddy CPU

On 28 April 2013 10:20, Francesco Lavra <[email protected]> wrote:
> Hi,
>
> On 04/25/2013 07:23 PM, Vincent Guittot wrote:
>> Periodically updates the buddy of a CPU according to the current activity of
>> the system. A CPU is its own buddy if it participates to the packing effort.
>> Otherwise, it points to a CPU that participates to the packing effort.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>> kernel/sched/fair.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 86 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 234ecdd..28f8ea7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -174,11 +174,17 @@ void sched_init_granularity(void)
>>
>>
>> #ifdef CONFIG_SMP
>> +static unsigned long power_of(int cpu)
>> +{
>> + return cpu_rq(cpu)->cpu_power;
>> +}
>> +
>> /*
>> * Save the id of the optimal CPU that should be used to pack small tasks
>> * The value -1 is used when no buddy has been found
>> */
>> DEFINE_PER_CPU(int, sd_pack_buddy);
>> +DEFINE_PER_CPU(struct sched_domain *, sd_pack_domain);
>>
>> /*
>> * Look for the best buddy CPU that can be used to pack small tasks
>> @@ -237,6 +243,68 @@ void update_packing_domain(int cpu)
>> }
>>
>> pr_debug("CPU%d packing on CPU%d\n", cpu, id);
>> + per_cpu(sd_pack_domain, cpu) = sd;
>> + per_cpu(sd_pack_buddy, cpu) = id;
>> +}
>> +
>> +void update_packing_buddy(int cpu, int activity)
>> +{
>> + struct sched_domain *sd = per_cpu(sd_pack_domain, cpu);
>> + struct sched_group *sg, *pack, *tmp;
>> + int id = cpu;
>> +
>> + if (!sd)
>> + return;
>> +
>> + /*
>> + * The sched_domain of a CPU points on the local sched_group
>> + * and this CPU of this local group is a good candidate
>> + */
>> + pack = sg = sd->groups;
>> +
>> + /* loop the sched groups to find the best one */
>> + for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
>> + if ((tmp->sgp->power * pack->group_weight) >
>> + (pack->sgp->power_available * tmp->group_weight))
>
> The power_available struct member is defined in a subsequent patch
> (12/14), so this patch series would break git bisect.

Hi Francesco,

yes, power_available should have been added in patch 13/14.
I will correct that

Thanks
Vincent

>
> --
> Francesco

2013-05-02 09:12:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/14] sched: packing small tasks

On 26 April 2013 17:46, Arjan van de Ven <[email protected]> wrote:
>>>
>>>
>>> so I got to ask the hard question; what percentage of system level (not
>>> just
>>> cpu level)
>>> power consumption gain can you measure (pick your favorite workload)...
>>>
>>
>> I haven't system level figures for my patches but only for the cpu
>> subsystem. If we use the MP3 results in the back of my mail, they show
>> an improvement of 37 % (113/178) for the CPU subsystem of the
>> platform. If we assume that the CPU subsystem contributes 25% of an
>> embedded system power consumption (this can vary across platform
>> depending of the use of HW accelerator but it should be a almost fair
>> percentage), the patch can impact the power consumption on up to 9%.
>>
>
> sadly the math tends to not work quite that easy;
> memory takes significantly more power when the system is not idle than when
> it is idle for example. [*]
> so while reducing cpu power by making it run a bit longer (at lower
> frequency or
> slower core or whatever) is a pure win if you only look at the cpu, but it
> may
> (or may not) be a loss when looking at a whole system level.
>
> I've learned the hard way that you cannot just look at the cpu numbers; you
> must look
> at the whole-system power when playing with such tradeoffs.
>
> That does not mean that your patch is not useful; it very well can be, but
> without having looked at whole-system power that's a very dangerous
> conclusion to make.
> So.. if you get a chance, I'd love to see data on a whole-system level...
> even for just one workload
> and one system
> (playing mp3 sounds like a quite reasonable workload for such things indeed)

Hi Arjan,

I don't have full system power consumption but i have gathered some
additional figures for mp3 use case

The power consumption percentage stay similar
| CA15 | CA7 | total |
-------------------------------------
default | 81% | 95% | 178% |
pack | 10% | 100% | 110% |
agressive | 3% | 100% | 103% |
-------------------------------------

I have also gathered the time when all CPUs are in low power state so
the entire system can be put in low power mode (DDR in self refresh)
| Time | nb / sec | average |
--------------------------------------------
default | 85.5% | 296 | 2.9ms |
pack | 82.5% | 253 | 3.2ms |
agressive | 81.8% | 236 | 3.4ms |
--------------------------------------------

We can see a decrease of the time spent in low power mode but on the
other side, the number of switch to low power mode also decreases
which results in the increase the average duration of low power state.

Regards,
Vincent

>
>
> [*] I assume that on your ARM systems, memory goes into self refresh during
> system idle just as it does on x86
>

2013-05-22 15:57:06

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 11/14] sched: filter task pull request

On Fri, Apr 26, 2013 at 11:00:58AM +0100, Vincent Guittot wrote:
> Part of this patch is missing, the fix below is needed
>
> @@ -3497,7 +3497,9 @@ static bool is_buddy_full(int cpu)
> static bool is_my_buddy(int cpu, int buddy)
> {
> int my_buddy = per_cpu(sd_pack_buddy, cpu);
> - return (my_buddy == -1) || (buddy == my_buddy);
> +
> + return ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) &&
> + ((my_buddy == -1) || (buddy == my_buddy)));
> }
>
> static bool is_light_task(struct task_struct *p)
>
>
> On 25 April 2013 19:23, Vincent Guittot <[email protected]> wrote:
> > Only CPUs that participates to the packing effort can pull tasks on a busiest
> > group.
> >
> > Signed-off-by: Vincent Guittot <[email protected]>
> > ---
> > kernel/sched/fair.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 28f8ea7..6f63fb9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu)
> > return (sum * 1024 >= period * 1000);
> > }
> >
> > +static bool is_my_buddy(int cpu, int buddy)
> > +{
> > + int my_buddy = per_cpu(sd_pack_buddy, cpu);
> > + return (my_buddy == -1) || (buddy == my_buddy);
> > +}

Would it make sense to change the function name to something like
is_packing_target() and only have one argument?

is_my_buddy() is only used with the same variable for both arguments
like below.

> > +
> > static bool is_light_task(struct task_struct *p)
> > {
> > /* A light task runs less than 20% in average */
> > @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >
> > /* Bias balancing toward cpus of our domain */
> > if (local_group) {
> > - if (idle_cpu(i) && !first_idle_cpu &&
> > - cpumask_test_cpu(i, sched_group_mask(group))) {
> > + if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
> > + && cpumask_test_cpu(i, sched_group_mask(group))) {
> > first_idle_cpu = 1;
> > balance_cpu = i;
> > }
> > @@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
> >
> > /* Get my new buddy */
> > buddy = per_cpu(sd_pack_buddy, cpu);
> > +
> > + /* This CPU doesn't act for agressive packing */
> > + if (buddy != cpu)
> > + sds->busiest = 0;

sds->busiest is a pointer, so I think it should be assigned to NULL
instead of 0.

Morten

> > }
> >
> > /**
> > --
> > 1.7.9.5
> >
>

2013-05-22 16:03:46

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 11/14] sched: filter task pull request

On 22 May 2013 17:56, Morten Rasmussen <[email protected]> wrote:
> On Fri, Apr 26, 2013 at 11:00:58AM +0100, Vincent Guittot wrote:
>> Part of this patch is missing, the fix below is needed
>>
>> @@ -3497,7 +3497,9 @@ static bool is_buddy_full(int cpu)
>> static bool is_my_buddy(int cpu, int buddy)
>> {
>> int my_buddy = per_cpu(sd_pack_buddy, cpu);
>> - return (my_buddy == -1) || (buddy == my_buddy);
>> +
>> + return ((sysctl_sched_packing_mode == SCHED_PACKING_FULL) &&
>> + ((my_buddy == -1) || (buddy == my_buddy)));
>> }
>>
>> static bool is_light_task(struct task_struct *p)
>>
>>
>> On 25 April 2013 19:23, Vincent Guittot <[email protected]> wrote:
>> > Only CPUs that participates to the packing effort can pull tasks on a busiest
>> > group.
>> >
>> > Signed-off-by: Vincent Guittot <[email protected]>
>> > ---
>> > kernel/sched/fair.c | 14 ++++++++++++--
>> > 1 file changed, 12 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 28f8ea7..6f63fb9 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -3494,6 +3494,12 @@ static bool is_buddy_full(int cpu)
>> > return (sum * 1024 >= period * 1000);
>> > }
>> >
>> > +static bool is_my_buddy(int cpu, int buddy)
>> > +{
>> > + int my_buddy = per_cpu(sd_pack_buddy, cpu);
>> > + return (my_buddy == -1) || (buddy == my_buddy);
>> > +}
>
> Would it make sense to change the function name to something like
> is_packing_target() and only have one argument?

I have replaced it with is_packing_cpu(int cpu) in my next version.
This function returns true if the cpu is part of the packing effort

>
> is_my_buddy() is only used with the same variable for both arguments
> like below.
>
>> > +
>> > static bool is_light_task(struct task_struct *p)
>> > {
>> > /* A light task runs less than 20% in average */
>> > @@ -4688,8 +4694,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> >
>> > /* Bias balancing toward cpus of our domain */
>> > if (local_group) {
>> > - if (idle_cpu(i) && !first_idle_cpu &&
>> > - cpumask_test_cpu(i, sched_group_mask(group))) {
>> > + if (is_my_buddy(i, i) && idle_cpu(i) && !first_idle_cpu
>> > + && cpumask_test_cpu(i, sched_group_mask(group))) {
>> > first_idle_cpu = 1;
>> > balance_cpu = i;
>> > }
>> > @@ -4817,6 +4823,10 @@ static void update_plb_buddy(int cpu, int *balance, struct sd_lb_stats *sds,
>> >
>> > /* Get my new buddy */
>> > buddy = per_cpu(sd_pack_buddy, cpu);
>> > +
>> > + /* This CPU doesn't act for agressive packing */
>> > + if (buddy != cpu)
>> > + sds->busiest = 0;
>
> sds->busiest is a pointer, so I think it should be assigned to NULL
> instead of 0.

yes

Thanks
Vincent

>
> Morten
>
>> > }
>> >
>> > /**
>> > --
>> > 1.7.9.5
>> >
>>
>