2013-03-22 12:29:33

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 0/6] sched: packing small tasks

Hi,

This patchset takes advantage of the new per-task load tracking that is
available in the kernel for packing the small tasks in as few as possible
CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power
consumption in the low load use cases by minimizing the number of power domain
that are enabled. The packing is done in 2 steps:

The 1st step looks for the best place to pack tasks in a system according to
its topology and it defines a pack buddy CPU for each CPU if there is one
available. We define the best CPU during the build of the sched_domain instead
of evaluating it at runtime because it can be difficult to define a stable
buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is
that we pack at all levels inside a node where a group of CPU can be power
gated independently from others. For describing this capability, a new flag
has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the
groups of CPUs of a scheduling domain are sharing their power state. By
default, this flag has been set in all sched_domain in order to keep unchanged
the current behavior of the scheduler and only ARM platform clears the
SD_SHARE_POWERDOMAIN flag for MC and CPU level.

In a 2nd step, the scheduler checks the load average of a task which wakes up
as well as the load average of the buddy CPU and it can decide to migrate the
light tasks on a not busy buddy. This check is done during the wake up because
small tasks tend to wake up between periodic load balance and asynchronously
to each other which prevents the default mechanism to catch and migrate them
efficiently. A light task is defined by a runnable_avg_sum that is less than
20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
The average CPU load of the task must be less than 20% and the task must have
been runnable less than 10ms when it woke up last time in order to be
electable for the packing migration. So, a task than runs 1 ms each 5ms will
be considered as a small task but a task that runs 50 ms with a period of
500ms, will not.
Then, the business of the buddy CPU depends of the load average for the rq and
the number of running tasks. A CPU with a load average greater than 50% will
be considered as busy CPU whatever the number of running tasks is and this
threshold will be reduced by the number of running tasks in order to not
increase too much the wake up latency of a task. When the buddy CPU is busy,
the scheduler falls back to default CFS policy.

Change since V2:

- Migrate only a task that wakes up
- Change the light tasks threshold to 20%
- Change the loaded CPU threshold to not pull tasks if the current number of
running tasks is null but the load average is already greater than 50%
- Fix the algorithm for selecting the buddy CPU.

Change since V1:

Patch 2/6
- Change the flag name which was not clear. The new name is
SD_SHARE_POWERDOMAIN.
- Create an architecture dependent function to tune the sched_domain flags
Patch 3/6
- Fix issues in the algorithm that looks for the best buddy CPU
- Use pr_debug instead of pr_info
- Fix for uniprocessor
Patch 4/6
- Remove the use of usage_avg_sum which has not been merged
Patch 5/6
- Change the way the coherency of runnable_avg_sum and runnable_avg_period is
ensured
Patch 6/6
- Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
platform


New results for v3:

This series has been tested with hackbench on ARM platform and the results
don't show any performance regression

Hackbench 3.9-rc2 +patches
Mean Time (10 tests): 2.048 2.015
stdev : 0.047 0.068

Previous results for V2:

This series has been tested with MP3 play back on ARM platform:
TC2 HMP (dual CA-15 and 3xCA-7 cluster).

The measurements have been done on an Ubuntu image during 60 seconds of
playback and the result has been normalized to 100.

| CA15 | CA7 | total |
-------------------------------------
default | 81 | 97 | 178 |
pack | 13 | 100 | 113 |
-------------------------------------

Previous results for V1:

The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP
(dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have
demonstrated that it's worth packing small tasks at all topology levels.

The performance tests have been done on both platforms with sysbench. The
results don't show any performance regressions. These results are aligned with
the policy which uses the normal behavior with heavy use cases.

test: sysbench --test=cpu --num-threads=N --max-requests=R run

Results below is the average duration of 3 tests on the quad CA-9.
default is the current scheduler behavior (pack buddy CPU is -1)
pack is the scheduler with the pack mechanism

| default | pack |
-----------------------------------
N=8; R=200 | 3.1999 | 3.1921 |
N=8; R=2000 | 31.4939 | 31.4844 |
N=12; R=200 | 3.2043 | 3.2084 |
N=12; R=2000 | 31.4897 | 31.4831 |
N=16; R=200 | 3.1774 | 3.1824 |
N=16; R=2000 | 31.4899 | 31.4897 |
-----------------------------------

The power consumption tests have been done only on TC2 platform which has got
accessible power lines and I have used cyclictest to simulate small tasks. The
tests show some power consumption improvements.

test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20

The measurements have been done during 16 seconds and the result has been
normalized to 100

| CA15 | CA7 | total |
-------------------------------------
default | 100 | 40 | 140 |
pack | <1 | 45 | <46 |
-------------------------------------

The A15 cluster is less power efficient than the A7 cluster but if we assume
that the tasks is well spread on both clusters, we can guest estimate that the
power consumption on a dual cluster of CA7 would have been for a default
kernel:

| CA7 | CA7 | total |
-------------------------------------
default | 40 | 40 | 80 |
-------------------------------------

Vincent Guittot (6):
Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for
load-tracking"
sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain
sched: pack small tasks
sched: secure access to other CPU statistics
sched: pack the idle load balance
ARM: sched: clear SD_SHARE_POWERDOMAIN

arch/arm/kernel/topology.c | 9 +++
arch/ia64/include/asm/topology.h | 1 +
arch/tile/include/asm/topology.h | 1 +
include/linux/sched.h | 9 +--
include/linux/topology.h | 4 +
kernel/sched/core.c | 14 ++--
kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 14 ++--
8 files changed, 169 insertions(+), 32 deletions(-)

--
1.7.9.5


2013-03-22 12:29:40

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 1/6] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f.

Signed-off-by: Vincent Guittot <[email protected]>
---
include/linux/sched.h | 8 +-------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..16b0d18 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1159,13 +1159,7 @@ struct sched_entity {
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
-
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..54eaaa2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1561,12 +1561,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..9c2f726 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3394,12 +3393,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3422,7 +3415,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6114,9 +6106,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..7f36024f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -227,12 +227,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -242,8 +236,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.9.5

2013-03-22 12:29:44

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 3/6] sched: pack small tasks

During the creation of sched_domain, we define a pack buddy CPU for each CPU
when one is available. We want to pack at all levels where a group of CPU can
be power gated independently from others.
On a system that can't power gate a group of CPUs independently, the flag is
set at all sched_domain level and the buddy is set to -1. This is the default
behavior.
On a dual clusters / dual cores system which can power gate each core and
cluster independently, the buddy configuration will be :

| Cluster 0 | Cluster 1 |
| CPU0 | CPU1 | CPU2 | CPU3 |
-----------------------------------
buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance so the best place
to choose to migrate them is during their wake up. The decision is in O(1) as
we only check again one buddy CPU

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 5 +++
3 files changed, 121 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b827e0c..21c35ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5662,6 +5662,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);

+ update_packing_domain(cpu);
update_top_cache_domain(cpu);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c2f726..021c7b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -160,6 +160,76 @@ void sched_init_granularity(void)
update_sysctl();
}

+
+#ifdef CONFIG_SMP
+/*
+ * Save the id of the optimal CPU that should be used to pack small tasks
+ * The value -1 is used when no buddy has been found
+ */
+DEFINE_PER_CPU(int, sd_pack_buddy);
+
+/*
+ * Look for the best buddy CPU that can be used to pack small tasks
+ * We make the assumption that it doesn't wort to pack on CPU that share the
+ * same powerline. We look for the 1st sched_domain without the
+ * SD_SHARE_POWERDOMAIN flag. Then we look for the sched_group with the lowest
+ * power per core based on the assumption that their power efficiency is
+ * better
+ */
+void update_packing_domain(int cpu)
+{
+ struct sched_domain *sd;
+ int id = -1;
+
+ sd = highest_flag_domain(cpu, SD_SHARE_POWERDOMAIN);
+ if (!sd)
+ sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+ else
+ sd = sd->parent;
+
+ while (sd && (sd->flags & SD_LOAD_BALANCE)
+ && !(sd->flags & SD_SHARE_POWERDOMAIN)) {
+ struct sched_group *sg = sd->groups;
+ struct sched_group *pack = sg;
+ struct sched_group *tmp;
+
+ /*
+ * The sched_domain of a CPU points on the local sched_group
+ * and the 1st CPU of this local group is a good candidate
+ */
+ id = cpumask_first(sched_group_cpus(pack));
+
+ /* loop the sched groups to find the best one */
+ for (tmp = sg->next; tmp != sg; tmp = tmp->next) {
+ if (tmp->sgp->power * pack->group_weight >
+ pack->sgp->power * tmp->group_weight)
+ continue;
+
+ if ((tmp->sgp->power * pack->group_weight ==
+ pack->sgp->power * tmp->group_weight)
+ && (cpumask_first(sched_group_cpus(tmp)) >= id))
+ continue;
+
+ /* we have found a better group */
+ pack = tmp;
+
+ /* Take the 1st CPU of the new group */
+ id = cpumask_first(sched_group_cpus(pack));
+ }
+
+ /* Look for another CPU than itself */
+ if (id != cpu)
+ break;
+
+ sd = sd->parent;
+ }
+
+ pr_debug("CPU%d packing on CPU%d\n", cpu, id);
+ per_cpu(sd_pack_buddy, cpu) = id;
+}
+
+#endif /* CONFIG_SMP */
+
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
@@ -3291,6 +3361,47 @@ done:
return target;
}

+static bool is_buddy_busy(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ /*
+ * A busy buddy is a CPU with a high load or a small load with a lot of
+ * running tasks.
+ */
+ return (rq->avg.runnable_avg_sum >
+ (rq->avg.runnable_avg_period / (rq->nr_running + 2)));
+}
+
+static bool is_light_task(struct task_struct *p)
+{
+ /* A light task runs less than 20% in average */
+ return ((p->se.avg.runnable_avg_sum * 5) <
+ (p->se.avg.runnable_avg_period));
+}
+
+static int check_pack_buddy(int cpu, struct task_struct *p)
+{
+ int buddy = per_cpu(sd_pack_buddy, cpu);
+
+ /* No pack buddy for this CPU */
+ if (buddy == -1)
+ return false;
+
+ /* buddy is not an allowed CPU */
+ if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
+ return false;
+
+ /*
+ * If the task is a small one and the buddy is not overloaded,
+ * we use buddy cpu
+ */
+ if (!is_light_task(p) || is_buddy_busy(buddy))
+ return false;
+
+ return true;
+}
+
/*
* sched_balance_self: balance the current task (running on cpu) in domains
* that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3319,6 +3430,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
want_affine = 1;
new_cpu = prev_cpu;
+
+ /* We pack only at wake up and not new task */
+ if (check_pack_buddy(new_cpu, p))
+ return per_cpu(sd_pack_buddy, new_cpu);
}

rcu_read_lock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7f36024f..96b164d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -872,6 +872,7 @@ extern const struct sched_class idle_sched_class;

extern void trigger_load_balance(struct rq *rq, int cpu);
extern void idle_balance(int this_cpu, struct rq *this_rq);
+extern void update_packing_domain(int cpu);

#else /* CONFIG_SMP */

@@ -879,6 +880,10 @@ static inline void idle_balance(int cpu, struct rq *rq)
{
}

+static inline void update_packing_domain(int cpu)
+{
+}
+
#endif

extern void sysrq_sched_debug_show(void);
--
1.7.9.5

2013-03-22 12:29:50

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 6/6] ARM: sched: clear SD_SHARE_POWERDOMAIN

The ARM platforms take advantage of packing small tasks on few cores.
This is true even when the cores of a cluster can't be power gated
independantly. So we clear SD_SHARE_POWERDOMAIN at MC and CPU level.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
arch/arm/kernel/topology.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 79282eb..f89a4a2 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -201,6 +201,15 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {}
*/
struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_local_flags(int level)
+{
+ /* Powergate at threading level doesn't make sense */
+ if (level & SD_SHARE_CPUPOWER)
+ return 1*SD_SHARE_POWERDOMAIN;
+
+ return 0*SD_SHARE_POWERDOMAIN;
+}
+
const struct cpumask *cpu_coregroup_mask(int cpu)
{
return &cpu_topology[cpu].core_sibling;
--
1.7.9.5

2013-03-22 12:29:48

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 5/6] sched: pack the idle load balance

Look for an idle CPU close to the pack buddy CPU whenever possible.
The goal is to prevent the wake up of a CPU which doesn't share the power
domain of the pack buddy CPU.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b636199..52a7736 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5455,7 +5455,25 @@ static struct {

static inline int find_new_ilb(int call_cpu)
{
+ struct sched_domain *sd;
int ilb = cpumask_first(nohz.idle_cpus_mask);
+ int buddy = per_cpu(sd_pack_buddy, call_cpu);
+
+ /*
+ * If we have a pack buddy CPU, we try to run load balance on a CPU
+ * that is close to the buddy.
+ */
+ if (buddy != -1)
+ for_each_domain(buddy, sd) {
+ if (sd->flags & SD_SHARE_CPUPOWER)
+ continue;
+
+ ilb = cpumask_first_and(sched_domain_span(sd),
+ nohz.idle_cpus_mask);
+
+ if (ilb < nr_cpu_ids)
+ break;
+ }

if (ilb < nr_cpu_ids && idle_cpu(ilb))
return ilb;
--
1.7.9.5

2013-03-22 12:30:34

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 4/6] sched: secure access to other CPU statistics

If a CPU accesses the runnable_avg_sum and runnable_avg_period fields of its
buddy CPU while the latter updates it, it can get the new version of a field
and the old version of the other one. This can generate erroneous decisions.
We don't want to use a lock mechanism for ensuring the coherency because of
the overhead in this critical path. The previous attempt can't ensure
coherency of both fields for 100% of the platform and use case as it will
depend of the toolchain and the platform architecture.
The runnable_avg_period of a runqueue tends to the max value in less than
345ms after plugging a CPU, which implies that we could use the max value
instead of reading runnable_avg_period after 345ms. During the starting phase,
we must ensure a minimum of coherency between the fields. A simple rule is
runnable_avg_sum <= runnable_avg_period.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
kernel/sched/fair.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 021c7b7..b636199 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,13 +3364,16 @@ done:
static bool is_buddy_busy(int cpu)
{
struct rq *rq = cpu_rq(cpu);
+ u32 sum = rq->avg.runnable_avg_sum;
+ u32 period = rq->avg.runnable_avg_period;
+
+ sum = min(sum, period);

/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
- return (rq->avg.runnable_avg_sum >
- (rq->avg.runnable_avg_period / (rq->nr_running + 2)));
+ return (sum > (period / (rq->nr_running + 2)));
}

static bool is_light_task(struct task_struct *p)
--
1.7.9.5

2013-03-22 12:31:14

by Vincent Guittot

[permalink] [raw]
Subject: [RFC PATCH v3 2/6] sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain

This new flag SD_SHARE_POWERDOMAIN is used to reflect whether groups of CPUs
in a sched_domain level can or not reach a different power state. If groups of
cores can be power gated independently, as an example, the flag should be
cleared at CPU level. This information is used to decide if it's worth packing
some tasks in a group of CPUs in order to power gated the other groups instead
of spreading the tasks. The default behavior of the scheduler is to spread
tasks so the flag is set into all sched_domains.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-by: Morten Rasmussen <[email protected]>
---
arch/ia64/include/asm/topology.h | 1 +
arch/tile/include/asm/topology.h | 1 +
include/linux/sched.h | 1 +
include/linux/topology.h | 4 ++++
kernel/sched/core.c | 6 ++++++
5 files changed, 13 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..6d0b617 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -65,6 +65,7 @@ void build_cpu_to_node_map(void);
| SD_BALANCE_EXEC \
| SD_BALANCE_FORK \
| SD_WAKE_AFFINE, \
+ | arch_sd_local_flags(0)\
.last_balance = jiffies, \
.balance_interval = 1, \
.nr_balance_failed = 0, \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d5e86c9..adc87102 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -71,6 +71,7 @@ static inline const struct cpumask *cpumask_of_node(int node)
| 0*SD_WAKE_AFFINE \
| 0*SD_SHARE_CPUPOWER \
| 0*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(0) \
| 0*SD_SERIALIZE \
, \
.last_balance = jiffies, \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16b0d18..a82cdeb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -809,6 +809,7 @@ enum cpu_idle_type {
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
#define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..3eab293 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,6 +99,8 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 1*SD_SHARE_CPUPOWER \
| 1*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(SD_SHARE_CPUPOWER|\
+ SD_SHARE_PKG_RESOURCES) \
| 0*SD_SERIALIZE \
| 0*SD_PREFER_SIBLING \
| arch_sd_sibling_asym_packing() \
@@ -131,6 +133,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 0*SD_SHARE_CPUPOWER \
| 1*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(SD_SHARE_PKG_RESOURCES)\
| 0*SD_SERIALIZE \
, \
.last_balance = jiffies, \
@@ -161,6 +164,7 @@ int arch_update_cpu_topology(void);
| 1*SD_WAKE_AFFINE \
| 0*SD_SHARE_CPUPOWER \
| 0*SD_SHARE_PKG_RESOURCES \
+ | arch_sd_local_flags(0) \
| 0*SD_SERIALIZE \
| 1*SD_PREFER_SIBLING \
, \
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 54eaaa2..b827e0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5938,6 +5938,11 @@ int __weak arch_sd_sibling_asym_packing(void)
return 0*SD_ASYM_PACKING;
}

+int __weak arch_sd_local_flags(int level)
+{
+ return 1*SD_SHARE_POWERDOMAIN;
+}
+
/*
* Initializers for schedule domains
* Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6127,6 +6132,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUPOWER
| 0*SD_SHARE_PKG_RESOURCES
+ | 1*SD_SHARE_POWERDOMAIN
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
| sd_local_flags(level)
--
1.7.9.5

2013-03-23 11:57:50

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched: packing small tasks

Hi Vincent,

The power aware scheduler patchset that has been released by Alex
recently [https://lkml.org/lkml/2013/2/18/4], also has the idea of
packing of small tasks.In that patchset, the waking of small tasks in
done on a leader cpu, which of course varies dynamically.
https://lkml.org/lkml/2013/2/18/6

I will refer henceforth to the above patchset as Patch A and your
current patchset in this mail as Patch B.

The difference in Patch A is that it is packing small tasks onto a
leader cpu, i.e. a cpu which has just enough capacity to take on itself
a light weight task. Essentially belonging to a group which has a
favourable grp_power and the leader cpu having a suitable rq->util.

In Patch B, you are choosing a buddy cpu, belonging to a group with the
least cpu power, and while deciding to pack tasks onto the buddy cpu,
this again boils down to, verifying if the
grp_power is favourable and the buddy cpu has a suitable rq->util.

In my opinion, this goes to say that both the patches are using similar
metrics to decide if a cpu is capable of handling the light weight task.
The difference lies in how far the search for the appropriate cpu can
proceed. While Patch A continues to search at all levels of sched
domains till the last to find the appropriate cpu, Patch B queries only
the buddy CPU.

The point I am trying to make is that, considering the above
similarities and differences, is it possible for us to see if the ideas
that both these patches bring in can be merged into one, since both of
them are having the common goal of packing small tasks.

Thanks

Regards
Preeti U Murthy

On 03/22/2013 05:55 PM, Vincent Guittot wrote:
> Hi,
>
> This patchset takes advantage of the new per-task load tracking that is
> available in the kernel for packing the small tasks in as few as possible
> CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power
> consumption in the low load use cases by minimizing the number of power domain
> that are enabled. The packing is done in 2 steps:
>
> The 1st step looks for the best place to pack tasks in a system according to
> its topology and it defines a pack buddy CPU for each CPU if there is one
> available. We define the best CPU during the build of the sched_domain instead
> of evaluating it at runtime because it can be difficult to define a stable
> buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is
> that we pack at all levels inside a node where a group of CPU can be power
> gated independently from others. For describing this capability, a new flag
> has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the
> groups of CPUs of a scheduling domain are sharing their power state. By
> default, this flag has been set in all sched_domain in order to keep unchanged
> the current behavior of the scheduler and only ARM platform clears the
> SD_SHARE_POWERDOMAIN flag for MC and CPU level.
>
> In a 2nd step, the scheduler checks the load average of a task which wakes up
> as well as the load average of the buddy CPU and it can decide to migrate the
> light tasks on a not busy buddy. This check is done during the wake up because
> small tasks tend to wake up between periodic load balance and asynchronously
> to each other which prevents the default mechanism to catch and migrate them
> efficiently. A light task is defined by a runnable_avg_sum that is less than
> 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
> The average CPU load of the task must be less than 20% and the task must have
> been runnable less than 10ms when it woke up last time in order to be
> electable for the packing migration. So, a task than runs 1 ms each 5ms will
> be considered as a small task but a task that runs 50 ms with a period of
> 500ms, will not.
> Then, the business of the buddy CPU depends of the load average for the rq and
> the number of running tasks. A CPU with a load average greater than 50% will
> be considered as busy CPU whatever the number of running tasks is and this
> threshold will be reduced by the number of running tasks in order to not
> increase too much the wake up latency of a task. When the buddy CPU is busy,
> the scheduler falls back to default CFS policy.
>
> Change since V2:
>
> - Migrate only a task that wakes up
> - Change the light tasks threshold to 20%
> - Change the loaded CPU threshold to not pull tasks if the current number of
> running tasks is null but the load average is already greater than 50%
> - Fix the algorithm for selecting the buddy CPU.
>
> Change since V1:
>
> Patch 2/6
> - Change the flag name which was not clear. The new name is
> SD_SHARE_POWERDOMAIN.
> - Create an architecture dependent function to tune the sched_domain flags
> Patch 3/6
> - Fix issues in the algorithm that looks for the best buddy CPU
> - Use pr_debug instead of pr_info
> - Fix for uniprocessor
> Patch 4/6
> - Remove the use of usage_avg_sum which has not been merged
> Patch 5/6
> - Change the way the coherency of runnable_avg_sum and runnable_avg_period is
> ensured
> Patch 6/6
> - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
> platform
>
>
> New results for v3:
>
> This series has been tested with hackbench on ARM platform and the results
> don't show any performance regression
>
> Hackbench 3.9-rc2 +patches
> Mean Time (10 tests): 2.048 2.015
> stdev : 0.047 0.068
>
> Previous results for V2:
>
> This series has been tested with MP3 play back on ARM platform:
> TC2 HMP (dual CA-15 and 3xCA-7 cluster).
>
> The measurements have been done on an Ubuntu image during 60 seconds of
> playback and the result has been normalized to 100.
>
> | CA15 | CA7 | total |
> -------------------------------------
> default | 81 | 97 | 178 |
> pack | 13 | 100 | 113 |
> -------------------------------------
>
> Previous results for V1:
>
> The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP
> (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have
> demonstrated that it's worth packing small tasks at all topology levels.
>
> The performance tests have been done on both platforms with sysbench. The
> results don't show any performance regressions. These results are aligned with
> the policy which uses the normal behavior with heavy use cases.
>
> test: sysbench --test=cpu --num-threads=N --max-requests=R run
>
> Results below is the average duration of 3 tests on the quad CA-9.
> default is the current scheduler behavior (pack buddy CPU is -1)
> pack is the scheduler with the pack mechanism
>
> | default | pack |
> -----------------------------------
> N=8; R=200 | 3.1999 | 3.1921 |
> N=8; R=2000 | 31.4939 | 31.4844 |
> N=12; R=200 | 3.2043 | 3.2084 |
> N=12; R=2000 | 31.4897 | 31.4831 |
> N=16; R=200 | 3.1774 | 3.1824 |
> N=16; R=2000 | 31.4899 | 31.4897 |
> -----------------------------------
>
> The power consumption tests have been done only on TC2 platform which has got
> accessible power lines and I have used cyclictest to simulate small tasks. The
> tests show some power consumption improvements.
>
> test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
>
> The measurements have been done during 16 seconds and the result has been
> normalized to 100
>
> | CA15 | CA7 | total |
> -------------------------------------
> default | 100 | 40 | 140 |
> pack | <1 | 45 | <46 |
> -------------------------------------
>
> The A15 cluster is less power efficient than the A7 cluster but if we assume
> that the tasks is well spread on both clusters, we can guest estimate that the
> power consumption on a dual cluster of CA7 would have been for a default
> kernel:
>
> | CA7 | CA7 | total |
> -------------------------------------
> default | 40 | 40 | 80 |
> -------------------------------------
>
> Vincent Guittot (6):
> Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for
> load-tracking"
> sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain
> sched: pack small tasks
> sched: secure access to other CPU statistics
> sched: pack the idle load balance
> ARM: sched: clear SD_SHARE_POWERDOMAIN
>
> arch/arm/kernel/topology.c | 9 +++
> arch/ia64/include/asm/topology.h | 1 +
> arch/tile/include/asm/topology.h | 1 +
> include/linux/sched.h | 9 +--
> include/linux/topology.h | 4 +
> kernel/sched/core.c | 14 ++--
> kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 14 ++--
> 8 files changed, 169 insertions(+), 32 deletions(-)
>

2013-03-25 09:58:29

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 0/6] sched: packing small tasks

On 23 March 2013 12:55, Preeti U Murthy <[email protected]> wrote:
> Hi Vincent,
>
> The power aware scheduler patchset that has been released by Alex
> recently [https://lkml.org/lkml/2013/2/18/4], also has the idea of
> packing of small tasks.In that patchset, the waking of small tasks in
> done on a leader cpu, which of course varies dynamically.
> https://lkml.org/lkml/2013/2/18/6
>
> I will refer henceforth to the above patchset as Patch A and your
> current patchset in this mail as Patch B.
>
> The difference in Patch A is that it is packing small tasks onto a
> leader cpu, i.e. a cpu which has just enough capacity to take on itself
> a light weight task. Essentially belonging to a group which has a
> favourable grp_power and the leader cpu having a suitable rq->util.
>
> In Patch B, you are choosing a buddy cpu, belonging to a group with the
> least cpu power, and while deciding to pack tasks onto the buddy cpu,
> this again boils down to, verifying if the
> grp_power is favourable and the buddy cpu has a suitable rq->util.
>
> In my opinion, this goes to say that both the patches are using similar
> metrics to decide if a cpu is capable of handling the light weight task.
> The difference lies in how far the search for the appropriate cpu can
> proceed. While Patch A continues to search at all levels of sched
> domains till the last to find the appropriate cpu, Patch B queries only
> the buddy CPU.
>
> The point I am trying to make is that, considering the above
> similarities and differences, is it possible for us to see if the ideas
> that both these patches bring in can be merged into one, since both of
> them are having the common goal of packing small tasks.

Hi Preeti,

I agree that the final goal is similar but we also have important
differences to keep in mind among which the share (or not) of power
domain that is used to decide where it's worth packing, the selection
of leader CPU which is the most power efficient core and the use of a
knob to select a policy. So, you're right that we should find a way to
have both working together and still keep these specificities.

Regards,
Vincent

>
> Thanks
>
> Regards
> Preeti U Murthy
>
> On 03/22/2013 05:55 PM, Vincent Guittot wrote:
>> Hi,
>>
>> This patchset takes advantage of the new per-task load tracking that is
>> available in the kernel for packing the small tasks in as few as possible
>> CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power
>> consumption in the low load use cases by minimizing the number of power domain
>> that are enabled. The packing is done in 2 steps:
>>
>> The 1st step looks for the best place to pack tasks in a system according to
>> its topology and it defines a pack buddy CPU for each CPU if there is one
>> available. We define the best CPU during the build of the sched_domain instead
>> of evaluating it at runtime because it can be difficult to define a stable
>> buddy CPU in a low CPU load situation. The policy for defining a buddy CPU is
>> that we pack at all levels inside a node where a group of CPU can be power
>> gated independently from others. For describing this capability, a new flag
>> has been introduced SD_SHARE_POWERDOMAIN that is used to indicate whether the
>> groups of CPUs of a scheduling domain are sharing their power state. By
>> default, this flag has been set in all sched_domain in order to keep unchanged
>> the current behavior of the scheduler and only ARM platform clears the
>> SD_SHARE_POWERDOMAIN flag for MC and CPU level.
>>
>> In a 2nd step, the scheduler checks the load average of a task which wakes up
>> as well as the load average of the buddy CPU and it can decide to migrate the
>> light tasks on a not busy buddy. This check is done during the wake up because
>> small tasks tend to wake up between periodic load balance and asynchronously
>> to each other which prevents the default mechanism to catch and migrate them
>> efficiently. A light task is defined by a runnable_avg_sum that is less than
>> 20% of the runnable_avg_period. In fact, the former condition encloses 2 ones:
>> The average CPU load of the task must be less than 20% and the task must have
>> been runnable less than 10ms when it woke up last time in order to be
>> electable for the packing migration. So, a task than runs 1 ms each 5ms will
>> be considered as a small task but a task that runs 50 ms with a period of
>> 500ms, will not.
>> Then, the business of the buddy CPU depends of the load average for the rq and
>> the number of running tasks. A CPU with a load average greater than 50% will
>> be considered as busy CPU whatever the number of running tasks is and this
>> threshold will be reduced by the number of running tasks in order to not
>> increase too much the wake up latency of a task. When the buddy CPU is busy,
>> the scheduler falls back to default CFS policy.
>>
>> Change since V2:
>>
>> - Migrate only a task that wakes up
>> - Change the light tasks threshold to 20%
>> - Change the loaded CPU threshold to not pull tasks if the current number of
>> running tasks is null but the load average is already greater than 50%
>> - Fix the algorithm for selecting the buddy CPU.
>>
>> Change since V1:
>>
>> Patch 2/6
>> - Change the flag name which was not clear. The new name is
>> SD_SHARE_POWERDOMAIN.
>> - Create an architecture dependent function to tune the sched_domain flags
>> Patch 3/6
>> - Fix issues in the algorithm that looks for the best buddy CPU
>> - Use pr_debug instead of pr_info
>> - Fix for uniprocessor
>> Patch 4/6
>> - Remove the use of usage_avg_sum which has not been merged
>> Patch 5/6
>> - Change the way the coherency of runnable_avg_sum and runnable_avg_period is
>> ensured
>> Patch 6/6
>> - Use the arch dependent function to set/clear SD_SHARE_POWERDOMAIN for ARM
>> platform
>>
>>
>> New results for v3:
>>
>> This series has been tested with hackbench on ARM platform and the results
>> don't show any performance regression
>>
>> Hackbench 3.9-rc2 +patches
>> Mean Time (10 tests): 2.048 2.015
>> stdev : 0.047 0.068
>>
>> Previous results for V2:
>>
>> This series has been tested with MP3 play back on ARM platform:
>> TC2 HMP (dual CA-15 and 3xCA-7 cluster).
>>
>> The measurements have been done on an Ubuntu image during 60 seconds of
>> playback and the result has been normalized to 100.
>>
>> | CA15 | CA7 | total |
>> -------------------------------------
>> default | 81 | 97 | 178 |
>> pack | 13 | 100 | 113 |
>> -------------------------------------
>>
>> Previous results for V1:
>>
>> The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP
>> (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have
>> demonstrated that it's worth packing small tasks at all topology levels.
>>
>> The performance tests have been done on both platforms with sysbench. The
>> results don't show any performance regressions. These results are aligned with
>> the policy which uses the normal behavior with heavy use cases.
>>
>> test: sysbench --test=cpu --num-threads=N --max-requests=R run
>>
>> Results below is the average duration of 3 tests on the quad CA-9.
>> default is the current scheduler behavior (pack buddy CPU is -1)
>> pack is the scheduler with the pack mechanism
>>
>> | default | pack |
>> -----------------------------------
>> N=8; R=200 | 3.1999 | 3.1921 |
>> N=8; R=2000 | 31.4939 | 31.4844 |
>> N=12; R=200 | 3.2043 | 3.2084 |
>> N=12; R=2000 | 31.4897 | 31.4831 |
>> N=16; R=200 | 3.1774 | 3.1824 |
>> N=16; R=2000 | 31.4899 | 31.4897 |
>> -----------------------------------
>>
>> The power consumption tests have been done only on TC2 platform which has got
>> accessible power lines and I have used cyclictest to simulate small tasks. The
>> tests show some power consumption improvements.
>>
>> test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20
>>
>> The measurements have been done during 16 seconds and the result has been
>> normalized to 100
>>
>> | CA15 | CA7 | total |
>> -------------------------------------
>> default | 100 | 40 | 140 |
>> pack | <1 | 45 | <46 |
>> -------------------------------------
>>
>> The A15 cluster is less power efficient than the A7 cluster but if we assume
>> that the tasks is well spread on both clusters, we can guest estimate that the
>> power consumption on a dual cluster of CA7 would have been for a default
>> kernel:
>>
>> | CA7 | CA7 | total |
>> -------------------------------------
>> default | 40 | 40 | 80 |
>> -------------------------------------
>>
>> Vincent Guittot (6):
>> Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for
>> load-tracking"
>> sched: add a new SD_SHARE_POWERDOMAIN flag for sched_domain
>> sched: pack small tasks
>> sched: secure access to other CPU statistics
>> sched: pack the idle load balance
>> ARM: sched: clear SD_SHARE_POWERDOMAIN
>>
>> arch/arm/kernel/topology.c | 9 +++
>> arch/ia64/include/asm/topology.h | 1 +
>> arch/tile/include/asm/topology.h | 1 +
>> include/linux/sched.h | 9 +--
>> include/linux/topology.h | 4 +
>> kernel/sched/core.c | 14 ++--
>> kernel/sched/fair.c | 149 +++++++++++++++++++++++++++++++++++---
>> kernel/sched/sched.h | 14 ++--
>> 8 files changed, 169 insertions(+), 32 deletions(-)
>>
>

2013-03-26 12:26:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> +static bool is_buddy_busy(int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> +
> + /*
> + * A busy buddy is a CPU with a high load or a small load with
> a lot of
> + * running tasks.
> + */
> + return (rq->avg.runnable_avg_sum >
> + (rq->avg.runnable_avg_period / (rq->nr_running
> + 2)));
> +}

Why does the comment talk about load but we don't see it in the
equation. Also, why does nr_running matter at all? I thought we'd
simply bother with utilization, if fully utilized we're done etc..

2013-03-26 12:37:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> +static bool is_light_task(struct task_struct *p)
> +{
> + /* A light task runs less than 20% in average */
> + return ((p->se.avg.runnable_avg_sum * 5) <
> + (p->se.avg.runnable_avg_period));
> +}

OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
this as well.

PJT, do you have any sane solution for this, I forgot what the result
of the last discussion was -- was there any?

2013-03-26 12:46:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> During the creation of sched_domain, we define a pack buddy CPU for
> each CPU
> when one is available. We want to pack at all levels where a group of
> CPU can
> be power gated independently from others.
> On a system that can't power gate a group of CPUs independently, the
> flag is
> set at all sched_domain level and the buddy is set to -1. This is the
> default
> behavior.
> On a dual clusters / dual cores system which can power gate each core
> and
> cluster independently, the buddy configuration will be :
>
> | Cluster 0 | Cluster 1 |
> | CPU0 | CPU1 | CPU2 | CPU3 |
> -----------------------------------
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |

I suppose this is adequate for the 'small' systems you currently have;
but given that Samsung is already bragging with its 'octo'-core Exynos
5 (4+4 big-little thing) does this solution scale?

Isn't this basically related to picking the NO_HZ cpu; if the system
isn't fully symmetric with its power gates you want the NO_HZ cpu to be
the 'special' cpu. If it is symmetric we really don't care which core
is left 'running' and we can even select a new pack cpu from the idle
cores once the old one is fully utilized.

Re-using (or integrating) with NO_HZ has the dual advantage that you'll
make NO_HZ do the right thing for big-little (you typically want a
little core to be the one staying 'awake' and once someone makes NO_HZ
scale this all gets to scale along with it.

2013-03-26 12:50:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/6] sched: secure access to other CPU statistics

On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> @@ -3364,13 +3364,16 @@ done:
> static bool is_buddy_busy(int cpu)
> {
> struct rq *rq = cpu_rq(cpu);
> + u32 sum = rq->avg.runnable_avg_sum;
> + u32 period = rq->avg.runnable_avg_period;
> +
> + sum = min(sum, period);

OK this makes sense; use a simple sanity constraint instead of going
overboard on serialization -- however, why is this a separate patch?

That is, this could easily be part of the patch that introduces
is_buddy_busy(); also you likely want part of this patch's changelog
to become a comment that goes right above this min() :-)

2013-03-26 12:52:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> Look for an idle CPU close to the pack buddy CPU whenever possible.
> The goal is to prevent the wake up of a CPU which doesn't share the
> power
> domain of the pack buddy CPU.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Reviewed-by: Morten Rasmussen <[email protected]>
> ---
> kernel/sched/fair.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b636199..52a7736 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5455,7 +5455,25 @@ static struct {
>
> static inline int find_new_ilb(int call_cpu)
> {
> + struct sched_domain *sd;
> int ilb = cpumask_first(nohz.idle_cpus_mask);
> + int buddy = per_cpu(sd_pack_buddy, call_cpu);
> +
> + /*
> + * If we have a pack buddy CPU, we try to run load balance on
> a CPU
> + * that is close to the buddy.
> + */
> + if (buddy != -1)
> + for_each_domain(buddy, sd) {
> + if (sd->flags & SD_SHARE_CPUPOWER)
> + continue;
> +
> + ilb = cpumask_first_and(sched_domain_span(sd),
> + nohz.idle_cpus_mask);
> +
> + if (ilb < nr_cpu_ids)
> + break;
> + }

/me hands you a fresh bucket of curlies, no reason to skimp on them.

But ha! here's your NO_HZ link.. but does the above DTRT and ensure
that the ILB is a little core when possible?

2013-03-26 13:00:53

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On 26 March 2013 13:37, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> + /* A light task runs less than 20% in average */
>> + return ((p->se.avg.runnable_avg_sum * 5) <
>> + (p->se.avg.runnable_avg_period));
>> +}
>
> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
> this as well.
Hi Peter,

The packing small tasks is only applied at wake up and not during fork
or exec so the runnable_avg_* should have been initialized. As you
mentionned, we assume that a fresh task is fully loaded and let the
default scheduler behavior to select a target CPU

Vincent

>
> PJT, do you have any sane solution for this, I forgot what the result
> of the last discussion was -- was there any?
>

2013-03-26 13:06:40

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 4/6] sched: secure access to other CPU statistics

On 26 March 2013 13:50, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> @@ -3364,13 +3364,16 @@ done:
>> static bool is_buddy_busy(int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>> + u32 sum = rq->avg.runnable_avg_sum;
>> + u32 period = rq->avg.runnable_avg_period;
>> +
>> + sum = min(sum, period);
>
> OK this makes sense; use a simple sanity constraint instead of going
> overboard on serialization -- however, why is this a separate patch?

There is no real reason other than explaining why I have added this
additional check

>
> That is, this could easily be part of the patch that introduces
> is_buddy_busy(); also you likely want part of this patch's changelog
> to become a comment that goes right above this min() :-)

Yes, i 'm going to do that

>

2013-03-26 13:53:33

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On 26 March 2013 13:46, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> During the creation of sched_domain, we define a pack buddy CPU for
>> each CPU
>> when one is available. We want to pack at all levels where a group of
>> CPU can
>> be power gated independently from others.
>> On a system that can't power gate a group of CPUs independently, the
>> flag is
>> set at all sched_domain level and the buddy is set to -1. This is the
>> default
>> behavior.
>> On a dual clusters / dual cores system which can power gate each core
>> and
>> cluster independently, the buddy configuration will be :
>>
>> | Cluster 0 | Cluster 1 |
>> | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> I suppose this is adequate for the 'small' systems you currently have;
> but given that Samsung is already bragging with its 'octo'-core Exynos
> 5 (4+4 big-little thing) does this solution scale?

The packing is only done at MC and CPU level to minimize the number of
transition.

>
> Isn't this basically related to picking the NO_HZ cpu; if the system
> isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> the 'special' cpu. If it is symmetric we really don't care which core
> is left 'running' and we can even select a new pack cpu from the idle
> cores once the old one is fully utilized.

I agree that on a symmetric system, we don't really care about which
core is selected but we want to use the same one whenever possible to
prevent a ping pong between several cores or groups of cores, which is
power consuming. By forcing a NOHZ cpu, your background activity will
smoothly pack on this CPU and will not be spread on your system.
When a CPU is fully loaded, we don't fall in a low CPU load use case
and the periodic load balance can handle the situation to select a new
target CPU which is close to the buddy CPU

>
> Re-using (or integrating) with NO_HZ has the dual advantage that you'll
> make NO_HZ do the right thing for big-little (you typically want a
> little core to be the one staying 'awake' and once someone makes NO_HZ
> scale this all gets to scale along with it.
>

I think that you have answered to this question in your comment of
patch 5, isn't it?

Vincent

2013-03-26 14:03:45

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 26 March 2013 13:52, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> Look for an idle CPU close to the pack buddy CPU whenever possible.
>> The goal is to prevent the wake up of a CPU which doesn't share the
>> power
>> domain of the pack buddy CPU.
>>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> Reviewed-by: Morten Rasmussen <[email protected]>
>> ---
>> kernel/sched/fair.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b636199..52a7736 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5455,7 +5455,25 @@ static struct {
>>
>> static inline int find_new_ilb(int call_cpu)
>> {
>> + struct sched_domain *sd;
>> int ilb = cpumask_first(nohz.idle_cpus_mask);
>> + int buddy = per_cpu(sd_pack_buddy, call_cpu);
>> +
>> + /*
>> + * If we have a pack buddy CPU, we try to run load balance on
>> a CPU
>> + * that is close to the buddy.
>> + */
>> + if (buddy != -1)
>> + for_each_domain(buddy, sd) {
>> + if (sd->flags & SD_SHARE_CPUPOWER)
>> + continue;
>> +
>> + ilb = cpumask_first_and(sched_domain_span(sd),
>> + nohz.idle_cpus_mask);
>> +
>> + if (ilb < nr_cpu_ids)
>> + break;
>> + }
>
> /me hands you a fresh bucket of curlies, no reason to skimp on them.

ok

>
> But ha! here's your NO_HZ link.. but does the above DTRT and ensure
> that the ILB is a little core when possible?

The loop looks for an idle CPU as close as possible to the buddy CPU
and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
a little and there is an idle little, the ILB will be this idle
little.

>
>

2013-03-26 14:42:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
> > But ha! here's your NO_HZ link.. but does the above DTRT and ensure
> > that the ILB is a little core when possible?
>
> The loop looks for an idle CPU as close as possible to the buddy CPU
> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
> a little and there is an idle little, the ILB will be this idle
> little.

Earlier you wrote:

> | Cluster 0 | Cluster 1 |
> | CPU0 | CPU1 | CPU2 | CPU3 |
> -----------------------------------
> buddy | CPU0 | CPU0 | CPU0 | CPU2 |

So extrapolating that to a 4+4 big-little you'd get something like:

| little A9 || big A15 |
| 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
------+---+---+---+---++---+---+---+---+
buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |

Right?

So supposing the current ILB is 6, we'll only check 4, not 0-3, even
though there might be a perfectly idle cpu in there.

Also, your scheme fails to pack when cpus 0,4 are filled, even when
there's idle cores around.

If we'd use the ILB as packing cpu, we would simply select a next pack
target once the old one fills up.

2013-03-26 15:29:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks


> Isn't this basically related to picking the NO_HZ cpu; if the system
> isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> the 'special' cpu. If it is symmetric we really don't care which core
> is left 'running' and we can even select a new pack cpu from the idle
> cores once the old one is fully utilized.

you don't really care much sure, but there's some advantages for sorting "all the way left",
e.g. to linux cpu 0.
Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.

2013-03-26 15:55:39

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 26 March 2013 15:42, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>> > But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>> > that the ILB is a little core when possible?
>>
>> The loop looks for an idle CPU as close as possible to the buddy CPU
>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>> a little and there is an idle little, the ILB will be this idle
>> little.
>
> Earlier you wrote:
>
>> | Cluster 0 | Cluster 1 |
>> | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> So extrapolating that to a 4+4 big-little you'd get something like:
>
> | little A9 || big A15 |
> | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
> ------+---+---+---+---++---+---+---+---+
> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>
> Right?

yes

>
> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
> though there might be a perfectly idle cpu in there.

We will check 4,5,7 at MC level in order to pack in the group of A15
(because they are not sharing the same power domain). If none of them
are idle, we will look at CPU level and will check CPUs 0-3.

>
> Also, your scheme fails to pack when cpus 0,4 are filled, even when
> there's idle cores around.

The primary target is to pack the tasks only when we are in a not busy
system so you will have a power improvement without performance
decrease. is_light_task function returns false and is_buddy_busy
function true before the buddy is fully loaded and the scheduler will
fall back into the default behavior which spreads tasks and races to
idle.

We can extend the buddy CPU and the packing mechanism to fill one CPU
before filling another buddy but it's not always the best choice for
performance and/or power and thus it will imply to have a knob to
select this full packing mode.

>
> If we'd use the ILB as packing cpu, we would simply select a next pack
> target once the old one fills up.

Yes, we will be able to pack the long running tasks and the wake up
will take care of the short tasks

>

2013-03-27 04:34:47

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

Hi Peter,

On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_light_task(struct task_struct *p)
>> +{
>> + /* A light task runs less than 20% in average */
>> + return ((p->se.avg.runnable_avg_sum * 5) <
>> + (p->se.avg.runnable_avg_period));
>> +}
>
> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
> this as well.
>
> PJT, do you have any sane solution for this, I forgot what the result
> of the last discussion was -- was there any?

The conclusion after last discussion between PJT and Alex was that the
load contribution of a fresh task be set to "full" during "__sched_fork()".

task->se.avg.load_avg_contrib = task->se.load.weight during
__sched_fork() is reflected in the latest power aware scheduler patchset
by Alex.

Thanks

Regards
Preeti U Murthy
>

2013-03-27 04:48:25

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On 03/27/2013 12:33 PM, Preeti U Murthy wrote:
> Hi Peter,
>
> On 03/26/2013 06:07 PM, Peter Zijlstra wrote:
>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>> +static bool is_light_task(struct task_struct *p)
>>> +{
>>> + /* A light task runs less than 20% in average */
>>> + return ((p->se.avg.runnable_avg_sum * 5) <
>>> + (p->se.avg.runnable_avg_period));
>>> +}
>>
>> OK, so we have a 'problem' here, we initialize runnable_avg_* to 0, but
>> we want to 'assume' a fresh task is fully 'loaded'. IIRC Alex ran into
>> this as well.
>>
>> PJT, do you have any sane solution for this, I forgot what the result
>> of the last discussion was -- was there any?
>
> The conclusion after last discussion between PJT and Alex was that the
> load contribution of a fresh task be set to "full" during "__sched_fork()".
>
> task->se.avg.load_avg_contrib = task->se.load.weight during
> __sched_fork() is reflected in the latest power aware scheduler patchset
> by Alex.

Yes, the new forked runnable load was set as full utilisation in V5
power aware scheduling. PJT, Mike and I both agree on this. PJT just
discussion how to give the full load to new forked task. and we get
agreement in my coming V6 power aware scheduling patchset.

>
> Thanks
>
> Regards
> Preeti U Murthy
>>
>


--
Thanks Alex

2013-03-27 04:57:12

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 03/26/2013 11:55 PM, Vincent Guittot wrote:
>> > So extrapolating that to a 4+4 big-little you'd get something like:
>> >
>> > | little A9 || big A15 |
>> > | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>> > ------+---+---+---+---++---+---+---+---+
>> > buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>> >
>> > Right?
> yes
>
>> >
>> > So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>> > though there might be a perfectly idle cpu in there.
> We will check 4,5,7 at MC level in order to pack in the group of A15
> (because they are not sharing the same power domain). If none of them
> are idle, we will look at CPU level and will check CPUs 0-3.

So you increase a fixed step here.
>
>> >
>> > Also, your scheme fails to pack when cpus 0,4 are filled, even when
>> > there's idle cores around.
> The primary target is to pack the tasks only when we are in a not busy
> system so you will have a power improvement without performance
> decrease. is_light_task function returns false and is_buddy_busy
> function true before the buddy is fully loaded and the scheduler will
> fall back into the default behavior which spreads tasks and races to
> idle.
>
> We can extend the buddy CPU and the packing mechanism to fill one CPU
> before filling another buddy but it's not always the best choice for
> performance and/or power and thus it will imply to have a knob to
> select this full packing mode.

Just one buddy to pack tasks for whole level cpus definitely has
scalability problem. That is not good for powersaving in most of scenarios.


--
Thanks Alex

2013-03-27 08:05:27

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 27 March 2013 05:56, Alex Shi <[email protected]> wrote:
> On 03/26/2013 11:55 PM, Vincent Guittot wrote:
>>> > So extrapolating that to a 4+4 big-little you'd get something like:
>>> >
>>> > | little A9 || big A15 |
>>> > | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>>> > ------+---+---+---+---++---+---+---+---+
>>> > buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>>> >
>>> > Right?
>> yes
>>
>>> >
>>> > So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>> > though there might be a perfectly idle cpu in there.
>> We will check 4,5,7 at MC level in order to pack in the group of A15
>> (because they are not sharing the same power domain). If none of them
>> are idle, we will look at CPU level and will check CPUs 0-3.
>
> So you increase a fixed step here.

I have modified the find_new_ilb function to look for the best idle
CPU instead of just picking the first CPU of idle_cpus_mask.

>>
>>> >
>>> > Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>> > there's idle cores around.
>> The primary target is to pack the tasks only when we are in a not busy
>> system so you will have a power improvement without performance
>> decrease. is_light_task function returns false and is_buddy_busy
>> function true before the buddy is fully loaded and the scheduler will
>> fall back into the default behavior which spreads tasks and races to
>> idle.
>>
>> We can extend the buddy CPU and the packing mechanism to fill one CPU
>> before filling another buddy but it's not always the best choice for
>> performance and/or power and thus it will imply to have a knob to
>> select this full packing mode.
>
> Just one buddy to pack tasks for whole level cpus definitely has
> scalability problem. That is not good for powersaving in most of scenarios.
>

This patch doesn't want to pack all kind of tasks in all scenario but
only the small tasks that run less that 10ms and when the CPU is not
already too busy with other tasks so you don't have to cope with long
wake up latency and performance regression and only one CPU will be
powered up for these background activities. Nevertheless, I can extend
the packing small tasks to pack all tasks in any scenario in as few
CPUs as possible. This will imply to choose a new buddy CPU when the
previous one is full during the ILB selection as an example and to add
a knob to select this mode which will modify the performance of the
system. But the primary target is not to have a knob and not to reduce
performance in most of scenario.

Regards,
Vincent

>
> --
> Thanks Alex

2013-03-27 08:47:38

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance


>>>>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>>>> though there might be a perfectly idle cpu in there.
>>> We will check 4,5,7 at MC level in order to pack in the group of A15
>>> (because they are not sharing the same power domain). If none of them
>>> are idle, we will look at CPU level and will check CPUs 0-3.
>>
>> So you increase a fixed step here.
>
> I have modified the find_new_ilb function to look for the best idle
> CPU instead of just picking the first CPU of idle_cpus_mask.

That's better.
But using a fixed buddy is still not flexible, and involve more checking
in this time critical balancing.
Consider the most of SMP system, cpu is equal, so any of other cpu can
play the role of buddy in your design. That means no buddy cpu is
better, like my version packing.

>
>>>
>>>>>
>>>>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>>>> there's idle cores around.
>>> The primary target is to pack the tasks only when we are in a not busy
>>> system so you will have a power improvement without performance
>>> decrease. is_light_task function returns false and is_buddy_busy
>>> function true before the buddy is fully loaded and the scheduler will
>>> fall back into the default behavior which spreads tasks and races to
>>> idle.
>>>
>>> We can extend the buddy CPU and the packing mechanism to fill one CPU
>>> before filling another buddy but it's not always the best choice for
>>> performance and/or power and thus it will imply to have a knob to
>>> select this full packing mode.
>>
>> Just one buddy to pack tasks for whole level cpus definitely has
>> scalability problem. That is not good for powersaving in most of scenarios.
>>
>
> This patch doesn't want to pack all kind of tasks in all scenario but
> only the small tasks that run less that 10ms and when the CPU is not
> already too busy with other tasks so you don't have to cope with long
> wake up latency and performance regression and only one CPU will be
> powered up for these background activities. Nevertheless, I can extend
> the packing small tasks to pack all tasks in any scenario in as few
> CPUs as possible. This will imply to choose a new buddy CPU when the
> previous one is full during the ILB selection as an example and to add
> a knob to select this mode which will modify the performance of the
> system. But the primary target is not to have a knob and not to reduce
> performance in most of scenario.

Arguing the performance/power balance does no much sense without
detailed scenario. We just want to seek a flexible compromise way.
But fixed buddy cpu is not flexible. and it may lose many possible
powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle
all tasks, we don't need to wake another core. or if 2 cores in one
socket can handle tasks, we also don't need to wakeup another socket.
>
> Regards,
> Vincent
>
>>
>> --
>> Thanks Alex


--
Thanks Alex

2013-03-27 08:49:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On Wed, 2013-03-27 at 12:56 +0800, Alex Shi wrote:

> Just one buddy to pack tasks for whole level cpus definitely has
> scalability problem.

Right, but note we already have this scalability problem in the form of
the ILB. Some people were working on sorting that but then someone
changed jobs and I think it fell in some deep and dark crack.

Venki, Suresh?

2013-03-27 08:51:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, 2013-03-27 at 12:48 +0800, Alex Shi wrote:
>
> Yes, the new forked runnable load was set as full utilisation in V5
> power aware scheduling. PJT, Mike and I both agree on this. PJT just
> discussion how to give the full load to new forked task. and we get
> agreement in my coming V6 power aware scheduling patchset.

Great!, means I can mark the v5 thread read! :-)

2013-03-27 08:54:17

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On 27 March 2013 09:46, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
>> > Isn't this basically related to picking the NO_HZ cpu; if the system
>> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
>> > the 'special' cpu. If it is symmetric we really don't care which core
>> > is left 'running' and we can even select a new pack cpu from the idle
>> > cores once the old one is fully utilized.
>>
>> you don't really care much sure, but there's some advantages for sorting "all the way left",
>> e.g. to linux cpu 0.
>> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
>
> Right, and I suspect all the big-little nonsense will have the little
> cores on low numbers as well (is this architected or can a creative
> licensee screw us over?)

It's not mandatory to have little cores on low numbers even if it's advised

>
> So find_new_ilb() already does cpumask_first(), so it has a strong
> leftmost preference. We just need to make sure it indeed does the right
> thing and doesn't have some unintended side effect.
>

2013-03-27 08:54:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
> > Isn't this basically related to picking the NO_HZ cpu; if the system
> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> > the 'special' cpu. If it is symmetric we really don't care which core
> > is left 'running' and we can even select a new pack cpu from the idle
> > cores once the old one is fully utilized.
>
> you don't really care much sure, but there's some advantages for sorting "all the way left",
> e.g. to linux cpu 0.
> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.

Right, and I suspect all the big-little nonsense will have the little
cores on low numbers as well (is this architected or can a creative
licensee screw us over?)

So find_new_ilb() already does cpumask_first(), so it has a strong
leftmost preference. We just need to make sure it indeed does the right
thing and doesn't have some unintended side effect.

2013-03-27 09:00:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, 2013-03-27 at 09:54 +0100, Vincent Guittot wrote:
> It's not mandatory to have little cores on low numbers even if it's
> advised

ARGH!

2013-03-27 10:23:33

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

Hi,

On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> +static bool is_buddy_busy(int cpu)
>> +{
>> + struct rq *rq = cpu_rq(cpu);
>> +
>> + /*
>> + * A busy buddy is a CPU with a high load or a small load with
>> a lot of
>> + * running tasks.
>> + */
>> + return (rq->avg.runnable_avg_sum >
>> + (rq->avg.runnable_avg_period / (rq->nr_running
>> + 2)));
>> +}
>
> Why does the comment talk about load but we don't see it in the
> equation. Also, why does nr_running matter at all? I thought we'd
> simply bother with utilization, if fully utilized we're done etc..
>

Peter, lets say the run-queue has 50% utilization and is running 2
tasks. And we wish to find out if it is busy. We would compare this
metric with the cpu power, which lets say is 100.

rq->util * 100 < cpu_of(rq)->power.

In the above scenario would we declare the cpu _not_busy? Or would we do
the following:

(rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it
is just enough _busy_ to not take on more processes?


@Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
with *high rq utilization*, as far as the equation goes.

Regards
Preeti U Murthy

2013-03-27 10:30:08

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 27 March 2013 09:47, Alex Shi <[email protected]> wrote:
>
>>>>>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>>>>> though there might be a perfectly idle cpu in there.
>>>> We will check 4,5,7 at MC level in order to pack in the group of A15
>>>> (because they are not sharing the same power domain). If none of them
>>>> are idle, we will look at CPU level and will check CPUs 0-3.
>>>
>>> So you increase a fixed step here.
>>
>> I have modified the find_new_ilb function to look for the best idle
>> CPU instead of just picking the first CPU of idle_cpus_mask.
>
> That's better.
> But using a fixed buddy is still not flexible, and involve more checking
> in this time critical balancing.
> Consider the most of SMP system, cpu is equal, so any of other cpu can
> play the role of buddy in your design. That means no buddy cpu is
> better, like my version packing.
>
>>
>>>>
>>>>>>
>>>>>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>>>>> there's idle cores around.
>>>> The primary target is to pack the tasks only when we are in a not busy
>>>> system so you will have a power improvement without performance
>>>> decrease. is_light_task function returns false and is_buddy_busy
>>>> function true before the buddy is fully loaded and the scheduler will
>>>> fall back into the default behavior which spreads tasks and races to
>>>> idle.
>>>>
>>>> We can extend the buddy CPU and the packing mechanism to fill one CPU
>>>> before filling another buddy but it's not always the best choice for
>>>> performance and/or power and thus it will imply to have a knob to
>>>> select this full packing mode.
>>>
>>> Just one buddy to pack tasks for whole level cpus definitely has
>>> scalability problem. That is not good for powersaving in most of scenarios.
>>>
>>
>> This patch doesn't want to pack all kind of tasks in all scenario but
>> only the small tasks that run less that 10ms and when the CPU is not
>> already too busy with other tasks so you don't have to cope with long
>> wake up latency and performance regression and only one CPU will be
>> powered up for these background activities. Nevertheless, I can extend
>> the packing small tasks to pack all tasks in any scenario in as few
>> CPUs as possible. This will imply to choose a new buddy CPU when the
>> previous one is full during the ILB selection as an example and to add
>> a knob to select this mode which will modify the performance of the
>> system. But the primary target is not to have a knob and not to reduce
>> performance in most of scenario.
>
> Arguing the performance/power balance does no much sense without
> detailed scenario. We just want to seek a flexible compromise way.
> But fixed buddy cpu is not flexible. and it may lose many possible
> powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle
> all tasks, we don't need to wake another core. or if 2 cores in one
> socket can handle tasks, we also don't need to wakeup another socket.

Using 2 SMT for all tasks implies to accept latency and to share
resources like cache and memory bandwidth so it means that you also
accept some potential performance decrease which implies that someone
must select this mode with a knob.
The primary goal of the patchset is not to select between powersaving
and performance but to stay in performance mode. We pack the small
tasks in one CPU so the performance will not decrease but the low load
scenario will consume less power. Then, I can add another step which
will be more power saving aggressive with a potential cost of
performance and i this case the buddy CPU will be updated dynamically
according to the system load

>>
>> Regards,
>> Vincent
>>
>>>
>>> --
>>> Thanks Alex
>
>
> --
> Thanks Alex

2013-03-27 11:00:43

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On 27 March 2013 11:21, Preeti U Murthy <[email protected]> wrote:
> Hi,
>
> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>> +static bool is_buddy_busy(int cpu)
>>> +{
>>> + struct rq *rq = cpu_rq(cpu);
>>> +
>>> + /*
>>> + * A busy buddy is a CPU with a high load or a small load with
>>> a lot of
>>> + * running tasks.
>>> + */
>>> + return (rq->avg.runnable_avg_sum >
>>> + (rq->avg.runnable_avg_period / (rq->nr_running
>>> + 2)));
>>> +}
>>
>> Why does the comment talk about load but we don't see it in the
>> equation. Also, why does nr_running matter at all? I thought we'd
>> simply bother with utilization, if fully utilized we're done etc..
>>
>
> Peter, lets say the run-queue has 50% utilization and is running 2
> tasks. And we wish to find out if it is busy. We would compare this
> metric with the cpu power, which lets say is 100.
>
> rq->util * 100 < cpu_of(rq)->power.

I don't use cpu_of(rq)->power in the definition of the business

>
> In the above scenario would we declare the cpu _not_busy? Or would we do
> the following:

In the above scenario, the CPU is busy

By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
In addition, i take into account the number of tasks already in the
runqueue in order to define the business of a CPU. A CPU with a load
of 50% without any tasks in the runqeue in not busy at this time and
we can migrate tasks on it but if the CPU already has 2 tasks in its
runqueue, it means that newly wake up task will have to share the CPU
with other tasks so we consider that the CPU is already busy and we
will fall back to default behavior. The equation considers that a CPU
is not busy if
100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)

>
> (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it
> is just enough _busy_ to not take on more processes?
>
>
> @Vincent: Yes the comment above needs to be fixed. A busy buddy is a CPU
> with *high rq utilization*, as far as the equation goes.

I can update the comment. Is the comment below more clear ?

/*
* A busy buddy is a CPU with a high average running time or a small
average running time but a lot of
* running tasks in its runqueue which are already sharing this CPU time.
*/

Vincent

>
> Regards
> Preeti U Murthy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-27 13:32:43

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 03/27/2013 06:30 PM, Vincent Guittot wrote:
>> Arguing the performance/power balance does no much sense without
>> > detailed scenario. We just want to seek a flexible compromise way.
>> > But fixed buddy cpu is not flexible. and it may lose many possible
>> > powersaving fit scenarios on x86 system. Like if 2 SMT cpu can handle
>> > all tasks, we don't need to wake another core. or if 2 cores in one
>> > socket can handle tasks, we also don't need to wakeup another socket.
> Using 2 SMT for all tasks implies to accept latency and to share
> resources like cache and memory bandwidth so it means that you also
> accept some potential performance decrease which implies that someone
> must select this mode with a knob.
> The primary goal of the patchset is not to select between powersaving
> and performance but to stay in performance mode. We pack the small
> tasks in one CPU so the performance will not decrease but the low load
> scenario will consume less power. Then, I can add another step which
> will be more power saving aggressive with a potential cost of
> performance and i this case the buddy CPU will be updated dynamically
> according to the system load
>

Predication of small task behavior is often wrong. so for performance
purpose, packing task is a bad idea.

--
Thanks
Alex

2013-03-27 15:37:09

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, 27 Mar 2013, Vincent Guittot wrote:

> On 27 March 2013 09:46, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2013-03-26 at 08:29 -0700, Arjan van de Ven wrote:
> >> > Isn't this basically related to picking the NO_HZ cpu; if the system
> >> > isn't fully symmetric with its power gates you want the NO_HZ cpu to be
> >> > the 'special' cpu. If it is symmetric we really don't care which core
> >> > is left 'running' and we can even select a new pack cpu from the idle
> >> > cores once the old one is fully utilized.
> >>
> >> you don't really care much sure, but there's some advantages for sorting "all the way left",
> >> e.g. to linux cpu 0.
> >> Some tasks only run there, and interrupts tend to be favored to that cpu as well on x86.
> >
> > Right, and I suspect all the big-little nonsense will have the little
> > cores on low numbers as well (is this architected or can a creative
> > licensee screw us over?)
>
> It's not mandatory to have little cores on low numbers even if it's advised

We can trivially move things around in the logical CPU mapping if that
simplifies things. However the boot CPU might not be CPU0 in that
case which might be less trivial.


Nicolas

2013-03-27 17:18:58

by Nicolas Pitre

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, 27 Mar 2013, Catalin Marinas wrote:

> So if the above works, the scheduler guys can mandate that little CPUs
> are always first and for ARM it would be a matter of getting the right
> CPU topology in the DT (independent of what hw vendors think of CPU
> topology) and booting Linux on CPU 4 etc.

Just a note about that: if the scheduler mandates little CPUs first,
that should _not_ have any implications on the DT content. DT is not
about encoding Linux specific implementation details. It is simple
enough to tweak the CPU logical map at run time when enumeratiing CPUs.


Nicolas

2013-03-27 17:41:51

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, Mar 27, 2013 at 05:18:53PM +0000, Nicolas Pitre wrote:
> On Wed, 27 Mar 2013, Catalin Marinas wrote:
>
> > So if the above works, the scheduler guys can mandate that little CPUs
> > are always first and for ARM it would be a matter of getting the right
> > CPU topology in the DT (independent of what hw vendors think of CPU
> > topology) and booting Linux on CPU 4 etc.
>
> Just a note about that: if the scheduler mandates little CPUs first,
> that should _not_ have any implications on the DT content. DT is not
> about encoding Linux specific implementation details. It is simple
> enough to tweak the CPU logical map at run time when enumeratiing CPUs.

You are right, though a simpler way (hack) to tweak the cpu_logical_map
is to change the DT ;).

But the problem is that the kernel doesn't know which CPU is big and
which is little, unless you specify this in some way via the DT. It can
be the cpu nodes order or some other means.

--
Catalin

2013-04-05 11:08:06

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

Peter,

After some toughts about your comments,I can update the buddy cpu
during ILB or periofdic LB to a new idle core and extend the packing
mechanism Does this additional mechanism sound better for you ?

Vincent

On 26 March 2013 15:42, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>> > But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>> > that the ILB is a little core when possible?
>>
>> The loop looks for an idle CPU as close as possible to the buddy CPU
>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>> a little and there is an idle little, the ILB will be this idle
>> little.
>
> Earlier you wrote:
>
>> | Cluster 0 | Cluster 1 |
>> | CPU0 | CPU1 | CPU2 | CPU3 |
>> -----------------------------------
>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>
> So extrapolating that to a 4+4 big-little you'd get something like:
>
> | little A9 || big A15 |
> | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
> ------+---+---+---+---++---+---+---+---+
> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>
> Right?
>
> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
> though there might be a perfectly idle cpu in there.
>
> Also, your scheme fails to pack when cpus 0,4 are filled, even when
> there's idle cores around.
>
> If we'd use the ILB as packing cpu, we would simply select a next pack
> target once the old one fills up.
>

2013-04-22 05:49:10

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

Hi Vincent,

On 04/05/2013 04:38 PM, Vincent Guittot wrote:
> Peter,
>
> After some toughts about your comments,I can update the buddy cpu
> during ILB or periofdic LB to a new idle core and extend the packing
> mechanism Does this additional mechanism sound better for you ?

If the primary goal of this patchset is to pack small tasks in fewer
power domains then why not see if the power aware scheduler patchset by
Alex does the same for you? The reason being:

a.The power aware scheduler also checks if a task is small enough to be
packed on a cpu which has just enough capacity to take on that
task(leader cpu). This cpu belongs to a scheduler group which is nearly
full(group_leader),so we end up packing tasks.

b.The overhead of assigning a buddy cpu gets eliminated because the best
cpu for packing is decided during wake up.

c.This is a scalable solution because if the leader cpu is busy,then any
other idle cpu from that group_leader is chosen.Eventually you end up
packing anyway.

The reason that I am suggesting this is that we could unify the power
awareness of the scheduler under one umbrella.And i believe that the
current power aware scheduler patchset is flexible enough to do this and
that we must cash in on it.

Thanks

Regards
Preeti U Murthy
>
> Vincent
>
> On 26 March 2013 15:42, Peter Zijlstra <[email protected]> wrote:
>> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>>>> But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>>>> that the ILB is a little core when possible?
>>>
>>> The loop looks for an idle CPU as close as possible to the buddy CPU
>>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>>> a little and there is an idle little, the ILB will be this idle
>>> little.
>>
>> Earlier you wrote:
>>
>>> | Cluster 0 | Cluster 1 |
>>> | CPU0 | CPU1 | CPU2 | CPU3 |
>>> -----------------------------------
>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>
>> So extrapolating that to a 4+4 big-little you'd get something like:
>>
>> | little A9 || big A15 |
>> | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>> ------+---+---+---+---++---+---+---+---+
>> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>>
>> Right?
>>
>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>> though there might be a perfectly idle cpu in there.
>>
>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>> there's idle cores around.
>>
>> If we'd use the ILB as packing cpu, we would simply select a next pack
>> target once the old one fills up.
>>
>

2013-04-23 02:24:13

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

Thanks you, Preeti and Vincent to talk the power aware scheduler for
details! believe this open discussion is helpful to conduct a a more
comprehensive solution. :)

> Hi Preeti,
>
> I have had a look at Alex patches but i have some concerns with his patches
> -There no notion of power domain which is quite important when we speak
> about power saving IMHO. Packing tasks has got an interest if the idle
> CPUs can reach a useful low power state independently from busy CPUs.
> Architectures have different low power state capabilities which must be
> taken into account. In addition, you can have system which have CPUs
> with better power efficiency and this kind of system are not taken into
> account.

I agree with you on this point. and like what's you done to add new flag
in sched domain. It also make scheduler easy pick up new idea in balancing.
BTW, Currently, the my balance is trying pack task per SMT, maybe
packing task per cpu horse power is more compatible for other archs?

> -There are some computation of statistics on a potentially large number
> of cpus and groups at each task wake up. This overhead concerns me and
> such amount of computation should only be done when we have more time
> like the periodic load balance.

Usually, some computation is far slighter then the task migration. If
the computation is helpful to reduce future possible migration, it will
save much. On current code, I observed the fork balancing can distribute
task well in powersaving policy. That means the computation is worth.

> -There are some heuristics that will be hard to tune:
> *powersaving balance period set as 8*max_interval
> *power saving can do some performance load balance if there was no
> performance load balance in the last 32 balances with no more than 4
> perf balance in the last 64 balance

Do you have other tunning opinions on the numbers? I am glad to hear any
input.
> *sched_burst_threshold

I find it is useful on 3.8 kernel when aim7 cause a very imbalance
wakeup. but now aim7 is calm down after lock-stealing RWSEM used in
kernel, maybe need to re-evaluate this on future new version.
>
> I'm going to send a proposal for a more aggressive and scalable mode of
> my patches which will take care of my concerns. Let see how this new
> patchset can fit with Alex's ones


--
Thanks Alex

2013-04-23 04:38:41

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

Hi Vincent,

Thank you very much for bringing about the differences between your
goals and the working of the power aware scheduler patchset.This was
essential for us to understand the various requirements from a power
aware scheduler.After you post out the patchset we could try and
evaluate the following points again.

Thanks

Regards
Preeti U Murthy

On 04/23/2013 01:27 AM, Vincent Guittot wrote:
> On Monday, 22 April 2013, Preeti U Murthy <[email protected]> wrote:
>> Hi Vincent,
>>
>> On 04/05/2013 04:38 PM, Vincent Guittot wrote:
>>> Peter,
>>>
>>> After some toughts about your comments,I can update the buddy cpu
>>> during ILB or periofdic LB to a new idle core and extend the packing
>>> mechanism Does this additional mechanism sound better for you ?
>>
>
> Hi Preeti,
>
> I have had a look at Alex patches but i have some concerns with his patches
> -There no notion of power domain which is quite important when we speak
> about power saving IMHO. Packing tasks has got an interest if the idle CPUs
> can reach a useful low power state independently from busy CPUs.
> Architectures have different low power state capabilities which must be
> taken into account. In addition, you can have system which have CPUs with
> better power efficiency and this kind of system are not taken into account.
> -There are some computation of statistics on a potentially large number of
> cpus and groups at each task wake up. This overhead concerns me and such
> amount of computation should only be done when we have more time like the
> periodic load balance.
> -There are some heuristics that will be hard to tune:
> *powersaving balance period set as 8*max_interval
> *power saving can do some performance load balance if there was no
> performance load balance in the last 32 balances with no more than 4 perf
> balance in the last 64 balance
> *sched_burst_threshold
>
> I'm going to send a proposal for a more aggressive and scalable mode of my
> patches which will take care of my concerns. Let see how this new patchset
> can fit with Alex's ones
>
> Regards,
> Vincent
>
>> If the primary goal of this patchset is to pack small tasks in fewer
>> power domains then why not see if the power aware scheduler patchset by
>> Alex does the same for you? The reason being:
>>
>> a.The power aware scheduler also checks if a task is small enough to be
>> packed on a cpu which has just enough capacity to take on that
>> task(leader cpu). This cpu belongs to a scheduler group which is nearly
>> full(group_leader),so we end up packing tasks.
>>
>> b.The overhead of assigning a buddy cpu gets eliminated because the best
>> cpu for packing is decided during wake up.
>>
>> c.This is a scalable solution because if the leader cpu is busy,then any
>> other idle cpu from that group_leader is chosen.Eventually you end up
>> packing anyway.
>>
>> The reason that I am suggesting this is that we could unify the power
>> awareness of the scheduler under one umbrella.And i believe that the
>> current power aware scheduler patchset is flexible enough to do this and
>> that we must cash in on it.
>>
>> Thanks
>>
>> Regards
>> Preeti U Murthy
>>>
>>> Vincent
>>>
>>> On 26 March 2013 15:42, Peter Zijlstra <[email protected]> wrote:
>>>> On Tue, 2013-03-26 at 15:03 +0100, Vincent Guittot wrote:
>>>>>> But ha! here's your NO_HZ link.. but does the above DTRT and ensure
>>>>>> that the ILB is a little core when possible?
>>>>>
>>>>> The loop looks for an idle CPU as close as possible to the buddy CPU
>>>>> and the buddy CPU is the 1st CPU has been chosen. So if your buddy is
>>>>> a little and there is an idle little, the ILB will be this idle
>>>>> little.
>>>>
>>>> Earlier you wrote:
>>>>
>>>>> | Cluster 0 | Cluster 1 |
>>>>> | CPU0 | CPU1 | CPU2 | CPU3 |
>>>>> -----------------------------------
>>>>> buddy | CPU0 | CPU0 | CPU0 | CPU2 |
>>>>
>>>> So extrapolating that to a 4+4 big-little you'd get something like:
>>>>
>>>> | little A9 || big A15 |
>>>> | 0 | 1 | 2 | 3 || 4 | 5 | 6 | 7 |
>>>> ------+---+---+---+---++---+---+---+---+
>>>> buddy | 0 | 0 | 0 | 0 || 0 | 4 | 4 | 4 |
>>>>
>>>> Right?
>>>>
>>>> So supposing the current ILB is 6, we'll only check 4, not 0-3, even
>>>> though there might be a perfectly idle cpu in there.
>>>>
>>>> Also, your scheme fails to pack when cpus 0,4 are filled, even when
>>>> there's idle cores around.
>>>>
>>>> If we'd use the ILB as packing cpu, we would simply select a next pack
>>>> target once the old one fills up.
>>>>
>>>
>>
>>
>

2013-04-23 04:59:05

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

Hi Alex,

I have one point below.

On 04/23/2013 07:53 AM, Alex Shi wrote:
> Thanks you, Preeti and Vincent to talk the power aware scheduler for
> details! believe this open discussion is helpful to conduct a a more
> comprehensive solution. :)
>
>> Hi Preeti,
>>
>> I have had a look at Alex patches but i have some concerns with his patches
>> -There no notion of power domain which is quite important when we speak
>> about power saving IMHO. Packing tasks has got an interest if the idle
>> CPUs can reach a useful low power state independently from busy CPUs.
>> Architectures have different low power state capabilities which must be
>> taken into account. In addition, you can have system which have CPUs
>> with better power efficiency and this kind of system are not taken into
>> account.
>
> I agree with you on this point. and like what's you done to add new flag
> in sched domain. It also make scheduler easy pick up new idea in balancing.
> BTW, Currently, the my balance is trying pack task per SMT, maybe
> packing task per cpu horse power is more compatible for other archs?

Correct me if I am wrong,but the scheduler today does not compare the
task load to the destination cpu power before moving the task to the
destination cpu.This could be during:

1. Load balancing: In move_tasks(), only the imbalance is verified
against the task load before moving tasks and does not necessarily check
if the destination cpu has enough cpu power to handle these tasks.

2. select_task_rq_fair(): For a forked task, the idlest cpu in the group
leader is found during power save balance( I am focussing only on the
power save policy),and is returned as the destination cpu for the forked
task.But I feel we need to check if the idle cpu has the cpu power to
handle the task load.

Why I am bringing about this point is due to a use case which we might
need to handle in the power aware scheduler going ahead.That being the
big.LITTLE cpus. We would ideally want the short running tasks on the
LITTLE cpus and the long running tasks on the big cpus.

While the power aware scheduler strives to pack tasks,it should not end
up packing long running tasks on LITTLE cpus. Not having big cpus to
handle short running tasks is the next step of course but atleast not
throttle the long running tasks by scheduling them on LITTLE cpus.

Thanks

Regards
Preeti U Murthy

2013-04-23 15:31:01

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On 4/22/2013 7:23 PM, Alex Shi wrote:
> Thanks you, Preeti and Vincent to talk the power aware scheduler for
> details! believe this open discussion is helpful to conduct a a more
> comprehensive solution. :)
>
>> Hi Preeti,
>>
>> I have had a look at Alex patches but i have some concerns with his patches
>> -There no notion of power domain which is quite important when we speak
>> about power saving IMHO. Packing tasks has got an interest if the idle
>> CPUs can reach a useful low power state independently from busy CPUs.
>> Architectures have different low power state capabilities which must be
>> taken into account. In addition, you can have system which have CPUs
>> with better power efficiency and this kind of system are not taken into
>> account.
>
> I agree with you on this point. and like what's you done to add new flag
> in sched domain.

For x86 we should not be setting such flag then; we don't have a way for some cpu packages to
go to an extra deep power state if they're completely idle.
(this afaik is true for both Intel and AMD)


2013-04-26 10:20:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
> Hi,
>
> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> > On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> >> +static bool is_buddy_busy(int cpu)
> >> +{
> >> + struct rq *rq = cpu_rq(cpu);
> >> +
> >> + /*
> >> + * A busy buddy is a CPU with a high load or a small load with
> >> a lot of
> >> + * running tasks.
> >> + */
> >> + return (rq->avg.runnable_avg_sum >
> >> + (rq->avg.runnable_avg_period / (rq->nr_running
> >> + 2)));
> >> +}
> >
> > Why does the comment talk about load but we don't see it in the
> > equation. Also, why does nr_running matter at all? I thought we'd
> > simply bother with utilization, if fully utilized we're done etc..
> >
>
> Peter, lets say the run-queue has 50% utilization and is running 2
> tasks. And we wish to find out if it is busy. We would compare this
> metric with the cpu power, which lets say is 100.
>
> rq->util * 100 < cpu_of(rq)->power.
>
> In the above scenario would we declare the cpu _not_busy? Or would we do
> the following:
>
> (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it
> is just enough _busy_ to not take on more processes?

That is just confused... ->power doesn't have anything to do with a per-cpu
measure. ->power is a inter-cpu measure of relative compute capacity.

Mixing in nr_running confuses things even more; it doesn't matter how many
tasks it takes to push utilization up to 100%; once its there the cpu simply
cannot run more.

So colour me properly confused..

2013-04-26 10:31:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
> On 27 March 2013 11:21, Preeti U Murthy <[email protected]> wrote:
> > Hi,
> >
> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
> >>> +static bool is_buddy_busy(int cpu)
> >>> +{
> >>> + struct rq *rq = cpu_rq(cpu);
> >>> +
> >>> + /*
> >>> + * A busy buddy is a CPU with a high load or a small load with
> >>> a lot of
> >>> + * running tasks.
> >>> + */
> >>> + return (rq->avg.runnable_avg_sum >
> >>> + (rq->avg.runnable_avg_period / (rq->nr_running
> >>> + 2)));
> >>> +}
> >>
> >> Why does the comment talk about load but we don't see it in the
> >> equation. Also, why does nr_running matter at all? I thought we'd
> >> simply bother with utilization, if fully utilized we're done etc..
>
> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
> In addition, i take into account the number of tasks already in the
> runqueue in order to define the business of a CPU. A CPU with a load
> of 50% without any tasks in the runqeue in not busy at this time and
> we can migrate tasks on it but if the CPU already has 2 tasks in its
> runqueue, it means that newly wake up task will have to share the CPU
> with other tasks so we consider that the CPU is already busy and we
> will fall back to default behavior. The equation considers that a CPU
> is not busy if
> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)

I'm still somewhat confused by all this. So raising nr_running will lower the
required utilization to be considered busy. Suppose we have 50 tasks running,
all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its
picking its nose for half the time.


I'm assuming it's mean to limit process latency or so? Why are you concerned
with that? This seems like an arbitrary power vs performance tweak without
solid effidence its needed or even wanted.

2013-04-26 10:34:22

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

Hi Peter,

On 04/26/2013 03:48 PM, Peter Zijlstra wrote:
> On Wed, Mar 27, 2013 at 03:51:51PM +0530, Preeti U Murthy wrote:
>> Hi,
>>
>> On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>>> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>>>> +static bool is_buddy_busy(int cpu)
>>>> +{
>>>> + struct rq *rq = cpu_rq(cpu);
>>>> +
>>>> + /*
>>>> + * A busy buddy is a CPU with a high load or a small load with
>>>> a lot of
>>>> + * running tasks.
>>>> + */
>>>> + return (rq->avg.runnable_avg_sum >
>>>> + (rq->avg.runnable_avg_period / (rq->nr_running
>>>> + 2)));
>>>> +}
>>>
>>> Why does the comment talk about load but we don't see it in the
>>> equation. Also, why does nr_running matter at all? I thought we'd
>>> simply bother with utilization, if fully utilized we're done etc..
>>>
>>
>> Peter, lets say the run-queue has 50% utilization and is running 2
>> tasks. And we wish to find out if it is busy. We would compare this
>> metric with the cpu power, which lets say is 100.
>>
>> rq->util * 100 < cpu_of(rq)->power.
>>
>> In the above scenario would we declare the cpu _not_busy? Or would we do
>> the following:
>>
>> (rq->util * 100) * #nr_running < cpu_of(rq)->power and conclude that it
>> is just enough _busy_ to not take on more processes?
>
> That is just confused... ->power doesn't have anything to do with a per-cpu
> measure. ->power is a inter-cpu measure of relative compute capacity.

Ok.

>
> Mixing in nr_running confuses things even more; it doesn't matter how many
> tasks it takes to push utilization up to 100%; once its there the cpu simply
> cannot run more.

True, this is from the perspective of the CPU. But will not the tasks on
this CPU get throttled if, you find the utilization of this CPU < 100%
and decide to put more tasks on it?

Regards
Preeti U Murthy

2013-04-26 10:56:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH v3 5/6] sched: pack the idle load balance

On Tue, Apr 23, 2013 at 08:30:58AM -0700, Arjan van de Ven wrote:
> For x86 we should not be setting such flag then; we don't have a way for some cpu packages to
> go to an extra deep power state if they're completely idle.
> (this afaik is true for both Intel and AMD)

You say 'some'; this implies we do for others, right?

We can dynamically set the required flags in the arch setup when it makes sense.

2013-04-26 11:34:36

by Vincent Guittot

[permalink] [raw]
Subject: Re: [RFC PATCH v3 3/6] sched: pack small tasks

On 26 April 2013 12:30, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 27, 2013 at 12:00:40PM +0100, Vincent Guittot wrote:
>> On 27 March 2013 11:21, Preeti U Murthy <[email protected]> wrote:
>> > Hi,
>> >
>> > On 03/26/2013 05:56 PM, Peter Zijlstra wrote:
>> >> On Fri, 2013-03-22 at 13:25 +0100, Vincent Guittot wrote:
>> >>> +static bool is_buddy_busy(int cpu)
>> >>> +{
>> >>> + struct rq *rq = cpu_rq(cpu);
>> >>> +
>> >>> + /*
>> >>> + * A busy buddy is a CPU with a high load or a small load with
>> >>> a lot of
>> >>> + * running tasks.
>> >>> + */
>> >>> + return (rq->avg.runnable_avg_sum >
>> >>> + (rq->avg.runnable_avg_period / (rq->nr_running
>> >>> + 2)));
>> >>> +}
>> >>
>> >> Why does the comment talk about load but we don't see it in the
>> >> equation. Also, why does nr_running matter at all? I thought we'd
>> >> simply bother with utilization, if fully utilized we're done etc..
>>
>> By load, I mean : 100 * avg.runnable_avg_sum / avg.runnable_avg_period
>> In addition, i take into account the number of tasks already in the
>> runqueue in order to define the business of a CPU. A CPU with a load
>> of 50% without any tasks in the runqeue in not busy at this time and
>> we can migrate tasks on it but if the CPU already has 2 tasks in its
>> runqueue, it means that newly wake up task will have to share the CPU
>> with other tasks so we consider that the CPU is already busy and we
>> will fall back to default behavior. The equation considers that a CPU
>> is not busy if
>> 100 * avg.runnable_avg_sum / avg.runnable_avg_period < 100 / (nr_running + 2)
>
> I'm still somewhat confused by all this. So raising nr_running will lower the
> required utilization to be considered busy. Suppose we have 50 tasks running,
> all at 1% utilization (bulk wakeup) we'd consider the cpu busy, even though its
> picking its nose for half the time.
>
>
> I'm assuming it's mean to limit process latency or so? Why are you concerned
> with that? This seems like an arbitrary power vs performance tweak without
> solid effidence its needed or even wanted.

Yes the goal was to limit the wake up latency because this version was
only trying to modify the scheduler behavior when the system was not
busy in order to pack the small tasks like background activities but
without decreasing the performance so we were concerned by wakeup
latency.

The new version proposes a more aggressive mode that packs all tasks
until CPUs becomes full.

Vincent