2013-04-27 05:26:55

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 0/6] sched: use runnable load based balance

This patchset bases on tip/sched/core.

The patchset remove the burst wakeup detection which had worked fine on 3.8
kernel, since the aim7 is very imbalance. But rwsem write lock stealing
enabled in 3.9 kernel. aim7 imbalance disappeared. So the burst wakeup
care doesn't needed.

It was tested on Intel core2, NHM, SNB, IVB, 2 and 4 sockets machines with
benchmark kbuild, aim7, dbench, tbench, hackbench, fileio-cfq(sysbench)

On SNB EP 4 sockets machine, the hackbench increased about 50%, and result
become stable. on other machines, hackbench increased about 2~5%.
no clear performance change on other benchmarks.

and Michael Wang had tested the pgbench on his box:
https://lkml.org/lkml/2013/4/2/1022
---
Done, here the results of pgbench without the last patch on my box:

| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 22 MB | 1 | 10662 | | 10679 |
| 22 MB | 2 | 21483 | | 21471 |
| 22 MB | 4 | 42046 | | 41957 |
| 22 MB | 8 | 55807 | | 55684 |
| 22 MB | 12 | 50768 | | 52074 |
| 22 MB | 16 | 49880 | | 52879 |
| 22 MB | 24 | 45904 | | 53406 |
| 22 MB | 32 | 43420 | | 54088 | +24.57%
| 7484 MB | 1 | 7965 | | 7725 |
| 7484 MB | 2 | 19354 | | 19405 |
| 7484 MB | 4 | 37552 | | 37246 |
| 7484 MB | 8 | 48655 | | 50613 |
| 7484 MB | 12 | 45778 | | 47639 |
| 7484 MB | 16 | 45659 | | 48707 |
| 7484 MB | 24 | 42192 | | 46469 |
| 7484 MB | 32 | 36385 | | 46346 | +27.38%
| 15 GB | 1 | 7677 | | 7727 |
| 15 GB | 2 | 19227 | | 19199 |
| 15 GB | 4 | 37335 | | 37372 |
| 15 GB | 8 | 48130 | | 50333 |
| 15 GB | 12 | 45393 | | 47590 |
| 15 GB | 16 | 45110 | | 48091 |
| 15 GB | 24 | 41415 | | 47415 |
| 15 GB | 32 | 35988 | | 45749 | +27.12%
---

and also tested by [email protected]
http://comments.gmane.org/gmane.linux.kernel/1463371
---
The patches are based in 3.9-rc2 and have been tested on an ARM vexpress TC2
big.LITTLE testchip containing five cpus: 2xCortex-A15 + 3xCortex-A7.
Additional testing and refinements might be needed later as more sophisticated
platforms become available.

cpu_power A15: 1441
cpu_power A7: 606

Benchmarks:
cyclictest: cyclictest -a -t 2 -n -D 10
hackbench: hackbench (default settings)
sysbench_1t: sysbench --test=cpu --num-threads=1 --max-requests=1000 run
sysbench_2t: sysbench --test=cpu --num-threads=2 --max-requests=1000 run
sysbench_5t: sysbench --test=cpu --num-threads=5 --max-requests=1000 run


Mixed cpu_power:
Average times over 20 runs normalized to 3.9-rc2 (lower is better):
3.9-rc2 +shi +shi+patches Improvement
cyclictest
AVG 74.9 74.5 75.75 -1.13%
MIN 69 69 69
MAX 88 88 94
hackbench
AVG 2.17 2.09 2.09 3.90%
MIN 2.10 1.95 2.02
MAX 2.25 2.48 2.17
sysbench_1t
AVG 25.13* 16.47' 16.48 34.43%
MIN 16.47 16.47 16.47
MAX 33.78 16.48 16.54
sysbench_2t
AVG 19.32 18.19 16.51 14.55%
MIN 16.48 16.47 16.47
MAX 22.15 22.19 16.61
sysbench_5t
AVG 27.22 27.71 24.14 11.31%
MIN 25.42 27.66 24.04
MAX 27.75 27.86 24.31

* The unpatched 3.9-rc2 scheduler gives inconsistent performance as tasks may
randomly be placed on either A7 or A15 cores. The max/min values reflects this
behaviour. A15 and A7 performance are ~16.5 and ~33.5 respectively.

' While Alex Shi's patches appear to solve the performance inconsistency for
sysbench_1t, it is not the true picture for all workloads. This can be seen for
sysbench_2t.

To ensure that the proposed changes does not affect normal SMP systems, the
same benchmarks have been run on a 2xCortex-A15 configuration as well:

SMP:
Average times over 20 runs normalized to 3.9-rc2 (lower is better):
3.9-rc2 +shi +shi+patches Improvement
cyclictest
AVG 78.6 75.3 77.6 1.34%
MIN 69 69 69
MAX 135 98 125
hackbench
AVG 3.55 3.54 3.55 0.06%
MIN 3.51 3.48 3.49
MAX 3.66 3.65 3.67
sysbench_1t
AVG 16.48 16.48 16.48 -0.03%
MIN 16.47 16.48 16.48
MAX 16.49 16.48 16.48
sysbench_2t
AVG 16.53 16.53 16.54 -0.05%
MIN 16.47 16.47 16.48
MAX 16.59 16.57 16.59
sysbench_5t
AVG 41.16 41.15 41.15 0.04%
MIN 41.14 41.13 41.11
MAX 41.35 41.19 41.17
---

Peter,
Would you like to consider pick up the patchset? or give some comments? :)

Best regards
Alex

[PATCH v4 1/6] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[PATCH v4 2/6] sched: set initial value of runnable avg for new
[PATCH v4 3/6] sched: update cpu load after task_tick.
[PATCH v4 4/6] sched: compute runnable load avg in cpu_load and
[PATCH v4 5/6] sched: consider runnable load average in move_tasks
[PATCH v4 6/6] sched: consider runnable load average in


2013-04-27 05:27:24

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 2/6] sched: set initial value of runnable avg for new forked task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
resolve such issues.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 4 ++++
2 files changed, 10 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 86e87f1..e0c003a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1564,6 +1564,7 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
+ p->se.avg.decay_count = 0;
#endif
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -1651,6 +1652,11 @@ void sched_fork(struct task_struct *p)
p->sched_reset_on_fork = 0;
}

+ /* New forked task assumed with full utilization */
+#if defined(CONFIG_SMP)
+ p->se.avg.load_avg_contrib = p->se.load.weight;
+#endif
+
if (!rt_prio(p->prio))
p->sched_class = &fair_sched_class;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9bcb7f3..85ff367 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
* We track migrations using entity decay_count <= 0, on a wake-up
* migration we use a negative decay count to track the remote decays
* accumulated while sleeping.
+ *
+ * When enqueue a new forked task, the se->avg.decay_count == 0, so
+ * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+ * value: se->load.weight.
*/
if (unlikely(se->avg.decay_count <= 0)) {
se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
--
1.7.12

2013-04-27 05:27:06

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 1/6] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <[email protected]>
---
include/linux/sched.h | 7 +------
kernel/sched/core.c | 7 +------
kernel/sched/fair.c | 13 ++-----------
kernel/sched/sched.h | 9 +--------
4 files changed, 5 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6bdaa73..1203ce6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -991,12 +991,7 @@ struct sched_entity {
struct cfs_rq *my_q;
#endif

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/* Per-entity load-tracking */
struct sched_avg avg;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cb49b2a..86e87f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1561,12 +1561,7 @@ static void __sched_fork(struct task_struct *p)
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
p->se.avg.runnable_avg_period = 0;
p->se.avg.runnable_avg_sum = 0;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acaf567..9bcb7f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1110,8 +1110,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
/*
* We choose a half-life close to 1 scheduling period.
* Note: The tables below are dependent on this value.
@@ -3416,12 +3415,6 @@ unlock:
}

/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous cpu. However, the caller only guarantees p->pi_lock is held; no
@@ -3444,7 +3437,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
-#endif
#endif /* CONFIG_SMP */

static unsigned long
@@ -6140,9 +6132,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
+
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 605426a..33ab9e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -260,12 +260,6 @@ struct cfs_rq {
#endif

#ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
/*
* CFS Load tracking
* Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -275,8 +269,7 @@ struct cfs_rq {
u64 runnable_load_avg, blocked_load_avg;
atomic64_t decay_counter, removed_load;
u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
#ifdef CONFIG_FAIR_GROUP_SCHED
u32 tg_runnable_contrib;
u64 tg_load_contrib;
--
1.7.12

2013-04-27 05:27:31

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 6/6] sched: consider runnable load average in effective_load

effective_load calculates the load change as seen from the
root_task_group. It needs to engage the runnable average
of changed task.

Thanks for Morten Rasmussen's reminder of this.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 995dfc7..27c7ebb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2998,7 +2998,8 @@ static void task_waking_fair(struct task_struct *p)

#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * effective_load() calculates the load change as seen from the root_task_group
+ * effective_load() calculates the runnable load average change as seen from
+ * the root_task_group
*
* Adding load to a group doesn't make a group heavier, but can cause movement
* of group shares between cpus. Assuming the shares were perfectly aligned one
@@ -3046,6 +3047,9 @@ static void task_waking_fair(struct task_struct *p)
* Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
* times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
* 4/7) times the weight of the group.
+ *
+ * After get effective_load of the load moving, will engaged the sched entity's
+ * runnable avg.
*/
static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
@@ -3120,6 +3124,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
struct task_group *tg;
unsigned long weight;
int balanced;
+ int runnable_avg;

idx = sd->wake_idx;
this_cpu = smp_processor_id();
@@ -3135,13 +3140,19 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
if (sync) {
tg = task_group(current);
weight = current->se.load.weight;
+ runnable_avg = current->se.avg.runnable_avg_sum * NICE_0_LOAD
+ / (current->se.avg.runnable_avg_period + 1);

- this_load += effective_load(tg, this_cpu, -weight, -weight);
- load += effective_load(tg, prev_cpu, 0, -weight);
+ this_load += effective_load(tg, this_cpu, -weight, -weight)
+ * runnable_avg >> NICE_0_SHIFT;
+ load += effective_load(tg, prev_cpu, 0, -weight)
+ * runnable_avg >> NICE_0_SHIFT;
}

tg = task_group(p);
weight = p->se.load.weight;
+ runnable_avg = p->se.avg.runnable_avg_sum * NICE_0_LOAD
+ / (p->se.avg.runnable_avg_period + 1);

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
@@ -3153,16 +3164,18 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
* task to be woken on this_cpu.
*/
if (this_load > 0) {
- s64 this_eff_load, prev_eff_load;
+ s64 this_eff_load, prev_eff_load, tmp_eff_load;

this_eff_load = 100;
this_eff_load *= power_of(prev_cpu);
- this_eff_load *= this_load +
- effective_load(tg, this_cpu, weight, weight);
+ tmp_eff_load = effective_load(tg, this_cpu, weight, weight)
+ * runnable_avg >> NICE_0_SHIFT;
+ this_eff_load *= this_load + tmp_eff_load;

prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= power_of(this_cpu);
- prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+ prev_eff_load *= load + (effective_load(tg, prev_cpu, 0, weight)
+ * runnable_avg >> NICE_0_SHIFT);

balanced = this_eff_load <= prev_eff_load;
} else
--
1.7.12

2013-04-27 05:27:33

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 3/6] sched: update cpu load after task_tick.

To get the latest runnable info, we need do this cpuload update after
task_tick.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e0c003a..0fedeed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2690,8 +2690,8 @@ void scheduler_tick(void)

raw_spin_lock(&rq->lock);
update_rq_clock(rq);
- update_cpu_load_active(rq);
curr->sched_class->task_tick(rq, curr, 0);
+ update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);

perf_event_task_tick();
--
1.7.12

2013-04-27 05:27:29

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 5/6] sched: consider runnable load average in move_tasks

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca63c9f..995dfc7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3989,6 +3989,15 @@ static unsigned long task_h_load(struct task_struct *p);

static const unsigned int sched_nr_migrate_break = 32;

+static unsigned long task_h_load_avg(struct task_struct *p)
+{
+ u32 period = p->se.avg.runnable_avg_period;
+ if (!period)
+ return 0;
+
+ return task_h_load(p) * p->se.avg.runnable_avg_sum / period;
+}
+
/*
* move_tasks tries to move up to imbalance weighted load from busiest to
* this_rq, as part of a balancing operation within domain "sd".
@@ -4024,7 +4033,7 @@ static int move_tasks(struct lb_env *env)
if (!can_migrate_task(p, env))
goto next;

- load = task_h_load(p);
+ load = task_h_load_avg(p);

if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
goto next;
--
1.7.12

2013-04-27 05:27:27

by Alex Shi

[permalink] [raw]
Subject: [PATCH v4 4/6] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0fedeed..ae32dfb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2534,7 +2534,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
void update_idle_cpu_load(struct rq *this_rq)
{
unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
- unsigned long load = this_rq->load.weight;
+ unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
unsigned long pending_updates;

/*
@@ -2584,7 +2584,7 @@ static void update_cpu_load_active(struct rq *this_rq)
* See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
*/
this_rq->last_load_update_tick = jiffies;
- __update_cpu_load(this_rq, this_rq->load.weight, 1);
+ __update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);

calc_load_account_active(this_rq);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85ff367..ca63c9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2922,7 +2922,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Used instead of source_load when we know the type == 0 */
static unsigned long weighted_cpuload(const int cpu)
{
- return cpu_rq(cpu)->load.weight;
+ return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
}

/*
@@ -2969,7 +2969,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);

if (nr_running)
- return rq->load.weight / nr_running;
+ return (unsigned long)rq->cfs.runnable_load_avg / nr_running;

return 0;
}
--
1.7.12

2013-05-01 12:16:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance

On Sat, Apr 27, 2013 at 01:25:38PM +0800, Alex Shi wrote:
> This patchset bases on tip/sched/core.
>
> The patchset remove the burst wakeup detection which had worked fine on 3.8

Was this part of the original series from PJT or some patches afterwards?
I missed a few months worth of patches so any extra information to help me
catch up is greatly appreciated :-)

> kernel, since the aim7 is very imbalance. But rwsem write lock stealing
> enabled in 3.9 kernel. aim7 imbalance disappeared. So the burst wakeup
> care doesn't needed.

How does this follow.. surely something else can cause burst wakeups just fine
too?

> Peter,
> Would you like to consider pick up the patchset? or give some comments? :)

I'll go read them.. :-)

2013-05-02 00:39:14

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance

On 05/01/2013 08:14 PM, Peter Zijlstra wrote:
> On Sat, Apr 27, 2013 at 01:25:38PM +0800, Alex Shi wrote:
>> This patchset bases on tip/sched/core.
>>
>> The patchset remove the burst wakeup detection which had worked fine on 3.8
>
> Was this part of the original series from PJT or some patches afterwards?
> I missed a few months worth of patches so any extra information to help me
> catch up is greatly appreciated :-)

On 3.8 kernel, the first problem commit is 5a505085f043 ("mm/rmap:
Convert the struct anon_vma::mutex to an rwsem"), It cause much
imbalance among cpus on aim7 benchmark. The reason is here,
https://lkml.org/lkml/2013/1/29/84.

With PJT's patch, we know in first few seconds from wakeup, task's
runnable load maybe nearly zero. The nearly zero runnable load increases
the imbalance among cpus. So there are about extra 5% performance drop
when use runnable load to do balance on aim7(in testing, aim7 will fork
2000 threads, and then wait for a trigger, then run as fast as possible).

But, after we use rwsem lock stealing in 3.9 kernel, commit
ce6711f3d196f09ca0e, aim7 wakeup won't has clear imbalance issue. And
then aim7 won't need this extra burst wakeup detection.
>
>> kernel, since the aim7 is very imbalance. But rwsem write lock stealing
>> enabled in 3.9 kernel. aim7 imbalance disappeared. So the burst wakeup
>> care doesn't needed.
>
> How does this follow.. surely something else can cause burst wakeups just fine
> too?
>
>> Peter,
>> Would you like to consider pick up the patchset? or give some comments? :)
>
> I'll go read them.. :-)
>

I appreciate for this. :)

--
Thanks Alex

2013-05-02 10:37:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance

On Thu, May 02, 2013 at 08:38:17AM +0800, Alex Shi wrote:

> On 3.8 kernel, the first problem commit is 5a505085f043 ("mm/rmap:
> Convert the struct anon_vma::mutex to an rwsem"), It cause much
> imbalance among cpus on aim7 benchmark. The reason is here,
> https://lkml.org/lkml/2013/1/29/84.

Hehe. yeah that one was going to be obvious.. rwsems were known to be totally
bad performers. Not only were they lacking lock stealing but also the spinning.

> But, after we use rwsem lock stealing in 3.9 kernel, commit
> ce6711f3d196f09ca0e, aim7 wakeup won't has clear imbalance issue. And
> then aim7 won't need this extra burst wakeup detection.

OK, that seems like a nice fix for rwsems.. one nit:

+ raw_spin_lock_irq(&sem->wait_lock);
+ /* Try to get the writer sem, may steal from the head writer: */
+ if (flags == RWSEM_WAITING_FOR_WRITE)
+ if (try_get_writer_sem(sem, &waiter)) {
+ raw_spin_unlock_irq(&sem->wait_lock);
+ return sem;
+ }
+ raw_spin_unlock_irq(&sem->wait_lock);
schedule();

That should probably look like:

preempt_disable();
raw_spin_unlock_irq();
preempt_enable_no_resched();
schedule();

Otherwise you might find a performance regression on PREEMPT=y kernels.

> With PJT's patch, we know in first few seconds from wakeup, task's
> runnable load maybe nearly zero. The nearly zero runnable load increases
> the imbalance among cpus. So there are about extra 5% performance drop
> when use runnable load to do balance on aim7(in testing, aim7 will fork
> 2000 threads, and then wait for a trigger, then run as fast as possible).

OK, so what I was asking after is if you changed the scheduler after PJTs
patches landed to deal with this bulk wakeup. Also while aim7 might no longer
trigger the bad pattern what is to say nothing ever will? In particular
anything using pthread_cond_broadcast() is known to be suspect of bulk wakeups.

Anyway, I'll go try and make sense of some of the actual patches.. :-)

2013-05-02 11:02:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 2/6] sched: set initial value of runnable avg for new forked task

On Sat, Apr 27, 2013 at 01:25:40PM +0800, Alex Shi wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
> enqueue_task_fair
> enqueue_entity
> enqueue_entity_load_avg
>
> and make forking balancing imbalance since incorrect load_avg_contrib.
>
> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
> resolve such issues.
>
> Signed-off-by: Alex Shi <[email protected]>

This patch seems sensible enough..

Acked-by: Peter Zijlstra <[email protected]>

> ---
> kernel/sched/core.c | 6 ++++++
> kernel/sched/fair.c | 4 ++++
> 2 files changed, 10 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 86e87f1..e0c003a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c

> + /* New forked task assumed with full utilization */
> +#if defined(CONFIG_SMP)
> + p->se.avg.load_avg_contrib = p->se.load.weight;
> +#endif

We'd typically write: #ifdef CONFIG_SMP
we only use the #if defined() form for multiple CONFIG tests.

2013-05-02 13:21:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] sched: consider runnable load average in effective_load

> @@ -3120,6 +3124,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> struct task_group *tg;
> unsigned long weight;
> int balanced;
> + int runnable_avg;
>
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> @@ -3135,13 +3140,19 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> if (sync) {
> tg = task_group(current);
> weight = current->se.load.weight;
> + runnable_avg = current->se.avg.runnable_avg_sum * NICE_0_LOAD
> + / (current->se.avg.runnable_avg_period + 1);
>
> - this_load += effective_load(tg, this_cpu, -weight, -weight);
> - load += effective_load(tg, prev_cpu, 0, -weight);
> + this_load += effective_load(tg, this_cpu, -weight, -weight)
> + * runnable_avg >> NICE_0_SHIFT;
> + load += effective_load(tg, prev_cpu, 0, -weight)
> + * runnable_avg >> NICE_0_SHIFT;
> }


I'm fairly sure this is wrong; but I haven't bothered to take pencil to paper.

I think you'll need to insert the runnable avg load and make sure
effective_load() uses the right sums itself.

2013-05-03 07:40:30

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] sched: consider runnable load average in effective_load

On 05/02/2013 09:19 PM, Peter Zijlstra wrote:
>
> I'm fairly sure this is wrong; but I haven't bothered to take pencil to paper.
>
> I think you'll need to insert the runnable avg load and make sure
> effective_load() uses the right sums itself.

Thanks for comment!
I changed it to the following patch for further review.
a quick testing show hackbench still keep the improvement.

Will resent a new version to include all changes. :)

---

>From 2f6491807226a30e01b9afa21d80e42e7cbb5678 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Fri, 3 May 2013 13:29:04 +0800
Subject: [PATCH 7/8] sched: consider runnable load average in effective_load

effective_load calculates the load change as seen from the
root_task_group. It needs to engage the runnable average
of changed task.

Thanks for Morten Rasmussen and PeterZ's reminder of this.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 790e23d..6f4f14b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2976,15 +2976,15 @@ static void task_waking_fair(struct task_struct *p)

#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * effective_load() calculates the load change as seen from the root_task_group
+ * effective_load() calculates load avg change as seen from the root_task_group
*
* Adding load to a group doesn't make a group heavier, but can cause movement
* of group shares between cpus. Assuming the shares were perfectly aligned one
* can calculate the shift in shares.
*
- * Calculate the effective load difference if @wl is added (subtracted) to @tg
- * on this @cpu and results in a total addition (subtraction) of @wg to the
- * total group weight.
+ * Calculate the effective load avg difference if @wl is added (subtracted) to
+ * @tg on this @cpu and results in a total addition (subtraction) of @wg to the
+ * total group load avg.
*
* Given a runqueue weight distribution (rw_i) we can compute a shares
* distribution (s_i) using:
@@ -2998,7 +2998,7 @@ static void task_waking_fair(struct task_struct *p)
* rw_i = { 2, 4, 1, 0 }
* s_i = { 2/7, 4/7, 1/7, 0 }
*
- * As per wake_affine() we're interested in the load of two CPUs (the CPU the
+ * As per wake_affine() we're interested in load avg of two CPUs (the CPU the
* task used to run on and the CPU the waker is running on), we need to
* compute the effect of waking a task on either CPU and, in case of a sync
* wakeup, compute the effect of the current task going to sleep.
@@ -3008,20 +3008,20 @@ static void task_waking_fair(struct task_struct *p)
*
* s'_i = (rw_i + @wl) / (@wg + \Sum rw_j) (2)
*
- * Suppose we're interested in CPUs 0 and 1, and want to compute the load
+ * Suppose we're interested in CPUs 0 and 1, and want to compute the load avg
* differences in waking a task to CPU 0. The additional task changes the
* weight and shares distributions like:
*
* rw'_i = { 3, 4, 1, 0 }
* s'_i = { 3/8, 4/8, 1/8, 0 }
*
- * We can then compute the difference in effective weight by using:
+ * We can then compute the difference in effective load avg by using:
*
* dw_i = S * (s'_i - s_i) (3)
*
* Where 'S' is the group weight as seen by its parent.
*
- * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
+ * Therefore the effective change in load avg on CPU 0 would be 5/56 (3/8 - 2/7)
* times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
* 4/7) times the weight of the group.
*/
@@ -3045,7 +3045,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* w = rw_i + @wl
*/
- w = se->my_q->load.weight + wl;
+ w = se->my_q->tg_load_contrib + wl;

/*
* wl = S * s'_i; see (2)
@@ -3066,7 +3066,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
/*
* wl = dw_i = S * (s'_i - s_i); see (3)
*/
- wl -= se->load.weight;
+ wl -= se->avg.load_avg_contrib;

/*
* Recursively apply this logic to all parent groups to compute
@@ -3112,14 +3112,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
*/
if (sync) {
tg = task_group(current);
- weight = current->se.load.weight;
+ weight = current->se.avg.load_avg_contrib;

this_load += effective_load(tg, this_cpu, -weight, -weight);
load += effective_load(tg, prev_cpu, 0, -weight);
}

tg = task_group(p);
- weight = p->se.load.weight;
+ weight = p->se.avg.load_avg_contrib;

/*
* In low-load situations, where prev_cpu is idle and this_cpu is idle
--
1.7.12


--
Thanks
Alex

2013-05-03 07:56:33

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance


> That should probably look like:
>
> preempt_disable();
> raw_spin_unlock_irq();
> preempt_enable_no_resched();
> schedule();
>
> Otherwise you might find a performance regression on PREEMPT=y kernels.

Yes, right!
Thanks a lot for reminder. The following patch will fix it.
>
> OK, so what I was asking after is if you changed the scheduler after PJTs
> patches landed to deal with this bulk wakeup. Also while aim7 might no longer
> trigger the bad pattern what is to say nothing ever will? In particular
> anything using pthread_cond_broadcast() is known to be suspect of bulk wakeups.

Just find a benchmark named as pthread_cond_broadcast.
http://kristiannielsen.livejournal.com/13577.html. will play with it. :)
>
> Anyway, I'll go try and make sense of some of the actual patches.. :-)
>

---

>From 4c9b4b8a9b92bcbe6934637fd33c617e73dbda97 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Fri, 3 May 2013 14:51:25 +0800
Subject: [PATCH 8/8] rwsem: small optimizing rwsem_down_failed_common

Peter Zijlstra suggest adding a preempt_enable_no_resched() to prevent
a unnecessary scheduler in raw_spin_unlock.
And we also can pack 2 raw_spin_lock to save one. So has this patch.

Thanks Peter!

Signed-off-by: Alex Shi <[email protected]>
---
lib/rwsem.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/lib/rwsem.c b/lib/rwsem.c
index ad5e0df..9aacf81 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -212,23 +212,25 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);

- raw_spin_unlock_irq(&sem->wait_lock);
-
/* wait to be given the lock */
for (;;) {
- if (!waiter.task)
+ if (!waiter.task) {
+ raw_spin_unlock_irq(&sem->wait_lock);
break;
+ }

- raw_spin_lock_irq(&sem->wait_lock);
- /* Try to get the writer sem, may steal from the head writer: */
+ /* Try to get the writer sem, may steal from the head writer */
if (flags == RWSEM_WAITING_FOR_WRITE)
if (try_get_writer_sem(sem, &waiter)) {
raw_spin_unlock_irq(&sem->wait_lock);
return sem;
}
+ preempt_disable();
raw_spin_unlock_irq(&sem->wait_lock);
+ preempt_enable_no_resched();
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+ raw_spin_lock_irq(&sem->wait_lock);
}

tsk->state = TASK_RUNNING;
--
1.7.12

2013-05-03 08:55:26

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance

On 05/03/2013 03:55 PM, Alex Shi wrote:
> Just find a benchmark named as pthread_cond_broadcast.
> http://kristiannielsen.livejournal.com/13577.html. will play with it. :)
>> >

I tried the pthread_cond_broadcast with/without my latest patchset,
seems no clear performance change.
without the patch, finished 30000 threads need 0.39 ~ 0.49 seconds in
ten times testing,
with the patch, finished 30000 thread need 0.38 ~ 0.47 seconds in ten
times testing.

I try to give more threads in testing, but when thread number increased
to 40000, the testing become extremely slow and gdb find it busy on
thread creating.


[New Thread 0x7fe470411700 (LWP 24629)]
[New Thread 0x7fe470512700 (LWP 24628)]
[New Thread 0x7fe470613700 (LWP 24627)]
[New Thread 0x7fe470714700 (LWP 24626)]



--
Thanks
Alex

2013-05-07 07:51:25

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance

On 05/03/2013 03:55 PM, Alex Shi wrote:
>
>> That should probably look like:
>>
>> preempt_disable();
>> raw_spin_unlock_irq();
>> preempt_enable_no_resched();
>> schedule();
>>
>> Otherwise you might find a performance regression on PREEMPT=y kernels.
>
> Yes, right!
> Thanks a lot for reminder. The following patch will fix it.
>>

Peter, would you like to pick this patch?
>
> ---
>
> From 4c9b4b8a9b92bcbe6934637fd33c617e73dbda97 Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Fri, 3 May 2013 14:51:25 +0800
> Subject: [PATCH 8/8] rwsem: small optimizing rwsem_down_failed_common
>
> Peter Zijlstra suggest adding a preempt_enable_no_resched() to prevent
> a unnecessary scheduler in raw_spin_unlock.
> And we also can pack 2 raw_spin_lock to save one. So has this patch.
>
> Thanks Peter!
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> lib/rwsem.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/lib/rwsem.c b/lib/rwsem.c
> index ad5e0df..9aacf81 100644
> --- a/lib/rwsem.c
> +++ b/lib/rwsem.c
> @@ -212,23 +212,25 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
> adjustment == -RWSEM_ACTIVE_WRITE_BIAS)
> sem = __rwsem_do_wake(sem, RWSEM_WAKE_READ_OWNED);
>
> - raw_spin_unlock_irq(&sem->wait_lock);
> -
> /* wait to be given the lock */
> for (;;) {
> - if (!waiter.task)
> + if (!waiter.task) {
> + raw_spin_unlock_irq(&sem->wait_lock);
> break;
> + }
>
> - raw_spin_lock_irq(&sem->wait_lock);
> - /* Try to get the writer sem, may steal from the head writer: */
> + /* Try to get the writer sem, may steal from the head writer */
> if (flags == RWSEM_WAITING_FOR_WRITE)
> if (try_get_writer_sem(sem, &waiter)) {
> raw_spin_unlock_irq(&sem->wait_lock);
> return sem;
> }
> + preempt_disable();
> raw_spin_unlock_irq(&sem->wait_lock);
> + preempt_enable_no_resched();
> schedule();
> set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> + raw_spin_lock_irq(&sem->wait_lock);
> }
>
> tsk->state = TASK_RUNNING;
>


--
Thanks
Alex

2013-05-08 16:12:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] sched: use runnable load based balance

On Tue, May 07, 2013 at 03:51:14PM +0800, Alex Shi wrote:
> On 05/03/2013 03:55 PM, Alex Shi wrote:
> >
> >> That should probably look like:
> >>
> >> preempt_disable();
> >> raw_spin_unlock_irq();
> >> preempt_enable_no_resched();
> >> schedule();
> >>
> >> Otherwise you might find a performance regression on PREEMPT=y kernels.
> >
> > Yes, right!
> > Thanks a lot for reminder. The following patch will fix it.
> >>
>
> Peter, would you like to pick this patch?

There's this huge series from walken that touches the same code.. I'll wait to
see what happens with that.