This new version stays quite close to the previous one and should
replace without problems the previous one that part of Mel's patchset:
https://lkml.org/lkml/2020/2/14/156
NUMA load balancing is the last remaining piece of code that uses the
runnable_load_avg of PELT to balance tasks between nodes. The normal
load_balance has replaced it by a better description of the current state
of the group of cpus. The same policy can be applied to the numa
balancing.
Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
signal that tracks the waiting time of tasks on rq. Currently, the state
of a group of CPUs is defined thanks to the number of running task and the
level of utilization of rq. But the utilization can be temporarly low
after the migration of a task whereas the rq is still overloaded with
tasks. In such case where tasks were competing for the rq, the
runnable_avg will stay high after the migration.
Some hackbench results:
- small arm64 dual quad cores system
hackbench -l (2560/#grp) -g #grp
grp tip/sched/core +patchset improvement
1 1,327(+/-10,06 %) 1,247(+/-5,45 %) 5,97 %
4 1,250(+/- 2,55 %) 1,207(+/-2,12 %) 3,42 %
8 1,189(+/- 1,47 %) 1,179(+/-1,93 %) 0,90 %
16 1,221(+/- 3,25 %) 1,219(+/-2,44 %) 0,16 %
- large arm64 2 nodes / 224 cores system
hackbench -l (256000/#grp) -g #grp
grp tip/sched/core +patchset improvement
1 14,197(+/- 2,73 %) 13,917(+/- 2,19 %) 1,98 %
4 6,817(+/- 1,27 %) 6,523(+/-11,96 %) 4,31 %
16 2,930(+/- 1,07 %) 2,911(+/- 1,08 %) 0,66 %
32 2,735(+/- 1,71 %) 2,725(+/- 1,53 %) 0,37 %
64 2,702(+/- 0,32 %) 2,717(+/- 1,07 %) -0,53 %
128 3,533(+/-14,66 %) 3,123(+/-12,47 %) 11,59 %
256 3,918(+/-19,93 %) 3,390(+/- 5,93 %) 13,47 %
The significant improvement for 128 and 256 should be taken with care
because of some instabilities over iterations without the patchset.
The table below shows figures of the classification of sched group during
load balance (idle, newly or busy lb) with the disribution according to
the number of running tasks for:
hackbench -l 640 -g 4 on octo cores
tip/sched/core +patchset
state
has spare 3973 1934
nr_running
0 1965 1858
1 518 56
2 369 18
3 270 2
4+ 851 0
fully busy 546 1018
nr_running
0 0 0
1 546 1018
2 0 0
3 0 0
4+ 0 0
overloaded 2109 3056
nr_running
0 0 0
1 0 0
2 241 483
3 170 348
4+ 1698 2225
total 6628 6008
Without the patchset, there is a significant number of time that a CPU has
spare capacity with more than 1 running task. Although this is a valid
case, this is not a state that should often happen when 160 tasks are
competing on 8 cores like for this test. The patchset fixes the situation
by taking into account the runnable_avg, which stays high after the
migration of a task on another CPU.
Changes since v3:
- fix some comments and typos
- collect runnable_avg in update_sg_wakeup_stats()
- use cpu capacity instead of SCHED_CAPACITY_SCALE as init value for
runnable_avg
I haven't rerun all tests and results above comes from v2 but only run a
subset with octo cores on latest tip/sched/core :
grp v3
1 1,191(+/-0.77 %)
4 1,147(+/-1.14 %)
8 1,112(+/-1,52 %)
16 1,163(+/-1.72 %)
Vincent Guittot (5):
sched/fair: Reorder enqueue/dequeue_task_fair path
sched/numa: Replace runnable_load_avg by load_avg
sched/pelt: Remove unused runnable load average
sched/pelt: Add a new runnable average signal
sched/fair: Take into account runnable_avg to classify group
include/linux/sched.h | 31 ++--
kernel/sched/core.c | 2 -
kernel/sched/debug.c | 17 +-
kernel/sched/fair.c | 358 +++++++++++++++++++++++-------------------
kernel/sched/pelt.c | 59 +++----
kernel/sched/sched.h | 29 +++-
6 files changed, 272 insertions(+), 224 deletions(-)
--
2.17.1
On Fri, Feb 21, 2020 at 02:27:10PM +0100, Vincent Guittot wrote:
> This new version stays quite close to the previous one and should
> replace without problems the previous one that part of Mel's patchset:
> https://lkml.org/lkml/2020/2/14/156
>
Thanks Vincent, just in time for tests to run over the weekend!
I can confirm the patches slotted in easily into a yet-to-be-relased v6
of my series that still has my fix inserted after patch 2. After looking
at your series, I see only patches 3-5 need to be retested as well as
my own patches on top. This should take less time as I can reuse some of
the old results. I'll post v6 if the tests complete successfully.
The overall diff is as follows in case you want to double check it is
what you expect.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3060ba94e813..3f51586365f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,10 +740,8 @@ void init_entity_runnable_average(struct sched_entity *se)
* Group entities are initialized with zero load to reflect the fact that
* nothing has been attached to the task group yet.
*/
- if (entity_is_task(se)) {
- sa->runnable_avg = SCHED_CAPACITY_SCALE;
+ if (entity_is_task(se))
sa->load_avg = scale_load_down(se->load.weight);
- }
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
}
@@ -796,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p)
}
}
+ sa->runnable_avg = cpu_scale;
+
if (p->sched_class != &fair_sched_class) {
/*
* For !fair tasks do:
@@ -3083,9 +3083,9 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
#endif
enqueue_load_avg(cfs_rq, se);
- if (se->on_rq) {
+ if (se->on_rq)
account_entity_enqueue(cfs_rq, se);
- }
+
}
void reweight_task(struct task_struct *p, int prio)
@@ -5613,6 +5613,24 @@ static unsigned long cpu_runnable(struct rq *rq)
return cfs_rq_runnable_avg(&rq->cfs);
}
+static unsigned long cpu_runnable_without(struct rq *rq, struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq;
+ unsigned int runnable;
+
+ /* Task has no contribution or is new */
+ if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
+ return cpu_runnable(rq);
+
+ cfs_rq = &rq->cfs;
+ runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+
+ /* Discount task's runnable from CPU's runnable */
+ lsub_positive(&runnable, p->se.avg.runnable_avg);
+
+ return runnable;
+}
+
static unsigned long capacity_of(int cpu)
{
return cpu_rq(cpu)->cpu_capacity;
@@ -8521,6 +8539,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
sgs->group_load += cpu_load_without(rq, p);
sgs->group_util += cpu_util_without(i, p);
+ sgs->group_runnable += cpu_runnable_without(rq, p);
local = task_running_on_cpu(i, p);
sgs->sum_h_nr_running += rq->cfs.h_nr_running - local;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 2cc88d9e3b38..c40d57a2a248 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -267,8 +267,6 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
* load_sum := runnable
* load_avg = se_weight(se) * load_sum
*
- * XXX collapse load_sum and runnable_load_sum
- *
* cfq_rq:
*
* runnable_sum = \Sum se->avg.runnable_sum
@@ -325,7 +323,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
* util_sum = cpu_scale * load_sum
* runnable_sum = util_sum
*
- * load_avg and runnable_load_avg are not supported and meaningless.
+ * load_avg and runnable_avg are not supported and meaningless.
*
*/
@@ -351,7 +349,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
* util_sum = cpu_scale * load_sum
* runnable_sum = util_sum
*
- * load_avg and runnable_load_avg are not supported and meaningless.
+ * load_avg and runnable_avg are not supported and meaningless.
*
*/
@@ -378,7 +376,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
* util_sum = cpu_scale * load_sum
* runnable_sum = util_sum
*
- * load_avg and runnable_load_avg are not supported and meaningless.
+ * load_avg and runnable_avg are not supported and meaningless.
*
*/
Hi,
On 2/21/20 6:57 PM, Vincent Guittot wrote:
> This new version stays quite close to the previous one and should
> replace without problems the previous one that part of Mel's patchset:
> https://lkml.org/lkml/2020/2/14/156
>
> NUMA load balancing is the last remaining piece of code that uses the
> runnable_load_avg of PELT to balance tasks between nodes. The normal
> load_balance has replaced it by a better description of the current state
> of the group of cpus. The same policy can be applied to the numa
> balancing.
>
> Once unused, runnable_load_avg can be replaced by a simpler runnable_avg
> signal that tracks the waiting time of tasks on rq. Currently, the state
> of a group of CPUs is defined thanks to the number of running task and the
> level of utilization of rq. But the utilization can be temporarly low
> after the migration of a task whereas the rq is still overloaded with
> tasks. In such case where tasks were competing for the rq, the
> runnable_avg will stay high after the migration.
>
> Some hackbench results:
>
> - small arm64 dual quad cores system
> hackbench -l (2560/#grp) -g #grp
>
> grp tip/sched/core +patchset improvement
> 1 1,327(+/-10,06 %) 1,247(+/-5,45 %) 5,97 %
> 4 1,250(+/- 2,55 %) 1,207(+/-2,12 %) 3,42 %
> 8 1,189(+/- 1,47 %) 1,179(+/-1,93 %) 0,90 %
> 16 1,221(+/- 3,25 %) 1,219(+/-2,44 %) 0,16 %
>
> - large arm64 2 nodes / 224 cores system
> hackbench -l (256000/#grp) -g #grp
>
> grp tip/sched/core +patchset improvement
> 1 14,197(+/- 2,73 %) 13,917(+/- 2,19 %) 1,98 %
> 4 6,817(+/- 1,27 %) 6,523(+/-11,96 %) 4,31 %
> 16 2,930(+/- 1,07 %) 2,911(+/- 1,08 %) 0,66 %
> 32 2,735(+/- 1,71 %) 2,725(+/- 1,53 %) 0,37 %
> 64 2,702(+/- 0,32 %) 2,717(+/- 1,07 %) -0,53 %
> 128 3,533(+/-14,66 %) 3,123(+/-12,47 %) 11,59 %
> 256 3,918(+/-19,93 %) 3,390(+/- 5,93 %) 13,47 %
[...]
I performed similar experiment on IBM POWER9 system with 2 nodes 44 cores
system (22 per node)
- hackbench -l (256000/#grp) -g #grp
+-----+----------------+-------+
| grp | tip/sched/core | v4 |
+-----+----------------+-------+
| 1 | 76.97 | 76.31 |
| 4 | 56.56 | 56.86 |
| 8 | 54.23 | 54.25 |
| 16 | 53.94 | 53.24 |
| 32 | 54.10 | 54.01 |
| 64 | 54.38 | 54.35 |
| 128 | 55.11 | 55.08 |
| 256 | 55.97 | 56.04 |
| 512 | 54.81 | 55.5 |
+-----+----------------+-------+
- deviation in the result is very marginal ( < 1%)
The results shows no changes with respect to the hackbench. I will do
further benchmarking to see if any observable changes occurs.
- Parth