The current implementation of the CPU controller uses hierarchical
runqueues, where on wakeup a task is enqueued on its group's runqueue,
the group is enqueued on the runqueue of the group above it, etc.
This increases a fairly large amount of overhead for workloads that
do a lot of wakeups a second, especially given that the default systemd
hierarchy is 2 or 3 levels deep.
This patch series is an attempt at reducing that overhead, by placing
all the tasks on the same runqueue, and scaling the task priority by
the priority of the group, which is calculated periodically.
My main TODO items for the next period of time are likely going to
be testing, testing, and testing. I hope to find and flush out any
corner case I can find, and make sure performance does not regress
with any workloads, and hopefully improves some.
Other TODO items:
- More code cleanups.
- Remove some more now unused code.
- Reimplement CONFIG_CFS_BANDWIDTH.
Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
- When a cgroup gets throttled, mark the cgroup and its children
as throttled.
- When pick_next_entity finds a task that is on a throttled cgroup,
stash it on the cgroup runqueue (which is not used for runnable
tasks any more). Leave the vruntime unchanged, and adjust that
runqueue's vruntime to be that of the left-most task.
- When a cgroup gets unthrottled, and has tasks on it, place it on
a vruntime ordered heap separate from the main runqueue.
- Have pick_next_task_fair grab one task off that heap every time it
is called, and the min vruntime of that heap is lower than the
vruntime of the CPU's cfs_rq (or the CPU has no other runnable tasks).
- Place that selected task on the CPU's cfs_rq, renormalizing its
vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
interleave the already runnable tasks with the recently unthrottled
group, and prevent thundering herd issues.
- If the group gets throttled again before all of its task had a chance
to run, vruntime sorting ensures all the tasks in the throttled cgroup
get a chance to run over time.
Changes from v2:
- fixed the web server performance regression, in a way vaguely similar
to what Josef Bacik suggested (blame me for the implementation)
- removed some code duplication so the diffstat is redder than before
- propagate sum_exec_runtime up the tree, in preparation for CFS_BANDWIDTH
- small cleanups left and right
Changes from v1:
- use task_se_h_weight instead of task_se_h_load in calc_delta_fair
and sched_slice, this seems to improve performance a little, but
I still have some remaining regression to chase with our web server
workload
- implement a number of the changes suggested by Dietmar Eggemann
(still holding out for a better name for group_cfs_rq_of_parent)
This series applies on top of 5.2
include/linux/sched.h | 7
kernel/sched/core.c | 3
kernel/sched/debug.c | 17 -
kernel/sched/fair.c | 780 ++++++++++++++++++++------------------------------
kernel/sched/pelt.c | 69 ++--
kernel/sched/pelt.h | 3
kernel/sched/sched.h | 11
7 files changed, 372 insertions(+), 518 deletions(-)
The way the time slice length is currently calculated, not only do high
priority tasks get longer time slices than low priority tasks, but due
to fixed point math, low priority tasks could end up with a zero length
time slice. This can lead to cache thrashing and other inefficiencies.
Cap the minimum time slice length to sysctl_sched_min_granularity.
Tasks that end up getting a time slice length too long for their relative
priority will simply end up having their vruntime advanced much faster than
other tasks, resulting in them receiving time slices less frequently.
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39f7a2d810e1..9ff69b927a3c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -732,6 +732,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
slice = __calc_delta(slice, se->load.weight, load);
}
+
+ /*
+ * To avoid cache thrashing, run at least sysctl_sched_min_granularity.
+ * The vruntime of a low priority task advances faster; those tasks
+ * will simply get time slices less frequently.
+ */
+ slice = max_t(u64, slice, sysctl_sched_min_granularity);
return slice;
}
--
2.20.1
Refactor enqueue_entity, dequeue_entity, and update_load_avg, in order
to split out the things we still want to happen at every level in the
cgroup hierarchy with a flat runqueue from the things we only need to
happen once.
No functional changes.
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++----------------
1 file changed, 42 insertions(+), 22 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4fc328032e6..c944729423bd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3500,7 +3500,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
#define DO_ATTACH 0x4
/* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 now = cfs_rq_clock_pelt(cfs_rq);
int decayed;
@@ -3529,6 +3529,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
} else if (decayed && (flags & UPDATE_TG))
update_tg_load_avg(cfs_rq, 0);
+
+ return decayed;
}
#ifndef CONFIG_64BIT
@@ -3745,9 +3747,10 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
#define SKIP_AGE_LOAD 0x0
#define DO_ATTACH 0x0
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
+static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
{
cfs_rq_util_change(cfs_rq, 0);
+ return false;
}
static inline void remove_entity_load_avg(struct sched_entity *se) {}
@@ -3870,6 +3873,24 @@ static inline void check_schedstat_required(void)
* CPU and an up-to-date min_vruntime on the destination CPU.
*/
+static bool
+enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+ /*
+ * When enqueuing a sched_entity, we must:
+ * - Update loads to have both entity and cfs_rq synced with now.
+ * - Add its load to cfs_rq->runnable_avg
+ * - For group_entity, update its weight to reflect the new share of
+ * its group cfs_rq
+ * - Add its new weight to cfs_rq->load.weight
+ */
+ if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
+ return false;
+
+ update_cfs_group(se);
+ return true;
+}
+
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -3894,16 +3915,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (renorm && !curr)
se->vruntime += cfs_rq->min_vruntime;
- /*
- * When enqueuing a sched_entity, we must:
- * - Update loads to have both entity and cfs_rq synced with now.
- * - Add its load to cfs_rq->runnable_avg
- * - For group_entity, update its weight to reflect the new share of
- * its group cfs_rq
- * - Add its new weight to cfs_rq->load.weight
- */
- update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
- update_cfs_group(se);
enqueue_runnable_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
@@ -3970,14 +3981,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
-static void
-dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static bool
+dequeue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
- /*
- * Update run-time statistics of the 'current'.
- */
- update_curr(cfs_rq);
-
/*
* When dequeuing a sched_entity, we must:
* - Update loads to have both entity and cfs_rq synced with now.
@@ -3986,7 +3992,21 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* - For group entity, update its weight to reflect the new share
* of its group cfs_rq.
*/
- update_load_avg(cfs_rq, se, UPDATE_TG);
+ if (!update_load_avg(cfs_rq, se, UPDATE_TG))
+ return false;
+ update_cfs_group(se);
+
+ return true;
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr(cfs_rq);
+
dequeue_runnable_load_avg(cfs_rq, se);
update_stats_dequeue(cfs_rq, se, flags);
@@ -4010,8 +4030,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
- update_cfs_group(se);
-
/*
* Now advance min_vruntime if @se was the entity holding it back,
* except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
@@ -5176,6 +5194,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (se->on_rq)
break;
cfs_rq = cfs_rq_of(se);
+ enqueue_entity_groups(cfs_rq, se, flags);
enqueue_entity(cfs_rq, se, flags);
/*
@@ -5258,6 +5277,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
+ dequeue_entity_groups(cfs_rq, se, flags);
dequeue_entity(cfs_rq, se, flags);
/*
--
2.20.1
On Mon, Jul 22, 2019 at 01:33:43PM -0400, Rik van Riel wrote:
> Refactor enqueue_entity, dequeue_entity, and update_load_avg, in order
> to split out the things we still want to happen at every level in the
> cgroup hierarchy with a flat runqueue from the things we only need to
> happen once.
>
> No functional changes.
> @@ -3500,7 +3500,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> #define DO_ATTACH 0x4
>
> /* Update task and its cfs_rq load average */
> -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> u64 now = cfs_rq_clock_pelt(cfs_rq);
> int decayed;
> @@ -3529,6 +3529,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> } else if (decayed && (flags & UPDATE_TG))
> update_tg_load_avg(cfs_rq, 0);
> +
> + return decayed;
> }
>
> #ifndef CONFIG_64BIT
> @@ -3745,9 +3747,10 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> #define SKIP_AGE_LOAD 0x0
> #define DO_ATTACH 0x0
>
> -static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> +static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> {
> cfs_rq_util_change(cfs_rq, 0);
> + return false;
> }
>
> static inline void remove_entity_load_avg(struct sched_entity *se) {}
> @@ -3870,6 +3873,24 @@ static inline void check_schedstat_required(void)
> * CPU and an up-to-date min_vruntime on the destination CPU.
> */
>
> +static bool
> +enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +{
> + /*
> + * When enqueuing a sched_entity, we must:
> + * - Update loads to have both entity and cfs_rq synced with now.
> + * - Add its load to cfs_rq->runnable_avg
> + * - For group_entity, update its weight to reflect the new share of
> + * its group cfs_rq
> + * - Add its new weight to cfs_rq->load.weight
> + */
> + if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
> + return false;
> +
> + update_cfs_group(se);
> + return true;
> +}
> +
> static void
> enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> @@ -3894,16 +3915,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> if (renorm && !curr)
> se->vruntime += cfs_rq->min_vruntime;
>
> - /*
> - * When enqueuing a sched_entity, we must:
> - * - Update loads to have both entity and cfs_rq synced with now.
> - * - Add its load to cfs_rq->runnable_avg
> - * - For group_entity, update its weight to reflect the new share of
> - * its group cfs_rq
> - * - Add its new weight to cfs_rq->load.weight
> - */
> - update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
> - update_cfs_group(se);
> enqueue_runnable_load_avg(cfs_rq, se);
> account_entity_enqueue(cfs_rq, se);
>
No functional, but you did make update_cfs_group() conditional. Now that
looks OK, but maybe you can do that part in a separate patch with a
little justification of its own.
On Tue, 2019-07-30 at 11:36 +0200, Peter Zijlstra wrote:
> On Mon, Jul 22, 2019 at 01:33:43PM -0400, Rik van Riel wrote:
> > @@ -3870,6 +3873,24 @@ static inline void
> > check_schedstat_required(void)
> > * CPU and an up-to-date min_vruntime on the destination CPU.
> > */
> >
> > +static bool
> > +enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity
> > *se, int flags)
> > +{
> > + /*
> > + * When enqueuing a sched_entity, we must:
> > + * - Update loads to have both entity and cfs_rq synced with
> > now.
> > + * - Add its load to cfs_rq->runnable_avg
> > + * - For group_entity, update its weight to reflect the new
> > share of
> > + * its group cfs_rq
> > + * - Add its new weight to cfs_rq->load.weight
> > + */
> > + if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
> > + return false;
> > +
> > + update_cfs_group(se);
> > + return true;
> > +}
> >
>
> No functional, but you did make update_cfs_group() conditional. Now
> that
> looks OK, but maybe you can do that part in a separate patch with a
> little justification of its own.
Good idea, I will split that out.
--
All Rights Reversed.
On Mon, Jul 22, 2019 at 01:33:34PM -0400, Rik van Riel wrote:
> Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
> - When a cgroup gets throttled, mark the cgroup and its children
> as throttled.
> - When pick_next_entity finds a task that is on a throttled cgroup,
> stash it on the cgroup runqueue (which is not used for runnable
> tasks any more). Leave the vruntime unchanged, and adjust that
> runqueue's vruntime to be that of the left-most task.
and ignore such tasks for the load-balancer; I suppose
> - When a cgroup gets unthrottled, and has tasks on it, place it on
> a vruntime ordered heap separate from the main runqueue.
and expose said heap to the load-balancer..
Now, I suppose you do this because merging heaps is easier than merging
RB trees? (not in complexity iirc, but probably in code)
> - Have pick_next_task_fair grab one task off that heap every time it
> is called, and the min vruntime of that heap is lower than the
> vruntime of the CPU's cfs_rq (or the CPU has no other runnable tasks).
That's like a smeared out merge :-) But since the other tasks kept on
running, this CPUs vruntime will be (much) advanced vs those throttled
tasks and we'll most likely end up picking them all before we pick a
'normal' task.
> - Place that selected task on the CPU's cfs_rq, renormalizing its
> vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
> interleave the already runnable tasks with the recently unthrottled
> group, and prevent thundering herd issues.
> - If the group gets throttled again before all of its task had a chance
> to run, vruntime sorting ensures all the tasks in the throttled cgroup
> get a chance to run over time.
On Tue, 2019-07-30 at 18:29 +0200, Peter Zijlstra wrote:
> On Mon, Jul 22, 2019 at 01:33:34PM -0400, Rik van Riel wrote:
> > Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
> > - When a cgroup gets throttled, mark the cgroup and its children
> > as throttled.
> > - When pick_next_entity finds a task that is on a throttled cgroup,
> > stash it on the cgroup runqueue (which is not used for runnable
> > tasks any more). Leave the vruntime unchanged, and adjust that
> > runqueue's vruntime to be that of the left-most task.
>
> and ignore such tasks for the load-balancer; I suppose
Good point. I suppose we need to find a lazy way of doing
this, too...
> > - When a cgroup gets unthrottled, and has tasks on it, place it on
> > a vruntime ordered heap separate from the main runqueue.
>
> and expose said heap to the load-balancer..
>
> Now, I suppose you do this because merging heaps is easier than
> merging
> RB trees? (not in complexity iirc, but probably in code)
>
> > - Have pick_next_task_fair grab one task off that heap every time
> > it
> > is called, and the min vruntime of that heap is lower than the
> > vruntime of the CPU's cfs_rq (or the CPU has no other runnable
> > tasks).
>
> That's like a smeared out merge :-) But since the other tasks kept on
> running, this CPUs vruntime will be (much) advanced vs those
> throttled
> tasks and we'll most likely end up picking them all before we pick a
> 'normal' task.
The GENTLE_FAIR_SLEEPERS code should place the vruntime
of newly unthrottled tasks to the right of that of the
current top task on the runqueue, to prevent thundering
herd effects (and the throttled group immediately going
over its quota again, while causing bad latency for others).
> > - Place that selected task on the CPU's cfs_rq, renormalizing its
> > vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
> > interleave the already runnable tasks with the recently
> > unthrottled
> > group, and prevent thundering herd issues.
> > - If the group gets throttled again before all of its task had a
> > chance
> > to run, vruntime sorting ensures all the tasks in the throttled
> > cgroup
> > get a chance to run over time.
>
>
--
All Rights Reversed.
On Tue, Jul 30, 2019 at 11:36:17AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 22, 2019 at 01:33:43PM -0400, Rik van Riel wrote:
> > +static bool
> > +enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +{
> > + /*
> > + * When enqueuing a sched_entity, we must:
> > + * - Update loads to have both entity and cfs_rq synced with now.
> > + * - Add its load to cfs_rq->runnable_avg
> > + * - For group_entity, update its weight to reflect the new share of
> > + * its group cfs_rq
> > + * - Add its new weight to cfs_rq->load.weight
> > + */
> > + if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
> > + return false;
> > +
> > + update_cfs_group(se);
> > + return true;
> > +}
> No functional, but you did make update_cfs_group() conditional. Now that
> looks OK, but maybe you can do that part in a separate patch with a
> little justification of its own.
To record (and extend) our discussion from IRC yesterday; I now do think
the above is in fact a problem.
The thing is that update_cfs_group() does not soly rely on the tg state;
it also contains magic to deal with ramp up; for which you later
introduce that max_h_load thing.
Specifically (re)read the second part of the comment describing
calc_group_shares() where it explains the ramp up:
* The problem with it is that because the average is slow -- it was designed
* to be exactly that of course -- this leads to transients in boundary
* conditions. In specific, the case where the group was idle and we start the
* one task. It takes time for our CPU's grq->avg.load_avg to build up,
* yielding bad latency etc..
(and further)
So by not always calling this (and not updating h_load) you fail to take
advantage of this.
So I would suggest keeping update_cfs_group() unconditional, and
recomputing the h_load for the entire hierarchy on enqueue.
On Wed, 2019-07-31 at 11:35 +0200, Peter Zijlstra wrote:
> On Tue, Jul 30, 2019 at 11:36:17AM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 22, 2019 at 01:33:43PM -0400, Rik van Riel wrote:
> > > +static bool
> > > +enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity
> > > *se, int flags)
> > > +{
> > > + /*
> > > + * When enqueuing a sched_entity, we must:
> > > + * - Update loads to have both entity and cfs_rq synced with
> > > now.
> > > + * - Add its load to cfs_rq->runnable_avg
> > > + * - For group_entity, update its weight to reflect the new
> > > share of
> > > + * its group cfs_rq
> > > + * - Add its new weight to cfs_rq->load.weight
> > > + */
> > > + if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
> > > + return false;
> > > +
> > > + update_cfs_group(se);
> > > + return true;
> > > +}
> > No functional, but you did make update_cfs_group() conditional. Now
> > that
> > looks OK, but maybe you can do that part in a separate patch with a
> > little justification of its own.
>
> To record (and extend) our discussion from IRC yesterday; I now do
> think
> the above is in fact a problem.
>
> The thing is that update_cfs_group() does not soly rely on the tg
> state;
> it also contains magic to deal with ramp up; for which you later
> introduce that max_h_load thing.
>
> Specifically (re)read the second part of the comment describing
> calc_group_shares() where it explains the ramp up:
>
> * The problem with it is that because the average is slow -- it was
> designed
> * to be exactly that of course -- this leads to transients in
> boundary
> * conditions. In specific, the case where the group was idle and we
> start the
> * one task. It takes time for our CPU's grq->avg.load_avg to build
> up,
> * yielding bad latency etc..
>
> (and further)
>
> So by not always calling this (and not updating h_load) you fail to
> take
> advantage of this.
>
> So I would suggest keeping update_cfs_group() unconditional, and
> recomputing the h_load for the entire hierarchy on enqueue.
I think I understand the problem you are pointing
out, but if update_load_avg() keeps the load average
for the runqueue unchanged (because that is rate limited
to once a jiffy, and has been like that for a while),
why would calc_group_shares() result in a different value
than what it returned the last time?
What am I overlooking?
--
All Rights Reversed.
On Wed, Jul 31, 2019 at 11:03:01AM -0400, Rik van Riel wrote:
> I think I understand the problem you are pointing out, but if
> update_load_avg() keeps the load average for the runqueue unchanged
> (because that is rate limited to once a jiffy, and has been like that
> for a while), why would calc_group_shares() result in a different
> value than what it returned the last time?
>
> What am I overlooking?
I'm thinking you're thinking (3):
tg->weight * grq->avg.load_avg
shares = ------------------------------
tg->load_avg
Where: tg->load_avg ~= \Sum grq->avg.load_avg
Which is the straight forward shares calculation, and purely depends on
the load averages (which haven't been changed etc..)
But what we actually do is (6):
tg->weight * grq->avg.load_avg
shares = ---------------------------------------------------------------------------
tg->load_avg - grq->avg.load_avg + max(grq->load.weight, grq->avg.load_avg)
And even if tg->load_avg and grq->avg.load_avg haven't changed,
grq->load.weight most certainly has.