Argh, regenerated mbox and forgot to fix the subject.
This should have been: sched: per-entity load-tracking.
I won't bother re-posting, but if you find yourself reading this
email; look to the much more interesting "Series short description".
Thanks,
- Paul
On Wed, Jun 27, 2012 at 7:24 PM, Paul Turner <[email protected]> wrote:
> Hi all,
>
> Please find attached the latest version for CFS load-tracking.
>
> It implements load-tracking on a per-sched_entity (currently SCHED_NORMAL, but
> could be extended to RT as well) basis. This results in a bottom-up
> load-computation in which entities contribute to their parents' load, as
> opposed to the current top-down where the parent averages its children. ?In
> particular this allows us to correctly migrate load with their accompanying
> entities and provides the necessary inputs for intelligent load-balancing and
> power-management.
>
> Modulo review I think this is now close to a mergable state.
>
> Notable updates:
> ----------------
> - Few stability issues fixed; in particular on systems with tasks having idle
> ?periods spanning several days. ?Minor code cleanups.
> - 32 bit performance improved (now reported faster[*] on 32-bit ARM than
> ?without, presumably due to better load-balance or reduced overhead. ?Thanks
> ?to Vincent Guittot and others for looking at this)
> - All configs delinted
> - Config dependencies seperated so that load-tracking can be used without
> ?FAIR_GROUP_SCHED so that we may generally use it for power governors and
> ?per-entity load-balance (in progress).
>
> Thanks to Peter, Vincent, Morten, Nikunj, and others for testing and comments.
>
> Rewinding some context since I've been buried with internal work and it's been
> a while since my last posting:
>
> Background:
> ----------
> We currently track the load average at the parenting cfs_rq level. We divide
> execution into a series of consecutive "windows, w_i". ?Within each we track:
> ?\Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
>
> The load average is then \Sum w_j / 2^n.
>
> This this works reasonably well but there's a few problems
> 1) Since decay occurs at boundary window boundary there are 'skews':
> At a window boundary the 'foreground' time has a bias against the
> time immediately preceding it (as a result of the folding division)
> e.g. ?__xx_|_yyyy___ vs __xx_yyyy_|___ ?(where '|' is a window boundary).
>
> The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
> window your load lands on.
>
> 2) Everything within a window 'w' is accounted equally, we only fold at
> the boundaries. ?This actually means you can't set 'w' large enough to
> really have meaningful coverage of the sched period without throwing
> decay out the window. ?But then with w/2 << sched_period (currently
> w/2=5ms) the average ends up having fairly drastic swings even with
> stable loads.
>
> (Note: ?Before we even looked at migrating to per-entity tracking we evaluating
> foreground window into the considered average until it was "complete", this
> represented a measurable improvement in stability under predictable load.)
>
> 3) Since the load sum average is per-cfs_rq and not per-entity when a task
> entity migrates we lose its contribution to load-sum and effectively double
> count it while it former sum decays.
>
> New approach:
> -------------
> Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
> basis. ?The load sum for a cfs_rq is then just the sum of its childrens' load
> averages. ?The obvious immediately nice property is that when we migrate thread
> A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
> unmodified.
>
> The 'windows' above are replaced with more fine-grained tracking, that is (for
> an entity j):
>
> runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
>
> Where: u_i is the usage in the last i`th ~ms and y is chosen such that
> y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
> sched_period ago contributes about ~50% as current execution.
>
> Now, the real challenge in tracking at an entity basis is handling blocked
> entities. ?Obviously runnable entities will be updated naturally as we iterate
> through them but there could be O(many) blocked entities so we can't afford to
> iterate through them and update their tracking.
>
> That our decay for a unit period is exponential introduces a particularly nice
> property here:
> We can separate the contributed load on a cfs_rq into blocked and runnable.
> Runnable load is just the sum of all load_avg_j above, maintained by the
> entities themselves as they run and self update (when they update their
> load_avg they update the cumulative sum also).
>
> Blocked load then looks like:
> ?load_avg_j = weight_k * \Sum u[k]_n * y^n
>
> To account an entirely idle period we then only need to multiply by y.
>
> This ends up being orders of magnitude more accurate than the current
> tracking schema, even with the old shares_window massively dilated to
> better capture a stable load-state. ?It's also obviously stable under
> migration.
>
> As referenced above this also allows us to potentially improve decisions within
> the load-balancer, for both distribution and power-management.
>
> Example: consider 1x80% task ?and 2x40% tasks on a 2-core machine. ?It's
> currently a bit of a gamble as to whether you get an {AB, B} or {A, BB} split
> since they have equal weight (assume 1024). ?With per-task tracking we can
> actually consider them at their contributed weight and see a stable ~{800,{400,
> 400}} load-split. ?Likewise within balance_tasks we can consider the load
> migrated to be that actually contributed. ?We have some patches looking at this
> and will post them as they mature.
>
> Thanks,
>
> - Paul
>
> ---
>
> Ben Segall (1):
> ? ? ?sched: maintain per-rq runnable averages
>
> Paul Turner (15):
> ? ? ?sched: track the runnable average on a per-task entitiy basis
> ? ? ?sched: aggregate load contributed by task entities on parenting cfs_rq
> ? ? ?sched: maintain the load contribution of blocked entities
> ? ? ?sched: add an rq migration call-back to sched_class
> ? ? ?sched: account for blocked load waking back up
> ? ? ?sched: aggregate total task_group load
> ? ? ?sched: compute load contribution by a group entity
> ? ? ?sched: normalize tg load contributions against runnable time
> ? ? ?sched: maintain runnable averages across throttled periods
> ? ? ?sched: replace update_shares weight distribution with per-entity computation
> ? ? ?sched: refactor update_shares_cpu() -> update_blocked_avgs()
> ? ? ?sched: update_cfs_shares at period edge
> ? ? ?sched: make __update_entity_runnable_avg() fast
> ? ? ?sched: implement usage tracking
> ? ? ?sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
>
>
> ?include/linux/sched.h | ? 18 +
> ?kernel/sched/core.c ? | ? ?5
> ?kernel/sched/debug.c ?| ? 39 ++
> ?kernel/sched/fair.c ? | ?793 +++++++++++++++++++++++++++++++++++++++----------
> ?kernel/sched/sched.h ?| ? 60 ++--
> ?5 files changed, 720 insertions(+), 195 deletions(-)
>
> --
> Signature
>
I've rebased this to tip/master and pushed some minor updates (most
notable: rounding down in decay_load).
Tree is at:
http://git.kernel.org/?p=linux/kernel/git/pjt/sched.git;a=summary
(branch: load-tracking)
If you prefer raw patches there's a quilt series at:
http://rs5.risingnet.net/~pjt/patches/load_tracking/
And a tar-ball at:
http://rs5.risingnet.net/~pjt/patches/load_tracking/series.tar.gz
Peter:
The rebase to tip/master made interdiff angry enough that it wasn't
producing the right relative diffs for your stack. Sorry :(
- Paul
On Fri, Sep 21, 2012 at 1:52 PM, Paul Turner <[email protected]> wrote:
> Hi all,
>
> I've refreshed the load-tracking branch at:
> http://git.kernel.org/?p=linux/kernel/git/pjt/sched.git;a=summary
> With the latest rebased version. Similar I've got a quilt export
> below to make it easier for people to patch in.
>
> If you prefer raw patches there's a quilt series at:
> http://rs5.risingnet.net/~pjt/patches/load_tracking/
>
> And a tar-ball at:
> http://rs5.risingnet.net/~pjt/patches/load_tracking/series.tar.gz
>
> There are no functional differences from the version posted at:
> https://lkml.org/lkml/2012/8/23/267
> So I won't bother posting the series again here. If you have find
> errors, please post them there and I'll do another round.
>
> Similarly, if you previously pulled from my git tree above; the only
> difference is that it's been re-written to current tip head, and that
> the generator for the fixed-point math constants have been added to
> the changelogs.
>
> Thanks!
>
> - Paul
* Paul Turner <[email protected]> wrote:
> Peter:
> The rebase to tip/master made interdiff angry enough that it wasn't
> producing the right relative diffs for your stack. Sorry :(
Find below the diff between the two series, using 'quilt
snapshot' and 'quilt diff --snapshot'.
One quick stylistic note: instead of putting the
update_cfs_rq_blocked_load() in the middle of the file, order
functions naturally so that no prototypes are needed.
Thanks,
Ingo
--
tip/kernel/sched/fair.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
Index: tip/kernel/sched/fair.c
===================================================================
--- tip.orig/kernel/sched/fair.c
+++ tip/kernel/sched/fair.c
@@ -262,6 +262,9 @@ static inline struct cfs_rq *group_cfs_r
return grp->my_q;
}
+static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
+ int force_update);
+
static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
{
if (!cfs_rq->on_list) {
@@ -281,6 +284,8 @@ static inline void list_add_leaf_cfs_rq(
}
cfs_rq->on_list = 1;
+ /* We should have no load, but we need to update last_decay. */
+ update_cfs_rq_blocked_load(cfs_rq, 0);
}
}
@@ -1151,7 +1156,7 @@ static inline void update_cfs_shares(str
* Note: The tables below are dependent on this value.
*/
#define LOAD_AVG_PERIOD 32
-#define LOAD_AVG_MAX 47765 /* maximum possible load avg */
+#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
/* Precomputed fixed inverse multiplies for multiplication by y^n */
@@ -1203,7 +1208,8 @@ static __always_inline u64 decay_load(u6
}
val *= runnable_avg_yN_inv[local_n];
- return SRR(val, 32);
+ /* We don't use SRR here since we always want to round down. */
+ return val >> 32;
}
/*
@@ -4236,13 +4242,15 @@ static void __update_blocked_averages_cp
if (se) {
update_entity_load_avg(se, 1);
/*
- * We can pivot on the runnable average decaying to zero for
- * list removal since the parent average will always be >=
- * child.
+ * We pivot on our runnable average having decayed to zero for
+ * list removal. This generally implies that all our children
+ * have also been removed (modulo rounding error or bandwidth
+ * control); however, such cases are rare and we can fix these
+ * at enqueue.
+ *
+ * TODO: fix up out-of-order children on enqueue.
*/
- if (se->avg.runnable_avg_sum)
- update_cfs_shares(cfs_rq);
- else
+ if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
list_del_leaf_cfs_rq(cfs_rq);
} else {
struct rq *rq = rq_of(cfs_rq);
@@ -6013,10 +6021,10 @@ static void task_tick_fair(struct rq *rq
entity_tick(cfs_rq, se, queued);
}
- update_rq_runnable_avg(rq, 1);
-
if (sched_feat_numa(NUMA))
task_tick_numa(rq, curr);
+
+ update_rq_runnable_avg(rq, 1);
}
/*
On Sat, 2012-10-06 at 09:39 +0200, Ingo Molnar wrote:
Thanks Ingo! Paul,
> tip/kernel/sched/fair.c | 28 ++++++++++++++++++----------
> 1 file changed, 18 insertions(+), 10 deletions(-)
>
> Index: tip/kernel/sched/fair.c
> ===================================================================
> --- tip.orig/kernel/sched/fair.c
> +++ tip/kernel/sched/fair.c
>
> @@ -1151,7 +1156,7 @@ static inline void update_cfs_shares(str
> * Note: The tables below are dependent on this value.
> */
> #define LOAD_AVG_PERIOD 32
> -#define LOAD_AVG_MAX 47765 /* maximum possible load avg */
> +#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
> #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
>
> /* Precomputed fixed inverse multiplies for multiplication by y^n */
> @@ -1203,7 +1208,8 @@ static __always_inline u64 decay_load(u6
> }
>
> val *= runnable_avg_yN_inv[local_n];
> - return SRR(val, 32);
> + /* We don't use SRR here since we always want to round down. */
> + return val >> 32;
> }
>
> /*
This is from patches/sched-fast_decay.patch and I changed both the
patch and the changelog to reflect your changes.
>@@ -262,6 +262,9 @@ static inline struct cfs_rq *group_cfs_r
> return grp->my_q;
> }
>
> +static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
> + int force_update);
> +
> static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> {
> if (!cfs_rq->on_list) {
> @@ -281,6 +284,8 @@ static inline void list_add_leaf_cfs_rq(
> }
>
> cfs_rq->on_list = 1;
> + /* We should have no load, but we need to update last_decay. */
> + update_cfs_rq_blocked_load(cfs_rq, 0);
> }
> }
This seems to have come from patches/sched-cfs_rq_blocked_load.patch, and I merged
it in there. With subsequent updates in
patches/sched-wakeup_load.patch. I didn't find any changes to the Changelogs for either.
> @@ -4236,13 +4242,15 @@ static void __update_blocked_averages_cp
> if (se) {
> update_entity_load_avg(se, 1);
> /*
> - * We can pivot on the runnable average decaying to zero for
> - * list removal since the parent average will always be >=
> - * child.
> + * We pivot on our runnable average having decayed to zero for
> + * list removal. This generally implies that all our children
> + * have also been removed (modulo rounding error or bandwidth
> + * control); however, such cases are rare and we can fix these
> + * at enqueue.
> + *
> + * TODO: fix up out-of-order children on enqueue.
> */
> - if (se->avg.runnable_avg_sum)
> - update_cfs_shares(cfs_rq);
> - else
> + if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
> list_del_leaf_cfs_rq(cfs_rq);
> } else {
> struct rq *rq = rq_of(cfs_rq);
This hunk seems to have come from
patches/sched-swap_update_shares.patch, merged in there, no Changelog
changes afaict.
I've updated my queue (at the usual location:
http://programming.kicks-ass.net/sekrit/patches.tar.bz2), Paul please
verify I didn't miss anything.
On Mon, Oct 8, 2012 at 5:14 AM, Peter Zijlstra <[email protected]> wrote:
> On Sat, 2012-10-06 at 09:39 +0200, Ingo Molnar wrote:
>
> Thanks Ingo! Paul,
>
>> tip/kernel/sched/fair.c | 28 ++++++++++++++++++----------
>> 1 file changed, 18 insertions(+), 10 deletions(-)
>>
>> Index: tip/kernel/sched/fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched/fair.c
>> +++ tip/kernel/sched/fair.c
>>
>> @@ -1151,7 +1156,7 @@ static inline void update_cfs_shares(str
>> * Note: The tables below are dependent on this value.
>> */
>> #define LOAD_AVG_PERIOD 32
>> -#define LOAD_AVG_MAX 47765 /* maximum possible load avg */
>> +#define LOAD_AVG_MAX 47742 /* maximum possible load avg */
>> #define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */
>>
>> /* Precomputed fixed inverse multiplies for multiplication by y^n */
>> @@ -1203,7 +1208,8 @@ static __always_inline u64 decay_load(u6
>> }
>>
>> val *= runnable_avg_yN_inv[local_n];
>> - return SRR(val, 32);
>> + /* We don't use SRR here since we always want to round down. */
>> + return val >> 32;
>> }
>>
>> /*
>
> This is from patches/sched-fast_decay.patch and I changed both the
> patch and the changelog to reflect your changes.
>
Yes. To be sure here there's a similar correction to the table
generator in the changelog, so it's particularly relevant to make sure
that one's included.
I checked your stack and this looks correctly updated.
>
>>@@ -262,6 +262,9 @@ static inline struct cfs_rq *group_cfs_r
>> return grp->my_q;
>> }
>>
>> +static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq,
>> + int force_update);
>> +
>> static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>> {
>> if (!cfs_rq->on_list) {
>> @@ -281,6 +284,8 @@ static inline void list_add_leaf_cfs_rq(
>> }
>>
>> cfs_rq->on_list = 1;
>> + /* We should have no load, but we need to update last_decay. */
>> + update_cfs_rq_blocked_load(cfs_rq, 0);
>> }
>> }
>
>
> This seems to have come from patches/sched-cfs_rq_blocked_load.patch, and I merged
> it in there. With subsequent updates in
> patches/sched-wakeup_load.patch. I didn't find any changes to the Changelogs for either.
>
>
>> @@ -4236,13 +4242,15 @@ static void __update_blocked_averages_cp
>> if (se) {
>> update_entity_load_avg(se, 1);
>> /*
>> - * We can pivot on the runnable average decaying to zero for
>> - * list removal since the parent average will always be >=
>> - * child.
>> + * We pivot on our runnable average having decayed to zero for
>> + * list removal. This generally implies that all our children
>> + * have also been removed (modulo rounding error or bandwidth
>> + * control); however, such cases are rare and we can fix these
>> + * at enqueue.
>> + *
>> + * TODO: fix up out-of-order children on enqueue.
>> */
>> - if (se->avg.runnable_avg_sum)
>> - update_cfs_shares(cfs_rq);
>> - else
>> + if (!se->avg.runnable_avg_sum && !cfs_rq->nr_running)
>> list_del_leaf_cfs_rq(cfs_rq);
>> } else {
>> struct rq *rq = rq_of(cfs_rq);
>
>
> This hunk seems to have come from
> patches/sched-swap_update_shares.patch, merged in there, no Changelog
> changes afaict.
>
>
> I've updated my queue (at the usual location:
> http://programming.kicks-ass.net/sekrit/patches.tar.bz2), Paul please
> verify I didn't miss anything.
>
>
Looks good, the only difference I see is:
sched-root_avg.patch
diff -u b/kernel/sched/fair.c b/kernel/sched/fair.c
--- b/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5594,10 +5594,10 @@
entity_tick(cfs_rq, se, queued);
}
+ update_rq_runnable_avg(rq, 1);
+
if (sched_feat_numa(NUMA))
task_tick_numa(rq, curr);
-
- update_rq_runnable_avg(rq, 1);
}
Which is a change I agree with; we want task_tick_numa to have
up-to-date stats available to it.
Thanks!
- Paul
On Sat, Oct 6, 2012 at 12:39 AM, Ingo Molnar <[email protected]> wrote:
>
> * Paul Turner <[email protected]> wrote:
>
>> Peter:
>> The rebase to tip/master made interdiff angry enough that it wasn't
>> producing the right relative diffs for your stack. Sorry :(
>
> Find below the diff between the two series, using 'quilt
> snapshot' and 'quilt diff --snapshot'.
So what I was looking to generate here were per-patch differences that
Peter could squash into his quilt stack directly. I typically find
interdiff works quite well for this, with the exception of the
changelog.
>
> One quick stylistic note: instead of putting the
> update_cfs_rq_blocked_load() in the middle of the file, order
> functions naturally so that no prototypes are needed.
This is certainly a principle I agree with, and I'm not gun-shy about
moving things around to reduce this burden :)
However, in this case I think the prototype is preferable as this code
is right in the middle of the 'core' helper functions for fair.c which
naturally live at the top of the file. Promoting the group scheduling
bits to remove this dependency would place them several hundred lines
down and be counter-productive towards this structure.
However, if you still feel strongly, it can certainly be done.
Thanks!
- Paul