MIME-Version: 1.0
From: Paul Turner <pjt@google.com>
Date: Wed, 27 Jun 2012 20:36:24 -0700
Message-ID: <CAPM31R+PALEsrHQjALK-NqQrkv8uYwnyJCKF_ceoMk8cjg_MLQ@mail.gmail.com>
Subject: sched: per-entity load-tracking
To: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7411
Lines: 167

Argh, regenerated mbox and forgot to fix the subject.

This should have been: sched: per-entity load-tracking.

I won't bother re-posting, but if you find yourself reading this
email; look to the much more interesting "Series short description".

Thanks,

- Paul

On Wed, Jun 27, 2012 at 7:24 PM, Paul Turner <pjt@google.com> wrote:
> Hi all,
>
> Please find attached the latest version for CFS load-tracking.
>
> It implements load-tracking on a per-sched_entity (currently SCHED_NORMAL, but
> could be extended to RT as well) basis. This results in a bottom-up
> load-computation in which entities contribute to their parents' load, as
> opposed to the current top-down where the parent averages its children. ?In
> particular this allows us to correctly migrate load with their accompanying
> entities and provides the necessary inputs for intelligent load-balancing and
> power-management.
>
> Modulo review I think this is now close to a mergable state.
>
> Notable updates:
> ----------------
> - Few stability issues fixed; in particular on systems with tasks having idle
> ?periods spanning several days. ?Minor code cleanups.
> - 32 bit performance improved (now reported faster[*] on 32-bit ARM than
> ?without, presumably due to better load-balance or reduced overhead. ?Thanks
> ?to Vincent Guittot and others for looking at this)
> - All configs delinted
> - Config dependencies seperated so that load-tracking can be used without
> ?FAIR_GROUP_SCHED so that we may generally use it for power governors and
> ?per-entity load-balance (in progress).
>
> Thanks to Peter, Vincent, Morten, Nikunj, and others for testing and comments.
>
> Rewinding some context since I've been buried with internal work and it's been
> a while since my last posting:
>
> Background:
> ----------
> We currently track the load average at the parenting cfs_rq level. We divide
> execution into a series of consecutive "windows, w_i". ?Within each we track:
> ?\Sum load_i * t_i where \Sum t_i = w and each load_i a disjoint load level.
>
> The load average is then \Sum w_j / 2^n.
>
> This this works reasonably well but there's a few problems
> 1) Since decay occurs at boundary window boundary there are 'skews':
> At a window boundary the 'foreground' time has a bias against the
> time immediately preceding it (as a result of the folding division)
> e.g. ?__xx_|_yyyy___ vs __xx_yyyy_|___ ?(where '|' is a window boundary).
>
> The accounting here is 2x + 4y/2 or 2x + 4y, depending on which side of the
> window your load lands on.
>
> 2) Everything within a window 'w' is accounted equally, we only fold at
> the boundaries. ?This actually means you can't set 'w' large enough to
> really have meaningful coverage of the sched period without throwing
> decay out the window. ?But then with w/2 << sched_period (currently
> w/2=5ms) the average ends up having fairly drastic swings even with
> stable loads.
>
> (Note: ?Before we even looked at migrating to per-entity tracking we evaluating
> foreground window into the considered average until it was "complete", this
> represented a measurable improvement in stability under predictable load.)
>
> 3) Since the load sum average is per-cfs_rq and not per-entity when a task
> entity migrates we lose its contribution to load-sum and effectively double
> count it while it former sum decays.
>
> New approach:
> -------------
> Instead of tracking load on a per-cfs_rq basis we do it on a per-sched_entity
> basis. ?The load sum for a cfs_rq is then just the sum of its childrens' load
> averages. ?The obvious immediately nice property is that when we migrate thread
> A from cpu1-->cpu2, we carry its load with it; leaving the global load sum
> unmodified.
>
> The 'windows' above are replaced with more fine-grained tracking, that is (for
> an entity j):
>
> runnable_j = u_i * y^i , load_avg_j = runnable_j * weight_j [*]
>
> Where: u_i is the usage in the last i`th ~ms and y is chosen such that
> y^~sched_period = 1/2 (we currently choose k=32).This means that load tracked 1
> sched_period ago contributes about ~50% as current execution.
>
> Now, the real challenge in tracking at an entity basis is handling blocked
> entities. ?Obviously runnable entities will be updated naturally as we iterate
> through them but there could be O(many) blocked entities so we can't afford to
> iterate through them and update their tracking.
>
> That our decay for a unit period is exponential introduces a particularly nice
> property here:
> We can separate the contributed load on a cfs_rq into blocked and runnable.
> Runnable load is just the sum of all load_avg_j above, maintained by the
> entities themselves as they run and self update (when they update their
> load_avg they update the cumulative sum also).
>
> Blocked load then looks like:
> ?load_avg_j = weight_k * \Sum u[k]_n * y^n
>
> To account an entirely idle period we then only need to multiply by y.
>
> This ends up being orders of magnitude more accurate than the current
> tracking schema, even with the old shares_window massively dilated to
> better capture a stable load-state. ?It's also obviously stable under
> migration.
>
> As referenced above this also allows us to potentially improve decisions within
> the load-balancer, for both distribution and power-management.
>
> Example: consider 1x80% task ?and 2x40% tasks on a 2-core machine. ?It's
> currently a bit of a gamble as to whether you get an {AB, B} or {A, BB} split
> since they have equal weight (assume 1024). ?With per-task tracking we can
> actually consider them at their contributed weight and see a stable ~{800,{400,
> 400}} load-split. ?Likewise within balance_tasks we can consider the load
> migrated to be that actually contributed. ?We have some patches looking at this
> and will post them as they mature.
>
> Thanks,
>
> - Paul
>
> ---
>
> Ben Segall (1):
> ? ? ?sched: maintain per-rq runnable averages
>
> Paul Turner (15):
> ? ? ?sched: track the runnable average on a per-task entitiy basis
> ? ? ?sched: aggregate load contributed by task entities on parenting cfs_rq
> ? ? ?sched: maintain the load contribution of blocked entities
> ? ? ?sched: add an rq migration call-back to sched_class
> ? ? ?sched: account for blocked load waking back up
> ? ? ?sched: aggregate total task_group load
> ? ? ?sched: compute load contribution by a group entity
> ? ? ?sched: normalize tg load contributions against runnable time
> ? ? ?sched: maintain runnable averages across throttled periods
> ? ? ?sched: replace update_shares weight distribution with per-entity computation
> ? ? ?sched: refactor update_shares_cpu() -> update_blocked_avgs()
> ? ? ?sched: update_cfs_shares at period edge
> ? ? ?sched: make __update_entity_runnable_avg() fast
> ? ? ?sched: implement usage tracking
> ? ? ?sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
>
>
> ?include/linux/sched.h | ? 18 +
> ?kernel/sched/core.c ? | ? ?5
> ?kernel/sched/debug.c ?| ? 39 ++
> ?kernel/sched/fair.c ? | ?793 +++++++++++++++++++++++++++++++++++++++----------
> ?kernel/sched/sched.h ?| ? 60 ++--
> ?5 files changed, 720 insertions(+), 195 deletions(-)
>
> --
> Signature
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/