Date: Fri, 18 Dec 2015 10:33:09 +0800
From: Yuyang Du <yuyang.du@intel.com>
To: Steve Muckle <steve.muckle@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Morten Rasmussen <morten.rasmussen@arm.com>,
        Andrey Ryabinin <aryabinin@virtuozzo.com>, mingo@redhat.com,
        linux-kernel@vger.kernel.org, Paul Turner <pjt@google.com>,
        Ben Segall <bsegall@google.com>
Subject: Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems
Message-ID: <20151218023309.GL28098@intel.com>
References: <1449838518-26543-1-git-send-email-aryabinin@virtuozzo.com>
 <20151211132551.GO6356@twins.programming.kicks-ass.net>
 <20151211133612.GG6373@twins.programming.kicks-ass.net>
 <566AD6E1.2070005@virtuozzo.com>
 <20151211175751.GA27552@e105550-lin.cambridge.arm.com>
 <20151213224224.GC28098@intel.com>
 <20151214115453.GN6357@twins.programming.kicks-ass.net>
 <20151215022229.GF28098@intel.com>
 <56708C71.90701@linaro.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <56708C71.90701@linaro.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6267
Lines: 173

Hi Steve,

On Tue, Dec 15, 2015 at 01:56:01PM -0800, Steve Muckle wrote:
> On 12/14/2015 06:22 PM, Yuyang Du wrote:
> > what if we just init the util_avg to 0?
> 
> With scheduler-guided frequency [1] we will rely on the initial util_avg
> value to determine the CPU frequency response when new tasks are created.
> 
> There is sure to be a lot of debate on what sort of default policy is
> used (and how much that can be changed with tunables). IMO at least for
> now, new tasks should cause the target CPU to go to fmax, especially
> given how slow PELT is to respond to changes in task behavior. This
> would imply leaving the initial task util_avg at 100% (or the equivalent
> necessary to take care of the overflow issues).
> 
> [1] https://lkml.org/lkml/2015/12/9/35

As far as initial util_avg is concerned, i think the proposed algorithm
gives a sensible util to a new task. It is a trade-off between 0 and full
utilization.

Without meaningful workloads and methods, I haven't tested it. But it
compiles, thanks to LKP :)
Hope you can give it a try to see what happens.

---
Subject: [PATCH] sched/fair: Initiate a new task's util avg to a bounded
 value

A new task's util_avg is set to a full utilization of a CPU (100%
time running). This accelerates the new task's utilization ramp-up,
which can be used to boost its execution in early time. However,
it may result in (insanely) high utilization for a transient time
when a flood of tasks are spawned. Importantly, it violates the
"fundamentally bounded" CPU utilization, and its side effect is
negative if we just don't take any measure to bound it.

This patch proposes an algorithm to address this issue. It has
two methods to approach a reasonable initial util_avg:

(1) An expected (or average) util_avg based on its cfs_rq's util_avg:

util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight

(2) A trajectory of how successive new tasks' util develops, which
gives 1/2 of the left utilization budget to a new task such that
the additional util is noticeably large (when overall util is low) or
unnoticeably small (when overall util is high enough). In the meantime,
the aggregate utilization is well bounded.

util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n

where n denotes the nth task.

If util_avg is larger than util_avg_cap, then the effective util is
clamped to the util_avg_cap.

Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 kernel/sched/core.c  |    2 ++
 kernel/sched/fair.c  |   56 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |    1 +
 3 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2ecefd..cd314de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2485,6 +2485,8 @@ void wake_up_new_task(struct task_struct *p)
 	 */
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
+	/* Post initialize new task's util average when its cfs_rq is set */
+	post_init_entity_util_avg(&p->se);
 
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3266eb..9ac00b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,17 +682,69 @@ void init_entity_runnable_average(struct sched_entity *se)
 	sa->period_contrib = 1023;
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
-	sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
-	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+	/*
+	 * At this point, util_avg won't be used in select_task_rq_fair anyway
+	 */
+	sa->util_avg = 0;
+	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
+/*
+ * With new tasks being created, their initial util_avgs are extrapolated
+ * based on the cfs_rq's current util_avg:
+ *
+ * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
+ *
+ * However, in many cases, the above util_avg does not give a desired
+ * value. Moreover, the sum of the util_avgs may be divergent, such
+ * as when the series is a harmonic series.
+ *
+ * To solve this problem, we also cap the util_avg of successive tasks to
+ * only 1/2 of the left utilization budget:
+ *
+ * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
+ *
+ * where n denotes the nth task.
+ *
+ * For example, a simplest series from the beginning would be like:
+ *
+ *  task  util_avg: 512, 256, 128,  64,  32,   16,    8, ...
+ * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
+ *
+ * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
+ * if util_avg > util_avg_cap.
+ */
+void post_init_entity_util_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	struct sched_avg *sa = &se->avg;
+	long cap = (scale_load_down(SCHED_LOAD_SCALE) - cfs_rq->avg.util_avg) / 2;
+
+	if (cap > 0) {
+		if (cfs_rq->avg.util_avg != 0) {
+			sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
+			sa->util_avg /=	 (cfs_rq->avg.load_avg + 1);
+
+			if (sa->util_avg > cap)
+				sa->util_avg = cap;
+		}
+		else {
+			sa->util_avg = cap;
+		}
+		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
+	}
+}
+
 static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
 static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq);
 #else
 void init_entity_runnable_average(struct sched_entity *se)
 {
 }
+void post_init_entity_util_avg(struct sched_entity *se)
+{
+}
 #endif
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..7049ee9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1277,6 +1277,7 @@ extern void init_dl_task_timer(struct sched_dl_entity *dl_se);
 unsigned long to_ratio(u64 period, u64 runtime);
 
 extern void init_entity_runnable_average(struct sched_entity *se);
+extern void post_init_entity_util_avg(struct sched_entity *se);
 
 static inline void add_nr_running(struct rq *rq, unsigned count)
 {
-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/