Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp1050886imm; Tue, 5 Jun 2018 08:23:07 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKlf9y5/O052wR47wZTdNYf6Ud7MlWdxH97fQqPnsPgomYQYDCX7DZzESNOyaPWzoaCe7Da X-Received: by 2002:a17:902:584b:: with SMTP id f11-v6mr14838342plj.222.1528212187428; Tue, 05 Jun 2018 08:23:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528212187; cv=none; d=google.com; s=arc-20160816; b=hzLfJgT+sychUAL9J4/W4h31LrY+Ml2z7Gg8FjVhUVEyE0dhnL+puMX4QEQVngR3fM JZ+bVJ1QGGCQwNezm7lCP5AVLkAq8sk77ufiid3rx/1lSV+LA4S6L68y1IGVqGC5qxy4 qTiubQxzvSZPUN95qIm8S6n2Npk/zDITGb3p9dWZCLAIDX6GU3dgWeM6+nhPcEFz3UVC EAvDcJLS9dyGF41994Vb9To/2DjVQ9ZCGh8Z+VL4UiKZlyDS2WCkZWTld3Rh2RmWBUxg ULy4sotVz7XymqAyB95bQMmcrmI0lJlPGW8fPfjjPDA7yfhIJxyuElhWzM26b5x5SWrR 1UxQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=HPei+1uv6wyEYBJImjXKwwXDAgHC5PrD/rsiavWjiew=; b=qG5SOW3N5UuQUdXd3zevLk1PRAxP+Us+0c1tUsardPRRcNwMzOHPlT+OiHQegczh50 6jV4Ph55kMCdiWdipmbPzcIzdVpq6gsByBxbIgqJJUj+jC9YtPXGIxYLZqwZONiM8GFF eY81da8tWZrXHwvXQS2M9Ce7iARw5fvnsAxvTagtfjMgfY95msdwQ/NL9loxu09P85VX cWNEP4MmU1Pom2coEOcxnsjQXqPq7RuvuVN5ispHnjzOU65tOJX0DqiXu7nSK2PP9rh1 5HsdBJWJyKXuI/tm82/dPmnfmYXTRXTRdpuVrvt/oLtjVECfovC1vr1l7g4xRW25oEVN A6DQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a24-v6si24555172pls.129.2018.06.05.08.22.53; Tue, 05 Jun 2018 08:23:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752125AbeFEPWD (ORCPT + 99 others); Tue, 5 Jun 2018 11:22:03 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:57240 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751656AbeFEPWB (ORCPT ); Tue, 5 Jun 2018 11:22:01 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 519EA1596; Tue, 5 Jun 2018 08:22:01 -0700 (PDT) Received: from e110439-lin (e110439-lin.cambridge.arm.com [10.1.210.68]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ED6053F557; Tue, 5 Jun 2018 08:21:58 -0700 (PDT) Date: Tue, 5 Jun 2018 16:21:56 +0100 From: Patrick Bellasi To: Joel Fernandes Cc: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Ingo Molnar , Peter Zijlstra , "Rafael J . Wysocki" , Viresh Kumar , Vincent Guittot , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Joel Fernandes , Steve Muckle , Todd Kjos Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking Message-ID: <20180605152156.GD32302@e110439-lin> References: <20180604160600.22052-1-patrick.bellasi@arm.com> <20180604160600.22052-3-patrick.bellasi@arm.com> <20180604174618.GA222053@joelaf.mtv.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180604174618.GA222053@joelaf.mtv.corp.google.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04-Jun 10:46, Joel Fernandes wrote: > Hi Patrick, > > On Mon, Jun 04, 2018 at 05:06:00PM +0100, Patrick Bellasi wrote: > > The estimated utilization of a task is affected by the task being > > preempted, either by another FAIR task of by a task of an higher > > priority class (i.e. RT or DL). Indeed, when a preemption happens, the > > PELT utilization of the preempted task is going to be decayed a bit. > > That's actually correct for utilization, which goal is to measure the > > actual CPU bandwidth consumed by a task. > > > > However, the above behavior does not allow to know exactly what is the > > utilization a task "would have used" if it was running without > > being preempted. Thus, this reduces the effectiveness of util_est for a > > task because it does not always allow to predict how much CPU a task is > > likely to require. > > > > Let's improve the estimated utilization by adding a new "sort-of" PELT > > signal, explicitly only for SE which has the following behavior: > > a) at each enqueue time of a task, its value is the (already decayed) > > util_avg of the task being enqueued > > b) it's updated at each update_load_avg > > c) it can just increase, whenever the task is actually RUNNING on a > > CPU, while it's kept stable while the task is RUNNANBLE but not > > actively consuming CPU bandwidth > > > > Such a defined signal is exactly equivalent to the util_avg for a task > > running alone on a CPU while, in case the task is preempted, it allows > > to know at dequeue time how much would have been the task utilization if > > it was running alone on that CPU. > > > > This new signal is named "running_avg", since it tracks the actual > > RUNNING time of a task by ignoring any form of preemption. > > > > From an implementation standpoint, since the sched_avg should fit into a > > single cache line, we save space by tracking only a new runnable sum: > > p->se.avg.running_sum > > while the conversion into a running_avg is done on demand whenever we > > need it, which is at task dequeue time when a new util_est sample has to > > be collected. > > > > The conversion from "running_sum" to "running_avg" is done by performing > > a single division by LOAD_AVG_MAX, which introduces a small error since > > in the division we do not consider the (sa->period_contrib - 1024) > > compensation factor used in ___update_load_avg(). However: > > a) this error is expected to be limited (~2-3%) > > b) it can be safely ignored since the estimated utilization is the only > > consumer which is already subject to small estimation errors > > > > The additional corresponding benefit is that, at run-time, we pay the > > cost for a additional sum and multiply, while the more expensive > > division is required only at dequeue time. > > > > Signed-off-by: Patrick Bellasi > > Cc: Ingo Molnar > > Cc: Peter Zijlstra > > Cc: Vincent Guittot > > Cc: Juri Lelli > > Cc: Todd Kjos > > Cc: Joel Fernandes > > Cc: Steve Muckle > > Cc: Dietmar Eggemann > > Cc: Morten Rasmussen > > Cc: linux-kernel@vger.kernel.org > > Cc: linux-pm@vger.kernel.org > > --- > > include/linux/sched.h | 1 + > > kernel/sched/fair.c | 16 ++++++++++++++-- > > 2 files changed, 15 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 9d8732dab264..2bd5f1c68da9 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -399,6 +399,7 @@ struct sched_avg { > > u64 load_sum; > > u64 runnable_load_sum; > > u32 util_sum; > > + u32 running_sum; > > u32 period_contrib; > > unsigned long load_avg; > > unsigned long runnable_load_avg; > > Should update the documentation comments above the struct too? > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f74441be3f44..5d54d6a4c31f 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, > > sa->runnable_load_sum = > > decay_load(sa->runnable_load_sum, periods); > > sa->util_sum = decay_load((u64)(sa->util_sum), periods); > > + if (running) > > + sa->running_sum = decay_load(sa->running_sum, periods); > > > > /* > > * Step 2 > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, > > sa->load_sum += load * contrib; > > if (runnable) > > sa->runnable_load_sum += runnable * contrib; > > - if (running) > > + if (running) { > > sa->util_sum += contrib * scale_cpu; > > + sa->running_sum += contrib * scale_cpu; > > + } > > > > return periods; > > } > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq, > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); > > } > > PELT changes look nice and makes sense :) That's not strictly speaking a PELT change... it's still more in the idea to work "on top of PELT" to make it more effective in measuring the tasks expected required CPU bandwidth. > > +static inline void util_est_enqueue_running(struct task_struct *p) > > +{ > > + /* Initilize the (non-preempted) utilization */ > > + p->se.avg.running_sum = p->se.avg.util_sum; > > +} > > + > > /* > > * Check if a (signed) value is within a specified (unsigned) margin, > > * based on the observation that: > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > > * Skip update of task's estimated utilization when its EWMA is > > * already ~1% close to its last activation value. > > */ > > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED); > > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX; > > I guess we are doing extra division here which adds some cost. Does > performance look Ok with the change? This extra division is there and done only at dequeue time instead of doing it at each update_load_avg. To be more precise, at each ___update_load_avg we should really update running_avg by: u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib; sa->running_avg = sa->running_sum / divider; but, this would imply tracking an additional signal in sched_avg and doing an additional division at ___update_load_avg() time. Morten suggested that, if we accept the rounding errors due to considering divider ~= LOAD_AVG_MAX thus discarding the (sa->period_contrib - 1024) correction, then we can completely skip the tracking of running_avg (thus saving space in sched_avg) and approximate it at dequeue time as per the code line, just to compute the new util_est sample to accumulate. Does that make sense now? -- #include Patrick Bellasi