Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp547988imm; Mon, 4 Jun 2018 23:58:10 -0700 (PDT) X-Google-Smtp-Source: ADUXVKKzGiTxaIpGEuu/nXokAjItlisIw76eeQ659dmPM+6blqOQk0+YqUVjb9opjktGkipzCLvj X-Received: by 2002:a62:418b:: with SMTP id g11-v6mr24471455pfd.51.1528181890320; Mon, 04 Jun 2018 23:58:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528181890; cv=none; d=google.com; s=arc-20160816; b=q1iBJiqks06A2KO1nvwqp8ejldFAYZBYHioKyvAhBgbGz9zYl0Czq5uzqApPZLOU+H +ttJIGCoEX1PRMqi/jP0ZsJHfcyYfPtwPCInCYhgYPkwPQ+EVPiA2G2S6dnzJ6JlpjDk V3vs5VKCXyTDpmJYZZ1umqJIMJBK1LGu9fMamGipI6ZJwDDEXa17qOEU48jyfDcYf9mr 7jklTevYj6clbQrxPczd2BrsCDfCKyrdsqt0QlLNIiP8e1VHhHi3Feg+257mSEGZ0m3Z 3rZYAw3GI+HvsivzZSe5Pkw7vADhvgy+ss2Cm/Xafhcd282YamF1J2yOgm18NM0xfwDC yy4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=O35dqaJYjn+WRYo24V8kwut+y4OMYuYp1gvAioZAqMQ=; b=xTzFBeMfGh5Ou/erp9CaKXs17Vy/jZQ0Wrn7g4nsn2kcmREV2YkUMs7FuIABgQDIZO kbeuHVv/71am57hh9pfKFs+1G09u7jwo6P9pIUWQh7h1O/F1CgMwFMSM/gkzi0Q1oJCY LI12c0oszdGQNMbrj4WfsXm/8eSJ5U9NwZ+WJ1I716B546UBaHrxYJmmxanaIlSRO371 IkaDhKR3fNpvVg6UX6r2fDsXPIAIlRiKN5JKR0NmIQwVLF/ETF63odZBBWiU9Gw70DNW hZwwft8PYjPPbfhgpN0qSo2dy8BaySI6hnViatAfPC9qmVg7X3FNILX7sSGjfYd9ZAWz z4ng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=UxXtcN6K; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u196-v6si1712552pgc.137.2018.06.04.23.57.55; Mon, 04 Jun 2018 23:58:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=UxXtcN6K; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751511AbeFEG53 (ORCPT + 99 others); Tue, 5 Jun 2018 02:57:29 -0400 Received: from mail-it0-f67.google.com ([209.85.214.67]:51298 "EHLO mail-it0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751223AbeFEG51 (ORCPT ); Tue, 5 Jun 2018 02:57:27 -0400 Received: by mail-it0-f67.google.com with SMTP id n7-v6so2150921itn.1 for ; Mon, 04 Jun 2018 23:57:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=O35dqaJYjn+WRYo24V8kwut+y4OMYuYp1gvAioZAqMQ=; b=UxXtcN6Kf5VReLqelkQoaYx9NtpgBeuHs3Odr4RCoQnwjbwaYctq8flmspqkBexOrZ aQnlNkH2PJTxGJPn1KvWQH25E5Dcwm68BO+ThpXVC0JIyUV9ENz4IS5xBNIfFEKLhjzQ 9pCaieAxzmQ4clg8hnahgKT61OFC5HdxJ8pgw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=O35dqaJYjn+WRYo24V8kwut+y4OMYuYp1gvAioZAqMQ=; b=O9JHmDkjhw+Cz0+Zd/DmVYat3kw7ATjxM/FQD5cQ72Fr9FhpKBXb0ZymfZBd5BBQlX DZDFDLemFJpQPp4X+afoinxqWLieuFVtO8Rs8pTIuyEeoHkNqdEC+EspMvUkZd6toFMi D3BWnzlymh5+5JvF7MS0ANz41O5hDAc6zdZFb1yYRA+l2+OcuTAe/YJfH7ck5MqRjSYq ygCUQ2eL+qXbCZsM2SrRfOvtnF7kx5jafpK6OSB9paeXehhLNMmCn15zwhmfrDPXzcGV taI/i2uUWAWG9ut2+9+0Tw7ZFurqr9BfRSZw2Xezj7GEQV0E0Lywhmo+ivcrJditjJme Pc2Q== X-Gm-Message-State: APt69E0I6d3hOYpUDF9/0OqwM+LBGEavf9XDFZcnSBrTPbOL6bRd6BLq 0zfvTvaRgPqk0YHIF388sGkkLT134mHGAQGQ75TLbA== X-Received: by 2002:a24:f987:: with SMTP id l129-v6mr17299541ith.148.1528181846237; Mon, 04 Jun 2018 23:57:26 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:304a:0:0:0:0:0 with HTTP; Mon, 4 Jun 2018 23:57:05 -0700 (PDT) In-Reply-To: <20180604160600.22052-3-patrick.bellasi@arm.com> References: <20180604160600.22052-1-patrick.bellasi@arm.com> <20180604160600.22052-3-patrick.bellasi@arm.com> From: Vincent Guittot Date: Tue, 5 Jun 2018 08:57:05 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking To: Patrick Bellasi Cc: linux-kernel , "open list:THERMAL" , Ingo Molnar , Peter Zijlstra , "Rafael J . Wysocki" , Viresh Kumar , Dietmar Eggemann , Morten Rasmussen , Juri Lelli , Joel Fernandes , Steve Muckle , Todd Kjos Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4 June 2018 at 18:06, Patrick Bellasi wrote: > The estimated utilization of a task is affected by the task being > preempted, either by another FAIR task of by a task of an higher > priority class (i.e. RT or DL). Indeed, when a preemption happens, the > PELT utilization of the preempted task is going to be decayed a bit. > That's actually correct for utilization, which goal is to measure the > actual CPU bandwidth consumed by a task. > > However, the above behavior does not allow to know exactly what is the > utilization a task "would have used" if it was running without > being preempted. Thus, this reduces the effectiveness of util_est for a > task because it does not always allow to predict how much CPU a task is > likely to require. > > Let's improve the estimated utilization by adding a new "sort-of" PELT > signal, explicitly only for SE which has the following behavior: > a) at each enqueue time of a task, its value is the (already decayed) > util_avg of the task being enqueued > b) it's updated at each update_load_avg > c) it can just increase, whenever the task is actually RUNNING on a > CPU, while it's kept stable while the task is RUNNANBLE but not > actively consuming CPU bandwidth > > Such a defined signal is exactly equivalent to the util_avg for a task > running alone on a CPU while, in case the task is preempted, it allows > to know at dequeue time how much would have been the task utilization if > it was running alone on that CPU. I don't agree with this statement above Let say that you have 2 periodic tasks which wants to run 4ms in every period of 10ms and wakes up at the same time. One task will execute 1st and the other will wait but at the end they have the same period and running time and as a result the same util_avg which is exactly what they need. > > This new signal is named "running_avg", since it tracks the actual > RUNNING time of a task by ignoring any form of preemption. In fact, you still take into account this preemption as you remove some time whereas it's only a shift of the excution > > From an implementation standpoint, since the sched_avg should fit into a > single cache line, we save space by tracking only a new runnable sum: > p->se.avg.running_sum > while the conversion into a running_avg is done on demand whenever we > need it, which is at task dequeue time when a new util_est sample has to > be collected. > > The conversion from "running_sum" to "running_avg" is done by performing > a single division by LOAD_AVG_MAX, which introduces a small error since > in the division we do not consider the (sa->period_contrib - 1024) > compensation factor used in ___update_load_avg(). However: > a) this error is expected to be limited (~2-3%) > b) it can be safely ignored since the estimated utilization is the only > consumer which is already subject to small estimation errors > > The additional corresponding benefit is that, at run-time, we pay the > cost for a additional sum and multiply, while the more expensive > division is required only at dequeue time. > > Signed-off-by: Patrick Bellasi > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Vincent Guittot > Cc: Juri Lelli > Cc: Todd Kjos > Cc: Joel Fernandes > Cc: Steve Muckle > Cc: Dietmar Eggemann > Cc: Morten Rasmussen > Cc: linux-kernel@vger.kernel.org > Cc: linux-pm@vger.kernel.org > --- > include/linux/sched.h | 1 + > kernel/sched/fair.c | 16 ++++++++++++++-- > 2 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 9d8732dab264..2bd5f1c68da9 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -399,6 +399,7 @@ struct sched_avg { > u64 load_sum; > u64 runnable_load_sum; > u32 util_sum; > + u32 running_sum; > u32 period_contrib; > unsigned long load_avg; > unsigned long runnable_load_avg; > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f74441be3f44..5d54d6a4c31f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, > sa->runnable_load_sum = > decay_load(sa->runnable_load_sum, periods); > sa->util_sum = decay_load((u64)(sa->util_sum), periods); > + if (running) > + sa->running_sum = decay_load(sa->running_sum, periods); so you make some time disappearing from the equation as the signal will not be decayed and make the period looking shorter than reality. With the example I mentioned above, the 2nd task will be seen as if its period becomes 6ms instead of 10ms which is not true and the utilization of the task is overestimated without any good reason Furthermore, this new signal doesn't have clear meaning because it's not utilization but it's also not the waiting time of the task. > > /* > * Step 2 > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, > sa->load_sum += load * contrib; > if (runnable) > sa->runnable_load_sum += runnable * contrib; > - if (running) > + if (running) { > sa->util_sum += contrib * scale_cpu; > + sa->running_sum += contrib * scale_cpu; > + } > > return periods; > } > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq, > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued); > } > > +static inline void util_est_enqueue_running(struct task_struct *p) > +{ > + /* Initilize the (non-preempted) utilization */ > + p->se.avg.running_sum = p->se.avg.util_sum; > +} > + > /* > * Check if a (signed) value is within a specified (unsigned) margin, > * based on the observation that: > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > * Skip update of task's estimated utilization when its EWMA is > * already ~1% close to its last activation value. > */ > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED); > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX; > last_ewma_diff = ue.enqueued - ue.ewma; > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) > return; > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > if (!se) > add_nr_running(rq, 1); > > + util_est_enqueue_running(p); > + > hrtick_update(rq); > } > > -- > 2.15.1 >