Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1309215imu; Wed, 28 Nov 2018 07:44:09 -0800 (PST) X-Google-Smtp-Source: AFSGD/WCdzMT/HWRTx7qxNLP0qvVdecm0mUEQ1BB9aJLEPjTYFPS858mBkCBYu6Keo7L3cuBBZom X-Received: by 2002:a63:6f0d:: with SMTP id k13mr33257395pgc.42.1543419849224; Wed, 28 Nov 2018 07:44:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543419849; cv=none; d=google.com; s=arc-20160816; b=YaJGBSqIz7qslCtdVcAUEVkGz8fJn/jsnOquaCgCFNY7Mmwl/N8MKY1Ki0+z/TmzAi R26LgD2KDkpTgKyx9iWT/IgO7/ZyQLBrVh2qT+a+SAGgpdYJhZNF0EYwAb/b0hg4hdI1 cohZLjxsU6iBVR8GdqKAUgG0FygsbnkZ6s4hVZUM5dJjCwyRikR1bIFQo+hF5nunF/E6 g3pRkuGR22bDm8AqrRKjszcx6LCs0/FYuWGpiQVZaTPOYZc6P2IVo9TCoCHqEnymMCUp +FMBPCdPHpxFzOj4ZhZ9XXkfbLl3xyY8X/cmmFhxAtEFTHdDtYx7b3INZzbIlxoRA1u4 IJEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=0uUjT4ItIRh9y4vgEIOefwEK1rUf8/o5NSvJYCDiNqg=; b=JSQhAEe/p90GK1WGlUQvqlY0wP7YDGa+ZQoXhZv5/k1bJwPtvod/p4QHqeHB/4mQI9 a6WeYKLKeutLvNAXxqJ4e0PVhU9aaXfpQetZYlhCZ7Mx+Anq7bNRKC4gjdRDLAc2vIcg qinLAQnILuCRuWjRse/X0NDY+xlefDZFCRwnV5yAn7PGz7COT+cQiGPjwFcAxzxcTywy CQm2Uxbt/UNGfRgGPVpd+/dWxP8jpK674Yd4ReFIb1yeunyKRNCkKfwh7e7ukIgydlTw O17DT6m6RfsUQuLoCrWrPNzS0fX3O7xSZU8J4V5HMvdgkGJWwama3OD5RSCeWqfmaqnc rCxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=QpI5fRTg; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y10si7204121plt.406.2018.11.28.07.43.40; Wed, 28 Nov 2018 07:44:09 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=QpI5fRTg; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728775AbeK2Coh (ORCPT + 99 others); Wed, 28 Nov 2018 21:44:37 -0500 Received: from mail-it1-f196.google.com ([209.85.166.196]:52647 "EHLO mail-it1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728217AbeK2Cof (ORCPT ); Wed, 28 Nov 2018 21:44:35 -0500 Received: by mail-it1-f196.google.com with SMTP id i7so5032693iti.2 for ; Wed, 28 Nov 2018 07:42:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0uUjT4ItIRh9y4vgEIOefwEK1rUf8/o5NSvJYCDiNqg=; b=QpI5fRTgwcA6NW7hv+ad1nIkjnv/E07MUMhqrYOupClaUNVqBfqhQhiw8h1wDDaTg7 cYCtAkJ5Nu75HeSCnGf1/VxPCJaU3Fu1wGWMOViprECUJckNpSrEKAs2+0o26Ym9mtQT tNZ9CwCy+lxtMSde5bh27KnSAiqT1rJqDhXag= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0uUjT4ItIRh9y4vgEIOefwEK1rUf8/o5NSvJYCDiNqg=; b=EScK9OTPhEcYPwwXIhUfDfiS5eCg8quE6Tx6EhIbfRPts8EX+lT2aQh4Tfu4UU/dhs Iv0gu5nQy2v/y5SBHRsAQom3iEWgp1Zl/O9MilFkdNTMKJGPX7N2VDXq8MlVbu+qAbNK Maknvt1SwleBz/1iYC7r5tvrsYKLimkz+185ye5JbQXRiC2uY083pn7MvInIHfuP7nGu vE2RfhSrjJR3A2RYJeNV7VSbEdQCFHAd6HPNEkEThWC+UbHvzbJBMnq9J5u8x/4gpHjx pSQjFE60ouOoFoHYtrPlktvufVoqFa1Sm12Sy0Hh0HMEtu8zfFa6G+KPvMPBN/6l7KZl ghmA== X-Gm-Message-State: AA+aEWYs+FPppEkIbJwrIXJNUgkxkUls+e4qwpCOy15qqk1t49HIs5yj UMTR0S5Vfknw4XGDvlfN7X+IfCdNmQw5jyJrMlZBRA== X-Received: by 2002:a02:6019:: with SMTP id i25mr34247138jac.137.1543419749929; Wed, 28 Nov 2018 07:42:29 -0800 (PST) MIME-Version: 1.0 References: <1542711308-25256-1-git-send-email-vincent.guittot@linaro.org> <1542711308-25256-3-git-send-email-vincent.guittot@linaro.org> <20181128100241.GA2131@hirez.programming.kicks-ass.net> <20181128115336.GB23094@e110439-lin> <20181128144039.GC23094@e110439-lin> <20181128152133.GD23094@e110439-lin> In-Reply-To: <20181128152133.GD23094@e110439-lin> From: Vincent Guittot Date: Wed, 28 Nov 2018 16:42:18 +0100 Message-ID: Subject: Re: [PATCH v7 2/2] sched/fair: update scale invariance of PELT To: Patrick Bellasi Cc: Peter Zijlstra , Ingo Molnar , linux-kernel , "Rafael J. Wysocki" , Dietmar Eggemann , Morten Rasmussen , Paul Turner , Ben Segall , Thara Gopinath , pkondeti@codeaurora.org, Quentin Perret , Srinivas Pandruvada Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi wrote: > > On 28-Nov 15:55, Vincent Guittot wrote: > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi wrote: > > > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi wrote: > > > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > > utilization samples which will overestimate the task requirements. > > > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > > case we collect multiple samples above the current capacity. > > > > > > > > TBH I don't see how it's different from current implementation with a > > > > task that was scheduled on big core and now wakes up on little core. > > > > The util_est is overestimated as well. > > > > > > While running below the capacity of a CPU, either big or LITTLE, we > > > can still measure the actual used bandwidth as long as we have idle > > > time. If the task is then moved into a lower capacity core, I think > > > it's still safe to assume that, likely, it would need more capacity. > > > > > > Why do you say it's the same ? > > > > In the example of a task that runs 39ms in period of 80ms that we used > > during previous version, > > the utilization on the big core will reach 709 so will util_est too > > When the task migrates on little core (512), util_est is higher than > > current cpu capacity > > Right, and what's the problem ? you worry about an util_est being higher than capacity which is the case there > > 1) We know that PELT is calibrated to 32ms period task and in your > example, since the runtime is higher then the half-life, it's > correct to estimate a utilization higher then 50%. > > PELT utilization is defined _based on the half-life_: thus > your task having a 50% duty cycle does not mean we are not correct > if report a utilization != 50%. > It would be as broken as reporting 10% utilization for a task > running 100ms every 1s. > > 2) If it was a 70% task on a previous activation, once it's moved into > a lower capacity CPU it's still correct to assume that it's likely > going to require the same bandwidth and thus will be > under-provisioned. > > I still don't see where we are wrong in this case :/ > > To me it looks different then the problem I described. > > > > With your new signal instead, once we cross the current capacity, > > > utilization is just not anymore utilization. Thus, IMHO it make sense > > > avoid to accumulate a sample for what we call "estimated utilization". This is not true. With the example above, the util_est will be exactly the same on big and little cores with the new signal > > > > > > I would also say that, with the current implementation which caps > > > utilization to the current capacity, we get better estimation in > > > general. At least we can say with absolute precision: > > > > > > "the task needs _at least_ that amount of capacity". > > > > > > Potentially we can also flag the task as being under-provisioned, in > > > case there was not idle time, and _let a policy_ decide what to do > > > with it and the granted information we have. > > > > > > While, with your new signal, once we are over the current capacity, > > > the "utilization" is just a sort of "random" number at best useful to > > > drive some conclusions about how long the task has been delayed. see my comment above > > > > > > IOW, I fear that we are embedding a policy within a signal which is > > > currently representing something very well defined: how much cpu > > > bandwidth a task used. While, latency/under-provisioning policies > > > perhaps should be better placed somewhere else. > > > > > > Perhaps I've missed it in some of the previous discussions: > > > have we have considered/discussed this signal-vs-policy aspect ? > > What's your opinion on the above instead ? It's not a policy but it gives better knowledge about the amount a work done I have put below discussion on the subject on previous version > > > > With contribution scaling the PELT utilization of a task is a _minimum_ > > utilization. Regardless of where the task is currently/was running (and > > provided that it doesn't change behaviour) its PELT utilization will > > approximate its _minimum_ utilization on an idle 1024 capacity CPU. > > The main drawback is that the _minimum_ utilization depends on the CPU > capacity on which the task runs. The two 25% tasks on a 256 capacity > CPU will have an utilization of 128 as an example > > > > > With time scaling the PELT utilization doesn't really have a meaning on > > its own. It has to be compared to the capacity of the CPU where it > > is/was running to know what the its current PELT utilization means. When > > I would have said the opposite. The utilization of the task will > always reflect the same amount of work that has been already done > whatever the CPU capacity. > In fact, the new scaling mechanism uses the real amount of work that > has been already done to compute the utilization signal which is not > the case currently. This gives more information about the real amount > of worked that has been computed in the over utilization case. > > > the utilization over-shoots the capacity its value is no longer > > represents utilization, it just means that it has a higher compute > > demand than is offered on its current CPU and a high value means that it > > has been suffering longer. It can't be used to predict the actual > > utilization on an idle 1024 capacity any better than contribution scaled > > PELT utilization. > > I think that it provides earlier detection of over utilization and > more accurate signal for a longer time duration which can help the > load balance > Coming back to 50% task example . I will use a 50ms running time > during a 100ms period for the example below to make it easier > > Starting from 0, the evolution of the utilization is: > > With contribution scaling: > time 0ms 50ms 100ms 150ms 200ms > capacity > 1024 0 666 > 512 0 333 453 > When the CPU start to be over utilized (@100ms), the utilization is > already too low (453 instead of 666) and scheduler doesn't detect yet > that we are over utilized > 256 0 169 226 246 252 > That's even worse with this lower capacity > > With time scaling, > time 0ms 50ms 100ms 150ms 200ms > capacity > 1024 0 666 > 512 0 428 677 > We know that the current capacity is not enough and the utilization > reflect the correct utilization level compare to 1024 capacity (the > 666 vs 677 difference comes from the 1024us window so the last window > is not full in the case of max capacity) > 256 0 234 468 564 677 > At 100ms, we know that there is not enough capacity. (In fact we know > that at 56ms). And even at time 200ms, the amount of work is exactly > what would have been executed on a CPU 4x faster > > > > > This change might not be a showstopper, but it is something to be aware > > off and take into account wherever PELT utilization is used. > > The point above is clearly a big difference between the 2 approaches > of the no spare cycle case but I think it will help by giving more > information in the over utilization case > > Vincent > > > > Morten > > -- > #include > > Patrick Bellasi