Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp3904647imm; Mon, 4 Jun 2018 11:11:02 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIWe1fmWmv/MxQtfuEmHBgkFX9UaJb8nQjxb+UWdumxwV/+x0qfUgwGUfVQRLJt9jRjxPOC X-Received: by 2002:a17:902:164:: with SMTP id 91-v6mr23011702plb.134.1528135862831; Mon, 04 Jun 2018 11:11:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528135862; cv=none; d=google.com; s=arc-20160816; b=yJBE/PFwWZpUkokv5lp0rGUolR6Lb8loyU/rrE35DJPBAQHzcdbkJ6NQxg/VXyKfUx 8SWrkXaDOCxmaT5iQVzW++5Mib2ygSkNasIiQhCUBqeprON15vHCRVWKKVBrNtWzn8U/ JwMlPEuo1BAWbqFa1r1jIPmPKKI6Mddtr/8R2+2wRTfQFRf6LJKLZseeVJrfaMsmjJe3 DKASrPdyg8csnZsLAgYJuO6XtlQmQIg6oJwqF2aFfwSDf3xjY7yZiSlI05TqXc9d2E64 O73HvfVyDOTNFbusGP7gNXMOyQDCLZ5ng0Ou2Ht1kmCHoB2P+s4WY+EUubt9HHSnUGkU 8ZZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=5ctPBYbeyIlw1bIkYcXO9Z239CzNV9ZNbJVo9zXs2XY=; b=nw62Run6TT9J4tW3f8dAH3w93SZF6XgJ5g93MLgOJAzfG9A9DTelZYgfNoFo7XzKWm uU1z/YsvCtNQ8YA7pK4gR+nktQ3h4+bK2JX1ncQxH7xvjltQBV6UX7GhhZNjM+3cVVG9 aqoSEB/jjs1Gf8/qgXu9pF0H9NzniXD7IyXBE4xdZPj0wLXoJIlKlaz1KqvnoBAQRUHT RT6ezN3oY9iUaiBH11RfFjWK5VPxaTSiboBJFO/+HW6zoL49iBk91JahNfpvn/2/Fpbz K9Pvz6XtGhixm6yj9WxK58p/JXjzL9LXaRqgyk7qcil/0qv0WfePP6lVGDLOysSKRTIx VDZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=FuZjl/ZO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v5-v6si47510833plo.166.2018.06.04.11.10.47; Mon, 04 Jun 2018 11:11:02 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=FuZjl/ZO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751336AbeFDSJV (ORCPT + 99 others); Mon, 4 Jun 2018 14:09:21 -0400 Received: from mail-it0-f65.google.com ([209.85.214.65]:55683 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750999AbeFDSJU (ORCPT ); Mon, 4 Jun 2018 14:09:20 -0400 Received: by mail-it0-f65.google.com with SMTP id 144-v6so34139iti.5 for ; Mon, 04 Jun 2018 11:09:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=5ctPBYbeyIlw1bIkYcXO9Z239CzNV9ZNbJVo9zXs2XY=; b=FuZjl/ZOpYygETjXk+ePcNDNB8RQW7Szg1gEEcsoyfBYo0QAcRR5+nCJpVSnXyd5VN flARcVZmtNMJb/Oac3fzF/UfIVGrBE0mMlWzUcE6b4sT9bj8aDPgHQEFrfRK8Kx021v4 uFy5l6EldFQ0REIT+jZIRs6Q4w1GNom5Hfvu8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=5ctPBYbeyIlw1bIkYcXO9Z239CzNV9ZNbJVo9zXs2XY=; b=D8d/FRPwRYwyUkOuI2DW4aKt6aadZH1BqpYSRCjmG4f86JVdBD9g1OCATfG2FWAw8G v1EcdIAXvWlXod5uFB2HKaqW2F4xqAVAgLIAZw1jkm/WVs9Uv2znanczKJ7V7Q6KImQY 29n0QWVFyovXysY3ViPQ/ELMtl/FthmMBSg9HdgRKaSK8TzGkv3IzQR9FiStUsLfx5FC PwOV80ZX2UT8ntjB0+7GSypLl2dtcPdwjG/oq7BorCPSwh97JLJETCnfBQLjPKCzk13A z3HgtIWe2aR3VTe+KYPJkxEdQK2m3GeNArMNFYcTgOMEPJUvCT0XoeyKJbKpo0+BrV+U 5LIg== X-Gm-Message-State: APt69E1Psu+ZoGv7zOkmOVCK9hoc+qZtRT88cDuSTiVwgwx2SJ2ZgQHA 7HzrfgNHi9iRV6F+PNsjx+P4KhOwIZZ4hCcBGOGDdg== X-Received: by 2002:a24:5fca:: with SMTP id r193-v6mr14810795itb.89.1528135759242; Mon, 04 Jun 2018 11:09:19 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a6b:304a:0:0:0:0:0 with HTTP; Mon, 4 Jun 2018 11:08:58 -0700 (PDT) In-Reply-To: <20180604165047.GU12180@hirez.programming.kicks-ass.net> References: <1527253951-22709-1-git-send-email-vincent.guittot@linaro.org> <20180604165047.GU12180@hirez.programming.kicks-ass.net> From: Vincent Guittot Date: Mon, 4 Jun 2018 20:08:58 +0200 Message-ID: Subject: Re: [PATCH v5 00/10] track CPU utilization To: Peter Zijlstra Cc: Ingo Molnar , linux-kernel , "Rafael J. Wysocki" , Juri Lelli , Dietmar Eggemann , Morten Rasmussen , viresh kumar , Valentin Schneider , Quentin Perret Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4 June 2018 at 18:50, Peter Zijlstra wrote: > On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote: >> When both cfs and rt tasks compete to run on a CPU, we can see some frequency >> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't >> reflect anymore the utilization of cfs tasks but only the remaining part that >> is not used by rt tasks. We should monitor the stolen utilization and take >> it into account when selecting OPP. This patchset doesn't change the OPP >> selection policy for RT tasks but only for CFS tasks > > So the problem is that when RT/DL/stop/IRQ happens and preempts CFS > tasks, time continues and the CFS load tracking will see !running and > decay things. > > Then, when we get back to CFS, we'll have lower load/util than we > expected. > > In particular, your focus is on OPP selection, and where we would have > say: u=1 (always running task), after being preempted by our RT task for > a while, it will now have u=.5. With the effect that when the RT task > goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right? yes that's the typical example > > Your solution is to track RT/DL/stop/IRQ with the identical PELT average > as we track cfs util. Such that we can then add the various averages to > reconstruct the actual utilisation signal. yes and get the whole cpu utilization > > This should work for the case of the utilization signal on UP. When we > consider that PELT migrates the signal around on SMP, but we don't do > that to the per-rq signals we have for RT/DL/stop/IRQ. > > There is also the 'complaint' that this ends up with 2 util signals for > DL, complicating things. yes. that's the main point of discussion how to balance dl bandwidth and dl utilization > > > So this patch-set tracks the !cfs occupation using the same function, > which is all good. But what, if instead of using that to compensate the > OPP selection, we employ that to renormalize the util signal? > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > then I think your initial problem goes away. Because while the RT task > will push the util to .5, it will at the same time push the CPU capacity > to .5, and renormalized that gives 1. > > NOTE: the renorm would then become something like: > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); > > > On IRC I mentioned stopping the CFS clock when preempted, and while that > would result in fixed numbers, Vincent was right in pointing out the > numbers will be difficult to interpret, since the meaning will be purely > CPU local and I'm not sure you can actually fix it again with > normalization. > > Imagine, running a .3 RT task, that would push the (always running) CFS > down to .7, but because we discard all !cfs time, it actually has 1. If > we try and normalize that we'll end up with ~1.43, which is of course > completely broken. > > > _However_, all that happens for util, also happens for load. So the above > scenario will also make the CPU appear less loaded than it actually is. The load will continue to increase because we track runnable state and not running for the load > > Now, we actually try and compensate for that by decreasing the capacity > of the CPU. But because the existing rt_avg and PELT signals are so > out-of-tune, this is likely to be less than ideal. With that fixed > however, the best this appears to do is, as per the above, preserve the > actual load. But what we really wanted is to actually inflate the load, > such that someone will take load from us -- we're doing less actual work > after all. > > Possibly, we can do something like: > > scale_cpu_capacity / (rt_frac^2) > > for load, then we inflate the load and could maybe get rid of all this > capacity_of() sprinkling, but that needs more thinking. > > > But I really feel we need to consider both util and load, as this issue > affects both. my initial idea was to get max between dl bandwidth and dl util_avg but util_avg can be higher than bandwidth and using it will make sched_util selecting higher OPP for now good reason if nothing is running around and need compute capacity As you mentioned, scale_rt_capacity give the remaining capacity for cfs and it will behave like cfs util_avg now that it uses PELT. So as long as cfs util_avg < scale_rt_capacity(we probably need a margin) we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting OPP because we have remaining spare capacity but if cfs util_avg == scale_rt_capacity, we make sure to use max OPP. I will run some test to make sure that all my test are running correctly which such policy