Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753409AbdDLMNS (ORCPT ); Wed, 12 Apr 2017 08:13:18 -0400 Received: from bombadil.infradead.org ([65.50.211.133]:52156 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751780AbdDLMNQ (ORCPT ); Wed, 12 Apr 2017 08:13:16 -0400 Date: Wed, 12 Apr 2017 14:10:09 +0200 From: Peter Zijlstra To: Patrick Bellasi Cc: Tejun Heo , linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Ingo Molnar , "Rafael J . Wysocki" , Paul Turner , Vincent Guittot , John Stultz , Todd Kjos , Tim Murray , Andres Oportus , Joel Fernandes , Juri Lelli , Chris Redpath , Morten Rasmussen , Dietmar Eggemann Subject: Re: [RFC v3 0/5] Add capacity capping support to the CPU controller Message-ID: <20170412121009.GD3093@worktop> References: <1488292722-19410-1-git-send-email-patrick.bellasi@arm.com> <20170320145131.GA3623@htj.duckdns.org> <20170320172233.GA28391@e110439-lin> <20170410073622.2y6tnpcd2ssuoztz@hirez.programming.kicks-ass.net> <20170411175833.GI29455@e110439-lin> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170411175833.GI29455@e110439-lin> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2732 Lines: 63 Let me reply in parts as I read this.. easy things first :-) On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > On 10-Apr 09:36, Peter Zijlstra wrote: > > 4) they have muddled semantics, because while its presented as a task > > property, it very much is not. > > Actually we always presented it as a task group property, while other > people suggested it should be a per-task API. > > Still, IMO it could make sense also as a per-task API, for example > considering a specific RT task which we know in our system is > perfectly fine to always run it below a certain OPP. > > Do you think it should be a per-task API? > Should we focus (at least initially) on providing a per task-group API? Even for the cgroup interface, I think they should set a per-task property, not a group property. > > 3) they have absolutely unacceptable overhead in implementation. Two > > more RB tree operations per enqueue/dequeue is just not going to > > happen. > > This last point is about "implementation details", I'm pretty > confident that if we find an agreement on the previous point than > this last will be simple to solve. > > Just to be clear, the rb-trees are per CPU and used to track just the > RUNNABLE tasks on each CPUs and, as I described in the previous > example, for the OPP biasing to work I think we really need an > aggregation mechanism. I know its runnable, which is exactly what the regular RB tree in fair already tracks. That gets us 3 RB tree operations per scheduling operation, which is entirely ridiculous. And while I disagree with the need to track this at all, see below, it can be done _much_ cheaper using a double augmented RB-tree, where you keep the min/max as heaps inside the regular RB-tree. > Ideally, every time a task is enqueue/dequeued from a CPU we want to > know what is the currently required capacity clamping. This requires > to maintain an ordered list of values and rb-trees are the most effective > solution. > > Perhaps, if we accept a certain level of approximation, we can > potentially reduce the set of tracked values to a finite set (maybe > corresponding to the EM capacity values) and use just a per-cpu > ref-counting array. > Still the min/max aggregation will have a worst case O(N) search > complexity, where N is the number of finite values we want to use. So the bigger point is that if the min/max is a per-task property (even if set through a cgroup interface), the min(max) / max(min) thing is wrong. If the min/max were to apply to each individual task's util, you'd end up with something like: Dom(\Sum util) = [min(1, \Sum min), min(1, \Sum max)]. Where a possible approximation is scaling the aggregate util into that domain.