Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752220AbcJJKQH (ORCPT ); Mon, 10 Oct 2016 06:16:07 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:58613 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751100AbcJJKQG (ORCPT ); Mon, 10 Oct 2016 06:16:06 -0400 Date: Mon, 10 Oct 2016 12:15:58 +0200 From: Peter Zijlstra To: Luca Abeni Cc: linux-kernel@vger.kernel.org, Tommaso Cucinotta , Juri Lelli , Thomas Gleixner , Andrea Parri Subject: Re: About group scheduling for SCHED_DEADLINE Message-ID: <20161010101558.GL3568@worktop.programming.kicks-ass.net> References: <20161009213938.3cec05ea@utopia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161009213938.3cec05ea@utopia> User-Agent: Mutt/1.5.22.1 (2013-10-16) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3572 Lines: 82 On Sun, Oct 09, 2016 at 09:39:38PM +0200, Luca Abeni wrote: > So, I started to think about this, and here are some ideas to start a > discussion: > 1) First of all, we need to decide the software interface. If I > understand correctly (please correct me if I am wrong), cgroups let > you specify a runtime and a period, and this means that the cgroup is > reserved the specified runtime every period on all the cgroup's > CPUs... In other words, it is not possible to reserve different > runtimes/periods on different CPUs. Is this correct? That is the current state for RR/FIFO, but given that that is a complete trainwreck, I think we can deprecate that and change the interface. My primary concern is getting something that actually works and makes theoretical sense, and then we can worry about the interface. > Is this what we > want for hierarchical SCHED_DEADLINE? Or do we want to allow the > possibility to schedule a cgroup with multiple "deadline servers" > having different runtime/period parameters? (the first solution is > easier to implement, the second one offers more degrees of freedom > that might be used to improve the real-time schedulability) Right, I'm not sure what makes most sense, nor am I entirely sure on what you mean with multiple deadline servers, is that different different variables per CPU? > 2) Is it ok have only two levels in the scheduling hierarchy (at least > in the first implementation)? Yeah, I think so. It doesn't really make sense to stack this stuff too deep anyway, and it only makes sense to stack downwards (eg, have DL host RR/FIFO/CFS). Stakcing upwards (have CFS host DL for example) doesn't make any sense to me. > 4) From a more theoretical point of view, it would be good to define > the scheduling model that needs to be implemented (based on something > previously described on some paper, or defining a new model from > scratch). > > Well, I hope this can be a good starting point for a discussion :) Right, so the problem we have is unspecified SCHED_FIFO on SMP and historical behaviour. As you know we've extended FIFO to SMP by G-FIFO (run the m highest prio tasks on m CPUs). But along with that, we allow arbitrary affinity masks for RR/FIFO tasks. (Note that RR is broken in the G-FIFO model, but that's a different discussion for a different day). Now, the proposed model has identical CBS parameters for every (v)cpu of the cgroup. This means that a cgroup must be overprovisioned in the general case where nr_tasks < nr_cpus (and worse, the parameters must match the max task). This leads to vast amounts of wasted resources. The alternative is different but fixed parameters per cpu, but that is somewhat unwieldy in that it increases the configuration burden. But you can indeed minimize the wasted resources and deal with the affinity problem (AFAICT). However, I think there's a third alternative. I have memories of a paper from UNC (I'd have to dig through the site to see if I can still find it) where they argue that for a hierarchical (G-)FIFO you should use minimal concurrency, that is run the minimal number of (v)cpu servers. This would mean we give a single CBS parameter and carve out the minimal number (of max CBS) (v)cpu that fit in that. I'm just not sure how the random affinity crap works out for that, if we have the (v)cpu servers migratable in the G-EDF and migrate to whatever is demanded by the task at runtime it might work, but who knows.. Analysis would be needed I think. Any other opinions / options?