Date: Mon, 10 Oct 2016 12:15:58 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Luca Abeni <luca.abeni@unitn.it>
Cc: linux-kernel@vger.kernel.org,
        Tommaso Cucinotta <tommaso.cucinotta@sssup.it>,
        Juri Lelli <juri.lelli@gmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Andrea Parri <parri.andrea@gmail.com>
Subject: Re: About group scheduling for SCHED_DEADLINE
Message-ID: <20161010101558.GL3568@worktop.programming.kicks-ass.net>
References: <20161009213938.3cec05ea@utopia>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20161009213938.3cec05ea@utopia>
User-Agent: Mutt/1.5.22.1 (2013-10-16)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3572
Lines: 82

On Sun, Oct 09, 2016 at 09:39:38PM +0200, Luca Abeni wrote:

> So, I started to think about this, and here are some ideas to start a
> discussion:
> 1) First of all, we need to decide the software interface. If I
>    understand correctly (please correct me if I am wrong), cgroups let
>    you specify a runtime and a period, and this means that the cgroup is
>    reserved the specified runtime every period on all the cgroup's
>    CPUs... In other words, it is not possible to reserve different
>    runtimes/periods on different CPUs. Is this correct?

That is the current state for RR/FIFO, but given that that is a complete
trainwreck, I think we can deprecate that and change the interface.

My primary concern is getting something that actually works and makes
theoretical sense, and then we can worry about the interface.

>    Is this what we
>    want for hierarchical SCHED_DEADLINE? Or do we want to allow the
>    possibility to schedule a cgroup with multiple "deadline servers"
>    having different runtime/period parameters? (the first solution is
>    easier to implement, the second one offers more degrees of freedom
>    that might be used to improve the real-time schedulability)

Right, I'm not sure what makes most sense, nor am I entirely sure on
what you mean with multiple deadline servers, is that different
different variables per CPU?

> 2) Is it ok have only two levels in the scheduling hierarchy (at least
>    in the first implementation)?

Yeah, I think so. It doesn't really make sense to stack this stuff too
deep anyway, and it only makes sense to stack downwards (eg, have DL
host RR/FIFO/CFS). Stakcing upwards (have CFS host DL for example)
doesn't make any sense to me.


> 4) From a more theoretical point of view, it would be good to define
>    the scheduling model that needs to be implemented (based on something
>    previously described on some paper, or defining a new model from
>    scratch).
> 
> Well, I hope this can be a good starting point for a discussion :)

Right, so the problem we have is unspecified SCHED_FIFO on SMP and
historical behaviour.

As you know we've extended FIFO to SMP by G-FIFO (run the m highest prio
tasks on m CPUs). But along with that, we allow arbitrary affinity masks
for RR/FIFO tasks.

(Note that RR is broken in the G-FIFO model, but that's a different
discussion for a different day).

Now, the proposed model has identical CBS parameters for every (v)cpu of
the cgroup. This means that a cgroup must be overprovisioned in the
general case where nr_tasks < nr_cpus (and worse, the parameters must
match the max task).

This leads to vast amounts of wasted resources.

The alternative is different but fixed parameters per cpu, but that is
somewhat unwieldy in that it increases the configuration burden. But you
can indeed minimize the wasted resources and deal with the affinity
problem (AFAICT).

However, I think there's a third alternative. I have memories of a paper
from UNC (I'd have to dig through the site to see if I can still find
it) where they argue that for a hierarchical (G-)FIFO you should use
minimal concurrency, that is run the minimal number of (v)cpu servers.

This would mean we give a single CBS parameter and carve out the minimal
number (of max CBS) (v)cpu that fit in that.

I'm just not sure how the random affinity crap works out for that, if we
have the (v)cpu servers migratable in the G-EDF and migrate to whatever
is demanded by the task at runtime it might work, but who knows..
Analysis would be needed I think.


Any other opinions / options?