Hi all,
after the SCHED_DEADLINE TODO page
(https://github.com/jlelli/sched-deadline/wiki/TODOs) has been
published, there has been a private exchange of emails about the "group
scheduling (cgroups)" / "hierarchical DEADLINE server for FIFO/RR"
item.
I'd like to start a discussion about this topic, so that the TODO item
can be implemented in a way that is agreed by everyone.
I add in cc all the people involved in the previous email exchange
about this topic + Andrea, who originally developed a patch
implementing hierarchical SCHED_DEADLINE (see
http://retis.sssup.it/~nino/publication/rtlws14bdm.pdf and cited
papers); I do not know who else to cc, so feel free to forward this
email to the relevant people or to tell me who to add in future emails.
So, I started to think about this, and here are some ideas to start a
discussion:
1) First of all, we need to decide the software interface. If I
understand correctly (please correct me if I am wrong), cgroups let
you specify a runtime and a period, and this means that the cgroup is
reserved the specified runtime every period on all the cgroup's
CPUs... In other words, it is not possible to reserve different
runtimes/periods on different CPUs. Is this correct? Is this what we
want for hierarchical SCHED_DEADLINE? Or do we want to allow the
possibility to schedule a cgroup with multiple "deadline servers"
having different runtime/period parameters? (the first solution is
easier to implement, the second one offers more degrees of freedom
that might be used to improve the real-time schedulability)
2) Is it ok have only two levels in the scheduling hierarchy (at least
in the first implementation)?
3) If this "hierarchical SCHED_DEADLINE" is implemented using multiple
"deadline servers" (one per cgroup's CPU) to schedule the cgroup's
tasks, should these servers be bound to CPUs, or should they be free
to migrate between the cgroup's CPUs? In the first case, each one of
these deadline servers can be implemented as a sched_dl_entity
structure that can be scheduled only on a specific runqueue. The
second case is (in my understanding) more complex to implement,
because the dl push/pull code uses task structures, so a dl
scheduling entity per server is not enough (unless we modify the
migration code). At least, this is what I understood when looking at
the code.
4) From a more theoretical point of view, it would be good to define
the scheduling model that needs to be implemented (based on something
previously described on some paper, or defining a new model from
scratch).
Well, I hope this can be a good starting point for a discussion :)
Luca
On Sun, Oct 09, 2016 at 09:39:38PM +0200, Luca Abeni wrote:
> So, I started to think about this, and here are some ideas to start a
> discussion:
> 1) First of all, we need to decide the software interface. If I
> understand correctly (please correct me if I am wrong), cgroups let
> you specify a runtime and a period, and this means that the cgroup is
> reserved the specified runtime every period on all the cgroup's
> CPUs... In other words, it is not possible to reserve different
> runtimes/periods on different CPUs. Is this correct?
That is the current state for RR/FIFO, but given that that is a complete
trainwreck, I think we can deprecate that and change the interface.
My primary concern is getting something that actually works and makes
theoretical sense, and then we can worry about the interface.
> Is this what we
> want for hierarchical SCHED_DEADLINE? Or do we want to allow the
> possibility to schedule a cgroup with multiple "deadline servers"
> having different runtime/period parameters? (the first solution is
> easier to implement, the second one offers more degrees of freedom
> that might be used to improve the real-time schedulability)
Right, I'm not sure what makes most sense, nor am I entirely sure on
what you mean with multiple deadline servers, is that different
different variables per CPU?
> 2) Is it ok have only two levels in the scheduling hierarchy (at least
> in the first implementation)?
Yeah, I think so. It doesn't really make sense to stack this stuff too
deep anyway, and it only makes sense to stack downwards (eg, have DL
host RR/FIFO/CFS). Stakcing upwards (have CFS host DL for example)
doesn't make any sense to me.
> 4) From a more theoretical point of view, it would be good to define
> the scheduling model that needs to be implemented (based on something
> previously described on some paper, or defining a new model from
> scratch).
>
> Well, I hope this can be a good starting point for a discussion :)
Right, so the problem we have is unspecified SCHED_FIFO on SMP and
historical behaviour.
As you know we've extended FIFO to SMP by G-FIFO (run the m highest prio
tasks on m CPUs). But along with that, we allow arbitrary affinity masks
for RR/FIFO tasks.
(Note that RR is broken in the G-FIFO model, but that's a different
discussion for a different day).
Now, the proposed model has identical CBS parameters for every (v)cpu of
the cgroup. This means that a cgroup must be overprovisioned in the
general case where nr_tasks < nr_cpus (and worse, the parameters must
match the max task).
This leads to vast amounts of wasted resources.
The alternative is different but fixed parameters per cpu, but that is
somewhat unwieldy in that it increases the configuration burden. But you
can indeed minimize the wasted resources and deal with the affinity
problem (AFAICT).
However, I think there's a third alternative. I have memories of a paper
from UNC (I'd have to dig through the site to see if I can still find
it) where they argue that for a hierarchical (G-)FIFO you should use
minimal concurrency, that is run the minimal number of (v)cpu servers.
This would mean we give a single CBS parameter and carve out the minimal
number (of max CBS) (v)cpu that fit in that.
I'm just not sure how the random affinity crap works out for that, if we
have the (v)cpu servers migratable in the G-EDF and migrate to whatever
is demanded by the task at runtime it might work, but who knows..
Analysis would be needed I think.
Any other opinions / options?
On Mon, Oct 10, 2016 at 12:15:58PM +0200, Peter Zijlstra wrote:
> However, I think there's a third alternative. I have memories of a paper
> from UNC (I'd have to dig through the site to see if I can still find
> it) where they argue that for a hierarchical (G-)FIFO you should use
> minimal concurrency, that is run the minimal number of (v)cpu servers.
>
> This would mean we give a single CBS parameter and carve out the minimal
> number (of max CBS) (v)cpu that fit in that.
>
> I'm just not sure how the random affinity crap works out for that, if we
> have the (v)cpu servers migratable in the G-EDF and migrate to whatever
> is demanded by the task at runtime it might work, but who knows..
> Analysis would be needed I think.
Hurm,.. thinking slightly more on this, this ends up being a DL task
with random affinity, which is problematic IIRC.
On Mon, Oct 10, 2016 at 12:15:58PM +0200, Peter Zijlstra wrote:
> Right, so the problem we have is unspecified SCHED_FIFO on SMP and
> historical behaviour.
>
> As you know we've extended FIFO to SMP by G-FIFO (run the m highest prio
> tasks on m CPUs). But along with that, we allow arbitrary affinity masks
> for RR/FIFO tasks.
>
> (Note that RR is broken in the G-FIFO model, but that's a different
> discussion for a different day).
>
> Now, the proposed model has identical CBS parameters for every (v)cpu of
> the cgroup. This means that a cgroup must be overprovisioned in the
> general case where nr_tasks < nr_cpus (and worse, the parameters must
> match the max task).
>
> This leads to vast amounts of wasted resources.
>
> The alternative is different but fixed parameters per cpu, but that is
> somewhat unwieldy in that it increases the configuration burden. But you
> can indeed minimize the wasted resources and deal with the affinity
> problem (AFAICT).
>
Hurm, so I think both proposals above rely on deadline servers with
single CPU affinity, something which we currently do not support at all,
and is something that should be defined/fixed first I think.
Hi Peter,
first of all, sorry for the delay in my answer
On Mon, 10 Oct 2016 12:15:58 +0200
Peter Zijlstra <[email protected]> wrote:
> On Sun, Oct 09, 2016 at 09:39:38PM +0200, Luca Abeni wrote:
>
> > So, I started to think about this, and here are some ideas to start
> > a discussion:
> > 1) First of all, we need to decide the software interface. If I
> > understand correctly (please correct me if I am wrong), cgroups
> > let you specify a runtime and a period, and this means that the
> > cgroup is reserved the specified runtime every period on all the
> > cgroup's CPUs... In other words, it is not possible to reserve
> > different runtimes/periods on different CPUs. Is this correct?
>
> That is the current state for RR/FIFO, but given that that is a
> complete trainwreck, I think we can deprecate that and change the
> interface.
Ok
>
> My primary concern is getting something that actually works and makes
> theoretical sense, and then we can worry about the interface.
Well, my question was mainly about the kind of functionalities we want
(there are various hierarchical scheduling models that makes
theoretical sense and can work in practice...)
I guess the main decision is about having identical CBS parameters for
every (v)cpu of the cgroup, or allowing to have different parameters in
the various vcpus.
> > Is this what we
> > want for hierarchical SCHED_DEADLINE? Or do we want to allow the
> > possibility to schedule a cgroup with multiple "deadline servers"
> > having different runtime/period parameters? (the first solution
> > is easier to implement, the second one offers more degrees of
> > freedom that might be used to improve the real-time schedulability)
>
> Right, I'm not sure what makes most sense, nor am I entirely sure on
> what you mean with multiple deadline servers, is that different
> different variables per CPU?
Sorry, my description was not clear. I assumed we can model a cgroup
with multiple (v)cpus as multiple "deadline servers" (one dl scheduling
entity per (v)cpu, serving an RT runqueue). And the question was about
the parameters (runtime and period/deadline) of these scheduling entity
(basically, the same question as above: should all these entities in
the cgroup have the same parameters?)
[...]
> > 4) From a more theoretical point of view, it would be good to define
> > the scheduling model that needs to be implemented (based on
> > something previously described on some paper, or defining a new
> > model from scratch).
> >
> > Well, I hope this can be a good starting point for a discussion :)
>
> Right, so the problem we have is unspecified SCHED_FIFO on SMP and
> historical behaviour.
>
> As you know we've extended FIFO to SMP by G-FIFO (run the m highest
> prio tasks on m CPUs). But along with that, we allow arbitrary
> affinity masks for RR/FIFO tasks.
Ok; I think I've seen a paper about multi-processor scheduling of fixed
priority tasks with arbitrary affinites, but I do not remember the
details... You mean that we should perform a similar analysis for
(g-)EDF too, right?
> Now, the proposed model has identical CBS parameters for every (v)cpu
> of the cgroup. This means that a cgroup must be overprovisioned in the
> general case where nr_tasks < nr_cpus (and worse, the parameters must
> match the max task).
>
> This leads to vast amounts of wasted resources.
Right... Simplicity (in the interface and in the implementation) come
at a cost (they force resource over-provisioning).
> The alternative is different but fixed parameters per cpu, but that is
> somewhat unwieldy in that it increases the configuration burden. But
> you can indeed minimize the wasted resources and deal with the
> affinity problem (AFAICT).
Yes, I think so... But I suspect this solution introduces a small
drawback: the fixed priority scheduler running below SCHED_DEADLINE (at
the lower level of the hierarchy) must know what the SCHED_DEADLINE
scheduler running above it is doing (so, the RR/FIFO scheduler must
make a distinction between the various (v)cpus on which it can
schedule tasks).
> However, I think there's a third alternative. I have memories of a
> paper from UNC (I'd have to dig through the site to see if I can
> still find it) where they argue that for a hierarchical (G-)FIFO you
> should use minimal concurrency, that is run the minimal number of
> (v)cpu servers.
Ok, I need to find and read that paper
> This would mean we give a single CBS parameter and carve out the
> minimal number (of max CBS) (v)cpu that fit in that.
>
> I'm just not sure how the random affinity crap works out for that, if
> we have the (v)cpu servers migratable in the G-EDF and migrate to
> whatever is demanded by the task at runtime it might work, but who
> knows.. Analysis would be needed I think.
I think Giuseppe Lipari (added in cc) can help with this analysis.
Luca
>
>
>
> Any other opinions / options?
On Mon, 10 Oct 2016 13:08:18 +0200
Peter Zijlstra <[email protected]> wrote:
> On Mon, Oct 10, 2016 at 12:15:58PM +0200, Peter Zijlstra wrote:
> > However, I think there's a third alternative. I have memories of a
> > paper from UNC (I'd have to dig through the site to see if I can
> > still find it) where they argue that for a hierarchical (G-)FIFO
> > you should use minimal concurrency, that is run the minimal number
> > of (v)cpu servers.
> >
> > This would mean we give a single CBS parameter and carve out the
> > minimal number (of max CBS) (v)cpu that fit in that.
> >
> > I'm just not sure how the random affinity crap works out for that,
> > if we have the (v)cpu servers migratable in the G-EDF and migrate
> > to whatever is demanded by the task at runtime it might work, but
> > who knows.. Analysis would be needed I think.
>
> Hurm,.. thinking slightly more on this, this ends up being a DL task
> with random affinity, which is problematic IIRC.
Yes, there currently is no existing schedulability analysis for
multi-processor EDF with random affinities (as far as I know), but I
think we can at least have a look at developing this kind of analysis.
Giuseppe, what do you think?
Luca
On Sun, 16 Oct 2016 21:40:59 +0200
Luca Abeni <[email protected]> wrote:
> On Mon, 10 Oct 2016 13:08:18 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, Oct 10, 2016 at 12:15:58PM +0200, Peter Zijlstra wrote:
> > > However, I think there's a third alternative. I have memories of a
> > > paper from UNC (I'd have to dig through the site to see if I can
> > > still find it) where they argue that for a hierarchical (G-)FIFO
> > > you should use minimal concurrency, that is run the minimal number
> > > of (v)cpu servers.
> > >
> > > This would mean we give a single CBS parameter and carve out the
> > > minimal number (of max CBS) (v)cpu that fit in that.
> > >
> > > I'm just not sure how the random affinity crap works out for that,
> > > if we have the (v)cpu servers migratable in the G-EDF and migrate
> > > to whatever is demanded by the task at runtime it might work, but
> > > who knows.. Analysis would be needed I think.
> >
> > Hurm,.. thinking slightly more on this, this ends up being a DL task
> > with random affinity, which is problematic IIRC.
> Yes, there currently is no existing schedulability analysis for
> multi-processor EDF with random affinities (as far as I know)
Correction: it looks like I was wrong, and the schedulability of
multi-processor EDF with arbitrary affinities has already been analysed
in
A. Gujarati, F. Cerqueira, and B. Brandenburg, “Multiprocessor
Real-Time Scheduling with Arbitrary Processor Affinities: From Practice
to Theory”, Real- Time Systems, Volume 51, Issue 4, pp. 440–483.
Springer Verlag, 2015
(see https://www.mpi-sws.org/~bbb/papers/).
Thanks to Giuseppe Lipari for pointing me to this paper.
So, having DL tasks with arbitrary affinities is not a big problem from
the theoretical point of view... The only issue is that the
utilisation-based admission test that is currently implemented in the
kernel does not work (and given the complexity of the analysis I think
it is better not to perform it in the kernel :)
Luca
> but I
> think we can at least have a look at developing this kind of analysis.
> Giuseppe, what do you think?
On Mon, Oct 17, 2016 at 08:38:57AM +0200, luca abeni wrote:
> > Yes, there currently is no existing schedulability analysis for
> > multi-processor EDF with random affinities (as far as I know)
> Correction: it looks like I was wrong, and the schedulability of
> multi-processor EDF with arbitrary affinities has already been analysed
> in
> A. Gujarati, F. Cerqueira, and B. Brandenburg, “Multiprocessor
> Real-Time Scheduling with Arbitrary Processor Affinities: From Practice
> to Theory”, Real- Time Systems, Volume 51, Issue 4, pp. 440–483.
> Springer Verlag, 2015
> (see https://www.mpi-sws.org/~bbb/papers/).
>
> Thanks to Giuseppe Lipari for pointing me to this paper.
Ooh, shiny. Let me go read that.
> So, having DL tasks with arbitrary affinities is not a big problem from
> the theoretical point of view... The only issue is that the
> utilisation-based admission test that is currently implemented in the
> kernel does not work (and given the complexity of the analysis I think
> it is better not to perform it in the kernel :)
So the problem with doing admission control out of the kernel is that
then all the guarantees and resource control the kernel should provide
are out the window, and we're back to all the reasons why SCHED_FIFO is
a horrible horrible scheduling policy.
On Sun, Oct 16, 2016 at 09:34:20PM +0200, Luca Abeni wrote:
> Hi Peter,
>
> first of all, sorry for the delay in my answer
Not a problem, we're all busy. As it happens I had this thing in Berlin
which delayed me reading email in any case ;-)
> > However, I think there's a third alternative. I have memories of a
> > paper from UNC (I'd have to dig through the site to see if I can
> > still find it) where they argue that for a hierarchical (G-)FIFO you
> > should use minimal concurrency, that is run the minimal number of
> > (v)cpu servers.
> Ok, I need to find and read that paper
I had a real quick look and it could be this one, but memories are
vague...
https://cs.unc.edu/~anderson/papers/ecrts08c.pdf
In particular 3.P.
Hi,
On 17/10/16 10:25, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 08:38:57AM +0200, luca abeni wrote:
>
> > > Yes, there currently is no existing schedulability analysis for
> > > multi-processor EDF with random affinities (as far as I know)
> > Correction: it looks like I was wrong, and the schedulability of
> > multi-processor EDF with arbitrary affinities has already been analysed
> > in
> > A. Gujarati, F. Cerqueira, and B. Brandenburg, “Multiprocessor
> > Real-Time Scheduling with Arbitrary Processor Affinities: From Practice
> > to Theory”, Real- Time Systems, Volume 51, Issue 4, pp. 440–483.
> > Springer Verlag, 2015
> > (see https://www.mpi-sws.org/~bbb/papers/).
> >
> > Thanks to Giuseppe Lipari for pointing me to this paper.
>
> Ooh, shiny. Let me go read that.
>
Yeah, I should read that again (last time I remember headaches :).
> > So, having DL tasks with arbitrary affinities is not a big problem from
> > the theoretical point of view... The only issue is that the
> > utilisation-based admission test that is currently implemented in the
> > kernel does not work (and given the complexity of the analysis I think
> > it is better not to perform it in the kernel :)
>
> So the problem with doing admission control out of the kernel is that
> then all the guarantees and resource control the kernel should provide
> are out the window, and we're back to all the reasons why SCHED_FIFO is
> a horrible horrible scheduling policy.
>
I agree that we must have admission control in the kernel, but I don't
see why disabling such AC and let userspace perform more complex
calculations would make scheduler policies bogus. What we have today
(modulo fixes and possible small improvements) seems a sane and simple
default admission test to me. If userspace (and here I think of some
middleware controller type of thing more than a simple user) has
knowledge (and root privileges) and needs to perform more complex tests,
it can disable the kernel AC, but still rely on resource control
mechanisms provided by the kernel to provide guarantees (such thing is
still not achievable with FIFO).
Best,
- Juri