From: Dario Faggioli <[email protected]>
Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Dario Faggioli <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
Documentation/scheduler/00-INDEX | 2 +
Documentation/scheduler/sched-deadline.txt | 189 ++++++++++++++++++++++++++++
kernel/sched/deadline.c | 3 +-
3 files changed, 193 insertions(+), 1 deletion(-)
create mode 100644 Documentation/scheduler/sched-deadline.txt
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
index d2651c4..46702e4 100644
--- a/Documentation/scheduler/00-INDEX
+++ b/Documentation/scheduler/00-INDEX
@@ -10,5 +10,7 @@ sched-nice-design.txt
- How and why the scheduler's nice levels are implemented.
sched-rt-group.txt
- real-time group scheduling.
+sched-deadline.txt
+ - deadline scheduling.
sched-stats.txt
- information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..8980de1
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,189 @@
+ Deadline Task Scheduling
+ ------------------------
+
+CONTENTS
+========
+
+ 0. WARNING
+ 1. Overview
+ 2. Task scheduling
+ 2. The Interface
+ 3. Bandwidth management
+ 3.1 System-wide settings
+ 3.2 Task interface
+ 3.4 Default behavior
+ 4. Tasks CPU affinity
+ 4.1 SCHED_DEADLINE and cpusets HOWTO
+ 5. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+2. Task scheduling
+==================
+
+ The typical -deadline task is composed of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance needs to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (or, more properly) activates plus the relative
+ deadline.
+
+ The EDF[1] algorithm selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS[2,3] ensures that each task runs
+ for at most its runtime every period, avoiding any interference between
+ different tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model described above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+ References:
+ 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+ ming in a hard-real-time environment. Journal of the Association for
+ Computing Machinery, 20(1), 1973.
+ 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+ Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+ Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+ 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+ Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
+
+3. Bandwidth management
+=======================
+
+ In order for the -deadline scheduling to be effective and useful, it is
+ important to have some method to keep the allocation of the available CPU
+ bandwidth to the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group has a bandwidth
+ associated, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
+ uses (for now) a less sophisticated, but actually very sensible, mechanism to
+ ensure that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+3.1 System wide settings
+------------------------
+
+ The system wide settings are configured under the /proc virtual file system.
+
+ For now the -rt knobs are used for dl admission control and the -deadline
+ runtime is accounted against the -rt runtime. We realise that this isn't
+ entirely desirable; however, it is better to have a small interface for now,
+ and be able to change it easily later. The ideal situation (see 5.) is to run
+ -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
+ subset of dl_bw.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created while the sum of their bandwidths stays below:
+
+ M * (sched_rt_runtime_us / sched_rt_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
+
+
+3.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+ - a (maximum/typical) instance execution time,
+ - a minimum interval between consecutive instances,
+ - a time constraint by which each instance must be completed.
+
+ Therefore:
+ * a new struct sched_attr, containing all the necessary fields is
+ provided;
+ * the new scheduling related syscalls that manipulate it, i.e.,
+ sched_setattr() and sched_getattr() are implemented.
+
+
+3.3 Default behavior
+---------------------
+
+ The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
+ 95000. With rt_period equal to 1000000, by default, it means that -deadline
+ tasks can use at most 95%, multiplied by the number of CPUs that compose the
+ root_domain, for each root_domain.
+
+ A -deadline task cannot fork.
+
+4. Tasks CPU affinity
+=====================
+
+ -deadline tasks cannot have an affinity mask smaller that the entire
+ root_domain they are created on. However, affinities can be specified
+ through the cpuset facility (Documentation/cgroups/cpusets.txt).
+
+4.1 SCHED_DEADLINE and cpusets HOWTO
+------------------------------------
+
+ An example of a simple configuration (pin a -deadline task to CPU0)
+ follows (rt-app is used to create a -deadline task).
+
+ mkdir /dev/cpuset
+ mount -t cgroup -o cpuset cpuset /dev/cpuset
+ cd /dev/cpuset
+ mkdir cpu0
+ echo 0 > cpu0/cpuset.cpus
+ echo 0 > cpu0/cpuset.mems
+ echo 1 > cpuset.cpu_exclusive
+ echo 0 > cpuset.sched_load_balance
+ echo 1 > cpu0/cpuset.cpu_exclusive
+ echo 1 > cpu0/cpuset.mem_exclusive
+ echo $$ > cpu0/tasks
+ rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
+ task affinity)
+
+5. Future plans
+===============
+
+ Still missing:
+
+ - refinements to deadline inheritance, especially regarding the possibility
+ of retaining bandwidth isolation among non-interacting tasks. This is
+ being studied from both theoretical and practical points of view, and
+ hopefully we should be able to produce some demonstrative code soon;
+ - (c)group based bandwidth management, and maybe scheduling;
+ - access control for non-root users (and related security concerns to
+ address), which is the best way to allow unprivileged use of the mechanisms
+ and how to prevent non-root users "cheat" the system?
+
+ As already discussed, we are planning also to merge this work with the EDF
+ throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+ the preliminary phases of the merge and we really seek feedback that would
+ help us decide on the direction it should take.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0de2482..0dd5e09 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
* disrupting the schedulability of the system. Otherwise, we should
* refill the runtime and set the deadline a period in the future,
* because keeping the current (absolute) deadline of the task would
- * result in breaking guarantees promised to other tasks.
+ * result in breaking guarantees promised to other tasks (refer to
+ * Documentation/scheduler/sched-deadline.txt for more informations).
*
* This function returns true if:
*
--
1.7.9.5
On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> From: Dario Faggioli <[email protected]>
>
> Add in Documentation/scheduler/ some hints about the design
> choices, the usage and the future possible developments of the
> sched_dl scheduling class and of the SCHED_DEADLINE policy.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Dario Faggioli <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> Documentation/scheduler/00-INDEX | 2 +
> Documentation/scheduler/sched-deadline.txt | 189 ++++++++++++++++++++++++++++
> kernel/sched/deadline.c | 3 +-
> 3 files changed, 193 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/scheduler/sched-deadline.txt
>
> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> index d2651c4..46702e4 100644
> --- a/Documentation/scheduler/00-INDEX
> +++ b/Documentation/scheduler/00-INDEX
> @@ -10,5 +10,7 @@ sched-nice-design.txt
> - How and why the scheduler's nice levels are implemented.
> sched-rt-group.txt
> - real-time group scheduling.
> +sched-deadline.txt
> + - deadline scheduling.
> sched-stats.txt
> - information on schedstats (Linux Scheduler Statistics).
> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> new file mode 100644
> index 0000000..8980de1
> --- /dev/null
> +++ b/Documentation/scheduler/sched-deadline.txt
> @@ -0,0 +1,189 @@
> + Deadline Task Scheduling
> + ------------------------
> +
> +CONTENTS
> +========
> +
> + 0. WARNING
> + 1. Overview
> + 2. Task scheduling
> + 2. The Interface
> + 3. Bandwidth management
> + 3.1 System-wide settings
> + 3.2 Task interface
> + 3.4 Default behavior
> + 4. Tasks CPU affinity
> + 4.1 SCHED_DEADLINE and cpusets HOWTO
> + 5. Future plans
> +
> +
> +0. WARNING
> +==========
> +
> + Fiddling with these settings can result in an unpredictable or even unstable
> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> + know what they're doing.
> +
> +
> +1. Overview
> +===========
> +
> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> + that makes it possible to isolate the behavior of tasks between each other.
Why not something along the lines of giving a task a guaranteed slice of
the CPU as well as making sure that a task takes no more than a given
slice? I.e. making the point of a lower as well as an upper limit of CPU
usage.
> +2. Task scheduling
> +==================
> +
> + The typical -deadline task is composed of a computation phase (instance)
> + which is activated on a periodic or sporadic fashion. The expected (maximum)
> + duration of such computation is called the task's runtime; the time interval
> + by which each instance needs to be completed is called the task's relative
> + deadline. The task's absolute deadline is dynamically calculated as the
> + time instant a task (or, more properly) activates plus the relative
> + deadline.
activates - released?
Since real-time papers from different rt-campus around the academia insist
on using *slightly* different terminology, perhaps add a short dictionary
for some of the more common terms?
D: relative deadline, typically N ms after release
d: absolute deadline, the physical time when a given instance of a job
needs to be completed
R: relative release time, for periodic tasks, this is typically 'every N
ms'
r: absolute release time
C: Worst-case execution time
...you get the idea.
Perhaps too academic?
> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> + the one to be executed first, while the CBS[2,3] ensures that each task runs
> + for at most its runtime every period, avoiding any interference between
> + different tasks (bandwidth isolation).
> + Thanks to this feature, also tasks that do not strictly comply with the
> + computational model described above can effectively use the new policy.
> + IOW, there are no limitations on what kind of task can exploit this new
> + scheduling discipline, even if it must be said that it is particularly
> + suited for periodic or sporadic tasks that need guarantees on their
> + timing behavior, e.g., multimedia, streaming, control applications, etc.
I assume that ties are broken arbitrarily and that a running task is not
preempted for another task with equal deadline. Correct?
This would be a nice point to include in this doc methinks.
> + References:
> + 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> + ming in a hard-real-time environment. Journal of the Association for
> + Computing Machinery, 20(1), 1973.
> + 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> + Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> + Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> + 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> + Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> +
> +3. Bandwidth management
> +=======================
> +
> + In order for the -deadline scheduling to be effective and useful, it is
> + important to have some method to keep the allocation of the available CPU
> + bandwidth to the tasks under control.
> + This is usually called "admission control" and if it is not performed at all,
> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> +
> + Since when RT-throttling has been introduced each task group has a bandwidth
> + associated, calculated as a certain amount of runtime over a period.
> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> + controls have been added to both procfs (for system wide settings) and cgroupfs
> + (for per-group settings).
> + Therefore, the same interface is being used for controlling the bandwidth
> + distrubution to -deadline tasks.
> +
> + However, more discussion is needed in order to figure out how we want to manage
> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> + ensure that a certain utilization cap is not overcome per each root_domain.
> +
> + Another main difference between deadline bandwidth management and RT-throttling
> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> + and thus we don't need an higher level throttling mechanism to enforce the
> + desired bandwidth.
> +
> +3.1 System wide settings
> +------------------------
> +
> + The system wide settings are configured under the /proc virtual file system.
> +
> + For now the -rt knobs are used for dl admission control and the -deadline
> + runtime is accounted against the -rt runtime. We realise that this isn't
> + entirely desirable; however, it is better to have a small interface for now,
> + and be able to change it easily later. The ideal situation (see 5.) is to run
> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> + subset of dl_bw.
> +
> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> + can be created while the sum of their bandwidths stays below:
> +
> + M * (sched_rt_runtime_us / sched_rt_period_us)
> +
> + It is also possible to disable this bandwidth management logic, and
> + be thus free of oversubscribing the system up to any arbitrary level.
> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> +
> +
> +3.2 Task interface
> +------------------
> +
> + Specifying a periodic/sporadic task that executes for a given amount of
> + runtime at each instance, and that is scheduled according to the urgency of
> + its own timing constraints needs, in general, a way of declaring:
> + - a (maximum/typical) instance execution time,
> + - a minimum interval between consecutive instances,
> + - a time constraint by which each instance must be completed.
> +
> + Therefore:
> + * a new struct sched_attr, containing all the necessary fields is
> + provided;
> + * the new scheduling related syscalls that manipulate it, i.e.,
> + sched_setattr() and sched_getattr() are implemented.
> +
> +
> +3.3 Default behavior
> +---------------------
> +
> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
^^^^
This seems to be 9.5% to me ;)
> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> + root_domain, for each root_domain.
> +
> + A -deadline task cannot fork.
> +
> +4. Tasks CPU affinity
> +=====================
> +
> + -deadline tasks cannot have an affinity mask smaller that the entire
> + root_domain they are created on. However, affinities can be specified
> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
Does this mean that sched_deadline is a somewhat global implementation? Or
rather, at what point in time will sched_deadline take all cpus in a set
into consideration and when will it only look at the current CPU? Where is
the line drawn between global and fully partitioned?
Also, how do you account the budget when a resource holder is boosted in
order to release a resource? (IIRC, you use BWI, right?)
> +4.1 SCHED_DEADLINE and cpusets HOWTO
> +------------------------------------
> +
> + An example of a simple configuration (pin a -deadline task to CPU0)
> + follows (rt-app is used to create a -deadline task).
> +
> + mkdir /dev/cpuset
> + mount -t cgroup -o cpuset cpuset /dev/cpuset
> + cd /dev/cpuset
> + mkdir cpu0
> + echo 0 > cpu0/cpuset.cpus
> + echo 0 > cpu0/cpuset.mems
> + echo 1 > cpuset.cpu_exclusive
> + echo 0 > cpuset.sched_load_balance
> + echo 1 > cpu0/cpuset.cpu_exclusive
> + echo 1 > cpu0/cpuset.mem_exclusive
> + echo $$ > cpu0/tasks
> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
> + task affinity)
> +
> +5. Future plans
> +===============
> +
> + Still missing:
> +
> + - refinements to deadline inheritance, especially regarding the possibility
> + of retaining bandwidth isolation among non-interacting tasks. This is
> + being studied from both theoretical and practical points of view, and
> + hopefully we should be able to produce some demonstrative code soon;
> + - (c)group based bandwidth management, and maybe scheduling;
> + - access control for non-root users (and related security concerns to
> + address), which is the best way to allow unprivileged use of the mechanisms
> + and how to prevent non-root users "cheat" the system?
> +
> + As already discussed, we are planning also to merge this work with the EDF
> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
> + the preliminary phases of the merge and we really seek feedback that would
> + help us decide on the direction it should take.
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0de2482..0dd5e09 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
> * disrupting the schedulability of the system. Otherwise, we should
> * refill the runtime and set the deadline a period in the future,
> * because keeping the current (absolute) deadline of the task would
> - * result in breaking guarantees promised to other tasks.
> + * result in breaking guarantees promised to other tasks (refer to
> + * Documentation/scheduler/sched-deadline.txt for more informations).
> *
> * This function returns true if:
> *
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Henrik Austad
On Mon, Jan 20, 2014 at 12:24:42PM +0100, Henrik Austad wrote:
> > +2. Task scheduling
> > +==================
> > +
> > + The typical -deadline task is composed of a computation phase (instance)
> > + which is activated on a periodic or sporadic fashion. The expected (maximum)
> > + duration of such computation is called the task's runtime; the time interval
> > + by which each instance needs to be completed is called the task's relative
> > + deadline. The task's absolute deadline is dynamically calculated as the
> > + time instant a task (or, more properly) activates plus the relative
> > + deadline.
>
> activates - released?
>
> Since real-time papers from different rt-campus around the academia insist
> on using *slightly* different terminology, perhaps add a short dictionary
> for some of the more common terms?
Oh gawd, they really don't conform with their definitions? I'd not
noticed that.
> D: relative deadline, typically N ms after release
You failed to define release :-) Its the 'wakeup' event, right? Where
the activation would be the moment we actually schedule the
job/instance?
> d: absolute deadline, the physical time when a given instance of a job
> needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N
> ms'
> r: absolute release time
> C: Worst-case execution time
>
> ...you get the idea.
>
> Perhaps too academic?
I think not, one can never be too clear about these things.
> > +4. Tasks CPU affinity
> > +=====================
> > +
> > + -deadline tasks cannot have an affinity mask smaller that the entire
> > + root_domain they are created on. However, affinities can be specified
> > + through the cpuset facility (Documentation/cgroups/cpusets.txt).
>
> Does this mean that sched_deadline is a somewhat global implementation?
Yes, its a GEDF like thing.
> Or
> rather, at what point in time will sched_deadline take all cpus in a set
> into consideration and when will it only look at the current CPU? Where is
> the line drawn between global and fully partitioned?
Its drawn >< that close to global.
So I think adding a SCHED_FLAG_DL_HARD option which would reduce to
strict per-cpu affinity and deliver 0 tardiness is a future work.
Its slightly complicated in that you cannot share the DL tree between
the GEDF and EDF jobs, because while a GEDF job might have an earlier
deadline an EDF job might have a lesser laxity. Not running the EDF job
in that case would result in a deadline miss (although, assuming we'd
still have function GEDF admission control, still have bounded
tardiness).
I'm not entirely sure we want to do anything in between the fully global
and per-cpu 'hard' mode -- random affinity masks seems like a terribly
hard problem.
NOTE: the 'global' nature is per root_domain, so cpusets can be used to
carve the thing into smaller balance sets.
> Also, how do you account the budget when a resource holder is boosted in
> order to release a resource? (IIRC, you use BWI, right?)
Boosting is still work in progress, but yes, it does a BWI like thing.
On 01/20/2014 12:24 PM, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
>> From: Dario Faggioli <[email protected]>
>>
>> Add in Documentation/scheduler/ some hints about the design
>> choices, the usage and the future possible developments of the
>> sched_dl scheduling class and of the SCHED_DEADLINE policy.
>>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: [email protected]
>> Signed-off-by: Dario Faggioli <[email protected]>
>> Signed-off-by: Juri Lelli <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> ---
>> Documentation/scheduler/00-INDEX | 2 +
>> Documentation/scheduler/sched-deadline.txt | 189 ++++++++++++++++++++++++++++
>> kernel/sched/deadline.c | 3 +-
>> 3 files changed, 193 insertions(+), 1 deletion(-)
>> create mode 100644 Documentation/scheduler/sched-deadline.txt
>>
>> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
>> index d2651c4..46702e4 100644
>> --- a/Documentation/scheduler/00-INDEX
>> +++ b/Documentation/scheduler/00-INDEX
>> @@ -10,5 +10,7 @@ sched-nice-design.txt
>> - How and why the scheduler's nice levels are implemented.
>> sched-rt-group.txt
>> - real-time group scheduling.
>> +sched-deadline.txt
>> + - deadline scheduling.
>> sched-stats.txt
>> - information on schedstats (Linux Scheduler Statistics).
>> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
>> new file mode 100644
>> index 0000000..8980de1
>> --- /dev/null
>> +++ b/Documentation/scheduler/sched-deadline.txt
>> @@ -0,0 +1,189 @@
>> + Deadline Task Scheduling
>> + ------------------------
>> +
>> +CONTENTS
>> +========
>> +
>> + 0. WARNING
>> + 1. Overview
>> + 2. Task scheduling
>> + 2. The Interface
>> + 3. Bandwidth management
>> + 3.1 System-wide settings
>> + 3.2 Task interface
>> + 3.4 Default behavior
>> + 4. Tasks CPU affinity
>> + 4.1 SCHED_DEADLINE and cpusets HOWTO
>> + 5. Future plans
>> +
>> +
>> +0. WARNING
>> +==========
>> +
>> + Fiddling with these settings can result in an unpredictable or even unstable
>> + system behavior. As for -rt (group) scheduling, it is assumed that root users
>> + know what they're doing.
>> +
>> +
>> +1. Overview
>> +===========
>> +
>> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
>> + basically an implementation of the Earliest Deadline First (EDF) scheduling
>> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
>> + that makes it possible to isolate the behavior of tasks between each other.
>
>
> Why not something along the lines of giving a task a guaranteed slice of
> the CPU as well as making sure that a task takes no more than a given
> slice? I.e. making the point of a lower as well as an upper limit of CPU
> usage.
>
I'd keep the term "isolate" in, as is one of the strong points on having all
this merged in. But, we could add something along the lines you suggested:
"that makes it possible to isolate the behavior of tasks between each other.
IOW, isolation means that we can reserve a task a guaranteed percentage of the
CPU and, at the same time, we ensure that the task takes no more than the
percentage reserved."
>> +2. Task scheduling
>> +==================
>> +
>> + The typical -deadline task is composed of a computation phase (instance)
>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>> + duration of such computation is called the task's runtime; the time interval
>> + by which each instance needs to be completed is called the task's relative
>> + deadline. The task's absolute deadline is dynamically calculated as the
>> + time instant a task (or, more properly) activates plus the relative
>> + deadline.
>
> activates - released?
>
I'd keep (modifying a bit):
"time instant a task activates plus the relative deadline."
This is probably the nearest thing to what is implemented that we can say
(without entering into the theory too much), a task that "activates" can mean
that it is first released, enqueued, woken-up, etc.
> Since real-time papers from different rt-campus around the academia insist
> on using *slightly* different terminology, perhaps add a short dictionary
> for some of the more common terms?
>
> D: relative deadline, typically N ms after release
> d: absolute deadline, the physical time when a given instance of a job
> needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N
> ms'
> r: absolute release time
> C: Worst-case execution time
>
> ...you get the idea.
>
> Perhaps too academic?
>
Mmm.. we don't go too deep in theory (we just refer to papers below), could
adding a dictionary only complicate things? I mean, if you add a term you have
to explain its meaning related to the task-model you are using. And this means
you have to also define the task model and so on. Who wants more details
already finds them in the papers below.
>> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
>> + the one to be executed first, while the CBS[2,3] ensures that each task runs
>> + for at most its runtime every period, avoiding any interference between
>> + different tasks (bandwidth isolation).
>> + Thanks to this feature, also tasks that do not strictly comply with the
>> + computational model described above can effectively use the new policy.
>> + IOW, there are no limitations on what kind of task can exploit this new
>> + scheduling discipline, even if it must be said that it is particularly
>> + suited for periodic or sporadic tasks that need guarantees on their
>> + timing behavior, e.g., multimedia, streaming, control applications, etc.
>
> I assume that ties are broken arbitrarily and that a running task is not
> preempted for another task with equal deadline. Correct?
>
Yes.
> This would be a nice point to include in this doc methinks.
>
>> + References:
>> + 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
>> + ming in a hard-real-time environment. Journal of the Association for
>> + Computing Machinery, 20(1), 1973.
>> + 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
>> + Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
>> + Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
>> + 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
>> + Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
>> +
>> +3. Bandwidth management
>> +=======================
>> +
>> + In order for the -deadline scheduling to be effective and useful, it is
>> + important to have some method to keep the allocation of the available CPU
>> + bandwidth to the tasks under control.
>> + This is usually called "admission control" and if it is not performed at all,
>> + no guarantee can be given on the actual scheduling of the -deadline tasks.
>> +
>> + Since when RT-throttling has been introduced each task group has a bandwidth
>> + associated, calculated as a certain amount of runtime over a period.
>> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
>> + controls have been added to both procfs (for system wide settings) and cgroupfs
>> + (for per-group settings).
>> + Therefore, the same interface is being used for controlling the bandwidth
>> + distrubution to -deadline tasks.
>> +
>> + However, more discussion is needed in order to figure out how we want to manage
>> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
>> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
>> + ensure that a certain utilization cap is not overcome per each root_domain.
>> +
>> + Another main difference between deadline bandwidth management and RT-throttling
>> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
>> + and thus we don't need an higher level throttling mechanism to enforce the
>> + desired bandwidth.
>> +
>> +3.1 System wide settings
>> +------------------------
>> +
>> + The system wide settings are configured under the /proc virtual file system.
>> +
>> + For now the -rt knobs are used for dl admission control and the -deadline
>> + runtime is accounted against the -rt runtime. We realise that this isn't
>> + entirely desirable; however, it is better to have a small interface for now,
>> + and be able to change it easily later. The ideal situation (see 5.) is to run
>> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
>> + subset of dl_bw.
>> +
>> + This means that, for a root_domain comprising M CPUs, -deadline tasks
>> + can be created while the sum of their bandwidths stays below:
>> +
>> + M * (sched_rt_runtime_us / sched_rt_period_us)
>> +
>> + It is also possible to disable this bandwidth management logic, and
>> + be thus free of oversubscribing the system up to any arbitrary level.
>> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
>> +
>> +
>> +3.2 Task interface
>> +------------------
>> +
>> + Specifying a periodic/sporadic task that executes for a given amount of
>> + runtime at each instance, and that is scheduled according to the urgency of
>> + its own timing constraints needs, in general, a way of declaring:
>> + - a (maximum/typical) instance execution time,
>> + - a minimum interval between consecutive instances,
>> + - a time constraint by which each instance must be completed.
>> +
>> + Therefore:
>> + * a new struct sched_attr, containing all the necessary fields is
>> + provided;
>> + * the new scheduling related syscalls that manipulate it, i.e.,
>> + sched_setattr() and sched_getattr() are implemented.
>> +
>> +
>> +3.3 Default behavior
>> +---------------------
>> +
>> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
>> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
> ^^^^
> This seems to be 9.5% to me ;)
>
Argh! s/95000/950000/
>> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
>> + root_domain, for each root_domain.
>> +
>> + A -deadline task cannot fork.
>> +
>> +4. Tasks CPU affinity
>> +=====================
>> +
>> + -deadline tasks cannot have an affinity mask smaller that the entire
>> + root_domain they are created on. However, affinities can be specified
>> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
>
> Does this mean that sched_deadline is a somewhat global implementation? Or
> rather, at what point in time will sched_deadline take all cpus in a set
> into consideration and when will it only look at the current CPU? Where is
> the line drawn between global and fully partitioned?
>
> Also, how do you account the budget when a resource holder is boosted in
> order to release a resource? (IIRC, you use BWI, right?)
>
Peter already replied about this.
Thanks,
- Juri
>> +4.1 SCHED_DEADLINE and cpusets HOWTO
>> +------------------------------------
>> +
>> + An example of a simple configuration (pin a -deadline task to CPU0)
>> + follows (rt-app is used to create a -deadline task).
>> +
>> + mkdir /dev/cpuset
>> + mount -t cgroup -o cpuset cpuset /dev/cpuset
>> + cd /dev/cpuset
>> + mkdir cpu0
>> + echo 0 > cpu0/cpuset.cpus
>> + echo 0 > cpu0/cpuset.mems
>> + echo 1 > cpuset.cpu_exclusive
>> + echo 0 > cpuset.sched_load_balance
>> + echo 1 > cpu0/cpuset.cpu_exclusive
>> + echo 1 > cpu0/cpuset.mem_exclusive
>> + echo $$ > cpu0/tasks
>> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
>> + task affinity)
>> +
>> +5. Future plans
>> +===============
>> +
>> + Still missing:
>> +
>> + - refinements to deadline inheritance, especially regarding the possibility
>> + of retaining bandwidth isolation among non-interacting tasks. This is
>> + being studied from both theoretical and practical points of view, and
>> + hopefully we should be able to produce some demonstrative code soon;
>> + - (c)group based bandwidth management, and maybe scheduling;
>> + - access control for non-root users (and related security concerns to
>> + address), which is the best way to allow unprivileged use of the mechanisms
>> + and how to prevent non-root users "cheat" the system?
>> +
>> + As already discussed, we are planning also to merge this work with the EDF
>> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
>> + the preliminary phases of the merge and we really seek feedback that would
>> + help us decide on the direction it should take.
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 0de2482..0dd5e09 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
>> * disrupting the schedulability of the system. Otherwise, we should
>> * refill the runtime and set the deadline a period in the future,
>> * because keeping the current (absolute) deadline of the task would
>> - * result in breaking guarantees promised to other tasks.
>> + * result in breaking guarantees promised to other tasks (refer to
>> + * Documentation/scheduler/sched-deadline.txt for more informations).
>> *
>> * This function returns true if:
>> *
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
Hi Juri,
On 01/20/2014 11:40 AM, Juri Lelli wrote:
> From: Dario Faggioli <[email protected]>
>
> Add in Documentation/scheduler/ some hints about the design
> choices, the usage and the future possible developments of the
> sched_dl scheduling class and of the SCHED_DEADLINE policy.
[...]
> + References:
> + 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> + ming in a hard-real-time environment. Journal of the Association for
> + Computing Machinery, 20(1), 1973.
> + 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> + Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> + Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> + 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> + Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
I do not know how stable the "xoomer.virtgilio.it" URL is (and I lost the password
for this web site :).
I can host the technical report on unitn.it; I think it's more reliable. If you think
this is a good idea, I'll provide you with an updated URL (also, this is an old
document, written in "not so good" English... If there is interest, I can search for
the latex sources and update the document).
Luca
On Mon, Jan 20, 2014 at 01:15:51PM +0100, Juri Lelli wrote:
> On 01/20/2014 12:24 PM, Henrik Austad wrote:
> > On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> >> From: Dario Faggioli <[email protected]>
> >>
> >> Add in Documentation/scheduler/ some hints about the design
> >> choices, the usage and the future possible developments of the
> >> sched_dl scheduling class and of the SCHED_DEADLINE policy.
> >>
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Signed-off-by: Dario Faggioli <[email protected]>
> >> Signed-off-by: Juri Lelli <[email protected]>
> >> Signed-off-by: Peter Zijlstra <[email protected]>
> >> ---
> >> Documentation/scheduler/00-INDEX | 2 +
> >> Documentation/scheduler/sched-deadline.txt | 189 ++++++++++++++++++++++++++++
> >> kernel/sched/deadline.c | 3 +-
> >> 3 files changed, 193 insertions(+), 1 deletion(-)
> >> create mode 100644 Documentation/scheduler/sched-deadline.txt
> >>
> >> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
> >> index d2651c4..46702e4 100644
> >> --- a/Documentation/scheduler/00-INDEX
> >> +++ b/Documentation/scheduler/00-INDEX
> >> @@ -10,5 +10,7 @@ sched-nice-design.txt
> >> - How and why the scheduler's nice levels are implemented.
> >> sched-rt-group.txt
> >> - real-time group scheduling.
> >> +sched-deadline.txt
> >> + - deadline scheduling.
> >> sched-stats.txt
> >> - information on schedstats (Linux Scheduler Statistics).
> >> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
> >> new file mode 100644
> >> index 0000000..8980de1
> >> --- /dev/null
> >> +++ b/Documentation/scheduler/sched-deadline.txt
> >> @@ -0,0 +1,189 @@
> >> + Deadline Task Scheduling
> >> + ------------------------
> >> +
> >> +CONTENTS
> >> +========
> >> +
> >> + 0. WARNING
> >> + 1. Overview
> >> + 2. Task scheduling
> >> + 2. The Interface
> >> + 3. Bandwidth management
> >> + 3.1 System-wide settings
> >> + 3.2 Task interface
> >> + 3.4 Default behavior
> >> + 4. Tasks CPU affinity
> >> + 4.1 SCHED_DEADLINE and cpusets HOWTO
> >> + 5. Future plans
> >> +
> >> +
> >> +0. WARNING
> >> +==========
> >> +
> >> + Fiddling with these settings can result in an unpredictable or even unstable
> >> + system behavior. As for -rt (group) scheduling, it is assumed that root users
> >> + know what they're doing.
> >> +
> >> +
> >> +1. Overview
> >> +===========
> >> +
> >> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> >> + basically an implementation of the Earliest Deadline First (EDF) scheduling
> >> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> >> + that makes it possible to isolate the behavior of tasks between each other.
> >
> >
> > Why not something along the lines of giving a task a guaranteed slice of
> > the CPU as well as making sure that a task takes no more than a given
> > slice? I.e. making the point of a lower as well as an upper limit of CPU
> > usage.
> >
>
> I'd keep the term "isolate" in, as is one of the strong points on having all
> this merged in. But, we could add something along the lines you suggested:
>
> "that makes it possible to isolate the behavior of tasks between each other.
> IOW, isolation means that we can reserve a task a guaranteed percentage of the
> CPU and, at the same time, we ensure that the task takes no more than the
> percentage reserved."
+1 from me!
> >> +2. Task scheduling
> >> +==================
> >> +
> >> + The typical -deadline task is composed of a computation phase (instance)
> >> + which is activated on a periodic or sporadic fashion. The expected (maximum)
> >> + duration of such computation is called the task's runtime; the time interval
> >> + by which each instance needs to be completed is called the task's relative
> >> + deadline. The task's absolute deadline is dynamically calculated as the
> >> + time instant a task (or, more properly) activates plus the relative
> >> + deadline.
> >
> > activates - released?
> >
>
> I'd keep (modifying a bit):
>
> "time instant a task activates plus the relative deadline."
>
> This is probably the nearest thing to what is implemented that we can say
> (without entering into the theory too much), a task that "activates" can mean
> that it is first released, enqueued, woken-up, etc.
So, if we look at release (yes, I'm avoiding activates for a little while)
as the time at the *beginning* of a new period, then, and only then should
the *absolute* deadline be computed.
If you den move on to use 'activate' as a term for when a task becomes
eligble to run, then 'release' becomes a subset of 'activate', and you
should only compute the absolute deadline at that time. Did that make
sense?
> > Since real-time papers from different rt-campus around the academia insist
> > on using *slightly* different terminology, perhaps add a short dictionary
> > for some of the more common terms?
> >
> > D: relative deadline, typically N ms after release
> > d: absolute deadline, the physical time when a given instance of a job
> > needs to be completed
> > R: relative release time, for periodic tasks, this is typically 'every N
> > ms'
> > r: absolute release time
> > C: Worst-case execution time
> >
> > ...you get the idea.
> >
> > Perhaps too academic?
> >
>
> Mmm.. we don't go too deep in theory (we just refer to papers below), could
> adding a dictionary only complicate things? I mean, if you add a term you have
> to explain its meaning related to the task-model you are using. And this means
> you have to also define the task model and so on. Who wants more details
> already finds them in the papers below.
Honestly, how many readers of sched_deadline.txt are going to read those
papers to have the terms defined?
Either way, having a *short* list of terms we use _in_ sched_deadline would
help avoiding confusion. For instance, what 'activate' actually means.
> >> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> >> + the one to be executed first, while the CBS[2,3] ensures that each task runs
> >> + for at most its runtime every period, avoiding any interference between
> >> + different tasks (bandwidth isolation).
> >> + Thanks to this feature, also tasks that do not strictly comply with the
> >> + computational model described above can effectively use the new policy.
> >> + IOW, there are no limitations on what kind of task can exploit this new
> >> + scheduling discipline, even if it must be said that it is particularly
> >> + suited for periodic or sporadic tasks that need guarantees on their
> >> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> >
> > I assume that ties are broken arbitrarily and that a running task is not
> > preempted for another task with equal deadline. Correct?
> >
>
> Yes.
>
> > This would be a nice point to include in this doc methinks.
how about adding something like that?
"""
Whenever 2 (or more) tasks happen to have the exact same absolute deadine,
ties are broken arbitrarily. One notable exception occur when one of the
tasks are already running, in which case no preemption occurs.
"""
> >> + References:
> >> + 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> >> + ming in a hard-real-time environment. Journal of the Association for
> >> + Computing Machinery, 20(1), 1973.
> >> + 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> >> + Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> >> + Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> >> + 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> >> + Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> >> +
> >> +3. Bandwidth management
> >> +=======================
> >> +
> >> + In order for the -deadline scheduling to be effective and useful, it is
> >> + important to have some method to keep the allocation of the available CPU
> >> + bandwidth to the tasks under control.
> >> + This is usually called "admission control" and if it is not performed at all,
> >> + no guarantee can be given on the actual scheduling of the -deadline tasks.
> >> +
> >> + Since when RT-throttling has been introduced each task group has a bandwidth
> >> + associated, calculated as a certain amount of runtime over a period.
> >> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> >> + controls have been added to both procfs (for system wide settings) and cgroupfs
> >> + (for per-group settings).
> >> + Therefore, the same interface is being used for controlling the bandwidth
> >> + distrubution to -deadline tasks.
> >> +
> >> + However, more discussion is needed in order to figure out how we want to manage
> >> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> >> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> >> + ensure that a certain utilization cap is not overcome per each root_domain.
> >> +
> >> + Another main difference between deadline bandwidth management and RT-throttling
> >> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> >> + and thus we don't need an higher level throttling mechanism to enforce the
> >> + desired bandwidth.
> >> +
> >> +3.1 System wide settings
> >> +------------------------
> >> +
> >> + The system wide settings are configured under the /proc virtual file system.
> >> +
> >> + For now the -rt knobs are used for dl admission control and the -deadline
> >> + runtime is accounted against the -rt runtime. We realise that this isn't
> >> + entirely desirable; however, it is better to have a small interface for now,
> >> + and be able to change it easily later. The ideal situation (see 5.) is to run
> >> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> >> + subset of dl_bw.
> >> +
> >> + This means that, for a root_domain comprising M CPUs, -deadline tasks
> >> + can be created while the sum of their bandwidths stays below:
> >> +
> >> + M * (sched_rt_runtime_us / sched_rt_period_us)
> >> +
> >> + It is also possible to disable this bandwidth management logic, and
> >> + be thus free of oversubscribing the system up to any arbitrary level.
> >> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> >> +
> >> +
> >> +3.2 Task interface
> >> +------------------
> >> +
> >> + Specifying a periodic/sporadic task that executes for a given amount of
> >> + runtime at each instance, and that is scheduled according to the urgency of
> >> + its own timing constraints needs, in general, a way of declaring:
> >> + - a (maximum/typical) instance execution time,
> >> + - a minimum interval between consecutive instances,
> >> + - a time constraint by which each instance must be completed.
> >> +
> >> + Therefore:
> >> + * a new struct sched_attr, containing all the necessary fields is
> >> + provided;
> >> + * the new scheduling related syscalls that manipulate it, i.e.,
> >> + sched_setattr() and sched_getattr() are implemented.
> >> +
> >> +
> >> +3.3 Default behavior
> >> +---------------------
> >> +
> >> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> >> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
> > ^^^^
> > This seems to be 9.5% to me ;)
> >
>
> Argh! s/95000/950000/
>
> >> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> >> + root_domain, for each root_domain.
> >> +
> >> + A -deadline task cannot fork.
> >> +
> >> +4. Tasks CPU affinity
> >> +=====================
> >> +
> >> + -deadline tasks cannot have an affinity mask smaller that the entire
> >> + root_domain they are created on. However, affinities can be specified
> >> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> >
> > Does this mean that sched_deadline is a somewhat global implementation? Or
> > rather, at what point in time will sched_deadline take all cpus in a set
> > into consideration and when will it only look at the current CPU? Where is
> > the line drawn between global and fully partitioned?
> >
> > Also, how do you account the budget when a resource holder is boosted in
> > order to release a resource? (IIRC, you use BWI, right?)
> >
>
> Peter already replied about this.
>
> Thanks,
>
> - Juri
Feel free to add a reviewed-by from me on this one :)
--
Henrik Austad
Hi all,
On 01/20/2014 02:16 PM, Henrik Austad wrote:
[...]
>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>> + duration of such computation is called the task's runtime; the time interval
>>>> + by which each instance needs to be completed is called the task's relative
>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>> + time instant a task (or, more properly) activates plus the relative
>>>> + deadline.
>>>
>>> activates - released?
>>>
>>
>> I'd keep (modifying a bit):
>>
>> "time instant a task activates plus the relative deadline."
>>
>> This is probably the nearest thing to what is implemented that we can say
>> (without entering into the theory too much), a task that "activates" can mean
>> that it is first released, enqueued, woken-up, etc.
>
> So, if we look at release (yes, I'm avoiding activates for a little while)
> as the time at the *beginning* of a new period, then, and only then should
> the *absolute* deadline be computed.
>
> If you den move on to use 'activate' as a term for when a task becomes
> eligble to run, then 'release' becomes a subset of 'activate', and you
> should only compute the absolute deadline at that time. Did that make
> sense?
I think things are a little bit complex here, because there are 2 different
"deadlines" we can think about:
- the "jobs deadlines" (the absolute job deadline can be computed at job
arrival, as the arrival time + the relative deadline). These are generally
used for performance metrics, to see if a job is respecting its timing
constraints or not
- the "scheduling deadlines", which are the ones used by the scheduler to
schedule the tasks. These are computed at tasks' wake-up time - notice that
for self-suspending jobs there can be wake-up times that are not job arrival
times. And are assigned according to the rules described in the CBS paper.
I think this can easily become very confusing, so I currently have no concrete
proposals for improving the documentation... But I wanted to point out that
things here are more complex than in the "traditional" real-time task model.
Maybe a solution could be to simply describe scheduling deadlines (which are
what sched_deadline uses) without going into the details of jobs' deadlines.
So, the document could avoid talking about instances (or jobs), and can say
that a task is guaranteed to receive "runtime" time units every "period" time
units (and these "runtime" time units are available within "deadline" time
units from the beginning of the period). Every time the task wakes up, the
scheduler computes a scheduling deadline consistent with this constraint,
and tasks are scheduled using EDF on these scheduling deadlines.
Every time "runtime" time units are consumed in a period, the scheduling
deadline is postponed by a period.
This is of course an approximative description, but I think it can give an
intuitive idea of how the scheduler works, and about the meaning of the three
scheduling parameters.
Luca
On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
> Hi all,
>
> On 01/20/2014 02:16 PM, Henrik Austad wrote:
> [...]
> >>>>+ The typical -deadline task is composed of a computation phase (instance)
> >>>>+ which is activated on a periodic or sporadic fashion. The expected (maximum)
> >>>>+ duration of such computation is called the task's runtime; the time interval
> >>>>+ by which each instance needs to be completed is called the task's relative
> >>>>+ deadline. The task's absolute deadline is dynamically calculated as the
> >>>>+ time instant a task (or, more properly) activates plus the relative
> >>>>+ deadline.
> >>>
> >>>activates - released?
> >>>
> >>
> >>I'd keep (modifying a bit):
> >>
> >>"time instant a task activates plus the relative deadline."
> >>
> >>This is probably the nearest thing to what is implemented that we can say
> >>(without entering into the theory too much), a task that "activates" can mean
> >>that it is first released, enqueued, woken-up, etc.
> >
> >So, if we look at release (yes, I'm avoiding activates for a little while)
> >as the time at the *beginning* of a new period, then, and only then should
> >the *absolute* deadline be computed.
> >
> >If you den move on to use 'activate' as a term for when a task becomes
> >eligble to run, then 'release' becomes a subset of 'activate', and you
> >should only compute the absolute deadline at that time. Did that make
> >sense?
> I think things are a little bit complex here, because there are 2 different
> "deadlines" we can think about:
> - the "jobs deadlines" (the absolute job deadline can be computed at job
> arrival, as the arrival time + the relative deadline). These are generally
> used for performance metrics, to see if a job is respecting its timing
> constraints or not
>
> - the "scheduling deadlines", which are the ones used by the scheduler to
> schedule the tasks. These are computed at tasks' wake-up time - notice that
> for self-suspending jobs there can be wake-up times that are not job arrival
> times. And are assigned according to the rules described in the CBS paper.
>
> I think this can easily become very confusing, so I currently have no concrete
> proposals for improving the documentation... But I wanted to point out that
> things here are more complex than in the "traditional" real-time task model.
Traditional real-time as in the current real-time model used in Linux, or
traditional as in EDF terminology used by Liu & Layland?
> Maybe a solution could be to simply describe scheduling deadlines (which are
> what sched_deadline uses) without going into the details of jobs' deadlines.
Huh?
We definately need a short dictionary. In fact, I'd like to have a
paragraph describing what deadline driven scheduling is.
For instance, I'm getting *Really* confused wrt to arrival time - you seem
to wrap several types of arrival into the same name, yet treat it
differently.
- arrival: when a job gets ready to run for the first time
- arrival: when a job unblocks on some resource
Or did I misunderstand?
So, the terminology I'm used to, an attempt to write up something to
clear up the terminology and establish common grounds. Please
edit/elaborate or shoot down as appropriate:
"""
N. Crashcourse in deadline-terminology:
In a system, we typically look at a set of tasks. In Linux-kernel
terminology, a particular task is normally a thread. When a thread is
ready to run, we say that a *job* of that task is running. It is
perhaps easiest to grasp this if one think only of periodic tasks, i.e. a
thread that need to run for 2ms every 10ms. Normally, we want one job to
finish before a new (of the same task) start, which implies that the
deadline for this task is also 10ms. Once this is clear, expanding one's
mind to aperiodic and/or sporadic tasks is easier.
* Periodic task: a task that needs to run for a while every N us.
* Sporadic task: a tasks that needs tor un for a while at most every N us
(jobs start no closer than N us apart)
* Aperiodic task: a task that have no particular period, but once
released, needs to complete before a given deadline.
* Set of all deadline-tasks in the system: \tau
* One particluar task: \tau_i
* The j'th job of task i: \tau_{i,j}
* The (relative) deadline of task i: D_i
* The (periodic, relative) release time of task i: R_i
* Required execution time a tasks's job needs to complete. C_i
* Absolute release-time, the time when a new job is ready (when a thread is
woken up for a new period).
* The absolute deadline of a job, the actual point in time where a job
needs to be finished. This is what the scheduler looks at when it picks
the next thread to run.
We can now construct a 3-tuple describing a perioic and sporadic tasks:
(C_i, R_i, D_i).
These 3 items is what you can use to describe your task to the scheduler.
"""
> So, the document could avoid talking about instances (or jobs), and can say
> that a task is guaranteed to receive "runtime" time units every "period" time
> units (and these "runtime" time units are available within "deadline" time
> units from the beginning of the period). Every time the task wakes up, the
> scheduler computes a scheduling deadline consistent with this constraint,
> and tasks are scheduled using EDF on these scheduling deadlines.
> Every time "runtime" time units are consumed in a period, the scheduling
> deadline is postponed by a period.
What is wrong with using the CORRECT TERMINOLOGY? People looking at using
sched_deadline _need_ to understand what a _deadline_ scheduler is.
If the only goal of sched_deadline is to act as a bandwidth-gauge, then
fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES. I
do NOT want to mess around with mapping deadlines to priorities in order to
meet my deadlines. I suspect others would like to use sched_deadline for
the same.
> This is of course an approximative description, but I think it can give an
> intuitive idea of how the scheduler works, and about the meaning of the three
> scheduling parameters.
Look, I'm not in favour of turning sched_deadline.txt into an academic
paper, but it is clear that we need to establish a baseline here, and then
we need to include tasks, instances/jobs, deadline, arrival/release and so
on.
--
Henrik Austad
On Mon, Jan 20, 2014 at 12:24:42PM +0100, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
> > From: Dario Faggioli <[email protected]>
> >
> > Add in Documentation/scheduler/ some hints about the design
> > choices, the usage and the future possible developments of the
> > sched_dl scheduling class and of the SCHED_DEADLINE policy.
> > [...]
> > +++ b/Documentation/scheduler/sched-deadline.txt
> > @@ -0,0 +1,189 @@
> > + Deadline Task Scheduling
> > + ------------------------
> > +
> > +CONTENTS
> > +========
> > +
> > + 0. WARNING
> > + 1. Overview
> > + 2. Task scheduling
> > + 2. The Interface
^^^^^
I just noticed, where did this one go?
--
H
> > + 3. Bandwidth management
> > + 3.1 System-wide settings
> > + 3.2 Task interface
> > + 3.4 Default behavior
> > + 4. Tasks CPU affinity
> > + 4.1 SCHED_DEADLINE and cpusets HOWTO
> > + 5. Future plans
> > +
> > +
> > +0. WARNING
> > +==========
> > +
> > + Fiddling with these settings can result in an unpredictable or even unstable
> > + system behavior. As for -rt (group) scheduling, it is assumed that root users
> > + know what they're doing.
> > +
> > +
> > +1. Overview
> > +===========
> > +
> > + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
> > + basically an implementation of the Earliest Deadline First (EDF) scheduling
> > + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
> > + that makes it possible to isolate the behavior of tasks between each other.
>
>
> Why not something along the lines of giving a task a guaranteed slice of
> the CPU as well as making sure that a task takes no more than a given
> slice? I.e. making the point of a lower as well as an upper limit of CPU
> usage.
>
> > +2. Task scheduling
> > +==================
> > +
> > + The typical -deadline task is composed of a computation phase (instance)
> > + which is activated on a periodic or sporadic fashion. The expected (maximum)
> > + duration of such computation is called the task's runtime; the time interval
> > + by which each instance needs to be completed is called the task's relative
> > + deadline. The task's absolute deadline is dynamically calculated as the
> > + time instant a task (or, more properly) activates plus the relative
> > + deadline.
>
> activates - released?
>
> Since real-time papers from different rt-campus around the academia insist
> on using *slightly* different terminology, perhaps add a short dictionary
> for some of the more common terms?
>
> D: relative deadline, typically N ms after release
> d: absolute deadline, the physical time when a given instance of a job
> needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N
> ms'
> r: absolute release time
> C: Worst-case execution time
>
> ...you get the idea.
>
> Perhaps too academic?
>
> > + The EDF[1] algorithm selects the task with the smallest absolute deadline as
> > + the one to be executed first, while the CBS[2,3] ensures that each task runs
> > + for at most its runtime every period, avoiding any interference between
> > + different tasks (bandwidth isolation).
> > + Thanks to this feature, also tasks that do not strictly comply with the
> > + computational model described above can effectively use the new policy.
> > + IOW, there are no limitations on what kind of task can exploit this new
> > + scheduling discipline, even if it must be said that it is particularly
> > + suited for periodic or sporadic tasks that need guarantees on their
> > + timing behavior, e.g., multimedia, streaming, control applications, etc.
>
> I assume that ties are broken arbitrarily and that a running task is not
> preempted for another task with equal deadline. Correct?
>
> This would be a nice point to include in this doc methinks.
>
> > + References:
> > + 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
> > + ming in a hard-real-time environment. Journal of the Association for
> > + Computing Machinery, 20(1), 1973.
> > + 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
> > + Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
> > + Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
> > + 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
> > + Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
> > +
> > +3. Bandwidth management
> > +=======================
> > +
> > + In order for the -deadline scheduling to be effective and useful, it is
> > + important to have some method to keep the allocation of the available CPU
> > + bandwidth to the tasks under control.
> > + This is usually called "admission control" and if it is not performed at all,
> > + no guarantee can be given on the actual scheduling of the -deadline tasks.
> > +
> > + Since when RT-throttling has been introduced each task group has a bandwidth
> > + associated, calculated as a certain amount of runtime over a period.
> > + Moreover, to make it possible to manipulate such bandwidth, readable/writable
> > + controls have been added to both procfs (for system wide settings) and cgroupfs
> > + (for per-group settings).
> > + Therefore, the same interface is being used for controlling the bandwidth
> > + distrubution to -deadline tasks.
> > +
> > + However, more discussion is needed in order to figure out how we want to manage
> > + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
> > + uses (for now) a less sophisticated, but actually very sensible, mechanism to
> > + ensure that a certain utilization cap is not overcome per each root_domain.
> > +
> > + Another main difference between deadline bandwidth management and RT-throttling
> > + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
> > + and thus we don't need an higher level throttling mechanism to enforce the
> > + desired bandwidth.
> > +
> > +3.1 System wide settings
> > +------------------------
> > +
> > + The system wide settings are configured under the /proc virtual file system.
> > +
> > + For now the -rt knobs are used for dl admission control and the -deadline
> > + runtime is accounted against the -rt runtime. We realise that this isn't
> > + entirely desirable; however, it is better to have a small interface for now,
> > + and be able to change it easily later. The ideal situation (see 5.) is to run
> > + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
> > + subset of dl_bw.
> > +
> > + This means that, for a root_domain comprising M CPUs, -deadline tasks
> > + can be created while the sum of their bandwidths stays below:
> > +
> > + M * (sched_rt_runtime_us / sched_rt_period_us)
> > +
> > + It is also possible to disable this bandwidth management logic, and
> > + be thus free of oversubscribing the system up to any arbitrary level.
> > + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
> > +
> > +
> > +3.2 Task interface
> > +------------------
> > +
> > + Specifying a periodic/sporadic task that executes for a given amount of
> > + runtime at each instance, and that is scheduled according to the urgency of
> > + its own timing constraints needs, in general, a way of declaring:
> > + - a (maximum/typical) instance execution time,
> > + - a minimum interval between consecutive instances,
> > + - a time constraint by which each instance must be completed.
> > +
> > + Therefore:
> > + * a new struct sched_attr, containing all the necessary fields is
> > + provided;
> > + * the new scheduling related syscalls that manipulate it, i.e.,
> > + sched_setattr() and sched_getattr() are implemented.
> > +
> > +
> > +3.3 Default behavior
> > +---------------------
> > +
> > + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
> > + 95000. With rt_period equal to 1000000, by default, it means that -deadline
> ^^^^
> This seems to be 9.5% to me ;)
>
> > + tasks can use at most 95%, multiplied by the number of CPUs that compose the
> > + root_domain, for each root_domain.
> > +
> > + A -deadline task cannot fork.
> > +
> > +4. Tasks CPU affinity
> > +=====================
> > +
> > + -deadline tasks cannot have an affinity mask smaller that the entire
> > + root_domain they are created on. However, affinities can be specified
> > + through the cpuset facility (Documentation/cgroups/cpusets.txt).
>
> Does this mean that sched_deadline is a somewhat global implementation? Or
> rather, at what point in time will sched_deadline take all cpus in a set
> into consideration and when will it only look at the current CPU? Where is
> the line drawn between global and fully partitioned?
>
> Also, how do you account the budget when a resource holder is boosted in
> order to release a resource? (IIRC, you use BWI, right?)
>
> > +4.1 SCHED_DEADLINE and cpusets HOWTO
> > +------------------------------------
> > +
> > + An example of a simple configuration (pin a -deadline task to CPU0)
> > + follows (rt-app is used to create a -deadline task).
> > +
> > + mkdir /dev/cpuset
> > + mount -t cgroup -o cpuset cpuset /dev/cpuset
> > + cd /dev/cpuset
> > + mkdir cpu0
> > + echo 0 > cpu0/cpuset.cpus
> > + echo 0 > cpu0/cpuset.mems
> > + echo 1 > cpuset.cpu_exclusive
> > + echo 0 > cpuset.sched_load_balance
> > + echo 1 > cpu0/cpuset.cpu_exclusive
> > + echo 1 > cpu0/cpuset.mem_exclusive
> > + echo $$ > cpu0/tasks
> > + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
> > + task affinity)
> > +
> > +5. Future plans
> > +===============
> > +
> > + Still missing:
> > +
> > + - refinements to deadline inheritance, especially regarding the possibility
> > + of retaining bandwidth isolation among non-interacting tasks. This is
> > + being studied from both theoretical and practical points of view, and
> > + hopefully we should be able to produce some demonstrative code soon;
> > + - (c)group based bandwidth management, and maybe scheduling;
> > + - access control for non-root users (and related security concerns to
> > + address), which is the best way to allow unprivileged use of the mechanisms
> > + and how to prevent non-root users "cheat" the system?
> > +
> > + As already discussed, we are planning also to merge this work with the EDF
> > + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
> > + the preliminary phases of the merge and we really seek feedback that would
> > + help us decide on the direction it should take.
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 0de2482..0dd5e09 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
> > * disrupting the schedulability of the system. Otherwise, we should
> > * refill the runtime and set the deadline a period in the future,
> > * because keeping the current (absolute) deadline of the task would
> > - * result in breaking guarantees promised to other tasks.
> > + * result in breaking guarantees promised to other tasks (refer to
> > + * Documentation/scheduler/sched-deadline.txt for more informations).
> > *
> > * This function returns true if:
> > *
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Henrik Austad
--
Henrik Austad
On 01/21/2014 11:20 AM, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
>> Hi all,
>>
>> On 01/20/2014 02:16 PM, Henrik Austad wrote:
>> [...]
>>>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>>>> + duration of such computation is called the task's runtime; the time interval
>>>>>> + by which each instance needs to be completed is called the task's relative
>>>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>>>> + time instant a task (or, more properly) activates plus the relative
>>>>>> + deadline.
>>>>>
>>>>> activates - released?
>>>>>
>>>>
>>>> I'd keep (modifying a bit):
>>>>
>>>> "time instant a task activates plus the relative deadline."
>>>>
>>>> This is probably the nearest thing to what is implemented that we can say
>>>> (without entering into the theory too much), a task that "activates" can mean
>>>> that it is first released, enqueued, woken-up, etc.
>>>
>>> So, if we look at release (yes, I'm avoiding activates for a little while)
>>> as the time at the *beginning* of a new period, then, and only then should
>>> the *absolute* deadline be computed.
>>>
>>> If you den move on to use 'activate' as a term for when a task becomes
>>> eligble to run, then 'release' becomes a subset of 'activate', and you
>>> should only compute the absolute deadline at that time. Did that make
>>> sense?
>
>> I think things are a little bit complex here, because there are 2 different
>> "deadlines" we can think about:
>> - the "jobs deadlines" (the absolute job deadline can be computed at job
>> arrival, as the arrival time + the relative deadline). These are generally
>> used for performance metrics, to see if a job is respecting its timing
>> constraints or not
>>
>> - the "scheduling deadlines", which are the ones used by the scheduler to
>> schedule the tasks. These are computed at tasks' wake-up time - notice that
>> for self-suspending jobs there can be wake-up times that are not job arrival
>> times. And are assigned according to the rules described in the CBS paper.
>>
>> I think this can easily become very confusing, so I currently have no concrete
>> proposals for improving the documentation... But I wanted to point out that
>> things here are more complex than in the "traditional" real-time task model.
>
> Traditional real-time as in the current real-time model used in Linux, or
> traditional as in EDF terminology used by Liu & Layland?
Sorry, by traditional real-time task model I meant the Liu & Layland one
(which is similar to what is describe above "... a task is composed of
a computation phase (instance) which is activated on a periodic or sporadic
fashion"...). I just wanted to say that things in reality are more complex
than this.
>> Maybe a solution could be to simply describe scheduling deadlines (which are
>> what sched_deadline uses) without going into the details of jobs' deadlines.
>
> Huh?
>
> We definately need a short dictionary. In fact, I'd like to have a
> paragraph describing what deadline driven scheduling is.
>
> For instance, I'm getting *Really* confused wrt to arrival time - you seem
> to wrap several types of arrival into the same name, yet treat it
> differently.
>
> - arrival: when a job gets ready to run for the first time
> - arrival: when a job unblocks on some resource
>
> Or did I misunderstand?
Well, my point was (but I might be wrong) that describing all of these details
in this document might be confusing, so it might make sense to describe only
the relevant things (leaving all of the theoretical details to other documents,
like the papers and technical reports cited in this document).
You are right: there are different concepts of "arrival" and "arrival time": the
"job arrival" (when some activity - for example decoding a video frame - starts)
and the "task wake up". But I think in this document it is not important to mention
jobs (or task instances) and job arrivals: the only relevant things are task
wake-ups.
> So, the terminology I'm used to, an attempt to write up something to
> clear up the terminology and establish common grounds. Please
> edit/elaborate or shoot down as appropriate:
>
> """
> N. Crashcourse in deadline-terminology:
>
> In a system, we typically look at a set of tasks. In Linux-kernel
> terminology, a particular task is normally a thread. When a thread is
> ready to run, we say that a *job* of that task is running.
This would be true in the original Liu&Layland model (where a task blocks
only when a job finishes), but I do not think it is correct in a real system...
For example: (notice: this discussion might be slightly off-topic, and I do not
think this should go in the document... I am writing just to clarify my point
of view)
- Let's consider a (over simplified) video decoder as an example of task
- The task periodically read a video frame (from disk or network), decodes it,
and displays it
- So, each job starts when the frame is read, and finishes when the frame is
displayed. And jobs are (in this case) activated periodically
- During the execution of a job, the task might invoke a blocking system call,
and block... When it wakes up, it is still in the same job (decoding the same
video frame), and not in a different one.
This is (IMHO) where all the confusion comes from.
> It is
> perhaps easiest to grasp this if one think only of periodic tasks, i.e. a
> thread that need to run for 2ms every 10ms. Normally, we want one job to
> finish before a new (of the same task) start, which implies that the
> deadline for this task is also 10ms. Once this is clear, expanding one's
> mind to aperiodic and/or sporadic tasks is easier.
>
> * Periodic task: a task that needs to run for a while every N us.
> * Sporadic task: a tasks that needs tor un for a while at most every N us
> (jobs start no closer than N us apart)
> * Aperiodic task: a task that have no particular period, but once
> released, needs to complete before a given deadline.
This is all correct, but I think it is not needed to understand how sched_deadline
works
> * Set of all deadline-tasks in the system: \tau
> * One particluar task: \tau_i
> * The j'th job of task i: \tau_{i,j}
> * The (relative) deadline of task i: D_i
> * The (periodic, relative) release time of task i: R_i
> * Required execution time a tasks's job needs to complete. C_i
> * Absolute release-time, the time when a new job is ready (when a thread is
> woken up for a new period).
> * The absolute deadline of a job, the actual point in time where a job
> needs to be finished. This is what the scheduler looks at when it picks
> the next thread to run.
This is not completely correct... The scheduler does not use this deadline, but
uses a "scheduling deadline". The difference is, for example, that the scheduling
deadline is postponed when the task tries to execute for more than the runtime...
And this is why I think (but I might be wrong) that the concepts above do not need
to be introduced in this document.
Notice that if runtime >= C_i and if the sched_deadline period is >= "R_i", then
the scheduling deadlines and the absolute deadlines you mention above coincide, so
the task is guaranteed to respect the jobs' absolute deadlines (this is what is
called "hard schedulability property" in the CBS paper).
> We can now construct a 3-tuple describing a perioic and sporadic tasks:
>
> (C_i, R_i, D_i).
>
> These 3 items is what you can use to describe your task to the scheduler.
> """
>
>> So, the document could avoid talking about instances (or jobs), and can say
>> that a task is guaranteed to receive "runtime" time units every "period" time
>> units (and these "runtime" time units are available within "deadline" time
>> units from the beginning of the period). Every time the task wakes up, the
>> scheduler computes a scheduling deadline consistent with this constraint,
>> and tasks are scheduled using EDF on these scheduling deadlines.
>> Every time "runtime" time units are consumed in a period, the scheduling
>> deadline is postponed by a period.
>
> What is wrong with using the CORRECT TERMINOLOGY?
Well, what I mentioned above could be "correct terminology" :).
I just avoided talking about the two different kinds of deadlines
(jobs' deadlines and scheduling deadlines), because this could be confusing.
And about jobs, in order to avoid talking about self-suspending jobs, and
having to make a distinction about job arrivals and other task wake ups...
> People looking at using
> sched_deadline _need_ to understand what a _deadline_ scheduler is.
I think the sentence I wrote above gives an idea about what sched_deadline does,
and how to use the three parameters. I can be extended by clarifying that "using
EDF on these scheduling deadlines" means "scheduling the task with the smallest
scheduling deadline".
> If the only goal of sched_deadline is to act as a bandwidth-gauge, then
> fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES.
Ok, so this is a more advanced usage. A section about the "hard schedulability
property" mentioned above can be added (later in the document, I think) to explain
how sched_deadline can be used to provide guarantees to real-time tasks (so,
the real-time task model you mentioned can be described later, and the condition
I mentioned above can be explained).
My point is: first explain what sched_deadline is and how it works (something about
my paragraph above), then, later explain what a real-time task is and how sched_deadline
can be used to properly schedule it.
> I do NOT want to mess around with mapping deadlines to priorities in order to
> meet my deadlines.
I agree, this is not needed.
> I suspect others would like to use sched_deadline for the same.
Again, I might be wrong... But once the "runtime time units every period, served
within a specified deadline" idea is clear, it is easier to understand how to use
sched_deadline.
>> This is of course an approximative description, but I think it can give an
>> intuitive idea of how the scheduler works, and about the meaning of the three
>> scheduling parameters.
>
> Look, I'm not in favour of turning sched_deadline.txt into an academic
> paper, but it is clear that we need to establish a baseline here, and then
> we need to include tasks, instances/jobs, deadline, arrival/release and so
> on.
See above... IMHO, this is not needed to understand how sched_deadline works.
But once the sched_deadline behaviour is defined, these concepts can be introduced
and it can be explained how to use sched_deadline to respect the jobs' deadlines...
Luca
Ok,
I'd say that we need to re-write parts of the document, as we need more
consensus about terms and better explain to users what SCHED_DEADLINE can be
used for, and the type of guarantees it can provide. I'll try to do this with
Luca, taking into account Henrik's points.
Thanks,
- Juri
On 01/21/2014 12:35 PM, Luca Abeni wrote:
> On 01/21/2014 11:20 AM, Henrik Austad wrote:
>> On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
>>> Hi all,
>>>
>>> On 01/20/2014 02:16 PM, Henrik Austad wrote:
>>> [...]
>>>>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>>>>> + duration of such computation is called the task's runtime; the time interval
>>>>>>> + by which each instance needs to be completed is called the task's relative
>>>>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>>>>> + time instant a task (or, more properly) activates plus the relative
>>>>>>> + deadline.
>>>>>>
>>>>>> activates - released?
>>>>>>
>>>>>
>>>>> I'd keep (modifying a bit):
>>>>>
>>>>> "time instant a task activates plus the relative deadline."
>>>>>
>>>>> This is probably the nearest thing to what is implemented that we can say
>>>>> (without entering into the theory too much), a task that "activates" can mean
>>>>> that it is first released, enqueued, woken-up, etc.
>>>>
>>>> So, if we look at release (yes, I'm avoiding activates for a little while)
>>>> as the time at the *beginning* of a new period, then, and only then should
>>>> the *absolute* deadline be computed.
>>>>
>>>> If you den move on to use 'activate' as a term for when a task becomes
>>>> eligble to run, then 'release' becomes a subset of 'activate', and you
>>>> should only compute the absolute deadline at that time. Did that make
>>>> sense?
>>
>>> I think things are a little bit complex here, because there are 2 different
>>> "deadlines" we can think about:
>>> - the "jobs deadlines" (the absolute job deadline can be computed at job
>>> arrival, as the arrival time + the relative deadline). These are generally
>>> used for performance metrics, to see if a job is respecting its timing
>>> constraints or not
>>>
>>> - the "scheduling deadlines", which are the ones used by the scheduler to
>>> schedule the tasks. These are computed at tasks' wake-up time - notice that
>>> for self-suspending jobs there can be wake-up times that are not job arrival
>>> times. And are assigned according to the rules described in the CBS paper.
>>>
>>> I think this can easily become very confusing, so I currently have no concrete
>>> proposals for improving the documentation... But I wanted to point out that
>>> things here are more complex than in the "traditional" real-time task model.
>>
>> Traditional real-time as in the current real-time model used in Linux, or
>> traditional as in EDF terminology used by Liu & Layland?
> Sorry, by traditional real-time task model I meant the Liu & Layland one
> (which is similar to what is describe above "... a task is composed of
> a computation phase (instance) which is activated on a periodic or sporadic
> fashion"...). I just wanted to say that things in reality are more complex
> than this.
>
>
>>> Maybe a solution could be to simply describe scheduling deadlines (which are
>>> what sched_deadline uses) without going into the details of jobs' deadlines.
>>
>> Huh?
>>
>> We definately need a short dictionary. In fact, I'd like to have a
>> paragraph describing what deadline driven scheduling is.
>>
>> For instance, I'm getting *Really* confused wrt to arrival time - you seem
>> to wrap several types of arrival into the same name, yet treat it
>> differently.
>>
>> - arrival: when a job gets ready to run for the first time
>> - arrival: when a job unblocks on some resource
>>
>> Or did I misunderstand?
> Well, my point was (but I might be wrong) that describing all of these details
> in this document might be confusing, so it might make sense to describe only
> the relevant things (leaving all of the theoretical details to other documents,
> like the papers and technical reports cited in this document).
> You are right: there are different concepts of "arrival" and "arrival time": the
> "job arrival" (when some activity - for example decoding a video frame - starts)
> and the "task wake up". But I think in this document it is not important to mention
> jobs (or task instances) and job arrivals: the only relevant things are task
> wake-ups.
>
>> So, the terminology I'm used to, an attempt to write up something to
>> clear up the terminology and establish common grounds. Please
>> edit/elaborate or shoot down as appropriate:
>>
>> """
>> N. Crashcourse in deadline-terminology:
>>
>> In a system, we typically look at a set of tasks. In Linux-kernel
>> terminology, a particular task is normally a thread. When a thread is
>> ready to run, we say that a *job* of that task is running.
> This would be true in the original Liu&Layland model (where a task blocks
> only when a job finishes), but I do not think it is correct in a real system...
> For example: (notice: this discussion might be slightly off-topic, and I do not
> think this should go in the document... I am writing just to clarify my point
> of view)
> - Let's consider a (over simplified) video decoder as an example of task
> - The task periodically read a video frame (from disk or network), decodes it,
> and displays it
> - So, each job starts when the frame is read, and finishes when the frame is
> displayed. And jobs are (in this case) activated periodically
> - During the execution of a job, the task might invoke a blocking system call,
> and block... When it wakes up, it is still in the same job (decoding the same
> video frame), and not in a different one.
> This is (IMHO) where all the confusion comes from.
>
>> It is
>> perhaps easiest to grasp this if one think only of periodic tasks, i.e. a
>> thread that need to run for 2ms every 10ms. Normally, we want one job to
>> finish before a new (of the same task) start, which implies that the
>> deadline for this task is also 10ms. Once this is clear, expanding one's
>> mind to aperiodic and/or sporadic tasks is easier.
>>
>> * Periodic task: a task that needs to run for a while every N us.
>> * Sporadic task: a tasks that needs tor un for a while at most every N us
>> (jobs start no closer than N us apart)
>> * Aperiodic task: a task that have no particular period, but once
>> released, needs to complete before a given deadline.
> This is all correct, but I think it is not needed to understand how sched_deadline
> works
>
>
>> * Set of all deadline-tasks in the system: \tau
>> * One particluar task: \tau_i
>> * The j'th job of task i: \tau_{i,j}
>> * The (relative) deadline of task i: D_i
>> * The (periodic, relative) release time of task i: R_i
>> * Required execution time a tasks's job needs to complete. C_i
>> * Absolute release-time, the time when a new job is ready (when a thread is
>> woken up for a new period).
>> * The absolute deadline of a job, the actual point in time where a job
>> needs to be finished. This is what the scheduler looks at when it picks
>> the next thread to run.
> This is not completely correct... The scheduler does not use this deadline, but
> uses a "scheduling deadline". The difference is, for example, that the scheduling
> deadline is postponed when the task tries to execute for more than the runtime...
> And this is why I think (but I might be wrong) that the concepts above do not need
> to be introduced in this document.
> Notice that if runtime >= C_i and if the sched_deadline period is >= "R_i", then
> the scheduling deadlines and the absolute deadlines you mention above coincide, so
> the task is guaranteed to respect the jobs' absolute deadlines (this is what is
> called "hard schedulability property" in the CBS paper).
>
>> We can now construct a 3-tuple describing a perioic and sporadic tasks:
>>
>> (C_i, R_i, D_i).
>>
>> These 3 items is what you can use to describe your task to the scheduler.
>> """
>>
>>> So, the document could avoid talking about instances (or jobs), and can say
>>> that a task is guaranteed to receive "runtime" time units every "period" time
>>> units (and these "runtime" time units are available within "deadline" time
>>> units from the beginning of the period). Every time the task wakes up, the
>>> scheduler computes a scheduling deadline consistent with this constraint,
>>> and tasks are scheduled using EDF on these scheduling deadlines.
>>> Every time "runtime" time units are consumed in a period, the scheduling
>>> deadline is postponed by a period.
>>
>> What is wrong with using the CORRECT TERMINOLOGY?
> Well, what I mentioned above could be "correct terminology" :).
> I just avoided talking about the two different kinds of deadlines
> (jobs' deadlines and scheduling deadlines), because this could be confusing.
> And about jobs, in order to avoid talking about self-suspending jobs, and
> having to make a distinction about job arrivals and other task wake ups...
>
>> People looking at using
>> sched_deadline _need_ to understand what a _deadline_ scheduler is.
> I think the sentence I wrote above gives an idea about what sched_deadline does,
> and how to use the three parameters. I can be extended by clarifying that "using
> EDF on these scheduling deadlines" means "scheduling the task with the smallest
> scheduling deadline".
>
>
>> If the only goal of sched_deadline is to act as a bandwidth-gauge, then
>> fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES.
> Ok, so this is a more advanced usage. A section about the "hard schedulability
> property" mentioned above can be added (later in the document, I think) to explain
> how sched_deadline can be used to provide guarantees to real-time tasks (so,
> the real-time task model you mentioned can be described later, and the condition
> I mentioned above can be explained).
> My point is: first explain what sched_deadline is and how it works (something about
> my paragraph above), then, later explain what a real-time task is and how sched_deadline
> can be used to properly schedule it.
>
>> I do NOT want to mess around with mapping deadlines to priorities in order to
>> meet my deadlines.
> I agree, this is not needed.
>
>> I suspect others would like to use sched_deadline for the same.
> Again, I might be wrong... But once the "runtime time units every period, served
> within a specified deadline" idea is clear, it is easier to understand how to use
> sched_deadline.
>
>
>>> This is of course an approximative description, but I think it can give an
>>> intuitive idea of how the scheduler works, and about the meaning of the three
>>> scheduling parameters.
>>
>> Look, I'm not in favour of turning sched_deadline.txt into an academic
>> paper, but it is clear that we need to establish a baseline here, and then
>> we need to include tasks, instances/jobs, deadline, arrival/release and so
>> on.
> See above... IMHO, this is not needed to understand how sched_deadline works.
> But once the sched_deadline behaviour is defined, these concepts can be introduced
> and it can be explained how to use sched_deadline to respect the jobs' deadlines...
>
>
>
>
> Luca
>
On Tue, Jan 21, 2014 at 12:35:27PM +0100, Luca Abeni wrote:
> >In a system, we typically look at a set of tasks. In Linux-kernel
> >terminology, a particular task is normally a thread. When a thread is
> >ready to run, we say that a *job* of that task is running.
> This would be true in the original Liu&Layland model (where a task blocks
> only when a job finishes), but I do not think it is correct in a real system...
> For example: (notice: this discussion might be slightly off-topic, and I do not
> think this should go in the document... I am writing just to clarify my point
> of view)
> - Let's consider a (over simplified) video decoder as an example of task
> - The task periodically read a video frame (from disk or network), decodes it,
> and displays it
> - So, each job starts when the frame is read, and finishes when the frame is
> displayed. And jobs are (in this case) activated periodically
> - During the execution of a job, the task might invoke a blocking system call,
> and block... When it wakes up, it is still in the same job (decoding the same
> video frame), and not in a different one.
> This is (IMHO) where all the confusion comes from.
I would strongly urge you not to use that as an example, because its
dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
blocking IO.
Have !RT tasks read the stuff from disk into a buffer, then let the RT
task read data from the buffer and flip frames and such.
If you want to mention blocking, then please use the most common one:
blocking on a (hopefully PI) mutex.
On the other subject; I wouldn't actually mind if it grew into a proper
(academic or not) summary of deadline scheduling theory and how it
applies.
Sure, refer to actual papers for all the proofs and such, but it would
be very good to go over all the bits and pieces that make up the system.
So cover the periodic, sporadic and aperiodic model like henr_k
suggested, please do cover the job/instance idiom as it is used all over
the place.
Then also treat schedulability tests and their ramification, explain
what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
but does have bounded tardiness.
Maybe even mention the actual bounds -- but refer to papers for their
proofs.
Mention CBS and the ramification etc..
Yes this is all a bit much, but I feel it is important, after all how
can you properly use something you don't understand? (and yes I know its
a very popular thing to not want to understand how things work but still
use them :-/).
I mean, I'm the kind of idiot that actually goes out and read a bunch of
papers, but many people simply cannot read those things, or are not
given the time to, even if they wanted and could (arguably they have
bigger problems).
so think of this as the idiot guide to deadline scheduling :-)
On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 12:35:27PM +0100, Luca Abeni wrote:
>>> In a system, we typically look at a set of tasks. In Linux-kernel
>>> terminology, a particular task is normally a thread. When a thread is
>>> ready to run, we say that a *job* of that task is running.
>> This would be true in the original Liu&Layland model (where a task blocks
>> only when a job finishes), but I do not think it is correct in a real system...
>> For example: (notice: this discussion might be slightly off-topic, and I do not
>> think this should go in the document... I am writing just to clarify my point
>> of view)
>> - Let's consider a (over simplified) video decoder as an example of task
>> - The task periodically read a video frame (from disk or network), decodes it,
>> and displays it
>> - So, each job starts when the frame is read, and finishes when the frame is
>> displayed. And jobs are (in this case) activated periodically
>> - During the execution of a job, the task might invoke a blocking system call,
>> and block... When it wakes up, it is still in the same job (decoding the same
>> video frame), and not in a different one.
>> This is (IMHO) where all the confusion comes from.
>
> I would strongly urge you not to use that as an example, because its
> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
> blocking IO.
Well, but it does happen in reality :)
I mean: people might want to use SCHED_DEADLINE to schedule mplayer (or similar).
There are even scientific papers showing the advantage of doing so...
And if you try to use ftrace/kernelshark to check the wake-up times and similar
you will notice that even a single-threaded player like mplayer blocks and wakes-up
many times inside a job.
On the other hand, I agree with you that a hard real-time task should be designed
not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
many different kinds of tasks (hard real-time, soft real-time, etc...).
> Have !RT tasks read the stuff from disk into a buffer, then let the RT
> task read data from the buffer and flip frames and such.
>
> If you want to mention blocking, then please use the most common one:
> blocking on a (hopefully PI) mutex.
Ok.
> On the other subject; I wouldn't actually mind if it grew into a proper
> (academic or not) summary of deadline scheduling theory and how it
> applies.
>
> Sure, refer to actual papers for all the proofs and such, but it would
> be very good to go over all the bits and pieces that make up the system.
>
> So cover the periodic, sporadic and aperiodic model like henr_k
> suggested, please do cover the job/instance idiom as it is used all over
> the place.
Ok... My point was that it would be better (IMHO) to first explain how
sched_deadline works (and no notion of job/instance, etc is needed for this),
and then explain how this applies to the real-time task model (and here, of
course all the formal notation can be introduced).
Do you think this can be reasonable?
> Then also treat schedulability tests and their ramification, explain
> what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
> but does have bounded tardiness.
>
> Maybe even mention the actual bounds -- but refer to papers for their
> proofs.
>
> Mention CBS and the ramification etc..
Ok.
I guess some of these details can be added incrementally, with additional
patches?
> Yes this is all a bit much, but I feel it is important, after all how
> can you properly use something you don't understand? (and yes I know its
> a very popular thing to not want to understand how things work but still
> use them :-/).
>
> I mean, I'm the kind of idiot that actually goes out and read a bunch of
> papers, but many people simply cannot read those things, or are not
> given the time to, even if they wanted and could (arguably they have
> bigger problems).
Ok.
Thanks,
Luca
On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
> >>- During the execution of a job, the task might invoke a blocking system call,
> >> and block... When it wakes up, it is still in the same job (decoding the same
> >> video frame), and not in a different one.
> >>This is (IMHO) where all the confusion comes from.
> >
> >I would strongly urge you not to use that as an example, because its
> >dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
> >blocking IO.
> Well, but it does happen in reality :)
Yeah, I know, my point was more about not encouraging people to do this
by explicitly mentioning it.
> On the other hand, I agree with you that a hard real-time task should be designed
> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
> many different kinds of tasks (hard real-time, soft real-time, etc...).
At which point I feel obliged to mention the work Jim did on statistical
bounded tardiness and a potential future option:
SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
their runtime budget provided that they meet their budget on average.
A possible implementation could be to track the unused budget of
previous instances and keep a decaying sum (such that we're guaranteed
this unused budget < 2*runtime). And then allow runtime overruns upto
this limit.
Another possibly extension; one proposed by Ingo; is to demote tasks to
SCHED_OTHER once they exceed their budget instead of the full block they
get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
delivered if it exceeded the runtime -- I think some of the earlier
patches had things like this, no?
> >On the other subject; I wouldn't actually mind if it grew into a proper
> >(academic or not) summary of deadline scheduling theory and how it
> >applies.
> >
> >Sure, refer to actual papers for all the proofs and such, but it would
> >be very good to go over all the bits and pieces that make up the system.
> >
> >So cover the periodic, sporadic and aperiodic model like henr_k
> >suggested, please do cover the job/instance idiom as it is used all over
> >the place.
> Ok... My point was that it would be better (IMHO) to first explain how
> sched_deadline works (and no notion of job/instance, etc is needed for this),
> and then explain how this applies to the real-time task model (and here, of
> course all the formal notation can be introduced).
>
> Do you think this can be reasonable?
Sure, I think that's reasonable.
> >Then also treat schedulability tests and their ramification, explain
> >what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
> >but does have bounded tardiness.
> >
> >Maybe even mention the actual bounds -- but refer to papers for their
> >proofs.
> >
> >Mention CBS and the ramification etc..
> Ok.
> I guess some of these details can be added incrementally, with additional
> patches?
Oh sure, all of this will always be a work in progress anyway ;-)
On 01/21/2014 02:55 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
>> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
>
>>>> - During the execution of a job, the task might invoke a blocking system call,
>>>> and block... When it wakes up, it is still in the same job (decoding the same
>>>> video frame), and not in a different one.
>>>> This is (IMHO) where all the confusion comes from.
>>>
>>> I would strongly urge you not to use that as an example, because its
>>> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
>>> blocking IO.
>> Well, but it does happen in reality :)
>
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.
>
>> On the other hand, I agree with you that a hard real-time task should be designed
>> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
>> many different kinds of tasks (hard real-time, soft real-time, etc...).
>
> At which point I feel obliged to mention the work Jim did on statistical
> bounded tardiness and a potential future option:
> SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> their runtime budget provided that they meet their budget on average.
>
> A possible implementation could be to track the unused budget of
> previous instances and keep a decaying sum (such that we're guaranteed
> this unused budget < 2*runtime). And then allow runtime overruns upto
> this limit.
>
> Another possibly extension; one proposed by Ingo; is to demote tasks to
> SCHED_OTHER once they exceed their budget instead of the full block they
> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
>
> And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> delivered if it exceeded the runtime -- I think some of the earlier
> patches had things like this, no?
>
Yes, they both got removed along the way. But we can pick them back if we
realize that they are needed for some scenario.
>>> On the other subject; I wouldn't actually mind if it grew into a proper
>>> (academic or not) summary of deadline scheduling theory and how it
>>> applies.
>>>
>>> Sure, refer to actual papers for all the proofs and such, but it would
>>> be very good to go over all the bits and pieces that make up the system.
>>>
>>> So cover the periodic, sporadic and aperiodic model like henr_k
>>> suggested, please do cover the job/instance idiom as it is used all over
>>> the place.
>> Ok... My point was that it would be better (IMHO) to first explain how
>> sched_deadline works (and no notion of job/instance, etc is needed for this),
>> and then explain how this applies to the real-time task model (and here, of
>> course all the formal notation can be introduced).
>>
>> Do you think this can be reasonable?
>
> Sure, I think that's reasonable.
>
>>> Then also treat schedulability tests and their ramification, explain
>>> what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
>>> but does have bounded tardiness.
>>>
>>> Maybe even mention the actual bounds -- but refer to papers for their
>>> proofs.
>>>
>>> Mention CBS and the ramification etc..
>> Ok.
>> I guess some of these details can be added incrementally, with additional
>> patches?
>
> Oh sure, all of this will always be a work in progress anyway ;-)
>
Ok, we are working on a first update.
Thanks,
- Juri
On Mon, 20 Jan 2014 12:46:06 +0100
Peter Zijlstra <[email protected]> wrote:
> > activates - released?
> >
> > Since real-time papers from different rt-campus around the academia insist
> > on using *slightly* different terminology, perhaps add a short dictionary
> > for some of the more common terms?
>
> Oh gawd, they really don't conform with their definitions? I'd not
> noticed that.
>
> > D: relative deadline, typically N ms after release
>
> You failed to define release :-) Its the 'wakeup' event, right? Where
> the activation would be the moment we actually schedule the
> job/instance?
It took me about 20 seconds to figure out what a "release" was too. ;-)
-- Steve
On Tue, 21 Jan 2014 14:55:59 +0100
Peter Zijlstra <[email protected]> wrote:
> > >I would strongly urge you not to use that as an example, because its
> > >dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
> > >blocking IO.
> > Well, but it does happen in reality :)
>
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.
Unless it is mentioned that this is fine for soft real-time uses, such
as what Luca described (mplayer and possibly Jack). Tools where missing
a deadline does not crash the system, but just reduces quality.
-- Steve
Hi,
On 01/21/2014 02:55 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
>> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
>
>>>> - During the execution of a job, the task might invoke a blocking system call,
>>>> and block... When it wakes up, it is still in the same job (decoding the same
>>>> video frame), and not in a different one.
>>>> This is (IMHO) where all the confusion comes from.
>>>
>>> I would strongly urge you not to use that as an example, because its
>>> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
>>> blocking IO.
>> Well, but it does happen in reality :)
>
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.
Ok, got the point. I was not planning to add this example to the documentation
anyway, so I think there is no problem :)
>> On the other hand, I agree with you that a hard real-time task should be designed
>> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
>> many different kinds of tasks (hard real-time, soft real-time, etc...).
>
> At which point I feel obliged to mention the work Jim did on statistical
> bounded tardiness and a potential future option:
> SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> their runtime budget provided that they meet their budget on average.
I think I read the paper your refer to, and if I remember well it was about
an analysis technique (using some math from queuing theory to get estimations
of the average response time, and then using Tchebysheff to transform this result
in a probabilistic real-time guarantee)... If I understand well, it does not
require modifications to the scheduling algorithm (the paper presents a multiprocessor
reservation-based scheduling algorithm, but the presented analysis applies to every
reservation-based algorithm, including SCHED_DEADLINE without modifications).
Am I misunderstanding something?
Anyway, I'll propose a documentation patch adding this paper to the references
(and if you agree I can also add some other references to probabilistic guarantees).
The SCHED_FLAG_DL_AVG_RUNTIME idea also looks interesting.
> Another possibly extension; one proposed by Ingo; is to demote tasks to
> SCHED_OTHER once they exceed their budget instead of the full block they
> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
I think something similar to this was mentioned in the original "resource kernels"
paper by Rajkumar and others... It is in general very useful.
Another extension I implemented "locally" (but I never submitted patches because
it is "dangerous" and potentially controversial) is the original CBS behaviour:
when a task is depleted, do not make it unschedulable, but just postpone its
scheduling deadline (decreasing its priority) and immediately recharge the
runtime. This still preserves temporal isolation between SCHED_DEADLINE tasks,
but can cause starvation of non-SCHED_DEADLINE tasks (and this is why I say this
is dangerous and can be controversial), but can be useful in some situations.
> And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> delivered if it exceeded the runtime -- I think some of the earlier
> patches had things like this, no?
I've seen this in some patchset, but I do not remember when. I think some of
the "advanced features" have been removed from the first submission.
>>> On the other subject; I wouldn't actually mind if it grew into a proper
>>> (academic or not) summary of deadline scheduling theory and how it
>>> applies.
>>>
>>> Sure, refer to actual papers for all the proofs and such, but it would
>>> be very good to go over all the bits and pieces that make up the system.
>>>
>>> So cover the periodic, sporadic and aperiodic model like henr_k
>>> suggested, please do cover the job/instance idiom as it is used all over
>>> the place.
>> Ok... My point was that it would be better (IMHO) to first explain how
>> sched_deadline works (and no notion of job/instance, etc is needed for this),
>> and then explain how this applies to the real-time task model (and here, of
>> course all the formal notation can be introduced).
>>
>> Do you think this can be reasonable?
>
> Sure, I think that's reasonable.
Ok, good. I am already working on this together with Juri.
Thanks,
Luca
On Wed, Jan 22, 2014 at 02:03:18PM +0100, Luca Abeni wrote:
> >At which point I feel obliged to mention the work Jim did on statistical
> >bounded tardiness and a potential future option:
> >SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> >their runtime budget provided that they meet their budget on average.
> I think I read the paper your refer to, and if I remember well it was about
> an analysis technique (using some math from queuing theory to get estimations
> of the average response time, and then using Tchebysheff to transform this result
> in a probabilistic real-time guarantee)... If I understand well, it does not
> require modifications to the scheduling algorithm (the paper presents a multiprocessor
> reservation-based scheduling algorithm, but the presented analysis applies to every
> reservation-based algorithm, including SCHED_DEADLINE without modifications).
> Am I misunderstanding something?
I must admit to not actually having read that paper; Jim talked me
through the result at some time, but yes that sounds about right.
The only thing we need to do is enforce some 'avg' on the input
(scheduling parameters) such that we are guaranteed an avg on the output
(tardiness).
And we must enforce the input because we cannot trust userspace to
actually do what it says; if it were to go overboard we'd loose all
bounds.
> Anyway, I'll propose a documentation patch adding this paper to the references
> (and if you agree I can also add some other references to probabilistic guarantees).
Right, something for the future though.
> The SCHED_FLAG_DL_AVG_RUNTIME idea also looks interesting.
Right, like said above, because we cannot trust userspace we must
enforce that the input is indeed a bounded avg, otherwise the guarantees
do not hold.
> >Another possibly extension; one proposed by Ingo; is to demote tasks to
> >SCHED_OTHER once they exceed their budget instead of the full block they
> >get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
> I think something similar to this was mentioned in the original "resource kernels"
> paper by Rajkumar and others... It is in general very useful.
Right, esp. so when we allow unpriv. access to SCHED_DEADLINE, more on
that below.
> Another extension I implemented "locally" (but I never submitted patches because
> it is "dangerous" and potentially controversial) is the original CBS behaviour:
> when a task is depleted, do not make it unschedulable, but just postpone its
> scheduling deadline (decreasing its priority) and immediately recharge the
> runtime. This still preserves temporal isolation between SCHED_DEADLINE tasks,
> but can cause starvation of non-SCHED_DEADLINE tasks (and this is why I say this
> is dangerous and can be controversial), but can be useful in some situations.
Right, we could actually implement that but always require CAP_ADMIN if
set.
We should also again talk about what it would take to allow
unprivileged access to SCHED_DEADLINE. The things I can remember are the
obvious cap on utilization and a minimum period -- the latter so that we
don't DoS the system with a metric ton of tiny tasks.
But I seem to remember there were a few other issues.
>
> >And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> >delivered if it exceeded the runtime -- I think some of the earlier
> >patches had things like this, no?
> I've seen this in some patchset, but I do not remember when. I think some of
> the "advanced features" have been removed from the first submission.
Exactly, so. Now that we have the simple stuff settled, we can again
look at the more advanced features if there's potential users.
On 22/01/14 13:51, Peter Zijlstra wrote:
>>> Another possibly extension; one proposed by Ingo; is to demote tasks to
>>> SCHED_OTHER once they exceed their budget instead of the full block they
>>> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
Soft reservations are also very useful, as in multimedia and adaptive RT apps
we expect these budgets to be roughly estimated, rather than being carefully
computed WCETs.
For example, in the experimentation with adaptive budget tuning made with Jack,
http://retis.sssup.it/~tommaso/papers/lac11.php
the best results were achieved just setting the QOS_F_SOFT flag of the old
AQuoSA (apologies for mentioning that old crappy code), that was implementing
exactly downgrade of the task to the regular CFS scheduler for the time in
which the budget was exhausted.
That Jack-related scenario is quite challenging as due to the change in the
required budget due to new Jack clients joining, or varying workloads of
existing clients (e.g., timidity). At least, as far as the intention of being
seamless to applications is kept (apps required no change at all -- only Jack
server and client library were changed). FYI, I wish we could replay those
modifications on top of SCHED_DEADLINE one day :-)...
http://repo.or.cz/w/jack2.git/shortlog/refs/heads/aquosa
> We should also again talk about what it would take to allow
> unprivileged access to SCHED_DEADLINE. The things I can remember are the
> obvious cap on utilization and a minimum period -- the latter so that we
> don't DoS the system with a metric ton of tiny tasks.
>
> But I seem to remember there were a few other issues.
Exactly. I can remember also a huge period might be harmful, because with it
even a tiny utilization may lead to a big runtime that starves the whole non
RT tasks for a significant time. Just in case, I had this paper
http://retis.sssup.it/~tommaso/papers/rtas08.php
discussing the issues I had found, and how they were tackled in the AQuoSA
supervisor. See Section 3.1. Note that the AQuoSA model was probably more
complex due to the possibility of hosting multiple tasks in the same
deadline-driven reservation, and the possibility for parts of the tasks
in a user reservation being capable of being re-wrapped within another
app-specific reservation etc.... I'd expect the situation to be quite
simpler with SCHED_DEADLINE.
My2c,
Tommaso
On Fri, Jan 24, 2014 at 10:08:35AM +0000, Tommaso Cucinotta wrote:
> > We should also again talk about what it would take to allow
> > unprivileged access to SCHED_DEADLINE. The things I can remember are the
> > obvious cap on utilization and a minimum period -- the latter so that we
> > don't DoS the system with a metric ton of tiny tasks.
> >
> > But I seem to remember there were a few other issues.
>
> Exactly. I can remember also a huge period might be harmful, because with it
> even a tiny utilization may lead to a big runtime that starves the whole non
> RT tasks for a significant time.
Another way to solve that is with explicit slack time scheduling, where
slack here means the time/utilization not allocated to dl tasks (a quick
google shows various different usages of slack, including laxity).
So suppose we have 50% utilization but with a huge period, we could
schedule the other 50% at a fixed but smaller period and have that run
the rt/fair/etc tasks.
So effectively we'll always fill up the utilization to 100% and use the
'slack' time as a server for the other classes.
> Just in case, I had this paper
>
> http://retis.sssup.it/~tommaso/papers/rtas08.php
>
> discussing the issues I had found, and how they were tackled in the AQuoSA
> supervisor. See Section 3.1.
/me *click*
On 28/01/14 09:31, Peter Zijlstra wrote:
>> Exactly. I can remember also a huge period might be harmful, because with it
>> even a tiny utilization may lead to a big runtime that starves the whole non
>> RT tasks for a significant time.
>
> Another way to solve that is with explicit slack time scheduling, where
> slack here means the time/utilization not allocated to dl tasks (a quick
> google shows various different usages of slack, including laxity).
>
> So suppose we have 50% utilization but with a huge period, we could
> schedule the other 50% at a fixed but smaller period and have that run
> the rt/fair/etc tasks.
>
> So effectively we'll always fill up the utilization to 100% and use the
> 'slack' time as a server for the other classes.
Yesss :-)! Indeed, AFAICR in the AQuoSA API there was the QOS_F_DEFAULT flag
http://aquosa.sourceforge.net/aquosa-docs/aquosa-qosres/html/qos__types_8h_source.html
(that could only be set by a privileged process) just there to allow
for creating a server serving "default" tasks (namely, every non-aquosa task).
This way, you could create for example at system boot time a default
reservation with a precise (runtime, period) for Linux, so that non-aquosa
tasks could have a chance to run in that reservation, even in presence of
a RT reservation with a significantly long runtime.
T.