2009-07-10 21:51:20

by Henrik Austad

[permalink] [raw]
Subject: RFC for a new Scheduling policy/class in the Linux-kernel

Hi all!

This is a proposal for a global [1], deadline driven scheduler for
real-time tasks in the Linux kernel. I thought I should send out an RFC to
gather some feedback instead of wildy hack away at it.

This proposed scheduler is a modified MLLF (modified Least Laxity First)
called Earliest Failure First (EFF) as it orders tasks according to when
they will miss their deadlines, not when the actual deadline is.

== Motivation and background ==

Deadlines will give the developer greater flexibility and expressiveness
when creating real-time applications. Compared to a priority scheme,
this simplifies the process considerably as it removes the need for
calculating the priorities off-line in order to find the priority-map
that will order the tasks in the correct order. Yet another important
aspect with deadlines instead of priorities, are systems too dynamic to
analyze (a large application with 100s of processes/threads or a system
running more than one rt-application).

* In very large systems, it becomes very difficult to find the correct
set of priorities, even with sophisticated tools, and a slight change
will require a re-calculation of all priorities.

* In very dynamic systems, it can be impossible to analyze the system
off-line, reducing the calculated priorities to best-effort only

* If more than one application run at the same time, this become even
worse.


As a final point, a schedule of tasks with their priorities, is in
almost all scenarios, a result of all deadlines for all tasks. This also
goes for non-rt tasks, even though the concept of deadlines are a bit
more fuzzy here. The problem is that this mapping is a one-way function,
--you cannot determine the deadlines from a set of priorities.

The problem is, how to implement this efficiently in a priority-oriented
kernel, and more importantly, how to extend this to multi-core
systems. For single-core systems, EDF is optimal and can accept tasks up
to 100% utilization and still guarantee that all deadlines are
met. However, on multi-core, this breaks down and the utilization bound
must be set very low if any guarantee should be given (known as "Dhall's
effect").

== Related Work ==

Recently, I've been working away on a pfair-based scheduler (PD^2 to be
exact), but this failed for several reasons [2]. The most prominent being
the sensitivity for timer inaccuracies and very frequent task
preemption. pfair has several good qualities, as it reduces jitter,
scales to many cores and achieves high sustained utilization. However,
the benefits do not outweigh the added overhead and strict requirements
placed on the timer infrastructure.

This latter point is what haunts EDF on multi-core platforms. A global
EDF-US[1/2] cannot exceed (m+1)/2, standard EDF is much
worse. Partitioned can reach higher limits, but is very susceptible to
the bin-packing heuristics. Going fully partitioned will also introduce
other issues like the need for load balancing and more complex deadline
inheritance logic. However, one clear benefit with EDF, is that it will
minimize the number of task-switches, clearly something desirable.

== Proposed algorithm ==

So, with this in mind, my motivation was to find a way to extend the a
deadline driver scheduler scheduler to battle Dhall's effect. This can
be done if you look at time in a more general sense than just
deadlines. What you must do, is look at how the expected computation
time needed by a task with respect to the tasks deadline compares to
other tasks.

=== Notation ===

- Take a set of tasks with corresponding attributes. This set and their
attributes are called the schedule, 'S' and contains *all* tasks for
the given scheduling class (i.e. all EFF-tasks).

- Consider a multi-core system with 'm' processors.

- Let the i'th task in the schedule be denoted tau_i. [3]

- Each task will run in intervals, each 'round' is called a job. A task
consists of an infinite sequence of jobs. The k'th job of tau_i is
called tau_{i,k}

- Each task has a set of (relative) attributes supplied when the task is
inserted into the scheduler (passed via syscall)
* Period T_i
* Deadline D_i
* WCET C_i

- Each job (tau_{i,k}) has absolute attributes (computed from the relative
tasks-attributes coupled with physical time).
* Release-time r_{i,k}
* Deadline d_{i,k}
* Allocated time so for a job, C_a(t, tau_{i,k})
When C_a equals WCET, the jobs budget is exhausted and it should
start a new cycle. This is tested (see below) by the scheduler.
* Remaining time for the job, C_r(t, tau_{i,nk})

- The acceptance function for EFF screens new tasks on their expected
utilization. Depending on the mode and implementation, it can be based
on the period, or on the deadline. The latter will cause firmer
restraints, but may lead to wasted resources.

U = C_i / T_i For SRT (bounded deadline tardiness)
U = C_i / D_i For HRT

- A relative measure, time to failure, ttf, indicates how much time is
left before a job must be scheduled to run in order to avoid a
deadline-miss. This will decrease as time progresses and the job is
not granted CPU time. For tasks currently running on a CPU, this value
will be constant.

Take a job with a WCET of 10ms, it has been allowed to run for 4
ms so far. The deadline is 8 ms away. Then the job must be
scheduled to run within the next 4 ms, otherwise it will not be
able to finish in time.

- An absolute value, time of failure (tof) can also be computed in a
static manner. For tasks not running on a CPU, the allocated time is
static. That means you can take the absolute deadline, subtract the
allocated time and you have the absolute point in time when a given
job will fail to meet its deadline.

=== Outline of scheduler ===

Store tasks in 2 queues. One of size m, containing all the tasks
currently running on the CPUs (queue R). The other will hold all
currently active tasks waiting to execute (queue W).

queue R is sorted based on ttf (time to failure, the relative time left
until a task will miss it's deadline). As the tasks approaches the
absolute time of failure at the same rate C_a increases, ttf is
constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
(i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
be quite fluent.

queue W is sorted based on absolute time of failure (tof). Since this is
a fixed point in time, and the tasks in W are not running (C_a is
unchanged), this value is constant.

When a task is scheduled to run, a timer is set at the point in time
where it has exhausted it's budget (t_now + WCET - C_a). This is to
ensure that a runaway task does not grab the CPU.

When a new task arrives, it is handled according the following rules:
- The system has one or more CPUs not running EFF-tasks. Pick any of the
free CPUs and assign the new job there. Set a timer to

- All CPUs are busy, the new task has greater time to failure than the
head of W. The task is inserted into W at the appropriate place.

- All CPUs are busy and the new task has smaller time to failure than
the head of W. The new task is compared to the last task in Q. If time
to failure is larger than the task at the tail, it is added to the
head of W.

- If all CPUs are busy, and time to failure is smaller than the tail of
Q, the new task is a candidate for insertion. At this point the tasks
must be compared to see if picking one or the other will cause a
deadline-miss. If both will miss the deadline if the other is
scheduled, keep the existing running and place the new at the head of
W (as you will have a deadline-miss anyway unless the the task is
picked up by another CPU soon).

- A task running on a CPU with ttf=0 should *never* be preempted with
another task. If all tasks in R have ttf=0, and a newly arrived task
has ttf=0, a deadline-miss is inevitable and switching tasks will only
waste resources.

When a task in R finish (or is stopped due to the timer-limit), it is
removed from R, and the head of W is added to R, inserted at the
appropriate place.

It has been some discussion lately (in particular on #linux-rt) about
the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
should be possible to extend EFF to handle both. As a side note, if
anyone has some good information about PEP, I'd like a copy :)

Based on this, I think the utilization can be set as high as M
(i.e. full utilization of all CPUs), but the jitter can probably be
quite bad, so for jitter-sensitive tasks, a short period/deadline should
be used.

There are still some issues left to solve, for instance how to best
handle sporadic tasks, and whether or not deadline-miss should be allow,
or just 'bounded deadline tardiness'. Either way, EFF should be able to
handle it. Then, there are problems concerning blocking of tasks. One
solution would be BWI or PEP, but I have not had the time to read
properly through those, but from what I've gathered a combination of BWI
and PEP looks promising (anyone with good info about BWI and PEP - feel
free to share! (-: ).



Henrik


1) Before you freeze at 'global' and get all caught up on "This won't
ever scale to X", or "He will be haunted by Y" - I do not want to
extend a global algorithm to 2000 cores. I would like to scale to a
single *chip* and then we can worry about 2-way and 4-way systems
later. For the record, I've donned my asbestos suit anyway.

2) http://austad.us/kernel/thesis_henrikau.pdf

3) Anyone want to include LaTeX-notation into an email-rfc?



--
-> henrik


2009-07-11 18:29:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote:
> Hi all!
>
> This is a proposal for a global [1], deadline driven scheduler for
> real-time tasks in the Linux kernel. I thought I should send out an RFC to
> gather some feedback instead of wildy hack away at it.
>
> This proposed scheduler is a modified MLLF (modified Least Laxity First)
> called Earliest Failure First (EFF) as it orders tasks according to when
> they will miss their deadlines, not when the actual deadline is.

<snip>

Everybody agrees we want a deadline scheduler, we'll probably put a user
interface into -rt shortly which should work for all the involved
research groups so that we can share tests and have better comparisons.

The only thing (aside from an utter lack of time to work on things
recently) that has been holding us back is a proper solution to the
priority inversion issue.

I haven't fully read through the proposed algorithm below, and left it
in place for the new people on CC.

As already mentioned on IRC, the fact that you push the work to the last
possible moment indicates that this algorithm will utterly fall apart on
overload and would thus be unsuited for soft-rt loads, but I guess we
could implement things like EDF-fm and keep this as a hard-rt class.

> === Notation ===
>
> - Take a set of tasks with corresponding attributes. This set and their
> attributes are called the schedule, 'S' and contains *all* tasks for
> the given scheduling class (i.e. all EFF-tasks).
>
> - Consider a multi-core system with 'm' processors.
>
> - Let the i'th task in the schedule be denoted tau_i. [3]
>
> - Each task will run in intervals, each 'round' is called a job. A task
> consists of an infinite sequence of jobs. The k'th job of tau_i is
> called tau_{i,k}
>
> - Each task has a set of (relative) attributes supplied when the task is
> inserted into the scheduler (passed via syscall)
> * Period T_i
> * Deadline D_i
> * WCET C_i
>
> - Each job (tau_{i,k}) has absolute attributes (computed from the relative
> tasks-attributes coupled with physical time).
> * Release-time r_{i,k}
> * Deadline d_{i,k}
> * Allocated time so for a job, C_a(t, tau_{i,k})
> When C_a equals WCET, the jobs budget is exhausted and it should
> start a new cycle. This is tested (see below) by the scheduler.
> * Remaining time for the job, C_r(t, tau_{i,nk})
>
> - The acceptance function for EFF screens new tasks on their expected
> utilization. Depending on the mode and implementation, it can be based
> on the period, or on the deadline. The latter will cause firmer
> restraints, but may lead to wasted resources.
>
> U = C_i / T_i For SRT (bounded deadline tardiness)
> U = C_i / D_i For HRT
>
> - A relative measure, time to failure, ttf, indicates how much time is
> left before a job must be scheduled to run in order to avoid a
> deadline-miss. This will decrease as time progresses and the job is
> not granted CPU time. For tasks currently running on a CPU, this value
> will be constant.
>
> Take a job with a WCET of 10ms, it has been allowed to run for 4
> ms so far. The deadline is 8 ms away. Then the job must be
> scheduled to run within the next 4 ms, otherwise it will not be
> able to finish in time.
>
> - An absolute value, time of failure (tof) can also be computed in a
> static manner. For tasks not running on a CPU, the allocated time is
> static. That means you can take the absolute deadline, subtract the
> allocated time and you have the absolute point in time when a given
> job will fail to meet its deadline.
>
> === Outline of scheduler ===
>
> Store tasks in 2 queues. One of size m, containing all the tasks
> currently running on the CPUs (queue R). The other will hold all
> currently active tasks waiting to execute (queue W).
>
> queue R is sorted based on ttf (time to failure, the relative time left
> until a task will miss it's deadline). As the tasks approaches the
> absolute time of failure at the same rate C_a increases, ttf is
> constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
> (i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
> be quite fluent.
>
> queue W is sorted based on absolute time of failure (tof). Since this is
> a fixed point in time, and the tasks in W are not running (C_a is
> unchanged), this value is constant.
>
> When a task is scheduled to run, a timer is set at the point in time
> where it has exhausted it's budget (t_now + WCET - C_a). This is to
> ensure that a runaway task does not grab the CPU.
>
> When a new task arrives, it is handled according the following rules:
> - The system has one or more CPUs not running EFF-tasks. Pick any of the
> free CPUs and assign the new job there. Set a timer to
>
> - All CPUs are busy, the new task has greater time to failure than the
> head of W. The task is inserted into W at the appropriate place.
>
> - All CPUs are busy and the new task has smaller time to failure than
> the head of W. The new task is compared to the last task in Q. If time
> to failure is larger than the task at the tail, it is added to the
> head of W.
>
> - If all CPUs are busy, and time to failure is smaller than the tail of
> Q, the new task is a candidate for insertion. At this point the tasks
> must be compared to see if picking one or the other will cause a
> deadline-miss. If both will miss the deadline if the other is
> scheduled, keep the existing running and place the new at the head of
> W (as you will have a deadline-miss anyway unless the the task is
> picked up by another CPU soon).
>
> - A task running on a CPU with ttf=0 should *never* be preempted with
> another task. If all tasks in R have ttf=0, and a newly arrived task
> has ttf=0, a deadline-miss is inevitable and switching tasks will only
> waste resources.
>
> When a task in R finish (or is stopped due to the timer-limit), it is
> removed from R, and the head of W is added to R, inserted at the
> appropriate place.
>
> It has been some discussion lately (in particular on #linux-rt) about
> the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
> should be possible to extend EFF to handle both. As a side note, if
> anyone has some good information about PEP, I'd like a copy :)
>
> Based on this, I think the utilization can be set as high as M
> (i.e. full utilization of all CPUs), but the jitter can probably be
> quite bad, so for jitter-sensitive tasks, a short period/deadline should
> be used.
>
> There are still some issues left to solve, for instance how to best
> handle sporadic tasks, and whether or not deadline-miss should be allow,
> or just 'bounded deadline tardiness'. Either way, EFF should be able to
> handle it. Then, there are problems concerning blocking of tasks. One
> solution would be BWI or PEP, but I have not had the time to read
> properly through those, but from what I've gathered a combination of BWI
> and PEP looks promising (anyone with good info about BWI and PEP - feel
> free to share! (-: ).

Our SSSUP friends have a BWI paper here:

http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf

The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm
not sure if he has any papers on it.

Also, when talking about it at OSPERT last week Ted Baker (also on CC)
said it reminded him of something else of which I seem to have forgotten
the name.

Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
but SMP leaves things to be desired. Dhaval is currently working on a
PEP implementation that will migrate all the blocked tasks to the
owner's cpu, basically reducing it to the UP problem.

> 1) Before you freeze at 'global' and get all caught up on "This won't
> ever scale to X", or "He will be haunted by Y" - I do not want to
> extend a global algorithm to 2000 cores. I would like to scale to a
> single *chip* and then we can worry about 2-way and 4-way systems
> later. For the record, I've donned my asbestos suit anyway.

My preferred approach here is to find a distributed algorithm that
converges to the global one.

> 2) http://austad.us/kernel/thesis_henrikau.pdf
>
> 3) Anyone want to include LaTeX-notation into an email-rfc?

Not unheard of ;-)

2009-07-12 03:02:46

by Doug Niehaus

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Peter:
Perhaps you could expand on what you meant when you said:

Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
but SMP leaves things to be desired. Dhaval is currently working on a
PEP implementation that will migrate all the blocked tasks to the
owner's cpu, basically reducing it to the UP problem.

What is left to be desired with PEP on SMP? I am not saying it is
perfect, as I can think of a few things I would like to improve or
understand better, but I am curious what you have in mind.

Absent a clearer idea of what you had in mind, I can certainly discuss
the tradeeoffs Noah and I have considered over time, and which we think
motivates our approach.

When Noah and I have talked about this topic over the quite extended
time, several years, we have been working on it, there have always
seemed two choices:

1) Move the proxy (the resource owner) to the CPU with the blocked task
2) Move the "scheduling profile" of the blocked task to the CPU where
the proxy is.

For Proxy Execution under Group Scheduling we have considered both over
time. Consider the situation where thread A on CPU0 is blocked on a
resource held by thread B on CPU1. When we considered (1), it has the
advantage of ensuring that B will run on CPU0, unblocking A, if A (or B)
is still the best choice at the time it has been successfully moved from
CPU1 -> CPU0. That might not be true after the delay of moving the process.

We decided to emphasize (2) because it was more interesting in our view
because it was cheaper and seemed no more complicated although its
complications are different than (1). Its complication is, of course,
that while we have worked out how to add the "avatar" of A to the set
considered by the GS hierarchy on CPU1, it depends on the scheduling
semantics as configured whether the avatar of A is chosen as "best" and
thus how long it will be until B runs long enough to release the
resource and unblock A on CPU1.

We have always viewed that as complicated, but appropriately so to the
problem. It depends inherently on the semantics of threads A, B, and all
other threads on CPU1 that are ready to run, among whom the "best" is
chosen by the GS hierarchy. We think it is inherently the problem of the
scheduling configuration to take this trade-off into account.

We have also thought being able to do both (1) and (2) is best, but
which is best to use in a given situation depends on the comparative
cost of (X) running B on CPU1 long enough to unblock A and (Y) the cost
of moving B from CPU1->CPU0 to run long enough to unblock A, and then
move it back from CPU0->CPU1 since its designed CPU assigned is on CPU1.
Our decision after many hours of discussion over many months has been
that the cost of (X) seems a lot more attractive than (Y).

Part of our preference is that we are still working with semaphores as
resources. Since most critical sections are supposed to be "short",
then scheduling semantics taking the proxy execution periods into
account on the "foreign" CPUs would actually be easier/better than the
double thread movement.

Both problems seem quite hard and I do not think I have yet "completely
figured it out". While the "mass migration" you say Dhaval is working on
would "reduce" the problem to the UP case, I think it would create more
complexity for analysis than it eliminates. A form of thrashing seems a
real danger. In this case, that threads would be moving from CPU to CPU
so much it would be a real drain on resources and constraint on system
performance. However, Dhaval my well understand the cost implications of
thread migration better than I do.

The real core of the problem, it seems to me, is that the proxy
relationship among threads depends on what resources can be held by
them. I think that problem is "relatively easy" in the set of locks
associated with a multi-threaded application.

When the resources causing blocking can be *any* lock in the kernel
associated with *any* system service that might be used by *any* thread
is is complicated enough to make my brain hurt. However, we thing the GS
framework makes it relatively easy, perhaps that would be better said as
"as easy as it can be", to implement any combination of thread migration
and avatars desired by a given scheduling semantics.

In that sense Noah and I feel that GS is a "complete" framework in that
it is possible to configure any semantics desired as easily as it can be
done by any framework. Obviously that does not resolve the question of
what semantics it is best to desire for a given system which remains
quite complicated and highly dependent on the specific application
semantics.

Noah and I thought the relatively low cost of creating the avatar was
quite attractive, and so we decided on a GS configuration using it to
experiment with in specifying the scheduling semantics. The first two
approaches we want to experiment with are (*) to view the composite
scheduling hierarchy for all CPUs as a whole, and let the avatar of A
take its chances on CPU1, and (**) to view resolution of blocking as
most important system wide, so we have the avatar viewed as "best" long
enough for its proxy to release the resource.

The bottom line, in out view, is that no single semantics will be viewed
as either "best" or even acceptable for all applications as is the case
with schedulers, so we wanted to emphasize configurability.

We have performed basic tests showing the avatars can be chosen and
resolve the blocking relationship. More complex tests await the
completion of our port of GS and the other KUSP subsystems to 2.6.29.

Doug

Peter Zijlstra wrote:
> On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote:
>
>> Hi all!
>>
>> This is a proposal for a global [1], deadline driven scheduler for
>> real-time tasks in the Linux kernel. I thought I should send out an RFC to
>> gather some feedback instead of wildy hack away at it.
>>
>> This proposed scheduler is a modified MLLF (modified Least Laxity First)
>> called Earliest Failure First (EFF) as it orders tasks according to when
>> they will miss their deadlines, not when the actual deadline is.
>>
>
> <snip>
>
> Everybody agrees we want a deadline scheduler, we'll probably put a user
> interface into -rt shortly which should work for all the involved
> research groups so that we can share tests and have better comparisons.
>
> The only thing (aside from an utter lack of time to work on things
> recently) that has been holding us back is a proper solution to the
> priority inversion issue.
>
> I haven't fully read through the proposed algorithm below, and left it
> in place for the new people on CC.
>
> As already mentioned on IRC, the fact that you push the work to the last
> possible moment indicates that this algorithm will utterly fall apart on
> overload and would thus be unsuited for soft-rt loads, but I guess we
> could implement things like EDF-fm and keep this as a hard-rt class.
>
>
>> === Notation ===
>>
>> - Take a set of tasks with corresponding attributes. This set and their
>> attributes are called the schedule, 'S' and contains *all* tasks for
>> the given scheduling class (i.e. all EFF-tasks).
>>
>> - Consider a multi-core system with 'm' processors.
>>
>> - Let the i'th task in the schedule be denoted tau_i. [3]
>>
>> - Each task will run in intervals, each 'round' is called a job. A task
>> consists of an infinite sequence of jobs. The k'th job of tau_i is
>> called tau_{i,k}
>>
>> - Each task has a set of (relative) attributes supplied when the task is
>> inserted into the scheduler (passed via syscall)
>> * Period T_i
>> * Deadline D_i
>> * WCET C_i
>>
>> - Each job (tau_{i,k}) has absolute attributes (computed from the relative
>> tasks-attributes coupled with physical time).
>> * Release-time r_{i,k}
>> * Deadline d_{i,k}
>> * Allocated time so for a job, C_a(t, tau_{i,k})
>> When C_a equals WCET, the jobs budget is exhausted and it should
>> start a new cycle. This is tested (see below) by the scheduler.
>> * Remaining time for the job, C_r(t, tau_{i,nk})
>>
>> - The acceptance function for EFF screens new tasks on their expected
>> utilization. Depending on the mode and implementation, it can be based
>> on the period, or on the deadline. The latter will cause firmer
>> restraints, but may lead to wasted resources.
>>
>> U = C_i / T_i For SRT (bounded deadline tardiness)
>> U = C_i / D_i For HRT
>>
>> - A relative measure, time to failure, ttf, indicates how much time is
>> left before a job must be scheduled to run in order to avoid a
>> deadline-miss. This will decrease as time progresses and the job is
>> not granted CPU time. For tasks currently running on a CPU, this value
>> will be constant.
>>
>> Take a job with a WCET of 10ms, it has been allowed to run for 4
>> ms so far. The deadline is 8 ms away. Then the job must be
>> scheduled to run within the next 4 ms, otherwise it will not be
>> able to finish in time.
>>
>> - An absolute value, time of failure (tof) can also be computed in a
>> static manner. For tasks not running on a CPU, the allocated time is
>> static. That means you can take the absolute deadline, subtract the
>> allocated time and you have the absolute point in time when a given
>> job will fail to meet its deadline.
>>
>> === Outline of scheduler ===
>>
>> Store tasks in 2 queues. One of size m, containing all the tasks
>> currently running on the CPUs (queue R). The other will hold all
>> currently active tasks waiting to execute (queue W).
>>
>> queue R is sorted based on ttf (time to failure, the relative time left
>> until a task will miss it's deadline). As the tasks approaches the
>> absolute time of failure at the same rate C_a increases, ttf is
>> constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
>> (i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
>> be quite fluent.
>>
>> queue W is sorted based on absolute time of failure (tof). Since this is
>> a fixed point in time, and the tasks in W are not running (C_a is
>> unchanged), this value is constant.
>>
>> When a task is scheduled to run, a timer is set at the point in time
>> where it has exhausted it's budget (t_now + WCET - C_a). This is to
>> ensure that a runaway task does not grab the CPU.
>>
>> When a new task arrives, it is handled according the following rules:
>> - The system has one or more CPUs not running EFF-tasks. Pick any of the
>> free CPUs and assign the new job there. Set a timer to
>>
>> - All CPUs are busy, the new task has greater time to failure than the
>> head of W. The task is inserted into W at the appropriate place.
>>
>> - All CPUs are busy and the new task has smaller time to failure than
>> the head of W. The new task is compared to the last task in Q. If time
>> to failure is larger than the task at the tail, it is added to the
>> head of W.
>>
>> - If all CPUs are busy, and time to failure is smaller than the tail of
>> Q, the new task is a candidate for insertion. At this point the tasks
>> must be compared to see if picking one or the other will cause a
>> deadline-miss. If both will miss the deadline if the other is
>> scheduled, keep the existing running and place the new at the head of
>> W (as you will have a deadline-miss anyway unless the the task is
>> picked up by another CPU soon).
>>
>> - A task running on a CPU with ttf=0 should *never* be preempted with
>> another task. If all tasks in R have ttf=0, and a newly arrived task
>> has ttf=0, a deadline-miss is inevitable and switching tasks will only
>> waste resources.
>>
>> When a task in R finish (or is stopped due to the timer-limit), it is
>> removed from R, and the head of W is added to R, inserted at the
>> appropriate place.
>>
>> It has been some discussion lately (in particular on #linux-rt) about
>> the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
>> should be possible to extend EFF to handle both. As a side note, if
>> anyone has some good information about PEP, I'd like a copy :)
>>
>> Based on this, I think the utilization can be set as high as M
>> (i.e. full utilization of all CPUs), but the jitter can probably be
>> quite bad, so for jitter-sensitive tasks, a short period/deadline should
>> be used.
>>
>> There are still some issues left to solve, for instance how to best
>> handle sporadic tasks, and whether or not deadline-miss should be allow,
>> or just 'bounded deadline tardiness'. Either way, EFF should be able to
>> handle it. Then, there are problems concerning blocking of tasks. One
>> solution would be BWI or PEP, but I have not had the time to read
>> properly through those, but from what I've gathered a combination of BWI
>> and PEP looks promising (anyone with good info about BWI and PEP - feel
>> free to share! (-: ).
>>
>
> Our SSSUP friends have a BWI paper here:
>
> http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf
>
> The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm
> not sure if he has any papers on it.
>
> Also, when talking about it at OSPERT last week Ted Baker (also on CC)
> said it reminded him of something else of which I seem to have forgotten
> the name.
>
> Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
> but SMP leaves things to be desired. Dhaval is currently working on a
> PEP implementation that will migrate all the blocked tasks to the
> owner's cpu, basically reducing it to the UP problem.
>
>
>> 1) Before you freeze at 'global' and get all caught up on "This won't
>> ever scale to X", or "He will be haunted by Y" - I do not want to
>> extend a global algorithm to 2000 cores. I would like to scale to a
>> single *chip* and then we can worry about 2-way and 4-way systems
>> later. For the record, I've donned my asbestos suit anyway.
>>
>
> My preferred approach here is to find a distributed algorithm that
> converges to the global one.
>
>
>> 2) http://austad.us/kernel/thesis_henrikau.pdf
>>
>> 3) Anyone want to include LaTeX-notation into an email-rfc?
>>
>
> Not unheard of ;-)
>
>

2009-07-12 06:18:15

by Henrik Austad

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Saturday 11 July 2009 20:28:11 Peter Zijlstra wrote:
> On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote:
> > Hi all!
> >
> > This is a proposal for a global [1], deadline driven scheduler for
> > real-time tasks in the Linux kernel. I thought I should send out an RFC
> > to gather some feedback instead of wildy hack away at it.
> >
> > This proposed scheduler is a modified MLLF (modified Least Laxity First)
> > called Earliest Failure First (EFF) as it orders tasks according to when
> > they will miss their deadlines, not when the actual deadline is.
>
> <snip>
>
> Everybody agrees we want a deadline scheduler, we'll probably put a user
> interface into -rt shortly which should work for all the involved
> research groups so that we can share tests and have better comparisons.

My plan is to start working on this with Bill's framework once that is stable
enough and it has the expressiveness I need.

> The only thing (aside from an utter lack of time to work on things
> recently) that has been holding us back is a proper solution to the
> priority inversion issue.
>
> I haven't fully read through the proposed algorithm below, and left it
> in place for the new people on CC.
>
> As already mentioned on IRC, the fact that you push the work to the last
> possible moment indicates that this algorithm will utterly fall apart on
> overload and would thus be unsuited for soft-rt loads, but I guess we
> could implement things like EDF-fm and keep this as a hard-rt class.

Not pushing the work to the last instance per se, but scheduling out a running
thread only when it is absolutely necessary to reduce the number of
task-preemptions.

When a new task (A) arrives you compare it to the head of the waiting queue
(B), and if A has a 'graver' need for running than B, you move on to look at
the tail of the running queue (C). If A has higher need than C, you let A
run, but only if not running A will cause a dl-miss and not for B.

> > [...]
> > There are still some issues left to solve, for instance how to best
> > handle sporadic tasks, and whether or not deadline-miss should be allow,
> > or just 'bounded deadline tardiness'. Either way, EFF should be able to
> > handle it. Then, there are problems concerning blocking of tasks. One
> > solution would be BWI or PEP, but I have not had the time to read
> > properly through those, but from what I've gathered a combination of BWI
> > and PEP looks promising (anyone with good info about BWI and PEP - feel
> > free to share! (-: ).
>
> Our SSSUP friends have a BWI paper here:
>
> http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf

thanks!

> The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm
> not sure if he has any papers on it.

A general outline would also be great as I treat PEP the way I *think* it is
after some discussions on IRC etc.

> Also, when talking about it at OSPERT last week Ted Baker (also on CC)
> said it reminded him of something else of which I seem to have forgotten
> the name.
>
> Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
> but SMP leaves things to be desired. Dhaval is currently working on a
> PEP implementation that will migrate all the blocked tasks to the
> owner's cpu, basically reducing it to the UP problem.

I think that PEP will work better on MP, at least for the entire system, but
not for a single thread.

If you move the task holding the resource to the CPU of the blocked task, and
let the holder consume part of the requestor's budget, it will be easier to
to schedule the actual thread for running as you have complete control over
the current CPU. More importantly, you can ensure that the total utilization
is bounded, not only within the group (if you use EFF in a clustered setup),
but also when the holder belongs to another group.

If several threads want to run the holder through the proxy, things will of
course get complicated, but the same goes for BWI.

The downside of PEP, is that you introduce unpredictability with consumed
resource as other threads can consume a threads resource. This makes it
harder to analyze a single thread.


> > 1) Before you freeze at 'global' and get all caught up on "This won't
> > ever scale to X", or "He will be haunted by Y" - I do not want to
> > extend a global algorithm to 2000 cores. I would like to scale to a
> > single *chip* and then we can worry about 2-way and 4-way systems
> > later. For the record, I've donned my asbestos suit anyway.
>
> My preferred approach here is to find a distributed algorithm that
> converges to the global one.
>
> > 2) http://austad.us/kernel/thesis_henrikau.pdf
> >
> > 3) Anyone want to include LaTeX-notation into an email-rfc?
>
> Not unheard of ;-)

perhaps this 'Web 2.0' will come up with a solution :)

--
henrik
--
henrik

2009-07-12 15:32:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Sat, 2009-07-11 at 21:40 -0500, Douglas Niehaus wrote:
> Peter:
> Perhaps you could expand on what you meant when you said:
>
> Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
> but SMP leaves things to be desired. Dhaval is currently working on a
> PEP implementation that will migrate all the blocked tasks to the
> owner's cpu, basically reducing it to the UP problem.
>
> What is left to be desired with PEP on SMP? I am not saying it is
> perfect, as I can think of a few things I would like to improve or
> understand better, but I am curious what you have in mind.

Right, please don't take this as a critism against PEP, any scheme I
know of has enormous complications on SMP ;-)

But the thing that concerns me most, there seem to be a few O(n)
consequences. Suppose that for each resource (or lock) R_i, there is a
block graph G_i, which consists of n nodes and would be m deep.

Functionally (generalized) PIP and PEP are identical, their big
difference is that PIP uses waitqueues to encode the block graph G,
whereas PEP leaves everybody on the runqueue and uses the proxy field to
encode the block graph G.

The downside of PIP is that the waitqueue needs to re-implement the full
schedule function in order to evaluate the highest prio task on the
waitqueue. Ttraditionally this was rather easy, since you'd only
consider the limited SCHED_FIFO static prio range, leaving you with a
O(1) evaluation, when you add more complex scheduling functions things
get considerably more involved. Let's call this cost S.

So for PIP you get O(m*S) evaluations whenever you get a change to the
block graph.

Now for PEP, you get an increased O(m) cost on schedule, which can be
compared to the PIP cost.

However PEP on SMP needs to ensure all n tasks in G_i are on the same
cpu, because otherwise we can end up wanting to execute the resource
owner on multiple cpus at the same time, which is bad.

This can of course be amortized, but you end up having to migrate the
task (or an avatar thereof) to the owner's cpu (if you'd want to migrate
the owner to the blocker's cpu, then you're quickly into trouble when
there's multiple blockers), but any way around this ends up being O(n).

Also, when the owner gets blocked on something that doesn't have an
owner (io completion, or a traditional semaphore), you have to take all
n tasks from the runqueue (and back again when they do become runnable).

PIP doesn't suffer this, but does suffer the pain from having to
reimplement the full schedule function on the waitqueues, which when you
have hierarchical scheduling means you have to replicate the full
hierarchy per waitqueue.

Furthermore we cannot assume locked sections are short, and we must
indeed assume that its any resource in the kernel associated with any
service which can be used by any thread, worse, it can be any odd
userspace resource/thread too, since we expose the block graph to
userspace processes through PI-futexes.

2009-07-13 09:56:04

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Sat, 2009-07-11 at 20:28 +0200, Peter Zijlstra wrote:
> > There are still some issues left to solve, for instance how to best
> > handle sporadic tasks, and whether or not deadline-miss should be allow,
> > or just 'bounded deadline tardiness'. Either way, EFF should be able to
> > handle it. Then, there are problems concerning blocking of tasks. One
> > solution would be BWI or PEP, but I have not had the time to read
> > properly through those, but from what I've gathered a combination of BWI
> > and PEP looks promising (anyone with good info about BWI and PEP - feel
> > free to share! (-: ).
>
> Our SSSUP friends have a BWI paper here:
>
> http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf
>
And here we are! :-)

The paper Peter pointed you to mainly describes the work I did some
months ago to implement BandWidth Inheritance inside one real-time Linux
variant of us (ReTiS Lab, in Pisa, Italy)... Feel free to ask anything
related to it directly to me.

It is exactly implemented as a "proxy execution" protocol and things
were easy there, since --for now-- the framework I was talking about is
UP-only! :-(

Now we are back on work on it, especially thinking on how to extend the
protocol to SMP architectures...

> Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
> but SMP leaves things to be desired. Dhaval is currently working on a
> PEP implementation that will migrate all the blocked tasks to the
> owner's cpu, basically reducing it to the UP problem.
>
Nice... Only one question, doesn't this impact with task affinity
related issues?

regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-13 10:14:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 11:55 +0200, Raistlin wrote:
> > Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
> > but SMP leaves things to be desired. Dhaval is currently working on a
> > PEP implementation that will migrate all the blocked tasks to the
> > owner's cpu, basically reducing it to the UP problem.
> >
> Nice... Only one question, doesn't this impact with task affinity
> related issues?

No :-), well maybe.

Since the task is blocked and will this not actually run, we can place
it on any CPU, only once it becomes runnable again should we take care
to place it on a CPU that's part of its affinity mask.

Of course, underlying this is the question of what to do for
'partitioned' tasks sharing a resource. I think we've talked about this
a few times.

Since these 'partitioned' tasks do share a resource, they're not really
all that partitioned, and I think the best way to keep the system going
is to simply Inherit/Donate/Proxy right through the partition.

~ Peter

2009-07-13 15:44:11

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Sun, 2009-07-12 at 17:31 +0200, Peter Zijlstra wrote:
> But the thing that concerns me most, there seem to be a few O(n)
> consequences. Suppose that for each resource (or lock) R_i, there is a
> block graph G_i, which consists of n nodes and would be m deep.
>
> [...]
>
> However PEP on SMP needs to ensure all n tasks in G_i are on the same
> cpu, because otherwise we can end up wanting to execute the resource
> owner on multiple cpus at the same time, which is bad.
>
That's from where we are trying to start evaluating the various
possibilities to extend the idea to SMP.

What we would like to achieve is some set of rules that, extending the
UP ones, yield a situation which is both:
- analyzable from the real-time theorist's point of view... Which is
(un?)fortunately what we are :-)
- possible to implement... Which is not always (un!)fortunately obvious
here :-)

I hope you don't mind if I share some thoughts with you about the
solution I was wondering about... In case you do, just ignore and excuse
me.

Very basically: from the analysis point of view one easy and effective
solution would be to have the blocked-running tasks --i.e., the tasks
blocked on some lock that have been left on the rq to proxy-execute the
lock owner-- busy waiting while the lock owner is running. This allows
for retaining a lot of nice properties BWI already has, as far as
analyzability is concerned.

On the other hand, from the practical end efficiency point of view, it
would be not that difficult to block these otherwise-spinning tasks, in
order to avoid wasting CPU time too much... The only important thing is
to properly account the budget of the correct server/group (which
probably must be the otherwise-spinning task's one), or the analysis is
gone again! :-O

Also, this will probably imply some amount of block/wakeup overhead
which could be of some impact especially in involved, maybe rare, but
possible, locking schemes... which we would like to evaluate
thoroughly...

> This can of course be amortized, but you end up having to migrate the
> task (or an avatar thereof) to the owner's cpu (if you'd want to migrate
> the owner to the blocker's cpu, then you're quickly into trouble when
> there's multiple blockers), but any way around this ends up being O(n).
>
I agree... Computational complexity is also an issue, and something to
whom we have to validate against.

> Also, when the owner gets blocked on something that doesn't have an
> owner (io completion, or a traditional semaphore), you have to take all
> n tasks from the runqueue (and back again when they do become runnable).
>

> PIP doesn't suffer this, but does suffer the pain from having to
> reimplement the full schedule function on the waitqueues, which when you
> have hierarchical scheduling means you have to replicate the full
> hierarchy per waitqueue.
>
And, further than this, at least from my point of view, if you have
server/group based scheduling, and in general some kind of budgeting or
bandwidth enforcing mechanism in place, PIP is far from being a
solution...

> Furthermore we cannot assume locked sections are short, and we must
> indeed assume that its any resource in the kernel associated with any
> service which can be used by any thread, worse, it can be any odd
> userspace resource/thread too, since we expose the block graph to
> userspace processes through PI-futexes.
>
Agree, 100%. :-)

Again, sorry for bothering with if someone of you is not interested...
If instead you are, any comments is more than welcome.

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-13 16:06:27

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 12:14 +0200, Peter Zijlstra wrote:
> > Nice... Only one question, doesn't this impact with task affinity
> > related issues?
>
> No :-), well maybe.
>
:-D

> Since the task is blocked and will this not actually run, we can place
> it on any CPU, only once it becomes runnable again should we take care
> to place it on a CPU that's part of its affinity mask.
>
Well, it's difficult to find an argument against this! :-)

Anyway, maybe if, on some architecture, for some kind of application,
the affinity may have been set to preserve some kind actual cache or
memory locality for the task access pattern, maybe this could be an
issue, couldn't it? :-)
I mean, in some case where being sure of having a task running on a
particular CPU is somehow of paramount importance...

Anyway, I know, I'm just wandering around, I'm not that "affinity
expert" and I'm far from having a real example of the scenario I tried
to describe! :-(

> Of course, underlying this is the question of what to do for
> 'partitioned' tasks sharing a resource. I think we've talked about this
> a few times.
>
Yes, there is a lot of chatting about partitioned task sets where
resource sharing crosses the partition in the literature...

> Since these 'partitioned' tasks do share a resource, they're not really
> all that partitioned,
... and I definitely agree with you on that: where's partitioning?
Anyway, I also think that, if that is true, e.g., for user-space
locks/resources, there are resources that two tasks _have_ to share,
even if being partitioned, e.g., when they come to compete, in kernel
space, e.g., for a lock on the virtual terminal... And that's the only
situation where such a problem make sense.

These just to point out one more reason why we are trying something
different (as explained in the other message), and light year far from
saying that the approach you proposed is not the good one!

On the contrary, I think all this make the problem even more
interesting! :-)

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-13 16:33:49

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Raistlin wrote:

> Very basically: from the analysis point of view one easy and effective
> solution would be to have the blocked-running tasks --i.e., the tasks
> blocked on some lock that have been left on the rq to proxy-execute the
> lock owner-- busy waiting while the lock owner is running. This allows
> for retaining a lot of nice properties BWI already has, as far as
> analyzability is concerned.
>
> On the other hand, from the practical end efficiency point of view, it
> would be not that difficult to block these otherwise-spinning tasks, in
> order to avoid wasting CPU time too much... The only important thing is
> to properly account the budget of the correct server/group (which
> probably must be the otherwise-spinning task's one), or the analysis is
> gone again! :-O

Could you elaborate on this "proper accounting"?

If task A is blocked waiting for a lock (held by a task B on another
cpu) and we run task C instead, how would you propose that the
accounting be handled?

Chris

2009-07-13 17:25:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 17:44 +0200, Raistlin wrote:
> > PIP doesn't suffer this, but does suffer the pain from having to
> > reimplement the full schedule function on the waitqueues, which when you
> > have hierarchical scheduling means you have to replicate the full
> > hierarchy per waitqueue.
> >
> And, further than this, at least from my point of view, if you have
> server/group based scheduling, and in general some kind of budgeting or
> bandwidth enforcing mechanism in place, PIP is far from being a
> solution...

I think you can extend PIP to include things like bandwidth inheritance
too. Instead of simply propagating the priority through the waitqueue
hierarchy, you can pass the actual task around, and having this task you
can indeed consume its bandwidth etc..

But sure, hierarchical scheduling and things really complicate the
waitqueue implementation.

2009-07-13 17:28:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 17:44 +0200, Raistlin wrote:
> What we would like to achieve is some set of rules that, extending the
> UP ones, yield a situation which is both:
> - analyzable from the real-time theorist's point of view... Which is
> (un?)fortunately what we are :-)
> - possible to implement... Which is not always (un!)fortunately obvious
> here :-)

I would very much like a proper theoretical foundation for whatever we
end up with ;-)

> Very basically: from the analysis point of view one easy and effective
> solution would be to have the blocked-running tasks --i.e., the tasks
> blocked on some lock that have been left on the rq to proxy-execute the
> lock owner-- busy waiting while the lock owner is running. This allows
> for retaining a lot of nice properties BWI already has, as far as
> analyzability is concerned.

Right, practically we cannot do this, since we expose the block graph to
userspace and you could in userspace construct a program that would
exploit this spinning to DoS the system.

2009-07-13 18:20:52

by Noah Watkins

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

> I think you can extend PIP to include things like bandwidth
> inheritance
> too. Instead of simply propagating the priority through the waitqueue
> hierarchy, you can pass the actual task around, and having this
> task you
> can indeed consume its bandwidth etc..

I think I understand what you are suggesting by "pass the actual task
around". If not, this message might seem out of place.

Consider this locking chain/graph:

A-->L1-->B-->L2-->C
D-->L3-->E-->L2

The way I understand what you are suggesting is that tasks A,B,D,E
(or some representation of them) can be passed around such that when
C executes it consumes some resource associated with the the tasks it
is blocking (A,B,D,E). Obviously which tasks and what quantities are
dependent on scheduling semantics and configuration.

The biggest implementation hurdle we have encountered is that as
tasks are passed forward along a locking chain knowledge about the
structure of the locking chain is lost. For example, as C releases
L2, C must be able to distinguish between A and D as having been
passed to C's location through B or E, respectively.

Keeping a representation of the locking chain as a full graph is the
solution we have used. Another solution is to allow A and D to re-
block and walk the locking chain again, but obviously some thundering-
hurd issues arise.

noah



2009-07-13 20:32:58

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

I am happy that Peter has included me in this discussion. I will
have more to say after I have seen a few more messages, absorbed
the context better, and had some time to think things over. I
guess that, being a newcomer to this discussion, I may re-open
some issues that have already been discussed. However, I would
like to comment on a few things right off.

I would hope that all of the people working on the kernel could
agree on some basic goals, principles, and policies, which could
be kept stable, and would guide detailed design decisions.

In the real-time domain, I would hope that a goal would be to
support application-level analysis of schedulability, in a
modular, decomposable way. That is, it should be possible to
write the time-critical portions of any application in a way that
if the system calls the application makes all succeed, the
application gets a guaranteed level of service that enables it to
meet its timing constraints, regardless of whatever else is going
on in the system.

One part of this is support for a scheduling policy that enables
the application to request some guaranteed minimum level of CPU
bandwidth, and at the same time imposes an upper limit on how much
it can interfere with other tasks. One way this model has been
abstracted is in terms of a pairs of parameters for each of these
two bounds: a level of service, expressed as a fraction of CPU(s),
and a lag time.

The lag time is affected by a number of specific factors,
including the maximum time that a thread may be blocked due to
waiting for a lock. The priority inheritance protocol (PIP) has
been proposed as a way of bounding the duration of such blocking
in the context of very simple task models (such as fixed-priority
preemptive scheduling of periodic tasks), where the blocking is
viewed as "priority inversion".

As several message mentioned, and as I observed in the kernel code
(to my disgust), the Linux PIP implementation is a nightmare:
extremely heavy weight, involving maintenance of a full wait-for
graph, and requiring updates for a range of events, including
priority changes and interruptions of wait operations.

I recognize that this complexity is a product of the desire to
provide an implementation that does the right thing in all cases,
but one needs keep a sense of proportion. When one ends up having
to solve a more complex mutual exclusion problem (on the wait-for
graph and task priorities) in order to implement a mutual
exclusion primitive, you have a case of abstraction inversion--
something is out of whack.

It is sad enough that we have to implement PTHREAD_PRIO_INHERIT
for POSIX portable applications, but to use this internally within
the OS, or to build more complexity on top of it, seems totally
wrong.

For schedulability analysis, one just needs a way to bound the
duration of priority inversion. Simple non-preemption (Linux
spinlock_t) is sufficient for that, and it is easy to implement.
You just have to be careful not to voluntarily suspend (give up
the processor) while holding a lock. The SRP (Ada Ceiling_Locking
and POSIX PTHREAD_PRIO_PROTECT) is s slight refinement, also
extremely lightweight to implement with reasonable restrictions
(e.g., no blocking while holding a lock, priority changes deferred
while a lock is held).

The priority inheritance protocol (PIP) has always been a problem
for schedulability analysis, because priority inversion is not
bounded to a single critical section; one must account for the
worst-case chain of blockings -- even before one considers the
distributed overhead (meaning overhead paid by non-users of
the feature) and the implementation complexity.

The only selling point for PIP has been the ability of a thread to
suspend itself while holding a lock, such as to wait for
completion of an I/O operation. I would argue that this practice
is generally a sign of poor design, and it certainly throws out
the notion of bounding the priority inversion due to blocking on a
lock for schedulability analysis -- since now the lock-holding
time can depend on I/O completion time, timers, etc.

I do have sympathy for what you seem to be calling the "proxy"
implementation model of priority inheritance. That is, the
scheduler choose a process, than if it was blocked, walk down the
wait-for chain until it found a process that was not blocked. A
beauty of this approach is that you only need to maintain the one
wait-for link, and you only modify it when a task blocks. There
is a small overhead on the lock and unlock operations, but no need
to overhead, since one never traverses these links unless blocking
has occurred and one about to schedule a blocked process. In
particular, one gets around a case that I found very nasty if one
tries to do all the inheritance-caused priority changes
explicitly; that is, when mutex-wait operation times out you have
to recompute all the priorities, and that this involves traversal
of the reverse-wait-for relation, which is many-one.

The first case of its use that I saw published was for the
real-time Unix variant developed for the Harris SMP computers,
which were sold under the name Nighthawk. I don't recall how they
handled some of the details I've seen mentioned in these recent
messages. I have not been able to find any on-line publications
for that version of Unix (which I think may be what I see cited a
few places on the web as CX/RT). I once had a copy of the book
Real-time Unix Systems, which I think might have touched on this
(http://www.flipkart.com/real-time-unix-systems-borko/0792390997-y6w3fq6ipb),
but a student walked away with my copy. I just sent an e-mail to
Bork Fuhrt, one of the authors, and will let you know if I can get
any information from him.

This model does appear to extend to models of scheduling
appropriate for virtual processors, which enforce both upper and
lower bounds on processor bandwidth (e.g., sporadic server,
constant-bandwidth server), we do want to allow tasks that have
reached their current budget allocation to continue if they are
holding a lock, long enough to release the lock. However, that
can also be accomplished using non-preemmption, or the SRP.

Regarding the notion of charging proxy execution to the budget of
the client task, I have grave concerns. It is already hard enough
to estimate the amount of budget that a real-time task requires,
without this additional complication.

Ted Baker

2009-07-13 21:45:26

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted Baker wrote:

> I recognize that this complexity is a product of the desire to
> provide an implementation that does the right thing in all cases,
> but one needs keep a sense of proportion. When one ends up having
> to solve a more complex mutual exclusion problem (on the wait-for
> graph and task priorities) in order to implement a mutual
> exclusion primitive, you have a case of abstraction inversion--
> something is out of whack.

Given that the semantics of POSIX PI locking assumes certain scheduler
behaviours, is it actually abstraction inversion to have that same
dependency expressed in the kernel code that implements it?

> For schedulability analysis, one just needs a way to bound the
> duration of priority inversion. Simple non-preemption (Linux
> spinlock_t) is sufficient for that, and it is easy to implement.
> You just have to be careful not to voluntarily suspend (give up
> the processor) while holding a lock.

The whole point of mutexes (and semaphores) within the linux kernel is
that it is possible to block while holding them. I suspect you're going
to find it fairly difficult to convince people to spinlocks just to make
it possible to provide latency guarantees.

> The only selling point for PIP has been the ability of a thread to
> suspend itself while holding a lock, such as to wait for
> completion of an I/O operation.

You're comparing a full-featured PI implementation with a stripped-down
PP (priority protection, aka priority ceiling) approach. In an
apples-to-apples comparison, the selling point for PI vs PP is that
under PIP the priority of the lock holder is automatically boosted only
if necessary, and only as high as necessary. On the other hand, PP
requires code analysis to properly set the ceilings for each individual
mutex.

> I would argue that this practice
> is generally a sign of poor design, and it certainly throws out
> the notion of bounding the priority inversion due to blocking on a
> lock for schedulability analysis -- since now the lock-holding
> time can depend on I/O completion time, timers, etc.

Certainly if you block waiting for I/O while holding a lock then it
impacts the ability to provide latency guarantees for others waiting for
that lock. But this has nothing to do with PI vs PP or spinlocks, and
everything to do with how the lock is actually used.

> Regarding the notion of charging proxy execution to the budget of
> the client task, I have grave concerns. It is already hard enough
> to estimate the amount of budget that a real-time task requires,
> without this additional complication.

Agreed.

Chris

2009-07-13 23:48:00

by Doug Niehaus

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted has written a lot that I should think about for some time before
dealing with too many of the details, but one issue is worth clarifying,
I think, at least

The Group Scheduling framework is intended to permit flexible
configuration of *any* scheduling semantics, RT or otherwise that is
desired. We have done CPU share, application progress, and others.
Further , it is intended to permit applications running under multiple
programming models to coexist, although this requires higher-levels of
the system Group Scheduling hierarchy to describe how conflicts among
applications are resolved.

In that context, I coined the term "proxy execution" (I have no idea if
anyone else had coined it before, but it seems an obvious idea, so they
may have) as a *generalization* of Priority Inheritance, because it was
a way to think about the issue, and implement it correctly for a mixture
of scheduling semantics, if not as cheaply as I might like.

Thus, the idea is that if A -blocked-on-> B, then A is still considered
by the scheduling hierarchy, under whatever semantics is may be using,
priority or otherwise. If everyone is using priority semantics, then the
choice of the proxy calculation is the *same as* it would be under
Priority Inheritance, but arrived at by a different mechanism. If A is
under explicit plan scheduling, and B is under CFS or CPU share, then if
A would be chosen by the system if it were not blocked, then B is run as
its proxy until it gets out of A's way. In this case no priorities exit,
so Proxy Execution is not a "way of implementing Priority Inheritance".

Ted is correct that for any specific configuration of system semantics
support RT applications using any of several RT programming models,
schedulability is a key feature. I think that it is possible, in many
cases, and that GS will support schedulability in many cases when
configured correctly.

However, some of the many open questions include:

(1) Which system services, *if any* can RT threads use and still support
"acceptable" schedulability analysis under a given RT programming model?

(1.1) Will the use of system services (system calls) by RT threads
need to be explicitly supported by, perhaps, explicitly making the
schedulers of *other* threads also using those system calls aware of
that and take it into account to make those other tasks non-preemptible
while holding system call related locks.

(1.2) Can Linux SMP ever support this in an acceptable manner since
locks associated with systems services are shared across CPU boundaries.
THus, I can isolate a set of RT tasks on a CPU easily, and they are well
isolated and can run under strict predictability, *until they use a
system call that uses a lock*. Then, the system call is an interaction
channel with every other thread on the system using the same system call.

(1.2.1) The only alternative tot he obvious properties I can
think of here is to make each CPU run a separate copy of the OS and use
disjoint sets of devices. At which point we have a distributed system
in a box. Fundamentally, I think solutions to this problem are simply
more expensive than people are yet willing to accept. Alternately, of
course, I may just be thinking that so salve my disappointment at not
thinking of a cheap solution.

(2) Is the overhead of various mechanisms for tackling RT scheduling and
schedulability analysis going to be a barrier for SMP solutions, and if
so for how long, before people may just have to accept that "it really
does cost that much".

Of more personal concern:

(3) Is the generality of GS going to cost more than is feasible for some
applications. I can confidently say "yes" to the simplest form of this
question. The more subtle form is, "For what percentage of all possible
applications will the generality cost of GS be acceptable in return for
its configurability?" I think the percentage will be high, or I would
not have kept working on it for so long. Whether any of those qualifying
applications are RT is less clear.

One final observation as I must go for now:

The traditional levels of overhead viewed as acceptable for making some
of these decisions may not be attainable in SMP systems seeking to
support multiple threads running under different programming models.

Unfortunately if a consensus on a new "minimum feasible overhead" is
required, it could take quite a while, because if I think my approach is
at the "new minimum" but others do not, I am faced with proving a
negative - that no other approach can do it more cheaply, and that is,
of course, notoriously difficult.

-Doug

Ted & Syauchen Baker wrote:
> Before committing to a locking protocol, such as the (1) or (2) below, I
> would like to see explanation of how the schedulability analysis would be
> done with whatever model is being proposed. Since we are talking about SMP,
> this should include (at least) an analysis for the partitioned scheduling
> model and the global scheduling model, and preferably also for hybrids
> (tasks restricted to various subsets of processors), at least for
> fixed-priority scheduling, but preferably also for deadline scheduling, and
> for aperiodic server scheduling algorithms of both classes. To do less,
> would be making a huge commitment on behalf of the kernel support group and
> future application programmers, without any assurance that the mechanism
> will be workable in the long term.
>
> Witness the difficulties caused by premature standardization of POSIX
> PTHREAD_PRIO_INHERIT, and the bug that is standardized SCHED_SPORADIC, to
> see what we want to avoid.
>
> Ted
>
> -----Original Message-----
> From: Douglas Niehaus [mailto:[email protected]]
> Sent: Saturday, July 11, 2009 10:41 PM
> To: Peter Zijlstra
> Cc: Henrik Austad; LKML; Ingo Molnar; Bill Huey; Linux RT; Fabio Checconi;
> James H. Anderson; Thomas Gleixner; Ted Baker; Dhaval Giani; Noah Watkins;
> KUSP Google Group
> Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel
>
> Peter:
> Perhaps you could expand on what you meant when you said:
>
> Thing is, both BWI and PEP seems to work brilliantly on
> Uni-Processor
> but SMP leaves things to be desired. Dhaval is currently working on
> a
> PEP implementation that will migrate all the blocked tasks to the
> owner's cpu, basically reducing it to the UP problem.
>
> What is left to be desired with PEP on SMP? I am not saying it is
> perfect, as I can think of a few things I would like to improve or
> understand better, but I am curious what you have in mind.
>
> Absent a clearer idea of what you had in mind, I can certainly discuss
> the tradeeoffs Noah and I have considered over time, and which we think
> motivates our approach.
>
> When Noah and I have talked about this topic over the quite extended
> time, several years, we have been working on it, there have always
> seemed two choices:
>
> 1) Move the proxy (the resource owner) to the CPU with the blocked task
> 2) Move the "scheduling profile" of the blocked task to the CPU where
> the proxy is.
>
> For Proxy Execution under Group Scheduling we have considered both over
> time. Consider the situation where thread A on CPU0 is blocked on a
> resource held by thread B on CPU1. When we considered (1), it has the
> advantage of ensuring that B will run on CPU0, unblocking A, if A (or B)
> is still the best choice at the time it has been successfully moved from
> CPU1 -> CPU0. That might not be true after the delay of moving the process.
>
> We decided to emphasize (2) because it was more interesting in our view
> because it was cheaper and seemed no more complicated although its
> complications are different than (1). Its complication is, of course,
> that while we have worked out how to add the "avatar" of A to the set
> considered by the GS hierarchy on CPU1, it depends on the scheduling
> semantics as configured whether the avatar of A is chosen as "best" and
> thus how long it will be until B runs long enough to release the
> resource and unblock A on CPU1.
>
> We have always viewed that as complicated, but appropriately so to the
> problem. It depends inherently on the semantics of threads A, B, and all
> other threads on CPU1 that are ready to run, among whom the "best" is
> chosen by the GS hierarchy. We think it is inherently the problem of the
> scheduling configuration to take this trade-off into account.
>
> We have also thought being able to do both (1) and (2) is best, but
> which is best to use in a given situation depends on the comparative
> cost of (X) running B on CPU1 long enough to unblock A and (Y) the cost
> of moving B from CPU1->CPU0 to run long enough to unblock A, and then
> move it back from CPU0->CPU1 since its designed CPU assigned is on CPU1.
> Our decision after many hours of discussion over many months has been
> that the cost of (X) seems a lot more attractive than (Y).
>
> Part of our preference is that we are still working with semaphores as
> resources. Since most critical sections are supposed to be "short",
> then scheduling semantics taking the proxy execution periods into
> account on the "foreign" CPUs would actually be easier/better than the
> double thread movement.
>
> Both problems seem quite hard and I do not think I have yet "completely
> figured it out". While the "mass migration" you say Dhaval is working on
> would "reduce" the problem to the UP case, I think it would create more
> complexity for analysis than it eliminates. A form of thrashing seems a
> real danger. In this case, that threads would be moving from CPU to CPU
> so much it would be a real drain on resources and constraint on system
> performance. However, Dhaval my well understand the cost implications of
> thread migration better than I do.
>
> The real core of the problem, it seems to me, is that the proxy
> relationship among threads depends on what resources can be held by
> them. I think that problem is "relatively easy" in the set of locks
> associated with a multi-threaded application.
>
> When the resources causing blocking can be *any* lock in the kernel
> associated with *any* system service that might be used by *any* thread
> is is complicated enough to make my brain hurt. However, we thing the GS
> framework makes it relatively easy, perhaps that would be better said as
> "as easy as it can be", to implement any combination of thread migration
> and avatars desired by a given scheduling semantics.
>
> In that sense Noah and I feel that GS is a "complete" framework in that
> it is possible to configure any semantics desired as easily as it can be
> done by any framework. Obviously that does not resolve the question of
> what semantics it is best to desire for a given system which remains
> quite complicated and highly dependent on the specific application
> semantics.
>
> Noah and I thought the relatively low cost of creating the avatar was
> quite attractive, and so we decided on a GS configuration using it to
> experiment with in specifying the scheduling semantics. The first two
> approaches we want to experiment with are (*) to view the composite
> scheduling hierarchy for all CPUs as a whole, and let the avatar of A
> take its chances on CPU1, and (**) to view resolution of blocking as
> most important system wide, so we have the avatar viewed as "best" long
> enough for its proxy to release the resource.
>
> The bottom line, in out view, is that no single semantics will be viewed
> as either "best" or even acceptable for all applications as is the case
> with schedulers, so we wanted to emphasize configurability.
>
> We have performed basic tests showing the avatars can be chosen and
> resolve the blocking relationship. More complex tests await the
> completion of our port of GS and the other KUSP subsystems to 2.6.29.
>
> Doug
>
> Peter Zijlstra wrote:
>
>> On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote:
>>
>>
>>> Hi all!
>>>
>>> This is a proposal for a global [1], deadline driven scheduler for
>>> real-time tasks in the Linux kernel. I thought I should send out an RFC
>>>
> to
>
>>> gather some feedback instead of wildy hack away at it.
>>>
>>> This proposed scheduler is a modified MLLF (modified Least Laxity First)
>>> called Earliest Failure First (EFF) as it orders tasks according to when
>>> they will miss their deadlines, not when the actual deadline is.
>>>
>>>
>> <snip>
>>
>> Everybody agrees we want a deadline scheduler, we'll probably put a user
>> interface into -rt shortly which should work for all the involved
>> research groups so that we can share tests and have better comparisons.
>>
>> The only thing (aside from an utter lack of time to work on things
>> recently) that has been holding us back is a proper solution to the
>> priority inversion issue.
>>
>> I haven't fully read through the proposed algorithm below, and left it
>> in place for the new people on CC.
>>
>> As already mentioned on IRC, the fact that you push the work to the last
>> possible moment indicates that this algorithm will utterly fall apart on
>> overload and would thus be unsuited for soft-rt loads, but I guess we
>> could implement things like EDF-fm and keep this as a hard-rt class.
>>
>>
>>
>>> === Notation ===
>>>
>>> - Take a set of tasks with corresponding attributes. This set and their
>>> attributes are called the schedule, 'S' and contains *all* tasks for
>>> the given scheduling class (i.e. all EFF-tasks).
>>>
>>> - Consider a multi-core system with 'm' processors.
>>>
>>> - Let the i'th task in the schedule be denoted tau_i. [3]
>>>
>>> - Each task will run in intervals, each 'round' is called a job. A task
>>> consists of an infinite sequence of jobs. The k'th job of tau_i is
>>> called tau_{i,k}
>>>
>>> - Each task has a set of (relative) attributes supplied when the task is
>>> inserted into the scheduler (passed via syscall)
>>> * Period T_i
>>> * Deadline D_i
>>> * WCET C_i
>>>
>>> - Each job (tau_{i,k}) has absolute attributes (computed from the
>>>
> relative
>
>>> tasks-attributes coupled with physical time).
>>> * Release-time r_{i,k}
>>> * Deadline d_{i,k}
>>> * Allocated time so for a job, C_a(t, tau_{i,k})
>>> When C_a equals WCET, the jobs budget is exhausted and it should
>>> start a new cycle. This is tested (see below) by the scheduler.
>>> * Remaining time for the job, C_r(t, tau_{i,nk})
>>>
>>> - The acceptance function for EFF screens new tasks on their expected
>>> utilization. Depending on the mode and implementation, it can be based
>>> on the period, or on the deadline. The latter will cause firmer
>>> restraints, but may lead to wasted resources.
>>>
>>> U = C_i / T_i For SRT (bounded deadline tardiness)
>>> U = C_i / D_i For HRT
>>>
>>> - A relative measure, time to failure, ttf, indicates how much time is
>>> left before a job must be scheduled to run in order to avoid a
>>> deadline-miss. This will decrease as time progresses and the job is
>>> not granted CPU time. For tasks currently running on a CPU, this value
>>> will be constant.
>>>
>>> Take a job with a WCET of 10ms, it has been allowed to run for 4
>>> ms so far. The deadline is 8 ms away. Then the job must be
>>> scheduled to run within the next 4 ms, otherwise it will not be
>>> able to finish in time.
>>>
>>> - An absolute value, time of failure (tof) can also be computed in a
>>> static manner. For tasks not running on a CPU, the allocated time is
>>> static. That means you can take the absolute deadline, subtract the
>>> allocated time and you have the absolute point in time when a given
>>> job will fail to meet its deadline.
>>>
>>> === Outline of scheduler ===
>>>
>>> Store tasks in 2 queues. One of size m, containing all the tasks
>>> currently running on the CPUs (queue R). The other will hold all
>>> currently active tasks waiting to execute (queue W).
>>>
>>> queue R is sorted based on ttf (time to failure, the relative time left
>>> until a task will miss it's deadline). As the tasks approaches the
>>> absolute time of failure at the same rate C_a increases, ttf is
>>> constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
>>> (i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
>>> be quite fluent.
>>>
>>> queue W is sorted based on absolute time of failure (tof). Since this is
>>> a fixed point in time, and the tasks in W are not running (C_a is
>>> unchanged), this value is constant.
>>>
>>> When a task is scheduled to run, a timer is set at the point in time
>>> where it has exhausted it's budget (t_now + WCET - C_a). This is to
>>> ensure that a runaway task does not grab the CPU.
>>>
>>> When a new task arrives, it is handled according the following rules:
>>> - The system has one or more CPUs not running EFF-tasks. Pick any of the
>>> free CPUs and assign the new job there. Set a timer to
>>>
>>> - All CPUs are busy, the new task has greater time to failure than the
>>> head of W. The task is inserted into W at the appropriate place.
>>>
>>> - All CPUs are busy and the new task has smaller time to failure than
>>> the head of W. The new task is compared to the last task in Q. If time
>>> to failure is larger than the task at the tail, it is added to the
>>> head of W.
>>>
>>> - If all CPUs are busy, and time to failure is smaller than the tail of
>>> Q, the new task is a candidate for insertion. At this point the tasks
>>> must be compared to see if picking one or the other will cause a
>>> deadline-miss. If both will miss the deadline if the other is
>>> scheduled, keep the existing running and place the new at the head of
>>> W (as you will have a deadline-miss anyway unless the the task is
>>> picked up by another CPU soon).
>>>
>>> - A task running on a CPU with ttf=0 should *never* be preempted with
>>> another task. If all tasks in R have ttf=0, and a newly arrived task
>>> has ttf=0, a deadline-miss is inevitable and switching tasks will only
>>> waste resources.
>>>
>>> When a task in R finish (or is stopped due to the timer-limit), it is
>>> removed from R, and the head of W is added to R, inserted at the
>>> appropriate place.
>>>
>>> It has been some discussion lately (in particular on #linux-rt) about
>>> the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
>>> should be possible to extend EFF to handle both. As a side note, if
>>> anyone has some good information about PEP, I'd like a copy :)
>>>
>>> Based on this, I think the utilization can be set as high as M
>>> (i.e. full utilization of all CPUs), but the jitter can probably be
>>> quite bad, so for jitter-sensitive tasks, a short period/deadline should
>>> be used.
>>>
>>> There are still some issues left to solve, for instance how to best
>>> handle sporadic tasks, and whether or not deadline-miss should be allow,
>>> or just 'bounded deadline tardiness'. Either way, EFF should be able to
>>> handle it. Then, there are problems concerning blocking of tasks. One
>>> solution would be BWI or PEP, but I have not had the time to read
>>> properly through those, but from what I've gathered a combination of BWI
>>> and PEP looks promising (anyone with good info about BWI and PEP - feel
>>> free to share! (-: ).
>>>
>>>
>> Our SSSUP friends have a BWI paper here:
>>
>> http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf
>>
>> The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm
>> not sure if he has any papers on it.
>>
>> Also, when talking about it at OSPERT last week Ted Baker (also on CC)
>> said it reminded him of something else of which I seem to have forgotten
>> the name.
>>
>> Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor
>> but SMP leaves things to be desired. Dhaval is currently working on a
>> PEP implementation that will migrate all the blocked tasks to the
>> owner's cpu, basically reducing it to the UP problem.
>>
>>
>>
>>> 1) Before you freeze at 'global' and get all caught up on "This won't
>>> ever scale to X", or "He will be haunted by Y" - I do not want to
>>> extend a global algorithm to 2000 cores. I would like to scale to a
>>> single *chip* and then we can worry about 2-way and 4-way systems
>>> later. For the record, I've donned my asbestos suit anyway.
>>>
>>>
>> My preferred approach here is to find a distributed algorithm that
>> converges to the global one.
>>
>>
>>
>>> 2) http://austad.us/kernel/thesis_henrikau.pdf
>>>
>>> 3) Anyone want to include LaTeX-notation into an email-rfc?
>>>
>>>
>> Not unheard of ;-)
>>
>>
>>
>
>

2009-07-14 07:27:46

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Douglas Niehaus wrote:

> (1.1) Will the use of system services (system calls) by RT threads
> need to be explicitly supported by, perhaps, explicitly making the
> schedulers of *other* threads also using those system calls aware of
> that and take it into account to make those other tasks non-preemptible
> while holding system call related locks.
>
> (1.2) Can Linux SMP ever support this in an acceptable manner since
> locks associated with systems services are shared across CPU boundaries.
> THus, I can isolate a set of RT tasks on a CPU easily, and they are well
> isolated and can run under strict predictability, *until they use a
> system call that uses a lock*. Then, the system call is an interaction
> channel with every other thread on the system using the same system call.


This may be a terminology issue, but I think it would be more correct to
say that the lock is the interaction channel, not the system call, and
it is an interaction channel with every other entity on the system using
the same lock. Depending on the specific lock, this could be other
userspace tasks, kernel threads, or even hardware interrupt handlers.

This goes back to your first question of which system services an RT
task can use while maintaining schedulability analysis. Unfortunately
this may be a moving target, since the exact answer depends on what
locks are taken in the underlying kernel implementation.

Chris

2009-07-14 07:44:56

by Doug Niehaus

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel


Chris Friesen wrote:
> Douglas Niehaus wrote:
>
>
>> (1.1) Will the use of system services (system calls) by RT threads
>> need to be explicitly supported by, perhaps, explicitly making the
>> schedulers of *other* threads also using those system calls aware of
>> that and take it into account to make those other tasks non-preemptible
>> while holding system call related locks.
>>
>> (1.2) Can Linux SMP ever support this in an acceptable manner since
>> locks associated with systems services are shared across CPU boundaries.
>> THus, I can isolate a set of RT tasks on a CPU easily, and they are well
>> isolated and can run under strict predictability, *until they use a
>> system call that uses a lock*. Then, the system call is an interaction
>> channel with every other thread on the system using the same system call.
>>
>
>
> This may be a terminology issue, but I think it would be more correct to
> say that the lock is the interaction channel, not the system call, and
> it is an interaction channel with every other entity on the system using
> the same lock. Depending on the specific lock, this could be other
> userspace tasks, kernel threads, or even hardware interrupt handlers.
>
>
Sorry, sloppiness on my part while typing quickly, resulting in the
terminolgy problem of --- my using the wrong terminology.

Yes, the lock, is the interaction channel.

Admittedly, which locks are the interaction channels is correlated with
which system services are used by threads, but sometimes more and
sometimes less strongly correlated.

When I explain it to less expert audiences than this, I tend to talk in
terms of the system services because they at least know what some of
them are, while many have no idea what the concurrency control in the
kernel is. They can fairly easily see that if RT tasks use a range of
services used by generic Linux tasks, then some interaction between RT
and non-RT tasks is a result.

Still, no excuses, only explanation. Sorry to have over-simplified to
this audience.

Thanks for clarifying.
> This goes back to your first question of which system services an RT
> task can use while maintaining schedulability analysis. Unfortunately
> this may be a moving target, since the exact answer depends on what
> locks are taken in the underlying kernel implementation.
>
> Chris
>
Yes. This is true and is also the point I was trying to make. When
talking to people about RT over the years I have observed that it is
often hard to communicate the full cost of the predictability required
for RT tasks

Extremely detailed information is required, and getting it can be
expensive. This is one reason why supporting RT in Linux proper is even
harder than it first appears.

However, I think that while completely arbitrary use of Linux system
services by RT tasks is extremely complicated, many RT applications can
be happy using only a small subset of services, and so various classes
of applications can be supported successfully with merely extreme
effort, as opposed to completely insane effort.

It is a really hard problem, no doubt, though.

Doug

2009-07-14 08:42:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 18:06 +0200, Raistlin wrote:
> Anyway, maybe if, on some architecture, for some kind of application,
> the affinity may have been set to preserve some kind actual cache or
> memory locality for the task access pattern, maybe this could be an
> issue, couldn't it? :-)
> I mean, in some case where being sure of having a task running on a
> particular CPU is somehow of paramount importance...

Right, and you answered your own question :-), its _running_ that is
important, so as long as its blocked (not running), you're free to place
the task on another cpu if that helps out with something.


2009-07-14 09:15:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 11:14 -0700, Noah Watkins wrote:
> > I think you can extend PIP to include things like bandwidth
> > inheritance
> > too. Instead of simply propagating the priority through the waitqueue
> > hierarchy, you can pass the actual task around, and having this
> > task you
> > can indeed consume its bandwidth etc..
>
> I think I understand what you are suggesting by "pass the actual task
> around". If not, this message might seem out of place.
>
> Consider this locking chain/graph:
>
> A-->L1-->B-->L2-->C
> D-->L3-->E-->L2

C
|
L2
/ \
B E
| |
L1 L3
| |
A D

> The way I understand what you are suggesting is that tasks A,B,D,E
> (or some representation of them) can be passed around such that when
> C executes it consumes some resource associated with the the tasks it
> is blocking (A,B,D,E). Obviously which tasks and what quantities are
> dependent on scheduling semantics and configuration.
>
> The biggest implementation hurdle we have encountered is that as
> tasks are passed forward along a locking chain knowledge about the
> structure of the locking chain is lost. For example, as C releases
> L2, C must be able to distinguish between A and D as having been
> passed to C's location through B or E, respectively.
>
> Keeping a representation of the locking chain as a full graph is the
> solution we have used. Another solution is to allow A and D to re-
> block and walk the locking chain again, but obviously some thundering-
> hurd issues arise.

Right, we too keep the full graph (as Ted has noted with disgust).

Douglas mentions that since there is no priority in things like
proportional bandwidth schedulers (CFS), priority inheritance cannot
work.

I would beg to differ in that a straight forward extension of the
protocol can indeed work, even for such cases.

One needs to change the sorting function used in the waitqueues (Li) to
reflect the full scheduler function (which in itself can be expressed as
a (partial?) sorting function.

One needs to pass along the 'highest' task, not simply the priority.

And, one needs to cancel and re-acquire this task's contention whenever
the leading task (C) gets preempted.

We let the scheduler use and consume the task that is passed up as being
the 'highest' in C's stead (it might actually be C).


For example, suppose the above block graph and add a single unrelated
task F, all of equal and unit (1) weight. Now traditionally that'd
result in 2 runnable tasks of equal weight, such that they both get 50%
cpu time (assuming single cpu).

However, PEP and the above modified PIP will in fact result in C
receiving an effective weight of 5 versus 1 for F.

The sorting function for proportional bandwidth schedulers is the
typical least service received one -- such that the task with least
service will be deemed the 'highest' one.

Now suppose that that task would be found to be A, we'll run C with A's
values until its quanta is consumed and we'd force preempt. Normally
that would lead to the selection of F, however if we first cancel and
retry A, leading to a re-sorting of the graph, we might well find that E
is the 'highest' task in the graph, and also more eligible than F.


Furthermore Ted raises the point:

> Regarding the notion of charging proxy execution to the budget of
> the client task, I have grave concerns. It is already hard enough
> to estimate the amount of budget that a real-time task requires,
> without this additional complication.

I hope the above illustrates the use of this, furthermore I think Dario
and team did some actual analysis on it wrt deadline schedulers. Dario
could you expand?

~ Peter

2009-07-14 09:36:20

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 10:42 +0200, Peter Zijlstra wrote:
> On Mon, 2009-07-13 at 18:06 +0200, Raistlin wrote:
> > Anyway, maybe if, on some architecture, for some kind of application,
> > the affinity may have been set to preserve some kind actual cache or
> > memory locality for the task access pattern, maybe this could be an
> > issue, couldn't it? :-)
> > I mean, in some case where being sure of having a task running on a
> > particular CPU is somehow of paramount importance...
>
> Right, and you answered your own question :-), its _running_ that is
> important, so as long as its blocked (not running), you're free to place
> the task on another cpu if that helps out with something.
>
Yep! Re-reading both your and my comments I saw I misunderstood your
point! :-(

I agree thet you have to move some task and, moving the "blocked" ones,
would allow the lock-owner to continue running in its place, which
sounds good to me to. :-)

Sorry!
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 10:48:03

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 10:33 -0600, Chris Friesen wrote:
> > On the other hand, from the practical end efficiency point of view, it
> > would be not that difficult to block these otherwise-spinning tasks, in
> > order to avoid wasting CPU time too much... The only important thing is
> > to properly account the budget of the correct server/group (which
> > probably must be the otherwise-spinning task's one), or the analysis is
> > gone again! :-O
>
> Could you elaborate on this "proper accounting"?
>
Well, I can try, but that's exactly the trickiest part! :-O

> If task A is blocked waiting for a lock (held by a task B on another
> cpu) and we run task C instead, how would you propose that the
> accounting be handled?
>
Ok, here we are.

From the point of view of real-time theory, all is about having the (m,
on an m-CPU system) highest priority task(s) always running. That's
something a real-time theorist could kill to achieve! :-D

That is basically --well, as far as I've understood it since now! :-)--
because if you are sure you always have the highest task(s) up and
running, the schedulability analysis is "simple", and there are zillions
of schedulability test that can be applied and that will work more or
less out of the box!

The capability of a task to block complicates things in the sense that,
if (one of) your highest priority task(s) blocks, you end up in
executing someone else, invalidating our previous assumption. Moreover,
if also priority inversions can happen, well, guys, we are screwed! :-(

I think it is quite easy to figure out that blocking protocols, e.g. PI,
P(C)P, SRP, etc., act in the direction of mitigating right that kind of
issue, agree?
Now, if you came to server/group based scheduling[*] this is even nicer.
In fact, if you quickly go through the paper Peter pointed out (which is
our implementation of BWI --an incarnation of PEP, actually--) or to the
more theoretical one that describes the protocol in details[1], you will
find out that having proxying in place completely removes the blocking!
How is that possible? Well, when a task A blocks on B, then B executes
when and *where* A should: i.e., B is temporary added to A's group, uses
its priority/deadline and *consumes* its budget. This means that A's
group is always ready to run whenever it becomes the highest priority
group, even if A is blocked, which is why we say there is no more room
for blocking.

Now, all the above is true on UP setups. Extending to SMP (m CPUs) and
considering the first part of my draft proposal, i.e., having the
proxying tasks busy waiting (would say "spinning", but that busy wait is
interruptible, preemptable, etc.) on some CPU while their proxy is being
scheduled, we are back to the case of having the m highest entity
running... And thus we are happy!
This is because, again, given the holding of that assumption, all the
existent schedulability analysis automagically start working again,
without the need of accounting for any blocking term or blocking time
duration.

What we like most with this is that it means you can decide the
bandwidth (e.g., budget/period) you want to assign to each task group,
run one of the available scheduling tests --with and only with that
values!!--to see if it fits, turn the light on and
_have_it_properly_enforced_,
without the need of clairvoyance on task blocking patterns and
durations.
Moreover, if you hare an hard real-time guy, and you have the knowledge
of the length of your critical sections you can use them to properly
dimension the budget of your server/group... But I don't think this is
the Linux case, is it? :-P

And so we are done!

Well, so and so. I fact, if you want (and we want! :-D) to go a step
further, and consider how to remove the --possibly quite log on Linux as
Peter said-- wasting of CPU time due to busy waiting, what you can do is
to actually block a proxying task, instead of having it "spinning", so
that some other task in some other group, which may not be one of the m
highest prio ones, can reclaim that bandwidth... But... Ladies and
gentlemen, here it is: BLOCKING. Again!! :-(
That's where we are right now, trying to find some possible solution to
avoid reintroducing from the window what we are trying to kick out from
the door, striving for a good compromise between pure theory and
practical viability.

A draft solution could be to act as said right above, but also pretend
that the correct groups/tasks have executed, such that the effects of
the blocking would be, let's say, absorbed... So, yes, something similar
to what you are arguing: decrease the budget of B (B's group?) when C is
being run.
I agree, this rises issues as well, e.g., is another kind of
"inheritance" to take care of, also, how many and which task/group's
budget are we going to affect?, and so on... But it's just a first shot
in the dark.

Ok, I see I've definitely written too much... Sorry about that... I Hope
I at least managed in making my point a little bit clearer... If not,
feel free to ask again.
As I repeat, we are still quite near to the starting blocks with this
thing, but I still am glad to share thoughts with all of you! :-)

Regards,
Dario

[*] but a more real-timish group scheduling than the one present in
Linux right at the moment, i.e., something where groups have (fixed or
dynamic) priorities.

[1] http://feanor.sssup.it/%7Elipari/papers/rtss2001-bwi.ps.gz

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 11:03:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 12:47 +0200, Raistlin wrote:
> [1] http://feanor.sssup.it/%7Elipari/papers/rtss2001-bwi.ps.gz

That link seems to require a user/pass combination.

2009-07-14 11:16:24

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 15:45 -0600, Chris Friesen wrote:
> > The only selling point for PIP has been the ability of a thread to
> > suspend itself while holding a lock, such as to wait for
> > completion of an I/O operation.
>
> You're comparing a full-featured PI implementation with a stripped-down
> PP (priority protection, aka priority ceiling) approach. In an
> apples-to-apples comparison, the selling point for PI vs PP is that
> under PIP the priority of the lock holder is automatically boosted only
> if necessary, and only as high as necessary. On the other hand, PP
> requires code analysis to properly set the ceilings for each individual
> mutex.
>
I think I agree with this.

Moreover, talking about server/group based scheduling and considering
BWI/PEP, which are natural extensions of PI, you get the very nice
property of being independent from the actual knowledge of the blocking
time(s): you can run your scheduling test only considering the bandwidth
assigned to each component... And this is, at least for now and as far
as I know, not possible if you go for preventivate-blocking approaches
like P(C)P or SRP, and also for BROE or SIRAP I think.

I mean, if you only want to be sure to isolate applications and/or
components among themselves, without any knowledge of the blocking times
and critical section access patterns of the task running inside such
components, you can do it! Just pick up the bandwidths and see if they
fit with a scheduling test unmodified by any --very difficult to find
out, actually-- blocking term.
I know, this does not cover all the possible situations, and that it is
biased toward _soft_ real-time workloads, but I think it's a meaningful
use-case, especially considering Linux...

Anyway, it is more than possible that this belief is due to lack of
knowledge of mine about some other already existing solution... If so,
please, any pointer to it/them is more than welcome. :-)

> > Regarding the notion of charging proxy execution to the budget of
> > the client task, I have grave concerns. It is already hard enough
> > to estimate the amount of budget that a real-time task requires,
> > without this additional complication.
>
> Agreed.
>
Well, indeed, I agre with this as well... But it yields the, to me, very
nice property described above (and in the other e-mail).

However, I'm light year far from proposing it as the "solutions for all
evils"! I know that, for instance, overhead and code twisting are severe
issues. The all I hope is to be able and have the time to give it a try,
and try to guess if it is worth. :-D

Regards again,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 14:48:40

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Raistlin wrote:

> Now, all the above is true on UP setups. Extending to SMP (m CPUs) and
> considering the first part of my draft proposal, i.e., having the
> proxying tasks busy waiting (would say "spinning", but that busy wait is
> interruptible, preemptable, etc.) on some CPU while their proxy is being
> scheduled, we are back to the case of having the m highest entity
> running... And thus we are happy!


> Well, so and so. I fact, if you want (and we want! :-D) to go a step
> further, and consider how to remove the --possibly quite log on Linux as
> Peter said-- wasting of CPU time due to busy waiting, what you can do is
> to actually block a proxying task, instead of having it "spinning", so
> that some other task in some other group, which may not be one of the m
> highest prio ones, can reclaim that bandwidth... But... Ladies and
> gentlemen, here it is: BLOCKING. Again!! :-(

Let's call the highest priority task A, while the task holding the lock
(that A wants) is called B. Suppose we're on an dual-cpu system.

According to your proposal above we would have A busy-waiting while B
runs with A's priority. When B gives up the lock it gets downgraded and
A acquires it and continues to run.

Alternately, we could have A blocked waiting for the lock, a separate
task C running, and B runs with A's priority on the other cpu. When B
gives up the lock it gets downgraded and A acquires it and continues to run.

>From an overall system perspective we allowed C to make additional
forward progress, increasing the throughput. On the other hand, there
is a small window where A isn't running and it theoretically should be.
If we can bound it, would this window really cause so much difficulty
to the schedulability analysis?

Chris

2009-07-14 16:27:03

by James H. Anderson

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel


Hi all,

I've been reading these email messages for a few days
now and wanted to make some comments... maybe what I
have to say will be useful to you... maybe not.

First, I thought I should introduce myself because I
think some of you may not know me. I lead the LITMUS^RT
project at UNC (http://www.cs.unc.edu/~anderson/litmus-rt/).
I think it is important to mention this because the goals
of our LITMUS^RT-related research mean that I'm coming at
all of this from a rather different place than (I think)
most of you. The main goal of our research has been to
identify those real-time scheduling and synchronization
algorithms that work best on multicore platforms. We
define "work best" in terms *schedulability* with *real
overheads* from *real implementations* considered. There
are two important aspects of this that will influence my
comments (so take them with a grain of salt):

1. By emphasizing schedulability, we are clearly defining
"real-time" to mean predictable, and not real-fast.

2. We use Linux as a platform for implementing the algorithms
we test and getting overheads. It has never been among our
goals to actually change mainline Linux (although this may become
a goal in the future). This means that we haven't focused too
much on legacy-related hindrances. I'm using the word
"hindrances" with intention. I think in particular anything that
is POSIX-based is going to be fraught with problems in a multicore
world. Such legacy issues seem to be at the heart of many of
the problematic scenarios you are discussing. Anyway, maybe
these things are a fact of life.

The first email in this thread that I was cc'ed on talked about
implementing global EDF, but the discussion since has been
entirely about synchronization issues. To the best of my
knowledge, the only published real-time locking protocol that
can be used under global EDF (and many other algorithms) without
restricting critical sections (like not allowing certain resources
to be accessed in a nested manner) is a protocol we designed called
the flexible multiprocessor locking protocol (FMLP). Please see
the following papers at http://www.cs.unc.edu/~anderson/papers.html:

B. Brandenburg and J. Anderson, "Reader-Writer Synchronization for
Shared-Memory Multiprocessor Real-Time Systems", Proceedings of the
21st Euromicro Conference on Real-Time Systems, pp. 184-193, July 2009.

B. Brandenburg and J. Anderson, "A Comparison of the M-PCP, D-PCP, and
FMLP on LITMUSRT", Proceedings of the 12th International Conference
on Principles of Distributed Systems, pp. 105-124, December 2008.

B. Brandenburg and J. Anderson, "An Implementation of the PCP, SRP,
D-PCP, M-PCP, and FMLP Real-Time Synchronization Protocols in LITMUSRT",
Proceedings of the 14th IEEE International Conference on Embedded and
Real-Time Computing Systems and Applications, pp. 185-194, August 2008.

B. Brandenburg, J. Calandrino, A. Block, H. Leontyev, and J. Anderson,
"Real-Time Synchronization on Multiprocessors: To Block or Not to
Block, to Suspend or Spin?", Proceedings of the 14th IEEE Real-Time
and Embedded Technology and Applications Symposium, pp. 342-353, April 2008.

A. Block, H. Leontyev, B. Brandenburg, and J. Anderson, "A Flexible
Real-Time Locking Protocol for Multiprocessors", Proceedings of the
13th IEEE International Conference on Embedded and Real-Time Computing
Systems and Applications, pp. 47-57, August 2007.

(BTW, most papers on this page with "Bjoern Brandenburg" as a co-author
are LITMUS^RT-related. I've added Bjoern to the cc list.)

The underlying philosophy of our synchronization-related work is
"simplicity wins". Simplicity is obviously a good thing from an
implementation perspective, but it is also good from an *analysis*
perspective. Correctly computing task blocking terms for multiprocessor
real-time locking protocols is *really* hard. There are many
mistakes in analysis presented in the literature (in fact, I think a
new one was pointed out ECRTS two weeks ago). It is really hard
to nail down all the corner cases, and simple mechanisms have far
fewer corner cases. In our work, we first carefully write up all
blocking analysis and then have a multi-hour group meeting where
6 to 8 people review and discuss the analysis.

The FMLP is really embarrassingly simple. It's main mechanisms are:

1. Require critical sections to execute non-preemptively (I know,
non-preemption violates the "real-fast" religion -- I've been in
these discussion before :-)).

2. Execute critical sections in FIFO order (I know, FIFO violates
the "real-time" religion).

3. Deal with nested lock accesses by requiring a higher-level,
coarser lock to be acquired. While this many seem "too dumb", traces
we have taken of real systems (including Linux) indicate that
two-level nesting is not that common and deep nesting is quite rare.
Why design complicated mechanisms that are both hard to implement
and analyze to deal with something that is not so common?

In the FMLP, waiting can be implemented via either spinning or
suspension (technically, critical sections are categorized as
"short" or "long" and spinning is used for short CSs and suspension
for long ones -- this categorization is entirely up to the user).
Spinning is done non-preemptively (suspending obviously is not).

The beauty of non-preemptive, FIFO is that on an m-processor system,
blocking behavior is easy to analyze. For example, with spin-based
waiting, the blocking per critical-section access is (m-1) times the
cost of one critical section. Someone remarked in an earlier email
that we're really thinking of m being somewhat small (4, maybe 8).
As such, this is a blocking term that is really hard to beat. As I
said, simplicity wins. With suspension-based waiting, things are a
bit more complicated, but not much.

In our experience, the blocking analysis of any multiprocessor protocol
that uses priority-inheritance-related mechanisms is a nightmare. The
problem is that, to handle all the corner cases, pessimistic assumptions
have to be made. This pessimism can really add up and lead to huge
blocking terms.

This email thread has also touched on group-based scheduling. We
proposed a very simple scheme several years ago called "skipping"
in the context of Pfair scheduling that makes this easy:

P. Holman and J. Anderson, "Locking under Pfair Scheduling", ACM
Transactions on Computer Systems , Volume 24, Number 2, pp. 140-174,
May 2006.

(An earlier version in appeared in RTSS 2002).

In this approach, critical-section lengths must be known, and any
lock request that occurs when a task doesn't have sufficient
budget is simply denied -- the request is done later when that task
receives additional budget. This avoids a task in one group from
holding a lock while it is preempted (which could severely hurt
lock-requesting tasks in other groups). This scheme is really easy
to implement in conjunction with the FMLP and it doesn't require
complicated budget tracking. Its effects on blocking terms are
also easy to analyze. Thomas Nolte and colleagues (in Sweden) have
written some papers where they've used skip-based locks in
hierarchically-scheduled systems.

I'll stop rambling now. I guess my overall point is that these
legacy issues like POSIX seem to be forcing you into complex
solutions. Unless there is a way to shed this baggage, I think
it'll be a matter of where you put the complexity -- you'll never
be able to eliminate it (and again, I think complexity is *bad*).

I hope you find my comments worthwhile (and if not, sorry for
sending such a long message).

-Jim Anderson

2009-07-14 16:31:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel


Hi Jim,

On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote:
> The first email in this thread that I was cc'ed on talked about
> implementing global EDF,

Hendrik says that its a modified Modified-Least-Laxity-First, so
something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too,
but now see I failed to actually do so) in the hope that you would have
some thoughts on his scheduling algorithm, since that is what you do a
lot of ;-)

It could well be he modified M-LLF into G-EDF, but I'm behind in my
reading here, so I'll have to leave that up to others.

> but the discussion since has been entirely about synchronization issues.

Right, this seems to be a very hot topic.

> To the best of my
> knowledge, the only published real-time locking protocol that
> can be used under global EDF (and many other algorithms) without
> restricting critical sections (like not allowing certain resources
> to be accessed in a nested manner) is a protocol we designed called
> the flexible multiprocessor locking protocol (FMLP).

> The FMLP is really embarrassingly simple. It's main mechanisms are:
>
> 1. Require critical sections to execute non-preemptively (I know,
> non-preemption violates the "real-fast" religion -- I've been in
> these discussion before :-)).

Lets simply state that industry would like to be able to deal with very
short deadlines :-)

> 2. Execute critical sections in FIFO order (I know, FIFO violates
> the "real-time" religion).

For those locks that are indeed non-preempt busy wait (raw_spinlock_t)
we generally use a FIFO-fair implementation (x86 at least, but I thought
several other platforms were looking at, or have already, implemented a
similar spinlock).

> 3. Deal with nested lock accesses by requiring a higher-level,
> coarser lock to be acquired. While this many seem "too dumb", traces
> we have taken of real systems (including Linux) indicate that
> two-level nesting is not that common and deep nesting is quite rare.
> Why design complicated mechanisms that are both hard to implement
> and analyze to deal with something that is not so common?
>
> In the FMLP, waiting can be implemented via either spinning or
> suspension (technically, critical sections are categorized as
> "short" or "long" and spinning is used for short CSs and suspension
> for long ones -- this categorization is entirely up to the user).
> Spinning is done non-preemptively (suspending obviously is not).
>
> The beauty of non-preemptive, FIFO is that on an m-processor system,
> blocking behavior is easy to analyze. For example, with spin-based
> waiting, the blocking per critical-section access is (m-1) times the
> cost of one critical section. Someone remarked in an earlier email
> that we're really thinking of m being somewhat small (4, maybe 8).
> As such, this is a blocking term that is really hard to beat. As I
> said, simplicity wins. With suspension-based waiting, things are a
> bit more complicated, but not much.

Our problems are many (and I think you know about most if not all).

- Linux is _huge_:
o and while we seem to do a reasonable job of
getting people to cope with basic concurrency, asking them to
take RT constraints into account when doing so will surely be too
much -- worse, we might not ever convince people its a worthy
goal to begin with.

o auditing each and every lock in the kernel simply isn't possible.

- Linux needs to provide isolation:
o we must assume userspace is hostile and try to minimize the
impact of such tasks on others.

- POSIX:
o we're stuck with the legacy here, but we're looking fwd outside
of POSIX.

(I'm probably forgetting half)

> In our experience, the blocking analysis of any multiprocessor protocol
> that uses priority-inheritance-related mechanisms is a nightmare. The
> problem is that, to handle all the corner cases, pessimistic assumptions
> have to be made. This pessimism can really add up and lead to huge
> blocking terms.

Right, Ted holds similar opinions.

Practically though, active Priority Inheritance has enormous benefits
for us simple people that need to get things done :-)

It has allowed us to convert this huge mass of code into something that
is real-time enough for a lot of practical uses, including industrial
robots and the like without the need to analyze each and every lock out
there.

Now I fully appreciate the theoretical trouble full PI and its ilk
gives, so maybe we can do something with lockdep/stat to track and
qualify the lock nesting depths and critical section lengths and use
those to improve upon the situation.

At worst we could use the data to qualify the usability of certain
system calls/traps wrt RT applications.

Also, can't we employ the PI framework to act as a debug/help-guide in
validating/figuring out the PCP levels?

On that, how does the priority ceiling/protection protocol extend to
deadline schedulers?

> This email thread has also touched on group-based scheduling. We
> proposed a very simple scheme several years ago called "skipping"
> in the context of Pfair scheduling that makes this easy:

> In this approach, critical-section lengths must be known, and any
> lock request that occurs when a task doesn't have sufficient
> budget is simply denied -- the request is done later when that task
> receives additional budget. This avoids a task in one group from
> holding a lock while it is preempted (which could severely hurt
> lock-requesting tasks in other groups). This scheme is really easy
> to implement in conjunction with the FMLP and it doesn't require
> complicated budget tracking. Its effects on blocking terms are
> also easy to analyze. Thomas Nolte and colleagues (in Sweden) have
> written some papers where they've used skip-based locks in
> hierarchically-scheduled systems.

Not granting locks when the contender doesn't have enough budget to
finish them seems like a very attractive solution, however the cost of
having to analyze the critical section seems prohibitive.

That said, we could maybe experiment with this for a few key lock sites.

Anyway, our goal (the linux-rt folks) is to make Linux a better RT OS.
I'm sure we'll never get hard enough for some people, but I'm sure we
can do better than we do today.

Furthermore we're limited by the existing legacy (both spec and code
wise), but I think we have been able to make good progress through tools
that help us, such as lockdep and the latency tracer (now ftrace), which
help us find trouble in an active manner.

And I think discussion such as these help us find new directions to
explore, so carry on!

2009-07-14 16:48:38

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 08:48 -0600, Chris Friesen wrote:
> Let's call the highest priority task A, while the task holding the lock
> (that A wants) is called B. Suppose we're on an dual-cpu system.
>
Ok.

> According to your proposal above we would have A busy-waiting while B
> runs with A's priority. When B gives up the lock it gets downgraded and
> A acquires it and continues to run.
>
Yes. The first part of my proposal --the "theoretically safe" one.

> Alternately, we could have A blocked waiting for the lock, a separate
> task C running, and B runs with A's priority on the other cpu. When B
> gives up the lock it gets downgraded and A acquires it and continues to run.
>
Right. And this is in my proposal as well.

I'm just saying that the analysis gets more complicated, that some of
the nice properties may be lost, and suggested a first draft of idea to
turn it back to be easy again... With, unfortunately, a lot of flaws in
there yet. :-(

In fact, I'm suggesting that, in the scenario you described, C should be
able to run, but B's budget have to be affected somehow.

> From an overall system perspective we allowed C to make additional
> forward progress, increasing the throughput.
As said, I agre with this from the very beginning of the whole
thing! :-)

> On the other hand, there
> is a small window where A isn't running and it theoretically should be.
> If we can bound it, would this window really cause so much difficulty
> to the schedulability analysis?
>
Well, I'm not sure right now... I would need to do some math to be!

Remember that all my points are concerned with budgets, i.e., a scenario
where you have some mean to limit the capability of a task to ask for
CPU time over some kind of period.
And here it is where the problem comes since running C instead of having
A busy waiting means:
- that A is actually blocked, as said before;
- that A's budget is not diminished.
And this certainly improves system throughput but may affect its
predictability and, in this particular case, task's isolation among each
other...

Anyway, if you are saying that this could be a possible theory-practice
compromise, well, could be... And I'm open and ready to (try to)
contribute even in a discussion of such kind. :-)

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 16:55:13

by James H. Anderson

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel



On Tue, 14 Jul 2009, Peter Zijlstra wrote:

>
> Hi Jim,
>
> On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote:
>> The first email in this thread that I was cc'ed on talked about
>> implementing global EDF,
>
> Hendrik says that its a modified Modified-Least-Laxity-First, so
> something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too,
> but now see I failed to actually do so) in the hope that you would have
> some thoughts on his scheduling algorithm, since that is what you do a
> lot of ;-)
>
> It could well be he modified M-LLF into G-EDF, but I'm behind in my
> reading here, so I'll have to leave that up to others.

I meant to say something about this algorithm earlier that may be
worthwhile. If I understand Hendrik's algorithm correctly, it falls
within a class of global schedulers for which bounded deadline
tardiness is guaranteed (as long as total utilization doesn't exceed
the system's capacity). This follows from results in:

H. Leontyev and J. Anderson, "Generalized Tardiness Bounds for Global
Multiprocessor Scheduling", Proceedings of the 28th IEEE Real-Time
Systems Symposium, pp. 413-422, December 2007.

This is a positive thing w.r.t. wanting to support soft real-time
workloads. However, global EDF also has this property and (I would
think) is easier to implement (I'm just guessing here -- we haven't
actually implemented any algorithm that involves laxities in
LITMUS^RT). Ted Baker did some work on an algorithm called EDZL
(EDF until zero laxity) that is essentially a hybrid of these two
algorithms. In simulations we've done (not based on real implementations),
EDZL exhibits really low tardiness (almost non-existent). EDZL may
give you the benefits of using laxities where useful and be easier
to implement (but that's just my guess)

Regarding all these "legacy" issues, I think it would be helpful to
write them all down carefully and formally, and then think about
what are the *simplest* mechanisms that will do the job in the common
case. It is not worthwhile (in my opinion) to introduce tons of
complexity to nip at corner cases that aren't that common. Of course,
this begs the question of what is "common" and what is not: something
that may also be worth trying to find evidence for and writing down.

Or better yet, someone needs to define a new standard for multicore...
but I'm not sure how that gets done.

Gotta go....

-Jim

2009-07-14 17:17:40

by James H. Anderson

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel



On Tue, 14 Jul 2009, Peter Zijlstra wrote:

> On that, how does the priority ceiling/protection protocol extend to
> deadline schedulers?

Sorry -- I should have responded to this.

These things are pretty easy to extend to deadline scheduling if
partitioning is assumed, and this has been done. However, it is not
easy to do under global scheduling. Having tasks pinned to processors
(partitioning) makes it much easier to compute blocking terms (it's much
easier to predict what is happening on each processor at any time).
Without this, it is *really* hard to compute reasonable blocking
terms. Even doing something as mundane as avoiding deadlocks without a lot
of overhead may not be trivial. At the time we developed the FMLP
(a couple of years ago), there was *no* published locking protocol (to my
knowledge) that worked under GEDF. To my knowledge, the FMLP is still the
only one that works under GEDF. (BTW, I should say that I am not
familiar with the PEP protocol that has been discussed in this thread.
I assume it doesn't work under GEDF, or you wouldn't have asked the
question.)

BTW, FIFO waiting makes blocking terms in the global case much easier
to compute: once a lock request is initiated, the set of blocking lock
requests (onces initiated earlier) is fixed. (This is actually a bit
of an over-simplification if waiting is by suspending.) With priority-based
waiting, higher-priority requests can come later. Determining a
reasonable bound on the number of such requests is hard.

You can find references to papers by others in our FMLP papers.

-Jim

2009-07-14 18:20:02

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 13:03 +0200, Peter Zijlstra wrote:
> On Tue, 2009-07-14 at 12:47 +0200, Raistlin wrote:
> > [1] http://feanor.sssup.it/%7Elipari/papers/rtss2001-bwi.ps.gz
>
> That link seems to require a user/pass combination.
>
Oh, damn, sorry about that!

I thought I downloaded the paper once from Peppe's website, but I
suppose I'm wrong. Now I see there's also a disclaimer about that IEEE
copyright stuff there, but I did not notice it! :-(

Anyway, here are two working links to all the BWI details for UPs:
http://feanor.sssup.it/~faggioli/lipari_lamastra_abeni_bwi.pdf
http://feanor.sssup.it/~faggioli/rtss2001-bwi.ps.gz

Sorry again and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 18:24:38

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Raistlin wrote:

> Remember that all my points are concerned with budgets, i.e., a scenario
> where you have some mean to limit the capability of a task to ask for
> CPU time over some kind of period.
> And here it is where the problem comes since running C instead of having
> A busy waiting means:
> - that A is actually blocked, as said before;

Why does it make any difference that A is blocked rather than busy
waiting? In either case A cannot make forward progress.

> - that A's budget is not diminished.

If we're running B with A's priority, presumably it will get some amount
of cpu time above and beyond what it would normally have gotten during a
particular scheduling interval. Perhaps it would make sense to charge B
what it would normally have gotten, and charge the excess amount to A?

Chris

2009-07-14 19:08:02

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 11:15 +0200, Peter Zijlstra wrote:
> > Regarding the notion of charging proxy execution to the budget of
> > the client task, I have grave concerns. It is already hard enough
> > to estimate the amount of budget that a real-time task requires,
> > without this additional complication.
>
> I hope the above illustrates the use of this, furthermore I think Dario
> and team did some actual analysis on it wrt deadline schedulers. Dario
> could you expand?
>
Sure, I can try, again! :-)

I think what I said trying to answer Chris is of same help here as well,
and that it already clarified things at least a little bit. Does it?

Anyway, I think we are looking at the problem from slightly different
points standpoints.
I definitely agree with Prof. Baker with the fact that estimating
execution times is very difficult, and I think this extends to the
estimation of durations of critical sections too.

Moreover, there are scenarios where you don't really need strong
knowledge of these time intervals because, e.g., you are mainly
interested in:
- providing an applications/components with some kind of quality of
service --i.e., not hard real-time-- guarantees;
- isolate the applications/components among themselves.
This is, I think, quite typical for soft real-time workloads that could
run on Linux, better if preempt-rt, without many problems.

With this in mind, what we think is interesting of BWI-like approaches
is, indeed, that they do require virtually no knowledge about tasks'
behaviour, at least to provide timing isolation... No tasks' wcet, no
tasks' accessed mutexes, no tasks' blocking times: just a budget and a
period of a server/group each application is enclosed within.
This is of some relevance, we think, not only in the real-time world,
but also when robustness, availability and dependability starts being
considered.

I think I already said how the thing basically works, and I don't want
to bother you all by explaining it again, unless you ask me for some
clarification... All I can add is that going this way is, actually,
moving toward trying to _remove_ the need of having exact prediction of
task execution and blocking.

Finally, but I already said this as well, if you need hard behaviour and
you have the data about tasks' wcet and blocking times, the paper I
pointed to (the last mail, with working links! :-D) contain all it is
needed to compute the group's budget properly taking into account
interferences... It is, brutally simplifying, something like using as
budget for task A's group the sum of A's wcet plus all the blocking
times of tasks that blocks on it.

And Peter is right, such analysis is done for EDF on UP configurations.
Finally, I unfortunately am not able to provide any clue on how this
applies to fair schedulers (e.g., SFQ/CFS), but don't think it's
impossible... :-)

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 19:14:41

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 12:24 -0600, Chris Friesen wrote:
> > - that A is actually blocked, as said before;
>
> Why does it make any difference that A is blocked rather than busy
> waiting? In either case A cannot make forward progress.
>
I think it's not a problem of A, but of the overall schedule, from a
system predictability perspective.

Anyway, we are still evaluating what, if any could the issues be.

> > - that A's budget is not diminished.
>
> If we're running B with A's priority, presumably it will get some amount
> of cpu time above and beyond what it would normally have gotten during a
> particular scheduling interval.
>
Right...

> Perhaps it would make sense to charge B
> what it would normally have gotten, and charge the excess amount to A?
>
Mmm.. That's right, but I'm not sure I get what happen while executing
C... Anyway, it seems to me that we are getting closer to each other
point of view... let's keep staying in touch! :-D

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 19:29:17

by Henrik Austad

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tuesday 14 July 2009 18:54:13 James H. Anderson wrote:
> On Tue, 14 Jul 2009, Peter Zijlstra wrote:
> > Hi Jim,
> >
> > On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote:
> >> The first email in this thread that I was cc'ed on talked about
> >> implementing global EDF,
> >
> > Hendrik says that its a modified Modified-Least-Laxity-First, so
> > something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too,
> > but now see I failed to actually do so) in the hope that you would have
> > some thoughts on his scheduling algorithm, since that is what you do a
> > lot of ;-)
> >
> > It could well be he modified M-LLF into G-EDF, but I'm behind in my
> > reading here, so I'll have to leave that up to others.

Using deadlines as the only measure on multi-core will fail as you must
consider time (of deadline-miss) in a different way.

In UP, this is given implicitly as you only have a single processor. In MC you
need to do this the hard way, namely compute the point in time not when the
task misses the deadline, but when it will *eventually* fail a dealine. By
doing that, you combine deadline, wcet and granted time in one variable, and
you have a *single* variable to compare.

By doing this in a global sense, you can avoid bin-packing, load-balancing and
a lot of the issues that seem to be keeping the PI-people busy. But I think
I'm going to stay away from that topic now (as I haven't had the time to give
those emails their deserved attention.

I think M^2-LLF would be as good a name as EFF, but I wanted to emphasize the
usage of the two failure-variables (the static and the dynamic).

> I meant to say something about this algorithm earlier that may be
> worthwhile. If I understand Hendrik's algorithm correctly, it falls
> within a class of global schedulers for which bounded deadline
> tardiness is guaranteed (as long as total utilization doesn't exceed
> the system's capacity).

the tardiness is either 0 or bounded, depending on what you want (and what you
will do with tasks that do miss their deadlines when they block on locks).

> This follows from results in:
>
> H. Leontyev and J. Anderson, "Generalized Tardiness Bounds for Global
> Multiprocessor Scheduling", Proceedings of the 28th IEEE Real-Time
> Systems Symposium, pp. 413-422, December 2007.

Is this available outside locked portals?

> This is a positive thing w.r.t. wanting to support soft real-time
> workloads. However, global EDF also has this property and (I would
> think) is easier to implement (I'm just guessing here -- we haven't
> actually implemented any algorithm that involves laxities in
> LITMUS^RT). Ted Baker did some work on an algorithm called EDZL
> (EDF until zero laxity) that is essentially a hybrid of these two
> algorithms. In simulations we've done (not based on real implementations),
> EDZL exhibits really low tardiness (almost non-existent). EDZL may
> give you the benefits of using laxities where useful and be easier
> to implement (but that's just my guess)



--
henrik

2009-07-14 19:34:42

by James H. Anderson

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel



>> This follows from results in:
>>
>> H. Leontyev and J. Anderson, "Generalized Tardiness Bounds for Global
>> Multiprocessor Scheduling", Proceedings of the 28th IEEE Real-Time
>> Systems Symposium, pp. 413-422, December 2007.
>
> Is this available outside locked portals?

Yes, all my stuff is freely available at:

http://www.cs.unc.edu/~anderson/papers.html

-Jim

2009-07-14 19:48:12

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, 2009-07-13 at 19:28 +0200, Peter Zijlstra wrote:
> > - analyzable from the real-time theorist's point of view... Which is
> > (un?)fortunately what we are :-)
> > - possible to implement... Which is not always (un!)fortunately obvious
> > here :-)
>
> I would very much like a proper theoretical foundation for whatever we
> end up with ;-)
>
Well, thus let's hope to manage in it! :-D

> > Very basically: from the analysis point of view one easy and effective
> > solution would be to have the blocked-running tasks --i.e., the tasks
> > blocked on some lock that have been left on the rq to proxy-execute the
> > lock owner-- busy waiting while the lock owner is running. This allows
> > for retaining a lot of nice properties BWI already has, as far as
> > analyzability is concerned.
>
> Right, practically we cannot do this, since we expose the block graph to
> userspace and you could in userspace construct a program that would
> exploit this spinning to DoS the system.
>
I know and I share this belief with you, as I wrote in my original
mail... Even if I must say that it would not be a _real_ spinning, such
as raw_spinlock_t, with irq and preemption disabled (like in FMLP), but
only a mean to --preemptively-- consume some budget to avoid
schedulability to be lost!
Thus, I'm not sure it could be the basis for a DoS, especially if tasks
are, individually, as a group or as a whole class, enclosed within
groups (we used to call them reservations as well).

Anyway I'm aware that CPU bandwidth would be wasted, and that this is an
issue in systems like Linux... That's why we are striving for something
better... :-P

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-14 19:55:21

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, 2009-07-14 at 18:31 +0200, Peter Zijlstra wrote:
> > but the discussion since has been entirely about synchronization issues.
>
> Right, this seems to be a very hot topic.
>
Indeed! :-D

> Right, Ted holds similar opinions.
>
> Practically though, active Priority Inheritance has enormous benefits
> for us simple people that need to get things done :-)
>
> It has allowed us to convert this huge mass of code into something that
> is real-time enough for a lot of practical uses, including industrial
> robots and the like without the need to analyze each and every lock out
> there.
>
As said to you personally, I put quite a lot of efforts trying to find
some code using PI-futexes directly or PTHREAD_PRIO_INHERIT POSIX
mutexes, for personal curiosity and research interest... But I did not
manage for now. :-(

Do you think it would be possible to have some pointers?

> > In this approach, critical-section lengths must be known, and any
> > lock request that occurs when a task doesn't have sufficient
> > budget is simply denied -- the request is done later when that task
> > receives additional budget. This avoids a task in one group from
> > holding a lock while it is preempted (which could severely hurt
> > lock-requesting tasks in other groups). This scheme is really easy
> > to implement in conjunction with the FMLP and it doesn't require
> > complicated budget tracking. Its effects on blocking terms are
> > also easy to analyze. Thomas Nolte and colleagues (in Sweden) have
> > written some papers where they've used skip-based locks in
> > hierarchically-scheduled systems.
>
> Not granting locks when the contender doesn't have enough budget to
> finish them seems like a very attractive solution, however the cost of
> having to analyze the critical section seems prohibitive.
>
I've always thought so... We have some work related to this as well (can
give pointers if interested), but I personally like PI/BWI-PEP etc.
because such a knowledge is not required at all... Anyway...

> That said, we could maybe experiment with this for a few key lock sites.
>
... This would be nice!

> Furthermore we're limited by the existing legacy (both spec and code
> wise), but I think we have been able to make good progress through tools
> that help us, such as lockdep and the latency tracer (now ftrace), which
> help us find trouble in an active manner.
>
Well, in my very humble opinion you're definitely doing great job! :-D

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-15 05:46:57

by Bjoern Brandenburg

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Quoting "James H. Anderson" <[email protected]>:
> Ted Baker did some work on an algorithm called EDZL
> (EDF until zero laxity) that is essentially a hybrid of these two
> algorithms. In simulations we've done (not based on real implementations),
> EDZL exhibits really low tardiness (almost non-existent). EDZL may
> give you the benefits of using laxities where useful and be easier
> to implement (but that's just my guess)

EDZL has nice properties but may be hard to implement efficiently
because it requires (a lot of) fine-grained timers. However, Shinpei
Kato and Nobuyuki Yamasaki (CC'ed) presented an EDZL-derived scheduler
called EDCL at RTCSA'08 that has similar performance guarantees but
only does priority promotions at the release and completion times of
jobs (hence, no additional timers are required).

Global EDF-based Scheduling with Efficient Priority Promotion
Shinpei Kato and Nobuyuki Yamasaki
Proceedings of the 14th International Conference on Embedded and
Real-Time Computing Systems and Applications, pages 197-206. IEEE, 2008.
http://www.ny.ics.keio.ac.jp/members/shinpei/papers/rtcsa08.pdf

I know that Shinpei and friends implemented EDCL on top of LITMUS^RT,
but I never got around to incorporating the patches into the "official"
LITMUS^RT patch. (Sorry, Shinpei!)

Regarding the comments about the availability of academic papers (which
were also voiced repeatedly at OSPERT'09), I'd like to point out that
you can find almost any recent paper by searching for the title with
scholar.google.com and clicking on the "All XYZ versions" link... most
researches make PDFs available on their homepages.

- Bj?rn

2009-07-15 20:55:13

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, Jul 14, 2009 at 12:54:13PM -0400, James H. Anderson wrote:
>
> ... Ted Baker did some work on an algorithm called EDZL
> (EDF until zero laxity) that is essentially a hybrid of these two
> algorithms. In simulations we've done (not based on real implementations),
> EDZL exhibits really low tardiness (almost non-existent). EDZL may
> give you the benefits of using laxities where useful and be easier
> to implement (but that's just my guess)

Yes, EDZL is a simple extension of EDF. If you have a WCET
estimate for a task, you just apply EDF until/unless the task
reaches zero laxity. If it reaches zero laxity you bump it to a
fixed priority level that is above all EDF tasks. It seems very
good in simulations, in terms of schedulabilty, number of context
switches, and number of task migration. It never deviates from
EDF except in cases where a deadline would otherwise be missed.
Therefore all EDF schedulability tests are valid for EDZL. In
addition, we have some tighter tests. EDZL never causes more than
one context switch more than EDF, when a task goes from positive
laxity to zero laxity. This does not happen often. It is much
lower overhead than least-laxity-first, since there is no
possibility of getting into a time-slicing situtation among tasks
with equal laxity. It is also nice because in many cases the
scheduler will not know the WCET of a task, and so it is
impossible to compute laxity. In those cases, you just use EDF.
The two policies (EDF and EDZL) are completely compatible and
interoperable.

I hope that Linux will not pursue EDF or EDZL in only a narrow
sense, as simple priority scheduling algorithms, without providing
for bandwidth guaranteees and limits.

By bandwith "guarantee" I mean that the task/scheduling entity is
guaranteed to be able to at least *compete* at a certain level for
a certain percentage of the CPU, if we cannot (better) provide an
admission control mechanism that guarantees it will actually get a
certain percentage of the CPU.

By bandwidth "limit" I mean that there is an enforced bound
on the percentage of processor time a task/scheduling entity
may consume.

For example, in the fixed-priority domain, we have Sporadic
Server. This guarantees the server a chance to compete at its top
priority for at least sched_ss_init_budget time in every
sched_ss_repl_period time, at sched_priority, within some
tolerance. It also should not be permitted to use more than
sched_ss_init_budget time in every sched_ss_repl_period time, at
sched_priority, within some tolerance.

In the deadline scheduling domain, we have algorithms like
Constant Bandwidth Server (and some improved variants) that
provide similar gurantees and limites, in terms of deadlines. (I
recall seeing one very good version in a paper I recently saw,
which I could seek out and provide to the list if there is
interest.)

These so-called "aperiodic server scheduling" algorithms are
critically needed in a system like Linux that is open (in the
sense of being able to host a dynamic mix of uncoordinated
applications), and based on a thread programming model (as copared
to the job/task models used in scheduling theory). With the
thread abstraction one generally cannot give the scheduler WCET
bounds, and sometimes cannot give deadlines, since the threads
will wake up and go to sleep according to their own internal logic
(all the OS sees is the system calls). However, if the
application is guaranteed a chance to at least compete at a high
enough priority for a certain bandwidth (and, preferably, actually
be guaranteed that bandwidth, by admission control), it is
possible to write real-time applications. In order to be able to
make such a guarantee, one must be able to limit the CPU usage of
all other threads. This means, in effect, that *all*
tasks/scheduling entities above a certain priority level must be
scheduled according to a bandwidth-limiting algorithm.

The present Linux "RT throttling" mechanism seems to be an attempt
to achieve something similar (preventing RT tasks from shutting
out other work, by reserving some bandwidth for non-RT tasks).
However, it is done so grossly as to be unsatisfactory for
real-time applications. The better way to do this would be to
enforce a bandwidth limit on each task with real-time priority --
using the budget amount and replenishment interval required by
that task -- and enforce by admission control that the real-time
tasks do not exceed some configurable percent of the total system
utilization.

Of course, the POSIX standard does not recognize any budget and
replenishment interval for policies other than SCHED_SPORADIC, but
the same problem exists with respect to the present "RT
throttling" mechanism. It would be possible to provide legal
implementation-defined extensions for the operations that take a
struct sched_param to look at these additional sched_param fields
for other policies such as SCHED_FIFO and SCHED_OTHER, and to
provide default values when they are not specified. The amount of
capacity to be reserved for system internal use could be
configured, using some combination of kernel compile-time macros,
boot parameters, and the /proc interface. User defaults could be
configured using extensions to the setrlimit() interface.

Ted

2009-07-15 21:19:57

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, Jul 14, 2009 at 01:16:52PM -0400, James H. Anderson wrote:
> ... BTW, I should say that I am not
> familiar with the PEP protocol that has been discussed in this thread.
> I assume it doesn't work under GEDF, or you wouldn't have asked the
> question...

I have not seen the definition of PEP, but from the context of
this discussion I infer that it refers to an implementation of
priority inheritance. As such, with pretty much any global
scheduling policy, the set of other tasks whose critical sections
could stack up is bounded only by the number of tasks in the
system.

In any case, I have misunderstood what PEP is, let me attempt
to summarize what I have inferred:

A high priority running task that would otherwise become blocked
waiting for a lower-priority lock-holding task to release the lock
can give up its prority/shot at execution to the lower-priority
task that is blocking it. That is, when a task A is "blocked" for
a lock it can stay in the run-queue so long as the task B that is
(ultimately, transitively) blocking it is in (the same?)
run-queue. At any point where the scheduler would choose to
execute A it instead finds B, by traversing wait-for links between
tasks, and executes B. The priority order of the run-queue can be
based on any (partial) ordering relation, including deadlines.

A slight complexity is that if B is allowed to suspend itself
while holding a lock, and does so, one must run back and also
remove the tasks like A from the run-queue, and when B wakes up,
one must do the revers. However, the frequency of deep nesting
of wait-for relationships seems small.

For GEDF on SMP, a question is how to handle the case where A is
blocked on one processor and B may be running on a different one.
This seems to require removing A from the run-queue when it is
detected.

Of course, the current Linux model appears not to fully support
GEDF, since run-queues are per-processor, subject to explicit
migration. So, as infer from the preceding messages, the question
above transforms into whether to migrate A to B's processor
run-queue or to migrate B to A's processor run-queue?

Ted

2009-07-15 21:46:02

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, Jul 14, 2009 at 08:48:26AM -0600, Chris Friesen wrote:

> Let's call the highest priority task A, while the task holding the lock
> (that A wants) is called B. Suppose we're on an dual-cpu system.
>
> According to your proposal above we would have A busy-waiting while B
> runs with A's priority. When B gives up the lock it gets downgraded and
> A acquires it and continues to run.
>
> Alternately, we could have A blocked waiting for the lock, a separate
> task C running, and B runs with A's priority on the other cpu. When B
> gives up the lock it gets downgraded and A acquires it and continues to run.
>
> >From an overall system perspective we allowed C to make additional
> forward progress, increasing the throughput. On the other hand, there
> is a small window where A isn't running and it theoretically should be.
> If we can bound it, would this window really cause so much difficulty
> to the schedulability analysis?

I have two questions:

(1) As Jim Anderson points out in a later comment,
is priority inheritance (of any form) what you really want
on an SMP system? It does not give a good worst-case blocking
time bound.

(2) If you do want priority inheritance, how do I implement the
mechanics of the mechanics of the unlock operation of B on one
processor causing C to be preempted on the other processor, simply
and promptly?

A general solution needs to account for having multiple tasks in
the role of A for any given B, and possibly chains of such tasks
(and, of course, potential deadlock cycles).
For a relatively simple example,

A1 (on CPU1) -blocked_by-> B (on CPU2)
C (lower priority on CPU1)
A2 (preempts C on CPU1) -blocked_by-> B (CPU2)
A3 (on CPU3) -blocked_by-> B (on CPU2)
B releases the lock that is blocking A1, A2, and A3.

Do I need to wake up A1, A2, and A3?
Maybe I should only wake up A2 and A3?
Can I pick just one to wake up?
Then, how do I implement the wake-ups?

I and a student of mine implemented something like this on a
VME-bus based SMP system around 1990. We decided to only wake up
the highest (global) priority task. (In the case above, either A2
or A3, depending on priority.) We did this using compare-and-swap
operatoins, in a way that I recall ended up using (for each lock)
something like one global spin-lock variable, one "contending"
variable per CPU, and one additional local spinlock variable per
CPU to avoid bus contention on the global spin-lock variable. We
used a VME-bus interrupt for the lock-releasing CPU to invoke the
scheduler of the CPU of the task selected to next receive the
lock. The interrupt handler could then do the job of waking up
the corresponding contending task on that CPU.

It worked, but such global locks had a lot more overhead than
other locks, mostly due to the inter-processor interrupt.
So, we ended up distinguishing global locks from per-CPU
locks to lower the overhead when we did not need it.

We were using a partitioned scheduling model, or else this
would be a bit more complicated, and I would be talking about the
CPU selected to run the task selected to next receive the lock.

Is there a more efficient way to do this? or are you all
ready to pay this kind of overhead on every lock in your SMP
system?

Ted

2009-07-15 21:53:08

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, Jul 14, 2009 at 09:28:47PM +0200, Henrik Austad wrote:

> ... In MC you need to do this the hard way, namely compute the
> point in time not when the task misses the deadline, but when it
> will *eventually* fail a deadline. By doing that, you combine
> deadline, wcet and granted time in one variable, and you have a
> *single* variable to compare.

This is true in a theoretical sense, and is the basis of some
"optimal" scheduling algorithms, including the "throwforward
scheduling" algorithm. It makes sense in some environments, where
you actually know the WCET of the task in advance. However, I
don't believe a Linux system can expect all applications to
provide this kind of information.

In a system programmed using process and threads, the decision to
sleep or wake is embedded in the internal logic of the thread, and
implemented by system calls. The existing system calls do not
convey how long the thread needs to execute before it reaches its
next suspension point. Therefore, without a new API you cannot
use WCET. If you create a new API for this, you are limiting this
form of scheduling to threads that choose to use that API, and are
able to provide the needed WCET information. This seems like a
small number of cases among the full range of real-time Linux
applications.

Ted

2009-07-15 21:53:47

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted Baker wrote:

> The present Linux "RT throttling" mechanism seems to be an attempt
> to achieve something similar (preventing RT tasks from shutting
> out other work, by reserving some bandwidth for non-RT tasks).
> However, it is done so grossly as to be unsatisfactory for
> real-time applications. The better way to do this would be to
> enforce a bandwidth limit on each task with real-time priority --
> using the budget amount and replenishment interval required by
> that task -- and enforce by admission control that the real-time
> tasks do not exceed some configurable percent of the total system
> utilization.

>From an API standpoint, the "group scheduling" functionality in linux
allows for the creation of an arbitrary hierarchy of groups, each of
which may contain zero or more tasks. (Each task is associated with
exactly one group.)

There is a distinction between groups containing realtime tasks, and
groups containing non-realtime tasks. For realtime groups, each group
is allocated a specific amount of cpu time. For non-realtime groups,
each group is allocated a specific weight.

A realtime group may use up to its specified amount of cpu time. Any
cpu time not used by a realtime group is distributed to the non-realtime
groups according to their relative weights.

This does add a whole different API to the mix, but allows for controls
to be set by the administrator on existing POSIX apps without needing to
recompile them.

Chris

2009-07-15 22:12:14

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted Baker wrote:

> I have two questions:
>
> (1) As Jim Anderson points out in a later comment,
> is priority inheritance (of any form) what you really want
> on an SMP system? It does not give a good worst-case blocking
> time bound.

As Peter mentioned, it's not so much a matter of whether it's desired or
not. It's required, at least in terms of supporting priority
inheritance for pthread mutexes. I haven't been involved with the
linux-rt side of things, but I believe Peter implied that they used PI
fairly heavily there to try to minimize priority inversion issues
because it was infeasible to analyze all the various locks. The kernel
is over 10 million lines of code, so any locking changes pretty much
have to fit into the existing framework with minimal changes.

> (2) If you do want priority inheritance, how do I implement the
> mechanics of the mechanics of the unlock operation of B on one
> processor causing C to be preempted on the other processor, simply
> and promptly?

<snip>

> I and a student of mine implemented something like this on a
> VME-bus based SMP system around 1990. We decided to only wake up
> the highest (global) priority task. (In the case above, either A2
> or A3, depending on priority.) We did this using compare-and-swap
> operatoins, in a way that I recall ended up using (for each lock)
> something like one global spin-lock variable, one "contending"
> variable per CPU, and one additional local spinlock variable per
> CPU to avoid bus contention on the global spin-lock variable. We
> used a VME-bus interrupt for the lock-releasing CPU to invoke the
> scheduler of the CPU of the task selected to next receive the
> lock. The interrupt handler could then do the job of waking up
> the corresponding contending task on that CPU.
>
> It worked, but such global locks had a lot more overhead than
> other locks, mostly due to the inter-processor interrupt.
> So, we ended up distinguishing global locks from per-CPU
> locks to lower the overhead when we did not need it.
>
> We were using a partitioned scheduling model, or else this
> would be a bit more complicated, and I would be talking about the
> CPU selected to run the task selected to next receive the lock.
>
> Is there a more efficient way to do this? or are you all
> ready to pay this kind of overhead on every lock in your SMP
> system?

Peter would have a better idea of the low-level implementation than I,
but I suspect that the general idea would be similar, unless we were to
consider migrating the task chosen to run to the current cpu. That
would save the IPI, at a cost of task migration. (Which of course
wouldn't be possible in the case of cpu affinity.)

As to the additional cost, POSIX distinguishes between PI mutexes and
regular mutexes. Any additional penalty due to PI should be limited to
the mutexes which actually have PI enabled. If an application can limit
itself to "safe" syscalls and has enough knowledge of its own locking,
then it could presumably use regular mutexes and avoid the overhead of PI.

I'm not sure the same could apply to the kernel in general...Peter would
have to address that as I'm not familiar with the linux-rt changes.

Chris

2009-07-15 22:14:18

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Tue, Jul 14, 2009 at 12:24:26PM -0600, Chris Friesen wrote:

> > - that A's budget is not diminished.
>
> If we're running B with A's priority, presumably it will get some amount
> of cpu time above and beyond what it would normally have gotten during a
> particular scheduling interval. Perhaps it would make sense to charge B
> what it would normally have gotten, and charge the excess amount to A?

First, why will B get any excess time, if is charged? There will
certainly be excess time used in any context switch, including
premptions and blocking/unblocking for locks, but that will come
out of some task's budget. Given the realities of the scheduler,
the front-end portion of the context-switch will be charged to the
preempted or blocking task, and the back-end portion of the
context-switch cost will be charged to the task to which the CPU
is switched. In a cross-processor proxy situation like the one
above we have four switches: (1) from A to C on processor #1; (2)
from whatever else (call it D) that was running on processor #2 to
B, when B receives A's priority; (3) from B back to D when B
releasse the lock; (4) from C to A when A gets the lock. A will
naturally be charged for the front-end cost of (1) and the
back-end cost of (4), and B will naturally be charged for the
back-end cost of (2) and the front-end cost of (3).

The budget of each task must be over-provisioned enough to
allow for these additional costs. This is messy, but seems
unavoidable, and is an important reason for using scheduling
policies that minimize context switches.

Back to the original question, of who should be charged for
the actual critical section.

>From the schedulability analysis point of view, B is getting
higher priority time than it normally would be allowed to execute,
potentially causing priority inversion (a.k.a. "interference" or
"blocking") to a higher priority task D (which does not even share
a need for the lock that B is holding) that would otherwise run on
the same processor as B. Without priority inheritance this kind
of interferfence would not happen. So, we are benefiting A at the
expense of D. In the analysis, we can either allow for all such
interference in a "blocking term" in the analysis for D, or we
might call it "preemption" in the analysis of D and charge it to A
(if A has higher priority than D). Is the latter any better? I
think not, since we now have to inflate the nominal WCET of A to
include all of the critical sections that block it.

So, it seems most logical and simplest to leave the charges where
they naturally occur, on B. That is, if you allow priority
inheritance, you allow tasks to sometimes run at higher priority
than they originally were allocated, but not to execute more
than originally budgeted.

Ted

2009-07-15 22:34:06

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wed, Jul 15, 2009 at 03:53:33PM -0600, Chris Friesen wrote:

> >From an API standpoint, the "group scheduling" functionality in linux
> allows for the creation of an arbitrary hierarchy of groups, each of
> which may contain zero or more tasks. (Each task is associated with
> exactly one group.)
>
> There is a distinction between groups containing realtime tasks, and
> groups containing non-realtime tasks. For realtime groups, each group
> is allocated a specific amount of cpu time. For non-realtime groups,
> each group is allocated a specific weight.
>
> A realtime group may use up to its specified amount of cpu time. Any
> cpu time not used by a realtime group is distributed to the non-realtime
> groups according to their relative weights.
>
> This does add a whole different API to the mix, but allows for controls
> to be set by the administrator on existing POSIX apps without needing to
> recompile them.

This is in the right direction, but there is a lot about Linux groups
that I either do not understand or which falls short of what is needed.
Perhaps you can point me to an up to date detailed explanation of
how they work?

>From what I've been able to infer from my brief foray into that
part of the kernel code (a year ago), there seemed to be several
aspects of the group scheduling that did not seem to admit
schedulability analysis. (I admit that I may have read it wrong,
and these statements are false.)

1) The priority of a group seemed to be defined by the priority of
the highest-priority thread in the group's run-queue, which means
it varies dynamically according to which threads in the group are
contending.

2) Budget enforcement seemed to only occur at system tick
boundaries, which means precision can only be achieved at the cost
of frequent clock interrupts.

3) It seemed that a thread could belong to more than one group,
and so distributed charges arbitrarily between groups.
If so, budget allocation would seem very difficult.

4) On an SMP, more than one thread could be running against
the same budget at the same time, resulting in budget over-charges.

I am particularly concerned about the latter. The published
analyses of hierarchical generalizations of bandwidth
limiting/guaranteeing aperiodic server scheduling algorithms I
have seen so far all seem to require allocating bandwidth/budget
to groups on a per-processor basis, so as to void concurrent
charges to the same budget.

Ted

2009-07-15 22:39:22

by Dhaval Giani

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 4:04 AM, Ted Baker<[email protected]> wrote:
> On Wed, Jul 15, 2009 at 03:53:33PM -0600, Chris Friesen wrote:
>
>> >From an API standpoint, the "group scheduling" functionality in linux
>> allows for the creation of an arbitrary hierarchy of groups, each of
>> which may contain zero or more tasks. ?(Each task is associated with
>> exactly one group.)
>>
>> There is a distinction between groups containing realtime tasks, and
>> groups containing non-realtime tasks. ?For realtime groups, each group
>> is allocated a specific amount of cpu time. ?For non-realtime groups,
>> each group is allocated a specific weight.
>>
>> A realtime group may use up to its specified amount of cpu time. ?Any
>> cpu time not used by a realtime group is distributed to the non-realtime
>> groups according to their relative weights.
>>
>> This does add a whole different API to the mix, but allows for controls
>> to be set by the administrator on existing POSIX apps without needing to
>> recompile them.
>
> This is in the right direction, but there is a lot about Linux groups
> that I either do not understand or which falls short of what is needed.
> Perhaps you can point me to an up to date detailed explanation of
> how they work?
>
> From what I've been able to infer from my brief foray into that
> part of the kernel code (a year ago), there seemed to be several
> aspects of the group scheduling that did not seem to admit
> schedulability analysis. ?(I admit that I may have read it wrong,
> and these statements are false.)
>
> 1) The priority of a group seemed to be defined by the priority of
> the highest-priority thread in the group's run-queue, which means
> it varies dynamically according to which threads in the group are
> contending.
>

This is true, but it also ensures that the time allocated to the group
is also consumed by group if it wants to.

> 2) Budget enforcement seemed to only occur at system tick
> boundaries, which means precision can only be achieved at the cost
> of frequent clock interrupts.
>
> 3) It seemed that a thread could belong to more than one group,
> and so distributed charges arbitrarily between groups.
> If so, budget allocation would seem very difficult.
>

No. A thread can only be a part of one group at a time.

> 4) On an SMP, more than one thread could be running against
> the same budget at the same time, resulting in budget over-charges.
>

The rt group scheduler does split the budget per cpu. On expiring the
budget, it tries to borrow from other CPUs if possible.

Thanks,
Dhaval
--

Spike Milligan - "All I ask is the chance to prove that money can't
make me happy." -
http://www.brainyquote.com/quotes/authors/s/spike_milligan.html

2009-07-15 22:52:27

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wed, Jul 15, 2009 at 04:12:00PM -0600, Chris Friesen wrote:

> As Peter mentioned, it's not so much a matter of whether it's desired or
> not. It's required, at least in terms of supporting priority
> inheritance for pthread mutexes. I haven't been involved with the
> linux-rt side of things, but I believe Peter implied that they used PI
> fairly heavily there to try to minimize priority inversion issues
> because it was infeasible to analyze all the various locks. The kernel
> is over 10 million lines of code, so any locking changes pretty much
> have to fit into the existing framework with minimal changes.
...
> Any additional penalty due to PI should be limited to
> the mutexes which actually have PI enabled. If an application can limit
> itself to "safe" syscalls and has enough knowledge of its own locking,
> then it could presumably use regular mutexes and avoid the overhead of PI.
>
> I'm not sure the same could apply to the kernel in general...Peter would
> have to address that as I'm not familiar with the linux-rt changes.

I understand the need to support the (largely broken for SMP)
POSIX real-time API, but from what I read of the scheduler it
seems that support for priority inheritance on mutexes has greatly
complicated the kernel, and that the cost extends to applications
that do not use priority inheritance mutexes. I especially don't
see why why priority inheritance mutexes should ever be used by
the kernel itself. Is there a way to provide in the kernel
whatever minimal hooks are needed to support the API for users who
want it, with less impact?

My experience on IEEE PASC and the Austin Group, including work
with POSIX validation test suites, has shown that many of the
POSIX specifications leave a huge amount of room for
implementation freedom -- much larger than I thought when I
balloted them. I found this out when I wrote test programs that
implementors argued went beyond the specs, when I complained as
user about POSIX implementations that I thought did not meet the
specs, and when I read interpretation requests from other users
and implementors.

So, I think it would be worth the effort to read the POSIX
specification for PTHREAD_PRIO_INHERIT carefully, with a lawyerly
attitude, to see if it really requires going to the lengths the
Linux kernel currently goes. That is, look for loopholes that
permit Linux to do the right thing. I can't do this today, but
maybe I will another day.

Ted

2009-07-15 23:16:50

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

> > 1) The priority of a group seemed to be defined by the priority of
> > the highest-priority thread in the group's run-queue, which means
> > it varies dynamically according to which threads in the group are
> > contending.
> >
>
> This is true, but it also ensures that the time allocated to the group
> is also consumed by group if it wants to.

I don't see how schedulability analysis can be done with this model,
since a single budget is being expended at varying priorities/deadlines.

> > 4) On an SMP, more than one thread could be running against
> > the same budget at the same time, resulting in budget over-charges.
> >
>
> The rt group scheduler does split the budget per cpu. On expiring the
> budget, it tries to borrow from other CPUs if possible.

First, how is the splitting of the budget between CPU's controlled
by the application?

Second, I don't see how schedulabiliyt analysis could be done if
CPU's can "borrow" budget from other CPUs, unless there is some
mechanism in place to "pay it back". How do you do the analysis?

Ted

2009-07-15 23:39:44

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Mon, Jul 13, 2009 at 03:45:11PM -0600, Chris Friesen wrote:

> Given that the semantics of POSIX PI locking assumes certain scheduler
> behaviours, is it actually abstraction inversion to have that same
> dependency expressed in the kernel code that implements it?
...>
> The whole point of mutexes (and semaphores) within the linux kernel is
> that it is possible to block while holding them. I suspect you're going
> to find it fairly difficult to convince people to spinlocks just to make
> it possible to provide latency guarantees.

The abstraction inversion is when the kernel uses (internally)
something as complex as a POSIX PI mutex. So, I'm not arguing
that the kernel does not need internal mutexes/semaphores that
can be held while a task is suspended/blocked. I'm just arguing
that those internal mutexes/semaphores should not be PI ones.

> ... the selling point for PI vs PP is that under PIP the
> priority of the lock holder is automatically boosted only if
> necessary, and only as high as necessary.

The putative benefit of this is disputed, as shown by Jim and
Bjorn's work with LITMUS-RT and others. For difference to be
noted, there must be a lot of contention, and long critical
sections. The benefit of less frequent priority boosting and
lower priorities can be balanced by more increased worst-case
number of context switches.

> On the other hand, PP requires code analysis to properly set the
> ceilings for each individual mutex.

Indeed, this is difficult, but no more difficult than estimating
worst-case blocking times, which requires more extensive code
analysis and requires consideration of more cases with PI than PP.

If determining the exact ceiling is too difficult. one can simply
set the ceiling to the maximum priority used by the application.

Again, I don't think that either PP or PI is appropriate for use
in a (SMP) kernel. For non-blocking locks, the current
no-preeemption spinlock mechanism works. For higher-level
(blocking) locks, I'm attracted to Jim Anderson's model of
non-preemptable critical sections, combined with FIFO queue
service.

> Certainly if you block waiting for I/O while holding a lock then it
> impacts the ability to provide latency guarantees for others waiting for
> that lock. But this has nothing to do with PI vs PP or spinlocks, and
> everything to do with how the lock is actually used.

My only point there was with respect to application-level use of
POSIX mutexes, that if an application needs to suspend while
holding a mutex (e.g., for I/O) then the application will have
potentially unbounded priority inversion, and so is losing the
benefit from priority inheritance. So, if the only benefit of
PRIO_INHERIT over PRIO_PROTECT is being able to suspend while
holding a lock, there is no real benefit.

Ted

2009-07-16 07:17:30

by Henrik Austad

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thursday 16 July 2009 00:14:11 Ted Baker wrote:
> On Tue, Jul 14, 2009 at 12:24:26PM -0600, Chris Friesen wrote:
> > > - that A's budget is not diminished.
> >
> > If we're running B with A's priority, presumably it will get some amount
> > of cpu time above and beyond what it would normally have gotten during a
> > particular scheduling interval. Perhaps it would make sense to charge B
> > what it would normally have gotten, and charge the excess amount to A?
>
> First, why will B get any excess time, if is charged?

My understanding of PEP is that when B executes through the A-proxy, B will
consume parts of A's resources until the lock is freed. This makes sense when
A and B runs on different CPUs and B is moved (temporarily) to CPU#A. If B
were to use it's own budget when running here, once A resumes execution and
exhaustes its entire budget, you can have over-utilization on that CPU (and
under-util on CPU#B).

> There will
> certainly be excess time used in any context switch, including
> premptions and blocking/unblocking for locks, but that will come
> out of some task's budget.

AFAIK, there are no such things as preemption-overhead charging to a task's
budget in the kernel today. This time simply vanishes and must be compensated
for when running a task through the acceptance-stage (say, only 95% util pr
CPU or some such).

> Given the realities of the scheduler,
> the front-end portion of the context-switch will be charged to the
> preempted or blocking task, and the back-end portion of the
> context-switch cost will be charged to the task to which the CPU
> is switched.

> In a cross-processor proxy situation like the one
> above we have four switches: (1) from A to C on processor #1; (2)
> from whatever else (call it D) that was running on processor #2 to
> B, when B receives A's priority; (3) from B back to D when B
> releasse the lock; (4) from C to A when A gets the lock. A will
> naturally be charged for the front-end cost of (1) and the
> back-end cost of (4), and B will naturally be charged for the
> back-end cost of (2) and the front-end cost of (3).
>
> The budget of each task must be over-provisioned enough to
> allow for these additional costs. This is messy, but seems
> unavoidable, and is an important reason for using scheduling
> policies that minimize context switches.
>
> Back to the original question, of who should be charged for
> the actual critical section.

That depends on where you want to run the tasks. If you want to migrate B to
CPU#A, A should be charged. If you run B on CPU#B, then B should be charged
(for the exact same reasoning A should be charged in the first case).

The beauty of PEP, is that enabling B to run is very easy. In the case where B
runs on CPU#B, B must be updated statically so that the scheduler will
trigger on the new priority. In PEP, this is done automatically when A is
picked. One solution to this, would be to migrate A to CPU#B and insert A
into the runqueue there. However, then you add more overhead by moving the
task around instead of just 'borrowing' the task_struct.

> From the schedulability analysis point of view, B is getting
> higher priority time than it normally would be allowed to execute,
> potentially causing priority inversion (a.k.a. "interference" or
> "blocking") to a higher priority task D (which does not even share
> a need for the lock that B is holding) that would otherwise run on
> the same processor as B. Without priority inheritance this kind
> of interferfence would not happen. So, we are benefiting A at the
> expense of D. In the analysis, we can either allow for all such
> interference in a "blocking term" in the analysis for D, or we
> might call it "preemption" in the analysis of D and charge it to A
> (if A has higher priority than D). Is the latter any better?

If D has higher priority than A, then neither A nor B (with the locks held)
should be allowed to run before D.

> I
> think not, since we now have to inflate the nominal WCET of A to
> include all of the critical sections that block it.
>
> So, it seems most logical and simplest to leave the charges where
> they naturally occur, on B. That is, if you allow priority
> inheritance, you allow tasks to sometimes run at higher priority
> than they originally were allocated, but not to execute more
> than originally budgeted.

Yes, no task should be allowed to run more than the budget, but that requires
B to execute *only* on CPU#B.

On the other hand, one could say that if you run PEP and B is executed on
CPU#A, and A then exhausts its budget, you could blame A as well, as
lock-contention is a common problem and it's not only the kernel's fault. Do
we need perfect or best-effort lock-resolving?

> Ted

--
henrik

2009-07-16 07:58:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wed, 2009-07-15 at 19:11 -0400, Ted Baker wrote:
> On Mon, Jul 13, 2009 at 03:45:11PM -0600, Chris Friesen wrote:
>
> > Given that the semantics of POSIX PI locking assumes certain scheduler
> > behaviours, is it actually abstraction inversion to have that same
> > dependency expressed in the kernel code that implements it?
> ....>
> > The whole point of mutexes (and semaphores) within the linux kernel is
> > that it is possible to block while holding them. I suspect you're going
> > to find it fairly difficult to convince people to spinlocks just to make
> > it possible to provide latency guarantees.
>
> The abstraction inversion is when the kernel uses (internally)
> something as complex as a POSIX PI mutex. So, I'm not arguing
> that the kernel does not need internal mutexes/semaphores that
> can be held while a task is suspended/blocked. I'm just arguing
> that those internal mutexes/semaphores should not be PI ones.
>
> > ... the selling point for PI vs PP is that under PIP the
> > priority of the lock holder is automatically boosted only if
> > necessary, and only as high as necessary.
>
> The putative benefit of this is disputed, as shown by Jim and
> Bjorn's work with LITMUS-RT and others. For difference to be
> noted, there must be a lot of contention, and long critical
> sections. The benefit of less frequent priority boosting and
> lower priorities can be balanced by more increased worst-case
> number of context switches.
>
> > On the other hand, PP requires code analysis to properly set the
> > ceilings for each individual mutex.
>
> Indeed, this is difficult, but no more difficult than estimating
> worst-case blocking times, which requires more extensive code
> analysis and requires consideration of more cases with PI than PP.
>
> If determining the exact ceiling is too difficult. one can simply
> set the ceiling to the maximum priority used by the application.
>
> Again, I don't think that either PP or PI is appropriate for use
> in a (SMP) kernel. For non-blocking locks, the current
> no-preeemption spinlock mechanism works. For higher-level
> (blocking) locks, I'm attracted to Jim Anderson's model of
> non-preemptable critical sections, combined with FIFO queue
> service.

Right, so there's two points here I think:

A) making most locks preemptible
B) adding PI to all preemptible locks

I think that we can all agree that if you do A, B makes heaps of sense,
right?

I just asked Thomas if he could remember any numbers on this, and he
said that keeping all the locks non-preemptible had at least an order
difference in max latencies [ so a 60us (A+B) would turn into 600us (!
A) ], this means a proportional decrease for the max freq of periodic
tasks.

This led to the conviction that the PI overheads are worth it, since
people actually want high freq tasks.

Of course, when the decreased period is still sufficient for the
application at hand, the non-preemptible case allows for better
analysis.

2009-07-16 08:55:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 16 Jul 2009, Peter Zijlstra wrote:
> On Wed, 2009-07-15 at 19:11 -0400, Ted Baker wrote:
> > Again, I don't think that either PP or PI is appropriate for use
> > in a (SMP) kernel. For non-blocking locks, the current
> > no-preeemption spinlock mechanism works. For higher-level
> > (blocking) locks, I'm attracted to Jim Anderson's model of
> > non-preemptable critical sections, combined with FIFO queue
> > service.
>
> Right, so there's two points here I think:
>
> A) making most locks preemptible
> B) adding PI to all preemptible locks
>
> I think that we can all agree that if you do A, B makes heaps of sense,
> right?
>
> I just asked Thomas if he could remember any numbers on this, and he
> said that keeping all the locks non-preemptible had at least an order
> difference in max latencies [ so a 60us (A+B) would turn into 600us (!
> A) ], this means a proportional decrease for the max freq of periodic
> tasks.

That are numbers from about 3 years ago. I need to redo the tests as
we did lot of lock breaks and shortening of preempt/irq disabled
sections since then, but when we started preempt-rt there was no real
choice as the limited number of developers simply did not allow to
analyse and fix all the long lasting critical sections. We were busy
enough to fix all the locking problems which were unearthed. :) Also
we did not have the tools to analyse the length of critical sections
back then.

It's definitely worthwhile to revisit this, but that's going to be a
multi man years effort to distangle complex code pathes like the
network stack, disk i/o and other known sources of trouble. The next
challenge will be how to monitor the code base for regressions and
keeping thousands of developers aware of these requirements.

> This led to the conviction that the PI overheads are worth it, since
> people actually want high freq tasks.
>
> Of course, when the decreased period is still sufficient for the
> application at hand, the non-preemptible case allows for better
> analysis.

Agreed, but the real challenge of providing real time services in
Linux is the wide variety of use cases. We simply need to accept that
people want to use it from

high frequency industrial control tasks while running a GUI, a
webserver and a database on the same machine

to

heavily threaded enterprise class Java applications which do not
care that much about latencies but need the correctness guarantees

and of course everything in between.

I'm well aware that we try to create something which does everything
except of playing the National Anthem, but simply restricting the use
cases is not an option and would be exceptionally boring as well :)

Thanks,

tglx

2009-07-16 08:58:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote:
> > > 1) The priority of a group seemed to be defined by the priority of
> > > the highest-priority thread in the group's run-queue, which means
> > > it varies dynamically according to which threads in the group are
> > > contending.
> > >
> >
> > This is true, but it also ensures that the time allocated to the group
> > is also consumed by group if it wants to.
>
> I don't see how schedulability analysis can be done with this model,
> since a single budget is being expended at varying priorities/deadlines.
>
> > > 4) On an SMP, more than one thread could be running against
> > > the same budget at the same time, resulting in budget over-charges.
> > >
> >
> > The rt group scheduler does split the budget per cpu. On expiring the
> > budget, it tries to borrow from other CPUs if possible.
>
> First, how is the splitting of the budget between CPU's controlled
> by the application?
>
> Second, I don't see how schedulabiliyt analysis could be done if
> CPU's can "borrow" budget from other CPUs, unless there is some
> mechanism in place to "pay it back". How do you do the analysis?

Right so control-groups (cgroups for short) are a form of
virtualization. Each controller is specific to a resource. We have
memory controllers, namespace controllers and also a scheduler
controller.

If you would apply all controllers to the same task groups you get a
result like chroot on steroids, or jails etc. But you can also use them
individually to control resources in creative ways.

In order to manage RT resources you want:

- a minimum bandwidth guarantee
- isolation

So ideally you want a CBS server that schedules your RT (FIFO/RR) tasks.

Now, let me first state that the current code is a hack, and I know its
nowhere near proper. But it was the best I could come up with on a short
notice -- and Fabio is now looking at doing better :-)

Furthermore the whole feature is still marked EXPERIMENTAL, basically
because I do recognize it for the hack it is -- that said, some people
might find it useful.


So a task can only belong to 1 group of any 1 controller (that is, it
can belong to multiple groups but only of different controllers) and
since there is only 1 scheduler controller, we can, for the purpose of
this discussion say it can only belong to a single group.

These groups get assigned a bandwidth through the use of a period and
runtime group parameter -- the documentation states that using different
periods is currently broken and would require a deadline server.

Therefore we can assume all periods are equal, for the rest of this
description -- and set it to 1s.


So what does it do, its a hierarchical FIFO scheduler, with the prio of
a group being that of the max prio of its children. If a group runs out
of quota it will be dequeued. When the period timer comes along and
refreshes the quota it will be requeued.

R
/ \
A B
/ \ |\
1 4 C 3
|
2

groups in letters, tasks in digits.

If we assume tasks being hogs and have descending priority relative to
their numbering, and A has 40% and B has 30% bandwidth and C has 20%.

Lets first look at UP.

Then since 1 would be the highest priority task, A would have the
priority of 1, and we'd select A->1 to run.

This would continue for 400ms, after that the whole of A will be
dequeued. Next in line would be B->C->2, which can run for 200ms before
C gets dequeued, leaving B->3 as only option, which has a remaining
100ms of B's budget left to run.

Once the refresh timer comes along things get replenished and can run
again.

SMP

the cgroup interface specified bandwidth as per a single cpu, when more
are present in the load-balance domain (see cpusets) the total bandwidth
available to the group is the specified multiplied by the number of
available cpus.

The initial distribution is such that each cpu gets equal bandwidth.

Now on 2 cpus, we'd want A->1 and B->C->2 running, since those are the
highest prio tasks on the system.

Since we have 2 cpus the budget for C is 200ms per cpu, 400ms total.

For the first 200ms we'll run 1 on cpu0 and 2 on cpu1. At that point
we'll find that cpu1's budget for C is depleted.

It will then look at other cpus in the load-balance domain (cpu0) and
transfer half of C's unused budget over to itself (with rounding up so
we can indeed leave the other cpus with 0).

This way C->2 can get to run for up to 400ms on cpu1. After that C gets
dequeued and B->3 will take over as the next highest task.

Now, after 600ms total B will have depleted its quota and the only tasks
left are A->{1,4}. A will have consumed 600 of its 800ms, and will now
have to spread this over 2 cpus. [ basically cpu0 gets to transfer less
of off cpu1 because 4 will be consuming C's quota on it ]

Locks.

Suppose they're not hogs and behave like proper tasks and complete their
work within bugdet. In this case nobody will get throttled and we all
live happily together.

Now suppose one application, say 3, has a bug and runs out of quota
whilst holding a lock.

Further suppose that 1 contends on that lock. The implemented behaviour
is that we PI boost 3. The group is temporarily placed back on the
runqueue and we allow 3 to overcommit on its budget in order to release
the lock.

This overcommit it accounted and only once the budget turns positive
again (due to sufficient refresh) will the group be requeued.


Now, why I did things like this.

Because doing a deadline CBS server
- is non trivial.
- would mean we'd have to deal with deadline inheritenace.
- would significantly complicate the load-balancing.

Is any of that an excuse? No not really but it is something useful for
some people, esp in the case of normal usage where you'd not hit the
limits, and in case you do, you only penalize the one who does.

So as a first approximation it seems to work in practise.

I'm very glad Fabio is working on improving things.

2009-07-16 09:12:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 10:58 +0200, Peter Zijlstra wrote:
> On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote:
> > > > 1) The priority of a group seemed to be defined by the priority of
> > > > the highest-priority thread in the group's run-queue, which means
> > > > it varies dynamically according to which threads in the group are
> > > > contending.
> > > >
> > >
> > > This is true, but it also ensures that the time allocated to the group
> > > is also consumed by group if it wants to.
> >
> > I don't see how schedulability analysis can be done with this model,
> > since a single budget is being expended at varying priorities/deadlines.
> >
> > > > 4) On an SMP, more than one thread could be running against
> > > > the same budget at the same time, resulting in budget over-charges.
> > > >
> > >
> > > The rt group scheduler does split the budget per cpu. On expiring the
> > > budget, it tries to borrow from other CPUs if possible.
> >
> > First, how is the splitting of the budget between CPU's controlled
> > by the application?
> >
> > Second, I don't see how schedulabiliyt analysis could be done if
> > CPU's can "borrow" budget from other CPUs, unless there is some
> > mechanism in place to "pay it back". How do you do the analysis?
>
> Right so control-groups (cgroups for short) are a form of
> virtualization. Each controller is specific to a resource. We have
> memory controllers, namespace controllers and also a scheduler
> controller.
>
> If you would apply all controllers to the same task groups you get a
> result like chroot on steroids, or jails etc. But you can also use them
> individually to control resources in creative ways.

To add to this, it is by no means a way of introducing deadline
scheduling to tasks, for that you're quite right in that we should
extend sched_setscheduler() and struct sched_param.

Somewhere before RTLWS we'll add:

struct sched_param2;

sys_sched_setscheduler2(pid_t pid, int policy, struct sched_param2 *param);
sys_sched_setparam2(pid_t pid, struct sched_param2 *param);
sys_sched_getparam2(pid_t pid, struct sched_param2 *param);

SCHED_SOFT_DEADLINE (bounded tardiness)
SCHED_DEADLINE (no tardiness)

and hopefully enough hooks to make people happy :-)

The intention is to add these to -rt only for now, so that we can still
poke at the interfaces and only move then to mainline once the dust
settles down.

2009-07-16 12:17:57

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 09:58 +0200, Peter Zijlstra wrote:
> > Again, I don't think that either PP or PI is appropriate for use
> > in a (SMP) kernel. For non-blocking locks, the current
> > no-preeemption spinlock mechanism works. For higher-level
> > (blocking) locks, I'm attracted to Jim Anderson's model of
> > non-preemptable critical sections, combined with FIFO queue
> > service.
>
But, if I remember well how FMLP works, there is not that much
difference between them two... I mean, if you, at any (kernel|user)
level define a short critical section, this is protected by a
non-preemptive FIFO "queue lock", which is how Linux implements --at
least on x86-- spinlocks! :-O

Also, I'm not sure I can find in the FMLP paper information about the
possibility of a task to suspend itself (e.g., I/O completion) while
holding a short lock... I assume this is not recommended, but may be
wrong, and, in that case, I hope Prof. Anderson and Bjorn will excuse
and correct me. :-)

On the other hand, if with "blocking locks" you intended the long ones,
I think they hare dealt right with suspension and priority inheritance,
even in there.

> Right, so there's two points here I think:
>
> A) making most locks preemptible
> B) adding PI to all preemptible locks
>
> I think that we can all agree that if you do A, B makes heaps of sense,
> right?
>
I don't know about all, but I do... I hope I'm not offending anyone
saying that I like priority inheritance!! :-P

> Of course, when the decreased period is still sufficient for the
> application at hand, the non-preemptible case allows for better
> analysis.
>
Sure! I am impressed as well by the amazing results approaches like the
FMLP give in term of schedulabiulity analysis, and think a little bit of
spinning would not hurt that much, but not to the point of moving all
the locking to spinlocks. :-)

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-16 12:18:05

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote:
> > > 1) The priority of a group seemed to be defined by the priority of
> > > the highest-priority thread in the group's run-queue, which means
> > > it varies dynamically according to which threads in the group are
> > > contending.
> > >
> >
> > This is true, but it also ensures that the time allocated to the group
> > is also consumed by group if it wants to.
>
> I don't see how schedulability analysis can be done with this model,
> since a single budget is being expended at varying priorities/deadlines.
>
Yes, I agree... Right in these days I'm looking at this, and I have some
stub code to provide rt groups with a priority the user can, if he
wants, set through the cgroupfs.

The main problem is dealing with the "distributed" scheduling with push
and pull based migration mechanism.

Equally interesting, to me, is trying to figure out what kind of
analysis (if any!) could be inferred by the current implementation,
which could be an hack --as Peter say-- but, has some features I like...
Working on it as well, but I progress slowly, I'm not that good at
theoretical stuff yet! :-D

But that's another thread... I'll let all you know, if interested, soon
I hope. :-)

Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-16 13:00:44

by James H. Anderson

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel



Raistlin wrote:
> Also, I'm not sure I can find in the FMLP paper information about the
> possibility of a task to suspend itself (e.g., I/O completion) while
> holding a short lock... I assume this is not recommended, but may be
> wrong, and, in that case, I hope Prof. Anderson and Bjorn will excuse
> and correct me. :-)
>
>
This is a really excellent point and something I probably should have
mentioned. We developed the FMLP strictly for real-time (only)
workloads. We were specifically looking at protecting memory-resident
resources (not I/O). The FMLP would have to be significantly extended
to work in settings where these assumptions don't hold.

Thanks for pointing this out.

-Jim

2009-07-16 13:37:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 08:59 -0400, James H. Anderson wrote:
>
> Raistlin wrote:
> > Also, I'm not sure I can find in the FMLP paper information about the
> > possibility of a task to suspend itself (e.g., I/O completion) while
> > holding a short lock... I assume this is not recommended, but may be
> > wrong, and, in that case, I hope Prof. Anderson and Bjorn will excuse
> > and correct me. :-)
> >
> >
> This is a really excellent point and something I probably should have
> mentioned. We developed the FMLP strictly for real-time (only)
> workloads. We were specifically looking at protecting memory-resident
> resources (not I/O). The FMLP would have to be significantly extended
> to work in settings where these assumptions don't hold.

One thing I've been thinking about is extending lockdep to help verify
things like this.

If we were to annotate a syscall/trap with something like:

lockdep_assume_rt()

And teach lockdep about non-RT blocking objects, we could validate that
the callchain down from lockdep_assume_rt() would not indeed contain a
non-RT resource, but also that we don't take locks which might in other
another code path.

That is, suppose:

sys_foo()
lockdep_assume_rt()
mutex_lock(&my_lock)

vs

sys_bar()
mutex_lock(&my_lock)
down_read(&mm->mmap_sem)

vs

page-fault
down_read(&mm->mmap_sem)
lock_page(page)

Would indeed generate a warning because mmap_sem is known to block on
IO, and there is a dependency (through sys_bar()) between my_lock and
mmap_sem, therefore sys_foo()'s assumption is invalid.


2009-07-16 14:47:12

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted Baker wrote:
> On Tue, Jul 14, 2009 at 12:24:26PM -0600, Chris Friesen wrote:
>
>>> - that A's budget is not diminished.
>> If we're running B with A's priority, presumably it will get some amount
>> of cpu time above and beyond what it would normally have gotten during a
>> particular scheduling interval. Perhaps it would make sense to charge B
>> what it would normally have gotten, and charge the excess amount to A?

> So, it seems most logical and simplest to leave the charges where
> they naturally occur, on B. That is, if you allow priority
> inheritance, you allow tasks to sometimes run at higher priority
> than they originally were allocated, but not to execute more
> than originally budgeted.

I had considered this myself, as the simplicity is appealing.

In this scenario, we're going to disrupt B's scheduling pattern by
forcing it to borrow from its future cpu allocation in order to free up
the lock. It will then be forced to block for a while to make up for
the over-use of it's cpu budget.

I was worried that this "bursty" behaviour could cause problems, but on
reflection it seems like if the application is well-behaved to start
with, maybe we can assume that lock-hold times will be small enough that
this shouldn't cause significant problems.

Chris

2009-07-16 15:17:46

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted Baker wrote:
> On Mon, Jul 13, 2009 at 03:45:11PM -0600, Chris Friesen wrote:
>
>> Given that the semantics of POSIX PI locking assumes certain scheduler
>> behaviours, is it actually abstraction inversion to have that same
>> dependency expressed in the kernel code that implements it?
> ...>
>> The whole point of mutexes (and semaphores) within the linux kernel is
>> that it is possible to block while holding them. I suspect you're going
>> to find it fairly difficult to convince people to spinlocks just to make
>> it possible to provide latency guarantees.
>
> The abstraction inversion is when the kernel uses (internally)
> something as complex as a POSIX PI mutex. So, I'm not arguing
> that the kernel does not need internal mutexes/semaphores that
> can be held while a task is suspended/blocked. I'm just arguing
> that those internal mutexes/semaphores should not be PI ones.

This ties back to your other message with the comment about implementing
userspace PI behaviour via some simpler "loopholes".

If the application is already explicitly relying on PI pthread mutexes
(possibly because it hasn't got enough knowledge of itself to do PP or
to design the priorities in such a way that inversion isn't a problem)
then presumably priority inversion in the kernel itself will also be an
issue.

If a high-priority task makes a syscall that requires a lock currently
held by a sleeping low-priority task, and there is a medium priority
task that wants to run, the classic scenario for priority inversion has
been achieved.

>> On the other hand, PP requires code analysis to properly set the
>> ceilings for each individual mutex.
>
> Indeed, this is difficult, but no more difficult than estimating
> worst-case blocking times, which requires more extensive code
> analysis and requires consideration of more cases with PI than PP.

I know of at least one example with millions of lines of code being
ported to linux from another OS. The scheduling requirements are fairly
lax but deadlock due to priority inversion is a highly likely. They
compare PI and PP, see that PP requires up-front analysis, so they
enable PI.

I suspect there are other similar cases where deadlock is the real
issue, and hard realtime isn't a concern (but low latency may be
desirable). PI is simple to enable and doesn't require any thought on
the part of the app writer.


>> Certainly if you block waiting for I/O while holding a lock then it
>> impacts the ability to provide latency guarantees for others waiting for
>> that lock. But this has nothing to do with PI vs PP or spinlocks, and
>> everything to do with how the lock is actually used.
>
> My only point there was with respect to application-level use of
> POSIX mutexes, that if an application needs to suspend while
> holding a mutex (e.g., for I/O) then the application will have
> potentially unbounded priority inversion, and so is losing the
> benefit from priority inheritance. So, if the only benefit of
> PRIO_INHERIT over PRIO_PROTECT is being able to suspend while
> holding a lock, there is no real benefit.

At least for POSIX, both PI and PP mutexes can suspend while the lock is
held. From the user's point of view, the only difference between the
two is that PP bumps the lock holder's priority always, while PI bumps
the priority only if/when necessary.

Chris

2009-07-16 17:54:23

by Raj Rajkumar

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Dear Jim, Ted and others:

Let me first introduce myself. I work on real-time systems at Carnegie
Mellon Univ and am one of the authors on the papers that introduced the
priority inheritance and priority ceiling protocols, and some of their
follow-ons. I was also a co-founder of TimeSys in 1996. My PhD
student, Karthik Lakshmanan, and an SEI Sr. Staff Member, Dr. Dionisio
de Niz, and I have begun to revisit multiprocessor scheduling and
synchronization thanks in part to the surging interest in multicore
processors.

I just came to know about this message thread. I hope, as Jim did, that
I can add something to the ongoing discussion - if not, please bear with me.

First of all, let me agree completely with Jim that simplicity goes a
very long way. Two relevant examples follow. First, fixed-priority
scheduling has dominated most real-time systems over dynamic-priority
scheduling) since (a) the practical performance differences are not
significant (b) overheads are lower and (c) perhaps above all, less
information is required. Secondly, priority ceiling/ceiling emulation
protocols have much better properties than basic priority inheritance
protocols but the latter are more widely used. The reason appears to be
that PCP requires additional information which can also change at
run-time. Overall, simplicity does win.

Secondly, let me fully agree with Ted in that Linux-RT extensions should
support bandwidth control & isolation. My group's work on "reserves"
and "resource kernels" looked at isolating both temporal and spatial
resources (please see
http://www.ece.cmu.edu/~raj/publications.html#ResourceKernels) in the
context of Linux. Karthik's work extended Linux/RK to distributed
systems (Distributed Linux/RK).

Thirdly, I also agree with Jim that non-preemptive locking and FIFO
queueing are fine for mutexes in the kernel. The primary reason is that
critical sections in the kernel are written by experts and tend to be
short. And as it turns out, this is exactly how Linux SMP support has
been for quite a few years.

It looks to me like Jim and Bjoern name the kernel-mutex locking scheme
(of non-preemption and FIFO queueing) as FMLP and advocate it for
user-level mutexes. Jim: Please correct me if my interpretation is
incorrect.

I would like to propose a solution with 2 components. First, priority
inheritance protocols not just prevent (potentially) unbounded priority
inversion but offer less blocking for higher-priority tasks with
typically tighter timing constraints. It is also well known that
non-preemptive execution is an extreme/simplified form of the priority
ceiling protocol, where every task is assumed to access every shared
resource/mutex and hence every priority ceiling is the highest priority
in the system. There are cases when this is fine (e.g. when all
critical sections are *relatively* small as in the Linux kernel) and
there are cases when this is not (e.g. when some user-level critical
sections accessed only by lower-criticality tasks are *relatively* long
compared to the timing constraints of higher priority tasks.

Component 1: Let a priority ceiling be associated with each user-level
mutex. When the mutex is locked, the corresponding critical section is
executed at the priority ceiling value. The developer can choose this
to be the highest priority in the system in which case it becomes a
non-preemptive critical section. In addition, we could allow mutexes
to either pick basic priority inheritance (desirable for local mutexes?)
or the priority ceiling version (desirable for global mutexes shared
across processors/cores).

Component 2: The queueing discipline for tasks pending on locked mutexes
is the second policy under discussion. Jim argues that it should be
FIFO. Imagine a bunch of lower-priority tasks stuck in front of a
higher-priority task, and the "priority inversion" time can be
considerable. (Some including me would argue that it goes against the
grain of using priority-based scheduling for real-time systems in the
first place. Why not just use FIFO scheduling?). However, a
practical compromise is easy to reach as well. Let the queueing policy
(FIFO or priority-based) on mutexes be an option. FIFO for a "local"
mutex would not be very helpful. And priority-queueing for a "global"
mutex would help tasks with tight timing constraints.

Proposal Summary

1. Associate a priority ceiling (priority at which critical sections
are executed) with each (global) mutex. A macro HIGHEST_CEILING could
represent a value that is higher than any other application-level priority.
2. Allow the developer to specify the queueing discipline on a
(global) mutex. MUTEX_FIFO_QUEUEING and MUTEX_PRIORITY_QUEUEING are
allowed options.

I would appreciate any comments. Thanks for reading through a lot (if
you reached this far :-)

---
Raj
http://www.ece.cmu.edu/~raj


P.S. Some relevant references from a couple of decades back.

[1] Rajkumar, R., "Real-Time Synchronization Protocols for Shared Memory
Multiprocessors". The Tenth International Conference on Distributed
Computing Systems, 1990.
[2] Rajkumar, R., Sha, L., and Lehoczky J.P. "Real-Time Synchronization
Protocols for Multiprocessors". Proceedings of the
IEEE Real-Time Systems Symposium, December 1988, pp. 259-269.

Global locking is costly - global critical sections could be moved to a
"global synchronization processor" (and is described in one of the
articles above).

2009-07-16 21:26:10

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 09:17:32AM -0600, Chris Friesen wrote:

> If a high-priority task A makes a syscall that requires a lock currently
> held by a sleeping low-priority task C, and there is a medium priority B
> task that wants to run, the classic scenario for priority inversion has
> been achieved.

I think you don't really mean "sleeping" low-priority task C,
since then the priority inheritance would do no good. I guess you
mean that C has been/is preempted by B (and for global SMP, there
is some other medicum priority task B' that is eligible to run on
A's processor). That could be a priority inversion scenario.

BTW, if migration is allowed the probability of this kind of thing
(and hence the payoff for PIP) goes down rapidly with the number
of processors.

> I know of at least one example with millions of lines of code being
> ported to linux from another OS. The scheduling requirements are fairly
> lax but deadlock due to priority inversion is a highly likely. They
> compare PI and PP, see that PP requires up-front analysis, so they
> enable PI.
>
> I suspect there are other similar cases where deadlock is the real
> issue, and hard realtime isn't a concern (but low latency may be
> desirable). PI is simple to enable and doesn't require any thought on
> the part of the app writer.

I'm confused by your reference to deadlock. Priority inheritance
does not prevent deadlock, even on a single processor.

> At least for POSIX, both PI and PP mutexes can suspend while the lock is
> held. From the user's point of view, the only difference between the
> two is that PP bumps the lock holder's priority always, while PI bumps
> the priority only if/when necessary.

You are right that POSIX missed the point of priority ceilings,
by allowing suspension.

However, there is still a difference in context-switching
overhead. Worst-case, you have twice as many context switches
per critical section with PIP as with PP.

In any case, for a multiprocessor, PP is not enough.




2009-07-16 22:09:34

by Chris Friesen

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Ted Baker wrote:
> On Thu, Jul 16, 2009 at 09:17:32AM -0600, Chris Friesen wrote:
>
>> If a high-priority task A makes a syscall that requires a lock currently
>> held by a sleeping low-priority task C, and there is a medium priority B
>> task that wants to run, the classic scenario for priority inversion has
>> been achieved.
>
> I think you don't really mean "sleeping" low-priority task C,
> since then the priority inheritance would do no good. I guess you
> mean that C has been/is preempted by B (and for global SMP, there
> is some other medicum priority task B' that is eligible to run on
> A's processor). That could be a priority inversion scenario.

My terminology is getting sloppy. Yes, I meant preempted.

>> I suspect there are other similar cases where deadlock is the real
>> issue, and hard realtime isn't a concern (but low latency may be
>> desirable). PI is simple to enable and doesn't require any thought on
>> the part of the app writer.
>
> I'm confused by your reference to deadlock. Priority inheritance
> does not prevent deadlock, even on a single processor.

Sloppy terminology again. Priority inversion. If all apps are
soft-realtime and B is a pure cpu hog (which can effectively happen on
heavily loaded server systems) then A will never get to run.

>> At least for POSIX, both PI and PP mutexes can suspend while the lock is
>> held. From the user's point of view, the only difference between the
>> two is that PP bumps the lock holder's priority always, while PI bumps
>> the priority only if/when necessary.
>
> You are right that POSIX missed the point of priority ceilings,
> by allowing suspension.

The vast majority of apps written for Linux are POSIX apps, so for this
discussion we need to bear that behaviour in mind.

> However, there is still a difference in context-switching
> overhead. Worst-case, you have twice as many context switches
> per critical section with PIP as with PP.

On the other hand, with PI the uncontended case can be implemented as
atomic operations in userspace. With PP we need to issue at least two
syscalls per lock/unlock cycle even in the uncontended case (to handle
the priority manipulations).

Chris

2009-07-16 22:15:19

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 09:58:32AM +0200, Peter Zijlstra wrote:

> > Again, I don't think that either PP or PI is appropriate for use
> > in a (SMP) kernel. For non-blocking locks, the current
> > no-preeemption spinlock mechanism works. For higher-level
> > (blocking) locks, I'm attracted to Jim Anderson's model of
> > non-preemptable critical sections, combined with FIFO queue
> > service.
>
> Right, so there's two points here I think:
>
> A) making most locks preemptible
> B) adding PI to all preemptible locks
>
> I think that we can all agree that if you do A, B makes heaps of sense,
> right?

Maybe. That depends on how much it costs to provide PI for those
locks, and assumes that everyone (all application and OS tasks)
can agree on a global meaning for priority in the system.

I have always liked the simplicify of a global notion of priority,
which is the core idea of global preemptive SMP
scheduling. However, some authors have pointed out scenarios for
*partitioned* multiprocessor scheduling where expediting the
highest priority task on one processor may result in idling other
processors unnecessarily, ultimately resulting in missed
deadlines. For argument's sake, I'll assume that those scenarios
are pathological, and that a good designer who want to partition
can arrange the task-to-cpu assignments and priorities in a way
that prevents them.

There are two elements in this discussion that will require
resolution in such a global priority scheme: (1) how to rank EDF
deadlines vs. fixed priorities; (2) what to do about tasks that
are scheduled by dynamic priority schemes.

In the latter context, I'm thinking first of aperiodic server
scheduling schemes like SCHED_SPORADIC, but the problem occurs
with any scheme where a task's priority varies routinely. (You
already have a bunch of implementation code complexity from
user-initiated priority changes, like pthread_sched_setpolicy(),
but those kinds of changes don't happen often.)

I think both (1) and (2) are covered by what I think has been
termed here the "PEP" scheme or "proxy" scheduling, i.e.,
implementing priority inheritance not by explicitly comparing and
adjusting priorities, but by allowing the system scheduler to use
whatever algorithms it may to choose a task A to execute on each
processor, and then if that task A is blocked by a lock held by
another task B, instead execute B on A's processor if B is not
already executing.

However, this still leaves the question of how to choose which of
several blocked tasks to grant a lock to, when the lock is
released. If you want to do that according to priority (whichq
it seems one ought to, for consistency) it seems you now have to
maintain a priority orderd lock queue. That means you are back
into looking at explicit representation of inherited priorities
again, unless you can find a way to use the scheduler to choose
who to grant the lock to.

Some proponents of the original POSIX mutex scheme intended to
solve this by unblocking all the tasks contending for the mutex,
and letting them re-try locking it. This does get around the
explicit priority problem. Whichever task the scheduler chooses
first will get the lock. The problem is that you have the
overhead of unblocking all those tasks (itself a priority
inversion, since you are unblocking some "low priority" tasks). On
a single processor (or, on an SMP if you are lucky), the lock will
be released again soon, and all these unblocked tasks will get
into their critical sections without having to block again.
However, with back luck, on an SMP, all but one will bang up
against the spin-lock, and have be blocked again. This will
generate extra context switches on every CPU.... not a good thing.
This scheme also does not seem to work well for partitioned
scheduling, or any scheme with per-cpu run queues, since the
scheduling is being done in an uncoordinated way on multiple
processors.

So, maybe Jim's model of FIFO service in queues is the right one?
I'ts predictable. Even if it can cause some unnecesary priority
inversion, the priority inversion should be bounded.

I still conceptually prefer the idea of granting locks to
contending tasks in priority order, of course. It is just a
question of whether you want to have to agree (1) that all
scheduling is based on priority, and (2) pay the price for either
(2a) dynamically re-ordering all those queues every time a task
gains or loses priority (due to inheritance, or whatever), or (2b)
pay the O(n) price of scanning the queue for the currently
highest-priority task when you grant the lock. If you go this
way, I would favor the latter. In any system that does not
already have poor performance due to excessive lock contention,
the queues should rarely have more than one member. Assuming
integrity of the queue is maintained by the corresponding lock
itself, it is much easier to do this scanning at the point the
lock is released than to support (asynchronous) queue reordering
for every potential priority change.

> I just asked Thomas if he could remember any numbers on this, and he
> said that keeping all the locks non-preemptible had at least an order
> difference in max latencies [ so a 60us (A+B) would turn into 600us (!
> A) ], this means a proportional decrease for the max freq of periodic
> tasks.

Those numbers are convincing for the value of preemptable locks.

> This led to the conviction that the PI overheads are worth it, since
> people actually want high freq tasks.

As argued above, I don't see that they necessarily argue for
PI on those preempable locks.

> Of course, when the decreased period is still sufficient for the
> application at hand, the non-preemptible case allows for better
> analysis.

Ted

2009-07-16 22:34:43

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 08:46:52AM -0600, Chris Friesen wrote:

> > So, it seems most logical and simplest to leave the charges where
> > they naturally occur, on B. That is, if you allow priority
> > inheritance, you allow tasks to sometimes run at higher priority
> > than they originally were allocated, but not to execute more
> > than originally budgeted.

> In this scenario, we're going to disrupt B's scheduling pattern by
> forcing it to borrow from its future cpu allocation in order to free up
> the lock. It will then be forced to block for a while to make up for
> the over-use of it's cpu budget.

I guess by "disrupting" B's scheduling pattern, you mean only in
the sense of delayed budget enforcement. That is, finishing the
critical sections is something that B would do next in any case,
and something that B would need to consume CPU time to do in any
case. It is B doing its own normal work, but just getting
a chance to complete that work sooner than it might otherwise.

In this situation, you do want to allow B's budget to go negative
(borrowing against B's future CPU budget to do B's work) rather
than charging the task A who loaned B priority, since otherwise B
ends up getting a still bigger windfall (charging B's work to A's
budget).

Ted

BTW. I believe I understood what you are saying here, but I
noticed that we are using words in different ways, especially
"block" and "sleep", and I'm not sure about some of the other messages.

It would be helpful if we had some common terminology. In
particular, it might be useful to agree on distinct names for the
following following different reasons for a task not to be
executing.

1) actively looping on spin-lock, using CPU cycles
2) waiting for a condition or event,
like timer or I/O completion, not using CPU cycles,
not allowed to execute until the event or condition occurs
3) waiting for a CPU budget replenishment,
not using CPU cycles,
4) conceptually allowed to execute, but preempted on all available
processors (by higher priority or non-preemptable tasks)
5) conceptually allowed to execute, but prevented from executing
by lower-priority tasks that are not currently preemptable

Ted

Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

> I still conceptually prefer the idea of granting locks to
> contending tasks in priority order, of course. ?It is just a
> question of whether you want to have to agree (1) that all
> scheduling is based on priority, and (2) pay the price for either
> (2a) dynamically re-ordering all those queues every time a task
> gains or loses priority (due to inheritance, or whatever), or (2b)
> pay the O(n) price of scanning the queue for the currently
> highest-priority task when you grant the lock. ?If you go this
> way, I would favor the latter. ?In any system that does not
> already have poor performance due to excessive lock contention,
> the queues should rarely have more than one member. ?Assuming
> integrity of the queue is maintained by the corresponding lock
> itself, it is much easier to do this scanning at the point the
> lock is released than to support (asynchronous) queue reordering
> for every potential priority change.
>
Just chiming in that from an implementation perspective, we could use
a priority bitmap of active tasks contending for the lock. An
implementation similar to the one used by the O(1) scheduler can be of
great use here. Hardware support like "find_first_bit" can drastically
reduce the time taken to search for the highest-priority task pending
on the lock. Given realistic values for the number of distinct
priority values required by most practical systems, such an
implementation could prove effective.

Thanks,
Karthik

2009-07-16 23:07:53

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 18:34 -0400, Ted Baker wrote:
> > In this scenario, we're going to disrupt B's scheduling pattern by
> > forcing it to borrow from its future cpu allocation in order to free up
> > the lock. It will then be forced to block for a while to make up for
> > the over-use of it's cpu budget.
>
> I guess by "disrupting" B's scheduling pattern, you mean only in
> the sense of delayed budget enforcement. That is, finishing the
> critical sections is something that B would do next in any case,
> and something that B would need to consume CPU time to do in any
> case. It is B doing its own normal work, but just getting
> a chance to complete that work sooner than it might otherwise.
>
> In this situation, you do want to allow B's budget to go negative
> (borrowing against B's future CPU budget to do B's work) rather
> than charging the task A who loaned B priority, since otherwise B
> ends up getting a still bigger windfall (charging B's work to A's
> budget).
>
When bandwidth and budget are in place, allowing a task to cross these
boundaries if executing inside a critical section is another approach I
always have liked... Provided you have a mean to make it pay back for
its borrowed bandwidth, as you are properly saying.

However, I also think this raises an issue: what if something weird
happen inside the critical section, such that the task overcame its
declared/estimated duration? If no enforcing is in place, aren't we, at
least temporary, breaking the isolation?
And I think bandwidth isolation between a possibly "failing" task and
the rest of the system is exactly what we are trying to achieve when we
use bandwidth reservation mechanisms?

Thus, in scenarios where component isolation is what we care most, and
especially if good estimations of execution, critical section duration
and blocking times are hardly possible, proxy execution/bandwidth
inheritance approaches ensures the following:
1. the task executing the critical section gets some "bonus bandwidth"
to speedup the completion of that code, but not in an uncontrolled
manner: i.e., it can use _only_ the bandwidths of tasks blocked on
that cs;
2. whatever goes on inside a task, including critical sections, it
_only_ will affect the capability of making their deadline of:
- the task executing the cs;
- the tasks interacting with him, i.e., accessing the same cs.
Any other component/task in the system will be completely unaware of
anything!

Obviously, this comes at the price of overhead, tricky implementation,
etc. ... I don't know if it is always worth, probably it is not, but I
think it is yet something remarkable! :-)

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-16 23:13:28

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 09:17:09AM +0200, Henrik Austad wrote:

> My understanding of PEP is that when B executes through the
> A-proxy, B will consume parts of A's resources until the lock is
> freed. This makes sense when A and B runs on different CPUs and
> B is moved (temporarily) to CPU#A. If B were to use it's own
> budget when running here, once A resumes execution and exhaustes
> its entire budget, you can have over-utilization on that CPU
> (and under-util on CPU#B).

To be sure we are using A and B the same way here:
B is holding a lock.
A wants that lock.
A grants its priority B until B releases the lock.

How to look at the charges for usage seems not to have a perfect
solution. That is, you can't get around the fact that either:

(1) B is using some its time at the wrong priority (priority inversion)

or

(2) B is using some of A's time to do B's work (the right
priority, but the wrong task)

The right way to resolve this conflict seems to depend a lot on
where B runs, as well as whether you are managing budget per-CPU
(partitioned model) or managing it globally (free migration
model).

1) In a global scheduling model, it does not matter where B runs.
We want to charge B's critical section to B, since our
schedulability analysis is based on the processor usage of each
task. It would be broken if A could be charged random bits of
time for the execution of other tasks. So, we must charge B.

2) In a partitioned scheuling model, we worry about the
CPU utilization of each processor. We have several cases, depending
where B runs when it runs with A's priority:

2.1) If B gets to run on A's processor, we have a problem
with processor overload unless we charge the usage to A. This
is still a bad thing, though, since we then need to over-provision
A to allow for this.

2.2) If B runs on its own processor, we want to use B's budget.

2.3) A third variatin is that all the critical sections for A and
B run on a separate synchronization procesor. In that case, the
synchronization processor needs its own budget, and in effect we
the critical sections of tasks A and B become aperiodic tasks in
their own right.

> AFAIK, there are no such things as preemption-overhead charging
> to a task's budget in the kernel today. This time simply
> vanishes and must be compensated for when running a task through
> the acceptance-stage (say, only 95% util pr CPU or some such).

I don't see that anything can or does "vanish". Consider this
sequence of events on one processor:

1. task A is running, and calls scheduler (synchronously or maybe
via IRQ, say from timer)

2. scheduler computes how much time A has used recently, and charges
it to A's budget

The overhead of the IRQ handler and scheduler are therefore
charged to A.

3. scheduler switches to B

4. B calls scheduler

6. scheduler computes how much time B has used recently,
and charges it to B's budget

The rest of the overhead of the context switch from A to B, including cache
misses and page faults immediately following 3 are now
effectively charged to B.

> > Back to the original question, of who should be charged for
> > the actual critical section.

> That depends on where you want to run the tasks. If you want to
> migrate B to CPU#A, A should be charged. If you run B on CPU#B,
> then B should be charged (for the exact same reasoning A should
> be charged in the first case).

As I mentioned above, it also depends on whether you are using
a partitioned or global scheduling policy for your analysis.

> The beauty of PEP, is that enabling B to run is very easy. In
> the case where B runs on CPU#B, B must be updated statically so
> that the scheduler will trigger on the new priority. In PEP,
> this is done automatically when A is picked. One solution to
> this, would be to migrate A to CPU#B and insert A into the
> runqueue there. However, then you add more overhead by moving
> the task around instead of just 'borrowing' the task_struct.

> > From the schedulability analysis point of view, B is getting
> > higher priority time than it normally would be allowed to execute,
> > potentially causing priority inversion (a.k.a. "interference" or
> > "blocking") to a higher priority task D (which does not even share
> > a need for the lock that B is holding) that would otherwise run on
> > the same processor as B. Without priority inheritance this kind
> > of interferfence would not happen. So, we are benefiting A at the
> > expense of D. In the analysis, we can either allow for all such
> > interference in a "blocking term" in the analysis for D, or we
> > might call it "preemption" in the analysis of D and charge it to A
> > (if A has higher priority than D). Is the latter any better?

> If D has higher priority than A, then neither A nor B (with the
> locks held) should be allowed to run before D.

Yes, but I only meant D has higher priority than B. A may have
higher priority than D, but with multiple processors we can't just
focus on the one highest priority task. We need to look at the
(number of CPUs) highest priority tasks, or we will may end
up with our system behaving as if we have only one processor.

This is a key difference with multiprocessor scheduling,
and also with processor "share" scheduilng.

Expediting the highest priority task is *not* always the
best strategy. Assuming that each task has some amount of slack,
there is no great benefit for completing a high-priority task
early, if that means causing other tasks to miss their deadlines.

That is, if by delaying the highest-priority task a little bit
on one processor you can keep more processors busy,
you may be able to successfully schedule a set of tasks that
could not be scheduled to meet deadlines otherwise.

(I can give you an example of how this happens if you want, but it
would tedious if you can get my point without the example.)

> Yes, no task should be allowed to run more than the budget, but
> that requires B to execute *only* on CPU#B.

As mentioned above, that depends on whether you are
applying a global or partitioned model. With a global
scheduling model the budget is not per-CPU.

> On the other hand, one could say that if you run PEP and B is
> executed on CPU#A, and A then exhausts its budget, you could
> blame A as well, as lock-contention is a common problem and it's
> not only the kernel's fault. Do we need perfect or best-effort
> lock-resolving?

Yes. For application locks, with application-managed partitioned
scheduling, you can blame the application designer for building
in cross-CPU contention.

For kernel (hidden from the application) locks, it is a harder
problem.

I don't think there is a perfect solution for the partitioned
scheduling model. As mentioned above, you get to choose between B
causing short-term priority inversion for other tasks on B's
processor (but not using more than its budget on the average), or
B causing budget over-run for A on A's processor.

Ted

2009-07-16 23:29:27

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wed, 2009-07-15 at 16:55 -0400, Ted Baker wrote:
> I hope that Linux will not pursue EDF or EDZL in only a narrow
> sense, as simple priority scheduling algorithms, without providing
> for bandwidth guaranteees and limits.
>
Yes, me to...

> By bandwith "guarantee" I mean that the task/scheduling entity is
> guaranteed to be able to at least *compete* at a certain level for
> a certain percentage of the CPU, if we cannot (better) provide an
> admission control mechanism that guarantees it will actually get a
> certain percentage of the CPU.
>
Again, I agree... But giving a group an explicit priority assignment,
although very simple conceptually, raises a lot of issues when the
current implementation of global task scheduling in Linux, with
"distributed" (one per CPU) runqueue, is concerned.

> For example, in the fixed-priority domain, we have Sporadic
> Server. This guarantees the server a chance to compete at its top
> priority for at least sched_ss_init_budget time in every
> sched_ss_repl_period time, at sched_priority, within some
> tolerance. It also should not be permitted to use more than
> sched_ss_init_budget time in every sched_ss_repl_period time, at
> sched_priority, within some tolerance.
>
And that's why I'm trying what I said before... For, e.g., the sporadic
server policy to extend and be applied to group scheduling, groups need
to have priorities for the standard, already existent, analysis to work.

> In the deadline scheduling domain, we have algorithms like
> Constant Bandwidth Server (and some improved variants) that
> provide similar gurantees and limites, in terms of deadlines. (I
> recall seeing one very good version in a paper I recently saw,
> which I could seek out and provide to the list if there is
> interest.)
>
Well, if it does not bother you too much, I'm curious about that
solution... When you find some time, even via private mail, if I'm the
only one, it would be nice if you can point that paper out to me.

Thanks in advice.

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-16 23:38:40

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 05:34:25PM -0500, Karthik Singaram Lakshmanan wrote:

> Just chiming in that from an implementation perspective, we could use
> a priority bitmap of active tasks contending for the lock. An
> implementation similar to the one used by the O(1) scheduler can be of
> great use here. Hardware support like "find_first_bit" can drastically
> reduce the time taken to search for the highest-priority task pending
> on the lock. Given realistic values for the number of distinct
> priority values required by most practical systems, such an
> implementation could prove effective.

This does not solve the problem of avoiding queue reordering in
response to dynamic priority changes, since where you insert the
task in the queue (including the bit setting for the priority and
the linking/unlinking) depends on the curent priority.

This queue reordering is a huge pain, since you need to do it not
only whenever a priority is explicitly changed by the user; you
need to do it whenever a task gains or loses priority through
inheritance. The latter can happen asynchronously, in response to
a time-out or signal, for example.

By the way, the use of bit-maps is very appealing for the O(1)
operations, including bit-scan, especially if your machine has
atomic set/test bit instructions. We used this structure for some
single-processor RT kernels in the 1980's. The task scheduling
and sleep/wakeup operations were just a few instructions. However
the use of bit-maps forced us to set a fixed limit on the number
of tasks in the system. We also could not change priorities
without doing an ugly (non-atomic) re-shuffling of the structures.

You are proposing one bit per priority level, of course, rather
than one bit per task. This allows you to use linked lists within
priorities, but you lose the one-instruction suspend/unsuspend.

It is not immediately obvious how to extend this technique to
deadline-based scheduling.

There can only be a finite number of distinct deadlines in a
system at any one time. So at any point in time a finite number
of bits is sufficient. The problem is that the deadlines are
advancing, so a fixed mapping of bit positions to deadlines does
not work.

There is one way this can be used with EDF, at least for a single
processor, which is related to the way I came up with the SRP. If
you keep a stack of preempting tasks (earliest deadline on top),
the positions in the stack do map nicely to a bit vector.

You then need some other structure for the tasks with deadlines
later than the bottom of your stack.

I don't see how this generalizes to SMP.

Ted


2009-07-16 23:54:33

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, Jul 16, 2009 at 04:08:47PM -0600, Chris Friesen wrote:

> > However, there is still a difference in context-switching
> > overhead. Worst-case, you have twice as many context switches
> > per critical section with PIP as with PP.
>
> On the other hand, with PI the uncontended case can be implemented as
> atomic operations in userspace. With PP we need to issue at least two
> syscalls per lock/unlock cycle even in the uncontended case (to handle
> the priority manipulations).

Needing syscalls to change the priority of a thread may be an
artifact of system design, that might be correctable.

Suppose you put the effective priority of each thread in a
per-thread page that is mapped into a fixed location in the
thread's address space (and different locations in the kernel
memory). It is nice to have such a page for each thread
in any case. I don't recall whether Linux already does this,
but it is a well proven technique.

Taking a PP lock then involves:

1) push old priority on the thread's stack
2) overwrite thread's priority with max of the lock priority
and the thread priority
3) try to grab the lock (test-and-set, etc.)
... conditionally queue, etc.

Releasing the PP lock then involves:

1) conditionally find another thread to grant the lock to,
call scheduler, etc., otherwise
2) give up the actual lock (set bit, etc.)
3) pop the old priority from the stack, and
write it back into the per-thread location

Of course you also have explicit priority changes. The way we
handled those was to defer the effect until the lock release
point. This means keeping two priority values (the nominal one,
and the effective one). Just as you need conditional
code to do the ugly stuff that requires a kernel trap
in the case that the lock release requires unblocking
a task, you need conditional code to copy the copy the
new nominal priority to the effective priority, if that
is called for. We were able to combine these two conditions
into a single bit test, which then branched out to handle
each of the cases, if necessary.

I can't swerar there are nocomplexities in Linux that might break
this scheme, since we were not trying to support all the
functionality now in Linux.

Ted

2009-07-17 00:19:45

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 19:13 -0400, Ted Baker wrote:
> To be sure we are using A and B the same way here:
> B is holding a lock.
> A wants that lock.
> A grants its priority B until B releases the lock.
>
> How to look at the charges for usage seems not to have a perfect
> solution. That is, you can't get around the fact that either:
>
> [...]
>
> The right way to resolve this conflict seems to depend a lot on
> where B runs, as well as whether you are managing budget per-CPU
> (partitioned model) or managing it globally (free migration
> model).
>
> 1) In a global scheduling model, it does not matter where B runs.
> We want to charge B's critical section to B, since our
> schedulability analysis is based on the processor usage of each
> task. It would be broken if A could be charged random bits of
> time for the execution of other tasks. So, we must charge B.
>
Mmm... Why can't we make B able to exploit A's bandwidth to speed up
critical section completion, to the benefit of A as well? I mean, it
depends on how analysis is being carried out, and it is probably good or
bad depending on what you care most, e.g., making all task's deadline or
isolating the various components between each other... But I wouldn't
say it is "broken".

Especially because I implemented it once... Very rough, very
experimental, far from being ready for anything different than some
experiments... But it worked! :-)

> 2) In a partitioned scheuling model, we worry about the
> CPU utilization of each processor. We have several cases, depending
> where B runs when it runs with A's priority:
>
Ok, I'm not going into this, since I need a little bit more time to
figure out the details... I'm concentrating on global scheduling for
now! :-D

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-17 00:32:29

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 10:58 +0200, Peter Zijlstra wrote:
> Right so control-groups (cgroups for short) are a form of
> virtualization. Each controller is specific to a resource. We have
> memory controllers, namespace controllers and also a scheduler
> controller.
>
> If you would apply all controllers to the same task groups you get a
> result like chroot on steroids, or jails etc. But you can also use them
> individually to control resources in creative ways.
>
> In order to manage RT resources you want:
>
> - a minimum bandwidth guarantee
> - isolation
>
> So ideally you want a CBS server that schedules your RT (FIFO/RR) tasks.
>
Or, maybe, an RT fix-priority RT scheduling class (for FIFO/RR tasks)
with some kind of bandwidth limiting fix-priority algorithm for groups,
such as deferrable server (which is basically what you have now) or,
sporadic server.

What do you think about this?

The only thing you need to have this working is adding the capability of
explicitly assigning priorities to groups!
I'm sending some code for this in the next days... It has some (even
serious) issues, or at least some features that would need discussing
about, but it's a start.

Obviously, you can also have EDF in place, maybe as a different
scheduling class than sched-rt, able to accomodate EDF tasks, if wanted,
or again FIFO and RR task sets ("ghosts"), as in Fabio's proposal.

To me, this seems the very best solution both for real-time people
(which will get FP and EDF!) and for other users... And it also makes it
easier to retain POSIX compatibility (if the system is properly
configured) than if we add deadlines to sched-rt groups.

But that's only my very humble opinion. :-D

> Furthermore the whole feature is still marked EXPERIMENTAL, basically
> because I do recognize it for the hack it is -- that said, some people
> might find it useful.
>
Hey, it is very useful for me actually!! Without it I would be sleeping
right now... And not here coding and answering mails at this late time
in the night!! :-P

> These groups get assigned a bandwidth through the use of a period and
> runtime group parameter -- the documentation states that using different
> periods is currently broken and would require a deadline server.
>
Or a priority, since we are in the fixed priority scheduling class?

> Therefore we can assume all periods are equal, for the rest of this
> description -- and set it to 1s.
>
>
> So what does it do, its a hierarchical FIFO scheduler, with the prio of
> a group being that of the max prio of its children. If a group runs out
> of quota it will be dequeued. When the period timer comes along and
> refreshes the quota it will be requeued.
>
> R
> / \
> A B
> / \ |\
> 1 4 C 3
> |
> 2
>
> groups in letters, tasks in digits.
>
> [...]
>
WoW!! Very nice, thorough and clear explanation... Consider adding it to
Documentation/! :-D

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2009-07-17 00:43:25

by Dario Faggioli

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Thu, 2009-07-16 at 10:58 +0200, Peter Zijlstra wrote:
> Now, let me first state that the current code is a hack, and I know its
> nowhere near proper. But it was the best I could come up with on a short
> notice -- and Fabio is now looking at doing better :-)
>
If I can say something (more!) there is quite a lot of work in our lab
on this stuff, starting from different perspective and aiming at
different goals.

I've been involved, and continuing being, on sporadic server
implementation, EDF implementation in separate scheduling class (with
Michael), while Fabio is doing a lot (and a lot better!) work on
deadlines in sched-rt.

Independently from what concerns priorities or deadlines, the way the
global scheduling is implemented, i.e., with distributed ready queues,
raises a lot of issues with the implementation of _global_
_hierarchical_ scheduling policies of both kind, EDF and FP.
It is being very hard to figure out, and much more to implement, how
things should go when you have push, pull, affinity, and so on! :-(

I'll be more precise when I have the code ready, but here the question
is, do you think the push and pull approach "is there to stay", or is
there room, maybe after trials, errors, experiments and exhaustive
search for the correct data structure, to migrate to something that
would make global scheduling easier?

Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------
Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy)

http://blog.linux.it/raistlin / [email protected] /
[email protected]


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

> This does not solve the problem of avoiding queue reordering in
> response to dynamic priority changes, since where you insert the
> task in the queue (including the bit setting for the priority and
> the linking/unlinking) depends on the curent priority.
>

I completely agree with this. The bit map needs to be modified for
dynamic priority changes. However, atomic operations (like
Compare-And-Swap) can be used to modify the bitmap. Barring any
(highly unlikely but bounded) contention scenarios from other
processors, this implementation would still continue to be efficient
than maintaining a priority-ordered queue.

> By the way, the use of bit-maps is very appealing for the O(1)
> operations, including bit-scan, especially if your machine has
> atomic set/test bit instructions. ?We used this structure for some
> single-processor RT kernels in the 1980's. ?The task scheduling
> and sleep/wakeup operations were just a few instructions. ?However
> the use of bit-maps forced us to set a fixed limit on the number
> of tasks in the system. ?We also could not change priorities
> without doing an ugly (non-atomic) re-shuffling of the structures.
>
> You are proposing one bit per priority level, of course, rather
> than one bit per task. ?This allows you to use linked lists within
> priorities, but you lose the one-instruction suspend/unsuspend.
>

We can still maintain the O(1) instruction-suspend/unsuspend, if we
maintain a counter for each priority level.

(a) When a task suspends on a lock, we can do an atomic increment of
the counter for its priority level and set the bit on the priority map
of tasks waiting for the lock. Attaching the task to a
per-priority-level linked list queue should take O(1) assuming that
there is a tail pointer.

(b) When the lock is released, find the highest priority bit set on
the lock's priority map, index into the appropriate per-priority-level
linked list, and wake up the task at the head of the queue (delete
operation with O(1)). Do an atomic decrement of the counter, if it
reaches zero unset the bit on the priority map.

While there are still contention issues that arise with updating the
linked list and counters, these are restricted to tasks within the
same priority level (highly unlikely that multiple tasks from the same
priority level decide to acquire the same lock within a window of a
couple of instructions), and should be much less than the contention
arising from all the tasks in the system.

> It is not immediately obvious how to extend this technique to
> deadline-based scheduling.
>
> There can only be a finite number of distinct deadlines in a
> system at any one time. ?So at any point in time a finite number
> of bits is sufficient. ?The problem is that the deadlines are
> advancing, so a fixed mapping of bit positions to deadlines does
> not work.
>
> There is one way this can be used with EDF, at least for a single
> processor, which is related to the way I came up with the SRP. ?If
> you keep a stack of preempting tasks (earliest deadline on top),
> the positions in the stack do map nicely to a bit vector.
>
> You then need some other structure for the tasks with deadlines
> later than the bottom of your stack.
>
Yes. I did not think about EDF based schedulers when I proposed the
implementation mechanism. I believe we can work on the SRP idea to get
a corresponding version for EDF.

> I don't see how this generalizes to SMP.
>

I agree that there needs to be more work to generalize the idea to EDF
based schedulers on SMP, however, the idea still applies to
fixed-priority scheduling in the SMP context. Many SMP processors
support atomic instructions (for example:
http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/).
We can utilize these instructions to efficiently implement such locks
at least in fixed-priority schedulers (like Deadline Monotonic
Schedulers) for SMP systems.

Thanks
- Karthik

2009-07-17 07:32:39

by Henrik Austad

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Friday 17 July 2009 01:13:23 Ted Baker wrote:
> On Thu, Jul 16, 2009 at 09:17:09AM +0200, Henrik Austad wrote:
> > My understanding of PEP is that when B executes through the
> > A-proxy, B will consume parts of A's resources until the lock is
> > freed. This makes sense when A and B runs on different CPUs and
> > B is moved (temporarily) to CPU#A. If B were to use it's own
> > budget when running here, once A resumes execution and exhaustes
> > its entire budget, you can have over-utilization on that CPU
> > (and under-util on CPU#B).
>
> To be sure we are using A and B the same way here:
> B is holding a lock.
> A wants that lock.
> A grants its priority B until B releases the lock.

I tried to be consistent, and I still can't see where I messed up, but this is
what I meant.

With the mixed-in CPU-notation: I got the feeling that partitioned scheduling
was the preferred (or easiest to analyze), so I assumed that we had shifted
from my initial global motivation to a more mature partitioned view :-)

By CPU#A I menat the CPU where A is currently scheduled (in a partitioned
setup).

> How to look at the charges for usage seems not to have a perfect
> solution. That is, you can't get around the fact that either:
>
> (1) B is using some its time at the wrong priority (priority inversion)
>
> or
>
> (2) B is using some of A's time to do B's work (the right
> priority, but the wrong task)
>
> The right way to resolve this conflict seems to depend a lot on
> where B runs, as well as whether you are managing budget per-CPU
> (partitioned model) or managing it globally (free migration
> model).

Yes. This has been one of my point all the way, but I realize I've been rather
bad at actually showing that point.

> 1) In a global scheduling model, it does not matter where B runs.
> We want to charge B's critical section to B, since our
> schedulability analysis is based on the processor usage of each
> task. It would be broken if A could be charged random bits of
> time for the execution of other tasks. So, we must charge B.

Yes, I agree completely.

> 2) In a partitioned scheuling model, we worry about the
> CPU utilization of each processor. We have several cases, depending
> where B runs when it runs with A's priority:
>
> 2.1) If B gets to run on A's processor, we have a problem
> with processor overload unless we charge the usage to A. This
> is still a bad thing, though, since we then need to over-provision
> A to allow for this.
>
> 2.2) If B runs on its own processor, we want to use B's budget.
>
> 2.3) A third variatin is that all the critical sections for A and
> B run on a separate synchronization procesor. In that case, the
> synchronization processor needs its own budget, and in effect we
> the critical sections of tasks A and B become aperiodic tasks in
> their own right.

An interesting solution. However, this means that the kernel need to analyze
all tasks before they can run as it not only need to keep track of total "raw
utilization" (the computational time required), but also on how much time
spent while holding locks. Throw inn a few conditional statements, and
finding this becomes somewhat of a challenge.

> > AFAIK, there are no such things as preemption-overhead charging
> > to a task's budget in the kernel today. This time simply
> > vanishes and must be compensated for when running a task through
> > the acceptance-stage (say, only 95% util pr CPU or some such).
>
> I don't see that anything can or does "vanish". Consider this
> sequence of events on one processor:
>
> 1. task A is running, and calls scheduler (synchronously or maybe
> via IRQ, say from timer)
>
> 2. scheduler computes how much time A has used recently, and charges
> it to A's budget
>
> The overhead of the IRQ handler and scheduler are therefore
> charged to A.
>
> 3. scheduler switches to B
>
> 4. B calls scheduler
>
> 6. scheduler computes how much time B has used recently,
> and charges it to B's budget
>
> The rest of the overhead of the context switch from A to B, including cache
> misses and page faults immediately following 3 are now
> effectively charged to B.

Good point, it has to be this way. I see a lot of code in kernel/sched_stats.h
that I should have been aware of. My apologies.

> > > Back to the original question, of who should be charged for
> > > the actual critical section.
> >
> > That depends on where you want to run the tasks. If you want to
> > migrate B to CPU#A, A should be charged. If you run B on CPU#B,
> > then B should be charged (for the exact same reasoning A should
> > be charged in the first case).
>
> As I mentioned above, it also depends on whether you are using
> a partitioned or global scheduling policy for your analysis.

Yes. I guess we should split the thread in 2, or shift the topic as we are
discussing locking and how to resolve these.

> > The beauty of PEP, is that enabling B to run is very easy. In
> > the case where B runs on CPU#B, B must be updated statically so
> > that the scheduler will trigger on the new priority. In PEP,
> > this is done automatically when A is picked. One solution to
> > this, would be to migrate A to CPU#B and insert A into the
> > runqueue there. However, then you add more overhead by moving
> > the task around instead of just 'borrowing' the task_struct.
> >
> > > From the schedulability analysis point of view, B is getting
> > > higher priority time than it normally would be allowed to execute,
> > > potentially causing priority inversion (a.k.a. "interference" or
> > > "blocking") to a higher priority task D (which does not even share
> > > a need for the lock that B is holding) that would otherwise run on
> > > the same processor as B. Without priority inheritance this kind
> > > of interferfence would not happen. So, we are benefiting A at the
> > > expense of D. In the analysis, we can either allow for all such
> > > interference in a "blocking term" in the analysis for D, or we
> > > might call it "preemption" in the analysis of D and charge it to A
> > > (if A has higher priority than D). Is the latter any better?
> >
> > If D has higher priority than A, then neither A nor B (with the
> > locks held) should be allowed to run before D.
>
> Yes, but I only meant D has higher priority than B. A may have
> higher priority than D, but with multiple processors we can't just
> focus on the one highest priority task. We need to look at the
> (number of CPUs) highest priority tasks, or we will may end
> up with our system behaving as if we have only one processor.

Yes, well, that will always be a problem. but on the other hand, B *has*
higher priority than D, since B is blocking A, and A has higher priority than
D.

This becomes, as you say, messy in a partitioned system as a priority does not
mean *anything* outside a CPU. The same goes for a deadline as you must see
the deadline wrt to the other tasks in the same domain (I specifically choose
domain here as we can apply that to all types, partitioned, clustered and
global).

This is what makes me believe that a global algorithm will make several of
these problems smaller, or even vanish completely, even if you have problems
with caches going cold, high memory-bus traffic at frequent task preemption,
strong lock-contention on the rq-locks etc. These are all implementation
problems, not small problems, but they can be avoided and mended. A
partitioned scheduler is a bit like a broken hammer, you cannot really expect
to build a nice house with it :)

> This is a key difference with multiprocessor scheduling,
> and also with processor "share" scheduilng.

Yes, you can get into quite a mess if you're not careful.

> Expediting the highest priority task is *not* always the
> best strategy. Assuming that each task has some amount of slack,
> there is no great benefit for completing a high-priority task
> early, if that means causing other tasks to miss their deadlines.

which was the whole point of the initial email, to present EFF (or M^2LLF, or
whatever is the most appropriate name for it) and get some feedback :-)

> That is, if by delaying the highest-priority task a little bit
> on one processor you can keep more processors busy,
> you may be able to successfully schedule a set of tasks that
> could not be scheduled to meet deadlines otherwise.
>
> (I can give you an example of how this happens if you want, but it
> would tedious if you can get my point without the example.)

No, no, I think I understand what you're getting at. Dhall's effect would be
one such.

> > Yes, no task should be allowed to run more than the budget, but
> > that requires B to execute *only* on CPU#B.
>
> As mentioned above, that depends on whether you are
> applying a global or partitioned model. With a global
> scheduling model the budget is not per-CPU.
>
> > On the other hand, one could say that if you run PEP and B is
> > executed on CPU#A, and A then exhausts its budget, you could
> > blame A as well, as lock-contention is a common problem and it's
> > not only the kernel's fault. Do we need perfect or best-effort
> > lock-resolving?
>
> Yes. For application locks, with application-managed partitioned
> scheduling, you can blame the application designer for building
> in cross-CPU contention.
>
> For kernel (hidden from the application) locks, it is a harder
> problem.

Which locks would that be? If a syscall results in lock-contention, shouldn't
the app be aware of those locks?

> I don't think there is a perfect solution for the partitioned
> scheduling model.

I know! :-)

> As mentioned above, you get to choose between B
> causing short-term priority inversion for other tasks on B's
> processor (but not using more than its budget on the average), or
> B causing budget over-run for A on A's processor.
>
> Ted


--
henrik


Attachments:
(No filename) (9.62 kB)
signature.asc (189.00 B)
This is a digitally signed message part.
Download all attachments

2009-07-17 07:41:13

by Henrik Austad

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Wednesday 15 July 2009 23:53:05 Ted Baker wrote:
> On Tue, Jul 14, 2009 at 09:28:47PM +0200, Henrik Austad wrote:
> > ... In MC you need to do this the hard way, namely compute the
> > point in time not when the task misses the deadline, but when it
> > will *eventually* fail a deadline. By doing that, you combine
> > deadline, wcet and granted time in one variable, and you have a
> > *single* variable to compare.
>
> This is true in a theoretical sense, and is the basis of some
> "optimal" scheduling algorithms, including the "throwforward
> scheduling" algorithm. It makes sense in some environments, where
> you actually know the WCET of the task in advance. However, I
> don't believe a Linux system can expect all applications to
> provide this kind of information.

Why cannot you expect real-time tasks using a deadline scheduler to provide
some estimate of the execution cost? How can you ever hope to run a deadline
scheduler without this?

> In a system programmed using process and threads, the decision to
> sleep or wake is embedded in the internal logic of the thread, and
> implemented by system calls. The existing system calls do not
> convey how long the thread needs to execute before it reaches its
> next suspension point. Therefore, without a new API you cannot
> use WCET.

Yes, you would need to introduce a new set of syscalls. 2 in fact. When
working with PD^2, I added 3 (as reweighing was a special case), but:

sched_dl_update(pid, wcet, period, deadline)
sched_dl_release(pid, abs_releease_time)

How can you use deadlines based on priorities? A priority is a one-way mapping
of deadlines for a set of tasks.

> If you create a new API for this, you are limiting this
> form of scheduling to threads that choose to use that API, and are
> able to provide the needed WCET information. This seems like a
> small number of cases among the full range of real-time Linux
> applications.

Are we going to place all tasks in the kernel into rt-deadline tasks? I had
the impression that we wanted a class for a special set of tasks.

> Ted

--
henrik


Attachments:
(No filename) (2.05 kB)
signature.asc (189.00 B)
This is a digitally signed message part.
Download all attachments

2009-07-17 13:35:15

by Giuseppe Lipari

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

Dear all,

a few consideration on the BWI scheme, that was mentioned by
Peter and by Raistlin a few messages ago. I do not know very
well the PEP scheme, from the discussion until now the basic
idea looks pretty similar to our protocol. The BWI is described
in this paper:

[1] http://feanor.sssup.it/~lipari/papers/rtss2001-bwi.ps.gz

(I just removed the password, feel free to download it, I hope IEEE
lawyers will not chase me!).

In essence, when a task is blocked on a lock, its budget may be
consumed by the task holding the lock. A task can be assigned one or
more pairs (budget,deadline), one is its original one, the others are
"inherited" when holding a lock on which other tasks are blocked.

This scheme has advantages and disadvantages. The advantages are:

1) Isolation. It is easy to see that the budget of a task can be
stolen only by tasks that share common locks, directly or
indirectly. For the sake simplicity, consider only non nested
locks. If two tasks A and B do not share any lock, then they cannot
influence each other. (In the paper, this property is generalized to
nested locks). This property is useful if we want to isolate the
behavior of one (group of) task(s) from the others. For example, a
"hard real-time" task (group) must be isolated from soft real-time
tasks. Under EDF other classical protocols (like PI, SRP) do not have
the same property, because letting a task use a critical section for
longer than expected can jeopardize the schedulability of the any
other tasks.


2) Simpler admission test. With other schemes (PI, SRP), it is
necessary to compute blocking times for all tasks, which depend on the
length of the critical sections. The blocking times are then used in
the admission test formula. However, if one blocking time is wrongly
calculated, at run-time any task can miss its deadline. With BWI,
instead, the admission test is the simpler \sum U_i \leq 1, and
doesnot depend on the length of the critical section.

3) Under the assumption that tasks do not block inside a critical
section, BWI can be implemented in a fairly simple way.

4) It is indeed possible (under certain conditions) to verify the
schedulability of a hard real-time task H by knowing the length of all
critical sections of all tasks that share some lock with H. The very
complex procedure is explained in the paper.

However, BWI has some disadvantages

1) It is only for single processors.

2) From a schedulability point of view, if we want to guarantee that
every task always respects its deadline, when computing its budget we
must calculate the "interference" of other tasks. Since, we have to do
this for every "hard" task, we may end up counting the same critical
section several times, thus wasting some bandwidth. This is better
explained in the paper (section 6.4.1).

3) If a task is allowed to block in the middle of a critical section
(for example, due to a page fault), the implementation becomes much
more complex.

Concerning point 1, as Raistlin pointed out (see its e-mails), he is
working on extending BWI to multiprocessors, also from a theoretical
point of view. It is possible that the result will be similar to the
one obtained by the Dougleas Niehaus with PEP. We hope he will be able
to add some "theoretical spice"!

Concerning point 2, this cannot be avoided. We believe that BWI is
useful for open systems (where tasks are created/killed dynamically)
and for soft real-time systems. However, under certain conditions, it
allows to schedule and analyse hard real-time tasks, even when mixed
with soft real-time tasks.

Concerning point 3, this is the most difficult point. In fact,
achieving an efficient implementation in this case seems very
unlikely. However, we believe that blocking inside a critical section
it is probably a rare event, and then maybe it is possible to afford
some extra overhead in such unfortunate cases.

I hope I clarified some obscure issues with BWI!

Regards,

Giuseppe Lipari



Attachments:
giuseppe_lipari.vcf (181.00 B)

2009-07-17 13:37:27

by Ted Baker

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Fri, Jul 17, 2009 at 09:40:59AM +0200, Henrik Austad wrote:

> Why cannot you expect real-time tasks using a deadline scheduler to provide
> some estimate of the execution cost? How can you ever hope to run a deadline
> scheduler without this?

This depends on what you mean by a "deadline scheduler".
Personally, I don't think it makes much sense for an OS to support
scheduling of aperiodic "jobs" on job deadlines. If you have
chunks of application work that logically should be viewed as
jobs, with approximately known WCET and deadlines, the right way
to handle that is to have server thread(s), with associated job
queue(s) that order their own queues by job deadline. The servers
are provisioned with enough CPU capacity to provide the desired
level of service for the anticipated load. If load-shedding is
required at times, that is the responsibility of the server, which
is part of the application and has the application-specific
knowledge needed to make such decisions. (Alternatively,
the server could negotiated for a higher CPU share if it starts
missing deadlines.)

What it makes sense for the OS to provide is what I'll loosely
call "CPU shares". Deadline scheduling is a good way to implement
this, since the schedulability analysis is relatively clean
(including max. tardiness). That is each server thread has a
"period" and is entitled to a certain budgeted amount of time each
period, and the period is the deadline for getting that much CPU
time. Constant Bandwidth and Total Bandwidth are two such
policies. (I recently reviewed a paper that worked the kinks out
of the published Constant Bandwidth in a very nice way. If I can
find that it has been since published, or if there is a public
pre-print technical report, I'll try to send the reference in
another e-mail.)

With CPU shares, we do have something like a WCET, but it is
really a maximum allowed execution time. In this case, I'm not
sure it makes much (any?) sense to worry about laxity, though.
This should not be a hard-deadline situation (indeed, I don't
think it makes sense for an OS as complicated as Linux to talk
about truly hard deadlines). It may be enough to know the maximum
lag or tardiness.

Of course, if you want to put in special support for periodic
tasks (say for sensor polling or poriodic actuator output), you
can do that, but to me the right model is probably not a thread.
It would make more sense to me for such tasks to implemented as
event handlers.

> How can you use deadlines based on priorities? A priority is a
> one-way mapping of deadlines for a set of tasks.

I had in mind several different ways.

1) You can preserve a range of nominal "deadlines" that are
above and below the range of real deadlines used in scheduling.
For example:

A) reserve the maximum representable time value for
tasks that should never run (suspended).
This value is useful for bandwidth
limiting aperiodic server scheduling, if you really want to
keep a temporarily out-of-budget server from competing with
other tasks. Note that the pure constant-bandwith server is not enough
in this respect, since it would still have a deadline earlier
than non-realtime - tasks in the B range.

B) Reserve a few values below that for tasks that are
fixed-priority and lower in priority than all true deadline tasks.
Some of these "deadlines" can be used for SCHED_FIFO and SCHED_RR
that we want to be below the deadline-scheduler and also for
SCHED_OTHER.

C) The main band of deadlines is used for real deadline
scheduling. (I don't believe it would technicall violate the
POSIX standard to have a hidden "hole" between SCHED_FIFO and
SCHED_RR values, but if there are seriou objections, one could
pick a priority in the middle of the RT range and say that these
deadline scheduled tasks are executing at that priority.)

D) Reserve a few values close to zero for tasks that
are fixed-priority and higher in priority than all true
deadline tasks. This is useful for SCHED_FIFO and SCHED_RR
that we want to be above the deadline-scheduler, and
also for hybrid EDZL and EDF scheduling algorithms.
EDZL would use such a value when a task reaches zero laxity.
Hybrid EDF uses such a value for tasks that have high
enough processor utilizations (> 50%), to achieve a higher
schedulable system utilization than pure EDF.

You have lost nothing in deadline representation, since the values
used for these two fixed-priority ranges are useless for real
deadlines.

2) Within the range (C) ("real" deadline scheduling), you can also
implement something close enough to priority to serve the purposes
for which priority is typically used, by using an aperiodic-server
type scheduling algorithm and giving "high priority" tasks short
replenishment periods (and so shorter relative deadlines).

> Are we going to place all tasks in the kernel into rt-deadline
> tasks? I had the impression that we wanted a class for a special
> set of tasks.

I think it could be done.
See my scheme above for translating everthing into deadlines.

Ted

2009-07-18 20:20:37

by Michal Sojka

[permalink] [raw]
Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel

On Friday 17 of July 2009 01:29:11 Raistlin wrote:
> On Wed, 2009-07-15 at 16:55 -0400, Ted Baker wrote:
> > In the deadline scheduling domain, we have algorithms like
> > Constant Bandwidth Server (and some improved variants) that
> > provide similar gurantees and limites, in terms of deadlines. (I
> > recall seeing one very good version in a paper I recently saw,
> > which I could seek out and provide to the list if there is
> > interest.)
>
> Well, if it does not bother you too much, I'm curious about that
> solution... When you find some time, even via private mail, if I'm the
> only one, it would be nice if you can point that paper out to me.

Hi,
I'm also interested in this, so please, send the pointers publically,

Michal