2004-11-21 20:29:44

by Esben Nielsen

[permalink] [raw]
Subject: Priority Inheritance Test (Real-Time Preemption)

Hi,
From realfeel I wrote a small, simple test to test how well priority
inheritance mechanism works.

Basicly it samples how long a real-time task have to wait to get into a
protected region while non-real-time tasks also try to get into the
region (a character device). Their "job" in the region is to busy-loop for
1 ms. This ought to mimic how drivers and other parts of the kernel would
work in a real real-time application: Real time tasks using the driver
while non-real-time tasks also use the same driver.

With an ideal PI mutex the time the real-time task has to wait to get the
lock should be between 0 and 1 ms. 0 when the mutex is uncongested and 1
ms when one of the non-real-time tasks just got the mutex.

I tested it on V0.7.26-0 and my own U9.2-priom. Both implementations fails
when the mutex is congested by more than 1 non-real-time task. It works
well enough when there is only one non-real-time task trying to get the
mutex, but as soon as there are more it could look like the real-time task
not always is the first on the wait queue. I.e. sometimes it has to wait 2
ms! With 4 non-real-time tasks the most common is 1-2 ms!

Code, detailed description and data can be found at
http://www.phys.au.dk/~simlo/Linux/pi_test.tgz

Esben




2004-11-21 23:25:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> Hi,
> From realfeel I wrote a small, simple test to test how well priority
> inheritance mechanism works.

cool - this is a really useful testsuite.

> I tested it on V0.7.26-0 and my own U9.2-priom. Both implementations
> fails when the mutex is congested by more than 1 non-real-time task.
> [...]

i can confirm your measurements.

I have fixed all the PI bugs that your suite uncovered (there were quite
a number of them!), the fixes are included in the V0.7.30-0 patch
available at the usual place:

http://redhat.com/~mingo/realtime-preempt/

with this patch i get the expected histogram with entries only in the
0-1 msec buckets.

Ingo

2004-11-22 09:24:39

by Bill Huey

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Sun, Nov 21, 2004 at 09:29:23PM +0100, Esben Nielsen wrote:
> Hi,
> From realfeel I wrote a small, simple test to test how well priority
> inheritance mechanism works.
>
> Basicly it samples how long a real-time task have to wait to get into a
> protected region while non-real-time tasks also try to get into the
> region (a character device). Their "job" in the region is to busy-loop for
> 1 ms. This ought to mimic how drivers and other parts of the kernel would
> work in a real real-time application: Real time tasks using the driver
> while non-real-time tasks also use the same driver.
>
> With an ideal PI mutex the time the real-time task has to wait to get the
> lock should be between 0 and 1 ms. 0 when the mutex is uncongested and 1
> ms when one of the non-real-time tasks just got the mutex.

[private reply cut and pastes to the list as requested]

IMO, their needs to be statistical code in the mutex itself so that it can
measure the frequency of PI events as well as depth of the inheritance
chains and all data structure traversals. The problem with writing that
stuff now is that there isn't proper priority propagation through the entire
dependency chain in any mutex code that I've publically seen yet. Patching
this instrumentation in a mutex require a mutex with this built in
functionality. IMO, PI should be considered a kind of contention overload
condition and really a kind of fallback device to deal with these kind
of exceptional circumstances.

Turning this into a "priority inheritance world" is just going to turn
this project into the FreeBSD SMP project where nobody is working on fixing
the actual contention problems in the kernel and everybody is dependent on
overloading an already half-overloaded scheduler with priority calculations.
This eventually leads the kernel doing little effective work and is the main
technical reason why DragonFly BSD exists, a vastly superior BSD development
group (for many reasons).

http://citeseer.nj.nec.com/yodaiken01dangers.html

IMO, is a pretty fair assessment some of the failures of using a "late" PI.
It's not the whole story, but a pretty solid view of how it's been misunderstood,
overused and then flat out abused. There might be places where, if algorithmically
bounded somehow, reverting some of the heavy hammered sleeping locks back to
spinlocks would make the system faster and more controlled. rtc_lock possibly
could be one of those places and other places that are as heavily as used as
that.

That's my take on it.

bill

2004-11-22 11:38:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Bill Huey <[email protected]> wrote:

> [...] There might be places where, if algorithmically bounded somehow,
> reverting some of the heavy hammered sleeping locks back to spinlocks
> would make the system faster and more controlled. rtc_lock possibly
> could be one of those places and other places that are as heavily as
> used as that.

in the -RT patchset one of the reasons why i've gone for the completely
preemptible variant is to trigger all priority inversion problems
outright. In the first variant they didnt really trigger - but they were
present. Once the locks were almost all preemptible, PI problems
surfaced in a big way - causing people to report them and forcing me to
fix them :-)

There are lots of critical sections in Linux and we cannot design around
them - so if the goal is hard-RT properties and latencies then priority
inversion is a problem that has to be solved. Later on we could easily
revert some of the hw-related spinlocks to raw spinlocks, and/or the
known-O(1) critical sections as well.

the paper cited is not very persuasive to me though. It lists problems
of an incomplete/incorrect PI implementation, and comes to the IMO false
(and unrelated) conclusion that somehow PI-handling is not desired.
Obviously PI makes only sense if it's implemented correctly. I think i
managed to fix the problems Esben's testsuite uncovered, in the current
-RT patch. Anyway, this implementation is also special in that it relies
on correct SMP locking of Linux:

> Turning this into a "priority inheritance world" is just going to turn
> this project into the FreeBSD SMP project [...]

i dont have any intentions to turn Linux into a 'priority inheritance
world'. PI handling is only a property of the PREEMPT_RT feature
intended for the most latency-sensitive applications - the main and
primary critical-section model of Linux is and should still be a healthy
mix of spinlocks and mutexes. Having only mutexes (or only spinlocks) is
an extreme that _does_ hurt the common case. PREEMPT_RT 'only' lives on
the back of SMP-Linux.

Ingo

2004-11-22 14:36:18

by john cooper

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

Bill Huey (hui) wrote:

> IMO, their needs to be statistical code in the mutex itself so that it can
> measure the frequency of PI events as well as depth of the inheritance
> chains and all data structure traversals. The problem with writing that
> stuff now is that there isn't proper priority propagation through the entire
> dependency chain in any mutex code that I've publically seen yet.> Patching
> this instrumentation in a mutex require a mutex with this built in
> functionality. IMO, PI should be considered a kind of contention overload
> condition and really a kind of fallback device to deal with these kind
> of exceptional circumstances.

I'd hazard a guess the reason existing implementations do not
do this type of dependency-chain closure is the complexity of a
general approach. Getting correct behavior and scaling on SMP
require some restrictions of how lock ownership is maintained,
otherwise fine grained locking is not possible. Another likely
reason is the fact more mechanism is getting put in place for
less likely inversion scenarios. And when those scenarios do
exist the cost of effecting promotion closure may well be
greater than allowing the priority inversions to subside.
However this point of diminishing returns is application
dependent so there is no single, simple solution.

That said I don't see anything in the current work which precludes
doing any of the above. To my eyes, the groundwork is already
in place.

-john


--
[email protected]

2004-11-22 14:40:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* john cooper <[email protected]> wrote:

> I'd hazard a guess the reason existing implementations do not do this
> type of dependency-chain closure is the complexity of a general
> approach. [...]

please take a look at the latest patch, it is i believe handling all the
cases correctly. It certainly appears to solve the cases uncovered by
pi_test.

Ingo

2004-11-22 21:48:18

by Bill Huey

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Mon, Nov 22, 2004 at 09:16:18AM -0500, john cooper wrote:
> I'd hazard a guess the reason existing implementations do not
> do this type of dependency-chain closure is the complexity of a
> general approach. Getting correct behavior and scaling on SMP
> require some restrictions of how lock ownership is maintained,
> otherwise fine grained locking is not possible. Another likely

What do you mean by that ? Are you talking about strict priority
obedience by the system ?

> reason is the fact more mechanism is getting put in place for
> less likely inversion scenarios. And when those scenarios do
> exist the cost of effecting promotion closure may well be
> greater than allowing the priority inversions to subside.
> However this point of diminishing returns is application
> dependent so there is no single, simple solution.

Yes, this is my point.

> That said I don't see anything in the current work which precludes
> doing any of the above. To my eyes, the groundwork is already
> in place.

Yes it is. :)

bill

2004-11-23 01:37:45

by john cooper

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

Bill Huey (hui) wrote:
> On Mon, Nov 22, 2004 at 09:16:18AM -0500, john cooper wrote:
>
>>I'd hazard a guess the reason existing implementations do not
>>do this type of dependency-chain closure is the complexity of a
>>general approach. Getting correct behavior and scaling on SMP
>>require some restrictions of how lock ownership is maintained,
>>otherwise fine grained locking is not possible. Another likely
>
>
> What do you mean by that ? Are you talking about strict priority
> obedience by the system ?

Not quite if I understand your question. I was referring to
avoiding having a global lock to synchronize the conglomerate
data structure when doing a PI dependency walk. The problem
is the lock must be acquired not only in PI scenarios but in
any case which may possibly lead to one or affect a concurrent
PI in progress.

True this global lock is mostly an issue for large count
SMP systems. But as witnessed by such voodoo[1] mechanisms
as rcu, scalability problems are real at that end of the
spectrum.

-john


[1] in a 'nice' way.


--
[email protected]

2004-11-23 01:28:42

by john cooper

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

Ingo Molnar wrote:
> * john cooper <[email protected]> wrote:
>
>
>>I'd hazard a guess the reason existing implementations do not do this
>>type of dependency-chain closure is the complexity of a general
>>approach. [...]
>
>
> please take a look at the latest patch, it is i believe handling all the
> cases correctly. It certainly appears to solve the cases uncovered by
> pi_test.

Yes I see where you are walking the dependency chain
in pi_setprio(). But this is under the global spinlock
'pi_lock'.

My earlier comment was of the difficulty to establish fine
grained locking, ie: per-mutex to synchronize mutex
ownership/waiter lists and per task to synchronize
the list maintaining mutexes owned by task. While this
arguably scales better in an SMP environment, there are
issues of mutex traversal sequence during PI which lead
to deadlock scenarios. Though I believe there are
reasonable constraints placed on application mutex
acquisition order which side step this problem.

In the current design pi_lock has scope protecting all
mutex waiter lists (rooted either in mutex or owning task),
as well as per-mutex owner(s). The result of this is
pi_lock must be speculatively acquired before altering
any of the above lists/data regardless whether a PI
scenario is encountered. The behavior looks correct to
my reading. However I'd offer there is more concurrency
possible in this design.

-john




--
[email protected]

2004-11-22 21:30:17

by Bill Huey

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Mon, Nov 22, 2004 at 01:37:41PM +0100, Ingo Molnar wrote:
> in the -RT patchset one of the reasons why i've gone for the completely
> preemptible variant is to trigger all priority inversion problems
> outright. In the first variant they didnt really trigger - but they were
> present. Once the locks were almost all preemptible, PI problems
> surfaced in a big way - causing people to report them and forcing me to
> fix them :-)
>
> There are lots of critical sections in Linux and we cannot design around
> them - so if the goal is hard-RT properties and latencies then priority
> inversion is a problem that has to be solved. Later on we could easily
> revert some of the hw-related spinlocks to raw spinlocks, and/or the
> known-O(1) critical sections as well.

Good. Yeah, the only piont I was making is not to cover up contention
problems, which are kernel performance problems with PI. That's all.

> the paper cited is not very persuasive to me though. It lists problems
> of an incomplete/incorrect PI implementation, and comes to the IMO false
> (and unrelated) conclusion that somehow PI-handling is not desired.

Yeah, I agree. PI has it's place, but the main point of the paper is
that getting a really good PI protocol is a very difficult thing and
might not be worth it if you can use other techniques. What you do with
that assertion is situational.

A number of those complaints, as I see it, don't apply to Linux because
of how it's avoided things like deadlocking, fine grainedness, etc... are
done. But he does, IMO, outline the difficulty of getting a decent PI
implementation.

> Obviously PI makes only sense if it's implemented correctly. I think i
> managed to fix the problems Esben's testsuite uncovered, in the current
> -RT patch. Anyway, this implementation is also special in that it relies
> on correct SMP locking of Linux:

I'll check it out.

> i dont have any intentions to turn Linux into a 'priority inheritance
> world'. PI handling is only a property of the PREEMPT_RT feature
> intended for the most latency-sensitive applications - the main and
> primary critical-section model of Linux is and should still be a healthy
> mix of spinlocks and mutexes. Having only mutexes (or only spinlocks) is
> an extreme that _does_ hurt the common case. PREEMPT_RT 'only' lives on
> the back of SMP-Linux.

Yeah, that's my point. The reason why/if/when Linux will be strong at RT
is because of the SMP work.

bill

2004-11-23 04:32:11

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Mon, 22 Nov 2004, Bill Huey wrote:

> On Sun, Nov 21, 2004 at 09:29:23PM +0100, Esben Nielsen wrote:
> > [...]
>
> [private reply cut and pastes to the list as requested]
[I'll try to respond to the whole list]

> IMO, their needs to be statistical code in the mutex itself so that it can
> measure the frequency of PI events as well as depth of the inheritance
> chains and all data structure traversals. The problem with writing that
> stuff now is that there isn't proper priority propagation through the entire
> dependency chain in any mutex code that I've publically seen yet. Patching
> this instrumentation in a mutex require a mutex with this built in
> functionality. IMO, PI should be considered a kind of contention overload
> condition and really a kind of fallback device to deal with these kind
> of exceptional circumstances.

Well, I try to make a test which does _not_ require instrumentation:
A white box test. I.e. I test for what you want to know as a real-time
developer: What are the worst case timings. The test isn't complete: It
does need to be extended to have deeper nestings of muteces and tasks
entering on different levels in the hirachy. But with sufficient
randomness in the sampling I believe that we will catch all the real-life
situations and tells us if the mutex implementation gives us
"deterministic" locking times, which is the ultimate goal.

>
> Turning this into a "priority inheritance world" is just going to turn
> this project into the FreeBSD SMP project where nobody is working on fixing
> the actual contention problems in the kernel and everybody is dependent on
> overloading an already half-overloaded scheduler with priority calculations.
> This eventually leads the kernel doing little effective work and is the main
> technical reason why DragonFly BSD exists, a vastly superior BSD development
> group (for many reasons).
>
> http://citeseer.nj.nec.com/yodaiken01dangers.html
>

>From what I read DragonFly uses a a kind of "weak mutex" called
"serializing tokens" instead of mutexes. They do prevent deadlocks by
automaticly releasing the critical section whenever the holder of the
mutex goes to sleep - forinstance when waiting for another token. This
means you really have to take care - you might have lost your lock
somewhere in the middle of your code. If it is worth anything that ought
to happen only at blocking calls - i.e. you can't be preempted. So are
preemption turned off while you have the token? In that case we are back
to a none-real-time system! If preemption is on we are back to the
problems of priority inversion!

Another thing they seems to use is SPLs which looks a lot like the
priority ceiling protocol. Priority ceiling is a really good locking
mechanism which doesn't give you problems with priority inversion and is
very cheap run-time. See it as a level grained spin-lock: You give a
priority to your critical section and everything below that priority is
locked out. The problem is, that you have to give every critical section a
priority! But you can't do that in general, different applications need
different priority settings. Thus, priority ceiling is no good on a
general perpose system like Linux.

I haven't looked this deep into DragonFly, but as I see it it is at least
as far from being a real-time system as Linux is. They even proclaim they
don't run SMP: Tasks are locked to specific CPUs. They might have to
optimizing techniques worth looking at, though.

As regards to the scheduler: PI doesn't add jobs to the scheduler. You
have to lock it out to reset the priority but that is all. The job of
resorting the mutex etc. should be in the context of one of the tasks
accessing the mutex - preferably with releasing the spinlocks in
between to reduce the task latency. The only thing hammering on the
scheduler is that you get two "extra" task switchs, when you try to take a
congested region taken by a lower priority task: Switch that
in, and then out again when it is done. Priority ceiling would be better
as the high priority task wouldn't even be running (on a UP that is),
but again, but would in practise be very, very hard to configure
correctly.

> IMO, is a pretty fair assessment some of the failures of using a "late" PI.
> It's not the whole story, but a pretty solid view of how it's been misunderstood,
> overused and then flat out abused. There might be places where, if algorithmically
> bounded somehow, reverting some of the heavy hammered sleeping locks back to
> spinlocks would make the system faster and more controlled. rtc_lock possibly
> could be one of those places and other places that are as heavily as used as
> that.
>

Now I don't know what you mean by "late" PI. And I do agree spinlocks
perform a lot better than a mutex and that most locks ought to be keeped
as spinlocks to be "effective". Also many interrupt handles shouldn't be
defferered to tasks. But it all depends on you target system! If you
require a task latency of 1 ms, you can keep many more things the
"effective" way with spinlocks and interrupt handlers in interrupt
context, but when you want task lantecies in the order of say 100us you
have to be much more carefull.
I say: Let it be something you can choose when you are compiling you
kernel. Let there be a special interface in "make (/x/menu)config" for
choosing say between "driver X runs in interrupt and locks with a
spinlock" or "driver X runs in a task and locks with a mutex". For
servers: Use he "effective" way. When you want a real-time system switch
everything to preemptable and only choose the few drivers you "trust"
and/or need in your real-time subsystem to be "effective". On a desktop:
Well, I assume someone (the distributers) can pick a nice setting which
suites most of the users - probably most drivers as running in interrupt
context but more high level stuff defered to task-context.

> That's my take on it.
>
> bill
>

By the way: I have thought of a special use (hack?) of a mutex. In the
current implementations every mutex have a spinlock embedded in it. What
we do in the uncongested situation is

lock the spinlock
check for congestion
mark the mutex as taken by this task
unlock the spinlock
do the job
lock the spinlock
mark the mutex as unlocked
unlock the spinlock

Why not allow the following workflow when the job is small (10
instructions or so):

lock the spinlock
check for congestion
do the job
unlock the spinlock

I.e. the mutex works like a spinlock in the uncongested case but the same
mutex can also be used to lock larger areas but when small areas are
locked it is nearly as effective as a spinlock.
The drawbacks: 1) You need high dicipline of the coders not to abuse this
giving us high latencies again. 2) You rely on the current implementation
using an embedded spinlock and might not be able to do general platform
specific optimizations.

Esben



2004-11-23 08:17:26

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Mon, 22 Nov 2004, john cooper wrote:

> Ingo Molnar wrote:
> > [...]
>
> Yes I see where you are walking the dependency chain
> in pi_setprio(). But this is under the global spinlock
> 'pi_lock'.
>
> My earlier comment was of the difficulty to establish fine
> grained locking, ie: per-mutex to synchronize mutex
> ownership/waiter lists and per task to synchronize
> the list maintaining mutexes owned by task. While this
> arguably scales better in an SMP environment, there are
> issues of mutex traversal sequence during PI which lead
> to deadlock scenarios. Though I believe there are
> reasonable constraints placed on application mutex
> acquisition order which side step this problem.
>

In the implementation I sent out earlier (I called it -U9.2-priom) I
solved this by releasing all locks before going to the next task in the
dependency chain. This leaves me open to rescheduling in the middle and
some of the tasks could exit before I got to them. To solve this I used
task_get/put() to make sure the task struct doesn't disappeare before the
next lock is taken. All this locking/unlocking, task_get/task_put is
ofcourse an overhead, but besides avoiding one big lock it gives one more
advantage: It decreases the amount of time a spin lock is held, thus
descreasing the penalty to the overall latency.


> In the current design pi_lock has scope protecting all
> mutex waiter lists (rooted either in mutex or owning task),
> as well as per-mutex owner(s). The result of this is
> pi_lock must be speculatively acquired before altering
> any of the above lists/data regardless whether a PI
> scenario is encountered. The behavior looks correct to
> my reading. However I'd offer there is more concurrency
> possible in this design.
>
I think that is a good approach on UP: Disable preemption, do the sorting
etc., enable preempion. On SMP, however, a global lock is not so nice...

However, I am thinking: Fix the bugs in the current implementation, then
try to optimize it.


> -john
>
Esben

2004-11-23 08:20:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* john cooper <[email protected]> wrote:

> >>I'd hazard a guess the reason existing implementations do not do this
> >>type of dependency-chain closure is the complexity of a general
> >>approach. [...]
> >
> >
> >please take a look at the latest patch, it is i believe handling all the
> >cases correctly. It certainly appears to solve the cases uncovered by
> >pi_test.
>
> Yes I see where you are walking the dependency chain
> in pi_setprio(). But this is under the global spinlock
> 'pi_lock'.
>
> My earlier comment was of the difficulty to establish fine
> grained locking, [...]

the issues raised in the paper and in this thread were much more
fundamental than SMP-scalability. Considering the costs of a hard-RT
mutex approach itself i dont think SMP-scalability is a primary issue
right now.

> [...] However I'd offer there is more concurrency possible in this
> design.

yeah, most likely - but correctness comes first. SMP scalability is
something that can be done later.

Ingo

2004-11-23 12:32:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Ingo Molnar <[email protected]> wrote:

>
> > From realfeel I wrote a small, simple test to test how well priority
> > inheritance mechanism works.
>
> cool - this is a really useful testsuite.

FYI, i've put the 'blocker device' kernel code into the current -RT
patch (-30-7). This makes it possible to build it on SMP (which didnt
work when it was a module), and generally makes it easier to do testing
via pi_test.

The only change needed on the userspace pi_test side was to add -O2 to
the CFLAGS in the Makefile to make the loop() timings equivalent, and to
remove the module compilations. I've added a .config option for it too
and cleaned up the code.

Ingo

2004-11-23 15:57:34

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

Ok, I'll try to grab -30-7 and work on from there. I am working on
variable dependency chains (locking tree depth). I'll try to send it as a
patch to -30-7. But I must admit it is very hard for me to follow
all your patches, getting them down, compiling etc: Once I am up
running on the newest version you have already sent out the next! :-)

Esben


On Tue, 23 Nov 2004, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > > From realfeel I wrote a small, simple test to test how well priority
> > > inheritance mechanism works.
> >
> > cool - this is a really useful testsuite.
>
> FYI, i've put the 'blocker device' kernel code into the current -RT
> patch (-30-7). This makes it possible to build it on SMP (which didnt
> work when it was a module), and generally makes it easier to do testing
> via pi_test.
>
> The only change needed on the userspace pi_test side was to add -O2 to
> the CFLAGS in the Makefile to make the loop() timings equivalent, and to
> remove the module compilations. I've added a .config option for it too
> and cleaned up the code.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-23 23:07:24

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

Hi,
I have updated the test to take into account nested locking where
traversal of the dependency chain has to be traversed when priorities are
boosted. Basicly, I made a test which makes pi_walk in /proc/stat be
non-zero!

I both changed the code in user space and in the blocker device in the
kernel - the patch to the kernel is below and the full code is at
http://www.phys.au.dk/~simlo/Linux/pi_test2.tgz
along with detailed explanations.

Results (in short):
-30-9 doesn't resolved nested locking well. The expected max locking time
in my test would be depth * 1ms - it is much higher just at a locking
depth at two.

I have an idea about what the error(s) is(are): In rt.c policy ==
SCHED_NORMAL tasks are threaded specially. A task boosted into the
real-time realm by mutex_setprio() is _still_ SCHED_NORMAL and do not gain
all the privileges of a real-time task. I suggest that the tests on
SCHED_NORMAL are replaced by using the rt_task() test which just looks at
the current priority and thus also would be true on tasks temporarely
boosted in the real-time realm. Another thing: A SCHED_NORMAL task will
not be added to the pi_waiters list, but it ought to be when it is later
boosted into the real-time realm. Also, you ignore all tasks being
SCHED_NORMAL in the tail of the wait list when you try to find the next
owner: It could be that one of those is boosted.

Esben


--- linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c.orig 2004-11-23 20:18:28.000000000 +0100
+++ linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c 2004-11-23 20:41:57.742899751 +0100
@@ -24,11 +24,41 @@
get_cpu_tick();
}

-spinlock_t lock = SPIN_LOCK_UNLOCKED;
-
#define BLOCK_IOCTL 4245
+#define BLOCK_SET_DEPTH 4246
#define BLOCKER_MINOR 221

+
+#define MAX_LOCK_DEPTH 10
+
+static spinlock_t blocker_lock[MAX_LOCK_DEPTH];
+
+static unsigned int lock_depth = 1;
+
+void do_the_lock_and_loop(unsigned int args)
+{
+ int i,max;
+
+ if(rt_task(current)) {
+ max = lock_depth;
+ }
+ else if(lock_depth>1) {
+ max = (current->pid % lock_depth)+1;
+ }
+ else {
+ max = 1;
+ }
+
+ /* Always lock from the top down */
+ for(i=max-1;i>=0; i--) {
+ spin_lock(&blocker_lock[i]);
+ }
+ loop(args);
+ for(i=0;i<max; i++) {
+ spin_unlock(&blocker_lock[i]);
+ }
+}
+
static int blocker_open(struct inode *in, struct file *file)
{
printk(KERN_INFO "blocker_open called\n");
@@ -40,9 +70,13 @@
{
switch(cmd) {
case BLOCK_IOCTL:
- spin_lock(&lock);
- loop(args);
- spin_unlock(&lock);
+ do_the_lock_and_loop(args);
+ return 0;
+ case BLOCK_SET_DEPTH:
+ if(args>=MAX_LOCK_DEPTH) {
+ return -EINVAL;
+ }
+ lock_depth = args;
return 0;
default:
return -EINVAL;
@@ -66,11 +100,17 @@

static int __init blocker_init(void)
{
+ int i;
+
printk(KERN_INFO "blocker device installed\n");

if (misc_register(&blocker_dev))
return -ENODEV;

+ for(i=0;i<MAX_LOCK_DEPTH;i++) {
+ blocker_lock[i] = SPIN_LOCK_UNLOCKED;
+ }
+
return 0;
}


On Tue, 23 Nov 2004, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > > From realfeel I wrote a small, simple test to test how well priority
> > > inheritance mechanism works.
> >
> > cool - this is a really useful testsuite.
>
> FYI, i've put the 'blocker device' kernel code into the current -RT
> patch (-30-7). This makes it possible to build it on SMP (which didnt
> work when it was a module), and generally makes it easier to do testing
> via pi_test.
>
> The only change needed on the userspace pi_test side was to add -O2 to
> the CFLAGS in the Makefile to make the loop() timings equivalent, and to
> remove the module compilations. I've added a .config option for it too
> and cleaned up the code.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-24 02:41:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> I have an idea about what the error(s) is(are): In rt.c policy ==
> SCHED_NORMAL tasks are threaded specially. A task boosted into the
> real-time realm by mutex_setprio() is _still_ SCHED_NORMAL and do not
> gain all the privileges of a real-time task. I suggest that the tests
> on SCHED_NORMAL are replaced by using the rt_task() test which just
> looks at the current priority and thus also would be true on tasks
> temporarely boosted in the real-time realm. [...]

ah, indeed. I thought i fixed all these places but you are right that
there's one more instance remaining, in pick_new_owner():

+ if (w->task->policy == SCHED_NORMAL)
+ break;

> [...] Another thing: A SCHED_NORMAL task will not be added to the
> pi_waiters list, but it ought to be when it is later boosted into the
> real-time realm. [...]

ok, this is an important one too. Fixing this bug will complicate
pi_setprio() some more, but should be pretty straightforward: the task
that is boosted can be blocked on a single rt-mutex at most.

i'll fix these bugs in the next release. Your systematic testing of PI
is very useful, it already uncovered _tons_ of PI bugs that would be
quite hard to find via normal testing.

> [...] Also, you ignore all tasks being SCHED_NORMAL in the tail of the
> wait list when you try to find the next owner: It could be that one of
> those is boosted.

correct.

> --- linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c.orig 2004-11-23 20:18:28.000000000 +0100
> +++ linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c 2004-11-23 20:41:57.742899751 +0100

thx, i'll add this too. Do you have uploaded the matching pi_test
userspace changes too? (url?)

Ingo

2004-11-24 06:48:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


> thx, i'll add this too. Do you have uploaded the matching pi_test
> userspace changes too? (url?)

found it ... the url was in your mail :-|

Ingo

2004-11-24 07:05:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> --- linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c.orig 2004-11-23 20:18:28.000000000 +0100
> +++ linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c 2004-11-23 20:41:57.742899751 +0100

fyi, i've manually applied this patch to my tree - it didnt apply
cleanly because apparently your mailer turned tabs into spaces. I also
fixed up the coding style to match that of the kernel's.

Ingo

2004-11-24 08:33:15

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

Sorry. Is it allowed to send the patch as an attachment instead? That is
easier.

Esben


On Wed, 24 Nov 2004, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > --- linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c.orig 2004-11-23 20:18:28.000000000 +0100
> > +++ linux-2.6.10-rc2-mm2-V0.7.30-9/drivers/char/blocker.c 2004-11-23 20:41:57.742899751 +0100
>
> fyi, i've manually applied this patch to my tree - it didnt apply
> cleanly because apparently your mailer turned tabs into spaces. I also
> fixed up the coding style to match that of the kernel's.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-24 08:52:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> Sorry. Is it allowed to send the patch as an attachment instead? That
> is easier.

sure, that's perfectly fine to me - i'll apply any format that applies.

generally, upstream submission is much stricter, the rules can be found
at:

http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

but for -RT the number of patches isnt that high - just send anything
that applies cleanly.

Ingo

2004-11-24 09:16:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> Results (in short):
> -30-9 doesn't resolved nested locking well. The expected max locking time
> in my test would be depth * 1ms - it is much higher just at a locking
> depth at two.

could you check the -30-10 kernel i just uploaded? I fixed the PI bugs
you reported (and found/fixed other ones as well).

Ingo

2004-11-26 19:47:14

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Thu, 25 Nov 2004, Ingo Molnar wrote:

> [...]
>
> yeah, i agree that this has to be further investigated. What type of box
> did you test it on - UP or SMP? (SMP scheduling of RT tasks only got
> fully correct in the very latest -31-7 kernel.)
>

I am running on -31-7 kernel now - it takes quite some time to run with
the runall.sh script with 100000 samples per point so I don't have full
data yet. But the bounds look like
depth observed bound theoretical
1 1 ms 1 ms
2 3 ms 2 ms :-(

the rest of the list will follow tomorrow...

> Ingo
>

Esben

2004-11-26 20:48:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> > task-A task-B task-RT
> >
> > spin_lock(&lock2);
> > [ gets lock2 ]
> > spin_lock(&lock1);
> > [ gets lock1 ]
> > spin_lock(&lock2);
> > [ boosts task-A ]
> > [ waits ]
> > [ gets RT prio ] .
> > spin_lock(&lock1); .
> > [ boosts task-B ] .
> > [ waits ] .
> > . [ gets RT prio ] .
> > . [ 1 msec loop ] .
> > . spin_unlock(&lock1); .
> > [ gets lock 1 ] .
> > spin_lock(&lock1); .
>
> point of disagreement ----^

> No :-)

> Why should task B get lock1 the 2. time before the rt-task? That would
> be an error!

then make it task-C, which tried to take the lock before the RT task
came into the picture. Btw., the above scenario can still happen on SMP.

when task-A unlocks lock1, it can very well give it to task-C - there's
no reason not to do it, task-RT has not expressed any interest in lock1
yet.

so my example and analysis still stands.

> I can't see how it can produce a flow like the one you describe above!

it can produce such a flow on SMP, or if you add in a third non-RT task
(task-C). Agreed?

In the test where you got 3 msecs you had more than 2 non-RT tasks,
correct?

Ingo

2004-11-26 21:10:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Ingo Molnar <[email protected]> wrote:

> it can produce such a flow on SMP, or if you add in a third non-RT
> task (task-C). Agreed?

here's the 4-task flow. I've simplified the operations to make it easier
to overview: "L1-success means" 'task locked lock1 and got it'.
"L1-wait" means it tried to get lock1 but has to wait. "RT-prio" means a
non-RT task got boosted to RT priority. "old-prio" means priority got
restored to the original non-RT value. "UL-1" means 'unlocked lock1'.

task-A task-B task-C task-RT
-------------------------------------------------------
L2-success
L1-success
L1-wait
. L2-wait
. boost-A
RT-prio . .
L1-wait . .
boost-B . .
. RT-prio . .
. [ 1 ms ] . .
. UL-1 . .
. !RT-prio . .
get-L1 . .
[ 1 ms ] . .
UL-1 . .
get-L1 .
UL-2 .
old-prio .
get-L2
L1-wait
boost-C
RT-prio .
[ 1 msec ] .
UL-1 .
old-prio .
get-L1
[ success ]

this is a 3 milliseconds worst-case. Ok?

Ingo

2004-11-26 23:15:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> I am running on -31-7 kernel now - it takes quite some time to run with
> the runall.sh script with 100000 samples per point so I don't have full
> data yet. [...]

btw., do you really need 100,000 samples to get statistically stable
results? I've been running with 1000 samples and it was already more
than usable - i'd say 3000-5000 samples ought to be more than enough.

> But the bounds look like
> depth observed bound theoretical
> 1 1 ms 1 ms
> 2 3 ms 2 ms :-(

are you sure the theoretical limit is 2 msec? I think it's 3 msec, for
the following reason:

there are two types of nonprivileged-task lock sequences allowed in the
2-deep case:

spin_lock(&lock2);
spin_lock(&lock1);
... loop for 1 msec ...
spin_unlock(&lock1);
spin_unlock(&lock2);

or:
spin_lock(&lock1);
... loop for 1 msec ...
spin_unlock(&lock1);

now, with the above locking, the worst case scenario is the following
one, in chronological order [task A and B are unprivileged, RT is the
highprio task]:

task-A task-B task-RT

spin_lock(&lock2);
[ gets lock2 ]
spin_lock(&lock1);
[ gets lock1 ]
spin_lock(&lock2);
[ boosts task-A ]
[ waits ]
[ gets RT prio ] .
spin_lock(&lock1); .
[ boosts task-B ] .
[ waits ] .
. [ gets RT prio ] .
. [ 1 msec loop ] .
. spin_unlock(&lock1); .
[ gets lock 1 ] .
spin_lock(&lock1); .
[ waits ] .
[ 1 msec loop ] . .
spin_unlock(&lock1); . .
[ gets lock1 ] .
spin_unlock(&lock2); .
[ gets lock2 ]
spin_lock(&lock1);
[ boosts task-B ]
[ waits ]
[ 1 msec loop ] .
spin_unlock(&lock1); .
[ gets lock1 ]


the additional 1 msec comes in because the RT task might be blocking on
a task that _itself_ has to wait 1 msec to get its second lock. So we
have 3 msec of maximum waiting altogether.

the additional +1 msec comes from the fact that 1-deep lock/unlock of
lock1 is an allowed operation too - 2 msec would be the limit if the
only sequence is the 2-deep one.

so i think the numbers, at least in the 2-deep case, are quite close to
the theoretical boundary.

agreed?

Ingo

2004-11-26 23:57:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

>
> On Thu, 25 Nov 2004, Ingo Molnar wrote:
>
> > [...]
> > there's one thing i noticed, now that the blocker device is in the
> > kernel, you have to be really careful to compile the userspace loop()
> > code via the same gcc flags as the kernel did. Minor differences in
> > compiler options can skew the timing calibration.
> >
> > but any such bug should at most cause a linear deviation via a constant
> > factor multiplication, while the data shows a systematic nonlinear
> > transformation.
> >
> -g -Wall -O2 was on in userspace.

you can check the gcc options the kernel used via the
drivers/char/.blocker.o.cmd file. Mine has (only the
performance-relevant flags):

-fno-strict-aliasing -fno-common -O2 -fomit-frame-pointer -msoft-float
-mpreferred-stack-boundary=2 -fno-unit-at-a-time -march=pentium3
-mregparm=3

> > [...]
> > yeah, i agree that this has to be further investigated. What type of box
> > did you test it on - UP or SMP? (SMP scheduling of RT tasks only got
> > fully correct in the very latest -31-7 kernel.)
> >
> UP, PIII 697.143 Mhz

ok - some of the fixes affect UP too, but with less likelyhood. Might be
worth a try though.

Ingo

2004-11-27 00:20:16

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


On Thu, 25 Nov 2004, Ingo Molnar wrote:

> [...]
> there's one thing i noticed, now that the blocker device is in the
> kernel, you have to be really careful to compile the userspace loop()
> code via the same gcc flags as the kernel did. Minor differences in
> compiler options can skew the timing calibration.
>
> but any such bug should at most cause a linear deviation via a constant
> factor multiplication, while the data shows a systematic nonlinear
> transformation.
>
-g -Wall -O2 was on in userspace.

> [...]
> yeah, i agree that this has to be further investigated. What type of box
> did you test it on - UP or SMP? (SMP scheduling of RT tasks only got
> fully correct in the very latest -31-7 kernel.)
>
UP, PIII 697.143 Mhz

> Ingo
>

Esben

2004-11-27 01:28:16

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

I finally got time to run the test on 2.6.10-rc2-mm2-V0.7.30-10.
Conclusion: The bound on the locking time seems not to be bounded by
depth*1ms as predicted: The more blocking tasks there is the higher the
bound is.
There _is_ some kind of bound in that I don't see wild locking times. The
distributions stops nicely at N *1ms N in is higher than depth.

I have attached the data. The files is named
<kernel ver>-<depth>.<blocking tasks>.hist
Note especially
2.6.10-rc2-mm2-V0.7.30-10-3.20.hist
which is depth 3 and 20 blocking tasks. There is a clean upper bound of
7ms (7.1 to be exact) but where does the 7 come from??? A formula like
N=2*depth+1 might fit the results.

If we understand what is going on this might end up being "good enough" in
the sense it is deterministic. The end result doesn't have to be the best
case formula. But the maximum locking time have to independent of the
number of lower priority tasks banging on the protected region as that is
something uncontrolable. We have to be able to predict a bound based
solely on the depth before we can say it is "good enough".

Esben


On Wed, 24 Nov 2004, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > Results (in short):
> > -30-9 doesn't resolved nested locking well. The expected max locking time
> > in my test would be depth * 1ms - it is much higher just at a locking
> > depth at two.
>
> could you check the -30-10 kernel i just uploaded? I fixed the PI bugs
> you reported (and found/fixed other ones as well).
>
> Ingo
>


Attachments:
2.6.10-rc2-mm2-V0.7.30-10.tgz (5.80 kB)

2004-11-27 01:36:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Ingo Molnar <[email protected]> wrote:

> the additional +1 msec comes from the fact that 1-deep lock/unlock of
> lock1 is an allowed operation too - 2 msec would be the limit if the
> only sequence is the 2-deep one.
>
> so i think the numbers, at least in the 2-deep case, are quite close
> to the theoretical boundary.

in the generic case i think the theoretical boundary should be something
like:

sum(i=1...n)(i) == (n+1) * n / 2

n=1 limit=1
n=2 limit=3
n=3 limit=6
n=4 limit=10

this is quite close to what you have measured for n=1,2,3, and i think
it's becoming exponentially harder to trigger the worst-case with higher
N, so the measured results will likely be lower than that.

Ingo

2004-11-27 02:19:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> I finally got time to run the test on 2.6.10-rc2-mm2-V0.7.30-10.
> Conclusion: The bound on the locking time seems not to be bounded by
> depth*1ms as predicted: The more blocking tasks there is the higher
> the bound is. There _is_ some kind of bound in that I don't see wild
> locking times. The distributions stops nicely at N *1ms N in is higher
> than depth.

i've fixed a couple of minor, priority-related bugs in kernels post
-30-10, the latest being -31-7 - there's some chance that one of them
might fix this final anomaly.

> which is depth 3 and 20 blocking tasks. There is a clean upper bound
> of 7ms (7.1 to be exact) but where does the 7 come from??? A formula
> like N=2*depth+1 might fit the results.

there's one thing i noticed, now that the blocker device is in the
kernel, you have to be really careful to compile the userspace loop()
code via the same gcc flags as the kernel did. Minor differences in
compiler options can skew the timing calibration.

but any such bug should at most cause a linear deviation via a constant
factor multiplication, while the data shows a systematic nonlinear
transformation.

> If we understand what is going on this might end up being "good
> enough" in the sense it is deterministic. The end result doesn't have
> to be the best case formula. But the maximum locking time have to
> independent of the number of lower priority tasks banging on the
> protected region as that is something uncontrolable. We have to be
> able to predict a bound based solely on the depth before we can say it
> is "good enough".

yeah, i agree that this has to be further investigated. What type of box
did you test it on - UP or SMP? (SMP scheduling of RT tasks only got
fully correct in the very latest -31-7 kernel.)

Ingo

2004-11-27 03:35:26

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


On Fri, 26 Nov 2004, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > I am running on -31-7 kernel now - it takes quite some time to run with
> > the runall.sh script with 100000 samples per point so I don't have full
> > data yet. [...]
>
> btw., do you really need 100,000 samples to get statistically stable
> results? I've been running with 1000 samples and it was already more
> than usable - i'd say 3000-5000 samples ought to be more than enough.
>
> > But the bounds look like
> > depth observed bound theoretical
> > 1 1 ms 1 ms
> > 2 3 ms 2 ms :-(
>
> are you sure the theoretical limit is 2 msec? I think it's 3 msec, for
> the following reason:
>
> there are two types of nonprivileged-task lock sequences allowed in the
> 2-deep case:
>
> spin_lock(&lock2);
> spin_lock(&lock1);
> ... loop for 1 msec ...
> spin_unlock(&lock1);
> spin_unlock(&lock2);
>
> or:
> spin_lock(&lock1);
> ... loop for 1 msec ...
> spin_unlock(&lock1);
>
> now, with the above locking, the worst case scenario is the following
> one, in chronological order [task A and B are unprivileged, RT is the
> highprio task]:
>
> task-A task-B task-RT
>
> spin_lock(&lock2);
> [ gets lock2 ]
> spin_lock(&lock1);
> [ gets lock1 ]
> spin_lock(&lock2);
> [ boosts task-A ]
> [ waits ]
> [ gets RT prio ] .
> spin_lock(&lock1); .
> [ boosts task-B ] .
> [ waits ] .
> . [ gets RT prio ] .
> . [ 1 msec loop ] .
> . spin_unlock(&lock1); .
> [ gets lock 1 ] .
> spin_lock(&lock1); .

point of disagreement ----^

> [ waits ] .
> [ 1 msec loop ] . .
> spin_unlock(&lock1); . .
> [ gets lock1 ] .
> spin_unlock(&lock2); .
> [ gets lock2 ]
> spin_lock(&lock1);
> [ boosts task-B ]
> [ waits ]
> [ 1 msec loop ] .
> spin_unlock(&lock1); .
> [ gets lock1 ]
>
>
> the additional 1 msec comes in because the RT task might be blocking on
> a task that _itself_ has to wait 1 msec to get its second lock. So we
> have 3 msec of maximum waiting altogether.
>
> the additional +1 msec comes from the fact that 1-deep lock/unlock of
> lock1 is an allowed operation too - 2 msec would be the limit if the
> only sequence is the 2-deep one.
>
> so i think the numbers, at least in the 2-deep case, are quite close to
> the theoretical boundary.
>
> agreed?
>

No :-)
Why should task B get lock1 the 2. time before the rt-task? That
would be an error!
As soon as task B releases lock1 it has to go back to no-rt and thus go to
sleep. It should be asleep until rt-task is done and thus shouldn't get a
chance to call spin_lock(&lock1) again while it can "do damage".

I just briefly checked the code in -31-7 and it should work like I
described: You explicitly put in a schedule() after resetting the
priority after releasing the lock in __up_mutex(). The priority should be
the one you get from mutex_getprio() since B's pi_waiters is emptied in
change_owner() called from set_new_owner() called from pick_new_owner()
called before pi_setprio().

I can't see how it can produce a flow like the one you describe above!


> Ingo
>

Esben


2004-11-27 07:34:35

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

I will look through your arguments when I get time (probably not
before Sunday :-(). It does fit with the meassurements I have so far.

The reason for running 100000 samples is that for high depth and many
tasks the statistics is quite pure at the end. So I can just as well set
it high and do all the daily life stuff.

Esben

On Fri, 26 Nov 2004, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > the additional +1 msec comes from the fact that 1-deep lock/unlock of
> > > lock1 is an allowed operation too - 2 msec would be the limit if the
> > > only sequence is the 2-deep one.
> > >
> > > so i think the numbers, at least in the 2-deep case, are quite close
> > > to the theoretical boundary.
> >
> > in the generic case i think the theoretical boundary should be something
> > like:
> >
> > sum(i=1...n)(i) == (n+1) * n / 2
> >
> > n=1 limit=1
> > n=2 limit=3
> > n=3 limit=6
> > n=4 limit=10
> >
> > this is quite close to what you have measured for n=1,2,3, and i think
> > it's becoming exponentially harder to trigger the worst-case with higher
> > N, so the measured results will likely be lower than that.
>
> also, you might want to try the simpler N-deep-locking-only variant,
> where the maximum latency should be 'n'. This likely needs some changes
> to the blocker.c code though - i.e. set 'max' always to lock_depth.
>
> Ingo
>

2004-11-27 07:34:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > the additional +1 msec comes from the fact that 1-deep lock/unlock of
> > lock1 is an allowed operation too - 2 msec would be the limit if the
> > only sequence is the 2-deep one.
> >
> > so i think the numbers, at least in the 2-deep case, are quite close
> > to the theoretical boundary.
>
> in the generic case i think the theoretical boundary should be something
> like:
>
> sum(i=1...n)(i) == (n+1) * n / 2
>
> n=1 limit=1
> n=2 limit=3
> n=3 limit=6
> n=4 limit=10
>
> this is quite close to what you have measured for n=1,2,3, and i think
> it's becoming exponentially harder to trigger the worst-case with higher
> N, so the measured results will likely be lower than that.

also, you might want to try the simpler N-deep-locking-only variant,
where the maximum latency should be 'n'. This likely needs some changes
to the blocker.c code though - i.e. set 'max' always to lock_depth.

Ingo

2004-11-27 23:06:39

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Fri, 26 Nov 2004, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > it can produce such a flow on SMP, or if you add in a third non-RT
> > task (task-C). Agreed?
>
> here's the 4-task flow. I've simplified the operations to make it easier
> to overview: "L1-success means" 'task locked lock1 and got it'.
> "L1-wait" means it tried to get lock1 but has to wait. "RT-prio" means a
> non-RT task got boosted to RT priority. "old-prio" means priority got
> restored to the original non-RT value. "UL-1" means 'unlocked lock1'.
>
> task-A task-B task-C task-RT
> -------------------------------------------------------
> L2-success
> L1-success
> L1-wait
> . L2-wait
> . boost-A
> RT-prio . .
> L1-wait . .
> boost-B . .
> . RT-prio . .
> . [ 1 ms ] . .
> . UL-1 . .
> . !RT-prio . .
> get-L1 . .
> [ 1 ms ] . .
> UL-1 . .
> get-L1 .
> UL-2 .
> old-prio .
> get-L2
> L1-wait
> boost-C
> RT-prio .
> [ 1 msec ] .
> UL-1 .
> old-prio .
> get-L1
> [ success ]
>
> this is a 3 milliseconds worst-case. Ok?
>
> Ingo
>
Now I see it! The reason is that task-C is _given_ the lock by task A.
The action of transfering the lock from A to C is done in A's context with
RT-prio.
My line of thinking was that although task-C is in the wait queue of lock1
it still actively have to go in and _take_ the lock. In that kind of setup
task-RT would get i first on UP. I have a detailed description of such
a design below but first I have to point out that the formula for the
worst case delay is not
N*(N+1)/2
but
2^N-1
! Why? Let us say lock1 is replaced by a locking structure which is N
deep in the above examble. The worst case delay on locking on that
structure is f(N). When the extra lockN+1 is added we thus get
f(N+1) = 2*f(N) +1
because both task B and task C have to lock on the N depth structure
giving 2*f(N) and we have to wait for task A to finish giving the +1.
The solution to that recursive equation when f(0) = 0 is
f(N) = 2^N -1
I.e.
N f(N)
0 0
1 1
2 3
3 7
4 15

My tests doesn't show that - but it requires 2^N tasks to reproduce. I am
trying to do that now.

Ok, back to a design making
f(N) = N
on UP:

When I made the mutex in U9.2-priom I originally made it like
that: In mutex_unlock() owner was set to NULL and the first task awaken.
But a high priority task could at that point get in and snap the mutex in
front of the newly awoken task. That setup would make stuff work better on
UP.

I have come to think about it in terms of state machines. In the current
implementation a mutex can have the following states: unlocked, owned by
task A, owned by task B, etc.. All transitions are possible, i.e.
unlocked -> task B -> task A -> unlocked.
The one I started out coding the only allowed transitions were between
unlocked and task X. Task B -> task A was not allowed. Task B simply set
the state to unlocked, woke up task A, which the _most likely_ would get
in and get the mutex:
unlocked -> task B -> unlocked -> task A -> unlocked
Epsecially on a SMB system task C can however get in and "steal" it before
task A.

We would fix the discussed problem on UP as well as the problem of the
above implementation by introducing a new state called "released". Allowed
transitions would be
unlocked->task B
task B->unlocked
task B->released
released->task C
So what does "released" mean? It means that task B is giving the task to
another - but it will _not_ pick the new owner. It will simply wake up the
highest priority task on the wait queue and set the mutex in "released"
state. When another task A, which could be the newly awoken task, then
tries to lock a mutex in "released" state, task A will only take the mutex
if it has higher priority than any on the wait-list. Otherwise task A will
block on the mutex as usual. In more detailed pseudo code:

__mutex_down:
try_again:
switch(mutex->state)
case unlocked: /* the most common case, optimize for this */
mutex->state = owned by current
return;
case released:
if(current->prio <= first_on_wait_queue->prio)
mutex->state = owned by current
return
/* fall through */
case locked by A:
add current to wait-queue
schedule()
remove current from wait-queue
goto try_again;

__mutex_up:
if(likely(list_empty(mutex->wait_queue)))
mutex->state = unlocked;
return

wake_first_waiter(mutex->wait_queue)
mutex->state = released
reset priority
return

Invariant: When the mutex is "released" there is always one task which
is in the wait queue in state "running" comming back to _try_ to take the
mutex.

I think this algorithm would guarantie a latency of depth * 1ms on UP
Let us try to repeate the discussed case on _UP_:

task-A task-B task-C task-RT
-------------------------------------------------------
L2-success
L1-success
L1-wait
. L2-wait
. boost-A
RT-prio . .
L1-wait . .
boost-B . .
. RT-prio . .
. [ 1 ms ] . .
. L1 -> released . .
[woken up] !RT-prio . .
get-L1 . .
[ 1 ms ] . .
L1 -> released [set running] .
L2 -> released [set running]
old-prio .
get-L2
get-l1
L1->released
L2->unlocked
done!
get-L1 .


The reason task RT gets L1 before task C is because C is starved and wont
be able to change the state of the mutex from "released" to "owned by C".
before task RT has done it. On the other hand task C wont notice that
task RT got there first because task RT gets in there does it's stuff
and puts the mutex back in released state while C is sitting quitely on
the run-queue.
On a SMP system task C will not be starved like that and will get L1
immediately.

Esben

2004-11-28 08:43:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> > task-A task-B task-C task-RT
> > -------------------------------------------------------

> > [ 1 ms ] . .
> > UL-1 . .
> > get-L1 .
> > UL-2 .

> Now I see it! The reason is that task-C is _given_ the lock by task A.
> The action of transfering the lock from A to C is done in A's context
> with RT-prio.

correct.

> My line of thinking was that although task-C is in the wait queue of
> lock1 it still actively have to go in and _take_ the lock. In that
> kind of setup task-RT would get i first on UP.

Trying to exclude all locking activity by 'lesser' tasks (unrelated to
the higher-prio tasks) while a higher-prio task is running is a quite
fragile concept i believe, and it's also pointless: just think about
SMP, or an RT task sleeping for short periods of time for whatever
reason.

This effect can be very well avoided by taking both locks, and this also
better matches the real locking scenarios that happen in the Linux
kernel.

> [...] I have a detailed description of such
> a design below but first I have to point out that the formula for the
> worst case delay is not
> N*(N+1)/2
> but
> 2^N-1

ok.

> I.e.
> N f(N)
> 0 0
> 1 1
> 2 3
> 3 7
> 4 15
>
> My tests doesn't show that - but it requires 2^N tasks to reproduce. I
> am trying to do that now.

note that it's (probably worse than-) exponentially more unlikely for a
test to be able to generate the worst-case.

> Ok, back to a design making
> f(N) = N
> on UP:

why? It is quite rare for there to be even 4-deep locking structures in
the kernel, and even then, it's even rarer that they get accessed in
such an arbitrary-deep manner. Note that task-B can only steal the lock
from task-A because the following locking pattern is allowed:

L1,UL1 [X]

besides:

L1,L2,UL1,UL2 [Y]

In Linux, if there's 2-deep locking, it is much more common for only [Y]
to be allowed, for which the formula is still "f(N) = N". Or even if
both [X] and [Y] is allowed, 3-deep or 4-deep locking with [X] and [Y]
allowed as well is quite rare. Not only is this rare, it's _even_ rarer
that no code within that 4-lock-protected region may end up blocking on
an external interrupt. So what you are trying to solve is a very narrow
case i believe.

> When I made the mutex in U9.2-priom I originally made it like that: In
> mutex_unlock() owner was set to NULL and the first task awaken. But a
> high priority task could at that point get in and snap the mutex in
> front of the newly awoken task. That setup would make stuff work
> better on UP.

we could try wakeup-retry instead of pass-lock, although pass-lock still
feels more robust. Relying on UP-starvation to reduce your latencies is
quite ... fragile.

Also, pass-lock performs better because the woken up task does not have
to touch the lock structure anymore, up until it has to release it.
(which gives the CPU(s) more time to optimize things, especially on
SMP.) But if you send a separate patch for that alone (to implement the
'released' state) then we'll be able to see the costs in isolation.

Ingo

2004-11-28 15:55:37

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

I agree with your analysis:
There is no need to change the way the mutex is passed to the new task as
the current implementation does give an upper bound. Also it works the
same way on SMP and UP. It also performs better. The situations where the
bound really is 2^N-1, N>2, are very rare if they exist at all.

There is a tiny "however" I want to mention, though. Who will use a Linux
kernel with real-time performance? People who want a RT application and at
the same time want to deploy normal Linux applications. The criteria for
the RT system to work is that even if you put on heavy loads of normal
applications the RT application still schedules fine.
It is very unlikely that anyone will try to calculate wether it schedules
or not. It is much more common that people just test it in the lab and
then thrust that the real-time properties of the the system not to change
when you go into deployment.

The problem now is that in a test you wont see many lower priority
threads trying to get a locked region -- as we have seen, it
is hard to make such a test even for an "academic", bad driver. I.e. you
don't see the 2^N-1 behaviour in the test. In deployment, however, you
might in some circumstanses see it and thus degrade the real-time
performance. On the other hand even with a locking depth of 4 we go up to
a factor 15 for the very worst case behaviour. Hopefully,
people leave room enough in real-time application to be able to tolerate
that!

Esben

On Sun, 28 Nov 2004, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> > > task-A task-B task-C task-RT
> > > -------------------------------------------------------
>
> > > [ 1 ms ] . .
> > > UL-1 . .
> > > get-L1 .
> > > UL-2 .
>
> > Now I see it! The reason is that task-C is _given_ the lock by task A.
> > The action of transfering the lock from A to C is done in A's context
> > with RT-prio.
>
> correct.
>
> > My line of thinking was that although task-C is in the wait queue of
> > lock1 it still actively have to go in and _take_ the lock. In that
> > kind of setup task-RT would get i first on UP.
>
> Trying to exclude all locking activity by 'lesser' tasks (unrelated to
> the higher-prio tasks) while a higher-prio task is running is a quite
> fragile concept i believe, and it's also pointless: just think about
> SMP, or an RT task sleeping for short periods of time for whatever
> reason.
>
> This effect can be very well avoided by taking both locks, and this also
> better matches the real locking scenarios that happen in the Linux
> kernel.
>
> > [...] I have a detailed description of such
> > a design below but first I have to point out that the formula for the
> > worst case delay is not
> > N*(N+1)/2
> > but
> > 2^N-1
>
> ok.
>
> > I.e.
> > N f(N)
> > 0 0
> > 1 1
> > 2 3
> > 3 7
> > 4 15
> >
> > My tests doesn't show that - but it requires 2^N tasks to reproduce. I
> > am trying to do that now.
>
> note that it's (probably worse than-) exponentially more unlikely for a
> test to be able to generate the worst-case.
>
> > Ok, back to a design making
> > f(N) = N
> > on UP:
>
> why? It is quite rare for there to be even 4-deep locking structures in
> the kernel, and even then, it's even rarer that they get accessed in
> such an arbitrary-deep manner. Note that task-B can only steal the lock
> from task-A because the following locking pattern is allowed:
>
> L1,UL1 [X]
>
> besides:
>
> L1,L2,UL1,UL2 [Y]
>
> In Linux, if there's 2-deep locking, it is much more common for only [Y]
> to be allowed, for which the formula is still "f(N) = N". Or even if
> both [X] and [Y] is allowed, 3-deep or 4-deep locking with [X] and [Y]
> allowed as well is quite rare. Not only is this rare, it's _even_ rarer
> that no code within that 4-lock-protected region may end up blocking on
> an external interrupt. So what you are trying to solve is a very narrow
> case i believe.
>
> > When I made the mutex in U9.2-priom I originally made it like that: In
> > mutex_unlock() owner was set to NULL and the first task awaken. But a
> > high priority task could at that point get in and snap the mutex in
> > front of the newly awoken task. That setup would make stuff work
> > better on UP.
>
> we could try wakeup-retry instead of pass-lock, although pass-lock still
> feels more robust. Relying on UP-starvation to reduce your latencies is
> quite ... fragile.
>
> Also, pass-lock performs better because the woken up task does not have
> to touch the lock structure anymore, up until it has to release it.
> (which gives the CPU(s) more time to optimize things, especially on
> SMP.) But if you send a separate patch for that alone (to implement the
> 'released' state) then we'll be able to see the costs in isolation.
>
> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-11-29 10:00:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> I agree with your analysis: There is no need to change the way the
> mutex is passed to the new task as the current implementation does
> give an upper bound. Also it works the same way on SMP and UP. It also
> performs better. The situations where the bound really is 2^N-1, N>2,
> are very rare if they exist at all.
>
> There is a tiny "however" I want to mention, though. Who will use a
> Linux kernel with real-time performance? People who want a RT
> application and at the same time want to deploy normal Linux
> applications. The criteria for the RT system to work is that even if
> you put on heavy loads of normal applications the RT application still
> schedules fine. It is very unlikely that anyone will try to calculate
> wether it schedules or not. It is much more common that people just
> test it in the lab and then thrust that the real-time properties of
> the the system not to change when you go into deployment.

the locking interactions have to be well understood. You really have to
know whether the RT app takes lock1 once or 10 times in a critical path
- in the latter case people might never see the 10x2msec latency in
testing either, so i dont think there's a big difference. I.e. depending
on what kernel subsystems the RT task uses (what kernel subsystems it
uses that shares locks with nonprivileged tasks), it might see higher
latencies.

To give you an extreme example: you cannot run OpenOffice.org with
SCHED_FIFO prio 99 and expect it to have any sane deterministic latency
bounds. The simpler the app, the easier it is to control latencies.

Careful planning and design of hard-RT apps is still vital, and further
RT-friendly locking of kernel subsystems has to be done too. Also, the
actual latencies and characteristics of kernel subsystems has to be
understood too. So e.g.: "if your hard-RT task uses file ops then you
might see latencies up to ... N*open_files" - things like that.

PREEMPT_RT doesnt magically give all kernel subsystems bounded execution
times - it only guarantees bounded scheduling latencies. So it in
essence integrates hard-RT into Linux, but it doesnt by itself make the
kernel itself O(1). But, fortunately, a good portion of the core
functionality of Linux is quite close to O(1) already, and most hard-RT
apps try to stay as simple as possible. But no doubt there's still tons
of work left - PREEMPT_RT is only the core infrastructure.

Ingo

2004-11-29 15:07:59

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Mon, 29 Nov 2004, Ingo Molnar wrote:

>
> * Esben Nielsen <[email protected]> wrote:
>
> [...]
> To give you an extreme example: you cannot run OpenOffice.org with
> SCHED_FIFO prio 99 and expect it to have any sane deterministic latency
> bounds. The simpler the app, the easier it is to control latencies.
>

No, but I want to have my RT-subsystem to be not affected even if the
users choose to run 1000 soffice.bin at SCHED_NORMAL!! Not that I suspect
a cracker would be able to lock on to my RT-system and do it on perpose,
but I know developers make errors. Very often I have seen servers
becomming over-burdened because some application keeps spawning new
threads/processors - often as a solution to being overloaded!!!

That would indeed demand that the kernel would be O(1) in a lot of places
where it isn't right now. 1) The first job is to identify those places, 2)
make the core system be pure O(1) such that RT applications are posible at
all, 3) stop using spinlocks for all none-O(1) places such that RT
applications are not disturbed by none-O(1) behaviour even though
none-real-time threads hits them all the time and 4) chance the most
usefull, but none-critical places to be O(1), too, such that these
subsystems can be used in real-time applications as well.

I think you have come very far with 1)-3). There is a lot of work in 4)
but I think that can be left to those people who actually need to use
those subsystems in a real-time application:

The filesystems will most likely never be O(1) but simple device drivers
can easily be made O(1). I.e. you can make real-time applications using a
character device once you have opened the device but not the file-system
in general. The TCP/IP stack probably wont be O(1) for a long time, i.e.
no real-time application can use TCP/IP directly for a long time :-(

Now I started to look at priority inheritance because I saw that as the
way to fullfill 3) with the twist that real-time applications can share
resources with non-real-time applications. I think the current
implementation is good, if the current mechanism is keeped inside the
kernel. The problem is that a raw spinlock is used to lock for a iteration
over a list which can be O(number of waiters * locking depth) long. As
long as we are in the kernel both is "controlled", i.e. one can see the
worst-case number in stress test and know it can't get worse. *

However, if the same mechanism is used for a userspace mutex there will be
a problem since very long list of waiters can be build up on a mutex by a
faulty application. (That could be one where starvation is solved by
spawning new the threads to handle the work, which then will block on the
critical region, such the lists gets longer and longer...) The solution I
see, is to uplift the level of abstraction: Have the same mechanism but
the lock around the wait queues is this time around the kernel mutex. I.e.
the kernel mutex is implemented with a raw spinlock, the user space mutex
is implemented by a kernel mutex! That will be more expensive, ofcourse,
but avoid that non-real-time userspace threads having degrading the
system latency by bad usage of muteces.

But there are probably still a lot of other places where non-real-time
userspace threads can increase the overall system latency in a
unpredictable way, but I think you got a good grab on those now :-)


> [...]
> Ingo
>

Esben

*If somebody is afraid that the number of waiters on a mutex can be made
unlimited long by forinstance having a lot of non-real-time processes
calling into the same kernel region it can be fixed by moving and
keeping processes at priority 100 and SCHED_FIFO when they own a mutex
such they will run to completion before other non-real-time task can be
scheduled on the same CPU. In fact that will make the system behave as
the old non-preemptive kernel when there is only non-real-time threads! --
which also saves the system from rescheduling and thus make it have a
better through-put performance.


2004-11-29 15:58:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> On Mon, 29 Nov 2004, Ingo Molnar wrote:
>
> > [...]
> > To give you an extreme example: you cannot run OpenOffice.org with
> > SCHED_FIFO prio 99 and expect it to have any sane deterministic latency
> > bounds. The simpler the app, the easier it is to control latencies.
>
> No, but I want to have my RT-subsystem to be not affected even if the
> users choose to run 1000 soffice.bin at SCHED_NORMAL!! [...]

that is still the case, as long as it doesnt 'interact' with
soffice.bin. Note that the same qualification holds for every hard-RT
OS as well, with varying levels of 'interaction'. Interaction depends on
what kind of synchronization the hard-RT task does, and depends on how
the kernel itself implements various kernel features.

> [...] The problem is that a raw spinlock is used to lock for a
> iteration over a list which can be O(number of waiters * locking
> depth) long. As long as we are in the kernel both is "controlled",
> i.e. one can see the worst-case number in stress test and know it
> can't get worse. *

which list do you mean? Note that the pi_list depends on the number of
_RT-tasks_, not on the number of SCHED_NORMAL tasks. So you can create
an arbitrary number of SCHED_NORMAL tasks, they wont impact the overhead
of mutexes!

i very intentionally made it independent of nr-of-non-RT-tasks.

Ingo

2004-11-29 15:59:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Ingo Molnar <[email protected]> wrote:

> > iteration over a list which can be O(number of waiters * locking
> > depth) long. As long as we are in the kernel both is "controlled",
> > i.e. one can see the worst-case number in stress test and know it
> > can't get worse. *
>
> which list do you mean? Note that the pi_list depends on the number of
> _RT-tasks_, not on the number of SCHED_NORMAL tasks. So you can create
> an arbitrary number of SCHED_NORMAL tasks, they wont impact the
> overhead of mutexes!
>
> i very intentionally made it independent of nr-of-non-RT-tasks.

and i'm regularly testing this property with 'hackbench 50', which
creates over a 1000 wildly scheduling non-RT tasks. Latency is not
affected by such workloads.

Ingo

2004-11-29 16:50:47

by Esben Nielsen

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)

On Mon, 29 Nov 2004, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > > iteration over a list which can be O(number of waiters * locking
> > > depth) long. As long as we are in the kernel both is "controlled",
> > > i.e. one can see the worst-case number in stress test and know it
> > > can't get worse. *
> >
> > which list do you mean? Note that the pi_list depends on the number of
> > _RT-tasks_, not on the number of SCHED_NORMAL tasks. So you can create
> > an arbitrary number of SCHED_NORMAL tasks, they wont impact the
> > overhead of mutexes!
> >
> > i very intentionally made it independent of nr-of-non-RT-tasks.
>
Sorry I overlooked that detail. Also the wait_list on each mutex is
"sorted" (the rt-threads in the beginning and normal threads at the end).
Then there should be no problems.

However, a full sorting among the rt-tasks along with only holding the
first waiter for each mutex on the pi_waiters list would give it better
performance since the list traversals could be avoided. The worst case
still has a full list traversel at insertion of a waiter, but that would
only happen if the rt-thread sleeps on something not a mutex while holding
one - which is really bad coding style! Otherwise, only increasingly high
priority task can possible get the CPU(s) and get access to locking the
mutex. Thus all insertion will be within the first <number of CPUs>
elements on the list.

> and i'm regularly testing this property with 'hackbench 50', which
> creates over a 1000 wildly scheduling non-RT tasks. Latency is not
> affected by such workloads.
>

Probably not. Even while doing that you most likely wont build up wait
lists of more than 10, maybe 100 tasks? Doing full traversals with irq
disabled probably wont be meassureable!(?) compared to much other stuff
increasing responsible for the meassured latency.

> Ingo

Esben

2004-11-30 08:50:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: Priority Inheritance Test (Real-Time Preemption)


* Esben Nielsen <[email protected]> wrote:

> > and i'm regularly testing this property with 'hackbench 50', which
> > creates over a 1000 wildly scheduling non-RT tasks. Latency is not
> > affected by such workloads.
> >
>
> Probably not. Even while doing that you most likely wont build up wait
> lists of more than 10, maybe 100 tasks? Doing full traversals with irq
> disabled probably wont be meassureable!(?) compared to much other
> stuff increasing responsible for the meassured latency.

there is no full list traversal of SCHED_NORMAL tasks, ever.

but the best way is to test this yourself, download Florian's rtc_wakeup
from:

http://www.affenbande.org/~tapas/wiki/index.php?rtc_wakeup

and run it with the highest possible resolution, 8192 Hz:

chrt -f 98 -p `pidof 'IRQ 8'`
chrt -f 99 -p `pidof 'IRQ 0'`

./rtc_wakeup -f 8192 -t 100000

in this mode rtc_wakeup will report the worst irq-delivery latency it
measures. It will thus measure the combined effect of any type of
scheduling or irqs-off latency to RT-tasks.

then download hackbench from:

http://developer.osdl.org/craiger/hackbench/

and try e.g.:

./hackbench 50

this will start 2x20x50 == 2000 SCHED_NORMAL threads, all performing a
nice pattern of scheduling simulating a busy chat server workload with
tons of messages going back and forth.

Ingo