I have done 2 things which might be of interrest:
I) A rt_mutex unittest suite. It might also be usefull against the generic
mutexes.
II) I changed the priority inheritance mechanism in rt.c,
optaining the following goals:
1) rt_mutex deadlocks doesn't become raw_spinlock deadlocks. And more
importantly: futex_deadlocks doesn't become raw_spinlock deadlocks.
2) Time-Predictable code. No matter how deep you nest your locks
(kernel or futex) the time spend in irqs or preemption off should be
limited.
3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)
I have lost:
1) Some speed in the slow slow path. I _might_ have gained some in the
normal slow path, though, without meassuring it.
Idea:
When a task blocks on a lock it adds itself to the wait list and calls
schedule(). When it is unblocked it has the lock. Or rather due to
grab-locking it has to check again. Therefore the schedule() call is
wrapped in a loop.
Now when a task is PI boosted, it is at the same time checked if it is
blocked on a rt_mutex. If it is, it is unblocked ( wake_up_process_mutex()
). It will now go around in the above loop mentioned above. Within this loop
it will now boost the owner of the lock it is blocked on, maybe unblocking the
owner, which in turn can boost and unblock the next in the lock chain...
At all points there is at least one task boosted to the highest priority
required unblocked and working on boosting the next in the lock chain and
there is therefore no priority inversion.
The boosting of a long list of blocked tasks will clearly take longer than
the previous version as there will be task switches. But remember, it is
in the slow slow path! And it only occurs when PI boosting is happening on
_nested_ locks.
What is gained is that the amount of time where irq and preemption is off
is limited: One task does it's work with preemption disabled, wakes up the
next and enable preemption and schedules. The amount of time spend with
preemption disabled is has a clear upper limit, untouched by how
complicated and deep the lock structure is.
So how many locks do we have to worry about? Two.
One for locking the lock. One for locking various PI related data on the
task structure, as the pi_waiters list, blocked_on, pending_owner - and
also prio.
Therefore only lock->wait_lock and sometask->pi_lock will be locked at the
same time. And in that order. There is therefore no spinlock deadlocks.
And the code is simpler.
Because of the simplere code I was able to implement an optimization:
Only the first waiter on each lock is member of the owner->pi_waiters.
Therefore it is not needed to do any list traversels on neither
owner->pi_waiters, not lock->wait_list. Every operation requires touching
only removing and/or adding one element to these lists.
As for robust futexes: They ought to work out of the box now, blocking in
deadlock situations. I have added an entry to /proc/<pid>/status
"BlckOn: <pid>". This can be used to do "post mortem" deadlock detection
from userspace.
What am I missing:
Testing on SMP. I have no SMP machine. The unittest can mimic the SMP
somewhat
but no unittest can catch _all_ errors.
Testing with futexes.
ALL_PI_TASKS are always switched on now. This is for making the code
simpler.
My machine fails to run with CONFIG_DEBUG_DEADLOCKS and CONFIG_DEBUG_PREEMPT
on at the same time. I need a serial cabel and on consol over serial to
debug it. My screen is too small to see enough there.
Figure out more tests to run in my unittester.
So why aren't I doing those things before sending the patch? 1) Well my
girlfriend comes back tomorrow with our child. I know I will have no time to code anything substential
then. 2) I want to make sure Ingo sees this approach before he starts
merging preempt_rt and rt_mutex with his now mainstream mutex.
Esben
On Wed, 11 Jan 2006, Esben Nielsen wrote:
> I have done 2 things which might be of interrest:
>
> I) A rt_mutex unittest suite. It might also be usefull against the generic
> mutexes.
>
> II) I changed the priority inheritance mechanism in rt.c,
> optaining the following goals:
>
Interesting. I'll take a look more at this after I finish dealing with
some deadlocks that I found in posix-timers.
[snip]
>
> What am I missing:
> Testing on SMP. I have no SMP machine. The unittest can mimic the SMP
> somewhat
> but no unittest can catch _all_ errors.
I have a SMP machine that just freed up. It would be interesting to see
how this works on a 8x machine. I'll test it first on my 2x, and when
Ingo gets some time he can test it on his big boxes.
>
> Testing with futexes.
>
> ALL_PI_TASKS are always switched on now. This is for making the code
> simpler.
>
> My machine fails to run with CONFIG_DEBUG_DEADLOCKS and CONFIG_DEBUG_PREEMPT
> on at the same time. I need a serial cabel and on consol over serial to
> debug it. My screen is too small to see enough there.
>
> Figure out more tests to run in my unittester.
>
> So why aren't I doing those things before sending the patch? 1) Well my
> girlfriend comes back tomorrow with our child. I know I will have no time to code anything substential
> then. 2) I want to make sure Ingo sees this approach before he starts
> merging preempt_rt and rt_mutex with his now mainstream mutex.
If I get time, I might be able to finish this up, if the changes look
decent, and don't cause too much overhead.
-- Steve
On Wed, 11 Jan 2006, Steven Rostedt wrote:
>
> On Wed, 11 Jan 2006, Esben Nielsen wrote:
> [snip]
>
> If I get time, I might be able to finish this up, if the changes look
> decent, and don't cause too much overhead.
This was the answer I was hoping for! I'll try to get time to test and
improve myself ofcourse.
As for optimization: I take and release the current->pi_lock and
owner->pi_lock a lot because it isn't allowed to have both locks. Some
code restructuring could probably improve it such that it first finishes
what it has to finish under current->pi_lock then does what it has to do
under the owner->pi_lock - or the other way around.
In a few places pi_lock is taken without just to be sure. It might be
removed.
But first we have to establish the principle. Then optimization can begin.
Esben
>
> -- Steve
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> I have done 2 things which might be of interrest:
>
> II) I changed the priority inheritance mechanism in rt.c,
> optaining the following goals:
> 3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)
Awssome. The code was done in what seems like a hurry and mixes up a
bunch of things that should be seperate out into individual sub-sections.
The allocation of the waiter object on the thread's stack should undergo
some consideration of whether this should be move into a more permanent
store. I haven't looked at an implementation of turnstiles recently, but
I suspect that this is what it actually is and it would eliminate the
moving of waiters to the thread that's is actively running with the lock
path terminal mutex. It works, but it's sloppy stuff.
[loop trickery, priority leanding operations handed off to mutex owner]
> What is gained is that the amount of time where irq and preemption is off
> is limited: One task does it's work with preemption disabled, wakes up the
> next and enable preemption and schedules. The amount of time spend with
> preemption disabled is has a clear upper limit, untouched by how
> complicated and deep the lock structure is.
task_blocks_on_lock() is another place that one might consider seperating
out some bundled functionality into different places in the down()
implementation. I'll look at the preemption stuff next.
Just some ideas. Looks like a decent start.
bill
On Thu, 12 Jan 2006, Bill Huey wrote:
> On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> > I have done 2 things which might be of interrest:
> >
> > II) I changed the priority inheritance mechanism in rt.c,
> > optaining the following goals:
> > 3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)
>
> Awssome. The code was done in what seems like a hurry and mixes up a
> bunch of things that should be seperate out into individual sub-sections.
*nod*
I worked on the tester before christmas and only had a few evenings for
myself to look at the kernel after christmas. With a fulltime job and a family
I don't get many spare hours with no disturbance to code (close to none
really), so I had to ship it while my girlfriend and child were away for a
few days.
>
> The allocation of the waiter object on the thread's stack should undergo
> some consideration of whether this should be move into a more permanent
> store.
Hmm, why?
When I first saw it in the Linux kernel I thought: Damn this is elagant.
You only need a waiter when you block, and while you block your stack is
untouched. When you don't block you don't need the waiter, so why have it
allocated somewhere else, say in task_t?
> I haven't looked at an implementation of turnstiles recently,
turnstiles? What is that?
> but
> I suspect that this is what it actually is and it would eliminate the
> moving of waiters to the thread that's is actively running with the lock
> path terminal mutex. It works, but it's sloppy stuff.
>
> [loop trickery, priority leanding operations handed off to mutex owner]
>
> > What is gained is that the amount of time where irq and preemption is off
> > is limited: One task does it's work with preemption disabled, wakes up the
> > next and enable preemption and schedules. The amount of time spend with
> > preemption disabled is has a clear upper limit, untouched by how
> > complicated and deep the lock structure is.
>
> task_blocks_on_lock() is another place that one might consider seperating
> out some bundled functionality into different places in the down()
> implementation.
What is done now with my patch is "minimal", but you have to add your
self to the wait list. You also have to boost the owner of the lock.
You might be able to split it up by releasing and reacquiring all the
spinlocks; but I am pretty much sure this is not the place in the
whole system giving you the longest preemptions off section, so it doesn't
make much sense to improve it.
> I'll look at the preemption stuff next.
>
> Just some ideas. Looks like a decent start.
>
Thanks for the positive response!
Esben
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, Jan 12, 2006 at 01:54:23PM +0100, Esben Nielsen wrote:
> turnstiles? What is that?
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/subr_turnstile.c
Please, read. Now tell me or not if that looks familiar ? :)
Moving closer an implementation is arguable, but it is something that
should be considered somewhat since folks in both the Solaris (and
FreeBSD) communities have given a lot more consideration to these issues.
The stack allocated objects are fine for now. Priority inheritance
chains should never get long with a fine grained kernel, so the use
of a stack allocated object and migrating pi-ed waiters should not
be a major real world issue in Linux yet.
Folks should also consider using an adaptive spin in the __grab_lock() (sp?)
related loops as a possible way of optimizing away the immediate blocks.
FreeBSD actually checks the owner of a lock aacross another processor
to see if it's actively running, "current", and will block or wait if
it's running or not respectively. It's pretty trivial code, so it's
not a big issue to implement. This is ignoring the CPU local storage
issues.
bill
On Fri, 13 Jan 2006, Bill Huey wrote:
> On Thu, Jan 12, 2006 at 01:54:23PM +0100, Esben Nielsen wrote:
> > turnstiles? What is that?
>
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/subr_turnstile.c
>
> Please, read. Now tell me or not if that looks familiar ? :)
Yes, it reminds me of Ingo's first approach to pi locking:
Everything is done under a global spin lock. In Ingo's approach it was the
pi_lock. In turnstiles it is sched_lock, which (without looking at other
code in FreeBSD) locks the whole scheduler.
Although making the code a lot simpler, scalability is ofcourse the main
issue here. But apparently FreeBSD does have a global lock protecting the
scheduler anyway.
Otherwise it looks a lot like the rt_mutex.
Esben
>
> Moving closer an implementation is arguable, but it is something that
> should be considered somewhat since folks in both the Solaris (and
> FreeBSD) communities have given a lot more consideration to these issues.
>
> The stack allocated objects are fine for now. Priority inheritance
> chains should never get long with a fine grained kernel, so the use
> of a stack allocated object and migrating pi-ed waiters should not
> be a major real world issue in Linux yet.
>
> Folks should also consider using an adaptive spin in the __grab_lock() (sp?)
> related loops as a possible way of optimizing away the immediate blocks.
> FreeBSD actually checks the owner of a lock aacross another processor
> to see if it's actively running, "current", and will block or wait if
> it's running or not respectively. It's pretty trivial code, so it's
> not a big issue to implement. This is ignoring the CPU local storage
> issues.
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, Jan 13, 2006 at 09:47:39AM +0100, Esben Nielsen wrote:
> On Fri, 13 Jan 2006, Bill Huey wrote:
>
> > On Thu, Jan 12, 2006 at 01:54:23PM +0100, Esben Nielsen wrote:
> > > turnstiles? What is that?
> >
> > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/subr_turnstile.c
> >
> > Please, read. Now tell me or not if that looks familiar ? :)
>
> Yes, it reminds me of Ingo's first approach to pi locking:
> Everything is done under a global spin lock. In Ingo's approach it was the
> pi_lock. In turnstiles it is sched_lock, which (without looking at other
> code in FreeBSD) locks the whole scheduler.
>
> Although making the code a lot simpler, scalability is ofcourse the main
> issue here. But apparently FreeBSD does have a global lock protecting the
> scheduler anyway.
FreeBSD hasn't really address their scalability issues yet with their
locking. The valuable thing about that file is how they manipulate
threading priorities under priority inheritance. Some ideas might be
stealable from it. That's all.
bill
On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> So how many locks do we have to worry about? Two.
> One for locking the lock. One for locking various PI related data on the
> task structure, as the pi_waiters list, blocked_on, pending_owner - and
> also prio.
> Therefore only lock->wait_lock and sometask->pi_lock will be locked at the
> same time. And in that order. There is therefore no spinlock deadlocks.
> And the code is simpler.
Ok, got a question. How do deal with the false reporting and handling of
a lock circularity window involving the handoff of task A's BKL to another
task B ? Task A is blocked trying to get a mutex owned by task B, task A
is block B since it owns BKL which task B is contending on. It's not a
deadlock since it's a hand off situation.
I didn't see any handling of this case in the code and I was wondering
if the traversal logic you wrote avoids this case as an inherent property
and I missed that stuff ?
bill
On Sat, 14 Jan 2006, Bill Huey wrote:
> On Wed, Jan 11, 2006 at 06:25:36PM +0100, Esben Nielsen wrote:
> > So how many locks do we have to worry about? Two.
> > One for locking the lock. One for locking various PI related data on the
> > task structure, as the pi_waiters list, blocked_on, pending_owner - and
> > also prio.
> > Therefore only lock->wait_lock and sometask->pi_lock will be locked at the
> > same time. And in that order. There is therefore no spinlock deadlocks.
> > And the code is simpler.
>
> Ok, got a question. How do deal with the false reporting and handling of
> a lock circularity window involving the handoff of task A's BKL to another
> task B ? Task A is blocked trying to get a mutex owned by task B, task A
> is block B since it owns BKL which task B is contending on. It's not a
> deadlock since it's a hand off situation.
>
I am not precisely sure what you mean by "false reporting".
Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
mutex, A will give BKL to B when A calls schedule() in the down-operation
of that mutex.
> I didn't see any handling of this case in the code and I was wondering
> if the traversal logic you wrote avoids this case as an inherent property
> and I missed that stuff ?
The stuff is in kernel/sched.c and lib/kernel_lock.c.
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Jan 16, 2006 at 09:35:42AM +0100, Esben Nielsen wrote:
> On Sat, 14 Jan 2006, Bill Huey wrote:
> I am not precisely sure what you mean by "false reporting".
>
> Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
> mutex, A will give BKL to B when A calls schedule() in the down-operation
> of that mutex.
Task A holding BKL would have to drop BKL when it blocks against a mutex held
by task B in my example and therefore must hit schedule() before any pi boost
operation happens. I'll take another look at your code just to see if this is
clear.
bill
On Mon, Jan 16, 2006 at 02:22:55AM -0800, Bill Huey wrote:
> On Mon, Jan 16, 2006 at 09:35:42AM +0100, Esben Nielsen wrote:
> > On Sat, 14 Jan 2006, Bill Huey wrote:
> > I am not precisely sure what you mean by "false reporting".
> >
> > Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
> > mutex, A will give BKL to B when A calls schedule() in the down-operation
> > of that mutex.
>
> Task A holding BKL would have to drop BKL when it blocks against a mutex held
> by task B in my example and therefore must hit schedule() before any pi boost
> operation happens. I'll take another look at your code just to see if this is
> clear.
Esben,
Ok, I see what you did. Looking through the raw patch instead of the applied
sources made it not so obvious it me. Looks the logic for it is there to deal
with that case, good. I like the patch, but it does context switch twice as
much it seems which might a killer.
bill
On Mon, 16 Jan 2006, Bill Huey wrote:
> On Mon, Jan 16, 2006 at 02:22:55AM -0800, Bill Huey wrote:
> > On Mon, Jan 16, 2006 at 09:35:42AM +0100, Esben Nielsen wrote:
> > > On Sat, 14 Jan 2006, Bill Huey wrote:
> > > I am not precisely sure what you mean by "false reporting".
> > >
> > > Handing off BKL is done in schedule() in sched.c. I.e. if B owns a normal
> > > mutex, A will give BKL to B when A calls schedule() in the down-operation
> > > of that mutex.
> >
> > Task A holding BKL would have to drop BKL when it blocks against a mutex held
> > by task B in my example and therefore must hit schedule() before any pi boost
> > operation happens. I'll take another look at your code just to see if this is
> > clear.
>
> Esben,
>
> Ok, I see what you did. Looking through the raw patch instead of the applied
> sources made it not so obvious it me.
Only small patches can be read directly....
> Looks the logic for it is there to deal
> with that case, good. I like the patch,
good :-)
> but it does context switch twice as
> much it seems which might a killer.
Twice? It depends on the lock nesting depth. The number of task
switches is the lock nesting depth in any blocking down() operation.
In all other implementations the number of task switches is just 1.
I.e is the usual case (task A blocks on lock 1 owned by B, which is
unblocked), where lock nesting depth is 1, there is no penalty with my
approach. The panalty comes at (task A blocks on lock 1 owned by B,
blocked on lock 2 owned by C). There B is scheduled as an agent to boost
C, such that A never touches lock 2 or task C. Precisely this makes the
spinlocks a lot easier to handle.
On the other hand the maximum time spent with preemption off is
O(1) in my implementation whereas it is at least O(lock nesting depth) in
other implementations. I think the current implementation in rt_preempt
is max(O(total number of PI waiters),O(lock nesting depth)).
Esben
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
* Esben Nielsen <[email protected]> wrote:
> Hi,
> I have updated it:
>
> 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
> task->pi_waiters. Therefore we avoid taking the owner->pi_lock when the
> waiter isn't RT.
> 2) Merged into 2.6.15-rt6.
> 3) Updated the tester to test the hand over of BKL, which was mentioned
> as a potential issue by Bill Huey. Also added/adjusted the tests for the
> ALL_TASKS_PI==0 setup.
> (I really like unittesting: If someone points out an issue or finds a bug,
> make a test first demonstrating the problem. Then fixing the code is a lot
> easier - especially in this case where I run rt.c in userspace where I can
> easily use gdb.)
looks really nice to me. In particular i like the cleanup effect:
5 files changed, 490 insertions(+), 744 deletions(-)
right now Thomas is merging hrtimer-latest to -rt, which temporarily
blocks merging of intrusive patches - but i'll try to merge your patch
after Thomas is done. (hopefully later today)
Ingo
On Wed, 2006-01-18 at 11:31 +0100, Esben Nielsen wrote:
> Hi,
> I have updated it:
>
> 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
> task->pi_waiters. Therefore we avoid taking the owner->pi_lock when the
> waiter isn't RT.
> 2) Merged into 2.6.15-rt6.
> 3) Updated the tester to test the hand over of BKL, which was mentioned
> as a potential issue by Bill Huey. Also added/adjusted the tests for the
> ALL_TASKS_PI==0 setup.
> (I really like unittesting: If someone points out an issue or finds a bug,
> make a test first demonstrating the problem. Then fixing the code is a lot
> easier - especially in this case where I run rt.c in userspace where I can
> easily use gdb.)
Hmm, maybe I'll actually get a chance to finally play with this. I've
discovered issues with the hrtimers earlier, and was too busy helping
Thomas with them. That had to take precedence.
;)
-- Steve
On Wed, 18 Jan 2006, Steven Rostedt wrote:
> On Wed, 2006-01-18 at 11:31 +0100, Esben Nielsen wrote:
> > Hi,
> > I have updated it:
> >
> > 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
> > task->pi_waiters. Therefore we avoid taking the owner->pi_lock when the
> > waiter isn't RT.
> > 2) Merged into 2.6.15-rt6.
> > 3) Updated the tester to test the hand over of BKL, which was mentioned
> > as a potential issue by Bill Huey. Also added/adjusted the tests for the
> > ALL_TASKS_PI==0 setup.
> > (I really like unittesting: If someone points out an issue or finds a bug,
> > make a test first demonstrating the problem. Then fixing the code is a lot
> > easier - especially in this case where I run rt.c in userspace where I can
> > easily use gdb.)
>
> Hmm, maybe I'll actually get a chance to finally play with this. I've
> discovered issues with the hrtimers earlier, and was too busy helping
> Thomas with them. That had to take precedence.
>
I just keep making small improvements and test as I get time (which isn't
much).
Esben
> ;)
>
> -- Steve
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Jan 22, 2006, at 4:20 PM, Esben Nielsen wrote:
> Ok this time around I have a patch against 2.6.15-rt12.
>
> Updated since the last time:
> 1) Fixed a bug wrt. BKL. The problem involved the grab-lock mechanism.
> 2) Updated the tester to detect the bug. See
> TestRTMutex/reaquire_bkl_while_waiting.tst :-)
>
> I still need testing on SMP and with robust futexes.
> I test with the old priority inheritance test of mine. Do you guys test
> with anything newer?
I'll try your patch with rt12 and robustness. I have an SMP test I'm
working on.
David
>
> When looking at BKL i found that the buisness about reacquiring BKL is
> really bad for real-time. The problem is that you without knowing it
> suddenly can block on the BKL!
>
> Here is the problem:
>
> Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> tries to lock mutex 2, which is owned by task C. B goes blocks and
> releases the
> BKL. Our RT task A comes along and tries to get 1. It boosts task B
> which boosts task C which releases mutex 2. Now B can continue? No, it
> has
> to reaquire BKL! The netto effect is that our RT task A waits for BKL
> to
> be released without ever calling into a module using BKL. But just
> because
> somebody in some non-RT code called into a module otherwise considered
> safe for RT usage with BKL held, A must wait on BKL!
>
> Esben
>
> On Wed, 18 Jan 2006, Esben Nielsen wrote:
>
>> Hi,
>> I have updated it:
>>
>> 1) Now ALL_TASKS_PI is 0 again. Only RT tasks will be added to
>> task->pi_waiters. Therefore we avoid taking the owner->pi_lock when
>> the
>> waiter isn't RT.
>> 2) Merged into 2.6.15-rt6.
>> 3) Updated the tester to test the hand over of BKL, which was
>> mentioned
>> as a potential issue by Bill Huey. Also added/adjusted the tests for
>> the
>> ALL_TASKS_PI==0 setup.
>> (I really like unittesting: If someone points out an issue or finds a
>> bug,
>> make a test first demonstrating the problem. Then fixing the code is
>> a lot
>> easier - especially in this case where I run rt.c in userspace where
>> I can
>> easily use gdb.)
>>
>> Esben
>>
>> On Wed, 11 Jan 2006, Esben Nielsen wrote:
>>
>>> I have done 2 things which might be of interrest:
>>>
>>> I) A rt_mutex unittest suite. It might also be usefull against the
>>> generic
>>> mutexes.
>>>
>>> II) I changed the priority inheritance mechanism in rt.c,
>>> optaining the following goals:
>>>
>>> 1) rt_mutex deadlocks doesn't become raw_spinlock deadlocks. And more
>>> importantly: futex_deadlocks doesn't become raw_spinlock deadlocks.
>>> 2) Time-Predictable code. No matter how deep you nest your locks
>>> (kernel or futex) the time spend in irqs or preemption off should be
>>> limited.
>>> 3) Simpler code. rt.c was kind of messy. Maybe it still is....:-)
>>>
>>> I have lost:
>>> 1) Some speed in the slow slow path. I _might_ have gained some in
>>> the
>>> normal slow path, though, without meassuring it.
>>>
>>>
>>> Idea:
>>>
>>> When a task blocks on a lock it adds itself to the wait list and
>>> calls
>>> schedule(). When it is unblocked it has the lock. Or rather due to
>>> grab-locking it has to check again. Therefore the schedule() call is
>>> wrapped in a loop.
>>>
>>> Now when a task is PI boosted, it is at the same time checked if it
>>> is
>>> blocked on a rt_mutex. If it is, it is unblocked (
>>> wake_up_process_mutex()
>>> ). It will now go around in the above loop mentioned above. Within
>>> this loop
>>> it will now boost the owner of the lock it is blocked on, maybe
>>> unblocking the
>>> owner, which in turn can boost and unblock the next in the lock
>>> chain...
>>> At all points there is at least one task boosted to the highest
>>> priority
>>> required unblocked and working on boosting the next in the lock
>>> chain and
>>> there is therefore no priority inversion.
>>>
>>> The boosting of a long list of blocked tasks will clearly take
>>> longer than
>>> the previous version as there will be task switches. But remember,
>>> it is
>>> in the slow slow path! And it only occurs when PI boosting is
>>> happening on
>>> _nested_ locks.
>>>
>>> What is gained is that the amount of time where irq and preemption
>>> is off
>>> is limited: One task does it's work with preemption disabled, wakes
>>> up the
>>> next and enable preemption and schedules. The amount of time spend
>>> with
>>> preemption disabled is has a clear upper limit, untouched by how
>>> complicated and deep the lock structure is.
>>>
>>> So how many locks do we have to worry about? Two.
>>> One for locking the lock. One for locking various PI related data on
>>> the
>>> task structure, as the pi_waiters list, blocked_on, pending_owner -
>>> and
>>> also prio.
>>> Therefore only lock->wait_lock and sometask->pi_lock will be locked
>>> at the
>>> same time. And in that order. There is therefore no spinlock
>>> deadlocks.
>>> And the code is simpler.
>>>
>>> Because of the simplere code I was able to implement an optimization:
>>> Only the first waiter on each lock is member of the
>>> owner->pi_waiters.
>>> Therefore it is not needed to do any list traversels on neither
>>> owner->pi_waiters, not lock->wait_list. Every operation requires
>>> touching
>>> only removing and/or adding one element to these lists.
>>>
>>> As for robust futexes: They ought to work out of the box now,
>>> blocking in
>>> deadlock situations. I have added an entry to /proc/<pid>/status
>>> "BlckOn: <pid>". This can be used to do "post mortem" deadlock
>>> detection
>>> from userspace.
>>>
>>> What am I missing:
>>> Testing on SMP. I have no SMP machine. The unittest can mimic the SMP
>>> somewhat
>>> but no unittest can catch _all_ errors.
>>>
>>> Testing with futexes.
>>>
>>> ALL_PI_TASKS are always switched on now. This is for making the code
>>> simpler.
>>>
>>> My machine fails to run with CONFIG_DEBUG_DEADLOCKS and
>>> CONFIG_DEBUG_PREEMPT
>>> on at the same time. I need a serial cabel and on consol over serial
>>> to
>>> debug it. My screen is too small to see enough there.
>>>
>>> Figure out more tests to run in my unittester.
>>>
>>> So why aren't I doing those things before sending the patch? 1) Well
>>> my
>>> girlfriend comes back tomorrow with our child. I know I will have no
>>> time to code anything substential
>>> then. 2) I want to make sure Ingo sees this approach before he starts
>>> merging preempt_rt and rt_mutex with his now mainstream mutex.
>>>
>>> Esben
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> <pi_lock.patch-rt12><TestRTMutex.tgz>
On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> Here is the problem:
>
> Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> BKL. Our RT task A comes along and tries to get 1. It boosts task B
> which boosts task C which releases mutex 2. Now B can continue? No, it has
> to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> be released without ever calling into a module using BKL. But just because
> somebody in some non-RT code called into a module otherwise considered
> safe for RT usage with BKL held, A must wait on BKL!
True, that's major suckage, but I can't name a single place in the kernel that
does that. Remember, BKL is now preemptible so the place that it might sleep similar
to the above would be in spinlock_t definitions. But BKL is held across schedules()s
so that the BKL semantics are preserved. Contending under a priority inheritance
operation isn't too much of a problem anyways since the use of it already makes that
path indeterminant. Even under contention, a higher priority task above A can still
run since the kernel is preemptive now even when manipulating BKL.
bill
On Sun, 22 Jan 2006, Bill Huey wrote:
> On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > Here is the problem:
> >
> > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > be released without ever calling into a module using BKL. But just because
> > somebody in some non-RT code called into a module otherwise considered
> > safe for RT usage with BKL held, A must wait on BKL!
>
> True, that's major suckage, but I can't name a single place in the kernel that
> does that.
Sounds good. But someone might put it in...
> Remember, BKL is now preemptible so the place that it might sleep
> similar
> to the above would be in spinlock_t definitions.
I can't see that from how it works. It is explicitly made such that you
are allowed to use semaphores with BKL held - and such that the BKL is
released if you do.
> But BKL is held across schedules()s
> so that the BKL semantics are preserved.
Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
operations.
> Contending under a priority inheritance
> operation isn't too much of a problem anyways since the use of it already
> makes that
> path indeterminant.
The problem is that you might hit BKL because of what some other low
priority task does, thus making your RT code indeterministic.
> Even under contention, a higher priority task above A can still
> run since the kernel is preemptive now even when manipulating BKL.
No, A waits for BKL because it waits for B which waits for the BKL.
Esben
>
> bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, 2006-01-23 at 10:33 +0100, Esben Nielsen wrote:
> On Sun, 22 Jan 2006, Bill Huey wrote:
>
> > On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > > Here is the problem:
> > >
> > > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > > be released without ever calling into a module using BKL. But just because
> > > somebody in some non-RT code called into a module otherwise considered
> > > safe for RT usage with BKL held, A must wait on BKL!
> >
> > True, that's major suckage, but I can't name a single place in the kernel that
> > does that.
>
> Sounds good. But someone might put it in...
Hmm, I wouldn't be surprised if this is done somewhere in the VFS layer.
>
> > Remember, BKL is now preemptible so the place that it might sleep
> > similar
> > to the above would be in spinlock_t definitions.
> I can't see that from how it works. It is explicitly made such that you
> are allowed to use semaphores with BKL held - and such that the BKL is
> released if you do.
Correct. I hope you didn't remove my comment in the rt.c about BKL
being a PITA :) (Ingo was nice enough to change my original patch to use
the acronym.)
>
> > But BKL is held across schedules()s
> > so that the BKL semantics are preserved.
> Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
> operations.
> > Contending under a priority inheritance
> > operation isn't too much of a problem anyways since the use of it already
> > makes that
> > path indeterminant.
> The problem is that you might hit BKL because of what some other low
> priority task does, thus making your RT code indeterministic.
I disagree here. The fact that you grab a semaphore that may also be
grabbed by a path while holding the BKL means that grabbing that
semaphore may be blocked on the BKL too. So the length of grabbing a
semaphore that can be grabbed while also holding the BKL is the length
of the critical section of the semaphore + the length of the longest BKL
hold.
Just don't let your RT tasks grab semaphores that can be grabbed while
also holding the BKL :)
But the main point is that it is still deterministic. Just that it may
be longer than one thinks.
>
> > Even under contention, a higher priority task above A can still
> > run since the kernel is preemptive now even when manipulating BKL.
>
> No, A waits for BKL because it waits for B which waits for the BKL.
Right.
-- Steve
PS. I might actually get around to testing your patch today :) That is,
if -rt12 passes all my tests.
On Mon, 23 Jan 2006, Steven Rostedt wrote:
> On Mon, 2006-01-23 at 10:33 +0100, Esben Nielsen wrote:
> > On Sun, 22 Jan 2006, Bill Huey wrote:
> >
> > > On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > > > Here is the problem:
> > > >
> > > > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > > > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > > > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > > > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > > > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > > > be released without ever calling into a module using BKL. But just because
> > > > somebody in some non-RT code called into a module otherwise considered
> > > > safe for RT usage with BKL held, A must wait on BKL!
> > >
> > > True, that's major suckage, but I can't name a single place in the kernel that
> > > does that.
> >
> > Sounds good. But someone might put it in...
>
> Hmm, I wouldn't be surprised if this is done somewhere in the VFS layer.
>
> >
> > > Remember, BKL is now preemptible so the place that it might sleep
> > > similar
> > > to the above would be in spinlock_t definitions.
> > I can't see that from how it works. It is explicitly made such that you
> > are allowed to use semaphores with BKL held - and such that the BKL is
> > released if you do.
>
> Correct. I hope you didn't remove my comment in the rt.c about BKL
> being a PITA :) (Ingo was nice enough to change my original patch to use
> the acronym.)
I left it there it seems :-)
>
> >
> > > But BKL is held across schedules()s
> > > so that the BKL semantics are preserved.
> > Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
> > operations.
> > > Contending under a priority inheritance
> > > operation isn't too much of a problem anyways since the use of it already
> > > makes that
> > > path indeterminant.
> > The problem is that you might hit BKL because of what some other low
> > priority task does, thus making your RT code indeterministic.
>
> I disagree here. The fact that you grab a semaphore that may also be
> grabbed by a path while holding the BKL means that grabbing that
> semaphore may be blocked on the BKL too. So the length of grabbing a
> semaphore that can be grabbed while also holding the BKL is the length
> of the critical section of the semaphore + the length of the longest BKL
> hold.
Exactly. What is "the length of the longest BKL hold" ? (see below).
>
> Just don't let your RT tasks grab semaphores that can be grabbed while
> also holding the BKL :)
How are you to _know_ that. Even though your code or any code you
call or any code called from code you call haven't changed, this situation
can arise!
>
> But the main point is that it is still deterministic. Just that it may
> be longer than one thinks.
>
I don't consider "the length of the longest BKL hold" deterministic.
People might traverse all kinds of weird lists and datastructures while
holding BKL.
> >
> > > Even under contention, a higher priority task above A can still
> > > run since the kernel is preemptive now even when manipulating BKL.
> >
> > No, A waits for BKL because it waits for B which waits for the BKL.
>
> Right.
>
> -- Steve
>
> PS. I might actually get around to testing your patch today :) That is,
> if -rt12 passes all my tests.
>
Sounds nice :-) I cross my fingers...
Esben
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
I have patched against 2.6.15-rt15 and I have found a hyperthreaded P4
machine. It works fine on that one.
Esben
On Mon, 23 Jan 2006, Esben Nielsen wrote:
> On Mon, 23 Jan 2006, Steven Rostedt wrote:
>
> > On Mon, 2006-01-23 at 10:33 +0100, Esben Nielsen wrote:
> > > On Sun, 22 Jan 2006, Bill Huey wrote:
> > >
> > > > On Mon, Jan 23, 2006 at 01:20:12AM +0100, Esben Nielsen wrote:
> > > > > Here is the problem:
> > > > >
> > > > > Task B (non-RT) takes BKL. It then takes mutex 1. Then B
> > > > > tries to lock mutex 2, which is owned by task C. B goes blocks and releases the
> > > > > BKL. Our RT task A comes along and tries to get 1. It boosts task B
> > > > > which boosts task C which releases mutex 2. Now B can continue? No, it has
> > > > > to reaquire BKL! The netto effect is that our RT task A waits for BKL to
> > > > > be released without ever calling into a module using BKL. But just because
> > > > > somebody in some non-RT code called into a module otherwise considered
> > > > > safe for RT usage with BKL held, A must wait on BKL!
> > > >
> > > > True, that's major suckage, but I can't name a single place in the kernel that
> > > > does that.
> > >
> > > Sounds good. But someone might put it in...
> >
> > Hmm, I wouldn't be surprised if this is done somewhere in the VFS layer.
> >
> > >
> > > > Remember, BKL is now preemptible so the place that it might sleep
> > > > similar
> > > > to the above would be in spinlock_t definitions.
> > > I can't see that from how it works. It is explicitly made such that you
> > > are allowed to use semaphores with BKL held - and such that the BKL is
> > > released if you do.
> >
> > Correct. I hope you didn't remove my comment in the rt.c about BKL
> > being a PITA :) (Ingo was nice enough to change my original patch to use
> > the acronym.)
>
> I left it there it seems :-)
>
> >
> > >
> > > > But BKL is held across schedules()s
> > > > so that the BKL semantics are preserved.
> > > Only for spinlock_t now rt_mutex operation, not for semaphore/mutex
> > > operations.
> > > > Contending under a priority inheritance
> > > > operation isn't too much of a problem anyways since the use of it already
> > > > makes that
> > > > path indeterminant.
> > > The problem is that you might hit BKL because of what some other low
> > > priority task does, thus making your RT code indeterministic.
> >
> > I disagree here. The fact that you grab a semaphore that may also be
> > grabbed by a path while holding the BKL means that grabbing that
> > semaphore may be blocked on the BKL too. So the length of grabbing a
> > semaphore that can be grabbed while also holding the BKL is the length
> > of the critical section of the semaphore + the length of the longest BKL
> > hold.
> Exactly. What is "the length of the longest BKL hold" ? (see below).
>
> >
> > Just don't let your RT tasks grab semaphores that can be grabbed while
> > also holding the BKL :)
>
> How are you to _know_ that. Even though your code or any code you
> call or any code called from code you call haven't changed, this situation
> can arise!
>
> >
> > But the main point is that it is still deterministic. Just that it may
> > be longer than one thinks.
> >
> I don't consider "the length of the longest BKL hold" deterministic.
> People might traverse all kinds of weird lists and datastructures while
> holding BKL.
>
> > >
> > > > Even under contention, a higher priority task above A can still
> > > > run since the kernel is preemptive now even when manipulating BKL.
> > >
> > > No, A waits for BKL because it waits for B which waits for the BKL.
> >
> > Right.
> >
> > -- Steve
> >
> > PS. I might actually get around to testing your patch today :) That is,
> > if -rt12 passes all my tests.
> >
>
> Sounds nice :-) I cross my fingers...
>
> Esben
>
>
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>