2006-03-28 07:37:10

by Pierre Peiffer

[permalink] [raw]
Subject: [PATCH] 2.6.16 - futex: small optimization (?)

Hi,


I found a (optimization ?) problem in the futexes, during a futex_wake,
if the waiter has a higher priority than the waker.

In fact, in this case, the waiter is immediately scheduled and tries to
take a lock still held by the waker. This is specially expensive on UP
or if both threads are on the same CPU, due to the two task-switchings.
This produces an extra latency during a wakeup in pthread_cond_broadcast
or pthread_cond_signal, for example.

See below my detailed explanation.

I found a solution given by the patch, at the end of this mail. It works
for me on kernel 2.6.16, but the kernel hangs if I use it with -rt patch
from Ingo Molnar. So, I have a doubt on the correctness of the patch.

The idea is simple: in unqueue_me, I first check
"if (list_empty(&q->list))"

If yes => we were woken (the list is initialized in wake_futex).
Then, it immediately returns and let the waker drop the key_refs
(instead of the waiter).


====================================================================
Here is the detailed explanation:

Imagine two threads, on UP or on the same CPU: a futex-waker (thread A)
and a futex-waiter (thread B); thread B having a higher priority,
blocked and sleeping in futex_wait. Here is the scenario:



Thread A Thread B
(waker) (waiter with higher priority)

/* sleeping in futex_wait */
/* wake the futex */
/* (from futex_wake or futex_requeue) */
\_ wake_futex
\_ list_del_init(&q->list)
\_ wake_up_all (thread B)

=> Thread B, due to its higher priority is immediately
woken and sheduled
=> task-swith to thread B

/* sleeps */ /* awakes */

in futex_wait:
\_ ...
\_ unqueue_me
\_ lock_ptr = q->lock_ptr;
\_ if (lock_ptr != 0) {
/* TRUE */
\_ spin_lock(lock_ptr);
/* lock is still locked by the
waker, thread A, in either
futex_wake or futex_requeue
*/

=> task-switch to the lock owner, thread A

/* awakes */ /* sleeps */

\_q->lock_ptr = NULL;
/* back to futex_wake or futex_requeue */
\_ ...
\_ spin_unlock(&bh->lock);
/* this is q->lock_ptr in thread B */
/* => waiters are woken */

=> task-switch to the lock waiter, thread B

/* sleeps */ /* awakes */

\_ if (lock_ptr != q->lock_ptr) {
/* unfortunately, this is true */
/* q->lock_ptr in now NULL */
\_ spin_unlock(lock_ptr);
\_ goto retry;
\_ }
\_ lock_ptr = q->lock_ptr;
\_ if (lock_ptr != 0) {
/* FALSE now */
/* thread B can go on now */



=== end of scenario ===

So, there are two dummy and heavy task-switchings due to the
"if (lock_ptr != 0)" statement still true the first time in unqueue_me
where it should not.

Here is the patch I would like to propose, for comments.


Signed-off-by: Pierre Peiffer <[email protected]>

---
diff -uprN linux-2.6.16.ori/kernel/futex.c linux-2.6.16/kernel/futex.c
--- linux-2.6.16.ori/kernel/futex.c 2006-03-27 16:52:11.000000000 +0200
+++ linux-2.6.16/kernel/futex.c 2006-03-27 16:56:35.000000000 +0200
@@ -290,7 +290,7 @@ static int futex_wake(unsigned long uadd
struct futex_hash_bucket *bh;
struct list_head *head;
struct futex_q *this, *next;
- int ret;
+ int ret, drop_count=0;

down_read(&current->mm->mmap_sem);

@@ -305,12 +305,15 @@ static int futex_wake(unsigned long uadd
list_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key)) {
wake_futex(this);
+ drop_count++;
if (++ret >= nr_wake)
break;
}
}

spin_unlock(&bh->lock);
+ while (--drop_count >= 0)
+ drop_key_refs(&key);
out:
up_read(&current->mm->mmap_sem);
return ret;
@@ -327,6 +330,7 @@ static int futex_wake_op(unsigned long u
struct list_head *head;
struct futex_q *this, *next;
int ret, op_ret, attempt = 0;
+ int drop_count1 = 0, drop_count2 = 0;

retryfull:
down_read(&current->mm->mmap_sem);
@@ -413,6 +417,7 @@ retry:
list_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key1)) {
wake_futex(this);
+ drop_count1++;
if (++ret >= nr_wake)
break;
}
@@ -425,6 +430,7 @@ retry:
list_for_each_entry_safe(this, next, head, list) {
if (match_futex (&this->key, &key2)) {
wake_futex(this);
+ drop_count2++;
if (++op_ret >= nr_wake2)
break;
}
@@ -435,6 +441,12 @@ retry:
spin_unlock(&bh1->lock);
if (bh1 != bh2)
spin_unlock(&bh2->lock);
+
+ /* drop_key_refs() must be called outside the spinlocks. */
+ while (--drop_count1 >= 0)
+ drop_key_refs(&key1);
+ while (--drop_count2 >= 0)
+ drop_key_refs(&key2);
out:
up_read(&current->mm->mmap_sem);
return ret;
@@ -506,6 +518,7 @@ static int futex_requeue(unsigned long u
continue;
if (++ret <= nr_wake) {
wake_futex(this);
+ drop_count++;
} else {
list_move_tail(&this->list, &bh2->chain);
this->lock_ptr = &bh2->lock;
@@ -586,6 +599,9 @@ static int unqueue_me(struct futex_q *q)
int ret = 0;
spinlock_t *lock_ptr;

+ if (list_empty(&q->list))
+ return ret;
+
/* In the common case we don't take the spinlock, which is nice. */
retry:
lock_ptr = q->lock_ptr;


--
Pierre Peiffer


2006-03-28 10:06:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Pierre PEIFFER a ?crit :
> Hi,
>
>
> I found a (optimization ?) problem in the futexes, during a futex_wake,
> if the waiter has a higher priority than the waker.
>
> In fact, in this case, the waiter is immediately scheduled and tries to
> take a lock still held by the waker. This is specially expensive on UP
> or if both threads are on the same CPU, due to the two task-switchings.
> This produces an extra latency during a wakeup in pthread_cond_broadcast
> or pthread_cond_signal, for example.
>
> See below my detailed explanation.
>
> I found a solution given by the patch, at the end of this mail. It works
> for me on kernel 2.6.16, but the kernel hangs if I use it with -rt patch
> from Ingo Molnar. So, I have a doubt on the correctness of the patch.
>
> The idea is simple: in unqueue_me, I first check
> "if (list_empty(&q->list))"
>
> If yes => we were woken (the list is initialized in wake_futex).
> Then, it immediately returns and let the waker drop the key_refs
> (instead of the waiter).
>
>

Its true that futex code implies lot of context switches (kernel side but also
user side).

Even if you change kernel behavior in futex_wake(), you wont change the fact
that a typical pthread_cond_signal does :

1) lock cond var
lll_lock(cv->lock);
2) wake one waiter if necessary
FUTEX_WAKE(cv->wakeup_seq, 1);
3) unlock cond var

If a waiter process B has higher priority than the wake process A, then most
probably, B is scheduled before A had a chance to unlock cond var (step 3))

So B will re-enter kernel (because of the contended cond var lock), and A will
re-enter kernel too to futex_wake() process A again, but on cond var lock this
time, not on condvar wakeup_seq futex.

Each time a thread enters futex kernel code, an expensive find_extend_vma()
lookup is done, (expensive because of the read_lock but also the possible
amount of vm_area_struct in mm_struct)

I wish futex code had a special implementation for PTHREAD_SCOPE_PROCESS
futexes , where no vma lookups would be necessary at all. Most mutexes or
condvar have a process private scope (not shared by different processes)

Eric



2006-03-28 15:02:44

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

On 3/27/06, Pierre PEIFFER <[email protected]> wrote:
> I found a (optimization ?) problem in the futexes, during a futex_wake,
> if the waiter has a higher priority than the waker.

There are no such situations anymore in an optimal userlevel
implementation. The last problem (in pthread_cond_signal) was fixed
by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
at is simply not optimized for the modern kernels.

2006-03-28 23:09:50

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Ulrich Drepper wrote:
> On 3/27/06, Pierre PEIFFER <[email protected]> wrote:
>> I found a (optimization ?) problem in the futexes, during a futex_wake,
>> if the waiter has a higher priority than the waker.
>
> There are no such situations anymore in an optimal userlevel
> implementation. The last problem (in pthread_cond_signal) was fixed
> by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
> at is simply not optimized for the modern kernels.

What are you suggesting here, that the kernel can be inefficient as long
as the user has a way to program around it?

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2006-03-29 13:18:18

by Pierre Peiffer

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Ulrich Drepper a ?crit :
>
> There are no such situations anymore in an optimal userlevel
> implementation. The last problem (in pthread_cond_signal) was fixed
> by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
> at is simply not optimized for the modern kernels.
>

I think there is a misunderstanding here.

FUTEX_WAKE_OP is implemented to handle simultaneously more than one
futex in some specific situations (such as pthread_cond_signal).

The scenario I've described occurred in futex_wake, futex_wake_op and
futex_requeue and is _independent_ of the userlevel code.

All these functions call wake_futex, and then wake_up_all, with the
futex_hash_bucket lock still held.

If the woken thread is immediately scheduled (in wake_up_all), and only
in this case (because of a higher priority, etc), it will try to take
this lock too (because of the "if (lock_ptr != 0)" statement in
unqueue_me), causing two task-switches to take this lock for nothing.

Otherwise, it will not: lock_ptr is set to NULL just after the
wake_up_all call)

This scenario happens at least in pthread_cond_signal,
pthread_cond_broadcast and probably all pthread_*_unlock functions.

The patch I've proposed should, at least in theory, solve this. But I'm
not sure of the correctness...

--
Pierre P.

2006-03-29 15:26:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Pierre PEIFFER a ?crit :
> Ulrich Drepper a ?crit :
>>
>> There are no such situations anymore in an optimal userlevel
>> implementation. The last problem (in pthread_cond_signal) was fixed
>> by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
>> at is simply not optimized for the modern kernels.
>>
>
> I think there is a misunderstanding here.
>

Hum... maybe Ulrich was answering to my own message (where I stated that most
existing multithreaded pay the price of context switches)

(To Ulrich : Most existing applications use glibc <= 2.3.6, where
FUTEX_WAKE_OP is not used yet AFAIK)

I think your analysis is correct Pierre, but you speak of 'task-switches',
where there is only a spinlock involved :

On UP case : a wake_up_all() wont preempt current thread : it will task-switch
only when current thread exits kernel mode.

On PREEMPT case : wake_up_all() wont preempt current thread (because current
thread is holding bh->lock).

On SMP : the awaken thread will spin some time on bh->lock, but not
task-switch again.

On RT kernel, this might be different of course...

> FUTEX_WAKE_OP is implemented to handle simultaneously more than one
> futex in some specific situations (such as pthread_cond_signal).
>
> The scenario I've described occurred in futex_wake, futex_wake_op and
> futex_requeue and is _independent_ of the userlevel code.
>
> All these functions call wake_futex, and then wake_up_all, with the
> futex_hash_bucket lock still held.
>
> If the woken thread is immediately scheduled (in wake_up_all), and only
> in this case (because of a higher priority, etc), it will try to take
> this lock too (because of the "if (lock_ptr != 0)" statement in
> unqueue_me), causing two task-switches to take this lock for nothing.
>
> Otherwise, it will not: lock_ptr is set to NULL just after the
> wake_up_all call)
>
> This scenario happens at least in pthread_cond_signal,
> pthread_cond_broadcast and probably all pthread_*_unlock functions.
>
> The patch I've proposed should, at least in theory, solve this. But I'm
> not sure of the correctness...
>

2006-03-29 15:29:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)


* Bill Davidsen <[email protected]> wrote:

> Ulrich Drepper wrote:
> >On 3/27/06, Pierre PEIFFER <[email protected]> wrote:
> >>I found a (optimization ?) problem in the futexes, during a futex_wake,
> >> if the waiter has a higher priority than the waker.
> >
> >There are no such situations anymore in an optimal userlevel
> >implementation. The last problem (in pthread_cond_signal) was fixed
> >by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
> >at is simply not optimized for the modern kernels.
>
> What are you suggesting here, that the kernel can be inefficient as
> long as the user has a way to program around it?

What are you suggesting here, that FUTEX_WAKE_UP is a "user way to
program around" an inefficiency? If yes then please explain to me why
and what you would do differently.

Ingo

2006-03-30 14:51:23

by Pierre Peiffer

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Eric Dumazet a ?crit :
> I think your analysis is correct Pierre, but you speak of
> 'task-switches', where there is only a spinlock involved :
>
> On UP case : a wake_up_all() wont preempt current thread : it will
> task-switch only when current thread exits kernel mode.
>
> On PREEMPT case : wake_up_all() wont preempt current thread (because
> current thread is holding bh->lock).
>
> On SMP : the awaken thread will spin some time on bh->lock, but not
> task-switch again.

Ok, yes, you're right, thanks.

>
> On RT kernel, this might be different of course...

So, I confirm it only happens on RT kernel ...
... but my patch does not work on those kernels :-/

In fact, I don't really see what's wrong ?

--
Pierre

2006-03-30 20:25:27

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Ingo Molnar wrote:
> * Bill Davidsen <[email protected]> wrote:
>
>> Ulrich Drepper wrote:
>>> On 3/27/06, Pierre PEIFFER <[email protected]> wrote:
>>>> I found a (optimization ?) problem in the futexes, during a futex_wake,
>>>> if the waiter has a higher priority than the waker.
>>> There are no such situations anymore in an optimal userlevel
>>> implementation. The last problem (in pthread_cond_signal) was fixed
>>> by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
>>> at is simply not optimized for the modern kernels.
>> What are you suggesting here, that the kernel can be inefficient as
>> long as the user has a way to program around it?
>
> What are you suggesting here, that FUTEX_WAKE_UP is a "user way to
> program around" an inefficiency? If yes then please explain to me why
> and what you would do differently.

The point I'm making is that even if an application is "not optimized
for modern kernels" or whatever, there's no reason to ignore
inefficiencies. As Pierre Pfeiffer noted this happens independently of
user code. If a change can eliminate some CPU cycles and possible cache
activity, it would seem to be worth investigation.

The suggestion that the user code was inefficient was not mine...

Did I clarify it this time?

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2006-03-31 06:04:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)


* Bill Davidsen <[email protected]> wrote:

> >>>There are no such situations anymore in an optimal userlevel
> >>>implementation. The last problem (in pthread_cond_signal) was fixed
> >>>by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
> >>>at is simply not optimized for the modern kernels.
> >>What are you suggesting here, that the kernel can be inefficient as
> >>long as the user has a way to program around it?
> >
> >What are you suggesting here, that FUTEX_WAKE_UP is a "user way to
> >program around" an inefficiency? If yes then please explain to me why
> >and what you would do differently.
>
> The point I'm making is that even if an application is "not optimized
> for modern kernels" or whatever, there's no reason to ignore
> inefficiencies. [...]

What are you suggesting here, that the implementation of FUTEX_WAKE_UP
is "ignoring inefficiencies"? Please explain why and what you would do
differently to solve that inefficiency.

Ingo

2006-03-31 14:45:48

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)

Ingo Molnar wrote:

>* Bill Davidsen <[email protected]> wrote:
>
>
>
>>>>>There are no such situations anymore in an optimal userlevel
>>>>>implementation. The last problem (in pthread_cond_signal) was fixed
>>>>>by the addition of FUTEX_WAKE_OP. The userlevel code you're looking
>>>>>at is simply not optimized for the modern kernels.
>>>>>
>>>>>
>>>>What are you suggesting here, that the kernel can be inefficient as
>>>>long as the user has a way to program around it?
>>>>
>>>>
>>>What are you suggesting here, that FUTEX_WAKE_UP is a "user way to
>>>program around" an inefficiency? If yes then please explain to me why
>>>and what you would do differently.
>>>
>>>
>>The point I'm making is that even if an application is "not optimized
>>for modern kernels" or whatever, there's no reason to ignore
>>inefficiencies. [...]
>>
>>
>
>What are you suggesting here, that the implementation of FUTEX_WAKE_UP
>is "ignoring inefficiencies"? Please explain why and what you would do
>differently to solve that inefficiency.
>
>
I am suggesting that "There are no such situations anymore in an optimal
userlevel implementation" is not the right approach to the original
post. What I would do differently is to evaluate the original suggestion
to see if it would in fact be more efficient. Please read the original
post and take it on merit, it's either an optimizationn or not, and
obviously it's used even if not by some "optimal" user code.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2006-03-31 18:18:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] 2.6.16 - futex: small optimization (?)


* Bill Davidsen <[email protected]> wrote:

> >What are you suggesting here, that the implementation of
> >FUTEX_WAKE_OP is "ignoring inefficiencies"? Please explain why and
> >what you would do differently to solve that inefficiency.
>
> I am suggesting that "There are no such situations anymore in an
> optimal userlevel implementation" is not the right approach to the
> original post. What I would do differently is to evaluate the original
> suggestion to see if it would in fact be more efficient. [...]

that's what we did a year ago for a similar futex rescheduling
inefficiency, and came up with the FUTEX_WAKE_OP solution, which is now
part of the kernel and is used by glibc. But we are all ears for
alternative suggestions.

Ingo