I'm working on futex commentary cleanup patch series. While reading
through all the remaining comments, I've come across a couple I'd your
thoughts on:
The futex woken state is defined as:
* A futex_q has a woken state, just like tasks have TASK_RUNNING.
* It is considered woken when plist_node_empty(&q->list) || q->lock_ptr == 0.
* The order of wakup is always to make the first condition true, then
* wake up q->waiter, then make the second condition true.
1) wake_futex() actually wakes the task (q->task not q->waiter) after
the lock_ptr has been set to NULL. I believe this is fine and can
correct the comments accordingly.
2) futex_wait_queue_me() (recently refactored from futex_wait())
performs the following test:
/*
* !plist_node_empty() is safe here without any lock.
* q.lock_ptr != 0 is not safe, because of ordering against wakeup.
*/
if (likely(!plist_node_empty(&q->list))) {
/*
* If the timer has already expired, current will already be
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
if (!timeout || timeout->task)
schedule();
}
As I understand it, this is to avoid a missed wakeup when a FUTEX_WAKE
call occurs after the queue_me() but before the futex_wait() call has
had a chance to call schedule() (via futex_wait_queue_me()). However,
as no locks are taken, I don't see what prevents the futex_q from being
removed from the hash list after the plist_node_empty() test and before
the call to schedule(). In this scenario, the futex_q will not be found
on the hash list by subsequent wakers, and it will remain in schedule()
until a timeout or signal occurs.
This leads me to the question on the comment: "!plist_node_empty() is
safe here without any lock." - Why is that safe?
Secondly, why is the q.lock_ptr test not safe? "q.lock_ptr != 0 is not
safe, because of ordering against wakeup."
I understand the definition of the woken state to be
"plist_node_empty(&q->list) || q->lock_ptr == 0". So testing the plist
will detect a woken futex sooner than testing for a null lock_ptr, but I
don't see how one is more "safe" than the other when no locks are held
to prevent the futex_q from vanishing off the list before the call to
schedule().
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Darren,
On Wed, 16 Sep 2009, Darren Hart wrote:
> 2) futex_wait_queue_me() (recently refactored from futex_wait())
> performs the following test:
>
> /*
> * !plist_node_empty() is safe here without any lock.
> * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
> */
> if (likely(!plist_node_empty(&q->list))) {
> /*
> * If the timer has already expired, current will already be
> * flagged for rescheduling. Only call schedule if there
> * is no timeout, or if it has yet to expire.
> */
> if (!timeout || timeout->task)
> schedule();
> }
>
> As I understand it, this is to avoid a missed wakeup when a FUTEX_WAKE
> call occurs after the queue_me() but before the futex_wait() call has
> had a chance to call schedule() (via futex_wait_queue_me()). However,
> as no locks are taken, I don't see what prevents the futex_q from being
> removed from the hash list after the plist_node_empty() test and before
> the call to schedule(). In this scenario, the futex_q will not be found
> on the hash list by subsequent wakers, and it will remain in schedule()
> until a timeout or signal occurs.
>
> This leads me to the question on the comment: "!plist_node_empty() is
> safe here without any lock." - Why is that safe?
>
> Secondly, why is the q.lock_ptr test not safe? "q.lock_ptr != 0 is not
> safe, because of ordering against wakeup."
>
> I understand the definition of the woken state to be
> "plist_node_empty(&q->list) || q->lock_ptr == 0". So testing the plist
> will detect a woken futex sooner than testing for a null lock_ptr, but I
> don't see how one is more "safe" than the other when no locks are held
> to prevent the futex_q from vanishing off the list before the call to
> schedule().
Much of this are historic leftovers.
futex_wait_queue_me()
{
queue_me(q, hb);
/*
* There might have been scheduling since the queue_me(), as we
* cannot hold a spinlock across the get_user() in case it
* faults, and we cannot just set TASK_INTERRUPTIBLE state when
* queueing ourselves into the futex hash. This code thus has to
* rely on the futex_wake() code removing us from hash when it
* wakes us up.
*/
This comment is stale and patently wrong today. There is no get_user()
while we hold a spinlock. So we should move set_current_state() before
queue_me()
set_current_state(TASK_INTERRUPTIBLE);
/*
* !plist_node_empty() is safe here without any lock.
* q.lock_ptr != 0 is not safe, because of ordering against wakeup.
*/
if (likely(!plist_node_empty(&q->list))) {
If we move set_current_state() before the queue_me() this check is
still an optimization to avoid the schedule call in case we have been
woken up already. But the comment is still wrong as the wakeup code
has changed:
The old version did:
plist_del(&q->list);
wake_up_all(&q->waiters);
q->lock_ptr = NULL;
Today we do:
p = q->task;
get_task_struct(p);
plist_del(&q->list);
q->lock_ptr = NULL;
wake_up_state(p);
put_task_struct(p);
We changed this because it makes no sense to use a waitqueue for a
single task.
So the whole commentry is stale and needs to be updated.
Thanks,
tglx
Thomas Gleixner wrote:
> Darren,
>
> On Wed, 16 Sep 2009, Darren Hart wrote:
>
>> 2) futex_wait_queue_me() (recently refactored from futex_wait())
>> performs the following test:
>>
>> /*
>> * !plist_node_empty() is safe here without any lock.
>> * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
>> */
>> if (likely(!plist_node_empty(&q->list))) {
>> /*
>> * If the timer has already expired, current will already be
>> * flagged for rescheduling. Only call schedule if there
>> * is no timeout, or if it has yet to expire.
>> */
>> if (!timeout || timeout->task)
>> schedule();
>> }
>>
>> As I understand it, this is to avoid a missed wakeup when a FUTEX_WAKE
>> call occurs after the queue_me() but before the futex_wait() call has
>> had a chance to call schedule() (via futex_wait_queue_me()). However,
>> as no locks are taken, I don't see what prevents the futex_q from being
>> removed from the hash list after the plist_node_empty() test and before
>> the call to schedule(). In this scenario, the futex_q will not be found
>> on the hash list by subsequent wakers, and it will remain in schedule()
>> until a timeout or signal occurs.
>>
>> This leads me to the question on the comment: "!plist_node_empty() is
>> safe here without any lock." - Why is that safe?
>>
>> Secondly, why is the q.lock_ptr test not safe? "q.lock_ptr != 0 is not
>> safe, because of ordering against wakeup."
>>
>> I understand the definition of the woken state to be
>> "plist_node_empty(&q->list) || q->lock_ptr == 0". So testing the plist
>> will detect a woken futex sooner than testing for a null lock_ptr, but I
>> don't see how one is more "safe" than the other when no locks are held
>> to prevent the futex_q from vanishing off the list before the call to
>> schedule().
>
Hi Thomas,
Thanks for the review.
> Much of this are historic leftovers.
Yes, I figured, thus the exercise.
>
> futex_wait_queue_me()
> {
> queue_me(q, hb);
>
> /*
> * There might have been scheduling since the queue_me(), as we
> * cannot hold a spinlock across the get_user() in case it
> * faults, and we cannot just set TASK_INTERRUPTIBLE state when
> * queueing ourselves into the futex hash. This code thus has to
> * rely on the futex_wake() code removing us from hash when it
> * wakes us up.
> */
>
> This comment is stale and patently wrong today. There is no get_user()
> while we hold a spinlock. So we should move set_current_state() before
> queue_me()
OK, I'll include that in the patch series.
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> /*
> * !plist_node_empty() is safe here without any lock.
> * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
> */
> if (likely(!plist_node_empty(&q->list))) {
>
> If we move set_current_state() before the queue_me() this check is
> still an optimization to avoid the schedule call in case we have been
> woken up already. But the comment is still wrong as the wakeup code
> has changed:
>
> The old version did:
>
> plist_del(&q->list);
> wake_up_all(&q->waiters);
> q->lock_ptr = NULL;
>
> Today we do:
>
> p = q->task;
> get_task_struct(p);
> plist_del(&q->list);
> q->lock_ptr = NULL;
> wake_up_state(p);
> put_task_struct(p);
>
> We changed this because it makes no sense to use a waitqueue for a
> single task.
Right.
However, my bigger concern still remains. If the above is only an
optimization, we appear to have a race with wakeup where we can see a
non-empty list here and decide to schedule and have the wakeup code
remove us from the list, hiding it from all future futex related wakeups
(signal and timeout would still work).
We have also been seeing a race with the requeue_pi code with a JVM
benchmark where the apparent owner of the pi mutex remains blocked on
the condvar - this can be explained by the race I'm suspecting. Also,
futex_requeue_pi() is using futex_wait_queue_me() which expects the
waker to remove the futex_q from the list, which isn't how things work
for PI mutexes. In an experiment, I moved the spin_unlock() out of
queueme() and right before the call to schedule() to narrow the race
window, and the hang we were experiencing appears to have gone away.
I'll propose a patch series early next week to address the commentary
and functional issues (will try to get to it sooner - have family in
town for a couple days).
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
On Thu, 17 Sep 2009, Darren Hart wrote:
> > /*
> > * !plist_node_empty() is safe here without any lock.
> > * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
> > */
> > if (likely(!plist_node_empty(&q->list))) {
> >
> > If we move set_current_state() before the queue_me() this check is
> > still an optimization to avoid the schedule call in case we have been
> > woken up already. But the comment is still wrong as the wakeup code
> > has changed:
> >
> > The old version did:
> >
> > plist_del(&q->list);
> > wake_up_all(&q->waiters);
> > q->lock_ptr = NULL;
> >
> > Today we do:
> >
> > p = q->task;
> > get_task_struct(p);
> > plist_del(&q->list);
> > q->lock_ptr = NULL;
> > wake_up_state(p);
> > put_task_struct(p);
> >
> > We changed this because it makes no sense to use a waitqueue for a
> > single task.
>
> Right.
>
>
> However, my bigger concern still remains. If the above is only an
> optimization, we appear to have a race with wakeup where we can see a
> non-empty list here and decide to schedule and have the wakeup code remove us
> from the list, hiding it from all future futex related wakeups (signal and
> timeout would still work).
No.
Sleeper does:
set_current_state(TASK_INTERRUPTIBLE);
if (!plist_empty())
schedule();
So when the list removal happened before set_current_state() we don't
schedule. If the wakeup happens _after_ set_current_state() then the
wake_up_state() call will bring us back to running.
> We have also been seeing a race with the requeue_pi code with a JVM benchmark
> where the apparent owner of the pi mutex remains blocked on the condvar - this
> can be explained by the race I'm suspecting. Also, futex_requeue_pi() is
> using futex_wait_queue_me() which expects the waker to remove the futex_q from
> the list, which isn't how things work for PI mutexes. In an experiment, I
> moved the spin_unlock() out of queueme() and right before the call to
> schedule() to narrow the race window, and the hang we were experiencing
> appears to have gone away.
The correct thing to do is to move set_current_state() before queue_me().
Thanks,
tglx
Thomas Gleixner wrote:
> On Thu, 17 Sep 2009, Darren Hart wrote:
>>> /*
>>> * !plist_node_empty() is safe here without any lock.
>>> * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
>>> */
>>> if (likely(!plist_node_empty(&q->list))) {
>>>
>>> If we move set_current_state() before the queue_me() this check is
>>> still an optimization to avoid the schedule call in case we have been
>>> woken up already. But the comment is still wrong as the wakeup code
>>> has changed:
>>>
>>> The old version did:
>>>
>>> plist_del(&q->list);
>>> wake_up_all(&q->waiters);
>>> q->lock_ptr = NULL;
>>>
>>> Today we do:
>>>
>>> p = q->task;
>>> get_task_struct(p);
>>> plist_del(&q->list);
>>> q->lock_ptr = NULL;
>>> wake_up_state(p);
>>> put_task_struct(p);
>>>
>>> We changed this because it makes no sense to use a waitqueue for a
>>> single task.
>> Right.
>>
>>
>> However, my bigger concern still remains. If the above is only an
>> optimization, we appear to have a race with wakeup where we can see a
>> non-empty list here and decide to schedule and have the wakeup code remove us
>> from the list, hiding it from all future futex related wakeups (signal and
>> timeout would still work).
>
> No.
>
> Sleeper does:
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> if (!plist_empty())
> schedule();
>
> So when the list removal happened before set_current_state() we don't
> schedule. If the wakeup happens _after_ set_current_state() then the
> wake_up_state() call will bring us back to running.
>
>> We have also been seeing a race with the requeue_pi code with a JVM benchmark
>> where the apparent owner of the pi mutex remains blocked on the condvar - this
>> can be explained by the race I'm suspecting. Also, futex_requeue_pi() is
>> using futex_wait_queue_me() which expects the waker to remove the futex_q from
>> the list, which isn't how things work for PI mutexes. In an experiment, I
>> moved the spin_unlock() out of queueme() and right before the call to
>> schedule() to narrow the race window, and the hang we were experiencing
>> appears to have gone away.
>
> The correct thing to do is to move set_current_state() before queue_me().
>
Ah yes, you are correct of course. Since PI futexes do not use
plist_node_empty() to test for wakeup, the setting of TASK_ITNERRUPTIBLE
after the queue_me() sets the stage to call schedule() with the wrong
task state and lose the task forever. I have included this in my
current patch queue. We are running our tests to confirm the fix and
I'll submit the series for inclusion by tomorrow.
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team