by Thomas Gleixner

[permalink] [raw]

Subject: Re: [TIP][RFC 6/7] futex: add requeue_pi calls

On Mon, 9 Mar 2009, Darren Hart wrote:
> Thomas Gleixner wrote:
> > > + int has_timeout;
> >
> > Hmm. time == 0 perhaps or just add another flag bit to the flags field ?
>
> I've created FLAGS_HAS_TIMEOUT and use that instead. I opted for this
> approach instead of using time==0 as that seemed like it could be confused for
> an expired timer.

Yup. Preferred solution.

> > > + /*
> > > + * The pifutex has an owner, make sure it's us, if not complain
> > > + * to userspace.
> > > + * FIXME_LATER: handle this gracefully
> >
> > We should do this right now. It's not that hard.
>
> Hrm. So if another task owns the futex then POSIX says undefined scheduling be
> should be expected. So failing here seems like a good way to encourage proper
> usage of the cond vars.

Non predicatable != invalid usage

> Or can you think of a sane scenario in which the
> outer mutex should not be held prior to the call to cond_signal() or
> cond_broadcast()? If you are thinking outside of glibc cond vars, then there
> are likely other gaps with this patch, such as only being able to requeue from
> non-pi to pi futexes.

No, I think about glibc and the madness of existing code. Code just
relies on that and it works with the current glibc implementation so
making the enhanced requeue_pi behave different will break existing
applications and has to be considered as a regression.

> That being said... now that this is more or less implemented, I'm not sure
> there is really anything I'd need to do different to support this. I'll pour
> over this for a while and see if there are any gotchas that I'm missing right
> now.

I think you have everything what you need already except of some minor
tweaks.

> >
> > > + */
> > > + pid = curval & FUTEX_TID_MASK;
> > > + if (pid && pid != task_pid_vnr(current))
> > > + return -EMORON;
> >
> > Though it's a pity that we lose EMORON :)
>
>
> Does that mean I'll have to forfeit the nickname bestowed upon me by LWN?
> I've received numerous comments on this line - all of them in favor.
> Apparently lots of kernel developers are eager for a way to slyly mock users
> from within the context of the kernel. ;-)

Send a separate patch then. You have my Acked-by :)

> > > +/*
> > > + * Requeue all waiters hashed on one physical page to another physical
> > > page.
> > > + * In the requeue_pi case, either takeover uaddr2 or set FUTEX_WAITERS
> > > and
> > > + * setup the pistate. FUTEX_REQUEUE_PI only supports requeueing from a
> > > non-pi
> > > + * futex to a pi futex.
> > > */
> > > static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user
> > > *uaddr2,
> > > - int nr_wake, int nr_requeue, u32 *cmpval)
> > > + int nr_wake, int nr_requeue, u32 *cmpval,
> > > + int requeue_pi)
> > > {
> > > union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
> > > struct futex_hash_bucket *hb1, *hb2;
> > > struct plist_head *head1;
> > > struct futex_q *this, *next;
> > > - int ret, drop_count = 0;
> > > + u32 curval2;
> > > + struct futex_pi_state *pi_state = NULL;
> > > + int drop_count = 0, attempt = 0, task_count = 0, ret;
> > > +
> > > + if (requeue_pi && refill_pi_state_cache())
> > > + return -ENOMEM;
> >
> > Why do you want to refill the pi_state_cache of current ? current is
> > not going to wait for the pi_futex.
>
>
> I may need to allocate a pi_state during
> futex_requeue_pi_init()->futex_lock_pi_atomic()->lookup_pi_state(). Making use
> of this current->pi_cache seemed like the best way to do it. Is there a more
> appropriate place to store a preallocated pi_state?

Hmm. You are right. That's the best way to have one handy.

> > > + if (requeue_pi &&
> > > + ((curval2 & FUTEX_TID_MASK) != task_pid_vnr(this->task)))
> > > {
> >
> > Why don't you check the owner of the rtmutex, which we installed
> > already ? Then we can drop this curval2 business altogether.
>
>
> Because we can own the futex without an rt_mutex existing. The first time
> through this loop we consider the top_waiter of uaddr1. If uaddr2 had no
> owner, then it is now owned by the top_waiter, but since there are still no
> waiters on uaddr2, the rt_mutex has not been initialized. Once we know we are
> not the owner, then we also know the pi_state->pi_mutex exists.
>
> Hrm.... I suppose a test for "!this->pi_state" is equivalent to
> "(curval2 & FUTEX_TID_MASK) != task_pid_vnr(this->task)" then isn't it...?
> Seems like a rather fragile test though doesn't it?
>
> We wouldn't save much by eliminating the curval2 here though since
> futex_requeue_pi_init needs it for the atomic_lock, so we still need the fault
> logic... unfortunatley.
>
> Thoughts?

In futex_requeue_pi_init() we figure already out who owns the
futex, right ? So why don't we capture that information ?

> > > + if (abs_time) {
> > > + unsigned long slack;
> >
> > Missing new line. Also this is nonsense. A rt task should have set
> > its timer_slack_ns to 0, so we can just use current->timer_slack_ns
> > in the hrtimer_set_expires_range_ns(). If the timer_slack_ns is
> > random for rt_tasks then we need to fix this in general.
> >
>
>
> Added a WARN_ON(!current->timer_slack_ns) to my debug patch and used that
> value here. I'll create a patch to futex-fixes to address the existing calls
> that I copied this approach from.

Thanks.

> > Also the logic is completely backwards here. It has to be the same
> > as in futex_wait()
>
>
> Is it broken? I believe the reason I wrote it as I did was because the ret ==
> 0 case should be very rare here. Infact, it is an error. The only reason we
> should wake on the first futex (uaddr) is on a signal or a timeout, otherwise
> the user most likely paired a FUTEX_WAKE_REQUEUE_PI with FUTEX_WAKE, instead
> of FUTEX_REQUEUE_PI.
>
> I can change it around if you have a strong preference though (or if it's
> broken in some way I'm missing of course).

The 0 case might be rare, but it is definitely not an error.

futex_wait_queue_me(hb, &q, to);
queue_me();
drops hb->lock;

unqueue_me() is called w/o the hb->lock held, so we can race here
against a requeue_pi wakeup.

So we need to check the unqueue_me() == 0 case first and check
whether we got requeued.

The requeue check is done unlocked, so the requeue and wakeup
might happen between the check and unqueue_me().

But this needs more thoughts about locking and possible races. The
whole code sequence here screams "can you debug the nasty races here ?"

> > > + /* FIXME: this test is REALLY scary... gotta be a better way...
> > > + * If the pi futex was uncontended, futex_requeue_pi_init() gave us
> > > + * the lock.
> > > + */
> >
> > Didn't we take the rtmutex as well ??
>
>
> Not necessarily. If the pi futex was ownerless and there was only one task
> requeued, then there is no rtmutex as it is created by the first waiter.
>
> It is possible that other tasks have since tried to take this mutex and now
> there is an rtmutex, but we have to catch the case where there might not be
> one.

Fair enough.

> >
> > > + if (!q.pi_state) {

Again. This check might be racy.

> > > + ret = 0;
> > > + goto out;
> > > + }
> > > + pi_mutex = &q.pi_state->pi_mutex;
> > > +
> > > + ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
> >
> >
> > Eeek. We got a wakeup _after_ we have been requeued. So now you go
> > back to sleep on the pi_mutex ?
>
>
> Yes. The futex_requeue() calls wake_up() which allows the waiter here to wake
> and continue processing it's half of the rt_mutex acquisition loop that I
> split out of rt_mutex_slowlock() in rt_mutex_finish_proxy_lock().

Ah, ok. We try to get the rt_mutex there as we are only the designated
owner. If we fail, then we go back to sleep.

Thanks,

tglx