On Wed, Nov 29, 2017 at 11:56:05AM -0600, Julia Cartwright wrote:
> Hey Thomas, Peter-
>
> Gratian and I have been debugging into a nasty and difficult race w/
> futexes seemingly the culprit. The original symptom we were seeing
> was a seemingly spurious -EDEADLK from a futex(LOCK_PI) operation.
>
> On further analysis, however, it appears the thread which gets the
> spurious -EDEADLK has observed a weird futex state: a prior
> futex(WAIT_REQUEUE_PI) operation has returned -ETIMEDOUT, but the uaddr2
> futex word owner field indicates that it's the owner.
>
Do you have a reproducer you can share?
> Here's an attempt to boil down this situation into a pseudo trace; I'm
> happy to forward along the full traces as well, if that would be
> helpful:
Please do forward the full trace
>
> waiter waker stealer (prio > waiter)
>
> futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
> timeout=[N ms])
> futex_wait_requeue_pi()
> futex_wait_queue_me()
> freezable_schedule()
> <scheduled out>
> futex(LOCK_PI, uaddr2)
> futex(CMP_REQUEUE_PI, uaddr,
> uaddr2, 1, 0)
> /* requeues waiter to uaddr2 */
> futex(UNLOCK_PI, uaddr2)
> wake_futex_pi()
> cmp_futex_value_locked(uaddr, waiter)
> wake_up_q()
> <woken by waker>
> <hrtimer_wakeup() fires,
> clears sleeper->task>
> futex(LOCK_PI, uaddr2)
> __rt_mutex_start_proxy_lock()
> try_to_take_rt_mutex() /* steals lock */
> rt_mutex_set_owner(lock, stealer)
> <preempted>
> <scheduled in>
> rt_mutex_wait_proxy_lock()
> __rt_mutex_slowlock()
> try_to_take_rt_mutex() /* fails, lock held by stealer */
> if (timeout && !timeout->task)
> return -ETIMEDOUT;
> fixup_owner()
> /* lock wasn't acquired, so,
> fixup_pi_state_owner skipped */
> return -ETIMEDOUT;
>
> /* At this point, we've returned -ETIMEDOUT to userspace, but the
> * futex word shows waiter to be the owner, and the pi_mutex has
> * stealer as the owner */
>
eeeeeeewwwweeee
> futex_lock(LOCK_PI, uaddr2)
> -> bails with EDEADLK, futex word says we're owner.
>
> At some later point in execution, the stealer gets scheduled back in and
> will do fixup_owner() which fixes up the futex word, but at that point
> it's too late: the waiter has already observed the wonky state.
>
> fixup_owner() used to have additional seemingly relevant checks in place
> that were removed 73d786bd043eb ("futex: Rework inconsistent
> rt_mutex/futex_q state").
This and the subsequent changes moving some of this out from under the hb->lock
are interesting - and were quite fun to review at the time. Hrm.
I'll continue paging this stuff in, although I suspect Peter will likely beat me
to it. In the meantime, if you can share the reproducer and/or the trace you
collected, that will be helpful.
>
> The actual kernel we've been testing is 4.9.33-rt23, w/ 153fbd1226fb3
> ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races")
And this does not exhibit the behavior above, correct?
> cherry-picked w/ PREEMPT_RT_FULL. However, it appears that this issue
> may affect v4.15-rc1?
And this does?
>
> Thoughts on how to move forward?
>
> Nasty.
>
> Julia
>
--
Darren Hart
VMware Open Source Technology Center
On Fri, Dec 01, 2017 at 12:11:15PM -0800, Darren Hart wrote:
> On Wed, Nov 29, 2017 at 11:56:05AM -0600, Julia Cartwright wrote:
> > Hey Thomas, Peter-
> >
> > Gratian and I have been debugging into a nasty and difficult race w/
> > futexes seemingly the culprit. The original symptom we were seeing
> > was a seemingly spurious -EDEADLK from a futex(LOCK_PI) operation.
> >
> > On further analysis, however, it appears the thread which gets the
> > spurious -EDEADLK has observed a weird futex state: a prior
> > futex(WAIT_REQUEUE_PI) operation has returned -ETIMEDOUT, but the uaddr2
> > futex word owner field indicates that it's the owner.
> >
>
> Do you have a reproducer you can share?
We have a massive application which seems to reproduce it in 8 hours or
so, but it's not in a state to be shared :(. So far, every attempt at
creating a simple, smaller reproducing case has failed. We're still
trying, though :(.
One debugging technique we're trying to employ as well now that we think
we have a handle on the race is to pry the race window open with some
strategically placed spinning (or fixed-period sleeping). Hopefully
that will make it easier to reproduce ...
> > Here's an attempt to boil down this situation into a pseudo trace; I'm
> > happy to forward along the full traces as well, if that would be
> > helpful:
>
> Please do forward the full trace
Will do. Chances are they are large enough to bounce from LKML, but
I'll send them along privately.
> >
> > waiter waker stealer (prio > waiter)
> >
> > futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
> > timeout=[N ms])
> > futex_wait_requeue_pi()
> > futex_wait_queue_me()
> > freezable_schedule()
> > <scheduled out>
> > futex(LOCK_PI, uaddr2)
> > futex(CMP_REQUEUE_PI, uaddr,
> > uaddr2, 1, 0)
> > /* requeues waiter to uaddr2 */
> > futex(UNLOCK_PI, uaddr2)
> > wake_futex_pi()
> > cmp_futex_value_locked(uaddr, waiter)
minor fix: the above should have been: cmp_futex_value_locked(uaddr2, waiter)
> > wake_up_q()
> > <woken by waker>
> > <hrtimer_wakeup() fires,
> > clears sleeper->task>
> > futex(LOCK_PI, uaddr2)
> > __rt_mutex_start_proxy_lock()
> > try_to_take_rt_mutex() /* steals lock */
> > rt_mutex_set_owner(lock, stealer)
> > <preempted>
> > <scheduled in>
> > rt_mutex_wait_proxy_lock()
> > __rt_mutex_slowlock()
> > try_to_take_rt_mutex() /* fails, lock held by stealer */
> > if (timeout && !timeout->task)
> > return -ETIMEDOUT;
> > fixup_owner()
> > /* lock wasn't acquired, so,
> > fixup_pi_state_owner skipped */
> > return -ETIMEDOUT;
> >
> > /* At this point, we've returned -ETIMEDOUT to userspace, but the
> > * futex word shows waiter to be the owner, and the pi_mutex has
> > * stealer as the owner */
>
> eeeeeeewwwweeee
Indeed. :(
> > futex_lock(LOCK_PI, uaddr2)
> > -> bails with EDEADLK, futex word says we're owner.
> >
> > At some later point in execution, the stealer gets scheduled back in and
> > will do fixup_owner() which fixes up the futex word, but at that point
> > it's too late: the waiter has already observed the wonky state.
> >
> > fixup_owner() used to have additional seemingly relevant checks in place
> > that were removed 73d786bd043eb ("futex: Rework inconsistent
> > rt_mutex/futex_q state").
>
> This and the subsequent changes moving some of this out from under the hb->lock
> are interesting - and were quite fun to review at the time. Hrm.
>
> I'll continue paging this stuff in, although I suspect Peter will likely beat me
> to it. In the meantime, if you can share the reproducer and/or the trace you
> collected, that will be helpful.
>
> > The actual kernel we've been testing is 4.9.33-rt23, w/ 153fbd1226fb3
> > ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races")
>
> And this does not exhibit the behavior above, correct?
Sorry if I was unclear. This combination _does_ exhibit this incorrect
behavior.
> > cherry-picked w/ PREEMPT_RT_FULL. However, it appears that this issue
> > may affect v4.15-rc1?
>
> And this does?
I only meant that: as far as I can tell the affected codepaths are
mostly the same between v4.9.33-rt23 and v4.15-rc1, as the futex
reworking stuff was cherry-picked back.
We haven't yet tried reproducing on v4.15-rc1, and aren't really at a
place where we can do so quickly. It's unclear whether or not
PREEMPT_RT is required to reproduce.
Thanks!
Julia
On Fri, Dec 01, 2017 at 03:49:01PM -0600, Julia Cartwright wrote:
> On Fri, Dec 01, 2017 at 12:11:15PM -0800, Darren Hart wrote:
> > On Wed, Nov 29, 2017 at 11:56:05AM -0600, Julia Cartwright wrote:
> >
> > > The actual kernel we've been testing is 4.9.33-rt23, w/ 153fbd1226fb3
> > > ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races")
> >
> > And this does not exhibit the behavior above, correct?
>
> Sorry if I was unclear. This combination _does_ exhibit this incorrect
> behavior.
>
> > > cherry-picked w/ PREEMPT_RT_FULL. However, it appears that this issue
> > > may affect v4.15-rc1?
> >
> > And this does?
>
> I only meant that: as far as I can tell the affected codepaths are
> mostly the same between v4.9.33-rt23 and v4.15-rc1, as the futex
> reworking stuff was cherry-picked back.
>
Can you compare the futex and rtmutex related histories and see if there
is possibly something missing from the backport? A diff from current
version of the relevant files would be worth doing as well. There is
enough subtly here, that I'd want to be as confident as possible that we
aren't missing something here.
> We haven't yet tried reproducing on v4.15-rc1, and aren't really at a
> place where we can do so quickly. It's unclear whether or not
> PREEMPT_RT is required to reproduce.
Obviously reproducing on an unmodified upstream (RT or not) would be a
very valuable data point.
>
> Thanks!
> Julia
>
--
Darren Hart
VMware Open Source Technology Center