Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751330AbdLAVtr (ORCPT ); Fri, 1 Dec 2017 16:49:47 -0500 Received: from mx0a-00010702.pphosted.com ([148.163.156.75]:57680 "EHLO mx0b-00010702.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750898AbdLAVtq (ORCPT ); Fri, 1 Dec 2017 16:49:46 -0500 Date: Fri, 1 Dec 2017 15:49:01 -0600 From: Julia Cartwright To: Darren Hart CC: Thomas Gleixner , Peter Zijlstra , Gratian Crisan , , Ingo Molnar Subject: Re: PI futexes + lock stealing woes Message-ID: <20171201214901.GB32696@jcartwri.amer.corp.natinst.com> References: <20171129175605.GA863@jcartwri.amer.corp.natinst.com> <20171201201115.GB18881@fury> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20171201201115.GB18881@fury> User-Agent: Mutt/1.9.1 (2017-09-22) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-12-01_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=inbound_policy_notspam policy=inbound_policy score=30 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=30 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1712010259 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5537 Lines: 123 On Fri, Dec 01, 2017 at 12:11:15PM -0800, Darren Hart wrote: > On Wed, Nov 29, 2017 at 11:56:05AM -0600, Julia Cartwright wrote: > > Hey Thomas, Peter- > > > > Gratian and I have been debugging into a nasty and difficult race w/ > > futexes seemingly the culprit. The original symptom we were seeing > > was a seemingly spurious -EDEADLK from a futex(LOCK_PI) operation. > > > > On further analysis, however, it appears the thread which gets the > > spurious -EDEADLK has observed a weird futex state: a prior > > futex(WAIT_REQUEUE_PI) operation has returned -ETIMEDOUT, but the uaddr2 > > futex word owner field indicates that it's the owner. > > > > Do you have a reproducer you can share? We have a massive application which seems to reproduce it in 8 hours or so, but it's not in a state to be shared :(. So far, every attempt at creating a simple, smaller reproducing case has failed. We're still trying, though :(. One debugging technique we're trying to employ as well now that we think we have a handle on the race is to pry the race window open with some strategically placed spinning (or fixed-period sleeping). Hopefully that will make it easier to reproduce ... > > Here's an attempt to boil down this situation into a pseudo trace; I'm > > happy to forward along the full traces as well, if that would be > > helpful: > > Please do forward the full trace Will do. Chances are they are large enough to bounce from LKML, but I'll send them along privately. > > > > waiter waker stealer (prio > waiter) > > > > futex(WAIT_REQUEUE_PI, uaddr, uaddr2, > > timeout=[N ms]) > > futex_wait_requeue_pi() > > futex_wait_queue_me() > > freezable_schedule() > > > > futex(LOCK_PI, uaddr2) > > futex(CMP_REQUEUE_PI, uaddr, > > uaddr2, 1, 0) > > /* requeues waiter to uaddr2 */ > > futex(UNLOCK_PI, uaddr2) > > wake_futex_pi() > > cmp_futex_value_locked(uaddr, waiter) minor fix: the above should have been: cmp_futex_value_locked(uaddr2, waiter) > > wake_up_q() > > > > > clears sleeper->task> > > futex(LOCK_PI, uaddr2) > > __rt_mutex_start_proxy_lock() > > try_to_take_rt_mutex() /* steals lock */ > > rt_mutex_set_owner(lock, stealer) > > > > > > rt_mutex_wait_proxy_lock() > > __rt_mutex_slowlock() > > try_to_take_rt_mutex() /* fails, lock held by stealer */ > > if (timeout && !timeout->task) > > return -ETIMEDOUT; > > fixup_owner() > > /* lock wasn't acquired, so, > > fixup_pi_state_owner skipped */ > > return -ETIMEDOUT; > > > > /* At this point, we've returned -ETIMEDOUT to userspace, but the > > * futex word shows waiter to be the owner, and the pi_mutex has > > * stealer as the owner */ > > eeeeeeewwwweeee Indeed. :( > > futex_lock(LOCK_PI, uaddr2) > > -> bails with EDEADLK, futex word says we're owner. > > > > At some later point in execution, the stealer gets scheduled back in and > > will do fixup_owner() which fixes up the futex word, but at that point > > it's too late: the waiter has already observed the wonky state. > > > > fixup_owner() used to have additional seemingly relevant checks in place > > that were removed 73d786bd043eb ("futex: Rework inconsistent > > rt_mutex/futex_q state"). > > This and the subsequent changes moving some of this out from under the hb->lock > are interesting - and were quite fun to review at the time. Hrm. > > I'll continue paging this stuff in, although I suspect Peter will likely beat me > to it. In the meantime, if you can share the reproducer and/or the trace you > collected, that will be helpful. > > > The actual kernel we've been testing is 4.9.33-rt23, w/ 153fbd1226fb3 > > ("futex: Fix more put_pi_state() vs. exit_pi_state_list() races") > > And this does not exhibit the behavior above, correct? Sorry if I was unclear. This combination _does_ exhibit this incorrect behavior. > > cherry-picked w/ PREEMPT_RT_FULL. However, it appears that this issue > > may affect v4.15-rc1? > > And this does? I only meant that: as far as I can tell the affected codepaths are mostly the same between v4.9.33-rt23 and v4.15-rc1, as the futex reworking stuff was cherry-picked back. We haven't yet tried reproducing on v4.15-rc1, and aren't really at a place where we can do so quickly. It's unclear whether or not PREEMPT_RT is required to reproduce. Thanks! Julia