Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753831AbcC1OPk (ORCPT ); Mon, 28 Mar 2016 10:15:40 -0400 Received: from mail.efficios.com ([78.47.125.74]:53362 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752202AbcC1OPh (ORCPT ); Mon, 28 Mar 2016 10:15:37 -0400 Date: Mon, 28 Mar 2016 14:15:29 +0000 (UTC) From: Mathieu Desnoyers To: Peter Zijlstra Cc: "Paul E. McKenney" , "Chatre, Reinette" , Jacob Pan , Josh Triplett , Ross Green , John Stultz , Thomas Gleixner , lkml , Ingo Molnar , Lai Jiangshan , dipankar@in.ibm.com, Andrew Morton , rostedt , David Howells , Eric Dumazet , Darren Hart , =?utf-8?B?RnLDqWTDqXJpYw==?= Weisbecker , Oleg Nesterov , pranith kumar Message-ID: <1085262213.37823.1459174529625.JavaMail.zimbra@efficios.com> In-Reply-To: <20160328061350.GC6344@twins.programming.kicks-ass.net> References: <20160318235641.GH4287@linux.vnet.ibm.com> <20160327013456.GX4287@linux.vnet.ibm.com> <702204510.37291.1459086535844.JavaMail.zimbra@efficios.com> <20160327154018.GA4287@linux.vnet.ibm.com> <20160327204559.GV6356@twins.programming.kicks-ass.net> <683720290.37511.1459129458781.JavaMail.zimbra@efficios.com> <1124128277.37541.1459131825120.JavaMail.zimbra@efficios.com> <20160328061350.GC6344@twins.programming.kicks-ass.net> Subject: Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [78.47.125.74] X-Mailer: Zimbra 8.6.0_GA_1178 (ZimbraWebClient - FF45 (Linux)/8.6.0_GA_1178) Thread-Topic: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17 Thread-Index: PIcXpOLrY5NmeNGLpEw1WmFnkXWWfw== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5154 Lines: 147 ----- On Mar 28, 2016, at 2:13 AM, Peter Zijlstra peterz@infradead.org wrote: > On Mon, Mar 28, 2016 at 02:23:45AM +0000, Mathieu Desnoyers wrote: > >> >> But, you need hotplug for this to happen, right? >> > >> > My understanding is that this seems to be detection of failures to be >> > awakened for a long time on idle CPUs. It therefore seems to be more >> > idle-related than cpu hotplug-related. I'm not saying that there is >> > no issue with hotplug, just that the investigation so far seems to >> > target mostly idle systems, AFAIK without stressing hotplug. > > Paul has stated that without hotplug he cannot trigger this. > >> > set_nr_if_polling() returns true if the ti->flags read has the >> > _TIF_NEED_RESCHED bit set, which will skip the IPI. > > POLLING_NR, as per your later comment > >> > But it seems weird. The side that calls set_nr_if_polling() >> > does the following: >> > 1) llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list) >> > 2) set_nr_if_polling(rq->idle) >> > 3) (don't do smp_send_reschedule(cpu) since set_nr_if_polling() returned >> > true) >> > >> > The idle loop does: >> > 1) __current_set_polling() >> > 2) __current_clr_polling() >> > 3) smp_mb__after_atomic() >> > 4) sched_ttwu_pending() >> > 5) schedule_preempt_disabled() >> > -> This will clear the TIF_NEED_RESCHED flag >> > >> > While the idle loop is in sched_ttwu_pending(), after >> > it has done the llist_del_all() (thus has grabbed all the >> > list entries), TIF_NEED_RESCHED is still set. > >> > If both list_all and > > llist_add() ? Yes, indeed. > >> > set_nr_if_polling() are called right after the llist_del_all(), we >> > will end up in a situation where we have an entry in the list, but >> > there won't be any reschedule sent on the idle CPU until something >> > else awakens it. On a _very_ idle CPU, this could take some time. > > Can't happen, as per clearing of POLLING_NR before doing llist_del_all() > and the latter being a full memory barrier. > >> > set_nr_and_not_polling() don't seem to have the same issue, because >> > it does not return true if TIF_NEED_RESCHED is observed as being >> > already set: it really just depends on the state of the TIF_POLLING_NRFLAG >> > bit. >> > >> > Am I missing something important ? >> >> Well, it seems that the test for _TIF_POLLING_NRFLAG in set_nr_if_polling() >> just before the test for _TIF_NEED_RESCHED should take care of it: while in >> sched_ttwu_pending within the idle loop, the TIF_POLLING_NRFLAG should be >> cleared, thus causing set_nr_if_polling to return false. > > Right, clue in the name: Set NEED_RESCHED _IF_ POLLING_NR (is set). > >> I'm slightly concerned about the lack of smp_mb__after_atomic() >> between the TIF_NEED_RESCHED flag being cleared within schedule_preempt_disabled >> and the TIF_POLLING_NRFLAG being set in the following loop. Indeed, clear_bit() >> does not have a compiler barrier, > > Urgh, it really should, as all atomic ops. set_bit() very much has a > memory clobber in, see below. Yes, I'd be more comfortable with the memory clobber in the clear_bit too, but theoretically it *should* not matter, because we have a clobber in set_bit, and clear_bit has a +m memory operand. > >> nor processor-level memory barriers >> (of course, the processor memory barrier should not really matter on >> x86-64 due to lock prefix). > > Right. > >> Moreover, TIF_NEED_RESCHED is bit 3 on x86-64, >> whereas TIF_POLLING_NRFLAG is bit 21. Those are in two different bytes of >> the thread flags, and thus set/cleared as different addresses by clear_bit() >> acting on an immediate "nr" argument. >> >> If we have any state where TIF_POLLING_NRFLAG is set before TIF_NEED_RESCHED >> is cleared within the idle thread, we could end up missing a needed resched IPI. > > Yes, that would be bad. No objection to adding smp_mb__before_atomic() > before the initial __current_set_polling(). Although that's not going to > make a difference for x86_64 as you already noted. Yep. > >> Another question: why are set_nr_if_polling and set_nr_and_not_polling two >> different implementations ? > > Because they're fundamentally two different things. The one > conditionally sets NEED_RESCHED, the other unconditionally sets it. Got it, makes sense. Thanks! Mathieu > >> Could they be combined ? > > Can, yes, will not be pretty nor clear code though. > > > --- > arch/x86/include/asm/bitops.h | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h > index 7766d1cf096e..5345784d5e41 100644 > --- a/arch/x86/include/asm/bitops.h > +++ b/arch/x86/include/asm/bitops.h > @@ -112,11 +112,13 @@ clear_bit(long nr, volatile unsigned long *addr) > if (IS_IMMEDIATE(nr)) { > asm volatile(LOCK_PREFIX "andb %1,%0" > : CONST_MASK_ADDR(nr, addr) > - : "iq" ((u8)~CONST_MASK(nr))); > + : "iq" ((u8)~CONST_MASK(nr)) > + : "memory"); > } else { > asm volatile(LOCK_PREFIX "btr %1,%0" > : BITOP_ADDR(addr) > - : "Ir" (nr)); > + : "Ir" (nr) > + : "memory"); > } > } -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com