Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752620Ab3IINYL (ORCPT ); Mon, 9 Sep 2013 09:24:11 -0400 Received: from e31.co.us.ibm.com ([32.97.110.149]:43707 "EHLO e31.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751232Ab3IINYJ (ORCPT ); Mon, 9 Sep 2013 09:24:09 -0400 Date: Mon, 9 Sep 2013 06:23:43 -0700 From: "Paul E. McKenney" To: Peter Zijlstra Cc: Frederic Weisbecker , Eric Dumazet , linux-kernel@vger.kernel.org, mingo@elte.hu, laijs@cn.fujitsu.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@efficios.com, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, darren@dvhart.com, sbw@mit.edu, cl@linux.com Subject: Re: [PATCH] rcu: Is it safe to enter an RCU read-side critical section? Message-ID: <20130909132343.GN3966@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20130905195234.GA20555@linux.vnet.ibm.com> <20130906105934.GF20519@somewhere> <20130906151851.GQ3966@linux.vnet.ibm.com> <1378488088.31445.39.camel@edumazet-glaptop> <20130906174117.GU3966@linux.vnet.ibm.com> <20130906185927.GE2706@somewhere> <20130909105347.GK31370@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130909105347.GK31370@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13090913-8236-0000-0000-00000195C984 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4892 Lines: 101 On Mon, Sep 09, 2013 at 12:53:47PM +0200, Peter Zijlstra wrote: > On Fri, Sep 06, 2013 at 08:59:29PM +0200, Frederic Weisbecker wrote: > > Imagine that you're running on an rcu read side critical section on CPU 0, which > > is not in extended quiescent state. Now you get preempted in the middle of your > > RCU read side critical section (you called rcu_read_lock() but not yet rcu_read_unlock()). > > > > Later on, the task is woken up to be scheduled in CPU 1. If CPU 1 is in extended > > quiescent state because it runs is userspace, it receives a scheduler IPI, > > then schedule_user() is called by the end of the interrupt and in turns calls rcu_user_exit() > > before the task is resumed to the code it was running on CPU 0, in the middle of > > the rcu read side extended quiescent state. > > > > See, the key here is the rcu_user_exit() that restore the CPU on RCU's state machine. > > There are other possible scheduler entrypoints when a CPU runs in user extended quiescent > > state: exception and syscall entries or even preempt_schedule_irq() in case we receive an irq > > in the kernel while we haven't yet reached the call to rcu_user_exit()... All of these should > > be covered, otherwise you bet RCU would be prompt to warn. > > > > That's why when we call rcu_is_cpu_idle() from an RCU read side critical section, it's legit even > > if we can be preempted anytime around it. > > And preempt_disable() is probably not even necessary, except perhaps if __get_cpu_var() itself > > relies on non-preemptibility for its own correctness on the address calculation. > > I've tried reading that trice now, still not making much sense. Sorry, Frederic really is describing what is going on here. And it really does work. > In any case rcu_is_cpu_idle() is complete bollocks, either use > __raw_get_cpu_var() and add a _coherent_ explanation for why its right, > or its broken. Hmmmm... Adding Christoph Lameter on CC, since he was the one pushing for the current formulation of that line of rcu_is_cpu_idle(). And guys, I have to say that the advice on which per-CPU primitive to use varies wildly and randomly. For all I know, each of you individually might well be sticking to the same story, but taken together, your collective advice is strongly resembling white noise. It is not that the primitives themselves are changing that quickly: __raw_get_cpu_var() has been around for three years. > In any case the preempt_disable/enable pair there is just plain wrong as > Eric pointed out. Peter, in the general case, you are quite correct. But this is a special case where it really does work. The key point here is that preemption and migration cannot move a task from a CPU to which RCU is paying attention to a CPU that RCU is ignoring. So yes, by the time the task sees the return value from rcu_is_cpu_idle(), that task might be running on some other CPU. But that is OK, because if RCU was paying attention to the old CPU, then RCU must also be paying attention to the new CPU. Frederic's description gives the details of how this is enforced. Here is an example of how this works: 1. Some task running on a CPU 0 (which RCU is paying attention to) calls rcu_is_cpu_idle(), which disables preemption, checks the per-CPU variable, sets ret to zero, then enables preemption. At this point, the task is preempted by some high-priority task. 2. CPU 1 is currently idle, so RCU is -not- paying attention to it. However, it is decided that our low-priority task should migrate to CPU 1. 3. CPU 1 is sent an IPI, which forces this CPU out of idle. This causes rcu_idle_exit() to be called, which causes RCU to start paying attention to CPU 1. 4. CPU 1 switches to the low-priority task, which now sees the return value of rcu_is_cpu_idle(). Now, this return value did in fact reflect the old state of CPU 0, and the state of CPU 0 might have changed. (For example, the high-priority task might have blocked, so that CPU 0 is now idle, which in turn would mean that RCU is no longer paying attention to it, so that if rcu_is_cpu_idle() was called right now, it would return true rather than the false return computed in step 1 above.) 5. But that is OK. Because of the way RCU and idle interact, if a call from a given task to rcu_is_cpu_idle() returned false some time in the past, a call from that same task will also return false right now. So yes, in general it is wrong to disable preemption, grab the value of a per-CPU variable, re-enable preemption, and then return the result. But there are a number of special cases where it is OK, and this is one of them. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/