Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753051AbbFOAkn (ORCPT ); Sun, 14 Jun 2015 20:40:43 -0400 Received: from e38.co.us.ibm.com ([32.97.110.159]:36566 "EHLO e38.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752093AbbFOAkh (ORCPT ); Sun, 14 Jun 2015 20:40:37 -0400 X-Helo: d03dlp03.boulder.ibm.com X-MailFrom: paulmck@linux.vnet.ibm.com X-RcptTo: linux-kernel@vger.kernel.org Date: Sun, 14 Jun 2015 17:40:30 -0700 From: "Paul E. McKenney" To: Oleg Nesterov Cc: Ingo Molnar , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Denys Vlasenko , Brian Gerst , Peter Zijlstra , Borislav Petkov , "H. Peter Anvin" , Linus Torvalds , Thomas Gleixner , Waiman Long Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Message-ID: <20150615004030.GK3913@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1434188955-31397-1-git-send-email-mingo@kernel.org> <1434188955-31397-3-git-send-email-mingo@kernel.org> <20150613192454.GA1735@redhat.com> <20150614073652.GA5923@gmail.com> <20150614192422.GA18477@redhat.com> <20150614193825.GA19582@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150614193825.GA19582@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15061500-0029-0000-0000-00000A8B5F3B Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5668 Lines: 149 On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote: > On 06/14, Oleg Nesterov wrote: > > > > On 06/14, Ingo Molnar wrote: > > > > > > * Oleg Nesterov wrote: > > > > > > > > + spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */ > > > > ^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it > > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and > > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead > > > > to rcu_preempt_qs()? > > > > > > No, RCU grace periods are still defined by 'heavy' context boundaries such as > > > context switches, entering idle or user-space mode. > > > > > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the > > > RCU read critical section - that is why it uses a separate nesting counter > > > (current->rcu_read_lock_nesting), not the preempt count. > > > > Yes. > > > > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or > > > an irqs-off region, then those are still natural RCU read lock regions, regardless > > > of the RCU model, and need no additional RCU locking. > > > > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't > > finish the gp, but if there are no other rcu-read-lock holders then it > > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(), > > this CPU no longer needs rcu_preempt_note_context_switch(). > > > > OK, I can be easily wrong, I do not really understand the implementation > > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock() > > with the _current_ implementation. Still this doesn't look right even if > > happens to work, and Documentation/RCU/checklist.txt says: > > > > 11. Note that synchronize_rcu() -only- guarantees to wait until > > all currently executing rcu_read_lock()-protected RCU read-side > > critical sections complete. It does -not- necessarily guarantee > > that all currently running interrupts, NMIs, preempt_disable() > > code, or idle loops will complete. Therefore, if your > > read-side critical sections are protected by something other > > than rcu_read_lock(), do -not- use synchronize_rcu(). > > > I've even checked this ;) I applied the stupid patch below and then > > $ taskset 2 perl -e 'syscall 157, 666, 5000' & > [1] 565 > > $ taskset 1 perl -e 'syscall 157, 777' > > $ > [1]+ Done taskset 2 perl -e 'syscall 157, 666, 5000' > > $ dmesg -c > SPIN start > SYNC start > SYNC done! > SPIN done! Please accept my apologies for my late entry to this thread. Youngest kid graduated from university this weekend, so my attention has been elsewhere. If you were to disable interrupts instead of preemption, I would expect that the preemptible-RCU grace period would be blocked -- though I am not particularly comfortable with people relying on disabled interrupts blocking a preemptible-RCU grace period. Here is what can happen if you try to block a preemptible-RCU grace period by disabling preemption, assuming that there are at least two online CPUs in the system: 1. CPU 0 does spin_lock(), which disables preemption. 2. CPU 1 starts a grace period. 3. CPU 0 takes a scheduling-clock interrupt. It raises softirq, and the RCU_SOFTIRQ handler notes that there is a new grace period and sets state so that a subsequent quiescent state on this CPU will be noted. 4. CPU 0 takes another scheduling-clock interrupt, which checks current->rcu_read_lock_nesting, and notes that there is no preemptible-RCU read-side critical section in progress. It again raises softirq, and the RCU_SOFTIRQ handler reports the quiescent state to core RCU. 5. Once each of the other CPUs report a quiescent state, the grace period can end, despite CPU 0 having preemption disabled the whole time. So Oleg's test is correct, disabling preemption is not sufficient to block a preemptible-RCU grace period. The usual suggestion would be to add rcu_read_lock() just after the lock is acquired and rcu_read_unlock() just before each release of that same lock. Putting the entire RCU read-side critical section under the lock prevents RCU from having to invoke rcu_read_unlock_special() due to preemption. (It might still invoke it if the RCU read-side critical section was overly long, but that is much cheaper than the preemption-handling case.) Thanx, Paul > Oleg. > > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) > } > #endif > > +#include > + > + > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > unsigned long, arg4, unsigned long, arg5) > { > @@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > > error = 0; > switch (option) { > + case 666: > + preempt_disable(); > + pr_crit("SPIN start\n"); > + while (arg2--) > + mdelay(1); > + pr_crit("SPIN done!\n"); > + preempt_enable(); > + break; > + case 777: > + pr_crit("SYNC start\n"); > + synchronize_rcu(); > + pr_crit("SYNC done!\n"); > + break; > case PR_SET_PDEATHSIG: > if (!valid_signal(arg2)) { > error = -EINVAL; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/