Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758558Ab3HIR7H (ORCPT ); Fri, 9 Aug 2013 13:59:07 -0400 Received: from e36.co.us.ibm.com ([32.97.110.154]:58013 "EHLO e36.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758471Ab3HIR7E (ORCPT ); Fri, 9 Aug 2013 13:59:04 -0400 Date: Fri, 9 Aug 2013 10:58:48 -0700 From: "Paul E. McKenney" To: Lai Jiangshan Cc: Steven Rostedt , Peter Zijlstra , linux-kernel@vger.kernel.org, Dipankar Sarma Subject: Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site Message-ID: <20130809175848.GC29406@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1375871104-10688-1-git-send-email-laijs@cn.fujitsu.com> <1375871104-10688-6-git-send-email-laijs@cn.fujitsu.com> <20130808204020.GA31127@linux.vnet.ibm.com> <5204B6EF.6000905@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5204B6EF.6000905@cn.fujitsu.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13080917-7606-0000-0000-00000E2EB0EC Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9902 Lines: 248 On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote: > On 08/09/2013 04:40 AM, Paul E. McKenney wrote: > > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote: > >> Background) > >> > >> Although all articles declare that rcu read site is deadlock-immunity. > >> It is not true for rcu-preempt, it will be deadlock if rcu read site > >> overlaps with scheduler lock. > >> > >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site > >> is still not deadlock-immunity. And the problem described in 016a8d5b > >> is still existed(rcu_read_unlock_special() calls wake_up). > >> > >> Aim) > >> > >> We want to fix the problem forever, we want to keep rcu read site > >> is deadlock-immunity as books say. > >> > >> How) > >> > >> The problem is solved by "if rcu_read_unlock_special() is called inside > >> any lock which can be (chained) nested in rcu_read_unlock_special(), > >> we defer rcu_read_unlock_special()". > >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks > >> in printk()/WARN_ON() and all locks nested in these locks or chained nested > >> in these locks. > >> > >> The problem is reduced to "how to distinguish all these locks(context)", > >> We don't distinguish all these locks, we know that all these locks > >> should be nested in local_irqs_disable(). > >> > >> we just consider if rcu_read_unlock_special() is called in irqs-disabled > >> context, it may be called in these suspect locks, we should defer > >> rcu_read_unlock_special(). > >> > >> The algorithm enlarges the probability of deferring, but the probability > >> is still very very low. > >> > >> Deferring does add a small overhead, but it offers us: > >> 1) really deadlock-immunity for rcu read site > >> 2) remove the overhead of the irq-work(250 times per second in avg.) > > > > One problem here -- it may take quite some time for a set_need_resched() > > to take effect. This is especially a problem for RCU priority boosting, > > but can also needlessly delay preemptible-RCU grace periods because > > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit. > > The final effect of deboosting(rt_mutex_unlock()) is also accomplished > via set_need_resched()/set_tsk_need_resched(). > set_need_resched() is enough for RCU priority boosting issue here. Eventually, yes. But all that set_need_resched() does is set the TIF_NEED_RESCHED. This is checked by the outermost preempt_enable(), return from interrupt, return to userspace, and things like cond_resched(). So it might well be quite some time until the boosted reader actually gets around to deboosting itself. > Since rcu_read_unlock_special() is deferred, it does take quite some time for > QS report to take effect. Agreed. > > OK, alternatives... > > > > o Keep the current rule saying that if the scheduler is going > > to exit an RCU read-side critical section while holding > > one of its spinlocks, preemption has to have been disabled > > Since rtmutex'lock->wait_lock is not irqs-disabled nor bh-disabled. Yep, because this rule prevents the call to rt_mutex_unlock() from happening whenever one of the scheduler's irq-disable locks is held. > This kind of spinlocks include scheduler locks, rtmutex'lock->wait_lock, > all locks can be acquired in irq/SOFTIRQ. > > So this rule is not only applied for scheduler locks, it should also > be applied for almost all spinlocks in the kernel. No, only those locks acquired by the scheduler in the wakeup path. Or am I missing something here? > I was hard to accept that rcu read site is not deadlock-immunity. So did I, see http://lwn.net/Articles/453002/. RCU was a victim of its own success. ;-) And it would be really cool to restore full deadlock immunity to rcu_read_unlock(), no two ways about it! Hmmm... Any way that a self-IPI could be made safe for all architectures? Thanx, Paul > Thanks, > Lai > > > throughout the full duration of that critical section. > > Well, we can certainly do this, but it would be nice to get > > rid of this rule. > > > > o Use per-CPU variables, possibly injecting delay. This has ugly > > disadvantages as noted above. > > > > o irq_work_queue() can wait a jiffy (or on some architectures, > > quite a bit longer) before actually doing anything. > > > > o raise_softirq() is more immediate and is an easy change, but > > adds a softirq vector -- which people are really trying to > > get rid of. Also, wakeup_softirqd() calls things that acquire > > the scheduler locks, which is exactly what we were trying to > > avoid doing. > > > > o invoke_rcu_core() can invoke raise_softirq() as above. > > > > o IPI to self. From what I can see, not all architectures > > support this. Easy to fake if you have at least two CPUs, > > but not so good from an OS jitter viewpoint... > > > > o Add a check to local_irq_disable() and friends. I would guess > > that this suggestion would not make architecture maintainers > > happy. > > > > Other thoughts? > > > > Thanx, Paul > > > >> Signed-off-by: Lai Jiangshan > >> --- > >> include/linux/rcupdate.h | 2 +- > >> kernel/rcupdate.c | 2 +- > >> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++---- > >> 3 files changed, 44 insertions(+), 7 deletions(-) > >> > >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > >> index 4b14bdc..00b4220 100644 > >> --- a/include/linux/rcupdate.h > >> +++ b/include/linux/rcupdate.h > >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void); > >> > >> extern void __rcu_read_lock(void); > >> extern void __rcu_read_unlock(void); > >> -extern void rcu_read_unlock_special(struct task_struct *t); > >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock); > >> void synchronize_rcu(void); > >> > >> /* > >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c > >> index cce6ba8..33b89a3 100644 > >> --- a/kernel/rcupdate.c > >> +++ b/kernel/rcupdate.c > >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void) > >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */ > >> barrier(); /* assign before ->rcu_read_unlock_special load */ > >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) > >> - rcu_read_unlock_special(t); > >> + rcu_read_unlock_special(t, true); > >> barrier(); /* ->rcu_read_unlock_special load before assign */ > >> t->rcu_read_lock_nesting = 0; > >> } > >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h > >> index fc8b36f..997b424 100644 > >> --- a/kernel/rcutree_plugin.h > >> +++ b/kernel/rcutree_plugin.h > >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu) > >> ? rnp->gpnum > >> : rnp->gpnum + 1); > >> raw_spin_unlock_irqrestore(&rnp->lock, flags); > >> - } else if (t->rcu_read_lock_nesting < 0 && > >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) && > >> - t->rcu_read_unlock_special) { > >> + } else if (t->rcu_read_lock_nesting == 0 || > >> + (t->rcu_read_lock_nesting < 0 && > >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) { > >> > >> /* > >> * Complete exit from RCU read-side critical section on > >> * behalf of preempted instance of __rcu_read_unlock(). > >> */ > >> - rcu_read_unlock_special(t); > >> + if (t->rcu_read_unlock_special) > >> + rcu_read_unlock_special(t, false); > >> } > >> > >> /* > >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t, > >> * notify RCU core processing or task having blocked during the RCU > >> * read-side critical section. > >> */ > >> -void rcu_read_unlock_special(struct task_struct *t) > >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock) > >> { > >> int empty; > >> int empty_exp; > >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t) > >> > >> /* Clean up if blocked during RCU read-side critical section. */ > >> if (special & RCU_READ_UNLOCK_BLOCKED) { > >> + /* > >> + * If rcu read lock overlaps with scheduler lock, > >> + * rcu_read_unlock_special() may lead to deadlock: > >> + * > >> + * rcu_read_lock(); > >> + * preempt_schedule[_irq]() (when preemption) > >> + * scheduler lock; (or some other locks can be (chained) nested > >> + * in rcu_read_unlock_special()/rnp->lock) > >> + * access and check rcu data > >> + * rcu_read_unlock(); > >> + * rcu_read_unlock_special(); > >> + * wake_up(); DEAD LOCK > >> + * > >> + * To avoid all these kinds of deadlock, we should quit > >> + * rcu_read_unlock_special() here and defer it to > >> + * rcu_preempt_note_context_switch() or next outmost > >> + * rcu_read_unlock() if we consider this case may happen. > >> + * > >> + * Although we can't know whether current _special() > >> + * is nested in scheduler lock or not. But we know that > >> + * irqs are always disabled in this case. so we just quit > >> + * and defer it to rcu_preempt_note_context_switch() > >> + * when irqs are disabled. > >> + * > >> + * It means we always defer _special() when it is > >> + * nested in irqs disabled context, but > >> + * (special & RCU_READ_UNLOCK_BLOCKED) && > >> + * irqs_disabled_flags(flags) > >> + * is still unlikely to be true. > >> + */ > >> + if (unlikely(unlock && irqs_disabled_flags(flags))) { > >> + set_need_resched(); > >> + local_irq_restore(flags); > >> + return; > >> + } > >> + > >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED; > >> > >> /* > >> -- > >> 1.7.4.4 > >> > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/