Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755585Ab2JCJY0 (ORCPT ); Wed, 3 Oct 2012 05:24:26 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37791 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753718Ab2JCJYY (ORCPT ); Wed, 3 Oct 2012 05:24:24 -0400 Date: Wed, 3 Oct 2012 11:24:13 +0200 (CEST) From: Jiri Kosina To: "Srivatsa S. Bhat" Cc: paulmck@linux.vnet.ibm.com, "Paul E. McKenney" , Josh Triplett , linux-kernel@vger.kernel.org Subject: Re: Lockdep complains about commit 1331e7a1bb ("rcu: Remove _rcu_barrier() dependency on __stop_machine()") In-Reply-To: <506BF795.7000604@linux.vnet.ibm.com> Message-ID: References: <506B50F1.8070907@linux.vnet.ibm.com> <506BB283.4010800@linux.vnet.ibm.com> <20121003034405.GB13192@linux.vnet.ibm.com> <506BB950.3000102@linux.vnet.ibm.com> <506BF339.6020201@linux.vnet.ibm.com> <506BF795.7000604@linux.vnet.ibm.com> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3725 Lines: 88 On Wed, 3 Oct 2012, Srivatsa S. Bhat wrote: > >> static void cpu_hotplug_begin(void) > >> { > >> cpu_hotplug.active_writer = current; > >> > >> for (;;) { > >> mutex_lock(&cpu_hotplug.lock); > >> if (likely(!cpu_hotplug.refcount)) <================ This one! > >> break; > >> __set_current_state(TASK_UNINTERRUPTIBLE); > >> mutex_unlock(&cpu_hotplug.lock); > >> schedule(); > >> } > >> } > > > > I acutally just came to the same conclusion (7 hours of sleep later, the > > mind indeed seems to be brighter ... what a poet I am). > > > > Lockdep doesn't know about this semantics of cpu_hotplug_begin(), and > > therefore gets confused by the fact that mutual exclusion is actually > > achieved through the refcount instead of mutex (and the same apparently > > happened to me). > > No, that's not the problem. Lockdep is fine. The calltrace clearly shows that > our refcounting has messed up somewhere. As a result, we really *are* running > a hotplug-reader and a hotplug-writer at the same time! We really need to fix > *that*! So please try the second debug patch I sent just now (with the BUG_ON() > in put_online_cpus()). We need to know who is calling put_online_cpus() twice > and fix that culprit! I don't think so. Lockdep is complaining, because (a) during system bootup, the smp_init() -> cpu_up() -> cpuup_callback() teaches him about hotplug.lock -> slab_mutex dependency (b) many many jiffies later, nf_conntrack_cleanup_net() calls kmem_cache_destroy(), which introduces slab_mutex -> hotplug.lock dependency Lockdep rightfully (from his POV) sees this as potential ABBA, and reports it, it's as simple as that. It has no way of knowing the fact that the ABBA can actually never happen, because of special semantics of cpu_hotplug.refcount and it's handling in cpu_hotplug_begin(). The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin() until everyone who called get_online_cpus() will call put_online_cpus()" is totally invisible to lockdep. > > So right, now I agree that the deadlock scenario I have come up with is > > indeed bogus (*), and we just have to annotate this fact to lockdep > > somehow. > > Yes, the deadlock scenario is bogus, but the refcounting leak is for real > and needs fixing. With your patch applied, the BUG_ON() in put_online_cpus() didn't trigger for me at all. Which is what I expected. > I'm fine with this, but the real problem is elsewhere, like I mentioned above. > This one is only a good-to-have, not a fix. > > > (*) I have seen machine locking hard reproducibly, but that was only with > > additional Paul's patch, so I guess the lock order there actually was > > wrong > > If refcounting was working fine, Paul's patch wouldn't have caused *any* issues. > With that patch in place, the 2 places where rcu_barrier() get invoked (ie., > kmem_cache_destroy() and deactivate_locked_super()) both start waiting on > get_online_cpus() until the slab cpu hotplug notifier as well as the entire > cpu_up operation completes. Absolutely no problem in that! So the fact that > you are seeing lock-ups here is another indication that the problem is really > elsewhere! I don't agree. The reason why Paul's patch (1331e7a1bb) started to trigger this, is that (b) above doesn't exist in pre-1331e7a1bb kernels. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/