Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751860AbbBLTcQ (ORCPT ); Thu, 12 Feb 2015 14:32:16 -0500 Received: from hofr.at ([212.69.189.236]:51896 "EHLO mail.hofr.at" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750999AbbBLTcP (ORCPT ); Thu, 12 Feb 2015 14:32:15 -0500 Date: Thu, 12 Feb 2015 20:32:10 +0100 From: Nicholas Mc Guire To: Oleg Nesterov Cc: Davidlohr Bueso , paulmck@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, waiman.long@hp.com, peterz@infradead.org, raghavendra.kt@linux.vnet.ibm.com Subject: Re: BUG: spinlock bad magic on CPU#0, migration/0/9 Message-ID: <20150212193210.GA7244@opentech.at> References: <20150212003430.GA28656@linux.vnet.ibm.com> <1423710911.2046.50.camel@stgolabs.net> <20150212172805.GA20850@redhat.com> <20150212174144.GA21714@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150212174144.GA21714@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4154 Lines: 103 On Thu, 12 Feb 2015, Oleg Nesterov wrote: > On 02/12, Oleg Nesterov wrote: > > On 02/11, Davidlohr Bueso wrote: > > > > > > On Wed, 2015-02-11 at 16:34 -0800, Paul E. McKenney wrote: > > > > Hello! > > > > > > > > Did an earlier-than-usual port of v3.21 patches to post-v3.19, and > > > > hit the following on x86_64. This happened after about 15 minutes of > > > > rcutorture. In contrast, I have been doing successful 15-hour runs > > > > on v3.19. I will check reproducibility and try to narrow it down. > > > > Might this be a duplicate of the bug that Raghavendra posted a fix for? > > > > > > > > Anyway, this was on 3e8c04eb1174 (Merge branch 'for-3.20' of > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata). > > > > > > > > [ 837.287011] BUG: spinlock bad magic on CPU#0, migration/0/9 > > > > [ 837.287013] lock: 0xffff88001ea0fe80, .magic: ffffffff, .owner: g?<81>????/0, .owner_cpu: -42 > > > > [ 837.287013] CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.19.0+ #1 > > > > [ 837.287013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > > > > [ 837.287013] ffff88001ea0fe80 ffff88001ea0bc78 ffffffff818f6f4b ffffffff810a5a51 > > > > [ 837.287013] ffffffff81e500e0 ffff88001ea0bc98 ffffffff818f3755 ffff88001ea0fe80 > > > > [ 837.287013] ffffffff81ca4396 ffff88001ea0bcb8 ffffffff818f377b ffff88001ea0fe80 > > > > [ 837.287013] Call Trace: > > > > [ 837.287013] [] dump_stack+0x45/0x57 > > > > [ 837.287013] [] ? console_unlock+0x1f1/0x4c0 > > > > [ 837.287013] [] spin_dump+0x8b/0x90 > > > > [ 837.287013] [] spin_bug+0x21/0x26 > > > > [ 837.287013] [] do_raw_spin_unlock+0x5c/0xa0 > > > > [ 837.287013] [] _raw_spin_unlock_irqrestore+0x27/0x50 > > > > [ 837.287013] [] complete+0x41/0x50 > > > > > > We did have some recent changes in completions: > > > > > > 7c34e318 (sched/completion: Add lock-free checking of the blocking case) > > > de30ec47 (sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state) > > > > > > The second one being more related (although both appear to make sense). > > > Perhaps some subtle implication in the completion_done side that > > > disappeared with the spinlock? > > > > At first glance both changes look suspicious. > > No, sorry, only the 2nd one. > > > Unless at least document how > > you can use these helpers. > > > > Consider this code: > > > > void xxx(void) > > { > > struct completion c; > > > > init_completion(&c); > > > > expose_this_completion(&c); > > > > while (!completion_done(&c) > > schedule_timeout_uninterruptible(1); > > } > > > > Before that change this code was correct, now it is not. Hmm and note that > > this is what stop_machine_from_inactive_cpu() does although I do not know > > if this is related or not. > > > > Because completion_done() can now race with complete(), the final > > spin_unlock() can write to the memory after it was freed/reused. In this > > case it can write to the stack after return. > > > > Add CC's. > > Nicholas, don't we need something like below? > > Oleg. > > > --- x/kernel/sched/completion.c > +++ x/kernel/sched/completion.c > @@ -274,7 +274,7 @@ bool try_wait_for_completion(struct comp > * first without taking the lock so we can > * return early in the blocking case. > */ > - if (!ACCESS_ONCE(x->done)) > + if (!READ_ONCE(x->done)) > return 0; > from looking at compiler.h I don't think that there would be a difference between ACCESS_ONCE() and READ_ONCE() in this case - done is an unsigned int here so it would be a single read instruction on the PPC440 here as well and would not resort to the barrier protected memcpy. Is the oops reproducible ? If it is could you drop me a few lines how to trigger this ? thx! hofrat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/