Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751492AbbBLT7f (ORCPT ); Thu, 12 Feb 2015 14:59:35 -0500 Received: from smtp2.provo.novell.com ([137.65.250.81]:53573 "EHLO smtp2.provo.novell.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751099AbbBLT7e (ORCPT ); Thu, 12 Feb 2015 14:59:34 -0500 Message-ID: <1423771164.2046.73.camel@stgolabs.net> Subject: Re: BUG: spinlock bad magic on CPU#0, migration/0/9 From: Davidlohr Bueso To: Nicholas Mc Guire Cc: Oleg Nesterov , paulmck@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, waiman.long@hp.com, peterz@infradead.org, raghavendra.kt@linux.vnet.ibm.com Date: Thu, 12 Feb 2015 11:59:24 -0800 In-Reply-To: <20150212191009.GA26275@opentech.at> References: <20150212003430.GA28656@linux.vnet.ibm.com> <1423710911.2046.50.camel@stgolabs.net> <20150212172805.GA20850@redhat.com> <20150212174144.GA21714@redhat.com> <20150212191009.GA26275@opentech.at> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.9 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3813 Lines: 89 On Thu, 2015-02-12 at 20:10 +0100, Nicholas Mc Guire wrote: > On Thu, 12 Feb 2015, Oleg Nesterov wrote: > > > On 02/12, Oleg Nesterov wrote: > > > On 02/11, Davidlohr Bueso wrote: > > > > > > > > On Wed, 2015-02-11 at 16:34 -0800, Paul E. McKenney wrote: > > > > > Hello! > > > > > > > > > > Did an earlier-than-usual port of v3.21 patches to post-v3.19, and > > > > > hit the following on x86_64. This happened after about 15 minutes of > > > > > rcutorture. In contrast, I have been doing successful 15-hour runs > > > > > on v3.19. I will check reproducibility and try to narrow it down. > > > > > Might this be a duplicate of the bug that Raghavendra posted a fix for? > > > > > > > > > > Anyway, this was on 3e8c04eb1174 (Merge branch 'for-3.20' of > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata). > > > > > > > > > > [ 837.287011] BUG: spinlock bad magic on CPU#0, migration/0/9 > > > > > [ 837.287013] lock: 0xffff88001ea0fe80, .magic: ffffffff, .owner: g?<81>????/0, .owner_cpu: -42 > > > > > [ 837.287013] CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.19.0+ #1 > > > > > [ 837.287013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 > > > > > [ 837.287013] ffff88001ea0fe80 ffff88001ea0bc78 ffffffff818f6f4b ffffffff810a5a51 > > > > > [ 837.287013] ffffffff81e500e0 ffff88001ea0bc98 ffffffff818f3755 ffff88001ea0fe80 > > > > > [ 837.287013] ffffffff81ca4396 ffff88001ea0bcb8 ffffffff818f377b ffff88001ea0fe80 > > > > > [ 837.287013] Call Trace: > > > > > [ 837.287013] [] dump_stack+0x45/0x57 > > > > > [ 837.287013] [] ? console_unlock+0x1f1/0x4c0 > > > > > [ 837.287013] [] spin_dump+0x8b/0x90 > > > > > [ 837.287013] [] spin_bug+0x21/0x26 > > > > > [ 837.287013] [] do_raw_spin_unlock+0x5c/0xa0 > > > > > [ 837.287013] [] _raw_spin_unlock_irqrestore+0x27/0x50 > > > > > [ 837.287013] [] complete+0x41/0x50 > > > > > > > > We did have some recent changes in completions: > > > > > > > > 7c34e318 (sched/completion: Add lock-free checking of the blocking case) > > > > de30ec47 (sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state) > > > > > > > > The second one being more related (although both appear to make sense). > > > > Perhaps some subtle implication in the completion_done side that > > > > disappeared with the spinlock? > > > > > > At first glance both changes look suspicious. > > > > No, sorry, only the 2nd one. > > > > > Unless at least document how > > > you can use these helpers. > > > > > > Consider this code: > > > > > > void xxx(void) > > > { > > > struct completion c; > > > > > > init_completion(&c); > > > > > > expose_this_completion(&c); > > > > > > while (!completion_done(&c) > > > schedule_timeout_uninterruptible(1); > > But that would not break due to the change - even if completion_done() had a > problem - complete_done() is not consuming x->done it is only checking it? There is for instance contention to consider. When this happens: CPU0 CPU1 complete(x) spining_lock_irqsave() completion_done(x) load x->done; x->done++; wakeup Before the change in completion_done() we would respect the order of the completion checks with the spinlock ticket, so you wouldn't run into this issue. The impact would obviously depend how the waiting for completion is done, for instance issuing more sleeps when the lock becomes very contended. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/