Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754706AbaDGPD4 (ORCPT ); Mon, 7 Apr 2014 11:03:56 -0400 Received: from merlin.infradead.org ([205.233.59.134]:48015 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754665AbaDGPDx (ORCPT ); Mon, 7 Apr 2014 11:03:53 -0400 Date: Mon, 7 Apr 2014 17:03:37 +0200 From: Peter Zijlstra To: Michele Ballabio Cc: linux-kernel@vger.kernel.org, toralf.foerster@gmx.de, fweisbec@gmail.com, mingo@kernel.org, Steven Rostedt Subject: Re: Bisected KVM hang on x86-32 between v3.12 and v3.13 Message-ID: <20140407150337.GO10526@twins.programming.kicks-ass.net> References: <5341707F.5000406@katamail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5341707F.5000406@katamail.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Apr 06, 2014 at 05:19:27PM +0200, Michele Ballabio wrote: > Toralf F?rster reported this in > http://article.gmane.org/gmane.linux.kernel/1662567 > http://article.gmane.org/gmane.linux.kernel/1658422 > http://article.gmane.org/gmane.linux.kernel/1657962 > > "The issue happens here at a 32 bit stable Gentoo Linux if > I try to start a KVM image. Kernels 3.12.X works fine, > kernel >= v3.13 will hang shortly after I started the image > with the virtual-manager. The last syslog messages are > something like: > Feb 28 16:22:00 n22 kernel: INFO: rcu_sched detected stalls > on CPUs/tasks: {} (detected by 2, t=60002 jiffies, > g=14689, c=14688, q=21051) > Feb 28 16:22:00 n22 kernel: INFO: Stall ended before state > dump start" > > He correctly pointed out that the bisection blamed the merge > commit 37bf06375c90a42fe07b9bebdb07bc316ae5a0ce > "Merge tag 'v3.12-rc4' into sched/core". > > This bug is obviously caused by at least two patches, one > on each side of the merge, that only when combined together > (at that merge point) cause the bug in kvm. By rebasing > the "sched/core" branch on "master" before the merge and > going on with the bisection, I found commit > 3e8e42c69bb7d9fc12ebc23ff308e8523a2a59a0 > "sched: Revert need_resched() to look at TIF_NEED_RESCHED" > as one of the causes. The other patch that contributes to the > bug is commit ded797547548a5b8e7b92383a41e4c0e6b0ecb7f > "irq: Force hardirq exit's softirq processing on its own stack". > > Reverting either one of them solves the problem reported with kvm, > but revert is probably not the correct answer. > > I wonder if the solution is as simple as this: > > --->8--- > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 0af5250..f3b985d 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -126,6 +126,7 @@ config X86 > select RTC_LIB > select HAVE_DEBUG_STACKOVERFLOW > select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64 > + select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_32 > select HAVE_CC_STACKPROTECTOR Ohh ahh.. shiney! So what I suspect at this point is that because i386 and x86_64 have a difference in current_thread_info() (i386 is stack based), we end up setting the TIF_NEED_RESCHED bit on the wrong stack. Now I have some vague memories of propagating the TIF flags on stack switch, but I cannot remember what arch we did that for. Let me stare at this a little more. Also, IFF this is the case, then the fingered patch above (and your suggested 'fix') aren't the real curlpit/cure but simply make it more/less likely to happen. Now, Steve had a patch somewhere that would make i386 use per-cpu variables for current_thread_info() just like x86_64 already does I think. Let me go find them too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/