Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965973AbaLMIT7 (ORCPT ); Sat, 13 Dec 2014 03:19:59 -0500 Received: from mail-wi0-f179.google.com ([209.85.212.179]:35615 "EHLO mail-wi0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965934AbaLMIT6 (ORCPT ); Sat, 13 Dec 2014 03:19:58 -0500 Date: Sat, 13 Dec 2014 09:19:53 +0100 From: Ingo Molnar To: Linus Torvalds Cc: Dave Jones , Chris Mason , Mike Galbraith , Peter Zijlstra , =?iso-8859-1?Q?D=E2niel?= Fraga , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List Subject: Re: frequent lockups in 3.18rc4 Message-ID: <20141213081953.GG32572@gmail.com> References: <20141203184111.GA32005@redhat.com> <20141205171501.GA1320@redhat.com> <1417806247.4845.1@mail.thefacebook.com> <20141211145408.GB16800@redhat.com> <20141212185454.GB4716@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Fri, Dec 12, 2014 at 10:54 AM, Dave Jones wrote: > > > > > Something that's still making me wonder if it's some kind of > > hardware problem is the non-deterministic nature of this bug. > > I'd expect it to be a race condition, though. Which can easily > cause these kinds of issues, and the timing will be pretty > random even if the load is very regular. > > And we know that the scheduler has an integer overflow under > Sasha's loads, although I didn't hear anything from Ingo and > friends about it. Ingo/Peter, you were cc'd on that report, > where at least one of the multiplcations in wake_affine() ended > up overflowing.. Just to make sure, is there any other wake_affine report other than the one in this thread? (I tried a wake_affine full text search on my inbox and didn't find anything that appeared relevant.) > Some scheduler thing that overflows only under heavy load, and > screws up scheduling could easily account for the RCU thread > thing. I see it *less* easily accounting for DaveJ's case, > though, because the watchdog is running at RT priority, and the > scheduler would have to screw up much more to then not schedule > an RT task, but.. Yeah, the RT scheduler is harder (but not impossible) to confuse due to its simplicity, but scheduler counts overflowing could definitely cause all sorts of trouble and make debugging harder, so we want to fix it regardless of its likelihood of causing lockups. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/