Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754106AbaLUVW0 (ORCPT ); Sun, 21 Dec 2014 16:22:26 -0500 Received: from mail-qg0-f53.google.com ([209.85.192.53]:62263 "EHLO mail-qg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751391AbaLUVWW (ORCPT ); Sun, 21 Dec 2014 16:22:22 -0500 MIME-Version: 1.0 In-Reply-To: References: <20141218051327.GA31988@redhat.com> <1418918059.17358.6@mail.thefacebook.com> <20141218161230.GA6042@redhat.com> <20141219024549.GB1671@redhat.com> <20141219035859.GA20022@redhat.com> <20141219040308.GB20022@redhat.com> <20141219145528.GC13404@redhat.com> Date: Sun, 21 Dec 2014 13:22:21 -0800 X-Google-Sender-Auth: 09iIXVwHOKPXjyc9qohGH3Cbq9U Message-ID: Subject: Re: frequent lockups in 3.18rc4 From: Linus Torvalds To: Thomas Gleixner Cc: Dave Jones , Chris Mason , Mike Galbraith , Ingo Molnar , Peter Zijlstra , =?UTF-8?Q?D=C3=A2niel_Fraga?= , Sasha Levin , "Paul E. McKenney" , Linux Kernel Mailing List , Suresh Siddha , Oleg Nesterov , Peter Anvin Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds wrote: > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see anything wrong. The whole "update_wall_clock" and the shadow > timekeeping state is confusing as hell, but seems fine. We'd have to > avoid update_wall_clock for a *long* time for overflows to occur. > > And the overflow in 32 bits isn't that special, since the only thing > that really matters is the overflow of "cycle_now - tkr->cycle_last" > within the mask. > > So I'm not seeing anything even halfway suspicious. .. of course, this reminds me of the "clocksource TSC unstable" issue. Then *simple* solution may actually be that the HPET itself is buggered. That would explain both the "clocksource TSC unstable" messages _and_ the "time went backwards, so now we're re-arming the scheduler tick 'forever' until time has gone forwards again". And googling for this actually shows other people seeing similar issues, including hangs after switching to another clocksource. See for example http://stackoverflow.com/questions/13796944/system-hang-with-possible-relevance-to-clocksource-tsc-unstable which switches to acpi_pm (not HPET) and then hangs afterwards. Of course, it may be the switching itself that causes some issue. Btw, there's another reason to think that it's the HPET, I just realized. DaveJ posted all his odd TSC unstable things, and the delta was pretty damn random. But it did have a range: it was in the 1-251 second range. With a 14.318MHz clock (which is, I think, the normal HPET frequency), a 32-bit overflow happens in about 300 seconds. So the range of 1-251 seconds is not entirely random. It's all in that "32-bit HPET range". In contrast, wrt the TSC frequency, that kind of range makes no sense at all. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/