Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756174AbaKTCm1 (ORCPT ); Wed, 19 Nov 2014 21:42:27 -0500 Received: from mail-yh0-f44.google.com ([209.85.213.44]:43768 "EHLO mail-yh0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754658AbaKTCm0 (ORCPT ); Wed, 19 Nov 2014 21:42:26 -0500 MIME-Version: 1.0 In-Reply-To: References: <20141118145234.GA7487@redhat.com> <20141118215540.GD35311@redhat.com> <20141119021902.GA14216@redhat.com> <20141119145902.GA13387@redhat.com> <20141119190215.GA10796@lerouge> <20141119225615.GA11386@lerouge> Date: Wed, 19 Nov 2014 18:42:24 -0800 X-Google-Sender-Auth: YuGNRW_9pr14-d95JxGKWt50_m4 Message-ID: Subject: Re: frequent lockups in 3.18rc4 From: Linus Torvalds To: Andy Lutomirski Cc: Thomas Gleixner , "linux-kernel@vger.kernel.org" , Arnaldo Carvalho de Melo , Peter Zijlstra , Frederic Weisbecker , Don Zickus , Dave Jones , "the arch/x86 maintainers" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 5:16 PM, Andy Lutomirski wrote: > > And you were calling me crazy? :) Hey, I'm crazy like a fox. > We could be restarting just about anything if that happens. Except > that if we double-faulted on a trap gate entry instead of an interrupt > gate entry, then we can't restart, and, unless we can somehow decode > the error code usefully (it's woefully undocumented), int 0x80 and > int3 might be impossible to handle correctly if it double-faults. And > please don't suggest moving int 0x80 to an IST stack :) No, no. So tell me if this won't work: - when forking a new process, make sure we allocate the vmalloc stack *before* we copy the vm - this should guarantee that all new processes will at least have its *own* stack always in its page tables, since vmalloc always fills in the page table of the current page tables of the thread doing the vmalloc. HOWEVER, that leaves the task switch *to* that process, and making sure that the stack pointer is ok in between the "switch %rsp" and "switch %cr3". So then we make the rule be: switch %cr3 *before* switching %rsp, and only in between those places can we get in trouble. Yes/no? And that small section is all with interrupts disabled, and nothing should take an exception. The C code might take a double fault on a regular access to the old stack (the *new* stack is guaranteed to be mapped, but the old stack is not), but that should be very similar to what we already do with "iret". So we can just fill in the page tables and return. For safety, add a percpu counter that is cleared before the %cr3 setting, to make sure that we only do a *single* double-fault, but it really sounds pretty safe. No? The only deadly thing would be NMI, but that's an IST anyway, so not an issue. No other traps should be able to happen except the double page table miss. But hey, maybe I'm not crazy like a fox. Maybe I'm just plain crazy, and I missed something else. And no, I don't think the above is necessarily a *good* idea. But it doesn't seem really overly complicated either. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/