Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751009AbdL2WyD (ORCPT ); Fri, 29 Dec 2017 17:54:03 -0500 Received: from mail-it0-f46.google.com ([209.85.214.46]:37469 "EHLO mail-it0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750772AbdL2WyA (ORCPT ); Fri, 29 Dec 2017 17:54:00 -0500 X-Google-Smtp-Source: ACJfBov0Gsy8lXBcCanXWqfbWxiQirsLUBQ0giii4J1Q7GnoSQAiFKlcb7s/R1rKQ1oBZAom90cZo8YyUei7MoOr+f8= MIME-Version: 1.0 In-Reply-To: References: <33249a35-7d6a-f0f3-5a98-e6474f9366e3@gmx.de> From: Linus Torvalds Date: Fri, 29 Dec 2017 14:53:59 -0800 X-Google-Sender-Auth: w3W5_LIOdJTqag8W7e6Dtu59JVs Message-ID: Subject: Re: 4.14.9 doesn't boot (regression) To: =?UTF-8?Q?Toralf_F=C3=B6rster?= Cc: Alexander Tsoy , Andy Lutomirski , stable , Linux Kernel , "the arch/x86 maintainers" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id vBTMs7W5023004 Content-Length: 1789 Lines: 39 On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster wrote: > > The bad news - the issue is not solved with the changed cflags. > The good news - I could compile eventually a working config for my desktop (works fine with 4.14.10 with generic CPU) having a higher screen resolution during boot. > > So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > .config", changed the .config to use MCORE2 instead of GENERIC and defined the string "-local" to ensure that the modules directory is really unique. > Then I run "time make -j4 && sudo make modules_install && sudo cp arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look for IMG_* Ok, so what does seem to be consistent for everybody is that double-fault in the NMI backtrace. So the fact that the NMI always hits on a double-fault does make me suspect that it's a infinite stream of double-faults, and that is presumably also what causes the RCU timeout. And as I pointed out elsewhere (damn two threads), I think that it would help to simply catch the *first* double-fault. And I *think* that the only thing that can make a double-fault silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in arch/x86/kernel/traps.c do_double_fault(), that would be interesting. So just change the #ifdef CONFIG_X86_ESPFIX64 into a #if 0 and see if instead of the RCU stall after 20 seconds, you get an immediate double fault error report instead? I'm still entirely confused about why that MCORE2 would make _any_ difference what-so-ever, so this is all fishing for random clues in the dark. Linus