Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750982AbdL3AKk (ORCPT ); Fri, 29 Dec 2017 19:10:40 -0500 Received: from mail-oi0-f67.google.com ([209.85.218.67]:44725 "EHLO mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750946AbdL3AKi (ORCPT ); Fri, 29 Dec 2017 19:10:38 -0500 X-Google-Smtp-Source: ACJfBovh0Swcba+BPN4yBZzD+vXgmFV2jglvluNVZYFKjtW3zsxOh6qarizrmmFSsnhAuU6CVkH/jQ== Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (1.0) Subject: Re: 4.14.9 doesn't boot (regression) From: Andy Lutomirski X-Mailer: iPhone Mail (15C153) In-Reply-To: Date: Fri, 29 Dec 2017 17:10:35 -0700 Cc: =?utf-8?Q?Toralf_F=C3=B6rster?= , Alexander Tsoy , stable , Linux Kernel , the arch/x86 maintainers , jpoimboe@redhat.com Message-Id: <7A0A9B37-20FF-4B17-B4F5-D8B999269FC4@amacapital.net> References: <33249a35-7d6a-f0f3-5a98-e6474f9366e3@gmx.de> To: Linus Torvalds Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id vBU0Ak62001491 Content-Length: 2853 Lines: 42 > On Dec 29, 2017, at 3:53 PM, Linus Torvalds wrote: > >> On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster wrote: >> >> The bad news - the issue is not solved with the changed cflags. >> The good news - I could compile eventually a working config for my desktop (works fine with 4.14.10 with generic CPU) having a higher screen resolution during boot. >> >> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > .config", changed the .config to use MCORE2 instead of GENERIC and defined the string "-local" to ensure that the modules directory is really unique. >> Then I run "time make -j4 && sudo make modules_install && sudo cp arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look for IMG_* > > Ok, so what does seem to be consistent for everybody is that > double-fault in the NMI backtrace. > > So the fact that the NMI always hits on a double-fault does make me > suspect that it's a infinite stream of double-faults, and that is > presumably also what causes the RCU timeout. > > And as I pointed out elsewhere (damn two threads), I think that it > would help to simply catch the *first* double-fault. > > And I *think* that the only thing that can make a double-fault > silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can > build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in > arch/x86/kernel/traps.c do_double_fault(), that would be interesting. Double faults use IST, so a double fault that double faults will effectively just start over rather than eventually running out of stack and triple faulting. But check out the registers. We have RSP = ...28fd8 and CR2 = ...27f08. IOW the double fault stack is ...28000 - ...28fff and we're somehow getting a failed page fault a couple hundred bytes below the bottom of the IST stack. IOW, I think we're just stuck in a neverending loop of stack overflows. (Also, Josh, the oops code should have printed the contents of the struct pt_regs at the top of the DF stack. Any idea why it didn't?) Toralf, can you send the complete output of: objdump -dr arch/x86/kernel/traps.o >From the build tree of a nonworking kernel? Also, you wouldn't happen to be using Gentoo perchance? I already have two reports of a Gentoo system miscompiling the vDSO due to Gentoo enabling -fstack-check and GCC generating stack check code that is highly suboptimal, actively incorrect, and doesn't even manage to check the stack in a particularly helpful way. If this is indeed what's going on, I'm going to try to come up with a patch to outright fail the build on these buggy systems. We could probably fudge the build options to avoid the problem, but Gentoo really just needs fix its toolchain.