MIME-Version: 1.0
In-Reply-To: <a591188f-021b-6d0a-6261-28fa83d161e1@gmx.de>
References: <b8b49443-e43b-7253-78c9-23a0b1632a3b@gmx.de> <CA+55aFz+moL9zem-gGnRtGSzYV0yL+GoA=3+XrUZy9t=cmhBhw@mail.gmail.com>
 <33249a35-7d6a-f0f3-5a98-e6474f9366e3@gmx.de> <CA+55aFyJiCT=Mnn5RUMiYRXddi_eSdVTpobEKNoHzYXsos_6pw@mail.gmail.com>
 <a591188f-021b-6d0a-6261-28fa83d161e1@gmx.de>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 29 Dec 2017 14:53:59 -0800
Message-ID: <CA+55aFwfHEr1mSmHyvmSeFJvxMnpN=NPgJ21zEOkDPvZ2y4wXg@mail.gmail.com>
Subject: Re: 4.14.9 doesn't boot (regression)
To: =?UTF-8?Q?Toralf_F=C3=B6rster?= <toralf.foerster@gmx.de>
Cc: Alexander Tsoy <alexander@tsoy.me>,
        Andy Lutomirski <luto@amacapital.net>, stable <stable@vger.kernel.org>,
        Linux Kernel <linux-kernel@vger.kernel.org>,
        "the arch/x86 maintainers" <x86@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Content-Transfer-Encoding: 8bit
Content-Length: 1789
Lines: 39

On Fri, Dec 29, 2017 at 2:30 PM, Toralf Förster <toralf.foerster@gmx.de> wrote:
>
> The bad news - the issue is not solved with the changed cflags.
> The good news - I could compile eventually a working config for my desktop  (works fine with 4.14.10 with generic CPU) having a higher screen resolution during boot.
>
> So I made a "make distclean", followed by a "sudo zcat /proc/config.gz > .config", changed the .config to use MCORE2 instead of GENERIC and defined the string "-local" to ensure that the modules directory is really unique.
> Then I run "time make -j4 && sudo make modules_install && sudo cp arch/x86_64/boot/bzImage /boot/vmlinuz-0 && sudo grub-mkconfig -o /boot/grub/grub.cfg", booted and made 3 fotos which were uploaded to [1], look for IMG_*

Ok, so what does seem to be consistent for everybody is that
double-fault in the NMI backtrace.

So the fact that the NMI always hits on a double-fault does make me
suspect that it's a infinite stream of double-faults, and that is
presumably also what causes the RCU timeout.

And as I pointed out elsewhere (damn two threads), I think that it
would help to simply catch the *first* double-fault.

And I *think* that the only thing that can make a double-fault
silently be re-tried is the CONFIG_X86_ESPFIX64 case, so if you can
build a failing kernel with the CONFIG_X86_ESPFIX64 case disabled in
arch/x86/kernel/traps.c do_double_fault(), that would be interesting.

So just change the

  #ifdef CONFIG_X86_ESPFIX64

into a

  #if 0

and see if instead of the RCU stall after 20 seconds, you get an
immediate double fault error report instead?

I'm still entirely confused about why that MCORE2 would make _any_
difference what-so-ever, so this is all fishing for random clues in
the dark.

                      Linus