Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751781AbdL0CQk (ORCPT ); Tue, 26 Dec 2017 21:16:40 -0500 Received: from mail-it0-f45.google.com ([209.85.214.45]:44325 "EHLO mail-it0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751421AbdL0CQj (ORCPT ); Tue, 26 Dec 2017 21:16:39 -0500 X-Google-Smtp-Source: ACJfBot6uP1plE2Ji0zflqlfb2fL/1a87R/cwboluVmPm4/2GPzpHyfK5HdaXYRy6TdBf+04V1GPxvBRmPWsh/kZUbw= MIME-Version: 1.0 In-Reply-To: <20171226231900.GB1410@arch-chirva.localdomain> References: <20171224014415.GA5663@chirva-void> <20171225212934.GA1410@arch-chirva.localdomain> <20171226231900.GB1410@arch-chirva.localdomain> From: Linus Torvalds Date: Tue, 26 Dec 2017 18:16:37 -0800 X-Google-Sender-Auth: hpqVqq5nAYE4Wr4_aUH6-mE-qpg Message-ID: Subject: Re: PROBLEM: consolidated IDT invalidation causes kexec to reboot To: Alexandru Chirvasitu Cc: Andy Lutomirski , Thomas Gleixner , kernel list , Borislav Petkov , Brian Gerst , Denys Vlasenko , "H. Peter Anvin" , Josh Poimboeuf , Peter Zijlstra , Steven Rostedt , Ingo Molnar Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1796 Lines: 51 On Tue, Dec 26, 2017 at 3:19 PM, Alexandru Chirvasitu wrote: > > I went back to the initial problematic commit e802a51 and modified it as you suggest: Thank you. > This did not work out for me, but now it fails differently. Both > (kexec -l + kexec -e) and (kexec -p + echo c > /proc/sysrq-trigger) > end in call traces and freezes. > > It does seem to be tied to idt_invalidate. One of the last things I > see on the screen (which is ends up frozen with the computer inactive) > is > > EIP: idt_invalidate+0x6/0x40 SS:ESP: 0068:f6c47cd0 Yes, interesting, it's the stack canary load access there: mov %gs:0x14,%edx that traps. And that actually makes a lot of sense: the load_segments() call just above has rloaded all segments with __KERNEL_DS. So while the stack canary access *intends* to load it from the magic stack canary segment (offset 0x14), we've just reset all segments to the standard zero-based full-sized ones, and obviously that will take a page fault at 0x14. And the reason you now actually *see* the page fault is that we haven't completely buggered the CPU state now, so the trap handler actually works. With the GDT reset before, it used to take that same trap, but now the trap handler itself would fault, and cause a triple fault - which resets the machine. So it wasn't actually tracing, it was the stack canary all along. So at least it's truly root-caused now. But the fix is the same: we just can't afford to do any function calls. Alternatively, we should just fix that insane "load_segments()". I'm not sure why the code insists on reloading the segments in the first place. So you could try just to remove the "load_segments()" line entirely. Thanks for spending the time testing things out, Linus