Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966483AbbDWQ11 (ORCPT ); Thu, 23 Apr 2015 12:27:27 -0400 Received: from mail-la0-f46.google.com ([209.85.215.46]:35676 "EHLO mail-la0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966153AbbDWQ1Z (ORCPT ); Thu, 23 Apr 2015 12:27:25 -0400 MIME-Version: 1.0 In-Reply-To: References: <1429792491-5978-1-git-send-email-dvlasenk@redhat.com> From: Andy Lutomirski Date: Thu, 23 Apr 2015 09:27:03 -0700 Message-ID: Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary To: Linus Torvalds Cc: Brian Gerst , Denys Vlasenko , Ingo Molnar , Steven Rostedt , Borislav Petkov , "H. Peter Anvin" , Oleg Nesterov , Frederic Weisbecker , Alexei Starovoitov , Will Drewry , Kees Cook , "the arch/x86 maintainers" , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2047 Lines: 46 On Thu, Apr 23, 2015 at 9:13 AM, Linus Torvalds wrote: > On Thu, Apr 23, 2015 at 9:06 AM, Brian Gerst wrote: >> >> So you are saying we should save and conditionally restore the >> kernel's %ss during context switch? That shouldn't be too bad. Half >> of the time you would be loading the null selector which is fast (no >> GDT access, no validation). > > I'd almost prefer something along those lines, yes. Who knows *what* > leaks? If the present bit state leaks, then likely so does the limit > value etc etc.. > I'll go out on a limb and guess the present bit doesn't leak. If I were implementing an x86 cpu, I wouldn't have a present bit at all in the descriptor cache, since you aren't supposed to be able to load a non-present descriptor in the first place. I bet it's the limit we're seeing. But I think I prefer something closer to Denys' approach with alternatives instead. I think the only case that matters (if my hare-brained explanantion of the actual crash is right) is when we sysret (q or l) while SS is 0. That only happens if we scheduled inside a syscall, and I'm guessing that testing if ss is zero and reloading it on syscall return will be a smaller performance hit than reloading on all context switches. The latter could happen more than once per syscall, and it could also affect tasks that aren't doing syscalls at all and are therefore unaffected. I'll try to send out a patch and a test case later today, but no promises -- the test case will be a bit tedious, and I'm already overcommitted for today :( A sketch of the a reproducer: Two threads. Thread 1 sets ss to some very-low-limit value, and it loops doing mov $-1, %eax; int $80. Thread 2 is ordinary 32-bit code doing while(true) usleep(1); --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/