MIME-Version: 1.0
In-Reply-To: <CA+55aFxre3Jk-CG+mBs6Bp9a1J=riCQaA3dwUTFRph1OsCMSuA@mail.gmail.com>
References: <1429792491-5978-1-git-send-email-dvlasenk@redhat.com>
 <CA+55aFwDP1QxvHs9wKE9ZpjAzSHhzc+YQS1Mhwt=oYLrB+Rp1A@mail.gmail.com>
 <CAMzpN2j3K9B4UDPB8suz15PjoDY2aZBr8bvEZ5r4ag5FLgCVrw@mail.gmail.com> <CA+55aFxre3Jk-CG+mBs6Bp9a1J=riCQaA3dwUTFRph1OsCMSuA@mail.gmail.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Thu, 23 Apr 2015 09:27:03 -0700
Message-ID: <CALCETrX_RCGH0C4GJ0Zxm1emcasMbZ5s29HCfFy=GrEC9w28+A@mail.gmail.com>
Subject: Re: [PATCH] x86/asm/entry/32: Restore %ss before SYSRETL if necessary
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>, Denys Vlasenko <dvlasenk@redhat.com>,
        Ingo Molnar <mingo@kernel.org>, Steven Rostedt <rostedt@goodmis.org>,
        Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>,
        Oleg Nesterov <oleg@redhat.com>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Alexei Starovoitov <ast@plumgrid.com>, Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>,
        "the arch/x86 maintainers" <x86@kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2047
Lines: 46

On Thu, Apr 23, 2015 at 9:13 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 23, 2015 at 9:06 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>
>> So you are saying we should save and conditionally restore the
>> kernel's %ss during context switch?  That shouldn't be too bad.  Half
>> of the time you would be loading the null selector which is fast (no
>> GDT access, no validation).
>
> I'd almost prefer something along those lines, yes. Who knows *what*
> leaks? If the present bit state leaks, then likely so does the limit
> value etc etc..
>

I'll go out on a limb and guess the present bit doesn't leak.  If I
were implementing an x86 cpu, I wouldn't have a present bit at all in
the descriptor cache, since you aren't supposed to be able to load a
non-present descriptor in the first place.  I bet it's the limit we're
seeing.

But I think I prefer something closer to Denys' approach with
alternatives instead.  I think the only case that matters (if my
hare-brained explanantion of the actual crash is right) is when we
sysret (q or l) while SS is 0.  That only happens if we scheduled
inside a syscall, and I'm guessing that testing if ss is zero and
reloading it on syscall return will be a smaller performance hit than
reloading on all context switches.  The latter could happen more than
once per syscall, and it could also affect tasks that aren't doing
syscalls at all and are therefore unaffected.

I'll try to send out a patch and a test case later today, but no
promises -- the test case will be a bit tedious, and I'm already
overcommitted for today :(

A sketch of the a reproducer:

Two threads.  Thread 1 sets ss to some very-low-limit value, and it
loops doing mov $-1, %eax; int $80.  Thread 2 is ordinary 32-bit code
doing while(true) usleep(1);

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/