Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
MIME-Version: 1.0
Message-ID: <6c16fc37-bdf2-4925-8114-14f5a08c07e3@default>
Date:   Tue, 23 Jan 2018 03:13:05 -0800 (PST)
From:   Liran Alon <liran.alon@oracle.com>
To:     <dwmw2@infradead.org>
Cc:     <labbott@redhat.com>, <luto@kernel.org>,
        <Janakarajan.Natarajan@amd.com>, <torvalds@linux-foundation.org>,
        <bp@suse.de>, <asit.k.mallick@intel.com>, <rkrcmar@redhat.com>,
        <dave.hansen@intel.com>, <karahmed@amazon.de>, <hpa@zytor.com>,
        <mingo@redhat.com>, <jun.nakajima@intel.com>, <x86@kernel.org>,
        <ashok.raj@intel.com>, <arjan.van.de.ven@intel.com>,
        <tim.c.chen@linux.intel.com>, <pbonzini@redhat.com>,
        <ak@linux.intel.com>, <linux-kernel@vger.kernel.org>,
        <peterz@infradead.org>, <tglx@linutronix.de>,
        <gregkh@linuxfoundation.org>, <mhiramat@kernel.org>,
        <arjan@linux.intel.com>, <thomas.lendacky@amd.com>,
        <dan.j.williams@intel.com>, <joro@8bytes.org>,
        <kvm@vger.kernel.org>, <aarcange@redhat.com>
Subject: Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict
 Indirect Branch Speculation
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk


----- dwmw2@infradead.org wrote:

> On Sun, 2018-01-21 at 14:27 -0800, Linus Torvalds wrote:
> > On Sun, Jan 21, 2018 at 2:00 PM, David Woodhouse
> <dwmw2@infradead.org> wrote:
> > >>
> > >> The patches do things like add the garbage MSR writes to the
> kernel
> > >> entry/exit points. That's insane. That says "we're trying to
> protect
> > >> the kernel".=C2=A0 We already have retpoline there, with less
> overhead.
> > >
> > > You're looking at IBRS usage, not IBPB. They are different
> things.
> >=20
> > Ehh. Odd intel naming detail.
> >=20
> > If you look at this series, it very much does that kernel
> entry/exit
> > stuff. It was patch 10/10, iirc. In fact, the patch I was replying
> to
> > was explicitly setting that garbage up.
> >=20
> > And I really don't want to see these garbage patches just
> mindlessly
> > sent around.
>=20
> I think we've covered the technical part of this now, not that you
> like
> it =E2=80=94 not that any of us *like* it. But since the peanut gallery i=
s
> paying lots of attention it's probably worth explaining it a little
> more for their benefit.
>=20
> This is all about Spectre variant 2, where the CPU can be tricked
> into
> mispredicting the target of an indirect branch. And I'm specifically
> looking at what we can do on *current* hardware, where we're limited
> to
> the hacks they can manage to add in the microcode.
>=20
> The new microcode from Intel and AMD adds three new features.
>=20
> One new feature (IBPB) is a complete barrier for branch prediction.
> After frobbing this, no branch targets learned earlier are going to
> be
> used. It's kind of expensive (order of magnitude ~4000 cycles).
>=20
> The second (STIBP) protects a hyperthread sibling from following
> branch
> predictions which were learned on another sibling. You *might* want
> this when running unrelated processes in userspace, for example. Or
> different VM guests running on HT siblings.
>=20
> The third feature (IBRS) is more complicated. It's designed to be
> set when you enter a more privileged execution mode (i.e. the
> kernel).
> It prevents branch targets learned in a less-privileged execution
> mode,
> BEFORE IT WAS MOST RECENTLY SET, from taking effect. But it's not
> just
> a 'set-and-forget' feature, it also has barrier-like semantics and
> needs to be set on *each* entry into the kernel (from userspace or a
> VM
> guest). It's *also* expensive. And a vile hack, but for a while it
> was
> the only option we had.
>=20
> Even with IBRS, the CPU cannot tell the difference between different
> userspace processes, and between different VM guests. So in addition
> to
> IBRS to protect the kernel, we need the full IBPB barrier on context
> switch and vmexit. And maybe STIBP while they're running.
>=20
> Then along came Paul with the cunning plan of "oh, indirect branches
> can be exploited? Screw it, let's not have any of *those* then",
> which
> is retpoline. And it's a *lot* faster than frobbing IBRS on every
> entry
> into the kernel. It's a massive performance win.
>=20
> So now we *mostly* don't need IBRS. We build with retpoline, use IBPB
> on context switches/vmexit (which is in the first part of this patch
> series before IBRS is added), and we're safe. We even refactored the
> patch series to put retpoline first.
>=20
> But wait, why did I say "mostly"? Well, not everyone has a retpoline
> compiler yet... but OK, screw them; they need to update.
>=20
> Then there's Skylake, and that generation of CPU cores. For
> complicated
> reasons they actually end up being vulnerable not just on indirect
> branches, but also on a 'ret' in some circumstances (such as 16+
> CALLs
> in a deep chain).
>=20
> The IBRS solution, ugly though it is, did address that. Retpoline
> doesn't. There are patches being floated to detect and prevent deep
> stacks, and deal with some of the other special cases that bite on
> SKL,
> but those are icky too. And in fact IBRS performance isn't anywhere
> near as bad on this generation of CPUs as it is on earlier CPUs
> *anyway*, which makes it not quite so insane to *contemplate* using
> it
> as Intel proposed.
>=20
> That's why my initial idea, as implemented in this RFC patchset, was
> to
> stick with IBRS on Skylake, and use retpoline everywhere else. I'll
> give you "garbage patches", but they weren't being "just mindlessly
> sent around". If we're going to drop IBRS support and accept the
> caveats, then let's do it as a conscious decision having seen what it
> would look like, not just drop it quietly because poor Davey is too
> scared that Linus might shout at him again. :)
>=20
> I have seen *hand-wavy* analyses of the Skylake thing that mean I'm
> not
> actually lying awake at night fretting about it, but nothing concrete
> that really says it's OK.
>=20
> If you view retpoline as a performance optimisation, which is how it
> first arrived, then it's rather unconventional to say "well, it only
> opens a *little* bit of a security hole but it does go nice and fast
> so
> let's do it".
>=20
> But fine, I'm content with ditching the use of IBRS to protect the
> kernel, and I'm not even surprised. There's a *reason* we put it last
> in the series, as both the most contentious and most dispensable
> part.
> I'd be *happier* with a coherent analysis showing Skylake is still
> OK,
> but hey-ho, screw Skylake.
>=20
> The early part of the series adds the new feature bits and detects
> when
> it can turn KPTI off on non-Meltdown-vulnerable Intel CPUs, and also
> supports the IBPB barrier that we need to make retpoline complete.
> That
> much I think we definitely *do* want. There have been a bunch of us
> working on this behind the scenes; one of us will probably post that
> bit in the next day or so.
>=20
> I think we also want to expose IBRS to VM guests, even if we don't
> use
> it ourselves. Because Windows guests (and RHEL guests; yay!) do use
> it.
>=20
> If we can be done with the shouty part, I'd actually quite like to
> have
> a sensible discussion about when, if ever, we do IBPB on context
> switch
> (ptraceability and dumpable have both been suggested) and when, if
> ever, we set STIPB in userspace.

It is also important to note that current solutions, as I understand it, st=
ill have info-leak issues.

If retpoline is being used, user-mode code can leak RSB entries created whi=
le CPU was in kernel-mode.
Therefore, breaking KASLR. In order to handle this, every exit from kernel-=
mode to user-mode should stuff RSB. In addition, this stuffing of RSB may n=
eed to be done from a fixed address to avoid leaking the address of the RSB=
 stuffing itself. Same concept applies for VMEntry into guests. Hypervisor =
should stuff RSB just before VMEntry, otherwise guest will be able to leak =
RSB entries which reveals hypervisor addresses.

If IBRS is being used, things seems to be even worse.
IBRS prevents BTB entries created at lower prediction-mode from being used =
by higher prediction-mode code.
However, nothing seems to prevent lower prediction-mode code from using BTB=
 entries of higher prediction-mode code. This means that user-mode code cou=
ld leak BTB entries in order to break KASLR and guests could leaks host's B=
TB entries to reveal hypervisor addresses. This seems to be an issue even w=
ith future CPUs that will have "IBRS all-the-time" feature.
Note that this issue is not theoretical. This is exactly what Google's Proj=
ect-Zero KVM PoC did. They leaked host's BTB entries to reveal kvm-intel.ko=
, kvm.ko & vmlinux addresses.
It seems that the correct way to really handle this scenario should be to t=
ag every BTB entry with prediction-mode and make CPU only use BTB entries t=
agged with current prediction-mode. Therefore, entirely separating the BTB =
entries between prediction-modes. That, in my opinion, should replace the I=
BRS-feature.

-Liran