Received: by 2002:ac0:98c7:0:0:0:0:0 with SMTP id g7-v6csp2442681imd; Fri, 2 Nov 2018 11:27:53 -0700 (PDT) X-Google-Smtp-Source: AJdET5fHqNB14Q9ftY0rnnrgqFJ+KbrzjTu/4W8RZO7RxTJzlUxBHjPGdARHVNUBxGfmQ2qDs3nn X-Received: by 2002:a63:9a52:: with SMTP id e18mr9095905pgo.14.1541183273756; Fri, 02 Nov 2018 11:27:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1541183273; cv=none; d=google.com; s=arc-20160816; b=GFiW9/aahsxvELpGLOvIiC5+PN+fjf98+FHjmaI54fNJjZCl7HXCo+GpVs6Za+Gb5F Q3DlymbPI7BnHUCWpaAwFvjxOYC/NgLO8QNi/P82a2wzFcUTadQ1w2f/wPle+ck+0VV4 1ZKYUL5AP4/dKYxXD8KEn2U1aOD20oCGvl0Y8t7Fi0/T+H+DBNhi9Qih7WENyahcAt+K 27FOJ8irODcxsze/TTDPTJlDTRRhNEt+biJalGQDMfH3JEnxtjLiAfjf9Nm1sa++einE Pw4W/TUouv/Seu6fTVT5zX5/E6ZOXqjw5EwHgeefUttjdo4PVMfQWMk47ulvg2dtb6dS RfzQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=hVzxOTJ4ybRJodCBdt4bm8CeS51PhuuZsGXgSJNADPs=; b=bfB1qzHnLKgo70/BVk09fQmmICXT6KfcCCU9dhfvawxrBesL2hp5VGNtS2hBeOMP40 LV72px3dGbl20KPUbJCDESV8B3LKxwGZcYdDBipU69Kx8qQjwBPjA0L8FrWlT50TvqNH /3qLI8rjMlAKv4XYjezlhP2+AtRbN2m3LJmkHutTezktnJ/0U33MCTLpzYgsopCu9yL+ 2k6QCUAC9lFKW/tODcITPQNqb10S0oZjBONrSuWovH0l4j17k/7FmiSX07AaPa8Naq/F PXq4Lh4SYVczanGzXV/H6Zt7ZpFpjs8GK4TZYqj2efHkQPATu5oh4JEldY0y0QgOnQJj iA4w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t25-v6si26659666pfm.152.2018.11.02.11.27.38; Fri, 02 Nov 2018 11:27:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728190AbeKCDfW (ORCPT + 99 others); Fri, 2 Nov 2018 23:35:22 -0400 Received: from mga06.intel.com ([134.134.136.31]:8221 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726849AbeKCDfW (ORCPT ); Fri, 2 Nov 2018 23:35:22 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Nov 2018 11:27:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,456,1534834800"; d="scan'208";a="101023981" Received: from sjchrist-coffee.jf.intel.com (HELO linux.intel.com) ([10.54.74.193]) by fmsmga002.fm.intel.com with ESMTP; 02 Nov 2018 11:27:12 -0700 Date: Fri, 2 Nov 2018 11:27:12 -0700 From: Sean Christopherson To: Andy Lutomirski Cc: Dave Hansen , Linus Torvalds , Rich Felker , Jann Horn , Dave Hansen , Jethro Beekman , Jarkko Sakkinen , Florian Weimer , Linux API , X86 ML , linux-arch , LKML , Peter Zijlstra , nhorman@redhat.com, npmccallum@redhat.com, "Ayoun, Serge" , shay.katz-zamir@intel.com, linux-sgx@vger.kernel.org, Andy Shevchenko , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Carlos O'Donell , adhemerval.zanella@linaro.org Subject: Re: RFC: userspace exception fixups Message-ID: <20181102182712.GG7393@linux.intel.com> References: <20181101193107.GE5150@brightrain.aerifal.cx> <20181102163034.GB7393@linux.intel.com> <7050972d-a874-dc08-3214-93e81181da60@intel.com> <20181102170627.GD7393@linux.intel.com> <20181102173350.GF7393@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 02, 2018 at 10:48:38AM -0700, Andy Lutomirski wrote: > On Fri, Nov 2, 2018 at 10:33 AM Sean Christopherson > wrote: > > > > On Fri, Nov 02, 2018 at 10:13:23AM -0700, Dave Hansen wrote: > > > On 11/2/18 10:06 AM, Sean Christopherson wrote: > > > > On Fri, Nov 02, 2018 at 09:56:44AM -0700, Dave Hansen wrote: > > > >> On 11/2/18 9:30 AM, Sean Christopherson wrote: > > > >>> What if rather than having userspace register an address for fixup, the > > > >>> kernel instead unconditionally does fixup on the ENCLU opcode? > > > >> > > > >> The problem is knowing what to do for the fixup. If we have a simple > > > >> action to take that's universal, like backing up %RIP, or setting some > > > >> other register state, it's not bad. > > > > > > > > Isn't the EENTER/RESUME behavior universal? Or am I missing something? > > > > > > Could someone write down all the ways we get in and out of the enclave? > > > > > > I think we always get in from userspace calling EENTER or ERESUME. We > > > can't ever enter directly from the kernel, like via an IRET from what I > > > understand. > > > > Correct, the only way to get into the enclave is EENTER or ERESUME. > > My understanding is that even SMIs bounce through the AEX target > > before transitioning to SMM. > > > > > We get *out* from exceptions, hardware interrupts, or enclave-explicit > > > EEXITs. Did I miss any? Remind me where the hardware lands the control > > > flow in each of those exit cases. > > > > And VMExits. There are basically two cases: EEXIT and everything else. > > EEXIT is a glorified indirect jump, e.g. %RBX holds the target %RIP. > > Everything else is an Asynchronous Enclave Exit (AEX). On an AEX, %RIP > > is set to a value specified by EENTER/ERESUME, %RBP and %RSP are > > restored to pre-enclave values and all other registers are loaded with > > synthetic state. The actual interrupt/exception/VMExit then triggers, > > e.g. the %RIP on the stack for an exception is always the AEX target, > > not the %RIP inside the enclave that actually faulted. > > So what exactly happens when an enclave accesses non-enclave memory > and takes a page fault, for example? The SDM says that the #PF vector > and error code are stored in the SSA frame where the kernel can't see > them. Is a real #PF then delivered? Yes. From there kernel's perspective a #PF occurred on the %RIP of the AEX target. This holds true for all AEX types, e.g. GUEST_RIP on VMExit also points at the AEX target. On an AEX, %RAX, %RBX and %RCX are set to match the ERESUME parameter. The idea is for userspace to have an ENCU at the AEX so that it automatically ERESUMEs the enclave after the kernel handles the fault. And the trampoline approach means the ucode flows for exceptions, interrupts, VMExit, VMEnter, IRET, RSM, etc... generally don't need to be SGX-aware. The events themselves just need to be redirected to the AEX target and then redo the event. > I guess that, if the memory in question gets faulted in, then the > kernel resumes exection at the AEP address, which does ERESUME, and > the enclave resumes. But if the access is bad, then the kernel > delivers a signal (or uses some other new mechanism), and then what > happens? Is the enclave just considered dead? Is user code supposed > to EENTER back into the enclave to tell it that it got an error? Completely depends on the enclave and its runtime. A simple enclave mayy never expect to encounter a bad access or #UD and so its runtime would probably just kill it. A test/development enclave might have its runtime call back into the enclave to dump state on a fatal fault. Complex runtimes, e.g. libraries that wrap unmodified applications, will call back into the enclave so that libraries in-enclave fault handler can decode what went wrong and take action accordingly, e.g. request CPUID information if unmodified code tried to do CPUID. > This whole mechanism seems very complicated, and it's not clear > exactly what behavior user code wants. No argument there. That's why I like the approach of dumping the exception to userspace without trying to do anything intelligent in the kernel. Userspace can then do whatever it wants AND we don't have to worry about mucking with stacks. One of the hiccups with the VDSO approach is that the enclave may want to use the untrusted stack, i.e. the stack that has the VDSO's stack frame. For example, Intel's SDK uses the untrusted stack to pass parameters for EEXIT, which means an AEX might occur with what is effectively a bad stack from the VDSO's perspective.