Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
From:   "Huang, Kai" <kai.huang@intel.com>
To:     "Christopherson, Sean J" <sean.j.christopherson@intel.com>
CC:     Andy Lutomirski <luto@kernel.org>,
        Jethro Beekman <jethro@fortanix.com>,
        Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        "x86@kernel.org" <x86@kernel.org>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>,
        "H. Peter Anvin" <hpa@zytor.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "linux-sgx@vger.kernel.org" <linux-sgx@vger.kernel.org>,
        Josh Triplett <josh@joshtriplett.org>,
        "Haitao Huang" <haitao.huang@linux.intel.com>,
        "Dr . Greg Wettstein" <greg@enjellic.com>
Subject: RE: x86/sgx: uapi change proposal
Thread-Topic: x86/sgx: uapi change proposal
Thread-Index: AQHUl3CouNmOMJQZHU2YX8Dh54+gQaWFODMAgAAIjACAAAbWAIAAVlWAgADM8wCAAnSJgIAc/6rw//+proCAAOdxMA==
Date:   Wed, 9 Jan 2019 05:24:38 +0000
Message-ID: <105F7BF4D0229846AF094488D65A0989355A58F1@PGSMSX112.gar.corp.intel.com>
References: <20181214215729.4221-1-sean.j.christopherson@intel.com>
 <7706b2aa71312e1f0009958bcab24e1e9d8d1237.camel@linux.intel.com>
 <598cd050-f0b5-d18c-96a0-915f02525e3e@fortanix.com>
 <20181219091148.GA5121@linux.intel.com>
 <613c6814-4e71-38e5-444a-545f0e286df8@fortanix.com>
 <20181219144515.GA30909@linux.intel.com>
 <CALCETrW_8LnitFamYUU7POHFOJFxLO7ESFUdS=aeG_SOvHVPkw@mail.gmail.com>
 <20181221162825.GB26865@linux.intel.com>
 <105F7BF4D0229846AF094488D65A0989355A45B6@PGSMSX112.gar.corp.intel.com>
 <20190108220946.GA30462@linux.intel.com>
In-Reply-To: <20190108220946.GA30462@linux.intel.com>
Accept-Language: en-US
Content-Language: en-US
dlp-product: dlpe-windows
dlp-version: 11.0.400.15
dlp-reaction: no-action
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

> On Tue, Jan 08, 2019 at 11:27:11AM -0800, Huang, Kai wrote:
> > > >
> > > > Can one of you explain why SGX_ENCLAVE_CREATE is better than just
> > > > opening a new instance of /dev/sgx for each encalve?
> > >
> > > Directly associating /dev/sgx with an enclave means /dev/sgx can't
> > > be used to provide ioctl()'s for other SGX-related needs, e.g. to
> > > mmap() raw EPC and expose it a VM.  Proposed layout in the link
> > > below.  I'll also respond to Jarkko's question about exposing EPC
> > > through /dev/sgx instead of having KVM allocate it on behalf of the VM.
> > >
> > > https://lkml.kernel.org/r/20181218185349.GC30082@linux.intel.com
> >
> > Hi Sean,
> >
> > Sorry for replying to old email. But IMHO it is not a must that Qemu
> > needs to open some /dev/sgx and allocate/mmap EPC for guest's virtual
> > EPC slot, instead, KVM could create private slot, which is not visible
> > to Qemu, for virtual EPC, and KVM could call core-SGX EPC allocation
> > API directly.
> 
> That's possible, but it has several downsides.
> 
>   - Duplicates a lot of code in KVM for managing memory regions.

I don't see why there will be duplicated code. you can simply call __x86_set_memory_region to create private slot. It is KVM x86 equivalent to KVM_SET_USER_MEMORY_REGION from userspace. The only difference is Qemu is not aware of the private slot.

>   - Artificially restricts userspace to a single EPC region, unless
>     even more code is duplicated to handle multiple private regions.

You can have multiple private slots, by calling __x86_set_memory_region for each EPC section. KVM receives EPC section/sections info from Qemu, via CPUID, or dedicated IOCTL (is this you are going to add?), and simply creates private EPC slot/slots.

>   - Requires additional ioctls() or capabilities to probe EPC support

No. EPC info is from Qemu at the beginning (size is given by parameter, base is calculated by Qemu), and actually it is Qemu notifies KVM EPC info, so I don't think we require additional ioctls or capabilities here.

>   - Does not fit with Qemu/KVM's memory model, e.g. all other types of
>     memory are exposed to a guest through
> KVM_SET_USER_MEMORY_REGION.

EPC is different. I am not sure whether EPC needs to fit such model. There are already examples in KVM which uses private slot w/o using KVM_SET_USER_MEMORY_REGION, for example, APIC access page.

>   - Prevents userspace from debugging a guest's enclave.  I'm not saying
>     this is a likely scenario, but I also don't think we should preclude
>     it without good reason.

I am not sure how important it is, so don't know whether this can be a justification. To me we don't need to consider this. Qemu normally doesn't access guest memory unless it has to (ie, for device model). 

>   - KVM is now responsible for managing the lifecycle of EPC, e.g. what
>     happens if an EPC cgroup limit is lowered on a running VM and
>     KVM can't gracefully reclaim EPC?  The userspace hypervisor should
>     ultimately decide how to handle such an event.

Even using KVM_SET_USER_MEMORY_REGION, KVM is also responsible for managing the lifecycle of EPC, no? And "managing the lifecycle of EPC" you mean doesn't need to be "managing EPC itself" I suppose, since EPC should always be managed by core SGX code.

I don't see the difference between private slot and KVM_SET_USER_MEMORY_REGION, in terms of how does KVM reclaim EPC, or how does KVM do when it fails to reclaim EPC. 

>   - SGX logic is split between SGX and KVM, e.g. VA page management for
>     oversubscription will likely be common to SGX and KVM.  From a long
>     term maintenance perspective, this means that changes to the EPC
>     management could potentially need to be Acked by KVM, and vice versa.

I think most of the code should be in core SGX code under x86. KVM should only have the code that is specifically related to virtualization, ie, ENCLV.

VA page allocation should be under code SGX code. KVM might need to call function such as alloc_va_page, etc, but this is not a problem. There are many other cases now.

And this is not related to private slot vs KVM_SET_USER_MEMORY_REGION.

> 
> > I am not sure what's the good of allowing userspace to alloc/mmap a
> > raw EPC region? Userspace is not allowed to touch EPC anyway, expect
> > enclave code.
> >
> > To me KVM creates private EPC slot is cleaner than exposing
> > /dev/sgx/epc and allowing userspace to map some raw EPC region.
> 
> Cleaner in the sense that it's faster to get basic support up and running since
> there are fewer touchpoints, but there are long term ramifications to
> cramming EPC management in KVM.
> 
> And at this point I'm not stating any absolutes, e.g. how EPC will be handled
> by KVM.  What I'm pushing for is to not eliminate the possibility of having
> the SGX subsystem own all EPC management, e.g. don't tie /dev/sgx to a
> single enclave.

I suppose "SGX subsystem" you mean here is core SGX code under x86. IMHO EPC should always be managed by such SGX subsystem, and KVM and SGX driver are just consumers (ie, calling EPC allocation function, etc).

IMHO I think /dev/sgx (or whatever /dev/sgx/xxx) should be in SGX driver, but not SGX core code. For example, if we don't consider KVM EPC oversubscription here, theoretically we only need below code in core SGX code to make KVM SGX work:

1) SGX feature detection. There should be some core structure (such as boot_cpu_data, etc) where KVM can query SGX capabilities, ie, whether FLC is available.
2) EPC management. KVM simply calls EPC management APIs when it needs. For example, when EPC slot is created, we should populate all EPC pages, and fail to create VM if running out of EPC.
3) Other code to deal with ENCLS, SGX data structure, etc, since we have agreed even with EPC static allocation, we should trap EINIT. 

We probably even don't need code to deal with enclave, if only for KVM. Of course, in order to support KVM EPC oversubscription, we should also include enclave management code in core SGX code, but this doesn't necessary mean we should include /dev/sgx/xxx in core SGX code.

It seems there still are lots of design issues we haven't got consensus in terms of how SGX  driver should be (or how SGX stack is supposed to work), but if we can focus on what core SGX code should really have (w/o involving SGX driver logic), we can at least get KVM SGX code working. After all, we also have window SGX support. 

Thanks,
-Kai