This is an "early" RFC series for adding SGX virtualization to KVM. SGX
virtualization (more specifically, EPC virtualization) is dependent on the
not-yet-merged SGX enabling series and so cannot be considered for
inclusion any time soon.
The primary goal of this RFC is to get feedback on the overall approach,
e.g. code location, uAPI changes, functionality, etc... My hope is to
sort out any major issues sooner rather than later, so that if/when the
base SGX enabling is merged, virtualization support can quickly follow
suit.
That being said, nitpicking and bikeshedding is more than welcome :-)
This code applies on top of a slightly modified version of v21 of the SGX
enabling series[1]. The modifications on top of the SGX series are a few
minor bug fixes that are not related to SGX virtualization, but affect
code that is moved/modified by this series. The full source for the
modified version of v21 can be found at:
https://github.com/sean-jc/linux.git
under the tag:
sgx-v21-ish
A corresponding Qemu RFC will (hopefully) follow next week, the Qemu
patches need a bit more cleanup...
[1] https://lkml.kernel.org/r/[email protected]
Sean Christopherson (21):
x86/sgx: Add defines for SGX device minor numbers
x86/sgx: Move bus registration and device init to common code
x86/sgx: Move provisioning device to common code
x86/sgx: Add /dev/sgx/virt_epc device to allocate "raw" EPC for VMs
x86/sgx: Expose SGX architectural definitions to the kernel
KVM: x86: Add SGX sub-features leaf to reverse CPUID table
KVM: x86: Add WARN_ON_ONCE(index!=0) in __do_cpuid_ent
KVM: x86: Add kvm_x86_ops hook to short circuit emulation
KVM: VMX: Add basic handling of VM-Exit from SGX enclave
KVM: x86: Export kvm_mmu_gva_to_gpa_{read,write}() for VMX/SGX
KVM: x86: Export kvm_propagate_fault (as kvm_propagate_page_fault)
KVM: x86: Define new #PF SGX error code bit
x86/sgx: Move the intermediate EINIT helper into the driver
x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions
KVM: VMX: Edd emulation of SGX Launch Control LE hash MSRs
KVM: VMX: Add handler for ENCLS[EINIT] to support SGX Launch Control
KVM: x86: Invoke kvm_x86_ops->cpuid_update() after kvm_update_cpuid()
KVM: VMX: Enable SGX virtualization for SGX1, SGX2 and LC
x86/sgx: Export sgx_set_attribute() for use by KVM
KVM: x86: Add capability to grant VM access to privileged SGX
attribute
Documentation/virtual/kvm/api.txt | 20 ++
arch/x86/Kconfig | 13 +
arch/x86/include/asm/kvm_host.h | 8 +-
arch/x86/include/asm/sgx.h | 17 +
.../cpu/sgx/arch.h => include/asm/sgx_arch.h} | 1 +
arch/x86/include/asm/vmx.h | 1 +
arch/x86/include/uapi/asm/vmx.h | 1 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/driver/driver.h | 3 +-
arch/x86/kernel/cpu/sgx/driver/ioctl.c | 40 ++-
arch/x86/kernel/cpu/sgx/driver/main.c | 73 +----
arch/x86/kernel/cpu/sgx/encl.c | 2 +-
arch/x86/kernel/cpu/sgx/encls.h | 2 +-
arch/x86/kernel/cpu/sgx/main.c | 141 ++++++--
arch/x86/kernel/cpu/sgx/sgx.h | 16 +-
arch/x86/kernel/cpu/sgx/virt.c | 308 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/virt.h | 14 +
arch/x86/kvm/Makefile | 2 +
arch/x86/kvm/cpuid.c | 135 ++++++--
arch/x86/kvm/cpuid.h | 20 ++
arch/x86/kvm/emulate.c | 1 +
arch/x86/kvm/mmu.c | 12 -
arch/x86/kvm/svm.c | 19 +-
arch/x86/kvm/vmx/nested.c | 21 +-
arch/x86/kvm/vmx/nested.h | 5 +
arch/x86/kvm/vmx/sgx.c | 247 ++++++++++++++
arch/x86/kvm/vmx/sgx.h | 11 +
arch/x86/kvm/vmx/vmcs12.c | 1 +
arch/x86/kvm/vmx/vmcs12.h | 4 +-
arch/x86/kvm/vmx/vmx.c | 251 +++++++++++++-
arch/x86/kvm/vmx/vmx.h | 6 +
arch/x86/kvm/x86.c | 40 ++-
arch/x86/kvm/x86.h | 5 -
include/uapi/linux/kvm.h | 1 +
tools/testing/selftests/x86/sgx/defines.h | 2 +-
35 files changed, 1234 insertions(+), 210 deletions(-)
create mode 100644 arch/x86/include/asm/sgx.h
rename arch/x86/{kernel/cpu/sgx/arch.h => include/asm/sgx_arch.h} (99%)
create mode 100644 arch/x86/kernel/cpu/sgx/virt.c
create mode 100644 arch/x86/kernel/cpu/sgx/virt.h
create mode 100644 arch/x86/kvm/vmx/sgx.c
create mode 100644 arch/x86/kvm/vmx/sgx.h
--
2.22.0
Similar to the existing AMD #NPF case where emulation of the current
instruction is not possible due to lack of information, virtualization
of Intel SGX will introduce a scenario where emulation is not possible
due to the VMExit occurring in an SGX enclave. And again similar to
the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
outside of the control of the vendor-specific code.
While the cause and architecturally visible behavior of the two cases
is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
clean resume or complete shutdown, the impact on the common emulation
code is identical: KVM must stop emulation immediately and resume the
guest.
Replace the exisiting need_emulation_on_page_fault() with a more generic
is_emulatable() kvm_x86_ops callback, which is called unconditionally
by x86_emulate_instruction().
Query is_emulatable() in handle_ud() as well so that the
force_emulation_prefix code doesn't incorrectly modify RIP before
calling emulate_instruction() in the absurdly unlikely scenario that
we encounter forced emulation in conjunction with "do not emulate".
Do this for both Intel and AMD so that any future changes to AMD's
emulation logic take effect as expected for handle_ud().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu.c | 12 ------------
arch/x86/kvm/svm.c | 19 +++++++++++++++++--
arch/x86/kvm/vmx/vmx.c | 11 +++++------
arch/x86/kvm/x86.c | 9 ++++++++-
5 files changed, 31 insertions(+), 22 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26d1eb83f72a..1341d8390ebe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1198,7 +1198,7 @@ struct kvm_x86_ops {
uint16_t *vmcs_version);
uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
- bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
+ bool (*is_emulatable)(struct kvm_vcpu *vcpu, void *insn, int insn_len);
};
struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 98f6e4f88b04..bf6952f8f330 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5412,18 +5412,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
if (!mmio_info_in_cache(vcpu, cr2, direct) && !is_guest_mode(vcpu))
emulation_type = EMULTYPE_ALLOW_RETRY;
emulate:
- /*
- * On AMD platforms, under certain conditions insn_len may be zero on #NPF.
- * This can happen if a guest gets a page-fault on data access but the HW
- * table walker is not able to read the instruction page (e.g instruction
- * page is not present in memory). In those cases we simply restart the
- * guest, with the exception of AMD Erratum 1096 which is unrecoverable.
- */
- if (unlikely(insn && !insn_len)) {
- if (!kvm_x86_ops->need_emulation_on_page_fault(vcpu))
- return 1;
- }
-
er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
switch (er) {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 48c865a4e5dd..0fb8b60eb136 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -7115,10 +7115,25 @@ static int nested_enable_evmcs(struct kvm_vcpu *vcpu,
return -ENODEV;
}
-static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
+static bool svm_is_emulatable(struct kvm_vcpu *vcpu, void *insn, int insn_len)
{
bool is_user, smap;
+ if (likely(!insn || insn_len))
+ return true;
+
+ /*
+ * Under certain conditions insn_len may be zero on #NPF. This can
+ * happen if a guest gets a page-fault on data access but the HW table
+ * walker is not able to read the instruction page (e.g instruction
+ * page is not present in memory). In those cases we simply restart the
+ * guest, with the exception of AMD Erratum 1096 which is unrecoverable.
+ */
+ if (unlikely(insn && !insn_len)) {
+ if (!kvm_x86_ops->need_emulation_on_page_fault(vcpu))
+ return 1;
+ }
+
is_user = svm_get_cpl(vcpu) == 3;
smap = !kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
@@ -7279,7 +7294,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
.nested_enable_evmcs = nested_enable_evmcs,
.nested_get_evmcs_version = nested_get_evmcs_version,
- .need_emulation_on_page_fault = svm_need_emulation_on_page_fault,
+ .is_emulatable = svm_is_emulatable,
};
static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d98eac371c0a..f48fc990ca6d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1458,6 +1458,10 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
return 0;
}
+static bool vmx_is_emulatable(struct kvm_vcpu *vcpu, void *insn, int insn_len)
+{
+ return true;
+}
static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
{
@@ -7416,11 +7420,6 @@ static int enable_smi_window(struct kvm_vcpu *vcpu)
return 0;
}
-static bool vmx_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
-{
- return 0;
-}
-
static __init int hardware_setup(void)
{
unsigned long host_bndcfgs;
@@ -7723,7 +7722,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.set_nested_state = NULL,
.get_vmcs12_pages = NULL,
.nested_enable_evmcs = NULL,
- .need_emulation_on_page_fault = vmx_need_emulation_on_page_fault,
+ .is_emulatable = vmx_is_emulatable,
};
static void vmx_cleanup_l1d_flush(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63bb1ee8258e..afcc01a59421 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5274,6 +5274,9 @@ int handle_ud(struct kvm_vcpu *vcpu)
char sig[5]; /* ud2; .ascii "kvm" */
struct x86_exception e;
+ if (unlikely(!kvm_x86_ops->is_emulatable(vcpu, NULL, 0)))
+ return 1;
+
if (force_emulation_prefix &&
kvm_read_guest_virt(vcpu, kvm_get_linear_rip(vcpu),
sig, sizeof(sig), &e) == 0 &&
@@ -6430,7 +6433,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
int r;
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
bool writeback = true;
- bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
+ bool write_fault_to_spt;
+
+ if (unlikely(!kvm_x86_ops->is_emulatable(vcpu, insn, insn_len)))
+ return EMULATE_DONE;
vcpu->arch.l1tf_flush_l1d = true;
@@ -6438,6 +6444,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
* Clear write_fault_to_shadow_pgtable here to ensure it is
* never reused.
*/
+ write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
vcpu->arch.write_fault_to_shadow_pgtable = false;
kvm_clear_exception_queue(vcpu);
--
2.22.0
Add an SGX device to enable userspace to allocate EPC without an
associated enclave. The intended and only known use case for direct EPC
allocation is to expose EPC to a KVM guest, hence the virt_epc moniker,
virt.{c,h} files and INTEL_SGX_VIRTUALIZATION Kconfig.
Although KVM is the end consumer of EPC, and will need hooks into the
virtual EPC management if oversubscription of EPC for guest is ever
supported (see below), implement direct access to EPC in the SGX
subsystem instead of in KVM. Doing so has two major advantages:
- Does not require changes to KVM's uAPI, e.g. EPC gets handled as
just another memory backend for guests.
- EPC management is wholly contained in the SGX subsystem, e.g. SGX
does not have to export any symbols, changes to reclaim flows don't
need to be routed through KVM, SGX's dirty laundry doesn't have to
get aired out for the world to see, and so on and so forth.
Oversubscription of EPC for KVM guests is not currently supported. Due
to the complications of handling reclaim conflicts between guest and
host, KVM EPC oversubscription is expected to be at least an order of
magnitude more complex than basic support for SGX virtualization.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/Kconfig | 10 ++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/main.c | 3 +
arch/x86/kernel/cpu/sgx/sgx.h | 3 +-
arch/x86/kernel/cpu/sgx/virt.c | 253 +++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/virt.h | 14 ++
6 files changed, 283 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/sgx/virt.c
create mode 100644 arch/x86/kernel/cpu/sgx/virt.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 74ccb1bdea16..c1bdb9f85928 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1961,6 +1961,16 @@ config INTEL_SGX_DRIVER
If unsure, say N.
+config INTEL_SGX_VIRTUALIZATION
+ bool "Intel SGX Virtualization"
+ depends on INTEL_SGX && KVM_INTEL
+ help
+ Enabling support for SGX virtualization enables userspace to allocate
+ "raw" EPC for the purpose of exposing EPC to a KVM guest, i.e. a
+ virtual machine, via a device node (/dev/sgx/virt_epc by default).
+
+ If unsure, say N.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index e5d1e862969c..559fd0f9be50 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,2 +1,3 @@
obj-y += encl.o encls.o main.o reclaim.o
obj-$(CONFIG_INTEL_SGX_DRIVER) += driver/
+obj-$(CONFIG_INTEL_SGX_VIRTUALIZATION) += virt.o
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9f4473597620..ead827371139 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -14,6 +14,7 @@
#include "arch.h"
#include "encls.h"
#include "sgx.h"
+#include "virt.h"
struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
int sgx_nr_epc_sections;
@@ -422,7 +423,9 @@ static __init int sgx_init(void)
if (ret)
goto err_provision_dev;
+ /* Success if the native *or* virtual driver initialized cleanly. */
ret = sgx_drv_init();
+ ret = sgx_virt_epc_init() ? ret : 0;
if (ret)
goto err_provision_cdev;
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index a0af8849c7c3..16cdb935aaa7 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -91,7 +91,8 @@ int sgx_einit(struct sgx_sigstruct *sigstruct, struct sgx_einittoken *token,
#define SGX_ENCL_DEV_MINOR 0
#define SGX_PROV_DEV_MINOR 1
-#define SGX_MAX_NR_DEVICES 2
+#define SGX_VIRT_DEV_MINOR 2
+#define SGX_MAX_NR_DEVICES 3
__init int sgx_dev_init(const char *name, struct device *dev,
struct cdev *cdev, const struct file_operations *fops,
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
new file mode 100644
index 000000000000..79ee5917a4fc
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/cdev.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <uapi/asm/sgx.h>
+
+#include "encls.h"
+#include "sgx.h"
+#include "virt.h"
+
+struct sgx_virt_epc_page {
+ struct sgx_epc_page *epc_page;
+};
+
+struct sgx_virt_epc {
+ struct radix_tree_root page_tree;
+ struct rw_semaphore lock;
+};
+
+static inline unsigned long sgx_virt_epc_calc_index(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
+}
+
+static struct sgx_virt_epc_page *__sgx_virt_epc_fault(struct sgx_virt_epc *epc,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct sgx_virt_epc_page *page;
+ struct sgx_epc_page *epc_page;
+ unsigned long index;
+ int ret;
+
+ index = sgx_virt_epc_calc_index(vma, addr);
+
+ page = radix_tree_lookup(&epc->page_tree, index);
+ if (page) {
+ if (page->epc_page)
+ return page;
+ } else {
+ page = kzalloc(sizeof(*page), GFP_KERNEL);
+ if (!page)
+ return ERR_PTR(-ENOMEM);
+
+ ret = radix_tree_insert(&epc->page_tree, index, page);
+ if (unlikely(ret)) {
+ kfree(page);
+ return ERR_PTR(ret);
+ }
+ }
+
+ epc_page = sgx_alloc_page(&epc, false);
+ if (IS_ERR(epc_page))
+ return ERR_CAST(epc_page);
+
+ ret = vmf_insert_pfn(vma, addr, PFN_DOWN(epc_page->desc));
+ if (unlikely(ret != VM_FAULT_NOPAGE)) {
+ sgx_free_page(epc_page);
+ return ERR_PTR(-EFAULT);
+ }
+
+ page->epc_page = epc_page;
+
+ return page;
+}
+
+static vm_fault_t sgx_virt_epc_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct sgx_virt_epc *epc = (struct sgx_virt_epc *)vma->vm_private_data;
+ struct sgx_virt_epc_page *page;
+
+ down_write(&epc->lock);
+ page = __sgx_virt_epc_fault(epc, vma, vmf->address);
+ up_write(&epc->lock);
+
+ if (!IS_ERR(page) || signal_pending(current))
+ return VM_FAULT_NOPAGE;
+
+ if (PTR_ERR(page) == -EBUSY && (vmf->flags & FAULT_FLAG_ALLOW_RETRY)) {
+ up_read(&vma->vm_mm->mmap_sem);
+ return VM_FAULT_RETRY;
+ }
+
+ return VM_FAULT_SIGBUS;
+}
+
+static struct sgx_virt_epc_page *sgx_virt_epc_get_page(struct sgx_virt_epc *epc,
+ unsigned long index)
+{
+ struct sgx_virt_epc_page *page;
+
+ down_read(&epc->lock);
+ page = radix_tree_lookup(&epc->page_tree, index);
+ if (!page || !page->epc_page)
+ page = ERR_PTR(-EFAULT);
+ up_read(&epc->lock);
+
+ return page;
+}
+
+static int sgx_virt_epc_access(struct vm_area_struct *vma, unsigned long start,
+ void *buf, int len, int write)
+{
+ /* EDBG{RD,WR} are naturally sized, i.e. always 8-byte on 64-bit. */
+ unsigned char data[sizeof(unsigned long)];
+ struct sgx_virt_epc_page *page;
+ struct sgx_virt_epc *epc;
+ unsigned long addr, index;
+ int offset, cnt, i;
+ int ret = 0;
+ void *p;
+
+ epc = vma->vm_private_data;
+
+ for (i = 0; i < len && !ret; i += cnt) {
+ addr = start + i;
+ if (i == 0 || PFN_DOWN(addr) != PFN_DOWN(addr - cnt))
+ index = sgx_virt_epc_calc_index(vma, addr);
+
+ page = sgx_virt_epc_get_page(epc, index);
+
+ /*
+ * EDBG{RD,WR} require an active enclave, and given that VMM
+ * EPC oversubscription isn't supported, a not-present EPC page
+ * means the guest hasn't accessed the page and therefore can't
+ * possibility have added the page to an enclave.
+ */
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ offset = addr & (sizeof(unsigned long) - 1);
+ addr = ALIGN_DOWN(addr, sizeof(unsigned long));
+ cnt = min((int)sizeof(unsigned long) - offset, len - i);
+
+ p = sgx_epc_addr(page->epc_page) + (addr & ~PAGE_MASK);
+
+ /* EDBGRD for read, or to do RMW for a partial write. */
+ if (!write || cnt != sizeof(unsigned long))
+ ret = __edbgrd(p, (void *)data);
+
+ if (!ret) {
+ if (write) {
+ memcpy(data + offset, buf + i, cnt);
+ ret = __edbgwr(p, (void *)data);
+ } else {
+ memcpy(buf + i, data + offset, cnt);
+ }
+ }
+ }
+
+ if (ret)
+ return -EIO;
+ return i;
+}
+
+const struct vm_operations_struct sgx_virt_epc_vm_ops = {
+ .fault = sgx_virt_epc_fault,
+ .access = sgx_virt_epc_access,
+};
+
+static int sgx_virt_epc_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (!(vma->vm_flags & VM_SHARED))
+ return -EINVAL;
+
+ vma->vm_ops = &sgx_virt_epc_vm_ops;
+ vma->vm_flags |= VM_PFNMAP | VM_IO | VM_DONTDUMP;
+ vma->vm_private_data = file->private_data;
+
+ return 0;
+}
+
+static int sgx_virt_epc_release(struct inode *inode, struct file *file)
+{
+ struct sgx_virt_epc *epc = file->private_data;
+ struct radix_tree_iter iter;
+ struct sgx_virt_epc_page *page;
+ void **slot;
+
+ LIST_HEAD(secs_pages);
+
+ radix_tree_for_each_slot(slot, &epc->page_tree, &iter, 0) {
+ page = *slot;
+ if (page->epc_page && __sgx_free_page(page->epc_page))
+ continue;
+ kfree(page);
+ radix_tree_delete(&epc->page_tree, iter.index);
+ }
+
+ /*
+ * Because we don't track which pages are SECS pages, it's possible
+ * for EREMOVE to fail, e.g. a SECS page can have children if a VM
+ * shutdown unexpectedly. Retry all failed pages after iterating
+ * through the entire tree, at which point all children should be
+ * removed and the SECS pages can be nuked as well.
+ */
+ radix_tree_for_each_slot(slot, &epc->page_tree, &iter, 0) {
+ page = *slot;
+ if (!(WARN_ON(!page->epc_page)))
+ sgx_free_page(page->epc_page);
+ radix_tree_delete(&epc->page_tree, iter.index);
+ }
+
+ kfree(epc);
+
+ return 0;
+}
+
+static int sgx_virt_epc_open(struct inode *inode, struct file *file)
+{
+ struct sgx_virt_epc *epc;
+
+ epc = kzalloc(sizeof(struct sgx_virt_epc), GFP_KERNEL);
+ if (!epc)
+ return -ENOMEM;
+
+ init_rwsem(&epc->lock);
+ INIT_RADIX_TREE(&epc->page_tree, GFP_KERNEL);
+
+ file->private_data = epc;
+
+ return 0;
+}
+
+static const struct file_operations sgx_virt_epc_fops = {
+ .owner = THIS_MODULE,
+ .open = sgx_virt_epc_open,
+ .release = sgx_virt_epc_release,
+ .mmap = sgx_virt_epc_mmap,
+};
+
+static struct device sgx_virt_epc_dev;
+static struct cdev sgx_virt_epc_cdev;
+
+int __init sgx_virt_epc_init(void)
+{
+ int ret = sgx_dev_init("sgx/virt_epc", &sgx_virt_epc_dev,
+ &sgx_virt_epc_cdev, &sgx_virt_epc_fops,
+ SGX_VIRT_DEV_MINOR);
+ if (ret)
+ return ret;
+
+ ret = cdev_device_add(&sgx_virt_epc_cdev, &sgx_virt_epc_dev);
+ if (ret)
+ put_device(&sgx_virt_epc_dev);
+
+ return ret;
+}
diff --git a/arch/x86/kernel/cpu/sgx/virt.h b/arch/x86/kernel/cpu/sgx/virt.h
new file mode 100644
index 000000000000..436170412b98
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/virt.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+#ifndef _ASM_X86_SGX_VIRT_H
+#define _ASM_X86_SGX_VIRT_H
+
+#ifdef CONFIG_INTEL_SGX_VIRTUALIZATION
+int __init sgx_virt_epc_init(void);
+#else
+static inline int __init sgx_virt_epc_init(void)
+{
+ return -ENODEV;
+}
+#endif
+
+#endif /* _ASM_X86_SGX_VIRT_H */
--
2.22.0
Except for one outlier, function 7, all cases in __do_cpuid_ent and
its children assume that the index passed in is zero. Furthermore,
the index is fully under KVM's control and all callers pass an index
of zero. In other words, a non-zero index would indicate either a
bug in the caller or a new case that is expected to be handled. WARN
and return an error on a non-zero index and remove the now unreachable
code in function 7 for handling a non-zero index.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/cpuid.c | 57 ++++++++++++++++++++++----------------------
1 file changed, 29 insertions(+), 28 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4992e7c99588..70e488951f25 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -410,6 +410,14 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR);
+ /*
+ * The code below assumes index == 0, which simplifies handling leafs
+ * with a dynamic number of sub-leafs. The index is fully under KVM's
+ * control, i.e. a non-zero value is a bug.
+ */
+ if (WARN_ON_ONCE(index != 0))
+ return -EINVAL;
+
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -480,38 +488,31 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
entry->ecx = 0;
entry->edx = 0;
break;
- case 7: {
+ case 7:
entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
/* Mask ebx against host capability word 9 */
- if (index == 0) {
- entry->ebx &= kvm_cpuid_7_0_ebx_x86_features;
- cpuid_mask(&entry->ebx, CPUID_7_0_EBX);
- // TSC_ADJUST is emulated
- entry->ebx |= F(TSC_ADJUST);
- entry->ecx &= kvm_cpuid_7_0_ecx_x86_features;
- f_la57 = entry->ecx & F(LA57);
- cpuid_mask(&entry->ecx, CPUID_7_ECX);
- /* Set LA57 based on hardware capability. */
- entry->ecx |= f_la57;
- entry->ecx |= f_umip;
- /* PKU is not yet implemented for shadow paging. */
- if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
- entry->ecx &= ~F(PKU);
- entry->edx &= kvm_cpuid_7_0_edx_x86_features;
- cpuid_mask(&entry->edx, CPUID_7_EDX);
- /*
- * We emulate ARCH_CAPABILITIES in software even
- * if the host doesn't support it.
- */
- entry->edx |= F(ARCH_CAPABILITIES);
- } else {
- entry->ebx = 0;
- entry->ecx = 0;
- entry->edx = 0;
- }
+ entry->ebx &= kvm_cpuid_7_0_ebx_x86_features;
+ cpuid_mask(&entry->ebx, CPUID_7_0_EBX);
+ // TSC_ADJUST is emulated
+ entry->ebx |= F(TSC_ADJUST);
+ entry->ecx &= kvm_cpuid_7_0_ecx_x86_features;
+ f_la57 = entry->ecx & F(LA57);
+ cpuid_mask(&entry->ecx, CPUID_7_ECX);
+ /* Set LA57 based on hardware capability. */
+ entry->ecx |= f_la57;
+ entry->ecx |= f_umip;
+ /* PKU is not yet implemented for shadow paging. */
+ if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
+ entry->ecx &= ~F(PKU);
+ entry->edx &= kvm_cpuid_7_0_edx_x86_features;
+ cpuid_mask(&entry->edx, CPUID_7_EDX);
+ /*
+ * We emulate ARCH_CAPABILITIES in software even
+ * if the host doesn't support it.
+ */
+ entry->edx |= F(ARCH_CAPABILITIES);
entry->eax = 0;
break;
- }
case 9:
break;
case 0xa: { /* Architectural Performance Monitoring */
--
2.22.0
SGX adds a basic support bit to CPUID(7, 0), and enumerates SGX
capabilities, e.g. EPC info, ENCLS leafs, etc..., in CPUID(0x12, *).
All SGX1 and SGX2 ENCLS leafs (supported in hardware) can be exposed
to the guest unconditionally. All other ENCLS leafs (currently the
ENCLS_C leafs) and all ENCLV leafs currently cannot be exposed to the
guest.
Flexible Launch Control, a.k.a. SGX LC, allows software to control the
key that is used to verify the signer of an enclave. Because SGX LC
impacts guest operation even if it's not exposed to the guest, i.e.
EINIT is affected by hardware's LE hash MSRs, SGX cannot be exposed to
the guest if the host supports LC without explicit LC support in KVM.
In other words, LC support is required to run on platforms with LC
enabled in the host, thus making exposure of SGX LC to the guest a
formality.
Access to the provision key is not supported in this patch. Access to
the provision key is controlled via securityfs, a future patch will
plumb in the ability for the userspace hypervisor to grant a VM access
to the provision key by passing in an appropriate file descriptor.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/cpuid.c | 72 +++++++++++++++++-
arch/x86/kvm/vmx/nested.c | 19 ++++-
arch/x86/kvm/vmx/nested.h | 5 ++
arch/x86/kvm/vmx/sgx.h | 11 +++
arch/x86/kvm/vmx/vmcs12.c | 1 +
arch/x86/kvm/vmx/vmcs12.h | 4 +-
arch/x86/kvm/vmx/vmx.c | 156 ++++++++++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.h | 1 +
8 files changed, 254 insertions(+), 15 deletions(-)
create mode 100644 arch/x86/kvm/vmx/sgx.h
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4c235af5318c..73a0326a1968 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -18,6 +18,7 @@
#include <asm/processor.h>
#include <asm/user.h>
#include <asm/fpu/xstate.h>
+#include <asm/sgx_arch.h>
#include "cpuid.h"
#include "lapic.h"
#include "mmu.h"
@@ -117,6 +118,21 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
if (best && (best->eax & (F(XSAVES) | F(XSAVEC))))
best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+ /*
+ * Bits 127:0 of the allowed SECS.ATTRIBUTES (CPUID.0x12.0x1) enumerate
+ * the supported XSAVE Feature Request Mask (XFRM), i.e. the enclave's
+ * requested XCR0 value. The enclave's XFRM must be a subset of XCRO
+ * at the time of EENTER, thus adjust the allowed XFRM by the guest's
+ * supported XCR0. Similar to XCR0 handling, FP and SSE are forced to
+ * '1' even on CPUs that don't support XSAVE.
+ */
+ best = kvm_find_cpuid_entry(vcpu, 0x12, 0x1);
+ if (best) {
+ best->ecx &= vcpu->arch.guest_supported_xcr0 & 0xffffffff;
+ best->edx &= vcpu->arch.guest_supported_xcr0 >> 32;
+ best->ecx |= XFEATURE_MASK_FPSSE;
+ }
+
/*
* The existing code assumes virtual address is 48-bit or 57-bit in the
* canonical address checks; exit if it is ever changed.
@@ -393,7 +409,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | f_mpx | F(RDSEED) |
F(ADX) | F(SMAP) | F(AVX512IFMA) | F(AVX512F) | F(AVX512PF) |
F(AVX512ER) | F(AVX512CD) | F(CLFLUSHOPT) | F(CLWB) | F(AVX512DQ) |
- F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt;
+ F(SHA_NI) | F(AVX512BW) | F(AVX512VL) | f_intel_pt | F(SGX);
/* cpuid 0xD.1.eax */
const u32 kvm_cpuid_D_1_eax_x86_features =
@@ -404,7 +420,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(AVX512VBMI) | F(LA57) | F(PKU) | 0 /*OSPKE*/ |
F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
- F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B);
+ F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | F(SGX_LC);
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
@@ -412,6 +428,23 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
F(MD_CLEAR);
+ /* cpuid 12.0.eax*/
+ const u32 kvm_cpuid_12_0_eax_x86_features =
+ F(SGX1) | F(SGX2) | 0 /* Reserved */ | 0 /* Reserved */ |
+ 0 /* Reserved */ | 0 /* ENCLV */ | 0 /* ENCLS_C */;
+
+ /* cpuid 12.0.ebx*/
+ const u32 kvm_cpuid_12_0_ebx_sgx_features =
+ SGX_MISC_EXINFO;
+
+ /* cpuid 12.1.eax*/
+ const u32 kvm_cpuid_12_1_eax_sgx_features =
+ SGX_ATTR_DEBUG | SGX_ATTR_MODE64BIT | 0 /* PROVISIONKEY */ |
+ SGX_ATTR_EINITTOKENKEY | SGX_ATTR_KSS;
+
+ /* cpuid 12.1.ebx*/
+ const u32 kvm_cpuid_12_1_ebx_sgx_features = 0;
+
/*
* The code below assumes index == 0, which simplifies handling leafs
* with a dynamic number of sub-leafs. The index is fully under KVM's
@@ -433,7 +466,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
switch (function) {
case 0:
- entry->eax = min(entry->eax, (u32)(f_intel_pt ? 0x14 : 0xd));
+ entry->eax = min(entry->eax, (u32)(f_intel_pt ? 0x14 : 0x12));
break;
case 1:
entry->edx &= kvm_cpuid_1_edx_x86_features;
@@ -607,6 +640,39 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
}
break;
}
+ case 0x12:
+ /* Intel SGX */
+ if (!boot_cpu_has(X86_FEATURE_SGX)) {
+ entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+ break;
+ }
+
+ /*
+ * Index 0: Sub-features, MISCSELECT (a.k.a extended features)
+ * and max enclave sizes. The SGX sub-features and MISCSELECT
+ * are restricted by KVM and x86_capabilities (like most
+ * feature flags), while enclave size is unrestricted.
+ */
+ entry->eax &= kvm_cpuid_12_0_eax_x86_features;
+ entry->ebx &= kvm_cpuid_12_0_ebx_sgx_features;
+ cpuid_mask(&entry->eax, CPUID_LNX_3);
+ entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+
+ /*
+ * Index 1: SECS.ATTRIBUTES. ATTRIBUTES are restricted a la
+ * feature flags. Unconditionally advertise all supported
+ * flags, even privileged attributes that require explicit
+ * opt-in from userspace. ATTRIBUTES.XFRM is not adjusted
+ * as userspace is expected to derive it from supported XCR0.
+ */
+ if (*nent >= maxnent)
+ goto out;
+ do_cpuid_1_ent(++entry, 0x12, 0x1);
+ entry->eax &= kvm_cpuid_12_1_eax_sgx_features;
+ entry->ebx &= kvm_cpuid_12_1_ebx_sgx_features;
+ entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+ ++*nent;
+ break;
/* Intel PT */
case 0x14: {
int t, times = entry->eax;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index fef4fb3e1aaa..6d0e46584133 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2074,7 +2074,7 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
vmcs_write64(APIC_ACCESS_ADDR, -1ull);
if (exec_control & SECONDARY_EXEC_ENCLS_EXITING)
- vmcs_write64(ENCLS_EXITING_BITMAP, -1ull);
+ vmx_write_encls_bitmap(&vmx->vcpu, vmcs12);
vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
}
@@ -5014,6 +5014,20 @@ static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
return false;
}
+static bool nested_vmx_exit_handled_encls(struct kvm_vcpu *vcpu,
+ struct vmcs12 *vmcs12)
+{
+ u32 encls_leaf;
+
+ if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING))
+ return false;
+
+ encls_leaf = vcpu->arch.regs[VCPU_REGS_RAX];
+ if (encls_leaf > 62)
+ encls_leaf = 63;
+ return (vmcs12->encls_exiting_bitmap & (1ULL << encls_leaf));
+}
+
static bool nested_vmx_exit_handled_vmcs_access(struct kvm_vcpu *vcpu,
struct vmcs12 *vmcs12, gpa_t bitmap)
{
@@ -5212,8 +5226,7 @@ bool nested_vmx_exit_reflected(struct kvm_vcpu *vcpu, u32 exit_reason)
/* VM functions are emulated through L2->L0 vmexits. */
return false;
case EXIT_REASON_ENCLS:
- /* SGX is never exposed to L1 */
- return false;
+ return nested_vmx_exit_handled_encls(vcpu, vmcs12);
default:
return true;
}
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index e847ff1019a2..527bab90703d 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -226,6 +226,11 @@ static inline bool nested_cpu_has_save_preemption_timer(struct vmcs12 *vmcs12)
VM_EXIT_SAVE_VMX_PREEMPTION_TIMER;
}
+static inline bool nested_cpu_has_encls_exit(struct vmcs12 *vmcs12)
+{
+ return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING);
+}
+
/*
* In nested virtualization, check if L1 asked to exit on external interrupts.
* For most existing hypervisors, this will always return true.
diff --git a/arch/x86/kvm/vmx/sgx.h b/arch/x86/kvm/vmx/sgx.h
new file mode 100644
index 000000000000..2087d5f70883
--- /dev/null
+++ b/arch/x86/kvm/vmx/sgx.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_SGX_H
+#define __KVM_X86_SGX_H
+
+#ifdef CONFIG_INTEL_SGX_VIRTUALIZATION
+int handle_encls_ecreate(struct kvm_vcpu *vcpu);
+int handle_encls_einit(struct kvm_vcpu *vcpu);
+#endif
+
+#endif /* __KVM_X86_SGX_H */
+
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 53dfb401316d..355020944f6c 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -50,6 +50,7 @@ const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VMREAD_BITMAP, vmread_bitmap),
FIELD64(VMWRITE_BITMAP, vmwrite_bitmap),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
+ FIELD64(ENCLS_EXITING_BITMAP, encls_exiting_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 337718fc8a36..82a6ccd91ce5 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -69,7 +69,8 @@ struct __packed vmcs12 {
u64 vm_function_control;
u64 eptp_list_address;
u64 pml_address;
- u64 padding64[3]; /* room for future expansion */
+ u64 encls_exiting_bitmap;
+ u64 padding64[2]; /* room for future expansion */
/*
* To allow migration of L1 (complete with its L2 guests) between
* machines of different natural widths (32 or 64 bit), we cannot have
@@ -259,6 +260,7 @@ static inline void vmx_check_vmcs12_offsets(void)
CHECK_OFFSET(vm_function_control, 296);
CHECK_OFFSET(eptp_list_address, 304);
CHECK_OFFSET(pml_address, 312);
+ CHECK_OFFSET(encls_exiting_bitmap, 320);
CHECK_OFFSET(cr0_guest_host_mask, 344);
CHECK_OFFSET(cr4_guest_host_mask, 352);
CHECK_OFFSET(cr0_read_shadow, 360);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 819c47fee157..ed0366641587 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -41,6 +41,7 @@
#include <asm/mce.h>
#include <asm/mmu_context.h>
#include <asm/mshyperv.h>
+#include <asm/sgx_arch.h>
#include <asm/spec-ctrl.h>
#include <asm/virtext.h>
#include <asm/vmx.h>
@@ -55,6 +56,7 @@
#include "nested.h"
#include "ops.h"
#include "pmu.h"
+#include "sgx.h"
#include "trace.h"
#include "vmcs.h"
#include "vmcs12.h"
@@ -1961,6 +1963,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmx->msr_ia32_feature_control = data;
if (msr_info->host_initiated && data == 0)
vmx_leave_nested(vcpu);
+
+ /* SGX may be enabled/disabled by guest's firmware */
+ vmx_write_encls_bitmap(vcpu, NULL);
break;
case MSR_IA32_SGXLEPUBKEYHASH0 ... MSR_IA32_SGXLEPUBKEYHASH3:
if (!msr_info->host_initiated &&
@@ -4030,6 +4035,15 @@ static void vmx_compute_secondary_exec_control(struct vcpu_vmx *vmx)
}
}
+ if (cpu_has_vmx_encls_vmexit() && nested) {
+ if (guest_cpuid_has(vcpu, X86_FEATURE_SGX))
+ vmx->nested.msrs.secondary_ctls_high |=
+ SECONDARY_EXEC_ENCLS_EXITING;
+ else
+ vmx->nested.msrs.secondary_ctls_high &=
+ ~SECONDARY_EXEC_ENCLS_EXITING;
+ }
+
vmx->secondary_exec_control = exec_control;
}
@@ -4145,8 +4159,7 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
}
- if (cpu_has_vmx_encls_vmexit())
- vmcs_write64(ENCLS_EXITING_BITMAP, -1ull);
+ vmx_write_encls_bitmap(&vmx->vcpu, NULL);
if (pt_mode == PT_MODE_HOST_GUEST) {
memset(&vmx->pt_desc, 0, sizeof(vmx->pt_desc));
@@ -5501,14 +5514,48 @@ static int handle_vmx_instruction(struct kvm_vcpu *vcpu)
return 1;
}
+static inline bool sgx_enabled_in_guest_bios(struct kvm_vcpu *vcpu)
+{
+ const u64 bits = FEATURE_CONTROL_SGX_ENABLE | FEATURE_CONTROL_LOCKED;
+
+ return (to_vmx(vcpu)->msr_ia32_feature_control & bits) == bits;
+}
+
+static inline bool encls_leaf_enabled_in_guest(struct kvm_vcpu *vcpu, u32 leaf)
+{
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_SGX))
+ return false;
+
+ if (leaf >= SGX_ECREATE && leaf <= SGX_ETRACK)
+ return guest_cpuid_has(vcpu, X86_FEATURE_SGX1);
+
+ if (leaf >= SGX_EAUG && leaf <= SGX_EMODT)
+ return guest_cpuid_has(vcpu, X86_FEATURE_SGX2);
+
+ return false;
+}
+
static int handle_encls(struct kvm_vcpu *vcpu)
{
- /*
- * SGX virtualization is not yet supported. There is no software
- * enable bit for SGX, so we have to trap ENCLS and inject a #UD
- * to prevent the guest from executing ENCLS.
- */
- kvm_queue_exception(vcpu, UD_VECTOR);
+ u32 leaf = (u32)vcpu->arch.regs[VCPU_REGS_RAX];
+
+ if (!encls_leaf_enabled_in_guest(vcpu, leaf) ||
+ !IS_ENABLED(CONFIG_INTEL_SGX_VIRTUALIZATION))
+ kvm_queue_exception(vcpu, UD_VECTOR);
+ else if (!sgx_enabled_in_guest_bios(vcpu))
+ kvm_inject_gp(vcpu, 0);
+ else {
+#ifdef CONFIG_INTEL_SGX_VIRTUALIZATION
+ if (leaf == SGX_ECREATE)
+ return handle_encls_ecreate(vcpu);
+ if (leaf == SGX_EINIT)
+ return handle_encls_einit(vcpu);
+#endif
+ WARN(1, "KVM: unexpected exit on ENCLS[%u]", leaf);
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ vcpu->run->hw.hardware_exit_reason = EXIT_REASON_ENCLS;
+ return 0;
+ }
return 1;
}
@@ -7005,6 +7052,85 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
}
+/*
+ * ECREATE must be intercepted to enforce MISCSELECT, ATTRIBUTES and XFRM
+ * restrictions if the guest's allowed-1 settings diverge from hardware.
+ */
+static inline bool vmx_intercept_encls_ecreate(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_SGX_VIRTUALIZATION
+ struct kvm_cpuid_entry2 *guest_cpuid;
+ u32 eax, ebx, ecx, edx;
+
+ if (!vcpu->kvm->arch.sgx_provisioning_allowed)
+ return true;
+
+ guest_cpuid = kvm_find_cpuid_entry(vcpu, 0x12, 0);
+ if (!guest_cpuid)
+ return true;
+
+ cpuid_count(0x12, 0, &eax, &ebx, &ecx, &edx);
+ if (guest_cpuid->ebx != ebx)
+ return true;
+
+ guest_cpuid = kvm_find_cpuid_entry(vcpu, 0x12, 1);
+ if (!guest_cpuid)
+ return true;
+
+ cpuid_count(0x12, 1, &eax, &ebx, &ecx, &edx);
+ if (guest_cpuid->eax != eax || guest_cpuid->ebx != ebx ||
+ guest_cpuid->ecx != ecx || guest_cpuid->edx != edx)
+ return true;
+#endif
+ return false;
+}
+
+void vmx_write_encls_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+ /*
+ * There is no software enable bit for SGX that is virtualized by
+ * hardware, e.g. there's no CR4.SGXE, so when SGX is disabled in the
+ * guest (either by the host or by the guest's BIOS) but enabled in the
+ * host, trap all ENCLS leafs and inject #UD/#GP as needed to emulate
+ * the expected system behavior for ENCLS.
+ */
+ u64 bitmap = -1ull;
+
+ /* Nothing to do if hardware doesn't support SGX */
+ if (!cpu_has_vmx_encls_vmexit())
+ return;
+
+#ifdef CONFIG_INTEL_SGX_VIRTUALIZATION
+ if (guest_cpuid_has(vcpu, X86_FEATURE_SGX) &&
+ sgx_enabled_in_guest_bios(vcpu)) {
+ if (guest_cpuid_has(vcpu, X86_FEATURE_SGX1)) {
+ bitmap &= ~GENMASK_ULL(SGX_ETRACK, SGX_ECREATE);
+ if (vmx_intercept_encls_ecreate(vcpu))
+ bitmap |= (1 << SGX_ECREATE);
+ }
+
+ if (guest_cpuid_has(vcpu, X86_FEATURE_SGX2))
+ bitmap &= ~GENMASK_ULL(SGX_EMODT, SGX_EAUG);
+
+ /*
+ * Trap and execute EINIT if launch control is enabled in the
+ * host using the guest's values for launch control MSRs, even
+ * if the guest's values are fixed to hardware default values.
+ * The MSRs are not loaded/saved on VM-Enter/VM-Exit as writing
+ * the MSRs is extraordinarily expensive.
+ */
+ if (boot_cpu_has(X86_FEATURE_SGX_LC))
+ bitmap |= (1 << SGX_EINIT);
+
+ if (!vmcs12 && nested && is_guest_mode(vcpu))
+ vmcs12 = get_vmcs12(vcpu);
+ if (vmcs12 && nested_cpu_has_encls_exit(vmcs12))
+ bitmap |= vmcs12->encls_exiting_bitmap;
+ }
+#endif
+ vmcs_write64(ENCLS_EXITING_BITMAP, bitmap);
+}
+
static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7029,6 +7155,20 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
if (boot_cpu_has(X86_FEATURE_INTEL_PT) &&
guest_cpuid_has(vcpu, X86_FEATURE_INTEL_PT))
update_intel_pt_cfg(vcpu);
+
+ vmx_write_encls_bitmap(vcpu, NULL);
+ if (guest_cpuid_has(vcpu, X86_FEATURE_SGX))
+ vmx->msr_ia32_feature_control_valid_bits |=
+ FEATURE_CONTROL_SGX_ENABLE;
+ else
+ vmx->msr_ia32_feature_control_valid_bits &=
+ ~FEATURE_CONTROL_SGX_ENABLE;
+ if (guest_cpuid_has(vcpu, X86_FEATURE_SGX_LC))
+ vmx->msr_ia32_feature_control_valid_bits |=
+ FEATURE_CONTROL_SGX_LE_WR;
+ else
+ vmx->msr_ia32_feature_control_valid_bits &=
+ ~FEATURE_CONTROL_SGX_LE_WR;
}
static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1519c6918190..78c81efe43ba 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -325,6 +325,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
+void vmx_write_encls_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12);
#define POSTED_INTR_ON 0
#define POSTED_INTR_SN 1
--
2.22.0
On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
<[email protected]> wrote:
>
> Similar to the existing AMD #NPF case where emulation of the current
> instruction is not possible due to lack of information, virtualization
> of Intel SGX will introduce a scenario where emulation is not possible
> due to the VMExit occurring in an SGX enclave. And again similar to
> the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
> outside of the control of the vendor-specific code.
>
> While the cause and architecturally visible behavior of the two cases
> is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
> clean resume or complete shutdown, the impact on the common emulation
> code is identical: KVM must stop emulation immediately and resume the
> guest.
>
> Replace the exisiting need_emulation_on_page_fault() with a more generic
> is_emulatable() kvm_x86_ops callback, which is called unconditionally
> by x86_emulate_instruction().
>
Having recently noticed that emulate_ud() is broken when the guest's
TF is set, I suppose I should ask: does your new code function
sensibly when TF is set?
Also, anyone want to fix that emulate_ud() bug? The test case is merged now:
# ./tools/testing/selftests/x86/syscall_arg_fault_32
[RUN] SYSENTER with invalid state
[OK] Seems okay
[RUN] SYSCALL with invalid state
[SKIP] Illegal instruction
[RUN] SYSENTER with TF and invalid state
[OK] Seems okay
[RUN] SYSCALL with TF and invalid state
[WARN] Got stuck single-stepping -- you probably have a KVM bug
On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
<[email protected]> wrote:
>
> Add an SGX device to enable userspace to allocate EPC without an
> associated enclave. The intended and only known use case for direct EPC
> allocation is to expose EPC to a KVM guest, hence the virt_epc moniker,
> virt.{c,h} files and INTEL_SGX_VIRTUALIZATION Kconfig.
>
> Although KVM is the end consumer of EPC, and will need hooks into the
> virtual EPC management if oversubscription of EPC for guest is ever
> supported (see below), implement direct access to EPC in the SGX
> subsystem instead of in KVM. Doing so has two major advantages:
>
> - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
> just another memory backend for guests.
This is general grumbling more than useful feedback, but I wish there
was a way for KVM's userspace to add a memory region that is *not*
backed by a memory mapping. For SGX, this would avoid the slightly
awkward situation where useless EPC pages are mapped by QEMU. For
SEV, it would solve the really fairly awful situation where the SEV
pages are mapped *incoherently* for QEMU. And even in the absence of
fancy hardware features, it would allow the guest to have secrets in
memory that are not exposed to wild reads, speculation attacks, etc
coming from QEMU.
I realize the implementation would be extremely intrusive, but it just
might make it a lot easier to do things like making SEV pages property
movable. Similarly, I could see EPC oversubscription being less nasty
in this model. For one thing, it would make it more straightforward
to keep track of exactly which VMs have a given EPC page mapped,
whereas right now this driver only really knows which host userspace
mm has the EPC page mapped.
On Sat, Jul 27, 2019 at 10:44:24AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > Add an SGX device to enable userspace to allocate EPC without an
> > associated enclave. The intended and only known use case for direct EPC
> > allocation is to expose EPC to a KVM guest, hence the virt_epc moniker,
> > virt.{c,h} files and INTEL_SGX_VIRTUALIZATION Kconfig.
> >
> > Although KVM is the end consumer of EPC, and will need hooks into the
> > virtual EPC management if oversubscription of EPC for guest is ever
> > supported (see below), implement direct access to EPC in the SGX
> > subsystem instead of in KVM. Doing so has two major advantages:
> >
> > - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
> > just another memory backend for guests.
>
> This is general grumbling more than useful feedback, but I wish there
> was a way for KVM's userspace to add a memory region that is *not*
> backed by a memory mapping. For SGX, this would avoid the slightly
> awkward situation where useless EPC pages are mapped by QEMU. For
> SEV, it would solve the really fairly awful situation where the SEV
> pages are mapped *incoherently* for QEMU. And even in the absence of
> fancy hardware features, it would allow the guest to have secrets in
> memory that are not exposed to wild reads, speculation attacks, etc
> coming from QEMU.
>
> I realize the implementation would be extremely intrusive, but it just
> might make it a lot easier to do things like making SEV pages property
> movable. Similarly, I could see EPC oversubscription being less nasty
> in this model. For one thing, it would make it more straightforward
> to keep track of exactly which VMs have a given EPC page mapped,
> whereas right now this driver only really knows which host userspace
> mm has the EPC page mapped.
Host userspace vs VM doesn't add much, if any, complexity to EPC
oversubscription, especially since the problem of supporting multiple mm
structs needs to be solved for the native driver anyways.
The nastiness of oversubscribing a VM is primarily in dealing with
EBLOCK/ETRACK/EWB conflicts between guest and host. The other nasty bit
is that it all but requires fancier VA page management, e.g. allocating a
dedicated VA slot for every EPC page doesn't scale when presenting a
multi-{GB,TB} EPC to a guest, especially since there's no guarantee the
guest will ever access EPC.
On Sat, Jul 27, 2019 at 10:38:03AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > Similar to the existing AMD #NPF case where emulation of the current
> > instruction is not possible due to lack of information, virtualization
> > of Intel SGX will introduce a scenario where emulation is not possible
> > due to the VMExit occurring in an SGX enclave. And again similar to
> > the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
> > outside of the control of the vendor-specific code.
> >
> > While the cause and architecturally visible behavior of the two cases
> > is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
> > clean resume or complete shutdown, the impact on the common emulation
> > code is identical: KVM must stop emulation immediately and resume the
> > guest.
> >
> > Replace the exisiting need_emulation_on_page_fault() with a more generic
> > is_emulatable() kvm_x86_ops callback, which is called unconditionally
> > by x86_emulate_instruction().
> >
>
> Having recently noticed that emulate_ud() is broken when the guest's
> TF is set, I suppose I should ask: does your new code function
> sensibly when TF is set?
Barring a VMX fault injection interaction I'm not thinking of, yes. The
SGX reaction to the #UD VM-Exit is to inject a #UD and resume the guest,
pending breakpoints shouldn't be affected in any way (unless some other
part of KVM mucks with them, e.g. when guest single-stepping is enabled).
On Fri, Jul 26, 2019 at 10:52:01PM -0700, Sean Christopherson wrote:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 48c865a4e5dd..0fb8b60eb136 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -7115,10 +7115,25 @@ static int nested_enable_evmcs(struct kvm_vcpu *vcpu,
> return -ENODEV;
> }
>
> -static bool svm_need_emulation_on_page_fault(struct kvm_vcpu *vcpu)
> +static bool svm_is_emulatable(struct kvm_vcpu *vcpu, void *insn, int insn_len)
> {
> bool is_user, smap;
>
> + if (likely(!insn || insn_len))
> + return true;
> +
> + /*
> + * Under certain conditions insn_len may be zero on #NPF. This can
> + * happen if a guest gets a page-fault on data access but the HW table
> + * walker is not able to read the instruction page (e.g instruction
> + * page is not present in memory). In those cases we simply restart the
> + * guest, with the exception of AMD Erratum 1096 which is unrecoverable.
> + */
> + if (unlikely(insn && !insn_len)) {
> + if (!kvm_x86_ops->need_emulation_on_page_fault(vcpu))
> + return 1;
> + }
Doh, obviously forgot to compile for SVM when rebasing.
>> On Jul 29, 2019, at 7:49 PM, Sean Christopherson <[email protected]> wrote:
>>
>> On Sat, Jul 27, 2019 at 10:38:03AM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
>> <[email protected]> wrote:
>>>
>>> Similar to the existing AMD #NPF case where emulation of the current
>>> instruction is not possible due to lack of information, virtualization
>>> of Intel SGX will introduce a scenario where emulation is not possible
>>> due to the VMExit occurring in an SGX enclave. And again similar to
>>> the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
>>> outside of the control of the vendor-specific code.
>>>
>>> While the cause and architecturally visible behavior of the two cases
>>> is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
>>> clean resume or complete shutdown, the impact on the common emulation
>>> code is identical: KVM must stop emulation immediately and resume the
>>> guest.
>>>
>>> Replace the exisiting need_emulation_on_page_fault() with a more generic
>>> is_emulatable() kvm_x86_ops callback, which is called unconditionally
>>> by x86_emulate_instruction().
>>
>> Having recently noticed that emulate_ud() is broken when the guest's
>> TF is set, I suppose I should ask: does your new code function
>> sensibly when TF is set?
>
> Barring a VMX fault injection interaction I'm not thinking of, yes. The
> SGX reaction to the #UD VM-Exit is to inject a #UD and resume the guest,
> pending breakpoints shouldn't be affected in any way (unless some other
> part of KVM mucks with them, e.g. when guest single-stepping is enabled).
What I mean is: does the code actually do what you think it does if TF is set? Right now, as I understand it, the KVM emulation code has a bug in which some emulated faults also inject #DB despite the fact that the instruction faulted, and the #DB seems to take precedence over the original fault. This confuses the guest.
On Thu, Aug 15, 2019 at 05:47:12PM -0700, Andy Lutomirski wrote:
>
>
> >> On Jul 29, 2019, at 7:49 PM, Sean Christopherson <[email protected]> wrote:
> >>
> >> On Sat, Jul 27, 2019 at 10:38:03AM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
> >> <[email protected]> wrote:
> >>>
> >>> Similar to the existing AMD #NPF case where emulation of the current
> >>> instruction is not possible due to lack of information, virtualization
> >>> of Intel SGX will introduce a scenario where emulation is not possible
> >>> due to the VMExit occurring in an SGX enclave. And again similar to
> >>> the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
> >>> outside of the control of the vendor-specific code.
> >>>
> >>> While the cause and architecturally visible behavior of the two cases
> >>> is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
> >>> clean resume or complete shutdown, the impact on the common emulation
> >>> code is identical: KVM must stop emulation immediately and resume the
> >>> guest.
> >>>
> >>> Replace the exisiting need_emulation_on_page_fault() with a more generic
> >>> is_emulatable() kvm_x86_ops callback, which is called unconditionally
> >>> by x86_emulate_instruction().
> >>
> >> Having recently noticed that emulate_ud() is broken when the guest's
> >> TF is set, I suppose I should ask: does your new code function
> >> sensibly when TF is set?
> >
> > Barring a VMX fault injection interaction I'm not thinking of, yes. The
> > SGX reaction to the #UD VM-Exit is to inject a #UD and resume the guest,
> > pending breakpoints shouldn't be affected in any way (unless some other
> > part of KVM mucks with them, e.g. when guest single-stepping is enabled).
>
> What I mean is: does the code actually do what you think it does if TF is
> set? Right now, as I understand it, the KVM emulation code has a bug in
> which some emulated faults also inject #DB despite the fact that the
> instruction faulted, and the #DB seems to take precedence over the original
> fault. This confuses the guest.
Yes. The proposed change is to inject the #UD instead of calling into the
emulator, and by inspection I've verified that all code that injects a #DB
is either contained within the emulator or is mutually exclusive with an
intercepted #UD. It's a qualified yes because I don't have an actual
testcase to verify my literacy. I'll look into adding a test, either to
the selftest/x86/sgx or to kvm-unit-tests.
On Mon, Aug 19, 2019 at 3:01 PM Sean Christopherson
<[email protected]> wrote:
>
> On Thu, Aug 15, 2019 at 05:47:12PM -0700, Andy Lutomirski wrote:
> >
> >
> > >> On Jul 29, 2019, at 7:49 PM, Sean Christopherson <[email protected]> wrote:
> > >>
> > >> On Sat, Jul 27, 2019 at 10:38:03AM -0700, Andy Lutomirski wrote:
> > >> On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
> > >> <[email protected]> wrote:
> > >>>
> > >>> Similar to the existing AMD #NPF case where emulation of the current
> > >>> instruction is not possible due to lack of information, virtualization
> > >>> of Intel SGX will introduce a scenario where emulation is not possible
> > >>> due to the VMExit occurring in an SGX enclave. And again similar to
> > >>> the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
> > >>> outside of the control of the vendor-specific code.
> > >>>
> > >>> While the cause and architecturally visible behavior of the two cases
> > >>> is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
> > >>> clean resume or complete shutdown, the impact on the common emulation
> > >>> code is identical: KVM must stop emulation immediately and resume the
> > >>> guest.
> > >>>
> > >>> Replace the exisiting need_emulation_on_page_fault() with a more generic
> > >>> is_emulatable() kvm_x86_ops callback, which is called unconditionally
> > >>> by x86_emulate_instruction().
> > >>
> > >> Having recently noticed that emulate_ud() is broken when the guest's
> > >> TF is set, I suppose I should ask: does your new code function
> > >> sensibly when TF is set?
> > >
> > > Barring a VMX fault injection interaction I'm not thinking of, yes. The
> > > SGX reaction to the #UD VM-Exit is to inject a #UD and resume the guest,
> > > pending breakpoints shouldn't be affected in any way (unless some other
> > > part of KVM mucks with them, e.g. when guest single-stepping is enabled).
> >
> > What I mean is: does the code actually do what you think it does if TF is
> > set? Right now, as I understand it, the KVM emulation code has a bug in
> > which some emulated faults also inject #DB despite the fact that the
> > instruction faulted, and the #DB seems to take precedence over the original
> > fault. This confuses the guest.
>
> Yes. The proposed change is to inject the #UD instead of calling into the
> emulator, and by inspection I've verified that all code that injects a #DB
> is either contained within the emulator or is mutually exclusive with an
> intercepted #UD. It's a qualified yes because I don't have an actual
> testcase to verify my literacy. I'll look into adding a test, either to
> the selftest/x86/sgx or to kvm-unit-tests.
I wrote one, and it fails:
# ./tools/testing/selftests/x86/syscall_arg_fault_32
[RUN] SYSENTER with invalid state
[OK] Seems okay
[RUN] SYSCALL with invalid state
[SKIP] Illegal instruction
[RUN] SYSENTER with TF and invalid state
[OK] Seems okay
[RUN] SYSCALL with TF and invalid state
[WARN] Got stuck single-stepping -- you probably have a KVM bug
emulate_ud() is buggy.
On Mon, Aug 19, 2019 at 06:34:07PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2019 at 3:01 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > On Thu, Aug 15, 2019 at 05:47:12PM -0700, Andy Lutomirski wrote:
> > >
> > >
> > > >> On Jul 29, 2019, at 7:49 PM, Sean Christopherson <[email protected]> wrote:
> > > >>
> > > >> On Sat, Jul 27, 2019 at 10:38:03AM -0700, Andy Lutomirski wrote:
> > > >> On Fri, Jul 26, 2019 at 10:52 PM Sean Christopherson
> > > >> <[email protected]> wrote:
> > > >>>
> > > >>> Similar to the existing AMD #NPF case where emulation of the current
> > > >>> instruction is not possible due to lack of information, virtualization
> > > >>> of Intel SGX will introduce a scenario where emulation is not possible
> > > >>> due to the VMExit occurring in an SGX enclave. And again similar to
> > > >>> the AMD case, emulation can be initiated by kvm_mmu_page_fault(), i.e.
> > > >>> outside of the control of the vendor-specific code.
> > > >>>
> > > >>> While the cause and architecturally visible behavior of the two cases
> > > >>> is different, e.g. Intel SGX will inject a #UD whereas AMD #NPF is a
> > > >>> clean resume or complete shutdown, the impact on the common emulation
> > > >>> code is identical: KVM must stop emulation immediately and resume the
> > > >>> guest.
> > > >>>
> > > >>> Replace the exisiting need_emulation_on_page_fault() with a more generic
> > > >>> is_emulatable() kvm_x86_ops callback, which is called unconditionally
> > > >>> by x86_emulate_instruction().
> > > >>
> > > >> Having recently noticed that emulate_ud() is broken when the guest's
> > > >> TF is set, I suppose I should ask: does your new code function
> > > >> sensibly when TF is set?
> > > >
> > > > Barring a VMX fault injection interaction I'm not thinking of, yes. The
> > > > SGX reaction to the #UD VM-Exit is to inject a #UD and resume the guest,
> > > > pending breakpoints shouldn't be affected in any way (unless some other
> > > > part of KVM mucks with them, e.g. when guest single-stepping is enabled).
> > >
> > > What I mean is: does the code actually do what you think it does if TF is
> > > set? Right now, as I understand it, the KVM emulation code has a bug in
> > > which some emulated faults also inject #DB despite the fact that the
> > > instruction faulted, and the #DB seems to take precedence over the original
> > > fault. This confuses the guest.
> >
> > Yes. The proposed change is to inject the #UD instead of calling into the
> > emulator, and by inspection I've verified that all code that injects a #DB
> > is either contained within the emulator or is mutually exclusive with an
> > intercepted #UD. It's a qualified yes because I don't have an actual
> > testcase to verify my literacy. I'll look into adding a test, either to
> > the selftest/x86/sgx or to kvm-unit-tests.
>
> I wrote one, and it fails:
>
> # ./tools/testing/selftests/x86/syscall_arg_fault_32
> [RUN] SYSENTER with invalid state
> [OK] Seems okay
> [RUN] SYSCALL with invalid state
> [SKIP] Illegal instruction
> [RUN] SYSENTER with TF and invalid state
> [OK] Seems okay
> [RUN] SYSCALL with TF and invalid state
> [WARN] Got stuck single-stepping -- you probably have a KVM bug
>
> emulate_ud() is buggy.
Heh, yeah, I meant for the SGX case, e.g. SYSCALL from inside an enclave
with RFLAGS.TF=1.