Hi,
This series aims to introduce the concept of KVM address space isolation.
This is done as part of the upstream community effort to have exploit
mitigations for CPU info-leaks vulnerabilities such as L1TF.
These patches are based on an original patches from Liran Alon, completed
with additional patches to effectively create KVM address space different
from the full kernel address space.
The current code is just an early POC, and it is not fully stable at the
moment (unfortunately you can expect crashes/hangs, see the "Issues"
section below). However I would like to start a discussion get feedback
and opinions about this approach.
Context
=======
The most naive approach to handle L1TF SMT-variant exploit is to just disable
hyper-threading. But that is not practical for public cloud providers. As a
second next best alternative, there is an approach to combine coscheduling
together with flushing L1D cache on every VMEntry. By coscheduling I refer
to some mechanism which on every VMExit from guest, kicks all sibling
hyperthreads from guest aswell.
However, this approach have some open issues:
1. Kicking all sibling hyperthreads for every VMExit have significant
performance hit for some compute shapes (e.g. Emulated and PV).
2. It assumes only CPU core resource which could be leaked by some
vulnerability is L1D cache. But future vulnerabilities may also be able
to leak other CPU core resources. Therefore, we would prefer to have a
mechanism which prevents these resources to be able to be loaded with
sensitive data to begin with.
To better address (2), upstream community has discussed some mechanisms
related to reducing data that is mapped on kernel virtual address space.
Specifically:
a. XPFO: Removes from physmap pages that currently should only be accessed
by userspace.
b. Process-local memory allocations: Allows having a memory area in kernel
virtual address space that maps different content per-process. Then,
allocations made on this memory area can be hidden from other tasks in
the system running in kernel space. Most obvious use it to allocate
there per-vCPU and per-VM KVM structures.
However, both (a)+(b) work in a black-list approach (where we decide which
data is considered dangerous and remove it from kernel virtual address
space) and don't address performance hit described at (1).
Proposal
========
To handle both these points, this series introduce the mechanism of KVM
address space isolation. Note that this mechanism completes (a)+(b) and
don't contradict. In case this mechanism is also applied, (a)+(b) should
still be applied to the full virtual address space as a defence-in-depth).
The idea is that most of KVM #VMExit handlers code will run in a special
KVM isolated address space which maps only KVM required code and per-VM
information. Only once KVM needs to architectually access other (sensitive)
data, it will switch from KVM isolated address space to full standard
host address space. At this point, KVM will also need to kick all sibling
hyperthreads to get out of guest (note that kicking all sibling hyperthreads
is not implemented in this serie).
Basically, we will have the following flow:
- qemu issues KVM_RUN ioctl
- KVM handles the ioctl and calls vcpu_run():
. KVM switches from the kernel address to the KVM address space
. KVM transfers control to VM (VMLAUNCH/VMRESUME)
. VM returns to KVM
. KVM handles VM-Exit:
. if handling need full kernel then switch to kernel address space
. else continues with KVM address space
. KVM loops in vcpu_run() or return
- KVM_RUN ioctl returns
So, the KVM_RUN core function will mainly be executed using the KVM address
space. The handling of a VM-Exit can require access to the kernel space
and, in that case, we will switch back to the kernel address space.
The high-level idea of how this is implemented is to create a separate
struct_mm for KVM such that a vCPU thread will switch active_mm between
it's original active_mm and kvm_mm when needed as described above. The
idea is very similar to how kernel switches between task active_mm and
efi_mm when calling EFI Runtime Services.
Note that because we use the kernel TLB Manager to switch between kvm_mm
and host_mm, we will effectively use TLB with PCID if enabled to make
these switches fast. As all of this is managed internally in TLB Manager's
switch_mm().
Patches
=======
The proposed patches implement the necessary framework for creating kvm_mm
and switching between host_mm and kvm_mm at appropriate times. They also
provide functions for populating the KVM address space, and implement an
actual KVM address space much smaller than the full kernel address space.
- 01-08: add framework for switching between the kernel address space and
the KVM address space at appropriate times. Note that these patches do
not create or switch the address space yet. Address space switching is
implemented in patch 25.
- 09-18: add a framework for populating and managing the KVM page table;
this also include mechanisms to ensure changes are effectively limited
to the KVM page table and no change is mistakenly propagated to the
kernel page table.
- 19-23: populate the KVM page table.
- 24: add page fault handler to handle and report missing mappings when
running with the KVM address space. This is based on an original idea
from Paul Turner.
- 25: implement the actual switch between the kernel address space and
the KVM address space.
- 26-27: populate the KVM page table with more data.
If a fault occurs while running with the KVM address space, it will be
reported on the console like this:
[ 4840.727476] KVM isolation: page fault #0 (0) at fast_page_fault+0x13e/0x3e0 [kvm] on ffffea00005331f0 (0xffffea00005331f0)
If the KVM page_fault_stack module parameter is set to non-zero (that's
the default) then the stack of the fault will also be reported:
[ 5025.630374] KVM isolation: page fault #0 (0) at fast_page_fault+0x100/0x3e0 [kvm] on ffff88003c718000 (0xffff88003c718000)
[ 5025.631918] Call Trace:
[ 5025.632782] tdp_page_fault+0xec/0x260 [kvm]
[ 5025.633395] kvm_mmu_page_fault+0x74/0x5f0 [kvm]
[ 5025.644467] handle_ept_violation+0xc3/0x1a0 [kvm_intel]
[ 5025.645218] vmx_handle_exit+0xb9/0x600 [kvm_intel]
[ 5025.645917] vcpu_enter_guest+0xb88/0x1580 [kvm]
[ 5025.646577] kvm_arch_vcpu_ioctl_run+0x403/0x610 [kvm]
[ 5025.647313] kvm_vcpu_ioctl+0x3d5/0x650 [kvm]
[ 5025.648538] do_vfs_ioctl+0xaa/0x602
[ 5025.650502] SyS_ioctl+0x79/0x84
[ 5025.650966] do_syscall_64+0x79/0x1ae
[ 5025.651487] entry_SYSCALL_64_after_hwframe+0x151/0x0
[ 5025.652200] RIP: 0033:0x7f74a2f1d997
[ 5025.652710] RSP: 002b:00007f749f3ec998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 5025.653769] RAX: ffffffffffffffda RBX: 0000562caa83e110 RCX: 00007f74a2f1d997
[ 5025.654769] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c
[ 5025.655769] RBP: 0000562caa83e1b3 R08: 0000562ca9b6fa50 R09: 0000000000000006
[ 5025.656766] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ca9b552c0
[ 5025.657764] R13: 0000000000801000 R14: 00007f74a59d4000 R15: 0000562caa83e110
This allows to find out what is missing in the KVM address space.
Issues
======
Limited tests have been done so far, and mostly with an empty single-vcpu
VM (i.e. qemu-system-i386 -enable-kvm -smp 1). Single-vcpu VM is able to
start and run a full OS but the system will eventually crash/hang at some
point. Multiple vcpus will crash/hang much faster.
Performance Impact
==================
As this is a RFC, the effective performance impact hasn't been measured
yet. Current patches introduce two additional context switches (kernel to
KVM, and KVM to kernel) on each KVM_RUN ioctl. Also additional context
switches are added if a VM-Exit has to be handled using the full kernel
address space.
I expect that the KVM address space can eventually be expanded to include
the ioctl syscall entries. By doing so, and also adding the KVM page table
to the process userland page table (which should be safe to do because the
KVM address space doesn't have any secret), we could potentially handle the
KVM ioctl without having to switch to the kernel pagetable (thus effectively
eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
to be handled using the full kernel address space.
Thanks,
alex.
---
Alexandre Chartre (18):
kvm/isolation: function to track buffers allocated for the KVM page
table
kvm/isolation: add KVM page table entry free functions
kvm/isolation: add KVM page table entry offset functions
kvm/isolation: add KVM page table entry allocation functions
kvm/isolation: add KVM page table entry set functions
kvm/isolation: functions to copy page table entries for a VA range
kvm/isolation: keep track of VA range mapped in KVM address space
kvm/isolation: functions to clear page table entries for a VA range
kvm/isolation: improve mapping copy when mapping is already present
kvm/isolation: function to copy page table entries for percpu buffer
kvm/isolation: initialize the KVM page table with core mappings
kvm/isolation: initialize the KVM page table with vmx specific data
kvm/isolation: initialize the KVM page table with vmx VM data
kvm/isolation: initialize the KVM page table with vmx cpu data
kvm/isolation: initialize the KVM page table with the vcpu tasks
kvm/isolation: KVM page fault handler
kvm/isolation: initialize the KVM page table with KVM memslots
kvm/isolation: initialize the KVM page table with KVM buses
Liran Alon (9):
kernel: Export memory-management symbols required for KVM address
space isolation
KVM: x86: Introduce address_space_isolation module parameter
KVM: x86: Introduce KVM separate virtual address space
KVM: x86: Switch to KVM address space on entry to guest
KVM: x86: Add handler to exit kvm isolation
KVM: x86: Exit KVM isolation on IRQ entry
KVM: x86: Switch to host address space when may access sensitive data
KVM: x86: Optimize branches which checks if address space isolation
enabled
kvm/isolation: implement actual KVM isolation enter/exit
arch/x86/include/asm/apic.h | 4 +-
arch/x86/include/asm/hardirq.h | 10 +
arch/x86/include/asm/irq.h | 1 +
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/dumpstack.c | 1 +
arch/x86/kernel/irq.c | 11 +
arch/x86/kernel/ldt.c | 1 +
arch/x86/kernel/smp.c | 2 +-
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/isolation.c | 1773 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 40 +
arch/x86/kvm/mmu.c | 3 +-
arch/x86/kvm/vmx/vmx.c | 123 +++-
arch/x86/kvm/x86.c | 44 +-
arch/x86/mm/fault.c | 12 +
arch/x86/mm/tlb.c | 4 +-
arch/x86/platform/uv/tlb_uv.c | 2 +-
include/linux/kvm_host.h | 2 +
include/linux/percpu.h | 2 +
include/linux/sched.h | 6 +
mm/memory.c | 5 +
mm/percpu.c | 6 +-
virt/kvm/arm/arm.c | 4 +
virt/kvm/kvm_main.c | 4 +-
24 files changed, 2051 insertions(+), 13 deletions(-)
create mode 100644 arch/x86/kvm/isolation.c
create mode 100644 arch/x86/kvm/isolation.h
From: Liran Alon <[email protected]>
As every entry to guest checks if should switch from host_mm to kvm_mm,
these branches is at very hot path. Optimize them by using
static_branch.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 11 ++++++++---
arch/x86/kvm/isolation.h | 7 +++++++
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index eeb60c4..43fd924 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -23,6 +23,9 @@ struct mm_struct kvm_mm = {
.mmlist = LIST_HEAD_INIT(kvm_mm.mmlist),
};
+DEFINE_STATIC_KEY_FALSE(kvm_isolation_enabled);
+EXPORT_SYMBOL(kvm_isolation_enabled);
+
/*
* When set to true, KVM #VMExit handlers run in isolated address space
* which maps only KVM required code and per-VM information instead of
@@ -118,15 +121,17 @@ int kvm_isolation_init(void)
kvm_isolation_set_handlers();
pr_info("KVM: x86: Running with isolated address space\n");
+ static_branch_enable(&kvm_isolation_enabled);
return 0;
}
void kvm_isolation_uninit(void)
{
- if (!address_space_isolation)
+ if (!kvm_isolation())
return;
+ static_branch_disable(&kvm_isolation_enabled);
kvm_isolation_clear_handlers();
kvm_isolation_uninit_mm();
pr_info("KVM: x86: End of isolated address space\n");
@@ -140,7 +145,7 @@ void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu)
void kvm_isolation_enter(void)
{
- if (address_space_isolation) {
+ if (kvm_isolation()) {
/*
* Switches to kvm_mm should happen from vCPU thread,
* which should not be a kernel thread with no mm
@@ -152,7 +157,7 @@ void kvm_isolation_enter(void)
void kvm_isolation_exit(void)
{
- if (address_space_isolation) {
+ if (kvm_isolation()) {
/* TODO: Kick sibling hyperthread before switch to host mm */
/* TODO: switch back to original mm */
}
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 1290d32..aa5e979 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -4,6 +4,13 @@
#include <linux/kvm_host.h>
+DECLARE_STATIC_KEY_FALSE(kvm_isolation_enabled);
+
+static inline bool kvm_isolation(void)
+{
+ return static_branch_likely(&kvm_isolation_enabled);
+}
+
extern int kvm_isolation_init(void);
extern void kvm_isolation_uninit(void);
extern void kvm_isolation_enter(void);
--
1.7.1
These functions are wrappers are the p4d/pud/pmd/pte offset functions
which ensure that page table pointers are in the KVM page table.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 61 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 61df750..b29a09b 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -162,6 +162,67 @@ static bool kvm_valid_pgt_entry(void *ptr)
}
/*
+ * kvm_pXX_offset() functions are equivalent to kernel pXX_offset()
+ * functions but, in addition, they ensure that page table pointers
+ * are in the KVM page table. Otherwise an error is returned.
+ */
+
+static pte_t *kvm_pte_offset(pmd_t *pmd, unsigned long addr)
+{
+ pte_t *pte;
+
+ pte = pte_offset_map(pmd, addr);
+ if (!kvm_valid_pgt_entry(pte)) {
+ pr_err("PTE %px is not in KVM page table\n", pte);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return pte;
+}
+
+static pmd_t *kvm_pmd_offset(pud_t *pud, unsigned long addr)
+{
+ pmd_t *pmd;
+
+ pmd = pmd_offset(pud, addr);
+ if (!kvm_valid_pgt_entry(pmd)) {
+ pr_err("PMD %px is not in KVM page table\n", pmd);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return pmd;
+}
+
+static pud_t *kvm_pud_offset(p4d_t *p4d, unsigned long addr)
+{
+ pud_t *pud;
+
+ pud = pud_offset(p4d, addr);
+ if (!kvm_valid_pgt_entry(pud)) {
+ pr_err("PUD %px is not in KVM page table\n", pud);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return pud;
+}
+
+static p4d_t *kvm_p4d_offset(pgd_t *pgd, unsigned long addr)
+{
+ p4d_t *p4d;
+
+ p4d = p4d_offset(pgd, addr);
+ /*
+ * p4d is the same has pgd if we don't have a 5-level page table.
+ */
+ if ((p4d != (p4d_t *)pgd) && !kvm_valid_pgt_entry(p4d)) {
+ pr_err("P4D %px is not in KVM page table\n", p4d);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return p4d;
+}
+
+/*
* kvm_pXX_free() functions are equivalent to kernel pXX_free()
* functions but they can be used with any PXX pointer in the
* directory.
--
1.7.1
These functions are wrappers around the p4d/pud/pmd/pte free function
which can be used with any pointer in the directory.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 1efdab1..61df750 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -161,6 +161,32 @@ static bool kvm_valid_pgt_entry(void *ptr)
}
+/*
+ * kvm_pXX_free() functions are equivalent to kernel pXX_free()
+ * functions but they can be used with any PXX pointer in the
+ * directory.
+ */
+
+static inline void kvm_pte_free(struct mm_struct *mm, pte_t *pte)
+{
+ pte_free_kernel(mm, PGTD_ALIGN(pte));
+}
+
+static inline void kvm_pmd_free(struct mm_struct *mm, pmd_t *pmd)
+{
+ pmd_free(mm, PGTD_ALIGN(pmd));
+}
+
+static inline void kvm_pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ pud_free(mm, PGTD_ALIGN(pud));
+}
+
+static inline void kvm_p4d_free(struct mm_struct *mm, p4d_t *p4d)
+{
+ p4d_free(mm, PGTD_ALIGN(p4d));
+}
+
static int kvm_isolation_init_mm(void)
{
--
1.7.1
This will be used when we have to clear mappings to ensure the same
range is cleared at the same page table level it was copied.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 86 ++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 84 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 4f1b511..c8358a9 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -61,6 +61,20 @@ struct pgt_directory_group {
#define PGTD_ALIGN(entry) \
((typeof(entry))(((unsigned long)(entry)) & PAGE_MASK))
+/*
+ * Variables to keep track of address ranges mapped into the KVM
+ * address space.
+ */
+struct kvm_range_mapping {
+ struct list_head list;
+ void *ptr;
+ size_t size;
+ enum page_table_level level;
+};
+
+static LIST_HEAD(kvm_range_mapping_list);
+static DEFINE_MUTEX(kvm_range_mapping_lock);
+
struct mm_struct kvm_mm = {
.mm_rb = RB_ROOT,
@@ -91,6 +105,52 @@ struct mm_struct kvm_mm = {
static bool __read_mostly address_space_isolation;
module_param(address_space_isolation, bool, 0444);
+static struct kvm_range_mapping *kvm_get_range_mapping_locked(void *ptr,
+ bool *subset)
+{
+ struct kvm_range_mapping *range;
+
+ list_for_each_entry(range, &kvm_range_mapping_list, list) {
+ if (range->ptr == ptr) {
+ if (subset)
+ *subset = false;
+ return range;
+ }
+ if (ptr > range->ptr && ptr < range->ptr + range->size) {
+ if (subset)
+ *subset = true;
+ return range;
+ }
+ }
+
+ return NULL;
+}
+
+static struct kvm_range_mapping *kvm_get_range_mapping(void *ptr, bool *subset)
+{
+ struct kvm_range_mapping *range;
+
+ mutex_lock(&kvm_range_mapping_lock);
+ range = kvm_get_range_mapping_locked(ptr, subset);
+ mutex_unlock(&kvm_range_mapping_lock);
+
+ return range;
+}
+
+static void kvm_free_all_range_mapping(void)
+{
+ struct kvm_range_mapping *range, *range_next;
+
+ mutex_lock(&kvm_range_mapping_lock);
+
+ list_for_each_entry_safe(range, range_next,
+ &kvm_range_mapping_list, list) {
+ list_del(&range->list);
+ kfree(range);
+ }
+
+ mutex_unlock(&kvm_range_mapping_lock);
+}
static struct pgt_directory_group *pgt_directory_group_create(void)
{
@@ -661,10 +721,30 @@ static int kvm_copy_mapping(void *ptr, size_t size, enum page_table_level level)
{
unsigned long addr = (unsigned long)ptr;
unsigned long end = addr + ((unsigned long)size);
+ struct kvm_range_mapping *range_mapping;
+ bool subset;
+ int err;
BUG_ON(current->mm == &kvm_mm);
- pr_debug("KERNMAP COPY addr=%px size=%lx\n", ptr, size);
- return kvm_copy_pgd_range(&kvm_mm, current->mm, addr, end, level);
+ pr_debug("KERNMAP COPY addr=%px size=%lx level=%d\n", ptr, size, level);
+
+ range_mapping = kmalloc(sizeof(struct kvm_range_mapping), GFP_KERNEL);
+ if (!range_mapping)
+ return -ENOMEM;
+
+ err = kvm_copy_pgd_range(&kvm_mm, current->mm, addr, end, level);
+ if (err) {
+ kfree(range_mapping);
+ return err;
+ }
+
+ INIT_LIST_HEAD(&range_mapping->list);
+ range_mapping->ptr = ptr;
+ range_mapping->size = size;
+ range_mapping->level = level;
+ list_add(&range_mapping->list, &kvm_range_mapping_list);
+
+ return 0;
}
@@ -720,6 +800,8 @@ static void kvm_isolation_uninit_mm(void)
destroy_context(&kvm_mm);
+ kvm_free_all_range_mapping();
+
#ifdef CONFIG_PAGE_TABLE_ISOLATION
/*
* With PTI, the KVM address space is defined in the user
--
1.7.1
These functions allocate p4d/pud/pmd/pte pages and ensure that
pages are in the KVM page table.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 94 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index b29a09b..6ec86df 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -248,6 +248,100 @@ static inline void kvm_p4d_free(struct mm_struct *mm, p4d_t *p4d)
p4d_free(mm, PGTD_ALIGN(p4d));
}
+/*
+ * kvm_pXX_alloc() functions are equivalent to kernel pXX_alloc()
+ * functions but, in addition, they ensure that page table pointers
+ * are in the KVM page table. Otherwise an error is returned.
+ */
+
+static pte_t *kvm_pte_alloc(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr)
+{
+ pte_t *pte;
+
+ if (pmd_none(*pmd)) {
+ pte = pte_alloc_kernel(pmd, addr);
+ if (!pte) {
+ pr_debug("PTE: ERR ALLOC\n");
+ return ERR_PTR(-ENOMEM);
+ }
+ if (!kvm_add_pgt_directory(pte, PGT_LEVEL_PTE)) {
+ kvm_pte_free(mm, pte);
+ return ERR_PTR(-EINVAL);
+ }
+ } else {
+ pte = kvm_pte_offset(pmd, addr);
+ }
+
+ return pte;
+}
+
+static pmd_t *kvm_pmd_alloc(struct mm_struct *mm, pud_t *pud,
+ unsigned long addr)
+{
+ pmd_t *pmd;
+
+ if (pud_none(*pud)) {
+ pmd = pmd_alloc(mm, pud, addr);
+ if (!pmd) {
+ pr_debug("PMD: ERR ALLOC\n");
+ return ERR_PTR(-ENOMEM);
+ }
+ if (!kvm_add_pgt_directory(pmd, PGT_LEVEL_PMD)) {
+ kvm_pmd_free(mm, pmd);
+ return ERR_PTR(-EINVAL);
+ }
+ } else {
+ pmd = kvm_pmd_offset(pud, addr);
+ }
+
+ return pmd;
+}
+
+static pud_t *kvm_pud_alloc(struct mm_struct *mm, p4d_t *p4d,
+ unsigned long addr)
+{
+ pud_t *pud;
+
+ if (p4d_none(*p4d)) {
+ pud = pud_alloc(mm, p4d, addr);
+ if (!pud) {
+ pr_debug("PUD: ERR ALLOC\n");
+ return ERR_PTR(-ENOMEM);
+ }
+ if (!kvm_add_pgt_directory(pud, PGT_LEVEL_PUD)) {
+ kvm_pud_free(mm, pud);
+ return ERR_PTR(-EINVAL);
+ }
+ } else {
+ pud = kvm_pud_offset(p4d, addr);
+ }
+
+ return pud;
+}
+
+static p4d_t *kvm_p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+ unsigned long addr)
+{
+ p4d_t *p4d;
+
+ if (pgd_none(*pgd)) {
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d) {
+ pr_debug("P4D: ERR ALLOC\n");
+ return ERR_PTR(-ENOMEM);
+ }
+ if (!kvm_add_pgt_directory(p4d, PGT_LEVEL_P4D)) {
+ kvm_p4d_free(mm, p4d);
+ return ERR_PTR(-EINVAL);
+ }
+ } else {
+ p4d = kvm_p4d_offset(pgd, addr);
+ }
+
+ return p4d;
+}
+
static int kvm_isolation_init_mm(void)
{
--
1.7.1
Add wrappers around the page table entry (pgd/p4d/pud/pmd) set function
to check that an existing entry is not being overwritten.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 107 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 6ec86df..b681e4f 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -342,6 +342,113 @@ static inline void kvm_p4d_free(struct mm_struct *mm, p4d_t *p4d)
return p4d;
}
+/*
+ * kvm_set_pXX() functions are equivalent to kernel set_pXX() functions
+ * but, in addition, they ensure that they are not overwriting an already
+ * existing reference in the page table. Otherwise an error is returned.
+ *
+ * Note that this is not used for PTE because a PTE entry points to page
+ * frames containing the actual user data, and not to another entry in the
+ * page table. However this is used for PGD.
+ */
+
+static int kvm_set_pmd(pmd_t *pmd, pmd_t pmd_value)
+{
+#ifdef DEBUG
+ /*
+ * The pmd pointer should come from kvm_pmd_alloc() or kvm_pmd_offset()
+ * both of which check if the pointer is in the KVM page table. So this
+ * is a paranoid check to ensure the pointer is really in the KVM page
+ * table.
+ */
+ if (!kvm_valid_pgt_entry(pmd)) {
+ pr_err("PMD %px is not in KVM page table\n", pmd);
+ return -EINVAL;
+ }
+#endif
+ if (pmd_val(*pmd) == pmd_val(pmd_value))
+ return 0;
+
+ if (!pmd_none(*pmd)) {
+ pr_err("PMD %px: overwriting %lx with %lx\n",
+ pmd, pmd_val(*pmd), pmd_val(pmd_value));
+ return -EBUSY;
+ }
+
+ set_pmd(pmd, pmd_value);
+
+ return 0;
+}
+
+static int kvm_set_pud(pud_t *pud, pud_t pud_value)
+{
+#ifdef DEBUG
+ /*
+ * The pud pointer should come from kvm_pud_alloc() or kvm_pud_offset()
+ * both of which check if the pointer is in the KVM page table. So this
+ * is a paranoid check to ensure the pointer is really in the KVM page
+ * table.
+ */
+ if (!kvm_valid_pgt_entry(pud)) {
+ pr_err("PUD %px is not in KVM page table\n", pud);
+ return -EINVAL;
+ }
+#endif
+ if (pud_val(*pud) == pud_val(pud_value))
+ return 0;
+
+ if (!pud_none(*pud)) {
+ pr_err("PUD %px: overwriting %lx\n", pud, pud_val(*pud));
+ return -EBUSY;
+ }
+
+ set_pud(pud, pud_value);
+
+ return 0;
+}
+
+static int kvm_set_p4d(p4d_t *p4d, p4d_t p4d_value)
+{
+#ifdef DEBUG
+ /*
+ * The p4d pointer should come from kvm_p4d_alloc() or kvm_p4d_offset()
+ * both of which check if the pointer is in the KVM page table. So this
+ * is a paranoid check to ensure the pointer is really in the KVM page
+ * table.
+ */
+ if (!kvm_valid_pgt_entry(p4d)) {
+ pr_err("P4D %px is not in KVM page table\n", p4d);
+ return -EINVAL;
+ }
+#endif
+ if (p4d_val(*p4d) == p4d_val(p4d_value))
+ return 0;
+
+ if (!p4d_none(*p4d)) {
+ pr_err("P4D %px: overwriting %lx\n", p4d, p4d_val(*p4d));
+ return -EBUSY;
+ }
+
+ set_p4d(p4d, p4d_value);
+
+ return 0;
+}
+
+static int kvm_set_pgd(pgd_t *pgd, pgd_t pgd_value)
+{
+ if (pgd_val(*pgd) == pgd_val(pgd_value))
+ return 0;
+
+ if (!pgd_none(*pgd)) {
+ pr_err("PGD %px: overwriting %lx\n", pgd, pgd_val(*pgd));
+ return -EBUSY;
+ }
+
+ set_pgd(pgd, pgd_value);
+
+ return 0;
+}
+
static int kvm_isolation_init_mm(void)
{
--
1.7.1
A mapping can already exist if a buffer was mapped in the KVM
address space, and then the buffer was freed but there was no
request to unmap from the KVM address space. In that case, clear
the existing mapping before mapping the new buffer.
Also if the new mapping is a subset of an already larger mapped
range, then remap the entire larger map.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 67 +++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 63 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index e494a15..539e287 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -88,6 +88,9 @@ struct mm_struct kvm_mm = {
DEFINE_STATIC_KEY_FALSE(kvm_isolation_enabled);
EXPORT_SYMBOL(kvm_isolation_enabled);
+static void kvm_clear_mapping(void *ptr, size_t size,
+ enum page_table_level level);
+
/*
* When set to true, KVM #VMExit handlers run in isolated address space
* which maps only KVM required code and per-VM information instead of
@@ -721,6 +724,7 @@ static int kvm_copy_mapping(void *ptr, size_t size, enum page_table_level level)
{
unsigned long addr = (unsigned long)ptr;
unsigned long end = addr + ((unsigned long)size);
+ unsigned long range_addr, range_end;
struct kvm_range_mapping *range_mapping;
bool subset;
int err;
@@ -728,22 +732,77 @@ static int kvm_copy_mapping(void *ptr, size_t size, enum page_table_level level)
BUG_ON(current->mm == &kvm_mm);
pr_debug("KERNMAP COPY addr=%px size=%lx level=%d\n", ptr, size, level);
- range_mapping = kmalloc(sizeof(struct kvm_range_mapping), GFP_KERNEL);
- if (!range_mapping)
- return -ENOMEM;
+ mutex_lock(&kvm_range_mapping_lock);
+
+ /*
+ * A mapping can already exist if the buffer was mapped and then
+ * freed but there was no request to unmap it. We might also be
+ * trying to map a subset of an already mapped buffer.
+ */
+ range_mapping = kvm_get_range_mapping_locked(ptr, &subset);
+ if (range_mapping) {
+ if (subset) {
+ pr_debug("range %px/%lx/%d is a subset of %px/%lx/%d already mapped, remapping\n",
+ ptr, size, level, range_mapping->ptr,
+ range_mapping->size, range_mapping->level);
+ range_addr = (unsigned long)range_mapping->ptr;
+ range_end = range_addr +
+ ((unsigned long)range_mapping->size);
+ err = kvm_copy_pgd_range(&kvm_mm, current->mm,
+ range_addr, range_end,
+ range_mapping->level);
+ if (end <= range_end) {
+ /*
+ * We effectively have a subset, fully contained
+ * in the superset. So we are done.
+ */
+ mutex_unlock(&kvm_range_mapping_lock);
+ return err;
+ }
+ /*
+ * The new range is larger than the existing mapped
+ * range. So we need an extra mapping to map the end
+ * of the range.
+ */
+ addr = range_end;
+ range_mapping = NULL;
+ pr_debug("adding extra range %lx-%lx (%d)\n", addr,
+ end, level);
+ } else {
+ pr_debug("range %px size=%lx level=%d already mapped, clearing\n",
+ range_mapping->ptr, range_mapping->size,
+ range_mapping->level);
+ kvm_clear_mapping(range_mapping->ptr,
+ range_mapping->size,
+ range_mapping->level);
+ list_del(&range_mapping->list);
+ }
+ }
+
+ if (!range_mapping) {
+ range_mapping = kmalloc(sizeof(struct kvm_range_mapping),
+ GFP_KERNEL);
+ if (!range_mapping) {
+ mutex_unlock(&kvm_range_mapping_lock);
+ return -ENOMEM;
+ }
+ INIT_LIST_HEAD(&range_mapping->list);
+ }
err = kvm_copy_pgd_range(&kvm_mm, current->mm, addr, end, level);
if (err) {
+ mutex_unlock(&kvm_range_mapping_lock);
kfree(range_mapping);
return err;
}
- INIT_LIST_HEAD(&range_mapping->list);
range_mapping->ptr = ptr;
range_mapping->size = size;
range_mapping->level = level;
list_add(&range_mapping->list, &kvm_range_mapping_list);
+ mutex_unlock(&kvm_range_mapping_lock);
+
return 0;
}
--
1.7.1
These functions will be used to unmapped memory from the KVM
address space.
When clearing mapping in the KVM page table, check that the clearing
effectively happens in the KVM page table and there is no crossing of
the KVM page table boundary (with references to the kernel page table),
so that the kernel page table isn't mistakenly modified.
Information (address, size, page table level) about address ranges
mapped to the KVM page table is tracked, so mapping clearing is done
with just specified the start address of the range.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 1 +
2 files changed, 173 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index c8358a9..e494a15 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -758,6 +758,178 @@ int kvm_copy_ptes(void *ptr, unsigned long size)
}
EXPORT_SYMBOL(kvm_copy_ptes);
+static void kvm_clear_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end)
+{
+ pte_t *pte;
+
+ pte = kvm_pte_offset(pmd, addr);
+ if (IS_ERR(pte)) {
+ pr_debug("PTE not found, skip clearing\n");
+ return;
+ }
+ do {
+ pr_debug("PTE: %lx/%lx clear[%lx]\n", addr, end, (long)pte);
+ pte_clear(NULL, addr, pte);
+ } while (pte++, addr += PAGE_SIZE, addr < end);
+}
+
+static void kvm_clear_pmd_range(pud_t *pud, unsigned long addr,
+ unsigned long end, enum page_table_level level)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = kvm_pmd_offset(pud, addr);
+ if (IS_ERR(pmd)) {
+ pr_debug("PMD not found, skip clearing\n");
+ return;
+ }
+ do {
+ next = pmd_addr_end(addr, end);
+ if (pmd_none(*pmd))
+ continue;
+ BUG_ON(!pmd_present(*pmd));
+ if (level == PGT_LEVEL_PMD || pmd_trans_huge(*pmd) ||
+ pmd_devmap(*pmd)) {
+ pr_debug("PMD: %lx/%lx clear[%lx]\n",
+ addr, end, (long)pmd);
+ pmd_clear(pmd);
+ continue;
+ }
+ kvm_clear_pte_range(pmd, addr, next);
+ } while (pmd++, addr = next, addr < end);
+}
+
+static void kvm_clear_pud_range(p4d_t *p4d, unsigned long addr,
+ unsigned long end, enum page_table_level level)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = kvm_pud_offset(p4d, addr);
+ if (IS_ERR(pud)) {
+ pr_debug("PUD not found, skip clearing\n");
+ return;
+ }
+ do {
+ next = pud_addr_end(addr, end);
+ if (pud_none(*pud))
+ continue;
+ if (level == PGT_LEVEL_PUD || pud_trans_huge(*pud) ||
+ pud_devmap(*pud)) {
+ pr_debug("PUD: %lx/%lx clear[%lx]\n",
+ addr, end, (long)pud);
+ pud_clear(pud);
+ continue;
+ }
+ kvm_clear_pmd_range(pud, addr, next, level);
+ } while (pud++, addr = next, addr < end);
+}
+
+static void kvm_clear_p4d_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end, enum page_table_level level)
+{
+ p4d_t *p4d;
+ unsigned long next;
+
+ p4d = kvm_p4d_offset(pgd, addr);
+ if (IS_ERR(p4d)) {
+ pr_debug("P4D not found, skip clearing\n");
+ return;
+ }
+
+ do {
+ next = p4d_addr_end(addr, end);
+ if (p4d_none(*p4d))
+ continue;
+ if (level == PGT_LEVEL_P4D) {
+ pr_debug("P4D: %lx/%lx clear[%lx]\n",
+ addr, end, (long)p4d);
+ p4d_clear(p4d);
+ continue;
+ }
+ kvm_clear_pud_range(p4d, addr, next, level);
+ } while (p4d++, addr = next, addr < end);
+}
+
+static void kvm_clear_pgd_range(struct mm_struct *mm, unsigned long addr,
+ unsigned long end, enum page_table_level level)
+{
+ pgd_t *pgd;
+ unsigned long next;
+
+ pgd = pgd_offset(mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_none(*pgd))
+ continue;
+ if (level == PGT_LEVEL_PGD) {
+ pr_debug("PGD: %lx/%lx clear[%lx]\n",
+ addr, end, (long)pgd);
+ pgd_clear(pgd);
+ continue;
+ }
+ kvm_clear_p4d_range(pgd, addr, next, level);
+ } while (pgd++, addr = next, addr < end);
+}
+
+/*
+ * Clear page table entries in the KVM page table. The level parameter
+ * specifies the page table level (PGD, P4D, PUD PMD, PTE) at which the
+ * clear should be done.
+ *
+ * WARNING: The KVM page table can have direct references to the kernel
+ * page table, at different levels (PGD, P4D, PUD, PMD). When clearing
+ * such references, if the level is incorrect (for example, clear at the
+ * PTE level while the mapping was done at PMD level), then the clearing
+ * will occur in the kernel page table and the system will likely crash
+ * on an unhandled page fault.
+ */
+static void kvm_clear_mapping(void *ptr, size_t size,
+ enum page_table_level level)
+{
+ unsigned long start = (unsigned long)ptr;
+ unsigned long end = start + ((unsigned long)size);
+
+ pr_debug("CLEAR %px, %lx [%lx,%lx], level=%d\n",
+ ptr, size, start, end, level);
+ kvm_clear_pgd_range(&kvm_mm, start, end, level);
+}
+
+/*
+ * Clear a range mapping in the KVM page table.
+ */
+void kvm_clear_range_mapping(void *ptr)
+{
+ struct kvm_range_mapping *range_mapping;
+ bool subset;
+
+ mutex_lock(&kvm_range_mapping_lock);
+
+ range_mapping = kvm_get_range_mapping_locked(ptr, &subset);
+ if (!range_mapping) {
+ mutex_unlock(&kvm_range_mapping_lock);
+ pr_debug("CLEAR %px - range not found\n", ptr);
+ return;
+ }
+ if (subset) {
+ mutex_unlock(&kvm_range_mapping_lock);
+ pr_debug("CLEAR %px - ignored, subset of %px/%lx/%d\n",
+ ptr, range_mapping->ptr, range_mapping->size,
+ range_mapping->level);
+ return;
+ }
+
+ kvm_clear_mapping(range_mapping->ptr, range_mapping->size,
+ range_mapping->level);
+ list_del(&range_mapping->list);
+ mutex_unlock(&kvm_range_mapping_lock);
+
+ kfree(range_mapping);
+}
+EXPORT_SYMBOL(kvm_clear_range_mapping);
+
static int kvm_isolation_init_mm(void)
{
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index e8c018a..7d3c985 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -17,5 +17,6 @@ static inline bool kvm_isolation(void)
extern void kvm_isolation_exit(void);
extern void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu);
extern int kvm_copy_ptes(void *ptr, unsigned long size);
+extern void kvm_clear_range_mapping(void *ptr);
#endif
--
1.7.1
The KVM page table is initialized with adding core memory mappings:
the kernel text, the per-cpu memory, the kvm module, the cpu_entry_area,
%esp fixup stacks, IRQ stacks.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kvm/isolation.c | 131 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 10 +++
include/linux/percpu.h | 2 +
mm/percpu.c | 6 +-
5 files changed, 149 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3764054..0fa44b1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1511,6 +1511,8 @@ static __init int setup_clearcpuid(char *arg)
EXPORT_PER_CPU_SYMBOL(current_task);
DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
+EXPORT_PER_CPU_SYMBOL_GPL(hardirq_stack_ptr);
+
DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 2052abf..cf5ee0d 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -10,6 +10,8 @@
#include <linux/printk.h>
#include <linux/slab.h>
+#include <asm/cpu_entry_area.h>
+#include <asm/processor.h>
#include <asm/mmu_context.h>
#include <asm/pgalloc.h>
@@ -88,6 +90,8 @@ struct mm_struct kvm_mm = {
DEFINE_STATIC_KEY_FALSE(kvm_isolation_enabled);
EXPORT_SYMBOL(kvm_isolation_enabled);
+static void kvm_isolation_uninit_page_table(void);
+static void kvm_isolation_uninit_mm(void);
static void kvm_clear_mapping(void *ptr, size_t size,
enum page_table_level level);
@@ -1024,10 +1028,130 @@ int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size)
EXPORT_SYMBOL(kvm_copy_percpu_mapping);
+static int kvm_isolation_init_page_table(void)
+{
+ void *stack;
+ int cpu, rv;
+
+ /*
+ * Copy the mapping for all the kernel text. We copy at the PMD
+ * level since the PUD is shared with the module mapping space.
+ */
+ rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
+ PGT_LEVEL_PMD);
+ if (rv)
+ goto out_uninit_page_table;
+
+ /* copy the mapping of per cpu memory */
+ rv = kvm_copy_mapping(pcpu_base_addr, pcpu_unit_size * pcpu_nr_units,
+ PGT_LEVEL_PMD);
+ if (rv)
+ goto out_uninit_page_table;
+
+ /*
+ * Copy the mapping for cpu_entry_area and %esp fixup stacks
+ * (this is based on the PTI userland address space, but probably
+ * not needed because the KVM address space is not directly
+ * enterered from userspace). They can both be copied at the P4D
+ * level since they each have a dedicated P4D entry.
+ */
+ rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
+ PGT_LEVEL_P4D);
+ if (rv)
+ goto out_uninit_page_table;
+
+#ifdef CONFIG_X86_ESPFIX64
+ rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE,
+ PGT_LEVEL_P4D);
+ if (rv)
+ goto out_uninit_page_table;
+#endif
+
+#ifdef CONFIG_VMAP_STACK
+ /*
+ * Interrupt stacks are vmap'ed with guard pages, so we need to
+ * copy mappings.
+ */
+ for_each_possible_cpu(cpu) {
+ stack = per_cpu(hardirq_stack_ptr, cpu);
+ pr_debug("IRQ Stack %px\n", stack);
+ if (!stack)
+ continue;
+ rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE);
+ if (rv)
+ goto out_uninit_page_table;
+ }
+
+#endif
+
+ /* copy mapping of the current module (kvm) */
+ rv = kvm_copy_module_mapping();
+ if (rv)
+ goto out_uninit_page_table;
+
+ return 0;
+
+out_uninit_page_table:
+ kvm_isolation_uninit_page_table();
+ return rv;
+}
+
+/*
+ * Free all buffers used by the kvm page table. These buffers are stored
+ * in the kvm_pgt_dgroup_list.
+ */
+static void kvm_isolation_uninit_page_table(void)
+{
+ struct pgt_directory_group *dgroup, *dgroup_next;
+ enum page_table_level level;
+ void *ptr;
+ int i;
+
+ mutex_lock(&kvm_pgt_dgroup_lock);
+
+ list_for_each_entry_safe(dgroup, dgroup_next,
+ &kvm_pgt_dgroup_list, list) {
+
+ for (i = 0; i < dgroup->count; i++) {
+ ptr = dgroup->directory[i].ptr;
+ level = dgroup->directory[i].level;
+
+ switch (dgroup->directory[i].level) {
+
+ case PGT_LEVEL_PTE:
+ kvm_pte_free(NULL, ptr);
+ break;
+
+ case PGT_LEVEL_PMD:
+ kvm_pmd_free(NULL, ptr);
+ break;
+
+ case PGT_LEVEL_PUD:
+ kvm_pud_free(NULL, ptr);
+ break;
+
+ case PGT_LEVEL_P4D:
+ kvm_p4d_free(NULL, ptr);
+ break;
+
+ default:
+ pr_err("unexpected page directory %d for %px\n",
+ level, ptr);
+ }
+ }
+
+ list_del(&dgroup->list);
+ kfree(dgroup);
+ }
+
+ mutex_unlock(&kvm_pgt_dgroup_lock);
+}
+
static int kvm_isolation_init_mm(void)
{
pgd_t *kvm_pgd;
gfp_t gfp_mask;
+ int rv;
gfp_mask = GFP_KERNEL | __GFP_ZERO;
kvm_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
@@ -1054,6 +1178,12 @@ static int kvm_isolation_init_mm(void)
mm_init_cpumask(&kvm_mm);
init_new_context(NULL, &kvm_mm);
+ rv = kvm_isolation_init_page_table();
+ if (rv) {
+ kvm_isolation_uninit_mm();
+ return rv;
+ }
+
return 0;
}
@@ -1065,6 +1195,7 @@ static void kvm_isolation_uninit_mm(void)
destroy_context(&kvm_mm);
+ kvm_isolation_uninit_page_table();
kvm_free_all_range_mapping();
#ifdef CONFIG_PAGE_TABLE_ISOLATION
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 3ef2060..1f79e28 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -3,6 +3,16 @@
#define ARCH_X86_KVM_ISOLATION_H
#include <linux/kvm_host.h>
+#include <linux/export.h>
+
+/*
+ * Copy the memory mapping for the current module. This is defined as a
+ * macro to ensure it is expanded in the module making the call so that
+ * THIS_MODULE has the correct value.
+ */
+#define kvm_copy_module_mapping() \
+ (kvm_copy_ptes(THIS_MODULE->core_layout.base, \
+ THIS_MODULE->core_layout.size))
DECLARE_STATIC_KEY_FALSE(kvm_isolation_enabled);
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 70b7123..fb0ab9a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -70,6 +70,8 @@
extern void *pcpu_base_addr;
extern const unsigned long *pcpu_unit_offsets;
+extern int pcpu_unit_size;
+extern int pcpu_nr_units;
struct pcpu_group_info {
int nr_units; /* aligned # of units */
diff --git a/mm/percpu.c b/mm/percpu.c
index 68dd2e7..b68b3d8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -119,8 +119,10 @@
#endif /* CONFIG_SMP */
static int pcpu_unit_pages __ro_after_init;
-static int pcpu_unit_size __ro_after_init;
-static int pcpu_nr_units __ro_after_init;
+int pcpu_unit_size __ro_after_init;
+EXPORT_SYMBOL(pcpu_unit_size);
+int pcpu_nr_units __ro_after_init;
+EXPORT_SYMBOL(pcpu_nr_units);
static int pcpu_atom_size __ro_after_init;
int pcpu_nr_slots __ro_after_init;
static size_t pcpu_chunk_struct_size __ro_after_init;
--
1.7.1
Map vmx cpu to the KVM address space when a vmx cpu is created, and
unmap when it is freed.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 65 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5b52e8c..cbbaf58 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6564,10 +6564,69 @@ static void vmx_vm_free(struct kvm *kvm)
vfree(to_kvm_vmx(kvm));
}
+static void vmx_unmap_vcpu(struct vcpu_vmx *vmx)
+{
+ pr_debug("unmapping vmx %p", vmx);
+
+ kvm_clear_range_mapping(vmx);
+ if (enable_pml)
+ kvm_clear_range_mapping(vmx->pml_pg);
+ kvm_clear_range_mapping(vmx->guest_msrs);
+ kvm_clear_range_mapping(vmx->vmcs01.vmcs);
+ kvm_clear_range_mapping(vmx->vmcs01.msr_bitmap);
+ kvm_clear_range_mapping(vmx->vcpu.arch.pio_data);
+ kvm_clear_range_mapping(vmx->vcpu.arch.apic);
+}
+
+static int vmx_map_vcpu(struct vcpu_vmx *vmx)
+{
+ int rv;
+
+ pr_debug("mapping vmx %p", vmx);
+
+ rv = kvm_copy_ptes(vmx, sizeof(struct vcpu_vmx));
+ if (rv)
+ goto out_unmap_vcpu;
+
+ if (enable_pml) {
+ rv = kvm_copy_ptes(vmx->pml_pg, PAGE_SIZE);
+ if (rv)
+ goto out_unmap_vcpu;
+ }
+
+ rv = kvm_copy_ptes(vmx->guest_msrs, PAGE_SIZE);
+ if (rv)
+ goto out_unmap_vcpu;
+
+ rv = kvm_copy_ptes(vmx->vmcs01.vmcs, PAGE_SIZE << vmcs_config.order);
+ if (rv)
+ goto out_unmap_vcpu;
+
+ rv = kvm_copy_ptes(vmx->vmcs01.msr_bitmap, PAGE_SIZE);
+ if (rv)
+ goto out_unmap_vcpu;
+
+ rv = kvm_copy_ptes(vmx->vcpu.arch.pio_data, PAGE_SIZE);
+ if (rv)
+ goto out_unmap_vcpu;
+
+ rv = kvm_copy_ptes(vmx->vcpu.arch.apic, sizeof(struct kvm_lapic));
+ if (rv)
+ goto out_unmap_vcpu;
+
+ return 0;
+
+out_unmap_vcpu:
+ vmx_unmap_vcpu(vmx);
+ return rv;
+}
+
static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
+ if (kvm_isolation())
+ vmx_unmap_vcpu(vmx);
if (enable_pml)
vmx_destroy_pml_buffer(vmx);
free_vpid(vmx->vpid);
@@ -6679,6 +6738,12 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
vmx->ept_pointer = INVALID_PAGE;
+ if (kvm_isolation()) {
+ err = vmx_map_vcpu(vmx);
+ if (err)
+ goto free_vmcs;
+ }
+
return &vmx->vcpu;
free_vmcs:
--
1.7.1
From: Liran Alon <[email protected]>
KVM isolation enter/exit is done by switching between the KVM address
space and the kernel address space.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 30 ++++++++++++++++++++++++------
arch/x86/mm/tlb.c | 1 +
include/linux/sched.h | 1 +
3 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index db0a7ce..b0c789f 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -1383,11 +1383,13 @@ static bool kvm_page_fault(struct pt_regs *regs, unsigned long error_code,
printk(KERN_DEFAULT "KVM isolation: page fault %ld at %pS on %lx (%pS) while switching mm\n"
" cr3=%lx\n"
" kvm_mm=%px pgd=%px\n"
- " active_mm=%px pgd=%px\n",
+ " active_mm=%px pgd=%px\n"
+ " kvm_prev_mm=%px pgd=%px\n",
error_code, (void *)regs->ip, address, (void *)address,
cr3,
&kvm_mm, kvm_mm.pgd,
- active_mm, active_mm->pgd);
+ active_mm, active_mm->pgd,
+ current->kvm_prev_mm, current->kvm_prev_mm->pgd);
dump_stack();
return false;
@@ -1649,11 +1651,27 @@ void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu)
kvm_isolation_exit();
}
+static void kvm_switch_mm(struct mm_struct *mm)
+{
+ unsigned long flags;
+
+ /*
+ * Disable interrupt before updating active_mm, otherwise if an
+ * interrupt occurs during the switch then the interrupt handler
+ * can be mislead about the mm effectively in use.
+ */
+ local_irq_save(flags);
+ current->kvm_prev_mm = current->active_mm;
+ current->active_mm = mm;
+ switch_mm_irqs_off(current->kvm_prev_mm, mm, NULL);
+ local_irq_restore(flags);
+}
+
void kvm_isolation_enter(void)
{
int err;
- if (kvm_isolation()) {
+ if (kvm_isolation() && current->active_mm != &kvm_mm) {
/*
* Switches to kvm_mm should happen from vCPU thread,
* which should not be a kernel thread with no mm
@@ -1666,14 +1684,14 @@ void kvm_isolation_enter(void)
current);
return;
}
- /* TODO: switch to kvm_mm */
+ kvm_switch_mm(&kvm_mm);
}
}
void kvm_isolation_exit(void)
{
- if (kvm_isolation()) {
+ if (kvm_isolation() && current->active_mm == &kvm_mm) {
/* TODO: Kick sibling hyperthread before switch to host mm */
- /* TODO: switch back to original mm */
+ kvm_switch_mm(current->kvm_prev_mm);
}
}
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a4db7f5..7ad5ad1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -444,6 +444,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
switch_ldt(real_prev, next);
}
}
+EXPORT_SYMBOL_GPL(switch_mm_irqs_off);
/*
* Please ignore the name of this function. It should be called
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 80e1d75..b03680d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1202,6 +1202,7 @@ struct task_struct {
#ifdef CONFIG_HAVE_KVM
/* Is the task mapped into the KVM address space? */
bool kvm_mapped;
+ struct mm_struct *kvm_prev_mm;
#endif
/*
--
1.7.1
Map VM data, in particular the kvm structure data.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 17 +++++++++++++++++
arch/x86/kvm/isolation.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 31 ++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 12 ++++++++++++
include/linux/kvm_host.h | 1 +
virt/kvm/arm/arm.c | 4 ++++
virt/kvm/kvm_main.c | 2 +-
7 files changed, 67 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index cf5ee0d..d3ac014 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -1222,6 +1222,23 @@ static void kvm_isolation_clear_handlers(void)
kvm_set_isolation_exit_handler(NULL);
}
+int kvm_isolation_init_vm(struct kvm *kvm)
+{
+ if (!kvm_isolation())
+ return 0;
+
+ return (kvm_copy_percpu_mapping(kvm->srcu.sda,
+ sizeof(struct srcu_data)));
+}
+
+void kvm_isolation_destroy_vm(struct kvm *kvm)
+{
+ if (!kvm_isolation())
+ return;
+
+ kvm_clear_percpu_mapping(kvm->srcu.sda);
+}
+
int kvm_isolation_init(void)
{
int r;
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 1f79e28..33e9a87 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -23,6 +23,8 @@ static inline bool kvm_isolation(void)
extern int kvm_isolation_init(void);
extern void kvm_isolation_uninit(void);
+extern int kvm_isolation_init_vm(struct kvm *kvm);
+extern void kvm_isolation_destroy_vm(struct kvm *kvm);
extern void kvm_isolation_enter(void);
extern void kvm_isolation_exit(void);
extern void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f181b3c..5b52e8c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6523,6 +6523,33 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
vmx_complete_interrupts(vmx);
}
+static void vmx_unmap_vm(struct kvm *kvm)
+{
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+ if (!kvm_isolation())
+ return;
+
+ pr_debug("unmapping kvm %p", kvm_vmx);
+ kvm_clear_range_mapping(kvm_vmx);
+}
+
+static int vmx_map_vm(struct kvm *kvm)
+{
+ struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
+
+ if (!kvm_isolation())
+ return 0;
+
+ pr_debug("mapping kvm %p", kvm_vmx);
+ /*
+ * Only copy kvm_vmx struct mapping because other
+ * attributes (like kvm->srcu) are not initialized
+ * yet.
+ */
+ return kvm_copy_ptes(kvm_vmx, sizeof(struct kvm_vmx));
+}
+
static struct kvm *vmx_vm_alloc(void)
{
struct kvm_vmx *kvm_vmx = __vmalloc(sizeof(struct kvm_vmx),
@@ -6533,6 +6560,7 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
static void vmx_vm_free(struct kvm *kvm)
{
+ vmx_unmap_vm(kvm);
vfree(to_kvm_vmx(kvm));
}
@@ -6702,7 +6730,8 @@ static int vmx_vm_init(struct kvm *kvm)
break;
}
}
- return 0;
+
+ return (vmx_map_vm(kvm));
}
static void __init vmx_check_processor_compat(void *rtn)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1db72c3..e1cc3a6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9207,6 +9207,17 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return 0;
}
+void kvm_arch_vm_postcreate(struct kvm *kvm)
+{
+ /*
+ * The kvm structure is mapped in vmx.c so that the full kvm_vmx
+ * structure can be mapped. Attributes allocated in the kvm
+ * structure (like kvm->srcu) are mapped by kvm_isolation_init_vm()
+ * because they are not initialized when vmx.c maps the kvm structure.
+ */
+ kvm_isolation_init_vm(kvm);
+}
+
static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
{
vcpu_load(vcpu);
@@ -9320,6 +9331,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT, 0, 0);
x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
}
+ kvm_isolation_destroy_vm(kvm);
if (kvm_x86_ops->vm_destroy)
kvm_x86_ops->vm_destroy(kvm);
kvm_pic_destroy(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 640a036..ad24d9e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -932,6 +932,7 @@ static inline bool kvm_arch_intc_initialized(struct kvm *kvm)
int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
void kvm_arch_destroy_vm(struct kvm *kvm);
+void kvm_arch_vm_postcreate(struct kvm *kvm);
void kvm_arch_sync_events(struct kvm *kvm);
int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index f412ebc..0921cb3 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -156,6 +156,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return ret;
}
+void kvm_arch_vm_postcreate(struct kvm *kvm)
+{
+}
+
bool kvm_arch_has_vcpu_debugfs(void)
{
return false;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a704d1f..3c0c3db 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3366,7 +3366,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
return -ENOMEM;
}
kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm);
-
+ kvm_arch_vm_postcreate(kvm);
fd_install(r, file);
return r;
--
1.7.1
From: Liran Alon <[email protected]>
Next commits will change most of KVM #VMExit handlers to run
in KVM isolated address space. Any interrupt handler raised
during execution in KVM address space needs to switch back
to host address space.
This patch makes sure that IRQ handlers will run in full
host address space instead of KVM isolated address space.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/apic.h | 4 ++--
arch/x86/include/asm/hardirq.h | 10 ++++++++++
arch/x86/kernel/smp.c | 2 +-
arch/x86/platform/uv/tlb_uv.c | 2 +-
4 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 130e81e..606da8f 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -515,7 +515,7 @@ static inline unsigned int read_apic_id(void)
static inline void entering_irq(void)
{
irq_enter();
- kvm_set_cpu_l1tf_flush_l1d();
+ kvm_cpu_may_access_sensitive_data();
}
static inline void entering_ack_irq(void)
@@ -528,7 +528,7 @@ static inline void ipi_entering_ack_irq(void)
{
irq_enter();
ack_APIC_irq();
- kvm_set_cpu_l1tf_flush_l1d();
+ kvm_cpu_may_access_sensitive_data();
}
static inline void exiting_irq(void)
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index d9069bb..e082ecb 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -80,4 +80,14 @@ static inline bool kvm_get_cpu_l1tf_flush_l1d(void)
static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }
#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */
+#ifdef CONFIG_HAVE_KVM
+extern void (*kvm_isolation_exit_handler)(void);
+
+static inline void kvm_cpu_may_access_sensitive_data(void)
+{
+ kvm_set_cpu_l1tf_flush_l1d();
+ kvm_isolation_exit_handler();
+}
+#endif
+
#endif /* _ASM_X86_HARDIRQ_H */
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 04adc8d..b99fda0 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -261,7 +261,7 @@ __visible void __irq_entry smp_reschedule_interrupt(struct pt_regs *regs)
{
ack_APIC_irq();
inc_irq_stat(irq_resched_count);
- kvm_set_cpu_l1tf_flush_l1d();
+ kvm_cpu_may_access_sensitive_data();
if (trace_resched_ipi_enabled()) {
/*
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index 1297e18..83a17ca 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1285,7 +1285,7 @@ void uv_bau_message_interrupt(struct pt_regs *regs)
struct msg_desc msgdesc;
ack_APIC_irq();
- kvm_set_cpu_l1tf_flush_l1d();
+ kvm_cpu_may_access_sensitive_data();
time_start = get_cycles();
bcp = &per_cpu(bau_control, smp_processor_id());
--
1.7.1
The KVM page table will have direct references to the kernel page table,
at different levels (PGD, P4D, PUD, PMD). When freeing the KVM page table,
we should make sure that we free parts actually allocated for the KVM
page table, and not parts of the kernel page table referenced from the
KVM page table. To do so, we will keep track of buffers when building
the KVM page table.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 119 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 43fd924..1efdab1 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -8,12 +8,60 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/printk.h>
+#include <linux/slab.h>
#include <asm/mmu_context.h>
#include <asm/pgalloc.h>
#include "isolation.h"
+
+enum page_table_level {
+ PGT_LEVEL_PTE,
+ PGT_LEVEL_PMD,
+ PGT_LEVEL_PUD,
+ PGT_LEVEL_P4D,
+ PGT_LEVEL_PGD
+};
+
+/*
+ * The KVM page table can have direct references to the kernel page table,
+ * at different levels (PGD, P4D, PUD, PMD). When freeing the KVM page
+ * table, we should make sure that we free parts actually allocated for
+ * the KVM page table, and not parts of the kernel page table referenced
+ * from the KVM page table.
+ *
+ * To do so, page table directories (struct pgt_directory) are used to keep
+ * track of buffers allocated when building the KVM page table. Also, as
+ * a page table can have many buffers, page table directory groups (struct
+ * (pgt_directory_group) are used to group page table directories and save
+ * some space (instead of allocating each directory individually).
+ */
+
+#define PGT_DIRECTORY_GROUP_SIZE 64
+
+struct pgt_directory {
+ enum page_table_level level;
+ void *ptr;
+};
+
+struct pgt_directory_group {
+ struct list_head list;
+ int count;
+ struct pgt_directory directory[PGT_DIRECTORY_GROUP_SIZE];
+};
+
+static LIST_HEAD(kvm_pgt_dgroup_list);
+static DEFINE_MUTEX(kvm_pgt_dgroup_lock);
+
+/*
+ * Get the pointer to the beginning of a page table directory from a page
+ * table directory entry.
+ */
+#define PGTD_ALIGN(entry) \
+ ((typeof(entry))(((unsigned long)(entry)) & PAGE_MASK))
+
+
struct mm_struct kvm_mm = {
.mm_rb = RB_ROOT,
.mm_users = ATOMIC_INIT(2),
@@ -43,6 +91,77 @@ struct mm_struct kvm_mm = {
static bool __read_mostly address_space_isolation;
module_param(address_space_isolation, bool, 0444);
+
+static struct pgt_directory_group *pgt_directory_group_create(void)
+{
+ struct pgt_directory_group *dgroup;
+
+ dgroup = kzalloc(sizeof(struct pgt_directory_group), GFP_KERNEL);
+ if (!dgroup)
+ return NULL;
+
+ INIT_LIST_HEAD(&dgroup->list);
+ dgroup->count = 0;
+
+ return dgroup;
+}
+
+static bool kvm_add_pgt_directory(void *ptr, enum page_table_level level)
+{
+ struct pgt_directory_group *dgroup;
+ int index;
+
+ mutex_lock(&kvm_pgt_dgroup_lock);
+
+ if (list_empty(&kvm_pgt_dgroup_list))
+ dgroup = NULL;
+ else
+ dgroup = list_entry(kvm_pgt_dgroup_list.next,
+ struct pgt_directory_group, list);
+
+ if (!dgroup || dgroup->count >= PGT_DIRECTORY_GROUP_SIZE) {
+ dgroup = pgt_directory_group_create();
+ if (!dgroup) {
+ mutex_unlock(&kvm_pgt_dgroup_lock);
+ return false;
+ }
+ list_add_tail(&dgroup->list, &kvm_pgt_dgroup_list);
+ }
+
+ index = dgroup->count;
+ dgroup->directory[index].level = level;
+ dgroup->directory[index].ptr = PGTD_ALIGN(ptr);
+ dgroup->count = index + 1;
+
+ mutex_unlock(&kvm_pgt_dgroup_lock);
+
+ return true;
+}
+
+static bool kvm_valid_pgt_entry(void *ptr)
+{
+ struct pgt_directory_group *dgroup;
+ int i;
+
+ mutex_lock(&kvm_pgt_dgroup_lock);
+
+ ptr = PGTD_ALIGN(ptr);
+ list_for_each_entry(dgroup, &kvm_pgt_dgroup_list, list) {
+ for (i = 0; i < dgroup->count; i++) {
+ if (dgroup->directory[i].ptr == ptr) {
+ mutex_unlock(&kvm_pgt_dgroup_lock);
+ return true;
+ }
+ }
+ }
+
+ mutex_unlock(&kvm_pgt_dgroup_lock);
+
+ return false;
+
+}
+
+
static int kvm_isolation_init_mm(void)
{
pgd_t *kvm_pgd;
--
1.7.1
The KVM page fault handler handles page fault occurring while using
the KVM address space by switching to the kernel address space and
retrying the access (except if the fault occurs while switching
to the kernel address space). Processing of page faults occurring
while using the kernel address space is unchanged.
Page fault log is cleared when creating a vm so that page fault
information doesn't persist when qemu is stopped and restarted.
The KVM module parameter page_fault_stack can be used to disable
dumping stack trace when a page fault occurs while using the KVM
address space. The fault will still be reported but without the
stack trace.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kernel/dumpstack.c | 1 +
arch/x86/kvm/isolation.c | 202 +++++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 12 +++
3 files changed, 215 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2b58864..aa28763 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -292,6 +292,7 @@ void show_stack(struct task_struct *task, unsigned long *sp)
show_trace_log_lvl(task, NULL, sp, KERN_DEFAULT);
}
+EXPORT_SYMBOL(show_stack);
void show_stack_regs(struct pt_regs *regs)
{
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index e7979b3..db0a7ce 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/printk.h>
+#include <linux/sched/debug.h>
#include <linux/slab.h>
#include <asm/cpu_entry_area.h>
@@ -17,6 +18,9 @@
#include "isolation.h"
+extern bool (*kvm_page_fault_handler)(struct pt_regs *regs,
+ unsigned long error_code,
+ unsigned long address);
enum page_table_level {
PGT_LEVEL_PTE,
@@ -91,6 +95,25 @@ struct kvm_range_mapping {
static LIST_HEAD(kvm_range_mapping_list);
static DEFINE_MUTEX(kvm_range_mapping_lock);
+/*
+ * When a page fault occurs, while running with the KVM address space,
+ * the KVM page fault handler prints information about the fault (in
+ * particular the stack trace), and it switches back to the kernel
+ * address space.
+ *
+ * Information printed by the KVM page fault handler can be used to find
+ * out data not mapped in the KVM address space. Then the KVM address
+ * space can be augmented to include the missing mapping so that we don't
+ * fault at that same place anymore.
+ *
+ * The following variables keep track of page faults occurring while running
+ * with the KVM address space to prevent displaying the same information.
+ */
+
+#define KVM_LAST_FAULT_COUNT 128
+
+static unsigned long kvm_last_fault[KVM_LAST_FAULT_COUNT];
+
struct mm_struct kvm_mm = {
.mm_rb = RB_ROOT,
@@ -126,6 +149,14 @@ static void kvm_clear_mapping(void *ptr, size_t size,
static bool __read_mostly address_space_isolation;
module_param(address_space_isolation, bool, 0444);
+/*
+ * When set to true, KVM dumps the stack when a page fault occurs while
+ * running with the KVM address space. Otherwise the page fault is still
+ * reported but without the stack trace.
+ */
+static bool __read_mostly page_fault_stack = true;
+module_param(page_fault_stack, bool, 0444);
+
static struct kvm_range_mapping *kvm_get_range_mapping_locked(void *ptr,
bool *subset)
{
@@ -1195,6 +1226,173 @@ static void kvm_reset_all_task_mapping(void)
mutex_unlock(&kvm_task_mapping_lock);
}
+static int bad_address(void *p)
+{
+ unsigned long dummy;
+
+ return probe_kernel_address((unsigned long *)p, dummy);
+}
+
+static void kvm_dump_pagetable(pgd_t *base, unsigned long address)
+{
+ pgd_t *pgd = base + pgd_index(address);
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pr_info("BASE %px ", base);
+
+ if (bad_address(pgd))
+ goto bad;
+
+ pr_cont("PGD %lx ", pgd_val(*pgd));
+
+ if (!pgd_present(*pgd))
+ goto out;
+
+ p4d = p4d_offset(pgd, address);
+ if (bad_address(p4d))
+ goto bad;
+
+ pr_cont("P4D %lx ", p4d_val(*p4d));
+ if (!p4d_present(*p4d) || p4d_large(*p4d))
+ goto out;
+
+ pud = pud_offset(p4d, address);
+ if (bad_address(pud))
+ goto bad;
+
+ pr_cont("PUD %lx ", pud_val(*pud));
+ if (!pud_present(*pud) || pud_large(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (bad_address(pmd))
+ goto bad;
+
+ pr_cont("PMD %lx ", pmd_val(*pmd));
+ if (!pmd_present(*pmd) || pmd_large(*pmd))
+ goto out;
+
+ pte = pte_offset_kernel(pmd, address);
+ if (bad_address(pte))
+ goto bad;
+
+ pr_cont("PTE %lx", pte_val(*pte));
+out:
+ pr_cont("\n");
+ return;
+bad:
+ pr_info("BAD\n");
+}
+
+static void kvm_clear_page_fault(void)
+{
+ int i;
+
+ for (i = 0; i < KVM_LAST_FAULT_COUNT; i++)
+ kvm_last_fault[i] = 0;
+}
+
+static void kvm_log_page_fault(struct pt_regs *regs, unsigned long error_code,
+ unsigned long address)
+{
+ int i;
+
+ /*
+ * Log information about the fault only if this is a fault
+ * we don't know about yet (or if the fault tracking buffer
+ * is full).
+ */
+ for (i = 0; i < KVM_LAST_FAULT_COUNT; i++) {
+ if (!kvm_last_fault[i]) {
+ kvm_last_fault[i] = regs->ip;
+ break;
+ }
+ if (kvm_last_fault[i] == regs->ip)
+ return;
+ }
+
+ if (i >= KVM_LAST_FAULT_COUNT)
+ pr_warn("KVM isolation: fault tracking buffer is full [%d]\n",
+ i);
+
+ pr_info("KVM isolation: page fault #%d (%ld) at %pS on %px (%pS)\n",
+ i, error_code, (void *)regs->ip,
+ (void *)address, (void *)address);
+ if (page_fault_stack)
+ show_stack(NULL, (unsigned long *)regs->sp);
+}
+
+/*
+ * KVM Page Fault Handler. The handler handles two simple cases:
+ *
+ * - If the fault occurs while using the kernel address space, then let
+ * the kernel handles the fault normally.
+ *
+ * - If the fault occurs while using the KVM address space, then switch
+ * to the kernel address space, and retry.
+ *
+ * It also handles a tricky case: if the fault occurs when using the KVM
+ * address space but while switching to the kernel address space then the
+ * switch is failing and we can't recover. In that case, we force switching
+ * to the kernel address space, print information and let the kernel
+ * handles the fault.
+ */
+static bool kvm_page_fault(struct pt_regs *regs, unsigned long error_code,
+ unsigned long address)
+{
+ struct mm_struct *active_mm = current->active_mm;
+ unsigned long cr3;
+
+ /*
+ * First, do a quick and simple test to see if we are using
+ * the KVM address space. If we do then exit KVM isolation,
+ * log the fault and report that we have handled the fault.
+ */
+ if (likely(active_mm == &kvm_mm)) {
+ kvm_isolation_exit();
+ kvm_log_page_fault(regs, error_code, address);
+ return true;
+ }
+
+ /*
+ * Verify that we are effectively using the kernel address space.
+ * When switching address space, active_mm is not necessarily up
+ * to date as it can already be set with the next mm while %cr3
+ * has not been updated yet. So check loaded_mm which is updated
+ * after %cr3.
+ *
+ * If we are effectively using the kernel address space then report
+ * that we haven't handled the fault.
+ */
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) != &kvm_mm)
+ return false;
+
+ /*
+ * We are actually using the KVM address space and faulting while
+ * switching address space. Force swiching to the kernel address
+ * space, log information and reported that we haven't handled
+ * the fault.
+ */
+ cr3 = __read_cr3();
+ write_cr3(build_cr3(active_mm->pgd, 0));
+ kvm_dump_pagetable(kvm_mm.pgd, address);
+ kvm_dump_pagetable(active_mm->pgd, address);
+ printk(KERN_DEFAULT "KVM isolation: page fault %ld at %pS on %lx (%pS) while switching mm\n"
+ " cr3=%lx\n"
+ " kvm_mm=%px pgd=%px\n"
+ " active_mm=%px pgd=%px\n",
+ error_code, (void *)regs->ip, address, (void *)address,
+ cr3,
+ &kvm_mm, kvm_mm.pgd,
+ active_mm, active_mm->pgd);
+ dump_stack();
+
+ return false;
+}
+
static int kvm_isolation_init_page_table(void)
{
@@ -1384,11 +1582,13 @@ static void kvm_isolation_uninit_mm(void)
static void kvm_isolation_set_handlers(void)
{
kvm_set_isolation_exit_handler(kvm_isolation_exit);
+ kvm_page_fault_handler = kvm_page_fault;
}
static void kvm_isolation_clear_handlers(void)
{
kvm_set_isolation_exit_handler(NULL);
+ kvm_page_fault_handler = NULL;
}
int kvm_isolation_init_vm(struct kvm *kvm)
@@ -1396,6 +1596,8 @@ int kvm_isolation_init_vm(struct kvm *kvm)
if (!kvm_isolation())
return 0;
+ kvm_clear_page_fault();
+
pr_debug("mapping kvm srcu sda\n");
return (kvm_copy_percpu_mapping(kvm->srcu.sda,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 46df4c6..317e105 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,10 @@
#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
+bool (*kvm_page_fault_handler)(struct pt_regs *regs, unsigned long error_code,
+ unsigned long address);
+EXPORT_SYMBOL(kvm_page_fault_handler);
+
/*
* Returns 0 if mmiotrace is disabled, or if the fault is not
* handled by mmiotrace:
@@ -1253,6 +1257,14 @@ static int fault_in_kernel_space(unsigned long address)
WARN_ON_ONCE(hw_error_code & X86_PF_PK);
/*
+ * KVM might be able to handle the fault when running with the
+ * KVM address space.
+ */
+ if (kvm_page_fault_handler &&
+ kvm_page_fault_handler(regs, hw_error_code, address))
+ return;
+
+ /*
* We can fault-in kernel-space virtual memory on-demand. The
* 'reference' page table is init_mm.pgd.
*
--
1.7.1
KVM buses can change after they have been created so new buses
have to be mapped when they are created.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 37 +++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 1 +
arch/x86/kvm/x86.c | 13 ++++++++++++-
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 2 ++
5 files changed, 53 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 255b2da..329e769 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -1614,6 +1614,29 @@ void kvm_isolation_check_memslots(struct kvm *kvm)
}
+void kvm_isolation_check_buses(struct kvm *kvm)
+{
+ struct kvm_range_mapping *rmapping;
+ struct kvm_io_bus *bus;
+ int i, err;
+
+ if (!kvm_isolation())
+ return;
+
+ for (i = 0; i < KVM_NR_BUSES; i++) {
+ bus = kvm->buses[i];
+ rmapping = kvm_get_range_mapping(bus, NULL);
+ if (rmapping)
+ continue;
+ pr_debug("remapping kvm buses[%d]\n", i);
+ err = kvm_copy_ptes(bus, sizeof(*bus) + bus->dev_count *
+ sizeof(struct kvm_io_range));
+ if (err)
+ pr_debug("failed to map kvm buses[%d]\n", i);
+ }
+
+}
+
int kvm_isolation_init_vm(struct kvm *kvm)
{
int err, i;
@@ -1632,6 +1655,15 @@ int kvm_isolation_init_vm(struct kvm *kvm)
return err;
}
+ pr_debug("mapping kvm buses\n");
+
+ for (i = 0; i < KVM_NR_BUSES; i++) {
+ err = kvm_copy_ptes(kvm->buses[i],
+ sizeof(struct kvm_io_bus));
+ if (err)
+ return err;
+ }
+
pr_debug("mapping kvm srcu sda\n");
return (kvm_copy_percpu_mapping(kvm->srcu.sda,
@@ -1650,6 +1682,11 @@ void kvm_isolation_destroy_vm(struct kvm *kvm)
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
kvm_clear_range_mapping(kvm->memslots[i]);
+ pr_debug("unmapping kvm buses\n");
+
+ for (i = 0; i < KVM_NR_BUSES; i++)
+ kvm_clear_range_mapping(kvm->buses[i]);
+
pr_debug("unmapping kvm srcu sda\n");
kvm_clear_percpu_mapping(kvm->srcu.sda);
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 1e55799..b048946 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -33,6 +33,7 @@ static inline bool kvm_isolation(void)
extern int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size);
extern void kvm_clear_percpu_mapping(void *percpu_ptr);
extern void kvm_isolation_check_memslots(struct kvm *kvm);
+extern void kvm_isolation_check_buses(struct kvm *kvm);
extern int kvm_add_task_mapping(struct task_struct *tsk);
extern void kvm_cleanup_task_mapping(struct task_struct *tsk);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7d98e9f..3ba1996 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9253,6 +9253,13 @@ void kvm_arch_sync_events(struct kvm *kvm)
cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
kvm_free_pit(kvm);
+ /*
+ * Note that kvm_isolation_destroy_vm() has to be called from
+ * here, and not from kvm_arch_destroy_vm() because it will unmap
+ * buses which are already destroyed when kvm_arch_destroy_vm()
+ * is invoked.
+ */
+ kvm_isolation_destroy_vm(kvm);
}
int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
@@ -9331,7 +9338,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT, 0, 0);
x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
}
- kvm_isolation_destroy_vm(kvm);
if (kvm_x86_ops->vm_destroy)
kvm_x86_ops->vm_destroy(kvm);
kvm_pic_destroy(kvm);
@@ -9909,6 +9915,11 @@ bool kvm_vector_hashing_enabled(void)
}
EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
+void kvm_arch_buses_updated(struct kvm *kvm, struct kvm_io_bus *bus)
+{
+ kvm_isolation_check_buses(kvm);
+}
+
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ad24d9e..1291d8d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -199,6 +199,7 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
struct kvm_io_device *dev);
struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
gpa_t addr);
+void kvm_arch_buses_updated(struct kvm *kvm, struct kvm_io_bus *bus);
#ifdef CONFIG_KVM_ASYNC_PF
struct kvm_async_pf {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3c0c3db..374e79f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3749,6 +3749,8 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
synchronize_srcu_expedited(&kvm->srcu);
kfree(bus);
+ kvm_arch_buses_updated(kvm, new_bus);
+
return 0;
}
--
1.7.1
KVM memslots can change after they have been created so new memslots
have to be mapped when they are created.
TODO: we currently don't unmapped old memslots, they should be unmapped
when they are freed.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 39 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 1 +
arch/x86/kvm/x86.c | 3 +++
3 files changed, 43 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index b0c789f..255b2da 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -1593,13 +1593,45 @@ static void kvm_isolation_clear_handlers(void)
kvm_page_fault_handler = NULL;
}
+void kvm_isolation_check_memslots(struct kvm *kvm)
+{
+ struct kvm_range_mapping *rmapping;
+ int i, err;
+
+ if (!kvm_isolation())
+ return;
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ rmapping = kvm_get_range_mapping(kvm->memslots[i], NULL);
+ if (rmapping)
+ continue;
+ pr_debug("remapping kvm memslots[%d]\n", i);
+ err = kvm_copy_ptes(kvm->memslots[i],
+ sizeof(struct kvm_memslots));
+ if (err)
+ pr_debug("failed to map kvm memslots[%d]\n", i);
+ }
+
+}
+
int kvm_isolation_init_vm(struct kvm *kvm)
{
+ int err, i;
+
if (!kvm_isolation())
return 0;
kvm_clear_page_fault();
+ pr_debug("mapping kvm memslots\n");
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ err = kvm_copy_ptes(kvm->memslots[i],
+ sizeof(struct kvm_memslots));
+ if (err)
+ return err;
+ }
+
pr_debug("mapping kvm srcu sda\n");
return (kvm_copy_percpu_mapping(kvm->srcu.sda,
@@ -1608,9 +1640,16 @@ int kvm_isolation_init_vm(struct kvm *kvm)
void kvm_isolation_destroy_vm(struct kvm *kvm)
{
+ int i;
+
if (!kvm_isolation())
return;
+ pr_debug("unmapping kvm memslots\n");
+
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+ kvm_clear_range_mapping(kvm->memslots[i]);
+
pr_debug("unmapping kvm srcu sda\n");
kvm_clear_percpu_mapping(kvm->srcu.sda);
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 2d7d016..1e55799 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -32,6 +32,7 @@ static inline bool kvm_isolation(void)
extern void kvm_clear_range_mapping(void *ptr);
extern int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size);
extern void kvm_clear_percpu_mapping(void *percpu_ptr);
+extern void kvm_isolation_check_memslots(struct kvm *kvm);
extern int kvm_add_task_mapping(struct task_struct *tsk);
extern void kvm_cleanup_task_mapping(struct task_struct *tsk);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1cc3a6..7d98e9f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9438,6 +9438,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
* mmio generation may have reached its maximum value.
*/
kvm_mmu_invalidate_mmio_sptes(kvm, gen);
+ kvm_isolation_check_memslots(kvm);
}
int kvm_arch_prepare_memory_region(struct kvm *kvm,
@@ -9537,6 +9538,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
*/
if (change != KVM_MR_DELETE)
kvm_mmu_slot_apply_flags(kvm, (struct kvm_memory_slot *) new);
+
+ kvm_isolation_check_memslots(kvm);
}
void kvm_arch_flush_shadow_all(struct kvm *kvm)
--
1.7.1
These functions are based on the copy_pxx_range() functions defined in
mm/memory.c. The main difference is that a level parameter is specified
to indicate the page table level (PGD, P4D, PUD PMD, PTE) at which the
copy should be done. Also functions don't use a vma parameter, and
don't alter the source page table even if an entry is bad.
Also kvm_copy_pte_range() can be called with a non page-aligned buffer,
so the buffer should be aligned with the page start so that the entire
buffer is mapped if the end of buffer crosses a page.
These functions will be used to populate the KVM page table.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 229 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 1 +
2 files changed, 230 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index b681e4f..4f1b511 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -450,6 +450,235 @@ static int kvm_set_pgd(pgd_t *pgd, pgd_t pgd_value)
}
+static int kvm_copy_pte_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm, pmd_t *dst_pmd,
+ pmd_t *src_pmd, unsigned long addr,
+ unsigned long end)
+{
+ pte_t *src_pte, *dst_pte;
+
+ dst_pte = kvm_pte_alloc(dst_mm, dst_pmd, addr);
+ if (IS_ERR(dst_pte))
+ return PTR_ERR(dst_pte);
+
+ addr &= PAGE_MASK;
+ src_pte = pte_offset_map(src_pmd, addr);
+
+ do {
+ pr_debug("PTE: %lx/%lx set[%lx] = %lx\n",
+ addr, addr + PAGE_SIZE, (long)dst_pte, pte_val(*src_pte));
+ set_pte_at(dst_mm, addr, dst_pte, *src_pte);
+
+ } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr < end);
+
+ return 0;
+}
+
+static int kvm_copy_pmd_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm,
+ pud_t *dst_pud, pud_t *src_pud,
+ unsigned long addr, unsigned long end,
+ enum page_table_level level)
+{
+ pmd_t *src_pmd, *dst_pmd;
+ unsigned long next;
+ int err;
+
+ dst_pmd = kvm_pmd_alloc(dst_mm, dst_pud, addr);
+ if (IS_ERR(dst_pmd))
+ return PTR_ERR(dst_pmd);
+
+ src_pmd = pmd_offset(src_pud, addr);
+
+ do {
+ next = pmd_addr_end(addr, end);
+ if (level == PGT_LEVEL_PMD || pmd_none(*src_pmd)) {
+ pr_debug("PMD: %lx/%lx set[%lx] = %lx\n",
+ addr, next, (long)dst_pmd, pmd_val(*src_pmd));
+ err = kvm_set_pmd(dst_pmd, *src_pmd);
+ if (err)
+ return err;
+ continue;
+ }
+
+ if (!pmd_present(*src_pmd)) {
+ pr_warn("PMD: not present for [%lx,%lx]\n",
+ addr, next - 1);
+ pmd_clear(dst_pmd);
+ continue;
+ }
+
+ if (pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) {
+ pr_debug("PMD: %lx/%lx set[%lx] = %lx (huge/devmap)\n",
+ addr, next, (long)dst_pmd, pmd_val(*src_pmd));
+ err = kvm_set_pmd(dst_pmd, *src_pmd);
+ if (err)
+ return err;
+ continue;
+ }
+
+ err = kvm_copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
+ addr, next);
+ if (err) {
+ pr_err("PMD: ERR PTE addr=%lx next=%lx\n", addr, next);
+ return err;
+ }
+
+ } while (dst_pmd++, src_pmd++, addr = next, addr < end);
+
+ return 0;
+}
+
+static int kvm_copy_pud_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm,
+ p4d_t *dst_p4d, p4d_t *src_p4d,
+ unsigned long addr, unsigned long end,
+ enum page_table_level level)
+{
+ pud_t *src_pud, *dst_pud;
+ unsigned long next;
+ int err;
+
+ dst_pud = kvm_pud_alloc(dst_mm, dst_p4d, addr);
+ if (IS_ERR(dst_pud))
+ return PTR_ERR(dst_pud);
+
+ src_pud = pud_offset(src_p4d, addr);
+
+ do {
+ next = pud_addr_end(addr, end);
+ if (level == PGT_LEVEL_PUD || pud_none(*src_pud)) {
+ pr_debug("PUD: %lx/%lx set[%lx] = %lx\n",
+ addr, next, (long)dst_pud, pud_val(*src_pud));
+ err = kvm_set_pud(dst_pud, *src_pud);
+ if (err)
+ return err;
+ continue;
+ }
+
+ if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+ pr_debug("PUD: %lx/%lx set[%lx] = %lx (huge/devmap)\n",
+ addr, next, (long)dst_pud, pud_val(*src_pud));
+ err = kvm_set_pud(dst_pud, *src_pud);
+ if (err)
+ return err;
+ continue;
+ }
+
+ err = kvm_copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
+ addr, next, level);
+ if (err) {
+ pr_err("PUD: ERR PMD addr=%lx next=%lx\n", addr, next);
+ return err;
+ }
+
+ } while (dst_pud++, src_pud++, addr = next, addr < end);
+
+ return 0;
+}
+
+static int kvm_copy_p4d_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm,
+ pgd_t *dst_pgd, pgd_t *src_pgd,
+ unsigned long addr, unsigned long end,
+ enum page_table_level level)
+{
+ p4d_t *src_p4d, *dst_p4d;
+ unsigned long next;
+ int err;
+
+ dst_p4d = kvm_p4d_alloc(dst_mm, dst_pgd, addr);
+ if (IS_ERR(dst_p4d))
+ return PTR_ERR(dst_p4d);
+
+ src_p4d = p4d_offset(src_pgd, addr);
+
+ do {
+ next = p4d_addr_end(addr, end);
+ if (level == PGT_LEVEL_P4D || p4d_none(*src_p4d)) {
+ pr_debug("P4D: %lx/%lx set[%lx] = %lx\n",
+ addr, next, (long)dst_p4d, p4d_val(*src_p4d));
+
+ err = kvm_set_p4d(dst_p4d, *src_p4d);
+ if (err)
+ return err;
+ continue;
+ }
+
+ err = kvm_copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
+ addr, next, level);
+ if (err) {
+ pr_err("P4D: ERR PUD addr=%lx next=%lx\n", addr, next);
+ return err;
+ }
+
+ } while (dst_p4d++, src_p4d++, addr = next, addr < end);
+
+ return 0;
+}
+
+static int kvm_copy_pgd_range(struct mm_struct *dst_mm,
+ struct mm_struct *src_mm, unsigned long addr,
+ unsigned long end, enum page_table_level level)
+{
+ pgd_t *src_pgd, *dst_pgd;
+ unsigned long next;
+ int err;
+
+ dst_pgd = pgd_offset(dst_mm, addr);
+ src_pgd = pgd_offset(src_mm, addr);
+
+ do {
+ next = pgd_addr_end(addr, end);
+ if (level == PGT_LEVEL_PGD || pgd_none(*src_pgd)) {
+ pr_debug("PGD: %lx/%lx set[%lx] = %lx\n",
+ addr, next, (long)dst_pgd, pgd_val(*src_pgd));
+ err = kvm_set_pgd(dst_pgd, *src_pgd);
+ if (err)
+ return err;
+ continue;
+ }
+
+ err = kvm_copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ addr, next, level);
+ if (err) {
+ pr_err("PGD: ERR P4D addr=%lx next=%lx\n", addr, next);
+ return err;
+ }
+
+ } while (dst_pgd++, src_pgd++, addr = next, addr < end);
+
+ return 0;
+}
+
+/*
+ * Copy page table entries from the current page table (i.e. from the
+ * kernel page table) to the KVM page table. The level parameter specifies
+ * the page table level (PGD, P4D, PUD PMD, PTE) at which the copy should
+ * be done.
+ */
+static int kvm_copy_mapping(void *ptr, size_t size, enum page_table_level level)
+{
+ unsigned long addr = (unsigned long)ptr;
+ unsigned long end = addr + ((unsigned long)size);
+
+ BUG_ON(current->mm == &kvm_mm);
+ pr_debug("KERNMAP COPY addr=%px size=%lx\n", ptr, size);
+ return kvm_copy_pgd_range(&kvm_mm, current->mm, addr, end, level);
+}
+
+
+/*
+ * Copy page table PTE entries from the current page table to the KVM
+ * page table.
+ */
+int kvm_copy_ptes(void *ptr, unsigned long size)
+{
+ return kvm_copy_mapping(ptr, size, PGT_LEVEL_PTE);
+}
+EXPORT_SYMBOL(kvm_copy_ptes);
+
+
static int kvm_isolation_init_mm(void)
{
pgd_t *kvm_pgd;
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index aa5e979..e8c018a 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -16,5 +16,6 @@ static inline bool kvm_isolation(void)
extern void kvm_isolation_enter(void);
extern void kvm_isolation_exit(void);
extern void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu);
+extern int kvm_copy_ptes(void *ptr, unsigned long size);
#endif
--
1.7.1
From: Liran Alon <[email protected]>
Interrupt handlers will need this handler to switch from
the KVM address space back to the kernel address space
on their prelog.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/irq.h | 1 +
arch/x86/kernel/irq.c | 11 +++++++++++
arch/x86/kvm/isolation.c | 13 +++++++++++++
3 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 8f95686..eb32abc 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -29,6 +29,7 @@ static inline int irq_canonicalize(int irq)
extern __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs);
extern __visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs);
extern __visible void smp_kvm_posted_intr_nested_ipi(struct pt_regs *regs);
+extern void kvm_set_isolation_exit_handler(void (*handler)(void));
#endif
extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 59b5f2e..e68483b 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -295,6 +295,17 @@ void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
}
EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
+void (*kvm_isolation_exit_handler)(void) = dummy_handler;
+
+void kvm_set_isolation_exit_handler(void (*handler)(void))
+{
+ if (handler)
+ kvm_isolation_exit_handler = handler;
+ else
+ kvm_isolation_exit_handler = dummy_handler;
+}
+EXPORT_SYMBOL_GPL(kvm_set_isolation_exit_handler);
+
/*
* Handler for POSTED_INTERRUPT_VECTOR.
*/
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 35aa659..22ff9c2 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -5,6 +5,7 @@
* KVM Address Space Isolation
*/
+#include <linux/kvm_host.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/printk.h>
@@ -95,6 +96,16 @@ static void kvm_isolation_uninit_mm(void)
free_pages((unsigned long)kvm_pgd, PGD_ALLOCATION_ORDER);
}
+static void kvm_isolation_set_handlers(void)
+{
+ kvm_set_isolation_exit_handler(kvm_isolation_exit);
+}
+
+static void kvm_isolation_clear_handlers(void)
+{
+ kvm_set_isolation_exit_handler(NULL);
+}
+
int kvm_isolation_init(void)
{
int r;
@@ -106,6 +117,7 @@ int kvm_isolation_init(void)
if (r)
return r;
+ kvm_isolation_set_handlers();
pr_info("KVM: x86: Running with isolated address space\n");
return 0;
@@ -116,6 +128,7 @@ void kvm_isolation_uninit(void)
if (!address_space_isolation)
return;
+ kvm_isolation_clear_handlers();
kvm_isolation_uninit_mm();
pr_info("KVM: x86: End of isolated address space\n");
}
--
1.7.1
On Mon, May 13, 2019 at 04:38:32PM +0200, Alexandre Chartre wrote:
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 46df4c6..317e105 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -33,6 +33,10 @@
> #define CREATE_TRACE_POINTS
> #include <asm/trace/exceptions.h>
>
> +bool (*kvm_page_fault_handler)(struct pt_regs *regs, unsigned long error_code,
> + unsigned long address);
> +EXPORT_SYMBOL(kvm_page_fault_handler);
NAK NAK NAK NAK
This is one of the biggest anti-patterns around.
On Mon, May 13, 2019 at 7:40 AM Alexandre Chartre
<[email protected]> wrote:
>
> From: Liran Alon <[email protected]>
>
> KVM isolation enter/exit is done by switching between the KVM address
> space and the kernel address space.
>
> Signed-off-by: Liran Alon <[email protected]>
> Signed-off-by: Alexandre Chartre <[email protected]>
> ---
> arch/x86/kvm/isolation.c | 30 ++++++++++++++++++++++++------
> arch/x86/mm/tlb.c | 1 +
> include/linux/sched.h | 1 +
> 3 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
> index db0a7ce..b0c789f 100644
> --- a/arch/x86/kvm/isolation.c
> +++ b/arch/x86/kvm/isolation.c
> @@ -1383,11 +1383,13 @@ static bool kvm_page_fault(struct pt_regs *regs, unsigned long error_code,
> printk(KERN_DEFAULT "KVM isolation: page fault %ld at %pS on %lx (%pS) while switching mm\n"
> " cr3=%lx\n"
> " kvm_mm=%px pgd=%px\n"
> - " active_mm=%px pgd=%px\n",
> + " active_mm=%px pgd=%px\n"
> + " kvm_prev_mm=%px pgd=%px\n",
> error_code, (void *)regs->ip, address, (void *)address,
> cr3,
> &kvm_mm, kvm_mm.pgd,
> - active_mm, active_mm->pgd);
> + active_mm, active_mm->pgd,
> + current->kvm_prev_mm, current->kvm_prev_mm->pgd);
> dump_stack();
>
> return false;
> @@ -1649,11 +1651,27 @@ void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu)
> kvm_isolation_exit();
> }
>
> +static void kvm_switch_mm(struct mm_struct *mm)
> +{
> + unsigned long flags;
> +
> + /*
> + * Disable interrupt before updating active_mm, otherwise if an
> + * interrupt occurs during the switch then the interrupt handler
> + * can be mislead about the mm effectively in use.
> + */
> + local_irq_save(flags);
> + current->kvm_prev_mm = current->active_mm;
Peter's NAK aside, why on Earth is this in task_struct? You cannot
possibly context switch while in isolation mode.
--Andy
pcpu_base_addr is already mapped to the KVM address space, but this
represents the first percpu chunk. To access a per-cpu buffer not
allocated in the first chunk, add a function which maps all cpu
buffers corresponding to that per-cpu buffer.
Also add function to clear page table entries for a percpu buffer.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 2 ++
2 files changed, 36 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 539e287..2052abf 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -990,6 +990,40 @@ void kvm_clear_range_mapping(void *ptr)
EXPORT_SYMBOL(kvm_clear_range_mapping);
+void kvm_clear_percpu_mapping(void *percpu_ptr)
+{
+ void *ptr;
+ int cpu;
+
+ pr_debug("PERCPU CLEAR percpu=%px\n", percpu_ptr);
+ for_each_possible_cpu(cpu) {
+ ptr = per_cpu_ptr(percpu_ptr, cpu);
+ kvm_clear_range_mapping(ptr);
+ }
+}
+EXPORT_SYMBOL(kvm_clear_percpu_mapping);
+
+int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size)
+{
+ void *ptr;
+ int cpu, err;
+
+ pr_debug("PERCPU COPY percpu=%px size=%lx\n", percpu_ptr, size);
+ for_each_possible_cpu(cpu) {
+ ptr = per_cpu_ptr(percpu_ptr, cpu);
+ pr_debug("PERCPU COPY cpu%d addr=%px\n", cpu, ptr);
+ err = kvm_copy_ptes(ptr, size);
+ if (err) {
+ kvm_clear_range_mapping(percpu_ptr);
+ return err;
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(kvm_copy_percpu_mapping);
+
+
static int kvm_isolation_init_mm(void)
{
pgd_t *kvm_pgd;
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 7d3c985..3ef2060 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -18,5 +18,7 @@ static inline bool kvm_isolation(void)
extern void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu);
extern int kvm_copy_ptes(void *ptr, unsigned long size);
extern void kvm_clear_range_mapping(void *ptr);
+extern int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size);
+extern void kvm_clear_percpu_mapping(void *percpu_ptr);
#endif
--
1.7.1
In addition of core memory mappings, the KVM page table has to be
initialized with vmx specific data.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0c955bb..f181b3c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -63,6 +63,7 @@
#include "vmcs12.h"
#include "vmx.h"
#include "x86.h"
+#include "isolation.h"
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");
@@ -7830,6 +7831,24 @@ static int __init vmx_init(void)
}
}
+ if (kvm_isolation()) {
+ pr_debug("mapping vmx init");
+ /* copy mapping of the current module (kvm_intel) */
+ r = kvm_copy_module_mapping();
+ if (r) {
+ vmx_exit();
+ return r;
+ }
+ if (vmx_l1d_flush_pages) {
+ r = kvm_copy_ptes(vmx_l1d_flush_pages,
+ PAGE_SIZE << L1D_CACHE_ORDER);
+ if (r) {
+ vmx_exit();
+ return r;
+ }
+ }
+ }
+
#ifdef CONFIG_KEXEC_CORE
rcu_assign_pointer(crash_vmclear_loaded_vmcss,
crash_vmclear_local_loaded_vmcss);
--
1.7.1
From: Liran Alon <[email protected]>
Create a separate mm for KVM that will be active when KVM #VMExit
handlers run. Up until the point which we architectully need to
access host (or other VM) sensitive data.
This patch just create kvm_mm but never makes it active yet.
This will be done by next commits.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 8 ++++
arch/x86/kvm/x86.c | 10 ++++-
3 files changed, 112 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/kvm/isolation.h
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index e25f663..74bc0cd 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -7,6 +7,21 @@
#include <linux/module.h>
#include <linux/moduleparam.h>
+#include <linux/printk.h>
+
+#include <asm/mmu_context.h>
+#include <asm/pgalloc.h>
+
+#include "isolation.h"
+
+struct mm_struct kvm_mm = {
+ .mm_rb = RB_ROOT,
+ .mm_users = ATOMIC_INIT(2),
+ .mm_count = ATOMIC_INIT(1),
+ .mmap_sem = __RWSEM_INITIALIZER(kvm_mm.mmap_sem),
+ .page_table_lock = __SPIN_LOCK_UNLOCKED(kvm_mm.page_table_lock),
+ .mmlist = LIST_HEAD_INIT(kvm_mm.mmlist),
+};
/*
* When set to true, KVM #VMExit handlers run in isolated address space
@@ -24,3 +39,83 @@
*/
static bool __read_mostly address_space_isolation;
module_param(address_space_isolation, bool, 0444);
+
+static int kvm_isolation_init_mm(void)
+{
+ pgd_t *kvm_pgd;
+ gfp_t gfp_mask;
+
+ gfp_mask = GFP_KERNEL | __GFP_ZERO;
+ kvm_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
+ if (!kvm_pgd)
+ return -ENOMEM;
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ /*
+ * With PTI, we have two PGDs: one the kernel page table, and one
+ * for the user page table. The PGD with the kernel page table has
+ * to be the entire kernel address space because paranoid faults
+ * will unconditionally use it. So we define the KVM address space
+ * in the user table space, although it will be used in the kernel.
+ */
+
+ /* initialize the kernel page table */
+ memcpy(kvm_pgd, current->active_mm->pgd, sizeof(pgd_t) * PTRS_PER_PGD);
+
+ /* define kvm_mm with the user page table */
+ kvm_mm.pgd = kernel_to_user_pgdp(kvm_pgd);
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+ kvm_mm.pgd = kvm_pgd;
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+ mm_init_cpumask(&kvm_mm);
+ init_new_context(NULL, &kvm_mm);
+
+ return 0;
+}
+
+static void kvm_isolation_uninit_mm(void)
+{
+ pgd_t *kvm_pgd;
+
+ BUG_ON(current->active_mm == &kvm_mm);
+
+ destroy_context(&kvm_mm);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ /*
+ * With PTI, the KVM address space is defined in the user
+ * page table space, but the full PGD starts with the kernel
+ * page table space.
+ */
+ kvm_pgd = user_to_kernel_pgdp(kvm_pgd);
+#else /* CONFIG_PAGE_TABLE_ISOLATION */
+ kvm_pgd = kvm_mm.pgd;
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+ kvm_mm.pgd = NULL;
+ free_pages((unsigned long)kvm_pgd, PGD_ALLOCATION_ORDER);
+}
+
+int kvm_isolation_init(void)
+{
+ int r;
+
+ if (!address_space_isolation)
+ return 0;
+
+ r = kvm_isolation_init_mm();
+ if (r)
+ return r;
+
+ pr_info("KVM: x86: Running with isolated address space\n");
+
+ return 0;
+}
+
+void kvm_isolation_uninit(void)
+{
+ if (!address_space_isolation)
+ return;
+
+ kvm_isolation_uninit_mm();
+ pr_info("KVM: x86: End of isolated address space\n");
+}
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
new file mode 100644
index 0000000..cf8c7d4
--- /dev/null
+++ b/arch/x86/kvm/isolation.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_X86_KVM_ISOLATION_H
+#define ARCH_X86_KVM_ISOLATION_H
+
+extern int kvm_isolation_init(void);
+extern void kvm_isolation_uninit(void);
+
+#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b5edc8e..4b7cec2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -29,6 +29,7 @@
#include "cpuid.h"
#include "pmu.h"
#include "hyperv.h"
+#include "isolation.h"
#include <linux/clocksource.h>
#include <linux/interrupt.h>
@@ -6972,10 +6973,14 @@ int kvm_arch_init(void *opaque)
goto out_free_x86_fpu_cache;
}
- r = kvm_mmu_module_init();
+ r = kvm_isolation_init();
if (r)
goto out_free_percpu;
+ r = kvm_mmu_module_init();
+ if (r)
+ goto out_uninit_isolation;
+
kvm_set_mmio_spte_mask();
kvm_x86_ops = ops;
@@ -7000,6 +7005,8 @@ int kvm_arch_init(void *opaque)
return 0;
+out_uninit_isolation:
+ kvm_isolation_uninit();
out_free_percpu:
free_percpu(shared_msrs);
out_free_x86_fpu_cache:
@@ -7024,6 +7031,7 @@ void kvm_arch_exit(void)
#ifdef CONFIG_X86_64
pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
#endif
+ kvm_isolation_uninit();
kvm_x86_ops = NULL;
kvm_mmu_module_exit();
free_percpu(shared_msrs);
--
1.7.1
From: Liran Alon <[email protected]>
Add the address_space_isolation parameter to the kvm module.
When set to true, KVM #VMExit handlers run in isolated address space
which maps only KVM required code and per-VM information instead of
entire kernel address space.
This mechanism is meant to mitigate memory-leak side-channels CPU
vulnerabilities (e.g. Spectre, L1TF and etc.) but can also be viewed
as security in-depth as it also helps generically against info-leaks
vulnerabilities in KVM #VMExit handlers and reduce the available
gadgets for ROP attacks.
This is set to false by default because it incurs a performance hit
which some users will not want to take for security gain.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/isolation.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/kvm/isolation.c
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 31ecf7a..9f404e9 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -10,7 +10,7 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
- hyperv.o page_track.o debugfs.o
+ hyperv.o page_track.o debugfs.o isolation.o
kvm-intel-y += vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o vmx/evmcs.o vmx/nested.o
kvm-amd-y += svm.o pmu_amd.o
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
new file mode 100644
index 0000000..e25f663
--- /dev/null
+++ b/arch/x86/kvm/isolation.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * KVM Address Space Isolation
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+/*
+ * When set to true, KVM #VMExit handlers run in isolated address space
+ * which maps only KVM required code and per-VM information instead of
+ * entire kernel address space.
+ *
+ * This mechanism is meant to mitigate memory-leak side-channels CPU
+ * vulnerabilities (e.g. Spectre, L1TF and etc.) but can also be viewed
+ * as security in-depth as it also helps generically against info-leaks
+ * vulnerabilities in KVM #VMExit handlers and reduce the available
+ * gadgets for ROP attacks.
+ *
+ * This is set to false by default because it incurs a performance hit
+ * which some users will not want to take for security gain.
+ */
+static bool __read_mostly address_space_isolation;
+module_param(address_space_isolation, bool, 0444);
--
1.7.1
From: Liran Alon <[email protected]>
Switch to KVM address space on entry to guest and switch
out on immediately at exit (before enabling host interrupts).
For now, this is not effectively switching, we just remain on
the kernel address space. In addition, we switch back as soon
as we exit guest, which makes KVM #VMExit handlers still run
with full host address space.
However, this introduces the entry points and places for switching.
Next commits will change switch to happen only when necessary.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 20 ++++++++++++++++++++
arch/x86/kvm/isolation.h | 2 ++
arch/x86/kvm/x86.c | 8 ++++++++
3 files changed, 30 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 74bc0cd..35aa659 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -119,3 +119,23 @@ void kvm_isolation_uninit(void)
kvm_isolation_uninit_mm();
pr_info("KVM: x86: End of isolated address space\n");
}
+
+void kvm_isolation_enter(void)
+{
+ if (address_space_isolation) {
+ /*
+ * Switches to kvm_mm should happen from vCPU thread,
+ * which should not be a kernel thread with no mm
+ */
+ BUG_ON(current->active_mm == NULL);
+ /* TODO: switch to kvm_mm */
+ }
+}
+
+void kvm_isolation_exit(void)
+{
+ if (address_space_isolation) {
+ /* TODO: Kick sibling hyperthread before switch to host mm */
+ /* TODO: switch back to original mm */
+ }
+}
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index cf8c7d4..595f62c 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -4,5 +4,7 @@
extern int kvm_isolation_init(void);
extern void kvm_isolation_uninit(void);
+extern void kvm_isolation_enter(void);
+extern void kvm_isolation_exit(void);
#endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b7cec2..85700e0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7896,6 +7896,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto cancel_injection;
}
+ kvm_isolation_enter();
+
if (req_immediate_exit) {
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_x86_ops->request_immediate_exit(vcpu);
@@ -7946,6 +7948,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
+ /*
+ * TODO: Move this to where we architectually need to access
+ * host (or other VM) sensitive data
+ */
+ kvm_isolation_exit();
+
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
--
1.7.1
From: Liran Alon <[email protected]>
Before this patch, we exited from KVM isolated address space to
host address space as soon as we exit guest.
Change code such that most of KVM #VMExit handlers will run in KVM
isolated address space and switch back to host address space
only before accessing sensitive data. Sensitive data is defined
as either host data or other VM data.
Currently, we switch from kvm_mm to host_mm on the following scenarios:
1) When handling guest page-faults:
As this will access SPTs which contains host PFNs.
2) On schedule-out of vCPU thread
3) On write to guest virtual memory
(kvm_write_guest_virt_system() can pull in tons of pages)
4) On return to userspace (e.g. QEMU)
5) On prelog of IRQ handlers
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 7 ++++++-
arch/x86/kvm/isolation.h | 3 +++
arch/x86/kvm/mmu.c | 3 ++-
arch/x86/kvm/x86.c | 12 +++++-------
4 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index 22ff9c2..eeb60c4 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -5,7 +5,6 @@
* KVM Address Space Isolation
*/
-#include <linux/kvm_host.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/printk.h>
@@ -133,6 +132,12 @@ void kvm_isolation_uninit(void)
pr_info("KVM: x86: End of isolated address space\n");
}
+void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu)
+{
+ vcpu->arch.l1tf_flush_l1d = true;
+ kvm_isolation_exit();
+}
+
void kvm_isolation_enter(void)
{
if (address_space_isolation) {
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 595f62c..1290d32 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -2,9 +2,12 @@
#ifndef ARCH_X86_KVM_ISOLATION_H
#define ARCH_X86_KVM_ISOLATION_H
+#include <linux/kvm_host.h>
+
extern int kvm_isolation_init(void);
extern void kvm_isolation_uninit(void);
extern void kvm_isolation_enter(void);
extern void kvm_isolation_exit(void);
+extern void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu);
#endif
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7b45..a2b38de 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -23,6 +23,7 @@
#include "x86.h"
#include "kvm_cache_regs.h"
#include "cpuid.h"
+#include "isolation.h"
#include <linux/kvm_host.h>
#include <linux/types.h>
@@ -4059,7 +4060,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
{
int r = 1;
- vcpu->arch.l1tf_flush_l1d = true;
+ kvm_may_access_sensitive_data(vcpu);
switch (vcpu->arch.apf.host_apf_reason) {
default:
trace_kvm_page_fault(fault_address, error_code);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 85700e0..1db72c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3307,6 +3307,8 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
* guest. do_debug expects dr6 to be cleared after it runs, do the same.
*/
set_debugreg(0, 6);
+
+ kvm_may_access_sensitive_data(vcpu);
}
static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
@@ -5220,7 +5222,7 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,
unsigned int bytes, struct x86_exception *exception)
{
/* kvm_write_guest_virt_system can pull in tons of pages. */
- vcpu->arch.l1tf_flush_l1d = true;
+ kvm_may_access_sensitive_data(vcpu);
return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,
PFERR_WRITE_MASK, exception);
@@ -7948,12 +7950,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
- /*
- * TODO: Move this to where we architectually need to access
- * host (or other VM) sensitive data
- */
- kvm_isolation_exit();
-
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
@@ -8086,6 +8082,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
+ kvm_may_access_sensitive_data(vcpu);
+
return r;
}
--
1.7.1
Tasks which are going to be running with the KVM address space have
to be mapped with their core data (stack, mm, pgd..) so that they
can (at least) switch back to the kernel address space.
For now, assume that these tasks are the ones running vcpu, and that
there's a 1:1 mapping between a task and vcpu. This should eventually
be improved to be independent of any task/vcpu mapping.
Also check that the task effectively entering the KVM address space
is mapped.
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/isolation.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/isolation.h | 2 +
arch/x86/kvm/vmx/vmx.c | 8 ++
include/linux/sched.h | 5 +
4 files changed, 197 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kvm/isolation.c b/arch/x86/kvm/isolation.c
index d3ac014..e7979b3 100644
--- a/arch/x86/kvm/isolation.c
+++ b/arch/x86/kvm/isolation.c
@@ -64,6 +64,20 @@ struct pgt_directory_group {
((typeof(entry))(((unsigned long)(entry)) & PAGE_MASK))
/*
+ * Variables to keep track of tasks mapped into the KVM address space.
+ */
+struct kvm_task_mapping {
+ struct list_head list;
+ struct task_struct *task;
+ void *stack;
+ struct mm_struct *mm;
+ pgd_t *pgd;
+};
+
+static LIST_HEAD(kvm_task_mapping_list);
+static DEFINE_MUTEX(kvm_task_mapping_lock);
+
+/*
* Variables to keep track of address ranges mapped into the KVM
* address space.
*/
@@ -1027,6 +1041,160 @@ int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size)
}
EXPORT_SYMBOL(kvm_copy_percpu_mapping);
+static void kvm_clear_task_mapping(struct kvm_task_mapping *task_mapping)
+{
+ if (task_mapping->task) {
+ kvm_clear_range_mapping(task_mapping->task);
+ task_mapping->task = NULL;
+ }
+ if (task_mapping->stack) {
+ kvm_clear_range_mapping(task_mapping->stack);
+ task_mapping->stack = NULL;
+ }
+ if (task_mapping->mm) {
+ kvm_clear_range_mapping(task_mapping->mm);
+ task_mapping->mm = NULL;
+ }
+ if (task_mapping->pgd) {
+ kvm_clear_range_mapping(task_mapping->pgd);
+ task_mapping->pgd = NULL;
+ }
+}
+
+static int kvm_copy_task_mapping(struct task_struct *tsk,
+ struct kvm_task_mapping *task_mapping)
+{
+ int err;
+
+ err = kvm_copy_ptes(tsk, sizeof(struct task_struct));
+ if (err)
+ goto out_clear_task_mapping;
+ task_mapping->task = tsk;
+
+ err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);
+ if (err)
+ goto out_clear_task_mapping;
+ task_mapping->stack = tsk->stack;
+
+ err = kvm_copy_ptes(tsk->active_mm, sizeof(struct mm_struct));
+ if (err)
+ goto out_clear_task_mapping;
+ task_mapping->mm = tsk->active_mm;
+
+ err = kvm_copy_ptes(tsk->active_mm->pgd,
+ PAGE_SIZE << PGD_ALLOCATION_ORDER);
+ if (err)
+ goto out_clear_task_mapping;
+ task_mapping->pgd = tsk->active_mm->pgd;
+
+ return 0;
+
+out_clear_task_mapping:
+ kvm_clear_task_mapping(task_mapping);
+ return err;
+}
+
+int kvm_add_task_mapping(struct task_struct *tsk)
+{
+ struct kvm_task_mapping *task_mapping;
+ int err;
+
+ mutex_lock(&kvm_task_mapping_lock);
+
+ if (tsk->kvm_mapped) {
+ mutex_unlock(&kvm_task_mapping_lock);
+ return 0;
+ }
+
+ task_mapping = kzalloc(sizeof(struct kvm_task_mapping), GFP_KERNEL);
+ if (!task_mapping) {
+ mutex_unlock(&kvm_task_mapping_lock);
+ return -ENOMEM;
+ }
+ INIT_LIST_HEAD(&task_mapping->list);
+
+ /*
+ * Ensure that the task and its stack are mapped into the KVM
+ * address space. Also map the task mm to be able to switch back
+ * to the original mm, and its PGD directory.
+ */
+ pr_debug("mapping task %px\n", tsk);
+ err = kvm_copy_task_mapping(tsk, task_mapping);
+ if (err) {
+ kfree(task_mapping);
+ mutex_unlock(&kvm_task_mapping_lock);
+ return err;
+ }
+
+ get_task_struct(tsk);
+ list_add(&task_mapping->list, &kvm_task_mapping_list);
+ tsk->kvm_mapped = true;
+
+ mutex_unlock(&kvm_task_mapping_lock);
+
+ return 0;
+}
+EXPORT_SYMBOL(kvm_add_task_mapping);
+
+static struct kvm_task_mapping *kvm_find_task_mapping(struct task_struct *tsk)
+{
+ struct kvm_task_mapping *task_mapping;
+
+ list_for_each_entry(task_mapping, &kvm_task_mapping_list, list) {
+ if (task_mapping->task == tsk)
+ return task_mapping;
+ }
+ return NULL;
+}
+
+void kvm_cleanup_task_mapping(struct task_struct *tsk)
+{
+ struct kvm_task_mapping *task_mapping;
+
+ if (!tsk->kvm_mapped)
+ return;
+
+ task_mapping = kvm_find_task_mapping(tsk);
+ if (!task_mapping) {
+ pr_debug("KVM isolation: mapping not found for mapped task %px\n",
+ tsk);
+ tsk->kvm_mapped = false;
+ mutex_unlock(&kvm_task_mapping_lock);
+ return;
+ }
+
+ pr_debug("unmapping task %px\n", tsk);
+
+ list_del(&task_mapping->list);
+ kvm_clear_task_mapping(task_mapping);
+ kfree(task_mapping);
+ tsk->kvm_mapped = false;
+ put_task_struct(tsk);
+ mutex_unlock(&kvm_task_mapping_lock);
+}
+EXPORT_SYMBOL(kvm_cleanup_task_mapping);
+
+/*
+ * Mark all tasks which have being mapped into the KVM address space
+ * as not mapped. This only clears the mapping attribute in the task
+ * structure, but page table mappings remain in the KVM page table.
+ * They will be effectively removed when deleting the KVM page table.
+ */
+static void kvm_reset_all_task_mapping(void)
+{
+ struct kvm_task_mapping *task_mapping;
+ struct task_struct *tsk;
+
+ mutex_lock(&kvm_task_mapping_lock);
+ list_for_each_entry(task_mapping, &kvm_task_mapping_list, list) {
+ tsk = task_mapping->task;
+ pr_debug("clear mapping for task %px\n", tsk);
+ tsk->kvm_mapped = false;
+ put_task_struct(tsk);
+ }
+ mutex_unlock(&kvm_task_mapping_lock);
+}
+
static int kvm_isolation_init_page_table(void)
{
@@ -1195,6 +1363,7 @@ static void kvm_isolation_uninit_mm(void)
destroy_context(&kvm_mm);
+ kvm_reset_all_task_mapping();
kvm_isolation_uninit_page_table();
kvm_free_all_range_mapping();
@@ -1227,6 +1396,8 @@ int kvm_isolation_init_vm(struct kvm *kvm)
if (!kvm_isolation())
return 0;
+ pr_debug("mapping kvm srcu sda\n");
+
return (kvm_copy_percpu_mapping(kvm->srcu.sda,
sizeof(struct srcu_data)));
}
@@ -1236,6 +1407,8 @@ void kvm_isolation_destroy_vm(struct kvm *kvm)
if (!kvm_isolation())
return;
+ pr_debug("unmapping kvm srcu sda\n");
+
kvm_clear_percpu_mapping(kvm->srcu.sda);
}
@@ -1276,12 +1449,21 @@ void kvm_may_access_sensitive_data(struct kvm_vcpu *vcpu)
void kvm_isolation_enter(void)
{
+ int err;
+
if (kvm_isolation()) {
/*
* Switches to kvm_mm should happen from vCPU thread,
* which should not be a kernel thread with no mm
*/
BUG_ON(current->active_mm == NULL);
+
+ err = kvm_add_task_mapping(current);
+ if (err) {
+ pr_err("KVM isolation cancelled (failed to map task %px)",
+ current);
+ return;
+ }
/* TODO: switch to kvm_mm */
}
}
diff --git a/arch/x86/kvm/isolation.h b/arch/x86/kvm/isolation.h
index 33e9a87..2d7d016 100644
--- a/arch/x86/kvm/isolation.h
+++ b/arch/x86/kvm/isolation.h
@@ -32,5 +32,7 @@ static inline bool kvm_isolation(void)
extern void kvm_clear_range_mapping(void *ptr);
extern int kvm_copy_percpu_mapping(void *percpu_ptr, size_t size);
extern void kvm_clear_percpu_mapping(void *percpu_ptr);
+extern int kvm_add_task_mapping(struct task_struct *tsk);
+extern void kvm_cleanup_task_mapping(struct task_struct *tsk);
#endif
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cbbaf58..9ed31c2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6576,6 +6576,9 @@ static void vmx_unmap_vcpu(struct vcpu_vmx *vmx)
kvm_clear_range_mapping(vmx->vmcs01.msr_bitmap);
kvm_clear_range_mapping(vmx->vcpu.arch.pio_data);
kvm_clear_range_mapping(vmx->vcpu.arch.apic);
+
+ /* XXX assume there's a 1:1 mapping between a task and a vcpu */
+ kvm_cleanup_task_mapping(current);
}
static int vmx_map_vcpu(struct vcpu_vmx *vmx)
@@ -6614,6 +6617,11 @@ static int vmx_map_vcpu(struct vcpu_vmx *vmx)
if (rv)
goto out_unmap_vcpu;
+ /* XXX assume there's a 1:1 mapping between a task and a vcpu */
+ rv = kvm_add_task_mapping(current);
+ if (rv)
+ goto out_unmap_vcpu;
+
return 0;
out_unmap_vcpu:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50606a6..80e1d75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1199,6 +1199,11 @@ struct task_struct {
unsigned long prev_lowest_stack;
#endif
+#ifdef CONFIG_HAVE_KVM
+ /* Is the task mapped into the KVM address space? */
+ bool kvm_mapped;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
--
1.7.1
From: Liran Alon <[email protected]>
Export symbols needed to create, manage, populate and switch
a mm from a kernel module (kvm in this case).
This is a hacky way for now to start.
This should be changed to some suitable memory-management API.
Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kernel/ldt.c | 1 +
arch/x86/mm/tlb.c | 3 ++-
mm/memory.c | 5 +++++
3 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index b2463fc..19a86e0 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -401,6 +401,7 @@ void destroy_context_ldt(struct mm_struct *mm)
free_ldt_struct(mm->context.ldt);
mm->context.ldt = NULL;
}
+EXPORT_SYMBOL_GPL(destroy_context_ldt);
void ldt_arch_exit_mmap(struct mm_struct *mm)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 7f61431..a4db7f5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -70,7 +70,7 @@ static void clear_asid_other(void)
}
atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
-
+EXPORT_SYMBOL_GPL(last_mm_ctx_id);
static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
u16 *new_asid, bool *need_flush)
@@ -159,6 +159,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
switch_mm_irqs_off(prev, next, tsk);
local_irq_restore(flags);
}
+EXPORT_SYMBOL_GPL(switch_mm);
static void sync_current_stack_to_mm(struct mm_struct *mm)
{
diff --git a/mm/memory.c b/mm/memory.c
index 36aac68..ede9335 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -434,6 +434,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
pte_free(mm, new);
return 0;
}
+EXPORT_SYMBOL_GPL(__pte_alloc);
int __pte_alloc_kernel(pmd_t *pmd)
{
@@ -453,6 +454,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
pte_free_kernel(&init_mm, new);
return 0;
}
+EXPORT_SYMBOL_GPL(__pte_alloc_kernel);
static inline void init_rss_vec(int *rss)
{
@@ -4007,6 +4009,7 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
spin_unlock(&mm->page_table_lock);
return 0;
}
+EXPORT_SYMBOL_GPL(__p4d_alloc);
#endif /* __PAGETABLE_P4D_FOLDED */
#ifndef __PAGETABLE_PUD_FOLDED
@@ -4039,6 +4042,7 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
spin_unlock(&mm->page_table_lock);
return 0;
}
+EXPORT_SYMBOL_GPL(__pud_alloc);
#endif /* __PAGETABLE_PUD_FOLDED */
#ifndef __PAGETABLE_PMD_FOLDED
@@ -4072,6 +4076,7 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
spin_unlock(ptl);
return 0;
}
+EXPORT_SYMBOL_GPL(__pmd_alloc);
#endif /* __PAGETABLE_PMD_FOLDED */
static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
--
1.7.1
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<[email protected]> wrote:
>
> From: Liran Alon <[email protected]>
>
> Interrupt handlers will need this handler to switch from
> the KVM address space back to the kernel address space
> on their prelog.
This patch doesn't appear to do anything at all. What am I missing?
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<[email protected]> wrote:
>
> From: Liran Alon <[email protected]>
>
> Next commits will change most of KVM #VMExit handlers to run
> in KVM isolated address space. Any interrupt handler raised
> during execution in KVM address space needs to switch back
> to host address space.
>
> This patch makes sure that IRQ handlers will run in full
> host address space instead of KVM isolated address space.
IMO this needs to be somewhere a lot more central. What about NMI and
MCE? Or async page faults? Or any other entry?
--Andy
> + /*
> + * Copy the mapping for all the kernel text. We copy at the PMD
> + * level since the PUD is shared with the module mapping space.
> + */
> + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
> + PGT_LEVEL_PMD);
> + if (rv)
> + goto out_uninit_page_table;
Could you double-check this? We (I) have had some repeated confusion
with the PTI code and kernel text vs. kernel data vs. __init.
KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than
kernel text.
> + /*
> + * Copy the mapping for cpu_entry_area and %esp fixup stacks
> + * (this is based on the PTI userland address space, but probably
> + * not needed because the KVM address space is not directly
> + * enterered from userspace). They can both be copied at the P4D
> + * level since they each have a dedicated P4D entry.
> + */
> + rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
> + PGT_LEVEL_P4D);
> + if (rv)
> + goto out_uninit_page_table;
cpu_entry_area is used for more than just entry from userspace. The gdt
mapping, for instance, is needed everywhere. You might want to go look
at 'struct cpu_entry_area' in some more detail.
> +#ifdef CONFIG_X86_ESPFIX64
> + rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE,
> + PGT_LEVEL_P4D);
> + if (rv)
> + goto out_uninit_page_table;
> +#endif
Why are these mappings *needed*? I thought we only actually used these
fixup stacks for some crazy iret-to-userspace handling. We're certainly
not doing that from KVM context.
Am I forgetting something?
> +#ifdef CONFIG_VMAP_STACK
> + /*
> + * Interrupt stacks are vmap'ed with guard pages, so we need to
> + * copy mappings.
> + */
> + for_each_possible_cpu(cpu) {
> + stack = per_cpu(hardirq_stack_ptr, cpu);
> + pr_debug("IRQ Stack %px\n", stack);
> + if (!stack)
> + continue;
> + rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE);
> + if (rv)
> + goto out_uninit_page_table;
> + }
> +
> +#endif
I seem to remember that the KVM VMENTRY/VMEXIT context is very special.
Interrupts (and even NMIs?) are disabled. Would it be feasible to do
the switching in there so that we never even *get* interrupts in the KVM
context?
I also share Peter's concerns about letting modules do this. If we ever
go down this road, we're going to have to think very carefully how we
let KVM do this without giving all the not-so-nice out-of-tree modules
the keys to the castle.
A high-level comment: it looks like this is "working", but has probably
erred on the side of mapping too much. The hard part is paring this
back to a truly minimal set of mappings.
On 5/13/19 6:02 PM, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> The KVM page fault handler handles page fault occurring while using
>> the KVM address space by switching to the kernel address space and
>> retrying the access (except if the fault occurs while switching
>> to the kernel address space). Processing of page faults occurring
>> while using the kernel address space is unchanged.
>>
>> Page fault log is cleared when creating a vm so that page fault
>> information doesn't persist when qemu is stopped and restarted.
>
> Are you saying that a page fault will just exit isolation? This
> completely defeats most of the security, right? Sure, it still helps
> with side channels, but not with actual software bugs.
>
Yes, page fault exit isolation so that the faulty instruction can be retried
with the full kernel address space. When exiting isolation, we also want to
kick the sibling hyperthread and pinned it so that it can't steal secret while
we use the kernel address page, but that's not implemented in this serie
(see TODO comment in kvm_isolation_exit() in patch 25 "kvm/isolation:
implement actual KVM isolation enter/exit").
alex.
On 5/13/19 5:49 PM, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> From: Liran Alon <[email protected]>
>>
>> Interrupt handlers will need this handler to switch from
>> the KVM address space back to the kernel address space
>> on their prelog.
>
> This patch doesn't appear to do anything at all. What am I missing?
>
Let me check. It looks like I trimmed the code invoking the handler from
IRQ (to exit isolation when there's an IRQ). Probably a bad merge at some
point. Sorry.
alex.
On Mon, May 13, 2019 at 04:38:33PM +0200, Alexandre Chartre wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index a4db7f5..7ad5ad1 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -444,6 +444,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> switch_ldt(real_prev, next);
> }
> }
> +EXPORT_SYMBOL_GPL(switch_mm_irqs_off);
>
> /*
> * Please ignore the name of this function. It should be called
NAK
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<[email protected]> wrote:
>
> From: Liran Alon <[email protected]>
>
> Create a separate mm for KVM that will be active when KVM #VMExit
> handlers run. Up until the point which we architectully need to
> access host (or other VM) sensitive data.
>
> This patch just create kvm_mm but never makes it active yet.
> This will be done by next commits.
NAK to this whole pile of code. KVM is not so special that it can
duplicate core infrastructure like this. Use copy_init_mm() or
improve it as needed.
--Andy
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<[email protected]> wrote:
>
> From: Liran Alon <[email protected]>
>
> Add the address_space_isolation parameter to the kvm module.
>
> When set to true, KVM #VMExit handlers run in isolated address space
> which maps only KVM required code and per-VM information instead of
> entire kernel address space.
Does the *entry* also get isolated? If not, it seems less useful for
side-channel mitigation.
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<[email protected]> wrote:
>
> The KVM page fault handler handles page fault occurring while using
> the KVM address space by switching to the kernel address space and
> retrying the access (except if the fault occurs while switching
> to the kernel address space). Processing of page faults occurring
> while using the kernel address space is unchanged.
>
> Page fault log is cleared when creating a vm so that page fault
> information doesn't persist when qemu is stopped and restarted.
Are you saying that a page fault will just exit isolation? This
completely defeats most of the security, right? Sure, it still helps
with side channels, but not with actual software bugs.
On Mon, May 13, 2019 at 8:50 AM Dave Hansen <[email protected]> wrote:
>
> > + /*
> > + * Copy the mapping for all the kernel text. We copy at the PMD
> > + * level since the PUD is shared with the module mapping space.
> > + */
> > + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
> > + PGT_LEVEL_PMD);
> > + if (rv)
> > + goto out_uninit_page_table;
>
> Could you double-check this? We (I) have had some repeated confusion
> with the PTI code and kernel text vs. kernel data vs. __init.
> KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than
> kernel text.
>
> > + /*
> > + * Copy the mapping for cpu_entry_area and %esp fixup stacks
> > + * (this is based on the PTI userland address space, but probably
> > + * not needed because the KVM address space is not directly
> > + * enterered from userspace). They can both be copied at the P4D
> > + * level since they each have a dedicated P4D entry.
> > + */
> > + rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
> > + PGT_LEVEL_P4D);
> > + if (rv)
> > + goto out_uninit_page_table;
>
> cpu_entry_area is used for more than just entry from userspace. The gdt
> mapping, for instance, is needed everywhere. You might want to go look
> at 'struct cpu_entry_area' in some more detail.
>
> > +#ifdef CONFIG_X86_ESPFIX64
> > + rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE,
> > + PGT_LEVEL_P4D);
> > + if (rv)
> > + goto out_uninit_page_table;
> > +#endif
>
> Why are these mappings *needed*? I thought we only actually used these
> fixup stacks for some crazy iret-to-userspace handling. We're certainly
> not doing that from KVM context.
>
> Am I forgetting something?
>
> > +#ifdef CONFIG_VMAP_STACK
> > + /*
> > + * Interrupt stacks are vmap'ed with guard pages, so we need to
> > + * copy mappings.
> > + */
> > + for_each_possible_cpu(cpu) {
> > + stack = per_cpu(hardirq_stack_ptr, cpu);
> > + pr_debug("IRQ Stack %px\n", stack);
> > + if (!stack)
> > + continue;
> > + rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE);
> > + if (rv)
> > + goto out_uninit_page_table;
> > + }
> > +
> > +#endif
>
> I seem to remember that the KVM VMENTRY/VMEXIT context is very special.
> Interrupts (and even NMIs?) are disabled. Would it be feasible to do
> the switching in there so that we never even *get* interrupts in the KVM
> context?
That would be nicer.
Looking at this code, it occurs to me that mapping the IRQ stacks
seems questionable. As it stands, this series switches to a normal
CR3 in some C code somewhere moderately deep in the APIC IRQ code. By
that time, I think you may have executed traceable code, and, if that
happens, you lose. i hate to say this, but any shenanigans like this
patch does might need to happen in the entry code *before* even
switching to the IRQ stack. Or perhaps shortly thereafter.
We've talked about moving context tracking to C. If we go that route,
then this KVM context mess could go there, too -- we'd have a
low-level C wrapper for each entry that would deal with getting us
ready to run normal C code.
(We need to do something about terminology. This kvm_mm thing isn't
an mm in the normal sense. An mm has normal kernel mappings and
varying user mappings. For example, the PTI "userspace" page tables
aren't an mm. And we really don't want a situation where the vmalloc
fault code runs with the "kvm_mm" mm active -- it will totally
malfunction.)
On 5/13/19 5:51 PM, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>> From: Liran Alon <[email protected]>
>>
>> Next commits will change most of KVM #VMExit handlers to run
>> in KVM isolated address space. Any interrupt handler raised
>> during execution in KVM address space needs to switch back
>> to host address space.
>>
>> This patch makes sure that IRQ handlers will run in full
>> host address space instead of KVM isolated address space.
>
> IMO this needs to be somewhere a lot more central. What about NMI and
> MCE? Or async page faults? Or any other entry?
>
Actually, I am not sure this is effectively useful because the IRQ
handler is probably faulting before it tries to exit isolation, so
the isolation exit will be done by the kvm page fault handler. I need
to check that.
alex.
> On 13 May 2019, at 17:38, Alexandre Chartre <[email protected]> wrote:
>
> Hi,
>
> This series aims to introduce the concept of KVM address space isolation.
> This is done as part of the upstream community effort to have exploit
> mitigations for CPU info-leaks vulnerabilities such as L1TF.
>
> These patches are based on an original patches from Liran Alon, completed
> with additional patches to effectively create KVM address space different
> from the full kernel address space.
Great job for pushing this forward! Thank you!
>
> The current code is just an early POC, and it is not fully stable at the
> moment (unfortunately you can expect crashes/hangs, see the "Issues"
> section below). However I would like to start a discussion get feedback
> and opinions about this approach.
>
> Context
> =======
>
> The most naive approach to handle L1TF SMT-variant exploit is to just disable
> hyper-threading. But that is not practical for public cloud providers. As a
> second next best alternative, there is an approach to combine coscheduling
> together with flushing L1D cache on every VMEntry. By coscheduling I refer
> to some mechanism which on every VMExit from guest, kicks all sibling
> hyperthreads from guest aswell.
>
> However, this approach have some open issues:
>
> 1. Kicking all sibling hyperthreads for every VMExit have significant
> performance hit for some compute shapes (e.g. Emulated and PV).
>
> 2. It assumes only CPU core resource which could be leaked by some
> vulnerability is L1D cache. But future vulnerabilities may also be able
> to leak other CPU core resources. Therefore, we would prefer to have a
> mechanism which prevents these resources to be able to be loaded with
> sensitive data to begin with.
>
> To better address (2), upstream community has discussed some mechanisms
> related to reducing data that is mapped on kernel virtual address space.
> Specifically:
>
> a. XPFO: Removes from physmap pages that currently should only be accessed
> by userspace.
>
> b. Process-local memory allocations: Allows having a memory area in kernel
> virtual address space that maps different content per-process. Then,
> allocations made on this memory area can be hidden from other tasks in
> the system running in kernel space. Most obvious use it to allocate
> there per-vCPU and per-VM KVM structures.
>
> However, both (a)+(b) work in a black-list approach (where we decide which
> data is considered dangerous and remove it from kernel virtual address
> space) and don't address performance hit described at (1).
+Cc Stefan from AWS and Kaya from Google.
(I have sent them my original patch series for review and discuss with them about this subject)
Stefan: Do you know what is Julian's current email address to Cc him as-well?
>
>
> Proposal
> ========
>
> To handle both these points, this series introduce the mechanism of KVM
> address space isolation. Note that this mechanism completes (a)+(b) and
> don't contradict. In case this mechanism is also applied, (a)+(b) should
> still be applied to the full virtual address space as a defence-in-depth).
>
> The idea is that most of KVM #VMExit handlers code will run in a special
> KVM isolated address space which maps only KVM required code and per-VM
> information. Only once KVM needs to architectually access other (sensitive)
> data, it will switch from KVM isolated address space to full standard
> host address space. At this point, KVM will also need to kick all sibling
> hyperthreads to get out of guest (note that kicking all sibling hyperthreads
> is not implemented in this serie).
>
> Basically, we will have the following flow:
>
> - qemu issues KVM_RUN ioctl
> - KVM handles the ioctl and calls vcpu_run():
> . KVM switches from the kernel address to the KVM address space
> . KVM transfers control to VM (VMLAUNCH/VMRESUME)
> . VM returns to KVM
> . KVM handles VM-Exit:
> . if handling need full kernel then switch to kernel address space
*AND* kick sibling hyperthreads before switching to that address space.
I think it’s important to emphasise that one of the main points of this KVM address space isolation mechanism is to minimise number of times we require to kick sibling hyperthreads outside of guest. By hopefully having the vast majority of VMExits handled on KVM isolated address space.
> . else continues with KVM address space
> . KVM loops in vcpu_run() or return
> - KVM_RUN ioctl returns
>
> So, the KVM_RUN core function will mainly be executed using the KVM address
> space. The handling of a VM-Exit can require access to the kernel space
> and, in that case, we will switch back to the kernel address space.
>
> The high-level idea of how this is implemented is to create a separate
> struct_mm for KVM such that a vCPU thread will switch active_mm between
> it's original active_mm and kvm_mm when needed as described above. The
> idea is very similar to how kernel switches between task active_mm and
> efi_mm when calling EFI Runtime Services.
>
> Note that because we use the kernel TLB Manager to switch between kvm_mm
> and host_mm, we will effectively use TLB with PCID if enabled to make
> these switches fast. As all of this is managed internally in TLB Manager's
> switch_mm().
>
>
> Patches
> =======
>
> The proposed patches implement the necessary framework for creating kvm_mm
> and switching between host_mm and kvm_mm at appropriate times. They also
> provide functions for populating the KVM address space, and implement an
> actual KVM address space much smaller than the full kernel address space.
>
> - 01-08: add framework for switching between the kernel address space and
> the KVM address space at appropriate times. Note that these patches do
> not create or switch the address space yet. Address space switching is
> implemented in patch 25.
>
> - 09-18: add a framework for populating and managing the KVM page table;
> this also include mechanisms to ensure changes are effectively limited
> to the KVM page table and no change is mistakenly propagated to the
> kernel page table.
>
> - 19-23: populate the KVM page table.
>
> - 24: add page fault handler to handle and report missing mappings when
> running with the KVM address space. This is based on an original idea
> from Paul Turner.
>
> - 25: implement the actual switch between the kernel address space and
> the KVM address space.
>
> - 26-27: populate the KVM page table with more data.
>
>
> If a fault occurs while running with the KVM address space, it will be
> reported on the console like this:
>
> [ 4840.727476] KVM isolation: page fault #0 (0) at fast_page_fault+0x13e/0x3e0 [kvm] on ffffea00005331f0 (0xffffea00005331f0)
>
> If the KVM page_fault_stack module parameter is set to non-zero (that's
> the default) then the stack of the fault will also be reported:
>
> [ 5025.630374] KVM isolation: page fault #0 (0) at fast_page_fault+0x100/0x3e0 [kvm] on ffff88003c718000 (0xffff88003c718000)
> [ 5025.631918] Call Trace:
> [ 5025.632782] tdp_page_fault+0xec/0x260 [kvm]
> [ 5025.633395] kvm_mmu_page_fault+0x74/0x5f0 [kvm]
> [ 5025.644467] handle_ept_violation+0xc3/0x1a0 [kvm_intel]
> [ 5025.645218] vmx_handle_exit+0xb9/0x600 [kvm_intel]
> [ 5025.645917] vcpu_enter_guest+0xb88/0x1580 [kvm]
> [ 5025.646577] kvm_arch_vcpu_ioctl_run+0x403/0x610 [kvm]
> [ 5025.647313] kvm_vcpu_ioctl+0x3d5/0x650 [kvm]
> [ 5025.648538] do_vfs_ioctl+0xaa/0x602
> [ 5025.650502] SyS_ioctl+0x79/0x84
> [ 5025.650966] do_syscall_64+0x79/0x1ae
> [ 5025.651487] entry_SYSCALL_64_after_hwframe+0x151/0x0
> [ 5025.652200] RIP: 0033:0x7f74a2f1d997
> [ 5025.652710] RSP: 002b:00007f749f3ec998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 5025.653769] RAX: ffffffffffffffda RBX: 0000562caa83e110 RCX: 00007f74a2f1d997
> [ 5025.654769] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c
> [ 5025.655769] RBP: 0000562caa83e1b3 R08: 0000562ca9b6fa50 R09: 0000000000000006
> [ 5025.656766] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ca9b552c0
> [ 5025.657764] R13: 0000000000801000 R14: 00007f74a59d4000 R15: 0000562caa83e110
>
> This allows to find out what is missing in the KVM address space.
>
>
> Issues
> ======
>
> Limited tests have been done so far, and mostly with an empty single-vcpu
> VM (i.e. qemu-system-i386 -enable-kvm -smp 1). Single-vcpu VM is able to
> start and run a full OS but the system will eventually crash/hang at some
> point. Multiple vcpus will crash/hang much faster.
>
>
> Performance Impact
> ==================
>
> As this is a RFC, the effective performance impact hasn't been measured
> yet.
> Current patches introduce two additional context switches (kernel to
> KVM, and KVM to kernel) on each KVM_RUN ioctl.
I have never considered this to be an issue.
By design of this patch series, I treated exits to userspace VMM as slow-path that should not be important to optimise.
> Also additional context
> switches are added if a VM-Exit has to be handled using the full kernel
> address space.
This is by design as well.
The KVM address space should contain enough data that is not-sensitive to be leaked by guest from one hand while still be able to handle the vast majority of exits without exiting the address space on the other hand. If we cannot achieve such a KVM isolated address space, the PoC of the series failed.
>
> I expect that the KVM address space can eventually be expanded to include
> the ioctl syscall entries.
As mentioned above, I do not see a strong reason to do so. The ioctl syscalls are considered slow-path that shouldn’t be important to optimise.
> By doing so, and also adding the KVM page table
> to the process userland page table (which should be safe to do because the
> KVM address space doesn't have any secret), we could potentially handle the
> KVM ioctl without having to switch to the kernel pagetable (thus effectively
> eliminating KPTI for KVM).
From above reasons, I don’t think this is important.
> Then the only overhead would be if a VM-Exit has
> to be handled using the full kernel address space.
This was always by design the only overhead. And a major one. Because not only it switches address space but also it will require in the future to kick all sibling hyperthreads outside of guest. The purpose of the series is to created an address space such that most VMExits won’t require to do such kick to the sibling hyperthreads.
-Liran
>
>
> Thanks,
>
> alex.
>
> ---
>
> Alexandre Chartre (18):
> kvm/isolation: function to track buffers allocated for the KVM page
> table
> kvm/isolation: add KVM page table entry free functions
> kvm/isolation: add KVM page table entry offset functions
> kvm/isolation: add KVM page table entry allocation functions
> kvm/isolation: add KVM page table entry set functions
> kvm/isolation: functions to copy page table entries for a VA range
> kvm/isolation: keep track of VA range mapped in KVM address space
> kvm/isolation: functions to clear page table entries for a VA range
> kvm/isolation: improve mapping copy when mapping is already present
> kvm/isolation: function to copy page table entries for percpu buffer
> kvm/isolation: initialize the KVM page table with core mappings
> kvm/isolation: initialize the KVM page table with vmx specific data
> kvm/isolation: initialize the KVM page table with vmx VM data
> kvm/isolation: initialize the KVM page table with vmx cpu data
> kvm/isolation: initialize the KVM page table with the vcpu tasks
> kvm/isolation: KVM page fault handler
> kvm/isolation: initialize the KVM page table with KVM memslots
> kvm/isolation: initialize the KVM page table with KVM buses
>
> Liran Alon (9):
> kernel: Export memory-management symbols required for KVM address
> space isolation
> KVM: x86: Introduce address_space_isolation module parameter
> KVM: x86: Introduce KVM separate virtual address space
> KVM: x86: Switch to KVM address space on entry to guest
> KVM: x86: Add handler to exit kvm isolation
> KVM: x86: Exit KVM isolation on IRQ entry
> KVM: x86: Switch to host address space when may access sensitive data
> KVM: x86: Optimize branches which checks if address space isolation
> enabled
> kvm/isolation: implement actual KVM isolation enter/exit
>
> arch/x86/include/asm/apic.h | 4 +-
> arch/x86/include/asm/hardirq.h | 10 +
> arch/x86/include/asm/irq.h | 1 +
> arch/x86/kernel/cpu/common.c | 2 +
> arch/x86/kernel/dumpstack.c | 1 +
> arch/x86/kernel/irq.c | 11 +
> arch/x86/kernel/ldt.c | 1 +
> arch/x86/kernel/smp.c | 2 +-
> arch/x86/kvm/Makefile | 2 +-
> arch/x86/kvm/isolation.c | 1773 ++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/isolation.h | 40 +
> arch/x86/kvm/mmu.c | 3 +-
> arch/x86/kvm/vmx/vmx.c | 123 +++-
> arch/x86/kvm/x86.c | 44 +-
> arch/x86/mm/fault.c | 12 +
> arch/x86/mm/tlb.c | 4 +-
> arch/x86/platform/uv/tlb_uv.c | 2 +-
> include/linux/kvm_host.h | 2 +
> include/linux/percpu.h | 2 +
> include/linux/sched.h | 6 +
> mm/memory.c | 5 +
> mm/percpu.c | 6 +-
> virt/kvm/arm/arm.c | 4 +
> virt/kvm/kvm_main.c | 4 +-
> 24 files changed, 2051 insertions(+), 13 deletions(-)
> create mode 100644 arch/x86/kvm/isolation.c
> create mode 100644 arch/x86/kvm/isolation.h
>
On Mon, May 13, 2019 at 08:50:19AM -0700, Dave Hansen wrote:
> I seem to remember that the KVM VMENTRY/VMEXIT context is very special.
> Interrupts (and even NMIs?) are disabled. Would it be feasible to do
> the switching in there so that we never even *get* interrupts in the KVM
> context?
NMIs are enabled on VMX's VM-Exit. On SVM, NMIs are blocked on exit until
STGI is executed.
On 5/13/19 5:50 PM, Dave Hansen wrote:
>> + /*
>> + * Copy the mapping for all the kernel text. We copy at the PMD
>> + * level since the PUD is shared with the module mapping space.
>> + */
>> + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
>> + PGT_LEVEL_PMD);
>> + if (rv)
>> + goto out_uninit_page_table;
>
> Could you double-check this? We (I) have had some repeated confusion
> with the PTI code and kernel text vs. kernel data vs. __init.
> KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than
> kernel text.
I probably have the same confusion :-) but I will try to check again.
>> + /*
>> + * Copy the mapping for cpu_entry_area and %esp fixup stacks
>> + * (this is based on the PTI userland address space, but probably
>> + * not needed because the KVM address space is not directly
>> + * enterered from userspace). They can both be copied at the P4D
>> + * level since they each have a dedicated P4D entry.
>> + */
>> + rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
>> + PGT_LEVEL_P4D);
>> + if (rv)
>> + goto out_uninit_page_table;
>
> cpu_entry_area is used for more than just entry from userspace. The gdt
> mapping, for instance, is needed everywhere. You might want to go look
> at 'struct cpu_entry_area' in some more detail.
Ok. Thanks.
>> +#ifdef CONFIG_X86_ESPFIX64
>> + rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE,
>> + PGT_LEVEL_P4D);
>> + if (rv)
>> + goto out_uninit_page_table;
>> +#endif
>
> Why are these mappings *needed*? I thought we only actually used these
> fixup stacks for some crazy iret-to-userspace handling. We're certainly
> not doing that from KVM context.
Right. I initially looked what was used for PTI, and I probably copied unneeded
mapping.
> Am I forgetting something?
>
>> +#ifdef CONFIG_VMAP_STACK
>> + /*
>> + * Interrupt stacks are vmap'ed with guard pages, so we need to
>> + * copy mappings.
>> + */
>> + for_each_possible_cpu(cpu) {
>> + stack = per_cpu(hardirq_stack_ptr, cpu);
>> + pr_debug("IRQ Stack %px\n", stack);
>> + if (!stack)
>> + continue;
>> + rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE);
>> + if (rv)
>> + goto out_uninit_page_table;
>> + }
>> +
>> +#endif
>
> I seem to remember that the KVM VMENTRY/VMEXIT context is very special.
> Interrupts (and even NMIs?) are disabled. Would it be feasible to do
> the switching in there so that we never even *get* interrupts in the KVM
> context?
Ideally we would like to run with the KVM address space when handling a VM-exit
(so between a VMEXIT and the next VMENTER) where interrupts are not disabled.
> I also share Peter's concerns about letting modules do this. If we ever
> go down this road, we're going to have to think very carefully how we
> let KVM do this without giving all the not-so-nice out-of-tree modules
> the keys to the castle.
Right, we probably need some more generic framework for creating limited
kernel context space which kvm (or other module?) can deal with. I think
kvm is a good place to start for having this kind of limited context, hence
this RFC and my request for advice how best to do it.
> A high-level comment: it looks like this is "working", but has probably
> erred on the side of mapping too much. The hard part is paring this
> back to a truly minimal set of mappings.
>
Agree.
Thanks,
alex.
On 5/13/19 6:00 PM, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 8:50 AM Dave Hansen <[email protected]> wrote:
>>
>>> + /*
>>> + * Copy the mapping for all the kernel text. We copy at the PMD
>>> + * level since the PUD is shared with the module mapping space.
>>> + */
>>> + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
>>> + PGT_LEVEL_PMD);
>>> + if (rv)
>>> + goto out_uninit_page_table;
>>
>> Could you double-check this? We (I) have had some repeated confusion
>> with the PTI code and kernel text vs. kernel data vs. __init.
>> KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than
>> kernel text.
>>
>>> + /*
>>> + * Copy the mapping for cpu_entry_area and %esp fixup stacks
>>> + * (this is based on the PTI userland address space, but probably
>>> + * not needed because the KVM address space is not directly
>>> + * enterered from userspace). They can both be copied at the P4D
>>> + * level since they each have a dedicated P4D entry.
>>> + */
>>> + rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
>>> + PGT_LEVEL_P4D);
>>> + if (rv)
>>> + goto out_uninit_page_table;
>>
>> cpu_entry_area is used for more than just entry from userspace. The gdt
>> mapping, for instance, is needed everywhere. You might want to go look
>> at 'struct cpu_entry_area' in some more detail.
>>
>>> +#ifdef CONFIG_X86_ESPFIX64
>>> + rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE,
>>> + PGT_LEVEL_P4D);
>>> + if (rv)
>>> + goto out_uninit_page_table;
>>> +#endif
>>
>> Why are these mappings *needed*? I thought we only actually used these
>> fixup stacks for some crazy iret-to-userspace handling. We're certainly
>> not doing that from KVM context.
>>
>> Am I forgetting something?
>>
>>> +#ifdef CONFIG_VMAP_STACK
>>> + /*
>>> + * Interrupt stacks are vmap'ed with guard pages, so we need to
>>> + * copy mappings.
>>> + */
>>> + for_each_possible_cpu(cpu) {
>>> + stack = per_cpu(hardirq_stack_ptr, cpu);
>>> + pr_debug("IRQ Stack %px\n", stack);
>>> + if (!stack)
>>> + continue;
>>> + rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE);
>>> + if (rv)
>>> + goto out_uninit_page_table;
>>> + }
>>> +
>>> +#endif
>>
>> I seem to remember that the KVM VMENTRY/VMEXIT context is very special.
>> Interrupts (and even NMIs?) are disabled. Would it be feasible to do
>> the switching in there so that we never even *get* interrupts in the KVM
>> context?
>
> That would be nicer.
>
> Looking at this code, it occurs to me that mapping the IRQ stacks
> seems questionable. As it stands, this series switches to a normal
> CR3 in some C code somewhere moderately deep in the APIC IRQ code. By
> that time, I think you may have executed traceable code, and, if that
> happens, you lose. i hate to say this, but any shenanigans like this
> patch does might need to happen in the entry code *before* even
> switching to the IRQ stack. Or perhaps shortly thereafter.
>
> We've talked about moving context tracking to C. If we go that route,
> then this KVM context mess could go there, too -- we'd have a
> low-level C wrapper for each entry that would deal with getting us
> ready to run normal C code.
>
> (We need to do something about terminology. This kvm_mm thing isn't
> an mm in the normal sense. An mm has normal kernel mappings and
> varying user mappings. For example, the PTI "userspace" page tables
> aren't an mm. And we really don't want a situation where the vmalloc
> fault code runs with the "kvm_mm" mm active -- it will totally
> malfunction.)
>
One of my next step is to try to put the KVM page table in the PTI userspace
page tables, and not switch CR3 on KVM_RUN ioctl. That way, we will run with
a regular mm (but using the userspace page table). Then interrupt would switch
CR3 to kernel page table (like paranoid idtentry currently do it).
alex.
On Mon, May 13, 2019 at 9:28 AM Alexandre Chartre
<[email protected]> wrote:
>
>
>
> On 5/13/19 5:51 PM, Andy Lutomirski wrote:
> > On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> > <[email protected]> wrote:
> >>
> >> From: Liran Alon <[email protected]>
> >>
> >> Next commits will change most of KVM #VMExit handlers to run
> >> in KVM isolated address space. Any interrupt handler raised
> >> during execution in KVM address space needs to switch back
> >> to host address space.
> >>
> >> This patch makes sure that IRQ handlers will run in full
> >> host address space instead of KVM isolated address space.
> >
> > IMO this needs to be somewhere a lot more central. What about NMI and
> > MCE? Or async page faults? Or any other entry?
> >
>
> Actually, I am not sure this is effectively useful because the IRQ
> handler is probably faulting before it tries to exit isolation, so
> the isolation exit will be done by the kvm page fault handler. I need
> to check that.
>
The whole idea of having #PF exit with a different CR3 than was loaded
on entry seems questionable to me. I'd be a lot more comfortable with
the whole idea if a page fault due to accessing the wrong data was an
OOPS and the code instead just did the right thing directly.
--Andy
> I expect that the KVM address space can eventually be expanded to include
> the ioctl syscall entries. By doing so, and also adding the KVM page table
> to the process userland page table (which should be safe to do because the
> KVM address space doesn't have any secret), we could potentially handle the
> KVM ioctl without having to switch to the kernel pagetable (thus effectively
> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
> to be handled using the full kernel address space.
>
In the hopefully common case where a VM exits and then gets re-entered
without needing to load full page tables, what code actually runs?
I'm trying to understand when the optimization of not switching is
actually useful.
Allowing ioctl() without switching to kernel tables sounds...
extremely complicated. It also makes the dubious assumption that user
memory contains no secrets.
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<[email protected]> wrote:
>
> pcpu_base_addr is already mapped to the KVM address space, but this
> represents the first percpu chunk. To access a per-cpu buffer not
> allocated in the first chunk, add a function which maps all cpu
> buffers corresponding to that per-cpu buffer.
>
> Also add function to clear page table entries for a percpu buffer.
>
This needs some kind of clarification so that readers can tell whether
you're trying to map all percpu memory or just map a specific
variable. In either case, you're making a dubious assumption that
percpu memory contains no secrets.
On 5/13/19, 7:43 AM, "[email protected] on behalf of Alexandre Chartre" wrote:
Proposal
========
To handle both these points, this series introduce the mechanism of KVM
address space isolation. Note that this mechanism completes (a)+(b) and
don't contradict. In case this mechanism is also applied, (a)+(b) should
still be applied to the full virtual address space as a defence-in-depth).
The idea is that most of KVM #VMExit handlers code will run in a special
KVM isolated address space which maps only KVM required code and per-VM
information. Only once KVM needs to architectually access other (sensitive)
data, it will switch from KVM isolated address space to full standard
host address space. At this point, KVM will also need to kick all sibling
hyperthreads to get out of guest (note that kicking all sibling hyperthreads
is not implemented in this serie).
Basically, we will have the following flow:
- qemu issues KVM_RUN ioctl
- KVM handles the ioctl and calls vcpu_run():
. KVM switches from the kernel address to the KVM address space
. KVM transfers control to VM (VMLAUNCH/VMRESUME)
. VM returns to KVM
. KVM handles VM-Exit:
. if handling need full kernel then switch to kernel address space
. else continues with KVM address space
. KVM loops in vcpu_run() or return
- KVM_RUN ioctl returns
So, the KVM_RUN core function will mainly be executed using the KVM address
space. The handling of a VM-Exit can require access to the kernel space
and, in that case, we will switch back to the kernel address space.
Once all sibling hyperthreads are in the host (either using the full kernel address space or user address space), what happens to the other sibling hyperthreads if one of them tries to do VM entry? That VCPU will switch to the KVM address space prior to VM entry, but others continue to run? Do you think (a) + (b) would be sufficient for that case?
---
Jun
Intel Open Source Technology Center
> On 13 May 2019, at 21:17, Andy Lutomirski <[email protected]> wrote:
>
>> I expect that the KVM address space can eventually be expanded to include
>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>> to the process userland page table (which should be safe to do because the
>> KVM address space doesn't have any secret), we could potentially handle the
>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>> to be handled using the full kernel address space.
>>
>
> In the hopefully common case where a VM exits and then gets re-entered
> without needing to load full page tables, what code actually runs?
> I'm trying to understand when the optimization of not switching is
> actually useful.
>
> Allowing ioctl() without switching to kernel tables sounds...
> extremely complicated. It also makes the dubious assumption that user
> memory contains no secrets.
Let me attempt to clarify what we were thinking when creating this patch series:
1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
4) The main goal of this patch series is to preserve (2), but to avoid the overhead specified in (3).
The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.
From this reason, I think the above paragraph (that was added to my original cover letter) is incorrect.
I believe that we should by design treat all exits to userspace VMM (e.g. QEMU) as slow-path that should not be optimised and therefore ok to switch address space (and therefore also kick sibling hyperthread). Similarly, all IOCTLs handlers are also slow-path and therefore it should be ok for them to also not run in KVM isolated address space.
-Liran
> On 13 May 2019, at 22:31, Nakajima, Jun <[email protected]> wrote:
>
> On 5/13/19, 7:43 AM, "[email protected] on behalf of Alexandre Chartre" wrote:
>
> Proposal
> ========
>
> To handle both these points, this series introduce the mechanism of KVM
> address space isolation. Note that this mechanism completes (a)+(b) and
> don't contradict. In case this mechanism is also applied, (a)+(b) should
> still be applied to the full virtual address space as a defence-in-depth).
>
> The idea is that most of KVM #VMExit handlers code will run in a special
> KVM isolated address space which maps only KVM required code and per-VM
> information. Only once KVM needs to architectually access other (sensitive)
> data, it will switch from KVM isolated address space to full standard
> host address space. At this point, KVM will also need to kick all sibling
> hyperthreads to get out of guest (note that kicking all sibling hyperthreads
> is not implemented in this serie).
>
> Basically, we will have the following flow:
>
> - qemu issues KVM_RUN ioctl
> - KVM handles the ioctl and calls vcpu_run():
> . KVM switches from the kernel address to the KVM address space
> . KVM transfers control to VM (VMLAUNCH/VMRESUME)
> . VM returns to KVM
> . KVM handles VM-Exit:
> . if handling need full kernel then switch to kernel address space
> . else continues with KVM address space
> . KVM loops in vcpu_run() or return
> - KVM_RUN ioctl returns
>
> So, the KVM_RUN core function will mainly be executed using the KVM address
> space. The handling of a VM-Exit can require access to the kernel space
> and, in that case, we will switch back to the kernel address space.
>
> Once all sibling hyperthreads are in the host (either using the full kernel address space or user address space), what happens to the other sibling hyperthreads if one of them tries to do VM entry? That VCPU will switch to the KVM address space prior to VM entry, but others continue to run? Do you think (a) + (b) would be sufficient for that case?
The description here is missing and important part: When a hyperthread needs to switch from KVM isolated address space to kernel full address space, it should first kick all sibling hyperthreads outside of guest and only then safety switch to full kernel address space. Only once all sibling hyperthreads are running with KVM isolated address space, it is safe to enter guest.
The main point of this address space is to avoid kicking all sibling hyperthreads on *every* VMExit from guest but instead only kick them when switching address space. The assumption is that the vast majority of exits can be handled in KVM isolated address space and therefore do not require to kick the sibling hyperthreads outside of guest.
-Liran
>
> ---
> Jun
> Intel Open Source Technology Center
>
>
> On 13 May 2019, at 18:15, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, May 13, 2019 at 04:38:32PM +0200, Alexandre Chartre wrote:
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 46df4c6..317e105 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -33,6 +33,10 @@
>> #define CREATE_TRACE_POINTS
>> #include <asm/trace/exceptions.h>
>>
>> +bool (*kvm_page_fault_handler)(struct pt_regs *regs, unsigned long error_code,
>> + unsigned long address);
>> +EXPORT_SYMBOL(kvm_page_fault_handler);
>
> NAK NAK NAK NAK
>
> This is one of the biggest anti-patterns around.
I agree.
I think that mm should expose a mm_set_kvm_page_fault_handler() or something (give it a better name).
Similar to how arch/x86/kernel/irq.c have kvm_set_posted_intr_wakeup_handler().
-Liran
> On 14 May 2019, at 0:42, Nakajima, Jun <[email protected]> wrote:
>
>
>
>> On May 13, 2019, at 2:16 PM, Liran Alon <[email protected]> wrote:
>>
>>> On 13 May 2019, at 22:31, Nakajima, Jun <[email protected]> wrote:
>>>
>>> On 5/13/19, 7:43 AM, "[email protected] on behalf of Alexandre Chartre" wrote:
>>>
>>> Proposal
>>> ========
>>>
>>> To handle both these points, this series introduce the mechanism of KVM
>>> address space isolation. Note that this mechanism completes (a)+(b) and
>>> don't contradict. In case this mechanism is also applied, (a)+(b) should
>>> still be applied to the full virtual address space as a defence-in-depth).
>>>
>>> The idea is that most of KVM #VMExit handlers code will run in a special
>>> KVM isolated address space which maps only KVM required code and per-VM
>>> information. Only once KVM needs to architectually access other (sensitive)
>>> data, it will switch from KVM isolated address space to full standard
>>> host address space. At this point, KVM will also need to kick all sibling
>>> hyperthreads to get out of guest (note that kicking all sibling hyperthreads
>>> is not implemented in this serie).
>>>
>>> Basically, we will have the following flow:
>>>
>>> - qemu issues KVM_RUN ioctl
>>> - KVM handles the ioctl and calls vcpu_run():
>>> . KVM switches from the kernel address to the KVM address space
>>> . KVM transfers control to VM (VMLAUNCH/VMRESUME)
>>> . VM returns to KVM
>>> . KVM handles VM-Exit:
>>> . if handling need full kernel then switch to kernel address space
>>> . else continues with KVM address space
>>> . KVM loops in vcpu_run() or return
>>> - KVM_RUN ioctl returns
>>>
>>> So, the KVM_RUN core function will mainly be executed using the KVM address
>>> space. The handling of a VM-Exit can require access to the kernel space
>>> and, in that case, we will switch back to the kernel address space.
>>>
>>> Once all sibling hyperthreads are in the host (either using the full kernel address space or user address space), what happens to the other sibling hyperthreads if one of them tries to do VM entry? That VCPU will switch to the KVM address space prior to VM entry, but others continue to run? Do you think (a) + (b) would be sufficient for that case?
>>
>> The description here is missing and important part: When a hyperthread needs to switch from KVM isolated address space to kernel full address space, it should first kick all sibling hyperthreads outside of guest and only then safety switch to full kernel address space. Only once all sibling hyperthreads are running with KVM isolated address space, it is safe to enter guest.
>>
>
> Okay, it makes sense. So, it will require some synchronization among the siblings there.
Definitely.
Currently the kicking of sibling hyperthreads is not integrated yet with this patch series. But it should be at some point.
-Liran
>
>> The main point of this address space is to avoid kicking all sibling hyperthreads on *every* VMExit from guest but instead only kick them when switching address space. The assumption is that the vast majority of exits can be handled in KVM isolated address space and therefore do not require to kick the sibling hyperthreads outside of guest.
>
>
> ---
> Jun
> Intel Open Source Technology Center
On Mon, May 13, 2019 at 2:26 PM Liran Alon <[email protected]> wrote:
>
>
>
> > On 13 May 2019, at 18:15, Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, May 13, 2019 at 04:38:32PM +0200, Alexandre Chartre wrote:
> >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> >> index 46df4c6..317e105 100644
> >> --- a/arch/x86/mm/fault.c
> >> +++ b/arch/x86/mm/fault.c
> >> @@ -33,6 +33,10 @@
> >> #define CREATE_TRACE_POINTS
> >> #include <asm/trace/exceptions.h>
> >>
> >> +bool (*kvm_page_fault_handler)(struct pt_regs *regs, unsigned long error_code,
> >> + unsigned long address);
> >> +EXPORT_SYMBOL(kvm_page_fault_handler);
> >
> > NAK NAK NAK NAK
> >
> > This is one of the biggest anti-patterns around.
>
> I agree.
> I think that mm should expose a mm_set_kvm_page_fault_handler() or something (give it a better name).
> Similar to how arch/x86/kernel/irq.c have kvm_set_posted_intr_wakeup_handler().
>
> -Liran
>
This sounds like a great use case for static_call(). PeterZ, do you
suppose we could wire up static_call() with the module infrastructure
to make it easy to do "static_call to such-and-such GPL module symbol
if that symbol is in a loaded module, else nop"?
On Mon, May 13, 2019 at 2:09 PM Liran Alon <[email protected]> wrote:
>
>
>
> > On 13 May 2019, at 21:17, Andy Lutomirski <[email protected]> wrote:
> >
> >> I expect that the KVM address space can eventually be expanded to include
> >> the ioctl syscall entries. By doing so, and also adding the KVM page table
> >> to the process userland page table (which should be safe to do because the
> >> KVM address space doesn't have any secret), we could potentially handle the
> >> KVM ioctl without having to switch to the kernel pagetable (thus effectively
> >> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
> >> to be handled using the full kernel address space.
> >>
> >
> > In the hopefully common case where a VM exits and then gets re-entered
> > without needing to load full page tables, what code actually runs?
> > I'm trying to understand when the optimization of not switching is
> > actually useful.
> >
> > Allowing ioctl() without switching to kernel tables sounds...
> > extremely complicated. It also makes the dubious assumption that user
> > memory contains no secrets.
>
> Let me attempt to clarify what we were thinking when creating this patch series:
>
> 1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
> This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
>
> 2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
>
> 3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
>
>
4) The main goal of this patch series is to preserve (2), but to avoid
the overhead specified in (3).
>
> The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
Thanks! This clarifies a lot of things.
> The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
> However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.
What exactly does "kick" mean in this context? It sounds like you're
going to need to be able to kick sibling VMs from extremely atomic
contexts like NMI and MCE.
On Mon, May 13, 2019 at 11:13:34AM -0700, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 9:28 AM Alexandre Chartre
> <[email protected]> wrote:
> > Actually, I am not sure this is effectively useful because the IRQ
> > handler is probably faulting before it tries to exit isolation, so
> > the isolation exit will be done by the kvm page fault handler. I need
> > to check that.
> >
>
> The whole idea of having #PF exit with a different CR3 than was loaded
> on entry seems questionable to me. I'd be a lot more comfortable with
> the whole idea if a page fault due to accessing the wrong data was an
> OOPS and the code instead just did the right thing directly.
So I've ran into this idea before; it basically allows a lazy approach
to things.
I'm somewhat conflicted on things, on the one hand, changing CR3 from
#PF is a natural extention in that #PF already changes page-tables (for
userspace / vmalloc etc..), on the other hand, there's a thin line
between being lazy and being sloppy.
If we're going down this route; I think we need a very coherent design
and strong rules.
On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> <[email protected]> wrote:
> >
> > pcpu_base_addr is already mapped to the KVM address space, but this
> > represents the first percpu chunk. To access a per-cpu buffer not
> > allocated in the first chunk, add a function which maps all cpu
> > buffers corresponding to that per-cpu buffer.
> >
> > Also add function to clear page table entries for a percpu buffer.
> >
>
> This needs some kind of clarification so that readers can tell whether
> you're trying to map all percpu memory or just map a specific
> variable. In either case, you're making a dubious assumption that
> percpu memory contains no secrets.
I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
does contain secrits, invalidating that premise.
On Mon, May 13, 2019 at 07:02:30PM -0700, Andy Lutomirski wrote:
> This sounds like a great use case for static_call(). PeterZ, do you
> suppose we could wire up static_call() with the module infrastructure
> to make it easy to do "static_call to such-and-such GPL module symbol
> if that symbol is in a loaded module, else nop"?
You're basically asking it to do dynamic linking. And I suppose that is
technically possible.
However, I'm really starting to think kvm (or at least these parts of it
that want to play these games) had better not be a module anymore.
(please, wrap our emails at 78 chars)
On Tue, May 14, 2019 at 12:08:23AM +0300, Liran Alon wrote:
> 3) From (2), we should have theoretically deduced that for every
> #VMExit, there is a need to kick the sibling hyperthread also outside
> of guest until the #VMExit is completed.
That's not in fact quite true; all you have to do is send the IPI.
Having one sibling IPI the other sibling carries enough guarantees that
the receiving sibling will not execute any further guest instructions.
That is, you don't have to wait on the VMExit to complete; you can just
IPI and get on with things. Now, this is still expensive, But it is
heaps better than doing a full sync up between siblings.
On Mon, May 13, 2019 at 07:07:36PM -0700, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 2:09 PM Liran Alon <[email protected]> wrote:
> > The hope is that the very vast majority of #VMExit handlers will be
> > able to completely run without requiring to switch to full address
> > space. Therefore, avoiding the performance hit of (2).
> > However, for the very few #VMExits that does require to run in full
> > kernel address space, we must first kick the sibling hyperthread
> > outside of guest and only then switch to full kernel address space
> > and only once all hyperthreads return to KVM address space, then
> > allow then to enter into guest.
>
> What exactly does "kick" mean in this context? It sounds like you're
> going to need to be able to kick sibling VMs from extremely atomic
> contexts like NMI and MCE.
Yeah, doing the full synchronous thing from NMI/MCE context sounds
exceedingly dodgy, howver..
Realistically they only need to send an IPI to the other sibling; they
don't need to wait for the VMExit to complete or anything else.
And that is something we can do from NMI context -- with a bit of care.
See also arch_irq_work_raise(); specifically we need to ensure we leave
the APIC in an idle state, such that if we interrupted an APIC sequence
it will not suddenly fail/violate the APIC write/state etc.
> On 14 May 2019, at 10:29, Peter Zijlstra <[email protected]> wrote:
>
>
> (please, wrap our emails at 78 chars)
>
> On Tue, May 14, 2019 at 12:08:23AM +0300, Liran Alon wrote:
>
>> 3) From (2), we should have theoretically deduced that for every
>> #VMExit, there is a need to kick the sibling hyperthread also outside
>> of guest until the #VMExit is completed.
>
> That's not in fact quite true; all you have to do is send the IPI.
> Having one sibling IPI the other sibling carries enough guarantees that
> the receiving sibling will not execute any further guest instructions.
>
> That is, you don't have to wait on the VMExit to complete; you can just
> IPI and get on with things. Now, this is still expensive, But it is
> heaps better than doing a full sync up between siblings.
>
I agree.
I didn’t say you need to do full sync. You just need to IPI the sibling
hyperthreads before switching to the full kernel address space.
But you need to make sure these sibling hyperthreads don’t get back into
the guest until all hyperthreads are running with KVM isolated address space.
It is still very expensive if done for every #VMExit. Which as I explained,
can be avoided in case we use the KVM isolated address space technique.
-Liran
On 5/14/19 9:07 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 11:13:34AM -0700, Andy Lutomirski wrote:
>> On Mon, May 13, 2019 at 9:28 AM Alexandre Chartre
>> <[email protected]> wrote:
>
>>> Actually, I am not sure this is effectively useful because the IRQ
>>> handler is probably faulting before it tries to exit isolation, so
>>> the isolation exit will be done by the kvm page fault handler. I need
>>> to check that.
>>>
>>
>> The whole idea of having #PF exit with a different CR3 than was loaded
>> on entry seems questionable to me. I'd be a lot more comfortable with
>> the whole idea if a page fault due to accessing the wrong data was an
>> OOPS and the code instead just did the right thing directly.
>
> So I've ran into this idea before; it basically allows a lazy approach
> to things.
>
> I'm somewhat conflicted on things, on the one hand, changing CR3 from
> #PF is a natural extention in that #PF already changes page-tables (for
> userspace / vmalloc etc..), on the other hand, there's a thin line
> between being lazy and being sloppy.
>
> If we're going down this route; I think we need a very coherent design
> and strong rules.
>
Right. We should particularly ensure that the KVM page-table remains a
subset of the kernel page-table, in particular page-table changes (e.g.
for vmalloc etc...) should happen in the kernel page-table and not in
the kvm page-table.
So we should probably enforce switching to the kernel page-table when
doing operation like vmalloc. The current code doesn't enforce it, but
I can see it faulting, when doing any allocation (because the kvm page
table doesn't have all structures used during an allocation).
alex.
> On 14 May 2019, at 5:07, Andy Lutomirski <[email protected]> wrote:
>
> On Mon, May 13, 2019 at 2:09 PM Liran Alon <[email protected]> wrote:
>>
>>
>>
>>> On 13 May 2019, at 21:17, Andy Lutomirski <[email protected]> wrote:
>>>
>>>> I expect that the KVM address space can eventually be expanded to include
>>>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>>>> to the process userland page table (which should be safe to do because the
>>>> KVM address space doesn't have any secret), we could potentially handle the
>>>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>>>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>>>> to be handled using the full kernel address space.
>>>>
>>>
>>> In the hopefully common case where a VM exits and then gets re-entered
>>> without needing to load full page tables, what code actually runs?
>>> I'm trying to understand when the optimization of not switching is
>>> actually useful.
>>>
>>> Allowing ioctl() without switching to kernel tables sounds...
>>> extremely complicated. It also makes the dubious assumption that user
>>> memory contains no secrets.
>>
>> Let me attempt to clarify what we were thinking when creating this patch series:
>>
>> 1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
>> This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
>>
>> 2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
>>
>> 3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
>>
>>
> 4) The main goal of this patch series is to preserve (2), but to avoid
> the overhead specified in (3).
>>
>> The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
>
> Thanks! This clarifies a lot of things.
>
>> The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
>> However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.
>
> What exactly does "kick" mean in this context? It sounds like you're
> going to need to be able to kick sibling VMs from extremely atomic
> contexts like NMI and MCE.
Yes that’s true.
“kick” in this context will probably mean sending an IPI to all sibling hyperthreads.
This IPI will cause these sibling hyperthreads to exit from guest to host on EXTERNAL_INTERRUPT
and wait for a condition that again allows to enter back into guest.
This condition will be once all hyperthreads of CPU core is again running only within KVM isolated address space of this VM.
-Liran
On 5/14/19 9:09 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
>> <[email protected]> wrote:
>>>
>>> pcpu_base_addr is already mapped to the KVM address space, but this
>>> represents the first percpu chunk. To access a per-cpu buffer not
>>> allocated in the first chunk, add a function which maps all cpu
>>> buffers corresponding to that per-cpu buffer.
>>>
>>> Also add function to clear page table entries for a percpu buffer.
>>>
>>
>> This needs some kind of clarification so that readers can tell whether
>> you're trying to map all percpu memory or just map a specific
>> variable. In either case, you're making a dubious assumption that
>> percpu memory contains no secrets.
>
> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
> does contain secrits, invalidating that premise.
>
The current code unconditionally maps the entire first percpu chunk
(pcpu_base_addr). So it assumes it doesn't contain any secret. That is
mainly a simplification for the POC because a lot of core information
that we need, for example just to switch mm, are stored there (like
cpu_tlbstate, current_task...).
If the entire first percpu chunk effectively has secret then we will
need to individually map only buffers we need. The kvm_copy_percpu_mapping()
function is added to copy mapping for a specified percpu buffer, so
this used to map percpu buffers which are not in the first percpu chunk.
Also note that mapping is constrained by PTE (4K), so mapped buffers
(percpu or not) which do not fill a whole set of pages can leak adjacent
data store on the same pages.
alex.
> On May 14, 2019, at 1:25 AM, Alexandre Chartre <[email protected]> wrote:
>
>
>> On 5/14/19 9:09 AM, Peter Zijlstra wrote:
>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
>>> <[email protected]> wrote:
>>>>
>>>> pcpu_base_addr is already mapped to the KVM address space, but this
>>>> represents the first percpu chunk. To access a per-cpu buffer not
>>>> allocated in the first chunk, add a function which maps all cpu
>>>> buffers corresponding to that per-cpu buffer.
>>>>
>>>> Also add function to clear page table entries for a percpu buffer.
>>>>
>>>
>>> This needs some kind of clarification so that readers can tell whether
>>> you're trying to map all percpu memory or just map a specific
>>> variable. In either case, you're making a dubious assumption that
>>> percpu memory contains no secrets.
>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
>> does contain secrits, invalidating that premise.
>
> The current code unconditionally maps the entire first percpu chunk
> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is
> mainly a simplification for the POC because a lot of core information
> that we need, for example just to switch mm, are stored there (like
> cpu_tlbstate, current_task...).
I don’t think you should need any of this.
>
> If the entire first percpu chunk effectively has secret then we will
> need to individually map only buffers we need. The kvm_copy_percpu_mapping()
> function is added to copy mapping for a specified percpu buffer, so
> this used to map percpu buffers which are not in the first percpu chunk.
>
> Also note that mapping is constrained by PTE (4K), so mapped buffers
> (percpu or not) which do not fill a whole set of pages can leak adjacent
> data store on the same pages.
>
>
I would take a different approach: figure out what you need and put it in its own dedicated area, kind of like cpu_entry_area.
One nasty issue you’ll have is vmalloc: the kernel stack is in the vmap range, and, if you allow access to vmap memory at all, you’ll need some way to ensure that *unmap* gets propagated. I suspect the right choice is to see if you can avoid using the kernel stack at all in isolated mode. Maybe you could run on the IRQ stack instead.
On 5/13/19 11:08 PM, Liran Alon wrote:
>
>
>> On 13 May 2019, at 21:17, Andy Lutomirski <[email protected]> wrote:
>>
>>> I expect that the KVM address space can eventually be expanded to include
>>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>>> to the process userland page table (which should be safe to do because the
>>> KVM address space doesn't have any secret), we could potentially handle the
>>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>>> to be handled using the full kernel address space.
>>>
>>
>> In the hopefully common case where a VM exits and then gets re-entered
>> without needing to load full page tables, what code actually runs?
>> I'm trying to understand when the optimization of not switching is
>> actually useful.
>>
>> Allowing ioctl() without switching to kernel tables sounds...
>> extremely complicated. It also makes the dubious assumption that user
>> memory contains no secrets.
>
> Let me attempt to clarify what we were thinking when creating this patch series:
>
> 1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
> This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
>
> 2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
>
> 3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
>
> 4) The main goal of this patch series is to preserve (2), but to avoid the overhead specified in (3).
>
> The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
> The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
> However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.
>
> From this reason, I think the above paragraph (that was added to my original cover letter) is incorrect.
Yes, I am wrong. The KVM page table can't be added to the process userland page
table because this can leak secrets from userland. I was only thinking about
performances to reduce the number of context switches. So just forget that
paragraph :-)
alex.
> I believe that we should by design treat all exits to userspace VMM (e.g. QEMU) as slow-path that should not be optimised and therefore ok to switch address space (and therefore also kick sibling hyperthread). Similarly, all IOCTLs handlers are also slow-path and therefore it should be ok for them to also not run in KVM isolated address space.
>
> -Liran
>
On 5/14/19 10:34 AM, Andy Lutomirski wrote:
>
>
>> On May 14, 2019, at 1:25 AM, Alexandre Chartre <[email protected]> wrote:
>>
>>
>>> On 5/14/19 9:09 AM, Peter Zijlstra wrote:
>>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
>>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
>>>> <[email protected]> wrote:
>>>>>
>>>>> pcpu_base_addr is already mapped to the KVM address space, but this
>>>>> represents the first percpu chunk. To access a per-cpu buffer not
>>>>> allocated in the first chunk, add a function which maps all cpu
>>>>> buffers corresponding to that per-cpu buffer.
>>>>>
>>>>> Also add function to clear page table entries for a percpu buffer.
>>>>>
>>>>
>>>> This needs some kind of clarification so that readers can tell whether
>>>> you're trying to map all percpu memory or just map a specific
>>>> variable. In either case, you're making a dubious assumption that
>>>> percpu memory contains no secrets.
>>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
>>> does contain secrits, invalidating that premise.
>>
>> The current code unconditionally maps the entire first percpu chunk
>> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is
>> mainly a simplification for the POC because a lot of core information
>> that we need, for example just to switch mm, are stored there (like
>> cpu_tlbstate, current_task...).
>
> I don’t think you should need any of this.
>
At the moment, the current code does need it. Otherwise it can't switch from
kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate",
and then the page fault handler will fail accessing "current" before calling
the kvm page fault handler. So it will double fault or loop on page faults.
There are many different places where percpu variables are used, and I have
experienced many double fault/page fault loop because of that.
>>
>> If the entire first percpu chunk effectively has secret then we will
>> need to individually map only buffers we need. The kvm_copy_percpu_mapping()
>> function is added to copy mapping for a specified percpu buffer, so
>> this used to map percpu buffers which are not in the first percpu chunk.
>>
>> Also note that mapping is constrained by PTE (4K), so mapped buffers
>> (percpu or not) which do not fill a whole set of pages can leak adjacent
>> data store on the same pages.
>>
>>
>
> I would take a different approach: figure out what you need and put it in its
> own dedicated area, kind of like cpu_entry_area.
That's certainly something we can do, like Julian proposed with "Process-local
memory allocations": https://lkml.org/lkml/2018/11/22/1240
That's fine for buffers allocated from KVM, however, we will still need some
core kernel mappings so the thread can run and interrupts can be handled.
> One nasty issue you’ll have is vmalloc: the kernel stack is in the
> vmap range, and, if you allow access to vmap memory at all, you’ll
> need some way to ensure that *unmap* gets propagated. I suspect the
> right choice is to see if you can avoid using the kernel stack at all
> in isolated mode. Maybe you could run on the IRQ stack instead.
I am currently just copying the task stack mapping into the KVM page table
(patch 23) when a vcpu is created:
err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);
And this seems to work. I am clearing the mapping when the VM vcpu is freed,
so I am making the assumption that the same task is used to create and free
a vcpu.
alex.
On 5/13/19 6:47 PM, Alexandre Chartre wrote:
>
>
> On 5/13/19 5:50 PM, Dave Hansen wrote:
>>> + /*
>>> + * Copy the mapping for all the kernel text. We copy at the PMD
>>> + * level since the PUD is shared with the module mapping space.
>>> + */
>>> + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
>>> + PGT_LEVEL_PMD);
>>> + if (rv)
>>> + goto out_uninit_page_table;
>>
>> Could you double-check this? We (I) have had some repeated confusion
>> with the PTI code and kernel text vs. kernel data vs. __init.
>> KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than
>> kernel text.
>
> I probably have the same confusion :-) but I will try to check again.
>
>
mm.txt says that kernel text is 512MB, and that's probably why I used
KERNEL_IMAGE_SIZE.
https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt
========================================================================================================================
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
[...]
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
[...]
However, vmlinux.lds.S does:
. = ASSERT((_end - _text <= KERNEL_IMAGE_SIZE),
"kernel image bigger than KERNEL_IMAGE_SIZE");
So this covers everything between _text and _end, which includes text, data,
init and other stuff
The end of the text section is tagged with _etext. So the text section is
effectively (_etext - _text). This matches with what efi_setup_page_tables()
used to copy kernel text:
int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
{
[...]
npages = (_etext - _text) >> PAGE_SHIFT;
text = __pa(_text);
pfn = text >> PAGE_SHIFT;
pf = _PAGE_RW | _PAGE_ENC;
if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, pf)) {
pr_err("Failed to map kernel text 1:1\n");
return 1;
}
[...]
}
alex.
On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre
<[email protected]> wrote:
>
>
> On 5/14/19 10:34 AM, Andy Lutomirski wrote:
> >
> >
> >> On May 14, 2019, at 1:25 AM, Alexandre Chartre <[email protected]> wrote:
> >>
> >>
> >>> On 5/14/19 9:09 AM, Peter Zijlstra wrote:
> >>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
> >>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> pcpu_base_addr is already mapped to the KVM address space, but this
> >>>>> represents the first percpu chunk. To access a per-cpu buffer not
> >>>>> allocated in the first chunk, add a function which maps all cpu
> >>>>> buffers corresponding to that per-cpu buffer.
> >>>>>
> >>>>> Also add function to clear page table entries for a percpu buffer.
> >>>>>
> >>>>
> >>>> This needs some kind of clarification so that readers can tell whether
> >>>> you're trying to map all percpu memory or just map a specific
> >>>> variable. In either case, you're making a dubious assumption that
> >>>> percpu memory contains no secrets.
> >>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
> >>> does contain secrits, invalidating that premise.
> >>
> >> The current code unconditionally maps the entire first percpu chunk
> >> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is
> >> mainly a simplification for the POC because a lot of core information
> >> that we need, for example just to switch mm, are stored there (like
> >> cpu_tlbstate, current_task...).
> >
> > I don’t think you should need any of this.
> >
>
> At the moment, the current code does need it. Otherwise it can't switch from
> kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate",
> and then the page fault handler will fail accessing "current" before calling
> the kvm page fault handler. So it will double fault or loop on page faults.
> There are many different places where percpu variables are used, and I have
> experienced many double fault/page fault loop because of that.
Now you're experiencing what working on the early PTI code was like :)
This is why I think you shouldn't touch current in any of this.
>
> >>
> >> If the entire first percpu chunk effectively has secret then we will
> >> need to individually map only buffers we need. The kvm_copy_percpu_mapping()
> >> function is added to copy mapping for a specified percpu buffer, so
> >> this used to map percpu buffers which are not in the first percpu chunk.
> >>
> >> Also note that mapping is constrained by PTE (4K), so mapped buffers
> >> (percpu or not) which do not fill a whole set of pages can leak adjacent
> >> data store on the same pages.
> >>
> >>
> >
> > I would take a different approach: figure out what you need and put it in its
> > own dedicated area, kind of like cpu_entry_area.
>
> That's certainly something we can do, like Julian proposed with "Process-local
> memory allocations": https://lkml.org/lkml/2018/11/22/1240
>
> That's fine for buffers allocated from KVM, however, we will still need some
> core kernel mappings so the thread can run and interrupts can be handled.
>
> > One nasty issue you’ll have is vmalloc: the kernel stack is in the
> > vmap range, and, if you allow access to vmap memory at all, you’ll
> > need some way to ensure that *unmap* gets propagated. I suspect the
> > right choice is to see if you can avoid using the kernel stack at all
> > in isolated mode. Maybe you could run on the IRQ stack instead.
>
> I am currently just copying the task stack mapping into the KVM page table
> (patch 23) when a vcpu is created:
>
> err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);
>
> And this seems to work. I am clearing the mapping when the VM vcpu is freed,
> so I am making the assumption that the same task is used to create and free
> a vcpu.
>
vCPUs are bound to an mm but not a specific task, right? So I think
this is wrong in both directions.
Suppose a vCPU is created, then the task exits, the stack mapping gets
freed (the core code tries to avoid this, but it does happen), and a
new stack gets allocated at the same VA with different physical pages.
Now you're toast :) On the flip side, wouldn't you crash if a vCPU is
created and then run on a different thread?
How important is the ability to enable IRQs while running with the KVM
page tables?
On 5/14/19 9:21 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 07:02:30PM -0700, Andy Lutomirski wrote:
>
>> This sounds like a great use case for static_call(). PeterZ, do you
>> suppose we could wire up static_call() with the module infrastructure
>> to make it easy to do "static_call to such-and-such GPL module symbol
>> if that symbol is in a loaded module, else nop"?
>
> You're basically asking it to do dynamic linking. And I suppose that is
> technically possible.
>
> However, I'm really starting to think kvm (or at least these parts of it
> that want to play these games) had better not be a module anymore.
>
Maybe we can use an atomic notifier (e.g. page_fault_notifier)?
alex.
> On May 14, 2019, at 8:36 AM, Alexandre Chartre <[email protected]> wrote:
>
>
>> On 5/14/19 9:21 AM, Peter Zijlstra wrote:
>>> On Mon, May 13, 2019 at 07:02:30PM -0700, Andy Lutomirski wrote:
>>> This sounds like a great use case for static_call(). PeterZ, do you
>>> suppose we could wire up static_call() with the module infrastructure
>>> to make it easy to do "static_call to such-and-such GPL module symbol
>>> if that symbol is in a loaded module, else nop"?
>> You're basically asking it to do dynamic linking. And I suppose that is
>> technically possible.
>> However, I'm really starting to think kvm (or at least these parts of it
>> that want to play these games) had better not be a module anymore.
>
> Maybe we can use an atomic notifier (e.g. page_fault_notifier)?
>
>
IMO that’s worse. I want to be able to read do_page_fault() and understand what happens and in what order.
Having do_page_fault run with the wrong CR3 is so fundamental to its operation that it needs to be very obvious what’s happening.
On 5/14/19 5:23 PM, Andy Lutomirski wrote:
> On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>> On 5/14/19 10:34 AM, Andy Lutomirski wrote:
>>>
>>>
>>>> On May 14, 2019, at 1:25 AM, Alexandre Chartre <[email protected]> wrote:
>>>>
>>>>
>>>>> On 5/14/19 9:09 AM, Peter Zijlstra wrote:
>>>>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
>>>>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> pcpu_base_addr is already mapped to the KVM address space, but this
>>>>>>> represents the first percpu chunk. To access a per-cpu buffer not
>>>>>>> allocated in the first chunk, add a function which maps all cpu
>>>>>>> buffers corresponding to that per-cpu buffer.
>>>>>>>
>>>>>>> Also add function to clear page table entries for a percpu buffer.
>>>>>>>
>>>>>>
>>>>>> This needs some kind of clarification so that readers can tell whether
>>>>>> you're trying to map all percpu memory or just map a specific
>>>>>> variable. In either case, you're making a dubious assumption that
>>>>>> percpu memory contains no secrets.
>>>>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
>>>>> does contain secrits, invalidating that premise.
>>>>
>>>> The current code unconditionally maps the entire first percpu chunk
>>>> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is
>>>> mainly a simplification for the POC because a lot of core information
>>>> that we need, for example just to switch mm, are stored there (like
>>>> cpu_tlbstate, current_task...).
>>>
>>> I don’t think you should need any of this.
>>>
>>
>> At the moment, the current code does need it. Otherwise it can't switch from
>> kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate",
>> and then the page fault handler will fail accessing "current" before calling
>> the kvm page fault handler. So it will double fault or loop on page faults.
>> There are many different places where percpu variables are used, and I have
>> experienced many double fault/page fault loop because of that.
>
> Now you're experiencing what working on the early PTI code was like :)
>
> This is why I think you shouldn't touch current in any of this.
>
>>
>>>>
>>>> If the entire first percpu chunk effectively has secret then we will
>>>> need to individually map only buffers we need. The kvm_copy_percpu_mapping()
>>>> function is added to copy mapping for a specified percpu buffer, so
>>>> this used to map percpu buffers which are not in the first percpu chunk.
>>>>
>>>> Also note that mapping is constrained by PTE (4K), so mapped buffers
>>>> (percpu or not) which do not fill a whole set of pages can leak adjacent
>>>> data store on the same pages.
>>>>
>>>>
>>>
>>> I would take a different approach: figure out what you need and put it in its
>>> own dedicated area, kind of like cpu_entry_area.
>>
>> That's certainly something we can do, like Julian proposed with "Process-local
>> memory allocations": https://lkml.org/lkml/2018/11/22/1240
>>
>> That's fine for buffers allocated from KVM, however, we will still need some
>> core kernel mappings so the thread can run and interrupts can be handled.
>>
>>> One nasty issue you’ll have is vmalloc: the kernel stack is in the
>>> vmap range, and, if you allow access to vmap memory at all, you’ll
>>> need some way to ensure that *unmap* gets propagated. I suspect the
>>> right choice is to see if you can avoid using the kernel stack at all
>>> in isolated mode. Maybe you could run on the IRQ stack instead.
>>
>> I am currently just copying the task stack mapping into the KVM page table
>> (patch 23) when a vcpu is created:
>>
>> err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);
>>
>> And this seems to work. I am clearing the mapping when the VM vcpu is freed,
>> so I am making the assumption that the same task is used to create and free
>> a vcpu.
>>
>
> vCPUs are bound to an mm but not a specific task, right? So I think
> this is wrong in both directions.
>
I know, that was yet another shortcut for the POC, I assume there's a 1:1
mapping between a vCPU and task, but I think that's fair with qemu.
> Suppose a vCPU is created, then the task exits, the stack mapping gets
> freed (the core code tries to avoid this, but it does happen), and a
> new stack gets allocated at the same VA with different physical pages.
> Now you're toast :) On the flip side, wouldn't you crash if a vCPU is
> created and then run on a different thread?
Yes, that's why I have a safety net: before entering KVM isolation I always
check that the current task is mapped in the KVM address space, if not it
gets mapped.
> How important is the ability to enable IRQs while running with the KVM
> page tables?
>
I can't say, I would need to check but we probably need IRQs at least for
some timers. Sounds like you would really prefer IRQs to be disabled.
alex.
On Tue, May 14, 2019 at 06:24:48PM +0200, Alexandre Chartre wrote:
> On 5/14/19 5:23 PM, Andy Lutomirski wrote:
> > How important is the ability to enable IRQs while running with the KVM
> > page tables?
> >
>
> I can't say, I would need to check but we probably need IRQs at least for
> some timers. Sounds like you would really prefer IRQs to be disabled.
>
I think what amluto is getting at, is:
again:
local_irq_disable();
switch_to_kvm_mm();
/* do very little -- (A) */
VMEnter()
/* runs as guest */
/* IRQ happens */
WMExit()
/* inspect exit raisin */
if (/* IRQ pending */) {
switch_from_kvm_mm();
local_irq_restore();
goto again;
}
but I don't know anything about VMX/SVM at all, so the above might not
be feasible, specifically I read something about how VMX allows NMIs
where SVM did not somewhere around (A) -- or something like that,
earlier in this thread.
On Tue, May 14, 2019 at 07:05:22PM +0200, Peter Zijlstra wrote:
> On Tue, May 14, 2019 at 06:24:48PM +0200, Alexandre Chartre wrote:
> > On 5/14/19 5:23 PM, Andy Lutomirski wrote:
>
> > > How important is the ability to enable IRQs while running with the KVM
> > > page tables?
> > >
> >
> > I can't say, I would need to check but we probably need IRQs at least for
> > some timers. Sounds like you would really prefer IRQs to be disabled.
> >
>
> I think what amluto is getting at, is:
>
> again:
> local_irq_disable();
> switch_to_kvm_mm();
> /* do very little -- (A) */
> VMEnter()
>
> /* runs as guest */
>
> /* IRQ happens */
> WMExit()
> /* inspect exit raisin */
> if (/* IRQ pending */) {
> switch_from_kvm_mm();
> local_irq_restore();
> goto again;
> }
>
>
> but I don't know anything about VMX/SVM at all, so the above might not
> be feasible, specifically I read something about how VMX allows NMIs
> where SVM did not somewhere around (A) -- or something like that,
> earlier in this thread.
For IRQs it's somewhat feasible, but not for NMIs since NMIs are unblocked
on VMX immediately after VM-Exit, i.e. there's no way to prevent an NMI
from occuring while KVM's page tables are loaded.
Back to Andy's question about enabling IRQs, the answer is "it depends".
Exits due to INTR, NMI and #MC are considered high priority and are
serviced before re-enabling IRQs and preemption[1]. All other exits are
handled after IRQs and preemption are re-enabled.
A decent number of exit handlers are quite short, e.g. CPUID, most RDMSR
and WRMSR, any event-related exit, etc... But many exit handlers require
significantly longer flows, e.g. EPT violations (page faults) and anything
that requires extensive emulation, e.g. nested VMX. In short, leaving
IRQs disabled across all exits is not practical.
Before going down the path of figuring out how to handle the corner cases
regarding kvm_mm, I think it makes sense to pinpoint exactly what exits
are a) in the hot path for the use case (configuration) and b) can be
handled fast enough that they can run with IRQs disabled. Generating that
list might allow us to tightly bound the contents of kvm_mm and sidestep
many of the corner cases, i.e. select VM-Exits are handle with IRQs
disabled using KVM's mm, while "slow" VM-Exits go through the full context
switch.
[1] Technically, IRQs are actually enabled when SVM services INTR. SVM
hardware doesn't acknowledge the INTR/NMI on VM-Exit, but rather keeps
it pending until the event is unblocked, e.g. servicing a VM-Exit due
to an INTR is simply a matter of enabling IRQs.
On Tue, May 14, 2019 at 10:05 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, May 14, 2019 at 06:24:48PM +0200, Alexandre Chartre wrote:
> > On 5/14/19 5:23 PM, Andy Lutomirski wrote:
>
> > > How important is the ability to enable IRQs while running with the KVM
> > > page tables?
> > >
> >
> > I can't say, I would need to check but we probably need IRQs at least for
> > some timers. Sounds like you would really prefer IRQs to be disabled.
> >
>
> I think what amluto is getting at, is:
>
> again:
> local_irq_disable();
> switch_to_kvm_mm();
> /* do very little -- (A) */
> VMEnter()
>
> /* runs as guest */
>
> /* IRQ happens */
> WMExit()
> /* inspect exit raisin */
> if (/* IRQ pending */) {
> switch_from_kvm_mm();
> local_irq_restore();
> goto again;
> }
>
What I'm getting at is that running the kernel without mapping the
whole kernel is a horrible, horrible thing to do. The less code we
can run like that, the better.
On Tue, May 14, 2019 at 11:09 AM Sean Christopherson
<[email protected]> wrote:
>
> On Tue, May 14, 2019 at 07:05:22PM +0200, Peter Zijlstra wrote:
> > On Tue, May 14, 2019 at 06:24:48PM +0200, Alexandre Chartre wrote:
> > > On 5/14/19 5:23 PM, Andy Lutomirski wrote:
> >
> > > > How important is the ability to enable IRQs while running with the KVM
> > > > page tables?
> > > >
> > >
> > > I can't say, I would need to check but we probably need IRQs at least for
> > > some timers. Sounds like you would really prefer IRQs to be disabled.
> > >
> >
> > I think what amluto is getting at, is:
> >
> > again:
> > local_irq_disable();
> > switch_to_kvm_mm();
> > /* do very little -- (A) */
> > VMEnter()
> >
> > /* runs as guest */
> >
> > /* IRQ happens */
> > WMExit()
> > /* inspect exit raisin */
> > if (/* IRQ pending */) {
> > switch_from_kvm_mm();
> > local_irq_restore();
> > goto again;
> > }
> >
> >
> > but I don't know anything about VMX/SVM at all, so the above might not
> > be feasible, specifically I read something about how VMX allows NMIs
> > where SVM did not somewhere around (A) -- or something like that,
> > earlier in this thread.
>
> For IRQs it's somewhat feasible, but not for NMIs since NMIs are unblocked
> on VMX immediately after VM-Exit, i.e. there's no way to prevent an NMI
> from occuring while KVM's page tables are loaded.
>
> Back to Andy's question about enabling IRQs, the answer is "it depends".
> Exits due to INTR, NMI and #MC are considered high priority and are
> serviced before re-enabling IRQs and preemption[1]. All other exits are
> handled after IRQs and preemption are re-enabled.
>
> A decent number of exit handlers are quite short, e.g. CPUID, most RDMSR
> and WRMSR, any event-related exit, etc... But many exit handlers require
> significantly longer flows, e.g. EPT violations (page faults) and anything
> that requires extensive emulation, e.g. nested VMX. In short, leaving
> IRQs disabled across all exits is not practical.
>
> Before going down the path of figuring out how to handle the corner cases
> regarding kvm_mm, I think it makes sense to pinpoint exactly what exits
> are a) in the hot path for the use case (configuration) and b) can be
> handled fast enough that they can run with IRQs disabled. Generating that
> list might allow us to tightly bound the contents of kvm_mm and sidestep
> many of the corner cases, i.e. select VM-Exits are handle with IRQs
> disabled using KVM's mm, while "slow" VM-Exits go through the full context
> switch.
I suspect that the context switch is a bit of a red herring. A
PCID-don't-flush CR3 write is IIRC under 300 cycles. Sure, it's slow,
but it's probably minor compared to the full cost of the vm exit. The
pain point is kicking the sibling thread.
When I worked on the PTI stuff, I went to great lengths to never have
a copy of the vmalloc page tables. The top-level entry is either
there or it isn't, so everything is always in sync. I'm sure it's
*possible* to populate just part of it for this KVM isolation, but
it's going to be ugly. It would be really nice if we could avoid it.
Unfortunately, this interacts unpleasantly with having the kernel
stack in there. We can freely use a different stack (the IRQ stack,
for example) as long as we don't schedule, but that means we can't run
preemptable code.
Another issue is tracing, kprobes, etc -- I don't think anyone will
like it if a kprobe in KVM either dramatically changes performance by
triggering isolation exits or by crashing. So you may need to
restrict the isolated code to a file that is compiled with tracing off
and has everything marked NOKPROBE. Yuck.
I hate to say this, but at what point do we declare that "if you have
SMT on, you get to keep both pieces, simultaneously!"?
On Tue, May 14, 2019 at 01:33:21PM -0700, Andy Lutomirski wrote:
> On Tue, May 14, 2019 at 11:09 AM Sean Christopherson
> <[email protected]> wrote:
> > For IRQs it's somewhat feasible, but not for NMIs since NMIs are unblocked
> > on VMX immediately after VM-Exit, i.e. there's no way to prevent an NMI
> > from occuring while KVM's page tables are loaded.
> >
> > Back to Andy's question about enabling IRQs, the answer is "it depends".
> > Exits due to INTR, NMI and #MC are considered high priority and are
> > serviced before re-enabling IRQs and preemption[1]. All other exits are
> > handled after IRQs and preemption are re-enabled.
> >
> > A decent number of exit handlers are quite short, e.g. CPUID, most RDMSR
> > and WRMSR, any event-related exit, etc... But many exit handlers require
> > significantly longer flows, e.g. EPT violations (page faults) and anything
> > that requires extensive emulation, e.g. nested VMX. In short, leaving
> > IRQs disabled across all exits is not practical.
> >
> > Before going down the path of figuring out how to handle the corner cases
> > regarding kvm_mm, I think it makes sense to pinpoint exactly what exits
> > are a) in the hot path for the use case (configuration) and b) can be
> > handled fast enough that they can run with IRQs disabled. Generating that
> > list might allow us to tightly bound the contents of kvm_mm and sidestep
> > many of the corner cases, i.e. select VM-Exits are handle with IRQs
> > disabled using KVM's mm, while "slow" VM-Exits go through the full context
> > switch.
>
> I suspect that the context switch is a bit of a red herring. A
> PCID-don't-flush CR3 write is IIRC under 300 cycles. Sure, it's slow,
> but it's probably minor compared to the full cost of the vm exit. The
> pain point is kicking the sibling thread.
Speaking of PCIDs, a separate mm for KVM would mean consuming another
ASID, which isn't good.
> When I worked on the PTI stuff, I went to great lengths to never have
> a copy of the vmalloc page tables. The top-level entry is either
> there or it isn't, so everything is always in sync. I'm sure it's
> *possible* to populate just part of it for this KVM isolation, but
> it's going to be ugly. It would be really nice if we could avoid it.
> Unfortunately, this interacts unpleasantly with having the kernel
> stack in there. We can freely use a different stack (the IRQ stack,
> for example) as long as we don't schedule, but that means we can't run
> preemptable code.
>
> Another issue is tracing, kprobes, etc -- I don't think anyone will
> like it if a kprobe in KVM either dramatically changes performance by
> triggering isolation exits or by crashing. So you may need to
> restrict the isolated code to a file that is compiled with tracing off
> and has everything marked NOKPROBE. Yuck.
Right, and all of the above is largely why I suggested compiling a list
of VM-Exits that "need" preferential treatment. If the cumulative amount
of code and data that needs to be accessed is tiny, then this might be
feasible. But if the goal is to be able to do things like handle IRQs
using the KVM mm, ouch.
> I hate to say this, but at what point do we declare that "if you have
> SMT on, you get to keep both pieces, simultaneously!"?
On 5/14/19 12:37 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 07:07:36PM -0700, Andy Lutomirski wrote:
>> On Mon, May 13, 2019 at 2:09 PM Liran Alon <[email protected]> wrote:
>>> The hope is that the very vast majority of #VMExit handlers will be
>>> able to completely run without requiring to switch to full address
>>> space. Therefore, avoiding the performance hit of (2).
>>> However, for the very few #VMExits that does require to run in full
>>> kernel address space, we must first kick the sibling hyperthread
>>> outside of guest and only then switch to full kernel address space
>>> and only once all hyperthreads return to KVM address space, then
>>> allow then to enter into guest.
>> What exactly does "kick" mean in this context? It sounds like you're
>> going to need to be able to kick sibling VMs from extremely atomic
>> contexts like NMI and MCE.
> Yeah, doing the full synchronous thing from NMI/MCE context sounds
> exceedingly dodgy, howver..
>
> Realistically they only need to send an IPI to the other sibling; they
> don't need to wait for the VMExit to complete or anything else.
>
> And that is something we can do from NMI context -- with a bit of care.
> See also arch_irq_work_raise(); specifically we need to ensure we leave
> the APIC in an idle state, such that if we interrupted an APIC sequence
> it will not suddenly fail/violate the APIC write/state etc.
>
I've been experimenting with IPI'ing siblings on vmexit, primarily
because we know we'll need it if ASI turns out to be viable, but also
because I wanted to understand why previous experiments resulted in such
poor performance.
You're correct that you don't need to wait for the sibling to come out
once you send the IPI. That hardware thread will not do anything other
than process the IPI once it's sent. There is still some need for
synchronization, at least for the every vmexit case, since you always
want to make sure that one thread is actually doing work while the other
one is held. I have this working for some cases, but not enough to call
it a general solution. I'm not at all sure that the every vmexit case
can be made to perform for the general case. Even the non-general case
uses synchronization that I fear might be overly complex.
For the cases I do have working, simply not pinning the sibling when
we exit due to the quest idling is a big enough win to put performance
into a much more reasonable range.
Base on this, I believe that pining a sibling HT in a subset of cases,
when we interact with full kernel address space, is almost certainly
reasonable.
-jan
> On May 14, 2019, at 2:06 PM, Sean Christopherson <[email protected]> wrote:
>
>> On Tue, May 14, 2019 at 01:33:21PM -0700, Andy Lutomirski wrote:
>> On Tue, May 14, 2019 at 11:09 AM Sean Christopherson
>> <[email protected]> wrote:
>>> For IRQs it's somewhat feasible, but not for NMIs since NMIs are unblocked
>>> on VMX immediately after VM-Exit, i.e. there's no way to prevent an NMI
>>> from occuring while KVM's page tables are loaded.
>>>
>>> Back to Andy's question about enabling IRQs, the answer is "it depends".
>>> Exits due to INTR, NMI and #MC are considered high priority and are
>>> serviced before re-enabling IRQs and preemption[1]. All other exits are
>>> handled after IRQs and preemption are re-enabled.
>>>
>>> A decent number of exit handlers are quite short, e.g. CPUID, most RDMSR
>>> and WRMSR, any event-related exit, etc... But many exit handlers require
>>> significantly longer flows, e.g. EPT violations (page faults) and anything
>>> that requires extensive emulation, e.g. nested VMX. In short, leaving
>>> IRQs disabled across all exits is not practical.
>>>
>>> Before going down the path of figuring out how to handle the corner cases
>>> regarding kvm_mm, I think it makes sense to pinpoint exactly what exits
>>> are a) in the hot path for the use case (configuration) and b) can be
>>> handled fast enough that they can run with IRQs disabled. Generating that
>>> list might allow us to tightly bound the contents of kvm_mm and sidestep
>>> many of the corner cases, i.e. select VM-Exits are handle with IRQs
>>> disabled using KVM's mm, while "slow" VM-Exits go through the full context
>>> switch.
>>
>> I suspect that the context switch is a bit of a red herring. A
>> PCID-don't-flush CR3 write is IIRC under 300 cycles. Sure, it's slow,
>> but it's probably minor compared to the full cost of the vm exit. The
>> pain point is kicking the sibling thread.
>
> Speaking of PCIDs, a separate mm for KVM would mean consuming another
> ASID, which isn't good.
I’m not sure we care. We have many logical address spaces (two per mm plus a few more). We have 4096 PCIDs, but we only use ten or so. And we have some undocumented number of *physical* ASIDs with some undocumented mechanism by which PCID maps to a physical ASID.
I don’t suppose you know how many physical ASIDs we have? And how it interacts with the VPID stuff?
On Tue, May 14, 2019 at 02:55:18PM -0700, Andy Lutomirski wrote:
>
> > On May 14, 2019, at 2:06 PM, Sean Christopherson <[email protected]> wrote:
> >
> >> On Tue, May 14, 2019 at 01:33:21PM -0700, Andy Lutomirski wrote:
> >> I suspect that the context switch is a bit of a red herring. A
> >> PCID-don't-flush CR3 write is IIRC under 300 cycles. Sure, it's slow,
> >> but it's probably minor compared to the full cost of the vm exit. The
> >> pain point is kicking the sibling thread.
> >
> > Speaking of PCIDs, a separate mm for KVM would mean consuming another
> > ASID, which isn't good.
>
> I’m not sure we care. We have many logical address spaces (two per mm plus a
> few more). We have 4096 PCIDs, but we only use ten or so. And we have some
> undocumented number of *physical* ASIDs with some undocumented mechanism by
> which PCID maps to a physical ASID.
Yeah, I was referring to physical ASIDs.
> I don’t suppose you know how many physical ASIDs we have?
Limited number of physical ASIDs. I'll leave it at that so as not to
disclose something I shouldn't.
> And how it interacts with the VPID stuff?
VPID and PCID get factored into the final ASID, i.e. changing either one
results in a new ASID. The SDM's oblique way of saying that:
VPIDs and PCIDs (see Section 4.10.1) can be used concurrently. When this
is done, the processor associates cached information with both a VPID and
a PCID. Such information is used only if the current VPID and PCID both
match those associated with the cached information.
E.g. enabling PTI in both the host and guest consumes four ASIDs just to
run a single task in the guest:
- VPID=0, PCID=kernel
- VPID=0, PCID=user
- VPID=1, PCID=kernel
- VPID=1, PCID=user
The impact of consuming another ASID for KVM would likely depend on both
the guest and host configurations/worloads, e.g. if the guest is using a
lot of PCIDs then it's probably a moot point. It's something to keep in
mind though if we go down this path.
Thanks all for your replies and comments. I am trying to summarize main
feedback below, and define next steps.
But first, let me clarify what should happen when exiting the KVM isolated
address space (i.e. when we need to access to the full kernel). There was
some confusion because this was not clearly described in the cover letter.
Thanks to Liran for this better explanation:
When a hyperthread needs to switch from KVM isolated address space to
kernel full address space, it should first kick all sibling hyperthreads
outside of guest and only then safety switch to full kernel address
space. Only once all sibling hyperthreads are running with KVM isolated
address space, it is safe to enter guest.
The main point of this address space is to avoid kicking all sibling
hyperthreads on *every* VMExit from guest but instead only kick them when
switching address space. The assumption is that the vast majority of exits
can be handled in KVM isolated address space and therefore do not require
to kick the sibling hyperthreads outside of guest.
“kick” in this context means sending an IPI to all sibling hyperthreads.
This IPI will cause these sibling hyperthreads to exit from guest to host
on EXTERNAL_INTERRUPT and wait for a condition that again allows to enter
back into guest. This condition will be once all hyperthreads of CPU core
is again running only within KVM isolated address space of this VM.
Feedback
========
Page-table Management
- Need to cleanup terminology mm vs page-table. It looks like we just need
a KVM page-table, not a KVM mm.
- Interfaces for creating and managing page-table should be provided by
the kernel, and not implemented in KVM. KVM shouldn't access kernel
low-level memory management functions.
KVM Isolation Enter/Exit
- Changing CR3 in #PF could be a natural extension as #PF can already
change page-tables, but we need a very coherent design and strong
rules.
- Reduce kernel code running without the whole kernel mapping to the
minimum.
- Avoid using current and task_struct while running with KVM page table.
- Ensure KVM page-table is not used with vmalloc.
- Try to avoid copying parts of the vmalloc page tables. This
interacts unpleasantly with having the kernel stack. We can freely
use a different stack (the IRQ stack, for example) as long as
we don't schedule, but that means we can't run preemptable code.
- Potential issues with tracing, kprobes... A solution would be to
compile the isolated code with tracing off.
- Better centralize KVM isolation exit on IRQ, NMI, MCE, faults...
Switch back to full kernel before switching to IRQ stack or
shorlty after.
- Can we disable IRQ while running with KVM page-table?
For IRQs it's somewhat feasible, but not for NMIs since NMIs are
unblocked on VMX immediately after VM-Exit
Exits due to INTR, NMI and #MC are considered high priority and are
serviced before re-enabling IRQs and preemption[1]. All other exits
are handled after IRQs and preemption are re-enabled.
A decent number of exit handlers are quite short, but many exit
handlers require significantly longer flows. In short, leaving
IRQs disabled across all exits is not practical.
It makes sense to pinpoint exactly what exits are:
a) in the hot path for the use case (configuration)
b) can be handled fast enough that they can run with IRQs disabled.
Generating that list might allow us to tightly bound the contents
of kvm_mm and sidestep many of the corner cases, i.e. select VM-Exits
are handle with IRQs disabled using KVM's mm, while "slow" VM-Exits
go through the full context switch.
KVM Page Table Content
- Check and reduce core mappings (kernel text size, cpu_entry_area,
espfix64, IRQ stack...)
- Check and reduce percpu mapping, percpu memory can contain secrets (e.g.
percpu random pool)
Next Steps
==========
I will investigate Sean's suggestion to see which VM-Exits can be handled
fast enough so that they can run with IRQs disabled (fast VM-Exits),
and which slow VM-Exits are in the hot path.
So I will work on a new POC which just handles fast VM-Exits with IRQs
disabled. This should largely reduce mappings required in the KVM page
table. I will also try to just have a KVM page-table and not a KVM mm.
After this new POC, we should be able to evaluate the need for handling
slow VM-Exits. And if there's an actual need, we can investigate how
to handle them with IRQs enabled.
Thanks,
alex.
On Tue, May 14, 2019 at 3:38 PM Sean Christopherson
<[email protected]> wrote:
> On Tue, May 14, 2019 at 02:55:18PM -0700, Andy Lutomirski wrote:
> > > On May 14, 2019, at 2:06 PM, Sean Christopherson <[email protected]> wrote:
> > >> On Tue, May 14, 2019 at 01:33:21PM -0700, Andy Lutomirski wrote:
> > >> I suspect that the context switch is a bit of a red herring. A
> > >> PCID-don't-flush CR3 write is IIRC under 300 cycles. Sure, it's slow,
> > >> but it's probably minor compared to the full cost of the vm exit. The
> > >> pain point is kicking the sibling thread.
> > >
> > > Speaking of PCIDs, a separate mm for KVM would mean consuming another
> > > ASID, which isn't good.
> >
> > I’m not sure we care. We have many logical address spaces (two per mm plus a
> > few more). We have 4096 PCIDs, but we only use ten or so. And we have some
> > undocumented number of *physical* ASIDs with some undocumented mechanism by
> > which PCID maps to a physical ASID.
>
> Yeah, I was referring to physical ASIDs.
>
> > I don’t suppose you know how many physical ASIDs we have?
>
> Limited number of physical ASIDs. I'll leave it at that so as not to
> disclose something I shouldn't.
>
> > And how it interacts with the VPID stuff?
>
> VPID and PCID get factored into the final ASID, i.e. changing either one
> results in a new ASID. The SDM's oblique way of saying that:
>
> VPIDs and PCIDs (see Section 4.10.1) can be used concurrently. When this
> is done, the processor associates cached information with both a VPID and
> a PCID. Such information is used only if the current VPID and PCID both
> match those associated with the cached information.
>
> E.g. enabling PTI in both the host and guest consumes four ASIDs just to
> run a single task in the guest:
>
> - VPID=0, PCID=kernel
> - VPID=0, PCID=user
> - VPID=1, PCID=kernel
> - VPID=1, PCID=user
>
> The impact of consuming another ASID for KVM would likely depend on both
> the guest and host configurations/worloads, e.g. if the guest is using a
> lot of PCIDs then it's probably a moot point. It's something to keep in
> mind though if we go down this path.
One answer to that would be to have the KVM page tables use the same
PCID as the normal user-mode PTI page tables. It's not ideal (since
the qemu/whatever process can see some kernel data via meltdown it
wouldn't be able to normally see), but might be an option to
investigate.
Cheers,
- jonathan