2019-07-11 14:31:01

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 00/27] Kernel Address Space Isolation

Hi,

This is version 2 of the "KVM Address Space Isolation" RFC. The code
has been completely changed compared to v1 and it now provides a generic
kernel framework which provides Address Space Isolation; and KVM is now
a simple consumer of that framework. That's why the RFC title has been
changed from "KVM Address Space Isolation" to "Kernel Address Space
Isolation".

Kernel Address Space Isolation aims to use address spaces to isolate some
parts of the kernel (for example KVM) to prevent leaking sensitive data
between hyper-threads under speculative execution attacks. You can refer
to the first version of this RFC for more context:

https://lkml.org/lkml/2019/5/13/515

The new code is still a proof of concept. It is much more stable than v1:
I am able to run a VM with a full OS (and also a nested VM) with multiple
vcpus. But it looks like there are still some corner cases which cause the
system to crash/hang.

I am looking for feedback about this new approach where address space
isolation is provided by the kernel, and KVM is a just a consumer of this
new framework.


Changes
=======

- Address Space Isolation (ASI) is now provided as a kernel framework:
interfaces for creating and managing an ASI are provided by the kernel,
there are not implemented in KVM.

- An ASI is associated with a page-table, we don't use mm anymore. Entering
isolation is done by just updating CR3 to use the ASI page-table. Exiting
isolation restores CR3 with the CR3 value present before entering isolation.

- Isolation is exited at the beginning of any interrupt/exception handler,
and on context switch.

- Isolation doesn't disable interrupt, but if an interrupt occurs the
interrupt handler will exit isolation.

- The current stack is mapped when entering isolation and unmapped when
exiting isolation.

- The current task is not mapped by default, but there's an option to map it.
In such a case, the current task is mapped when entering isolation and
unmap when exiting isolation.

- Kernel code mapped to the ASI page-table has been reduced to:
. the entire kernel (I still need to test with only the kernel text)
. the cpu entry area (because we need the GDT to be mapped)
. the cpu ASI session (for managing ASI)
. the current stack

- Optionally, an ASI can request the following kernel mapping to be added:
. the stack canary
. the cpu offsets (this_cpu_off)
. the current task
. RCU data (rcu_data)
. CPU HW events (cpu_hw_events).

All these optional mappings are used for KVM isolation.


Patches:
========

The proposed patches provides a framework for creating an Address Space
Isolation (ASI) (represented by a struct asi). The ASI has a page-table which
can be populated by copying mappings from the kernel page-table. The ASI can
then be entered/exited by switching between the kernel page-table and the
ASI page-table. In addition, any interrupt, exception or context switch
will automatically abort and exit the isolation. Finally patches use the
ASI framework to implement KVM isolation.

- 01-03: Core of the ASI framework: create/destroy ASI, enter/exit/abort
isolation, ASI page-fault handler.

- 04-14: Functions to manage, populate and clear an ASI page-table.

- 15-20: ASI core mappings and optional mappings.

- 21: Make functions to read cr3/cr4 ASI aware

- 22-26: Use ASI in KVM to provide isolation for VMExit handlers.


API Overview:
=============
Here is a short description of the main ASI functions provided by the framwork.

struct asi *asi_create(int map_flags)

Create an Address Space Isolation (ASI). map_flags can be used to specify
optional kernel mapping to be added to the ASI page-table (for example,
ASI_MAP_STACK_CANARY to map the stack canary).


void asi_destroy(struct asi *asi)

Destroy an ASI.


int asi_enter(struct asi *asi)

Enter isolation for the specified ASI. This switches from the kernel page-table
to the page-table associated with the ASI.


void asi_exit(struct asi *asi)

Exit isolation for the specified ASI. This switches back to the kernel
page-table


int asi_map(struct asi *asi, void *ptr, unsigned long size);

Copy kernel mapping to the specified ASI page-table.


void asi_unmap(struct asi *asi, void *ptr);

Clear kernel mapping from the specified ASI page-table.


----
Alexandre Chartre (23):
mm/x86: Introduce kernel address space isolation
mm/asi: Abort isolation on interrupt, exception and context switch
mm/asi: Handle page fault due to address space isolation
mm/asi: Functions to track buffers allocated for an ASI page-table
mm/asi: Add ASI page-table entry offset functions
mm/asi: Add ASI page-table entry allocation functions
mm/asi: Add ASI page-table entry set functions
mm/asi: Functions to populate an ASI page-table from a VA range
mm/asi: Helper functions to map module into ASI
mm/asi: Keep track of VA ranges mapped in ASI page-table
mm/asi: Functions to clear ASI page-table entries for a VA range
mm/asi: Function to copy page-table entries for percpu buffer
mm/asi: Add asi_remap() function
mm/asi: Handle ASI mapped range leaks and overlaps
mm/asi: Initialize the ASI page-table with core mappings
mm/asi: Option to map current task into ASI
rcu: Move tree.h static forward declarations to tree.c
rcu: Make percpu rcu_data non-static
mm/asi: Add option to map RCU data
mm/asi: Add option to map cpu_hw_events
mm/asi: Make functions to read cr3/cr4 ASI aware
KVM: x86/asi: Populate the KVM ASI page-table
KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI

Liran Alon (3):
KVM: x86/asi: Introduce address_space_isolation module parameter
KVM: x86/asi: Introduce KVM address space isolation
KVM: x86/asi: Switch to KVM address space on entry to guest

arch/x86/entry/entry_64.S | 42 ++-
arch/x86/include/asm/asi.h | 237 ++++++++
arch/x86/include/asm/mmu_context.h | 20 +-
arch/x86/include/asm/tlbflush.h | 10 +
arch/x86/kernel/asm-offsets.c | 4 +
arch/x86/kvm/Makefile | 3 +-
arch/x86/kvm/mmu.c | 2 +-
arch/x86/kvm/vmx/isolation.c | 231 ++++++++
arch/x86/kvm/vmx/vmx.c | 14 +-
arch/x86/kvm/vmx/vmx.h | 24 +
arch/x86/kvm/x86.c | 68 +++-
arch/x86/kvm/x86.h | 1 +
arch/x86/mm/Makefile | 2 +
arch/x86/mm/asi.c | 459 +++++++++++++++
arch/x86/mm/asi_pagetable.c | 1077 ++++++++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 7 +
include/linux/kvm_host.h | 7 +
kernel/rcu/tree.c | 56 ++-
kernel/rcu/tree.h | 56 +--
kernel/sched/core.c | 4 +
security/Kconfig | 10 +
21 files changed, 2269 insertions(+), 65 deletions(-)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/kvm/vmx/isolation.c
create mode 100644 arch/x86/mm/asi.c
create mode 100644 arch/x86/mm/asi_pagetable.c


2019-07-11 14:31:11

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 23/26] KVM: x86/asi: Introduce KVM address space isolation

From: Liran Alon <[email protected]>

Create a separate address space for KVM that will be active when
KVM #VMExit handlers run. Up until the point which we architectully
need to access host (or other VM) sensitive data.

This patch just create the address space using address space
isolation (asi) but never makes it active yet. This will be done
by next commits.

Signed-off-by: Liran Alon <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/vmx/isolation.c | 58 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 7 ++++-
arch/x86/kvm/vmx/vmx.h | 3 ++
include/linux/kvm_host.h | 5 +++
4 files changed, 72 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx/isolation.c b/arch/x86/kvm/vmx/isolation.c
index e25f663..644d8d3 100644
--- a/arch/x86/kvm/vmx/isolation.c
+++ b/arch/x86/kvm/vmx/isolation.c
@@ -7,6 +7,15 @@

#include <linux/module.h>
#include <linux/moduleparam.h>
+#include <linux/printk.h>
+#include <asm/asi.h>
+#include <asm/vmx.h>
+
+#include "vmx.h"
+#include "x86.h"
+
+#define VMX_ASI_MAP_FLAGS \
+ (ASI_MAP_STACK_CANARY | ASI_MAP_CPU_PTR | ASI_MAP_CURRENT_TASK)

/*
* When set to true, KVM #VMExit handlers run in isolated address space
@@ -24,3 +33,52 @@
*/
static bool __read_mostly address_space_isolation;
module_param(address_space_isolation, bool, 0444);
+
+static int vmx_isolation_init_mapping(struct asi *asi, struct vcpu_vmx *vmx)
+{
+ /* TODO: Populate the KVM ASI page-table */
+
+ return 0;
+}
+
+int vmx_isolation_init(struct vcpu_vmx *vmx)
+{
+ struct kvm_vcpu *vcpu = &vmx->vcpu;
+ struct asi *asi;
+ int err;
+
+ if (!address_space_isolation) {
+ vcpu->asi = NULL;
+ return 0;
+ }
+
+ asi = asi_create(VMX_ASI_MAP_FLAGS);
+ if (!asi) {
+ pr_debug("KVM: x86: Failed to create address space isolation\n");
+ return -ENXIO;
+ }
+
+ err = vmx_isolation_init_mapping(asi, vmx);
+ if (err) {
+ vcpu->asi = NULL;
+ return err;
+ }
+
+ vcpu->asi = asi;
+
+ pr_info("KVM: x86: Running with isolated address space\n");
+
+ return 0;
+}
+
+void vmx_isolation_uninit(struct vcpu_vmx *vmx)
+{
+ struct kvm_vcpu *vcpu = &vmx->vcpu;
+
+ if (!address_space_isolation || !vcpu->asi)
+ return;
+
+ asi_destroy(vcpu->asi);
+ vcpu->asi = NULL;
+ pr_info("KVM: x86: End of isolated address space\n");
+}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d98eac3..9b92467 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -202,7 +202,7 @@
};

#define L1D_CACHE_ORDER 4
-static void *vmx_l1d_flush_pages;
+void *vmx_l1d_flush_pages;

static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
{
@@ -6561,6 +6561,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

+ vmx_isolation_uninit(vmx);
if (enable_pml)
vmx_destroy_pml_buffer(vmx);
free_vpid(vmx->vpid);
@@ -6672,6 +6673,10 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)

vmx->ept_pointer = INVALID_PAGE;

+ err = vmx_isolation_init(vmx);
+ if (err)
+ goto free_vmcs;
+
return &vmx->vcpu;

free_vmcs:
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 61128b4..09c1593 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -525,4 +525,7 @@ static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)

void dump_vmcs(void);

+int vmx_isolation_init(struct vcpu_vmx *vmx);
+void vmx_isolation_uninit(struct vcpu_vmx *vmx);
+
#endif /* __KVM_X86_VMX_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d1ad38a..2a9d073 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
#include <linux/kvm_types.h>

#include <asm/kvm_host.h>
+#include <asm/asi.h>

#ifndef KVM_MAX_VCPU_ID
#define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -320,6 +321,10 @@ struct kvm_vcpu {
bool preempted;
struct kvm_vcpu_arch arch;
struct dentry *debugfs_dentry;
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ struct asi *asi;
+#endif
};

static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
--
1.7.1

2019-07-11 14:31:15

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 21/26] mm/asi: Make functions to read cr3/cr4 ASI aware

When address space isolation is active, cpu_tlbstate isn't necessarily
mapped in the ASI page-table, this would cause ASI to fault. Instead of
just mapping cpu_tlbstate, update __get_current_cr3_fast() and
cr4_read_shadow() by caching the cr3/cr4 values in the ASI session
when ASI is active.

Note that the cached cr3 value is the ASI cr3 value (i.e. the current
CR3 value when ASI is active). The cached cr4 value is the cr4 value
when isolation was entered (ASI doesn't change cr4).

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/asi.h | 2 ++
arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++--
arch/x86/include/asm/tlbflush.h | 10 ++++++++++
arch/x86/mm/asi.c | 3 +++
4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index f489551..07c2b50 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -73,7 +73,9 @@ struct asi_session {
enum asi_session_state state; /* state of ASI session */
bool retry_abort; /* always retry abort */
unsigned int abort_depth; /* abort depth */
+ unsigned long isolation_cr3; /* cr3 when ASI is active */
unsigned long original_cr3; /* cr3 before entering ASI */
+ unsigned long original_cr4; /* cr4 before entering ASI */
struct task_struct *task; /* task during isolation */
} __aligned(PAGE_SIZE);

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 9024236..8cec983 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -14,6 +14,7 @@
#include <asm/paravirt.h>
#include <asm/mpx.h>
#include <asm/debugreg.h>
+#include <asm/asi.h>

extern atomic64_t last_mm_ctx_id;

@@ -347,8 +348,23 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
*/
static inline unsigned long __get_current_cr3_fast(void)
{
- unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ unsigned long cr3;
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /*
+ * If isolation is active, cpu_tlbstate isn't necessarily mapped
+ * in the ASI page-table (and it doesn't have the current pgd anyway).
+ * The current CR3 is cached in the CPU ASI session.
+ */
+ if (this_cpu_read(cpu_asi_session.state) == ASI_SESSION_STATE_ACTIVE)
+ cr3 = this_cpu_read(cpu_asi_session.isolation_cr3);
+ else
+ cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+#else
+ cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+#endif

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee3758..917f9a5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -12,6 +12,7 @@
#include <asm/invpcid.h>
#include <asm/pti.h>
#include <asm/processor-flags.h>
+#include <asm/asi.h>

/*
* The x86 feature is called PCID (Process Context IDentifier). It is similar
@@ -324,6 +325,15 @@ static inline void cr4_toggle_bits_irqsoff(unsigned long mask)
/* Read the CR4 shadow. */
static inline unsigned long cr4_read_shadow(void)
{
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /*
+ * If isolation is active, cpu_tlbstate isn't necessarily mapped
+ * in the ASI page-table. The CR4 value is cached in the CPU
+ * ASI session.
+ */
+ if (this_cpu_read(cpu_asi_session.state) == ASI_SESSION_STATE_ACTIVE)
+ return this_cpu_read(cpu_asi_session.original_cr4);
+#endif
return this_cpu_read(cpu_tlbstate.cr4);
}

diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index d488704..4a5a4ba 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -23,6 +23,7 @@

/* ASI sessions, one per cpu */
DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
+EXPORT_SYMBOL(cpu_asi_session);

struct asi_map_option {
int flag;
@@ -291,6 +292,8 @@ int asi_enter(struct asi *asi)
goto err_unmap_task;
}
asi_session->original_cr3 = original_cr3;
+ asi_session->original_cr4 = cr4_read_shadow();
+ asi_session->isolation_cr3 = __sme_pa(asi->pgd);

/*
* Use ASI barrier as we are setting CR3 with the ASI page-table.
--
1.7.1

2019-07-11 14:32:20

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 04/26] mm/asi: Functions to track buffers allocated for an ASI page-table

Add functions to track buffers allocated for an ASI page-table. An ASI
page-table can have direct references to the kernel page table, at
different levels (PGD, P4D, PUD, PMD). When freeing an ASI page-table,
we should make sure that we free parts actually allocated for the ASI
page-table, and not parts of the kernel page table referenced from the
ASI page-table. To do so, we will keep track of buffers when building
the ASI page-table.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/asi.h | 26 +++++++++++
arch/x86/mm/Makefile | 2 +-
arch/x86/mm/asi.c | 3 +
arch/x86/mm/asi_pagetable.c | 99 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 129 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/mm/asi_pagetable.c

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 013d77a..3d965e6 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -8,12 +8,35 @@

#include <linux/spinlock.h>
#include <asm/pgtable.h>
+#include <linux/xarray.h>
+
+enum page_table_level {
+ PGT_LEVEL_PTE,
+ PGT_LEVEL_PMD,
+ PGT_LEVEL_PUD,
+ PGT_LEVEL_P4D,
+ PGT_LEVEL_PGD
+};

#define ASI_FAULT_LOG_SIZE 128

struct asi {
spinlock_t lock; /* protect all attributes */
pgd_t *pgd; /* ASI page-table */
+
+ /*
+ * An ASI page-table can have direct references to the full kernel
+ * page-table, at different levels (PGD, P4D, PUD, PMD). When freeing
+ * an ASI page-table, we should make sure that we free parts actually
+ * allocated for the ASI page-table, and not part of the full kernel
+ * page-table referenced from the ASI page-table.
+ *
+ * To do so, the backend_pages XArray is used to keep track of pages
+ * used for the kernel isolation page-table.
+ */
+ struct xarray backend_pages; /* page-table pages */
+ unsigned long backend_pages_count; /* pages count */
+
spinlock_t fault_lock; /* protect fault_log */
unsigned long fault_log[ASI_FAULT_LOG_SIZE];
bool fault_stack; /* display stack of fault? */
@@ -43,6 +66,9 @@ struct asi_session {

DECLARE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);

+void asi_init_backend(struct asi *asi);
+void asi_fini_backend(struct asi *asi);
+
extern struct asi *asi_create(void);
extern void asi_destroy(struct asi *asi);
extern int asi_enter(struct asi *asi);
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dae5c8a..b972f0f 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,7 +49,7 @@ obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
-obj-$(CONFIG_ADDRESS_SPACE_ISOLATION) += asi.o
+obj-$(CONFIG_ADDRESS_SPACE_ISOLATION) += asi.o asi_pagetable.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index 717160d..dfde245 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -111,6 +111,7 @@ struct asi *asi_create(void)
asi->pgd = page_address(page);
spin_lock_init(&asi->lock);
spin_lock_init(&asi->fault_lock);
+ asi_init_backend(asi);

err = asi_init_mapping(asi);
if (err)
@@ -132,6 +133,8 @@ void asi_destroy(struct asi *asi)
if (asi->pgd)
free_page((unsigned long)asi->pgd);

+ asi_fini_backend(asi);
+
kfree(asi);
}
EXPORT_SYMBOL(asi_destroy);
diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
new file mode 100644
index 0000000..7a8f791
--- /dev/null
+++ b/arch/x86/mm/asi_pagetable.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ */
+
+#include <asm/asi.h>
+
+/*
+ * Get the pointer to the beginning of a page table directory from a page
+ * table directory entry.
+ */
+#define ASI_BACKEND_PAGE_ALIGN(entry) \
+ ((typeof(entry))(((unsigned long)(entry)) & PAGE_MASK))
+
+/*
+ * Pages used to build the address space isolation page-table are stored
+ * in the backend_pages XArray. Each entry in the array is a logical OR
+ * of the page address and the page table level (PTE, PMD, PUD, P4D) this
+ * page is used for in the address space isolation page-table.
+ *
+ * As a page address is aligned with PAGE_SIZE, we have plenty of space
+ * for storing the page table level (which is a value between 0 and 4) in
+ * the low bits of the page address.
+ *
+ */
+
+#define ASI_BACKEND_PAGE_ENTRY(addr, level) \
+ ((typeof(addr))(((unsigned long)(addr)) | ((unsigned long)(level))))
+#define ASI_BACKEND_PAGE_ADDR(entry) \
+ ((void *)(((unsigned long)(entry)) & PAGE_MASK))
+#define ASI_BACKEND_PAGE_LEVEL(entry) \
+ ((enum page_table_level)(((unsigned long)(entry)) & ~PAGE_MASK))
+
+static int asi_add_backend_page(struct asi *asi, void *addr,
+ enum page_table_level level)
+{
+ unsigned long index;
+ void *old_entry;
+
+ if ((!addr) || ((unsigned long)addr) & ~PAGE_MASK)
+ return -EINVAL;
+
+ lockdep_assert_held(&asi->lock);
+ index = asi->backend_pages_count;
+
+ old_entry = xa_store(&asi->backend_pages, index,
+ ASI_BACKEND_PAGE_ENTRY(addr, level),
+ GFP_KERNEL);
+ if (xa_is_err(old_entry))
+ return xa_err(old_entry);
+ if (old_entry)
+ return -EBUSY;
+
+ asi->backend_pages_count++;
+
+ return 0;
+}
+
+void asi_init_backend(struct asi *asi)
+{
+ xa_init(&asi->backend_pages);
+}
+
+void asi_fini_backend(struct asi *asi)
+{
+ unsigned long index;
+ void *entry;
+
+ if (asi->backend_pages_count) {
+ xa_for_each(&asi->backend_pages, index, entry)
+ free_page((unsigned long)ASI_BACKEND_PAGE_ADDR(entry));
+ }
+}
+
+/*
+ * Check if an offset in the address space isolation page-table is valid,
+ * i.e. check that the offset is on a page effectively belonging to the
+ * address space isolation page-table.
+ */
+static bool asi_valid_offset(struct asi *asi, void *offset)
+{
+ unsigned long index;
+ void *addr, *entry;
+ bool valid;
+
+ addr = ASI_BACKEND_PAGE_ALIGN(offset);
+ valid = false;
+
+ lockdep_assert_held(&asi->lock);
+ xa_for_each(&asi->backend_pages, index, entry) {
+ if (ASI_BACKEND_PAGE_ADDR(entry) == addr) {
+ valid = true;
+ break;
+ }
+ }
+
+ return valid;
+}
--
1.7.1

2019-07-11 14:32:33

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 01/26] mm/x86: Introduce kernel address space isolation

Introduce core functions and structures for implementing Address Space
Isolation (ASI). Kernel address space isolation provides the ability to
run some kernel code with a reduced kernel address space.

An address space isolation is defined with a struct asi structure which
has its own page-table. While, for now, this page-table is empty, it
will eventually be possible to populate it so that it is much smaller
than the full kernel page-table.

Isolation is entered by calling asi_enter() which switches the kernel
page-table to the address space isolation page-table. Isolation is then
exited by calling asi_exit() which switches the page-table back to the
kernel page-table.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/asi.h | 41 ++++++++++++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/asi.c | 152 ++++++++++++++++++++++++++++++++++++++++++++
security/Kconfig | 10 +++
4 files changed, 205 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/mm/asi.c

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 0000000..8a13f73
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_X86_MM_ASI_H
+#define ARCH_X86_MM_ASI_H
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+#include <linux/spinlock.h>
+#include <asm/pgtable.h>
+
+struct asi {
+ spinlock_t lock; /* protect all attributes */
+ pgd_t *pgd; /* ASI page-table */
+};
+
+/*
+ * An ASI session maintains the state of address state isolation on a
+ * cpu. There is one ASI session per cpu. There is no lock to protect
+ * members of the asi_session structure as each cpu is managing its
+ * own ASI session.
+ */
+
+enum asi_session_state {
+ ASI_SESSION_STATE_INACTIVE, /* no address space isolation */
+ ASI_SESSION_STATE_ACTIVE, /* address space isolation is active */
+};
+
+struct asi_session {
+ struct asi *asi; /* ASI for this session */
+ enum asi_session_state state; /* state of ASI session */
+ unsigned long original_cr3; /* cr3 before entering ASI */
+ struct task_struct *task; /* task during isolation */
+} __aligned(PAGE_SIZE);
+
+extern struct asi *asi_create(void);
+extern void asi_destroy(struct asi *asi);
+extern int asi_enter(struct asi *asi);
+extern void asi_exit(struct asi *asi);
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 84373dc..dae5c8a 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,7 +49,9 @@ obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
+obj-$(CONFIG_ADDRESS_SPACE_ISOLATION) += asi.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
new file mode 100644
index 0000000..c3993b7
--- /dev/null
+++ b/arch/x86/mm/asi.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ *
+ * Kernel Address Space Isolation (ASI)
+ */
+
+#include <linux/export.h>
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+
+#include <asm/asi.h>
+#include <asm/bug.h>
+#include <asm/mmu_context.h>
+
+/* ASI sessions, one per cpu */
+DEFINE_PER_CPU_PAGE_ALIGNED(struct asi_session, cpu_asi_session);
+
+static int asi_init_mapping(struct asi *asi)
+{
+ /*
+ * TODO: Populate the ASI page-table with minimal mappings so
+ * that we can at least enter isolation and abort.
+ */
+ return 0;
+}
+
+struct asi *asi_create(void)
+{
+ struct page *page;
+ struct asi *asi;
+ int err;
+
+ asi = kzalloc(sizeof(*asi), GFP_KERNEL);
+ if (!asi)
+ return NULL;
+
+ page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ if (!page)
+ goto error;
+
+ asi->pgd = page_address(page);
+ spin_lock_init(&asi->lock);
+
+ err = asi_init_mapping(asi);
+ if (err)
+ goto error;
+
+ return asi;
+
+error:
+ asi_destroy(asi);
+ return NULL;
+}
+EXPORT_SYMBOL(asi_create);
+
+void asi_destroy(struct asi *asi)
+{
+ if (!asi)
+ return;
+
+ if (asi->pgd)
+ free_page((unsigned long)asi->pgd);
+
+ kfree(asi);
+}
+EXPORT_SYMBOL(asi_destroy);
+
+
+/*
+ * When isolation is active, the address space doesn't necessarily map
+ * the percpu offset value (this_cpu_off) which is used to get pointers
+ * to percpu variables. So functions which can be invoked while isolation
+ * is active shouldn't be getting pointers to percpu variables (i.e. with
+ * get_cpu_var() or this_cpu_ptr()). Instead percpu variable should be
+ * directly read or written to (i.e. with this_cpu_read() or
+ * this_cpu_write()).
+ */
+
+int asi_enter(struct asi *asi)
+{
+ enum asi_session_state state;
+ struct asi *current_asi;
+ struct asi_session *asi_session;
+
+ state = this_cpu_read(cpu_asi_session.state);
+ /*
+ * We can re-enter isolation, but only with the same ASI (we don't
+ * support nesting isolation). Also, if isolation is still active,
+ * then we should be re-entering with the same task.
+ */
+ if (state == ASI_SESSION_STATE_ACTIVE) {
+ current_asi = this_cpu_read(cpu_asi_session.asi);
+ if (current_asi != asi) {
+ WARN_ON(1);
+ return -EBUSY;
+ }
+ WARN_ON(this_cpu_read(cpu_asi_session.task) != current);
+ return 0;
+ }
+
+ /* isolation is not active so we can safely access the percpu pointer */
+ asi_session = &get_cpu_var(cpu_asi_session);
+ asi_session->asi = asi;
+ asi_session->task = current;
+ asi_session->original_cr3 = __get_current_cr3_fast();
+ if (!asi_session->original_cr3) {
+ WARN_ON(1);
+ err = -EINVAL;
+ goto err_clear_asi;
+ }
+ asi_session->state = ASI_SESSION_STATE_ACTIVE;
+
+ load_cr3(asi->pgd);
+
+ return 0;
+
+err_clear_asi:
+ asi_session->asi = NULL;
+ asi_session->task = NULL;
+
+ return err;
+
+}
+EXPORT_SYMBOL(asi_enter);
+
+void asi_exit(struct asi *asi)
+{
+ struct asi_session *asi_session;
+ enum asi_session_state asi_state;
+ unsigned long original_cr3;
+
+ asi_state = this_cpu_read(cpu_asi_session.state);
+ if (asi_state == ASI_SESSION_STATE_INACTIVE)
+ return;
+
+ /* TODO: Kick sibling hyperthread before switching to kernel cr3 */
+ original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
+ if (original_cr3)
+ write_cr3(original_cr3);
+
+ /* page-table was switched, we can now access the percpu pointer */
+ asi_session = &get_cpu_var(cpu_asi_session);
+ WARN_ON(asi_session->task != current);
+ asi_session->state = ASI_SESSION_STATE_INACTIVE;
+ asi_session->asi = NULL;
+ asi_session->task = NULL;
+ asi_session->original_cr3 = 0;
+}
+EXPORT_SYMBOL(asi_exit);
diff --git a/security/Kconfig b/security/Kconfig
index 466cc1f..241b9a7 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -65,6 +65,16 @@ config PAGE_TABLE_ISOLATION

See Documentation/x86/pti.txt for more details.

+config ADDRESS_SPACE_ISOLATION
+ bool "Allow code to run with a reduced kernel address space"
+ default y
+ depends on (X86_64 || X86_PAE) && !UML
+ help
+ This feature provides the ability to run some kernel code
+ with a reduced kernel address space. This can be used to
+ mitigate speculative execution attacks which are able to
+ leak data between sibling CPU hyper-threads.
+
config SECURITY_INFINIBAND
bool "Infiniband Security Hooks"
depends on SECURITY && INFINIBAND
--
1.7.1

2019-07-11 14:47:51

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 05/26] mm/asi: Add ASI page-table entry offset functions

Add wrappers around the p4d/pud/pmd/pte offset kernel functions which
ensure that page-table pointers are in the specified ASI page-table.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/mm/asi_pagetable.c | 62 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index 7a8f791..a89e02e 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -97,3 +97,65 @@ static bool asi_valid_offset(struct asi *asi, void *offset)

return valid;
}
+
+/*
+ * asi_pXX_offset() functions are equivalent to kernel pXX_offset()
+ * functions but, in addition, they ensure that page table pointers
+ * are in the kernel isolation page table. Otherwise an error is
+ * returned.
+ */
+
+static pte_t *asi_pte_offset(struct asi *asi, pmd_t *pmd, unsigned long addr)
+{
+ pte_t *pte;
+
+ pte = pte_offset_map(pmd, addr);
+ if (!asi_valid_offset(asi, pte)) {
+ pr_err("ASI %p: PTE %px not found\n", asi, pte);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return pte;
+}
+
+static pmd_t *asi_pmd_offset(struct asi *asi, pud_t *pud, unsigned long addr)
+{
+ pmd_t *pmd;
+
+ pmd = pmd_offset(pud, addr);
+ if (!asi_valid_offset(asi, pmd)) {
+ pr_err("ASI %p: PMD %px not found\n", asi, pmd);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return pmd;
+}
+
+static pud_t *asi_pud_offset(struct asi *asi, p4d_t *p4d, unsigned long addr)
+{
+ pud_t *pud;
+
+ pud = pud_offset(p4d, addr);
+ if (!asi_valid_offset(asi, pud)) {
+ pr_err("ASI %p: PUD %px not found\n", asi, pud);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return pud;
+}
+
+static p4d_t *asi_p4d_offset(struct asi *asi, pgd_t *pgd, unsigned long addr)
+{
+ p4d_t *p4d;
+
+ p4d = p4d_offset(pgd, addr);
+ /*
+ * p4d is the same has pgd if we don't have a 5-level page table.
+ */
+ if ((p4d != (p4d_t *)pgd) && !asi_valid_offset(asi, p4d)) {
+ pr_err("ASI %p: P4D %px not found\n", asi, p4d);
+ return ERR_PTR(-EINVAL);
+ }
+
+ return p4d;
+}
--
1.7.1

2019-07-11 14:48:35

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 16/26] mm/asi: Option to map current task into ASI

Add an option to map the current task into an ASI page-table.
The task is mapped when entering isolation and unmapped on
abort/exit.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/asi.h | 2 ++
arch/x86/mm/asi.c | 25 +++++++++++++++++++++----
arch/x86/mm/asi_pagetable.c | 4 ++--
3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index 1ac8fd3..a277e43 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -17,6 +17,7 @@
*/
#define ASI_MAP_STACK_CANARY 0x01 /* map stack canary */
#define ASI_MAP_CPU_PTR 0x02 /* for get_cpu_var()/this_cpu_ptr() */
+#define ASI_MAP_CURRENT_TASK 0x04 /* map the current task */

enum page_table_level {
PGT_LEVEL_PTE,
@@ -31,6 +32,7 @@ enum page_table_level {
struct asi {
spinlock_t lock; /* protect all attributes */
pgd_t *pgd; /* ASI page-table */
+ int mapping_flags; /* map flags */
struct list_head mapping_list; /* list of VA range mapping */

/*
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index f049438..acd1135 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -28,6 +28,7 @@ struct asi_map_option {
struct asi_map_option asi_map_percpu_options[] = {
{ ASI_MAP_STACK_CANARY, &fixed_percpu_data, sizeof(fixed_percpu_data) },
{ ASI_MAP_CPU_PTR, &this_cpu_off, sizeof(this_cpu_off) },
+ { ASI_MAP_CURRENT_TASK, &current_task, sizeof(current_task) },
};

static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
@@ -96,8 +97,9 @@ bool asi_fault(struct pt_regs *regs, unsigned long error_code,
return true;
}

-static int asi_init_mapping(struct asi *asi, int flags)
+static int asi_init_mapping(struct asi *asi)
{
+ int flags = asi->mapping_flags;
struct asi_map_option *option;
int i, err;

@@ -164,8 +166,9 @@ struct asi *asi_create(int map_flags)
spin_lock_init(&asi->lock);
spin_lock_init(&asi->fault_lock);
asi_init_backend(asi);
+ asi->mapping_flags = map_flags;

- err = asi_init_mapping(asi, map_flags);
+ err = asi_init_mapping(asi);
if (err)
goto error;

@@ -248,6 +251,15 @@ int asi_enter(struct asi *asi)
goto err_clear_asi;

/*
+ * Optionally, also map the current task.
+ */
+ if (asi->mapping_flags & ASI_MAP_CURRENT_TASK) {
+ err = asi_map(asi, current, sizeof(struct task_struct));
+ if (err)
+ goto err_unmap_stack;
+ }
+
+ /*
* Instructions ordering is important here because we should be
* able to deal with any interrupt/exception which will abort
* the isolation and restore CR3 to its original value:
@@ -269,7 +281,7 @@ int asi_enter(struct asi *asi)
if (!original_cr3) {
WARN_ON(1);
err = -EINVAL;
- goto err_unmap_stack;
+ goto err_unmap_task;
}
asi_session->original_cr3 = original_cr3;

@@ -286,6 +298,9 @@ int asi_enter(struct asi *asi)

return 0;

+err_unmap_task:
+ if (asi->mapping_flags & ASI_MAP_CURRENT_TASK)
+ asi_unmap(asi, current);
err_unmap_stack:
asi_unmap(asi, current->stack);
err_clear_asi:
@@ -345,8 +360,10 @@ void asi_exit(struct asi *asi)
*/
asi_session->abort_depth = 0;

- /* unmap stack */
+ /* unmap stack and task */
asi_unmap(asi, current->stack);
+ if (asi->mapping_flags & ASI_MAP_CURRENT_TASK)
+ asi_unmap(asi, current);
}
EXPORT_SYMBOL(asi_exit);

diff --git a/arch/x86/mm/asi_pagetable.c b/arch/x86/mm/asi_pagetable.c
index bcc95f2..8076626 100644
--- a/arch/x86/mm/asi_pagetable.c
+++ b/arch/x86/mm/asi_pagetable.c
@@ -714,7 +714,7 @@ int asi_map_range(struct asi *asi, void *ptr, size_t size,
* Don't log info the current stack because it is mapped/unmapped
* everytime we enter/exit isolation.
*/
- if (ptr != current->stack) {
+ if (ptr != current->stack && ptr != current) {
pr_debug("ASI %p: MAP %px/%lx/%d -> %lx-%lx\n",
asi, ptr, size, level, map_addr, map_end);
if (map_addr < addr)
@@ -1001,7 +1001,7 @@ void asi_unmap(struct asi *asi, void *ptr)
* Don't log info the current stack because it is mapped/unmapped
* everytime we enter/exit isolation.
*/
- if (ptr != current->stack) {
+ if (ptr != current->stack && ptr != current) {
pr_debug("ASI %p: UNMAP %px/%lx/%d\n", asi, ptr,
range_mapping->size, range_mapping->level);
}
--
1.7.1

2019-07-11 14:50:31

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 26/26] KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI

Map KVM memslots and IO buses into KVM ASI. Mapping is checking on each
KVM ASI enter because they can change.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/kvm/x86.c | 36 +++++++++++++++++++++++++++++++++++-
include/linux/kvm_host.h | 2 ++
2 files changed, 37 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9458413..7c52827 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7748,11 +7748,45 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)

static void vcpu_isolation_enter(struct kvm_vcpu *vcpu)
{
- int err;
+ struct kvm *kvm = vcpu->kvm;
+ struct kvm_io_bus *bus;
+ int i, err;

if (!vcpu->asi)
return;

+ /*
+ * Check memslots and buses mapping as they tend to change.
+ */
+ for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+ if (vcpu->asi_memslots[i] == kvm->memslots[i])
+ continue;
+ pr_debug("remapping kvm memslots[%d]: %px -> %px\n",
+ i, vcpu->asi_memslots[i], kvm->memslots[i]);
+ err = asi_remap(vcpu->asi, &vcpu->asi_memslots[i],
+ kvm->memslots[i], sizeof(struct kvm_memslots));
+ if (err) {
+ pr_debug("failed to map kvm memslots[%d]: error %d\n",
+ i, err);
+ }
+ }
+
+
+ for (i = 0; i < KVM_NR_BUSES; i++) {
+ bus = kvm->buses[i];
+ if (bus == vcpu->asi_buses[i])
+ continue;
+ pr_debug("remapped kvm buses[%d]: %px -> %px\n",
+ i, vcpu->asi_buses[i], bus);
+ err = asi_remap(vcpu->asi, &vcpu->asi_buses[i], bus,
+ sizeof(*bus) + bus->dev_count *
+ sizeof(struct kvm_io_range));
+ if (err) {
+ pr_debug("failed to map kvm buses[%d]: error %d\n",
+ i, err);
+ }
+ }
+
err = asi_enter(vcpu->asi);
if (err)
pr_debug("KVM isolation failed: error %d\n", err);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2a9d073..1f82de4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -324,6 +324,8 @@ struct kvm_vcpu {

#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
struct asi *asi;
+ void *asi_memslots[KVM_ADDRESS_SPACE_NUM];
+ void *asi_buses[KVM_NR_BUSES];
#endif
};

--
1.7.1

2019-07-11 14:50:31

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v2 17/26] rcu: Move tree.h static forward declarations to tree.c

tree.h has static forward declarations for inline function declared
in tree_plugin.h and tree_stall.h. These forward declarations prevent
including tree.h into a file different from tree.c

Signed-off-by: Alexandre Chartre <[email protected]>
---
kernel/rcu/tree.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/rcu/tree.h | 55 +----------------------------------------------------
2 files changed, 55 insertions(+), 54 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 980ca3c..44dd3b4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -55,6 +55,60 @@
#include "tree.h"
#include "rcu.h"

+/* Forward declarations for tree_plugin.h */
+static void rcu_bootup_announce(void);
+static void rcu_qs(void);
+static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp);
+#ifdef CONFIG_HOTPLUG_CPU
+static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
+static int rcu_print_task_exp_stall(struct rcu_node *rnp);
+static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp);
+static void rcu_flavor_sched_clock_irq(int user);
+static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck);
+static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
+static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
+static void invoke_rcu_callbacks_kthread(void);
+static bool rcu_is_callbacks_kthread(void);
+static void __init rcu_spawn_boost_kthreads(void);
+static void rcu_prepare_kthreads(int cpu);
+static void rcu_cleanup_after_idle(void);
+static void rcu_prepare_for_idle(void);
+static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
+static bool rcu_preempt_need_deferred_qs(struct task_struct *t);
+static void rcu_preempt_deferred_qs(struct task_struct *t);
+static void zero_cpu_stall_ticks(struct rcu_data *rdp);
+static bool rcu_nocb_cpu_needs_barrier(int cpu);
+static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
+static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
+static void rcu_init_one_nocb(struct rcu_node *rnp);
+static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
+ bool lazy, unsigned long flags);
+static bool rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp,
+ struct rcu_data *rdp,
+ unsigned long flags);
+static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
+static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
+static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
+static void rcu_spawn_cpu_nocb_kthread(int cpu);
+static void __init rcu_spawn_nocb_kthreads(void);
+#ifdef CONFIG_RCU_NOCB_CPU
+static void __init rcu_organize_nocb_kthreads(void);
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+static bool init_nocb_callback_list(struct rcu_data *rdp);
+static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp);
+static void rcu_bind_gp_kthread(void);
+static bool rcu_nohz_full_cpu(void);
+static void rcu_dynticks_task_enter(void);
+static void rcu_dynticks_task_exit(void);
+
+/* Forward declarations for tree_stall.h */
+static void record_gp_stall_check_time(void);
+static void rcu_iw_handler(struct irq_work *iwp);
+static void check_cpu_stall(struct rcu_data *rdp);
+static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
+ const unsigned long gpssdelay);
+
#ifdef MODULE_PARAM_PREFIX
#undef MODULE_PARAM_PREFIX
#endif
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e253d11..9790b58 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -392,58 +392,5 @@ struct rcu_state {
#endif /* #else #ifdef CONFIG_TRACING */

int rcu_dynticks_snap(struct rcu_data *rdp);
-
-/* Forward declarations for tree_plugin.h */
-static void rcu_bootup_announce(void);
-static void rcu_qs(void);
-static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp);
-#ifdef CONFIG_HOTPLUG_CPU
-static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
-#endif /* #ifdef CONFIG_HOTPLUG_CPU */
-static int rcu_print_task_exp_stall(struct rcu_node *rnp);
-static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp);
-static void rcu_flavor_sched_clock_irq(int user);
void call_rcu(struct rcu_head *head, rcu_callback_t func);
-static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck);
-static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
-static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
-static void invoke_rcu_callbacks_kthread(void);
-static bool rcu_is_callbacks_kthread(void);
-static void __init rcu_spawn_boost_kthreads(void);
-static void rcu_prepare_kthreads(int cpu);
-static void rcu_cleanup_after_idle(void);
-static void rcu_prepare_for_idle(void);
-static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
-static bool rcu_preempt_need_deferred_qs(struct task_struct *t);
-static void rcu_preempt_deferred_qs(struct task_struct *t);
-static void zero_cpu_stall_ticks(struct rcu_data *rdp);
-static bool rcu_nocb_cpu_needs_barrier(int cpu);
-static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
-static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
-static void rcu_init_one_nocb(struct rcu_node *rnp);
-static bool __call_rcu_nocb(struct rcu_data *rdp, struct rcu_head *rhp,
- bool lazy, unsigned long flags);
-static bool rcu_nocb_adopt_orphan_cbs(struct rcu_data *my_rdp,
- struct rcu_data *rdp,
- unsigned long flags);
-static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp);
-static void do_nocb_deferred_wakeup(struct rcu_data *rdp);
-static void rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp);
-static void rcu_spawn_cpu_nocb_kthread(int cpu);
-static void __init rcu_spawn_nocb_kthreads(void);
-#ifdef CONFIG_RCU_NOCB_CPU
-static void __init rcu_organize_nocb_kthreads(void);
-#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
-static bool init_nocb_callback_list(struct rcu_data *rdp);
-static unsigned long rcu_get_n_cbs_nocb_cpu(struct rcu_data *rdp);
-static void rcu_bind_gp_kthread(void);
-static bool rcu_nohz_full_cpu(void);
-static void rcu_dynticks_task_enter(void);
-static void rcu_dynticks_task_exit(void);
-
-/* Forward declarations for tree_stall.h */
-static void record_gp_stall_check_time(void);
-static void rcu_iw_handler(struct irq_work *iwp);
-static void check_cpu_stall(struct rcu_data *rdp);
-static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
- const unsigned long gpssdelay);
+
--
1.7.1

2019-07-11 15:26:19

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


And I've just noticed that I've messed up the subject of the cover letter.
There are 26 patches, not 27. So it should have been 00/26 not 00/27.

Sorry about that.

alex.

On 7/11/19 4:25 PM, Alexandre Chartre wrote:
> Hi,
>
> This is version 2 of the "KVM Address Space Isolation" RFC. The code
> has been completely changed compared to v1 and it now provides a generic
> kernel framework which provides Address Space Isolation; and KVM is now
> a simple consumer of that framework. That's why the RFC title has been
> changed from "KVM Address Space Isolation" to "Kernel Address Space
> Isolation".
>
> Kernel Address Space Isolation aims to use address spaces to isolate some
> parts of the kernel (for example KVM) to prevent leaking sensitive data
> between hyper-threads under speculative execution attacks. You can refer
> to the first version of this RFC for more context:
>
> https://lkml.org/lkml/2019/5/13/515
>
> The new code is still a proof of concept. It is much more stable than v1:
> I am able to run a VM with a full OS (and also a nested VM) with multiple
> vcpus. But it looks like there are still some corner cases which cause the
> system to crash/hang.
>
> I am looking for feedback about this new approach where address space
> isolation is provided by the kernel, and KVM is a just a consumer of this
> new framework.
>
>
> Changes
> =======
>
> - Address Space Isolation (ASI) is now provided as a kernel framework:
> interfaces for creating and managing an ASI are provided by the kernel,
> there are not implemented in KVM.
>
> - An ASI is associated with a page-table, we don't use mm anymore. Entering
> isolation is done by just updating CR3 to use the ASI page-table. Exiting
> isolation restores CR3 with the CR3 value present before entering isolation.
>
> - Isolation is exited at the beginning of any interrupt/exception handler,
> and on context switch.
>
> - Isolation doesn't disable interrupt, but if an interrupt occurs the
> interrupt handler will exit isolation.
>
> - The current stack is mapped when entering isolation and unmapped when
> exiting isolation.
>
> - The current task is not mapped by default, but there's an option to map it.
> In such a case, the current task is mapped when entering isolation and
> unmap when exiting isolation.
>
> - Kernel code mapped to the ASI page-table has been reduced to:
> . the entire kernel (I still need to test with only the kernel text)
> . the cpu entry area (because we need the GDT to be mapped)
> . the cpu ASI session (for managing ASI)
> . the current stack
>
> - Optionally, an ASI can request the following kernel mapping to be added:
> . the stack canary
> . the cpu offsets (this_cpu_off)
> . the current task
> . RCU data (rcu_data)
> . CPU HW events (cpu_hw_events).
>
> All these optional mappings are used for KVM isolation.
>
>
> Patches:
> ========
>
> The proposed patches provides a framework for creating an Address Space
> Isolation (ASI) (represented by a struct asi). The ASI has a page-table which
> can be populated by copying mappings from the kernel page-table. The ASI can
> then be entered/exited by switching between the kernel page-table and the
> ASI page-table. In addition, any interrupt, exception or context switch
> will automatically abort and exit the isolation. Finally patches use the
> ASI framework to implement KVM isolation.
>
> - 01-03: Core of the ASI framework: create/destroy ASI, enter/exit/abort
> isolation, ASI page-fault handler.
>
> - 04-14: Functions to manage, populate and clear an ASI page-table.
>
> - 15-20: ASI core mappings and optional mappings.
>
> - 21: Make functions to read cr3/cr4 ASI aware
>
> - 22-26: Use ASI in KVM to provide isolation for VMExit handlers.
>
>
> API Overview:
> =============
> Here is a short description of the main ASI functions provided by the framwork.
>
> struct asi *asi_create(int map_flags)
>
> Create an Address Space Isolation (ASI). map_flags can be used to specify
> optional kernel mapping to be added to the ASI page-table (for example,
> ASI_MAP_STACK_CANARY to map the stack canary).
>
>
> void asi_destroy(struct asi *asi)
>
> Destroy an ASI.
>
>
> int asi_enter(struct asi *asi)
>
> Enter isolation for the specified ASI. This switches from the kernel page-table
> to the page-table associated with the ASI.
>
>
> void asi_exit(struct asi *asi)
>
> Exit isolation for the specified ASI. This switches back to the kernel
> page-table
>
>
> int asi_map(struct asi *asi, void *ptr, unsigned long size);
>
> Copy kernel mapping to the specified ASI page-table.
>
>
> void asi_unmap(struct asi *asi, void *ptr);
>
> Clear kernel mapping from the specified ASI page-table.
>
>
> ----
> Alexandre Chartre (23):
> mm/x86: Introduce kernel address space isolation
> mm/asi: Abort isolation on interrupt, exception and context switch
> mm/asi: Handle page fault due to address space isolation
> mm/asi: Functions to track buffers allocated for an ASI page-table
> mm/asi: Add ASI page-table entry offset functions
> mm/asi: Add ASI page-table entry allocation functions
> mm/asi: Add ASI page-table entry set functions
> mm/asi: Functions to populate an ASI page-table from a VA range
> mm/asi: Helper functions to map module into ASI
> mm/asi: Keep track of VA ranges mapped in ASI page-table
> mm/asi: Functions to clear ASI page-table entries for a VA range
> mm/asi: Function to copy page-table entries for percpu buffer
> mm/asi: Add asi_remap() function
> mm/asi: Handle ASI mapped range leaks and overlaps
> mm/asi: Initialize the ASI page-table with core mappings
> mm/asi: Option to map current task into ASI
> rcu: Move tree.h static forward declarations to tree.c
> rcu: Make percpu rcu_data non-static
> mm/asi: Add option to map RCU data
> mm/asi: Add option to map cpu_hw_events
> mm/asi: Make functions to read cr3/cr4 ASI aware
> KVM: x86/asi: Populate the KVM ASI page-table
> KVM: x86/asi: Map KVM memslots and IO buses into KVM ASI
>
> Liran Alon (3):
> KVM: x86/asi: Introduce address_space_isolation module parameter
> KVM: x86/asi: Introduce KVM address space isolation
> KVM: x86/asi: Switch to KVM address space on entry to guest
>
> arch/x86/entry/entry_64.S | 42 ++-
> arch/x86/include/asm/asi.h | 237 ++++++++
> arch/x86/include/asm/mmu_context.h | 20 +-
> arch/x86/include/asm/tlbflush.h | 10 +
> arch/x86/kernel/asm-offsets.c | 4 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/mmu.c | 2 +-
> arch/x86/kvm/vmx/isolation.c | 231 ++++++++
> arch/x86/kvm/vmx/vmx.c | 14 +-
> arch/x86/kvm/vmx/vmx.h | 24 +
> arch/x86/kvm/x86.c | 68 +++-
> arch/x86/kvm/x86.h | 1 +
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/asi.c | 459 +++++++++++++++
> arch/x86/mm/asi_pagetable.c | 1077 ++++++++++++++++++++++++++++++++++++
> arch/x86/mm/fault.c | 7 +
> include/linux/kvm_host.h | 7 +
> kernel/rcu/tree.c | 56 ++-
> kernel/rcu/tree.h | 56 +--
> kernel/sched/core.c | 4 +
> security/Kconfig | 10 +
> 21 files changed, 2269 insertions(+), 65 deletions(-)
> create mode 100644 arch/x86/include/asm/asi.h
> create mode 100644 arch/x86/kvm/vmx/isolation.c
> create mode 100644 arch/x86/mm/asi.c
> create mode 100644 arch/x86/mm/asi_pagetable.c
>

2019-07-11 21:40:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 01/26] mm/x86: Introduce kernel address space isolation

On Thu, 11 Jul 2019, Alexandre Chartre wrote:
> +/*
> + * When isolation is active, the address space doesn't necessarily map
> + * the percpu offset value (this_cpu_off) which is used to get pointers
> + * to percpu variables. So functions which can be invoked while isolation
> + * is active shouldn't be getting pointers to percpu variables (i.e. with
> + * get_cpu_var() or this_cpu_ptr()). Instead percpu variable should be
> + * directly read or written to (i.e. with this_cpu_read() or
> + * this_cpu_write()).
> + */
> +
> +int asi_enter(struct asi *asi)
> +{
> + enum asi_session_state state;
> + struct asi *current_asi;
> + struct asi_session *asi_session;
> +
> + state = this_cpu_read(cpu_asi_session.state);
> + /*
> + * We can re-enter isolation, but only with the same ASI (we don't
> + * support nesting isolation). Also, if isolation is still active,
> + * then we should be re-entering with the same task.
> + */
> + if (state == ASI_SESSION_STATE_ACTIVE) {
> + current_asi = this_cpu_read(cpu_asi_session.asi);
> + if (current_asi != asi) {
> + WARN_ON(1);
> + return -EBUSY;
> + }
> + WARN_ON(this_cpu_read(cpu_asi_session.task) != current);
> + return 0;
> + }
> +
> + /* isolation is not active so we can safely access the percpu pointer */
> + asi_session = &get_cpu_var(cpu_asi_session);

get_cpu_var()?? Where is the matching put_cpu_var() ? get_cpu_var()
contains a preempt_disable ...

What's wrong with a simple this_cpu_ptr() here?

> +void asi_exit(struct asi *asi)
> +{
> + struct asi_session *asi_session;
> + enum asi_session_state asi_state;
> + unsigned long original_cr3;
> +
> + asi_state = this_cpu_read(cpu_asi_session.state);
> + if (asi_state == ASI_SESSION_STATE_INACTIVE)
> + return;
> +
> + /* TODO: Kick sibling hyperthread before switching to kernel cr3 */
> + original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
> + if (original_cr3)

Why would this be 0 if the session is active?

> + write_cr3(original_cr3);
> +
> + /* page-table was switched, we can now access the percpu pointer */
> + asi_session = &get_cpu_var(cpu_asi_session);

See above.

> + WARN_ON(asi_session->task != current);
> + asi_session->state = ASI_SESSION_STATE_INACTIVE;
> + asi_session->asi = NULL;
> + asi_session->task = NULL;
> + asi_session->original_cr3 = 0;
> +}

Thanks,

tglx

2019-07-11 22:44:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On 7/11/19 7:25 AM, Alexandre Chartre wrote:
> - Kernel code mapped to the ASI page-table has been reduced to:
> . the entire kernel (I still need to test with only the kernel text)
> . the cpu entry area (because we need the GDT to be mapped)
> . the cpu ASI session (for managing ASI)
> . the current stack
>
> - Optionally, an ASI can request the following kernel mapping to be added:
> . the stack canary
> . the cpu offsets (this_cpu_off)
> . the current task
> . RCU data (rcu_data)
> . CPU HW events (cpu_hw_events).

I don't see the per-cpu areas in here. But, the ASI macros in
entry_64.S (and asi_start_abort()) use per-cpu data.

Also, this stuff seems to do naughty stuff (calling C code, touching
per-cpu data) before the PTI CR3 writes have been done. But, I don't
see anything excluding PTI and this code from coexisting.

2019-07-12 07:45:40

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 01/26] mm/x86: Introduce kernel address space isolation


On 7/11/19 11:33 PM, Thomas Gleixner wrote:
> On Thu, 11 Jul 2019, Alexandre Chartre wrote:
>> +/*
>> + * When isolation is active, the address space doesn't necessarily map
>> + * the percpu offset value (this_cpu_off) which is used to get pointers
>> + * to percpu variables. So functions which can be invoked while isolation
>> + * is active shouldn't be getting pointers to percpu variables (i.e. with
>> + * get_cpu_var() or this_cpu_ptr()). Instead percpu variable should be
>> + * directly read or written to (i.e. with this_cpu_read() or
>> + * this_cpu_write()).
>> + */
>> +
>> +int asi_enter(struct asi *asi)
>> +{
>> + enum asi_session_state state;
>> + struct asi *current_asi;
>> + struct asi_session *asi_session;
>> +
>> + state = this_cpu_read(cpu_asi_session.state);
>> + /*
>> + * We can re-enter isolation, but only with the same ASI (we don't
>> + * support nesting isolation). Also, if isolation is still active,
>> + * then we should be re-entering with the same task.
>> + */
>> + if (state == ASI_SESSION_STATE_ACTIVE) {
>> + current_asi = this_cpu_read(cpu_asi_session.asi);
>> + if (current_asi != asi) {
>> + WARN_ON(1);
>> + return -EBUSY;
>> + }
>> + WARN_ON(this_cpu_read(cpu_asi_session.task) != current);
>> + return 0;
>> + }
>> +
>> + /* isolation is not active so we can safely access the percpu pointer */
>> + asi_session = &get_cpu_var(cpu_asi_session);
>
> get_cpu_var()?? Where is the matching put_cpu_var() ? get_cpu_var()
> contains a preempt_disable ...
>
> What's wrong with a simple this_cpu_ptr() here?
>

Oups, my mistake, I should be using this_cpu_ptr(). I will replace all get_cpu_var()
with this_cpu_ptr().


>> +void asi_exit(struct asi *asi)
>> +{
>> + struct asi_session *asi_session;
>> + enum asi_session_state asi_state;
>> + unsigned long original_cr3;
>> +
>> + asi_state = this_cpu_read(cpu_asi_session.state);
>> + if (asi_state == ASI_SESSION_STATE_INACTIVE)
>> + return;
>> +
>> + /* TODO: Kick sibling hyperthread before switching to kernel cr3 */
>> + original_cr3 = this_cpu_read(cpu_asi_session.original_cr3);
>> + if (original_cr3)
>
> Why would this be 0 if the session is active?
>

Correct, original_cr3 won't be 0. I think this is a remain from a previous version
where original_cr3 was handled differently.


>> + write_cr3(original_cr3);
>> +
>> + /* page-table was switched, we can now access the percpu pointer */
>> + asi_session = &get_cpu_var(cpu_asi_session);
>
> See above.
>

Will fix that.


Thanks,

alex.

>> + WARN_ON(asi_session->task != current);
>> + asi_session->state = ASI_SESSION_STATE_INACTIVE;
>> + asi_session->asi = NULL;
>> + asi_session->task = NULL;
>> + asi_session->original_cr3 = 0;
>> +}
>
> Thanks,
>
> tglx
>

2019-07-12 08:11:53

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 12:38 AM, Dave Hansen wrote:
> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
>> - Kernel code mapped to the ASI page-table has been reduced to:
>> . the entire kernel (I still need to test with only the kernel text)
>> . the cpu entry area (because we need the GDT to be mapped)
>> . the cpu ASI session (for managing ASI)
>> . the current stack
>>
>> - Optionally, an ASI can request the following kernel mapping to be added:
>> . the stack canary
>> . the cpu offsets (this_cpu_off)
>> . the current task
>> . RCU data (rcu_data)
>> . CPU HW events (cpu_hw_events).
>
> I don't see the per-cpu areas in here. But, the ASI macros in
> entry_64.S (and asi_start_abort()) use per-cpu data.

We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
is created (see patch 15/26):

+ /*
+ * Map the percpu ASI sessions. This is used by interrupt handlers
+ * to figure out if we have entered isolation and switch back to
+ * the kernel address space.
+ */
+ err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
+ if (err)
+ return err;


> Also, this stuff seems to do naughty stuff (calling C code, touching
> per-cpu data) before the PTI CR3 writes have been done. But, I don't
> see anything excluding PTI and this code from coexisting.

My understanding is that PTI CR3 writes only happens when switching to/from
userland. While ASI enter/exit/abort happens while we are already in the kernel,
so asi_start_abort() is not called when coming from userland and so not
interacting with PTI.

For example, if ASI in used during a syscall (e.g. with KVM), we have:

-> syscall
- PTI CR3 write (kernel CR3)
- syscall handler:
...
asi_enter()-> write ASI CR3
.. code run with ASI ..
asi_exit() or asi abort -> restore original CR3
...
- PTI CR3 write (userland CR3)
<- syscall


Thanks,

alex.

2019-07-12 10:46:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Thu, 11 Jul 2019, Dave Hansen wrote:

> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
> > - Kernel code mapped to the ASI page-table has been reduced to:
> > . the entire kernel (I still need to test with only the kernel text)
> > . the cpu entry area (because we need the GDT to be mapped)
> > . the cpu ASI session (for managing ASI)
> > . the current stack
> >
> > - Optionally, an ASI can request the following kernel mapping to be added:
> > . the stack canary
> > . the cpu offsets (this_cpu_off)
> > . the current task
> > . RCU data (rcu_data)
> > . CPU HW events (cpu_hw_events).
>
> I don't see the per-cpu areas in here. But, the ASI macros in
> entry_64.S (and asi_start_abort()) use per-cpu data.
>
> Also, this stuff seems to do naughty stuff (calling C code, touching
> per-cpu data) before the PTI CR3 writes have been done. But, I don't
> see anything excluding PTI and this code from coexisting.

That ASI thing is just PTI on steroids.

So why do we need two versions of the same thing? That's absolutely bonkers
and will just introduce subtle bugs and conflicting decisions all over the
place.

The need for ASI is very tightly coupled to the need for PTI and there is
absolutely no point in keeping them separate.

The only difference vs. interrupts and exceptions is that the PTI logic
cares whether they enter from user or from kernel space while ASI only
cares about the kernel entry.

But most exceptions/interrupts transitions do not require to be handled at
the entry code level because on VMEXIT the exit reason clearly tells
whether a switch to the kernel CR3 is necessary or not. So this has to be
handled at the VMM level already in a very clean and simple way.

I'm not a virt wizard, but according to code inspection and instrumentation
even the NMI on the host is actually reinjected manually into the host via
'int $2' after the VMEXIT and for MCE it looks like manual handling as
well. So why do we need to sprinkle that muck all over the entry code?

From a semantical perspective VMENTER/VMEXIT are very similar to the return
to user / enter to user mechanics. Just that the transition happens in the
VMM code and not at the regular user/kernel transition points.

So why do you want ot treat that differently? There is absolutely zero
reason to do so. And there is no reason to create a pointlessly different
version of PTI which introduces yet another variant of a restricted page
table instead of just reusing and extending what's there already.

Thanks,

tglx

2019-07-12 11:45:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Thu, Jul 11, 2019 at 04:25:12PM +0200, Alexandre Chartre wrote:
> Kernel Address Space Isolation aims to use address spaces to isolate some
> parts of the kernel (for example KVM) to prevent leaking sensitive data
> between hyper-threads under speculative execution attacks. You can refer
> to the first version of this RFC for more context:
>
> https://lkml.org/lkml/2019/5/13/515

No, no, no!

That is the crux of this entire series; you're not punting on explaining
exactly why we want to go dig through 26 patches of gunk.

You get to exactly explain what (your definition of) sensitive data is,
and which speculative scenarios and how this approach mitigates them.

And included in that is a high level overview of the whole thing.

On the one hand you've made this implementation for KVM, while on the
other hand you're saying it is generic but then fail to describe any
!KVM user.

AFAIK all speculative fails this is relevant to are now public, so
excruciating horrible details are fine and required.

AFAIK2 this is all because of MDS but it also helps with v1.

AFAIK3 this wants/needs to be combined with core-scheduling to be
useful, but not a single mention of that is anywhere.

2019-07-12 12:04:29

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 12:44 PM, Thomas Gleixner wrote:
> On Thu, 11 Jul 2019, Dave Hansen wrote:
>
>> On 7/11/19 7:25 AM, Alexandre Chartre wrote:
>>> - Kernel code mapped to the ASI page-table has been reduced to:
>>> . the entire kernel (I still need to test with only the kernel text)
>>> . the cpu entry area (because we need the GDT to be mapped)
>>> . the cpu ASI session (for managing ASI)
>>> . the current stack
>>>
>>> - Optionally, an ASI can request the following kernel mapping to be added:
>>> . the stack canary
>>> . the cpu offsets (this_cpu_off)
>>> . the current task
>>> . RCU data (rcu_data)
>>> . CPU HW events (cpu_hw_events).
>>
>> I don't see the per-cpu areas in here. But, the ASI macros in
>> entry_64.S (and asi_start_abort()) use per-cpu data.
>>
>> Also, this stuff seems to do naughty stuff (calling C code, touching
>> per-cpu data) before the PTI CR3 writes have been done. But, I don't
>> see anything excluding PTI and this code from coexisting.
>
> That ASI thing is just PTI on steroids.
>
> So why do we need two versions of the same thing? That's absolutely bonkers
> and will just introduce subtle bugs and conflicting decisions all over the
> place.
>
> The need for ASI is very tightly coupled to the need for PTI and there is
> absolutely no point in keeping them separate.
>
> The only difference vs. interrupts and exceptions is that the PTI logic
> cares whether they enter from user or from kernel space while ASI only
> cares about the kernel entry.

I think that's precisely what makes ASI and PTI different and independent.
PTI is just about switching between userland and kernel page-tables, while
ASI is about switching page-table inside the kernel. You can have ASI without
having PTI. You can also use ASI for kernel threads so for code that won't
be triggered from userland and so which won't involve PTI.

> But most exceptions/interrupts transitions do not require to be handled at
> the entry code level because on VMEXIT the exit reason clearly tells
> whether a switch to the kernel CR3 is necessary or not. So this has to be
> handled at the VMM level already in a very clean and simple way.
>
> I'm not a virt wizard, but according to code inspection and instrumentation
> even the NMI on the host is actually reinjected manually into the host via
> 'int $2' after the VMEXIT and for MCE it looks like manual handling as
> well. So why do we need to sprinkle that muck all over the entry code?
>
> From a semantical perspective VMENTER/VMEXIT are very similar to the return
> to user / enter to user mechanics. Just that the transition happens in the
> VMM code and not at the regular user/kernel transition points.

VMExit returns to the kernel, and ASI is used to run the VMExit handler with
a limited kernel address space instead of using the full kernel address space.
Change in entry code is required to handle any interrupt/exception which
can happen while running code with ASI (like KVM VMExit handler).

Note that KVM is an example of an ASI consumer, but ASI is generic and can be
used to run (mostly) any kernel code if you want to run code with a reduced
kernel address space.

> So why do you want ot treat that differently? There is absolutely zero
> reason to do so. And there is no reason to create a pointlessly different
> version of PTI which introduces yet another variant of a restricted page
> table instead of just reusing and extending what's there already.
>

As I've tried to explain, to me PTI and ASI are different and independent.
PTI manages switching between userland and kernel page-table, and ASI manages
switching between kernel and a reduced-kernel page-table.


Thanks,

alex.

2019-07-12 12:23:49

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> On Thu, Jul 11, 2019 at 04:25:12PM +0200, Alexandre Chartre wrote:
>> Kernel Address Space Isolation aims to use address spaces to isolate some
>> parts of the kernel (for example KVM) to prevent leaking sensitive data
>> between hyper-threads under speculative execution attacks. You can refer
>> to the first version of this RFC for more context:
>>
>> https://lkml.org/lkml/2019/5/13/515
>
> No, no, no!
>
> That is the crux of this entire series; you're not punting on explaining
> exactly why we want to go dig through 26 patches of gunk.
>
> You get to exactly explain what (your definition of) sensitive data is,
> and which speculative scenarios and how this approach mitigates them.
>
> And included in that is a high level overview of the whole thing.
>

Ok, I will rework the explanation. Sorry about that.

> On the one hand you've made this implementation for KVM, while on the
> other hand you're saying it is generic but then fail to describe any
> !KVM user.
>
> AFAIK all speculative fails this is relevant to are now public, so
> excruciating horrible details are fine and required.

Ok.

> AFAIK2 this is all because of MDS but it also helps with v1.

Yes, mostly MDS and also L1TF.

> AFAIK3 this wants/needs to be combined with core-scheduling to be
> useful, but not a single mention of that is anywhere.

No. This is actually an alternative to core-scheduling. Eventually, ASI
will kick all sibling hyperthreads when exiting isolation and it needs to
run with the full kernel page-table (note that's currently not in these
patches).

So ASI can be seen as an optimization to disabling hyperthreading: instead
of just disabling hyperthreading you run with ASI, and when ASI can't preserve
isolation you will basically run with a single thread.

I will add all that to the explanation.

Thanks,

alex.

2019-07-12 12:38:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
> On 7/12/19 1:44 PM, Peter Zijlstra wrote:

> > AFAIK3 this wants/needs to be combined with core-scheduling to be
> > useful, but not a single mention of that is anywhere.
>
> No. This is actually an alternative to core-scheduling. Eventually, ASI
> will kick all sibling hyperthreads when exiting isolation and it needs to
> run with the full kernel page-table (note that's currently not in these
> patches).
>
> So ASI can be seen as an optimization to disabling hyperthreading: instead
> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
> isolation you will basically run with a single thread.

You can't do that without much of the scheduler changes present in the
core-scheduling patches.

2019-07-12 12:51:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:

> I think that's precisely what makes ASI and PTI different and independent.
> PTI is just about switching between userland and kernel page-tables, while
> ASI is about switching page-table inside the kernel. You can have ASI without
> having PTI. You can also use ASI for kernel threads so for code that won't
> be triggered from userland and so which won't involve PTI.

PTI is not mapping kernel space to avoid speculation crap (meltdown).
ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).

See how very similar they are?

Furthermore, to recover SMT for userspace (under MDS) we not only need
core-scheduling but core-scheduling per address space. And ASI was
specifically designed to help mitigate the trainwreck just described.

By explicitly exposing (hopefully harmless) part of the kernel to MDS,
we reduce the part that needs core-scheduling and thus reduce the rate
the SMT siblngs need to sync up/schedule.

But looking at it that way, it makes no sense to retain 3 address
spaces, namely:

user / kernel exposed / kernel private.

Specifically, it makes no sense to expose part of the kernel through MDS
but not through Meltdow. Therefore we can merge the user and kernel
exposed address spaces.

And then we've fully replaced PTI.

So no, they're not orthogonal.

2019-07-12 12:54:46

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>
>>> AFAIK3 this wants/needs to be combined with core-scheduling to be
>>> useful, but not a single mention of that is anywhere.
>>
>> No. This is actually an alternative to core-scheduling. Eventually, ASI
>> will kick all sibling hyperthreads when exiting isolation and it needs to
>> run with the full kernel page-table (note that's currently not in these
>> patches).
>>
>> So ASI can be seen as an optimization to disabling hyperthreading: instead
>> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
>> isolation you will basically run with a single thread.
>
> You can't do that without much of the scheduler changes present in the
> core-scheduling patches.
>

We hope we can do that without the whole core-scheduling mechanism. The idea
is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
sibling hyperthreads and have them wait for a condition that will allow them
to resume execution (for example when re-entering isolation). We are
investigating this in parallel to ASI.

alex.

2019-07-12 13:08:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 02:47:23PM +0200, Alexandre Chartre wrote:
> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> > On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
> > > On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> >
> > > > AFAIK3 this wants/needs to be combined with core-scheduling to be
> > > > useful, but not a single mention of that is anywhere.
> > >
> > > No. This is actually an alternative to core-scheduling. Eventually, ASI
> > > will kick all sibling hyperthreads when exiting isolation and it needs to
> > > run with the full kernel page-table (note that's currently not in these
> > > patches).
> > >
> > > So ASI can be seen as an optimization to disabling hyperthreading: instead
> > > of just disabling hyperthreading you run with ASI, and when ASI can't preserve
> > > isolation you will basically run with a single thread.
> >
> > You can't do that without much of the scheduler changes present in the
> > core-scheduling patches.
> >
>
> We hope we can do that without the whole core-scheduling mechanism. The idea
> is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
> sibling hyperthreads and have them wait for a condition that will allow them
> to resume execution (for example when re-entering isolation). We are
> investigating this in parallel to ASI.

You cannot wait from IPI context, so you have to go somewhere else to
wait.

Also, consider what happens when the task that entered isolation decides
to schedule out / gets migrated.

I think you'll quickly find yourself back at core-scheduling.

2019-07-12 13:48:03

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 2:50 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>
>> I think that's precisely what makes ASI and PTI different and independent.
>> PTI is just about switching between userland and kernel page-tables, while
>> ASI is about switching page-table inside the kernel. You can have ASI without
>> having PTI. You can also use ASI for kernel threads so for code that won't
>> be triggered from userland and so which won't involve PTI.
>
> PTI is not mapping kernel space to avoid speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>
> See how very similar they are?
>
>
> Furthermore, to recover SMT for userspace (under MDS) we not only need
> core-scheduling but core-scheduling per address space. And ASI was
> specifically designed to help mitigate the trainwreck just described.
>
> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> we reduce the part that needs core-scheduling and thus reduce the rate
> the SMT siblngs need to sync up/schedule.
>
> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
>
> user / kernel exposed / kernel private.
>
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdow. Therefore we can merge the user and kernel
> exposed address spaces.

The goal of ASI is to provide a reduced address space which exclude sensitive
data. A user process (for example a database daemon, a web server, or a vmm
like qemu) will likely have sensitive data mapped in its user address space.
Such data shouldn't be mapped with ASI because it can potentially leak to the
sibling hyperthread. For example, if an hyperthread is running a VM then the
VM could potentially access user sensitive data if they are mapped on the
sibling hyperthread with ASI.

The current approach is assuming that anything in the user address space
can be sensitive, and so the user address space shouldn't be mapped in ASI.

It looks like what you are suggesting could be an optimization when creating
an ASI for a process which has no sensitive data (this could be an option to
specify when creating an ASI, for example).

alex.

>
> And then we've fully replaced PTI.
>
> So no, they're not orthogonal.
>

2019-07-12 13:51:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On 7/12/19 1:09 AM, Alexandre Chartre wrote:
> On 7/12/19 12:38 AM, Dave Hansen wrote:
>> I don't see the per-cpu areas in here.  But, the ASI macros in
>> entry_64.S (and asi_start_abort()) use per-cpu data.
>
> We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
> code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
> is created (see patch 15/26):

No fair! I had per-cpu variables just for PTI at some point and had to
give them up! ;)

> +    /*
> +     * Map the percpu ASI sessions. This is used by interrupt handlers
> +     * to figure out if we have entered isolation and switch back to
> +     * the kernel address space.
> +     */
> +    err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
> +    if (err)
> +        return err;
>
>
>> Also, this stuff seems to do naughty stuff (calling C code, touching
>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>> see anything excluding PTI and this code from coexisting.
>
> My understanding is that PTI CR3 writes only happens when switching to/from
> userland. While ASI enter/exit/abort happens while we are already in the
> kernel,
> so asi_start_abort() is not called when coming from userland and so not
> interacting with PTI.

OK, that makes sense. You only need to call C code when interrupted
from something in the kernel (deeper than the entry code), and those
were already running kernel C code anyway.

If this continues to live in the entry code, I think you have a good
clue where to start commenting.

BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
from user vs. kernel. It's tricky because there's a window both in the
entry and exit code where you are in the kernel but have a userspace CR3
value. You end up needing a CR3 write when you have a userspace CR3
value when the interrupt occurred, not only when you interrupt userspace
itself.

2019-07-12 13:53:41

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 3:07 PM, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 02:47:23PM +0200, Alexandre Chartre wrote:
>> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre wrote:
>>>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>>>
>>>>> AFAIK3 this wants/needs to be combined with core-scheduling to be
>>>>> useful, but not a single mention of that is anywhere.
>>>>
>>>> No. This is actually an alternative to core-scheduling. Eventually, ASI
>>>> will kick all sibling hyperthreads when exiting isolation and it needs to
>>>> run with the full kernel page-table (note that's currently not in these
>>>> patches).
>>>>
>>>> So ASI can be seen as an optimization to disabling hyperthreading: instead
>>>> of just disabling hyperthreading you run with ASI, and when ASI can't preserve
>>>> isolation you will basically run with a single thread.
>>>
>>> You can't do that without much of the scheduler changes present in the
>>> core-scheduling patches.
>>>
>>
>> We hope we can do that without the whole core-scheduling mechanism. The idea
>> is to send an IPI to all sibling hyperthreads. This IPI will interrupt these
>> sibling hyperthreads and have them wait for a condition that will allow them
>> to resume execution (for example when re-entering isolation). We are
>> investigating this in parallel to ASI.
>
> You cannot wait from IPI context, so you have to go somewhere else to
> wait.
>
> Also, consider what happens when the task that entered isolation decides
> to schedule out / gets migrated.
>
> I think you'll quickly find yourself back at core-scheduling.
>

I haven't looked at details about what has been done so far. Hopefully, we
can do something not too complex, or reuse a (small) part of co-scheduling.

Thanks for pointing this out.

alex.

2019-07-12 13:55:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On 7/12/19 5:50 AM, Peter Zijlstra wrote:
> PTI is not mapping kernel space to avoid speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>
> See how very similar they are?

That's an interesting point.

I'd add that PTI maps a part of kernel space that partially overlaps
with what ASI wants.

> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
>
> user / kernel exposed / kernel private.
>
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdown. Therefore we can merge the user and kernel
> exposed address spaces.
>
> And then we've fully replaced PTI.

So, in one address space (PTI/user or ASI), we say, "screw it" and all
the data mapped is exposed to speculation attacks. We have to be very
careful about what we map and expose here.

The other (full kernel) address space we are more careful about what we
*do* instead of what we map. We map everything but have to add
mitigations to ensure that we don't leak anything back to the exposed
address space.

So, maybe we're not replacing PTI as much as we're growing PTI so that
we can run more kernel code with the (now inappropriately named) user
page tables.

2019-07-12 13:59:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On 7/12/19 6:43 AM, Alexandre Chartre wrote:
> The current approach is assuming that anything in the user address space
> can be sensitive, and so the user address space shouldn't be mapped in ASI.

Is this universally true?

There's certainly *some* mitigation provided by SMAP that would allow
userspace to remain mapped and still protected.

2019-07-12 14:11:26

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/12/19 3:51 PM, Dave Hansen wrote:
> On 7/12/19 1:09 AM, Alexandre Chartre wrote:
>> On 7/12/19 12:38 AM, Dave Hansen wrote:
>>> I don't see the per-cpu areas in here.  But, the ASI macros in
>>> entry_64.S (and asi_start_abort()) use per-cpu data.
>>
>> We don't map all per-cpu areas, but only the per-cpu variables we need. ASI
>> code uses the per-cpu cpu_asi_session variable which is mapped when an ASI
>> is created (see patch 15/26):
>
> No fair! I had per-cpu variables just for PTI at some point and had to
> give them up! ;)
>
>> +    /*
>> +     * Map the percpu ASI sessions. This is used by interrupt handlers
>> +     * to figure out if we have entered isolation and switch back to
>> +     * the kernel address space.
>> +     */
>> +    err = ASI_MAP_CPUVAR(asi, cpu_asi_session);
>> +    if (err)
>> +        return err;
>>
>>
>>> Also, this stuff seems to do naughty stuff (calling C code, touching
>>> per-cpu data) before the PTI CR3 writes have been done.  But, I don't
>>> see anything excluding PTI and this code from coexisting.
>>
>> My understanding is that PTI CR3 writes only happens when switching to/from
>> userland. While ASI enter/exit/abort happens while we are already in the
>> kernel,
>> so asi_start_abort() is not called when coming from userland and so not
>> interacting with PTI.
>
> OK, that makes sense. You only need to call C code when interrupted
> from something in the kernel (deeper than the entry code), and those
> were already running kernel C code anyway.
>

Exactly.

> If this continues to live in the entry code, I think you have a good
> clue where to start commenting.

Yeah, lot of writing to do... :-)

> BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
> from user vs. kernel. It's tricky because there's a window both in the
> entry and exit code where you are in the kernel but have a userspace CR3
> value. You end up needing a CR3 write when you have a userspace CR3
> value when the interrupt occurred, not only when you interrupt userspace
> itself.
>

Right. ASI is simpler because it comes from the kernel and return to the
kernel. There's just a small window (on entry) where we have the ASI CR3
but we quickly switch to the full kernel CR3.

alex.

2019-07-12 14:37:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
<[email protected]> wrote:
>
>
> On 7/12/19 2:50 PM, Peter Zijlstra wrote:
> > On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> >
> >> I think that's precisely what makes ASI and PTI different and independent.
> >> PTI is just about switching between userland and kernel page-tables, while
> >> ASI is about switching page-table inside the kernel. You can have ASI without
> >> having PTI. You can also use ASI for kernel threads so for code that won't
> >> be triggered from userland and so which won't involve PTI.
> >
> > PTI is not mapping kernel space to avoid speculation crap (meltdown).
> > ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >
> > See how very similar they are?
> >
> >
> > Furthermore, to recover SMT for userspace (under MDS) we not only need
> > core-scheduling but core-scheduling per address space. And ASI was
> > specifically designed to help mitigate the trainwreck just described.
> >
> > By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> > we reduce the part that needs core-scheduling and thus reduce the rate
> > the SMT siblngs need to sync up/schedule.
> >
> > But looking at it that way, it makes no sense to retain 3 address
> > spaces, namely:
> >
> > user / kernel exposed / kernel private.
> >
> > Specifically, it makes no sense to expose part of the kernel through MDS
> > but not through Meltdow. Therefore we can merge the user and kernel
> > exposed address spaces.
>
> The goal of ASI is to provide a reduced address space which exclude sensitive
> data. A user process (for example a database daemon, a web server, or a vmm
> like qemu) will likely have sensitive data mapped in its user address space.
> Such data shouldn't be mapped with ASI because it can potentially leak to the
> sibling hyperthread. For example, if an hyperthread is running a VM then the
> VM could potentially access user sensitive data if they are mapped on the
> sibling hyperthread with ASI.

So I've proposed the following slightly hackish thing:

Add a mechanism (call it /dev/xpfo). When you open /dev/xpfo and
fallocate it to some size, you allocate that amount of memory and kick
it out of the kernel direct map. (And pay the IPI cost unless there
were already cached non-direct-mapped pages ready.) Then you map
*that* into your VMs. Now, for a dedicated VM host, you map *all* the
VM private memory from /dev/xpfo. Pretend it's SEV if you want to
determine which pages can be set up like this.

Does this get enough of the benefit at a negligible fraction of the
code complexity cost? (This plus core scheduling, anyway.)

--Andy

2019-07-12 15:19:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>
> > I think that's precisely what makes ASI and PTI different and independent.
> > PTI is just about switching between userland and kernel page-tables, while
> > ASI is about switching page-table inside the kernel. You can have ASI without
> > having PTI. You can also use ASI for kernel threads so for code that won't
> > be triggered from userland and so which won't involve PTI.
>
> PTI is not mapping kernel space to avoid speculation crap (meltdown).
> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>
> See how very similar they are?
>
> Furthermore, to recover SMT for userspace (under MDS) we not only need
> core-scheduling but core-scheduling per address space. And ASI was
> specifically designed to help mitigate the trainwreck just described.
>
> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> we reduce the part that needs core-scheduling and thus reduce the rate
> the SMT siblngs need to sync up/schedule.
>
> But looking at it that way, it makes no sense to retain 3 address
> spaces, namely:
>
> user / kernel exposed / kernel private.
>
> Specifically, it makes no sense to expose part of the kernel through MDS
> but not through Meltdow. Therefore we can merge the user and kernel
> exposed address spaces.
>
> And then we've fully replaced PTI.
>
> So no, they're not orthogonal.

Right. If we decide to expose more parts of the kernel mappings then that's
just adding more stuff to the existing user (PTI) map mechanics.

As a consequence the CR3 switching points become different or can be
consolidated and that can be handled right at those switching points
depending on static keys or alternatives as we do today with PTI and other
mitigations.

All of that can do without that obscure "state machine" which is solely
there to duct-tape the complete lack of design. The same applies to that
mapping thing. Just mapping randomly selected parts by sticking them into
an array is a non-maintainable approach. This needs proper separation of
text and data sections, so violations of the mapping constraints can be
statically analyzed. Depending solely on the page fault at run time for
analysis is just bound to lead to hard to diagnose failures in the field.

TBH we all know already that this can be done and that this will solve some
of the issues caused by the speculation mess, so just writing some hastily
cobbled together POC code which explodes just by looking at it, does not
lead to anything else than time waste on all ends.

This first needs a clear definition of protection scope. That scope clearly
defines the required mappings and consequently the transition requirements
which provide the necessary transition points for flipping CR3.

If we have agreed on that, then we can think about the implementation
details.

Thanks,

tglx

2019-07-12 15:23:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 06:54:22AM -0700, Dave Hansen wrote:
> On 7/12/19 5:50 AM, Peter Zijlstra wrote:
> > PTI is not mapping kernel space to avoid speculation crap (meltdown).
> > ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >
> > See how very similar they are?
>
> That's an interesting point.
>
> I'd add that PTI maps a part of kernel space that partially overlaps
> with what ASI wants.

Right, wherever we put the boundary, we need whatever is required to
cross it.

> > But looking at it that way, it makes no sense to retain 3 address
> > spaces, namely:
> >
> > user / kernel exposed / kernel private.
> >
> > Specifically, it makes no sense to expose part of the kernel through MDS
> > but not through Meltdown. Therefore we can merge the user and kernel
> > exposed address spaces.
> >
> > And then we've fully replaced PTI.
>
> So, in one address space (PTI/user or ASI), we say, "screw it" and all
> the data mapped is exposed to speculation attacks. We have to be very
> careful about what we map and expose here.

Yes, which is why, in an earlier email, I've asked for a clear
definition of 'sensitive" :-)

> So, maybe we're not replacing PTI as much as we're growing PTI so that
> we can run more kernel code with the (now inappropriately named) user
> page tables.

Right.

2019-07-12 15:24:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 3:51 PM, Dave Hansen wrote:
> > BTW, the PTI CR3 writes are not *strictly* about the interrupt coming
> > from user vs. kernel. It's tricky because there's a window both in the
> > entry and exit code where you are in the kernel but have a userspace CR3
> > value. You end up needing a CR3 write when you have a userspace CR3
> > value when the interrupt occurred, not only when you interrupt userspace
> > itself.
> >
>
> Right. ASI is simpler because it comes from the kernel and return to the
> kernel. There's just a small window (on entry) where we have the ASI CR3
> but we quickly switch to the full kernel CR3.

That's wrong in several aspects.

1) You are looking at it purely from the VMM perspective, which is bogus
as you already said, that this can/should be used to be extended to
other scenarios (including kvm ioctl or such).

So no, it's not just coming from kernel space and returning to it.

If that'd be true then the entry code could just stay as is because
you can handle _ALL_ of that very trivial in the atomic VMM
enter/exit code.

2) It does not matter how small that window is. If there is a window
then this needs to be covered, no matter what.

Thanks,

tglx

2019-07-12 16:02:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 12:44 PM, Thomas Gleixner wrote:
> > That ASI thing is just PTI on steroids.
> >
> > So why do we need two versions of the same thing? That's absolutely bonkers
> > and will just introduce subtle bugs and conflicting decisions all over the
> > place.
> >
> > The need for ASI is very tightly coupled to the need for PTI and there is
> > absolutely no point in keeping them separate.
> >
> > The only difference vs. interrupts and exceptions is that the PTI logic
> > cares whether they enter from user or from kernel space while ASI only
> > cares about the kernel entry.
>
> I think that's precisely what makes ASI and PTI different and independent.
> PTI is just about switching between userland and kernel page-tables, while
> ASI is about switching page-table inside the kernel. You can have ASI without
> having PTI. You can also use ASI for kernel threads so for code that won't
> be triggered from userland and so which won't involve PTI.

It's still the same concept. And you can argue in circles it does not
justify yet another mapping setup with is a different copy of some other
mapping setup. Whether PTI is replaced by ASI or PTI is extended to handle
ASI does not matter at all. Having two similar concepts side by side is a
guarantee for disaster.

> > So why do you want ot treat that differently? There is absolutely zero
> > reason to do so. And there is no reason to create a pointlessly different
> > version of PTI which introduces yet another variant of a restricted page
> > table instead of just reusing and extending what's there already.
> >
>
> As I've tried to explain, to me PTI and ASI are different and independent.
> PTI manages switching between userland and kernel page-table, and ASI manages
> switching between kernel and a reduced-kernel page-table.

Again. It's the same concept and it does not matter what form of reduced
page tables you use. You always need transition points and in order to make
the transition points work you need reliably mapped bits and pieces.

Also Paul wants to use the same concept for user space so trivial system
calls can do w/o PTI. In some other thread you said yourself that this
could be extended to cover the kvm ioctl, which is clearly a return to user
space.

Are we then going to add another set of randomly sprinkled transition
points and yet another 'state machine' to duct-tape the fallout?

Definitely not going to happen.

Thanks,

tglx

2019-07-12 16:40:17

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation



On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>
>>> I think that's precisely what makes ASI and PTI different and independent.
>>> PTI is just about switching between userland and kernel page-tables, while
>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>> be triggered from userland and so which won't involve PTI.
>>
>> PTI is not mapping kernel space to avoid speculation crap (meltdown).
>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>
>> See how very similar they are?
>>
>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>> core-scheduling but core-scheduling per address space. And ASI was
>> specifically designed to help mitigate the trainwreck just described.
>>
>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>> we reduce the part that needs core-scheduling and thus reduce the rate
>> the SMT siblngs need to sync up/schedule.
>>
>> But looking at it that way, it makes no sense to retain 3 address
>> spaces, namely:
>>
>> user / kernel exposed / kernel private.
>>
>> Specifically, it makes no sense to expose part of the kernel through MDS
>> but not through Meltdow. Therefore we can merge the user and kernel
>> exposed address spaces.
>>
>> And then we've fully replaced PTI.
>>
>> So no, they're not orthogonal.
>
> Right. If we decide to expose more parts of the kernel mappings then that's
> just adding more stuff to the existing user (PTI) map mechanics.


If we expose more parts of the kernel mapping by adding them to the existing
user (PTI) map, then we only control the mapping of kernel sensitive data but
we don't control user mapping (with ASI, we exclude all user mappings).

How would you control the mapping of userland sensitive data and exclude them
from the user map? Would you have the application explicitly identify sensitive
data (like Andy suggested with a /dev/xpfo device)?

Thanks,

alex.


> As a consequence the CR3 switching points become different or can be
> consolidated and that can be handled right at those switching points
> depending on static keys or alternatives as we do today with PTI and other
> mitigations.
>
> All of that can do without that obscure "state machine" which is solely
> there to duct-tape the complete lack of design. The same applies to that
> mapping thing. Just mapping randomly selected parts by sticking them into
> an array is a non-maintainable approach. This needs proper separation of
> text and data sections, so violations of the mapping constraints can be
> statically analyzed. Depending solely on the page fault at run time for
> analysis is just bound to lead to hard to diagnose failures in the field.
>
> TBH we all know already that this can be done and that this will solve some
> of the issues caused by the speculation mess, so just writing some hastily
> cobbled together POC code which explodes just by looking at it, does not
> lead to anything else than time waste on all ends.
>
> This first needs a clear definition of protection scope. That scope clearly
> defines the required mappings and consequently the transition requirements
> which provide the necessary transition points for flipping CR3.
>
> If we have agreed on that, then we can think about the implementation
> details.
>
> Thanks,
>
> tglx
>

2019-07-12 16:46:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation



> On Jul 12, 2019, at 10:37 AM, Alexandre Chartre <[email protected]> wrote:
>
>
>
>> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>>
>>>> I think that's precisely what makes ASI and PTI different and independent.
>>>> PTI is just about switching between userland and kernel page-tables, while
>>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>>> be triggered from userland and so which won't involve PTI.
>>>
>>> PTI is not mapping kernel space to avoid speculation crap (meltdown).
>>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>>
>>> See how very similar they are?
>>>
>>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>>> core-scheduling but core-scheduling per address space. And ASI was
>>> specifically designed to help mitigate the trainwreck just described.
>>>
>>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>>> we reduce the part that needs core-scheduling and thus reduce the rate
>>> the SMT siblngs need to sync up/schedule.
>>>
>>> But looking at it that way, it makes no sense to retain 3 address
>>> spaces, namely:
>>>
>>> user / kernel exposed / kernel private.
>>>
>>> Specifically, it makes no sense to expose part of the kernel through MDS
>>> but not through Meltdow. Therefore we can merge the user and kernel
>>> exposed address spaces.
>>>
>>> And then we've fully replaced PTI.
>>>
>>> So no, they're not orthogonal.
>> Right. If we decide to expose more parts of the kernel mappings then that's
>> just adding more stuff to the existing user (PTI) map mechanics.
>
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).
>
> How would you control the mapping of userland sensitive data and exclude them
> from the user map?

As I see it, if we think part of the kernel is okay to leak to VM guests, then it should think it’s okay to leak to userspace and versa. At the end of the day, this may just have to come down to an administrator’s choice of how careful the mitigations need to be.

> Would you have the application explicitly identify sensitive
> data (like Andy suggested with a /dev/xpfo device)?

That’s not really the intent of my suggestion. I was suggesting that maybe we don’t need ASI at all if we allow VMs to exclude their memory from the kernel mapping entirely. Heck, in a setup like this, we can maybe even get away with turning PTI off under very, very controlled circumstances. I’m not quite sure what to do about the kernel random pools, though.

2019-07-12 19:08:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> On 7/12/19 5:16 PM, Thomas Gleixner wrote:

> > Right. If we decide to expose more parts of the kernel mappings then that's
> > just adding more stuff to the existing user (PTI) map mechanics.
>
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).
>
> How would you control the mapping of userland sensitive data and exclude them
> from the user map? Would you have the application explicitly identify sensitive
> data (like Andy suggested with a /dev/xpfo device)?

To what purpose do you want to exclude userspace from the kernel
mapping; that is, what are you mitigating against with that?

2019-07-12 19:49:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, 12 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> > On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> > > On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> > > And then we've fully replaced PTI.
> > >
> > > So no, they're not orthogonal.
> >
> > Right. If we decide to expose more parts of the kernel mappings then that's
> > just adding more stuff to the existing user (PTI) map mechanics.
>
> If we expose more parts of the kernel mapping by adding them to the existing
> user (PTI) map, then we only control the mapping of kernel sensitive data but
> we don't control user mapping (with ASI, we exclude all user mappings).

What prevents you from adding functionality to do so to the PTI
implementation? Nothing.

Again, the underlying concept is exactly the same:

1) Create a restricted mapping from an existing mapping

2) Switch to the restricted mapping when entering a particular execution
context

3) Switch to the unrestricted mapping when leaving that execution context

4) Keep track of the state

The restriction scope is different, but that's conceptually completely
irrelevant. It's a detail which needs to be handled at the implementation
level.

What matters here is the concept and because the concept is the same, this
needs to share the infrastructure for #1 - #4.

It's obvious that this requires changes to the way PTI works today, but
anything which creates a parallel implementation of any part of the above
#1 - #4 is not going anywhere.

This stuff is way too sensitive and has pretty well understood limitations
and corner cases. So it needs to be designed from ground up to handle these
proper. Which also means, that the possible use cases are going to be
limited.

As I said before, come up with a list of possible usage scenarios and
protection scopes first and please take all the ideas other people have
with this into account. This includes PTI of course.

Once we have that we need to figure out whether these things can actually
coexist and do not contradict each other at the semantical level and
whether the outcome justifies the resulting complexity.

After that we can talk about implementation details.

This problem is not going to be solved with handwaving and an ad hoc
implementation which creates more problems than it solves.

Thanks,

tglx

2019-07-14 15:07:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 12:06 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> > On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>
> > > Right. If we decide to expose more parts of the kernel mappings then that's
> > > just adding more stuff to the existing user (PTI) map mechanics.
> >
> > If we expose more parts of the kernel mapping by adding them to the existing
> > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > we don't control user mapping (with ASI, we exclude all user mappings).
> >
> > How would you control the mapping of userland sensitive data and exclude them
> > from the user map? Would you have the application explicitly identify sensitive
> > data (like Andy suggested with a /dev/xpfo device)?
>
> To what purpose do you want to exclude userspace from the kernel
> mapping; that is, what are you mitigating against with that?

Mutually distrusting user/guest tenants. Imagine an attack against a
VM hosting provider (GCE, for example). If the overall system is
well-designed, the host kernel won't possess secrets that are
important to the overall hosting network. The interesting secrets are
in the memory of other tenants running under the same host. So, if we
can mostly or completely avoid mapping one tenant's memory in the
host, we reduce the amount of valuable information that could leak via
a speculation (or wild read) attack to another tenant.

The practicality of such a scheme is obviously an open question.

2019-07-14 17:13:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Fri, Jul 12, 2019 at 10:45:06AM -0600, Andy Lutomirski wrote:
>
>
> > On Jul 12, 2019, at 10:37 AM, Alexandre Chartre <[email protected]> wrote:
> >
> >
> >
> >> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> >>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
> >>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
> >>>>
> >>>> I think that's precisely what makes ASI and PTI different and independent.
> >>>> PTI is just about switching between userland and kernel page-tables, while
> >>>> ASI is about switching page-table inside the kernel. You can have ASI without
> >>>> having PTI. You can also use ASI for kernel threads so for code that won't
> >>>> be triggered from userland and so which won't involve PTI.
> >>>
> >>> PTI is not mapping kernel space to avoid speculation crap (meltdown).
> >>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
> >>>
> >>> See how very similar they are?
> >>>
> >>> Furthermore, to recover SMT for userspace (under MDS) we not only need
> >>> core-scheduling but core-scheduling per address space. And ASI was
> >>> specifically designed to help mitigate the trainwreck just described.
> >>>
> >>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
> >>> we reduce the part that needs core-scheduling and thus reduce the rate
> >>> the SMT siblngs need to sync up/schedule.
> >>>
> >>> But looking at it that way, it makes no sense to retain 3 address
> >>> spaces, namely:
> >>>
> >>> user / kernel exposed / kernel private.
> >>>
> >>> Specifically, it makes no sense to expose part of the kernel through MDS
> >>> but not through Meltdow. Therefore we can merge the user and kernel
> >>> exposed address spaces.
> >>>
> >>> And then we've fully replaced PTI.
> >>>
> >>> So no, they're not orthogonal.
> >> Right. If we decide to expose more parts of the kernel mappings then that's
> >> just adding more stuff to the existing user (PTI) map mechanics.
> >
> > If we expose more parts of the kernel mapping by adding them to the existing
> > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > we don't control user mapping (with ASI, we exclude all user mappings).
> >
> > How would you control the mapping of userland sensitive data and exclude them
> > from the user map?
>
> As I see it, if we think part of the kernel is okay to leak to VM guests,
> then it should think it’s okay to leak to userspace and versa. At the end
> of the day, this may just have to come down to an administrator’s choice
> of how careful the mitigations need to be.
>
> > Would you have the application explicitly identify sensitive
> > data (like Andy suggested with a /dev/xpfo device)?
>
> That’s not really the intent of my suggestion. I was suggesting that
> maybe we don’t need ASI at all if we allow VMs to exclude their memory
> from the kernel mapping entirely. Heck, in a setup like this, we can
> maybe even get away with turning PTI off under very, very controlled
> circumstances. I’m not quite sure what to do about the kernel random
> pools, though.

I think KVM already allows excluding VMs memory from the kernel mapping
with the "new guest mapping interface" [1]. The memory managed by the host
can be restricted with "mem=" and KVM maps/unmaps the guest memory pages
only when needed.

It would be interesting to see if /dev/xpfo or even
madvise(MAKE_MY_MEMORY_PRIVATE) can be made useful for multi-tenant
container hosts.

[1] https://lore.kernel.org/lkml/[email protected]/

--
Sincerely yours,
Mike.

2019-07-14 18:19:59

by Alexander Graf

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation



On 12.07.19 16:36, Andy Lutomirski wrote:
> On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
> <[email protected]> wrote:
>>
>>
>> On 7/12/19 2:50 PM, Peter Zijlstra wrote:
>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>
>>>> I think that's precisely what makes ASI and PTI different and independent.
>>>> PTI is just about switching between userland and kernel page-tables, while
>>>> ASI is about switching page-table inside the kernel. You can have ASI without
>>>> having PTI. You can also use ASI for kernel threads so for code that won't
>>>> be triggered from userland and so which won't involve PTI.
>>>
>>> PTI is not mapping kernel space to avoid speculation crap (meltdown).
>>> ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).
>>>
>>> See how very similar they are?
>>>
>>>
>>> Furthermore, to recover SMT for userspace (under MDS) we not only need
>>> core-scheduling but core-scheduling per address space. And ASI was
>>> specifically designed to help mitigate the trainwreck just described.
>>>
>>> By explicitly exposing (hopefully harmless) part of the kernel to MDS,
>>> we reduce the part that needs core-scheduling and thus reduce the rate
>>> the SMT siblngs need to sync up/schedule.
>>>
>>> But looking at it that way, it makes no sense to retain 3 address
>>> spaces, namely:
>>>
>>> user / kernel exposed / kernel private.
>>>
>>> Specifically, it makes no sense to expose part of the kernel through MDS
>>> but not through Meltdow. Therefore we can merge the user and kernel
>>> exposed address spaces.
>>
>> The goal of ASI is to provide a reduced address space which exclude sensitive
>> data. A user process (for example a database daemon, a web server, or a vmm
>> like qemu) will likely have sensitive data mapped in its user address space.
>> Such data shouldn't be mapped with ASI because it can potentially leak to the
>> sibling hyperthread. For example, if an hyperthread is running a VM then the
>> VM could potentially access user sensitive data if they are mapped on the
>> sibling hyperthread with ASI.
>
> So I've proposed the following slightly hackish thing:
>
> Add a mechanism (call it /dev/xpfo). When you open /dev/xpfo and
> fallocate it to some size, you allocate that amount of memory and kick
> it out of the kernel direct map. (And pay the IPI cost unless there
> were already cached non-direct-mapped pages ready.) Then you map
> *that* into your VMs. Now, for a dedicated VM host, you map *all* the
> VM private memory from /dev/xpfo. Pretend it's SEV if you want to
> determine which pages can be set up like this.
>
> Does this get enough of the benefit at a negligible fraction of the
> code complexity cost? (This plus core scheduling, anyway.)

The problem with that approach is that you lose the ability to run
legacy workloads that do not support an SEV like model of "guest owned"
and "host visible" pages, but instead assume you can DMA anywhere.

Without that, your host will have visibility into guest pages via user
space (QEMU) pages which again are mapped in the kernel direct map, so
can be exposed via a spectre gadget into a malicious guest.

Also, please keep in mind that even register state of other VMs may be a
secret that we do not want to leak into other guests.


Alex

2019-07-15 08:25:09

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation



On 7/12/19 9:48 PM, Thomas Gleixner wrote:
> On Fri, 12 Jul 2019, Alexandre Chartre wrote:
>> On 7/12/19 5:16 PM, Thomas Gleixner wrote:
>>> On Fri, 12 Jul 2019, Peter Zijlstra wrote:
>>>> On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:
>>>> And then we've fully replaced PTI.
>>>>
>>>> So no, they're not orthogonal.
>>>
>>> Right. If we decide to expose more parts of the kernel mappings then that's
>>> just adding more stuff to the existing user (PTI) map mechanics.
>>
>> If we expose more parts of the kernel mapping by adding them to the existing
>> user (PTI) map, then we only control the mapping of kernel sensitive data but
>> we don't control user mapping (with ASI, we exclude all user mappings).
>
> What prevents you from adding functionality to do so to the PTI
> implementation? Nothing.
>
> Again, the underlying concept is exactly the same:
>
> 1) Create a restricted mapping from an existing mapping
>
> 2) Switch to the restricted mapping when entering a particular execution
> context
>
> 3) Switch to the unrestricted mapping when leaving that execution context
>
> 4) Keep track of the state
>
> The restriction scope is different, but that's conceptually completely
> irrelevant. It's a detail which needs to be handled at the implementation
> level.
>
> What matters here is the concept and because the concept is the same, this
> needs to share the infrastructure for #1 - #4.
>

You are totally right, that's the same concept (page-table creation and switching),
it is just used in different contexts. Sorry it took me that long to realize it,
I was too focus on the use case.


> It's obvious that this requires changes to the way PTI works today, but
> anything which creates a parallel implementation of any part of the above
> #1 - #4 is not going anywhere.
>
> This stuff is way too sensitive and has pretty well understood limitations
> and corner cases. So it needs to be designed from ground up to handle these
> proper. Which also means, that the possible use cases are going to be
> limited.
>
> As I said before, come up with a list of possible usage scenarios and
> protection scopes first and please take all the ideas other people have
> with this into account. This includes PTI of course.
>
> Once we have that we need to figure out whether these things can actually
> coexist and do not contradict each other at the semantical level and
> whether the outcome justifies the resulting complexity.
>
> After that we can talk about implementation details.

Right, that makes perfect sense. I think so far we have the following scenarios:

- PTI
- KVM (i.e. VMExit handler isolation)
- maybe some syscall isolation?

I will look at them in more details, in particular what particular mappings they
need and when they need to switch mappings.


And thanks for putting me back on the right track.


alex.

> This problem is not going to be solved with handwaving and an ad hoc
> implementation which creates more problems than it solves.
>
> Thanks,
>
> tglx
>

2019-07-15 08:30:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

Alexandre,

On Mon, 15 Jul 2019, Alexandre Chartre wrote:
> On 7/12/19 9:48 PM, Thomas Gleixner wrote:
> > As I said before, come up with a list of possible usage scenarios and
> > protection scopes first and please take all the ideas other people have
> > with this into account. This includes PTI of course.
> >
> > Once we have that we need to figure out whether these things can actually
> > coexist and do not contradict each other at the semantical level and
> > whether the outcome justifies the resulting complexity.
> >
> > After that we can talk about implementation details.
>
> Right, that makes perfect sense. I think so far we have the following
> scenarios:
>
> - PTI
> - KVM (i.e. VMExit handler isolation)
> - maybe some syscall isolation?

Vs. the latter you want to talk to Paul Turner. He had some ideas there.

> I will look at them in more details, in particular what particular
> mappings they need and when they need to switch mappings.
>
> And thanks for putting me back on the right track.

That's what maintainers are for :)

Thanks,

tglx

2019-07-15 10:35:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

On Sun, Jul 14, 2019 at 08:06:12AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 12, 2019 at 12:06 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Jul 12, 2019 at 06:37:47PM +0200, Alexandre Chartre wrote:
> > > On 7/12/19 5:16 PM, Thomas Gleixner wrote:
> >
> > > > Right. If we decide to expose more parts of the kernel mappings then that's
> > > > just adding more stuff to the existing user (PTI) map mechanics.
> > >
> > > If we expose more parts of the kernel mapping by adding them to the existing
> > > user (PTI) map, then we only control the mapping of kernel sensitive data but
> > > we don't control user mapping (with ASI, we exclude all user mappings).
> > >
> > > How would you control the mapping of userland sensitive data and exclude them
> > > from the user map? Would you have the application explicitly identify sensitive
> > > data (like Andy suggested with a /dev/xpfo device)?
> >
> > To what purpose do you want to exclude userspace from the kernel
> > mapping; that is, what are you mitigating against with that?
>
> Mutually distrusting user/guest tenants. Imagine an attack against a
> VM hosting provider (GCE, for example). If the overall system is
> well-designed, the host kernel won't possess secrets that are
> important to the overall hosting network. The interesting secrets are
> in the memory of other tenants running under the same host. So, if we
> can mostly or completely avoid mapping one tenant's memory in the
> host, we reduce the amount of valuable information that could leak via
> a speculation (or wild read) attack to another tenant.
>
> The practicality of such a scheme is obviously an open question.

Ah, ok. So it's some virt specific nonsense. I'll go on ignoring it then
;-)

2019-07-31 18:48:08

by Dario Faggioli

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

Hello all,

I know this is a bit of an old thread, so apologies for being late to
the party. :-)

I would have a question about this:

> > > On 7/12/19 2:36 PM, Peter Zijlstra wrote:
> > > > On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre
> > > > wrote:
> > > > > On 7/12/19 1:44 PM, Peter Zijlstra wrote:
> > > > > > AFAIK3 this wants/needs to be combined with core-scheduling
> > > > > > to be
> > > > > > useful, but not a single mention of that is anywhere.
> > > > >
> > > > > No. This is actually an alternative to core-scheduling.
> > > > > Eventually, ASI
> > > > > will kick all sibling hyperthreads when exiting isolation and
> > > > > it needs to
> > > > > run with the full kernel page-table (note that's currently
> > > > > not in these
> > > > > patches).
>
I.e., about the fact that ASI is presented as an alternative to
core-scheduling or, at least, as it will only need integrate a small
subset of the logic (and of the code) from core-scheduling, as said
here:

> I haven't looked at details about what has been done so far.
> Hopefully, we
> can do something not too complex, or reuse a (small) part of co-
> scheduling.
>
Now, sticking to virtualization examples, if you don't have core-
scheduling, it means that you can have two vcpus, one from VM A and the
other from VM B, running on the same core, one on thread 0 and the
other one on thread 1, at the same time.

And if VM A's vcpu, running on thread 0, exits, then VM B's vcpu
running in guest more on thread 1 can read host memory, as it is
speculatively accessed (either "normally" or because of cache load
gadgets) and brought in L1D cache by thread 0. And Indeed I do see how
ASI protects us from this attack scenario.

However, when the two VMs' vcpus are both running in guest mode, each
one on a thread of the same core, VM B's vcpu running on thread 1 can
exploit L1TF to peek at and steal secrets that VM A's vcpu, running on
thread 0, is accessing, as they're brought into L1D cache... can't it?

How can, ASI *without* core-scheduling, prevent this other attack
scenario?

Because I may very well be missing something, but it looks to me that
it can't. In which case, I'm not sure we can call it "alternative" to
core-scheduling.... Or is the second attack scenario that I tried to
describe above, not considered interesting?

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2019-08-22 16:58:59

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation


On 7/31/19 6:31 PM, Dario Faggioli wrote:
> Hello all,
>
> I know this is a bit of an old thread, so apologies for being late to
> the party. :-)

And sorry for the late reply, I was away for a while.

> I would have a question about this:
>
>>>> On 7/12/19 2:36 PM, Peter Zijlstra wrote:
>>>>> On Fri, Jul 12, 2019 at 02:17:20PM +0200, Alexandre Chartre
>>>>> wrote:
>>>>>> On 7/12/19 1:44 PM, Peter Zijlstra wrote:
>>>>>>> AFAIK3 this wants/needs to be combined with core-scheduling
>>>>>>> to be
>>>>>>> useful, but not a single mention of that is anywhere.
>>>>>>
>>>>>> No. This is actually an alternative to core-scheduling.
>>>>>> Eventually, ASI
>>>>>> will kick all sibling hyperthreads when exiting isolation and
>>>>>> it needs to
>>>>>> run with the full kernel page-table (note that's currently
>>>>>> not in these
>>>>>> patches).
>>
> I.e., about the fact that ASI is presented as an alternative to
> core-scheduling or, at least, as it will only need integrate a small
> subset of the logic (and of the code) from core-scheduling, as said
> here:
>
>> I haven't looked at details about what has been done so far.
>> Hopefully, we
>> can do something not too complex, or reuse a (small) part of co-
>> scheduling.
>>
> Now, sticking to virtualization examples, if you don't have core-
> scheduling, it means that you can have two vcpus, one from VM A and the
> other from VM B, running on the same core, one on thread 0 and the
> other one on thread 1, at the same time.
>
> And if VM A's vcpu, running on thread 0, exits, then VM B's vcpu
> running in guest more on thread 1 can read host memory, as it is
> speculatively accessed (either "normally" or because of cache load
> gadgets) and brought in L1D cache by thread 0. And Indeed I do see how
> ASI protects us from this attack scenario.
>
>
> However, when the two VMs' vcpus are both running in guest mode, each
> one on a thread of the same core, VM B's vcpu running on thread 1 can
> exploit L1TF to peek at and steal secrets that VM A's vcpu, running on
> thread 0, is accessing, as they're brought into L1D cache... can't it?
>
> How can, ASI *without* core-scheduling, prevent this other attack
> scenario?
>
> Because I may very well be missing something, but it looks to me that
> it can't. In which case, I'm not sure we can call it "alternative" to
> core-scheduling.... Or is the second attack scenario that I tried to
> describe above, not considered interesting?
>

Correct, ASI doesn't prevent this attack scenario. However, this case can
be prevented by pinning each VM to different CPU cores (for example, using
cgroups) so that you never have two different VMs running with CPU threads
from the same CPU core. Of course, this limits the number of VMs you can
run to the number of CPU cores on the system but we assume this is a
reasonable configuration when you want to have high performing VM.

Rgds,

alex.

2020-07-01 13:58:53

by hackapple

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

How about performance when running kvm example or isolate other kernel data?

2020-07-01 14:03:24

by hackapple

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

How about performance when running with ASI?

2020-07-01 14:03:48

by hackapple

[permalink] [raw]
Subject: Re: [RFC v2 00/27] Kernel Address Space Isolation

How about performance when running with ASI?