2020-05-04 18:40:44

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v4][PATCH part-1 0/7] ASI - Part I (ASI Infrastructure and PTI)

This version 4 of the kernel Address Space Isolation (ASI) RFC. I have
broken it down into three distinct parts:

- Part I: ASI Infrastructure and PTI (this part)
- Part II: Decorated Page-Table
- Part III: ASI Test Driver and CLI

Part I is similar to RFCv3 [3] with some small bug fixes. Parts II and III
extend the initial patchset: part II introduces decorated page-table in
order to provide convenient page-table management functions, and part III
provides a driver and CLI for testing ASI (using parts I and II).

KVM ASI will come later and will rely on the ASI infrastructure (part I)
and decorated page-table (part II).

Patches are based on v5.7-rc4.

Background
==========
Kernel Address Space Isolation aims to use address spaces to isolate some
parts of the kernel (for example KVM) to prevent leaking sensitive data
between CPU hyper-threads under speculative execution attacks.

Over the past years, various speculative attacks (like L1TF or MDS) have
highlighted that data can leak between CPU threads through the CPU (micro)
architecture. In particular, a malicious virtual machine running on a CPU
thread can target data used by a sibling CPU thread from the same CPU core.
Thus, a malicious VM can potentially access data from another VM or from
the host system if they are running on sibling CPU threads.

Core Scheduling [4] can prevent a malicious VM from attacking another VM
by running the same VM on all CPU threads of a CPU core. However a
malicious VM can still target the host system when the sibling CPU thread
exits the VM and returns to the host.

Address Space Isolation can be applied to KVM to mitigate this VM-to-host
attack by removing secrets from the kernel address space used when running
KVM, thus preventing a malicious VM from collecting any sensitive data
from host.

Address Space Isolation can also be used to implement Page Table Isolation
(PTI [5]) which reduces kernel mappings present in user address spaces to
prevent the Meltdown attack.

Details
=======

ASI
---
An ASI is created by calling asi_create() with a specified ASI type. The
ASI type manages data common to all ASI of the same type. It is used, in
particular, to manage per-ASI type TLB/PCID information.

Then the ASI can be entered with asi_enter() and exited with asi_exit().
When an ASI is in used, any interrupt/exception/NMI will cause the ASI to
be interrupted (ASI_INTERRUPT) and the ASI will be resumed (ASI_RESUME)
when the interrupt/exception/NMI returns.

asi_enter()/asi_exit() and ASI_INTERRUPT/ASI_RESUME switch between the
ASI and the full kernel page-table by updating the CR3 register.

If a task using ASI is scheduled out then its ASI state is saved and it
will be restored when the task is scheduled back.

Page fault occurring while ASI is used will either cause the ASI to be
aborted (switch back to the full kernel pagetable) or to be preserved.
The behavior depends on the ASI type. For example, for PTI the ASI is
preserved and the kernel page fault handler handles the fault on behalf
of the ASI. But for KVM ASI, the ASI will be aborted and the fault will
be retried with the full kernel page-table.

PTI
---
PTI is now implemented with ASI (user ASI) if both CONFIG_ADDRESS_SPACE_ISOLATION
and CONFIG_PAGE_TABLE_ISOLATION are set. The behavior of PTI is unchanged
but it is now using the ASI infrastructure.

For each user process, a user ASI is defined with the PTI pagetable. The
user ASI is used when running userland code, and it is exited when entering
a syscall. The user ASI is re-entered when the syscall returns to userland.

KVM
---
As already mentioned, KVM ASI is not present in this patchset. KVM ASI
will be implemented ontop of this infrastructure. Basically, the KVM ASI
patchset will:
- define a KVM ASI type (DEFINE_ASI_TYPE)
- create and fill a page-table to be used by the KVM ASI
- create a KVM ASI (asi_create_kvm())
- enter the KVM ASI (asi_enter()) on KVM_RUN ioctl
- exit the KVM ASI (asi_exit())

Fault occuring when KVM ASI is in used will cause the ASI to be aborted,
and the code will continue running with the full kernel page-table,
until KVM ASI is explicitly reentered.

Status
======
The code looks stable and it supports running a full kernel build and
also ltp tests. Performance impact is expected to be limited as the new
code only adds a small number of assembly instructions on syscall and
interrupts. There's probably also room for reducing this number of
instructions.

Changes
=======
RFCv4:

- Fix crash when booting with PTI disabled
- Fix issue when task using ASI is scheduled-in

RFCv3:

- Add ASI Type

- Add generic TLB flushing mechanism for ASI. This mechanism is similar
to the context tracking done when switching mm.

- When ASI is in used, it is interrupted on interrupt/exception/NMI and
resumed when the interrupt/exception/NMI returns.

- If a task using ASI is scheduled in/out then save/restore the corresponding
ASI and update the cpu ASI session.

- Implement PTI with ASI.

- Remove KVM ASI from the patchset. KVM ASI will be provided in a separated
patchset ontop of the ASI infrastructure.

- Remove functions to manage, populate and clear page-tables. These functions
were only used to build to the KVM ASI page-table. Also such functions should
be generic page-table functions and not specific to ASI. Mike Rapoport is also
looking at making these functions generic.


References
==========
[1] ASI RFCv1 - https://lkml.org/lkml/2019/5/13/515
[2] ASI RFCv2 - https://lore.kernel.org/lkml/[email protected]
[3] ASI RFCv3 - https://lore.kernel.org/lkml/[email protected]
[4] Core Scheduling - https://lwn.net/Articles/803652
[5] Page Table Isolation (PTI) - https://www.kernel.org/doc/html/latest/x86/pti.html


Thanks,

alex.

-----

Alexandre Chartre (7):
mm/x86: Introduce kernel Address Space Isolation (ASI)
mm/asi: ASI entry/exit interface
mm/asi: Improve TLB flushing when switching to an ASI pagetable
mm/asi: Interrupt ASI on interrupt/exception/NMI
mm/asi: Exit/enter ASI when task enters/exits scheduler
mm/asi: ASI fault handler
mm/asi: Implement PTI with ASI

arch/x86/entry/calling.h | 37 ++-
arch/x86/entry/common.c | 29 ++-
arch/x86/entry/entry_64.S | 28 ++
arch/x86/include/asm/asi.h | 289 +++++++++++++++++++++
arch/x86/include/asm/asi_session.h | 24 ++
arch/x86/include/asm/mmu_context.h | 20 +-
arch/x86/include/asm/tlbflush.h | 23 +-
arch/x86/kernel/asm-offsets.c | 5 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 402 +++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 20 ++
arch/x86/mm/pti.c | 28 +-
include/linux/mm_types.h | 5 +
include/linux/sched.h | 9 +
kernel/fork.c | 17 ++
kernel/sched/core.c | 17 ++
security/Kconfig | 10 +
17 files changed, 946 insertions(+), 18 deletions(-)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/include/asm/asi_session.h
create mode 100644 arch/x86/mm/asi.c

--
2.18.2


2020-05-04 18:40:58

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v4][PATCH part-1 1/7] mm/x86: Introduce kernel Address Space Isolation (ASI)

Introduce core functions and structures for implementing Address Space
Isolation (ASI). Kernel address space isolation provides the ability to
run some kernel code with a reduced kernel address space.

An address space isolation is defined with a struct asi structure and
associated with an ASI type and a pagetable.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/asi.h | 88 ++++++++++++++++++++++++++++++++++++++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 60 ++++++++++++++++++++++++++
security/Kconfig | 10 +++++
4 files changed, 159 insertions(+)
create mode 100644 arch/x86/include/asm/asi.h
create mode 100644 arch/x86/mm/asi.c

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 000000000000..844a81fb84d2
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_X86_MM_ASI_H
+#define ARCH_X86_MM_ASI_H
+
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+
+/*
+ * An Address Space Isolation (ASI) is defined with a struct asi and
+ * associated with an ASI type (struct asi_type). All ASIs of the same
+ * type reference the same ASI type.
+ *
+ * An ASI type has a unique PCID prefix (a value in the range [1, 255])
+ * which is used to define the PCID used for the ASI CR3 value. The
+ * first four bits of the ASI PCID come from the kernel PCID (a value
+ * between 1 and 6, see TLB_NR_DYN_ASIDS). The remaining 8 bits are
+ * filled with the ASI PCID prefix.
+ *
+ * ASI PCID = (ASI Type PCID Prefix << 4) | Kernel PCID
+ *
+ * The ASI PCID is used to optimize TLB flushing when switching between
+ * the kernel and ASI pagetables. The optimization is valid only when
+ * a task switches between ASI of different types. If a task switches
+ * between different ASIs with the same type then the ASI TLB the task
+ * is switching to will always be flushed.
+ */
+
+#define ASI_PCID_PREFIX_SHIFT 4
+#define ASI_PCID_PREFIX_MASK 0xff0
+#define ASI_KERNEL_PCID_MASK 0x00f
+
+/*
+ * We use bit 12 of a pagetable pointer (and so of the CR3 value) as
+ * a way to know if a pointer/CR3 is referencing a full kernel page
+ * table or an ASI page table.
+ *
+ * A full kernel pagetable is always located on the first half of an
+ * 8K buffer, while an ASI pagetable is always located on the second
+ * half of an 8K buffer.
+ */
+#define ASI_PGTABLE_BIT PAGE_SHIFT
+#define ASI_PGTABLE_MASK (1 << ASI_PGTABLE_BIT)
+
+#ifndef __ASSEMBLY__
+
+#include <linux/export.h>
+
+struct asi_type {
+ int pcid_prefix; /* PCID prefix */
+};
+
+/*
+ * Macro to define and declare an ASI type.
+ *
+ * Declaring an ASI type will also define an inline function
+ * (asi_create_<typename>()) to easily create an ASI of the
+ * specified type.
+ */
+#define DEFINE_ASI_TYPE(name, pcid_prefix) \
+ struct asi_type asi_type_ ## name = { \
+ pcid_prefix, \
+ }; \
+ EXPORT_SYMBOL(asi_type_ ## name)
+
+#define DECLARE_ASI_TYPE(name) \
+ extern struct asi_type asi_type_ ## name; \
+ DECLARE_ASI_CREATE(name)
+
+#define DECLARE_ASI_CREATE(name) \
+static inline struct asi *asi_create_ ## name(void) \
+{ \
+ return asi_create(&asi_type_ ## name); \
+}
+
+struct asi {
+ struct asi_type *type; /* ASI type */
+ pgd_t *pagetable; /* ASI pagetable */
+ unsigned long base_cr3; /* base ASI CR3 */
+};
+
+extern struct asi *asi_create(struct asi_type *type);
+extern void asi_destroy(struct asi *asi);
+extern void asi_set_pagetable(struct asi *asi, pgd_t *pagetable);
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* CONFIG_ADDRESS_SPACE_ISOLATION */
+
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 98f7c6fa2eaa..e57af263e870 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_NUMA_EMU) += numa_emulation.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
+obj-$(CONFIG_ADDRESS_SPACE_ISOLATION) += asi.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
new file mode 100644
index 000000000000..0a0ac9d6d078
--- /dev/null
+++ b/arch/x86/mm/asi.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2019, 2020, Oracle and/or its affiliates.
+ *
+ * Kernel Address Space Isolation (ASI)
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+
+#include <asm/asi.h>
+#include <asm/bug.h>
+
+struct asi *asi_create(struct asi_type *type)
+{
+ struct asi *asi;
+
+ if (!type)
+ return NULL;
+
+ asi = kzalloc(sizeof(*asi), GFP_KERNEL);
+ if (!asi)
+ return NULL;
+
+ asi->type = type;
+
+ return asi;
+}
+EXPORT_SYMBOL(asi_create);
+
+void asi_destroy(struct asi *asi)
+{
+ kfree(asi);
+}
+EXPORT_SYMBOL(asi_destroy);
+
+void asi_set_pagetable(struct asi *asi, pgd_t *pagetable)
+{
+ /*
+ * Check that the specified pagetable is properly aligned to be
+ * used as an ASI pagetable. If not, the pagetable is ignored
+ * and entering/exiting ASI will do nothing.
+ */
+ if (!(((unsigned long)pagetable) & ASI_PGTABLE_MASK)) {
+ WARN(1, "ASI %p: invalid ASI pagetable", asi);
+ asi->pagetable = NULL;
+ return;
+ }
+ asi->pagetable = pagetable;
+
+ /*
+ * Initialize the invariant part of the ASI CR3 value. We will
+ * just have to complete the PCID with the kernel PCID before
+ * using it.
+ */
+ asi->base_cr3 = __sme_pa(asi->pagetable) |
+ (asi->type->pcid_prefix << ASI_PCID_PREFIX_SHIFT);
+
+}
+EXPORT_SYMBOL(asi_set_pagetable);
diff --git a/security/Kconfig b/security/Kconfig
index cd3cc7da3a55..d98197eb260c 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -65,6 +65,16 @@ config PAGE_TABLE_ISOLATION

See Documentation/x86/pti.rst for more details.

+config ADDRESS_SPACE_ISOLATION
+ bool "Allow code to run with a reduced kernel address space"
+ default y
+ depends on (X86_64 || X86_PAE) && !UML
+ help
+ This feature provides the ability to run some kernel code
+ with a reduced kernel address space. This can be used to
+ mitigate speculative execution attacks which are able to
+ leak data between sibling CPU hyper-threads.
+
config SECURITY_INFINIBAND
bool "Infiniband Security Hooks"
depends on SECURITY && INFINIBAND
--
2.18.2

2020-05-04 18:42:33

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v4][PATCH part-1 5/7] mm/asi: Exit/enter ASI when task enters/exits scheduler

Exit ASI as soon as a task is entering the scheduler (__schedule()),
otherwise ASI will likely quick fault, for example when accessing
run queues. The task will return to ASI when it is scheduled again.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/include/asm/asi.h | 3 ++
arch/x86/mm/asi.c | 67 ++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 9 +++++
kernel/sched/core.c | 17 ++++++++++
4 files changed, 96 insertions(+)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index d240954b2f85..a0733f1e4a67 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -102,6 +102,9 @@ struct asi {
unsigned long base_cr3; /* base ASI CR3 */
};

+void asi_schedule_out(struct task_struct *task);
+void asi_schedule_in(struct task_struct *task);
+
extern struct asi *asi_create(struct asi_type *type);
extern void asi_destroy(struct asi *asi);
extern void asi_set_pagetable(struct asi *asi, pgd_t *pagetable);
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index c91ba82a095b..3795582c66d8 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -229,3 +229,70 @@ void asi_prepare_resume(void)

asi_switch_to_asi_cr3(asi_session->asi, ASI_SWITCH_ON_RESUME);
}
+
+void asi_schedule_out(struct task_struct *task)
+{
+ struct asi_session *asi_session;
+ unsigned long flags;
+ struct asi *asi;
+
+ asi = this_cpu_read(cpu_asi_session.asi);
+ if (!asi)
+ return;
+
+ /*
+ * Save the ASI session.
+ *
+ * Exit the session if it hasn't been interrupted, otherwise
+ * just save the session state.
+ */
+ local_irq_save(flags);
+ if (!this_cpu_read(cpu_asi_session.idepth)) {
+ asi_switch_to_kernel_cr3(asi);
+ task->asi_session.asi = asi;
+ task->asi_session.idepth = 0;
+ } else {
+ asi_session = &get_cpu_var(cpu_asi_session);
+ task->asi_session = *asi_session;
+ asi_session->asi = NULL;
+ asi_session->idepth = 0;
+ }
+ local_irq_restore(flags);
+}
+
+void asi_schedule_in(struct task_struct *task)
+{
+ struct asi_session *asi_session;
+ unsigned long flags;
+ struct asi *asi;
+
+ asi = task->asi_session.asi;
+ if (!asi)
+ return;
+
+ /*
+ * At this point, the CPU shouldn't be using ASI because the
+ * ASI session is expected to be cleared in asi_schedule_out().
+ */
+ WARN_ON(this_cpu_read(cpu_asi_session.asi));
+
+ /*
+ * Restore ASI.
+ *
+ * If the task was scheduled out while using ASI, then the ASI
+ * is already setup and we can immediately switch to ASI page
+ * table.
+ *
+ * Otherwise, if the task was scheduled out while ASI was
+ * interrupted, just restore the ASI session.
+ */
+ local_irq_save(flags);
+ if (!task->asi_session.idepth) {
+ asi_switch_to_asi_cr3(asi, ASI_SWITCH_NOW);
+ } else {
+ asi_session = &get_cpu_var(cpu_asi_session);
+ *asi_session = task->asi_session;
+ }
+ task->asi_session.asi = NULL;
+ local_irq_restore(flags);
+}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..ea86bda713ee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -10,6 +10,7 @@
#include <uapi/linux/sched.h>

#include <asm/current.h>
+#include <asm/asi_session.h>

#include <linux/pid.h>
#include <linux/sem.h>
@@ -1289,6 +1290,14 @@ struct task_struct {
unsigned long prev_lowest_stack;
#endif

+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ /*
+ * ASI session is saved here when the task is scheduled out
+ * while an ASI session was active or interrupted.
+ */
+ struct asi_session asi_session;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..140071cfa25d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -14,6 +14,7 @@

#include <asm/switch_to.h>
#include <asm/tlb.h>
+#include <asm/asi.h>

#include "../workqueue_internal.h"
#include "../../fs/io-wq.h"
@@ -3241,6 +3242,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
}

tick_nohz_task_switch();
+
return rq;
}

@@ -4006,6 +4008,14 @@ static void __sched notrace __schedule(bool preempt)
struct rq *rq;
int cpu;

+ /*
+ * If the task is using ASI then exit it right away otherwise the
+ * ASI will likely quickly fault, for example when accessing run
+ * queues.
+ */
+ if (IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION))
+ asi_schedule_out(current);
+
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
@@ -4087,6 +4097,13 @@ static void __sched notrace __schedule(bool preempt)
}

balance_callback(rq);
+
+ /*
+ * Now the task will resume execution, we can safely return to
+ * its ASI if one was in used.
+ */
+ if (IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION))
+ asi_schedule_in(current);
}

void __noreturn do_task_dead(void)
--
2.18.2

2020-05-04 18:42:35

by Alexandre Chartre

[permalink] [raw]
Subject: [RFC v4][PATCH part-1 7/7] mm/asi: Implement PTI with ASI

ASI supersedes PTI. If both CONFIG_ADDRESS_SPACE_ISOLATION and
CONFIG_PAGE_TABLE_ISOLATION are set then PTI is implemented using
ASI. For each user process, a "user" ASI is then defined with the
PTI pagetable. The user ASI is used when running userland code, and
it is exited when entering a syscall. The user ASI is re-entered
when the syscall returns to userland.

As with any ASI, interrupts/exceptions/NMIs will interrupt the
ASI, the ASI will resume when the interrupt/exception/NMI has
completed. Faults won't abort the user ASI as user faults are
handled by the kernel before returning to userland.

Signed-off-by: Alexandre Chartre <[email protected]>
---
arch/x86/entry/calling.h | 13 ++++++++++++-
arch/x86/entry/common.c | 29 ++++++++++++++++++++++++-----
arch/x86/entry/entry_64.S | 6 ++++++
arch/x86/include/asm/asi.h | 9 +++++++++
arch/x86/include/asm/tlbflush.h | 11 +++++++++--
arch/x86/mm/asi.c | 9 +++++++++
arch/x86/mm/pti.c | 28 ++++++++++++++++++++--------
include/linux/mm_types.h | 5 +++++
kernel/fork.c | 17 +++++++++++++++++
9 files changed, 111 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index ca23b79adecf..e452fce1435f 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -176,16 +176,27 @@ For 32-bit we have the following conventions - kernel is built with
#if defined(CONFIG_ADDRESS_SPACE_ISOLATION)

/*
- * For now, ASI is not compatible with PTI.
+ * ASI supersedes the entry points used by PTI. If both
+ * CONFIG_ADDRESS_SPACE_ISOLATION and CONFIG_PAGE_TABLE_ISOLATION are
+ * set then PTI is implemented using ASI.
*/

.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ ASI_INTERRUPT \scratch_reg
+.Lend_\@:
.endm

.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ ASI_RESUME \scratch_reg
+.Lend_\@:
.endm

.macro SWITCH_TO_USER_CR3_STACK scratch_reg:req
+ ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+ ASI_RESUME \scratch_reg
+.Lend_\@:
.endm

.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 76735ec813e6..752b6672d455 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -35,6 +35,7 @@
#include <asm/nospec-branch.h>
#include <asm/io_bitmap.h>
#include <asm/syscall.h>
+#include <asm/asi.h>

#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -50,6 +51,13 @@ __visible inline void enter_from_user_mode(void)
static inline void enter_from_user_mode(void) {}
#endif

+static inline void syscall_enter(void)
+{
+ /* syscall enter has interrupted ASI, now exit ASI */
+ asi_exit(current->mm->user_asi);
+ enter_from_user_mode();
+}
+
static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch)
{
#ifdef CONFIG_X86_64
@@ -225,6 +233,17 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
mds_user_clear_cpu_buffers();
}

+static inline void prepare_syscall_return(struct pt_regs *regs)
+{
+ prepare_exit_to_usermode(regs);
+
+ /*
+ * Syscall return will resume ASI, prepare resume to enter
+ * user ASI.
+ */
+ asi_deferred_enter(current->mm->user_asi);
+}
+
#define SYSCALL_EXIT_WORK_FLAGS \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
_TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT)
@@ -276,7 +295,7 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
syscall_slow_exit_work(regs, cached_flags);

local_irq_disable();
- prepare_exit_to_usermode(regs);
+ prepare_syscall_return(regs);
}

#ifdef CONFIG_X86_64
@@ -284,7 +303,7 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;

- enter_from_user_mode();
+ syscall_enter();
local_irq_enable();
ti = current_thread_info();
if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
@@ -343,7 +362,7 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
/* Handles int $0x80 */
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
- enter_from_user_mode();
+ syscall_enter();
local_irq_enable();
do_syscall_32_irqs_on(regs);
}
@@ -366,7 +385,7 @@ __visible long do_fast_syscall_32(struct pt_regs *regs)
*/
regs->ip = landing_pad;

- enter_from_user_mode();
+ syscall_enter();

local_irq_enable();

@@ -388,7 +407,7 @@ __visible long do_fast_syscall_32(struct pt_regs *regs)
/* User code screwed up. */
local_irq_disable();
regs->ax = -EFAULT;
- prepare_exit_to_usermode(regs);
+ prepare_syscall_return(regs);
return 0; /* Keep it simple: use IRET. */
}

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ac47da63a29f..003c945dd6b0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -627,6 +627,9 @@ ret_from_intr:
.Lretint_user:
mov %rsp,%rdi
call prepare_exit_to_usermode
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ ASI_PREPARE_RESUME
+#endif
TRACE_IRQS_ON

SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
@@ -1491,6 +1494,9 @@ SYM_CODE_START(nmi)
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi
+#ifdef CONFIG_ADDRESS_SPACE_ISOLATION
+ ASI_PREPARE_RESUME
+#endif

/*
* Return back to user mode. We must *not* do the normal exit
diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
index b8d7b936cd19..ac0594d4f549 100644
--- a/arch/x86/include/asm/asi.h
+++ b/arch/x86/include/asm/asi.h
@@ -62,6 +62,10 @@ struct asi_tlb_state {
struct asi_tlb_pgtable tlb_pgtables[ASI_TLB_NR_DYN_ASIDS];
};

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#define ASI_PCID_PREFIX_USER 0x80 /* user ASI */
+#endif
+
struct asi_type {
int pcid_prefix; /* PCID prefix */
struct asi_tlb_state *tlb_state; /* percpu ASI TLB state */
@@ -139,6 +143,7 @@ void asi_schedule_out(struct task_struct *task);
void asi_schedule_in(struct task_struct *task);
bool asi_fault(struct pt_regs *regs, unsigned long error_code,
unsigned long address, enum asi_fault_origin fault_origin);
+void asi_deferred_enter(struct asi *asi);

extern struct asi *asi_create(struct asi_type *type);
extern void asi_destroy(struct asi *asi);
@@ -146,6 +151,10 @@ extern void asi_set_pagetable(struct asi *asi, pgd_t *pagetable);
extern int asi_enter(struct asi *asi);
extern void asi_exit(struct asi *asi);

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+DECLARE_ASI_TYPE(user);
+#endif
+
static inline void asi_set_log_policy(struct asi *asi, int policy)
{
asi->fault_log_policy = policy;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 241058ff63ba..db114deeb763 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -390,6 +390,8 @@ extern void initialize_tlbstate_and_flush(void);
*/
static inline void invalidate_user_asid(u16 asid)
{
+ struct asi_tlb_state *tlb_state;
+
/* There is no user ASID if address space separation is off */
if (!IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION))
return;
@@ -404,8 +406,13 @@ static inline void invalidate_user_asid(u16 asid)
if (!static_cpu_has(X86_FEATURE_PTI))
return;

- __set_bit(kern_pcid(asid),
- (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
+ if (IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION)) {
+ tlb_state = get_cpu_ptr(asi_type_user.tlb_state);
+ tlb_state->tlb_pgtables[asid].id = 0;
+ } else {
+ __set_bit(kern_pcid(asid),
+ (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
+ }
}

/*
diff --git a/arch/x86/mm/asi.c b/arch/x86/mm/asi.c
index a4a5d35fb779..b63a0a883293 100644
--- a/arch/x86/mm/asi.c
+++ b/arch/x86/mm/asi.c
@@ -14,6 +14,10 @@
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>

+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+DEFINE_ASI_TYPE(user, ASI_PCID_PREFIX_USER, false);
+#endif
+
static void asi_log_fault(struct asi *asi, struct pt_regs *regs,
unsigned long error_code, unsigned long address,
enum asi_fault_origin fault_origin)
@@ -314,6 +318,11 @@ void asi_exit(struct asi *asi)
}
EXPORT_SYMBOL(asi_exit);

+void asi_deferred_enter(struct asi *asi)
+{
+ asi_switch_to_asi_cr3(asi, ASI_SWITCH_ON_RESUME);
+}
+
void asi_prepare_resume(void)
{
struct asi_session *asi_session;
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 843aa10a4cb6..a1d09c163709 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -430,6 +430,18 @@ static void __init pti_clone_p4d(unsigned long addr)
*user_p4d = *kernel_p4d;
}

+static void __init pti_map_va(unsigned long va)
+{
+ phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
+ pte_t *target_pte;
+
+ target_pte = pti_user_pagetable_walk_pte(va);
+ if (WARN_ON(!target_pte))
+ return;
+
+ *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+}
+
/*
* Clone the CPU_ENTRY_AREA and associated data into the user space visible
* page table.
@@ -457,15 +469,15 @@ static void __init pti_clone_user_shared(void)
* is set up.
*/

- unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
- phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
- pte_t *target_pte;
-
- target_pte = pti_user_pagetable_walk_pte(va);
- if (WARN_ON(!target_pte))
- return;
+ pti_map_va((unsigned long)&per_cpu(cpu_tss_rw, cpu));

- *target_pte = pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL);
+ if (IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION)) {
+ /*
+ * Map the ASI session. We need to always be able
+ * to access the ASI session.
+ */
+ pti_map_va((unsigned long)&per_cpu(cpu_tlbstate, cpu));
+ }
}
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4aba6c0c2ba8..e2c6d63f39e5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -25,6 +25,7 @@

struct address_space;
struct mem_cgroup;
+struct asi;

/*
* Each physical page in the system has a struct page associated with
@@ -534,6 +535,10 @@ struct mm_struct {
atomic_long_t hugetlb_usage;
#endif
struct work_struct async_put_work;
+#if defined(CONFIG_ADDRESS_SPACE_ISOLATION) && defined(CONFIG_PAGE_TABLE_ISOLATION)
+ /* ASI used for user address space */
+ struct asi *user_asi;
+#endif
} __randomize_layout;

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 8c700f881d92..f245f9a4c55d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -101,6 +101,7 @@
#include <asm/mmu_context.h>
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
+#include <asm/asi.h>

#include <trace/events/sched.h>

@@ -698,6 +699,10 @@ void __mmdrop(struct mm_struct *mm)
mmu_notifier_subscriptions_destroy(mm);
check_mm(mm);
put_user_ns(mm->user_ns);
+ if (IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION) &&
+ IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION)) {
+ asi_destroy(mm->user_asi);
+ }
free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1049,6 +1054,18 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (init_new_context(p, mm))
goto fail_nocontext;

+ if (IS_ENABLED(CONFIG_ADDRESS_SPACE_ISOLATION) &&
+ IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION)) {
+ /*
+ * If we have PTI and ASI then use ASI to switch between
+ * user and kernel spaces, so create an ASI for this mm.
+ */
+ mm->user_asi = asi_create_user();
+ if (!mm->user_asi)
+ goto fail_nocontext;
+ asi_set_pagetable(mm->user_asi, kernel_to_user_pgdp(mm->pgd));
+ }
+
mm->user_ns = get_user_ns(user_ns);
return mm;

--
2.18.2

2020-05-12 17:47:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v4][PATCH part-1 0/7] ASI - Part I (ASI Infrastructure and PTI)

On 5/4/20 7:49 AM, Alexandre Chartre wrote:
> This version 4 of the kernel Address Space Isolation (ASI) RFC. I have
> broken it down into three distinct parts:
>
> - Part I: ASI Infrastructure and PTI (this part)
> - Part II: Decorated Page-Table
> - Part III: ASI Test Driver and CLI
>
> Part I is similar to RFCv3 [3] with some small bug fixes. Parts II and III
> extend the initial patchset: part II introduces decorated page-table in
> order to provide convenient page-table management functions, and part III
> provides a driver and CLI for testing ASI (using parts I and II).

These look interesting. I haven't found any holes in your methods,
although the interrupt depth tracking worries me a bit. I tried and
failed to do a similar thing with PTI in the NMI path, but you might
have just bested me there. :)

It's very interesting that you've been able to implement PTI underneath
all of this, and the "test driver" is really entertaining!

That said, this is working in some of the nastiest corners of the x86
code and this is going to take quite an investment to get reviewed. I'm
not *quite* sure it's all worth it.

So, this isn't being ignored, I'm just not quite sure what to do with
it, yet.

2020-05-12 19:29:10

by Alexandre Chartre

[permalink] [raw]
Subject: Re: [RFC v4][PATCH part-1 0/7] ASI - Part I (ASI Infrastructure and PTI)


Hi Dave,

On 5/12/20 7:45 PM, Dave Hansen wrote:
> On 5/4/20 7:49 AM, Alexandre Chartre wrote:
>> This version 4 of the kernel Address Space Isolation (ASI) RFC. I have
>> broken it down into three distinct parts:
>>
>> - Part I: ASI Infrastructure and PTI (this part)
>> - Part II: Decorated Page-Table
>> - Part III: ASI Test Driver and CLI
>>
>> Part I is similar to RFCv3 [3] with some small bug fixes. Parts II and III
>> extend the initial patchset: part II introduces decorated page-table in
>> order to provide convenient page-table management functions, and part III
>> provides a driver and CLI for testing ASI (using parts I and II).
>
> These look interesting. I haven't found any holes in your methods,
> although the interrupt depth tracking worries me a bit. I tried and
> failed to do a similar thing with PTI in the NMI path, but you might
> have just bested me there. :)

Thanks for taking a look. I am glad it seems okay, I have run several tests
and was unable to have it fail (so far) while previous versions were easily
breakable.

> It's very interesting that you've been able to implement PTI underneath
> all of this, and the "test driver" is really entertaining!

Yeah, this a kind of PTI on steroid as part of the implementation was done
based on the PTI implementation but making it more generic. The test driver
has proven very useful for testing and debugging. I am currently using it
(with some extensions) for helping me define the KVM ASI: I can connect the
driver to a KVM ASI, dump the KVM ASI faults and dynamically add mappings;
this is very handy.

> That said, this is working in some of the nastiest corners of the x86
> code and this is going to take quite an investment to get reviewed. I'm
> not *quite* sure it's all worth it.

I am also concerned about making changes in all these nasty corners. I am a
bit more confident now that it is working to implement PTI because PTI provides
a good stress test for ASI. I am also waiting for (and reviewing) all x86/entry
changes from tglx; this greatly cleans up the entry code and will hopefully help
for the integration of ASI. I will rebase as soon as these all changes are
integrated and check the benefit for ASI.

> So, this isn't being ignored, I'm just not quite sure what to do with
> it, yet.
>

I am working on defining ASI for KVM. Hopefully this will provide a good
usage example, and make the changes more compelling.

Thanks.

alex.

2020-05-12 20:09:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC v4][PATCH part-1 0/7] ASI - Part I (ASI Infrastructure and PTI)


> On May 12, 2020, at 10:45 AM, Dave Hansen <[email protected]> wrote:
>
> On 5/4/20 7:49 AM, Alexandre Chartre wrote:
>> This version 4 of the kernel Address Space Isolation (ASI) RFC. I have
>> broken it down into three distinct parts:
>>
>> - Part I: ASI Infrastructure and PTI (this part)
>> - Part II: Decorated Page-Table
>> - Part III: ASI Test Driver and CLI
>>
>> Part I is similar to RFCv3 [3] with some small bug fixes. Parts II and III
>> extend the initial patchset: part II introduces decorated page-table in
>> order to provide convenient page-table management functions, and part III
>> provides a driver and CLI for testing ASI (using parts I and II).
>
> These look interesting. I haven't found any holes in your methods,
> although the interrupt depth tracking worries me a bit. I tried and
> failed to do a similar thing with PTI in the NMI path, but you might
> have just bested me there. :)
>
> It's very interesting that you've been able to implement PTI underneath
> all of this, and the "test driver" is really entertaining!
>
> That said, this is working in some of the nastiest corners of the x86
> code and this is going to take quite an investment to get reviewed. I'm
> not *quite* sure it's all worth it.
>
> So, this isn't being ignored, I'm just not quite sure what to do with
> it, yet.

I’m going to wait until the dust settles on tglx’s big entry rework before I look at this.