2019-04-25 23:01:37

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

Hi,

Address space isolation has been used to protect the kernel from the
userspace and userspace programs from each other since the invention of the
virtual memory.

Assuming that kernel bugs and therefore vulnerabilities are inevitable it
might be worth isolating parts of the kernel to minimize damage that these
vulnerabilities can cause.

The idea here is to allow an untrusted user access to a potentially
vulnerable kernel in such a way that any kernel vulnerability they find to
exploit is either prevented or the consequences confined to their isolated
address space such that the compromise attempt has minimal impact on other
tenants or the protected structures of the monolithic kernel. Although we
hope to prevent many classes of attack, the first target we're looking at
is ROP gadget protection.

These patches implement a "system call isolation (SCI)" mechanism that
allows running system calls in an isolated address space with reduced page
tables to prevent ROP attacks.

ROP attacks involve corrupting the stack return address to repoint it to a
segment of code you know exists in the kernel that can be used to perform
the action you need to exploit the system.

The idea behind the prevention is that if we fault in pages in the
execution path, we can compare target address against the kernel symbol
table. So if we're in a function, we allow local jumps (and simply falling
of the end of a page) but if we're jumping to a new function it must be to
an external label in the symbol table. Since ROP attacks are all about
jumping to gadget code which is effectively in the middle of real
functions, the jumps they induce are to code that doesn't have an external
symbol, so it should mostly detect when they happen.

This is very early POC, it's able to run the simple dummy system calls and
a little bit beyond that, but it's not yet stable and robust enough to boot
a system with system call isolation enabled for all system calls. Still, we
wanted to get some feedback about the concept in general as early as
possible.

At this time we are not suggesting any API that will enable the system
calls isolation. Because of the overhead required for this, it should only
be activated for processes or containers we know should be untrusted. We
still have no actual numbers, but surely forcing page faults during system
call execution will not come for free.

One possible way is to create a namespace, and force the system calls
isolation on all the processes in that namespace. Another thing that came
to mind was to use a seccomp filter to allow fine grained control of this
feature.

The current implementation is pretty much x86-centric, but the general idea
can be used on other architectures.

A brief TOC of the set:
* patch 1 adds definitions of X86_FEATURE_SCI
* patch 2 is the core implementation of system calls isolation (SCI)
* patches 3-5 add hooks to SCI at entry paths and in the page fault
handler
* patch 6 enables the SCI in Kconfig
* patch 7 includes example dummy system calls that are used to
demonstrate the SCI in action.

Mike Rapoport (7):
x86/cpufeatures: add X86_FEATURE_SCI
x86/sci: add core implementation for system call isolation
x86/entry/64: add infrastructure for switching to isolated syscall
context
x86/sci: hook up isolated system call entry and exit
x86/mm/fault: hook up SCI verification
security: enable system call isolation in kernel config
sci: add example system calls to exercse SCI

arch/x86/entry/calling.h | 65 ++++
arch/x86/entry/common.c | 65 ++++
arch/x86/entry/entry_64.S | 13 +-
arch/x86/entry/syscalls/syscall_64.tbl | 3 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/processor-flags.h | 8 +
arch/x86/include/asm/sci.h | 55 +++
arch/x86/include/asm/tlbflush.h | 8 +-
arch/x86/kernel/asm-offsets.c | 7 +
arch/x86/kernel/process_64.c | 5 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/fault.c | 28 ++
arch/x86/mm/init.c | 2 +
arch/x86/mm/sci.c | 608 +++++++++++++++++++++++++++++++
include/linux/sched.h | 5 +
include/linux/sci.h | 12 +
kernel/Makefile | 2 +-
kernel/exit.c | 3 +
kernel/sci-examples.c | 52 +++
security/Kconfig | 10 +
21 files changed, 956 insertions(+), 5 deletions(-)
create mode 100644 arch/x86/include/asm/sci.h
create mode 100644 arch/x86/mm/sci.c
create mode 100644 include/linux/sci.h
create mode 100644 kernel/sci-examples.c

--
2.7.4


2019-04-25 23:01:38

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 1/7] x86/cpufeatures: add X86_FEATURE_SCI

The X86_FEATURE_SCI will be set when system call isolation is enabled.

Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d61225..a01c6dd 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -221,6 +221,7 @@
#define X86_FEATURE_ZEN ( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
#define X86_FEATURE_L1TF_PTEINV ( 7*32+29) /* "" L1TF workaround PTE inversion */
#define X86_FEATURE_IBRS_ENHANCED ( 7*32+30) /* Enhanced IBRS */
+#define X86_FEATURE_SCI ( 7*32+31) /* "" System call isolation */

/* Virtualization flags: Linux defined, word 8 */
#define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index a5ea841..79947f0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -62,6 +62,12 @@
# define DISABLE_PTI (1 << (X86_FEATURE_PTI & 31))
#endif

+#ifdef CONFIG_SYSCALL_ISOLATION
+# define DISABLE_SCI 0
+#else
+# define DISABLE_SCI (1 << (X86_FEATURE_SCI & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -72,7 +78,7 @@
#define DISABLED_MASK4 (DISABLE_PCID)
#define DISABLED_MASK5 0
#define DISABLED_MASK6 0
-#define DISABLED_MASK7 (DISABLE_PTI)
+#define DISABLED_MASK7 (DISABLE_PTI|DISABLE_SCI)
#define DISABLED_MASK8 0
#define DISABLED_MASK9 (DISABLE_MPX|DISABLE_SMAP)
#define DISABLED_MASK10 0
--
2.7.4

2019-04-25 23:01:46

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 4/7] x86/sci: hook up isolated system call entry and exit

When a system call is required to run in an isolated context, the CR3 will
be switched to the SCI page table a per-cpu variable will contain and
offset from the original CR3. This offset is used to switch back to the
full kernel context when a trap occurs during isolated system call.

Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/entry/common.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/process_64.c | 5 ++++
kernel/exit.c | 3 +++
3 files changed, 69 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f..8f2a6fd 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -25,12 +25,14 @@
#include <linux/uprobes.h>
#include <linux/livepatch.h>
#include <linux/syscalls.h>
+#include <linux/sci.h>

#include <asm/desc.h>
#include <asm/traps.h>
#include <asm/vdso.h>
#include <linux/uaccess.h>
#include <asm/cpufeature.h>
+#include <asm/tlbflush.h>

#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -269,6 +271,50 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
}

#ifdef CONFIG_X86_64
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+static inline bool sci_required(unsigned long nr)
+{
+ return false;
+}
+
+static inline unsigned long sci_syscall_enter(unsigned long nr)
+{
+ unsigned long sci_cr3, kernel_cr3;
+ unsigned long asid;
+
+ kernel_cr3 = __read_cr3();
+ asid = kernel_cr3 & ~PAGE_MASK;
+
+ sci_cr3 = build_cr3(current->sci->pgd, 0) & PAGE_MASK;
+ sci_cr3 |= (asid | (1 << X86_CR3_SCI_PCID_BIT));
+
+ current->in_isolated_syscall = 1;
+ current->sci->cr3_offset = kernel_cr3 - sci_cr3;
+
+ this_cpu_write(cpu_sci.sci_syscall, 1);
+ this_cpu_write(cpu_sci.sci_cr3_offset, current->sci->cr3_offset);
+
+ write_cr3(sci_cr3);
+
+ return kernel_cr3;
+}
+
+static inline void sci_syscall_exit(unsigned long cr3)
+{
+ if (cr3) {
+ write_cr3(cr3);
+ current->in_isolated_syscall = 0;
+ this_cpu_write(cpu_sci.sci_syscall, 0);
+ sci_clear_data();
+ }
+}
+#else
+static inline bool sci_required(unsigned long nr) { return false; }
+static inline unsigned long sci_syscall_enter(unsigned long nr) { return 0; }
+static inline void sci_syscall_exit(unsigned long cr3) {}
+#endif
+
__visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
struct thread_info *ti;
@@ -286,10 +332,25 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
*/
nr &= __SYSCALL_MASK;
if (likely(nr < NR_syscalls)) {
+ unsigned long sci_cr3 = 0;
+
nr = array_index_nospec(nr, NR_syscalls);
+
+ if (sci_required(nr)) {
+ int err = sci_init(current);
+
+ if (err) {
+ regs->ax = err;
+ goto err_return_from_syscall;
+ }
+ sci_cr3 = sci_syscall_enter(nr);
+ }
+
regs->ax = sys_call_table[nr](regs);
+ sci_syscall_exit(sci_cr3);
}

+err_return_from_syscall:
syscall_return_slowpath(regs);
}
#endif
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6a62f4a..b8aa624 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -55,6 +55,8 @@
#include <asm/resctrl_sched.h>
#include <asm/unistd.h>
#include <asm/fsgsbase.h>
+#include <asm/sci.h>
+
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
@@ -581,6 +583,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

switch_to_extra(prev_p, next_p);

+ /* update syscall isolation per-cpu data */
+ sci_switch_to(next_p);
+
#ifdef CONFIG_XEN_PV
/*
* On Xen PV, IOPL bits in pt_regs->flags have no effect, and
diff --git a/kernel/exit.c b/kernel/exit.c
index 2639a30..8e81353 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -62,6 +62,7 @@
#include <linux/random.h>
#include <linux/rcuwait.h>
#include <linux/compat.h>
+#include <linux/sci.h>

#include <linux/uaccess.h>
#include <asm/unistd.h>
@@ -859,6 +860,8 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);

+ sci_exit(tsk);
+
exit_mm();

if (group_dead)
--
2.7.4

2019-04-25 23:01:47

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 3/7] x86/entry/64: add infrastructure for switching to isolated syscall context

The isolated system calls will use a separate page table that does not map
the entire kernel. Exception and interrupts entries should switch the
context to the full kernel page tables and then restore it back to continue
the execution of the isolated system call.

Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/entry/calling.h | 65 ++++++++++++++++++++++++++++++++++
arch/x86/entry/entry_64.S | 13 +++++--
arch/x86/include/asm/processor-flags.h | 8 +++++
arch/x86/include/asm/tlbflush.h | 8 ++++-
arch/x86/kernel/asm-offsets.c | 7 ++++
5 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index efb0d1b..766e74e 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -187,6 +187,56 @@ For 32-bit we have the following conventions - kernel is built with
#endif
.endm

+#ifdef CONFIG_SYSCALL_ISOLATION
+
+#define SCI_PCID_BIT X86_CR3_SCI_PCID_BIT
+
+#define THIS_CPU_sci_syscall \
+ PER_CPU_VAR(cpu_sci) + SCI_SYSCALL
+
+#define THIS_CPU_sci_cr3_offset \
+ PER_CPU_VAR(cpu_sci) + SCI_CR3_OFFSET
+
+.macro SAVE_AND_SWITCH_SCI_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+ ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_SCI
+ movq THIS_CPU_sci_syscall, \scratch_reg
+ cmpq $0, \scratch_reg
+ je .Ldone_\@
+ movq %cr3, \scratch_reg
+ bt $SCI_PCID_BIT, \scratch_reg
+ jc .Lsci_context_\@
+ xorq \save_reg, \save_reg
+ jmp .Ldone_\@
+.Lsci_context_\@:
+ movq \scratch_reg, \save_reg
+ addq THIS_CPU_sci_cr3_offset, \scratch_reg
+ movq \scratch_reg, %cr3
+.Ldone_\@:
+.endm
+
+.macro RESTORE_SCI_CR3 scratch_reg:req save_reg:req
+ ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_SCI
+ movq THIS_CPU_sci_syscall, \scratch_reg
+ cmpq $0, \scratch_reg
+ je .Ldone_\@
+ movq \save_reg, \scratch_reg
+ cmpq $0, \scratch_reg
+ je .Ldone_\@
+ xorq \save_reg, \save_reg
+ movq \scratch_reg, %cr3
+.Ldone_\@:
+.endm
+
+#else /* CONFIG_SYSCALL_ISOLATION */
+
+.macro SAVE_AND_SWITCH_SCI_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+
+.macro RESTORE_SCI_CR3 scratch_reg:req save_reg:req
+.endm
+
+#endif /* CONFIG_SYSCALL_ISOLATION */
+
#ifdef CONFIG_PAGE_TABLE_ISOLATION

/*
@@ -264,6 +314,21 @@ For 32-bit we have the following conventions - kernel is built with
ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
movq %cr3, \scratch_reg
movq \scratch_reg, \save_reg
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+ /*
+ * Test the SCI PCID bit. If set, then the SCI page tables are
+ * active. If clear CR3 has either the kernel or user page
+ * table active.
+ */
+ ALTERNATIVE "jmp .Lcheck_user_pt_\@", "", X86_FEATURE_SCI
+ bt $SCI_PCID_BIT, \scratch_reg
+ jnc .Lcheck_user_pt_\@
+ addq THIS_CPU_sci_cr3_offset, \scratch_reg
+ movq \scratch_reg, %cr3
+ jmp .Ldone_\@
+.Lcheck_user_pt_\@:
+#endif
/*
* Test the user pagetable bit. If set, then the user page tables
* are active. If clear CR3 already has the kernel page table
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb..3cef67b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -543,7 +543,7 @@ ENTRY(interrupt_entry)
ENCODE_FRAME_POINTER 8

testb $3, CS+8(%rsp)
- jz 1f
+ jz .Linterrupt_entry_kernel

/*
* IRQ from user mode.
@@ -559,12 +559,17 @@ ENTRY(interrupt_entry)

CALL_enter_from_user_mode

-1:
+.Linterrupt_entry_done:
ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF

ret
+
+.Linterrupt_entry_kernel:
+ SAVE_AND_SWITCH_SCI_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+ jmp .Linterrupt_entry_done
+
END(interrupt_entry)
_ASM_NOKPROBE(interrupt_entry)

@@ -656,6 +661,8 @@ retint_kernel:
*/
TRACE_IRQS_IRETQ

+ RESTORE_SCI_CR3 scratch_reg=%rax save_reg=%r14
+
GLOBAL(restore_regs_and_return_to_kernel)
#ifdef CONFIG_DEBUG_ENTRY
/* Assert that pt_regs indicates kernel mode. */
@@ -1263,6 +1270,8 @@ ENTRY(error_entry)
* for these here too.
*/
.Lerror_kernelspace:
+ SAVE_AND_SWITCH_SCI_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+
leaq native_irq_return_iret(%rip), %rcx
cmpq %rcx, RIP+8(%rsp)
je .Lerror_bad_iret
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbd..eca9e17 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -53,4 +53,12 @@
# define X86_CR3_PTI_PCID_USER_BIT 11
#endif

+#ifdef CONFIG_SYSCALL_ISOLATION
+# if defined(X86_CR3_PTI_PCID_USER_BIT)
+# define X86_CR3_SCI_PCID_BIT (X86_CR3_PTI_PCID_USER_BIT - 1)
+# else
+# define X86_CR3_SCI_PCID_BIT 11
+# endif
+#endif
+
#endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf..dc69cc4 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -54,7 +54,13 @@
# define PTI_CONSUMED_PCID_BITS 0
#endif

-#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS)
+#ifdef CONFIG_SYSCALL_ISOLATION
+# define SCI_CONSUMED_PCID_BITS 1
+#else
+# define SCI_CONSUMED_PCID_BITS 0
+#endif
+
+#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS - SCI_CONSUMED_PCID_BITS)

/*
* ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid. -1 below to account
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 168543d..f2c9cd3f 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
#include <asm/bootparam.h>
#include <asm/suspend.h>
#include <asm/tlbflush.h>
+#include <asm/sci.h>

#ifdef CONFIG_XEN
#include <xen/interface/xen.h>
@@ -105,4 +106,10 @@ static void __used common(void)
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+ /* system calls isolation */
+ OFFSET(SCI_SYSCALL, sci_percpu_data, sci_syscall);
+ OFFSET(SCI_CR3_OFFSET, sci_percpu_data, sci_cr3_offset);
+#endif
}
--
2.7.4

2019-04-25 23:03:00

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 7/7] sci: add example system calls to exercse SCI

Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/entry/common.c | 6 +++-
arch/x86/entry/syscalls/syscall_64.tbl | 3 ++
kernel/Makefile | 2 +-
kernel/sci-examples.c | 52 ++++++++++++++++++++++++++++++++++
4 files changed, 61 insertions(+), 2 deletions(-)
create mode 100644 kernel/sci-examples.c

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 8f2a6fd..be0e1a7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -275,7 +275,11 @@ __visible inline void syscall_return_slowpath(struct pt_regs *regs)
#ifdef CONFIG_SYSCALL_ISOLATION
static inline bool sci_required(unsigned long nr)
{
- return false;
+ if (!static_cpu_has(X86_FEATURE_SCI))
+ return false;
+ if (nr < __NR_get_answer)
+ return false;
+ return true;
}

static inline unsigned long sci_syscall_enter(unsigned long nr)
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709..a25e838 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,9 @@
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
+335 64 get_answer __x64_sys_get_answer
+336 64 sci_write_dmesg __x64_sys_sci_write_dmesg
+337 64 sci_write_dmesg_bad __x64_sys_sci_write_dmesg_bad

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/kernel/Makefile b/kernel/Makefile
index 6aa7543..d6441d0 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = fork.o exec_domain.o panic.o \
extable.o params.o \
kthread.o sys_ni.o nsproxy.o \
notifier.o ksysfs.o cred.o reboot.o \
- async.o range.o smpboot.o ucount.o
+ async.o range.o smpboot.o ucount.o sci-examples.o

obj-$(CONFIG_MODULES) += kmod.o
obj-$(CONFIG_MULTIUSER) += groups.o
diff --git a/kernel/sci-examples.c b/kernel/sci-examples.c
new file mode 100644
index 0000000..9bfaad0
--- /dev/null
+++ b/kernel/sci-examples.c
@@ -0,0 +1,52 @@
+#include <linux/kernel.h>
+#include <linux/pid.h>
+#include <linux/syscalls.h>
+#include <linux/hugetlb.h>
+#include <asm/special_insns.h>
+
+SYSCALL_DEFINE0(get_answer)
+{
+ return 42;
+}
+
+#define BUF_SIZE 1024
+
+typedef void (*foo)(void);
+
+SYSCALL_DEFINE2(sci_write_dmesg, const char __user *, ubuf, size_t, count)
+{
+ char buf[BUF_SIZE];
+
+ if (!ubuf || count >= BUF_SIZE)
+ return -EINVAL;
+
+ buf[count] = '\0';
+ if (copy_from_user(buf, ubuf, count))
+ return -EFAULT;
+
+ printk("%s\n", buf);
+
+ return count;
+}
+
+SYSCALL_DEFINE2(sci_write_dmesg_bad, const char __user *, ubuf, size_t, count)
+{
+ unsigned long addr = (unsigned long)(void *)hugetlb_reserve_pages;
+ char buf[BUF_SIZE];
+ foo func1;
+
+ addr += 0xc5;
+ func1 = (foo)(void *)addr;
+ func1();
+
+ if (!ubuf || count >= BUF_SIZE)
+ return -EINVAL;
+
+ buf[count] = '\0';
+ if (copy_from_user(buf, ubuf, count))
+ return -EFAULT;
+
+ printk("%s\n", buf);
+
+ return count;
+}
--
2.7.4

2019-04-25 23:03:57

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 5/7] x86/mm/fault: hook up SCI verification

If a system call runs in isolated context, it's accesses to kernel code and
data will be verified by SCI susbsytem.

Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9d5c75f..baa2a2f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_recover_from_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/sci.h> /* sci_verify_and_map() */

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1254,6 +1255,30 @@ static int fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
}

+#ifdef CONFIG_SYSCALL_ISOLATION
+static int sci_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address)
+{
+ struct task_struct *tsk = current;
+
+ if (!tsk->in_isolated_syscall)
+ return 0;
+
+ if (!sci_verify_and_map(regs, address, hw_error_code)) {
+ this_cpu_write(cpu_sci.sci_syscall, 0);
+ no_context(regs, hw_error_code, address, SIGKILL, 0);
+ }
+
+ return 1;
+}
+#else
+static inline int sci_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address)
+{
+ return 0;
+}
+#endif
+
/*
* Called for all faults where 'address' is part of the kernel address
* space. Might get called for faults that originate from *code* that
@@ -1301,6 +1326,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
if (kprobes_fault(regs))
return;

+ if (sci_fault(regs, hw_error_code, address))
+ return;
+
/*
* Note, despite being a "bad area", there are quite a few
* acceptable reasons to get here, such as erratum fixups
--
2.7.4

2019-04-25 23:04:35

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 6/7] security: enable system call isolation in kernel config

Add SYSCALL_ISOLATION Kconfig option to enable build of SCI infrastructure.

Signed-off-by: Mike Rapoport <[email protected]>
---
security/Kconfig | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/security/Kconfig b/security/Kconfig
index e4fe2f3..0c6929a 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -65,6 +65,16 @@ config PAGE_TABLE_ISOLATION

See Documentation/x86/pti.txt for more details.

+config SYSCALL_ISOLATION
+ bool "System call isolation"
+ default n
+ depends on PAGE_TABLE_ISOLATION && !X86_PAE
+ help
+ This is an experimental feature to allow executing system
+ calls in an isolated address space.
+
+ If you are unsure how to answer this question, answer N.
+
config SECURITY_INFINIBAND
bool "Infiniband Security Hooks"
depends on SECURITY && INFINIBAND
--
2.7.4

2019-04-26 02:19:40

by Mike Rapoport

[permalink] [raw]
Subject: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

When enabled, the system call isolation (SCI) would allow execution of the
system calls with reduced page tables. These page tables are almost
identical to the user page tables in PTI. The only addition is the code
page containing system call entry function that will continue exectution
after the context switch.

Unlike PTI page tables, there is no sharing at higher levels and all the
hierarchy for SCI page tables is cloned.

The SCI page tables are created when a system call that requires isolation
is executed for the first time.

Whenever a system call should be executed in the isolated environment, the
context is switched to the SCI page tables. Any further access to the
kernel memory will generate a page fault. The page fault handler can verify
that the access is safe and grant it or kill the task otherwise.

The initial SCI implementation allows access to any kernel data, but it
limits access to the code in the following way:
* calls and jumps to known code symbols without offset are allowed
* calls and jumps into a known symbol with offset are allowed only if that
symbol was already accessed and the offset is in the next page
* all other code access are blocked

After the isolated system call finishes, the mappings created during its
execution are cleared.

The entire SCI page table is lazily freed at task exit() time.

Signed-off-by: Mike Rapoport <[email protected]>
---
arch/x86/include/asm/sci.h | 55 ++++
arch/x86/mm/Makefile | 1 +
arch/x86/mm/init.c | 2 +
arch/x86/mm/sci.c | 608 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 5 +
include/linux/sci.h | 12 +
6 files changed, 683 insertions(+)
create mode 100644 arch/x86/include/asm/sci.h
create mode 100644 arch/x86/mm/sci.c
create mode 100644 include/linux/sci.h

diff --git a/arch/x86/include/asm/sci.h b/arch/x86/include/asm/sci.h
new file mode 100644
index 0000000..0b56200
--- /dev/null
+++ b/arch/x86/include/asm/sci.h
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _ASM_X86_SCI_H
+#define _ASM_X86_SCI_H
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+
+struct sci_task_data {
+ pgd_t *pgd;
+ unsigned long cr3_offset;
+ unsigned long backtrace_size;
+ unsigned long *backtrace;
+ unsigned long ptes_count;
+ pte_t **ptes;
+};
+
+struct sci_percpu_data {
+ unsigned long sci_syscall;
+ unsigned long sci_cr3_offset;
+};
+
+DECLARE_PER_CPU_PAGE_ALIGNED(struct sci_percpu_data, cpu_sci);
+
+void sci_check_boottime_disable(void);
+
+int sci_init(struct task_struct *tsk);
+void sci_exit(struct task_struct *tsk);
+
+bool sci_verify_and_map(struct pt_regs *regs, unsigned long addr,
+ unsigned long hw_error_code);
+void sci_clear_data(void);
+
+static inline void sci_switch_to(struct task_struct *next)
+{
+ this_cpu_write(cpu_sci.sci_syscall, next->in_isolated_syscall);
+ if (next->sci)
+ this_cpu_write(cpu_sci.sci_cr3_offset, next->sci->cr3_offset);
+}
+
+#else /* CONFIG_SYSCALL_ISOLATION */
+
+static inline void sci_check_boottime_disable(void) {}
+
+static inline bool sci_verify_and_map(struct pt_regs *regs,unsigned long addr,
+ unsigned long hw_error_code)
+{
+ return true;
+}
+
+static inline void sci_clear_data(void) {}
+
+static inline void sci_switch_to(struct task_struct *next) {}
+
+#endif /* CONFIG_SYSCALL_ISOLATION */
+
+#endif /* _ASM_X86_SCI_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd..9a728b7 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -49,6 +49,7 @@ obj-$(CONFIG_X86_INTEL_MPX) += mpx.o
obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
obj-$(CONFIG_PAGE_TABLE_ISOLATION) += pti.o
+obj-$(CONFIG_SYSCALL_ISOLATION) += sci.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index f905a23..b6e2db4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -22,6 +22,7 @@
#include <asm/hypervisor.h>
#include <asm/cpufeature.h>
#include <asm/pti.h>
+#include <asm/sci.h>

/*
* We need to define the tracepoints somewhere, and tlb.c
@@ -648,6 +649,7 @@ void __init init_mem_mapping(void)
unsigned long end;

pti_check_boottime_disable();
+ sci_check_boottime_disable();
probe_page_size_mask();
setup_pcid();

diff --git a/arch/x86/mm/sci.c b/arch/x86/mm/sci.c
new file mode 100644
index 0000000..e7ddec1
--- /dev/null
+++ b/arch/x86/mm/sci.c
@@ -0,0 +1,608 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2019 IBM Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * Author: Mike Rapoport <[email protected]>
+ *
+ * This code is based on pti.c, see it for the original copyrights
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/kallsyms.h>
+#include <linux/slab.h>
+#include <linux/debugfs.h>
+#include <linux/sizes.h>
+#include <linux/sci.h>
+#include <linux/random.h>
+
+#include <asm/cpufeature.h>
+#include <asm/hypervisor.h>
+#include <asm/cmdline.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+#include <asm/sections.h>
+#include <asm/traps.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt) "SCI: " fmt
+
+#define SCI_MAX_PTES 256
+#define SCI_MAX_BACKTRACE 64
+
+__visible DEFINE_PER_CPU_PAGE_ALIGNED(struct sci_percpu_data, cpu_sci);
+
+/*
+ * Walk the shadow copy of the page tables to PMD level (optionally)
+ * trying to allocate page table pages on the way down.
+ *
+ * Allocation failures are not handled here because the entire page
+ * table will be freed in sci_free_pagetable.
+ *
+ * Returns a pointer to a PMD on success, or NULL on failure.
+ */
+static pmd_t *sci_pagetable_walk_pmd(struct mm_struct *mm,
+ pgd_t *pgd, unsigned long address)
+{
+ p4d_t *p4d;
+ pud_t *pud;
+
+ p4d = p4d_alloc(mm, pgd, address);
+ if (!p4d)
+ return NULL;
+
+ pud = pud_alloc(mm, p4d, address);
+ if (!pud)
+ return NULL;
+
+ return pmd_alloc(mm, pud, address);
+}
+
+/*
+ * Walk the shadow copy of the page tables to PTE level (optionally)
+ * trying to allocate page table pages on the way down.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+static pte_t *sci_pagetable_walk_pte(struct mm_struct *mm,
+ pgd_t *pgd, unsigned long address)
+{
+ pmd_t *pmd = sci_pagetable_walk_pmd(mm, pgd, address);
+
+ if (!pmd)
+ return NULL;
+
+ if (__pte_alloc(mm, pmd))
+ return NULL;
+
+ return pte_offset_kernel(pmd, address);
+}
+
+/*
+ * Clone a single page mapping
+ *
+ * The new mapping in the @target_pgdp is always created for base
+ * page. If the orinal page table has the page at @addr mapped at PMD
+ * level, we anyway create at PTE in the target page table and map
+ * only PAGE_SIZE.
+ */
+static pte_t *sci_clone_page(struct mm_struct *mm,
+ pgd_t *pgdp, pgd_t *target_pgdp,
+ unsigned long addr)
+{
+ pte_t *pte, *target_pte, ptev;
+ pgd_t *pgd, *target_pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ pgd = pgd_offset_pgd(pgdp, addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ target_pgd = pgd_offset_pgd(target_pgdp, addr);
+
+ if (pmd_large(*pmd)) {
+ pgprot_t flags;
+ unsigned long pfn;
+
+ /*
+ * We map only PAGE_SIZE rather than the entire huge page.
+ * The PTE will have the same pgprot bits as the origial PMD
+ */
+ flags = pte_pgprot(pte_clrhuge(*(pte_t *)pmd));
+ pfn = pmd_pfn(*pmd) + pte_index(addr);
+ ptev = pfn_pte(pfn, flags);
+ } else {
+ pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte) || !(pte_flags(*pte) & _PAGE_PRESENT))
+ return NULL;
+
+ ptev = *pte;
+ }
+
+ target_pte = sci_pagetable_walk_pte(mm, target_pgd, addr);
+ if (!target_pte)
+ return NULL;
+
+ *target_pte = ptev;
+
+ return target_pte;
+}
+
+/*
+ * Clone a range keeping the same leaf mappings
+ *
+ * If the range has holes they are simply skipped
+ */
+static int sci_clone_range(struct mm_struct *mm,
+ pgd_t *pgdp, pgd_t *target_pgdp,
+ unsigned long start, unsigned long end)
+{
+ unsigned long addr;
+
+ /*
+ * Clone the populated PMDs which cover start to end. These PMD areas
+ * can have holes.
+ */
+ for (addr = start; addr < end;) {
+ pte_t *pte, *target_pte;
+ pgd_t *pgd, *target_pgd;
+ pmd_t *pmd, *target_pmd;
+ p4d_t *p4d;
+ pud_t *pud;
+
+ /* Overflow check */
+ if (addr < start)
+ break;
+
+ pgd = pgd_offset_pgd(pgdp, addr);
+ if (pgd_none(*pgd))
+ return 0;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return 0;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud)) {
+ addr += PUD_SIZE;
+ continue;
+ }
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd)) {
+ addr += PMD_SIZE;
+ continue;
+ }
+
+ target_pgd = pgd_offset_pgd(target_pgdp, addr);
+
+ if (pmd_large(*pmd)) {
+ target_pmd = sci_pagetable_walk_pmd(mm, target_pgd,
+ addr);
+ if (!target_pmd)
+ return -ENOMEM;
+
+ *target_pmd = *pmd;
+
+ addr += PMD_SIZE;
+ continue;
+ } else {
+ pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte)) {
+ addr += PAGE_SIZE;
+ continue;
+ }
+
+ target_pte = sci_pagetable_walk_pte(mm, target_pgd,
+ addr);
+ if (!target_pte)
+ return -ENOMEM;
+
+ *target_pte = *pte;
+
+ addr += PAGE_SIZE;
+ }
+ }
+
+ return 0;
+}
+
+/*
+ * we have to map the syscall entry because we'll fault there after
+ * CR3 switch and before the verifier is able to detect this as proper
+ * access
+ */
+extern void do_syscall_64(unsigned long nr, struct pt_regs *regs);
+unsigned long syscall_entry_addr = (unsigned long)do_syscall_64;
+
+static void sci_reset_backtrace(struct sci_task_data *sci)
+{
+ memset(sci->backtrace, 0, sci->backtrace_size);
+ sci->backtrace[0] = syscall_entry_addr;
+ sci->backtrace_size = 1;
+}
+
+static inline void sci_sync_user_pagetable(struct task_struct *tsk)
+{
+ pgd_t *u_pgd = kernel_to_user_pgdp(tsk->mm->pgd);
+ pgd_t *sci_pgd = tsk->sci->pgd;
+
+ down_write(&tsk->mm->mmap_sem);
+ memcpy(sci_pgd, u_pgd, PGD_KERNEL_START * sizeof(pgd_t));
+ up_write(&tsk->mm->mmap_sem);
+}
+
+static int sci_free_pte_range(struct mm_struct *mm, pmd_t *pmd)
+{
+ pte_t *ptep = pte_offset_kernel(pmd, 0);
+
+ pmd_clear(pmd);
+ pte_free(mm, virt_to_page(ptep));
+ mm_dec_nr_ptes(mm);
+
+ return 0;
+}
+
+static int sci_free_pmd_range(struct mm_struct *mm, pud_t *pud)
+{
+ pmd_t *pmd, *pmdp;
+ int i;
+
+ pmdp = pmd_offset(pud, 0);
+
+ for (i = 0, pmd = pmdp; i < PTRS_PER_PMD; i++, pmd++)
+ if (!pmd_none(*pmd) && !pmd_large(*pmd))
+ sci_free_pte_range(mm, pmd);
+
+ pud_clear(pud);
+ pmd_free(mm, pmdp);
+ mm_dec_nr_pmds(mm);
+
+ return 0;
+}
+
+static int sci_free_pud_range(struct mm_struct *mm, p4d_t *p4d)
+{
+ pud_t *pud, *pudp;
+ int i;
+
+ pudp = pud_offset(p4d, 0);
+
+ for (i = 0, pud = pudp; i < PTRS_PER_PUD; i++, pud++)
+ if (!pud_none(*pud))
+ sci_free_pmd_range(mm, pud);
+
+ p4d_clear(p4d);
+ pud_free(mm, pudp);
+ mm_dec_nr_puds(mm);
+
+ return 0;
+}
+
+static int sci_free_p4d_range(struct mm_struct *mm, pgd_t *pgd)
+{
+ p4d_t *p4d, *p4dp;
+ int i;
+
+ p4dp = p4d_offset(pgd, 0);
+
+ for (i = 0, p4d = p4dp; i < PTRS_PER_P4D; i++, p4d++)
+ if (!p4d_none(*p4d))
+ sci_free_pud_range(mm, p4d);
+
+ pgd_clear(pgd);
+ p4d_free(mm, p4dp);
+
+ return 0;
+}
+
+static int sci_free_pagetable(struct task_struct *tsk, pgd_t *sci_pgd)
+{
+ struct mm_struct *mm = tsk->mm;
+ pgd_t *pgd, *pgdp = sci_pgd;
+
+#ifdef SCI_SHARED_PAGE_TABLES
+ int i;
+
+ for (i = KERNEL_PGD_BOUNDARY; i < PTRS_PER_PGD; i++) {
+ if (i >= pgd_index(VMALLOC_START) &&
+ i < pgd_index(__START_KERNEL_map))
+ continue;
+ pgd = pgdp + i;
+ sci_free_p4d_range(mm, pgd);
+ }
+#else
+ for (pgd = pgdp + KERNEL_PGD_BOUNDARY; pgd < pgdp + PTRS_PER_PGD; pgd++)
+ if (!pgd_none(*pgd))
+ sci_free_p4d_range(mm, pgd);
+#endif
+
+
+ return 0;
+}
+
+static int sci_pagetable_init(struct task_struct *tsk, pgd_t *sci_pgd)
+{
+ struct mm_struct *mm = tsk->mm;
+ pgd_t *k_pgd = mm->pgd;
+ pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+ unsigned long stack = (unsigned long)tsk->stack;
+ unsigned long addr;
+ unsigned int cpu;
+ pte_t *pte;
+ int ret;
+
+ /* copy the kernel part of user visible page table */
+ ret = sci_clone_range(mm, u_pgd, sci_pgd, CPU_ENTRY_AREA_BASE,
+ CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE);
+ if (ret)
+ goto err_free_pagetable;
+
+ ret = sci_clone_range(mm, u_pgd, sci_pgd,
+ (unsigned long) __entry_text_start,
+ (unsigned long) __irqentry_text_end);
+ if (ret)
+ goto err_free_pagetable;
+
+ ret = sci_clone_range(mm, mm->pgd, sci_pgd,
+ stack, stack + THREAD_SIZE);
+ if (ret)
+ goto err_free_pagetable;
+
+ ret = -ENOMEM;
+ for_each_possible_cpu(cpu) {
+ addr = (unsigned long)&per_cpu(cpu_sci, cpu);
+ pte = sci_clone_page(mm, k_pgd, sci_pgd, addr);
+ if (!pte)
+ goto err_free_pagetable;
+ }
+
+ /* plus do_syscall_64 */
+ pte = sci_clone_page(mm, k_pgd, sci_pgd, syscall_entry_addr);
+ if (!pte)
+ goto err_free_pagetable;
+
+ return 0;
+
+err_free_pagetable:
+ sci_free_pagetable(tsk, sci_pgd);
+ return ret;
+}
+
+static int sci_alloc(struct task_struct *tsk)
+{
+ struct sci_task_data *sci;
+ int err = -ENOMEM;
+
+ if (!static_cpu_has(X86_FEATURE_SCI))
+ return 0;
+
+ if (tsk->sci)
+ return 0;
+
+ sci = kzalloc(sizeof(*sci), GFP_KERNEL);
+ if (!sci)
+ return err;
+
+ sci->ptes = kcalloc(SCI_MAX_PTES, sizeof(*sci->ptes), GFP_KERNEL);
+ if (!sci->ptes)
+ goto free_sci;
+
+ sci->backtrace = kcalloc(SCI_MAX_BACKTRACE, sizeof(*sci->backtrace),
+ GFP_KERNEL);
+ if (!sci->backtrace)
+ goto free_ptes;
+
+ sci->pgd = (pgd_t *)get_zeroed_page(GFP_KERNEL);
+ if (!sci->pgd)
+ goto free_backtrace;
+
+ err = sci_pagetable_init(tsk, sci->pgd);
+ if (err)
+ goto free_pgd;
+
+ sci_reset_backtrace(sci);
+
+ tsk->sci = sci;
+
+ return 0;
+
+free_pgd:
+ free_page((unsigned long)sci->pgd);
+free_backtrace:
+ kfree(sci->backtrace);
+free_ptes:
+ kfree(sci->ptes);
+free_sci:
+ kfree(sci);
+ return err;
+}
+
+int sci_init(struct task_struct *tsk)
+{
+ if (!tsk->sci) {
+ int err = sci_alloc(tsk);
+
+ if (err)
+ return err;
+ }
+
+ sci_sync_user_pagetable(tsk);
+
+ return 0;
+}
+
+void sci_exit(struct task_struct *tsk)
+{
+ struct sci_task_data *sci = tsk->sci;
+
+ if (!static_cpu_has(X86_FEATURE_SCI))
+ return;
+
+ if (!sci)
+ return;
+
+ sci_free_pagetable(tsk, tsk->sci->pgd);
+ free_page((unsigned long)sci->pgd);
+ kfree(sci->backtrace);
+ kfree(sci->ptes);
+ kfree(sci);
+}
+
+void sci_clear_data(void)
+{
+ struct sci_task_data *sci = current->sci;
+ int i;
+
+ if (WARN_ON(!sci))
+ return;
+
+ for (i = 0; i < sci->ptes_count; i++)
+ pte_clear(NULL, 0, sci->ptes[i]);
+
+ memset(sci->ptes, 0, sci->ptes_count);
+ sci->ptes_count = 0;
+
+ sci_reset_backtrace(sci);
+}
+
+static void sci_add_pte(struct sci_task_data *sci, pte_t *pte)
+{
+ int i;
+
+ for (i = sci->ptes_count - 1; i >= 0; i--)
+ if (pte == sci->ptes[i])
+ return;
+ sci->ptes[sci->ptes_count++] = pte;
+}
+
+static void sci_add_rip(struct sci_task_data *sci, unsigned long rip)
+{
+ int i;
+
+ for (i = sci->backtrace_size - 1; i >= 0; i--)
+ if (rip == sci->backtrace[i])
+ return;
+
+ sci->backtrace[sci->backtrace_size++] = rip;
+}
+
+static bool sci_verify_code_access(struct sci_task_data *sci,
+ struct pt_regs *regs, unsigned long addr)
+{
+ char namebuf[KSYM_NAME_LEN];
+ unsigned long offset, size;
+ const char *symbol;
+ char *modname;
+
+
+ /* instruction fetch outside kernel or module text */
+ if (!(is_kernel_text(addr) || is_module_text_address(addr)))
+ return false;
+
+ /* no symbol matches the address */
+ symbol = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
+ if (!symbol)
+ return false;
+
+ /* BPF or ftrace? */
+ if (symbol != namebuf)
+ return false;
+
+ /* access in the middle of a function */
+ if (offset) {
+ int i = 0;
+
+ for (i = sci->backtrace_size - 1; i >= 0; i--) {
+ unsigned long rip = sci->backtrace[i];
+
+ /* allow jumps to the next page of already mapped one */
+ if ((addr >> PAGE_SHIFT) == ((rip >> PAGE_SHIFT) + 1))
+ return true;
+ }
+
+ return false;
+ }
+
+ sci_add_rip(sci, regs->ip);
+
+ return true;
+}
+
+bool sci_verify_and_map(struct pt_regs *regs, unsigned long addr,
+ unsigned long hw_error_code)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+ struct sci_task_data *sci = tsk->sci;
+ pte_t *pte;
+
+ /* run out of room for metadata, can't grant access */
+ if (sci->ptes_count >= SCI_MAX_PTES ||
+ sci->backtrace_size >= SCI_MAX_BACKTRACE)
+ return false;
+
+ /* only code access is checked */
+ if (hw_error_code & X86_PF_INSTR &&
+ !sci_verify_code_access(sci, regs, addr))
+ return false;
+
+ pte = sci_clone_page(mm, mm->pgd, sci->pgd, addr);
+ if (!pte)
+ return false;
+
+ sci_add_pte(sci, pte);
+
+ return true;
+}
+
+void __init sci_check_boottime_disable(void)
+{
+ char arg[5];
+ int ret;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PCID)) {
+ pr_info("System call isolation requires PCID\n");
+ return;
+ }
+
+ /* Assume SCI is disabled unless explicitly overridden. */
+ ret = cmdline_find_option(boot_command_line, "sci", arg, sizeof(arg));
+ if (ret == 2 && !strncmp(arg, "on", 2)) {
+ setup_force_cpu_cap(X86_FEATURE_SCI);
+ pr_info("System call isolation is enabled\n");
+ return;
+ }
+
+ pr_info("System call isolation is disabled\n");
+}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f9b43c9..cdcdb07 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1202,6 +1202,11 @@ struct task_struct {
unsigned long prev_lowest_stack;
#endif

+#ifdef CONFIG_SYSCALL_ISOLATION
+ unsigned long in_isolated_syscall;
+ struct sci_task_data *sci;
+#endif
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/include/linux/sci.h b/include/linux/sci.h
new file mode 100644
index 0000000..7a6beac
--- /dev/null
+++ b/include/linux/sci.h
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_SCI_H
+#define _LINUX_SCI_H
+
+#ifdef CONFIG_SYSCALL_ISOLATION
+#include <asm/sci.h>
+#else
+static inline int sci_init(struct task_struct *tsk) { return 0; }
+static inline void sci_exit(struct task_struct *tsk) {}
+#endif
+
+#endif /* _LINUX_SCI_H */
--
2.7.4

2019-04-26 02:28:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

On Thu, Apr 25, 2019 at 2:46 PM Mike Rapoport <[email protected]> wrote:
>
> Hi,
>
> Address space isolation has been used to protect the kernel from the
> userspace and userspace programs from each other since the invention of the
> virtual memory.
>
> Assuming that kernel bugs and therefore vulnerabilities are inevitable it
> might be worth isolating parts of the kernel to minimize damage that these
> vulnerabilities can cause.
>
> The idea here is to allow an untrusted user access to a potentially
> vulnerable kernel in such a way that any kernel vulnerability they find to
> exploit is either prevented or the consequences confined to their isolated
> address space such that the compromise attempt has minimal impact on other
> tenants or the protected structures of the monolithic kernel. Although we
> hope to prevent many classes of attack, the first target we're looking at
> is ROP gadget protection.
>
> These patches implement a "system call isolation (SCI)" mechanism that
> allows running system calls in an isolated address space with reduced page
> tables to prevent ROP attacks.
>
> ROP attacks involve corrupting the stack return address to repoint it to a
> segment of code you know exists in the kernel that can be used to perform
> the action you need to exploit the system.
>
> The idea behind the prevention is that if we fault in pages in the
> execution path, we can compare target address against the kernel symbol
> table. So if we're in a function, we allow local jumps (and simply falling
> of the end of a page) but if we're jumping to a new function it must be to
> an external label in the symbol table.

That's quite an assumption. The entry code at least uses .L labels.
Do you get that right?

As far as I can see, most of what's going on here has very little to
do with jumps and calls. The benefit seems to come from making sure
that the RET instruction actually goes somewhere that's already been
faulted in. Am I understanding right?

--Andy

2019-04-26 07:43:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] x86/mm/fault: hook up SCI verification

On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote:
> If a system call runs in isolated context, it's accesses to kernel code and
> data will be verified by SCI susbsytem.
>
> Signed-off-by: Mike Rapoport <[email protected]>
> ---
> arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)

There's a distinct lack of touching do_double_fault(). It appears to me
that you'll instantly trigger #DF when you #PF, because the #PF handler
itself will not be able to run.

And then obviously you have to be very careful to make sure #DF can,
_at_all_times_ run, otherwise you'll tripple-fault and we all know what
that does.

2019-04-26 07:52:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, Apr 26, 2019 at 12:45:49AM +0300, Mike Rapoport wrote:
> The initial SCI implementation allows access to any kernel data, but it
> limits access to the code in the following way:
> * calls and jumps to known code symbols without offset are allowed
> * calls and jumps into a known symbol with offset are allowed only if that
> symbol was already accessed and the offset is in the next page
> * all other code access are blocked

So if you have a large function and an in-function jump skips a page
you're toast.

Why not employ the instruction decoder we have and unconditionally allow
all direct JMP/CALL but verify indirect JMP/CALL and RET ?

Anyway, I'm fearing the overhead of this one, this cannot be fast.

2019-04-26 08:33:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Mike Rapoport <[email protected]> wrote:

> When enabled, the system call isolation (SCI) would allow execution of
> the system calls with reduced page tables. These page tables are almost
> identical to the user page tables in PTI. The only addition is the code
> page containing system call entry function that will continue
> exectution after the context switch.
>
> Unlike PTI page tables, there is no sharing at higher levels and all
> the hierarchy for SCI page tables is cloned.
>
> The SCI page tables are created when a system call that requires
> isolation is executed for the first time.
>
> Whenever a system call should be executed in the isolated environment,
> the context is switched to the SCI page tables. Any further access to
> the kernel memory will generate a page fault. The page fault handler
> can verify that the access is safe and grant it or kill the task
> otherwise.
>
> The initial SCI implementation allows access to any kernel data, but it
> limits access to the code in the following way:
> * calls and jumps to known code symbols without offset are allowed
> * calls and jumps into a known symbol with offset are allowed only if that
> symbol was already accessed and the offset is in the next page
> * all other code access are blocked
>
> After the isolated system call finishes, the mappings created during its
> execution are cleared.
>
> The entire SCI page table is lazily freed at task exit() time.

So this basically uses a similar mechanism to the horrendous PTI CR3
switching overhead whenever a syscall seeks "protection", which overhead
is only somewhat mitigated by PCID.

This might work on PTI-encumbered CPUs.

While AMD CPUs don't need PTI, nor do they have PCID.

So this feature is hurting the CPU maker who didn't mess up, and is
hurting future CPUs that don't need PTI ..

I really don't like it where this is going. In a couple of years I really
want to be able to think of PTI as a bad dream that is mostly over
fortunately.

I have the feeling that compiler level protection that avoids corrupting
the stack in the first place is going to be lower overhead, and would
work in a much broader range of environments. Do we have analysis of what
the compiler would have to do to prevent most ROP attacks, and what the
runtime cost of that is?

I mean, C# and Java programs aren't able to corrupt the stack as long as
the language runtime is corect. Has to be possible, right?

Thanks,

Ingo

2019-04-26 09:08:04

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

On Thu, 25 Apr 2019, Andy Lutomirski wrote:

> The benefit seems to come from making sure that the RET instruction
> actually goes somewhere that's already been faulted in.

Which doesn't seem to be really compatible with things like retpolines or
anyone using FTRACE_WITH_REGS to modify stored instruction pointer.

--
Jiri Kosina
SUSE Labs

2019-04-26 09:59:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Ingo Molnar <[email protected]> wrote:

> I really don't like it where this is going. In a couple of years I
> really want to be able to think of PTI as a bad dream that is mostly
> over fortunately.
>
> I have the feeling that compiler level protection that avoids
> corrupting the stack in the first place is going to be lower overhead,
> and would work in a much broader range of environments. Do we have
> analysis of what the compiler would have to do to prevent most ROP
> attacks, and what the runtime cost of that is?
>
> I mean, C# and Java programs aren't able to corrupt the stack as long
> as the language runtime is corect. Has to be possible, right?

So if such security feature is offered then I'm afraid distros would be
strongly inclined to enable it - saying 'yes' to a kernel feature that
can keep your product off CVE advisories is a strong force.

To phrase the argument in a bit more controversial form:

If the price of Linux using an insecure C runtime is to slow down
system calls with immense PTI-alike runtime costs, then wouldn't it be
the right technical decision to write the kernel in a language runtime
that doesn't allow stack overflows and such?

I.e. if having Linux in C ends up being slower than having it in Java,
then what's the performance argument in favor of using C to begin with?
;-)

And no, I'm not arguing for Java or C#, but I am arguing for a saner
version of C.

Thanks,

Ingo

2019-04-26 14:43:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

On 4/25/19 2:45 PM, Mike Rapoport wrote:
> The idea behind the prevention is that if we fault in pages in the
> execution path, we can compare target address against the kernel symbol
> table. So if we're in a function, we allow local jumps (and simply falling
> of the end of a page) but if we're jumping to a new function it must be to
> an external label in the symbol table. Since ROP attacks are all about
> jumping to gadget code which is effectively in the middle of real
> functions, the jumps they induce are to code that doesn't have an external
> symbol, so it should mostly detect when they happen.

This turns the problem from: "attackers can leverage any data/code that
the kernel has mapped (anything)" to "attackers can leverage any
code/data that the current syscall has faulted in".

That seems like a pretty restrictive change.

> At this time we are not suggesting any API that will enable the system
> calls isolation. Because of the overhead required for this, it should only
> be activated for processes or containers we know should be untrusted. We
> still have no actual numbers, but surely forcing page faults during system
> call execution will not come for free.

What's the minimum number of faults that have to occur to handle the
simplest dummy fault?

2019-04-26 14:46:12

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, 2019-04-26 at 10:31 +0200, Ingo Molnar wrote:
> * Mike Rapoport <[email protected]> wrote:
>
> > When enabled, the system call isolation (SCI) would allow execution
> > of the system calls with reduced page tables. These page tables are
> > almost identical to the user page tables in PTI. The only addition
> > is the code page containing system call entry function that will
> > continue exectution after the context switch.
> >
> > Unlike PTI page tables, there is no sharing at higher levels and
> > all the hierarchy for SCI page tables is cloned.
> >
> > The SCI page tables are created when a system call that requires
> > isolation is executed for the first time.
> >
> > Whenever a system call should be executed in the isolated
> > environment, the context is switched to the SCI page tables. Any
> > further access to the kernel memory will generate a page fault. The
> > page fault handler can verify that the access is safe and grant it
> > or kill the task otherwise.
> >
> > The initial SCI implementation allows access to any kernel data,
> > but it limits access to the code in the following way:
> > * calls and jumps to known code symbols without offset are allowed
> > * calls and jumps into a known symbol with offset are allowed only
> > if that symbol was already accessed and the offset is in the next
> > page
> > * all other code access are blocked
> >
> > After the isolated system call finishes, the mappings created
> > during its execution are cleared.
> >
> > The entire SCI page table is lazily freed at task exit() time.
>
> So this basically uses a similar mechanism to the horrendous PTI CR3
> switching overhead whenever a syscall seeks "protection", which
> overhead is only somewhat mitigated by PCID.
>
> This might work on PTI-encumbered CPUs.
>
> While AMD CPUs don't need PTI, nor do they have PCID.
>
> So this feature is hurting the CPU maker who didn't mess up, and is
> hurting future CPUs that don't need PTI ..
>
> I really don't like it where this is going. In a couple of years I
> really want to be able to think of PTI as a bad dream that is mostly
> over fortunately.

Perhaps ROP gadgets were a bad first example. The research object of
the current patch set is really to investigate eliminating sandboxing
for containers. As you know, current sandboxes like gVisor and Nabla
try to reduce the exposure to horizontal exploits (ability of an
untrusted tenant to exploit the shared kernel to attack another tenant)
by running significant chunks of kernel emulation code in userspace to
reduce exposure of the tenant to code in the shared kernel. The price
paid for this is pretty horrendous in performance terms, but the
benefit is multi-tenant safety.

The question we were looking into is if we used per-tenant in-kernel
address space isolation to improve the security of kernel system calls
such that either the exploit becomes detectable or its consequences
bounce back only on the tenant trying the exploit, we could eliminate
the emulation for that system call and instead pass it through to the
kernel, thus thinning out the sandbox layer without losing the security
benefits.

We are looking at other aspects as well, like can we simply run chunks
of the kernel in the user's address space as the sanbox emulation
currently does, or can we hide a tenant's data objects such that
they're not easily accessible from an exploited kernel.

James

2019-04-26 14:49:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On 4/25/19 2:45 PM, Mike Rapoport wrote:
> After the isolated system call finishes, the mappings created during its
> execution are cleared.

Yikes. I guess that stops someone from calling write() a bunch of times
on every filesystem using every block device driver and all the DM code
to get a lot of code/data faulted in. But, it also means not even
long-running processes will ever have a chance of behaving anything
close to normally.

Is this something you think can be rectified or is there something
fundamental that would keep SCI page tables from being cached across
different invocations of the same syscall?

2019-04-26 14:58:52

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
> On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > After the isolated system call finishes, the mappings created
> > during its execution are cleared.
>
> Yikes. I guess that stops someone from calling write() a bunch of
> times on every filesystem using every block device driver and all the
> DM code to get a lot of code/data faulted in. But, it also means not
> even long-running processes will ever have a chance of behaving
> anything close to normally.
>
> Is this something you think can be rectified or is there something
> fundamental that would keep SCI page tables from being cached across
> different invocations of the same syscall?

There is some work being done to look at pre-populating the isolated
address space with the expected execution footprint of the system call,
yes. It lessens the ROP gadget protection slightly because you might
find a gadget in the pre-populated code, but it solves a lot of the
overhead problem.

James

2019-04-26 15:08:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation



> On Apr 26, 2019, at 7:57 AM, James Bottomley <[email protected]> wrote:
>
>> On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
>>> On 4/25/19 2:45 PM, Mike Rapoport wrote:
>>> After the isolated system call finishes, the mappings created
>>> during its execution are cleared.
>>
>> Yikes. I guess that stops someone from calling write() a bunch of
>> times on every filesystem using every block device driver and all the
>> DM code to get a lot of code/data faulted in. But, it also means not
>> even long-running processes will ever have a chance of behaving
>> anything close to normally.
>>
>> Is this something you think can be rectified or is there something
>> fundamental that would keep SCI page tables from being cached across
>> different invocations of the same syscall?
>
> There is some work being done to look at pre-populating the isolated
> address space with the expected execution footprint of the system call,
> yes. It lessens the ROP gadget protection slightly because you might
> find a gadget in the pre-populated code, but it solves a lot of the
> overhead problem.
>

I’m not even remotely a ROP expert, but: what stops a ROP payload from using all the “fault-in” gadgets that exist — any function that can return on an error without doing to much will fault in the whole page containing the function.

To improve this, we would want some thing that would try to check whether the caller is actually supposed to call the callee, which is more or less the hard part of CFI. So can’t we just do CFI and call it a day?

On top of that, a robust, maintainable implementation of this thing seems very complicated — for example, what happens if vfree() gets called?

2019-04-26 15:20:57

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
> > On Apr 26, 2019, at 7:57 AM, James Bottomley <James.Bottomley@hanse
> > npartnership.com> wrote:
> >
> > > On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
> > > > On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > > > After the isolated system call finishes, the mappings created
> > > > during its execution are cleared.
> > >
> > > Yikes. I guess that stops someone from calling write() a bunch
> > > of times on every filesystem using every block device driver and
> > > all the DM code to get a lot of code/data faulted in. But, it
> > > also means not even long-running processes will ever have a
> > > chance of behaving anything close to normally.
> > >
> > > Is this something you think can be rectified or is there
> > > something fundamental that would keep SCI page tables from being
> > > cached across different invocations of the same syscall?
> >
> > There is some work being done to look at pre-populating the
> > isolated address space with the expected execution footprint of the
> > system call, yes. It lessens the ROP gadget protection slightly
> > because you might find a gadget in the pre-populated code, but it
> > solves a lot of the overhead problem.
> >
>
> I’m not even remotely a ROP expert, but: what stops a ROP payload
> from using all the “fault-in” gadgets that exist — any function that
> can return on an error without doing to much will fault in the whole
> page containing the function.

The address space pre-population is still per syscall, so you don't get
access to the code footprint of a different syscall. So the isolated
address space is created anew for every system call, it's just pre-
populated with that system call's expected footprint.

> To improve this, we would want some thing that would try to check
> whether the caller is actually supposed to call the callee, which is
> more or less the hard part of CFI. So can’t we just do CFI and call
> it a day?

By CFI you mean control flow integrity? In theory I believe so, yes,
but in practice doesn't it require a lot of semantic object information
which is easy to get from higher level languages like java but a bit
more difficult for plain C.

> On top of that, a robust, maintainable implementation of this thing
> seems very complicated — for example, what happens if vfree() gets
> called?

Address space Local vs global object tracking is another thing on our
list. What we'd probably do is verify the global object was allowed to
be freed and then hand it off safely to the main kernel address space.

James

2019-04-26 18:04:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation



> On Apr 26, 2019, at 8:19 AM, James Bottomley <[email protected]> wrote:
>
> On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
>>> On Apr 26, 2019, at 7:57 AM, James Bottomley <James.Bottomley@hanse
>>> npartnership.com> wrote:
>>>
>>>>> On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
>>>>> On 4/25/19 2:45 PM, Mike Rapoport wrote:
>>>>> After the isolated system call finishes, the mappings created
>>>>> during its execution are cleared.
>>>>
>>>> Yikes. I guess that stops someone from calling write() a bunch
>>>> of times on every filesystem using every block device driver and
>>>> all the DM code to get a lot of code/data faulted in. But, it
>>>> also means not even long-running processes will ever have a
>>>> chance of behaving anything close to normally.
>>>>
>>>> Is this something you think can be rectified or is there
>>>> something fundamental that would keep SCI page tables from being
>>>> cached across different invocations of the same syscall?
>>>
>>> There is some work being done to look at pre-populating the
>>> isolated address space with the expected execution footprint of the
>>> system call, yes. It lessens the ROP gadget protection slightly
>>> because you might find a gadget in the pre-populated code, but it
>>> solves a lot of the overhead problem.
>>
>> I’m not even remotely a ROP expert, but: what stops a ROP payload
>> from using all the “fault-in” gadgets that exist — any function that
>> can return on an error without doing to much will fault in the whole
>> page containing the function.
>
> The address space pre-population is still per syscall, so you don't get
> access to the code footprint of a different syscall. So the isolated
> address space is created anew for every system call, it's just pre-
> populated with that system call's expected footprint.

That’s not what I mean. Suppose I want to use a ROP gadget in vmalloc(), but vmalloc isn’t in the page tables. Then first push vmalloc itself into the stack. As long as RDI contains a sufficiently ridiculous value, it should just return without doing anything. And it can return right back into the ROP gadget, which is now available.

>
>> To improve this, we would want some thing that would try to check
>> whether the caller is actually supposed to call the callee, which is
>> more or less the hard part of CFI. So can’t we just do CFI and call
>> it a day?
>
> By CFI you mean control flow integrity? In theory I believe so, yes,
> but in practice doesn't it require a lot of semantic object information
> which is easy to get from higher level languages like java but a bit
> more difficult for plain C.

Yes. As I understand it, grsecurity instruments gcc to create some kind of hash of all function signatures. Then any indirect call can effectively verify that it’s calling a function of the right type. And every return verified a cookie.

On CET CPUs, RET gets checked directly, and I don’t see the benefit of SCI.

>
>> On top of that, a robust, maintainable implementation of this thing
>> seems very complicated — for example, what happens if vfree() gets
>> called?
>
> Address space Local vs global object tracking is another thing on our
> list. What we'd probably do is verify the global object was allowed to
> be freed and then hand it off safely to the main kernel address space.
>
>

This seems exceedingly complicated.

2019-04-26 18:50:57

by James Bottomley

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, 2019-04-26 at 10:40 -0700, Andy Lutomirski wrote:
> > On Apr 26, 2019, at 8:19 AM, James Bottomley <James.Bottomley@hanse
> > npartnership.com> wrote:
> >
> > On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
> > > > On Apr 26, 2019, at 7:57 AM, James Bottomley
> > > > <[email protected]> wrote:
> > > >
> > > > > > On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
> > > > > > On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > > > > > After the isolated system call finishes, the mappings
> > > > > > created during its execution are cleared.
> > > > >
> > > > > Yikes. I guess that stops someone from calling write() a
> > > > > bunch of times on every filesystem using every block device
> > > > > driver and all the DM code to get a lot of code/data faulted
> > > > > in. But, it also means not even long-running processes will
> > > > > ever have a chance of behaving anything close to normally.
> > > > >
> > > > > Is this something you think can be rectified or is there
> > > > > something fundamental that would keep SCI page tables from
> > > > > being cached across different invocations of the same
> > > > > syscall?
> > > >
> > > > There is some work being done to look at pre-populating the
> > > > isolated address space with the expected execution footprint of
> > > > the system call, yes. It lessens the ROP gadget protection
> > > > slightly because you might find a gadget in the pre-populated
> > > > code, but it solves a lot of the overhead problem.
> > >
> > > I’m not even remotely a ROP expert, but: what stops a ROP payload
> > > from using all the “fault-in” gadgets that exist — any function
> > > that can return on an error without doing to much will fault in
> > > the whole page containing the function.
> >
> > The address space pre-population is still per syscall, so you don't
> > get access to the code footprint of a different syscall. So the
> > isolated address space is created anew for every system call, it's
> > just pre-populated with that system call's expected footprint.
>
> That’s not what I mean. Suppose I want to use a ROP gadget in
> vmalloc(), but vmalloc isn’t in the page tables. Then first push
> vmalloc itself into the stack. As long as RDI contains a sufficiently
> ridiculous value, it should just return without doing anything. And
> it can return right back into the ROP gadget, which is now available.

Yes, it's not perfect, but stack space for a smashing attack is at a
premium and now you need two stack frames for every gadget you chain
instead of one so we've halved your ability to chain gadgets.

> > > To improve this, we would want some thing that would try to check
> > > whether the caller is actually supposed to call the callee, which
> > > is more or less the hard part of CFI. So can’t we just do CFI
> > > and call it a day?
> >
> > By CFI you mean control flow integrity? In theory I believe so,
> > yes, but in practice doesn't it require a lot of semantic object
> > information which is easy to get from higher level languages like
> > java but a bit more difficult for plain C.
>
> Yes. As I understand it, grsecurity instruments gcc to create some
> kind of hash of all function signatures. Then any indirect call can
> effectively verify that it’s calling a function of the right type.
> And every return verified a cookie.
>
> On CET CPUs, RET gets checked directly, and I don’t see the benefit
> of SCI.

Presumably you know something I don't but I thought CET CPUs had been
planned for release for ages, but not actually released yet?

> > > On top of that, a robust, maintainable implementation of this
> > > thing seems very complicated — for example, what happens if
> > > vfree() gets called?
> >
> > Address space Local vs global object tracking is another thing on
> > our list. What we'd probably do is verify the global object was
> > allowed to be freed and then hand it off safely to the main kernel
> > address space.
>
> This seems exceedingly complicated.

It's a research project: we're exploring what's possible so we can
choose the techniques that give the best security improvement for the
additional overhead.

James

2019-04-26 19:23:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation



> On Apr 26, 2019, at 11:49 AM, James Bottomley <[email protected]> wrote:
>
> On Fri, 2019-04-26 at 10:40 -0700, Andy Lutomirski wrote:
>>> On Apr 26, 2019, at 8:19 AM, James Bottomley <James.Bottomley@hanse
>>> npartnership.com> wrote:
>>>
>>> On Fri, 2019-04-26 at 08:07 -0700, Andy Lutomirski wrote:
>>>>> On Apr 26, 2019, at 7:57 AM, James Bottomley
>>>>> <[email protected]> wrote:
>>>>>
>>>>>>> On Fri, 2019-04-26 at 07:46 -0700, Dave Hansen wrote:
>>>>>>> On 4/25/19 2:45 PM, Mike Rapoport wrote:
>>>>>>> After the isolated system call finishes, the mappings
>>>>>>> created during its execution are cleared.
>>>>>>
>>>>>> Yikes. I guess that stops someone from calling write() a
>>>>>> bunch of times on every filesystem using every block device
>>>>>> driver and all the DM code to get a lot of code/data faulted
>>>>>> in. But, it also means not even long-running processes will
>>>>>> ever have a chance of behaving anything close to normally.
>>>>>>
>>>>>> Is this something you think can be rectified or is there
>>>>>> something fundamental that would keep SCI page tables from
>>>>>> being cached across different invocations of the same
>>>>>> syscall?
>>>>>
>>>>> There is some work being done to look at pre-populating the
>>>>> isolated address space with the expected execution footprint of
>>>>> the system call, yes. It lessens the ROP gadget protection
>>>>> slightly because you might find a gadget in the pre-populated
>>>>> code, but it solves a lot of the overhead problem.
>>>>
>>>> I’m not even remotely a ROP expert, but: what stops a ROP payload
>>>> from using all the “fault-in” gadgets that exist — any function
>>>> that can return on an error without doing to much will fault in
>>>> the whole page containing the function.
>>>
>>> The address space pre-population is still per syscall, so you don't
>>> get access to the code footprint of a different syscall. So the
>>> isolated address space is created anew for every system call, it's
>>> just pre-populated with that system call's expected footprint.
>>
>> That’s not what I mean. Suppose I want to use a ROP gadget in
>> vmalloc(), but vmalloc isn’t in the page tables. Then first push
>> vmalloc itself into the stack. As long as RDI contains a sufficiently
>> ridiculous value, it should just return without doing anything. And
>> it can return right back into the ROP gadget, which is now available.
>
> Yes, it's not perfect, but stack space for a smashing attack is at a
> premium and now you need two stack frames for every gadget you chain
> instead of one so we've halved your ability to chain gadgets.
>
>>>> To improve this, we would want some thing that would try to check
>>>> whether the caller is actually supposed to call the callee, which
>>>> is more or less the hard part of CFI. So can’t we just do CFI
>>>> and call it a day?
>>>
>>> By CFI you mean control flow integrity? In theory I believe so,
>>> yes, but in practice doesn't it require a lot of semantic object
>>> information which is easy to get from higher level languages like
>>> java but a bit more difficult for plain C.
>>
>> Yes. As I understand it, grsecurity instruments gcc to create some
>> kind of hash of all function signatures. Then any indirect call can
>> effectively verify that it’s calling a function of the right type.
>> And every return verified a cookie.
>>
>> On CET CPUs, RET gets checked directly, and I don’t see the benefit
>> of SCI.
>
> Presumably you know something I don't but I thought CET CPUs had been
> planned for release for ages, but not actually released yet?

I don’t know any secrets about this, but I don’t think it’s released. Last I checked, it didn’t even have a final public spec.

>
>>>> On top of that, a robust, maintainable implementation of this
>>>> thing seems very complicated — for example, what happens if
>>>> vfree() gets called?
>>>
>>> Address space Local vs global object tracking is another thing on
>>> our list. What we'd probably do is verify the global object was
>>> allowed to be freed and then hand it off safely to the main kernel
>>> address space.
>>
>> This seems exceedingly complicated.
>
> It's a research project: we're exploring what's possible so we can
> choose the techniques that give the best security improvement for the
> additional overhead.
>

:)

2019-04-26 21:27:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

> On Apr 26, 2019, at 2:58 AM, Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
>> I really don't like it where this is going. In a couple of years I
>> really want to be able to think of PTI as a bad dream that is mostly
>> over fortunately.
>>
>> I have the feeling that compiler level protection that avoids
>> corrupting the stack in the first place is going to be lower overhead,
>> and would work in a much broader range of environments. Do we have
>> analysis of what the compiler would have to do to prevent most ROP
>> attacks, and what the runtime cost of that is?
>>
>> I mean, C# and Java programs aren't able to corrupt the stack as long
>> as the language runtime is corect. Has to be possible, right?
>
> So if such security feature is offered then I'm afraid distros would be
> strongly inclined to enable it - saying 'yes' to a kernel feature that
> can keep your product off CVE advisories is a strong force.
>
> To phrase the argument in a bit more controversial form:
>
> If the price of Linux using an insecure C runtime is to slow down
> system calls with immense PTI-alike runtime costs, then wouldn't it be
> the right technical decision to write the kernel in a language runtime
> that doesn't allow stack overflows and such?
>
> I.e. if having Linux in C ends up being slower than having it in Java,
> then what's the performance argument in favor of using C to begin with?
> ;-)
>
> And no, I'm not arguing for Java or C#, but I am arguing for a saner
> version of C.
>
>

IMO three are three credible choices:

1. C with fairly strong CFI protection. Grsecurity has his (supposedly
— there’s a distinct lack of source code available), and clang is
gradually working on it.

2. A safe language for parts of the kernel, e.g. drivers and maybe
eventually filesystems. Rust is probably the only credible candidate.
Actually creating a decent Rust wrapper around the core kernel
facilities would be quite a bit of work. Things like sysfs would be
interesting in Rust, since AFAIK few or even no drivers actually get
the locking fully correct. This means that naive users of the API
cannot port directly to safe Rust, because all the races won't compile
:)

3. A sandbox for parts of the kernel, e.g. drivers. The obvious
candidates are eBPF and WASM.

#2 will give very good performance. #3 gives potentially stronger
protection against a sandboxed component corrupting the kernel
overall, but it gives much weaker protection against a sandboxed
component corrupting itself.

In an ideal world, we could do #2 *and* #3. Drivers could, for
example, be written in a language like Rust, compiled to WASM, and run
in the kernel.

2019-04-27 08:50:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Andy Lutomirski <[email protected]> wrote:

> > On Apr 26, 2019, at 2:58 AM, Ingo Molnar <[email protected]> wrote:
> >
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> I really don't like it where this is going. In a couple of years I
> >> really want to be able to think of PTI as a bad dream that is mostly
> >> over fortunately.
> >>
> >> I have the feeling that compiler level protection that avoids
> >> corrupting the stack in the first place is going to be lower overhead,
> >> and would work in a much broader range of environments. Do we have
> >> analysis of what the compiler would have to do to prevent most ROP
> >> attacks, and what the runtime cost of that is?
> >>
> >> I mean, C# and Java programs aren't able to corrupt the stack as long
> >> as the language runtime is corect. Has to be possible, right?
> >
> > So if such security feature is offered then I'm afraid distros would be
> > strongly inclined to enable it - saying 'yes' to a kernel feature that
> > can keep your product off CVE advisories is a strong force.
> >
> > To phrase the argument in a bit more controversial form:
> >
> > If the price of Linux using an insecure C runtime is to slow down
> > system calls with immense PTI-alike runtime costs, then wouldn't it be
> > the right technical decision to write the kernel in a language runtime
> > that doesn't allow stack overflows and such?
> >
> > I.e. if having Linux in C ends up being slower than having it in Java,
> > then what's the performance argument in favor of using C to begin with?
> > ;-)
> >
> > And no, I'm not arguing for Java or C#, but I am arguing for a saner
> > version of C.
> >
> >
>
> IMO three are three credible choices:
>
> 1. C with fairly strong CFI protection. Grsecurity has this (supposedly
> — there’s a distinct lack of source code available), and clang is
> gradually working on it.
>
> 2. A safe language for parts of the kernel, e.g. drivers and maybe
> eventually filesystems. Rust is probably the only credible candidate.
> Actually creating a decent Rust wrapper around the core kernel
> facilities would be quite a bit of work. Things like sysfs would be
> interesting in Rust, since AFAIK few or even no drivers actually get
> the locking fully correct. This means that naive users of the API
> cannot port directly to safe Rust, because all the races won't compile
> :)
>
> 3. A sandbox for parts of the kernel, e.g. drivers. The obvious
> candidates are eBPF and WASM.
>
> #2 will give very good performance. #3 gives potentially stronger
> protection against a sandboxed component corrupting the kernel overall,
> but it gives much weaker protection against a sandboxed component
> corrupting itself.
>
> In an ideal world, we could do #2 *and* #3. Drivers could, for
> example, be written in a language like Rust, compiled to WASM, and run
> in the kernel.

So why not go for #1, which would still outperform #2/#3, right? Do we
know what it would take, roughly, and how the runtime overhead looks
like?

Thanks,

Ingo

2019-04-27 10:48:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Ingo Molnar <[email protected]> wrote:

> * Andy Lutomirski <[email protected]> wrote:
>
> > > And no, I'm not arguing for Java or C#, but I am arguing for a saner
> > > version of C.
> >
> > IMO three are three credible choices:
> >
> > 1. C with fairly strong CFI protection. Grsecurity has this (supposedly
> > — there’s a distinct lack of source code available), and clang is
> > gradually working on it.
> >
> > 2. A safe language for parts of the kernel, e.g. drivers and maybe
> > eventually filesystems. Rust is probably the only credible candidate.
> > Actually creating a decent Rust wrapper around the core kernel
> > facilities would be quite a bit of work. Things like sysfs would be
> > interesting in Rust, since AFAIK few or even no drivers actually get
> > the locking fully correct. This means that naive users of the API
> > cannot port directly to safe Rust, because all the races won't compile
> > :)
> >
> > 3. A sandbox for parts of the kernel, e.g. drivers. The obvious
> > candidates are eBPF and WASM.
> >
> > #2 will give very good performance. #3 gives potentially stronger
> > protection against a sandboxed component corrupting the kernel overall,
> > but it gives much weaker protection against a sandboxed component
> > corrupting itself.
> >
> > In an ideal world, we could do #2 *and* #3. Drivers could, for
> > example, be written in a language like Rust, compiled to WASM, and run
> > in the kernel.
>
> So why not go for #1, which would still outperform #2/#3, right? Do we
> know what it would take, roughly, and how the runtime overhead looks
> like?

BTW., CFI protection is in essence a compiler (or hardware) technique to
detect stack frame or function pointer corruption after the fact.

So I'm wondering whether there's a 4th choice as well, which avoids
control flow corruption *before* it happens:

- A C language runtime that is a subset of current C syntax and
semantics used in the kernel, and which doesn't allow access outside
of existing objects and thus creates a strictly enforced separation
between memory used for data, and memory used for code and control
flow.

- This would involve, at minimum:

- tracking every type and object and its inherent length and valid
access patterns, and never losing track of its type.

- being a lot more organized about initialization, i.e. no
uninitialized variables/fields.

- being a lot more strict about type conversions and pointers in
general.

- ... and a metric ton of other details.

- If such a runtime could co-exist without big complications with
regular C kernel code then we could convert particular pieces of C
code into this safe-C runtime step by step, and would also allow the
compilation of a piece of code as regular C, or into the safe runtime.

- If a particular function can be formally proven to be safe, it can be
compiled as C - otherwise it would be compiled as safe-C.

- ... or something like this.

The advantage would be: data corruption could never be triggered by code
itself, if the compiler and runtime is correct. Return addresses and
stacks wouldn't have to be 'hardened' or 'checked', because they'd never
be corrupted in the first place. WX memory wouldn't be an issue as kernel
code could never jump into generated shell code or ROP gadgets.

The disadvantage: the overhead of managing this, and any loss of
flexibility on the kernel programming side.

Does this make sense, and if yes, does such a project exist already?
(And no, I don't mean Java or C#.)

Or would we in essence end up with a Java runtime, with C syntax?

Thanks,

Ingo

2019-04-28 05:46:42

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, Apr 26, 2019 at 09:49:56AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 26, 2019 at 12:45:49AM +0300, Mike Rapoport wrote:
> > The initial SCI implementation allows access to any kernel data, but it
> > limits access to the code in the following way:
> > * calls and jumps to known code symbols without offset are allowed
> > * calls and jumps into a known symbol with offset are allowed only if that
> > symbol was already accessed and the offset is in the next page
> > * all other code access are blocked
>
> So if you have a large function and an in-function jump skips a page
> you're toast.

Right :(

> Why not employ the instruction decoder we have and unconditionally allow
> all direct JMP/CALL but verify indirect JMP/CALL and RET ?

Apparently I didn't dig deep enough to find the instruction decoder :)
Surely I can use it.

> Anyway, I'm fearing the overhead of this one, this cannot be fast.

Well, I think that the verification itself is not what will slow things
down the most. IMHO, the major overhead is coming from cr3 switch.

--
Sincerely yours,
Mike.

2019-04-28 05:49:06

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] x86/mm/fault: hook up SCI verification

On Fri, Apr 26, 2019 at 09:42:23AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote:
> > If a system call runs in isolated context, it's accesses to kernel code and
> > data will be verified by SCI susbsytem.
> >
> > Signed-off-by: Mike Rapoport <[email protected]>
> > ---
> > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++
> > 1 file changed, 28 insertions(+)
>
> There's a distinct lack of touching do_double_fault(). It appears to me
> that you'll instantly trigger #DF when you #PF, because the #PF handler
> itself will not be able to run.

The #PF handler is able to run. On interrupt/error entry the cr3 is
switched to the full kernel page tables, pretty much like PTI does for
user <-> kernel transitions. It's in the patch 3.

> And then obviously you have to be very careful to make sure #DF can,
> _at_all_times_ run, otherwise you'll tripple-fault and we all know what
> that does.
>

--
Sincerely yours,
Mike.

2019-04-28 06:03:49

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

On Thu, Apr 25, 2019 at 05:30:13PM -0700, Andy Lutomirski wrote:
> On Thu, Apr 25, 2019 at 2:46 PM Mike Rapoport <[email protected]> wrote:
> >
> > Hi,
> >
> > Address space isolation has been used to protect the kernel from the
> > userspace and userspace programs from each other since the invention of the
> > virtual memory.
> >
> > Assuming that kernel bugs and therefore vulnerabilities are inevitable it
> > might be worth isolating parts of the kernel to minimize damage that these
> > vulnerabilities can cause.
> >
> > The idea here is to allow an untrusted user access to a potentially
> > vulnerable kernel in such a way that any kernel vulnerability they find to
> > exploit is either prevented or the consequences confined to their isolated
> > address space such that the compromise attempt has minimal impact on other
> > tenants or the protected structures of the monolithic kernel. Although we
> > hope to prevent many classes of attack, the first target we're looking at
> > is ROP gadget protection.
> >
> > These patches implement a "system call isolation (SCI)" mechanism that
> > allows running system calls in an isolated address space with reduced page
> > tables to prevent ROP attacks.
> >
> > ROP attacks involve corrupting the stack return address to repoint it to a
> > segment of code you know exists in the kernel that can be used to perform
> > the action you need to exploit the system.
> >
> > The idea behind the prevention is that if we fault in pages in the
> > execution path, we can compare target address against the kernel symbol
> > table. So if we're in a function, we allow local jumps (and simply falling
> > of the end of a page) but if we're jumping to a new function it must be to
> > an external label in the symbol table.
>
> That's quite an assumption. The entry code at least uses .L labels.
> Do you get that right?
>
> As far as I can see, most of what's going on here has very little to
> do with jumps and calls. The benefit seems to come from making sure
> that the RET instruction actually goes somewhere that's already been
> faulted in. Am I understanding right?

Well, RET indeed will go somewhere that's already been faulted in. But
before that, the first CALL to not-yet-mapped code will fault and bring in
the page containing the CALL target.

If the CALL is made into a middle of a function, SCI will refuse to
continue the syscall execution.

As for the local jumps, as long as they are inside a page that was already
mapped or the next page, they are allowed.

This does not take care (yet) of larger functions where local jumps are
further then PAGE_SIZE.

Here's an example trace of #PF's produced by a dummy get_answer system call
from patch 7:

[ 12.012906] #PF: DATA: do_syscall_64+0x26b/0x4c0 fault at 0xffffffff82000bb8
[ 12.012918] #PF: INSN: __x86_indirect_thunk_rax+0x0/0x20 fault at __x86_indirect_thunk_rax+0x0/0x20
[ 12.012929] #PF: INSN: __x64_sys_get_answer+0x0/0x10 fault at __x64_sys_get_answer+0x0/0x10

> --Andy
>

--
Sincerely yours,
Mike.

2019-04-28 06:12:32

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

On Fri, Apr 26, 2019 at 07:41:09AM -0700, Dave Hansen wrote:
> On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > The idea behind the prevention is that if we fault in pages in the
> > execution path, we can compare target address against the kernel symbol
> > table. So if we're in a function, we allow local jumps (and simply falling
> > of the end of a page) but if we're jumping to a new function it must be to
> > an external label in the symbol table. Since ROP attacks are all about
> > jumping to gadget code which is effectively in the middle of real
> > functions, the jumps they induce are to code that doesn't have an external
> > symbol, so it should mostly detect when they happen.
>
> This turns the problem from: "attackers can leverage any data/code that
> the kernel has mapped (anything)" to "attackers can leverage any
> code/data that the current syscall has faulted in".
>
> That seems like a pretty restrictive change.
>
> > At this time we are not suggesting any API that will enable the system
> > calls isolation. Because of the overhead required for this, it should only
> > be activated for processes or containers we know should be untrusted. We
> > still have no actual numbers, but surely forcing page faults during system
> > call execution will not come for free.
>
> What's the minimum number of faults that have to occur to handle the
> simplest dummy fault?

For the current implementation it's 3.

Here is the example trace of #PF's produced by a dummy get_answer
system call from patch 7:

[ 12.012906] #PF: DATA: do_syscall_64+0x26b/0x4c0 fault at 0xffffffff82000bb8
[ 12.012918] #PF: INSN: __x86_indirect_thunk_rax+0x0/0x20 fault at __x86_indirect_thunk_rax+0x0/0x20
[ 12.012929] #PF: INSN: __x64_sys_get_answer+0x0/0x10 fault at__x64_sys_get_answer+0x0/0x10

For the sci_write_dmesg syscall that does copy_from_user() and printk() its
between 35 and 60 depending on console and /proc/sys/kernel/printk values.

This includes both code and data accesses. The data page faults can be
avoided if we pre-populate SCI page tables with data.

--
Sincerely yours,
Mike.

2019-04-29 18:30:53

by James Morris

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Sat, 27 Apr 2019, Ingo Molnar wrote:

> - A C language runtime that is a subset of current C syntax and
> semantics used in the kernel, and which doesn't allow access outside
> of existing objects and thus creates a strictly enforced separation
> between memory used for data, and memory used for code and control
> flow.

Might be better to start with Rust.


--
James Morris
<[email protected]>

2019-04-29 18:46:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Mon, Apr 29, 2019 at 11:27 AM James Morris <[email protected]> wrote:
>
> On Sat, 27 Apr 2019, Ingo Molnar wrote:
>
> > - A C language runtime that is a subset of current C syntax and
> > semantics used in the kernel, and which doesn't allow access outside
> > of existing objects and thus creates a strictly enforced separation
> > between memory used for data, and memory used for code and control
> > flow.
>
> Might be better to start with Rust.
>

I think that Rust would be the clear winner as measured by how fun it sounds :)

2019-04-29 18:49:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Sat, Apr 27, 2019 at 3:46 AM Ingo Molnar <[email protected]> wrote:
>
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Andy Lutomirski <[email protected]> wrote:
> >
> > > > And no, I'm not arguing for Java or C#, but I am arguing for a saner
> > > > version of C.
> > >
> > > IMO three are three credible choices:
> > >
> > > 1. C with fairly strong CFI protection. Grsecurity has this (supposedly
> > > — there’s a distinct lack of source code available), and clang is
> > > gradually working on it.
> > >
> > > 2. A safe language for parts of the kernel, e.g. drivers and maybe
> > > eventually filesystems. Rust is probably the only credible candidate.
> > > Actually creating a decent Rust wrapper around the core kernel
> > > facilities would be quite a bit of work. Things like sysfs would be
> > > interesting in Rust, since AFAIK few or even no drivers actually get
> > > the locking fully correct. This means that naive users of the API
> > > cannot port directly to safe Rust, because all the races won't compile
> > > :)
> > >
> > > 3. A sandbox for parts of the kernel, e.g. drivers. The obvious
> > > candidates are eBPF and WASM.
> > >
> > > #2 will give very good performance. #3 gives potentially stronger
> > > protection against a sandboxed component corrupting the kernel overall,
> > > but it gives much weaker protection against a sandboxed component
> > > corrupting itself.
> > >
> > > In an ideal world, we could do #2 *and* #3. Drivers could, for
> > > example, be written in a language like Rust, compiled to WASM, and run
> > > in the kernel.
> >
> > So why not go for #1, which would still outperform #2/#3, right? Do we
> > know what it would take, roughly, and how the runtime overhead looks
> > like?
>
> BTW., CFI protection is in essence a compiler (or hardware) technique to
> detect stack frame or function pointer corruption after the fact.
>
> So I'm wondering whether there's a 4th choice as well, which avoids
> control flow corruption *before* it happens:
>
> - A C language runtime that is a subset of current C syntax and
> semantics used in the kernel, and which doesn't allow access outside
> of existing objects and thus creates a strictly enforced separation
> between memory used for data, and memory used for code and control
> flow.
>
> - This would involve, at minimum:
>
> - tracking every type and object and its inherent length and valid
> access patterns, and never losing track of its type.
>
> - being a lot more organized about initialization, i.e. no
> uninitialized variables/fields.
>
> - being a lot more strict about type conversions and pointers in
> general.

You're not the only one to suggest this. There are at least a few
things that make this extremely difficult if not impossible. For
example, consider this code:

void maybe_buggy(void)
{
int a, b;
int *p = &a;
int *q = (int *)some_function((unsigned long)p);
*q = 1;
}

If some_function(&a) returns &a, then all is well. But if
some_function(&a) returns &b or even a valid address of some unrelated
kernel object, then the code might be entirely valid and correct C,
but I don't see how the runtime checks are supposed to tell whether
the resulting address is valid or is a bug. This type of code is, I
think, quite common in the kernel -- it happens in every data
structure where we have unions of pointers and integers or where we
steal some known-zero bits of a pointer to store something else.

--Andy

2019-04-30 05:04:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Andy Lutomirski <[email protected]> wrote:

> On Sat, Apr 27, 2019 at 3:46 AM Ingo Molnar <[email protected]> wrote:

> > So I'm wondering whether there's a 4th choice as well, which avoids
> > control flow corruption *before* it happens:
> >
> > - A C language runtime that is a subset of current C syntax and
> > semantics used in the kernel, and which doesn't allow access outside
> > of existing objects and thus creates a strictly enforced separation
> > between memory used for data, and memory used for code and control
> > flow.
> >
> > - This would involve, at minimum:
> >
> > - tracking every type and object and its inherent length and valid
> > access patterns, and never losing track of its type.
> >
> > - being a lot more organized about initialization, i.e. no
> > uninitialized variables/fields.
> >
> > - being a lot more strict about type conversions and pointers in
> > general.
>
> You're not the only one to suggest this. There are at least a few
> things that make this extremely difficult if not impossible. For
> example, consider this code:
>
> void maybe_buggy(void)
> {
> int a, b;
> int *p = &a;
> int *q = (int *)some_function((unsigned long)p);
> *q = 1;
> }
>
> If some_function(&a) returns &a, then all is well. But if
> some_function(&a) returns &b or even a valid address of some unrelated
> kernel object, then the code might be entirely valid and correct C,
> but I don't see how the runtime checks are supposed to tell whether
> the resulting address is valid or is a bug. This type of code is, I
> think, quite common in the kernel -- it happens in every data
> structure where we have unions of pointers and integers or where we
> steal some known-zero bits of a pointer to store something else.

So the thing is, for the infinitely large state space of "valid C code"
we already disallow an infinitely many versions in the Linux kernel.

We have complicated rules that disallow certain C syntactical and
semantical constructs, both on the tooling (build failure/warning) and on
the review (style/taste) level.

So the question IMHO isn't whether it's "valid C", because we already
have the Linux kernel's own C syntax variant and are enforcing it with
varying degrees of success.

The question is whether the example you gave can be written in a strongly
typed fashion, whether it makes sense to do so, and what the costs are.

I think it's evident that it can be written with strongly typed
constructs, by separating pointers from embedded error codes - with
negative side effects to code generation: for example it increases
structure sizes and error return paths.

I think there's four main costs of converting such a pattern to strongly
typed constructs:

- memory/cache footprint: there's a nonzero cost there.
- performance: this will hurt too.
- code readability: this will probably improve.
- code robustness: this will improve too.

So I think the proper question to ask is not whether there's common C
syntax within the kernel that would have to be rewritten, but whether the
total sum of memory and runtime overhead of strongly typed C programming
(if it's possible/desirable) is larger than the total sum of a typical
Linux distro enabling the various current and proposed kernel hardening
features that have a runtime overhead:

- the SMAP/SMEP overhead of STAC/CLAC for every single user copy

- other usercopy hardening features

- stackprotector

- KASLR

- compiler plugins against information leaks

- proposed KASLR extension to implement module randomization and -PIE overhead

- proposed function call integrity checks

- proposed per system call kernel stack offset randomization

- ( and I'm sure I forgot about a few more, and it's all still only
reactive security, not proactive security. )

That's death by a thousand cuts and CR3 switching during system calls is
also throwing a hand grenade into the fight ;-)

So if people are also proposing to do CR3 switches in every system call,
I'm pretty sure the answer is "yes, even a managed C runtime is probably
faster than *THAT* sum of a performanc mess" - at least with the current
CR3 switching x86-uarch cost structure...

Thanks,

Ingo

2019-04-30 09:41:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Tue, Apr 30, 2019 at 07:03:37AM +0200, Ingo Molnar wrote:
> So the question IMHO isn't whether it's "valid C", because we already
> have the Linux kernel's own C syntax variant and are enforcing it with
> varying degrees of success.

I'm not getting into the whole 'safe' fight here; but you're under
selling things. We don't have a C syntax, we have a full blown C
lanugeage variant.

The 'Kernel C' that we write is very much not 'ANSI/ISO C' anymore in a
fair number of places. And if I can get my way, we'll only diverge
further from the standard.

And this is quite separate from us using every GCC extention under the
sun; which of course also doesn't help. It mostly has to do with us
treating C as a portable assembler and the C people not wanting to
commit to sensible things because they think C is a high-level language.

2019-04-30 11:07:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Peter Zijlstra <[email protected]> wrote:

> On Tue, Apr 30, 2019 at 07:03:37AM +0200, Ingo Molnar wrote:
> > So the question IMHO isn't whether it's "valid C", because we already
> > have the Linux kernel's own C syntax variant and are enforcing it with
> > varying degrees of success.
>
> I'm not getting into the whole 'safe' fight here; but you're under
> selling things. We don't have a C syntax, we have a full blown C
> lanugeage variant.
>
> The 'Kernel C' that we write is very much not 'ANSI/ISO C' anymore in a
> fair number of places. And if I can get my way, we'll only diverge
> further from the standard.

Yeah, but I think it would be fair to say that random style variations
aside, in the kernel we still allow about 95%+ of 'sensible C'.

> And this is quite separate from us using every GCC extention under the
> sun; which of course also doesn't help. It mostly has to do with us
> treating C as a portable assembler and the C people not wanting to
> commit to sensible things because they think C is a high-level
> language.

Indeed, and also because there's arguably somewhat of a "if the spec
allows it then performance first, common-sense semantics second" mindset.
Which is an understandable social dynamic, as compiler developers tend to
distinguish themselves via the optimizations they've authored.

Anyway, the main point I tried to make is that I think we'd still be able
to allow 95%+ of "sensible C" even if executed in a "safe runtime", and
we'd still be able to build and run without such strong runtime type
enforcement, i.e. get kernel code close to what we have today, minus a
handful of optimizations and data structures. (But the performance costs
even in that case are nonzero - I'm not sugarcoating it.)

( Plus even that isn't a fully secure solution with deterministic
outcomes, due to parallelism and data races. )

Thanks,

Ingo

2019-04-30 16:46:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] x86/mm/fault: hook up SCI verification

On Sat, Apr 27, 2019 at 10:47 PM Mike Rapoport <[email protected]> wrote:
>
> On Fri, Apr 26, 2019 at 09:42:23AM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote:
> > > If a system call runs in isolated context, it's accesses to kernel code and
> > > data will be verified by SCI susbsytem.
> > >
> > > Signed-off-by: Mike Rapoport <[email protected]>
> > > ---
> > > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++
> > > 1 file changed, 28 insertions(+)
> >
> > There's a distinct lack of touching do_double_fault(). It appears to me
> > that you'll instantly trigger #DF when you #PF, because the #PF handler
> > itself will not be able to run.
>
> The #PF handler is able to run. On interrupt/error entry the cr3 is
> switched to the full kernel page tables, pretty much like PTI does for
> user <-> kernel transitions. It's in the patch 3.
>
>

PeterZ meant page_fault, not do_page_fault. In your patch, page_fault
and some of error_entry run before that magic switchover happens. If
they're not in the page tables, you double-fault.

And don't even try to do SCI magic in the double-fault handler. As I
understand it, the SDM and APM aren't kidding when they say that #DF
is an abort, not a fault. There is a single case in the kernel where
we recover from #DF, and it was vetted by microcode people.

2019-05-01 05:43:18

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] x86/mm/fault: hook up SCI verification

On Tue, Apr 30, 2019 at 09:44:09AM -0700, Andy Lutomirski wrote:
> On Sat, Apr 27, 2019 at 10:47 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Fri, Apr 26, 2019 at 09:42:23AM +0200, Peter Zijlstra wrote:
> > > On Fri, Apr 26, 2019 at 12:45:52AM +0300, Mike Rapoport wrote:
> > > > If a system call runs in isolated context, it's accesses to kernel code and
> > > > data will be verified by SCI susbsytem.
> > > >
> > > > Signed-off-by: Mike Rapoport <[email protected]>
> > > > ---
> > > > arch/x86/mm/fault.c | 28 ++++++++++++++++++++++++++++
> > > > 1 file changed, 28 insertions(+)
> > >
> > > There's a distinct lack of touching do_double_fault(). It appears to me
> > > that you'll instantly trigger #DF when you #PF, because the #PF handler
> > > itself will not be able to run.
> >
> > The #PF handler is able to run. On interrupt/error entry the cr3 is
> > switched to the full kernel page tables, pretty much like PTI does for
> > user <-> kernel transitions. It's in the patch 3.
> >
> >
>
> PeterZ meant page_fault, not do_page_fault. In your patch, page_fault
> and some of error_entry run before that magic switchover happens. If
> they're not in the page tables, you double-fault.

The entry code is in sci page tables, just like in user-space page tables
with PTI.

> And don't even try to do SCI magic in the double-fault handler. As I
> understand it, the SDM and APM aren't kidding when they say that #DF
> is an abort, not a fault. There is a single case in the kernel where
> we recover from #DF, and it was vetted by microcode people.
>

--
Sincerely yours,
Mike.

2019-05-02 11:37:07

by Robert O'Callahan

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Sat, Apr 27, 2019 at 10:46 PM Ingo Molnar <[email protected]> wrote:
> - A C language runtime that is a subset of current C syntax and
> semantics used in the kernel, and which doesn't allow access outside
> of existing objects and thus creates a strictly enforced separation
> between memory used for data, and memory used for code and control
> flow.
>
> - This would involve, at minimum:
>
> - tracking every type and object and its inherent length and valid
> access patterns, and never losing track of its type.
>
> - being a lot more organized about initialization, i.e. no
> uninitialized variables/fields.
>
> - being a lot more strict about type conversions and pointers in
> general.
>
> - ... and a metric ton of other details.

Several research groups have tried to do this, and it is very
difficult to do. In particular this was almost exactly the goal of
C-Cured [1]. Much more recently, there's Microsoft's CheckedC [2] [3],
which is less ambitious. Check the references of the latter for lots
of relevant work. If anyone really pursues this they should talk
directly to researchers who've worked on this, e.g. George Necula; you
need to know what *didn't* work well, which is hard to glean from
papers. (Academic publishing is broken that way.)

One problem with adopting "safe C" or Rust in the kernel is that most
of your security mitigations (e.g. KASLR, CFI, other randomizations)
probably need to remain in place as long as there is a significant
amount of C in the kernel, which means the benefits from eliminating
them will be realized very far in the future, if ever, which makes the
whole exercise harder to justify.

Having said that, I think there's a good case to be made for writing
kernel code in Rust, e.g. sketchy drivers. The classes of bugs
prevented in Rust are significantly broader than your usual safe-C
dialect (e.g. data races).

[1] https://web.eecs.umich.edu/~weimerw/p/p477-necula.pdf
[2] https://www.microsoft.com/en-us/research/uploads/prod/2019/05/checkedc-post2019.pdf
[3] https://github.com/Microsoft/checkedc

Rob
--
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.

2019-05-02 15:23:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation


* Robert O'Callahan <[email protected]> wrote:

> On Sat, Apr 27, 2019 at 10:46 PM Ingo Molnar <[email protected]> wrote:
> > - A C language runtime that is a subset of current C syntax and
> > semantics used in the kernel, and which doesn't allow access outside
> > of existing objects and thus creates a strictly enforced separation
> > between memory used for data, and memory used for code and control
> > flow.
> >
> > - This would involve, at minimum:
> >
> > - tracking every type and object and its inherent length and valid
> > access patterns, and never losing track of its type.
> >
> > - being a lot more organized about initialization, i.e. no
> > uninitialized variables/fields.
> >
> > - being a lot more strict about type conversions and pointers in
> > general.
> >
> > - ... and a metric ton of other details.
>
> Several research groups have tried to do this, and it is very
> difficult to do. In particular this was almost exactly the goal of
> C-Cured [1]. Much more recently, there's Microsoft's CheckedC [2] [3],
> which is less ambitious. Check the references of the latter for lots
> of relevant work. If anyone really pursues this they should talk
> directly to researchers who've worked on this, e.g. George Necula; you
> need to know what *didn't* work well, which is hard to glean from
> papers. (Academic publishing is broken that way.)
>
> One problem with adopting "safe C" or Rust in the kernel is that most
> of your security mitigations (e.g. KASLR, CFI, other randomizations)
> probably need to remain in place as long as there is a significant
> amount of C in the kernel, which means the benefits from eliminating
> them will be realized very far in the future, if ever, which makes the
> whole exercise harder to justify.
>
> Having said that, I think there's a good case to be made for writing
> kernel code in Rust, e.g. sketchy drivers. The classes of bugs
> prevented in Rust are significantly broader than your usual safe-C
> dialect (e.g. data races).
>
> [1] https://web.eecs.umich.edu/~weimerw/p/p477-necula.pdf
> [2] https://www.microsoft.com/en-us/research/uploads/prod/2019/05/checkedc-post2019.pdf
> [3] https://github.com/Microsoft/checkedc

So what might work better is if we defined a Rust dialect that used C
syntax. I.e. the end result would be something like the 'c2rust' or
'citrus' projects, where code like this would be directly translatable to
Rust:

void gz_compress(FILE * in, gzFile out)
{
char buf[BUFLEN];
int len;
int err;

for (;;) {
len = fread(buf, 1, sizeof(buf), in);
if (ferror(in)) {
perror("fread");
exit(1);
}
if (len == 0)
break;
if (gzwrite(out, buf, (unsigned)len) != len)
error(gzerror(out, &err));
}
fclose(in);

if (gzclose(out) != Z_OK)
error("failed gzclose");
}


#[no_mangle]
pub unsafe extern "C" fn gz_compress(mut in_: *mut FILE, mut out: gzFile) {
let mut buf: [i8; 16384];
let mut len;
let mut err;
loop {
len = fread(buf, 1, std::mem::size_of_val(&buf), in_);
if ferror(in_) != 0 { perror("fread"); exit(1); }
if len == 0 { break ; }
if gzwrite(out, buf, len as c_uint) != len {
error(gzerror(out, &mut err));
};
}
fclose(in_);
if gzclose(out) != Z_OK { error("failed gzclose"); };
}

Example taken from:

https://gitlab.com/citrus-rs/citrus

Does this make sense?

Thanks,

Ingo

2019-05-02 21:09:17

by Robert O'Callahan

[permalink] [raw]
Subject: Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

On Fri, May 3, 2019 at 3:20 AM Ingo Molnar <[email protected]> wrote:
> So what might work better is if we defined a Rust dialect that used C
> syntax. I.e. the end result would be something like the 'c2rust' or
> 'citrus' projects, where code like this would be directly translatable to
> Rust:
>
> void gz_compress(FILE * in, gzFile out)
> {
> char buf[BUFLEN];
> int len;
> int err;
>
> for (;;) {
> len = fread(buf, 1, sizeof(buf), in);
> if (ferror(in)) {
> perror("fread");
> exit(1);
> }
> if (len == 0)
> break;
> if (gzwrite(out, buf, (unsigned)len) != len)
> error(gzerror(out, &err));
> }
> fclose(in);
>
> if (gzclose(out) != Z_OK)
> error("failed gzclose");
> }
>
>
> #[no_mangle]
> pub unsafe extern "C" fn gz_compress(mut in_: *mut FILE, mut out: gzFile) {
> let mut buf: [i8; 16384];
> let mut len;
> let mut err;
> loop {
> len = fread(buf, 1, std::mem::size_of_val(&buf), in_);
> if ferror(in_) != 0 { perror("fread"); exit(1); }
> if len == 0 { break ; }
> if gzwrite(out, buf, len as c_uint) != len {
> error(gzerror(out, &mut err));
> };
> }
> fclose(in_);
> if gzclose(out) != Z_OK { error("failed gzclose"); };
> }
>
> Example taken from:
>
> https://gitlab.com/citrus-rs/citrus
>
> Does this make sense?

Are you saying you want a tool like c2rust/citrus that translates some
new "looks like C, but really Rust" language into actual Rust at build
time? I guess that might work, but I suspect your "looks like C"
language isn't going to end up being much like C (e.g. it's going to
need Rust-style enums-with-fields, Rust polymorphism, Rust traits, and
Rust lifetimes), so it may not be beneficial, because you've just
created a new language no-one knows, and that has some real downsides.

If you're inspired by the dream of transitioning to safer languages,
then I think the first practical step would be to identify some part
of the kernel where the payoff of converting code would be highest.
This is probably something small, relatively isolated, that's not well
tested, generally suspicious, but still in use. Then do an experiment,
converting it to Rust (or something else) using off-the-shelf tools
and manual labor, and see where the pain points are and what benefits
accrue, if any. (Work like https://github.com/tsgates/rust.ko might be
a helpful starting point.) Then you'd have some data to start thinking
about how to reduce the costs, increase the benefits, and sell it to
the kernel community. If you reached out to the Rust community you
might find some volunteers to help with this.

Rob
--
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.

2020-06-30 00:11:19

by hackapple

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] x86/cpufeatures: add X86_FEATURE_SCI

What’s the version of kernel?

2020-06-30 12:17:45

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 1/7] x86/cpufeatures: add X86_FEATURE_SCI

On Tue, Jun 30, 2020 at 08:08:59AM +0800, hackapple wrote:
> What’s the version of kernel?

It was around 5.2 time frame, I think.

--
Sincerely yours,
Mike.

2020-07-01 14:06:31

by hackapple

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] x86: introduce system calls addess space isolation

How about performance when running with ASI?