2011-05-31 14:16:05

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 00/10] Remove syscall instructions at fixed addresses

Patch 1 is just a bugfix from the last vdso series.
The bug should be harmless but it's pretty dumb. This is almost
certainly 3.0 material.

Patch 2 adds documentation for entry_64.S. A lot of the magic in there is far from obvious.

Patches 3, 4, and 5 remove a bunch of syscall instructions in kernel space
at fixed addresses that user code can execute. Several are data that isn't marked NX. Patch 2 makes vvars NX and
5/10 makes the HPET NX.

The time() vsyscall contains an explicit syscall fallback. Patch 3/10
removes it.

At this point, there is only one explicit syscall left in the vsyscall
page: the fallback case for vgettimeofday. The rest of the series is to
remove it and most of the remaining vsyscall code.

Patch 6 is pure cleanup. venosys (one of the four vsyscalls) has been
broken for years, so patch 6 removes it.

Patch 7 pads the empty parts of the vsyscall page with 0xcc. 0xcc is an
explicit trap.

Patch 8 removes the code implementing the vsyscalls and replaces it with
magic int 0xcc incantations. These incantations are specifically
designed so that jumping into them at funny offsets will either work
fine or generate some kind of fault. This is a significant performance
penalty (~220ns here) for all vsyscall users, but there aren't many
left. Because current glibc still uses the time vsyscall (although it's
fixed in glibc's git), the option CONFIG_UNSAFE_VSYSCALLS (default y)
will leave time() alone.

This patch is also nice because it removes a bunch of duplicated code
from vsyscall_64.s.

Patch 9/10 randomizes the int 0xcc incantation at bootup. It is pretty
much worthless for security (there are only three choices for the random
number and it's easy to figure out which one is in use) but it prevents
overly clever userspace programs from thinking that the incantation is
ABI. One instrumentation tool author offered to hard-code special
handling for int 0xcc; I want to discourage this approach.

Patch 10/10 adds CONFIG_UNSAFE_VSYSCALLS to
feature-removal-schedule.txt. Removing it won't break anything but it
will slow some older code down.


Changes from v3:
- Rebased onto tip/master (1a0c84d)

Changes from v2:
- Reordered the patches.
- Removed the option to leave gettimeofday and getcpu as native code.
- Clean up the int 0xcc handler and registration.
- sched_getcpu works now (thanks, Borislav, for catching my blatant arithmetic error).
- Numerous small fixes from review comments.
- Abandon my plan to spread turbostat to the masses.

Changes from v1:
- Patches 6-10 are new.
- The int 0xcc code is much prettier and has lots of bugs fixed.
- I've decided to let everyone compile turbostat on their own :)

Andy Lutomirski (10):
x86-64: Fix alignment of jiffies variable
x86-64: Document some of entry_64.S
x86-64: Give vvars their own page
x86-64: Remove kernel.vsyscall64 sysctl
x86-64: Map the HPET NX
x86-64: Remove vsyscall number 3 (venosys)
x86-64: Fill unused parts of the vsyscall page with 0xcc
x86-64: Emulate legacy vsyscalls
x86-64: Randomize int 0xcc magic al values at boot
x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

Documentation/feature-removal-schedule.txt | 9 +
Documentation/x86/entry_64.txt | 98 +++++++++
arch/x86/Kconfig | 17 ++
arch/x86/include/asm/fixmap.h | 1 +
arch/x86/include/asm/irq_vectors.h | 6 +-
arch/x86/include/asm/pgtable_types.h | 6 +-
arch/x86/include/asm/traps.h | 4 +
arch/x86/include/asm/vgtod.h | 1 -
arch/x86/include/asm/vsyscall.h | 6 +
arch/x86/include/asm/vvar.h | 24 +-
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 4 +
arch/x86/kernel/hpet.c | 2 +-
arch/x86/kernel/traps.c | 4 +
arch/x86/kernel/vmlinux.lds.S | 47 ++--
arch/x86/kernel/vsyscall_64.c | 327 +++++++++++++++++-----------
arch/x86/kernel/vsyscall_emu_64.S | 42 ++++
arch/x86/vdso/vclock_gettime.c | 55 ++---
18 files changed, 448 insertions(+), 206 deletions(-)
create mode 100644 Documentation/x86/entry_64.txt
create mode 100644 arch/x86/kernel/vsyscall_emu_64.S

--
1.7.5.1


2011-05-31 14:15:00

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 01/10] x86-64: Fix alignment of jiffies variable

It's declared __attribute__((aligned(16)) but it's explicitly not
aligned. This is probably harmless but it's a bit embarrassing.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/vvar.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index 341b355..a4eaca4 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -45,7 +45,7 @@
/* DECLARE_VVAR(offset, type, name) */

DECLARE_VVAR(0, volatile unsigned long, jiffies)
-DECLARE_VVAR(8, int, vgetcpu_mode)
+DECLARE_VVAR(16, int, vgetcpu_mode)
DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)

#undef DECLARE_VVAR
--
1.7.5.1

2011-05-31 14:17:41

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 02/10] x86-64: Document some of entry_64.S

Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/x86/entry_64.txt | 98 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/entry_64.S | 2 +
2 files changed, 100 insertions(+), 0 deletions(-)
create mode 100644 Documentation/x86/entry_64.txt

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
new file mode 100644
index 0000000..7869f14
--- /dev/null
+++ b/Documentation/x86/entry_64.txt
@@ -0,0 +1,98 @@
+This file documents some of the kernel entries in
+arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from
+an email from Ingo Molnar:
+
+http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
+
+The x86 architecture has quite a few different ways to jump into
+kernel code. Most of these entry points are registered in
+arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S
+and arch/x86/ia32/ia32entry.S.
+
+The IDT vector assignments are listed in arch/x86/include/irq_vectors.h.
+
+Some of these entries are:
+
+ - system_call: syscall instruction from 64-bit code.
+
+ - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall
+ either way.
+
+ - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit
+ code
+
+ - interrupt: An array of entries. Every IDT vector that doesn't
+ explicitly point somewhere else gets set to the corresponding
+ value in interrupts. These point to a whole array of
+ magically-generated functions that make their way to do_IRQ with
+ the interrupt number as a parameter.
+
+ - emulate_vsyscall: int 0xcc, a special non-ABI entry used by
+ vsyscall emulation.
+
+ - APIC interrupts: Various special-purpose interrupts for things
+ like TLB shootdown.
+
+ - Architecturally-defined exceptions like divide_error.
+
+There are a few complexities here. The different x86-64 entries
+have different calling conventions. The syscall and sysenter
+instructions have their own peculiar calling conventions. Some of
+the IDT entries push an error code onto the stack; others don't.
+IDT entries using the IST alternative stack mechanism need their own
+magic to get the stack frames right. (You can find some
+documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
+Volume 3, Chapter 6.)
+
+Dealing with the swapgs instruction is especially tricky. Swapgs
+toggles whether gs is the kernel gs or the user gs. The swapgs
+instruction is rather fragile: it must nest perfectly and only in
+single depth, it should only be used if entering from user mode to
+kernel mode and then when returning to user-space, and precisely
+so. If we mess that up even slightly, we crash.
+
+So when we have a secondary entry, already in kernel mode, we *must
+not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
+not switched/swapped yet.
+
+Now, there's a secondary complication: there's a cheap way to test
+which mode the CPU is in and an expensive way.
+
+The cheap way is to pick this info off the entry frame on the kernel
+stack, from the CS of the ptregs area of the kernel stack:
+
+ xorl %ebx,%ebx
+ testl $3,CS+8(%rsp)
+ je error_kernelspace
+ SWAPGS
+
+The expensive (paranoid) way is to read back the MSR_GS_BASE value
+(which is what SWAPGS modifies):
+
+ movl $1,%ebx
+ movl $MSR_GS_BASE,%ecx
+ rdmsr
+ testl %edx,%edx
+ js 1f /* negative -> in kernel */
+ SWAPGS
+ xorl %ebx,%ebx
+1: ret
+
+and the whole paranoid non-paranoid macro complexity is about whether
+to suffer that RDMSR cost.
+
+If we are at an interrupt or user-trap/gate-alike boundary then we can
+use the faster check: the stack will be a reliable indicator of
+whether SWAPGS was already done: if we see that we are a secondary
+entry interrupting kernel mode execution, then we know that the GS
+base has already been switched. If it says that we interrupted
+user-space execution then we must do the SWAPGS.
+
+But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
+which might have triggered right after a normal entry wrote CS to the
+stack but before we executed SWAPGS, then the only safe way to check
+for GS is the slower method: the RDMSR.
+
+So we try only to mark those entry methods 'paranoid' that absolutely
+need the more expensive check for the GS base - and we generate all
+'normal' entry points with the regular (faster) entry macros.
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 8a445a0..72c4a77 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -9,6 +9,8 @@
/*
* entry.S contains the system-call and fault low-level handling routines.
*
+ * Some of this is documented in Documentation/x86/entry_64.txt
+ *
* NOTE: This code handles signal-recognition, which happens every time
* after an interrupt and after each system call.
*
--
1.7.5.1

2011-05-31 14:15:08

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 03/10] x86-64: Give vvars their own page

Move vvars out of the vsyscall page into their own page and mark it
NX.

Without this patch, an attacker who can force a daemon to call some
fixed address could wait until the time contains, say, 0xCD80, and
then execute the current time.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fixmap.h | 1 +
arch/x86/include/asm/pgtable_types.h | 2 ++
arch/x86/include/asm/vvar.h | 22 ++++++++++------------
arch/x86/kernel/vmlinux.lds.S | 27 ++++++++++++++++-----------
arch/x86/kernel/vsyscall_64.c | 5 +++++
5 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 4729b2b..460c74e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -78,6 +78,7 @@ enum fixed_addresses {
VSYSCALL_LAST_PAGE,
VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE
+ ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
+ VVAR_PAGE,
VSYSCALL_HPET,
#endif
FIX_DBGP_BASE,
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index d56187c..6a29aed6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -108,6 +108,7 @@
#define __PAGE_KERNEL_UC_MINUS (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
#define __PAGE_KERNEL_VSYSCALL_NOCACHE (__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_NOCACHE (__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -130,6 +131,7 @@
#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC)
#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL)
#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE)
+#define PAGE_KERNEL_VVAR __pgprot(__PAGE_KERNEL_VVAR)

#define PAGE_KERNEL_IO __pgprot(__PAGE_KERNEL_IO)
#define PAGE_KERNEL_IO_NOCACHE __pgprot(__PAGE_KERNEL_IO_NOCACHE)
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index a4eaca4..de656ac 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -10,15 +10,14 @@
* In normal kernel code, they are used like any other variable.
* In user code, they are accessed through the VVAR macro.
*
- * Each of these variables lives in the vsyscall page, and each
- * one needs a unique offset within the little piece of the page
- * reserved for vvars. Specify that offset in DECLARE_VVAR.
- * (There are 896 bytes available. If you mess up, the linker will
- * catch it.)
+ * These variables live in a page of kernel data that has an extra RO
+ * mapping for userspace. Each variable needs a unique offset within
+ * that page; specify that offset with the DECLARE_VVAR macro. (If
+ * you mess up, the linker will catch it.)
*/

-/* Offset of vars within vsyscall page */
-#define VSYSCALL_VARS_OFFSET (3072 + 128)
+/* Base address of vvars. This is not ABI. */
+#define VVAR_ADDRESS (-10*1024*1024 - 4096)

#if defined(__VVAR_KERNEL_LDS)

@@ -26,17 +25,17 @@
* right place.
*/
#define DECLARE_VVAR(offset, type, name) \
- EMIT_VVAR(name, VSYSCALL_VARS_OFFSET + offset)
+ EMIT_VVAR(name, offset)

#else

#define DECLARE_VVAR(offset, type, name) \
static type const * const vvaraddr_ ## name = \
- (void *)(VSYSCALL_START + VSYSCALL_VARS_OFFSET + (offset));
+ (void *)(VVAR_ADDRESS + (offset));

#define DEFINE_VVAR(type, name) \
- type __vvar_ ## name \
- __attribute__((section(".vsyscall_var_" #name), aligned(16)))
+ type name \
+ __attribute__((section(".vvar_" #name), aligned(16)))

#define VVAR(name) (*vvaraddr_ ## name)

@@ -49,4 +48,3 @@ DECLARE_VVAR(16, int, vgetcpu_mode)
DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)

#undef DECLARE_VVAR
-#undef VSYSCALL_VARS_OFFSET
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 89aed99..3d89a00 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -161,12 +161,6 @@ SECTIONS

#define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
#define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
-#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x \
- ADDR(.vsyscall_0) + offset \
- : AT(VLOAD(.vsyscall_var_ ## x)) { \
- *(.vsyscall_var_ ## x) \
- } \
- x = VVIRT(.vsyscall_var_ ## x);

. = ALIGN(4096);
__vsyscall_0 = .;
@@ -192,17 +186,28 @@ SECTIONS
*(.vsyscall_3)
}

-#define __VVAR_KERNEL_LDS
-#include <asm/vvar.h>
-#undef __VVAR_KERNEL_LDS
-
- . = __vsyscall_0 + PAGE_SIZE;
+ . = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);

#undef VSYSCALL_ADDR
#undef VLOAD_OFFSET
#undef VLOAD
#undef VVIRT_OFFSET
#undef VVIRT
+
+ __vvar_page = .;
+
+#define EMIT_VVAR(name, offset) .vvar_ ## name \
+ (__vvar_page + offset) : \
+ AT(ADDR(.vvar_ ## name) - LOAD_OFFSET) { \
+ *(.vvar_ ## x) \
+ } :data
+
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+
+ . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
+
#undef EMIT_VVAR

#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 3e68218..3cf1cef 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -284,9 +284,14 @@ void __init map_vsyscall(void)
{
extern char __vsyscall_0;
unsigned long physaddr_page0 = __pa_symbol(&__vsyscall_0);
+ extern char __vvar_page;
+ unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);

/* Note that VSYSCALL_MAPPED_PAGES must agree with the code below. */
__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
+ __set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
+ BUILD_BUG_ON((unsigned long)__fix_to_virt(VVAR_PAGE) !=
+ (unsigned long)VVAR_ADDRESS);
}

static int __init vsyscall_init(void)
--
1.7.5.1

2011-05-31 14:15:12

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 04/10] x86-64: Remove kernel.vsyscall64 sysctl

It's unnecessary overhead in code that's supposed to be highly
optimized. Removing it allows us to remove one of the two syscall
instructions in the vsyscall page.

The only sensible use for it is for UML users, and it doesn't fully
address inconsistent vsyscall results on UML. The real fix for UML
is to stop using vsyscalls entirely.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/vgtod.h | 1 -
arch/x86/kernel/vsyscall_64.c | 34 +------------------------
arch/x86/vdso/vclock_gettime.c | 55 +++++++++++++++------------------------
3 files changed, 22 insertions(+), 68 deletions(-)

diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 646b4c1..aa5add8 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -11,7 +11,6 @@ struct vsyscall_gtod_data {
time_t wall_time_sec;
u32 wall_time_nsec;

- int sysctl_enabled;
struct timezone sys_tz;
struct { /* extract of a clocksource struct */
cycle_t (*vread)(void);
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 3cf1cef..9b2f3f5 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -53,7 +53,6 @@ DEFINE_VVAR(int, vgetcpu_mode);
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
{
.lock = __SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock),
- .sysctl_enabled = 1,
};

void update_vsyscall_tz(void)
@@ -103,15 +102,6 @@ static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
return ret;
}

-static __always_inline long time_syscall(long *t)
-{
- long secs;
- asm volatile("syscall"
- : "=a" (secs)
- : "0" (__NR_time),"D" (t) : __syscall_clobber);
- return secs;
-}
-
static __always_inline void do_vgettimeofday(struct timeval * tv)
{
cycle_t now, base, mask, cycle_delta;
@@ -122,8 +112,7 @@ static __always_inline void do_vgettimeofday(struct timeval * tv)
seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);

vread = VVAR(vsyscall_gtod_data).clock.vread;
- if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled ||
- !vread)) {
+ if (unlikely(!vread)) {
gettimeofday(tv,NULL);
return;
}
@@ -165,8 +154,6 @@ time_t __vsyscall(1) vtime(time_t *t)
{
unsigned seq;
time_t result;
- if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
- return time_syscall(t);

do {
seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
@@ -227,22 +214,6 @@ static long __vsyscall(3) venosys_1(void)
return -ENOSYS;
}

-#ifdef CONFIG_SYSCTL
-static ctl_table kernel_table2[] = {
- { .procname = "vsyscall64",
- .data = &vsyscall_gtod_data.sysctl_enabled, .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec },
- {}
-};
-
-static ctl_table kernel_root_table2[] = {
- { .procname = "kernel", .mode = 0555,
- .child = kernel_table2 },
- {}
-};
-#endif
-
/* Assume __initcall executes before all user space. Hopefully kmod
doesn't violate that. We'll find out if it does. */
static void __cpuinit vsyscall_set_cpu(int cpu)
@@ -301,9 +272,6 @@ static int __init vsyscall_init(void)
BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
BUG_ON((unsigned long) &vgetcpu != VSYSCALL_ADDR(__NR_vgetcpu));
-#ifdef CONFIG_SYSCTL
- register_sysctl_table(kernel_root_table2);
-#endif
on_each_cpu(cpu_vsyscall_init, NULL, 1);
/* notifier priority > KVM */
hotcpu_notifier(cpu_vsyscall_notifier, 30);
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index a724905..cf54813 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -116,21 +116,21 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)

notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
- if (likely(gtod->sysctl_enabled))
- switch (clock) {
- case CLOCK_REALTIME:
- if (likely(gtod->clock.vread))
- return do_realtime(ts);
- break;
- case CLOCK_MONOTONIC:
- if (likely(gtod->clock.vread))
- return do_monotonic(ts);
- break;
- case CLOCK_REALTIME_COARSE:
- return do_realtime_coarse(ts);
- case CLOCK_MONOTONIC_COARSE:
- return do_monotonic_coarse(ts);
- }
+ switch (clock) {
+ case CLOCK_REALTIME:
+ if (likely(gtod->clock.vread))
+ return do_realtime(ts);
+ break;
+ case CLOCK_MONOTONIC:
+ if (likely(gtod->clock.vread))
+ return do_monotonic(ts);
+ break;
+ case CLOCK_REALTIME_COARSE:
+ return do_realtime_coarse(ts);
+ case CLOCK_MONOTONIC_COARSE:
+ return do_monotonic_coarse(ts);
+ }
+
return vdso_fallback_gettime(clock, ts);
}
int clock_gettime(clockid_t, struct timespec *)
@@ -139,7 +139,7 @@ int clock_gettime(clockid_t, struct timespec *)
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
long ret;
- if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
+ if (likely(gtod->clock.vread)) {
if (likely(tv != NULL)) {
BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
offsetof(struct timespec, tv_nsec) ||
@@ -161,27 +161,14 @@ notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
int gettimeofday(struct timeval *, struct timezone *)
__attribute__((weak, alias("__vdso_gettimeofday")));

-/* This will break when the xtime seconds get inaccurate, but that is
- * unlikely */
-
-static __always_inline long time_syscall(long *t)
-{
- long secs;
- asm volatile("syscall"
- : "=a" (secs)
- : "0" (__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
- return secs;
-}
-
+/*
+ * This will break when the xtime seconds get inaccurate, but that is
+ * unlikely
+ */
notrace time_t __vdso_time(time_t *t)
{
- time_t result;
-
- if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
- return time_syscall(t);
-
/* This is atomic on x86_64 so we don't need any locks. */
- result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);
+ time_t result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);

if (t)
*t = result;
--
1.7.5.1

2011-05-31 14:16:55

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 05/10] x86-64: Map the HPET NX

Currently the HPET mapping is a user-accessible syscall instruction
at a fixed address some of the time. A sufficiently determined
hacker might be able to guess when.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 4 ++--
arch/x86/kernel/hpet.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 6a29aed6..013286a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -107,8 +107,8 @@
#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_UC_MINUS (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
-#define __PAGE_KERNEL_VSYSCALL_NOCACHE (__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
+#define __PAGE_KERNEL_VVAR_NOCACHE (__PAGE_KERNEL_VVAR | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_NOCACHE (__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -130,8 +130,8 @@
#define PAGE_KERNEL_LARGE_NOCACHE __pgprot(__PAGE_KERNEL_LARGE_NOCACHE)
#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC)
#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL)
-#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE)
#define PAGE_KERNEL_VVAR __pgprot(__PAGE_KERNEL_VVAR)
+#define PAGE_KERNEL_VVAR_NOCACHE __pgprot(__PAGE_KERNEL_VVAR_NOCACHE)

#define PAGE_KERNEL_IO __pgprot(__PAGE_KERNEL_IO)
#define PAGE_KERNEL_IO_NOCACHE __pgprot(__PAGE_KERNEL_IO_NOCACHE)
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 6781765..e9f5605 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -71,7 +71,7 @@ static inline void hpet_set_mapping(void)
{
hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
#ifdef CONFIG_X86_64
- __set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE);
+ __set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VVAR_NOCACHE);
#endif
}

--
1.7.5.1

2011-05-31 14:15:23

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 06/10] x86-64: Remove vsyscall number 3 (venosys)

It just segfaults since April 2008 (a4928cff), so I'm pretty sure
that nothing uses it. And having an empty section makes the linker
script a bit fragile.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 4 ----
arch/x86/kernel/vsyscall_64.c | 3 ---
2 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3d89a00..dc8ac70 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -182,10 +182,6 @@ SECTIONS
*(.vsyscall_2)
}

- .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) {
- *(.vsyscall_3)
- }
-
. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);

#undef VSYSCALL_ADDR
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 9b2f3f5..c7fe325 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -209,9 +209,6 @@ vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
return 0;
}

-static long __vsyscall(3) venosys_1(void)
-{
- return -ENOSYS;
}

/* Assume __initcall executes before all user space. Hopefully kmod
--
1.7.5.1

2011-05-31 14:15:21

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 07/10] x86-64: Fill unused parts of the vsyscall page with 0xcc

Jumping to 0x00 might do something depending on the following bytes.
Jumping to 0xcc is a trap. So fill the unused parts of the vsyscall
page with 0xcc to make it useless for exploits to jump there.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index dc8ac70..c3b37d6 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -166,22 +166,20 @@ SECTIONS
__vsyscall_0 = .;

. = VSYSCALL_ADDR;
- .vsyscall_0 : AT(VLOAD(.vsyscall_0)) {
+ .vsyscall : AT(VLOAD(.vsyscall)) {
*(.vsyscall_0)
- } :user

- . = ALIGN(L1_CACHE_BYTES);
- .vsyscall_fn : AT(VLOAD(.vsyscall_fn)) {
+ . = ALIGN(L1_CACHE_BYTES);
*(.vsyscall_fn)
- }

- .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) {
+ . = 1024;
*(.vsyscall_1)
- }
- .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) {
+
+ . = 2048;
*(.vsyscall_2)
- }

+ . = 4096; /* Pad the whole page. */
+ } :user =0xcc
. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);

#undef VSYSCALL_ADDR
--
1.7.5.1

2011-05-31 14:16:37

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls

There's a fair amount of code in the vsyscall page. It contains a
syscall instruction (in the gettimeofday fallback) and who knows
what will happen if an exploit jumps into the middle of some other
code.

Reduce the risk by replacing the vsyscalls with short magic
incantations that cause the kernel to emulate the real vsyscalls.
These incantations are useless if entered in the middle.

This causes vsyscalls to be a little more expensive than real
syscalls. Fortunately sensible programs don't use them.

Less fortunately, current glibc uses the vsyscall for time() even in
dynamic binaries. So there's a CONFIG_UNSAFE_VSYSCALLS (default y)
option that leaves in the native code for time(). That should go
away in awhile when glibc gets fixed.

Some care is taken to make sure that tools like valgrind and
ThreadSpotter still work.

This patch is not perfect: the vread_tsc and vread_hpet functions
are still at a fixed address. Fixing that might involve making
alternative patching work in the vDSO.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/Kconfig | 17 +++
arch/x86/include/asm/irq_vectors.h | 6 +-
arch/x86/include/asm/traps.h | 4 +
arch/x86/include/asm/vsyscall.h | 6 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 2 +
arch/x86/kernel/traps.c | 4 +
arch/x86/kernel/vsyscall_64.c | 253 +++++++++++++++++++++---------------
arch/x86/kernel/vsyscall_emu_64.S | 42 ++++++
9 files changed, 231 insertions(+), 104 deletions(-)
create mode 100644 arch/x86/kernel/vsyscall_emu_64.S

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..79e5d8a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1646,6 +1646,23 @@ config COMPAT_VDSO

If unsure, say Y.

+config UNSAFE_VSYSCALLS
+ def_bool y
+ prompt "Unsafe fast legacy vsyscalls"
+ depends on X86_64
+ ---help---
+ Legacy user code expects to be able to issue three syscalls
+ by calling fixed addresses in kernel space. If you say N,
+ then the kernel traps and emulates these calls. If you say
+ Y, then there is actual executable code at a fixed address
+ to implement time() efficiently.
+
+ On a system with recent enough glibc (probably 2.14 or
+ newer) and no static binaries, you can say N without a
+ performance penalty to improve security
+
+ If unsure, say Y.
+
config CMDLINE_BOOL
bool "Built-in kernel command line"
---help---
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 6e976ee..a563c50 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -17,7 +17,8 @@
* Vectors 0 ... 31 : system traps and exceptions - hardcoded events
* Vectors 32 ... 127 : device interrupts
* Vector 128 : legacy int80 syscall interface
- * Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts
+ * Vector 204 : legacy x86_64 vsyscall emulation
+ * Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 except 204 : device interrupts
* Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
*
* 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
@@ -50,6 +51,9 @@
#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
#endif
+#ifdef CONFIG_X86_64
+# define VSYSCALL_EMU_VECTOR 0xcc
+#endif

/*
* Vectors 0x30-0x3f are used for ISA interrupts.
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 0310da6..2bae0a5 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -1,6 +1,8 @@
#ifndef _ASM_X86_TRAPS_H
#define _ASM_X86_TRAPS_H

+#include <linux/kprobes.h>
+
#include <asm/debugreg.h>
#include <asm/siginfo.h> /* TRAP_TRACE, ... */

@@ -38,6 +40,7 @@ asmlinkage void alignment_check(void);
asmlinkage void machine_check(void);
#endif /* CONFIG_X86_MCE */
asmlinkage void simd_coprocessor_error(void);
+asmlinkage void emulate_vsyscall(void);

dotraplinkage void do_divide_error(struct pt_regs *, long);
dotraplinkage void do_debug(struct pt_regs *, long);
@@ -64,6 +67,7 @@ dotraplinkage void do_alignment_check(struct pt_regs *, long);
dotraplinkage void do_machine_check(struct pt_regs *, long);
#endif
dotraplinkage void do_simd_coprocessor_error(struct pt_regs *, long);
+dotraplinkage void do_emulate_vsyscall(struct pt_regs *, long);
#ifdef CONFIG_X86_32
dotraplinkage void do_iret_error(struct pt_regs *, long);
#endif
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d555973..293ae08 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -31,6 +31,12 @@ extern struct timezone sys_tz;

extern void map_vsyscall(void);

+/* Emulation */
+static inline bool in_vsyscall_page(unsigned long addr)
+{
+ return (addr & ~(PAGE_SIZE - 1)) == VSYSCALL_START;
+}
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 90b06d4..cc0469a 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -44,6 +44,7 @@ obj-y += probe_roms.o
obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o
obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
obj-$(CONFIG_X86_64) += syscall_64.o vsyscall_64.o vread_tsc_64.o
+obj-$(CONFIG_X86_64) += vsyscall_emu_64.o
obj-y += bootflag.o e820.o
obj-y += pci-dma.o quirks.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o hw_breakpoint.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 72c4a77..e949793 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1123,6 +1123,8 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
zeroentry coprocessor_error do_coprocessor_error
errorentry alignment_check do_alignment_check
zeroentry simd_coprocessor_error do_simd_coprocessor_error
+zeroentry emulate_vsyscall do_emulate_vsyscall
+

/* Reload gs selector with exception handling */
/* edi: new selector */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b9b6716..72f0f6a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -872,6 +872,10 @@ void __init trap_init(void)
set_bit(SYSCALL_VECTOR, used_vectors);
#endif

+ BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
+ set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
+ set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
+
/*
* Should be a barrier for any external CPU state:
*/
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index c7fe325..52ba392 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -32,6 +32,8 @@
#include <linux/cpu.h>
#include <linux/smp.h>
#include <linux/notifier.h>
+#include <linux/syscalls.h>
+#include <linux/ratelimit.h>

#include <asm/vsyscall.h>
#include <asm/pgtable.h>
@@ -44,10 +46,7 @@
#include <asm/desc.h>
#include <asm/topology.h>
#include <asm/vgtod.h>
-
-#define __vsyscall(nr) \
- __attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
-#define __syscall_clobber "r11","cx","memory"
+#include <asm/traps.h>

DEFINE_VVAR(int, vgetcpu_mode);
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
@@ -84,73 +83,45 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags);
}

-/* RED-PEN may want to readd seq locking, but then the variable should be
- * write-once.
- */
-static __always_inline void do_get_tz(struct timezone * tz)
+static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
+ const char *message)
{
- *tz = VVAR(vsyscall_gtod_data).sys_tz;
+ struct task_struct *tsk;
+ static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ if (!show_unhandled_signals || !__ratelimit(&rs))
+ return;
+
+ tsk = current;
+
+ printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx",
+ level, tsk->comm, task_pid_nr(tsk),
+ message,
+ regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+ if (!in_vsyscall_page(regs->ip - 2))
+ print_vma_addr(" in ", regs->ip - 2);
+ printk("\n");
}

-static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
-{
- int ret;
- asm volatile("syscall"
- : "=a" (ret)
- : "0" (__NR_gettimeofday),"D" (tv),"S" (tz)
- : __syscall_clobber );
- return ret;
-}
+/* al values for each vsyscall; see vsyscall_emu_64.S for why. */
+static u8 vsyscall_nr_to_al[] = {0xcc, 0xce, 0xf0};

-static __always_inline void do_vgettimeofday(struct timeval * tv)
+static int al_to_vsyscall_nr(u8 al)
{
- cycle_t now, base, mask, cycle_delta;
- unsigned seq;
- unsigned long mult, shift, nsec;
- cycle_t (*vread)(void);
- do {
- seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
-
- vread = VVAR(vsyscall_gtod_data).clock.vread;
- if (unlikely(!vread)) {
- gettimeofday(tv,NULL);
- return;
- }
-
- now = vread();
- base = VVAR(vsyscall_gtod_data).clock.cycle_last;
- mask = VVAR(vsyscall_gtod_data).clock.mask;
- mult = VVAR(vsyscall_gtod_data).clock.mult;
- shift = VVAR(vsyscall_gtod_data).clock.shift;
-
- tv->tv_sec = VVAR(vsyscall_gtod_data).wall_time_sec;
- nsec = VVAR(vsyscall_gtod_data).wall_time_nsec;
- } while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
-
- /* calculate interval: */
- cycle_delta = (now - base) & mask;
- /* convert to nsecs: */
- nsec += (cycle_delta * mult) >> shift;
-
- while (nsec >= NSEC_PER_SEC) {
- tv->tv_sec += 1;
- nsec -= NSEC_PER_SEC;
- }
- tv->tv_usec = nsec / NSEC_PER_USEC;
+ int i;
+ for (i = 0; i < ARRAY_SIZE(vsyscall_nr_to_al); i++)
+ if (vsyscall_nr_to_al[i] == al)
+ return i;
+ return -1;
}

-int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
-{
- if (tv)
- do_vgettimeofday(tv);
- if (tz)
- do_get_tz(tz);
- return 0;
-}
+#ifdef CONFIG_UNSAFE_VSYSCALLS

/* This will break when the xtime seconds get inaccurate, but that is
* unlikely */
-time_t __vsyscall(1) vtime(time_t *t)
+time_t __attribute__ ((unused, __section__(".vsyscall_1"))) notrace
+vtime(time_t *t)
{
unsigned seq;
time_t result;
@@ -167,48 +138,127 @@ time_t __vsyscall(1) vtime(time_t *t)
return result;
}

-/* Fast way to get current CPU and node.
- This helps to do per node and per CPU caches in user space.
- The result is not guaranteed without CPU affinity, but usually
- works out because the scheduler tries to keep a thread on the same
- CPU.
+#endif /* CONFIG_UNSAFE_VSYSCALLS */
+
+/* If CONFIG_UNSAFE_VSYSCALLS=y, then this is incorrect for vsyscall_nr == 1. */
+static inline unsigned long vsyscall_intcc_addr(int vsyscall_nr)
+{
+ return VSYSCALL_START + 1024*vsyscall_nr + 2;
+}

- tcache must point to a two element sized long array.
- All arguments can be NULL. */
-long __vsyscall(2)
-vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
{
- unsigned int p;
- unsigned long j = 0;
-
- /* Fast cache - only recompute value once per jiffies and avoid
- relatively costly rdtscp/cpuid otherwise.
- This works because the scheduler usually keeps the process
- on the same CPU and this syscall doesn't guarantee its
- results anyways.
- We do this here because otherwise user space would do it on
- its own in a likely inferior way (no access to jiffies).
- If you don't like it pass NULL. */
- if (tcache && tcache->blob[0] == (j = VVAR(jiffies))) {
- p = tcache->blob[1];
- } else if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
- /* Load per CPU data from RDTSCP */
- native_read_tscp(&p);
- } else {
- /* Load per CPU data from GDT */
- asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+ static DEFINE_RATELIMIT_STATE(rs, 3600 * HZ, 3);
+ struct task_struct *tsk;
+ const char *vsyscall_name;
+ int vsyscall_nr;
+ long ret;
+
+ /* Kernel code must never get here. */
+ BUG_ON(!user_mode(regs));
+
+ local_irq_enable();
+
+ vsyscall_nr = al_to_vsyscall_nr(regs->ax & 0xff);
+ if (vsyscall_nr < 0) {
+ warn_bad_vsyscall(KERN_WARNING, regs, "illegal int 0xcc "
+ "(exploit attempt?)");
+ goto sigsegv;
}
- if (tcache) {
- tcache->blob[0] = j;
- tcache->blob[1] = p;
+
+ if (regs->ip - 2 != vsyscall_intcc_addr(vsyscall_nr)) {
+ if (in_vsyscall_page(regs->ip - 2)) {
+ /* This should not be possible. */
+ warn_bad_vsyscall(KERN_WARNING, regs,
+ "int 0xcc bogus magic "
+ "(exploit attempt?)");
+ goto sigsegv;
+ } else {
+ /*
+ * We allow the call because tools like ThreadSpotter
+ * might copy the int 0xcc instruction to user memory.
+ * We make it annoying, though, to try to persuade
+ * the authors to stop doing that...
+ */
+ warn_bad_vsyscall(KERN_WARNING, regs,
+ "int 0xcc in user code "
+ "(exploit attempt? legacy "
+ "instrumented code?)");
+ }
}
- if (cpu)
- *cpu = p & 0xfff;
- if (node)
- *node = p >> 12;
- return 0;
-}

+ tsk = current;
+ if (tsk->seccomp.mode) {
+ do_exit(SIGKILL);
+ goto out;
+ }
+
+ switch (vsyscall_nr) {
+ case 0:
+ vsyscall_name = "gettimeofday";
+ ret = sys_gettimeofday(
+ (struct timeval __user *)regs->di,
+ (struct timezone __user *)regs->si);
+ break;
+
+ case 1:
+#ifdef CONFIG_UNSAFE_VSYSCALLS
+ warn_bad_vsyscall(KERN_WARNING, regs, "bogus time() vsyscall "
+ "emulation (exploit attempt?)");
+ goto sigsegv;
+#else
+ vsyscall_name = "time";
+ ret = sys_time((time_t __user *)regs->di);
+ break;
+#endif
+
+ case 2:
+ vsyscall_name = "getcpu";
+ ret = sys_getcpu((unsigned __user *)regs->di,
+ (unsigned __user *)regs->si,
+ 0);
+ break;
+
+ default:
+ BUG();
+ }
+
+ if (ret == -EFAULT) {
+ /*
+ * Bad news -- userspace fed a bad pointer to a vsyscall.
+ *
+ * With a real vsyscall, that would have caused SIGSEGV.
+ * To make writing reliable exploits using the emulated
+ * vsyscalls harder, generate SIGSEGV here as well.
+ */
+ warn_bad_vsyscall(KERN_INFO, regs,
+ "vsyscall fault (exploit attempt?)");
+ goto sigsegv;
+ }
+
+ regs->ax = ret;
+
+ if (__ratelimit(&rs)) {
+ unsigned long caller;
+ if (get_user(caller, (unsigned long __user *)regs->sp))
+ caller = 0; /* no need to crash on this fault. */
+ printk(KERN_INFO "%s[%d] emulated legacy vsyscall %s(); "
+ "upgrade your code to avoid a performance hit. "
+ "ip:%lx sp:%lx caller:%lx",
+ tsk->comm, task_pid_nr(tsk), vsyscall_name,
+ regs->ip - 2, regs->sp, caller);
+ if (caller)
+ print_vma_addr(" in ", caller);
+ printk("\n");
+ }
+
+out:
+ local_irq_disable();
+ return;
+
+sigsegv:
+ regs->ip -= 2; /* The faulting instruction should be the int 0xcc. */
+ force_sig(SIGSEGV, current);
}

/* Assume __initcall executes before all user space. Hopefully kmod
@@ -264,11 +314,8 @@ void __init map_vsyscall(void)

static int __init vsyscall_init(void)
{
- BUG_ON(((unsigned long) &vgettimeofday !=
- VSYSCALL_ADDR(__NR_vgettimeofday)));
- BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
- BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
- BUG_ON((unsigned long) &vgetcpu != VSYSCALL_ADDR(__NR_vgetcpu));
+ BUG_ON(VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE));
+
on_each_cpu(cpu_vsyscall_init, NULL, 1);
/* notifier priority > KVM */
hotcpu_notifier(cpu_vsyscall_notifier, 30);
diff --git a/arch/x86/kernel/vsyscall_emu_64.S b/arch/x86/kernel/vsyscall_emu_64.S
new file mode 100644
index 0000000..7ebde61
--- /dev/null
+++ b/arch/x86/kernel/vsyscall_emu_64.S
@@ -0,0 +1,42 @@
+/*
+ * vsyscall_emu_64.S: Vsyscall emulation page
+ * Copyright (c) 2011 Andy Lutomirski
+ * Subject to the GNU General Public License, version 2
+*/
+
+#include <linux/linkage.h>
+#include <asm/irq_vectors.h>
+
+/*
+ * These magic incantations are chosen so that they fault if entered anywhere
+ * other than an instruction boundary. The movb instruction is two bytes, and
+ * the int imm8 instruction is also two bytes, so the only misaligned places
+ * to enter are the immediate values for the two instructions. 0xcc is int3
+ * (always faults), 0xce is into (faults on x64-64, and 32-bit code can't get
+ * here), and 0xf0 is lock (lock int is invalid).
+ *
+ * The unused parts of the page are filled with 0xcc by the linker script.
+ */
+
+.section .vsyscall_0, "a"
+ENTRY(vsyscall_0)
+ movb $0xcc, %al
+ int $VSYSCALL_EMU_VECTOR
+ ret
+END(vsyscall_0)
+
+#ifndef CONFIG_UNSAFE_VSYSCALLS
+.section .vsyscall_1, "a"
+ENTRY(vsyscall_1)
+ movb $0xce, %al
+ int $VSYSCALL_EMU_VECTOR
+ ret
+END(vsyscall_1)
+#endif
+
+.section .vsyscall_2, "a"
+ENTRY(vsyscall_2)
+ movb $0xf0, %al
+ int $VSYSCALL_EMU_VECTOR
+ ret
+END(vsyscall_2)
--
1.7.5.1

2011-05-31 14:15:53

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot

This is not a security feature. It's to prevent the int 0xcc
sequence from becoming ABI.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vsyscall_64.c | 54 +++++++++++++++++++++++++++++++++++++++-
1 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 52ba392..277c47c 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -34,6 +34,8 @@
#include <linux/notifier.h>
#include <linux/syscalls.h>
#include <linux/ratelimit.h>
+#include <linux/random.h>
+#include <linux/highmem.h>

#include <asm/vsyscall.h>
#include <asm/pgtable.h>
@@ -54,6 +56,8 @@ DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
.lock = __SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock),
};

+static u8 vsyscall_nr_offset; /* Cyclic permutation of al. */
+
void update_vsyscall_tz(void)
{
unsigned long flags;
@@ -95,10 +99,11 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,

tsk = current;

- printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx",
+ printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx offset:%d si:%lx di:%lx",
level, tsk->comm, task_pid_nr(tsk),
message,
- regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+ regs->ip - 2, regs->sp, regs->ax, (int)vsyscall_nr_offset,
+ regs->si, regs->di);
if (!in_vsyscall_page(regs->ip - 2))
print_vma_addr(" in ", regs->ip - 2);
printk("\n");
@@ -116,6 +121,17 @@ static int al_to_vsyscall_nr(u8 al)
return -1;
}

+static inline u8 mangle_al(u8 al)
+{
+ int nr = al_to_vsyscall_nr(al);
+ return vsyscall_nr_to_al[(nr + vsyscall_nr_offset) % 3];
+}
+
+static inline int demangle_vsyscall_nr(int nr)
+{
+ return (nr + 3 - vsyscall_nr_offset) % 3;
+}
+
#ifdef CONFIG_UNSAFE_VSYSCALLS

/* This will break when the xtime seconds get inaccurate, but that is
@@ -165,6 +181,7 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
"(exploit attempt?)");
goto sigsegv;
}
+ vsyscall_nr = demangle_vsyscall_nr(vsyscall_nr);

if (regs->ip - 2 != vsyscall_intcc_addr(vsyscall_nr)) {
if (in_vsyscall_page(regs->ip - 2)) {
@@ -312,10 +329,43 @@ void __init map_vsyscall(void)
(unsigned long)VVAR_ADDRESS);
}

+static void __init mangle_vsyscall_movb(void *mapping,
+ unsigned long movb_addr, u8 initial)
+{
+ u8 *imm8;
+ BUG_ON(movb_addr >= 4095);
+
+ imm8 = (char *)(mapping) + movb_addr + 1;
+
+ BUG_ON(*imm8 != initial);
+ *imm8 = mangle_al(*imm8);
+}
+
static int __init vsyscall_init(void)
{
+ extern char __vsyscall_0;
+ void *mapping;
+ struct page *vsyscall_page;
+
BUG_ON(VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE));

+ /*
+ * Randomize the magic al values for int 0xcc invocation. This
+ * isn't really a security feature; it's to make sure that
+ * dynamic binary instrumentation tools don't start to think
+ * that the int 0xcc magic incantation is ABI.
+ */
+ vsyscall_nr_offset = get_random_int() % 3;
+ vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
+ mapping = kmap_atomic(vsyscall_page);
+ /* It's easier to hardcode the addresses -- they're ABI. */
+ mangle_vsyscall_movb(mapping, 0, 0xcc);
+#ifndef CONFIG_UNSAFE_VSYSCALLS
+ mangle_vsyscall_movb(mapping, 1024, 0xce);
+#endif
+ mangle_vsyscall_movb(mapping, 2048, 0xf0);
+ kunmap_atomic(mapping);
+
on_each_cpu(cpu_vsyscall_init, NULL, 1);
/* notifier priority > KVM */
hotcpu_notifier(cpu_vsyscall_notifier, 30);
--
1.7.5.1

2011-05-31 14:15:28

by Andrew Lutomirski

[permalink] [raw]
Subject: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/feature-removal-schedule.txt | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 1a9446b..94b4470 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -600,3 +600,12 @@ Why: Superseded by the UVCIOC_CTRL_QUERY ioctl.
Who: Laurent Pinchart <[email protected]>

----------------------------
+
+What: CONFIG_UNSAFE_VSYSCALLS (x86_64)
+When: When glibc 2.14 or newer is ubitquitous. Perhaps mid-2012.
+Why: Having user-executable code at a fixed address is a security problem.
+ Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
+ make the time() function slower on glibc versions 2.13 and below.
+Who: Andy Lutomirski <[email protected]>
+
+----------------------------
--
1.7.5.1

2011-05-31 15:36:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 08/10] x86-64: Emulate legacy vsyscalls


* Andy Lutomirski <[email protected]> wrote:

> +#ifdef CONFIG_X86_64
> +# define VSYSCALL_EMU_VECTOR 0xcc
> +#endif

> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -872,6 +872,10 @@ void __init trap_init(void)
> set_bit(SYSCALL_VECTOR, used_vectors);
> #endif
>
> + BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
> + set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
> + set_bit(VSYSCALL_EMU_VECTOR, used_vectors);

That is a generic section of traps.c so it won't build (and makes no
point) on 32-bit.

Ingo

2011-05-31 15:40:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot


* Andy Lutomirski <[email protected]> wrote:

> This is not a security feature. It's to prevent the int 0xcc
> sequence from becoming ABI.

Hm, what kind of ABI reliance could be built on it?

> +static void __init mangle_vsyscall_movb(void *mapping,
> + unsigned long movb_addr, u8 initial)
> +{
> + u8 *imm8;
> + BUG_ON(movb_addr >= 4095);

Please put newlines after local variable definitions.

> static int __init vsyscall_init(void)
> {
> + extern char __vsyscall_0;

Please don't put extern definitions in the middle of a .c file - if
then it should be in a .h file. (even if only a single function uses
it)

> + /*
> + * Randomize the magic al values for int 0xcc invocation. This
> + * isn't really a security feature; it's to make sure that
> + * dynamic binary instrumentation tools don't start to think
> + * that the int 0xcc magic incantation is ABI.
> + */
> + vsyscall_nr_offset = get_random_int() % 3;
> + vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
> + mapping = kmap_atomic(vsyscall_page);
> + /* It's easier to hardcode the addresses -- they're ABI. */
> + mangle_vsyscall_movb(mapping, 0, 0xcc);

what about filling it with zeroes?

> +#ifndef CONFIG_UNSAFE_VSYSCALLS
> + mangle_vsyscall_movb(mapping, 1024, 0xce);
> +#endif
> + mangle_vsyscall_movb(mapping, 2048, 0xf0);

Dunno, this all looks rather ugly.

Thanks,

Ingo

2011-05-31 15:56:56

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot

On Tue, May 31, 2011 at 11:40 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> This is not a security feature. ?It's to prevent the int 0xcc
>> sequence from becoming ABI.
>
> Hm, what kind of ABI reliance could be built on it?

See:

http://lkml.kernel.org/r/<BANLkTi=zxA3xzc0pR14rqwaOUXUAyF__MQ%40mail.gmail.com>

I was surprised, too.

>
>> +static void __init mangle_vsyscall_movb(void *mapping,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long movb_addr, u8 initial)
>> +{
>> + ? ? u8 *imm8;
>> + ? ? BUG_ON(movb_addr >= 4095);
>
> Please put newlines after local variable definitions.

Yep.

>
>> ?static int __init vsyscall_init(void)
>> ?{
>> + ? ? extern char __vsyscall_0;
>
> Please don't put extern definitions in the middle of a .c file - if
> then it should be in a .h file. (even if only a single function uses
> it)

I thought the convention (and existing practice in vsyscall_64.c) was
that if the extern reference is to a magic linker symbol then it goes
in the function that uses it. But I can find it a header file.

>
>> + ? ? /*
>> + ? ? ?* Randomize the magic al values for int 0xcc invocation. ?This
>> + ? ? ?* isn't really a security feature; it's to make sure that
>> + ? ? ?* dynamic binary instrumentation tools don't start to think
>> + ? ? ?* that the int 0xcc magic incantation is ABI.
>> + ? ? ?*/
>> + ? ? vsyscall_nr_offset = get_random_int() % 3;
>> + ? ? vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
>> + ? ? mapping = kmap_atomic(vsyscall_page);
>> + ? ? /* It's easier to hardcode the addresses -- they're ABI. */
>> + ? ? mangle_vsyscall_movb(mapping, 0, 0xcc);
>
> what about filling it with zeroes?

Fill what with zeroes? I'm just patching one byte here.

>
>> +#ifndef CONFIG_UNSAFE_VSYSCALLS
>> + ? ? mangle_vsyscall_movb(mapping, 1024, 0xce);
>> +#endif
>> + ? ? mangle_vsyscall_movb(mapping, 2048, 0xf0);
>
> Dunno, this all looks rather ugly.

Agreed. Better ideas are welcome.

We could scrap int 0xcc entirely and emulate on page fault, but that
is slower and has other problems (like breaking anything that thinks
it can look at a call target in a binary and dereference that
address).

Here's a possibly dumb/evil idea:

Put real syscalls in the vsyscall page but mark the page NX. Then
emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
the correct address but send SIGSEGV for the wrong address.

Down side: it's even more complexity for the same silly case.

--Andy

>
> Thanks,
>
> ? ? ? ?Ingo
>

2011-05-31 16:11:03

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot

On Tue, May 31, 2011 at 11:56 AM, Andrew Lutomirski <[email protected]> wrote:
> We could scrap int 0xcc entirely and emulate on page fault, but that
> is slower and has other problems (like breaking anything that thinks
> it can look at a call target in a binary and dereference that
> address).
>
> Here's a possibly dumb/evil idea:
>
> Put real syscalls in the vsyscall page but mark the page NX. ?Then
> emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
> the correct address but send SIGSEGV for the wrong address.
>
> Down side: it's even more complexity for the same silly case.

Scratch that. It's incompatible with keeping time() fast for now.

>
> --Andy
>
>>
>> Thanks,
>>
>> ? ? ? ?Ingo
>>
>

2011-05-31 16:42:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot


* Andrew Lutomirski <[email protected]> wrote:

> >> ?static int __init vsyscall_init(void)
> >> ?{
> >> + ? ? extern char __vsyscall_0;
> >
> > Please don't put extern definitions in the middle of a .c file - if
> > then it should be in a .h file. (even if only a single function uses
> > it)
>
> I thought the convention (and existing practice in vsyscall_64.c)
> was that if the extern reference is to a magic linker symbol then
> it goes in the function that uses it. But I can find it a header
> file.

i'd suggest collecting them into a vsyscall header. The problem with
externs in .c is that the moment two .c files start using it there's
the danger of type divergence.

> >> + ? ? /*
> >> + ? ? ?* Randomize the magic al values for int 0xcc invocation. ?This
> >> + ? ? ?* isn't really a security feature; it's to make sure that
> >> + ? ? ?* dynamic binary instrumentation tools don't start to think
> >> + ? ? ?* that the int 0xcc magic incantation is ABI.
> >> + ? ? ?*/
> >> + ? ? vsyscall_nr_offset = get_random_int() % 3;
> >> + ? ? vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
> >> + ? ? mapping = kmap_atomic(vsyscall_page);
> >> + ? ? /* It's easier to hardcode the addresses -- they're ABI. */
> >> + ? ? mangle_vsyscall_movb(mapping, 0, 0xcc);
> >
> > what about filling it with zeroes?
>
> Fill what with zeroes? I'm just patching one byte here.

Sigh, i suck at reading comprehension today!

> >> +#ifndef CONFIG_UNSAFE_VSYSCALLS
> >> + ? ? mangle_vsyscall_movb(mapping, 1024, 0xce);
> >> +#endif
> >> + ? ? mangle_vsyscall_movb(mapping, 2048, 0xf0);
> >
> > Dunno, this all looks rather ugly.
>
> Agreed. Better ideas are welcome.

None at the moment except "don't randomize it and see where the chips
may fall". I'd rather live with a somewhat sticky default-off compat
Kconfig switch than some permanently ugly randomization to make the
transition to no-vsyscall faster.

It's not like we'll be able to remove the vsyscall altogether from
the kernel - the best we can hope for is to be able to flip the
default - there's binaries out there today that rely on it and
binaries are sticky - a few months ago i saw someone test-running
1995 binaries ;-)

Btw., we could also make the vsyscall page vanish *runtime*, via a
sysctl. That way distros only need to update their /etc/sysctl.conf.

> We could scrap int 0xcc entirely and emulate on page fault, but
> that is slower and has other problems (like breaking anything that
> thinks it can look at a call target in a binary and dereference
> that address).
>
> Here's a possibly dumb/evil idea:
>
> Put real syscalls in the vsyscall page but mark the page NX. Then
> emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
> the correct address but send SIGSEGV for the wrong address.
>
> Down side: it's even more complexity for the same silly case.

heh, you are good at coming up with sick ideas! ;-)

I don't think we want to add another branch to #PF, but could we turn
this into #GP or perhaps an illegal instruction fault?

Should be benchmarked:

- The advantage of INT 0xCC is that it's completely isolated: it
does not slow down anything else.

- doing this through #GP might be significantly slower cycle-wise.
Do we know by how much?

The advantage would be that we would not waste an extra vector, it
would be smaller, plus it would be rather simple to make it all a
runtime toggle via a sysctl.

Thanks,

Ingo

2011-05-31 16:43:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot


* Andrew Lutomirski <[email protected]> wrote:

> On Tue, May 31, 2011 at 11:56 AM, Andrew Lutomirski <[email protected]> wrote:
> > We could scrap int 0xcc entirely and emulate on page fault, but that
> > is slower and has other problems (like breaking anything that thinks
> > it can look at a call target in a binary and dereference that
> > address).
> >
> > Here's a possibly dumb/evil idea:
> >
> > Put real syscalls in the vsyscall page but mark the page NX. ?Then
> > emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
> > the correct address but send SIGSEGV for the wrong address.
> >
> > Down side: it's even more complexity for the same silly case.
>
> Scratch that. It's incompatible with keeping time() fast for now.

If we can find another fault than #PF then it will be similarly fast
to an INT $0xCC so please at least investigate this route.

Thanks,

Ingo

2011-05-31 17:23:06

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH v4 03/10] x86-64: Give vvars their own page

On 31/05/11 10:14 -0400, Andy Lutomirski wrote:
> Move vvars out of the vsyscall page into their own page and mark it
> NX.
>
> Without this patch, an attacker who can force a daemon to call some
> fixed address could wait until the time contains, say, 0xCD80, and
> then execute the current time.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/fixmap.h | 1 +
> arch/x86/include/asm/pgtable_types.h | 2 ++
> arch/x86/include/asm/vvar.h | 22 ++++++++++------------
> arch/x86/kernel/vmlinux.lds.S | 27 ++++++++++++++++-----------
> arch/x86/kernel/vsyscall_64.c | 5 +++++
> 5 files changed, 34 insertions(+), 23 deletions(-)
>

[...]

> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index 89aed99..3d89a00 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -161,12 +161,6 @@ SECTIONS
>
> #define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
> #define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
> -#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x \
> - ADDR(.vsyscall_0) + offset \
> - : AT(VLOAD(.vsyscall_var_ ## x)) { \
> - *(.vsyscall_var_ ## x) \
> - } \
> - x = VVIRT(.vsyscall_var_ ## x);
>
> . = ALIGN(4096);
> __vsyscall_0 = .;
> @@ -192,17 +186,28 @@ SECTIONS
> *(.vsyscall_3)
> }
>
> -#define __VVAR_KERNEL_LDS
> -#include <asm/vvar.h>
> -#undef __VVAR_KERNEL_LDS
> -
> - . = __vsyscall_0 + PAGE_SIZE;
> + . = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
>
> #undef VSYSCALL_ADDR
> #undef VLOAD_OFFSET
> #undef VLOAD
> #undef VVIRT_OFFSET
> #undef VVIRT
> +
> + __vvar_page = .;
> +
> +#define EMIT_VVAR(name, offset) .vvar_ ## name \
> + (__vvar_page + offset) : \
> + AT(ADDR(.vvar_ ## name) - LOAD_OFFSET) { \
> + *(.vvar_ ## x) \
^
Maybe s/x/name/ ? -----------|

Thanks,

Louis

> + } :data
> +
> +#define __VVAR_KERNEL_LDS
> +#include <asm/vvar.h>
> +#undef __VVAR_KERNEL_LDS
> +
> + . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
> +
> #undef EMIT_VVAR
>
> #endif /* CONFIG_X86_64 */
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 3e68218..3cf1cef 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -284,9 +284,14 @@ void __init map_vsyscall(void)
> {
> extern char __vsyscall_0;
> unsigned long physaddr_page0 = __pa_symbol(&__vsyscall_0);
> + extern char __vvar_page;
> + unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
>
> /* Note that VSYSCALL_MAPPED_PAGES must agree with the code below. */
> __set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
> + __set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
> + BUILD_BUG_ON((unsigned long)__fix_to_virt(VVAR_PAGE) !=
> + (unsigned long)VVAR_ADDRESS);
> }
>
> static int __init vsyscall_init(void)
> --
> 1.7.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (3.30 kB)
signature.asc (197.00 B)
Digital signature
Download all attachments

2011-05-31 18:09:13

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 09/10] x86-64: Randomize int 0xcc magic al values at boot

On Tue, May 31, 2011 at 12:42 PM, Ingo Molnar <[email protected]> wrote:
>
> * Andrew Lutomirski <[email protected]> wrote:
>
>> >> static int __init vsyscall_init(void)
>> >> {
>> >> + extern char __vsyscall_0;
>> >
>> > Please don't put extern definitions in the middle of a .c file - if
>> > then it should be in a .h file. (even if only a single function uses
>> > it)
>>
>> I thought the convention (and existing practice in vsyscall_64.c)
>> was that if the extern reference is to a magic linker symbol then
>> it goes in the function that uses it. But I can find it a header
>> file.
>
> i'd suggest collecting them into a vsyscall header. The problem with
> externs in .c is that the moment two .c files start using it there's
> the danger of type divergence.
>
>> >> + /*
>> >> + * Randomize the magic al values for int 0xcc invocation. This
>> >> + * isn't really a security feature; it's to make sure that
>> >> + * dynamic binary instrumentation tools don't start to think
>> >> + * that the int 0xcc magic incantation is ABI.
>> >> + */
>> >> + vsyscall_nr_offset = get_random_int() % 3;
>> >> + vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
>> >> + mapping = kmap_atomic(vsyscall_page);
>> >> + /* It's easier to hardcode the addresses -- they're ABI. */
>> >> + mangle_vsyscall_movb(mapping, 0, 0xcc);
>> >
>> > what about filling it with zeroes?
>>
>> Fill what with zeroes? I'm just patching one byte here.
>
> Sigh, i suck at reading comprehension today!
>
>> >> +#ifndef CONFIG_UNSAFE_VSYSCALLS
>> >> + mangle_vsyscall_movb(mapping, 1024, 0xce);
>> >> +#endif
>> >> + mangle_vsyscall_movb(mapping, 2048, 0xf0);
>> >
>> > Dunno, this all looks rather ugly.
>>
>> Agreed. Better ideas are welcome.
>
> None at the moment except "don't randomize it and see where the chips
> may fall". I'd rather live with a somewhat sticky default-off compat
> Kconfig switch than some permanently ugly randomization to make the
> transition to no-vsyscall faster.
>
> It's not like we'll be able to remove the vsyscall altogether from
> the kernel - the best we can hope for is to be able to flip the
> default - there's binaries out there today that rely on it and
> binaries are sticky - a few months ago i saw someone test-running
> 1995 binaries ;-)
>
> Btw., we could also make the vsyscall page vanish *runtime*, via a
> sysctl. That way distros only need to update their /etc/sysctl.conf.
>
>> We could scrap int 0xcc entirely and emulate on page fault, but
>> that is slower and has other problems (like breaking anything that
>> thinks it can look at a call target in a binary and dereference
>> that address).
>>
>> Here's a possibly dumb/evil idea:
>>
>> Put real syscalls in the vsyscall page but mark the page NX. Then
>> emulate the vsyscalls on the PF_INSTR fault when userspace jumps to
>> the correct address but send SIGSEGV for the wrong address.
>>
>> Down side: it's even more complexity for the same silly case.
>
> heh, you are good at coming up with sick ideas! ;-)
>
> I don't think we want to add another branch to #PF, but could we turn
> this into #GP or perhaps an illegal instruction fault?

I'm not seeing one right now. But #PF can be done without any
fast-path impact by checking for a vsyscall after the normal
page-fault logic gives up. That takes about 400ns on my machine, I
think.

>
> Should be benchmarked:
>
> - The advantage of INT 0xCC is that it's completely isolated: it
> does not slow down anything else.

220ns or so.

>
> - doing this through #GP might be significantly slower cycle-wise.
> Do we know by how much?

Not sure how to do it. I imagine that #UD is almost the same as int cc, though.

>
> The advantage would be that we would not waste an extra vector, it
> would be smaller, plus it would be rather simple to make it all a
> runtime toggle via a sysctl.

I think this is all moot right now, though -- whatever we do has to
keep time() fast for awhile, and I think that means that we need a
real (even if illegal) instruction in gettimeofday.

So let's just drop the randomization patch and hope that the "int 0xcc
in user code (exploit attempt? legacy instrumented code?)" is enough
to keep people from doing dumb things.

--Andy

2011-05-31 18:34:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

> +What: CONFIG_UNSAFE_VSYSCALLS (x86_64)
> +When: When glibc 2.14 or newer is ubitquitous. Perhaps mid-2012.
> +Why: Having user-executable code at a fixed address is a security problem.
> + Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
> + make the time() function slower on glibc versions 2.13 and below.

I disagree with this description (and the whole idea really)

First it's time+gettimeofday+vgetcu, not just time.

A more accurate description is

"will make all x86-64 Linux programs written to the original pre
vDSO syscall ABI significantly slower"

And the assumption that all world is using glibc is still as bad
as it was on the first po.st

And it's still a bad idea. Especially since there's a much better
alternative anyways for the "security problem" which has none of
these drawbacks.

-Andi

2011-05-31 18:57:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

On Tue, 31 May 2011, Andi Kleen wrote:
> > +What: CONFIG_UNSAFE_VSYSCALLS (x86_64)
> > +When: When glibc 2.14 or newer is ubitquitous. Perhaps mid-2012.
> > +Why: Having user-executable code at a fixed address is a security problem.
> > + Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
> > + make the time() function slower on glibc versions 2.13 and below.
>
> I disagree with this description (and the whole idea really)
>
> First it's time+gettimeofday+vgetcu, not just time.
>
> A more accurate description is
>
> "will make all x86-64 Linux programs written to the original pre
> vDSO syscall ABI significantly slower"
>
> And the assumption that all world is using glibc is still as bad
> as it was on the first po.st
>
> And it's still a bad idea. Especially since there's a much better
> alternative anyways for the "security problem" which has none of
> these drawbacks.

How about posting an alternative patch?

Thanks,

tglx

2011-05-31 18:59:31

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

On Tue, May 31, 2011 at 2:34 PM, Andi Kleen <[email protected]> wrote:
>> +What: ? ? ? ?CONFIG_UNSAFE_VSYSCALLS (x86_64)
>> +When: ? ? ? ?When glibc 2.14 or newer is ubitquitous. ?Perhaps mid-2012.
>> +Why: Having user-executable code at a fixed address is a security problem.
>> + ? ? Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
>> + ? ? make the time() function slower on glibc versions 2.13 and below.
>
> I disagree with this description (and the whole idea really)
>
> First it's time+gettimeofday+vgetcu, not just time.
>
> A more accurate description is
>
> "will make all x86-64 Linux programs written to the original pre
> vDSO syscall ABI significantly slower"

Well, if this series goes in, then gettimeofday and getcpu are already
slower. It's just time that would get even slower later on.

>
> And the assumption that all world is using glibc is still as bad
> as it was on the first po.st

As opposed to?

uclibc and klibc don't appear to use vsyscalls or the vdso.

dietlibc is hardcoded to use the vsyscall. Are there any
performance-critical programs using dietlibc?

I don't think that Bionic runs on any released x86-64 systems.

>
> And it's still a bad idea. Especially since there's a much better
> alternative anyways for the "security problem" which has none of
> these drawbacks.

What's the alternative?

--Andy

2011-05-31 19:29:02

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule


* Andrew Lutomirski <[email protected]> wrote:

> > And it's still a bad idea. Especially since there's a much better
> > alternative anyways for the "security problem" which has none of
> > these drawbacks.
>
> What's the alternative?

Well, Andi likes to draw out such answers and likes to keep any
answer as minimally helpful as possible (to demonstrate his
superiority), but my guess would be that he is thinking of the
(trivial) solution that filters the caller RIP at the generic syscall
entry point and checks RCX against the 'expected' SYSCALL instruction
address, which is the (per task) vdso-address + constant-offset.

That method has a big disadvantage:

- it slows down the syscall fastpath with two or three unnecessary
instructions.

It has two advantages:

- it's the simplest method of all

- it also *only* allows the vdso address to be used for system calls,
so if an attacker manages to find an executable SYSCALL
instruction somewhere in the executable space of an application,
that entry point will not be usable.

... so this method is not completely off the table.

If everyone agrees that the generic syscall overhead is acceptable we
could try this too.

Thoughts?

Thanks,

Ingo

2011-05-31 19:36:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule


* Ingo Molnar <[email protected]> wrote:

> [...] solution that filters the caller RIP at the generic syscall
> entry point and checks RCX against the 'expected' SYSCALL
> instruction address, which is the (per task) vdso-address +
> constant-offset.

Note that this solution would allow the vsyscall page to be
'filtered' to the 3 allowed system calls rather efficiently, via a
second level check.

This second check does not affect the fastpath, and it could be put
behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does
not put vsyscall references anywhere - but we could even keep it
around forever, as this way it's defanged permanently.

Thanks,

Ingo

2011-05-31 20:05:09

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule

[Sorry, possible resend.]

On 5/31/11, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> [...] solution that filters the caller RIP at the generic syscall
>> entry point and checks RCX against the 'expected' SYSCALL
>> instruction address, which is the (per task) vdso-address +
>> constant-offset.
>
> Note that this solution would allow the vsyscall page to be
> 'filtered' to the 3 allowed system calls rather efficiently, via a
> second level check.
>
> This second check does not affect the fastpath, and it could be put
> behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does
> not put vsyscall references anywhere - but we could even keep it
> around forever, as this way it's defanged permanently.
>

Are you thinking about the 32-bit vDSO? I think that 64-bit code puts
syscalls instructions all over the place.

How is this better than v2 of my series, stopping after the int cc fallback?

--Andy

> Thanks,
>
> Ingo
>

2011-05-31 20:25:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule


* Andrew Lutomirski <[email protected]> wrote:

> [Sorry, possible resend.]
>
> On 5/31/11, Ingo Molnar <[email protected]> wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> >> [...] solution that filters the caller RIP at the generic syscall
> >> entry point and checks RCX against the 'expected' SYSCALL
> >> instruction address, which is the (per task) vdso-address +
> >> constant-offset.
> >
> > Note that this solution would allow the vsyscall page to be
> > 'filtered' to the 3 allowed system calls rather efficiently, via a
> > second level check.
> >
> > This second check does not affect the fastpath, and it could be put
> > behind a CONFIG_COMPAT_VSYSCALL deprecation define once glibc does
> > not put vsyscall references anywhere - but we could even keep it
> > around forever, as this way it's defanged permanently.
> >
>
> Are you thinking about the 32-bit vDSO? I think that 64-bit code puts
> syscalls instructions all over the place.

Yeah, it does in a few dozen places so RCX filtering would only work
if we 'knew' about glibc's syscall range (it's available from the
vma) and restricted syscalls to that boundary.

... which makes this solution rather fragile so i think we can
disregard it.

Thanks,

Ingo

2011-06-08 08:50:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule


* Andrew Lutomirski <[email protected]> wrote:

> On Tue, May 31, 2011 at 2:34 PM, Andi Kleen <[email protected]> wrote:
>
> > And it's still a bad idea. Especially since there's a much better
> > alternative anyways for the "security problem" which has none of
> > these drawbacks.
>
> What's the alternative?

Andi, you still have not replied to Andy's, Thomas's and my question
regarding your claim: exactly what 'alternative' did you mean here?

Do you plan to reply to our questions or will you continue to ignore
them forever, keeping your FUD in the air indefinitely?

Thanks,

Ingo