Patch 1 is just a bugfix from the last vdso series.
The bug should be harmless but it's pretty dumb. This is almost
certainly 3.0 material.
Patch 2 adds documentation for entry_64.S. A lot of the magic in there is far from obvious.
Patches 3, 4, and 5 remove a bunch of syscall instructions in kernel space
at fixed addresses that user code can execute. Several are data that isn't marked NX. Patch 2 makes vvars NX and
5/10 makes the HPET NX.
The time() vsyscall contains an explicit syscall fallback. Patch 3/10
removes it.
At this point, there is only one explicit syscall left in the vsyscall
page: the fallback case for vgettimeofday. The rest of the series is to
remove it and most of the remaining vsyscall code.
Patch 6 is pure cleanup. venosys (one of the four vsyscalls) has been
broken for years, so patch 6 removes it.
Patch 7 pads the empty parts of the vsyscall page with 0xcc. 0xcc is an
explicit trap.
Patch 8 removes the code implementing the vsyscalls and replaces it with
magic int 0xcc incantations. These incantations are specifically
designed so that jumping into them at funny offsets will either work
fine or generate some kind of fault. This is a significant performance
penalty (~220ns here) for all vsyscall users, but there aren't many
left. Because current glibc still uses the time vsyscall (although it's
fixed in glibc's git), the option CONFIG_UNSAFE_VSYSCALLS (default y)
will leave time() alone.
This patch is also nice because it removes a bunch of duplicated code
from vsyscall_64.s.
Patch 9/10 randomizes the int 0xcc incantation at bootup. It is pretty
much worthless for security (there are only three choices for the random
number and it's easy to figure out which one is in use) but it prevents
overly clever userspace programs from thinking that the incantation is
ABI. One instrumentation tool author offered to hard-code special
handling for int 0xcc; I want to discourage this approach.
Patch 10/10 adds CONFIG_UNSAFE_VSYSCALLS to
feature-removal-schedule.txt. Removing it won't break anything but it
will slow some older code down.
Changes from v2:
- Reordered the patches.
- Removed the option to leave gettimeofday and getcpu as native code.
- Clean up the int 0xcc handler and registration.
- sched_getcpu works now (thanks, Borislav, for catching my blatant arithmetic error).
- Numerous small fixes from review comments.
- Abandon my plan to spread turbostat to the masses.
Changes from v1:
- Patches 6-10 are new.
- The int 0xcc code is much prettier and has lots of bugs fixed.
- I've decided to let everyone compile turbostat on their own :)
Andy Lutomirski (10):
x86-64: Fix alignment of jiffies variable
x86-64: Document some of entry_64.S
x86-64: Give vvars their own page
x86-64: Remove kernel.vsyscall64 sysctl
x86-64: Map the HPET NX
x86-64: Remove vsyscall number 3 (venosys)
x86-64: Fill unused parts of the vsyscall page with 0xcc
x86-64: Emulate legacy vsyscalls
x86-64: Randomize int 0xcc magic al values at boot
x86-64: Add CONFIG_UNSAFE_VSYSCALLS to feature-removal-schedule
Documentation/feature-removal-schedule.txt | 9 +
Documentation/x86/entry_64.txt | 98 +++++++++
arch/x86/Kconfig | 17 ++
arch/x86/include/asm/fixmap.h | 1 +
arch/x86/include/asm/irq_vectors.h | 6 +-
arch/x86/include/asm/pgtable_types.h | 6 +-
arch/x86/include/asm/traps.h | 4 +
arch/x86/include/asm/vgtod.h | 1 -
arch/x86/include/asm/vsyscall.h | 6 +
arch/x86/include/asm/vvar.h | 24 +-
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 4 +
arch/x86/kernel/hpet.c | 2 +-
arch/x86/kernel/traps.c | 4 +
arch/x86/kernel/vmlinux.lds.S | 47 ++--
arch/x86/kernel/vsyscall_64.c | 327 +++++++++++++++++-----------
arch/x86/kernel/vsyscall_emu_64.S | 42 ++++
arch/x86/vdso/vclock_gettime.c | 55 ++---
18 files changed, 448 insertions(+), 206 deletions(-)
create mode 100644 Documentation/x86/entry_64.txt
create mode 100644 arch/x86/kernel/vsyscall_emu_64.S
--
1.7.5.1
It's declared __attribute__((aligned(16)) but it's explicitly not
aligned. This is probably harmless but it's a bit embarrassing.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/vvar.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index 341b355..a4eaca4 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -45,7 +45,7 @@
/* DECLARE_VVAR(offset, type, name) */
DECLARE_VVAR(0, volatile unsigned long, jiffies)
-DECLARE_VVAR(8, int, vgetcpu_mode)
+DECLARE_VVAR(16, int, vgetcpu_mode)
DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
#undef DECLARE_VVAR
--
1.7.5.1
Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/x86/entry_64.txt | 98 ++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/entry_64.S | 2 +
2 files changed, 100 insertions(+), 0 deletions(-)
create mode 100644 Documentation/x86/entry_64.txt
diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
new file mode 100644
index 0000000..7869f14
--- /dev/null
+++ b/Documentation/x86/entry_64.txt
@@ -0,0 +1,98 @@
+This file documents some of the kernel entries in
+arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from
+an email from Ingo Molnar:
+
+http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
+
+The x86 architecture has quite a few different ways to jump into
+kernel code. Most of these entry points are registered in
+arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S
+and arch/x86/ia32/ia32entry.S.
+
+The IDT vector assignments are listed in arch/x86/include/irq_vectors.h.
+
+Some of these entries are:
+
+ - system_call: syscall instruction from 64-bit code.
+
+ - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall
+ either way.
+
+ - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit
+ code
+
+ - interrupt: An array of entries. Every IDT vector that doesn't
+ explicitly point somewhere else gets set to the corresponding
+ value in interrupts. These point to a whole array of
+ magically-generated functions that make their way to do_IRQ with
+ the interrupt number as a parameter.
+
+ - emulate_vsyscall: int 0xcc, a special non-ABI entry used by
+ vsyscall emulation.
+
+ - APIC interrupts: Various special-purpose interrupts for things
+ like TLB shootdown.
+
+ - Architecturally-defined exceptions like divide_error.
+
+There are a few complexities here. The different x86-64 entries
+have different calling conventions. The syscall and sysenter
+instructions have their own peculiar calling conventions. Some of
+the IDT entries push an error code onto the stack; others don't.
+IDT entries using the IST alternative stack mechanism need their own
+magic to get the stack frames right. (You can find some
+documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
+Volume 3, Chapter 6.)
+
+Dealing with the swapgs instruction is especially tricky. Swapgs
+toggles whether gs is the kernel gs or the user gs. The swapgs
+instruction is rather fragile: it must nest perfectly and only in
+single depth, it should only be used if entering from user mode to
+kernel mode and then when returning to user-space, and precisely
+so. If we mess that up even slightly, we crash.
+
+So when we have a secondary entry, already in kernel mode, we *must
+not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
+not switched/swapped yet.
+
+Now, there's a secondary complication: there's a cheap way to test
+which mode the CPU is in and an expensive way.
+
+The cheap way is to pick this info off the entry frame on the kernel
+stack, from the CS of the ptregs area of the kernel stack:
+
+ xorl %ebx,%ebx
+ testl $3,CS+8(%rsp)
+ je error_kernelspace
+ SWAPGS
+
+The expensive (paranoid) way is to read back the MSR_GS_BASE value
+(which is what SWAPGS modifies):
+
+ movl $1,%ebx
+ movl $MSR_GS_BASE,%ecx
+ rdmsr
+ testl %edx,%edx
+ js 1f /* negative -> in kernel */
+ SWAPGS
+ xorl %ebx,%ebx
+1: ret
+
+and the whole paranoid non-paranoid macro complexity is about whether
+to suffer that RDMSR cost.
+
+If we are at an interrupt or user-trap/gate-alike boundary then we can
+use the faster check: the stack will be a reliable indicator of
+whether SWAPGS was already done: if we see that we are a secondary
+entry interrupting kernel mode execution, then we know that the GS
+base has already been switched. If it says that we interrupted
+user-space execution then we must do the SWAPGS.
+
+But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
+which might have triggered right after a normal entry wrote CS to the
+stack but before we executed SWAPGS, then the only safe way to check
+for GS is the slower method: the RDMSR.
+
+So we try only to mark those entry methods 'paranoid' that absolutely
+need the more expensive check for the GS base - and we generate all
+'normal' entry points with the regular (faster) entry macros.
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 8a445a0..72c4a77 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -9,6 +9,8 @@
/*
* entry.S contains the system-call and fault low-level handling routines.
*
+ * Some of this is documented in Documentation/x86/entry_64.txt
+ *
* NOTE: This code handles signal-recognition, which happens every time
* after an interrupt and after each system call.
*
--
1.7.5.1
Move vvars out of the vsyscall page into their own page and mark it
NX.
Without this patch, an attacker who can force a daemon to call some
fixed address could wait until the time contains, say, 0xCD80, and
then execute the current time.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/fixmap.h | 1 +
arch/x86/include/asm/pgtable_types.h | 2 ++
arch/x86/include/asm/vvar.h | 22 ++++++++++------------
arch/x86/kernel/vmlinux.lds.S | 27 ++++++++++++++++-----------
arch/x86/kernel/vsyscall_64.c | 5 +++++
5 files changed, 34 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 4729b2b..460c74e 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -78,6 +78,7 @@ enum fixed_addresses {
VSYSCALL_LAST_PAGE,
VSYSCALL_FIRST_PAGE = VSYSCALL_LAST_PAGE
+ ((VSYSCALL_END-VSYSCALL_START) >> PAGE_SHIFT) - 1,
+ VVAR_PAGE,
VSYSCALL_HPET,
#endif
FIX_DBGP_BASE,
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index d56187c..6a29aed6 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -108,6 +108,7 @@
#define __PAGE_KERNEL_UC_MINUS (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
#define __PAGE_KERNEL_VSYSCALL_NOCACHE (__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_NOCACHE (__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -130,6 +131,7 @@
#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC)
#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL)
#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE)
+#define PAGE_KERNEL_VVAR __pgprot(__PAGE_KERNEL_VVAR)
#define PAGE_KERNEL_IO __pgprot(__PAGE_KERNEL_IO)
#define PAGE_KERNEL_IO_NOCACHE __pgprot(__PAGE_KERNEL_IO_NOCACHE)
diff --git a/arch/x86/include/asm/vvar.h b/arch/x86/include/asm/vvar.h
index a4eaca4..de656ac 100644
--- a/arch/x86/include/asm/vvar.h
+++ b/arch/x86/include/asm/vvar.h
@@ -10,15 +10,14 @@
* In normal kernel code, they are used like any other variable.
* In user code, they are accessed through the VVAR macro.
*
- * Each of these variables lives in the vsyscall page, and each
- * one needs a unique offset within the little piece of the page
- * reserved for vvars. Specify that offset in DECLARE_VVAR.
- * (There are 896 bytes available. If you mess up, the linker will
- * catch it.)
+ * These variables live in a page of kernel data that has an extra RO
+ * mapping for userspace. Each variable needs a unique offset within
+ * that page; specify that offset with the DECLARE_VVAR macro. (If
+ * you mess up, the linker will catch it.)
*/
-/* Offset of vars within vsyscall page */
-#define VSYSCALL_VARS_OFFSET (3072 + 128)
+/* Base address of vvars. This is not ABI. */
+#define VVAR_ADDRESS (-10*1024*1024 - 4096)
#if defined(__VVAR_KERNEL_LDS)
@@ -26,17 +25,17 @@
* right place.
*/
#define DECLARE_VVAR(offset, type, name) \
- EMIT_VVAR(name, VSYSCALL_VARS_OFFSET + offset)
+ EMIT_VVAR(name, offset)
#else
#define DECLARE_VVAR(offset, type, name) \
static type const * const vvaraddr_ ## name = \
- (void *)(VSYSCALL_START + VSYSCALL_VARS_OFFSET + (offset));
+ (void *)(VVAR_ADDRESS + (offset));
#define DEFINE_VVAR(type, name) \
- type __vvar_ ## name \
- __attribute__((section(".vsyscall_var_" #name), aligned(16)))
+ type name \
+ __attribute__((section(".vvar_" #name), aligned(16)))
#define VVAR(name) (*vvaraddr_ ## name)
@@ -49,4 +48,3 @@ DECLARE_VVAR(16, int, vgetcpu_mode)
DECLARE_VVAR(128, struct vsyscall_gtod_data, vsyscall_gtod_data)
#undef DECLARE_VVAR
-#undef VSYSCALL_VARS_OFFSET
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1432bd4..3c1ec1c 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -161,12 +161,6 @@ SECTIONS
#define VVIRT_OFFSET (VSYSCALL_ADDR - __vsyscall_0)
#define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
-#define EMIT_VVAR(x, offset) .vsyscall_var_ ## x \
- ADDR(.vsyscall_0) + offset \
- : AT(VLOAD(.vsyscall_var_ ## x)) { \
- *(.vsyscall_var_ ## x) \
- } \
- x = VVIRT(.vsyscall_var_ ## x);
. = ALIGN(4096);
__vsyscall_0 = .;
@@ -192,17 +186,28 @@ SECTIONS
*(.vsyscall_3)
}
-#define __VVAR_KERNEL_LDS
-#include <asm/vvar.h>
-#undef __VVAR_KERNEL_LDS
-
- . = __vsyscall_0 + PAGE_SIZE;
+ . = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
#undef VSYSCALL_ADDR
#undef VLOAD_OFFSET
#undef VLOAD
#undef VVIRT_OFFSET
#undef VVIRT
+
+ __vvar_page = .;
+
+#define EMIT_VVAR(name, offset) .vvar_ ## name \
+ (__vvar_page + offset) : \
+ AT(ADDR(.vvar_ ## name) - LOAD_OFFSET) { \
+ *(.vvar_ ## x) \
+ } :data
+
+#define __VVAR_KERNEL_LDS
+#include <asm/vvar.h>
+#undef __VVAR_KERNEL_LDS
+
+ . = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
+
#undef EMIT_VVAR
#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 5f6ad03..ee22180 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -284,9 +284,14 @@ void __init map_vsyscall(void)
{
extern char __vsyscall_0;
unsigned long physaddr_page0 = __pa_symbol(&__vsyscall_0);
+ extern char __vvar_page;
+ unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
/* Note that VSYSCALL_MAPPED_PAGES must agree with the code below. */
__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
+ __set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
+ BUILD_BUG_ON((unsigned long)__fix_to_virt(VVAR_PAGE) !=
+ (unsigned long)VVAR_ADDRESS);
}
static int __init vsyscall_init(void)
--
1.7.5.1
It's unnecessary overhead in code that's supposed to be highly
optimized. Removing it allows us to remove one of the two syscall
instructions in the vsyscall page.
The only sensible use for it is for UML users, and it doesn't fully
address inconsistent vsyscall results on UML. The real fix for UML
is to stop using vsyscalls entirely.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/vgtod.h | 1 -
arch/x86/kernel/vsyscall_64.c | 34 +------------------------
arch/x86/vdso/vclock_gettime.c | 55 +++++++++++++++------------------------
3 files changed, 22 insertions(+), 68 deletions(-)
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 646b4c1..aa5add8 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -11,7 +11,6 @@ struct vsyscall_gtod_data {
time_t wall_time_sec;
u32 wall_time_nsec;
- int sysctl_enabled;
struct timezone sys_tz;
struct { /* extract of a clocksource struct */
cycle_t (*vread)(void);
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index ee22180..3e8dac7 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -53,7 +53,6 @@ DEFINE_VVAR(int, vgetcpu_mode);
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
{
.lock = SEQLOCK_UNLOCKED,
- .sysctl_enabled = 1,
};
void update_vsyscall_tz(void)
@@ -103,15 +102,6 @@ static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
return ret;
}
-static __always_inline long time_syscall(long *t)
-{
- long secs;
- asm volatile("syscall"
- : "=a" (secs)
- : "0" (__NR_time),"D" (t) : __syscall_clobber);
- return secs;
-}
-
static __always_inline void do_vgettimeofday(struct timeval * tv)
{
cycle_t now, base, mask, cycle_delta;
@@ -122,8 +112,7 @@ static __always_inline void do_vgettimeofday(struct timeval * tv)
seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
vread = VVAR(vsyscall_gtod_data).clock.vread;
- if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled ||
- !vread)) {
+ if (unlikely(!vread)) {
gettimeofday(tv,NULL);
return;
}
@@ -165,8 +154,6 @@ time_t __vsyscall(1) vtime(time_t *t)
{
unsigned seq;
time_t result;
- if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
- return time_syscall(t);
do {
seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
@@ -227,22 +214,6 @@ static long __vsyscall(3) venosys_1(void)
return -ENOSYS;
}
-#ifdef CONFIG_SYSCTL
-static ctl_table kernel_table2[] = {
- { .procname = "vsyscall64",
- .data = &vsyscall_gtod_data.sysctl_enabled, .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec },
- {}
-};
-
-static ctl_table kernel_root_table2[] = {
- { .procname = "kernel", .mode = 0555,
- .child = kernel_table2 },
- {}
-};
-#endif
-
/* Assume __initcall executes before all user space. Hopefully kmod
doesn't violate that. We'll find out if it does. */
static void __cpuinit vsyscall_set_cpu(int cpu)
@@ -301,9 +272,6 @@ static int __init vsyscall_init(void)
BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
BUG_ON((unsigned long) &vgetcpu != VSYSCALL_ADDR(__NR_vgetcpu));
-#ifdef CONFIG_SYSCTL
- register_sysctl_table(kernel_root_table2);
-#endif
on_each_cpu(cpu_vsyscall_init, NULL, 1);
/* notifier priority > KVM */
hotcpu_notifier(cpu_vsyscall_notifier, 30);
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index a724905..cf54813 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -116,21 +116,21 @@ notrace static noinline int do_monotonic_coarse(struct timespec *ts)
notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
- if (likely(gtod->sysctl_enabled))
- switch (clock) {
- case CLOCK_REALTIME:
- if (likely(gtod->clock.vread))
- return do_realtime(ts);
- break;
- case CLOCK_MONOTONIC:
- if (likely(gtod->clock.vread))
- return do_monotonic(ts);
- break;
- case CLOCK_REALTIME_COARSE:
- return do_realtime_coarse(ts);
- case CLOCK_MONOTONIC_COARSE:
- return do_monotonic_coarse(ts);
- }
+ switch (clock) {
+ case CLOCK_REALTIME:
+ if (likely(gtod->clock.vread))
+ return do_realtime(ts);
+ break;
+ case CLOCK_MONOTONIC:
+ if (likely(gtod->clock.vread))
+ return do_monotonic(ts);
+ break;
+ case CLOCK_REALTIME_COARSE:
+ return do_realtime_coarse(ts);
+ case CLOCK_MONOTONIC_COARSE:
+ return do_monotonic_coarse(ts);
+ }
+
return vdso_fallback_gettime(clock, ts);
}
int clock_gettime(clockid_t, struct timespec *)
@@ -139,7 +139,7 @@ int clock_gettime(clockid_t, struct timespec *)
notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
long ret;
- if (likely(gtod->sysctl_enabled && gtod->clock.vread)) {
+ if (likely(gtod->clock.vread)) {
if (likely(tv != NULL)) {
BUILD_BUG_ON(offsetof(struct timeval, tv_usec) !=
offsetof(struct timespec, tv_nsec) ||
@@ -161,27 +161,14 @@ notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
int gettimeofday(struct timeval *, struct timezone *)
__attribute__((weak, alias("__vdso_gettimeofday")));
-/* This will break when the xtime seconds get inaccurate, but that is
- * unlikely */
-
-static __always_inline long time_syscall(long *t)
-{
- long secs;
- asm volatile("syscall"
- : "=a" (secs)
- : "0" (__NR_time), "D" (t) : "cc", "r11", "cx", "memory");
- return secs;
-}
-
+/*
+ * This will break when the xtime seconds get inaccurate, but that is
+ * unlikely
+ */
notrace time_t __vdso_time(time_t *t)
{
- time_t result;
-
- if (unlikely(!VVAR(vsyscall_gtod_data).sysctl_enabled))
- return time_syscall(t);
-
/* This is atomic on x86_64 so we don't need any locks. */
- result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);
+ time_t result = ACCESS_ONCE(VVAR(vsyscall_gtod_data).wall_time_sec);
if (t)
*t = result;
--
1.7.5.1
Currently the HPET mapping is a user-accessible syscall instruction
at a fixed address some of the time. A sufficiently determined
hacker might be able to guess when.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 4 ++--
arch/x86/kernel/hpet.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 6a29aed6..013286a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -107,8 +107,8 @@
#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_UC_MINUS (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_VSYSCALL (__PAGE_KERNEL_RX | _PAGE_USER)
-#define __PAGE_KERNEL_VSYSCALL_NOCACHE (__PAGE_KERNEL_VSYSCALL | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_VVAR (__PAGE_KERNEL_RO | _PAGE_USER)
+#define __PAGE_KERNEL_VVAR_NOCACHE (__PAGE_KERNEL_VVAR | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_NOCACHE (__PAGE_KERNEL | _PAGE_CACHE_UC | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -130,8 +130,8 @@
#define PAGE_KERNEL_LARGE_NOCACHE __pgprot(__PAGE_KERNEL_LARGE_NOCACHE)
#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC)
#define PAGE_KERNEL_VSYSCALL __pgprot(__PAGE_KERNEL_VSYSCALL)
-#define PAGE_KERNEL_VSYSCALL_NOCACHE __pgprot(__PAGE_KERNEL_VSYSCALL_NOCACHE)
#define PAGE_KERNEL_VVAR __pgprot(__PAGE_KERNEL_VVAR)
+#define PAGE_KERNEL_VVAR_NOCACHE __pgprot(__PAGE_KERNEL_VVAR_NOCACHE)
#define PAGE_KERNEL_IO __pgprot(__PAGE_KERNEL_IO)
#define PAGE_KERNEL_IO_NOCACHE __pgprot(__PAGE_KERNEL_IO_NOCACHE)
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index bfe8f72..bf71830 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -71,7 +71,7 @@ static inline void hpet_set_mapping(void)
{
hpet_virt_address = ioremap_nocache(hpet_address, HPET_MMAP_SIZE);
#ifdef CONFIG_X86_64
- __set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VSYSCALL_NOCACHE);
+ __set_fixmap(VSYSCALL_HPET, hpet_address, PAGE_KERNEL_VVAR_NOCACHE);
#endif
}
--
1.7.5.1
It just segfaults since April 2008 (a4928cff), so I'm pretty sure
that nothing uses it. And having an empty section makes the linker
script a bit fragile.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 4 ----
arch/x86/kernel/vsyscall_64.c | 3 ---
2 files changed, 0 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 3c1ec1c..1583238 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -182,10 +182,6 @@ SECTIONS
*(.vsyscall_2)
}
- .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) {
- *(.vsyscall_3)
- }
-
. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
#undef VSYSCALL_ADDR
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 3e8dac7..2907197 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -209,9 +209,6 @@ vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
return 0;
}
-static long __vsyscall(3) venosys_1(void)
-{
- return -ENOSYS;
}
/* Assume __initcall executes before all user space. Hopefully kmod
--
1.7.5.1
Jumping to 0x00 might do something depending on the following bytes.
Jumping to 0xcc is a trap. So fill the unused parts of the vsyscall
page with 0xcc to make it useless for exploits to jump there.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 1583238..8d6a0b7 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -166,22 +166,20 @@ SECTIONS
__vsyscall_0 = .;
. = VSYSCALL_ADDR;
- .vsyscall_0 : AT(VLOAD(.vsyscall_0)) {
+ .vsyscall : AT(VLOAD(.vsyscall)) {
*(.vsyscall_0)
- } :user
- . = ALIGN(L1_CACHE_BYTES);
- .vsyscall_fn : AT(VLOAD(.vsyscall_fn)) {
+ . = ALIGN(L1_CACHE_BYTES);
*(.vsyscall_fn)
- }
- .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) {
+ . = 1024;
*(.vsyscall_1)
- }
- .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) {
+
+ . = 2048;
*(.vsyscall_2)
- }
+ . = 4096; /* Pad the whole page. */
+ } :user =0xcc
. = ALIGN(__vsyscall_0 + PAGE_SIZE, PAGE_SIZE);
#undef VSYSCALL_ADDR
--
1.7.5.1
There's a fair amount of code in the vsyscall page. It contains a
syscall instruction (in the gettimeofday fallback) and who knows
what will happen if an exploit jumps into the middle of some other
code.
Reduce the risk by replacing the vsyscalls with short magic
incantations that cause the kernel to emulate the real vsyscalls.
These incantations are useless if entered in the middle.
This causes vsyscalls to be a little more expensive than real
syscalls. Fortunately sensible programs don't use them.
Less fortunately, current glibc uses the vsyscall for time() even in
dynamic binaries. So there's a CONFIG_UNSAFE_VSYSCALLS (default y)
option that leaves in the native code for time(). That should go
away in awhile when glibc gets fixed.
Some care is taken to make sure that tools like valgrind and
ThreadSpotter still work.
This patch is not perfect: the vread_tsc and vread_hpet functions
are still at a fixed address. Fixing that might involve making
alternative patching work in the vDSO.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/Kconfig | 17 +++
arch/x86/include/asm/irq_vectors.h | 6 +-
arch/x86/include/asm/traps.h | 4 +
arch/x86/include/asm/vsyscall.h | 6 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 2 +
arch/x86/kernel/traps.c | 4 +
arch/x86/kernel/vsyscall_64.c | 253 +++++++++++++++++++++---------------
arch/x86/kernel/vsyscall_emu_64.S | 42 ++++++
9 files changed, 231 insertions(+), 104 deletions(-)
create mode 100644 arch/x86/kernel/vsyscall_emu_64.S
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..8606921 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1650,6 +1650,23 @@ config COMPAT_VDSO
If unsure, say Y.
+config UNSAFE_VSYSCALLS
+ def_bool y
+ prompt "Unsafe fast legacy vsyscalls"
+ depends on X86_64
+ ---help---
+ Legacy user code expects to be able to issue three syscalls
+ by calling fixed addresses in kernel space. If you say N,
+ then the kernel traps and emulates these calls. If you say
+ Y, then there is actual executable code at a fixed address
+ to implement time() efficiently.
+
+ On a system with recent enough glibc (probably 2.14 or
+ newer) and no static binaries, you can say N without a
+ performance penalty to improve security
+
+ If unsure, say Y.
+
config CMDLINE_BOOL
bool "Built-in kernel command line"
---help---
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 6e976ee..a563c50 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -17,7 +17,8 @@
* Vectors 0 ... 31 : system traps and exceptions - hardcoded events
* Vectors 32 ... 127 : device interrupts
* Vector 128 : legacy int80 syscall interface
- * Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts
+ * Vector 204 : legacy x86_64 vsyscall emulation
+ * Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 except 204 : device interrupts
* Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
*
* 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
@@ -50,6 +51,9 @@
#ifdef CONFIG_X86_32
# define SYSCALL_VECTOR 0x80
#endif
+#ifdef CONFIG_X86_64
+# define VSYSCALL_EMU_VECTOR 0xcc
+#endif
/*
* Vectors 0x30-0x3f are used for ISA interrupts.
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 0310da6..2bae0a5 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -1,6 +1,8 @@
#ifndef _ASM_X86_TRAPS_H
#define _ASM_X86_TRAPS_H
+#include <linux/kprobes.h>
+
#include <asm/debugreg.h>
#include <asm/siginfo.h> /* TRAP_TRACE, ... */
@@ -38,6 +40,7 @@ asmlinkage void alignment_check(void);
asmlinkage void machine_check(void);
#endif /* CONFIG_X86_MCE */
asmlinkage void simd_coprocessor_error(void);
+asmlinkage void emulate_vsyscall(void);
dotraplinkage void do_divide_error(struct pt_regs *, long);
dotraplinkage void do_debug(struct pt_regs *, long);
@@ -64,6 +67,7 @@ dotraplinkage void do_alignment_check(struct pt_regs *, long);
dotraplinkage void do_machine_check(struct pt_regs *, long);
#endif
dotraplinkage void do_simd_coprocessor_error(struct pt_regs *, long);
+dotraplinkage void do_emulate_vsyscall(struct pt_regs *, long);
#ifdef CONFIG_X86_32
dotraplinkage void do_iret_error(struct pt_regs *, long);
#endif
diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index d555973..293ae08 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -31,6 +31,12 @@ extern struct timezone sys_tz;
extern void map_vsyscall(void);
+/* Emulation */
+static inline bool in_vsyscall_page(unsigned long addr)
+{
+ return (addr & ~(PAGE_SIZE - 1)) == VSYSCALL_START;
+}
+
#endif /* __KERNEL__ */
#endif /* _ASM_X86_VSYSCALL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a24521b..863128b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_X86_32) += probe_roms_32.o
obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o
obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o
obj-$(CONFIG_X86_64) += syscall_64.o vsyscall_64.o vread_tsc_64.o
+obj-$(CONFIG_X86_64) += vsyscall_emu_64.o
obj-y += bootflag.o e820.o
obj-y += pci-dma.o quirks.o topology.o kdebugfs.o
obj-y += alternative.o i8253.o pci-nommu.o hw_breakpoint.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 72c4a77..e949793 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1123,6 +1123,8 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
zeroentry coprocessor_error do_coprocessor_error
errorentry alignment_check do_alignment_check
zeroentry simd_coprocessor_error do_simd_coprocessor_error
+zeroentry emulate_vsyscall do_emulate_vsyscall
+
/* Reload gs selector with exception handling */
/* edi: new selector */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b9b6716..72f0f6a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -872,6 +872,10 @@ void __init trap_init(void)
set_bit(SYSCALL_VECTOR, used_vectors);
#endif
+ BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
+ set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
+ set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
+
/*
* Should be a barrier for any external CPU state:
*/
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 2907197..7c699a9 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -32,6 +32,8 @@
#include <linux/cpu.h>
#include <linux/smp.h>
#include <linux/notifier.h>
+#include <linux/syscalls.h>
+#include <linux/ratelimit.h>
#include <asm/vsyscall.h>
#include <asm/pgtable.h>
@@ -44,10 +46,7 @@
#include <asm/desc.h>
#include <asm/topology.h>
#include <asm/vgtod.h>
-
-#define __vsyscall(nr) \
- __attribute__ ((unused, __section__(".vsyscall_" #nr))) notrace
-#define __syscall_clobber "r11","cx","memory"
+#include <asm/traps.h>
DEFINE_VVAR(int, vgetcpu_mode);
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
@@ -84,73 +83,45 @@ void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags);
}
-/* RED-PEN may want to readd seq locking, but then the variable should be
- * write-once.
- */
-static __always_inline void do_get_tz(struct timezone * tz)
+static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
+ const char *message)
{
- *tz = VVAR(vsyscall_gtod_data).sys_tz;
+ struct task_struct *tsk;
+ static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+ if (!show_unhandled_signals || !__ratelimit(&rs))
+ return;
+
+ tsk = current;
+
+ printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx",
+ level, tsk->comm, task_pid_nr(tsk),
+ message,
+ regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+ if (!in_vsyscall_page(regs->ip - 2))
+ print_vma_addr(" in ", regs->ip - 2);
+ printk("\n");
}
-static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz)
-{
- int ret;
- asm volatile("syscall"
- : "=a" (ret)
- : "0" (__NR_gettimeofday),"D" (tv),"S" (tz)
- : __syscall_clobber );
- return ret;
-}
+/* al values for each vsyscall; see vsyscall_emu_64.S for why. */
+static u8 vsyscall_nr_to_al[] = {0xcc, 0xce, 0xf0};
-static __always_inline void do_vgettimeofday(struct timeval * tv)
+static int al_to_vsyscall_nr(u8 al)
{
- cycle_t now, base, mask, cycle_delta;
- unsigned seq;
- unsigned long mult, shift, nsec;
- cycle_t (*vread)(void);
- do {
- seq = read_seqbegin(&VVAR(vsyscall_gtod_data).lock);
-
- vread = VVAR(vsyscall_gtod_data).clock.vread;
- if (unlikely(!vread)) {
- gettimeofday(tv,NULL);
- return;
- }
-
- now = vread();
- base = VVAR(vsyscall_gtod_data).clock.cycle_last;
- mask = VVAR(vsyscall_gtod_data).clock.mask;
- mult = VVAR(vsyscall_gtod_data).clock.mult;
- shift = VVAR(vsyscall_gtod_data).clock.shift;
-
- tv->tv_sec = VVAR(vsyscall_gtod_data).wall_time_sec;
- nsec = VVAR(vsyscall_gtod_data).wall_time_nsec;
- } while (read_seqretry(&VVAR(vsyscall_gtod_data).lock, seq));
-
- /* calculate interval: */
- cycle_delta = (now - base) & mask;
- /* convert to nsecs: */
- nsec += (cycle_delta * mult) >> shift;
-
- while (nsec >= NSEC_PER_SEC) {
- tv->tv_sec += 1;
- nsec -= NSEC_PER_SEC;
- }
- tv->tv_usec = nsec / NSEC_PER_USEC;
+ int i;
+ for (i = 0; i < ARRAY_SIZE(vsyscall_nr_to_al); i++)
+ if (vsyscall_nr_to_al[i] == al)
+ return i;
+ return -1;
}
-int __vsyscall(0) vgettimeofday(struct timeval * tv, struct timezone * tz)
-{
- if (tv)
- do_vgettimeofday(tv);
- if (tz)
- do_get_tz(tz);
- return 0;
-}
+#ifdef CONFIG_UNSAFE_VSYSCALLS
/* This will break when the xtime seconds get inaccurate, but that is
* unlikely */
-time_t __vsyscall(1) vtime(time_t *t)
+time_t __attribute__ ((unused, __section__(".vsyscall_1"))) notrace
+vtime(time_t *t)
{
unsigned seq;
time_t result;
@@ -167,48 +138,127 @@ time_t __vsyscall(1) vtime(time_t *t)
return result;
}
-/* Fast way to get current CPU and node.
- This helps to do per node and per CPU caches in user space.
- The result is not guaranteed without CPU affinity, but usually
- works out because the scheduler tries to keep a thread on the same
- CPU.
+#endif /* CONFIG_UNSAFE_VSYSCALLS */
+
+/* If CONFIG_UNSAFE_VSYSCALLS=y, then this is incorrect for vsyscall_nr == 1. */
+static inline unsigned long vsyscall_intcc_addr(int vsyscall_nr)
+{
+ return VSYSCALL_START + 1024*vsyscall_nr + 2;
+}
- tcache must point to a two element sized long array.
- All arguments can be NULL. */
-long __vsyscall(2)
-vgetcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
{
- unsigned int p;
- unsigned long j = 0;
-
- /* Fast cache - only recompute value once per jiffies and avoid
- relatively costly rdtscp/cpuid otherwise.
- This works because the scheduler usually keeps the process
- on the same CPU and this syscall doesn't guarantee its
- results anyways.
- We do this here because otherwise user space would do it on
- its own in a likely inferior way (no access to jiffies).
- If you don't like it pass NULL. */
- if (tcache && tcache->blob[0] == (j = VVAR(jiffies))) {
- p = tcache->blob[1];
- } else if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
- /* Load per CPU data from RDTSCP */
- native_read_tscp(&p);
- } else {
- /* Load per CPU data from GDT */
- asm("lsl %1,%0" : "=r" (p) : "r" (__PER_CPU_SEG));
+ static DEFINE_RATELIMIT_STATE(rs, 3600 * HZ, 3);
+ struct task_struct *tsk;
+ const char *vsyscall_name;
+ int vsyscall_nr;
+ long ret;
+
+ /* Kernel code must never get here. */
+ BUG_ON(!user_mode(regs));
+
+ local_irq_enable();
+
+ vsyscall_nr = al_to_vsyscall_nr(regs->ax & 0xff);
+ if (vsyscall_nr < 0) {
+ warn_bad_vsyscall(KERN_WARNING, regs, "illegal int 0xcc "
+ "(exploit attempt?)");
+ goto sigsegv;
}
- if (tcache) {
- tcache->blob[0] = j;
- tcache->blob[1] = p;
+
+ if (regs->ip - 2 != vsyscall_intcc_addr(vsyscall_nr)) {
+ if (in_vsyscall_page(regs->ip - 2)) {
+ /* This should not be possible. */
+ warn_bad_vsyscall(KERN_WARNING, regs,
+ "int 0xcc bogus magic "
+ "(exploit attempt?)");
+ goto sigsegv;
+ } else {
+ /*
+ * We allow the call because tools like ThreadSpotter
+ * might copy the int 0xcc instruction to user memory.
+ * We make it annoying, though, to try to persuade
+ * the authors to stop doing that...
+ */
+ warn_bad_vsyscall(KERN_WARNING, regs,
+ "int 0xcc in user code "
+ "(exploit attempt? legacy "
+ "instrumented code?)");
+ }
}
- if (cpu)
- *cpu = p & 0xfff;
- if (node)
- *node = p >> 12;
- return 0;
-}
+ tsk = current;
+ if (tsk->seccomp.mode) {
+ do_exit(SIGKILL);
+ goto out;
+ }
+
+ switch (vsyscall_nr) {
+ case 0:
+ vsyscall_name = "gettimeofday";
+ ret = sys_gettimeofday(
+ (struct timeval __user *)regs->di,
+ (struct timezone __user *)regs->si);
+ break;
+
+ case 1:
+#ifdef CONFIG_UNSAFE_VSYSCALLS
+ warn_bad_vsyscall(KERN_WARNING, regs, "bogus time() vsyscall "
+ "emulation (exploit attempt?)");
+ goto sigsegv;
+#else
+ vsyscall_name = "time";
+ ret = sys_time((time_t __user *)regs->di);
+ break;
+#endif
+
+ case 2:
+ vsyscall_name = "getcpu";
+ ret = sys_getcpu((unsigned __user *)regs->di,
+ (unsigned __user *)regs->si,
+ 0);
+ break;
+
+ default:
+ BUG();
+ }
+
+ if (ret == -EFAULT) {
+ /*
+ * Bad news -- userspace fed a bad pointer to a vsyscall.
+ *
+ * With a real vsyscall, that would have caused SIGSEGV.
+ * To make writing reliable exploits using the emulated
+ * vsyscalls harder, generate SIGSEGV here as well.
+ */
+ warn_bad_vsyscall(KERN_INFO, regs,
+ "vsyscall fault (exploit attempt?)");
+ goto sigsegv;
+ }
+
+ regs->ax = ret;
+
+ if (__ratelimit(&rs)) {
+ unsigned long caller;
+ if (get_user(caller, (unsigned long __user *)regs->sp))
+ caller = 0; /* no need to crash on this fault. */
+ printk(KERN_INFO "%s[%d] emulated legacy vsyscall %s(); "
+ "upgrade your code to avoid a performance hit. "
+ "ip:%lx sp:%lx caller:%lx",
+ tsk->comm, task_pid_nr(tsk), vsyscall_name,
+ regs->ip - 2, regs->sp, caller);
+ if (caller)
+ print_vma_addr(" in ", caller);
+ printk("\n");
+ }
+
+out:
+ local_irq_disable();
+ return;
+
+sigsegv:
+ regs->ip -= 2; /* The faulting instruction should be the int 0xcc. */
+ force_sig(SIGSEGV, current);
}
/* Assume __initcall executes before all user space. Hopefully kmod
@@ -264,11 +314,8 @@ void __init map_vsyscall(void)
static int __init vsyscall_init(void)
{
- BUG_ON(((unsigned long) &vgettimeofday !=
- VSYSCALL_ADDR(__NR_vgettimeofday)));
- BUG_ON((unsigned long) &vtime != VSYSCALL_ADDR(__NR_vtime));
- BUG_ON((VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE)));
- BUG_ON((unsigned long) &vgetcpu != VSYSCALL_ADDR(__NR_vgetcpu));
+ BUG_ON(VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE));
+
on_each_cpu(cpu_vsyscall_init, NULL, 1);
/* notifier priority > KVM */
hotcpu_notifier(cpu_vsyscall_notifier, 30);
diff --git a/arch/x86/kernel/vsyscall_emu_64.S b/arch/x86/kernel/vsyscall_emu_64.S
new file mode 100644
index 0000000..7ebde61
--- /dev/null
+++ b/arch/x86/kernel/vsyscall_emu_64.S
@@ -0,0 +1,42 @@
+/*
+ * vsyscall_emu_64.S: Vsyscall emulation page
+ * Copyright (c) 2011 Andy Lutomirski
+ * Subject to the GNU General Public License, version 2
+*/
+
+#include <linux/linkage.h>
+#include <asm/irq_vectors.h>
+
+/*
+ * These magic incantations are chosen so that they fault if entered anywhere
+ * other than an instruction boundary. The movb instruction is two bytes, and
+ * the int imm8 instruction is also two bytes, so the only misaligned places
+ * to enter are the immediate values for the two instructions. 0xcc is int3
+ * (always faults), 0xce is into (faults on x64-64, and 32-bit code can't get
+ * here), and 0xf0 is lock (lock int is invalid).
+ *
+ * The unused parts of the page are filled with 0xcc by the linker script.
+ */
+
+.section .vsyscall_0, "a"
+ENTRY(vsyscall_0)
+ movb $0xcc, %al
+ int $VSYSCALL_EMU_VECTOR
+ ret
+END(vsyscall_0)
+
+#ifndef CONFIG_UNSAFE_VSYSCALLS
+.section .vsyscall_1, "a"
+ENTRY(vsyscall_1)
+ movb $0xce, %al
+ int $VSYSCALL_EMU_VECTOR
+ ret
+END(vsyscall_1)
+#endif
+
+.section .vsyscall_2, "a"
+ENTRY(vsyscall_2)
+ movb $0xf0, %al
+ int $VSYSCALL_EMU_VECTOR
+ ret
+END(vsyscall_2)
--
1.7.5.1
This is not a security feature. It's to prevent the int 0xcc
sequence from becoming ABI.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/vsyscall_64.c | 54 +++++++++++++++++++++++++++++++++++++++-
1 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 7c699a9..8b8ad99 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -34,6 +34,8 @@
#include <linux/notifier.h>
#include <linux/syscalls.h>
#include <linux/ratelimit.h>
+#include <linux/random.h>
+#include <linux/highmem.h>
#include <asm/vsyscall.h>
#include <asm/pgtable.h>
@@ -54,6 +56,8 @@ DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
.lock = SEQLOCK_UNLOCKED,
};
+static u8 vsyscall_nr_offset; /* Cyclic permutation of al. */
+
void update_vsyscall_tz(void)
{
unsigned long flags;
@@ -95,10 +99,11 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
tsk = current;
- printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx si:%lx di:%lx",
+ printk("%s%s[%d] %s ip:%lx sp:%lx ax:%lx offset:%d si:%lx di:%lx",
level, tsk->comm, task_pid_nr(tsk),
message,
- regs->ip - 2, regs->sp, regs->ax, regs->si, regs->di);
+ regs->ip - 2, regs->sp, regs->ax, (int)vsyscall_nr_offset,
+ regs->si, regs->di);
if (!in_vsyscall_page(regs->ip - 2))
print_vma_addr(" in ", regs->ip - 2);
printk("\n");
@@ -116,6 +121,17 @@ static int al_to_vsyscall_nr(u8 al)
return -1;
}
+static inline u8 mangle_al(u8 al)
+{
+ int nr = al_to_vsyscall_nr(al);
+ return vsyscall_nr_to_al[(nr + vsyscall_nr_offset) % 3];
+}
+
+static inline int demangle_vsyscall_nr(int nr)
+{
+ return (nr + 3 - vsyscall_nr_offset) % 3;
+}
+
#ifdef CONFIG_UNSAFE_VSYSCALLS
/* This will break when the xtime seconds get inaccurate, but that is
@@ -165,6 +181,7 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
"(exploit attempt?)");
goto sigsegv;
}
+ vsyscall_nr = demangle_vsyscall_nr(vsyscall_nr);
if (regs->ip - 2 != vsyscall_intcc_addr(vsyscall_nr)) {
if (in_vsyscall_page(regs->ip - 2)) {
@@ -312,10 +329,43 @@ void __init map_vsyscall(void)
(unsigned long)VVAR_ADDRESS);
}
+static void __init mangle_vsyscall_movb(void *mapping,
+ unsigned long movb_addr, u8 initial)
+{
+ u8 *imm8;
+ BUG_ON(movb_addr >= 4095);
+
+ imm8 = (char *)(mapping) + movb_addr + 1;
+
+ BUG_ON(*imm8 != initial);
+ *imm8 = mangle_al(*imm8);
+}
+
static int __init vsyscall_init(void)
{
+ extern char __vsyscall_0;
+ void *mapping;
+ struct page *vsyscall_page;
+
BUG_ON(VSYSCALL_ADDR(0) != __fix_to_virt(VSYSCALL_FIRST_PAGE));
+ /*
+ * Randomize the magic al values for int 0xcc invocation. This
+ * isn't really a security feature; it's to make sure that
+ * dynamic binary instrumentation tools don't start to think
+ * that the int 0xcc magic incantation is ABI.
+ */
+ vsyscall_nr_offset = get_random_int() % 3;
+ vsyscall_page = pfn_to_page(__pa_symbol(&__vsyscall_0) >> PAGE_SHIFT);
+ mapping = kmap_atomic(vsyscall_page);
+ /* It's easier to hardcode the addresses -- they're ABI. */
+ mangle_vsyscall_movb(mapping, 0, 0xcc);
+#ifndef CONFIG_UNSAFE_VSYSCALLS
+ mangle_vsyscall_movb(mapping, 1024, 0xce);
+#endif
+ mangle_vsyscall_movb(mapping, 2048, 0xf0);
+ kunmap_atomic(mapping);
+
on_each_cpu(cpu_vsyscall_init, NULL, 1);
/* notifier priority > KVM */
hotcpu_notifier(cpu_vsyscall_notifier, 30);
--
1.7.5.1
Signed-off-by: Andy Lutomirski <[email protected]>
---
Documentation/feature-removal-schedule.txt | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 492e81d..db7a122 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -580,3 +580,12 @@ Why: These legacy callbacks should no longer be used as i2c-core offers
Who: Jean Delvare <[email protected]>
----------------------------
+
+What: CONFIG_UNSAFE_VSYSCALLS (x86_64)
+When: When glibc 2.14 or newer is ubitquitous. Perhaps mid-2012.
+Why: Having user-executable code at a fixed address is a security problem.
+ Turning off CONFIG_UNSAFE_VSYSCALLS mostly removes the risk but will
+ make the time() function slower on glibc versions 2.13 and below.
+Who: Andy Lutomirski <[email protected]>
+
+----------------------------
--
1.7.5.1
* Andy Lutomirski <[email protected]> wrote:
> Patch 1 is just a bugfix from the last vdso series.
> The bug should be harmless but it's pretty dumb. This is almost
> certainly 3.0 material.
Please make the patches apply cleanly to the latest x86 tree:
http://people.redhat.com/mingo/tip.git/README
they currently don't - patch #4 does not apply.
Thanks,
Ingo
On Tue, May 31, 2011 at 9:45 AM, Ingo Molnar <[email protected]> wrote:
>
> * Andy Lutomirski <[email protected]> wrote:
>
>> Patch 1 is just a bugfix from the last vdso series.
>> The bug should be harmless but it's pretty dumb. ?This is almost
>> certainly 3.0 material.
>
> Please make the patches apply cleanly to the latest x86 tree:
>
> ?http://people.redhat.com/mingo/tip.git/README
>
> they currently don't - patch #4 does not apply.
The rebase applies with git mergetool's help without manual intervention.
I'll compile and test the rebased series before I send it out as v4
(should be about an hour).
--Andy
>
> Thanks,
>
> ? ? ? ?Ingo
>
On Tue, May 31, 2011 at 9:16 AM, Andy Lutomirski <[email protected]> wrote:
> There's a fair amount of code in the vsyscall page. It contains a
> syscall instruction (in the gettimeofday fallback) and who knows
> what will happen if an exploit jumps into the middle of some other
> code.
>
> Reduce the risk by replacing the vsyscalls with short magic
> incantations that cause the kernel to emulate the real vsyscalls.
> These incantations are useless if entered in the middle.
How about remapping the vsyscall page into a random page in the
modules area, and make the fixed page simply have stubs that jump to
the code in that page. That would solve the fixed address syscall
problem without any more overhead.
--
Brian Gerst
On Wed, Jun 1, 2011 at 7:54 AM, Brian Gerst <[email protected]> wrote:
> On Tue, May 31, 2011 at 9:16 AM, Andy Lutomirski <[email protected]> wrote:
>> There's a fair amount of code in the vsyscall page. ?It contains a
>> syscall instruction (in the gettimeofday fallback) and who knows
>> what will happen if an exploit jumps into the middle of some other
>> code.
>>
>> Reduce the risk by replacing the vsyscalls with short magic
>> incantations that cause the kernel to emulate the real vsyscalls.
>> These incantations are useless if entered in the middle.
>
> How about remapping the vsyscall page into a random page in the
> modules area, and make the fixed page simply have stubs that jump to
> the code in that page. ?That would solve the fixed address syscall
> problem without any more overhead.
It wouldn't give any protection against local attacks, though.
--Andy
>
> --
> Brian Gerst
>
On Tue, 31 May 2011 09:16:04 EDT, Andy Lutomirski said:
> +What: CONFIG_UNSAFE_VSYSCALLS (x86_64)
Wow. I went nuts trying to find where this was because I couldn't find it in
Linus's tree I pulled a little while ago, before I realized you added it in patch 8
and deprecated it in patch 10.
Speaking of which:
+ On a system with recent enough glibc (probably 2.14 or
+ newer) and no static binaries, you can say N without a
+ performance penalty to improve security
So I checked my laptop (Fedora 16 Rawhide), and found a bunch of static binaries. The ones
that look like people may care:
# file /sbin/* | grep statically
/sbin/grub: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.18, stripped
/sbin/insmod.static: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, stripped
/sbin/ldconfig: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, stripped
/sbin/sln: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, stripped
Might be a challenge to get rid of them though. Unless we don't care anymore
about "use a statically linked ldconfig to fix a corrupted ls.so.cache" and
"reboot from a rescue disk" the only choice. I think the insmod.static ends up
getting used in initrds to get the root filesystem mounted, we may care about
that as well.. ;)
On Wed, Jun 1, 2011 at 1:35 PM, <[email protected]> wrote:
> On Tue, 31 May 2011 09:16:04 EDT, Andy Lutomirski said:
>
>> +What: ? ? ? ?CONFIG_UNSAFE_VSYSCALLS (x86_64)
>
> Wow. I went nuts trying to find where this was because I couldn't find it in
> Linus's tree I pulled a little while ago, before I realized you added it in patch 8
> and deprecated it in patch 10.
>
> Speaking of which:
>
> + ? ? ? ? On a system with recent enough glibc (probably 2.14 or
> + ? ? ? ? newer) and no static binaries, you can say N without a
> + ? ? ? ? performance penalty to improve security
>
> So I checked my laptop (Fedora 16 Rawhide), and found a bunch of static binaries. The ones
> that look like people may care:
>
> # file /sbin/* | grep statically
> /sbin/grub: ? ? ? ? ? ? ? ?ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.18, stripped
> /sbin/insmod.static: ? ? ? ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, stripped
> /sbin/ldconfig: ? ? ? ? ? ?ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, stripped
> /sbin/sln: ? ? ? ? ? ? ? ? ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, stripped
>
The binaries will still work -- they'll just take a small performance
hit (~220ns on Sandy Bridge) every time they call time().
Somehow I doubt that grub, ldconfig, and sln care much.
The only things I can see being a problem are big enterprise packages
like, say, Oracle's server, but I think that most of those things are
dynamically linked anyway.
Also, on systems that use a non-tsc clocksource, this hit should be
lost in the noise, except for time(). On my systems, HPET already
takes nearly a microsecond to read and the acpi clock takes over 3us.
> Might be a challenge to get rid of them though. ?Unless we don't care anymore
> about "use a statically linked ldconfig to fix a corrupted ls.so.cache" and
> "reboot from a rescue disk" the only choice. ?I think the insmod.static ends up
> getting used in initrds to get the root filesystem mounted, we may care about
> that as well.. ;)
You'll just have to wait a few hundred extra ns to get your system fixed :)
--Andy
On Wed, 01 Jun 2011 13:41:56 EDT, Andrew Lutomirski said:
>> + ? ? ? ? On a system with recent enough glibc (probably 2.14 or
>> + ? ? ? ? newer) and no static binaries, you can say N without a
>> + ? ? ? ? performance penalty to improve security
>>
>> So I checked my laptop (Fedora 16 Rawhide), and found a bunch of static binaries. The ones
>> that look like people may care:
> The binaries will still work -- they'll just take a small performance
> hit (~220ns on Sandy Bridge) every time they call time().
Ah. I misparsed the Kconfig help - I read it as "If you have no static binaries,
setting this to N doesn't introduce a performance hit" (with an implied "if you
have static binaries, this will hose you"). Adding "Static binaries will continue
to work at a very small performance penalty" would probably help.
* [email protected] <[email protected]> wrote:
> On Wed, 01 Jun 2011 13:41:56 EDT, Andrew Lutomirski said:
>
> >> + ? ? ? ? On a system with recent enough glibc (probably 2.14 or
> >> + ? ? ? ? newer) and no static binaries, you can say N without a
> >> + ? ? ? ? performance penalty to improve security
> >>
> >> So I checked my laptop (Fedora 16 Rawhide), and found a bunch of static binaries. The ones
> >> that look like people may care:
>
> > The binaries will still work -- they'll just take a small performance
> > hit (~220ns on Sandy Bridge) every time they call time().
>
> Ah. I misparsed the Kconfig help - I read it as "If you have no
> static binaries, setting this to N doesn't introduce a performance
> hit" (with an implied "if you have static binaries, this will hose
> you"). Adding "Static binaries will continue to work at a very
> small performance penalty" would probably help.
Yeah, would be nice to add that clarification. (or better yet,
reformulate it in a way that makes it really obvious from the get
go.)
Thanks,
Ingo