2015-07-22 19:23:54

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 0/3] x86: modify_ldt improvement, test, and config option

Here's v3. It fixes the "dazed and confused" issue, I hope. It's also
probably a good general attack surface reduction, and it replaces some
scary code with IMO less scary code.

Also, servers and embedded systems should probably turn off modify_ldt.
This makes that possible.

Xen people, can you take a look at this?

Changes from v2:
- Allocate ldt_struct and the LDT entries separately. This should fix Xen.
- Stop using write_ldt_entry, since I'm pretty sure it's unnecessary now
that we no longer mutate an in-use LDT. (Xen people, can you check?)

Changes from v1:
- The config option is new.
- The test case is new.
- Fixed a missing allocation failure check.
- Fixed a use-after-free on fork().

Andy Lutomirski (3):
x86/ldt: Make modify_ldt synchronous
x86/ldt: Make modify_ldt optional
selftests/x86, x86/ldt: Add a selftest for modify_ldt

arch/x86/Kconfig | 17 ++
arch/x86/include/asm/desc.h | 15 --
arch/x86/include/asm/mmu.h | 5 +-
arch/x86/include/asm/mmu_context.h | 68 ++++-
arch/x86/kernel/Makefile | 3 +-
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/cpu/perf_event.c | 16 +-
arch/x86/kernel/ldt.c | 258 ++++++++++--------
arch/x86/kernel/process_64.c | 6 +-
arch/x86/kernel/step.c | 8 +-
arch/x86/power/cpu.c | 3 +-
kernel/sys_ni.c | 1 +
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/ldt_gdt.c | 492 ++++++++++++++++++++++++++++++++++
14 files changed, 747 insertions(+), 151 deletions(-)
create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

--
2.4.3


2015-07-22 19:24:00

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

modify_ldt has questionable locking and does not synchronize
threads. Improve it: redesign the locking and synchronize all
threads' LDTs using an IPI on all modifications.

This will dramatically slow down modify_ldt in multithreaded
programs, but there shouldn't be any multithreaded programs that
care about modify_ldt's performance in the first place.

Cc: [email protected]
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/desc.h | 15 ---
arch/x86/include/asm/mmu.h | 3 +-
arch/x86/include/asm/mmu_context.h | 53 +++++++-
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/cpu/perf_event.c | 12 +-
arch/x86/kernel/ldt.c | 258 ++++++++++++++++++++-----------------
arch/x86/kernel/process_64.c | 4 +-
arch/x86/kernel/step.c | 6 +-
arch/x86/power/cpu.c | 3 +-
9 files changed, 209 insertions(+), 149 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index a0bf89fd2647..4e10d73cf018 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -280,21 +280,6 @@ static inline void clear_LDT(void)
set_ldt(NULL, 0);
}

-/*
- * load one particular LDT into the current CPU
- */
-static inline void load_LDT_nolock(mm_context_t *pc)
-{
- set_ldt(pc->ldt, pc->size);
-}
-
-static inline void load_LDT(mm_context_t *pc)
-{
- preempt_disable();
- load_LDT_nolock(pc);
- preempt_enable();
-}
-
static inline unsigned long get_desc_base(const struct desc_struct *desc)
{
return (unsigned)(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 09b9620a73b4..364d27481a52 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,8 +9,7 @@
* we put the segment information here.
*/
typedef struct {
- void *ldt;
- int size;
+ struct ldt_struct *ldt;

#ifdef CONFIG_X86_64
/* True if mm supports a task running in 32 bit compatibility mode. */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 804a3a6030ca..3fcff70c398e 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -34,6 +34,49 @@ static inline void load_mm_cr4(struct mm_struct *mm) {}
#endif

/*
+ * ldt_structs can be allocated, used, and freed, but they are never
+ * modified while live.
+ */
+struct ldt_struct {
+ /*
+ * Xen requires page-aligned LDTs with special permissions. This is
+ * needed to prevent us from installing evil descriptors such as
+ * call gates. On native, we could merge the ldt_struct and LDT
+ * allocations, but it's not worth trying to optimize.
+ */
+ struct desc_struct *entries;
+ int size;
+};
+
+static inline void load_mm_ldt(struct mm_struct *mm)
+{
+ struct ldt_struct *ldt;
+ DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+
+ /* lockless_dereference synchronizes with smp_store_release */
+ ldt = lockless_dereference(mm->context.ldt);
+
+ /*
+ * Any change to mm->context.ldt is followed by an IPI to all
+ * CPUs with the mm active. The LDT will not be freed until
+ * after the IPI is handled by all such CPUs. This means that,
+ * if the ldt_struct changes before we return, the values we see
+ * will be safe, and the new values will be loaded before we run
+ * any user code.
+ *
+ * NB: don't try to convert this to use RCU without extreme care.
+ * We would still need IRQs off, because we don't want to change
+ * the local LDT after an IPI loaded a newer value than the one
+ * that we can see.
+ */
+
+ if (unlikely(ldt))
+ set_ldt(ldt->entries, ldt->size);
+ else
+ clear_LDT();
+}
+
+/*
* Used for LDT copy/destruction.
*/
int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
@@ -78,12 +121,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
* was called and then modify_ldt changed
* prev->context.ldt but suppressed an IPI to this CPU.
* In this case, prev->context.ldt != NULL, because we
- * never free an LDT while the mm still exists. That
- * means that next->context.ldt != prev->context.ldt,
- * because mms never share an LDT.
+ * never set context.ldt to NULL while the mm still
+ * exists. That means that next->context.ldt !=
+ * prev->context.ldt, because mms never share an LDT.
*/
if (unlikely(prev->context.ldt != next->context.ldt))
- load_LDT_nolock(&next->context);
+ load_mm_ldt(next);
}
#ifdef CONFIG_SMP
else {
@@ -106,7 +149,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
load_cr3(next->pgd);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
load_mm_cr4(next);
- load_LDT_nolock(&next->context);
+ load_mm_ldt(next);
}
}
#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 922c5e0cea4c..cb9e5df42dd2 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1410,7 +1410,7 @@ void cpu_init(void)
load_sp0(t, &current->thread);
set_tss_desc(cpu, t);
load_TR_desc();
- load_LDT(&init_mm.context);
+ load_mm_ldt(&init_mm);

clear_all_debug_regs();
dbg_restore_debug_regs();
@@ -1459,7 +1459,7 @@ void cpu_init(void)
load_sp0(t, thread);
set_tss_desc(cpu, t);
load_TR_desc();
- load_LDT(&init_mm.context);
+ load_mm_ldt(&init_mm);

t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3658de47900f..9469dfa55607 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2179,21 +2179,25 @@ static unsigned long get_segment_base(unsigned int segment)
int idx = segment >> 3;

if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+ struct ldt_struct *ldt;
+
if (idx > LDT_ENTRIES)
return 0;

- if (idx > current->active_mm->context.size)
+ /* IRQs are off, so this synchronizes with smp_store_release */
+ ldt = lockless_dereference(current->active_mm->context.ldt);
+ if (!ldt || idx > ldt->size)
return 0;

- desc = current->active_mm->context.ldt;
+ desc = &ldt->entries[idx];
} else {
if (idx > GDT_ENTRIES)
return 0;

- desc = raw_cpu_ptr(gdt_page.gdt);
+ desc = raw_cpu_ptr(gdt_page.gdt) + idx;
}

- return get_desc_base(desc + idx);
+ return get_desc_base(desc);
}

#ifdef CONFIG_COMPAT
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index c37886d759cc..3ae308029dee 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -12,6 +12,7 @@
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/smp.h>
+#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/uaccess.h>

@@ -20,82 +21,83 @@
#include <asm/mmu_context.h>
#include <asm/syscalls.h>

-#ifdef CONFIG_SMP
static void flush_ldt(void *current_mm)
{
- if (current->active_mm == current_mm)
- load_LDT(&current->active_mm->context);
+ if (current->active_mm == current_mm) {
+ /* context.lock is held for us, so we don't need any locking. */
+ mm_context_t *pc = &current->active_mm->context;
+ set_ldt(pc->ldt->entries, pc->ldt->size);
+ }
}
-#endif

-static int alloc_ldt(mm_context_t *pc, int mincount, int reload)
+/* The caller must call finalize_ldt_struct on the result. */
+static struct ldt_struct *alloc_ldt_struct(int size)
{
- void *oldldt, *newldt;
- int oldsize;
+ struct ldt_struct *new_ldt;
+ int alloc_size;

- if (mincount <= pc->size)
- return 0;
- oldsize = pc->size;
- mincount = (mincount + (PAGE_SIZE / LDT_ENTRY_SIZE - 1)) &
- (~(PAGE_SIZE / LDT_ENTRY_SIZE - 1));
- if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE)
- newldt = vmalloc(mincount * LDT_ENTRY_SIZE);
+ if (size > LDT_ENTRIES)
+ return NULL;
+
+ new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+ if (!new_ldt)
+ return NULL;
+
+ BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
+ alloc_size = size * LDT_ENTRY_SIZE;
+
+ if (alloc_size > PAGE_SIZE)
+ new_ldt->entries = vmalloc(alloc_size);
else
- newldt = (void *)__get_free_page(GFP_KERNEL);
+ new_ldt->entries = (void *)__get_free_page(GFP_KERNEL);
+ if (!new_ldt->entries) {
+ kfree(new_ldt);
+ return NULL;
+ }

- if (!newldt)
- return -ENOMEM;
+ new_ldt->size = size;
+ return new_ldt;
+}
+
+/* After calling this, the LDT is immutable. */
+static void finalize_ldt_struct(struct ldt_struct *ldt)
+{
+ paravirt_alloc_ldt(ldt->entries, ldt->size);
+}

- if (oldsize)
- memcpy(newldt, pc->ldt, oldsize * LDT_ENTRY_SIZE);
- oldldt = pc->ldt;
- memset(newldt + oldsize * LDT_ENTRY_SIZE, 0,
- (mincount - oldsize) * LDT_ENTRY_SIZE);
+static void install_ldt(struct mm_struct *current_mm,
+ struct ldt_struct *ldt)
+{
+ /* context.lock is held */
+ preempt_disable();

- paravirt_alloc_ldt(newldt, mincount);
+ /* Synchronizes with lockless_dereference in load_mm_ldt. */
+ smp_store_release(&current_mm->context.ldt, ldt);

-#ifdef CONFIG_X86_64
- /* CHECKME: Do we really need this ? */
- wmb();
-#endif
- pc->ldt = newldt;
- wmb();
- pc->size = mincount;
- wmb();
+ /* Activate for this CPU. */
+ flush_ldt(current->mm);

- if (reload) {
#ifdef CONFIG_SMP
- preempt_disable();
- load_LDT(pc);
- if (!cpumask_equal(mm_cpumask(current->mm),
- cpumask_of(smp_processor_id())))
- smp_call_function(flush_ldt, current->mm, 1);
- preempt_enable();
-#else
- load_LDT(pc);
+ /* Synchronize with other CPUs. */
+ if (!cpumask_equal(mm_cpumask(current_mm),
+ cpumask_of(smp_processor_id())))
+ smp_call_function(flush_ldt, current_mm, 1);
#endif
- }
- if (oldsize) {
- paravirt_free_ldt(oldldt, oldsize);
- if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
- vfree(oldldt);
- else
- put_page(virt_to_page(oldldt));
- }
- return 0;
+ preempt_enable();
}

-static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
+static void free_ldt_struct(struct ldt_struct *ldt)
{
- int err = alloc_ldt(new, old->size, 0);
- int i;
-
- if (err < 0)
- return err;
-
- for (i = 0; i < old->size; i++)
- write_ldt_entry(new->ldt, i, old->ldt + i * LDT_ENTRY_SIZE);
- return 0;
+ if (unlikely(ldt)) {
+ int alloc_size = sizeof(struct ldt_struct) +
+ ldt->size * LDT_ENTRY_SIZE;
+ paravirt_free_ldt(ldt->entries, ldt->size);
+ if (alloc_size > PAGE_SIZE)
+ vfree(ldt->entries);
+ else
+ put_page(virt_to_page(ldt->entries));
+ kfree(ldt);
+ }
}

/*
@@ -108,13 +110,32 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
int retval = 0;

mutex_init(&mm->context.lock);
- mm->context.size = 0;
old_mm = current->mm;
- if (old_mm && old_mm->context.size > 0) {
- mutex_lock(&old_mm->context.lock);
- retval = copy_ldt(&mm->context, &old_mm->context);
- mutex_unlock(&old_mm->context.lock);
+ if (!old_mm) {
+ mm->context.ldt = NULL;
+ return 0;
+ }
+
+ mutex_lock(&old_mm->context.lock);
+ if (old_mm->context.ldt) {
+ struct ldt_struct *new_ldt =
+ alloc_ldt_struct(old_mm->context.ldt->size);
+ if (!new_ldt) {
+ retval = -ENOMEM;
+ goto out_unlock;
+ }
+
+ memcpy(new_ldt->entries, old_mm->context.ldt->entries,
+ new_ldt->size * LDT_ENTRY_SIZE);
+ finalize_ldt_struct(new_ldt);
+
+ mm->context.ldt = new_ldt;
+ } else {
+ mm->context.ldt = NULL;
}
+
+out_unlock:
+ mutex_unlock(&old_mm->context.lock);
return retval;
}

@@ -125,53 +146,47 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
*/
void destroy_context(struct mm_struct *mm)
{
- if (mm->context.size) {
-#ifdef CONFIG_X86_32
- /* CHECKME: Can this ever happen ? */
- if (mm == current->active_mm)
- clear_LDT();
-#endif
- paravirt_free_ldt(mm->context.ldt, mm->context.size);
- if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE)
- vfree(mm->context.ldt);
- else
- put_page(virt_to_page(mm->context.ldt));
- mm->context.size = 0;
- }
+ free_ldt_struct(mm->context.ldt);
+ mm->context.ldt = NULL;
}

static int read_ldt(void __user *ptr, unsigned long bytecount)
{
- int err;
+ int retval;
unsigned long size;
struct mm_struct *mm = current->mm;

- if (!mm->context.size)
- return 0;
+ mutex_lock(&mm->context.lock);
+
+ if (!mm->context.ldt) {
+ retval = 0;
+ goto out_unlock;
+ }
+
if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;

- mutex_lock(&mm->context.lock);
- size = mm->context.size * LDT_ENTRY_SIZE;
+ size = mm->context.ldt->size * LDT_ENTRY_SIZE;
if (size > bytecount)
size = bytecount;

- err = 0;
- if (copy_to_user(ptr, mm->context.ldt, size))
- err = -EFAULT;
- mutex_unlock(&mm->context.lock);
- if (err < 0)
- goto error_return;
+ if (copy_to_user(ptr, mm->context.ldt->entries, size)) {
+ retval = -EFAULT;
+ goto out_unlock;
+ }
+
if (size != bytecount) {
- /* zero-fill the rest */
+ /* Zero-fill the rest and pretend we read bytecount bytes. */
if (clear_user(ptr + size, bytecount - size) != 0) {
- err = -EFAULT;
- goto error_return;
+ retval = -EFAULT;
+ goto out_unlock;
}
}
- return bytecount;
-error_return:
- return err;
+ retval = bytecount;
+
+out_unlock:
+ mutex_unlock(&mm->context.lock);
+ return retval;
}

static int read_default_ldt(void __user *ptr, unsigned long bytecount)
@@ -195,6 +210,8 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
struct desc_struct ldt;
int error;
struct user_desc ldt_info;
+ int oldsize, newsize;
+ struct ldt_struct *new_ldt, *old_ldt;

error = -EINVAL;
if (bytecount != sizeof(ldt_info))
@@ -213,34 +230,43 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
goto out;
}

- mutex_lock(&mm->context.lock);
- if (ldt_info.entry_number >= mm->context.size) {
- error = alloc_ldt(&current->mm->context,
- ldt_info.entry_number + 1, 1);
- if (error < 0)
- goto out_unlock;
- }
-
- /* Allow LDTs to be cleared by the user. */
- if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
- if (oldmode || LDT_empty(&ldt_info)) {
- memset(&ldt, 0, sizeof(ldt));
- goto install;
+ if ((oldmode && ldt_info.base_addr == 0 && ldt_info.limit == 0) ||
+ LDT_empty(&ldt_info)) {
+ /* The user wants to clear the entry. */
+ memset(&ldt, 0, sizeof(ldt));
+ } else {
+ if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
+ error = -EINVAL;
+ goto out;
}
+
+ fill_ldt(&ldt, &ldt_info);
+ if (oldmode)
+ ldt.avl = 0;
}

- if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
- error = -EINVAL;
+ mutex_lock(&mm->context.lock);
+
+ old_ldt = mm->context.ldt;
+ oldsize = old_ldt ? old_ldt->size : 0;
+ newsize = max((int)(ldt_info.entry_number + 1), oldsize);
+
+ error = -ENOMEM;
+ new_ldt = alloc_ldt_struct(newsize);
+ if (!new_ldt)
goto out_unlock;
- }

- fill_ldt(&ldt, &ldt_info);
- if (oldmode)
- ldt.avl = 0;
+ if (old_ldt) {
+ memcpy(new_ldt->entries, old_ldt->entries,
+ oldsize * LDT_ENTRY_SIZE);
+ }
+ memset(new_ldt->entries + oldsize * LDT_ENTRY_SIZE, 0,
+ (newsize - oldsize) * LDT_ENTRY_SIZE);
+ new_ldt->entries[ldt_info.entry_number] = ldt;
+ finalize_ldt_struct(new_ldt);

- /* Install the new entry ... */
-install:
- write_ldt_entry(mm->context.ldt, ldt_info.entry_number, &ldt);
+ install_ldt(mm, new_ldt);
+ free_ldt_struct(old_ldt);
error = 0;

out_unlock:
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 71d7849a07f7..f6b916387590 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -121,11 +121,11 @@ void __show_regs(struct pt_regs *regs, int all)
void release_thread(struct task_struct *dead_task)
{
if (dead_task->mm) {
- if (dead_task->mm->context.size) {
+ if (dead_task->mm->context.ldt) {
pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
dead_task->comm,
dead_task->mm->context.ldt,
- dead_task->mm->context.size);
+ dead_task->mm->context.ldt->size);
BUG();
}
}
diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 9b4d51d0c0d0..6273324186ac 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -5,6 +5,7 @@
#include <linux/mm.h>
#include <linux/ptrace.h>
#include <asm/desc.h>
+#include <asm/mmu_context.h>

unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs)
{
@@ -30,10 +31,11 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
seg &= ~7UL;

mutex_lock(&child->mm->context.lock);
- if (unlikely((seg >> 3) >= child->mm->context.size))
+ if (unlikely(!child->mm->context.ldt ||
+ (seg >> 3) >= child->mm->context.ldt->size))
addr = -1L; /* bogus selector, access would fault */
else {
- desc = child->mm->context.ldt + seg;
+ desc = &child->mm->context.ldt->entries[seg];
base = get_desc_base(desc);

/* 16-bit code segment? */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 0d7dd1f5ac36..9ab52791fed5 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -22,6 +22,7 @@
#include <asm/fpu/internal.h>
#include <asm/debugreg.h>
#include <asm/cpu.h>
+#include <asm/mmu_context.h>

#ifdef CONFIG_X86_32
__visible unsigned long saved_context_ebx;
@@ -153,7 +154,7 @@ static void fix_processor_context(void)
syscall_init(); /* This sets MSR_*STAR and related */
#endif
load_TR_desc(); /* This does ltr */
- load_LDT(&current->active_mm->context); /* This does lldt */
+ load_mm_ldt(current->active_mm); /* This does lldt */

fpu__resume_cpu();
}
--
2.4.3

2015-07-22 19:24:35

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

The modify_ldt syscall exposes a large attack surface and is
unnecessary for modern userspace. Make it optional.

Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/Kconfig | 17 +++++++++++++++++
arch/x86/include/asm/mmu.h | 2 ++
arch/x86/include/asm/mmu_context.h | 31 +++++++++++++++++++++++--------
arch/x86/kernel/Makefile | 3 ++-
arch/x86/kernel/cpu/perf_event.c | 4 ++++
arch/x86/kernel/process_64.c | 2 ++
arch/x86/kernel/step.c | 2 ++
kernel/sys_ni.c | 1 +
8 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d77d92..ede52be845db 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1015,6 +1015,7 @@ config VM86
config X86_16BIT
bool "Enable support for 16-bit segments" if EXPERT
default y
+ depends on MODIFY_LDT_SYSCALL
---help---
This option is required by programs like Wine to run 16-bit
protected mode legacy code on x86 processors. Disabling
@@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
This is used to work around broken boot loaders. This should
be set to 'N' under normal conditions.

+config MODIFY_LDT_SYSCALL
+ bool "Enable the LDT (local descriptor table)" if EXPERT
+ default y
+ ---help---
+ Linux can allow user programs to install a per-process x86
+ Local Descriptor Table (LDT) using the modify_ldt(2) system
+ call. This is required to run 16-bit or segmented code such as
+ DOSEMU or some Wine programs. It is also used by some very old
+ threading libraries.
+
+ Enabling this feature adds a small amount of overhead to
+ context switches and increases the low-level kernel attack
+ surface. Disabling it removes the modify_ldt(2) system call.
+
+ Saying 'N' here may make sense for embedded or server kernels.
+
source "kernel/livepatch/Kconfig"

endmenu
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 364d27481a52..55234d5e7160 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -9,7 +9,9 @@
* we put the segment information here.
*/
typedef struct {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
+#endif

#ifdef CONFIG_X86_64
/* True if mm supports a task running in 32 bit compatibility mode. */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 3fcff70c398e..08094eded318 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -33,6 +33,7 @@ static inline void load_mm_cr4(struct mm_struct *mm)
static inline void load_mm_cr4(struct mm_struct *mm) {}
#endif

+#ifdef CONFIG_MODIFY_LDT_SYSCALL
/*
* ldt_structs can be allocated, used, and freed, but they are never
* modified while live.
@@ -48,10 +49,24 @@ struct ldt_struct {
int size;
};

+/*
+ * Used for LDT copy/destruction.
+ */
+int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context(struct mm_struct *mm);
+#else /* CONFIG_MODIFY_LDT_SYSCALL */
+static inline int init_new_context(struct task_struct *tsk,
+ struct mm_struct *mm)
+{
+ return 0;
+}
+static inline void destroy_context(struct mm_struct *mm) {}
+#endif
+
static inline void load_mm_ldt(struct mm_struct *mm)
{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
- DEBUG_LOCKS_WARN_ON(!irqs_disabled());

/* lockless_dereference synchronizes with smp_store_release */
ldt = lockless_dereference(mm->context.ldt);
@@ -74,14 +89,12 @@ static inline void load_mm_ldt(struct mm_struct *mm)
set_ldt(ldt->entries, ldt->size);
else
clear_LDT();
-}
-
-/*
- * Used for LDT copy/destruction.
- */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
+#else
+ clear_LDT();
+#endif

+ DEBUG_LOCKS_WARN_ON(!irqs_disabled());
+}

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
@@ -113,6 +126,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
/* Load per-mm CR4 state */
load_mm_cr4(next);

+#ifdef CONFIG_MODIFY_LDT_SYSCALL
/*
* Load the LDT, if the LDT is different.
*
@@ -127,6 +141,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
*/
if (unlikely(prev->context.ldt != next->context.ldt))
load_mm_ldt(next);
+#endif
}
#ifdef CONFIG_SMP
else {
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0f15af41bd80..2b507befcd3f 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -24,7 +24,8 @@ CFLAGS_irq.o := -I$(src)/../include/asm/trace

obj-y := process_$(BITS).o signal.o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
-obj-y += time.o ioport.o ldt.o dumpstack.o nmi.o
+obj-y += time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_MODIFY_LDT_SYSCALL) += ldt.o
obj-y += setup.o x86_init.o i8259.o irqinit.o jump_label.o
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-y += probe_roms.o
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9469dfa55607..58b872ef2329 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2179,6 +2179,7 @@ static unsigned long get_segment_base(unsigned int segment)
int idx = segment >> 3;

if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;

if (idx > LDT_ENTRIES)
@@ -2190,6 +2191,9 @@ static unsigned long get_segment_base(unsigned int segment)
return 0;

desc = &ldt->entries[idx];
+#else
+ return 0;
+#endif
} else {
if (idx > GDT_ENTRIES)
return 0;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f6b916387590..941295ddf802 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -121,6 +121,7 @@ void __show_regs(struct pt_regs *regs, int all)
void release_thread(struct task_struct *dead_task)
{
if (dead_task->mm) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
if (dead_task->mm->context.ldt) {
pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
dead_task->comm,
@@ -128,6 +129,7 @@ void release_thread(struct task_struct *dead_task)
dead_task->mm->context.ldt->size);
BUG();
}
+#endif
}
}

diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
index 6273324186ac..fd88e152d584 100644
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -18,6 +18,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
return addr;
}

+#ifdef CONFIG_MODIFY_LDT_SYSCALL
/*
* We'll assume that the code segments in the GDT
* are all zero-based. That is largely true: the
@@ -45,6 +46,7 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
}
mutex_unlock(&child->mm->context.lock);
}
+#endif

return addr;
}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5868d8..ca7d84f438f1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -140,6 +140,7 @@ cond_syscall(sys_sgetmask);
cond_syscall(sys_ssetmask);
cond_syscall(sys_vm86old);
cond_syscall(sys_vm86);
+cond_syscall(sys_modify_ldt);
cond_syscall(sys_ipc);
cond_syscall(compat_sys_ipc);
cond_syscall(compat_sys_sysctl);
--
2.4.3

2015-07-22 19:24:10

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH v3 3/3] selftests/x86, x86/ldt: Add a selftest for modify_ldt

This tests general modify_ldt behavior (only writes, so far) as
well as synchronous updates via IPI. It fails on old kernels.

I called this ldt_gdt because I'll add set_thread_area tests to
it at some point.

Signed-off-by: Andy Lutomirski <[email protected]>
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/ldt_gdt.c | 492 ++++++++++++++++++++++++++++++++++
2 files changed, 493 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/ldt_gdt.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index caa60d56d7d1..4138387b892c 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -4,7 +4,7 @@ include ../lib.mk

.PHONY: all all_32 all_64 warn_32bit_failure clean

-TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs
+TARGETS_C_BOTHBITS := sigreturn single_step_syscall sysret_ss_attrs ldt_gdt
TARGETS_C_32BIT_ONLY := entry_from_vm86

TARGETS_C_32BIT_ALL := $(TARGETS_C_BOTHBITS) $(TARGETS_C_32BIT_ONLY)
diff --git a/tools/testing/selftests/x86/ldt_gdt.c b/tools/testing/selftests/x86/ldt_gdt.c
new file mode 100644
index 000000000000..7723a12d42e1
--- /dev/null
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -0,0 +1,492 @@
+/*
+ * ldt_gdt.c - Test cases for LDT and GDT access
+ * Copyright (c) 2015 Andrew Lutomirski
+ */
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <asm/ldt.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <sched.h>
+#include <linux/futex.h>
+
+#define AR_ACCESSED (1<<8)
+
+#define AR_TYPE_RODATA (0 * (1<<9))
+#define AR_TYPE_RWDATA (1 * (1<<9))
+#define AR_TYPE_RODATA_EXPDOWN (2 * (1<<9))
+#define AR_TYPE_RWDATA_EXPDOWN (3 * (1<<9))
+#define AR_TYPE_XOCODE (4 * (1<<9))
+#define AR_TYPE_XRCODE (5 * (1<<9))
+#define AR_TYPE_XOCODE_CONF (6 * (1<<9))
+#define AR_TYPE_XRCODE_CONF (7 * (1<<9))
+
+#define AR_DPL3 (3 * (1<<13))
+
+#define AR_S (1 << 12)
+#define AR_P (1 << 15)
+#define AR_AVL (1 << 20)
+#define AR_L (1 << 21)
+#define AR_DB (1 << 22)
+#define AR_G (1 << 23)
+
+static int nerrs;
+
+static void check_invalid_segment(uint16_t index, int ldt)
+{
+ uint32_t has_limit = 0, has_ar = 0, limit, ar;
+ uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+ asm ("lsl %[selector], %[limit]\n\t"
+ "jnz 1f\n\t"
+ "movl $1, %[has_limit]\n\t"
+ "1:"
+ : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+ : [selector] "r" (selector));
+ asm ("larl %[selector], %[ar]\n\t"
+ "jnz 1f\n\t"
+ "movl $1, %[has_ar]\n\t"
+ "1:"
+ : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+ : [selector] "r" (selector));
+
+ if (has_limit || has_ar) {
+ printf("[FAIL]\t%s entry %hu is valid but should be invalid\n",
+ (ldt ? "LDT" : "GDT"), index);
+ nerrs++;
+ } else {
+ printf("[OK]\t%s entry %hu is invalid\n",
+ (ldt ? "LDT" : "GDT"), index);
+ }
+}
+
+static void check_valid_segment(uint16_t index, int ldt,
+ uint32_t expected_ar, uint32_t expected_limit)
+{
+ uint32_t has_limit = 0, has_ar = 0, limit, ar;
+ uint32_t selector = (index << 3) | (ldt << 2) | 3;
+
+ asm ("lsl %[selector], %[limit]\n\t"
+ "jnz 1f\n\t"
+ "movl $1, %[has_limit]\n\t"
+ "1:"
+ : [limit] "=r" (limit), [has_limit] "+rm" (has_limit)
+ : [selector] "r" (selector));
+ asm ("larl %[selector], %[ar]\n\t"
+ "jnz 1f\n\t"
+ "movl $1, %[has_ar]\n\t"
+ "1:"
+ : [ar] "=r" (ar), [has_ar] "+rm" (has_ar)
+ : [selector] "r" (selector));
+
+ if (!has_limit || !has_ar) {
+ printf("[FAIL]\t%s entry %hu is invalid but should be valid\n",
+ (ldt ? "LDT" : "GDT"), index);
+ nerrs++;
+ return;
+ }
+
+ if (ar != expected_ar) {
+ printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
+ (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
+ nerrs++;
+ } else if (limit != expected_limit) {
+ printf("[FAIL]\t%s entry %hu has limit 0x%08X but expected 0x%08X\n",
+ (ldt ? "LDT" : "GDT"), index, limit, expected_limit);
+ nerrs++;
+ } else {
+ printf("[OK]\t%s entry %hu has AR 0x%08X and limit 0x%08X\n",
+ (ldt ? "LDT" : "GDT"), index, ar, limit);
+ }
+}
+
+static bool install_valid_mode(const struct user_desc *desc, uint32_t ar,
+ bool oldmode)
+{
+ int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+ desc, sizeof(*desc));
+ if (ret < -1)
+ errno = -ret;
+ if (ret == 0) {
+ uint32_t limit = desc->limit;
+ if (desc->limit_in_pages)
+ limit = (limit << 12) + 4095;
+ check_valid_segment(desc->entry_number, 1, ar, limit);
+ return true;
+ } else if (errno == ENOSYS) {
+ printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+ return false;
+ } else {
+ if (desc->seg_32bit) {
+ printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+ errno);
+ nerrs++;
+ return false;
+ } else {
+ printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+ return false;
+ }
+ }
+}
+
+static bool install_valid(const struct user_desc *desc, uint32_t ar)
+{
+ return install_valid_mode(desc, ar, false);
+}
+
+static void install_invalid(const struct user_desc *desc, bool oldmode)
+{
+ int ret = syscall(SYS_modify_ldt, oldmode ? 1 : 0x11,
+ desc, sizeof(*desc));
+ if (ret < -1)
+ errno = -ret;
+ if (ret == 0) {
+ check_invalid_segment(desc->entry_number, 1);
+ } else if (errno == ENOSYS) {
+ printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+ } else {
+ if (desc->seg_32bit) {
+ printf("[FAIL]\tUnexpected modify_ldt failure %d\n",
+ errno);
+ nerrs++;
+ } else {
+ printf("[OK]\tmodify_ldt rejected 16 bit segment\n");
+ }
+ }
+}
+
+static int safe_modify_ldt(int func, struct user_desc *ptr,
+ unsigned long bytecount)
+{
+ int ret = syscall(SYS_modify_ldt, 0x11, ptr, bytecount);
+ if (ret < -1)
+ errno = -ret;
+ return ret;
+}
+
+static void fail_install(struct user_desc *desc)
+{
+ if (safe_modify_ldt(0x11, desc, sizeof(*desc)) == 0) {
+ printf("[FAIL]\tmodify_ldt accepted a bad descriptor\n");
+ nerrs++;
+ } else if (errno == ENOSYS) {
+ printf("[OK]\tmodify_ldt returned -ENOSYS\n");
+ } else {
+ printf("[OK]\tmodify_ldt failure %d\n", errno);
+ }
+}
+
+static void do_simple_tests(void)
+{
+ struct user_desc desc = {
+ .entry_number = 0,
+ .base_addr = 0,
+ .limit = 10,
+ .seg_32bit = 1,
+ .contents = 2, /* Code, not conforming */
+ .read_exec_only = 0,
+ .limit_in_pages = 0,
+ .seg_not_present = 0,
+ .useable = 0
+ };
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB);
+
+ desc.limit_in_pages = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_P | AR_DB | AR_G);
+
+ check_invalid_segment(1, 1);
+
+ desc.entry_number = 2;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_P | AR_DB | AR_G);
+
+ check_invalid_segment(1, 1);
+
+ desc.base_addr = 0xf0000000;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_P | AR_DB | AR_G);
+
+ desc.useable = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_P | AR_DB | AR_G | AR_AVL);
+
+ desc.seg_not_present = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_DB | AR_G | AR_AVL);
+
+ desc.seg_32bit = 0;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_G | AR_AVL);
+
+ desc.seg_32bit = 1;
+ desc.contents = 0;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA |
+ AR_S | AR_DB | AR_G | AR_AVL);
+
+ desc.read_exec_only = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA |
+ AR_S | AR_DB | AR_G | AR_AVL);
+
+ desc.contents = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA_EXPDOWN |
+ AR_S | AR_DB | AR_G | AR_AVL);
+
+ desc.read_exec_only = 0;
+ desc.limit_in_pages = 0;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA_EXPDOWN |
+ AR_S | AR_DB | AR_AVL);
+
+ desc.contents = 3;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE_CONF |
+ AR_S | AR_DB | AR_AVL);
+
+ desc.read_exec_only = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE_CONF |
+ AR_S | AR_DB | AR_AVL);
+
+ desc.read_exec_only = 0;
+ desc.contents = 2;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XRCODE |
+ AR_S | AR_DB | AR_AVL);
+
+ desc.read_exec_only = 1;
+
+#ifdef __x86_64__
+ desc.lm = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+ AR_S | AR_DB | AR_AVL);
+ desc.lm = 0;
+#endif
+
+ bool entry1_okay = install_valid(&desc, AR_DPL3 | AR_TYPE_XOCODE |
+ AR_S | AR_DB | AR_AVL);
+
+ if (entry1_okay) {
+ printf("[RUN]\tTest fork\n");
+ pid_t child = fork();
+ if (child == 0) {
+ nerrs = 0;
+ check_valid_segment(desc.entry_number, 1,
+ AR_DPL3 | AR_TYPE_XOCODE |
+ AR_S | AR_DB | AR_AVL, desc.limit);
+ check_invalid_segment(1, 1);
+ exit(nerrs ? 1 : 0);
+ } else {
+ int status;
+ if (waitpid(child, &status, 0) != child ||
+ !WIFEXITED(status)) {
+ printf("[FAIL]\tChild died\n");
+ nerrs++;
+ } else if (WEXITSTATUS(status) != 0) {
+ printf("[FAIL]\tChild failed\n");
+ nerrs++;
+ } else {
+ printf("[OK]\tChild succeeded\n");
+ }
+ }
+ } else {
+ printf("[SKIP]\tSkipping fork test because have no LDT\n");
+ }
+
+ /* Test entry_number too high. */
+ desc.entry_number = 100000;
+ fail_install(&desc);
+
+ /* Test deletion and actions mistakeable for deletion. */
+ memset(&desc, 0, sizeof(desc));
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P);
+
+ desc.seg_not_present = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+ desc.seg_not_present = 0;
+ desc.read_exec_only = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S | AR_P);
+
+ desc.read_exec_only = 0;
+ desc.seg_not_present = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S);
+
+ desc.read_exec_only = 1;
+ desc.limit = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+ desc.limit = 0;
+ desc.base_addr = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RODATA | AR_S);
+
+ desc.base_addr = 0;
+ install_invalid(&desc, false);
+
+ desc.seg_not_present = 0;
+ desc.read_exec_only = 0;
+ desc.seg_32bit = 1;
+ install_valid(&desc, AR_DPL3 | AR_TYPE_RWDATA | AR_S | AR_P | AR_DB);
+ install_invalid(&desc, true);
+}
+
+/*
+ * 0: thread is idle
+ * 1: thread armed
+ * 2: thread should clear LDT entry 0
+ * 3: thread should exit
+ */
+static volatile unsigned int ftx;
+
+static void *threadproc(void *ctx)
+{
+ cpu_set_t cpuset;
+ CPU_ZERO(&cpuset);
+ CPU_SET(1, &cpuset);
+ if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+ err(1, "sched_setaffinity to CPU 1"); /* should never fail */
+
+ while (1) {
+ syscall(SYS_futex, &ftx, FUTEX_WAIT, 0, NULL, NULL, 0);
+ while (ftx != 2) {
+ if (ftx == 3)
+ return NULL;
+ }
+
+ /* clear LDT entry 0 */
+ const struct user_desc desc = {};
+ if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
+ err(1, "modify_ldt");
+
+ ftx = 0;
+ }
+}
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+ int flags)
+{
+ struct sigaction sa;
+ memset(&sa, 0, sizeof(sa));
+ sa.sa_sigaction = handler;
+ sa.sa_flags = SA_SIGINFO | flags;
+ sigemptyset(&sa.sa_mask);
+ if (sigaction(sig, &sa, 0))
+ err(1, "sigaction");
+
+}
+
+static jmp_buf jmpbuf;
+
+static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
+{
+ siglongjmp(jmpbuf, 1);
+}
+
+static void do_multicpu_tests(void)
+{
+ cpu_set_t cpuset;
+ pthread_t thread;
+ int failures = 0, iters = 5, i;
+ unsigned short orig_ss;
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(1, &cpuset);
+ if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+ printf("[SKIP]\tCannot set affinity to CPU 1\n");
+ return;
+ }
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(0, &cpuset);
+ if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+ printf("[SKIP]\tCannot set affinity to CPU 0\n");
+ return;
+ }
+
+ sethandler(SIGSEGV, sigsegv, 0);
+
+ printf("[RUN]\tCross-CPU LDT invalidation\n");
+
+ if (pthread_create(&thread, 0, threadproc, 0) != 0)
+ err(1, "pthread_create");
+
+ asm volatile ("mov %%ss, %0" : "=rm" (orig_ss));
+
+ for (i = 0; i < 5; i++) {
+ if (sigsetjmp(jmpbuf, 1) != 0)
+ continue;
+
+ /* Make sure the thread is ready after the last test. */
+ while (ftx != 0)
+ ;
+
+ struct user_desc desc = {
+ .entry_number = 0,
+ .base_addr = 0,
+ .limit = 0xfffff,
+ .seg_32bit = 1,
+ .contents = 0, /* Data */
+ .read_exec_only = 0,
+ .limit_in_pages = 1,
+ .seg_not_present = 0,
+ .useable = 0
+ };
+
+ if (safe_modify_ldt(0x11, &desc, sizeof(desc)) != 0) {
+ if (errno != ENOSYS)
+ err(1, "modify_ldt");
+ printf("[SKIP]\tmodify_ldt unavailable\n");
+ break;
+ }
+
+ /* Arm the thread. */
+ ftx = 1;
+ syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+ asm volatile ("mov %0, %%ss" : : "r" (0x7));
+
+ /* Go! */
+ ftx = 2;
+
+ while (ftx != 0)
+ ;
+
+ /*
+ * On success, modify_ldt will segfault us synchronously,
+ * and we'll escape via siglongjmp.
+ */
+
+ failures++;
+ asm volatile ("mov %0, %%ss" : : "rm" (orig_ss));
+ };
+
+ ftx = 3;
+ syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);
+
+ if (pthread_join(thread, NULL) != 0)
+ err(1, "pthread_join");
+
+ if (failures) {
+ printf("[FAIL]\t%d of %d iterations failed\n", failures, iters);
+ nerrs++;
+ } else {
+ printf("[OK]\tAll %d iterations succeeded\n", iters);
+ }
+}
+
+int main()
+{
+ do_simple_tests();
+
+ do_multicpu_tests();
+
+ return nerrs ? 1 : 0;
+}
--
2.4.3

2015-07-22 22:22:22

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

On 07/22/2015 03:23 PM, Andy Lutomirski wrote:
> modify_ldt has questionable locking and does not synchronize
> threads. Improve it: redesign the locking and synchronize all
> threads' LDTs using an IPI on all modifications.
>
> This will dramatically slow down modify_ldt in multithreaded
> programs, but there shouldn't be any multithreaded programs that
> care about modify_ldt's performance in the first place.
>
> Cc: [email protected]
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/desc.h | 15 ---
> arch/x86/include/asm/mmu.h | 3 +-
> arch/x86/include/asm/mmu_context.h | 53 +++++++-
> arch/x86/kernel/cpu/common.c | 4 +-
> arch/x86/kernel/cpu/perf_event.c | 12 +-
> arch/x86/kernel/ldt.c | 258 ++++++++++++++++++++-----------------
> arch/x86/kernel/process_64.c | 4 +-
> arch/x86/kernel/step.c | 6 +-
> arch/x86/power/cpu.c | 3 +-
> 9 files changed, 209 insertions(+), 149 deletions(-)
>
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> index a0bf89fd2647..4e10d73cf018 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -280,21 +280,6 @@ static inline void clear_LDT(void)
> set_ldt(NULL, 0);
> }
>
> -/*
> - * load one particular LDT into the current CPU
> - */
> -static inline void load_LDT_nolock(mm_context_t *pc)
> -{
> - set_ldt(pc->ldt, pc->size);
> -}
> -
> -static inline void load_LDT(mm_context_t *pc)
> -{
> - preempt_disable();
> - load_LDT_nolock(pc);
> - preempt_enable();
> -}
> -
> static inline unsigned long get_desc_base(const struct desc_struct *desc)
> {
> return (unsigned)(desc->base0 | ((desc->base1) << 16) | ((desc->base2) << 24));
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 09b9620a73b4..364d27481a52 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -9,8 +9,7 @@
> * we put the segment information here.
> */
> typedef struct {
> - void *ldt;
> - int size;
> + struct ldt_struct *ldt;
>
> #ifdef CONFIG_X86_64
> /* True if mm supports a task running in 32 bit compatibility mode. */
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 804a3a6030ca..3fcff70c398e 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -34,6 +34,49 @@ static inline void load_mm_cr4(struct mm_struct *mm) {}
> #endif
>
> /*
> + * ldt_structs can be allocated, used, and freed, but they are never
> + * modified while live.
> + */
> +struct ldt_struct {
> + /*
> + * Xen requires page-aligned LDTs with special permissions. This is
> + * needed to prevent us from installing evil descriptors such as
> + * call gates. On native, we could merge the ldt_struct and LDT
> + * allocations, but it's not worth trying to optimize.
> + */
> + struct desc_struct *entries;
> + int size;
> +};
> +
> +static inline void load_mm_ldt(struct mm_struct *mm)
> +{
> + struct ldt_struct *ldt;
> + DEBUG_LOCKS_WARN_ON(!irqs_disabled());
> +
> + /* lockless_dereference synchronizes with smp_store_release */
> + ldt = lockless_dereference(mm->context.ldt);
> +
> + /*
> + * Any change to mm->context.ldt is followed by an IPI to all
> + * CPUs with the mm active. The LDT will not be freed until
> + * after the IPI is handled by all such CPUs. This means that,
> + * if the ldt_struct changes before we return, the values we see
> + * will be safe, and the new values will be loaded before we run
> + * any user code.
> + *
> + * NB: don't try to convert this to use RCU without extreme care.
> + * We would still need IRQs off, because we don't want to change
> + * the local LDT after an IPI loaded a newer value than the one
> + * that we can see.
> + */
> +
> + if (unlikely(ldt))
> + set_ldt(ldt->entries, ldt->size);
> + else
> + clear_LDT();
> +}
> +
> +/*
> * Used for LDT copy/destruction.
> */
> int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
> @@ -78,12 +121,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> * was called and then modify_ldt changed
> * prev->context.ldt but suppressed an IPI to this CPU.
> * In this case, prev->context.ldt != NULL, because we
> - * never free an LDT while the mm still exists. That
> - * means that next->context.ldt != prev->context.ldt,
> - * because mms never share an LDT.
> + * never set context.ldt to NULL while the mm still
> + * exists. That means that next->context.ldt !=
> + * prev->context.ldt, because mms never share an LDT.
> */
> if (unlikely(prev->context.ldt != next->context.ldt))
> - load_LDT_nolock(&next->context);
> + load_mm_ldt(next);
> }
> #ifdef CONFIG_SMP
> else {
> @@ -106,7 +149,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> load_cr3(next->pgd);
> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> load_mm_cr4(next);
> - load_LDT_nolock(&next->context);
> + load_mm_ldt(next);
> }
> }
> #endif
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 922c5e0cea4c..cb9e5df42dd2 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1410,7 +1410,7 @@ void cpu_init(void)
> load_sp0(t, &current->thread);
> set_tss_desc(cpu, t);
> load_TR_desc();
> - load_LDT(&init_mm.context);
> + load_mm_ldt(&init_mm);
>
> clear_all_debug_regs();
> dbg_restore_debug_regs();
> @@ -1459,7 +1459,7 @@ void cpu_init(void)
> load_sp0(t, thread);
> set_tss_desc(cpu, t);
> load_TR_desc();
> - load_LDT(&init_mm.context);
> + load_mm_ldt(&init_mm);
>
> t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 3658de47900f..9469dfa55607 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -2179,21 +2179,25 @@ static unsigned long get_segment_base(unsigned int segment)
> int idx = segment >> 3;
>
> if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
> + struct ldt_struct *ldt;
> +
> if (idx > LDT_ENTRIES)
> return 0;
>
> - if (idx > current->active_mm->context.size)
> + /* IRQs are off, so this synchronizes with smp_store_release */
> + ldt = lockless_dereference(current->active_mm->context.ldt);
> + if (!ldt || idx > ldt->size)
> return 0;
>
> - desc = current->active_mm->context.ldt;
> + desc = &ldt->entries[idx];
> } else {
> if (idx > GDT_ENTRIES)
> return 0;
>
> - desc = raw_cpu_ptr(gdt_page.gdt);
> + desc = raw_cpu_ptr(gdt_page.gdt) + idx;
> }
>
> - return get_desc_base(desc + idx);
> + return get_desc_base(desc);
> }
>
> #ifdef CONFIG_COMPAT
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index c37886d759cc..3ae308029dee 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -12,6 +12,7 @@
> #include <linux/string.h>
> #include <linux/mm.h>
> #include <linux/smp.h>
> +#include <linux/slab.h>
> #include <linux/vmalloc.h>
> #include <linux/uaccess.h>
>
> @@ -20,82 +21,83 @@
> #include <asm/mmu_context.h>
> #include <asm/syscalls.h>
>
> -#ifdef CONFIG_SMP
> static void flush_ldt(void *current_mm)
> {
> - if (current->active_mm == current_mm)
> - load_LDT(&current->active_mm->context);
> + if (current->active_mm == current_mm) {
> + /* context.lock is held for us, so we don't need any locking. */
> + mm_context_t *pc = &current->active_mm->context;
> + set_ldt(pc->ldt->entries, pc->ldt->size);
> + }
> }
> -#endif
>
> -static int alloc_ldt(mm_context_t *pc, int mincount, int reload)
> +/* The caller must call finalize_ldt_struct on the result. */
> +static struct ldt_struct *alloc_ldt_struct(int size)
> {
> - void *oldldt, *newldt;
> - int oldsize;
> + struct ldt_struct *new_ldt;
> + int alloc_size;
>
> - if (mincount <= pc->size)
> - return 0;
> - oldsize = pc->size;
> - mincount = (mincount + (PAGE_SIZE / LDT_ENTRY_SIZE - 1)) &
> - (~(PAGE_SIZE / LDT_ENTRY_SIZE - 1));
> - if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE)
> - newldt = vmalloc(mincount * LDT_ENTRY_SIZE);
> + if (size > LDT_ENTRIES)
> + return NULL;
> +
> + new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
> + if (!new_ldt)
> + return NULL;
> +
> + BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
> + alloc_size = size * LDT_ENTRY_SIZE;
> +
> + if (alloc_size > PAGE_SIZE)
> + new_ldt->entries = vmalloc(alloc_size);
> else
> - newldt = (void *)__get_free_page(GFP_KERNEL);
> + new_ldt->entries = (void *)__get_free_page(GFP_KERNEL);
> + if (!new_ldt->entries) {
> + kfree(new_ldt);
> + return NULL;
> + }
>
> - if (!newldt)
> - return -ENOMEM;
> + new_ldt->size = size;
> + return new_ldt;
> +}
> +
> +/* After calling this, the LDT is immutable. */
> +static void finalize_ldt_struct(struct ldt_struct *ldt)
> +{
> + paravirt_alloc_ldt(ldt->entries, ldt->size);
> +}
>
> - if (oldsize)
> - memcpy(newldt, pc->ldt, oldsize * LDT_ENTRY_SIZE);
> - oldldt = pc->ldt;
> - memset(newldt + oldsize * LDT_ENTRY_SIZE, 0,
> - (mincount - oldsize) * LDT_ENTRY_SIZE);
> +static void install_ldt(struct mm_struct *current_mm,
> + struct ldt_struct *ldt)
> +{
> + /* context.lock is held */
> + preempt_disable();
>
> - paravirt_alloc_ldt(newldt, mincount);
> + /* Synchronizes with lockless_dereference in load_mm_ldt. */
> + smp_store_release(&current_mm->context.ldt, ldt);
>
> -#ifdef CONFIG_X86_64
> - /* CHECKME: Do we really need this ? */
> - wmb();
> -#endif
> - pc->ldt = newldt;
> - wmb();
> - pc->size = mincount;
> - wmb();
> + /* Activate for this CPU. */
> + flush_ldt(current->mm);
>
> - if (reload) {
> #ifdef CONFIG_SMP
> - preempt_disable();
> - load_LDT(pc);
> - if (!cpumask_equal(mm_cpumask(current->mm),
> - cpumask_of(smp_processor_id())))
> - smp_call_function(flush_ldt, current->mm, 1);
> - preempt_enable();
> -#else
> - load_LDT(pc);
> + /* Synchronize with other CPUs. */
> + if (!cpumask_equal(mm_cpumask(current_mm),
> + cpumask_of(smp_processor_id())))
> + smp_call_function(flush_ldt, current_mm, 1);
> #endif
> - }
> - if (oldsize) {
> - paravirt_free_ldt(oldldt, oldsize);
> - if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
> - vfree(oldldt);
> - else
> - put_page(virt_to_page(oldldt));
> - }
> - return 0;
> + preempt_enable();
> }
>
> -static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
> +static void free_ldt_struct(struct ldt_struct *ldt)
> {
> - int err = alloc_ldt(new, old->size, 0);
> - int i;
> -
> - if (err < 0)
> - return err;
> -
> - for (i = 0; i < old->size; i++)
> - write_ldt_entry(new->ldt, i, old->ldt + i * LDT_ENTRY_SIZE);
> - return 0;
> + if (unlikely(ldt)) {
> + int alloc_size = sizeof(struct ldt_struct) +
> + ldt->size * LDT_ENTRY_SIZE;
> + paravirt_free_ldt(ldt->entries, ldt->size);
> + if (alloc_size > PAGE_SIZE)
> + vfree(ldt->entries);
> + else
> + put_page(virt_to_page(ldt->entries));
> + kfree(ldt);
> + }
> }
>
> /*
> @@ -108,13 +110,32 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> int retval = 0;
>
> mutex_init(&mm->context.lock);
> - mm->context.size = 0;
> old_mm = current->mm;
> - if (old_mm && old_mm->context.size > 0) {
> - mutex_lock(&old_mm->context.lock);
> - retval = copy_ldt(&mm->context, &old_mm->context);
> - mutex_unlock(&old_mm->context.lock);
> + if (!old_mm) {
> + mm->context.ldt = NULL;
> + return 0;
> + }
> +
> + mutex_lock(&old_mm->context.lock);
> + if (old_mm->context.ldt) {
> + struct ldt_struct *new_ldt =
> + alloc_ldt_struct(old_mm->context.ldt->size);
> + if (!new_ldt) {
> + retval = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + memcpy(new_ldt->entries, old_mm->context.ldt->entries,
> + new_ldt->size * LDT_ENTRY_SIZE);
> + finalize_ldt_struct(new_ldt);
> +
> + mm->context.ldt = new_ldt;
> + } else {
> + mm->context.ldt = NULL;
> }
> +
> +out_unlock:
> + mutex_unlock(&old_mm->context.lock);
> return retval;
> }
>
> @@ -125,53 +146,47 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> */
> void destroy_context(struct mm_struct *mm)
> {
> - if (mm->context.size) {
> -#ifdef CONFIG_X86_32
> - /* CHECKME: Can this ever happen ? */
> - if (mm == current->active_mm)
> - clear_LDT();
> -#endif
> - paravirt_free_ldt(mm->context.ldt, mm->context.size);
> - if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE)
> - vfree(mm->context.ldt);
> - else
> - put_page(virt_to_page(mm->context.ldt));
> - mm->context.size = 0;
> - }
> + free_ldt_struct(mm->context.ldt);
> + mm->context.ldt = NULL;
> }
>
> static int read_ldt(void __user *ptr, unsigned long bytecount)
> {
> - int err;
> + int retval;
> unsigned long size;
> struct mm_struct *mm = current->mm;
>
> - if (!mm->context.size)
> - return 0;
> + mutex_lock(&mm->context.lock);
> +
> + if (!mm->context.ldt) {
> + retval = 0;
> + goto out_unlock;
> + }
> +
> if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
> bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
>
> - mutex_lock(&mm->context.lock);
> - size = mm->context.size * LDT_ENTRY_SIZE;
> + size = mm->context.ldt->size * LDT_ENTRY_SIZE;
> if (size > bytecount)
> size = bytecount;
>
> - err = 0;
> - if (copy_to_user(ptr, mm->context.ldt, size))
> - err = -EFAULT;
> - mutex_unlock(&mm->context.lock);
> - if (err < 0)
> - goto error_return;
> + if (copy_to_user(ptr, mm->context.ldt->entries, size)) {
> + retval = -EFAULT;
> + goto out_unlock;
> + }
> +
> if (size != bytecount) {
> - /* zero-fill the rest */
> + /* Zero-fill the rest and pretend we read bytecount bytes. */
> if (clear_user(ptr + size, bytecount - size) != 0) {
> - err = -EFAULT;
> - goto error_return;
> + retval = -EFAULT;
> + goto out_unlock;
> }
> }
> - return bytecount;
> -error_return:
> - return err;
> + retval = bytecount;
> +
> +out_unlock:
> + mutex_unlock(&mm->context.lock);
> + return retval;
> }
>
> static int read_default_ldt(void __user *ptr, unsigned long bytecount)
> @@ -195,6 +210,8 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
> struct desc_struct ldt;
> int error;
> struct user_desc ldt_info;
> + int oldsize, newsize;
> + struct ldt_struct *new_ldt, *old_ldt;
>
> error = -EINVAL;
> if (bytecount != sizeof(ldt_info))
> @@ -213,34 +230,43 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
> goto out;
> }
>
> - mutex_lock(&mm->context.lock);
> - if (ldt_info.entry_number >= mm->context.size) {
> - error = alloc_ldt(&current->mm->context,
> - ldt_info.entry_number + 1, 1);
> - if (error < 0)
> - goto out_unlock;
> - }
> -
> - /* Allow LDTs to be cleared by the user. */
> - if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
> - if (oldmode || LDT_empty(&ldt_info)) {
> - memset(&ldt, 0, sizeof(ldt));
> - goto install;
> + if ((oldmode && ldt_info.base_addr == 0 && ldt_info.limit == 0) ||
> + LDT_empty(&ldt_info)) {
> + /* The user wants to clear the entry. */
> + memset(&ldt, 0, sizeof(ldt));
> + } else {
> + if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
> + error = -EINVAL;
> + goto out;
> }
> +
> + fill_ldt(&ldt, &ldt_info);
> + if (oldmode)
> + ldt.avl = 0;
> }
>
> - if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
> - error = -EINVAL;
> + mutex_lock(&mm->context.lock);
> +
> + old_ldt = mm->context.ldt;
> + oldsize = old_ldt ? old_ldt->size : 0;
> + newsize = max((int)(ldt_info.entry_number + 1), oldsize);
> +
> + error = -ENOMEM;
> + new_ldt = alloc_ldt_struct(newsize);
> + if (!new_ldt)
> goto out_unlock;
> - }
>
> - fill_ldt(&ldt, &ldt_info);
> - if (oldmode)
> - ldt.avl = 0;
> + if (old_ldt) {
> + memcpy(new_ldt->entries, old_ldt->entries,
> + oldsize * LDT_ENTRY_SIZE);
> + }
> + memset(new_ldt->entries + oldsize * LDT_ENTRY_SIZE, 0,
> + (newsize - oldsize) * LDT_ENTRY_SIZE);

We need to zero out full page (probably better in alloc_ldt_struct()
with vmzalloc/__GFP_ZERO) --- Xen checks whole page that is assigned to
G/LDT and gets unhappy if an invalid descriptor is found there.

This fixes one problem. There is something else that Xen gets upset
about, I haven't figured what it is yet (and I am out tomorrow so it may
need to wait until Friday).


-boris


> + new_ldt->entries[ldt_info.entry_number] = ldt;
> + finalize_ldt_struct(new_ldt);
>
> - /* Install the new entry ... */
> -install:
> - write_ldt_entry(mm->context.ldt, ldt_info.entry_number, &ldt);
> + install_ldt(mm, new_ldt);
> + free_ldt_struct(old_ldt);
> error = 0;
>
> out_unlock:
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 71d7849a07f7..f6b916387590 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -121,11 +121,11 @@ void __show_regs(struct pt_regs *regs, int all)
> void release_thread(struct task_struct *dead_task)
> {
> if (dead_task->mm) {
> - if (dead_task->mm->context.size) {
> + if (dead_task->mm->context.ldt) {
> pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
> dead_task->comm,
> dead_task->mm->context.ldt,
> - dead_task->mm->context.size);
> + dead_task->mm->context.ldt->size);
> BUG();
> }
> }
> diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c
> index 9b4d51d0c0d0..6273324186ac 100644
> --- a/arch/x86/kernel/step.c
> +++ b/arch/x86/kernel/step.c
> @@ -5,6 +5,7 @@
> #include <linux/mm.h>
> #include <linux/ptrace.h>
> #include <asm/desc.h>
> +#include <asm/mmu_context.h>
>
> unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *regs)
> {
> @@ -30,10 +31,11 @@ unsigned long convert_ip_to_linear(struct task_struct *child, struct pt_regs *re
> seg &= ~7UL;
>
> mutex_lock(&child->mm->context.lock);
> - if (unlikely((seg >> 3) >= child->mm->context.size))
> + if (unlikely(!child->mm->context.ldt ||
> + (seg >> 3) >= child->mm->context.ldt->size))
> addr = -1L; /* bogus selector, access would fault */
> else {
> - desc = child->mm->context.ldt + seg;
> + desc = &child->mm->context.ldt->entries[seg];
> base = get_desc_base(desc);
>
> /* 16-bit code segment? */
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index 0d7dd1f5ac36..9ab52791fed5 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -22,6 +22,7 @@
> #include <asm/fpu/internal.h>
> #include <asm/debugreg.h>
> #include <asm/cpu.h>
> +#include <asm/mmu_context.h>
>
> #ifdef CONFIG_X86_32
> __visible unsigned long saved_context_ebx;
> @@ -153,7 +154,7 @@ static void fix_processor_context(void)
> syscall_init(); /* This sets MSR_*STAR and related */
> #endif
> load_TR_desc(); /* This does ltr */
> - load_LDT(&current->active_mm->context); /* This does lldt */
> + load_mm_ldt(current->active_mm); /* This does lldt */
>
> fpu__resume_cpu();
> }

2015-07-23 07:13:55

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

>>> On 22.07.15 at 21:23, <[email protected]> wrote:
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1015,6 +1015,7 @@ config VM86
> config X86_16BIT
> bool "Enable support for 16-bit segments" if EXPERT
> default y
> + depends on MODIFY_LDT_SYSCALL
> ---help---
> This option is required by programs like Wine to run 16-bit
> protected mode legacy code on x86 processors. Disabling
> @@ -2053,6 +2054,22 @@ config CMDLINE_OVERRIDE
> This is used to work around broken boot loaders. This should
> be set to 'N' under normal conditions.
>
> +config MODIFY_LDT_SYSCALL
> + bool "Enable the LDT (local descriptor table)" if EXPERT
> + default y
> + ---help---
> + Linux can allow user programs to install a per-process x86
> + Local Descriptor Table (LDT) using the modify_ldt(2) system
> + call. This is required to run 16-bit or segmented code such as
> + DOSEMU or some Wine programs. It is also used by some very old
> + threading libraries.
> +
> + Enabling this feature adds a small amount of overhead to
> + context switches and increases the low-level kernel attack
> + surface. Disabling it removes the modify_ldt(2) system call.
> +
> + Saying 'N' here may make sense for embedded or server kernels.
> +

I think it would be better to place this ahead of the one being
made dependent on it, to avoid the user being prompted for
X86_16BIT despite it possibly becoming unavailable (when this
one gets set to n).

Jan

2015-07-23 10:25:53

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

Hi Andy,

On Wed, Jul 22, 2015 at 12:23:47PM -0700, Andy Lutomirski wrote:
> The modify_ldt syscall exposes a large attack surface and is
> unnecessary for modern userspace. Make it optional.

Wouldn't you prefer something like this which makes it possible to re-enable
it at runtime so that we can hope distros ship with it disabled by default ?

It's pretty efficient on your ldtgdt testcase :

# echo 1 > /proc/sys/kernel/modify_ldt
# ./a.out
[OK] LDT entry 0 has AR 0x0040FA00 and limit 0x0000000A
[OK] LDT entry 0 has AR 0x00C0FA00 and limit 0x0000AFFF
[OK] LDT entry 1 is invalid
[OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
[OK] LDT entry 1 is invalid
[OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D0FA00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07A00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00907A00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07200 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07000 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07400 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00507600 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507E00 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507C00 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507A00 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
[RUN] Test fork
[OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
[OK] LDT entry 1 is invalid
[OK] LDT entry 0 has AR 0x0040FA00 and limit 0x0000000A
[OK] LDT entry 0 has AR 0x00C0FA00 and limit 0x0000AFFF
[OK] LDT entry 1 is invalid
[OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
[OK] LDT entry 1 is invalid
[OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D0FA00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07A00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00907A00 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07200 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07000 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00D07400 and limit 0x0000AFFF
[OK] LDT entry 2 has AR 0x00507600 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507E00 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507C00 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507A00 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
[OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
[RUN] Test fork
[OK] Child succeeded
[OK] modify_ldt failure 22
[OK] LDT entry 0 has AR 0x0000F200 and limit 0x00000000
[OK] LDT entry 0 has AR 0x00007200 and limit 0x00000000
[OK] LDT entry 0 has AR 0x0000F000 and limit 0x00000000
[OK] LDT entry 0 has AR 0x00007200 and limit 0x00000000
[OK] LDT entry 0 has AR 0x00007000 and limit 0x00000001
[OK] LDT entry 0 has AR 0x00007000 and limit 0x00000000
[OK] LDT entry 0 is invalid
[OK] LDT entry 0 has AR 0x0040F200 and limit 0x00000000
[OK] LDT entry 0 is invalid
[SKIP] Cannot set affinity to CPU 1


# echo 0 > /proc/sys/kernel/modify_ldt
# ./a.out
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] LDT entry 1 is invalid
[OK] modify_ldt is returned -ENOSYS
[OK] LDT entry 1 is invalid
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[SKIP] Skipping fork test because have no LDT
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[OK] modify_ldt is returned -ENOSYS
[SKIP] Cannot set affinity to CPU 1

The patch is quite small (I stole your comment for the config option).

Willy


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d569..b926f65 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1012,6 +1012,23 @@ config X86_16BIT
this option saves about 300 bytes on i386, or around 6K text
plus 16K runtime memory on x86-64,

+config DEFAULT_MODIFY_LDT_SYSCALL
+ bool "Allow userspace to modify the LDT (local descriptor table)"
+ default y
+ ---help---
+ Linux can allow user programs to install a per-process x86
+ Local Descriptor Table (LDT) using the modify_ldt(2) system
+ call. This is required to run 16-bit or segmented code such as
+ DOSEMU or some Wine programs. It is also used by some very old
+ threading libraries.
+
+ Enabling this feature increases the low-level kernel attack
+ surface. Disabling it disables the modify_ldt(2) system call by
+ default. Note that even when disabled it remains possible to
+ enable it at runtime by setting the sys.kernel.modify_ldt sysctl.
+
+ Say 'N' here if you don't expect to use DOSEMU or Wine often.
+
config X86_ESPFIX32
def_bool y
depends on X86_16BIT && X86_32
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index c37886d..2f10b6c 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -20,6 +20,12 @@
#include <asm/mmu_context.h>
#include <asm/syscalls.h>

+#ifdef CONFIG_DEFAULT_MODIFY_LDT_SYSCALL
+int sysctl_modify_ldt __read_mostly = 1;
+#else
+int sysctl_modify_ldt __read_mostly = 0;
+#endif
+
#ifdef CONFIG_SMP
static void flush_ldt(void *current_mm)
{
@@ -254,6 +260,9 @@ asmlinkage int sys_modify_ldt(int func, void __user *ptr,
{
int ret = -ENOSYS;

+ if (!sysctl_modify_ldt)
+ return ret;
+
switch (func) {
case 0:
ret = read_ldt(ptr, bytecount);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a..60270c6 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -111,6 +111,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
#ifndef CONFIG_MMU
extern int sysctl_nr_trim_pages;
#endif
+#ifdef CONFIG_X86
+extern int sysctl_modify_ldt;
+#endif

/* Constants used for minimum and maximum */
#ifdef CONFIG_LOCKUP_DETECTOR
@@ -962,6 +965,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "modify_ldt",
+ .data = &sysctl_modify_ldt,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif
#if defined(CONFIG_MMU)
{

2015-07-23 23:36:43

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

On Thu, Jul 23, 2015 at 3:24 AM, Willy Tarreau <[email protected]> wrote:
> Hi Andy,
>
> On Wed, Jul 22, 2015 at 12:23:47PM -0700, Andy Lutomirski wrote:
>> The modify_ldt syscall exposes a large attack surface and is
>> unnecessary for modern userspace. Make it optional.
>
> Wouldn't you prefer something like this which makes it possible to re-enable
> it at runtime so that we can hope distros ship with it disabled by default ?
>
> It's pretty efficient on your ldtgdt testcase :
>
> # echo 1 > /proc/sys/kernel/modify_ldt
> # ./a.out
> [OK] LDT entry 0 has AR 0x0040FA00 and limit 0x0000000A
> [OK] LDT entry 0 has AR 0x00C0FA00 and limit 0x0000AFFF
> [OK] LDT entry 1 is invalid
> [OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
> [OK] LDT entry 1 is invalid
> [OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D0FA00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07A00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00907A00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07200 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07000 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07400 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00507600 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507E00 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507C00 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507A00 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
> [RUN] Test fork
> [OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
> [OK] LDT entry 1 is invalid
> [OK] LDT entry 0 has AR 0x0040FA00 and limit 0x0000000A
> [OK] LDT entry 0 has AR 0x00C0FA00 and limit 0x0000AFFF
> [OK] LDT entry 1 is invalid
> [OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
> [OK] LDT entry 1 is invalid
> [OK] LDT entry 2 has AR 0x00C0FA00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D0FA00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07A00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00907A00 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07200 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07000 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00D07400 and limit 0x0000AFFF
> [OK] LDT entry 2 has AR 0x00507600 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507E00 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507C00 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507A00 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
> [OK] LDT entry 2 has AR 0x00507800 and limit 0x0000000A
> [RUN] Test fork
> [OK] Child succeeded
> [OK] modify_ldt failure 22
> [OK] LDT entry 0 has AR 0x0000F200 and limit 0x00000000
> [OK] LDT entry 0 has AR 0x00007200 and limit 0x00000000
> [OK] LDT entry 0 has AR 0x0000F000 and limit 0x00000000
> [OK] LDT entry 0 has AR 0x00007200 and limit 0x00000000
> [OK] LDT entry 0 has AR 0x00007000 and limit 0x00000001
> [OK] LDT entry 0 has AR 0x00007000 and limit 0x00000000
> [OK] LDT entry 0 is invalid
> [OK] LDT entry 0 has AR 0x0040F200 and limit 0x00000000
> [OK] LDT entry 0 is invalid
> [SKIP] Cannot set affinity to CPU 1
>
>
> # echo 0 > /proc/sys/kernel/modify_ldt
> # ./a.out
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] LDT entry 1 is invalid
> [OK] modify_ldt is returned -ENOSYS
> [OK] LDT entry 1 is invalid
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [SKIP] Skipping fork test because have no LDT
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [OK] modify_ldt is returned -ENOSYS
> [SKIP] Cannot set affinity to CPU 1
>
> The patch is quite small (I stole your comment for the config option).
>
> Willy
>
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 226d569..b926f65 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1012,6 +1012,23 @@ config X86_16BIT
> this option saves about 300 bytes on i386, or around 6K text
> plus 16K runtime memory on x86-64,
>
> +config DEFAULT_MODIFY_LDT_SYSCALL
> + bool "Allow userspace to modify the LDT (local descriptor table)"
> + default y
> + ---help---
> + Linux can allow user programs to install a per-process x86
> + Local Descriptor Table (LDT) using the modify_ldt(2) system
> + call. This is required to run 16-bit or segmented code such as
> + DOSEMU or some Wine programs. It is also used by some very old
> + threading libraries.
> +
> + Enabling this feature increases the low-level kernel attack
> + surface. Disabling it disables the modify_ldt(2) system call by
> + default. Note that even when disabled it remains possible to
> + enable it at runtime by setting the sys.kernel.modify_ldt sysctl.
> +
> + Say 'N' here if you don't expect to use DOSEMU or Wine often.
> +
> config X86_ESPFIX32
> def_bool y
> depends on X86_16BIT && X86_32
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index c37886d..2f10b6c 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -20,6 +20,12 @@
> #include <asm/mmu_context.h>
> #include <asm/syscalls.h>
>
> +#ifdef CONFIG_DEFAULT_MODIFY_LDT_SYSCALL
> +int sysctl_modify_ldt __read_mostly = 1;
> +#else
> +int sysctl_modify_ldt __read_mostly = 0;
> +#endif
> +
> #ifdef CONFIG_SMP
> static void flush_ldt(void *current_mm)
> {
> @@ -254,6 +260,9 @@ asmlinkage int sys_modify_ldt(int func, void __user *ptr,
> {
> int ret = -ENOSYS;
>
> + if (!sysctl_modify_ldt)
> + return ret;
> +
> switch (func) {
> case 0:
> ret = read_ldt(ptr, bytecount);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 2082b1a..60270c6 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -111,6 +111,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
> #ifndef CONFIG_MMU
> extern int sysctl_nr_trim_pages;
> #endif
> +#ifdef CONFIG_X86
> +extern int sysctl_modify_ldt;
> +#endif
>
> /* Constants used for minimum and maximum */
> #ifdef CONFIG_LOCKUP_DETECTOR
> @@ -962,6 +965,13 @@ static struct ctl_table kern_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> + {
> + .procname = "modify_ldt",
> + .data = &sysctl_modify_ldt,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> #endif
> #if defined(CONFIG_MMU)
> {

I've been pondering something like this that is even MORE generic, for
any syscall. Something like a "syscalls" directory under
/proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
disabled, and "-1" disabled until next boot.

-Kees

--
Kees Cook
Chrome OS Security

2015-07-23 23:40:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

On Thu, Jul 23, 2015 at 4:36 PM, Kees Cook <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 3:24 AM, Willy Tarreau <[email protected]> wrote:
>> #ifdef CONFIG_SMP
>> static void flush_ldt(void *current_mm)
>> {
>> @@ -254,6 +260,9 @@ asmlinkage int sys_modify_ldt(int func, void __user *ptr,
>> {
>> int ret = -ENOSYS;
>>
>> + if (!sysctl_modify_ldt)
>> + return ret;
>> +
>> switch (func) {
>> case 0:
>> ret = read_ldt(ptr, bytecount);
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 2082b1a..60270c6 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -111,6 +111,9 @@ extern int sysctl_nr_open_min, sysctl_nr_open_max;
>> #ifndef CONFIG_MMU
>> extern int sysctl_nr_trim_pages;
>> #endif
>> +#ifdef CONFIG_X86
>> +extern int sysctl_modify_ldt;
>> +#endif
>>
>> /* Constants used for minimum and maximum */
>> #ifdef CONFIG_LOCKUP_DETECTOR
>> @@ -962,6 +965,13 @@ static struct ctl_table kern_table[] = {
>> .mode = 0644,
>> .proc_handler = proc_dointvec,
>> },
>> + {
>> + .procname = "modify_ldt",
>> + .data = &sysctl_modify_ldt,
>> + .maxlen = sizeof(int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec,
>> + },
>> #endif
>> #if defined(CONFIG_MMU)
>> {
>
> I've been pondering something like this that is even MORE generic, for
> any syscall. Something like a "syscalls" directory under
> /proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
> disabled, and "-1" disabled until next boot.
>

It might want to be /proc/sys/kernel/syscalls/[abi]/[name], possibly
with more than just those options. We might want "disabled, returns
ENOSYS", "disabled, returns EPERM", and a lock bit.

On x86 at least, the implementation's easy -- we can just poke the
syscall table.

--Andy

2015-07-23 23:59:06

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

On Thu, Jul 23, 2015 at 04:40:14PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 4:36 PM, Kees Cook <[email protected]> wrote:
> > I've been pondering something like this that is even MORE generic, for
> > any syscall. Something like a "syscalls" directory under
> > /proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
> > disabled, and "-1" disabled until next boot.
> >
>
> It might want to be /proc/sys/kernel/syscalls/[abi]/[name], possibly
> with more than just those options. We might want "disabled, returns
> ENOSYS", "disabled, returns EPERM", and a lock bit.
>
> On x86 at least, the implementation's easy -- we can just poke the
> syscall table.

I wouldn't do it these days. Around 2000-2001, with a friend we designed
a module with its userland counterpart which was called "overloader". The
principle was to intercept syscalls in order to enforce some form of
policies, log values, or remap paths, etc. The first use was to log all
file creations during a "make install" to more easily build packages. It
was at the era where it was easy to modify the syscall table from a module,
in kernel 2.2.

We quickly found that beyond logging/rewriting syscall arguments, it had
limited use cases when used as a "syscall firewall" because many syscalls
are still too coarse to decide whether you want to enable/disable them.
I remember that socketcall() and ioctl() were among the annoying ones.
Either you totally enable or totally disable. In the end, the only valid
use cases we found for enabling/disabling a syscall were limited to a very
small set for debugging purposes, in order to force some application code
to detect a missing implementation and switch to an alternative (eg: these
days if you suspect a bug in epoll you could disable it and force the app
to use poll instead). It was still useful to disable module loading and
FS mounting but that was about all by then.

All this to say that probably only a handful of tricky syscalls would
need an on/off switch but clearly not all of them at all, so I'd rather
add a few entries just for the relevant ones, mainly to fix compatibility
issues and nothing more. Eg: what's the point of disabling exit(), wait(),
kill(), fork() or getpid()... It would only increase the difficulty to
sort out bug reports.

Just my opinion,
Willy

2015-07-24 00:09:24

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

On Thu, Jul 23, 2015 at 4:58 PM, Willy Tarreau <[email protected]> wrote:
> On Thu, Jul 23, 2015 at 04:40:14PM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 23, 2015 at 4:36 PM, Kees Cook <[email protected]> wrote:
>> > I've been pondering something like this that is even MORE generic, for
>> > any syscall. Something like a "syscalls" directory under
>> > /proc/sys/kernel, with 1 entry per syscall. "0" is "available", "1" is
>> > disabled, and "-1" disabled until next boot.
>> >
>>
>> It might want to be /proc/sys/kernel/syscalls/[abi]/[name], possibly
>> with more than just those options. We might want "disabled, returns
>> ENOSYS", "disabled, returns EPERM", and a lock bit.
>>
>> On x86 at least, the implementation's easy -- we can just poke the
>> syscall table.
>
> I wouldn't do it these days. Around 2000-2001, with a friend we designed
> a module with its userland counterpart which was called "overloader". The
> principle was to intercept syscalls in order to enforce some form of
> policies, log values, or remap paths, etc. The first use was to log all
> file creations during a "make install" to more easily build packages. It
> was at the era where it was easy to modify the syscall table from a module,
> in kernel 2.2.
>
> We quickly found that beyond logging/rewriting syscall arguments, it had
> limited use cases when used as a "syscall firewall" because many syscalls
> are still too coarse to decide whether you want to enable/disable them.
> I remember that socketcall() and ioctl() were among the annoying ones.
> Either you totally enable or totally disable. In the end, the only valid
> use cases we found for enabling/disabling a syscall were limited to a very
> small set for debugging purposes, in order to force some application code
> to detect a missing implementation and switch to an alternative (eg: these
> days if you suspect a bug in epoll you could disable it and force the app
> to use poll instead). It was still useful to disable module loading and
> FS mounting but that was about all by then.
>
> All this to say that probably only a handful of tricky syscalls would
> need an on/off switch but clearly not all of them at all, so I'd rather
> add a few entries just for the relevant ones, mainly to fix compatibility
> issues and nothing more. Eg: what's the point of disabling exit(), wait(),
> kill(), fork() or getpid()... It would only increase the difficulty to
> sort out bug reports.
>
> Just my opinion,

Well, I would really like to have something like this around so that I
can trivially globally disable syscalls when they have security risks.
My hack[1] to disable kexec_load, for example, was terrible while I
waited for a kernel that supported the disable_kexec_load sysctl.

-Kees

[1] https://outflux.net/blog/archives/2013/12/10/live-patching-the-kernel/

--
Kees Cook
Chrome OS Security

2015-07-24 06:37:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

On Wed, Jul 22, 2015 at 12:23:46PM -0700, Andy Lutomirski wrote:
> modify_ldt has questionable locking and does not synchronize
> threads. Improve it: redesign the locking and synchronize all
> threads' LDTs using an IPI on all modifications.
>
> This will dramatically slow down modify_ldt in multithreaded
> programs, but there shouldn't be any multithreaded programs that
> care about modify_ldt's performance in the first place.
>
> Cc: [email protected]
> Signed-off-by: Andy Lutomirski <[email protected]>

...

> +struct ldt_struct {
> + /*
> + * Xen requires page-aligned LDTs with special permissions. This is
> + * needed to prevent us from installing evil descriptors such as
> + * call gates. On native, we could merge the ldt_struct and LDT
> + * allocations, but it's not worth trying to optimize.

I don't think baremetal should care about xen and frankly, this is
getting ridiculous, slowly - baremetal has to wait with a potentially
critical security fix just because it breaks xen. Dammit, this level of
intrusiveness into x86 should've never been allowed.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-07-24 07:25:57

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

On Thu, Jul 23, 2015 at 05:09:21PM -0700, Kees Cook wrote:
> > All this to say that probably only a handful of tricky syscalls would
> > need an on/off switch but clearly not all of them at all, so I'd rather
> > add a few entries just for the relevant ones, mainly to fix compatibility
> > issues and nothing more. Eg: what's the point of disabling exit(), wait(),
> > kill(), fork() or getpid()... It would only increase the difficulty to
> > sort out bug reports.
> >
> > Just my opinion,
>
> Well, I would really like to have something like this around so that I
> can trivially globally disable syscalls when they have security risks.

I understand, but while maybe it could make sense to have the option on
any linux-specific syscall, having it on the standard, portable ones
will be useless as disabling them will break most applications.

> My hack[1] to disable kexec_load, for example, was terrible while I
> waited for a kernel that supported the disable_kexec_load sysctl.

This typically is one linux-specific syscall which no regular application
would rely on and which can come with side effects. I think there are not
*that* many, none of them is performance-critical, and they'd rather be
dealt with one at a time.

> [1] https://outflux.net/blog/archives/2013/12/10/live-patching-the-kernel/

Thanks, that (and the linked articles) was an interesting read.

Willy

2015-07-24 07:48:57

by Willy Tarreau

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] x86/ldt: Make modify_ldt optional

On Fri, Jul 24, 2015 at 09:24:51AM +0200, Willy Tarreau wrote:
> On Thu, Jul 23, 2015 at 05:09:21PM -0700, Kees Cook wrote:
> > > All this to say that probably only a handful of tricky syscalls would
> > > need an on/off switch but clearly not all of them at all, so I'd rather
> > > add a few entries just for the relevant ones, mainly to fix compatibility
> > > issues and nothing more. Eg: what's the point of disabling exit(), wait(),
> > > kill(), fork() or getpid()... It would only increase the difficulty to
> > > sort out bug reports.
> > >
> > > Just my opinion,
> >
> > Well, I would really like to have something like this around so that I
> > can trivially globally disable syscalls when they have security risks.
>
> I understand, but while maybe it could make sense to have the option on
> any linux-specific syscall, having it on the standard, portable ones
> will be useless as disabling them will break most applications.
>
> > My hack[1] to disable kexec_load, for example, was terrible while I
> > waited for a kernel that supported the disable_kexec_load sysctl.
>
> This typically is one linux-specific syscall which no regular application
> would rely on and which can come with side effects. I think there are not
> *that* many, none of them is performance-critical, and they'd rather be
> dealt with one at a time.

Looking at syscall_64.tbl, I'm seeing that the first ~133 syscalls
have no reason for being disabled if we don't want to break portable
applications, with the exception of ptrace I guess. Past this, things
like uselib, personality, sysfs, prctl etc... could be disabled. There
are still some exceptions in this area but I don't see them as critical
if someone would accidently disable them (eg: getpriority, mlock, ...).
Others like chroot, setrlimit, adjtimex, settimeofday, mount, umount,
time need to be kept. And a few ones like sync or sethostname would be
nice to have optional in order to lock down a system at boot. Many of
the other ones are ns-specific versions of the first ones (*at) and
would rather not being made optional either. I think maybe we can find
between 10 and 30 that would make sense to optionally disable.

Willy

2015-07-24 15:30:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

On Wed, Jul 22, 2015 at 12:23:46PM -0700, Andy Lutomirski wrote:
> modify_ldt has questionable locking and does not synchronize
> threads. Improve it: redesign the locking and synchronize all
> threads' LDTs using an IPI on all modifications.
>
> This will dramatically slow down modify_ldt in multithreaded
> programs, but there shouldn't be any multithreaded programs that
> care about modify_ldt's performance in the first place.
>
> Cc: [email protected]
> Signed-off-by: Andy Lutomirski <[email protected]>

Just minor stylistic nitpicks below. Otherwise looks ok to me.

...

> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index c37886d759cc..3ae308029dee 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -12,6 +12,7 @@
> #include <linux/string.h>
> #include <linux/mm.h>
> #include <linux/smp.h>
> +#include <linux/slab.h>
> #include <linux/vmalloc.h>
> #include <linux/uaccess.h>
>
> @@ -20,82 +21,83 @@
> #include <asm/mmu_context.h>
> #include <asm/syscalls.h>
>
> -#ifdef CONFIG_SMP
> static void flush_ldt(void *current_mm)
> {
> - if (current->active_mm == current_mm)
> - load_LDT(&current->active_mm->context);
> + if (current->active_mm == current_mm) {

Save indentation level:

if (current->active_mm != current_mm)
return;

> + /* context.lock is held for us, so we don't need any locking. */

Stick that comment above the function name.

> + mm_context_t *pc = &current->active_mm->context;
> + set_ldt(pc->ldt->entries, pc->ldt->size);
> + }
> }
> -#endif
>
> -static int alloc_ldt(mm_context_t *pc, int mincount, int reload)
> +/* The caller must call finalize_ldt_struct on the result. */
> +static struct ldt_struct *alloc_ldt_struct(int size)
> {
> - void *oldldt, *newldt;
> - int oldsize;
> + struct ldt_struct *new_ldt;
> + int alloc_size;
>
> - if (mincount <= pc->size)
> - return 0;
> - oldsize = pc->size;
> - mincount = (mincount + (PAGE_SIZE / LDT_ENTRY_SIZE - 1)) &
> - (~(PAGE_SIZE / LDT_ENTRY_SIZE - 1));
> - if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE)
> - newldt = vmalloc(mincount * LDT_ENTRY_SIZE);
> + if (size > LDT_ENTRIES)
> + return NULL;
> +
> + new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
> + if (!new_ldt)
> + return NULL;
> +
> + BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
> + alloc_size = size * LDT_ENTRY_SIZE;
> +
> + if (alloc_size > PAGE_SIZE)
> + new_ldt->entries = vmalloc(alloc_size);
> else
> - newldt = (void *)__get_free_page(GFP_KERNEL);
> + new_ldt->entries = (void *)__get_free_page(GFP_KERNEL);

newline here.

> + if (!new_ldt->entries) {
> + kfree(new_ldt);
> + return NULL;
> + }
>
> - if (!newldt)
> - return -ENOMEM;
> + new_ldt->size = size;
> + return new_ldt;
> +}
> +
> +/* After calling this, the LDT is immutable. */
> +static void finalize_ldt_struct(struct ldt_struct *ldt)
> +{
> + paravirt_alloc_ldt(ldt->entries, ldt->size);
> +}
>
> - if (oldsize)
> - memcpy(newldt, pc->ldt, oldsize * LDT_ENTRY_SIZE);
> - oldldt = pc->ldt;
> - memset(newldt + oldsize * LDT_ENTRY_SIZE, 0,
> - (mincount - oldsize) * LDT_ENTRY_SIZE);
> +static void install_ldt(struct mm_struct *current_mm,
> + struct ldt_struct *ldt)
> +{
> + /* context.lock is held */
> + preempt_disable();
>
> - paravirt_alloc_ldt(newldt, mincount);
> + /* Synchronizes with lockless_dereference in load_mm_ldt. */

Good.

> + smp_store_release(&current_mm->context.ldt, ldt);
>
> -#ifdef CONFIG_X86_64
> - /* CHECKME: Do we really need this ? */
> - wmb();
> -#endif
> - pc->ldt = newldt;
> - wmb();
> - pc->size = mincount;
> - wmb();
> + /* Activate for this CPU. */
> + flush_ldt(current->mm);
>
> - if (reload) {
> #ifdef CONFIG_SMP
> - preempt_disable();
> - load_LDT(pc);
> - if (!cpumask_equal(mm_cpumask(current->mm),
> - cpumask_of(smp_processor_id())))
> - smp_call_function(flush_ldt, current->mm, 1);
> - preempt_enable();
> -#else
> - load_LDT(pc);
> + /* Synchronize with other CPUs. */
> + if (!cpumask_equal(mm_cpumask(current_mm),
> + cpumask_of(smp_processor_id())))

Let it stick out:

if (!cpumask_equal(mm_cpumask(current_mm), cpumask_of(smp_processor_id())))
smp_call_function(flush_ldt, current_mm, 1);
> #endif
> - }
> - if (oldsize) {
> - paravirt_free_ldt(oldldt, oldsize);
> - if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
> - vfree(oldldt);
> - else
> - put_page(virt_to_page(oldldt));
> - }
> - return 0;
> + preempt_enable();
> }
>
> -static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
> +static void free_ldt_struct(struct ldt_struct *ldt)
> {
> - int err = alloc_ldt(new, old->size, 0);
> - int i;
> -
> - if (err < 0)
> - return err;
> -
> - for (i = 0; i < old->size; i++)
> - write_ldt_entry(new->ldt, i, old->ldt + i * LDT_ENTRY_SIZE);
> - return 0;
> + if (unlikely(ldt)) {

Save an indentation level:

int alloc_size;

if (!ldt)
return;

alloc_size = sizeof(struct ldt_struct) + ldt->size * LDT_ENTRY_SIZE;

...

> + paravirt_free_ldt(ldt->entries, ldt->size);
> + if (alloc_size > PAGE_SIZE)
> + vfree(ldt->entries);
> + else
> + put_page(virt_to_page(ldt->entries));
> + kfree(ldt);
> + }
> }
>
> /*
> @@ -108,13 +110,32 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> int retval = 0;
>
> mutex_init(&mm->context.lock);
> - mm->context.size = 0;
> old_mm = current->mm;
> - if (old_mm && old_mm->context.size > 0) {
> - mutex_lock(&old_mm->context.lock);
> - retval = copy_ldt(&mm->context, &old_mm->context);
> - mutex_unlock(&old_mm->context.lock);
> + if (!old_mm) {
> + mm->context.ldt = NULL;
> + return 0;
> + }
> +
> + mutex_lock(&old_mm->context.lock);
> + if (old_mm->context.ldt) {

Same here:

if (!old_mm->context.ldt) {
mm->context.ldt = NULL;
goto out_unlock;
}

new_ldt = ...

> + struct ldt_struct *new_ldt =
> + alloc_ldt_struct(old_mm->context.ldt->size);
> + if (!new_ldt) {
> + retval = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + memcpy(new_ldt->entries, old_mm->context.ldt->entries,
> + new_ldt->size * LDT_ENTRY_SIZE);
> + finalize_ldt_struct(new_ldt);
> +
> + mm->context.ldt = new_ldt;
> + } else {
> + mm->context.ldt = NULL;
> }
> +
> +out_unlock:
> + mutex_unlock(&old_mm->context.lock);
> return retval;
> }
>
> @@ -125,53 +146,47 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> */
> void destroy_context(struct mm_struct *mm)
> {
> - if (mm->context.size) {
> -#ifdef CONFIG_X86_32
> - /* CHECKME: Can this ever happen ? */
> - if (mm == current->active_mm)
> - clear_LDT();
> -#endif
> - paravirt_free_ldt(mm->context.ldt, mm->context.size);
> - if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE)
> - vfree(mm->context.ldt);
> - else
> - put_page(virt_to_page(mm->context.ldt));
> - mm->context.size = 0;
> - }
> + free_ldt_struct(mm->context.ldt);
> + mm->context.ldt = NULL;
> }
>
> static int read_ldt(void __user *ptr, unsigned long bytecount)
> {
> - int err;
> + int retval;
> unsigned long size;
> struct mm_struct *mm = current->mm;
>
> - if (!mm->context.size)
> - return 0;
> + mutex_lock(&mm->context.lock);
> +
> + if (!mm->context.ldt) {
> + retval = 0;
> + goto out_unlock;
> + }
> +
> if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES)
> bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES;
>
> - mutex_lock(&mm->context.lock);
> - size = mm->context.size * LDT_ENTRY_SIZE;
> + size = mm->context.ldt->size * LDT_ENTRY_SIZE;
> if (size > bytecount)
> size = bytecount;
>
> - err = 0;
> - if (copy_to_user(ptr, mm->context.ldt, size))
> - err = -EFAULT;
> - mutex_unlock(&mm->context.lock);
> - if (err < 0)
> - goto error_return;
> + if (copy_to_user(ptr, mm->context.ldt->entries, size)) {
> + retval = -EFAULT;
> + goto out_unlock;
> + }
> +
> if (size != bytecount) {
> - /* zero-fill the rest */
> + /* Zero-fill the rest and pretend we read bytecount bytes. */
> if (clear_user(ptr + size, bytecount - size) != 0) {

Make that:

if (clear_user(ptr + size, bytecount - size))

> - err = -EFAULT;
> - goto error_return;
> + retval = -EFAULT;
> + goto out_unlock;
> }
> }
> - return bytecount;
> -error_return:
> - return err;
> + retval = bytecount;
> +
> +out_unlock:
> + mutex_unlock(&mm->context.lock);
> + return retval;
> }
>
> static int read_default_ldt(void __user *ptr, unsigned long bytecount)
> @@ -195,6 +210,8 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
> struct desc_struct ldt;
> int error;
> struct user_desc ldt_info;
> + int oldsize, newsize;
> + struct ldt_struct *new_ldt, *old_ldt;
>
> error = -EINVAL;
> if (bytecount != sizeof(ldt_info))
> @@ -213,34 +230,43 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
> goto out;
> }
>
> - mutex_lock(&mm->context.lock);
> - if (ldt_info.entry_number >= mm->context.size) {
> - error = alloc_ldt(&current->mm->context,
> - ldt_info.entry_number + 1, 1);
> - if (error < 0)
> - goto out_unlock;
> - }
> -
> - /* Allow LDTs to be cleared by the user. */
> - if (ldt_info.base_addr == 0 && ldt_info.limit == 0) {
> - if (oldmode || LDT_empty(&ldt_info)) {
> - memset(&ldt, 0, sizeof(ldt));
> - goto install;
> + if ((oldmode && ldt_info.base_addr == 0 && ldt_info.limit == 0) ||

Shorten:

if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||

> + LDT_empty(&ldt_info)) {
> + /* The user wants to clear the entry. */
> + memset(&ldt, 0, sizeof(ldt));
> + } else {
> + if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
> + error = -EINVAL;
> + goto out;
> }
> +
> + fill_ldt(&ldt, &ldt_info);
> + if (oldmode)
> + ldt.avl = 0;
> }
>
> - if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
> - error = -EINVAL;
> + mutex_lock(&mm->context.lock);
> +
> + old_ldt = mm->context.ldt;
> + oldsize = old_ldt ? old_ldt->size : 0;
> + newsize = max((int)(ldt_info.entry_number + 1), oldsize);
> +
> + error = -ENOMEM;
> + new_ldt = alloc_ldt_struct(newsize);
> + if (!new_ldt)
> goto out_unlock;
> - }
>
> - fill_ldt(&ldt, &ldt_info);
> - if (oldmode)
> - ldt.avl = 0;
> + if (old_ldt) {
> + memcpy(new_ldt->entries, old_ldt->entries,
> + oldsize * LDT_ENTRY_SIZE);
> + }

Single if-statement doesn't need {} and you don't absolutely need to
keep 80cols. Just let it stick out.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-07-25 04:14:29

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous



On 07/22/2015 06:20 PM, Boris Ostrovsky wrote:
> On 07/22/2015 03:23 PM, Andy Lutomirski wrote:
>>
>> + error = -ENOMEM;
>> + new_ldt = alloc_ldt_struct(newsize);
>> + if (!new_ldt)
>> goto out_unlock;
>> - }
>> - fill_ldt(&ldt, &ldt_info);
>> - if (oldmode)
>> - ldt.avl = 0;
>> + if (old_ldt) {
>> + memcpy(new_ldt->entries, old_ldt->entries,
>> + oldsize * LDT_ENTRY_SIZE);
>> + }
>> + memset(new_ldt->entries + oldsize * LDT_ENTRY_SIZE, 0,
>> + (newsize - oldsize) * LDT_ENTRY_SIZE);
>
> We need to zero out full page (probably better in alloc_ldt_struct()
> with vmzalloc/__GFP_ZERO) --- Xen checks whole page that is assigned
> to G/LDT and gets unhappy if an invalid descriptor is found there.
>
> This fixes one problem. There is something else that Xen gets upset
> about, I haven't figured what it is yet (and I am out tomorrow so it
> may need to wait until Friday).
>


What I thought was another problem turned out not to be one so both 64-
and 32-bit tests passed on 64-bit PV (when allocated LDT is zeroed out)

However, on 32-bit kernel the test is failing multicpu test, I don't
know yet what it is.

-boris

2015-07-25 04:52:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

On Fri, Jul 24, 2015 at 8:29 AM, Borislav Petkov <[email protected]> wrote:
> Let it stick out:
>
> if (!cpumask_equal(mm_cpumask(current_mm), cpumask_of(smp_processor_id())))
> smp_call_function(flush_ldt, current_mm, 1);

I see your wide terminal and raise you a complete rewrite of that
function. Sigh, why did I assume the old code was the right way to do
it?

>> #endif
>> - }
>> - if (oldsize) {
>> - paravirt_free_ldt(oldldt, oldsize);
>> - if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
>> - vfree(oldldt);
>> - else
>> - put_page(virt_to_page(oldldt));
>> - }
>> - return 0;
>> + preempt_enable();
>> }
>>
>> -static inline int copy_ldt(mm_context_t *new, mm_context_t *old)
>> +static void free_ldt_struct(struct ldt_struct *ldt)
>> {
>> - int err = alloc_ldt(new, old->size, 0);
>> - int i;
>> -
>> - if (err < 0)
>> - return err;
>> -
>> - for (i = 0; i < old->size; i++)
>> - write_ldt_entry(new->ldt, i, old->ldt + i * LDT_ENTRY_SIZE);
>> - return 0;
>> + if (unlikely(ldt)) {
>
> Save an indentation level:
>
> int alloc_size;
>
> if (!ldt)
> return;
>
> alloc_size = sizeof(struct ldt_struct) + ldt->size * LDT_ENTRY_SIZE;
>

Hah¸ we both missed it. This is wrong. (Fix your backport!)

> ...
>
>> + paravirt_free_ldt(ldt->entries, ldt->size);
>> + if (alloc_size > PAGE_SIZE)
>> + vfree(ldt->entries);
>> + else
>> + put_page(virt_to_page(ldt->entries));

I'm not sure this is correct, so I changed it to something obviously
correct (kmalloc/kfree).

>>
>> - fill_ldt(&ldt, &ldt_info);
>> - if (oldmode)
>> - ldt.avl = 0;
>> + if (old_ldt) {
>> + memcpy(new_ldt->entries, old_ldt->entries,
>> + oldsize * LDT_ENTRY_SIZE);
>> + }
>
> Single if-statement doesn't need {} and you don't absolutely need to
> keep 80cols. Just let it stick out.

You read too many of Linus' comments about using wider terminals :)

--Andy

2015-07-25 04:58:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

On Fri, Jul 24, 2015 at 9:13 PM, Boris Ostrovsky
<[email protected]> wrote:
>
>
> On 07/22/2015 06:20 PM, Boris Ostrovsky wrote:
>>
>> On 07/22/2015 03:23 PM, Andy Lutomirski wrote:
>>>
>>>
>>> + error = -ENOMEM;
>>> + new_ldt = alloc_ldt_struct(newsize);
>>> + if (!new_ldt)
>>> goto out_unlock;
>>> - }
>>> - fill_ldt(&ldt, &ldt_info);
>>> - if (oldmode)
>>> - ldt.avl = 0;
>>> + if (old_ldt) {
>>> + memcpy(new_ldt->entries, old_ldt->entries,
>>> + oldsize * LDT_ENTRY_SIZE);
>>> + }
>>> + memset(new_ldt->entries + oldsize * LDT_ENTRY_SIZE, 0,
>>> + (newsize - oldsize) * LDT_ENTRY_SIZE);
>>
>>
>> We need to zero out full page (probably better in alloc_ldt_struct() with
>> vmzalloc/__GFP_ZERO) --- Xen checks whole page that is assigned to G/LDT
>> and gets unhappy if an invalid descriptor is found there.
>>
>> This fixes one problem. There is something else that Xen gets upset about,
>> I haven't figured what it is yet (and I am out tomorrow so it may need to
>> wait until Friday).
>>
>
>
> What I thought was another problem turned out not to be one so both 64- and
> 32-bit tests passed on 64-bit PV (when allocated LDT is zeroed out)
>
> However, on 32-bit kernel the test is failing multicpu test, I don't know
> yet what it is.

Test case bug or unrelated kernel bug depending on your point of view.
I forgot that x86_32 and x86_64 have very different handling of IRET
faults. Wait for v2 :)

--Andy

2015-07-25 08:37:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] x86/ldt: Make modify_ldt synchronous

On Fri, Jul 24, 2015 at 09:52:01PM -0700, Andy Lutomirski wrote:
> I see your wide terminal and raise you a complete rewrite of that
> function. Sigh, why did I assume the old code was the right way to do
> it?

That's a mostly wrong assumption, as experience proves.

> Hah¸ we both missed it. This is wrong. (Fix your backport!)

Yikes:

alloc_size = size * LDT_ENTRY_SIZE;

But hey, I made you spot it, still! :-)

Done.

> I'm not sure this is correct, so I changed it to something obviously
> correct (kmalloc/kfree).

Someone thought she won't get contiguous memory from kmalloc(). But how
big can alloc_size be to fail...

> You read too many of Linus' comments about using wider terminals :)

Nah, I'm just trying to put back some sanity in that 80 cols rule which,
even you, think is a hard one. And I say, keep 80 cols but sanity can
override it if what 80 cols produces, is crap.

I trust you're sane enough to apply that and not think C++ or java
wankery. Woahahahah...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--