2018-09-02 17:36:04

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

This patch-set addresses some issues that were raised in a recent
correspondence and might affect the security and the correctness of code
patching. (Note that patching performance is not addressed by this
patch-set).

The main issue that the patches deal with is the fact that the fixmap
PTEs that are used for patching are available for access from other
cores and might be exploited. They are not even flushed from the TLB in
remote cores, so the risk is even higher. Address this issue by
introducing a temporary mm that is only used during patching.
Unfortunately, due to init ordering, fixmap is still used during
boot-time patching. Future patches can eliminate the need for it.

The second issue is the missing lockdep assertion to ensure text_mutex
is taken. It is actually not always taken, so fix the instances that
were found not to take the lock (although they should be safe even
without taking the lock).

Finally, try to be more conservative and to map a single page, instead
of two, when possible. This helps both security and performance.

In addition, there is some cleanup of the patching code to make it more
readable.

v1->v2:
- Partial revert of 9222f606506c added to 1/6 [masami]
- Added Masami's reviewed-by tag

RFC->v1:
- Added handling of error in get_locked_pte()
- Remove lockdep assertion, clarify text_mutex use instead [masami]
- Comment fix [peterz]
- Removed remainders of text_poke return value [masami]
- Use __weak for poking_init instead of macros [masami]
- Simplify error handling in poking_init [masami]

Cc: Jiri Kosina <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lkml.org/lkml/2018/8/24/586

Andy Lutomirski (1):
x86/mm: temporary mm struct

Nadav Amit (5):
Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"
fork: provide a function for copying init_mm
x86/alternatives: initializing temporary mm for patching
x86/alternatives: use temporary mm for text poking
x86/alternatives: remove text_poke() return value

arch/x86/include/asm/mmu_context.h | 20 +++
arch/x86/include/asm/pgtable.h | 3 +
arch/x86/include/asm/text-patching.h | 4 +-
arch/x86/kernel/alternative.c | 175 +++++++++++++++++++++++----
arch/x86/mm/init_64.c | 29 +++++
include/linux/sched/task.h | 1 +
init/main.c | 3 +
kernel/fork.c | 24 +++-
8 files changed, 227 insertions(+), 32 deletions(-)

--
2.17.1



2018-09-02 17:36:04

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 6/6] x86/alternatives: remove text_poke() return value

The return value of text_poke() is meaningless - it is one of the
function inputs. One day someone may allow the callers to deal with
text_poke() failures, if those actually happen.

In the meanwhile, remove the return value.

Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/text-patching.h | 2 +-
arch/x86/kernel/alternative.c | 3 +--
2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index ffe7902cc326..1f73f71b4de2 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -34,7 +34,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
* On the local CPU you need to be protected again NMI or MCE handlers seeing an
* inconsistent instruction while you patch.
*/
-extern void *text_poke(void *addr, const void *opcode, size_t len);
+extern void text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index edca599c4479..7a30c5a3ca37 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -809,7 +809,7 @@ static void text_poke_safe(void *addr, const void *opcode, size_t len,
* holding the mutex and ensures that none of them will acquire the
* mutex while the code runs.
*/
-void *text_poke(void *addr, const void *opcode, size_t len)
+void text_poke(void *addr, const void *opcode, size_t len)
{
bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
struct page *pages[2] = {0};
@@ -851,7 +851,6 @@ void *text_poke(void *addr, const void *opcode, size_t len)
BUG_ON(memcmp(addr, opcode, len));

local_irq_restore(flags);
- return addr;
}

static void do_sync_core(void *info)
--
2.17.1


2018-09-02 17:36:04

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 3/6] fork: provide a function for copying init_mm

Provide a function for copying init_mm. This function will be later used
for setting a temporary mm.

Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
include/linux/sched/task.h | 1 +
kernel/fork.c | 24 ++++++++++++++++++------
2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 108ede99e533..ac0a675678f5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -74,6 +74,7 @@ extern void exit_itimers(struct signal_struct *);
extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
+struct mm_struct *copy_init_mm(void);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
extern long kernel_wait4(pid_t, int __user *, int, struct rusage *);

diff --git a/kernel/fork.c b/kernel/fork.c
index d896e9ca38b0..a1c637b903c1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1254,13 +1254,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
complete_vfork_done(tsk);
}

-/*
- * Allocate a new mm structure and copy contents from the
- * mm structure of the passed in task structure.
+/**
+ * dup_mm() - duplicates an existing mm structure
+ * @tsk: the task_struct with which the new mm will be associated.
+ * @oldmm: the mm to duplicate.
+ *
+ * Allocates a new mm structure and copy contents from the provided
+ * @oldmm structure.
+ *
+ * Return: the duplicated mm or NULL on failure.
*/
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+ struct mm_struct *oldmm)
{
- struct mm_struct *mm, *oldmm = current->mm;
+ struct mm_struct *mm;
int err;

mm = allocate_mm();
@@ -1327,7 +1334,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
}

retval = -ENOMEM;
- mm = dup_mm(tsk);
+ mm = dup_mm(tsk, current->mm);
if (!mm)
goto fail_nomem;

@@ -2127,6 +2134,11 @@ struct task_struct *fork_idle(int cpu)
return task;
}

+struct mm_struct *copy_init_mm(void)
+{
+ return dup_mm(NULL, &init_mm);
+}
+
/*
* Ok, this is the main fork-routine.
*
--
2.17.1


2018-09-02 17:36:04

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 2/6] x86/mm: temporary mm struct

From: Andy Lutomirski <[email protected]>

Sometimes we want to set a temporary page-table entries (PTEs) in one of
the cores, without allowing other cores to use - even speculatively -
these mappings. There are two benefits for doing so:

(1) Security: if sensitive PTEs are set, temporary mm prevents their use
in other cores. This hardens the security as it prevents exploding a
dangling pointer to overwrite sensitive data using the sensitive PTE.

(2) Avoiding TLB shootdowns: the PTEs do not need to be flushed in
remote page-tables.

To do so a temporary mm_struct can be used. Mappings which are private
for this mm can be set in the userspace part of the address-space.
During the whole time in which the temporary mm is loaded, interrupts
must be disabled.

The first use-case for temporary PTEs, which will follow, is for poking
the kernel text.

[ Commit message was written by Nadav ]

Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index eeeb9289c764..96afc8c0cf15 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -338,4 +338,24 @@ static inline unsigned long __get_current_cr3_fast(void)
return cr3;
}

+typedef struct {
+ struct mm_struct *prev;
+} temporary_mm_state_t;
+
+static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm)
+{
+ temporary_mm_state_t state;
+
+ lockdep_assert_irqs_disabled();
+ state.prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+ switch_mm_irqs_off(NULL, mm, current);
+ return state;
+}
+
+static inline void unuse_temporary_mm(temporary_mm_state_t prev)
+{
+ lockdep_assert_irqs_disabled();
+ switch_mm_irqs_off(NULL, prev.prev, current);
+}
+
#endif /* _ASM_X86_MMU_CONTEXT_H */
--
2.17.1


2018-09-02 17:36:04

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching

To prevent improper use of the PTEs that are used for text patching, we
want to use a temporary mm struct. We initailize it by copying the init
mm.

The address that will be used for patching is taken from the lower area
that is usually used for the task memory. Doing so prevents the need to
frequently synchronize the temporary-mm (e.g., when BPF programs are
installed), since different PGDs are used for the task memory.

Finally, we randomize the address of the PTEs to harden against exploits
that use these PTEs.

Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pgtable.h | 3 +++
arch/x86/include/asm/text-patching.h | 2 ++
arch/x86/mm/init_64.c | 29 ++++++++++++++++++++++++++++
init/main.c | 3 +++
4 files changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e4ffa565a69f..3de9a1fb7a9a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1022,6 +1022,9 @@ static inline void __meminit init_trampoline_default(void)
/* Default trampoline pgd value */
trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)];
}
+
+void __init poking_init(void);
+
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
# else
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..ffe7902cc326 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -38,5 +38,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+extern __ro_after_init struct mm_struct *poking_mm;
+extern __ro_after_init unsigned long poking_addr;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..db33a724a054 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -54,6 +54,7 @@
#include <asm/init.h>
#include <asm/uv/uv.h>
#include <asm/setup.h>
+#include <asm/text-patching.h>

#include "mm_internal.h"

@@ -1389,6 +1390,34 @@ unsigned long memory_block_size_bytes(void)
return memory_block_size_probed;
}

+/*
+ * Initialize an mm_struct to be used during poking and a pointer to be used
+ * during patching. If anything fails during initialization, poking will be done
+ * using the fixmap, which is unsafe, so warn the user about it.
+ */
+void __init poking_init(void)
+{
+ unsigned long poking_addr;
+
+ poking_mm = copy_init_mm();
+ if (!poking_mm) {
+ pr_err("x86/mm: error setting a separate poking address space");
+ return;
+ }
+
+ /*
+ * Randomize the poking address, but make sure that the following page
+ * will be mapped at the same PMD. We need 2 pages, so find space for 3,
+ * and adjust the address if the PMD ends after the first one.
+ */
+ poking_addr = TASK_UNMAPPED_BASE +
+ (kaslr_get_random_long("Poking") & PAGE_MASK) %
+ (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
+
+ if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
+ poking_addr += PAGE_SIZE;
+}
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
/*
* Initialise the sparsemem vmemmap using huge-pages at the PMD level.
diff --git a/init/main.c b/init/main.c
index 18f8f0140fa0..8c6dd8d88fca 100644
--- a/init/main.c
+++ b/init/main.c
@@ -496,6 +496,8 @@ void __init __weak thread_stack_cache_init(void)

void __init __weak mem_encrypt_init(void) { }

+void __init __weak poking_init(void) { }
+
bool initcall_debug;
core_param(initcall_debug, initcall_debug, bool, 0644);

@@ -725,6 +727,7 @@ asmlinkage __visible void __init start_kernel(void)
taskstats_init_early();
delayacct_init();

+ poking_init();
check_bugs();

acpi_subsystem_init();
--
2.17.1


2018-09-02 17:36:04

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 5/6] x86/alternatives: use temporary mm for text poking

text_poke() can potentially compromise the security as it sets temporary
PTEs in the fixmap. These PTEs might be used to rewrite the kernel code
from other cores accidentally or maliciously, if an attacker gains the
ability to write onto kernel memory.

Moreover, since remote TLBs are not flushed after the temporary PTEs are
removed, the time-window in which the code is writable is not limited if
the fixmap PTEs - maliciously or accidentally - are cached in the TLB.

To address these potential security hazards, we use a temporary mm for
patching the code. Unfortunately, the temporary-mm cannot be initialized
early enough during the init, and as a result x86_late_time_init() needs
to use text_poke() before it can be initialized. text_poke() therefore
keeps the two poking versions - using fixmap and using temporary mm -
and uses them accordingly.

More adventurous developers can try to reorder the init sequence or use
text_poke_early() instead of text_poke() to remove the use of fixmap for
patching completely.

Finally, text_poke() is also not conservative enough when mapping pages,
as it always tries to map 2 pages, even when a single one is sufficient.
So try to be more conservative, and do not map more than needed.

Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/kernel/alternative.c | 165 +++++++++++++++++++++++++++++-----
1 file changed, 144 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e9be18245698..edca599c4479 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -11,6 +11,7 @@
#include <linux/stop_machine.h>
#include <linux/slab.h>
#include <linux/kdebug.h>
+#include <linux/mmu_context.h>
#include <asm/text-patching.h>
#include <asm/alternative.h>
#include <asm/sections.h>
@@ -674,6 +675,124 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
return addr;
}

+/**
+ * text_poke_fixmap - poke using the fixmap.
+ *
+ * Fallback function for poking the text using the fixmap. It is used during
+ * early boot and in the rare case in which initialization of safe poking fails.
+ *
+ * Poking in this manner should be avoided, since it allows other cores to use
+ * the fixmap entries, and can be exploited by an attacker to overwrite the code
+ * (assuming he gained the write access through another bug).
+ */
+static void text_poke_fixmap(void *addr, const void *opcode, size_t len,
+ struct page *pages[2])
+{
+ u8 *vaddr;
+
+ set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
+ if (pages[1])
+ set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+ vaddr = (u8 *)fix_to_virt(FIX_TEXT_POKE0);
+ memcpy(vaddr + offset_in_page(addr), opcode, len);
+
+ /*
+ * clear_fixmap() performs a TLB flush, so no additional TLB
+ * flush is needed.
+ */
+ clear_fixmap(FIX_TEXT_POKE0);
+ if (pages[1])
+ clear_fixmap(FIX_TEXT_POKE1);
+ sync_core();
+
+ /*
+ * Could also do a CLFLUSH here to speed up CPU recovery; but
+ * that causes hangs on some VIA CPUs.
+ */
+}
+
+__ro_after_init struct mm_struct *poking_mm;
+__ro_after_init unsigned long poking_addr;
+
+/**
+ * text_poke_safe() - Pokes the text using a separate address space.
+ *
+ * This is the preferable way for patching the kernel after boot, as it does not
+ * allow other cores to accidentally or maliciously modify the code using the
+ * temporary PTEs.
+ */
+static void text_poke_safe(void *addr, const void *opcode, size_t len,
+ struct page *pages[2])
+{
+ temporary_mm_state_t prev;
+ pte_t pte, *ptep;
+ spinlock_t *ptl;
+
+ /*
+ * The lock is not really needed, but this allows to avoid open-coding.
+ */
+ ptep = get_locked_pte(poking_mm, poking_addr, &ptl);
+
+ /*
+ * If we failed to allocate a PTE, fail silently. The caller (text_poke)
+ * will detect that the write failed when it compares the memory with
+ * the new opcode.
+ */
+ if (unlikely(!ptep))
+ return;
+
+ pte = mk_pte(pages[0], PAGE_KERNEL);
+ set_pte_at(poking_mm, poking_addr, ptep, pte);
+
+ if (pages[1]) {
+ pte = mk_pte(pages[1], PAGE_KERNEL);
+ set_pte_at(poking_mm, poking_addr + PAGE_SIZE, ptep + 1, pte);
+ }
+
+ /*
+ * Loading the temporary mm behaves as a compiler barrier, which
+ * guarantees that the PTE will be set at the time memcpy() is done.
+ */
+ prev = use_temporary_mm(poking_mm);
+
+ memcpy((u8 *)poking_addr + offset_in_page(addr), opcode, len);
+
+ /*
+ * Ensure that the PTE is only cleared after the instructions of memcpy
+ * were issued by using a compiler barrier.
+ */
+ barrier();
+
+ pte_clear(poking_mm, poking_addr, ptep);
+
+ /*
+ * __flush_tlb_one_user() performs a redundant TLB flush when PTI is on,
+ * as it also flushes the corresponding "user" address spaces, which
+ * does not exist.
+ *
+ * Poking, however, is already very inefficient since it does not try to
+ * batch updates, so we ignore this problem for the time being.
+ *
+ * Since the PTEs do not exist in other kernel address-spaces, we do
+ * not use __flush_tlb_one_kernel(), which when PTI is on would cause
+ * more unwarranted TLB flushes.
+ */
+ __flush_tlb_one_user(poking_addr);
+ if (pages[1]) {
+ pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep + 1);
+ __flush_tlb_one_user(poking_addr + PAGE_SIZE);
+ }
+
+ /*
+ * Loading the previous page-table hierarchy requires a serializing
+ * instruction that already allows the core to see the updated version.
+ * Xen-PV is assumed to serialize execution in a similar manner.
+ */
+ unuse_temporary_mm(prev);
+
+ pte_unmap_unlock(ptep, ptl);
+}
+
/**
* text_poke - Update instructions on a live kernel
* @addr: address to modify
@@ -692,41 +811,45 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
*/
void *text_poke(void *addr, const void *opcode, size_t len)
{
+ bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
+ struct page *pages[2] = {0};
unsigned long flags;
- char *vaddr;
- struct page *pages[2];
- int i;

/*
- * While boot memory allocator is runnig we cannot use struct
- * pages as they are not yet initialized.
+ * While boot memory allocator is running we cannot use struct pages as
+ * they are not yet initialized.
*/
BUG_ON(!after_bootmem);

if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
- pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+ if (cross_page_boundary)
+ pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
} else {
pages[0] = virt_to_page(addr);
WARN_ON(!PageReserved(pages[0]));
- pages[1] = virt_to_page(addr + PAGE_SIZE);
+ if (cross_page_boundary)
+ pages[1] = virt_to_page(addr + PAGE_SIZE);
}
BUG_ON(!pages[0]);
local_irq_save(flags);
- set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
- if (pages[1])
- set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
- vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
- memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
- clear_fixmap(FIX_TEXT_POKE0);
- if (pages[1])
- clear_fixmap(FIX_TEXT_POKE1);
- local_flush_tlb();
- sync_core();
- /* Could also do a CLFLUSH here to speed up CPU recovery; but
- that causes hangs on some VIA CPUs. */
- for (i = 0; i < len; i++)
- BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
+
+ /*
+ * During initial boot, it is hard to initialize poking_mm due to
+ * dependencies in boot order.
+ */
+ if (poking_mm)
+ text_poke_safe(addr, opcode, len, pages);
+ else
+ text_poke_fixmap(addr, opcode, len, pages);
+
+ /*
+ * To be on the safe side, do the comparison before enabling IRQs, as it
+ * was done before. However, it makes more sense to allow the callers to
+ * deal with potential failures and not to panic so easily.
+ */
+ BUG_ON(memcmp(addr, opcode, len));
+
local_irq_restore(flags);
return addr;
}
--
2.17.1


2018-09-02 17:37:44

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

text_mutex is expected to be held before text_poke() is called, but we
cannot add a lockdep assertion since kgdb does not take it, and instead
*supposedly* ensures the lock is not taken and will not be acquired by
any other core while text_poke() is running.

The reason for the "supposedly" comment is that it is not entirely clear
that this would be the case if gdb_do_roundup is zero.

Add a comment to clarify this behavior, and restore the assertions as
they were before the recent commit.

This partially reverts commit 9222f606506c ("x86/alternatives:
Lockdep-enforce text_mutex in text_poke*()")

Cc: Jiri Kosina <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Dave Hansen <[email protected]>
Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in text_poke*()")
Reviewed-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/kernel/alternative.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b9d5e7c9ef43..e9be18245698 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -684,6 +684,11 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
* It means the size must be writable atomically and the address must be aligned
* in a way that permits an atomic write. It also makes sure we fit on a single
* page.
+ *
+ * Context: Must be called under text_mutex. kgdb is an exception: it does not
+ * hold the mutex, as it *supposedly* ensures that no other core is
+ * holding the mutex and ensures that none of them will acquire the
+ * mutex while the code runs.
*/
void *text_poke(void *addr, const void *opcode, size_t len)
{
@@ -698,8 +703,6 @@ void *text_poke(void *addr, const void *opcode, size_t len)
*/
BUG_ON(!after_bootmem);

- lockdep_assert_held(&text_mutex);
-
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
--
2.17.1


2018-09-05 18:58:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
> This patch-set addresses some issues that were raised in a recent
> correspondence and might affect the security and the correctness of code
> patching. (Note that patching performance is not addressed by this
> patch-set).
>
> The main issue that the patches deal with is the fact that the fixmap
> PTEs that are used for patching are available for access from other
> cores and might be exploited. They are not even flushed from the TLB in
> remote cores, so the risk is even higher. Address this issue by
> introducing a temporary mm that is only used during patching.
> Unfortunately, due to init ordering, fixmap is still used during
> boot-time patching. Future patches can eliminate the need for it.
>

Remind me; why are we doing it like this instead of fixing fixmap?
Because while this fixes the text_poke crud, it does leave fixmap
broken.

2018-09-05 19:04:58

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

at 11:56 AM, Peter Zijlstra <[email protected]> wrote:

> On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
>> This patch-set addresses some issues that were raised in a recent
>> correspondence and might affect the security and the correctness of code
>> patching. (Note that patching performance is not addressed by this
>> patch-set).
>>
>> The main issue that the patches deal with is the fact that the fixmap
>> PTEs that are used for patching are available for access from other
>> cores and might be exploited. They are not even flushed from the TLB in
>> remote cores, so the risk is even higher. Address this issue by
>> introducing a temporary mm that is only used during patching.
>> Unfortunately, due to init ordering, fixmap is still used during
>> boot-time patching. Future patches can eliminate the need for it.
>
> Remind me; why are we doing it like this instead of fixing fixmap?
> Because while this fixes the text_poke crud, it does leave fixmap
> broken.

Do you have other fixmap mappings in mind that are modified after boot?


2018-09-05 19:12:53

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

at 12:02 PM, Nadav Amit <[email protected]> wrote:

> at 11:56 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
>>> This patch-set addresses some issues that were raised in a recent
>>> correspondence and might affect the security and the correctness of code
>>> patching. (Note that patching performance is not addressed by this
>>> patch-set).
>>>
>>> The main issue that the patches deal with is the fact that the fixmap
>>> PTEs that are used for patching are available for access from other
>>> cores and might be exploited. They are not even flushed from the TLB in
>>> remote cores, so the risk is even higher. Address this issue by
>>> introducing a temporary mm that is only used during patching.
>>> Unfortunately, due to init ordering, fixmap is still used during
>>> boot-time patching. Future patches can eliminate the need for it.
>>
>> Remind me; why are we doing it like this instead of fixing fixmap?
>> Because while this fixes the text_poke crud, it does leave fixmap
>> broken.
>
> Do you have other fixmap mappings in mind that are modified after boot?

Oh.. I misunderstood you. You mean: why not to make the fixmap mappings that
are used for text_poke() as private ones.

Well, the main reason is that it can require synchronizations of the
different page-tables whenever a module is loaded/unloaded. The fixmap
region shares a PGD and PUD with the modules area in x86-64.

In contrast, the proposed solution uses a different PGD, so no
synchronization between page-tables is needed when modules are loaded.
Remember that module memory is allocated even when BPF programs are
installed, which can be rather common scenario.


2018-09-06 08:14:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Wed, Sep 05, 2018 at 07:10:46PM +0000, Nadav Amit wrote:
> at 12:02 PM, Nadav Amit <[email protected]> wrote:
>
> > at 11:56 AM, Peter Zijlstra <[email protected]> wrote:
> >
> >> On Sun, Sep 02, 2018 at 10:32:18AM -0700, Nadav Amit wrote:
> >>> This patch-set addresses some issues that were raised in a recent
> >>> correspondence and might affect the security and the correctness of code
> >>> patching. (Note that patching performance is not addressed by this
> >>> patch-set).
> >>>
> >>> The main issue that the patches deal with is the fact that the fixmap
> >>> PTEs that are used for patching are available for access from other
> >>> cores and might be exploited. They are not even flushed from the TLB in
> >>> remote cores, so the risk is even higher. Address this issue by
> >>> introducing a temporary mm that is only used during patching.
> >>> Unfortunately, due to init ordering, fixmap is still used during
> >>> boot-time patching. Future patches can eliminate the need for it.
> >>
> >> Remind me; why are we doing it like this instead of fixing fixmap?
> >> Because while this fixes the text_poke crud, it does leave fixmap
> >> broken.
> >
> > Do you have other fixmap mappings in mind that are modified after boot?
>
> Oh.. I misunderstood you. You mean: why not to make the fixmap mappings that
> are used for text_poke() as private ones.

No, you got it the first time. There are in fact more fixmap abusers;
see drivers/acpi/apei/ghes.c. Also, as long as set_fixmap() allows
overwriting a _PAGE_PRESENT pte and has that dodgy
__flush_tlb_one_kernel() in it, the broken remains (and can return).

So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
pte, or to issue a full TLB invalidate.

Either fix will terminally break GHES, but that's OK, they've known
about this issue since 2015 and haven't cared, so I can't be bothered
about them.


2018-09-06 08:44:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:

> No, you got it the first time. There are in fact more fixmap abusers;
> see drivers/acpi/apei/ghes.c. Also, as long as set_fixmap() allows
> overwriting a _PAGE_PRESENT pte and has that dodgy
> __flush_tlb_one_kernel() in it, the broken remains (and can return).
>
> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
> pte, or to issue a full TLB invalidate.
>
> Either fix will terminally break GHES, but that's OK, they've known
> about this issue since 2015 and haven't cared, so I can't be bothered
> about them.

Worse, when I boot with the below patch in, I get hits on text_poke (we
knew about this), but I also get hits from:

release_ds_buffer()

Because PEBS/BTS hardware write through the page-tables (I know right),
their buffers need to be in the PTI user map, and we do that by mapping
the pages in the CAE, and that uses fixmap.

---
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..20dcd6fa690e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -259,9 +259,16 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
{
pmd_t *pmd = fill_pmd(pud, vaddr);
- pte_t *pte = fill_pte(pmd, vaddr);
+ pte_t *ptep = fill_pte(pmd, vaddr);

- set_pte(pte, new_pte);
+ pte_t pte = *ptep;
+
+ set_pte(ptep, new_pte);
+
+ WARN((pte_val(pte) & _PAGE_PRESENT) && (pte_val(pte) != pte_val(new_pte)),
+ "set_fixmap(%d, %lx), was: %lx",
+ (int)__virt_to_fix(vaddr),
+ pte_val(new_pte), pte_val(pte));

/*
* It's enough to flush this one mapping.

2018-09-06 09:20:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:
> No, you got it the first time. There are in fact more fixmap abusers;
> see drivers/acpi/apei/ghes.c. Also, as long as set_fixmap() allows
> overwriting a _PAGE_PRESENT pte and has that dodgy
> __flush_tlb_one_kernel() in it, the broken remains (and can return).
>
> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
> pte, or to issue a full TLB invalidate.
>
> Either fix will terminally break GHES, but that's OK, they've known
> about this issue since 2015 and haven't cared, so I can't be bothered
> about them.

The below, combined with your text_poke() patches (text_poke disabled
IRQs, so the below still complains, and I was lazy) makes my machine
(mostly) happy.

There's still some complaining before the poking_mm exists. Let me see
if I can't cure that.

---

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..8df85fe66332 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -259,15 +260,27 @@ static pte_t *fill_pte(pmd_t *pmd, unsigned long vaddr)
static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
{
pmd_t *pmd = fill_pmd(pud, vaddr);
- pte_t *pte = fill_pte(pmd, vaddr);
+ pte_t *ptep = fill_pte(pmd, vaddr);

- set_pte(pte, new_pte);
+ pte_t pte = *ptep;

- /*
- * It's enough to flush this one mapping.
- * (PGE mappings get flushed as well)
- */
- __flush_tlb_one_kernel(vaddr);
+ set_pte(ptep, new_pte);
+
+ if (irqs_disabled()) {
+ WARN((pte_val(pte) & _PAGE_PRESENT) && (pte_val(pte) != pte_val(new_pte)),
+ "set_fixmap(%d, %lx), was: %lx",
+ (int)__virt_to_fix(vaddr),
+ pte_val(new_pte), pte_val(pte));
+
+ /*
+ * It is _NOT_ enough to flush just the local mapping;
+ * it might mostly work, but there is no guarantee a remote
+ * CPU didn't load this entry into its TLB.
+ */
+ __flush_tlb_one_kernel(vaddr);
+ } else {
+ flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+ }
}

void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 52ae5438edeb..e3c415bdbfbf 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -19,6 +19,7 @@ config ACPI_APEI

config ACPI_APEI_GHES
bool "APEI Generic Hardware Error Source"
+ depends on BROKEN
depends on ACPI_APEI
select ACPI_HED
select IRQ_WORK

2018-09-06 09:50:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching

On Sun, Sep 02, 2018 at 10:32:22AM -0700, Nadav Amit wrote:
> +void __init poking_init(void)
> +{
> + unsigned long poking_addr;
> +
> + poking_mm = copy_init_mm();
> + if (!poking_mm) {
> + pr_err("x86/mm: error setting a separate poking address space");
> + return;
> + }
> +
> + /*
> + * Randomize the poking address, but make sure that the following page
> + * will be mapped at the same PMD. We need 2 pages, so find space for 3,
> + * and adjust the address if the PMD ends after the first one.
> + */
> + poking_addr = TASK_UNMAPPED_BASE +
> + (kaslr_get_random_long("Poking") & PAGE_MASK) %
> + (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
> +
> + if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
> + poking_addr += PAGE_SIZE;
> +}

This fails to compile for me.. I don't have kaslr_get_random_long().

2018-09-06 11:11:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:
> No, you got it the first time. There are in fact more fixmap abusers;
> see drivers/acpi/apei/ghes.c. Also, as long as set_fixmap() allows
> overwriting a _PAGE_PRESENT pte and has that dodgy
> __flush_tlb_one_kernel() in it, the broken remains (and can return).
>
> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
> pte, or to issue a full TLB invalidate.
>
> Either fix will terminally break GHES, but that's OK, they've known
> about this issue since 2015 and haven't cared, so I can't be bothered
> about them.

In fact, the below seems to cure all woes..

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b9d5e7c9ef43..00f1c9e4f0a3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -709,22 +709,25 @@ void *text_poke(void *addr, const void *opcode, size_t len)
pages[1] = virt_to_page(addr + PAGE_SIZE);
}
BUG_ON(!pages[0]);
- local_irq_save(flags);
+
set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
if (pages[1])
set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
+
+ local_irq_save(flags);
vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
- clear_fixmap(FIX_TEXT_POKE0);
- if (pages[1])
- clear_fixmap(FIX_TEXT_POKE1);
- local_flush_tlb();
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
for (i = 0; i < len; i++)
BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
local_irq_restore(flags);
+
+ clear_fixmap(FIX_TEXT_POKE0);
+ if (pages[1])
+ clear_fixmap(FIX_TEXT_POKE1);
+
return addr;
}

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index dd519f372169..fd6af66bc400 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -260,14 +260,28 @@ static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
{
pmd_t *pmd = fill_pmd(pud, vaddr);
pte_t *pte = fill_pte(pmd, vaddr);
+ pte_t old_pte = *pte;

set_pte(pte, new_pte);

/*
- * It's enough to flush this one mapping.
- * (PGE mappings get flushed as well)
+ * If it was present and changes, we need to invalidate TLBs.
*/
- __flush_tlb_one_kernel(vaddr);
+ if (!(pte_present(old_pte) && !pte_same(old_pte, new_pte)))
+ return;
+
+ if (WARN(irqs_disabled(), "broken set_fixmap(%d, %lx), was: %lx",
+ (int)__virt_to_fix(vaddr),
+ pte_val(new_pte), pte_val(old_pte))) {
+ /*
+ * It is _NOT_ enough to flush just the local mapping;
+ * it might mostly work, but there is no guarantee a remote
+ * CPU didn't load this entry into its TLB.
+ */
+ __flush_tlb_one_kernel(vaddr);
+ } else {
+ flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+ }
}

void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 52ae5438edeb..e3c415bdbfbf 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -19,6 +19,7 @@ config ACPI_APEI

config ACPI_APEI_GHES
bool "APEI Generic Hardware Error Source"
+ depends on BROKEN if X86
depends on ACPI_APEI
select ACPI_HED
select IRQ_WORK

2018-09-06 17:04:11

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

at 3:16 AM, Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 06, 2018 at 10:13:00AM +0200, Peter Zijlstra wrote:
>> No, you got it the first time. There are in fact more fixmap abusers;
>> see drivers/acpi/apei/ghes.c. Also, as long as set_fixmap() allows
>> overwriting a _PAGE_PRESENT pte and has that dodgy
>> __flush_tlb_one_kernel() in it, the broken remains (and can return).
>>
>> So we need to fix fixmap, to either disallow overwriting a _PAGE_PRESENT
>> pte, or to issue a full TLB invalidate.
>>
>> Either fix will terminally break GHES, but that's OK, they've known
>> about this issue since 2015 and haven't cared, so I can't be bothered
>> about them.
>
> In fact, the below seems to cure all woes..
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index b9d5e7c9ef43..00f1c9e4f0a3 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -709,22 +709,25 @@ void *text_poke(void *addr, const void *opcode, size_t len)
> pages[1] = virt_to_page(addr + PAGE_SIZE);
> }
> BUG_ON(!pages[0]);
> - local_irq_save(flags);
> +
> set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
> if (pages[1])
> set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
> +
> + local_irq_save(flags);
> vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> - clear_fixmap(FIX_TEXT_POKE0);
> - if (pages[1])
> - clear_fixmap(FIX_TEXT_POKE1);
> - local_flush_tlb();
> sync_core();
> /* Could also do a CLFLUSH here to speed up CPU recovery; but
> that causes hangs on some VIA CPUs. */
> for (i = 0; i < len; i++)
> BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
> local_irq_restore(flags);
> +
> + clear_fixmap(FIX_TEXT_POKE0);
> + if (pages[1])
> + clear_fixmap(FIX_TEXT_POKE1);
> +
> return addr;
> }
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index dd519f372169..fd6af66bc400 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -260,14 +260,28 @@ static void __set_pte_vaddr(pud_t *pud, unsigned long vaddr, pte_t new_pte)
> {
> pmd_t *pmd = fill_pmd(pud, vaddr);
> pte_t *pte = fill_pte(pmd, vaddr);
> + pte_t old_pte = *pte;
>
> set_pte(pte, new_pte);
>
> /*
> - * It's enough to flush this one mapping.
> - * (PGE mappings get flushed as well)
> + * If it was present and changes, we need to invalidate TLBs.
> */
> - __flush_tlb_one_kernel(vaddr);
> + if (!(pte_present(old_pte) && !pte_same(old_pte, new_pte)))
> + return;
> +
> + if (WARN(irqs_disabled(), "broken set_fixmap(%d, %lx), was: %lx",
> + (int)__virt_to_fix(vaddr),
> + pte_val(new_pte), pte_val(old_pte))) {
> + /*
> + * It is _NOT_ enough to flush just the local mapping;
> + * it might mostly work, but there is no guarantee a remote
> + * CPU didn't load this entry into its TLB.
> + */
> + __flush_tlb_one_kernel(vaddr);
> + } else {
> + flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> + }
> }
>
> void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte)
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index 52ae5438edeb..e3c415bdbfbf 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -19,6 +19,7 @@ config ACPI_APEI
>
> config ACPI_APEI_GHES
> bool "APEI Generic Hardware Error Source"
> + depends on BROKEN if X86
> depends on ACPI_APEI
> select ACPI_HED
> select IRQ_WORK

Indeed, I missed these other instances that use the fixmap.

I’ll give your patch a try once my server goes back online. I was (and still
am) worried that interrupts would be disabled when __set_pte_vaddr() is
called, which would make the fix more complicated.

In addition, there might be a couple of issues with your fix:

1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
the warning might be wrong, but also means that these code patches (Xen’s
set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
before, someone might use this function for other purposes as well.

2. Printing the virtual address can break KASLR.

3. The WARN() can introduce some overhead since checking if IRQs are
disabled takes considerably long time. Perhaps VM_WARN() or something is
better. I realize this code-path is not on the hot-path though...

4. I guess flush_tlb_kernel_range() should also have something like
VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.

Let me know if you want me to make these changes and include your patch in
the set.

Regards,
Nadav

2018-09-06 17:19:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
> In addition, there might be a couple of issues with your fix:

It boots on my box ;-)

> 1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
> the warning might be wrong, but also means that these code patches (Xen’s
> set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
> before, someone might use this function for other purposes as well.

CEA is fine, that actually needs it too.

The one thing I missed out on earlier is the below chunk, that is no
longer needed now that cea_set_pte() actually does the right thing.

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index b7b01d762d32..14ad97fa0749 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -293,12 +293,6 @@ static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
preempt_disable();
for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
cea_set_pte(cea, pa, prot);
-
- /*
- * This is a cross-CPU update of the cpu_entry_area, we must shoot down
- * all TLB entries for it.
- */
- flush_tlb_kernel_range(start, start + size);
preempt_enable();
}

@@ -310,8 +304,6 @@ static void ds_clear_cea(void *cea, size_t size)
preempt_disable();
for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
cea_set_pte(cea, 0, PAGE_NONE);
-
- flush_tlb_kernel_range(start, start + size);
preempt_enable();
}


> 2. Printing the virtual address can break KASLR.

Local KASLR is a myth.. but sure, we can fix the print.

> 3. The WARN() can introduce some overhead since checking if IRQs are
> disabled takes considerably long time. Perhaps VM_WARN() or something is
> better. I realize this code-path is not on the hot-path though...

Yeah, if it triggers you have bigger problems. We can make it
WARN_ONCE() I suppose.

> 4. I guess flush_tlb_kernel_range() should also have something like
> VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.

It has, it's hidden in kernel/smp.c:smp_call_function_many().

> Let me know if you want me to make these changes and include your patch in
> the set.

The set is no longer needed. text_poke() is fine and correct with this
one patch.

2018-09-06 17:25:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
> I’ll give your patch a try once my server goes back online. I was (and still
> am) worried that interrupts would be disabled when __set_pte_vaddr() is
> called, which would make the fix more complicated.

Thing is, we only need the TLB invalidate if the previous PTE was
present and the new PTE is different. If we write the 'first' PTE, all
is fine.

The code as presented WARNs if we do __set_pte_vaddr() that needs a TLB
invalidate and we have IRQs disabled. And aside from the GHES
trainwreck, the patch as given boots and runs fine on my machine.

And no, if there is a caller that has interrupts disabled and needs TLB
invalidate, the patch still is right. Just means that caller is
terminally broken and needs fixing (like GHES).

There is no way x86 can do what needs done with IRQs disabled.

2018-09-06 21:10:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

at 10:17 AM, Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
>> In addition, there might be a couple of issues with your fix:
>
> It boots on my box ;-)
>
>> 1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
>> the warning might be wrong, but also means that these code patches (Xen’s
>> set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
>> before, someone might use this function for other purposes as well.
>
> CEA is fine, that actually needs it too.
>
> The one thing I missed out on earlier is the below chunk, that is no
> longer needed now that cea_set_pte() actually does the right thing.
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index b7b01d762d32..14ad97fa0749 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -293,12 +293,6 @@ static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
> preempt_disable();
> for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
> cea_set_pte(cea, pa, prot);
> -
> - /*
> - * This is a cross-CPU update of the cpu_entry_area, we must shoot down
> - * all TLB entries for it.
> - */
> - flush_tlb_kernel_range(start, start + size);
> preempt_enable();
> }
>
> @@ -310,8 +304,6 @@ static void ds_clear_cea(void *cea, size_t size)
> preempt_disable();
> for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
> cea_set_pte(cea, 0, PAGE_NONE);
> -
> - flush_tlb_kernel_range(start, start + size);
> preempt_enable();
> }
>
>
>> 2. Printing the virtual address can break KASLR.
>
> Local KASLR is a myth.. but sure, we can fix the print.
>
>> 3. The WARN() can introduce some overhead since checking if IRQs are
>> disabled takes considerably long time. Perhaps VM_WARN() or something is
>> better. I realize this code-path is not on the hot-path though...
>
> Yeah, if it triggers you have bigger problems. We can make it
> WARN_ONCE() I suppose.
>
>> 4. I guess flush_tlb_kernel_range() should also have something like
>> VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.
>
> It has, it's hidden in kernel/smp.c:smp_call_function_many().

Right. Thanks.

>
>> Let me know if you want me to make these changes and include your patch in
>> the set.
>
> The set is no longer needed. text_poke() is fine and correct with this
> one patch.

It depends what security you want. Some may consider even the short
time-window in which the kernel code is writable from other cores as
insufficient for security.

In addition, the set removes the need for remote TLB shootdowns that
text_poke() - with this fix - requires.

2018-09-06 21:13:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 11:09:23AM -0700, Andy Lutomirski wrote:
> > On Sep 6, 2018, at 10:58 AM, Nadav Amit <[email protected]> wrote:
> > It depends what security you want. Some may consider even the short
> > time-window in which the kernel code is writable from other cores as
> > insufficient for security.
> >
> > In addition, the set removes the need for remote TLB shootdowns that
> > text_poke() - with this fix - requires.
> >
>
> I’m personally in favor of not needing a global broadcast flush to install kprobes.

That's fine. But at that point its an optimization, not a correctness
issue.

2018-09-06 21:14:29

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

at 11:31 AM, Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 06, 2018 at 11:09:23AM -0700, Andy Lutomirski wrote:
>>> On Sep 6, 2018, at 10:58 AM, Nadav Amit <[email protected]> wrote:
>>> It depends what security you want. Some may consider even the short
>>> time-window in which the kernel code is writable from other cores as
>>> insufficient for security.
>>>
>>> In addition, the set removes the need for remote TLB shootdowns that
>>> text_poke() - with this fix - requires.
>>
>> I’m personally in favor of not needing a global broadcast flush to install kprobes.
>
> That's fine. But at that point its an optimization, not a correctness
> issue.

Note that patch 1/6 is still needed to fix false lockdep shoutouts due to a
recent patch.

2018-09-06 21:18:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes

On Thu, Sep 06, 2018 at 06:38:30PM +0000, Nadav Amit wrote:
> Note that patch 1/6 is still needed to fix false lockdep shoutouts due to a
> recent patch.

For some reason I do not appear to have 1/6 in my inbox. Let me dig
through lkml.

2018-09-06 21:19:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
> text_mutex is expected to be held before text_poke() is called, but we
> cannot add a lockdep assertion since kgdb does not take it, and instead
> *supposedly* ensures the lock is not taken and will not be acquired by
> any other core while text_poke() is running.
>
> The reason for the "supposedly" comment is that it is not entirely clear
> that this would be the case if gdb_do_roundup is zero.

Argh, that's pretty shit code...

Not only is that text_mutex abuse ugly, so too is the fixmap usage from
IRQ context. I suppose this really does require your alternative mm
patches for text_poke().

2018-09-06 21:21:02

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

at 12:40 PM, Peter Zijlstra <[email protected]> wrote:

> On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
>> text_mutex is expected to be held before text_poke() is called, but we
>> cannot add a lockdep assertion since kgdb does not take it, and instead
>> *supposedly* ensures the lock is not taken and will not be acquired by
>> any other core while text_poke() is running.
>>
>> The reason for the "supposedly" comment is that it is not entirely clear
>> that this would be the case if gdb_do_roundup is zero.
>
> Argh, that's pretty shit code...
>
> Not only is that text_mutex abuse ugly, so too is the fixmap usage from
> IRQ context. I suppose this really does require your alternative mm
> patches for text_poke().

Right, I forgot about that…

2018-09-06 21:22:19

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

at 12:53 PM, Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 06, 2018 at 07:42:14PM +0000, Nadav Amit wrote:
>> at 12:40 PM, Peter Zijlstra <[email protected]> wrote:
>>
>>> On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
>>>> text_mutex is expected to be held before text_poke() is called, but we
>>>> cannot add a lockdep assertion since kgdb does not take it, and instead
>>>> *supposedly* ensures the lock is not taken and will not be acquired by
>>>> any other core while text_poke() is running.
>>>>
>>>> The reason for the "supposedly" comment is that it is not entirely clear
>>>> that this would be the case if gdb_do_roundup is zero.
>>>
>>> Argh, that's pretty shit code...
>>>
>>> Not only is that text_mutex abuse ugly, so too is the fixmap usage from
>>> IRQ context. I suppose this really does require your alternative mm
>>> patches for text_poke().
>>
>> Right, I forgot about that…
>
> With that CR3 trickery, we can rid ourselves of the text_mutex
> requirement, since concurrent text_poke is 'safe'. That would clean up
> the kgdb code quite a bit.

I don’t know. I’m somewhat worried with multiple mechanisms potentially
changing the same code at the same time - and maybe ending up with some
mess.

2018-09-06 21:22:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

On Thu, Sep 06, 2018 at 07:42:14PM +0000, Nadav Amit wrote:
> at 12:40 PM, Peter Zijlstra <[email protected]> wrote:
>
> > On Sun, Sep 02, 2018 at 10:32:19AM -0700, Nadav Amit wrote:
> >> text_mutex is expected to be held before text_poke() is called, but we
> >> cannot add a lockdep assertion since kgdb does not take it, and instead
> >> *supposedly* ensures the lock is not taken and will not be acquired by
> >> any other core while text_poke() is running.
> >>
> >> The reason for the "supposedly" comment is that it is not entirely clear
> >> that this would be the case if gdb_do_roundup is zero.
> >
> > Argh, that's pretty shit code...
> >
> > Not only is that text_mutex abuse ugly, so too is the fixmap usage from
> > IRQ context. I suppose this really does require your alternative mm
> > patches for text_poke().
>
> Right, I forgot about that…

With that CR3 trickery, we can rid ourselves of the text_mutex
requirement, since concurrent text_poke is 'safe'. That would clean up
the kgdb code quite a bit.

2018-09-06 21:25:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

On Thu, Sep 06, 2018 at 07:58:40PM +0000, Nadav Amit wrote:
> > With that CR3 trickery, we can rid ourselves of the text_mutex
> > requirement, since concurrent text_poke is 'safe'. That would clean up
> > the kgdb code quite a bit.
>
> I don’t know. I’m somewhat worried with multiple mechanisms potentially
> changing the same code at the same time - and maybe ending up with some
> mess.

kgdb only pokes INT3, that should be pretty safe.

2018-09-06 21:32:57

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

at 1:25 PM, Peter Zijlstra <[email protected]> wrote:

> On Thu, Sep 06, 2018 at 07:58:40PM +0000, Nadav Amit wrote:
>>> With that CR3 trickery, we can rid ourselves of the text_mutex
>>> requirement, since concurrent text_poke is 'safe'. That would clean up
>>> the kgdb code quite a bit.
>>
>> I don’t know. I’m somewhat worried with multiple mechanisms potentially
>> changing the same code at the same time - and maybe ending up with some
>> mess.
>
> kgdb only pokes INT3, that should be pretty safe.

Maybe I misunderstand your point. If you want me to get rid of text_mutex
completely, I am afraid it will be able to cause mess by changing the same
piece of code through kprobes and the static-keys mechanism.

I doubt it would work today without failing, but getting rid of text_mutex
is likely to make it even worse.

2018-09-06 21:46:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/6] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

On Thu, Sep 06, 2018 at 08:57:38PM +0000, Nadav Amit wrote:
> at 1:25 PM, Peter Zijlstra <[email protected]> wrote:
>
> > On Thu, Sep 06, 2018 at 07:58:40PM +0000, Nadav Amit wrote:
> >>> With that CR3 trickery, we can rid ourselves of the text_mutex
> >>> requirement, since concurrent text_poke is 'safe'. That would clean up
> >>> the kgdb code quite a bit.
> >>
> >> I don’t know. I’m somewhat worried with multiple mechanisms potentially
> >> changing the same code at the same time - and maybe ending up with some
> >> mess.
> >
> > kgdb only pokes INT3, that should be pretty safe.
>
> Maybe I misunderstand your point. If you want me to get rid of text_mutex
> completely,

No, just the ugly things kgdb does with it.

2018-09-06 23:00:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] x86/alternatives: text_poke() fixes



> On Sep 6, 2018, at 10:58 AM, Nadav Amit <[email protected]> wrote:
>
> at 10:17 AM, Peter Zijlstra <[email protected]> wrote:
>
>>> On Thu, Sep 06, 2018 at 05:01:25PM +0000, Nadav Amit wrote:
>>> In addition, there might be a couple of issues with your fix:
>>
>> It boots on my box ;-)
>>
>>> 1. __set_pte_vaddr() is not used exclusive by set_fixmap(). This means
>>> the warning might be wrong, but also means that these code patches (Xen’s
>>> set_pte_mfn(), CPU-entry-area setup) needs to be checked. And as you said
>>> before, someone might use this function for other purposes as well.
>>
>> CEA is fine, that actually needs it too.
>>
>> The one thing I missed out on earlier is the below chunk, that is no
>> longer needed now that cea_set_pte() actually does the right thing.
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index b7b01d762d32..14ad97fa0749 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -293,12 +293,6 @@ static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
>> preempt_disable();
>> for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
>> cea_set_pte(cea, pa, prot);
>> -
>> - /*
>> - * This is a cross-CPU update of the cpu_entry_area, we must shoot down
>> - * all TLB entries for it.
>> - */
>> - flush_tlb_kernel_range(start, start + size);
>> preempt_enable();
>> }
>>
>> @@ -310,8 +304,6 @@ static void ds_clear_cea(void *cea, size_t size)
>> preempt_disable();
>> for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
>> cea_set_pte(cea, 0, PAGE_NONE);
>> -
>> - flush_tlb_kernel_range(start, start + size);
>> preempt_enable();
>> }
>>
>>
>>> 2. Printing the virtual address can break KASLR.
>>
>> Local KASLR is a myth.. but sure, we can fix the print.
>>
>>> 3. The WARN() can introduce some overhead since checking if IRQs are
>>> disabled takes considerably long time. Perhaps VM_WARN() or something is
>>> better. I realize this code-path is not on the hot-path though...
>>
>> Yeah, if it triggers you have bigger problems. We can make it
>> WARN_ONCE() I suppose.
>>
>>> 4. I guess flush_tlb_kernel_range() should also have something like
>>> VM_WARN_ON(irqs_disabled()), just as an additional general sanity check.
>>
>> It has, it's hidden in kernel/smp.c:smp_call_function_many().
>
> Right. Thanks.
>
>>
>>> Let me know if you want me to make these changes and include your patch in
>>> the set.
>>
>> The set is no longer needed. text_poke() is fine and correct with this
>> one patch.
>
> It depends what security you want. Some may consider even the short
> time-window in which the kernel code is writable from other cores as
> insufficient for security.
>
> In addition, the set removes the need for remote TLB shootdowns that
> text_poke() - with this fix - requires.
>

I’m personally in favor of not needing a global broadcast flush to install kprobes.

2018-09-07 20:53:37

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] x86/alternatives: initializing temporary mm for patching

at 2:01 AM, Peter Zijlstra <[email protected]> wrote:

> On Sun, Sep 02, 2018 at 10:32:22AM -0700, Nadav Amit wrote:
>> +void __init poking_init(void)
>> +{
>> + unsigned long poking_addr;
>> +
>> + poking_mm = copy_init_mm();
>> + if (!poking_mm) {
>> + pr_err("x86/mm: error setting a separate poking address space");
>> + return;
>> + }
>> +
>> + /*
>> + * Randomize the poking address, but make sure that the following page
>> + * will be mapped at the same PMD. We need 2 pages, so find space for 3,
>> + * and adjust the address if the PMD ends after the first one.
>> + */
>> + poking_addr = TASK_UNMAPPED_BASE +
>> + (kaslr_get_random_long("Poking") & PAGE_MASK) %
>> + (TASK_SIZE - TASK_UNMAPPED_BASE - 3 * PAGE_SIZE);
>> +
>> + if (((poking_addr + PAGE_SIZE) & ~PMD_MASK) == 0)
>> + poking_addr += PAGE_SIZE;
>> +}
>
> This fails to compile for me.. I don't have kaslr_get_random_long().

Will be fixed in v3. Thanks.