2010-07-14 16:00:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 2/2] x86 NMI-safe INT3 and Page Fault

Implements an alternative iret with popf and return so trap and exception
handlers can return to the NMI handler without issuing iret. iret would cause
NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
copy the return instruction pointer to the top of the previous stack, issue a
popf, loads the previous esp and issue a near return (ret).

It allows placing dynamically patched static jumps in asm gotos, which will be
used for optimized tracepoints, in NMI code since returning from a breakpoint
would be valid. Accessing vmalloc'd memory, which allows executing module code
or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
This is very useful to tracers like LTTng.

This patch makes all faults, traps and exception safe to be called from NMI
context *except* single-stepping, which requires iret to restore the TF (trap
flag) and jump to the return address in a single instruction. Sorry, no kprobes
support in NMI handlers because of this limitation. This cannot be emulated
with popf/lret, because lret would be single-stepped. It does not apply to
"immediate values" because they do not use single-stepping. This code detects if
the TF flag is set and uses the iret path for single-stepping, even if it
reactivates NMIs prematurely.

Test to detect if nested under a NMI handler is only done upon the return from
trap/exception to kernel, which is not frequent. Other return paths (return from
trap/exception to userspace, return from interrupt) keep the exact same behavior
(no slowdown).

alpha and avr32 use the active count bit 31. This patch moves them to 28.

TODO : test alpha and avr32 active count modification
TODO : test with lguest, xen, kvm.

tested on x86_32 (tests implemented in a separate patch) :
- instrumented the return path to export the EIP, CS and EFLAGS values when
taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
activated. Runs well through the return path.
- tested vmalloc faults in NMI handler by placing a non-optimized marker in the
NMI handler (so no breakpoint is executed) and connecting a probe which
touches every pages of a 20MB vmalloc'd buffer. It executes trough the return
path without problem.
- Tested with and without preemption

tested on x86_64
- instrumented the return path to export the EIP, CS and EFLAGS values when
taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
activated. Runs well through the return path.

To test on x86_64 :
- Test without preemption
- Test vmalloc faults
- Test on Intel 64 bits CPUs. (AMD64 was fine)

Changelog since v1 :
- x86_64 fixes.
Changelog since v2 :
- fix paravirt build
Changelog since v3 :
- Include modifications suggested by Jeremy
Changelog since v4 :
- including hardirq.h in entry_32/64.S is a bad idea (non ifndef'd C code),
define NMI_MASK in the .S files directly.
Changelog since v5 :
- Add NMI_MASK to irq_count() and make die() more verbose for NMIs.
Changelog since v7 :
- Implement paravirtualized nmi_return.
Changelog since v8 :
- refreshed the patch for asm-offsets. Those were left out of v8.
- now depends on "Stringify support commas" patch.
Changelog since v9 :
- Only test the nmi nested preempt count flag upon return from exceptions, not
on return from interrupts. Only the kernel return path has this test.
- Add Xen, VMI, lguest support. Use their iret pavavirt ops in lieu of
nmi_return.

- update for 2.6.30-rc1
Follow NMI_MASK bits merged in mainline.

- update for 2.6.35-rc4-tip

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: [email protected]
CC: [email protected]
CC: "H. Peter Anvin" <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Frank Ch. Eigler" <[email protected]>
---
arch/alpha/include/asm/thread_info.h | 2 -
arch/avr32/include/asm/thread_info.h | 2 -
arch/x86/include/asm/irqflags.h | 56 ++++++++++++++++++++++++++++++++++
arch/x86/include/asm/paravirt.h | 4 ++
arch/x86/include/asm/paravirt_types.h | 1
arch/x86/kernel/asm-offsets_32.c | 1
arch/x86/kernel/asm-offsets_64.c | 1
arch/x86/kernel/dumpstack.c | 2 +
arch/x86/kernel/entry_32.S | 30 ++++++++++++++++++
arch/x86/kernel/entry_64.S | 33 ++++++++++++++++++--
arch/x86/kernel/paravirt.c | 3 +
arch/x86/kernel/paravirt_patch_32.c | 6 +++
arch/x86/kernel/paravirt_patch_64.c | 6 +++
arch/x86/kernel/vmi_32.c | 2 +
arch/x86/lguest/boot.c | 1
arch/x86/xen/enlighten.c | 1
16 files changed, 145 insertions(+), 6 deletions(-)

Index: linux.trees.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/entry_32.S 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/entry_32.S 2010-07-14 08:02:11.000000000 -0400
@@ -80,6 +80,8 @@

#define nr_syscalls ((syscall_table_size)/4)

+#define NMI_MASK 0x04000000
+
#ifdef CONFIG_PREEMPT
#define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
#else
@@ -348,8 +350,32 @@ END(ret_from_fork)
# userspace resumption stub bypassing syscall exit tracing
ALIGN
RING0_PTREGS_FRAME
+
ret_from_exception:
preempt_stop(CLBR_ANY)
+ GET_THREAD_INFO(%ebp)
+ movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS
+ movb PT_CS(%esp), %al
+ andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
+ cmpl $USER_RPL, %eax
+ jae resume_userspace # returning to v8086 or userspace
+ testl $NMI_MASK,TI_preempt_count(%ebp)
+ jz resume_kernel /* Not nested over NMI ? */
+ testw $X86_EFLAGS_TF, PT_EFLAGS(%esp)
+ jnz resume_kernel /*
+ * If single-stepping an NMI handler,
+ * use the normal iret path instead of
+ * the popf/lret because lret would be
+ * single-stepped. It should not
+ * happen : it will reactivate NMIs
+ * prematurely.
+ */
+ TRACE_IRQS_IRET
+ RESTORE_REGS
+ addl $4, %esp # skip orig_eax/error_code
+ CFI_ADJUST_CFA_OFFSET -4
+ INTERRUPT_RETURN_NMI_SAFE
+
ret_from_intr:
GET_THREAD_INFO(%ebp)
check_userspace:
@@ -949,6 +975,10 @@ ENTRY(native_iret)
.previous
END(native_iret)

+ENTRY(native_nmi_return)
+ NATIVE_INTERRUPT_RETURN_NMI_SAFE # Should we deal with popf exception ?
+END(native_nmi_return)
+
ENTRY(native_irq_enable_sysexit)
sti
sysexit
Index: linux.trees.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/entry_64.S 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/entry_64.S 2010-07-14 08:02:11.000000000 -0400
@@ -163,6 +163,8 @@ GLOBAL(return_to_handler)
#endif


+#define NMI_MASK 0x04000000
+
#ifndef CONFIG_PREEMPT
#define retint_kernel retint_restore_args
#endif
@@ -875,6 +877,9 @@ ENTRY(native_iret)
.section __ex_table,"a"
.quad native_iret, bad_iret
.previous
+
+ENTRY(native_nmi_return)
+ NATIVE_INTERRUPT_RETURN_NMI_SAFE
#endif

.section .fixup,"ax"
@@ -929,6 +934,24 @@ retint_signal:
GET_THREAD_INFO(%rcx)
jmp retint_with_reschedule

+ /* Returning to kernel space from exception. */
+ /* rcx: threadinfo. interrupts off. */
+ENTRY(retexc_kernel)
+ testl $NMI_MASK,TI_preempt_count(%rcx)
+ jz retint_kernel /* Not nested over NMI ? */
+ testw $X86_EFLAGS_TF,EFLAGS-ARGOFFSET(%rsp) /* trap flag? */
+ jnz retint_kernel /*
+ * If single-stepping an NMI handler,
+ * use the normal iret path instead of
+ * the popf/lret because lret would be
+ * single-stepped. It should not
+ * happen : it will reactivate NMIs
+ * prematurely.
+ */
+ RESTORE_ARGS 0,8,0
+ TRACE_IRQS_IRETQ
+ INTERRUPT_RETURN_NMI_SAFE
+
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
@@ -1375,12 +1398,18 @@ ENTRY(paranoid_exit)
paranoid_swapgs:
TRACE_IRQS_IRETQ 0
SWAPGS_UNSAFE_STACK
+paranoid_restore_no_nmi:
RESTORE_ALL 8
jmp irq_return
paranoid_restore:
+ GET_THREAD_INFO(%rcx)
TRACE_IRQS_IRETQ 0
+ testl $NMI_MASK,TI_preempt_count(%rcx)
+ jz paranoid_restore_no_nmi /* Nested over NMI ? */
+ testw $X86_EFLAGS_TF,EFLAGS-0(%rsp) /* trap flag? */
+ jnz paranoid_restore_no_nmi
RESTORE_ALL 8
- jmp irq_return
+ INTERRUPT_RETURN_NMI_SAFE
paranoid_userspace:
GET_THREAD_INFO(%rcx)
movl TI_flags(%rcx),%ebx
@@ -1479,7 +1508,7 @@ ENTRY(error_exit)
TRACE_IRQS_OFF
GET_THREAD_INFO(%rcx)
testl %eax,%eax
- jne retint_kernel
+ jne retexc_kernel
LOCKDEP_SYS_EXIT_IRQ
movl TI_flags(%rcx),%edx
movl $_TIF_WORK_MASK,%edi
Index: linux.trees.git/arch/x86/include/asm/irqflags.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/irqflags.h 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/irqflags.h 2010-07-14 08:02:11.000000000 -0400
@@ -56,6 +56,61 @@ static inline void native_halt(void)

#endif

+#ifdef CONFIG_X86_64
+/*
+ * Only returns from a trap or exception to a NMI context (intra-privilege
+ * level near return) to the same SS and CS segments. Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued. It takes care of modifying the eflags, rsp and returning to the
+ * previous function.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(rsp) RIP
+ * 8(rsp) CS
+ * 16(rsp) EFLAGS
+ * 24(rsp) RSP
+ * 32(rsp) SS
+ *
+ * Upon execution :
+ * Copy EIP to the top of the return stack
+ * Update top of return stack address
+ * Pop eflags into the eflags register
+ * Make the return stack current
+ * Near return (popping the return address from the return stack)
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE pushq %rax; \
+ movq %rsp, %rax; \
+ movq 24+8(%rax), %rsp; \
+ pushq 0+8(%rax); \
+ pushq 16+8(%rax); \
+ movq (%rax), %rax; \
+ popfq; \
+ ret
+#else
+/*
+ * Protected mode only, no V8086. Implies that protected mode must
+ * be entered before NMIs or MCEs are enabled. Only returns from a trap or
+ * exception to a NMI context (intra-privilege level far return). Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(esp) EIP
+ * 4(esp) CS
+ * 8(esp) EFLAGS
+ *
+ * Upon execution :
+ * Copy the stack eflags to top of stack
+ * Pop eflags into the eflags register
+ * Far return: pop EIP and CS into their register, and additionally pop EFLAGS.
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE pushl 8(%esp); \
+ popfl; \
+ lret $4
+#endif
+
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else
@@ -114,6 +169,7 @@ static inline unsigned long __raw_local_

#define ENABLE_INTERRUPTS(x) sti
#define DISABLE_INTERRUPTS(x) cli
+#define INTERRUPT_RETURN_NMI_SAFE NATIVE_INTERRUPT_RETURN_NMI_SAFE

#ifdef CONFIG_X86_64
#define SWAPGS swapgs
Index: linux.trees.git/arch/alpha/include/asm/thread_info.h
===================================================================
--- linux.trees.git.orig/arch/alpha/include/asm/thread_info.h 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/alpha/include/asm/thread_info.h 2010-07-14 08:02:11.000000000 -0400
@@ -56,7 +56,7 @@ register struct thread_info *__current_t
#define THREAD_SIZE_ORDER 1
#define THREAD_SIZE (2*PAGE_SIZE)

-#define PREEMPT_ACTIVE 0x40000000
+#define PREEMPT_ACTIVE 0x10000000

/*
* Thread information flags:
Index: linux.trees.git/arch/avr32/include/asm/thread_info.h
===================================================================
--- linux.trees.git.orig/arch/avr32/include/asm/thread_info.h 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/avr32/include/asm/thread_info.h 2010-07-14 08:02:11.000000000 -0400
@@ -66,7 +66,7 @@ static inline struct thread_info *curren

#endif /* !__ASSEMBLY__ */

-#define PREEMPT_ACTIVE 0x40000000
+#define PREEMPT_ACTIVE 0x10000000

/*
* Thread information flags
Index: linux.trees.git/arch/x86/include/asm/paravirt.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/paravirt.h 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/paravirt.h 2010-07-14 08:02:11.000000000 -0400
@@ -943,6 +943,10 @@ extern void default_banner(void);
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE, \
jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret))

+#define INTERRUPT_RETURN_NMI_SAFE \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_nmi_return), CLBR_NONE, \
+ jmp *%cs:pv_cpu_ops+PV_CPU_nmi_return)
+
#define DISABLE_INTERRUPTS(clobbers) \
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE); \
Index: linux.trees.git/arch/x86/kernel/paravirt.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt.c 2010-07-14 08:02:11.000000000 -0400
@@ -156,6 +156,7 @@ unsigned paravirt_patch_default(u8 type,
ret = paravirt_patch_ident_64(insnbuf, len);

else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
+ type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret64))
@@ -204,6 +205,7 @@ static void native_flush_tlb_single(unsi

/* These are in entry.S */
extern void native_iret(void);
+extern void native_nmi_return(void);
extern void native_irq_enable_sysexit(void);
extern void native_usergs_sysret32(void);
extern void native_usergs_sysret64(void);
@@ -373,6 +375,7 @@ struct pv_cpu_ops pv_cpu_ops = {
.usergs_sysret64 = native_usergs_sysret64,
#endif
.iret = native_iret,
+ .nmi_return = native_nmi_return,
.swapgs = native_swapgs,

.set_iopl_mask = native_set_iopl_mask,
Index: linux.trees.git/arch/x86/kernel/paravirt_patch_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_32.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt_patch_32.c 2010-07-14 08:02:11.000000000 -0400
@@ -1,10 +1,13 @@
-#include <asm/paravirt.h>
+#include <linux/stringify.h>
+#include <linux/irqflags.h>

DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
DEF_NATIVE(pv_cpu_ops, iret, "iret");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+ __stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "sti; sysexit");
DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
@@ -41,6 +44,7 @@ unsigned native_patch(u8 type, u16 clobb
PATCH_SITE(pv_irq_ops, restore_fl);
PATCH_SITE(pv_irq_ops, save_fl);
PATCH_SITE(pv_cpu_ops, iret);
+ PATCH_SITE(pv_cpu_ops, nmi_return);
PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
PATCH_SITE(pv_mmu_ops, read_cr2);
PATCH_SITE(pv_mmu_ops, read_cr3);
Index: linux.trees.git/arch/x86/kernel/paravirt_patch_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_64.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt_patch_64.c 2010-07-14 08:02:11.000000000 -0400
@@ -1,12 +1,15 @@
+#include <linux/irqflags.h>
+#include <linux/stringify.h>
#include <asm/paravirt.h>
#include <asm/asm-offsets.h>
-#include <linux/stringify.h>

DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
DEF_NATIVE(pv_cpu_ops, iret, "iretq");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+ __stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
@@ -51,6 +54,7 @@ unsigned native_patch(u8 type, u16 clobb
PATCH_SITE(pv_irq_ops, irq_enable);
PATCH_SITE(pv_irq_ops, irq_disable);
PATCH_SITE(pv_cpu_ops, iret);
+ PATCH_SITE(pv_cpu_ops, nmi_return);
PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
PATCH_SITE(pv_cpu_ops, usergs_sysret32);
PATCH_SITE(pv_cpu_ops, usergs_sysret64);
Index: linux.trees.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/asm-offsets_32.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/asm-offsets_32.c 2010-07-14 08:02:11.000000000 -0400
@@ -113,6 +113,7 @@ void foo(void)
OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+ OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0);
#endif
Index: linux.trees.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/asm-offsets_64.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/asm-offsets_64.c 2010-07-14 08:02:11.000000000 -0400
@@ -58,6 +58,7 @@ int main(void)
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+ OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32);
OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64);
OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
Index: linux.trees.git/arch/x86/xen/enlighten.c
===================================================================
--- linux.trees.git.orig/arch/x86/xen/enlighten.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/xen/enlighten.c 2010-07-14 08:02:12.000000000 -0400
@@ -953,6 +953,7 @@ static const struct pv_cpu_ops xen_cpu_o
.read_pmc = native_read_pmc,

.iret = xen_iret,
+ .nmi_return = xen_iret,
.irq_enable_sysexit = xen_sysexit,
#ifdef CONFIG_X86_64
.usergs_sysret32 = xen_sysret32,
Index: linux.trees.git/arch/x86/kernel/vmi_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/vmi_32.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/vmi_32.c 2010-07-14 08:02:12.000000000 -0400
@@ -154,6 +154,8 @@ static unsigned vmi_patch(u8 type, u16 c
insns, ip);
case PARAVIRT_PATCH(pv_cpu_ops.iret):
return patch_internal(VMI_CALL_IRET, len, insns, ip);
+ case PARAVIRT_PATCH(pv_cpu_ops.nmi_return):
+ return patch_internal(VMI_CALL_IRET, len, insns, ip);
case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit):
return patch_internal(VMI_CALL_SYSEXIT, len, insns, ip);
default:
Index: linux.trees.git/arch/x86/lguest/boot.c
===================================================================
--- linux.trees.git.orig/arch/x86/lguest/boot.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/lguest/boot.c 2010-07-14 08:02:12.000000000 -0400
@@ -1270,6 +1270,7 @@ __init void lguest_init(void)
pv_cpu_ops.cpuid = lguest_cpuid;
pv_cpu_ops.load_idt = lguest_load_idt;
pv_cpu_ops.iret = lguest_iret;
+ pv_cpu_ops.nmi_return = lguest_iret;
pv_cpu_ops.load_sp0 = lguest_load_sp0;
pv_cpu_ops.load_tr_desc = lguest_load_tr_desc;
pv_cpu_ops.set_ldt = lguest_set_ldt;
Index: linux.trees.git/arch/x86/kernel/dumpstack.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/dumpstack.c 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/dumpstack.c 2010-07-14 08:02:12.000000000 -0400
@@ -258,6 +258,8 @@ void __kprobes oops_end(unsigned long fl

if (!signr)
return;
+ if (in_nmi())
+ panic("Fatal exception in non-maskable interrupt");
if (in_interrupt())
panic("Fatal exception in interrupt");
if (panic_on_oops)
Index: linux.trees.git/arch/x86/include/asm/paravirt_types.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/paravirt_types.h 2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/paravirt_types.h 2010-07-14 08:02:12.000000000 -0400
@@ -181,6 +181,7 @@ struct pv_cpu_ops {
/* Normal iret. Jump to this with the standard iret stack
frame set up. */
void (*iret)(void);
+ void (*nmi_return)(void);

void (*swapgs)(void);


2010-07-14 16:42:24

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> This patch makes all faults, traps and exception safe to be called from NMI
> context *except* single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes

Watch out for the RF flag too, that is not set correctly by POPFD -- that
may be important for faulting instructions that also have a hardware
breakpoint set at their address.

> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.

What about the VM flag for VM86 tasks? It cannot be changed by POPFD
either.

How about only using the special return path when a nested exception is
about to return to the NMI handler? You'd avoid all the odd cases then
that do not happen in the NMI context.

Maciej

2010-07-14 18:12:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Maciej W. Rozycki ([email protected]) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
>
> > This patch makes all faults, traps and exception safe to be called from NMI
> > context *except* single-stepping, which requires iret to restore the TF (trap
> > flag) and jump to the return address in a single instruction. Sorry, no kprobes
>
> Watch out for the RF flag too, that is not set correctly by POPFD -- that
> may be important for faulting instructions that also have a hardware
> breakpoint set at their address.
>
> > support in NMI handlers because of this limitation. This cannot be emulated
> > with popf/lret, because lret would be single-stepped. It does not apply to
> > "immediate values" because they do not use single-stepping. This code detects if
> > the TF flag is set and uses the iret path for single-stepping, even if it
> > reactivates NMIs prematurely.
>
> What about the VM flag for VM86 tasks? It cannot be changed by POPFD
> either.
>
> How about only using the special return path when a nested exception is
> about to return to the NMI handler? You'd avoid all the odd cases then
> that do not happen in the NMI context.

This is exactly what this patch does :-)

It selects the return path with

+ testl $NMI_MASK,TI_preempt_count(%ebp)
+ jz resume_kernel /* Not nested over NMI ? */

In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
not explicitly set the RF flag, and the breakpoint instruction (int3) appears
not to set it. (from my understanding of Intel's
Intel Architecture Software Developer’s Manual Volume 3: System Programming
15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
It's just the "single-stepping" feature of kprobes which is problematic.
Luckily, only int3 is needed for code patching bypass.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 19:21:33

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> > How about only using the special return path when a nested exception is
> > about to return to the NMI handler? You'd avoid all the odd cases then
> > that do not happen in the NMI context.
>
> This is exactly what this patch does :-)

Ah, OK then -- I understood you actually tested the value of TF in the
image to be restored.

> It selects the return path with
>
> + testl $NMI_MASK,TI_preempt_count(%ebp)
> + jz resume_kernel /* Not nested over NMI ? */
>
> In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> not to set it. (from my understanding of Intel's
> Intel Architecture Software Developer’s Manual Volume 3: System Programming
> 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

The CPU only sets RF itself in the image saved in certain cases -- you'd
see it set in the page fault handler for example, so that once the handler
has finished any instruction breakpoint does not hit (presumably again,
because the instruction breakpoint debug exception has the highest
priority). You mentioned the need to handle these faults.

> So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
>
> It's just the "single-stepping" feature of kprobes which is problematic.
> Luckily, only int3 is needed for code patching bypass.

Actually the breakpoint exception handler should actually probably set RF
explicitly, but that depends on the exact debugging scenario, so I can't
comment on it further. I don't know how INT3 is used in this context, so
I'm just noting this may be a danger zone.

Maciej

2010-07-14 19:58:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Maciej W. Rozycki ([email protected]) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
>
> > > How about only using the special return path when a nested exception is
> > > about to return to the NMI handler? You'd avoid all the odd cases then
> > > that do not happen in the NMI context.
> >
> > This is exactly what this patch does :-)
>
> Ah, OK then -- I understood you actually tested the value of TF in the
> image to be restored.

It tests it too. When it detects that the return path is about to return to a
NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
really needed, because TF can only single-step an instruction when set by
"iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
because there is no way to go around the iret.

>
> > It selects the return path with
> >
> > + testl $NMI_MASK,TI_preempt_count(%ebp)
> > + jz resume_kernel /* Not nested over NMI ? */
> >
> > In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> > not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> > not to set it. (from my understanding of Intel's
> > Intel Architecture Software Developer’s Manual Volume 3: System Programming
> > 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)
>
> The CPU only sets RF itself in the image saved in certain cases -- you'd
> see it set in the page fault handler for example, so that once the handler
> has finished any instruction breakpoint does not hit (presumably again,
> because the instruction breakpoint debug exception has the highest
> priority). You mentioned the need to handle these faults.

Well, the only case where I think it might make sense to allow a breakpoint in
NMI handler code would be to temporarily replace a static branch, which should
in no way be able to trigger any other fault.

>
> > So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
> >
> > It's just the "single-stepping" feature of kprobes which is problematic.
> > Luckily, only int3 is needed for code patching bypass.
>
> Actually the breakpoint exception handler should actually probably set RF
> explicitly, but that depends on the exact debugging scenario, so I can't
> comment on it further. I don't know how INT3 is used in this context, so
> I'm just noting this may be a danger zone.

In the case of temporary bypass, the int3 is only there to divert the
instruction execution flow to somewhere else, and we come back to the original
code at the address following the instruction which has the breakpoint. So
basically, we never come back to the original instruction, ever. We might as
well just clear the RF flag from the EFLAGS image before popf.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-14 20:36:35

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> It tests it too. When it detects that the return path is about to return to a
> NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
> really needed, because TF can only single-step an instruction when set by
> "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
> follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
> because there is no way to go around the iret.

Hmm, with Pentium Pro and more recent processors there is actually a
nasty hack that will let you get away with POPF/RET and TF set. ;) You
can try it if you like and can arrange for an appropriate scenario.

> In the case of temporary bypass, the int3 is only there to divert the
> instruction execution flow to somewhere else, and we come back to the original
> code at the address following the instruction which has the breakpoint. So
> basically, we never come back to the original instruction, ever. We might as
> well just clear the RF flag from the EFLAGS image before popf.

Yes, if you return to elsewhere, then that's actually quite desirable
IMHO.

This RF flag is quite complicated to handle and there are some errata
involved too. If I understand it correctly, all fault-class exception
handlers are expected to set it manually in the image to be restored if
they return to the original faulting instruction (that includes the debug
exception handler if it was invoked as a fault, i.e. in response to an
instruction breakpoint). Then all trap-class exception handlers are
expected to clear the flag (and that includes the debug exception handler
if it was invoked as a trap, e.g. in response to a data breakpoint or a
single step). I haven't checked if Linux gets these bits right, but it
may be worth doing so.

For the record -- GDB hardly cares, because it removes any instruction
breakpoints before it is asked to resume execution of an instruction that
has a breakpoint set at, single-steps the instruction with all the other
threads locked out and then reinserts the breakpoints so that they can hit
again. Then it proceeds with whatever should be done next to fulfil the
execution request.

Maciej

2010-07-16 12:30:21

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
> Implements an alternative iret with popf and return so trap and exception
> handlers can return to the NMI handler without issuing iret. iret would cause
> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
> copy the return instruction pointer to the top of the previous stack, issue a
> popf, loads the previous esp and issue a near return (ret).
>
> It allows placing dynamically patched static jumps in asm gotos, which will be
> used for optimized tracepoints, in NMI code since returning from a breakpoint
> would be valid. Accessing vmalloc'd memory, which allows executing module code
> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
> This is very useful to tracers like LTTng.
>
> This patch makes all faults, traps and exception safe to be called from NMI
> context*except* single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes
> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.
>

You need to save/restore cr2 in addition, otherwise the following hits you

- page fault
- processor writes cr2, enters fault handler
- nmi
- page fault
- cr2 overwritten

I guess you would usually not notice the corruption since you'd just see
a spurious fault on the page the NMI handler touched, but if the first
fault happened in a kvm guest, then we'd corrupt the guest's cr2.

But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
wrong with using a per-cpu pointer to a kmalloc() area?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 14:49:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Avi Kivity ([email protected]) wrote:
> On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
>> Implements an alternative iret with popf and return so trap and exception
>> handlers can return to the NMI handler without issuing iret. iret would cause
>> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
>> copy the return instruction pointer to the top of the previous stack, issue a
>> popf, loads the previous esp and issue a near return (ret).
>>
>> It allows placing dynamically patched static jumps in asm gotos, which will be
>> used for optimized tracepoints, in NMI code since returning from a breakpoint
>> would be valid. Accessing vmalloc'd memory, which allows executing module code
>> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
>> This is very useful to tracers like LTTng.
>>
>> This patch makes all faults, traps and exception safe to be called from NMI
>> context*except* single-stepping, which requires iret to restore the TF (trap
>> flag) and jump to the return address in a single instruction. Sorry, no kprobes
>> support in NMI handlers because of this limitation. This cannot be emulated
>> with popf/lret, because lret would be single-stepped. It does not apply to
>> "immediate values" because they do not use single-stepping. This code detects if
>> the TF flag is set and uses the iret path for single-stepping, even if it
>> reactivates NMIs prematurely.
>>
>
> You need to save/restore cr2 in addition, otherwise the following hits you
>
> - page fault
> - processor writes cr2, enters fault handler
> - nmi
> - page fault
> - cr2 overwritten
>
> I guess you would usually not notice the corruption since you'd just see
> a spurious fault on the page the NMI handler touched, but if the first
> fault happened in a kvm guest, then we'd corrupt the guest's cr2.

OK, just to make sure: you mean we'd have to save/restore the cr2 register
at the beginning/end of the NMI handler execution, right ? The shouldn't we
save/restore cr3 too ?

> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
> wrong with using a per-cpu pointer to a kmalloc() area?

Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
much more than perf) can potentially cause large latencies, which could be
squashed by allowing page faults in NMI handlers. This looks like a stronger
argument to me.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 15:34:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be

You need to fix all other code too that walks tasks lists to avoid all those.

% gid for_each_process | wc -l

In fact the mm-struct walk is cheaper than a task-list walk because there
are always less than tasks.

-Andi

2010-07-16 15:40:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Andi Kleen ([email protected]) wrote:
> > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> > much more than perf) can potentially cause large latencies, which could be
>
> You need to fix all other code too that walks tasks lists to avoid all those.
>
> % gid for_each_process | wc -l

This can very well be done incrementally. And I agree, these should eventually
targeted too, especially those which hold locks. We've already started hearing
about tasklist lock live-locks in the past year, so I think we're pretty much at
the point where it should be looked at.

Thanks,

Mathieu

>
> In fact the mm-struct walk is cheaper than a task-list walk because there
> are always less than tasks.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 16:48:08

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>
>> You need to save/restore cr2 in addition, otherwise the following hits you
>>
>> - page fault
>> - processor writes cr2, enters fault handler
>> - nmi
>> - page fault
>> - cr2 overwritten
>>
>> I guess you would usually not notice the corruption since you'd just see
>> a spurious fault on the page the NMI handler touched, but if the first
>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>
> OK, just to make sure: you mean we'd have to save/restore the cr2 register
> at the beginning/end of the NMI handler execution, right ?

Yes.

> The shouldn't we
> save/restore cr3 too ?
>
>

No, faults should not change cr3.

>> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
>> wrong with using a per-cpu pointer to a kmalloc() area?
>>
> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be
> squashed by allowing page faults in NMI handlers. This looks like a stronger
> argument to me.

Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
which cannot take vmalloc faults, why bother? If not, why not?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 16:59:01

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Avi Kivity ([email protected]) wrote:
> On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>>
>>> You need to save/restore cr2 in addition, otherwise the following hits you
>>>
>>> - page fault
>>> - processor writes cr2, enters fault handler
>>> - nmi
>>> - page fault
>>> - cr2 overwritten
>>>
>>> I guess you would usually not notice the corruption since you'd just see
>>> a spurious fault on the page the NMI handler touched, but if the first
>>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>>
>> OK, just to make sure: you mean we'd have to save/restore the cr2 register
>> at the beginning/end of the NMI handler execution, right ?
>
> Yes.

OK

>
>> The shouldn't we
>> save/restore cr3 too ?
>>
>>
>
> No, faults should not change cr3.

Ah, right.

>
>>> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's
>>> wrong with using a per-cpu pointer to a kmalloc() area?
>>>
>> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
>> much more than perf) can potentially cause large latencies, which could be
>> squashed by allowing page faults in NMI handlers. This looks like a stronger
>> argument to me.
>
> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
> which cannot take vmalloc faults, why bother? If not, why not?

Modules come as yet another example of stuff that is loaded in vmalloc'd space
and can be accesses from NMI context. That would include oprofile, tracers, and
probably others I'm forgetting about.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 17:54:59

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>
>> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
>> which cannot take vmalloc faults, why bother? If not, why not?
>>
> Modules come as yet another example of stuff that is loaded in vmalloc'd space
> and can be accesses from NMI context. That would include oprofile, tracers, and
> probably others I'm forgetting about.
>

Module loading can certainly take a vmalloc_sync_all() (though I agree
it's unpleasant). Anything else?

Note perf is not modular at this time, but could be made so with
preempt/sched notifiers to hook the context switch.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:07:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:54 AM, Avi Kivity wrote:
> On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>>
>>> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI
>>> which cannot take vmalloc faults, why bother? If not, why not?
>>>
>> Modules come as yet another example of stuff that is loaded in vmalloc'd space
>> and can be accesses from NMI context. That would include oprofile, tracers, and
>> probably others I'm forgetting about.
>>
>
> Module loading can certainly take a vmalloc_sync_all() (though I agree
> it's unpleasant). Anything else?
>
> Note perf is not modular at this time, but could be made so with
> preempt/sched notifiers to hook the context switch.
>

Actually, module loading is already a performance problem; a lot of
distros load sometimes hundreds of modules on startup, and it's heavily
serialized, so I can see this being desirable to skip.

I really hope noone ever gets the idea of touching user space from an
NMI handler, though, and expecting it to work...

-hpa

2010-07-16 18:16:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>
>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>> it's unpleasant). Anything else?
>>
>> Note perf is not modular at this time, but could be made so with
>> preempt/sched notifiers to hook the context switch.
>>
>>
> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily
> serialized, so I can see this being desirable to skip.
>

There aren't that many processes at this time (or there shouldn't be,
don't know how fork-happy udev is at this stage), so the sync should be
pretty fast. In any case, we can sync only modules that contain NMI
handlers.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...
>

I think the concern here is about an NMI handler's code running in
vmalloc space, or is it something else?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:19:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 11:15 AM, Avi Kivity wrote:
>
> There aren't that many processes at this time (or there shouldn't be,
> don't know how fork-happy udev is at this stage), so the sync should be
> pretty fast. In any case, we can sync only modules that contain NMI
> handlers.
>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>
>
> I think the concern here is about an NMI handler's code running in
> vmalloc space, or is it something else?
>

Code or data, yes; including percpu data.

-hpa

2010-07-16 18:22:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Avi Kivity ([email protected]) wrote:
> On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>>
>>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>>> it's unpleasant). Anything else?
>>>
>>> Note perf is not modular at this time, but could be made so with
>>> preempt/sched notifiers to hook the context switch.
>>>
>>>
>> Actually, module loading is already a performance problem; a lot of
>> distros load sometimes hundreds of modules on startup, and it's heavily
>> serialized, so I can see this being desirable to skip.
>>
>
> There aren't that many processes at this time (or there shouldn't be,
> don't know how fork-happy udev is at this stage), so the sync should be
> pretty fast. In any case, we can sync only modules that contain NMI
> handlers.

USB hotplug is a use-case happening randomly after the system is well there and
running; I'm afraid this does not fit in your module loading expectations. It
triggers tons of events, many of these actually load modules.

Thanks,

Mathieu

>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>
>
> I think the concern here is about an NMI handler's code running in
> vmalloc space, or is it something else?
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 18:30:26

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:17 PM, H. Peter Anvin wrote:
>
>> I think the concern here is about an NMI handler's code running in
>> vmalloc space, or is it something else?
>>
>>
> Code or data, yes; including percpu data.
>

Use kmalloc and percpu pointers, it's not that onerous.

Oh, and you can access vmalloc space by switching cr3 temporarily to
init_mm's, no? Obviously not a very performant solution, at least
without PCIDs, but can be used if needed.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:30:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <[email protected]> wrote:
>
> I think the concern here is about an NMI handler's code running in vmalloc
> space, or is it something else?

I think the concern was also potentially doing things like backtraces
etc that may need access to the module data structures (I think the
ELF headers end up all being in vmalloc space too, for example).

The whole debugging thing is also an issue. Now, I obviously am not a
big fan of remote debuggers, but everybody tells me I'm wrong. And
putting a breakpoint on NMI is certainly not insane if you are doing
debugging in the first place. So it's not necessarily always about the
page faults.

Linus

2010-07-16 18:33:23

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
>
>> There aren't that many processes at this time (or there shouldn't be,
>> don't know how fork-happy udev is at this stage), so the sync should be
>> pretty fast. In any case, we can sync only modules that contain NMI
>> handlers.
>>
> USB hotplug is a use-case happening randomly after the system is well there and
> running; I'm afraid this does not fit in your module loading expectations. It
> triggers tons of events, many of these actually load modules.
>

How long would vmalloc_sync_all take with a few thousand mm_struct take?

We share the pmds, yes? So it's a few thousand memory accesses. The
direct impact is probably negligible, compared to actually loading the
module from disk. All we need is to make sure the locking doesn't slow
down unrelated stuff.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 18:44:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity <[email protected]> wrote:
>
> Use kmalloc and percpu pointers, it's not that onerous.

What people don't seem to understand is that WE SHOULD NOT MAKE NMI
FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
WHAT-SO-EVER TO DO WITH NMI.

I'm shouting, because this point seems to have been continually
missed. It was missed in the original patches, and it's been missed in
the discussions.

Non-NMI code should simply never have to even _think_ about NMI's. Why
should it? It should just do whatever comes "natural" within its own
context.

This is why I've been pushing for the "let's just fix NMI" approach.
Not adding random hacks to other code sequences that have nothing
what-so-ever to do with NMI.

So don't add NMI code to the page fault code. Not to the debug code,
or to the module loading code. Don't say "use special allocations
because the NMI code may care about these particular data structures".
Because that way lies crap and unmaintainability.

Linus

2010-07-16 19:27:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 09:37 PM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity<[email protected]> wrote:
>
>> Use kmalloc and percpu pointers, it's not that onerous.
>>
> What people don't seem to understand is that WE SHOULD NOT MAKE NMI
> FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
> WHAT-SO-EVER TO DO WITH NMI.
>
> I'm shouting, because this point seems to have been continually
> missed. It was missed in the original patches, and it's been missed in
> the discussions.
>
> Non-NMI code should simply never have to even _think_ about NMI's. Why
> should it? It should just do whatever comes "natural" within its own
> context.
>
>

But we're not talking about non-NMI code. The 8k referred to in the
original patch are buffers used by NMI stack recording. Module code
vmalloc_sync_all() is only need by code that is executed during NMI,
hence must be NMI aware.

> This is why I've been pushing for the "let's just fix NMI" approach.
> Not adding random hacks to other code sequences that have nothing
> what-so-ever to do with NMI.
>

"fixing NMI" will result in code that is understandable by maybe three
people after long and hard thinking. NMI can happen in too many
semi-defined contexts, so there will be plenty of edge cases. I'm not
sure we can ever trust such trickery.

> So don't add NMI code to the page fault code. Not to the debug code,
> or to the module loading code. Don't say "use special allocations
> because the NMI code may care about these particular data structures".
> Because that way lies crap and unmaintainability.
>

If NMI code can call random hooks and access random data, yes. But I
don't think we're at that point yet.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 19:28:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily

On startup you don't have many processes. If there's a problem
it's surely not the fault of vmalloc_sync_all()

BTW in my experience one reason module loading was traditionally slow was
that it did a stop_machine(). I think(?) that has been fixed
at some point. But even with that's it's more an issue on larger
systems.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...

It can make sense for a backtrace in a profiler.

In fact perf is nearly doing it I believe, but moves
it to the self IPI handler in most cases.

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-16 19:30:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 11:32 AM, Avi Kivity wrote:
>
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>
> We share the pmds, yes? So it's a few thousand memory accesses. The
> direct impact is probably negligible, compared to actually loading the
> module from disk. All we need is to make sure the locking doesn't slow
> down unrelated stuff.
>

It's not the memory accesses, it's the need to synchronize all the CPUs.

-hpa

2010-07-16 19:30:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 11:25:19AM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <[email protected]> wrote:
> >
> > I think the concern here is about an NMI handler's code running in vmalloc
> > space, or is it something else?
>
> I think the concern was also potentially doing things like backtraces
> etc that may need access to the module data structures (I think the
> ELF headers end up all being in vmalloc space too, for example).
>
> The whole debugging thing is also an issue. Now, I obviously am not a
> big fan of remote debuggers, but everybody tells me I'm wrong. And
> putting a breakpoint on NMI is certainly not insane if you are doing
> debugging in the first place. So it's not necessarily always about the
> page faults.

We already have infrastructure for kprobes to prevent breakpoints
on critical code (the __kprobes section). In principle kgdb/kdb
could be taught about honoring those too.

That wouldn't help for truly external JTAG debuggers, but I would assume
those generally can (should) handle any contexts anyways.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-16 19:32:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 09:32:00PM +0300, Avi Kivity wrote:
> On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
> >
> >>There aren't that many processes at this time (or there shouldn't be,
> >>don't know how fork-happy udev is at this stage), so the sync should be
> >>pretty fast. In any case, we can sync only modules that contain NMI
> >>handlers.
> >USB hotplug is a use-case happening randomly after the system is well there and
> >running; I'm afraid this does not fit in your module loading expectations. It
> >triggers tons of events, many of these actually load modules.
>
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>
> We share the pmds, yes? So it's a few thousand memory accesses.
> The direct impact is probably negligible, compared to actually
> loading the module from disk. All we need is to make sure the
> locking doesn't slow down unrelated stuff.

Also you have to remember that vmalloc_sync_all() only does something
when the top level page is actually updated. That is very rare.
(in many cases it should happen at most once per boot)
Most mapping changes update lower levels, and those are already
shared.

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-16 19:32:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:28 PM, Andi Kleen wrote:
>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>
> It can make sense for a backtrace in a profiler.
>
> In fact perf is nearly doing it I believe, but moves
> it to the self IPI handler in most cases.
>

Interesting, is the self IPI guaranteed to execute synchronously after
the NMI's IRET? Or can the core IRET faster than the APIC and so we get
the backtrace at the wrong place?

(and does it matter? the NMI itself is not always accurate)

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 19:34:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 10:32:13PM +0300, Avi Kivity wrote:
> On 07/16/2010 10:28 PM, Andi Kleen wrote:
> >
> >>I really hope noone ever gets the idea of touching user space from an
> >>NMI handler, though, and expecting it to work...
> >It can make sense for a backtrace in a profiler.
> >
> >In fact perf is nearly doing it I believe, but moves
> >it to the self IPI handler in most cases.
>
> Interesting, is the self IPI guaranteed to execute synchronously
> after the NMI's IRET? Or can the core IRET faster than the APIC and
> so we get the backtrace at the wrong place?
>
> (and does it matter? the NMI itself is not always accurate)

self ipi runs after the next STI (or POPF)

-Andi
--
[email protected] -- Speaking for myself only.

2010-07-16 19:40:56

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:29 PM, H. Peter Anvin wrote:
> On 07/16/2010 11:32 AM, Avi Kivity wrote:
>
>> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>>
>> We share the pmds, yes? So it's a few thousand memory accesses. The
>> direct impact is probably negligible, compared to actually loading the
>> module from disk. All we need is to make sure the locking doesn't slow
>> down unrelated stuff.
>>
>>
> It's not the memory accesses, it's the need to synchronize all the CPUs.
>

I'm missing something. Why do we need to sync all cpus? the
vmalloc_sync_all() I'm reading doesn't.

Even if we do an on_each_cpu() somewhere, it isn't the end of the world.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-07-16 21:46:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity <[email protected]> wrote:
> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>
>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>> should it? It should just do whatever comes "natural" within its own
>> context.
>
> But we're not talking about non-NMI code.

Yes, we are. We're talking about breakpoints (look at the subject
line), and you are very much talking about things like that _idiotic_
vmalloc_sync_all() by module loading code etc etc.

Every _single_ "solution" I have seen - apart from my suggestion - has
been about making code "special" because some other code might run in
an NMI. Module init sequences having to do idiotic things just because
they have data structures that might get accessed by NMI.

And the thing is, if we just do NMI's correctly, and allow nesting,
ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
stupid things elsewhere.

In other words, why the hell are you arguing? Help Mathieu write the
low-level NMI handler right, and remove that idiotic
"vmalloc_sync_all()" that is fundamentally broken and should not
exist. Rather than talk about adding more of that kind of crap.

Linus

2010-07-16 22:02:53

by Jeffrey Merkey

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

Date Fri, 16 Jul 2010 14:39:50 -0700
>From Linus Torvalds

> Linus Torvalds wrote:
> But we're not talking about non-NMI code.Yes, we are. We're talking about breakpoints (look at the subjectline), and you are very much talking about things like that _idiotic_vmalloc_sync_all() by module
> loading code etc etc.Every _single_ "solution" I have seen - apart from my suggestion - hasbeen about making code "special" because some other code might run inan NMI. Module init sequences having to > do idiotic things just becausethey have data structures that might get accessed by NMI.And the thing is, if we just do NMI's correctly, and allow nesting,ALL THOSE PROBLEMS GO AWAY. And there is no > reason what-so-ever to dostupid things elsewhere.In other words, why the hell are you arguing? Help Mathieu write thelow-level NMI handler right, and remove that idiotic"vmalloc_sync_all()" that is
> fundamentally broken and should notexist. Rather than talk about adding more of that kind of crap.
>
> Linus

So Linus, my understanding of Intel's processor design is that the
processor will NEVER singal a nested NMI until it sees an iret from
the first NMI exception. At least that's how the processors were
working
when I started this unless this behavior has changed. Just put a
gate on the exception that uses its own stack (which I think we do
anyway).

Jeff

2010-07-16 22:07:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.

One issue I have with nesting NMIs is that you need
a nesting limit, otherwise you'll overflow the NMI stack.

We just got rid of nesting for normal interrupts because
of this stack overflow problem which hit in real situations.

In some cases you can get quite high NMI frequencies, e.g. with
performance counters. Now the current performance counter handlers
do not nest by themselves of course, but they might nest
with other longer running NMI users.

I think none of the current handlers are likely to nest
for very long, but there's more and more NMI coded all the time,
so it's definitely a concern.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-16 22:22:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <[email protected]> wrote:
>
> So Linus, my understanding of Intel's processor design is that the
> processor will NEVER singal a nested NMI until it sees an iret from
> the first NMI exception.

Wrong.

I like x86, but it has warts. The NMI blocking is one of them.

The NMI's will be nested until the _next_ "iret", but it has no
nesting. So if you take a fault during the NMI (debug, page table
fixup, whatever), the iret in the faulthandler will re-enable NMI's
even though we're still busy with the original NMI. There is no
nesting, or any way to say that "this is a NMI-releasing iret". They
could even do it still - make a new "iret that doesn't clear NMI" by
adding a segment override prefix to iret or whatever. But it's not
going to happen, and it's just one of those ugly special cases that
has various historical reasons (recursive faults during NMI sure as
hell didn't make sense back in the real-mode 8086 days).

So we have to handle it in software. Or not ever trap at all inside
the NMI handler.

The original patch - and the patch I detest - is to make the normal
fault paths use a "popf + ret" to emulate iret, but without the NMI
release.

Now, I could live with that if it's the only solution, but it _is_
pretty damn ugly.

If somebody shows that it's actually faster to do "popf + ret" when
retuning to kernel space (a poor mans special-case iret), maybe it
would be worth it, but the really critical code sequence is actually
not "return to kernel space", but the "return to user space" case that
really wants the iret. And I just think it's disgusting to add extra
tests to that path.

The other alternative would be to just make the rule be "NMI can never
take traps". It's possible to do that, but quite frankly, it's a pain.
It's a pain for page faults due to the whole vmalloc thing, and it's a
pain if you ever want to debug an NMI in any way (or put a breakpoint
on anything that is accessed from an NMI, which could potentially be
quite a lot of things).

If it was just the debug issue, I'd say "neener neener, debuggers are
for wimps", but it's clearly not just about debug. It's a whole lot of
other thigs. Random percpu datastructures used for tracing, kernel
pointer verification code, yadda yadda.

Linus

2010-07-16 22:34:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <[email protected]> wrote:
>
> One issue I have with nesting NMIs is that you need
> a nesting limit, otherwise you'll overflow the NMI stack.

Have you actually looked at the suggestion I (and now Mathieu)
suggested code for?

The nesting is very limited. NMI's would nest just once, and when that
happens, the nested NMI would never use more than something like a
hundred bytes of stack (most of which is what the CPU pushes
directly). And there would be no device interrupts that nest, and
practically the faults that nest obviously aren't going to be complex
faults either (ie the page fault would be the simple case that never
calls to 'handle_vm_fault()', but handles it all in
arch/x86/mm/fault.c.

IOW, there is absolutely _no_ issues with nesting. It's two levels
deep, and a much smaller stack footprint than our regular exception
nesting for those two levels too.

And your argument that there would be more and more NMI usage only
makes it more important that we handle NMI's without going crazy. Just
handle them cleanly instead of making them something totally special.

Linus

2010-07-16 22:40:47

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

* Andi Kleen ([email protected]) wrote:
> > And the thing is, if we just do NMI's correctly, and allow nesting,
> > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> > stupid things elsewhere.
>
> One issue I have with nesting NMIs is that you need
> a nesting limit, otherwise you'll overflow the NMI stack.
>
> We just got rid of nesting for normal interrupts because
> of this stack overflow problem which hit in real situations.
>
> In some cases you can get quite high NMI frequencies, e.g. with
> performance counters. Now the current performance counter handlers
> do not nest by themselves of course, but they might nest
> with other longer running NMI users.
>
> I think none of the current handlers are likely to nest
> for very long, but there's more and more NMI coded all the time,
> so it's definitely a concern.

We're not proposing to actually "nest" NMIs per se. We copy the stack at the
beginning of the NMI handler (and then use the copy) to permit nesting of faults
over NMI handlers. Following NMIs that would "try" to nest over the NMI handler
would see their regular execution postponed until the end of the currently
running NMI handler. It's OK for these "nested" NMI handlers to use the bottom
of NMI stack because the NMI handler on which they are trying to nest is only
using the stack copy. These "nested" handlers return to the original NMI handler
very early just after setting a "pending nmi" flag. There is more to it (e.g.
handling NMI handler exit atomically with respect to incoming NMIs); please
refer to the last assembly code snipped I sent to Linus a little earlier today
for details.

Thanks,

Mathieu


>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-07-16 22:41:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 03:26:32PM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <[email protected]> wrote:
> >
> > One issue I have with nesting NMIs is that you need
> > a nesting limit, otherwise you'll overflow the NMI stack.
>
> Have you actually looked at the suggestion I (and now Mathieu)
> suggested code for?

Maybe I'm misunderstanding everything (and it has been a lot of emails
in the thread), but the case I was thinking of would be if the second NMI
faults too, and then another one comes in after the IRET etc.

-Andi

2010-07-16 22:48:06

by Jeffrey Merkey

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 4:22 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <[email protected]> wrote:
>>
>> So Linus, my understanding of Intel's processor design is that the
>> processor will NEVER singal a nested NMI until it sees an iret from
>> the first NMI exception.
>
> Wrong.
>
> I like x86, but it has warts. The NMI blocking is one of them.
>
> The NMI's will be nested until the _next_ "iret", but it has no
> nesting. So if you take a fault during the NMI (debug, page table
> fixup, whatever), the iret in the faulthandler will re-enable NMI's
> even though we're still busy with the original NMI. There is no
> nesting, or any way to say that "this is a NMI-releasing iret". They
> could even do it still - make a new "iret that doesn't clear NMI" by
> adding a segment override prefix to iret or whatever. But it's not
> going to happen, and it's just one of those ugly special cases that
> has various historical reasons (recursive faults during NMI sure as
> hell didn't make sense back in the real-mode 8086 days).
>
> So we have to handle it in software. Or not ever trap at all inside
> the NMI handler.
>
> The original patch - and the patch I detest - is to make the normal
> fault paths use a "popf + ret" to emulate iret, but without the NMI
> release.
>
> Now, I could live with that if it's the only solution, but it _is_
> pretty damn ugly.
>
> If somebody shows that it's actually faster to do "popf + ret" when
> retuning to kernel space (a poor mans special-case iret), maybe it
> would be worth it, but the really critical code sequence is actually
> not "return to kernel space", but the "return to user space" case that
> really wants the iret. And I just think it's disgusting to add extra
> tests to that path.
>
> The other alternative would be to just make the rule be "NMI can never
> take traps". It's possible to do that, but quite frankly, it's a pain.
> It's a pain for page faults due to the whole vmalloc thing, and it's a
> pain if you ever want to debug an NMI in any way (or put a breakpoint
> on anything that is accessed from an NMI, which could potentially be
> quite a lot of things).
>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
> ? ? ? ? ? ? ? ? ?Linus
>

Well, the way I handled this problem on NetWare SMP and that other
kernel was to create a pool of TSS descriptors and reload each during
the exception to swap stacks before any handlers were called. Allowed
it to nest until I ran out of TSS descriptors (64 levels). Not sure
that's the way to go here though but it worked on that case.

Jeff

2010-07-16 22:50:05

by Jeffrey Merkey

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
> ? ? ? ? ? ? ? ? ?Linus
>

I guess I am a wimp then ... :-)

Jeff

2010-07-16 22:53:42

by Jeffrey Merkey

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

>
> Well, the way I handled this problem on NetWare SMP and that other
> kernel was to create a pool of TSS descriptors and reload each during
> the exception to swap stacks before any handlers were called. ?Allowed
> it to nest until I ran out of TSS descriptors (64 levels). ?Not sure
> that's the way to go here though but it worked on that case.
>
> Jeff
>

Here is where that old dusty code lives these days - it deals with this problem.

http://open-source-netware.googlecode.com/files/manos-06-26-2010.tar.gz

file to look at is startup.386

;
; nmi entry code
;

nmi_entry macro
cli
push ebx
push ebp
mov ebp, esp
sub ebp, SIZE TaskStateSegment
mov ebx, ebp

mov [ebp].tSS, ss
mov [ebp].tGS, gs ; save segment registers
mov [ebp].tFS, fs
mov [ebp].tES, es
mov [ebp].tDS, ds
pop [ebp].tEBP
mov [ebp].tEDI, edi
mov [ebp].tESI, esi
mov [ebp].tEDX, edx
mov [ebp].tECX, ecx
pop [ebp].tEBX
mov [ebp].tEAX, eax

pop [ebp].tEIP ; remove return address
pop eax
mov [ebp].tCS, ax
pop [ebp].tSystemFlags ; get flags into TSS

mov [ebp].tESP, esp ; save true stack address
mov esp, ebx ; cover stack frame

mov eax, CR0
and eax, 0FFFFFFF7h ; clear task switch bit in CR0 to
mov CR0, eax ; avoid NPX exceptions

xor eax, eax
mov dr7, eax ; disable breakpoints

mov eax, CR3 ;
mov [ebp].tCR3, eax ;
mov eax, DebuggerPDE
mov CR3, eax

;
; if we do not clear the NESTED_TASK_FLAG, then the IRET
; at the end of this function will cause
; an invalid TSS exception to be generated because the
; task busy bit was cleared earlier
;

pushfd
and dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
or dword ptr [esp], RESUME_FLAG
popfd

mov eax, 0FFFFFFFFh ; mark as a non-pooled TSS exception
push eax

push 0
push 0
push ebp

endm

;
; TSS entry code
;


task_entry macro
LOCAL @TSSNotNested, @NoLTR
LOCAL @UsedDefaultSegment
LOCAL @UsedPooledSegment
LOCAL @EnterTheDebugger

cli
xor eax, eax
str ax
mov esi, offset SystemGDTTable
mov esi, dword ptr [esi + 2]
lea ebx, [esi + eax]
mov al, [ebx].TSSBase2
mov ah, [ebx].TSSBase3
shl eax, 16
mov ax, [ebx].TSSBase1

;
; eax -> TSS Segment (Current)
; ebx -> TSS Descriptor (Current)
;

movzx ecx, word ptr [eax].tBackLink
or ecx, ecx
jz @TSSNotNested

mov esi, offset SystemGDTTable
mov esi, dword ptr [esi + 2]
lea edx, [esi + ecx]
mov cl, [edx].TSSBase2
mov ch, [edx].TSSBase3
shl ecx, 16
mov cx, [edx].TSSBase1

mov ebp, ecx

;
; edx -> TSS Descriptor (Previous)
; ebp -> TSS Segment (Previous)
;
; clear busy state and reset TSS
;

mov [edx].TSSType, 10001001b

@TSSNotNested:
mov [ebx].TSSType, 10001001b

lgdt ds: SystemGDTTable ; reset GDT TSS Busy bit

movzx eax, word ptr [eax].tBackLink
or eax, eax
jz @NoLTR

ltr ax

@NoLTR:

mov eax, CR0
and eax, 0FFFFFFF7h ; clear task switch bit in CR0 to
mov CR0, eax ; avoid NPX exceptions

xor eax, eax
mov dr7, eax ; disable breakpoints

pushfd
and dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
or dword ptr [esp], RESUME_FLAG
popfd

push ebp
call AllocPooledResource
pop ebp

or eax, eax
jz @UsedDefaultSegment

lea ebp, [eax].TSSSegment
mov esp, [eax].StackTop

push eax ; push address of pooled resource
jmp @UsedPooledSegment

@UsedDefaultSegment:
mov eax, 0FFFFFFFFh ; push non-pooled marker onto the stack
push eax

@UsedPooledSegment:

push 0
mov eax, CR2 ; get fault address
push eax
push ebp ; pass the TSS

endm

;
; TSS exit code
;

Jeff

2010-07-17 01:16:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, Jul 16, 2010 at 3:41 PM, Andi Kleen <[email protected]> wrote:
>
> Maybe I'm misunderstanding everything (and it has been a lot of emails
> in the thread), but the case I was thinking of would be if the second NMI
> faults too, and then another one comes in after the IRET etc.

No, the nested NMI cannot fault, because it never even enters C code.
It literally just returns immediately after having noticed it is
nested (and corrupted the stack of the original one, so that the
original NMI will re-do itself at return)..

So the nested NMI will use some few tens of bytes of stack. In fact,
it will use the stack "above" the stack that the original NMI handler
is using, because it will reset the stack pointer back to the top of
the NMI stack. So in a very real sense, it is not even extending the
stack, it is just re-using a small part of the same stack that the
original NMI used (and that we copied away so that it doesn't matter
that it gets re-used)

As to another small but important detail: the _nested_ NMI actually
returns using "popf+ret", leaving NMI's blocked again. Thus
guaranteeing forward progress and lack of NMI storms.

To summarize:

- the "original" (first-level) NMI can take faults (like the page
fault to fill in vmalloc pages lazily, or debug faults). That will
actually cause two stack frames (or three, if you debug a page fault
that happened while NMI was active). So there is certainly exception
nesting going on, but we're talking _much_ less stack than normal
stack usage where the nesting can be deep and in complex routines.

- any "nested" NMI's will not actually use any more stack at all than
a non-nested one, because we've pre-reserved space for them (and we
_had_ to reserve space for them due to IST)

- even if we get NMI's during the execution of the original NMI,
there can be only one such "spurious" NMI per nested exception. So if
we take a single page fault, that exception will re-enable NMI
(because it returns with "iret"), and as a result we may take a
_single_ new nested NMI until we disable NMI's again.

In other words, the approach is not all that different from doing
"lazy irq disable" like powerpc does for regular interrupts. For
NMI's, we do it because it's impossible (on x86) to disable NMI's
without actually taking one.

Linus

2010-07-18 09:25:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/17/2010 12:39 AM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity<[email protected]> wrote:
>
>> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>
>>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>>> should it? It should just do whatever comes "natural" within its own
>>> context.
>>>
>> But we're not talking about non-NMI code.
>>
> Yes, we are. We're talking about breakpoints (look at the subject
> line), and you are very much talking about things like that _idiotic_
> vmalloc_sync_all() by module loading code etc etc.
>

Well, I'd put it in the nmi handler registration code, but you're
right. A user placing breakpoints can't even tell whether the
breakpoint will be hit by NMI code, especially data breakpoints.

> Every _single_ "solution" I have seen - apart from my suggestion - has
> been about making code "special" because some other code might run in
> an NMI. Module init sequences having to do idiotic things just because
> they have data structures that might get accessed by NMI.
>
> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.
>
> In other words, why the hell are you arguing? Help Mathieu write the
> low-level NMI handler right, and remove that idiotic
> "vmalloc_sync_all()" that is fundamentally broken and should not
> exist. Rather than talk about adding more of that kind of crap.
>

Well, at least we'll get a good test case for kvm's nmi blocking
emulation (it's tricky since if we fault on an iret sometimes nmis get
unblocked even though the instruction did not complete).

--
error compiling committee.c: too many arguments to function

2010-07-18 09:27:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 07/16/2010 10:30 PM, Andi Kleen wrote:
> We already have infrastructure for kprobes to prevent breakpoints
> on critical code (the __kprobes section). In principle kgdb/kdb
> could be taught about honoring those too.
>
>

It doesn't help with NMI code calling other functions, or with data
breakpoints.

--
error compiling committee.c: too many arguments to function

2010-08-04 09:47:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
>
> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...

Perf actually already does that to unwind user-space stacks... ;-)

See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.

What we do is a manual page table walk (using __get_user_pages_fast) and
simply bail when the page is not available.

That said, I think that the thing that started the whole
per-cpu-per-context temp stack-frame storage story also means that that
function is now broken and can lead to kmap_atomic corruption.

I really should brush up that stack based kmap_atomic thing, last time I
got stuck on FRV wanting things.

Linus should I refresh that whole series and give a FRV a slow but
working implementation and then let David Howells sort out things if he
cares about that?

2010-08-04 20:27:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault

On 08/04/2010 02:46 AM, Peter Zijlstra wrote:
> On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
>>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>
> Perf actually already does that to unwind user-space stacks... ;-)
>
> See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.
>
> What we do is a manual page table walk (using __get_user_pages_fast) and
> simply bail when the page is not available.
>

That's not really "touching user space", though.

-hpa