Hi,
this series is the result of the discussion to the RFC patch found at [1].
The goal is to make x86' ftrace_int3_handler() not to simply skip over
the trapping instruction as this is problematic in the context of
the live patching consistency model. For details, c.f. the commit message
of [3/4] ("x86/ftrace: make ftrace_int3_handler() not to skip fops
invocation").
Everything is based on v5.1-rc6, please let me know in case you want me to
rebase on somehing else.
For x86_64, the live patching selftest added in [4/4] succeeds with this
series applied and fails without it. On 32 bits I only compile-tested.
checkpatch reports warnings about
- an overlong line in assembly -- I chose to ignore that
- MAINTAINERS perhaps needing updates due to the new files
arch/x86/kernel/ftrace_int3_stubs.S and
tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh.
As the existing arch/x86/kernel/ftrace_{32,64}.S haven't got an
explicit entry either, this one is probably Ok? The selftest
definitely is.
Changes to the RFC patch:
- s/trampoline/stub/ to avoid confusion with the ftrace_ops' trampolines,
- use a fixed size stack kept in struct thread_info for passing the
(adjusted) ->ip values from ftrace_int3_handler() to the stubs,
- provide one stub for each of the two possible jump targets and hardcode
those,
- add the live patching selftest.
Thanks,
Nicolai
Nicolai Stange (4):
x86/thread_info: introduce ->ftrace_int3_stack member
ftrace: drop 'static' qualifier from ftrace_ops_list_func()
x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
selftests/livepatch: add "ftrace a live patched function" test
arch/x86/include/asm/thread_info.h | 11 +++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/asm-offsets.c | 8 +++
arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++---
arch/x86/kernel/ftrace_int3_stubs.S | 61 +++++++++++++++++
kernel/trace/ftrace.c | 8 +--
tools/testing/selftests/livepatch/Makefile | 3 +-
.../livepatch/test-livepatch-vs-ftrace.sh | 44 ++++++++++++
8 files changed, 199 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S
create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
--
2.13.7
There had been an issue with interactions between tracing and live patching
due to how x86' CONFIG_DYNAMIC_FTRACE used to handle the breakpoints at the
updated instructions from its ftrace_int3_handler().
More specifically, starting to trace a live patched function caused a short
period in time where the live patching redirection became ineffective. In
particular, the guarantees from the consistency model couldn't be held up
in this situation.
Implement a testcase for verifying that a function's live patch replacement
is kept effective when enabling tracing on it.
Reuse the existing 'test_klp_livepatch' live patch module which patches
cmdline_proc_show(), the handler for /proc/cmdline.
Let the testcase in a loop
- apply this live patch,
- launch a background shell job enabling tracing on that function
- while continuously verifying that the contents of /proc/cmdline still
match what would be expected when the live patch is applied.
Signed-off-by: Nicolai Stange <[email protected]>
---
tools/testing/selftests/livepatch/Makefile | 3 +-
.../livepatch/test-livepatch-vs-ftrace.sh | 44 ++++++++++++++++++++++
2 files changed, 46 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
diff --git a/tools/testing/selftests/livepatch/Makefile b/tools/testing/selftests/livepatch/Makefile
index af4aee79bebb..bfa5353f6d17 100644
--- a/tools/testing/selftests/livepatch/Makefile
+++ b/tools/testing/selftests/livepatch/Makefile
@@ -3,6 +3,7 @@
TEST_GEN_PROGS := \
test-livepatch.sh \
test-callbacks.sh \
- test-shadow-vars.sh
+ test-shadow-vars.sh \
+ test-livepatch-vs-ftrace.sh
include ../lib.mk
diff --git a/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
new file mode 100755
index 000000000000..5c982ec56373
--- /dev/null
+++ b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2019 SUSE Linux GmbH
+
+. $(dirname $0)/functions.sh
+
+set -e
+
+MOD_LIVEPATCH=test_klp_livepatch
+
+# TEST: ftrace a live patched function
+# - load a livepatch that modifies the output from /proc/cmdline
+# - install a function tracer at the live patched function
+# - verify that the function is still patched by reading /proc/cmdline
+# - unload the livepatch and make sure the patch was removed
+
+echo -n "TEST: ftrace a live patched function ... "
+dmesg -C
+
+for i in $(seq 1 3); do
+ load_lp $MOD_LIVEPATCH
+
+ ( echo cmdline_proc_show > /sys/kernel/debug/tracing/set_ftrace_filter;
+ echo function > /sys/kernel/debug/tracing/current_tracer ) &
+
+ for j in $(seq 1 200); do
+ if [[ "$(cat /proc/cmdline)" != \
+ "$MOD_LIVEPATCH: this has been live patched" ]] ; then
+ echo -e "FAIL\n\n"
+ die "livepatch kselftest(s) failed"
+ fi
+ done
+
+ wait %1
+
+ echo nop > /sys/kernel/debug/tracing/current_tracer
+ echo > /sys/kernel/debug/tracing/set_ftrace_filter
+
+ disable_lp $MOD_LIVEPATCH
+ unload_lp $MOD_LIVEPATCH
+done
+
+echo "ok"
+exit 0
--
2.13.7
With an upcoming patch improving x86' ftrace_int3_handler() not to simply
skip over the insn being updated, ftrace_ops_list_func() will have to get
referenced from arch/x86 code. Drop its 'static' qualifier.
Signed-off-by: Nicolai Stange <[email protected]>
---
kernel/trace/ftrace.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b920358dd8f7..ed3c20811d9a 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -125,8 +125,8 @@ ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub;
struct ftrace_ops global_ops;
#if ARCH_SUPPORTS_FTRACE_OPS
-static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
- struct ftrace_ops *op, struct pt_regs *regs);
+void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
+ struct ftrace_ops *op, struct pt_regs *regs);
#else
/* See comment below, where ftrace_ops_list_func is defined */
static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip);
@@ -6302,8 +6302,8 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
* set the ARCH_SUPPORTS_FTRACE_OPS.
*/
#if ARCH_SUPPORTS_FTRACE_OPS
-static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
- struct ftrace_ops *op, struct pt_regs *regs)
+void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
+ struct ftrace_ops *op, struct pt_regs *regs)
{
__ftrace_ops_list_func(ip, parent_ip, NULL, regs);
}
--
2.13.7
With dynamic ftrace, ftrace patches call sites on x86 in a three steps:
1. Put a breakpoint at the to be patched location,
2. update call site and
3. finally remove the breakpoint again.
Note that the breakpoint ftrace_int3_handler() simply makes execution to
skip over the to be patched instruction.
This patching happens in the following circumstances:
a.) the global ftrace_trace_function changes and the call sites at
ftrace_call and ftrace_regs_call get patched,
b.) an ftrace_ops' ->func changes and the call site in its trampoline gets
patched,
c.) graph tracing gets enabled/disabled and the jump site at
ftrace_graph_call gets patched,
d.) a mcount site gets converted from nop -> call, call -> nop, or
call -> call.
The latter case, i.e. a mcount call getting redirected, is possible in e.g.
a transition from trampolined to not trampolined upon a user enabling
function tracing on a live patched function.
The ftrace_int3_handler() simply skipping over the updated insn is quite
problematic in the context of live patching, because it means that a live
patched function gets temporarily reverted to its unpatched original and
this breaks the live patching consistency model. But even without live
patching, it is desirable to avoid missing traces when making changes to
the tracing setup.
Make ftrace_int3_handler not to skip over the fops invocation, but modify
the interrupted control flow to issue a call as needed.
Case c.) from the list above can be ignored, because a jmp instruction gets
changed to a nop or vice versa.
The remaining cases a.), b.) and d.) all involve call instructions. For a.)
and b.), the call always goes to some ftrace_func_t and the generic
ftrace_ops_list_func() impementation will be able to demultiplex and do the
right thing. For case d.), the call target is either of ftrace_caller(),
ftrace_regs_caller() or some ftrace_ops' trampoline. Because providing the
register state won't cause any harm for !FTRACE_OPS_FL_SAVE_REGS
ftrace_ops, ftrace_regs_caller() would be a suitable target capable of
handling any case.
ftrace_int3_handler()'s context is different from the interrupted call
instruction's one, obviously. In order to be able to emulate the call
within the original context, make ftrace_int3_handler() set its iret
frame's ->ip to some helper stub. Upon return from the trap, this stub will
then mimic the call by pushing the the return address onto the stack and
issuing a jmp to the target address. As describe above, the jmp target
will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
one such stub implementation for each of the two cases.
Finally, the desired return address, which is derived from the trapping
->ip, must get passed from ftrace_int3_handler() to these stubs. Make
ftrace_int3_handler() push it onto the ftrace_int3_stack introduced by an
earlier patch and let the stubs consume it. Be careful to use proper
compiler barriers such that nested int3 handling from e.g. irqs won't
clobber entries owned by outer instances.
Suggested-by: Steven Rostedt <[email protected]>
Signed-off-by: Nicolai Stange <[email protected]>
---
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/ftrace.c | 79 +++++++++++++++++++++++++++++++------
arch/x86/kernel/ftrace_int3_stubs.S | 61 ++++++++++++++++++++++++++++
3 files changed, 130 insertions(+), 11 deletions(-)
create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 00b7e27bc2b7..0b63ae02b1f3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o
obj-$(CONFIG_FUNCTION_TRACER) += ftrace_$(BITS).o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
+obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace_int3_stubs.o
obj-$(CONFIG_X86_TSC) += trace_clock.o
obj-$(CONFIG_KEXEC_CORE) += machine_kexec_$(BITS).o
obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..917494f35cba 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -280,25 +280,84 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
return 0;
}
-/*
- * A breakpoint was added to the code address we are about to
- * modify, and this is the handle that will just skip over it.
- * We are either changing a nop into a trace call, or a trace
- * call to a nop. While the change is taking place, we treat
- * it just like it was a nop.
- */
+extern void ftrace_graph_call(void);
+
+asmlinkage void ftrace_int3_stub_regs_caller(void);
+asmlinkage void ftrace_int3_stub_list_func(void);
+
+/* A breakpoint was added to the code address we are about to modify. */
int ftrace_int3_handler(struct pt_regs *regs)
{
unsigned long ip;
+ bool is_ftrace_location;
+ struct ftrace_int3_stack *stack;
+ int slot;
if (WARN_ON_ONCE(!regs))
return 0;
ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+ is_ftrace_location = ftrace_location(ip);
+ if (!is_ftrace_location && !is_ftrace_caller(ip))
return 0;
- regs->ip += MCOUNT_INSN_SIZE - 1;
+ ip += MCOUNT_INSN_SIZE;
+
+ if (!is_ftrace_location &&
+ ftrace_update_func == (unsigned long)ftrace_graph_call) {
+ /*
+ * The insn at ftrace_graph_call is being changed from a
+ * nop to a jmp or vice versa. Treat it as a nop and
+ * skip over it.
+ */
+ regs->ip = ip;
+ return 1;
+ }
+
+ /*
+ * The insn having the breakpoint on it is either some mcount
+ * call site or one of ftrace_call, ftrace_regs_call and their
+ * equivalents within some trampoline. The currently pending
+ * transition is known to turn the insn from a nop to a call,
+ * from a call to a nop or to change the target address of an
+ * existing call. We're going to emulate a call to the most
+ * generic implementation capable of handling any possible
+ * configuration. For the mcount sites that would be
+ * ftrace_regs_caller() and for the remaining calls, which all
+ * have got some ftrace_func_t target, ftrace_ops_list_func()
+ * will do the right thing.
+ *
+ * However, the call insn can't get emulated from this trap
+ * handler here. Rewrite the iret frame's ->ip value to one of
+ * the ftrace_int3_stub instances which will then setup
+ * everything in the original context. The address following
+ * the current insn will be passed to the stub via the
+ * ftrace_int3_stack.
+ */
+ stack = ¤t_thread_info()->ftrace_int3_stack;
+ if (WARN_ON_ONCE(stack->depth >= 4)) {
+ /*
+ * This should not happen as at most one stack slot is
+ * required per the contexts "normal", "irq",
+ * "softirq" and "nmi" each. However, be conservative
+ * and treat it like a nop.
+ */
+ regs->ip = ip;
+ return 1;
+ }
+
+ /*
+ * Make sure interrupts will see the incremented ->depth value
+ * before writing the stack entry.
+ */
+ slot = stack->depth;
+ WRITE_ONCE(stack->depth, slot + 1);
+ WRITE_ONCE(stack->slots[slot], ip);
+
+ if (is_ftrace_location)
+ regs->ip = (unsigned long)ftrace_int3_stub_regs_caller;
+ else
+ regs->ip = (unsigned long)ftrace_int3_stub_list_func;
return 1;
}
@@ -949,8 +1008,6 @@ void arch_ftrace_trampoline_free(struct ftrace_ops *ops)
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
#ifdef CONFIG_DYNAMIC_FTRACE
-extern void ftrace_graph_call(void);
-
static unsigned char *ftrace_jmp_replace(unsigned long ip, unsigned long addr)
{
return ftrace_text_replace(0xe9, ip, addr);
diff --git a/arch/x86/kernel/ftrace_int3_stubs.S b/arch/x86/kernel/ftrace_int3_stubs.S
new file mode 100644
index 000000000000..ef5f580450bb
--- /dev/null
+++ b/arch/x86/kernel/ftrace_int3_stubs.S
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 SUSE Linux GmbH */
+
+#include <asm/asm.h>
+#include <asm/percpu.h>
+#include <asm/unwind_hints.h>
+#include <asm/asm-offsets.h>
+#include <linux/linkage.h>
+
+#ifdef CONFIG_X86_64
+#define WORD_SIZE 8
+#else
+#define WORD_SIZE 4
+#endif
+
+.macro ftrace_int3_stub call_target
+ /*
+ * We got here from ftrace_int3_handler() because a breakpoint
+ * on a call insn currently being modified has been
+ * hit. ftrace_int3_handler() can't emulate the function call
+ * directly, because it's running at a different position on
+ * the stack, obviously. Hence it sets the regs->ip to this
+ * stub so that we end up here upon the iret from the int3
+ * handler. The stack is now in its original state and we can
+ * emulate the function call insn by pushing the return
+ * address onto the stack and jumping to the call target. The
+ * desired return address has been put onto the ftrace_int3_stack
+ * kept within struct thread_info.
+ */
+ UNWIND_HINT_EMPTY
+ /* Reserve room for the emulated call's return address. */
+ sub $WORD_SIZE, %_ASM_SP
+ /*
+ * Pop the return address from ftrace_int3_ip_stack and write
+ * it to the location just reserved. Be careful to retrieve
+ * the address before decrementing ->depth in order to protect
+ * against nested contexts clobbering it.
+ */
+ push %_ASM_AX
+ push %_ASM_CX
+ push %_ASM_DX
+ mov PER_CPU_VAR(current_task), %_ASM_AX
+ mov TASK_TI_ftrace_int3_depth(%_ASM_AX), %_ASM_CX
+ dec %_ASM_CX
+ mov TASK_TI_ftrace_int3_slots(%_ASM_AX, %_ASM_CX, WORD_SIZE), %_ASM_DX
+ mov %_ASM_CX, TASK_TI_ftrace_int3_depth(%_ASM_AX)
+ mov %_ASM_DX, 3*WORD_SIZE(%_ASM_SP)
+ pop %_ASM_DX
+ pop %_ASM_CX
+ pop %_ASM_AX
+ /* Finally, transfer control to the target function. */
+ jmp \call_target
+.endm
+
+ENTRY(ftrace_int3_stub_regs_caller)
+ ftrace_int3_stub ftrace_regs_caller
+END(ftrace_int3_stub_regs_caller)
+
+ENTRY(ftrace_int3_stub_list_func)
+ ftrace_int3_stub ftrace_ops_list_func
+END(ftrace_int3_stub_list_func)
--
2.13.7
Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation
places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply
treats the insn in question as nop and advances %rip past it. An upcoming
patch will improve this by making the int3 trap handler emulate the call
insn. To this end, ftrace_int3_handler() will be made to change its iret
frame's ->ip to some stub which will then mimic the function call in the
original context.
Somehow the trapping ->ip address will have to get communicated from
ftrace_int3_handler() to these stubs though. Note that at any given point
in time, there can be at most four such call insn emulations pending:
namely at most one per "process", "irq", "softirq" and "nmi" context.
Introduce struct ftrace_int3_stack providing four entries for storing
the instruction pointer.
In principle, it could be made per-cpu, but this would require making
ftrace_int3_handler() to return with preemption disabled and to enable it
from those emulation stubs again only after the stack's top entry has been
consumed. I've been told that this would "break a lot of norms" and that
making this stack part of struct thread_info instead would be less fragile.
Follow this advice and add a struct ftrace_int3_stack instance to x86's
struct thread_info. Note that these stacks will get only rarely accessed
(only during ftrace's code modifications) and thus, cache line dirtying
won't have any significant impact on the neighbouring fields.
Initialization will take place implicitly through INIT_THREAD_INFO as per
the rules for missing elements in initializers. The memcpy() in
arch_dup_task_struct() will propagate the initial state properly, because
it's always run in process context and won't ever see a non-zero ->depth
value.
Finally, add the necessary bits to asm-offsets for making struct
ftrace_int3_stack accessible from assembly.
Suggested-by: Steven Rostedt <[email protected]>
Signed-off-by: Nicolai Stange <[email protected]>
---
arch/x86/include/asm/thread_info.h | 11 +++++++++++
arch/x86/kernel/asm-offsets.c | 8 ++++++++
2 files changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e0eccbcb8447..83434a88cfbb 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -56,6 +56,17 @@ struct task_struct;
struct thread_info {
unsigned long flags; /* low level flags */
u32 status; /* thread synchronous flags */
+#ifdef CONFIG_DYNAMIC_FTRACE
+ struct ftrace_int3_stack {
+ int depth;
+ /*
+ * There can be at most one slot in use per context,
+ * i.e. at most one for "normal", "irq", "softirq" and
+ * "nmi" each.
+ */
+ unsigned long slots[4];
+ } ftrace_int3_stack;
+#endif
};
#define INIT_THREAD_INFO(tsk) \
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 168543d077d7..ca6ee24a0c6e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -105,4 +105,12 @@ static void __used common(void)
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+#ifdef CONFIG_DYNAMIC_FTRACE
+ BLANK();
+ OFFSET(TASK_TI_ftrace_int3_depth, task_struct,
+ thread_info.ftrace_int3_stack.depth);
+ OFFSET(TASK_TI_ftrace_int3_slots, task_struct,
+ thread_info.ftrace_int3_stack.slots);
+#endif
}
--
2.13.7
On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote:
> ftrace_int3_handler()'s context is different from the interrupted call
> instruction's one, obviously. In order to be able to emulate the call
> within the original context, make ftrace_int3_handler() set its iret
> frame's ->ip to some helper stub. Upon return from the trap, this stub will
> then mimic the call by pushing the the return address onto the stack and
> issuing a jmp to the target address. As describe above, the jmp target
> will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
> one such stub implementation for each of the two cases.
Yuck; I'd much rather we get that static_call() stuff sorted such that
text_poke() and poke_int3_handler() can do CALL emulation.
Given all the back and forth, I think the solution where we shift
pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I
think we collectively hated it least.
[ Added Linus ]
On Sat, 27 Apr 2019 12:26:57 +0200
Peter Zijlstra <[email protected]> wrote:
> On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote:
> > ftrace_int3_handler()'s context is different from the interrupted call
> > instruction's one, obviously. In order to be able to emulate the call
> > within the original context, make ftrace_int3_handler() set its iret
> > frame's ->ip to some helper stub. Upon return from the trap, this stub will
> > then mimic the call by pushing the the return address onto the stack and
> > issuing a jmp to the target address. As describe above, the jmp target
> > will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
> > one such stub implementation for each of the two cases.
>
> Yuck; I'd much rather we get that static_call() stuff sorted such that
> text_poke() and poke_int3_handler() can do CALL emulation.
>
> Given all the back and forth, I think the solution where we shift
> pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I
> think we collectively hated it least.
Note, this use case is slightly different than the static calls but has
the same issue. That issue is that we need a way to simulate a "call"
function from int3, and there's no good way to handle that case right
now.
The static call needs to modify the call sight but make the call still
work from int3 context.
This use case is similar, but the issue is with a "bug" in the function
tracing implementation, that has become an issue with live kernel
patching which is built on top of it.
The issue is this:
For optimization reasons, if there's only a single user of a function
it gets its own trampoline that sets up the call to its callback and
calls that callback directly:
<function being traced>
call custom trampoline
<custom trampoline>
save regs to call C code
call custom_callback
restore regs
ret
If more than one callback is attached to that function, then we need to
call the iterator that loops through all registered callbacks of the
function tracer and checks their hashes to see if they want to trace
this function or not.
<function being traced>
call iterator_trampoline
<iterator_trampoline>
save regs to call C code
call iterator
restore regs
ret
What happens in that transition? We add an "int3" at the function being
traced, and change:
call custom_trampoline
to
call iterator_trampoline
But if the function being traced is executed during this transition,
the int3 just makes it act like a nop and returns pass the call.
For tracing this is a bug because we just missed a function that should
be traced. For live kernel patching, this could be more devastating
because the original "buggy" patched function is called, and this could
cause subtle problems.
If we can add a slot to the int3 handler, that we can use to emulate a
call, (perhaps add another slot to pt_regs structure, call it link
register)
It would make it easier to solve both this and the static call problems.
-- Steve
> On Apr 27, 2019, at 3:06 AM, Nicolai Stange <[email protected]> wrote:
>
> Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation
> places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply
> treats the insn in question as nop and advances %rip past it.
How does this not crash all the time?
> An upcoming
> patch will improve this by making the int3 trap handler emulate the call
> insn. To this end, ftrace_int3_handler() will be made to change its iret
> frame's ->ip to some stub which will then mimic the function call in the
> original context.
>
> Somehow the trapping ->ip address will have to get communicated from
> ftrace_int3_handler() to these stubs though.
Cute. But this should get “ftrace” removed from the name, since it’s potentially more generally useful. Josh wanted something like this for static_call.
> Note that at any given point
> in time, there can be at most four such call insn emulations pending:
> namely at most one per "process", "irq", "softirq" and "nmi" context.
>
That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately.
All this makes me think that one of the other solutions we came up with last time we discussed this might be better.
On Sun, 28 Apr 2019 10:41:10 -0700
Andy Lutomirski <[email protected]> wrote:
> > Note that at any given point
> > in time, there can be at most four such call insn emulations pending:
> > namely at most one per "process", "irq", "softirq" and "nmi" context.
> >
>
> That’s quite an assumption. I think your list should also contain
> exception, exceptions nested inside that exception, and machine
> check, at the very least. I’m also wondering why irq and softirq are
> treated separately.
4 has usually been the context count we choose. But I guess in theory,
if we get exceptions then I could potentially be more.
As for irq vs softirq, an interrupt can preempt a softirq. Interrupts
are enabled while softirqs are running. When sofirqs run, softirqs are
disabled to prevent nested softirqs. But interrupts are enabled again,
and another interrupt may come in while a softirq is executing.
>
> All this makes me think that one of the other solutions we came up
> with last time we discussed this might be better.
+100
Perhaps adding another slot into pt_regs that gets used by int3 to
store a slot to emulate a call on return?
-- Steve
> On Apr 28, 2019, at 10:51 AM, Steven Rostedt <[email protected]> wrote:
>
> On Sun, 28 Apr 2019 10:41:10 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>
>>> Note that at any given point
>>> in time, there can be at most four such call insn emulations pending:
>>> namely at most one per "process", "irq", "softirq" and "nmi" context.
>>>
>>
>> That’s quite an assumption. I think your list should also contain
>> exception, exceptions nested inside that exception, and machine
>> check, at the very least. I’m also wondering why irq and softirq are
>> treated separately.
>
> 4 has usually been the context count we choose. But I guess in theory,
> if we get exceptions then I could potentially be more.
>
> As for irq vs softirq, an interrupt can preempt a softirq. Interrupts
> are enabled while softirqs are running. When sofirqs run, softirqs are
> disabled to prevent nested softirqs. But interrupts are enabled again,
> and another interrupt may come in while a softirq is executing.
>
>>
>> All this makes me think that one of the other solutions we came up
>> with last time we discussed this might be better.
>
> +100
>
> Perhaps adding another slot into pt_regs that gets used by int3 to
> store a slot to emulate a call on return?
>
>
That’s not totally nuts, although finding pt_regs isn’t entirely trivial.
I still think I prefer an approach where we just emulate the call directly.
On Sun, 28 Apr 2019 11:08:34 -0700
Andy Lutomirski <[email protected]> wrote:
> >
> > Perhaps adding another slot into pt_regs that gets used by int3 to
> > store a slot to emulate a call on return?
> >
> >
>
> That’s not totally nuts, although finding pt_regs isn’t entirely trivial.
I meant on the int3 handler (which stores the pt_regs).
>
> I still think I prefer an approach where we just emulate the call directly.
Then, on the return of int3, if there's anything in that slot, then we
could possibly shift the exception handler frame (that was added by the
hardware), insert the slot data into the top of the stack, and then
call iret (which the int3 handler, would add the return ip to be the
function being called), which would in essence emulate the call directly.
I believe the complexity comes from the exception frame added by the
hardware is where we need to put the return of the call for the
emulation.
-- Steve
> On Apr 28, 2019, at 12:43 PM, Steven Rostedt <[email protected]> wrote:
>
> On Sun, 28 Apr 2019 11:08:34 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>>>
>>> Perhaps adding another slot into pt_regs that gets used by int3 to
>>> store a slot to emulate a call on return?
>>>
>>>
>>
>> That’s not totally nuts, although finding pt_regs isn’t entirely trivial.
>
> I meant on the int3 handler (which stores the pt_regs).
But that’s below the stub’s RSP, so it’s toast if another interrupt happens. Or am I misunderstanding you?
>
>>
>> I still think I prefer an approach where we just emulate the call directly.
>
> Then, on the return of int3, if there's anything in that slot, then we
> could possibly shift the exception handler frame (that was added by the
> hardware), insert the slot data into the top of the stack, and then
> call iret (which the int3 handler, would add the return ip to be the
> function being called), which would in essence emulate the call directly.
Oh, I get it.
I liked Josh’s old proposal of unconditionally shifting the #BP frame 8 bytes better. It will be interesting when kernel shadow stacks are thrown in the mix, but that’s a problem for another day.
Steven Rostedt <[email protected]> writes:
> On Sun, 28 Apr 2019 10:41:10 -0700
> Andy Lutomirski <[email protected]> wrote:
>
>
>> > Note that at any given point
>> > in time, there can be at most four such call insn emulations pending:
>> > namely at most one per "process", "irq", "softirq" and "nmi" context.
>> >
>>
>> That’s quite an assumption. I think your list should also contain
>> exception, exceptions nested inside that exception, and machine
>> check, at the very least. I’m also wondering why irq and softirq are
>> treated separately.
You're right, I missed the machine check case.
> 4 has usually been the context count we choose. But I guess in theory,
> if we get exceptions then I could potentially be more.
After having seen the static_call discussion, I'm in no way defending
this limited approach here, but out of curiosity: can the code between
the push onto the stack from ftrace_int3_handler() and the subsequent
pop from the stub actually trigger an (non-#MC) exception? There's an
iret inbetween, but that can fault only when returning to user space,
correct?
> As for irq vs softirq, an interrupt can preempt a softirq. Interrupts
> are enabled while softirqs are running. When sofirqs run, softirqs are
> disabled to prevent nested softirqs. But interrupts are enabled again,
> and another interrupt may come in while a softirq is executing.
>
Thanks,
Nicolai
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
> On Apr 28, 2019, at 2:22 PM, Nicolai Stange <[email protected]> wrote:
>
> Steven Rostedt <[email protected]> writes:
>
>> On Sun, 28 Apr 2019 10:41:10 -0700
>> Andy Lutomirski <[email protected]> wrote:
>>
>>
>>>> Note that at any given point
>>>> in time, there can be at most four such call insn emulations pending:
>>>> namely at most one per "process", "irq", "softirq" and "nmi" context.
>>>>
>>>
>>> That’s quite an assumption. I think your list should also contain
>>> exception, exceptions nested inside that exception, and machine
>>> check, at the very least. I’m also wondering why irq and softirq are
>>> treated separately.
>
> You're right, I missed the machine check case.
>
>
>> 4 has usually been the context count we choose. But I guess in theory,
>> if we get exceptions then I could potentially be more.
>
> After having seen the static_call discussion, I'm in no way defending
> this limited approach here, but out of curiosity: can the code between
> the push onto the stack from ftrace_int3_handler() and the subsequent
> pop from the stub actually trigger an (non-#MC) exception? There's an
> iret inbetween, but that can fault only when returning to user space,
> correct?
IRET doesn’t do any fancy masking, so #DB, NMI and regular IRQs should all be possible.
>
>
>> As for irq vs softirq, an interrupt can preempt a softirq. Interrupts
>> are enabled while softirqs are running. When sofirqs run, softirqs are
>> disabled to prevent nested softirqs. But interrupts are enabled again,
>> and another interrupt may come in while a softirq is executing.
>>
>
> Thanks,
>
> Nicolai
>
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
On Sun, Apr 28, 2019 at 10:38 AM Steven Rostedt <[email protected]> wrote:
>
> For optimization reasons, if there's only a single user of a function
> it gets its own trampoline that sets up the call to its callback and
> calls that callback directly:
So this is the same issue as the static calls, and it has exactly the
same solution.
Which I already outlined once, and nobody wrote the code for.
So here's a COMPLETELY UNTESTED patch that only works (_if_ it works) for
(a) 64-bit
(b) SMP
but that's just because I've hardcoded the percpu segment handling.
It does *not* emulate the "call" in the BP handler itself, instead if
replace the %ip (the same way all the other BP handlers replace the
%ip) with a code sequence that just does
push %gs:bp_call_return
jmp *%gs:bp_call_target
after having filled in those per-cpu things.
The reason they are percpu is that after the %ip has been changed, the
target CPU goes its merry way, and doesn't wait for the text--poke
semaphore serialization. But since we have interrupts disabled on that
CPU, we know that *another* text poke won't be coming around and
changing the values.
THIS IS ENTIRELY UNTESTED! I've built it, and it at least seems to
build, although with warnings
arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqoff()+0x9: indirect jump found in RETPOLINE build
arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqon()+0x8: indirect jump found in RETPOLINE build
arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqoff()+0x9: sibling call from callable instruction with
modified stack frame
arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqon()+0x8: sibling call from callable instruction with
modified stack frame
that will need the appropriate "ignore this case" annotations that I didn't do.
Do I expect it to work? No. I'm sure there's some silly mistake here,
but the point of the patch is to show it as an example, so that it can
actually be tested.
With this, it should be possible (under the text rewriting lock) to do
replace_call(callsite, newcallopcode, callsize, calltargettarget);
to do the static rewriting of the call at "callsite" to have the new
call target.
And again. Untested. But doesn't need any special code in the entry
path, and the concept is simple even if there are probably stupid bugs
just because it's entirely untested.
Oh, and did I mention that I didn't test this?
Linus
On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds
<[email protected]> wrote:
>
>
> It does *not* emulate the "call" in the BP handler itself, instead if
> replace the %ip (the same way all the other BP handlers replace the
> %ip) with a code sequence that just does
>
> push %gs:bp_call_return
> jmp *%gs:bp_call_target
>
> after having filled in those per-cpu things.
Note that if you read the patch, you'll see that my explanation
glossed over the "what if an interrupt happens" part. Which is handled
by having two handlers, one for "interrupts were already disabled" and
one for "interrupts were enabled, so I disabled them before entering
the handler".
The second handler does the same push/jmp sequence, but has a "sti"
before the jmp. Because of the one-instruction sti shadow, interrupts
won't actually be enabled until after the jmp instruction has
completed, and thus the "push/jmp" is atomic wrt regular interrupts.
It's not safe wrt NMI, of course, but since NMI won't be rescheduling,
and since any SMP IPI won't be punching through that sequence anyway,
it's still atomic wrt _another_ text_poke() attempt coming in and
re-using the bp_call_return/tyarget slots.
So yeah, it's not "one-liner" trivial, but it's not like it's
complicated either, and it actually matches the existing "call this
code to emulate the replaced instruction". So I'd much rather have a
couple of tens of lines of code here that still acts pretty much
exactly like all the other rewriting does, rather than play subtle
games with the entry stack frame.
Finally: there might be other situations where you want to have this
kind of "pseudo-atomic" replacement sequence, so I think while it's a
hack specific to emulating a "call" instruction, I don't think it is
conceptually limited to just that case.
Linus
On Mon, Apr 29, 2019 at 11:29 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds
> <[email protected]> wrote:
> >
> >
> > It does *not* emulate the "call" in the BP handler itself, instead if
> > replace the %ip (the same way all the other BP handlers replace the
> > %ip) with a code sequence that just does
> >
> > push %gs:bp_call_return
> > jmp *%gs:bp_call_target
> >
> > after having filled in those per-cpu things.
>
> Note that if you read the patch, you'll see that my explanation
> glossed over the "what if an interrupt happens" part. Which is handled
> by having two handlers, one for "interrupts were already disabled" and
> one for "interrupts were enabled, so I disabled them before entering
> the handler".
This is quite a cute solution.
>
> The second handler does the same push/jmp sequence, but has a "sti"
> before the jmp. Because of the one-instruction sti shadow, interrupts
> won't actually be enabled until after the jmp instruction has
> completed, and thus the "push/jmp" is atomic wrt regular interrupts.
>
> It's not safe wrt NMI, of course, but since NMI won't be rescheduling,
> and since any SMP IPI won't be punching through that sequence anyway,
> it's still atomic wrt _another_ text_poke() attempt coming in and
> re-using the bp_call_return/tyarget slots.
I'm less than 100% convinced about this argument. Sure, an NMI right
there won't cause a problem. But an NMI followed by an interrupt will
kill us if preemption is on. I can think of three solutions:
1. Assume that all CPUs (and all relevant hypervisors!) either mask
NMIs in the STI shadow or return from NMIs with interrupts masked for
one instruction. Ditto for MCE. This seems too optimistic.
2. Put a fixup in the NMI handler and MCE handler: if the return
address is one of these magic jumps, clear IF and back up IP by one
byte. This should *work*, but it's a bit ugly.
3. Use a different magic sequence:
push %gs:bp_call_return
int3
and have the int3 handler adjust regs->ip and regs->flags as appropriate.
I think I like #3 the best, even though it's twice as slow.
FWIW, kernel shadow stack patches will show up eventually, and none of
these approaches are compatible as is. Perhaps the actual sequence
should be this, instead:
bp_int3_fixup_irqsoff:
call 1f
1:
int3
bp_int3_fixup_irqson:
call 1f
1:
int3
and the int3 handler will update the conventional return address *and*
the shadow return address. Linus, what do you think about this
variant?
Finally, with my maintainer hat on: if anyone actually wants to merge
this thing, I want to see a test case, exercised regularly (every boot
in configured, perhaps) that actually *runs* this code. Hitting it in
practice will be rare, and I want bugs to be caught.
On Mon, 29 Apr 2019 11:06:58 -0700
Linus Torvalds <[email protected]> wrote:
> +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> +{
> + bp_int3_call_target = target;
> + bp_int3_call_return = addr + len;
> + bp_int3_handler_irqoff = emulate_call_irqoff;
> + text_poke_bp(addr, opcode, len, emulate_call_irqon);
> +}
Note, the function tracer does not use text poke. It does it in batch
mode. It can update over 40,000 calls in one go:
add int3 breakpoint to all 40,000 call sites.
sync()
change the last four bytes of each of those call sites
sync()
remove int3 from the 40,000 call site with new call.
It's a bit more intrusive than the static call updates we were
discussing before.
-- Steve
On Mon, Apr 29, 2019 at 11:53 AM Linus Torvalds
<[email protected]> wrote:
>
>
>
> On Mon, Apr 29, 2019, 11:42 Andy Lutomirski <[email protected]> wrote:
>>
>>
>> I'm less than 100% convinced about this argument. Sure, an NMI right
>> there won't cause a problem. But an NMI followed by an interrupt will
>> kill us if preemption is on. I can think of three solutions:
>
>
> No, because either the sti shadow disables nmi too (that's the case on some CPUs at least) or the iret from nmi does.
>
> Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
>
Is this documented somewhere? And do you actually believe that this
is true under KVM, Hyper-V, etc? As I recall, Andrew Cooper dug in to
the way that VMX dealt with this stuff and concluded that the SDM was
blatantly wrong in many cases, which leads me to believe that Xen
HVM/PVH is the *only* hypervisor that gets it right.
Steven's point about batched updates is quite valid, though. My
personal favorite solution to this whole mess is to rework the whole
thing so that the int3 handler simply returns and retries and to
replace the sync_core() broadcast with an SMI broadcast. I don't know
whether this will actually work on real CPUs and on VMs and whether
it's going to crash various BIOSes out there.
On Mon, 29 Apr 2019 11:59:04 -0700
Linus Torvalds <[email protected]> wrote:
> I really don't care. Just do what I suggested, and if you have numbers to
> show problems, then maybe I'll care.
>
Are you suggesting that I rewrite the code to do it one function at a
time? This has always been batch mode. This is not something new. The
function tracer has been around longer than the text poke code.
> Right now you're just making excuses for this. I described the solution
> months ago, now I've written a patch, if that's not good enough then we can
> just skip this all entirely.
>
> Honestly, if you need to rewrite tens of thousands of calls, maybe you're
> doing something wrong?
>
# cd /sys/kernel/debug/tracing
# cat available_filter_functions | wc -l
45856
# cat enabled_functions | wc -l
0
# echo function > current_tracer
# cat enabled_functions | wc -l
45856
There, I just enabled 45,856 function call sites in one shot!
How else do you want to update them? Every function in the kernel has a
nop, that turns into a call to the ftrace_handler, if I add another
user of that code, it will change each one as well.
-- Steve
On Mon, Apr 29, 2019 at 12:13 PM Linus Torvalds
<[email protected]> wrote:
>
>
>
> On Mon, Apr 29, 2019, 12:02 Linus Torvalds <[email protected]> wrote:
>>
>>
>>
>> If nmi were to break it, it would be a cpu bug.
>
>
> Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
>
Where? STI; IRET would be nuts.
Before:
commit 4214a16b02971c60960afd675d03544e109e0d75
Author: Andy Lutomirski <[email protected]>
Date: Thu Apr 2 17:12:12 2015 -0700
x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
we did sti; sysxit, but, when we discussed this, I don't recall anyone
speaking up in favor of the safely of the old code.
Not to mention that the crash we'll get if we get an NMI and a
rescheduling interrupt in this path will be very, very hard to debug.
On Mon, Apr 29, 2019 at 12:07 PM Steven Rostedt <[email protected]> wrote:
>
> Are you suggesting that I rewrite the code to do it one function at a
> time? This has always been batch mode. This is not something new. The
> function tracer has been around longer than the text poke code.
Only do the 'call' instructions one at a time. Why would you change
_existing_ code?
Linus
On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski <[email protected]> wrote:
> > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
>
> Where? STI; IRET would be nuts.
Sorry, not 'sti;iret' but 'sti;sysexit'
> before commit 4214a16b02971c60960afd675d03544e109e0d75
> x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
>
> we did sti; sysxit, but, when we discussed this, I don't recall anyone
> speaking up in favor of the safely of the old code.
We still have that sti sysexit in the 32-bit code.
Linus
On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds
<[email protected]> wrote:
>
> If nmi were to break it, it would be a cpu bug. I'm pretty sure I've
> seen the "shadow stops even nmi" documented for some uarch, but as
> mentioned it's not necessarily the only way to guarantee the shadow.
In fact, the documentation is simply the official Intel instruction
docs for "STI":
The IF flag and the STI and CLI instructions do not prohibit the
generation of exceptions and NMI interrupts. NMI interrupts (and
SMIs) may be blocked for one macroinstruction following an STI.
note the "may be blocked". As mentioned, that's just one option for
not having NMI break the STI shadow guarantee, but it's clearly one
that Intel has done at times, and clearly even documents as having
done so.
There is absolutely no question that the sti shadow is real, and that
people have depended on it for _decades_. It would be a horrible
errata if the shadow can just be made to go away by randomly getting
an NMI or SMI.
Linus
On Mon, Apr 29, 2019 at 1:06 PM Linus Torvalds
<[email protected]> wrote:
>
> Only do the 'call' instructions one at a time. Why would you change
> _existing_ code?
Side note: if you want to, you can easily batch up rewriting 'call'
instructions to the same target using the exact same code. You just
need to change the int3 handler case to calculate the
bp_int3_call_return from the fixed one-time address to use sopmething
like
this_cpu_write(bp_call_return, int3_address-1+bp_int3_call_size);
instead (and you'd need to also teach the function that there's not
just a single int3 live at a time)
Linus
On Mon, 29 Apr 2019 13:06:17 -0700
Linus Torvalds <[email protected]> wrote:
> On Mon, Apr 29, 2019 at 12:07 PM Steven Rostedt <[email protected]> wrote:
> >
> > Are you suggesting that I rewrite the code to do it one function at a
> > time? This has always been batch mode. This is not something new. The
> > function tracer has been around longer than the text poke code.
>
> Only do the 'call' instructions one at a time. Why would you change
> _existing_ code?
The function tracing is a call instruction.
On boot:
<function_X>:
nop
blah
blah
After a callback to function tracing is called:
<function_X>
call custom_trampoline
blah
blah
If we have two functions to that function added:
<function_X>
call iterator_trampoline
blah
blah
The update from "call custom_trampoline" to "call iterator_trampoline"
is where we have an issue.
We could make this a special case where we do this one at a time, but
currently the code is all the same looking at tables to determine to
what to do. Which is one of three:
1) change nop to call function
2) change call function to nop
3) update call function to another call function
#3 is where we have an issue. But if we want this to be different, we
would need to change the code significantly, and know that we are only
updating calls to calls. Which would take a bit of accounting to see if
that's the change that is being made.
This thread started about that #3 operation causing a call to be missed
because we turn it into a nop while we make the transition, where in
reality it needs to be a call to one of the two functions in the
transition.
-- Steve
On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <[email protected]> wrote:
>
> The update from "call custom_trampoline" to "call iterator_trampoline"
> is where we have an issue.
So it has never worked. Just tell people that they have two chocies:
- you do the careful rewriting, which takes more time
- you do it by rewriting as nop and then back, which is what
historically has been done, and that is fast and simple, because
there's no need to be careful.
Really. I find your complaints completely incomprehensible. You've
never rewritten call instructions atomically before, and now you
complain about it being slightly more expensive to do it when I give
you the code? Yes it damn well will be slightly more expensive. Deal
with it.
Btw, once again - I several months ago also gave a suggestion on how
it could be done batch-mode by having lots of those small stubs and
just generating them dynamically.
You never wrote that code *EITHER*. It's been *months*.
So now I've written the non-batch-mode code for you, and you just
*complain* about it?
I'm done with this discussion. I'm totally fed up.
Linus
On Mon, Apr 29, 2019 at 01:16:10PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > If nmi were to break it, it would be a cpu bug. I'm pretty sure I've
> > seen the "shadow stops even nmi" documented for some uarch, but as
> > mentioned it's not necessarily the only way to guarantee the shadow.
>
> In fact, the documentation is simply the official Intel instruction
> docs for "STI":
>
> The IF flag and the STI and CLI instructions do not prohibit the
> generation of exceptions and NMI interrupts. NMI interrupts (and
> SMIs) may be blocked for one macroinstruction following an STI.
>
> note the "may be blocked". As mentioned, that's just one option for
> not having NMI break the STI shadow guarantee, but it's clearly one
> that Intel has done at times, and clearly even documents as having
> done so.
>
> There is absolutely no question that the sti shadow is real, and that
> people have depended on it for _decades_. It would be a horrible
> errata if the shadow can just be made to go away by randomly getting
> an NMI or SMI.
FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
not sure that counters the "horrible errata" statement ;-). SMI+RSM saves
and restores STI blocking in that case, but AFAICT NMI has no such
protection and will effectively break the shadow on its IRET.
All other (modern) Intel uArchs block NMI in the shadow and also enforce
STI_BLOCKING==0 when injecting an NMI via VM-Enter, i.e. prevent a VMM
from breaking the shadow so long as the VMM preserves the shadow info.
KVM is generally ok with respect to STI blocking, but ancient versions
didn't migrate STI blocking and there's currently a hole where
single-stepping a guest (from host userspace) could drop STI_BLOCKING
if a different VM-Exit occurs between the single-step #DB VM-Exit and the
instruction in the shadow. Though "don't do that" may be a reasonable
answer in that case.
On Mon, 29 Apr 2019 14:38:35 -0700
Linus Torvalds <[email protected]> wrote:
> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <[email protected]> wrote:
> >
> > The update from "call custom_trampoline" to "call iterator_trampoline"
> > is where we have an issue.
>
> So it has never worked. Just tell people that they have two chocies:
The custom call to iterator translation never worked. One option is to
always call the iterator, and just take the hit. There's another
solution to to make permanent updates that would handle the live
patching case, but not for the tracing case. It is more critical for
live patching than tracing (missed traced function is not as critical
as running buggy code unexpectedly). I could look to work on that
instead.
>
> - you do the careful rewriting, which takes more time
Which would be the method I said making the call to call a special case.
>
> - you do it by rewriting as nop and then back, which is what
> historically has been done, and that is fast and simple, because
> there's no need to be careful.
Except in the live kernel patching case where you just put back the
buggy function.
>
> Really. I find your complaints completely incomprehensible. You've
> never rewritten call instructions atomically before, and now you
> complain about it being slightly more expensive to do it when I give
> you the code? Yes it damn well will be slightly more expensive. Deal
> with it.
I wasn't complaining about the expense of doing it that way. I was
stating that it would take more time to get it right as it as it would
require a different implementation for the call to call case.
>
> Btw, once again - I several months ago also gave a suggestion on how
> it could be done batch-mode by having lots of those small stubs and
> just generating them dynamically.
>
> You never wrote that code *EITHER*. It's been *months*.
Note, I wasn't the one writing the code back then either. I posted a
proof of concept for a similar topic (trace events) for the purpose of
bringing back the performance lost due to the retpolines (something
like 8% hit). Josh took the initiative to do that work, but I don't
remember ever getting to a consensus on a solution to that. Yes you had
given us ideas, but I remember people (like Andy) having concerns with
it. But because it was just an optimization change, and Josh had other
things to work on, that work just stalled.
But I just found out that the function tracing code has the same issue,
but this can cause real problems with live kernel patching. This is why
this patch set started.
>
> So now I've written the non-batch-mode code for you, and you just
> *complain* about it?
Because this is a different story. The trace event code is updated one
at a time (there's patches out there to do it in batch). But this is
not trace events. This is function tracing which only does batching.
>
> I'm done with this discussion. I'm totally fed up.
>
Note, I wasn't writing this code anyway as I didn't have the time to. I
was helping others who took the initiative to do this work. But I guess
they didn't have time to move it forward either.
For the live kernel patching case, I'll start working on the permanent
changes, where once a function entry is changed, it can never be put
back. Then we wont have an issue with the live kernel patching case,
but only for tracing. But the worse thing there is you miss tracing
functions while an update is being made.
If Nicolai has time, perhaps he can try out the method you suggest and
see if we can move this forward.
-- Steve
On Mon, Apr 29, 2019 at 11:57 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
>
> Is this documented somewhere?
Btw, if you really don't trust the sti shadow despite it going all the
way back to the 8086, then you could instead make the irqoff code do
push %gs:bp_call_return
push %gs:bp_call_target
sti
ret
which just keeps interrupts explicitly disabled over the whole use of
the percpu data.
The actual "ret" instruction doesn't matter, it's not going to change
in this model (where the code isn't dynamically generated or changed).
So I claim that it will still be protected by the sti shadow, but when
written that way it doesn't actually matter, and you could reschedule
immediately after the sti (getting an interrupt there might make the
stack frame look odd, but it doesn't really affect anything else)
Linus
On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson
<[email protected]> wrote:
>
> FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
> not sure that counters the "horrible errata" statement ;-). SMI+RSM saves
> and restores STI blocking in that case, but AFAICT NMI has no such
> protection and will effectively break the shadow on its IRET.
Ugh. I can't say I care deeply about Quark (ie never seemed to go
anywhere), but it's odd. I thought it was based on a Pentium core (or
i486+?). Are you saying those didn't do it either?
I have this dim memory about talking about this with some (AMD?)
engineer, and having an alternative approach for the sti shadow wrt
NMI - basically not checking interrupts in the instruction you return
to with 'iret'. I don't think it was even conditional on the "iret
from NMI", I think it was basically any iret also did the sti shadow
thing.
But I can find no actual paper to back that up, so this may be me just
making sh*t up.
> KVM is generally ok with respect to STI blocking, but ancient versions
> didn't migrate STI blocking and there's currently a hole where
> single-stepping a guest (from host userspace) could drop STI_BLOCKING
> if a different VM-Exit occurs between the single-step #DB VM-Exit and the
> instruction in the shadow. Though "don't do that" may be a reasonable
> answer in that case.
I thought the sti shadow blocked the single-step exception too? I know
"mov->ss" does block debug interrupts too.
Or are you saying that it's some "single step by emulation" that just
miss setting the STI_BLOCKING flag?
Linus
On Mon, Apr 29, 2019 at 03:22:09PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
> > not sure that counters the "horrible errata" statement ;-). SMI+RSM saves
> > and restores STI blocking in that case, but AFAICT NMI has no such
> > protection and will effectively break the shadow on its IRET.
>
> Ugh. I can't say I care deeply about Quark (ie never seemed to go
> anywhere), but it's odd. I thought it was based on a Pentium core (or
> i486+?). Are you saying those didn't do it either?
It's 486 based, but either way I suspect the answer is "yes". IIRC,
Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
was based on P54C, though I'm struggling to recall exactly what the
Larrabee weirdness was.
> I have this dim memory about talking about this with some (AMD?)
> engineer, and having an alternative approach for the sti shadow wrt
> NMI - basically not checking interrupts in the instruction you return
> to with 'iret'. I don't think it was even conditional on the "iret
> from NMI", I think it was basically any iret also did the sti shadow
> thing.
>
> But I can find no actual paper to back that up, so this may be me just
> making sh*t up.
If Intel CPUs ever did anything like that on IRET it's long gone.
> > KVM is generally ok with respect to STI blocking, but ancient versions
> > didn't migrate STI blocking and there's currently a hole where
> > single-stepping a guest (from host userspace) could drop STI_BLOCKING
> > if a different VM-Exit occurs between the single-step #DB VM-Exit and the
> > instruction in the shadow. Though "don't do that" may be a reasonable
> > answer in that case.
>
> I thought the sti shadow blocked the single-step exception too? I know
> "mov->ss" does block debug interrupts too.
{MOV,POP}SS blocks #DBs, STI does not.
> Or are you saying that it's some "single step by emulation" that just
> miss setting the STI_BLOCKING flag?
This is the case I was talking about for KVM. KVM supports single-stepping
the guest from userpace and uses EFLAGS.TF to do so (since it works on both
Intel and AMD). VMX has a consistency check that fails VM-Entry if
STI_BLOCKING=1, EFLAGS.TF==1, IA32_DEBUGCTL.BTF=0 and there isn't a pending
single-step #DB, and so KVM clears STI_BLOCKING immediately before entering
the guest when single-stepping the guest. If a VM-Exit occurs immediately
after VM-Entry, e.g. due to hardware interrupt, then KVM will see
STI_BLOCKING=0 when processing guest events in its run loop and will inject
any pending interrupts.
I *think* the KVM behavior can be fixed, e.g. I'm not entirely sure why KVM
takes this approach instead of setting PENDING_DBG.BS, but that's probably
a moot point.
On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote:
> On Mon, Apr 29, 2019 at 03:22:09PM -0700, Linus Torvalds wrote:
> > On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson
> > <[email protected]> wrote:
> > >
> > > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
> > > not sure that counters the "horrible errata" statement ;-). SMI+RSM saves
> > > and restores STI blocking in that case, but AFAICT NMI has no such
> > > protection and will effectively break the shadow on its IRET.
> >
> > Ugh. I can't say I care deeply about Quark (ie never seemed to go
> > anywhere), but it's odd. I thought it was based on a Pentium core (or
> > i486+?). Are you saying those didn't do it either?
>
> It's 486 based, but either way I suspect the answer is "yes". IIRC,
> Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> was based on P54C, though I'm struggling to recall exactly what the
> Larrabee weirdness was.
Aha! Found an ancient comment that explicitly states P5 does not block
NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
On Mon, Apr 29, 2019 at 5:45 PM Sean Christopherson
<[email protected]> wrote:
>
> On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote:
> >
> > It's 486 based, but either way I suspect the answer is "yes". IIRC,
> > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> > was based on P54C, though I'm struggling to recall exactly what the
> > Larrabee weirdness was.
>
> Aha! Found an ancient comment that explicitly states P5 does not block
> NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
Ok, so the STI shadow really wouldn't be reliable on those machines. Scary.
Of course, the good news is that hopefully nobody has them any more,
and if they do, they presumably don't use fancy NMI profiling etc, so
any actual NMI's are probably relegated purely to largely rare and
effectively fatal errors anyway (ie memory parity errors).
Linus
Steven Rostedt <[email protected]> writes:
> On Mon, 29 Apr 2019 14:38:35 -0700
> Linus Torvalds <[email protected]> wrote:
>
>> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <[email protected]> wrote:
>> >
>> > The update from "call custom_trampoline" to "call iterator_trampoline"
>> > is where we have an issue.
>>
>> So it has never worked. Just tell people that they have two chocies:
>
> The custom call to iterator translation never worked. One option is to
> always call the iterator, and just take the hit. There's another
> solution to to make permanent updates that would handle the live
> patching case, but not for the tracing case. It is more critical for
> live patching than tracing (missed traced function is not as critical
> as running buggy code unexpectedly). I could look to work on that
> instead.
Making the live patching case permanent would disable tracing of live
patched functions though?
For unconditionally calling the iterator: I don't have numbers, but
would expect that always searching through something like 50-70 live
patching ftrace_ops from some hot path will be noticeable.
> If Nicolai has time, perhaps he can try out the method you suggest and
> see if we can move this forward.
You mean making ftrace handle call->call transitions one by one in
non-batch mode, right? Sure, I can do that.
Another question though: is there anything that prevents us from calling
ftrace_ops_list_func() directly from ftrace_int3_handler()? We don't
have parent_ip, but if that is used for reporting only, maybe setting it
to some ftrace_is_in_int3_and_doesnt_now_the_parent() will work?
Graph tracing entries would still be lost, but well...
Thanks,
Nicolai
On Mon, Apr 29, 2019 at 07:26:02PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 5:45 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote:
> > >
> > > It's 486 based, but either way I suspect the answer is "yes". IIRC,
> > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> > > was based on P54C, though I'm struggling to recall exactly what the
> > > Larrabee weirdness was.
> >
> > Aha! Found an ancient comment that explicitly states P5 does not block
> > NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
>
> Ok, so the STI shadow really wouldn't be reliable on those machines. Scary.
>
> Of course, the good news is that hopefully nobody has them any more,
> and if they do, they presumably don't use fancy NMI profiling etc, so
> any actual NMI's are probably relegated purely to largely rare and
> effectively fatal errors anyway (ie memory parity errors).
We do have KNC perf support, if that chip has 'issues'...
Outside of that, we only do perf from P6 onwards. With P4 support being
in dubious shape, because it's massively weird and 'nobody' still has
those machines.
On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
> On Mon, 29 Apr 2019 11:06:58 -0700
> Linus Torvalds <[email protected]> wrote:
>
> > +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> > +{
> > + bp_int3_call_target = target;
> > + bp_int3_call_return = addr + len;
> > + bp_int3_handler_irqoff = emulate_call_irqoff;
> > + text_poke_bp(addr, opcode, len, emulate_call_irqon);
> > +}
>
> Note, the function tracer does not use text poke. It does it in batch
> mode. It can update over 40,000 calls in one go:
Note that Daniel is adding batch stuff to text_poke(), because
jump_label/static_key stuffs also end up wanting to change more than one
site at a time and IPI spraying the machine for every single instance is
hurting.
So ideally ftrace would start to use the 'normal' code when that
happens.
On Mon, 29 Apr 2019, Linus Torvalds wrote:
> > > It's 486 based, but either way I suspect the answer is "yes". IIRC,
> > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> > > was based on P54C, though I'm struggling to recall exactly what the
> > > Larrabee weirdness was.
> >
> > Aha! Found an ancient comment that explicitly states P5 does not block
> > NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
>
> Ok, so the STI shadow really wouldn't be reliable on those machines. Scary.
>
> Of course, the good news is that hopefully nobody has them any more, and
> if they do, they presumably don't use fancy NMI profiling etc, so any
> actual NMI's are probably relegated purely to largely rare and
> effectively fatal errors anyway (ie memory parity errors).
FWIW, if that thing has local apic (I have no idea, I've never seen Quark
myself), then NMIs are used to trigger all-cpu backtrace as well. Which
still can be done in situations where the kernel is then expected to
continue running undisrupted (softlockup, sysrq, hung task detector, ...).
Nothing to really worry about in the particular case of this HW perhaps,
sure.
--
Jiri Kosina
SUSE Labs
On Mon, Apr 29, 2019 at 03:06:30PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 11:57 AM Andy Lutomirski <[email protected]> wrote:
> > >
> > > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
> >
> > Is this documented somewhere?
>
> Btw, if you really don't trust the sti shadow despite it going all the
> way back to the 8086, then you could instead make the irqoff code do
>
> push %gs:bp_call_return
> push %gs:bp_call_target
> sti
> ret
This variant cures the RETPOLINE complaint; due to there not actually
being an indirect jump anymore. And it cures the sibling call complaint,
but trades it for "return with modified stack frame".
Something like so is clean:
+extern asmlinkage void emulate_call_irqon(void);
+extern asmlinkage void emulate_call_irqoff(void);
+
+asm(
+ ".text\n"
+ ".global emulate_call_irqoff\n"
+ ".type emulate_call_irqoff, @function\n"
+ "emulate_call_irqoff:\n\t"
+ "push %gs:bp_call_return\n\t"
+ "push %gs:bp_call_target\n\t"
+ "sti\n\t"
+ "ret\n"
+ ".size emulate_call_irqoff, .-emulate_call_irqoff\n"
+
+ ".global emulate_call_irqon\n"
+ ".type emulate_call_irqon, @function\n"
+ "emulate_call_irqon:\n\t"
+ "push %gs:bp_call_return\n\t"
+ "push %gs:bp_call_target\n\t"
+ "ret\n"
+ ".size emulate_call_irqon, .-emulate_call_irqon\n"
+ ".previous\n");
+
+STACK_FRAME_NON_STANDARD(emulate_call_irqoff);
+STACK_FRAME_NON_STANDARD(emulate_call_irqon);
On Tue, 30 Apr 2019 12:46:48 +0200
Peter Zijlstra <[email protected]> wrote:
> On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
> > On Mon, 29 Apr 2019 11:06:58 -0700
> > Linus Torvalds <[email protected]> wrote:
> >
> > > +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> > > +{
> > > + bp_int3_call_target = target;
> > > + bp_int3_call_return = addr + len;
> > > + bp_int3_handler_irqoff = emulate_call_irqoff;
> > > + text_poke_bp(addr, opcode, len, emulate_call_irqon);
> > > +}
> >
> > Note, the function tracer does not use text poke. It does it in batch
> > mode. It can update over 40,000 calls in one go:
>
> Note that Daniel is adding batch stuff to text_poke(), because
> jump_label/static_key stuffs also end up wanting to change more than one
> site at a time and IPI spraying the machine for every single instance is
> hurting.
>
> So ideally ftrace would start to use the 'normal' code when that
> happens.
Sure, but that's a completely different issue. We would need to solve
the 'emulate' call for batch mode first.
-- Steve
On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski <[email protected]> wrote:
> > > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
> >
> > Where? STI; IRET would be nuts.
>
> Sorry, not 'sti;iret' but 'sti;sysexit'
>
> > before commit 4214a16b02971c60960afd675d03544e109e0d75
> > x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
> >
> > we did sti; sysxit, but, when we discussed this, I don't recall anyone
> > speaking up in favor of the safely of the old code.
>
> We still have that sti sysexit in the 32-bit code.
We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI
shadow. Getting an NMI in between shouldn't hurt too much, but if that
in turn can lead to an actual interrupt happening, we're up some creek
without no paddle.
Most moden systems don't use either anymore though. As
mwait_idle_with_hints() relies on MWAIT ECX[0]=1 to allow MWAIT to wake
from pending interrupts.
On Tue, Apr 30, 2019 at 09:44:45AM -0400, Steven Rostedt wrote:
> On Tue, 30 Apr 2019 12:46:48 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
> > > On Mon, 29 Apr 2019 11:06:58 -0700
> > > Linus Torvalds <[email protected]> wrote:
> > >
> > > > +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> > > > +{
> > > > + bp_int3_call_target = target;
> > > > + bp_int3_call_return = addr + len;
> > > > + bp_int3_handler_irqoff = emulate_call_irqoff;
> > > > + text_poke_bp(addr, opcode, len, emulate_call_irqon);
> > > > +}
> > >
> > > Note, the function tracer does not use text poke. It does it in batch
> > > mode. It can update over 40,000 calls in one go:
> >
> > Note that Daniel is adding batch stuff to text_poke(), because
> > jump_label/static_key stuffs also end up wanting to change more than one
> > site at a time and IPI spraying the machine for every single instance is
> > hurting.
> >
> > So ideally ftrace would start to use the 'normal' code when that
> > happens.
>
> Sure, but that's a completely different issue. We would need to solve
> the 'emulate' call for batch mode first.
I don't see a problem there; when INT3 happens; you bsearch() the batch
array to find the one you hit, you frob that into the percpu variables
and do the thing. Or am I loosing the plot again?
On Tue, 30 Apr 2019 16:20:47 +0200
Peter Zijlstra <[email protected]> wrote:
> >
> > Sure, but that's a completely different issue. We would need to solve
> > the 'emulate' call for batch mode first.
>
> I don't see a problem there; when INT3 happens; you bsearch() the batch
> array to find the one you hit, you frob that into the percpu variables
> and do the thing. Or am I loosing the plot again?
I may need to tweak Linus's patch slightly, but I may be able to get it
to work with the batch mode.
-- Steve
On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote:
> >
> > We still have that sti sysexit in the 32-bit code.
>
> We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI
> shadow.
I guess the good news is that in all cases we really only ever protect
against a very unlikely race, and if the race happens it's not
actually fatal.
Yes, if we get an NMI and then an interrupt in between the "st;hlt" we
might wait for the next interrupt and get a (potentially fairly
horrible) latency issue. I guess that with maximal luck it might be a
one-shot timer and not get re-armed, but it sounds very very very
unlikely.
Googling around, I actually find a patch from Avi Kivity from back in
2010 for this exact issue, apparently because kvm got this case wrong
and somebody hit it. The patch never made it upstream exactly because
kvm could be fixed and people decided that most real hardware didn't
have the issue in the first place.
In the discussion I found, Peter Anvin tried to get confirmation from
AMD engineers about this too, but I don't see any resolution.
Realistically, I don't think you can hit the problem in practice. The
only way to hit that incredibly small race of "one instruction, *both*
NMI and interrupts" is to have a lot of interrupts going all at the
same time, but that will also then solve the latency problem, so the
very act of triggering it will also fix it.
I don't see any case where it's really bad. The "sti sysexit" race is
similar, just about latency of user space signal reporting (and
perhaps any pending TIF_WORK_xyz flags).
So maybe we don't care deeply about the sti shadow. It's a potential
latecy problem when broken, but not a huge issue. And for the
instruction rewriting hack, moving to "push+sti+ret" also makes a lost
sti shadow just a "possibly odd stack frame visibility" issue rather
than anything deeply fatal.
We can probably just write it off as "some old CPU's (and a smattering
or very rare and not relevant new ones) have potential but unlikely
latency issues because of a historical CPU mis-design - don't do perf
on them".
Linus
On Tue, Apr 30, 2019 at 9:06 AM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <[email protected]> wrote:
> >
>
> Realistically, I don't think you can hit the problem in practice. The
> only way to hit that incredibly small race of "one instruction, *both*
> NMI and interrupts" is to have a lot of interrupts going all at the
> same time, but that will also then solve the latency problem, so the
> very act of triggering it will also fix it.
>
> I don't see any case where it's really bad. The "sti sysexit" race is
> similar, just about latency of user space signal reporting (and
> perhaps any pending TIF_WORK_xyz flags).
In the worst case, it actually kills the machine. Last time I tracked
a bug like this down, I think the issue was that we got preempted
after the last TIF_ check, entered a VM, exited, context switched
back, and switched to user mode without noticing that there was a
ending KVM user return notifier. This left us with bogus CPU state
and the machine exploded.
Linus, can I ask you to reconsider your opposition to Josh's other
approach of just shifting the stack on int3 entry? I agree that it's
ugly, but the ugliness is easily manageable and fairly self-contained.
We add a little bit of complication to the entry asm (but it's not
like it's unprecedented -- the entry asm does all kinds of stack
rearrangement due to IST and PTI crap already), and we add an
int3_emulate_call(struct pt_regs *regs, unsigned long target) helper
that has appropriate assertions that the stack is okay and emulates
the call. And that's it.
In contrast, your approach involves multiple asm trampolines, hash
tables, batching complications, and sti shadows.
As an additional argument, with the stack-shifting approach, it runs
on *every int3 from kernel mode*. This means that we can do something
like this:
static bool int3_emulate_call_okay(struct pt_regs *regs)
{
unsigned long available_stack = regs->sp - (unsigned long);
return available_stack >= sizeof(long);
}
void do_int3(...) {
{
WARN_ON_ONCE(!user_mode(regs) && !int3_emulate_call_okay(regs));
...;
}
static void int3_emulate_call(struct pt_regs *regs, unsigned long target)
{
BUG_ON(user_mode(regs) || !int3_emulate_call_okey(regs));
regs->sp -= sizeof(unsigned long);
*(unsigned long *)regs->sp = target;
/* CET SHSTK fixup goes here */
}
Obviously the CET SHSTK fixup might be rather nasty, but I suspect
it's a solvable problem.
A major benefit of this is that the entry asm nastiness will get
exercised all the time, and, if we screw it up, the warning will fire.
This is the basic principle behind why the entry stuff *works* these
days. I've put a lot of effort into making sure that running kernels
with CONFIG_DEBUG_ENTRY and running the selftests actually exercises
the nasty cases.
--Andy
On Tue, 30 Apr 2019 09:33:51 -0700
Andy Lutomirski <[email protected]> wrote:
> Linus, can I ask you to reconsider your opposition to Josh's other
> approach of just shifting the stack on int3 entry? I agree that it's
> ugly, but the ugliness is easily manageable and fairly self-contained.
> We add a little bit of complication to the entry asm (but it's not
> like it's unprecedented -- the entry asm does all kinds of stack
> rearrangement due to IST and PTI crap already), and we add an
> int3_emulate_call(struct pt_regs *regs, unsigned long target) helper
> that has appropriate assertions that the stack is okay and emulates
> the call. And that's it.
I also prefer Josh's stack shift solution, as I personally believe
that's a cleaner solution. But I went ahead and implemented Linus's
version to get it working for ftrace. Here's the code, and it survived
some preliminary tests.
There's three places that use the update code. One is the start of
every function call (yes, I counted that as one, and that case is
determined by: ftrace_location(ip)). The other is the trampoline itself
has an update. That could also be converted to a text poke, but for now
its here as it was written before text poke existed. The third place is
actually a jump (to the function graph code). But that can be safely
skipped if we are converting it, as it only goes from jump to nop, or
nop to jump.
The trampolines reflect this. Also, as NMI code is traced by ftrace, I
had to duplicate the trampolines for the nmi case (but only for the
interrupts disabled case as NMIs don't have interrupts enabled).
-- Steve
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..bf320bf791dd 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -17,6 +17,7 @@
#include <linux/uaccess.h>
#include <linux/ftrace.h>
#include <linux/percpu.h>
+#include <linux/frame.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/init.h>
@@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */
+unsigned long ftrace_update_func_call;
+
static int update_ftrace_func(unsigned long ip, void *new)
{
unsigned char old[MCOUNT_INSN_SIZE];
@@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -280,6 +286,70 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
return 0;
}
+extern asmlinkage void ftrace_emulate_call_irqon(void);
+extern asmlinkage void ftrace_emulate_call_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_nmi(void);
+extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_update_irqon(void);
+extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+
+static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
+static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
+
+asm(
+ ".text\n"
+ ".global ftrace_emulate_call_irqoff\n"
+ ".type ftrace_emulate_call_irqoff, @function\n"
+ "ftrace_emulate_call_irqoff:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "sti\n\t"
+ "jmp ftrace_caller\n"
+ ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
+
+ ".global ftrace_emulate_call_irqon\n"
+ ".type ftrace_emulate_call_irqon, @function\n"
+ "ftrace_emulate_call_irqon:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "jmp ftrace_caller\n"
+ ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n"
+
+ ".global ftrace_emulate_call_nmi\n"
+ ".type ftrace_emulate_call_nmi, @function\n"
+ "ftrace_emulate_call_nmi:\n\t"
+ "push %gs:ftrace_bp_call_nmi_return\n\t"
+ "jmp ftrace_caller\n"
+ ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n"
+
+ ".global ftrace_emulate_call_update_irqoff\n"
+ ".type ftrace_emulate_call_update_irqoff, @function\n"
+ "ftrace_emulate_call_update_irqoff:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "sti\n\t"
+ "jmp *ftrace_update_func_call\n"
+ ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n"
+
+ ".global ftrace_emulate_call_update_irqon\n"
+ ".type ftrace_emulate_call_update_irqon, @function\n"
+ "ftrace_emulate_call_update_irqon:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "jmp *ftrace_update_func_call\n"
+ ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n"
+
+ ".global ftrace_emulate_call_update_nmi\n"
+ ".type ftrace_emulate_call_update_nmi, @function\n"
+ "ftrace_emulate_call_update_nmi:\n\t"
+ "push %gs:ftrace_bp_call_nmi_return\n\t"
+ "jmp *ftrace_update_func_call\n"
+ ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n"
+ ".previous\n");
+
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi);
+
/*
* A breakpoint was added to the code address we are about to
* modify, and this is the handle that will just skip over it.
@@ -295,10 +365,40 @@ int ftrace_int3_handler(struct pt_regs *regs)
return 0;
ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+ if (ftrace_location(ip)) {
+ if (in_nmi()) {
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_irqon;
+ }
+ } else if (is_ftrace_caller(ip)) {
+ /* If it's a jump, just need to skip it */
+ if (!ftrace_update_func_call) {
+ regs->ip += MCOUNT_INSN_SIZE -1;
+ return 1;
+ }
+ if (in_nmi()) {
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff;
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_update_irqon;
+ }
+ } else {
return 0;
-
- regs->ip += MCOUNT_INSN_SIZE - 1;
+ }
return 1;
}
@@ -859,6 +959,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +1062,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Tue, 30 Apr 2019 13:03:59 -0400
Steven Rostedt <[email protected]> wrote:
> I also prefer Josh's stack shift solution, as I personally believe
> that's a cleaner solution. But I went ahead and implemented Linus's
> version to get it working for ftrace. Here's the code, and it survived
> some preliminary tests.
Well it past the second level of tests.
If this is the way we are going, I could add comments to the code and
apply it to my queue and run it through the rest of my test suite and
make it ready for the merge window. I may add a stable tag to it to go
back to where live kernel patching was added, as it fixes a potential
bug there.
-- Steve
From: "Steven Rostedt (VMware)" <[email protected]>
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3],
Linus suggested using per CPU data along with special trampolines[4] to emulate
the calls.
Linus's solution was for text poke (which was mostly what the static_call
code did), but as ftrace has its own mechanism, it required doing its own
thing.
Having ftrace use its own per CPU data and having its own set of specialized
trampolines solves the issue of missed calls that live kernel patching
suffers.
[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.1541711457.git.jpoimboe@redhat.com
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=A@mail.gmail.com
Inspired-by: Linus Torvalds <[email protected]>
Cc: [email protected]
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
arch/x86/kernel/ftrace.c | 146 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 143 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..835277043348 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -17,6 +17,7 @@
#include <linux/uaccess.h>
#include <linux/ftrace.h>
#include <linux/percpu.h>
+#include <linux/frame.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/init.h>
@@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */
+unsigned long ftrace_update_func_call;
+
static int update_ftrace_func(unsigned long ip, void *new)
{
unsigned char old[MCOUNT_INSN_SIZE];
@@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -280,6 +286,103 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
return 0;
}
+/*
+ * We need to handle the "call func1" -> "call func2" case.
+ * Just skipping the call is not sufficient as it will be like
+ * changing to "nop" first and then updating the call. But some
+ * users of ftrace require calls never to be missed.
+ *
+ * To emulate the call while converting the call site with a breakpoint,
+ * some trampolines are used along with per CPU buffers.
+ * There are three trampolines for the call sites and three trampolines
+ * for the updating of the call in ftrace trampoline. The three
+ * trampolines are:
+ *
+ * 1) Interrupts are enabled when the breakpoint is hit
+ * 2) Interrupts are disabled when the breakpoint is hit
+ * 3) The breakpoint was hit in an NMI
+ *
+ * As per CPU data is used, interrupts must be disabled to prevent them
+ * from corrupting the data. A separate NMI trampoline is used for the
+ * NMI case. If interrupts are already disabled, then the return path
+ * of where the breakpoint was hit (saved in the per CPU data) is pushed
+ * on the stack and then a jump to either the ftrace_caller (which will
+ * loop through all registered ftrace_ops handlers depending on the ip
+ * address), or if its a ftrace trampoline call update, it will call
+ * ftrace_update_func_call which will hold the call that should be
+ * called.
+ */
+extern asmlinkage void ftrace_emulate_call_irqon(void);
+extern asmlinkage void ftrace_emulate_call_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_nmi(void);
+extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_update_irqon(void);
+extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+
+static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
+static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
+
+asm(
+ ".text\n"
+
+ /* Trampoline for function update with interrupts enabled */
+ ".global ftrace_emulate_call_irqoff\n"
+ ".type ftrace_emulate_call_irqoff, @function\n"
+ "ftrace_emulate_call_irqoff:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "sti\n\t"
+ "jmp ftrace_caller\n"
+ ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
+
+ /* Trampoline for function update with interrupts disabled*/
+ ".global ftrace_emulate_call_irqon\n"
+ ".type ftrace_emulate_call_irqon, @function\n"
+ "ftrace_emulate_call_irqon:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "jmp ftrace_caller\n"
+ ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n"
+
+ /* Trampoline for function update in an NMI */
+ ".global ftrace_emulate_call_nmi\n"
+ ".type ftrace_emulate_call_nmi, @function\n"
+ "ftrace_emulate_call_nmi:\n\t"
+ "push %gs:ftrace_bp_call_nmi_return\n\t"
+ "jmp ftrace_caller\n"
+ ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n"
+
+ /* Trampoline for ftrace trampoline call update with interrupts enabled */
+ ".global ftrace_emulate_call_update_irqoff\n"
+ ".type ftrace_emulate_call_update_irqoff, @function\n"
+ "ftrace_emulate_call_update_irqoff:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "sti\n\t"
+ "jmp *ftrace_update_func_call\n"
+ ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n"
+
+ /* Trampoline for ftrace trampoline call update with interrupts disabled */
+ ".global ftrace_emulate_call_update_irqon\n"
+ ".type ftrace_emulate_call_update_irqon, @function\n"
+ "ftrace_emulate_call_update_irqon:\n\t"
+ "push %gs:ftrace_bp_call_return\n\t"
+ "jmp *ftrace_update_func_call\n"
+ ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n"
+
+ /* Trampoline for ftrace trampoline call update in an NMI */
+ ".global ftrace_emulate_call_update_nmi\n"
+ ".type ftrace_emulate_call_update_nmi, @function\n"
+ "ftrace_emulate_call_update_nmi:\n\t"
+ "push %gs:ftrace_bp_call_nmi_return\n\t"
+ "jmp *ftrace_update_func_call\n"
+ ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n"
+ ".previous\n");
+
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi);
+
/*
* A breakpoint was added to the code address we are about to
* modify, and this is the handle that will just skip over it.
@@ -295,10 +398,44 @@ int ftrace_int3_handler(struct pt_regs *regs)
return 0;
ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+ if (ftrace_location(ip)) {
+ /* A breakpoint at the beginning of the function was hit */
+ if (in_nmi()) {
+ /* NMIs have their own trampoline */
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_irqon;
+ }
+ } else if (is_ftrace_caller(ip)) {
+ /* An ftrace trampoline is being updated */
+ if (!ftrace_update_func_call) {
+ /* If it's a jump, just need to skip it */
+ regs->ip += MCOUNT_INSN_SIZE -1;
+ return 1;
+ }
+ if (in_nmi()) {
+ /* NMIs have their own trampoline */
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff;
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_update_irqon;
+ }
+ } else {
return 0;
-
- regs->ip += MCOUNT_INSN_SIZE - 1;
+ }
return 1;
}
@@ -859,6 +996,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +1099,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
--
2.20.1
On Tue, Apr 30, 2019 at 10:49 AM Steven Rostedt <[email protected]> wrote:
>
> +
> +asm(
> + ".text\n"
> +
> + /* Trampoline for function update with interrupts enabled */
> + ".global ftrace_emulate_call_irqoff\n"
> + ".type ftrace_emulate_call_irqoff, @function\n"
> + "ftrace_emulate_call_irqoff:\n\t"
> + "push %gs:ftrace_bp_call_return\n\t"
Well, as mentioned in my original suggestion, this won't work on
32-bit, or on UP. They have different models for per-cpu data (32-bti
uses %fs, and UP doesn't use a segment override at all).
Maybe we just don't care about UP at all for this code, of course.
And maybe we can make the decision to also make 32-bit just not use
this either - so maybe the code is ok per se, just needs to make sure
it never triggers for the cases that it's not written for..
> + "ftrace_emulate_call_update_irqoff:\n\t"
> + "push %gs:ftrace_bp_call_return\n\t"
> + "sti\n\t"
> + "jmp *ftrace_update_func_call\n"
.. and this should then use the "push push sti ret" model instead.
Plus get updated for objtool complaints.
Anyway, since Andy really likes the entry code change, can we have
that patch in parallel and judge the difference that way? Iirc, that
was x86-64 specific too.
Linus
On Tue, 30 Apr 2019 11:33:21 -0700
Linus Torvalds <[email protected]> wrote:
> On Tue, Apr 30, 2019 at 10:49 AM Steven Rostedt <[email protected]> wrote:
> >
> > +
> > +asm(
> > + ".text\n"
> > +
> > + /* Trampoline for function update with interrupts enabled */
> > + ".global ftrace_emulate_call_irqoff\n"
> > + ".type ftrace_emulate_call_irqoff, @function\n"
> > + "ftrace_emulate_call_irqoff:\n\t"
> > + "push %gs:ftrace_bp_call_return\n\t"
>
> Well, as mentioned in my original suggestion, this won't work on
> 32-bit, or on UP. They have different models for per-cpu data (32-bti
> uses %fs, and UP doesn't use a segment override at all).
Ah, yeah, I forgot about 32-bit. I could easily make this use fs as
well, and for UP, just use a static variable.
>
> Maybe we just don't care about UP at all for this code, of course.
>
> And maybe we can make the decision to also make 32-bit just not use
> this either - so maybe the code is ok per se, just needs to make sure
> it never triggers for the cases that it's not written for..
>
> > + "ftrace_emulate_call_update_irqoff:\n\t"
> > + "push %gs:ftrace_bp_call_return\n\t"
> > + "sti\n\t"
> > + "jmp *ftrace_update_func_call\n"
>
> .. and this should then use the "push push sti ret" model instead.
>
> Plus get updated for objtool complaints.
Yeah, I see that now. Somehow it disappeared when I looked for it after
making some other changes. I can update it.
>
> Anyway, since Andy really likes the entry code change, can we have
> that patch in parallel and judge the difference that way? Iirc, that
> was x86-64 specific too.
Note, I don't think live kernel patching supports 32 bit anyway, so
that may not be an issue.
Josh,
When you come back to the office, can you look into that method?
-- Steve
On Tue, 30 Apr 2019 11:33:21 -0700
Linus Torvalds <[email protected]> wrote:
> > + "ftrace_emulate_call_update_irqoff:\n\t"
> > + "push %gs:ftrace_bp_call_return\n\t"
> > + "sti\n\t"
> > + "jmp *ftrace_update_func_call\n"
>
> .. and this should then use the "push push sti ret" model instead.
>
> Plus get updated for objtool complaints.
And unfortunately, this blows up on lockdep. Lockdep notices that the
return from the breakpoint handler has interrupts enabled, and will not
enable them in its shadow irqs disabled variable. But then we enabled
them in the trampoline, without telling lockdep and we trigger
something likes this:
------------[ cut here ]------------
IRQs not enabled as expected
WARNING: CPU: 2 PID: 0 at kernel/time/tick-sched.c:979 tick_nohz_idle_enter+0x44/0x8c
Modules linked in:
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.1.0-rc3-test+ #123
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: tick_nohz_idle_enter+0x44/0x8c
Code: f0 05 00 00 00 75 26 83 b8 c4 05 00 00 00 75 1d 80 3d 5f 0f 43 c1 00 75 14 68 72 74 16 c1 c6 05 5f 0f 43 c1 01 e8 33 d7 f8 ff <0f> 0b 58 fa e8 4e 2c 04 00 bb e0 36 6b c1 64 03 1d 28 81 56 c1 8b
EAX: 0000001c EBX: ee769f84 ECX: 00000000 EDX: 00000006
ESI: 00000000 EDI: 00000002 EBP: ee769f50 ESP: ee769f48
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210292
CR0: 80050033 CR2: 00000000 CR3: 016c4000 CR4: 001406f0
Call Trace:
do_idle+0x2a/0x1fc
cpu_startup_entry+0x1e/0x20
start_secondary+0x1d3/0x1ec
startup_32_smp+0x164/0x168
I have to fool lockdep with the following:
if (regs->flags & X86_EFLAGS_IF) {
regs->flags &= ~X86_EFLAGS_IF;
regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
/* Tell lockdep here we are enabling interrupts */
trace_hardirqs_on();
} else {
regs->ip = (unsigned long) ftrace_emulate_call_irqon;
}
-- Steve
From: "Steven Rostedt (VMware)" <[email protected]>
Nicolai Stange discovered[1] that if live kernel patching is enabled, and the
function tracer started tracing the same function that was patched, the
conversion of the fentry call site during the translation of going from
calling the live kernel patch trampoline to the iterator trampoline, would
have as slight window where it didn't call anything.
As live kernel patching depends on ftrace to always call its code (to
prevent the function being traced from being called, as it will redirect
it). This small window would allow the old buggy function to be called, and
this can cause undesirable results.
Nicolai submitted new patches[2] but these were controversial. As this is
similar to the static call emulation issues that came up a while ago[3],
Linus suggested using per CPU data along with special trampolines[4] to emulate
the calls.
Linus's solution was for text poke (which was mostly what the static_call
code did), but as ftrace has its own mechanism, it required doing its own
thing.
Having ftrace use its own per CPU data and having its own set of specialized
trampolines solves the issue of missed calls that live kernel patching
suffers.
[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/3cf04e113d71c9f8e4be95fb84a510f085aa4afa.1541711457.git.jpoimboe@redhat.com
[4] http://lkml.kernel.org/r/CAHk-=wh5OpheSU8Em_Q3Hg8qw_JtoijxOdPtHru6d+5K8TWM=A@mail.gmail.com
Inspired-by: Linus Torvalds <[email protected]>
Cc: [email protected]
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
Changes since v1:
- Use "push push ret" instead of indirect jumps (Linus)
- Handle 32 bit as well as non SMP
- Fool lockdep into thinking interrupts are enabled
arch/x86/kernel/ftrace.c | 175 +++++++++++++++++++++++++++++++++++++--
1 file changed, 170 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..9160f5cc3b6d 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -17,6 +17,7 @@
#include <linux/uaccess.h>
#include <linux/ftrace.h>
#include <linux/percpu.h>
+#include <linux/frame.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/init.h>
@@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
static unsigned long ftrace_update_func;
+/* Used within inline asm below */
+unsigned long ftrace_update_func_call;
+
static int update_ftrace_func(unsigned long ip, void *new)
{
unsigned char old[MCOUNT_INSN_SIZE];
@@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -280,6 +286,125 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
return 0;
}
+/*
+ * We need to handle the "call func1" -> "call func2" case.
+ * Just skipping the call is not sufficient as it will be like
+ * changing to "nop" first and then updating the call. But some
+ * users of ftrace require calls never to be missed.
+ *
+ * To emulate the call while converting the call site with a breakpoint,
+ * some trampolines are used along with per CPU buffers.
+ * There are three trampolines for the call sites and three trampolines
+ * for the updating of the call in ftrace trampoline. The three
+ * trampolines are:
+ *
+ * 1) Interrupts are enabled when the breakpoint is hit
+ * 2) Interrupts are disabled when the breakpoint is hit
+ * 3) The breakpoint was hit in an NMI
+ *
+ * As per CPU data is used, interrupts must be disabled to prevent them
+ * from corrupting the data. A separate NMI trampoline is used for the
+ * NMI case. If interrupts are already disabled, then the return path
+ * of where the breakpoint was hit (saved in the per CPU data) is pushed
+ * on the stack and then a jump to either the ftrace_caller (which will
+ * loop through all registered ftrace_ops handlers depending on the ip
+ * address), or if its a ftrace trampoline call update, it will call
+ * ftrace_update_func_call which will hold the call that should be
+ * called.
+ */
+extern asmlinkage void ftrace_emulate_call_irqon(void);
+extern asmlinkage void ftrace_emulate_call_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_nmi(void);
+extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_update_irqon(void);
+extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+
+static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
+static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
+
+#ifdef CONFIG_SMP
+#ifdef CONFIG_X86_64
+# define BP_CALL_RETURN "%gs:ftrace_bp_call_return"
+# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return"
+#else
+# define BP_CALL_RETURN "%fs:ftrace_bp_call_return"
+# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return"
+#endif
+#else /* SMP */
+# define BP_CALL_RETURN "ftrace_bp_call_return"
+# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return"
+#endif
+
+/* To hold the ftrace_caller address to push on the stack */
+void *ftrace_caller_func = (void *)ftrace_caller;
+
+asm(
+ ".text\n"
+
+ /* Trampoline for function update with interrupts enabled */
+ ".global ftrace_emulate_call_irqoff\n"
+ ".type ftrace_emulate_call_irqoff, @function\n"
+ "ftrace_emulate_call_irqoff:\n\t"
+ "push "BP_CALL_RETURN"\n\t"
+ "push ftrace_caller_func\n"
+ "sti\n\t"
+ "ret\n\t"
+ ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
+
+ /* Trampoline for function update with interrupts disabled*/
+ ".global ftrace_emulate_call_irqon\n"
+ ".type ftrace_emulate_call_irqon, @function\n"
+ "ftrace_emulate_call_irqon:\n\t"
+ "push "BP_CALL_RETURN"\n\t"
+ "push ftrace_caller_func\n"
+ "ret\n\t"
+ ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n"
+
+ /* Trampoline for function update in an NMI */
+ ".global ftrace_emulate_call_nmi\n"
+ ".type ftrace_emulate_call_nmi, @function\n"
+ "ftrace_emulate_call_nmi:\n\t"
+ "push "BP_CALL_NMI_RETURN"\n\t"
+ "push ftrace_caller_func\n"
+ "ret\n\t"
+ ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n"
+
+ /* Trampoline for ftrace trampoline call update with interrupts enabled */
+ ".global ftrace_emulate_call_update_irqoff\n"
+ ".type ftrace_emulate_call_update_irqoff, @function\n"
+ "ftrace_emulate_call_update_irqoff:\n\t"
+ "push "BP_CALL_RETURN"\n\t"
+ "push ftrace_update_func_call\n"
+ "sti\n\t"
+ "ret\n\t"
+ ".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n"
+
+ /* Trampoline for ftrace trampoline call update with interrupts disabled */
+ ".global ftrace_emulate_call_update_irqon\n"
+ ".type ftrace_emulate_call_update_irqon, @function\n"
+ "ftrace_emulate_call_update_irqon:\n\t"
+ "push "BP_CALL_RETURN"\n\t"
+ "push ftrace_update_func_call\n"
+ "ret\n\t"
+ ".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n"
+
+ /* Trampoline for ftrace trampoline call update in an NMI */
+ ".global ftrace_emulate_call_update_nmi\n"
+ ".type ftrace_emulate_call_update_nmi, @function\n"
+ "ftrace_emulate_call_update_nmi:\n\t"
+ "push "BP_CALL_NMI_RETURN"\n\t"
+ "push ftrace_update_func_call\n"
+ "ret\n\t"
+ ".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n"
+ ".previous\n");
+
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi);
+
/*
* A breakpoint was added to the code address we are about to
* modify, and this is the handle that will just skip over it.
@@ -295,12 +420,49 @@ int ftrace_int3_handler(struct pt_regs *regs)
return 0;
ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
-
- regs->ip += MCOUNT_INSN_SIZE - 1;
+ if (ftrace_location(ip)) {
+ /* A breakpoint at the beginning of the function was hit */
+ if (in_nmi()) {
+ /* NMIs have their own trampoline */
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
+ /* Tell lockdep here we are enabling interrupts */
+ trace_hardirqs_on();
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_irqon;
+ }
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ /* An ftrace trampoline is being updated */
+ if (!ftrace_update_func_call) {
+ /* If it's a jump, just need to skip it */
+ regs->ip += MCOUNT_INSN_SIZE -1;
+ return 1;
+ }
+ if (in_nmi()) {
+ /* NMIs have their own trampoline */
+ this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+ regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
+ return 1;
+ }
+ this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+ if (regs->flags & X86_EFLAGS_IF) {
+ regs->flags &= ~X86_EFLAGS_IF;
+ regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff;
+ trace_hardirqs_on();
+ } else {
+ regs->ip = (unsigned long) ftrace_emulate_call_update_irqon;
+ }
+ return 1;
+ }
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +1021,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +1124,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
--
2.20.1
On Tue, 30 Apr 2019 17:53:34 -0400
Steven Rostedt <[email protected]> wrote:
> + if (ftrace_location(ip)) {
> + /* A breakpoint at the beginning of the function was hit */
> + if (in_nmi()) {
> + /* NMIs have their own trampoline */
> + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
> + regs->ip = (unsigned long) ftrace_emulate_call_nmi;
> + return 1;
> + }
> + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
> + if (regs->flags & X86_EFLAGS_IF) {
> + regs->flags &= ~X86_EFLAGS_IF;
> + regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
> + /* Tell lockdep here we are enabling interrupts */
> + trace_hardirqs_on();
This isn't good enough. The return from interrupt does call lockdep
saying interrupts are disabled. Need to add the lockdep tracking in the
asm as well.
Probably easier to move it from inline asm to ftrace_X.S and use the
lockdep TRACE_ON/OFF macros.
-- Steve
> + } else {
> + regs->ip = (unsigned long) ftrace_emulate_call_irqon;
> + }
> + return 1;
> + } else if (is_ftrace_caller(ip)) {
> + /* An ftrace trampoline is being updated */
> + if (!ftrace_update_func_call) {
> + /* If it's a jump, just need to skip it */
> + regs->ip += MCOUNT_INSN_SIZE -1;
> + return 1;
> + }
> + if (in_nmi()) {
> + /* NMIs have their own trampoline */
> + this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
> + regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
> + return 1;
> + }
> + this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
> + if (regs->flags & X86_EFLAGS_IF) {
> + regs->flags &= ~X86_EFLAGS_IF;
> + regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff;
> + trace_hardirqs_on();
> + } else {
> + regs->ip = (unsigned long) ftrace_emulate_call_update_irqon;
> + }
> + return 1;
> + }
>
> - return 1;
> + return 0;
> }
On Tue, Apr 30, 2019 at 6:35 PM Steven Rostedt <[email protected]> wrote:
>
>
> Probably easier to move it from inline asm to ftrace_X.S and use the
> lockdep TRACE_ON/OFF macros.
Yeah, that should clean up the percpu stuff too since we have helper
macros for it for *.S files anyway.
I only did the asm() in C because it made the "look, something like
this" patch simpler to test (and it made it easy to check the
generated asm file). Not because it was a good idea ;)
Linus
Hi Steve,
many thanks for moving this forward!
Steven Rostedt <[email protected]> writes:
>
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index ef49517f6bb2..9160f5cc3b6d 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -17,6 +17,7 @@
> #include <linux/uaccess.h>
> #include <linux/ftrace.h>
> #include <linux/percpu.h>
> +#include <linux/frame.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/init.h>
> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
>
> static unsigned long ftrace_update_func;
>
> +/* Used within inline asm below */
> +unsigned long ftrace_update_func_call;
> +
> static int update_ftrace_func(unsigned long ip, void *new)
> {
> unsigned char old[MCOUNT_INSN_SIZE];
> @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
> unsigned char *new;
> int ret;
>
> + ftrace_update_func_call = (unsigned long)func;
> +
> new = ftrace_call_replace(ip, (unsigned long)func);
> ret = update_ftrace_func(ip, new);
>
> @@ -280,6 +286,125 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
> return 0;
> }
>
> +/*
> + * We need to handle the "call func1" -> "call func2" case.
> + * Just skipping the call is not sufficient as it will be like
> + * changing to "nop" first and then updating the call. But some
> + * users of ftrace require calls never to be missed.
> + *
> + * To emulate the call while converting the call site with a breakpoint,
> + * some trampolines are used along with per CPU buffers.
> + * There are three trampolines for the call sites and three trampolines
> + * for the updating of the call in ftrace trampoline. The three
> + * trampolines are:
> + *
> + * 1) Interrupts are enabled when the breakpoint is hit
> + * 2) Interrupts are disabled when the breakpoint is hit
> + * 3) The breakpoint was hit in an NMI
> + *
> + * As per CPU data is used, interrupts must be disabled to prevent them
> + * from corrupting the data. A separate NMI trampoline is used for the
> + * NMI case. If interrupts are already disabled, then the return path
> + * of where the breakpoint was hit (saved in the per CPU data) is pushed
> + * on the stack and then a jump to either the ftrace_caller (which will
> + * loop through all registered ftrace_ops handlers depending on the ip
> + * address), or if its a ftrace trampoline call update, it will call
> + * ftrace_update_func_call which will hold the call that should be
> + * called.
> + */
> +extern asmlinkage void ftrace_emulate_call_irqon(void);
> +extern asmlinkage void ftrace_emulate_call_irqoff(void);
> +extern asmlinkage void ftrace_emulate_call_nmi(void);
> +extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
> +extern asmlinkage void ftrace_emulate_call_update_irqon(void);
> +extern asmlinkage void ftrace_emulate_call_update_nmi(void);
> +
> +static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
> +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
Andy mentioned #DB and #MC exceptions here:
https://lkml.kernel.org/r/[email protected]
I think that #DB won't be possible, provided the trampolines below get
tagged as NOKPROBE (do_int3() and ftrace_int3_handler() already have
it).
It's highly theoretic, but tracing do_machine_check() could clobber
ftrace_bp_call_return or ftrace_bp_call_nmi_return?
> +#ifdef CONFIG_SMP
> +#ifdef CONFIG_X86_64
> +# define BP_CALL_RETURN "%gs:ftrace_bp_call_return"
> +# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return"
> +#else
> +# define BP_CALL_RETURN "%fs:ftrace_bp_call_return"
> +# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return"
> +#endif
> +#else /* SMP */
> +# define BP_CALL_RETURN "ftrace_bp_call_return"
> +# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return"
> +#endif
> +
> +/* To hold the ftrace_caller address to push on the stack */
> +void *ftrace_caller_func = (void *)ftrace_caller;
The live patching ftrace_ops need ftrace_regs_caller.
> +
> +asm(
> + ".text\n"
> +
> + /* Trampoline for function update with interrupts enabled */
> + ".global ftrace_emulate_call_irqoff\n"
> + ".type ftrace_emulate_call_irqoff, @function\n"
> + "ftrace_emulate_call_irqoff:\n\t"
> + "push "BP_CALL_RETURN"\n\t"
> + "push ftrace_caller_func\n"
> + "sti\n\t"
> + "ret\n\t"
> + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
> +
> + /* Trampoline for function update with interrupts disabled*/
> + ".global ftrace_emulate_call_irqon\n"
The naming is perhaps a bit confusing, i.e. "update with interrupts
disabled" vs. "irqon"... How about swapping irqoff<->irqon?
Thanks,
Nicolai
> + ".type ftrace_emulate_call_irqon, @function\n"
> + "ftrace_emulate_call_irqon:\n\t"
> + "push "BP_CALL_RETURN"\n\t"
> + "push ftrace_caller_func\n"
> + "ret\n\t"
> + ".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n"
> +
> + /* Trampoline for function update in an NMI */
> + ".global ftrace_emulate_call_nmi\n"
> + ".type ftrace_emulate_call_nmi, @function\n"
> + "ftrace_emulate_call_nmi:\n\t"
> + "push "BP_CALL_NMI_RETURN"\n\t"
> + "push ftrace_caller_func\n"
> + "ret\n\t"
> + ".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n"
> +
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
On Tue, Apr 30, 2019 at 11:33:21AM -0700, Linus Torvalds wrote:
> Anyway, since Andy really likes the entry code change, can we have
> that patch in parallel and judge the difference that way? Iirc, that
> was x86-64 specific too.
Here goes, compile tested only...
It obviously needs a self-test, but that shoulnd't be too hard to
arrange.
---
arch/x86/entry/entry_32.S | 7 +++++++
arch/x86/entry/entry_64.S | 14 ++++++++++++--
arch/x86/include/asm/text-patching.h | 20 ++++++++++++++++++++
arch/x86/kernel/ftrace.c | 24 +++++++++++++++++++-----
4 files changed, 58 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..d246302085a3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1479,6 +1479,13 @@ ENTRY(int3)
ASM_CLAC
pushl $-1 # mark this as an int
+ testl $SEGMENT_RPL_MASK, PT_CS(%esp)
+ jnz .Lfrom_usermode_no_gap
+ .rept 6
+ pushl 5*4(%esp)
+ .endr
+.Lfrom_usermode_no_gap:
+
SAVE_ALL switch_stacks=1
ENCODE_FRAME_POINTER
TRACE_IRQS_OFF
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 20e45d9b4e15..268cd9affe04 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt
* @paranoid == 2 is special: the stub will never switch stacks. This is for
* #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
*/
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0
ENTRY(\sym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -898,6 +898,16 @@ ENTRY(\sym)
jnz .Lfrom_usermode_switch_stack_\@
.endif
+ .if \create_gap == 1
+ testb $3, CS-ORIG_RAX(%rsp)
+ jnz .Lfrom_usermode_no_gap_\@
+ .rept 6
+ pushq 5*8(%rsp)
+ .endr
+ UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+ .endif
+
.if \paranoid
call paranoid_entry
.else
@@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
#endif /* CONFIG_HYPERV */
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
-idtentry int3 do_int3 has_error_code=0
+idtentry int3 do_int3 has_error_code=0 create_gap=1
idtentry stack_segment do_stack_segment has_error_code=1
#ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..ba275b6292db 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,24 @@ extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
extern int after_bootmem;
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+ regs->sp -= sizeof(unsigned long);
+ *(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+ regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+ int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ int3_emulate_jmp(regs, func);
+}
+
#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..90d319687d7e 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
#include <asm/kprobes.h>
#include <asm/ftrace.h>
#include <asm/nops.h>
+#include <asm/text-patching.h>
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
}
static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
static int update_ftrace_func(unsigned long ip, void *new)
{
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned char *new;
int ret;
+ ftrace_update_func_call = (unsigned long)func;
+
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -295,12 +299,19 @@ int ftrace_int3_handler(struct pt_regs *regs)
return 0;
ip = regs->ip - 1;
- if (!ftrace_location(ip) && !is_ftrace_caller(ip))
- return 0;
-
- regs->ip += MCOUNT_INSN_SIZE - 1;
+ if (ftrace_location(ip)) {
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ } else if (is_ftrace_caller(ip)) {
+ if (!ftrace_update_func_call) {
+ int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+ return 1;
+ }
+ int3_emulate_call(regs, ftrace_update_func_call);
+ return 1;
+ }
- return 1;
+ return 0;
}
NOKPROBE_SYMBOL(ftrace_int3_handler);
@@ -859,6 +870,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
func = ftrace_ops_get_func(ops);
+ ftrace_update_func_call = (unsigned long)func;
+
/* Do a safe modify in case the trampoline is executing */
new = ftrace_call_replace(ip, (unsigned long)func);
ret = update_ftrace_func(ip, new);
@@ -960,6 +973,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
{
unsigned char *new;
+ ftrace_update_func_call = 0UL;
new = ftrace_jmp_replace(ip, (unsigned long)func);
return update_ftrace_func(ip, new);
On Wed, 01 May 2019 10:26:32 +0200
Nicolai Stange <[email protected]> wrote:
> > +extern asmlinkage void ftrace_emulate_call_irqon(void);
> > +extern asmlinkage void ftrace_emulate_call_irqoff(void);
> > +extern asmlinkage void ftrace_emulate_call_nmi(void);
> > +extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
> > +extern asmlinkage void ftrace_emulate_call_update_irqon(void);
> > +extern asmlinkage void ftrace_emulate_call_update_nmi(void);
> > +
> > +static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
> > +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
>
> Andy mentioned #DB and #MC exceptions here:
> https://lkml.kernel.org/r/[email protected]
>
> I think that #DB won't be possible, provided the trampolines below get
> tagged as NOKPROBE (do_int3() and ftrace_int3_handler() already have
> it).
>
> It's highly theoretic, but tracing do_machine_check() could clobber
> ftrace_bp_call_return or ftrace_bp_call_nmi_return?
Probably shouldn't trace do_machine_check() then ;-)
>
>
> > +#ifdef CONFIG_SMP
> > +#ifdef CONFIG_X86_64
> > +# define BP_CALL_RETURN "%gs:ftrace_bp_call_return"
> > +# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return"
> > +#else
> > +# define BP_CALL_RETURN "%fs:ftrace_bp_call_return"
> > +# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return"
> > +#endif
> > +#else /* SMP */
> > +# define BP_CALL_RETURN "ftrace_bp_call_return"
> > +# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return"
> > +#endif
> > +
> > +/* To hold the ftrace_caller address to push on the stack */
> > +void *ftrace_caller_func = (void *)ftrace_caller;
>
> The live patching ftrace_ops need ftrace_regs_caller.
Ah, you're right. Luckily ftrace_regs_caller is a superset of
ftrace_caller. That is, those only needing ftrace_caller can do fine
with ftrace_regs_caller (but not vice versa).
Easy enough to fix.
>
>
> > +
> > +asm(
> > + ".text\n"
> > +
> > + /* Trampoline for function update with interrupts enabled */
> > + ".global ftrace_emulate_call_irqoff\n"
> > + ".type ftrace_emulate_call_irqoff, @function\n"
> > + "ftrace_emulate_call_irqoff:\n\t"
> > + "push "BP_CALL_RETURN"\n\t"
> > + "push ftrace_caller_func\n"
> > + "sti\n\t"
> > + "ret\n\t"
> > + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
> > +
> > + /* Trampoline for function update with interrupts disabled*/
> > + ".global ftrace_emulate_call_irqon\n"
>
> The naming is perhaps a bit confusing, i.e. "update with interrupts
> disabled" vs. "irqon"... How about swapping irqoff<->irqon?
I just used the terminology Linus used. It is confusing. Perhaps just
call it ftrace_emulate_call (for non sti case) and
ftrace_emulate_call_sti for the sti case. That should remove the
confusion.
-- Steve
On Wed, 1 May 2019 15:11:17 +0200
Peter Zijlstra <[email protected]> wrote:
> On Tue, Apr 30, 2019 at 11:33:21AM -0700, Linus Torvalds wrote:
> > Anyway, since Andy really likes the entry code change, can we have
> > that patch in parallel and judge the difference that way? Iirc, that
> > was x86-64 specific too.
>
> Here goes, compile tested only...
>
> It obviously needs a self-test, but that shoulnd't be too hard to
> arrange.
>
I was able to get it applied (with slight tweaking) but it then
crashed. But that was due to incorrect updates in the
ftrace_int3_handler().
> ---
> arch/x86/entry/entry_32.S | 7 +++++++
> arch/x86/entry/entry_64.S | 14 ++++++++++++--
> arch/x86/include/asm/text-patching.h | 20 ++++++++++++++++++++
> arch/x86/kernel/ftrace.c | 24 +++++++++++++++++++-----
> 4 files changed, 58 insertions(+), 7 deletions(-)
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index ef49517f6bb2..90d319687d7e 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -29,6 +29,7 @@
> #include <asm/kprobes.h>
> #include <asm/ftrace.h>
> #include <asm/nops.h>
> +#include <asm/text-patching.h>
>
> #ifdef CONFIG_DYNAMIC_FTRACE
>
> @@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec,
> unsigned long old_addr, }
>
> static unsigned long ftrace_update_func;
> +static unsigned long ftrace_update_func_call;
>
> static int update_ftrace_func(unsigned long ip, void *new)
> {
> @@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
> unsigned char *new;
> int ret;
>
> + ftrace_update_func_call = (unsigned long)func;
> +
> new = ftrace_call_replace(ip, (unsigned long)func);
> ret = update_ftrace_func(ip, new);
>
> @@ -295,12 +299,19 @@ int ftrace_int3_handler(struct pt_regs *regs)
> return 0;
>
> ip = regs->ip - 1;
> - if (!ftrace_location(ip) && !is_ftrace_caller(ip))
> - return 0;
> -
> - regs->ip += MCOUNT_INSN_SIZE - 1;
> + if (ftrace_location(ip)) {
> + int3_emulate_call(regs, ftrace_update_func_call);
Should be:
int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
> + return 1;
> + } else if (is_ftrace_caller(ip)) {
> + if (!ftrace_update_func_call) {
> + int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
I see what you did here, but I think:
int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
looks better. But that said, we could in the beginning do:
ip = regs->ip - INT3_INSN_SIZE;
instead of
ip = regs->ip - 1;
I made these updates and posted them to Linus.
-- Steve
> + return 1;
> + }
> + int3_emulate_call(regs, ftrace_update_func_call);
> + return 1;
> + }
>
> - return 1;
> + return 0;
> }
> NOKPROBE_SYMBOL(ftrace_int3_handler);
>
> @@ -859,6 +870,8 @@ void arch_ftrace_update_trampoline(struct
> ftrace_ops *ops)
> func = ftrace_ops_get_func(ops);
>
> + ftrace_update_func_call = (unsigned long)func;
> +
> /* Do a safe modify in case the trampoline is executing */
> new = ftrace_call_replace(ip, (unsigned long)func);
> ret = update_ftrace_func(ip, new);
> @@ -960,6 +973,7 @@ static int ftrace_mod_jmp(unsigned long ip, void
> *func) {
> unsigned char *new;
>
> + ftrace_update_func_call = 0UL;
> new = ftrace_jmp_replace(ip, (unsigned long)func);
>
> return update_ftrace_func(ip, new);
On Wed, May 01, 2019 at 02:58:24PM -0400, Steven Rostedt wrote:
> > + if (ftrace_location(ip)) {
> > + int3_emulate_call(regs, ftrace_update_func_call);
>
> Should be:
>
> int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
Ah, I lost the plot a little there.
> > + return 1;
> > + } else if (is_ftrace_caller(ip)) {
> > + if (!ftrace_update_func_call) {
> > + int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
>
> I see what you did here, but I think:
>
> int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
>
> looks better. But that said, we could in the beginning do:
>
> ip = regs->ip - INT3_INSN_SIZE;
>
> instead of
>
> ip = regs->ip - 1;
>
> I made these updates and posted them to Linus.
I was actually considering:
static inline void int3_emulate_nop(struct pt_regs *regs, unsigned long size)
{
int3_emulate_jmp(regs, regs->ip - INT3_INSN_SIZE + size);
}
And then the above becomes:
int3_emulate_nop(regs, CALL_INSN_SIZE);
Hmm?
On Wed, May 1, 2019 at 6:11 AM Peter Zijlstra <[email protected]> wrote:
>
> Here goes, compile tested only...
Ugh, two different threads. This has the same bug (same source) as the
one Steven posted:
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -1479,6 +1479,13 @@ ENTRY(int3)
> ASM_CLAC
> pushl $-1 # mark this as an int
>
> + testl $SEGMENT_RPL_MASK, PT_CS(%esp)
> + jnz .Lfrom_usermode_no_gap
> + .rept 6
> + pushl 5*4(%esp)
> + .endr
> +.Lfrom_usermode_no_gap:
This will corrupt things horribly if you still use vm86 mode. Checking
CS RPL is simply not correct.
Linus
On Wed, 1 May 2019 12:03:52 -0700
Linus Torvalds <[email protected]> wrote:
> On Wed, May 1, 2019 at 6:11 AM Peter Zijlstra <[email protected]> wrote:
> >
> > Here goes, compile tested only...
>
> Ugh, two different threads. This has the same bug (same source) as the
> one Steven posted:
>
> > --- a/arch/x86/entry/entry_32.S
> > +++ b/arch/x86/entry/entry_32.S
> > @@ -1479,6 +1479,13 @@ ENTRY(int3)
> > ASM_CLAC
> > pushl $-1 # mark this as an int
> >
> > + testl $SEGMENT_RPL_MASK, PT_CS(%esp)
> > + jnz .Lfrom_usermode_no_gap
> > + .rept 6
> > + pushl 5*4(%esp)
> > + .endr
> > +.Lfrom_usermode_no_gap:
>
> This will corrupt things horribly if you still use vm86 mode. Checking
> CS RPL is simply not correct.
I never tested the 32 bit version of this. And we could just not
implement it (I don't think there's live kernel patching for it
either).
But this doesn't make it any worse than my version, because under the
full testing of my patch with the trampolines, I would easily crash the
32 bit version. That was one reason I made my last patch only support 64
bit.
Under light load, 32 bit works, but when I stress it (running perf and
ftrace together) it blows up. Could be an NMI issue.
-- Steve
On Wed, May 01, 2019 at 12:03:52PM -0700, Linus Torvalds wrote:
> On Wed, May 1, 2019 at 6:11 AM Peter Zijlstra <[email protected]> wrote:
> >
> > Here goes, compile tested only...
>
> Ugh, two different threads. This has the same bug (same source) as the
> one Steven posted:
This is what Steve started from; lets continue in the other thread.
> > --- a/arch/x86/entry/entry_32.S
> > +++ b/arch/x86/entry/entry_32.S
> > @@ -1479,6 +1479,13 @@ ENTRY(int3)
> > ASM_CLAC
> > pushl $-1 # mark this as an int
> >
> > + testl $SEGMENT_RPL_MASK, PT_CS(%esp)
> > + jnz .Lfrom_usermode_no_gap
> > + .rept 6
> > + pushl 5*4(%esp)
> > + .endr
> > +.Lfrom_usermode_no_gap:
>
> This will corrupt things horribly if you still use vm86 mode. Checking
> CS RPL is simply not correct.
I'll go fix; I never really understood that vm86 crud and I cobbled this
32bit thing together based on the 64bit version (that Josh did a while
ago).
On Wed, 1 May 2019, Steven Rostedt wrote:
> I never tested the 32 bit version of this. And we could just not
> implement it (I don't think there's live kernel patching for it
> either).
That's correct, there is no livepatching on x86_32 (and no plans for
it). CONFIG_LIVEPATCH is not available for 32bit builds at all.
--
Jiri Kosina
SUSE Labs
On Wed, May 01, 2019 at 09:33:28PM +0200, Jiri Kosina wrote:
> On Wed, 1 May 2019, Steven Rostedt wrote:
>
> > I never tested the 32 bit version of this. And we could just not
> > implement it (I don't think there's live kernel patching for it
> > either).
>
> That's correct, there is no livepatching on x86_32 (and no plans for
> it). CONFIG_LIVEPATCH is not available for 32bit builds at all.
We still want this for static_call(), even on 32bit.