LinuxLists.cc - [PATCH 0/3] ring-buffer: less locking and only disable preemption

2008-10-04 06:02:23

Subject: [PATCH 0/3] ring-buffer: less locking and only disable preemption

Ingo,

These patches need to be put through the ringer. Could you add them
to your ring-buffer branch, so we can test them out before putting
them into your master branch.

The following patches bring the ring buffer closer to a lockless
solution. They move the locking only to the actual moving the
tail/write pointer from one page to the next. Interrupts are now
enabled during most of the writes.

A lot of the locking protection is still within the ftrace infrastructure.
The last patch takes some of that away.

The function tracer cannot be reentrant just due to the nature that
it traces everything, and can cause recursion issues.

-- Steve

2008-10-04 08:40:27

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Steven Rostedt <[email protected]> wrote:

> Ingo,
>
> These patches need to be put through the ringer. Could you add them to
> your ring-buffer branch, so we can test them out before putting them
> into your master branch.

hey, in fact your latest iteration already tested out so well on a wide
range of boxes that i've merged it all into tip/tracing/core already.

I'll reuse tip/tracing/ring-buffer for these latest 3 patches (merge it
up to tip/tracing/core and add these three patches) but it's a delta,
i.e. the whole ring-buffer approach is ready for prime time i think.

Hm, do we do deallocation of the buffers already when we switch tracers?

> The following patches bring the ring buffer closer to a lockless
> solution. They move the locking only to the actual moving the
> tail/write pointer from one page to the next. Interrupts are now
> enabled during most of the writes.

very nice direction!

> A lot of the locking protection is still within the ftrace
> infrastructure. The last patch takes some of that away.
>
> The function tracer cannot be reentrant just due to the nature that it
> traces everything, and can cause recursion issues.

Correct, and that's by far the yuckiest aspect of it. And there's
another aspect: NMIs. We've still got the tip/tracing/nmisafe angle with
these commits:

d979781: ftrace: mark lapic_wd_event() notrace
c2c27ba: ftrace: ignore functions that cannot be kprobe-ed
431e946: ftrace: do not trace NMI contexts
1eda930: x86, tracing, nmisafe: fix threadinfo_ -> TI_ rename fallout
84c2ca9: sched: sched_clock() improvement: use in_nmi()
0d84b78: x86 NMI-safe INT3 and Page Fault
a04464b: x86_64 page fault NMI-safe
b335389: Change avr32 active count bit
a581cbd: Change Alpha active count bit
eca0999: Stringify support commas

but i'm not yet fully convinced about the NMI angle, the practical cross
section to random lowlevel x86 code is wider than any sched_clock()
impact for example. We might be best off avoiding it: force-disable the
NMI watchdog when we trace?

Ingo

2008-10-04 14:34:20

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

[ Added Arjan to CC regarding the last statements ]

On Sat, 4 Oct 2008, Ingo Molnar wrote:
>
> * Steven Rostedt <[email protected]> wrote:
> > These patches need to be put through the ringer. Could you add them to
> > your ring-buffer branch, so we can test them out before putting them
> > into your master branch.
>
> hey, in fact your latest iteration already tested out so well on a wide
> range of boxes that I've merged it all into tip/tracing/core already.

Great to hear!

>
> I'll reuse tip/tracing/ring-buffer for these latest 3 patches (merge it
> up to tip/tracing/core and add these three patches) but it's a delta,
> i.e. the whole ring-buffer approach is ready for prime time i think.
>
> Hm, do we do deallocation of the buffers already when we switch tracers?

Not yet, but that is one of the trivial changes. I spent too much time
on these more complex changes to get around to it.

>
> > The following patches bring the ring buffer closer to a lockless
> > solution. They move the locking only to the actual moving the
> > tail/write pointer from one page to the next. Interrupts are now
> > enabled during most of the writes.
>
> very nice direction!

Thanks!

>
> > A lot of the locking protection is still within the ftrace
> > infrastructure. The last patch takes some of that away.
> >
> > The function tracer cannot be reentrant just due to the nature that it
> > traces everything, and can cause recursion issues.
>
> Correct, and that's by far the yuckiest aspect of it. And there's
> another aspect: NMIs. We've still got the tip/tracing/nmisafe angle with
> these commits:
>
> d979781: ftrace: mark lapic_wd_event() notrace
> c2c27ba: ftrace: ignore functions that cannot be kprobe-ed
> 431e946: ftrace: do not trace NMI contexts
> 1eda930: x86, tracing, nmisafe: fix threadinfo_ -> TI_ rename fallout
> 84c2ca9: sched: sched_clock() improvement: use in_nmi()
> 0d84b78: x86 NMI-safe INT3 and Page Fault
> a04464b: x86_64 page fault NMI-safe
> b335389: Change avr32 active count bit
> a581cbd: Change Alpha active count bit
> eca0999: Stringify support commas
>
> but I'm not yet fully convinced about the NMI angle, the practical cross
> section to random low level x86 code is wider than any sched_clock()
> impact for example. We might be best off avoiding it: force-disable the
> NMI watchdog when we trace?

Since we still have the locking in the ring buffer, it is still not NMI
safe. But once we remove all locking, then the tracer is fine.

BUT!

The dynamic function tracer is another issue. The problem with NMIs has
nothing to do with locking, or corrupting the buffers. It has to do with
the dynamic code modification. Whenever we modify code, we must guarantee
that it will not be executed on another CPU.

Kstop_machine serves this purpose rather well. We can modify code without
worrying it will be executed on another CPU, except for NMIs. The problem
now comes where an NMI can come in and execute the code being modified.
That's why I put in all the notrace, lines. But it gets difficult because
of nmi_notifier can call all over the kernel. Perhaps, we can simply
disable the nmi-notifier when we are doing the kstop_machine call?

-- Steve

2008-10-04 14:44:46

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Steven Rostedt <[email protected]> wrote:

> The dynamic function tracer is another issue. The problem with NMIs
> has nothing to do with locking, or corrupting the buffers. It has to
> do with the dynamic code modification. Whenever we modify code, we
> must guarantee that it will not be executed on another CPU.
>
> Kstop_machine serves this purpose rather well. We can modify code
> without worrying it will be executed on another CPU, except for NMIs.
> The problem now comes where an NMI can come in and execute the code
> being modified. That's why I put in all the notrace, lines. But it
> gets difficult because of nmi_notifier can call all over the kernel.
> Perhaps, we can simply disable the nmi-notifier when we are doing the
> kstop_machine call?

that would definitely be one way to reduce the cross section, but not
enough i'm afraid. For example in the nmi_watchdog=2 case we call into
various lapic functions and paravirt lapic handlers which makes it all
spread to 3-4 paravirtualization flavors ...

sched_clock()'s notrace aspects were pretty manageable, but this in its
current form is not.

Ingo

2008-10-04 16:34:17

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Steven Rostedt ([email protected]) wrote:
>
> [ Added Arjan to CC regarding the last statements ]
>
> On Sat, 4 Oct 2008, Ingo Molnar wrote:
> >
> > * Steven Rostedt <[email protected]> wrote:
> > > These patches need to be put through the ringer. Could you add them to
> > > your ring-buffer branch, so we can test them out before putting them
> > > into your master branch.
> >
> > hey, in fact your latest iteration already tested out so well on a wide
> > range of boxes that I've merged it all into tip/tracing/core already.
>
> Great to hear!
>
> >
> > I'll reuse tip/tracing/ring-buffer for these latest 3 patches (merge it
> > up to tip/tracing/core and add these three patches) but it's a delta,
> > i.e. the whole ring-buffer approach is ready for prime time i think.
> >
> > Hm, do we do deallocation of the buffers already when we switch tracers?
>
> Not yet, but that is one of the trivial changes. I spent too much time
> on these more complex changes to get around to it.
>
> >
> > > The following patches bring the ring buffer closer to a lockless
> > > solution. They move the locking only to the actual moving the
> > > tail/write pointer from one page to the next. Interrupts are now
> > > enabled during most of the writes.
> >
> > very nice direction!
>
> Thanks!
>
> >
> > > A lot of the locking protection is still within the ftrace
> > > infrastructure. The last patch takes some of that away.
> > >
> > > The function tracer cannot be reentrant just due to the nature that it
> > > traces everything, and can cause recursion issues.
> >
> > Correct, and that's by far the yuckiest aspect of it. And there's
> > another aspect: NMIs. We've still got the tip/tracing/nmisafe angle with
> > these commits:
> >
> > d979781: ftrace: mark lapic_wd_event() notrace
> > c2c27ba: ftrace: ignore functions that cannot be kprobe-ed
> > 431e946: ftrace: do not trace NMI contexts
> > 1eda930: x86, tracing, nmisafe: fix threadinfo_ -> TI_ rename fallout
> > 84c2ca9: sched: sched_clock() improvement: use in_nmi()
> > 0d84b78: x86 NMI-safe INT3 and Page Fault
> > a04464b: x86_64 page fault NMI-safe
> > b335389: Change avr32 active count bit
> > a581cbd: Change Alpha active count bit
> > eca0999: Stringify support commas
> >
> > but I'm not yet fully convinced about the NMI angle, the practical cross
> > section to random low level x86 code is wider than any sched_clock()
> > impact for example. We might be best off avoiding it: force-disable the
> > NMI watchdog when we trace?
>
> Since we still have the locking in the ring buffer, it is still not NMI
> safe. But once we remove all locking, then the tracer is fine.
>
> BUT!
>
> The dynamic function tracer is another issue. The problem with NMIs has
> nothing to do with locking, or corrupting the buffers. It has to do with
> the dynamic code modification. Whenever we modify code, we must guarantee
> that it will not be executed on another CPU.
>
> Kstop_machine serves this purpose rather well. We can modify code without
> worrying it will be executed on another CPU, except for NMIs. The problem
> now comes where an NMI can come in and execute the code being modified.
> That's why I put in all the notrace, lines. But it gets difficult because
> of nmi_notifier can call all over the kernel. Perhaps, we can simply
> disable the nmi-notifier when we are doing the kstop_machine call?
>

Or use this code, based on a temporary breakpoint, to do the code
patching (part of the -lttng tree). It does not require stop_machine at
all and is nmi safe.

Mathieu

Immediate Values - x86 Optimization NMI and MCE support

x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.
It uses a breakpoint to bypass the instruction being changed, which lessens the
interrupt latency of the operation and protects against NMIs and MCE.

- More reentrant immediate value : uses a breakpoint. Needs to know the
instruction's first byte. This is why we keep the "instruction size"
variable, so we can support the REX prefixed instructions too.

Changelog:
- Change the immediate.c update code to support variable length opcodes.
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the imv_mutex).
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
- Change the immediate.c update code to support variable length opcodes.
- Use imv_* instead of immediate_*.
- Use kernel_wp_disable/enable instead of save/restore.
- Fix 1 byte immediate value so it declares its instruction size.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Chuck Ebbert <[email protected]>
CC: Christoph Hellwig <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Ingo Molnar <[email protected]>
---
arch/x86/kernel/Makefile | 1
arch/x86/kernel/immediate.c | 291 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps_32.c | 8 -
include/asm-x86/immediate.h | 48 ++++++-
4 files changed, 338 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/immediate.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/immediate.h 2008-08-06 01:32:23.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/immediate.h 2008-08-06 02:35:56.000000000 -0400
@@ -12,6 +12,18 @@

#include <asm/asm.h>

+struct __imv {
+ unsigned long var; /* Pointer to the identifier variable of the
+ * immediate value
+ */
+ unsigned long imv; /*
+ * Pointer to the memory location of the
+ * immediate value within the instruction.
+ */
+ unsigned char size; /* Type size. */
+ unsigned char insn_size;/* Instruction size. */
+} __attribute__ ((packed));
+
/**
* imv_read - read immediate variable
* @name: immediate value name
@@ -26,6 +38,11 @@
* what will generate an instruction with 8 bytes immediate value (not the REX.W
* prefixed one that loads a sign extended 32 bits immediate value in a r64
* register).
+ *
+ * Create the instruction in a discarded section to calculate its size. This is
+ * how we can align the beginning of the instruction on an address that will
+ * permit atomic modification of the immediate value without knowing the size of
+ * the opcode used by the compiler. The operand size is known in advance.
*/
#define imv_read(name) \
({ \
@@ -33,9 +50,14 @@
BUILD_BUG_ON(sizeof(value) > 8); \
switch (sizeof(value)) { \
case 1: \
- asm(".section __imv,\"aw\",@progbits\n\t" \
+ asm(".section __discard,\"\",@progbits\n\t" \
+ "1:\n\t" \
+ "mov $0,%0\n\t" \
+ "2:\n\t" \
+ ".previous\n\t" \
+ ".section __imv,\"aw\",@progbits\n\t" \
_ASM_PTR "%c1, (3f)-%c2\n\t" \
- ".byte %c2\n\t" \
+ ".byte %c2, (2b-1b)\n\t" \
".previous\n\t" \
"mov $0,%0\n\t" \
"3:\n\t" \
@@ -45,10 +67,16 @@
break; \
case 2: \
case 4: \
- asm(".section __imv,\"aw\",@progbits\n\t" \
+ asm(".section __discard,\"\",@progbits\n\t" \
+ "1:\n\t" \
+ "mov $0,%0\n\t" \
+ "2:\n\t" \
+ ".previous\n\t" \
+ ".section __imv,\"aw\",@progbits\n\t" \
_ASM_PTR "%c1, (3f)-%c2\n\t" \
- ".byte %c2\n\t" \
+ ".byte %c2, (2b-1b)\n\t" \
".previous\n\t" \
+ ".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
"mov $0,%0\n\t" \
"3:\n\t" \
: "=r" (value) \
@@ -60,10 +88,16 @@
value = name##__imv; \
break; \
} \
- asm(".section __imv,\"aw\",@progbits\n\t" \
+ asm(".section __discard,\"\",@progbits\n\t" \
+ "1:\n\t" \
+ "mov $0xFEFEFEFE01010101,%0\n\t" \
+ "2:\n\t" \
+ ".previous\n\t" \
+ ".section __imv,\"aw\",@progbits\n\t" \
_ASM_PTR "%c1, (3f)-%c2\n\t" \
- ".byte %c2\n\t" \
+ ".byte %c2, (2b-1b)\n\t" \
".previous\n\t" \
+ ".org . + ((-.-(2b-1b)) & (%c2-1)), 0x90\n\t" \
"mov $0xFEFEFEFE01010101,%0\n\t" \
"3:\n\t" \
: "=r" (value) \
@@ -74,4 +108,6 @@
value; \
})

+extern int arch_imv_update(const struct __imv *imv, int early);
+
#endif /* _ASM_X86_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/traps_32.c 2008-08-06 00:44:22.000000000 -0400
+++ linux-2.6-lttng/arch/x86/kernel/traps_32.c 2008-08-06 02:35:57.000000000 -0400
@@ -576,7 +576,7 @@ void do_##name(struct pt_regs *regs, lon
}

DO_VM86_ERROR_INFO(0, SIGFPE, "divide error", divide_error, FPE_INTDIV, regs->ip)
-#ifndef CONFIG_KPROBES
+#if !defined(CONFIG_KPROBES) && !defined(CONFIG_IMMEDIATE)
DO_VM86_ERROR(3, SIGTRAP, "int3", int3)
#endif
DO_VM86_ERROR(4, SIGSEGV, "overflow", overflow)
@@ -850,7 +850,7 @@ void restart_nmi(void)
acpi_nmi_enable();
}

-#ifdef CONFIG_KPROBES
+#if defined(CONFIG_KPROBES) || defined(CONFIG_IMMEDIATE)
void __kprobes do_int3(struct pt_regs *regs, long error_code)
{
trace_hardirqs_fixup();
@@ -859,8 +859,8 @@ void __kprobes do_int3(struct pt_regs *r
== NOTIFY_STOP)
return;
/*
- * This is an interrupt gate, because kprobes wants interrupts
- * disabled. Normal trap handlers don't.
+ * This is an interrupt gate, because kprobes and immediate values wants
+ * interrupts disabled. Normal trap handlers don't.
*/
restore_interrupts(regs);

Index: linux-2.6-lttng/arch/x86/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile 2008-08-06 00:41:34.000000000 -0400
+++ linux-2.6-lttng/arch/x86/kernel/Makefile 2008-08-06 02:35:57.000000000 -0400
@@ -73,6 +73,7 @@ obj-$(CONFIG_X86_SUMMIT_NUMA) += summit_
obj-y += vsmp_64.o
obj-$(CONFIG_KPROBES) += kprobes.o
obj-$(CONFIG_MODULES) += module_$(BITS).o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
obj-$(CONFIG_EFI) += efi.o efi_$(BITS).o efi_stub_$(BITS).o
obj-$(CONFIG_DOUBLEFAULT) += doublefault_32.o
obj-$(CONFIG_KGDB) += kgdb.o
Index: linux-2.6-lttng/arch/x86/kernel/immediate.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/x86/kernel/immediate.c 2008-08-06 02:36:33.000000000 -0400
@@ -0,0 +1,291 @@
+/*
+ * Immediate Value - x86 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ * Centrino Duo Processor Technology Specification Update, AH33.
+ * Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ * Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micro-ops resulting from instruction interpretation -
+ * cannot be guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <[email protected]>
+ */
+
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+#include <linux/rcupdate.h>
+#include <linux/kprobes.h>
+#include <linux/io.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION 0xcc
+#define BREAKPOINT_INS_LEN 1
+#define NR_NOPS 10
+
+static unsigned long target_after_int3; /* EIP of the target after the int3 */
+static unsigned long bypass_eip; /* EIP of the bypass. */
+static unsigned long bypass_after_int3; /* EIP after the end-of-bypass int3 */
+static unsigned long after_imv; /*
+ * EIP where to resume after the
+ * single-stepping.
+ */
+
+/*
+ * Internal bypass used during value update. The bypass is skipped by the
+ * function in which it is inserted.
+ * No need to be aligned because we exclude readers from the site during
+ * update.
+ * Layout is:
+ * (10x nop) int3
+ * (maximum size is 2 bytes opcode + 8 bytes immediate value for long on x86_64)
+ * The nops are the target replaced by the instruction to single-step.
+ * Align on 16 bytes to make sure the nops fit within a single page so remapping
+ * it can be done easily.
+ */
+static inline void _imv_bypass(unsigned long *bypassaddr,
+ unsigned long *breaknextaddr)
+{
+ asm volatile("jmp 2f;\n\t"
+ ".align 16;\n\t"
+ "0:\n\t"
+ ".space 10, 0x90;\n\t"
+ "1:\n\t"
+ "int3;\n\t"
+ "2:\n\t"
+ "mov $(0b),%0;\n\t"
+ "mov $((1b)+1),%1;\n\t"
+ : "=r" (*bypassaddr),
+ "=r" (*breaknextaddr));
+}
+
+static void imv_synchronize_core(void *info)
+{
+ sync_core(); /* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movl.
+ * Disable preemption in the bypass to make sure no thread will be preempted in
+ * it. We can then use synchronize_sched() to make sure every bypass users have
+ * ended.
+ */
+static int imv_notifier(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ enum die_val die_val = (enum die_val) val;
+ struct die_args *args = data;
+
+ if (!args->regs || user_mode_vm(args->regs))
+ return NOTIFY_DONE;
+
+ if (die_val == DIE_INT3) {
+ if (args->regs->ip == target_after_int3) {
+ preempt_disable();
+ args->regs->ip = bypass_eip;
+ return NOTIFY_STOP;
+ } else if (args->regs->ip == bypass_after_int3) {
+ args->regs->ip = after_imv;
+ preempt_enable();
+ return NOTIFY_STOP;
+ }
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block imv_notify = {
+ .notifier_call = imv_notifier,
+ .priority = 0x7fffffff, /* we need to be notified first */
+};
+
+/**
+ * arch_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ * @early: early boot (1) or normal (0)
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ */
+__kprobes int arch_imv_update(const struct __imv *imv, int early)
+{
+ int ret;
+ unsigned char opcode_size = imv->insn_size - imv->size;
+ unsigned long insn = imv->imv - opcode_size;
+ unsigned long len;
+ char *vaddr;
+ struct page *pages[1];
+
+#ifdef CONFIG_KPROBES
+ /*
+ * Fail if a kprobe has been set on this instruction.
+ * (TODO: we could eventually do better and modify all the (possibly
+ * nested) kprobes for this site if kprobes had an API for this.
+ */
+ if (unlikely(!early
+ && *(unsigned char *)insn == BREAKPOINT_INSTRUCTION)) {
+ printk(KERN_WARNING "Immediate value in conflict with kprobe. "
+ "Variable at %p, "
+ "instruction at %p, size %hu\n",
+ (void *)imv->imv,
+ (void *)imv->var, imv->size);
+ return -EBUSY;
+ }
+#endif
+
+ /*
+ * If the variable and the instruction have the same value, there is
+ * nothing to do.
+ */
+ switch (imv->size) {
+ case 1: if (*(uint8_t *)imv->imv
+ == *(uint8_t *)imv->var)
+ return 0;
+ break;
+ case 2: if (*(uint16_t *)imv->imv
+ == *(uint16_t *)imv->var)
+ return 0;
+ break;
+ case 4: if (*(uint32_t *)imv->imv
+ == *(uint32_t *)imv->var)
+ return 0;
+ break;
+#ifdef CONFIG_X86_64
+ case 8: if (*(uint64_t *)imv->imv
+ == *(uint64_t *)imv->var)
+ return 0;
+ break;
+#endif
+ default:return -EINVAL;
+ }
+
+ if (!early) {
+ /* bypass is 10 bytes long for x86_64 long */
+ WARN_ON(imv->insn_size > 10);
+ _imv_bypass(&bypass_eip, &bypass_after_int3);
+
+ after_imv = imv->imv + imv->size;
+
+ /*
+ * Using the _early variants because nobody is executing the
+ * bypass code while we patch it. It is protected by the
+ * imv_mutex. Since we modify the instructions non atomically
+ * (for nops), we have to use the _early variant.
+ * We must however deal with RO pages.
+ * Use a single page : 10 bytes are aligned on 16 bytes
+ * boundaries.
+ */
+ pages[0] = virt_to_page((void *)bypass_eip);
+ vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
+ BUG_ON(!vaddr);
+ text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
+ (void *)insn, imv->insn_size);
+ /*
+ * Fill the rest with nops.
+ */
+ len = NR_NOPS - imv->insn_size;
+ add_nops((void *)
+ &vaddr[(bypass_eip & ~PAGE_MASK) + imv->insn_size],
+ len);
+ vunmap(vaddr);
+
+ target_after_int3 = insn + BREAKPOINT_INS_LEN;
+ /* register_die_notifier has memory barriers */
+ register_die_notifier(&imv_notify);
+ /* The breakpoint will single-step the bypass */
+ text_poke((void *)insn,
+ ((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);
+ /*
+ * Make sure the breakpoint is set before we continue (visible
+ * to other CPUs and interrupts).
+ */
+ wmb();
+ /*
+ * Execute serializing instruction on each CPU.
+ */
+ ret = on_each_cpu(imv_synchronize_core, NULL, 1);
+ BUG_ON(ret != 0);
+
+ text_poke((void *)(insn + opcode_size), (void *)imv->var,
+ imv->size);
+ /*
+ * Make sure the value can be seen from other CPUs and
+ * interrupts.
+ */
+ wmb();
+ text_poke((void *)insn, (unsigned char *)bypass_eip, 1);
+ /*
+ * Wait for all int3 handlers to end (interrupts are disabled in
+ * int3). This CPU is clearly not in a int3 handler, because
+ * int3 handler is not preemptible and there cannot be any more
+ * int3 handler called for this site, because we placed the
+ * original instruction back. synchronize_sched has memory
+ * barriers.
+ */
+ synchronize_sched();
+ unregister_die_notifier(&imv_notify);
+ /* unregister_die_notifier has memory barriers */
+ } else
+ text_poke_early((void *)imv->imv, (void *)imv->var,
+ imv->size);
+ return 0;
+}

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-04 17:18:23

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

On Sat, 4 Oct 2008, Mathieu Desnoyers wrote:
>
> Or use this code, based on a temporary breakpoint, to do the code
> patching (part of the -lttng tree). It does not require stop_machine at
> all and is nmi safe.
>

When this is supported for all archs, and can be done at all functions
then I could use it.

I may just have the arch specific code use it, but we'll see.

Also, how good is it at patching 20,000 call sites?

-- Steve

2008-10-04 17:41:44

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Ingo Molnar <[email protected]> wrote:

> * Steven Rostedt <[email protected]> wrote:
>
> > The dynamic function tracer is another issue. The problem with NMIs
> > has nothing to do with locking, or corrupting the buffers. It has to
> > do with the dynamic code modification. Whenever we modify code, we
> > must guarantee that it will not be executed on another CPU.
> >
> > Kstop_machine serves this purpose rather well. We can modify code
> > without worrying it will be executed on another CPU, except for NMIs.
> > The problem now comes where an NMI can come in and execute the code
> > being modified. That's why I put in all the notrace, lines. But it
> > gets difficult because of nmi_notifier can call all over the kernel.
> > Perhaps, we can simply disable the nmi-notifier when we are doing the
> > kstop_machine call?
>
> that would definitely be one way to reduce the cross section, but not
> enough i'm afraid. For example in the nmi_watchdog=2 case we call into
> various lapic functions and paravirt lapic handlers which makes it all
> spread to 3-4 paravirtualization flavors ...
>
> sched_clock()'s notrace aspects were pretty manageable, but this in
> its current form is not.

there's a relatively simple method that would solve all these
impact-size problems.

We cannot stop NMIs (and MCEs, etc.), but we can make kernel code
modifications atomic, by adding the following thin layer ontop of it:

#define MAX_CODE_SIZE 10

int redo_len;
u8 *redo_vaddr;

u8 redo_buffer[MAX_CODE_SIZE];

atomic_t __read_mostly redo_pending;

and use it in do_nmi():

if (unlikely(atomic_read(&redo_pending)))
modify_code_redo();

i.e. when we modify code, we first fill in the redo_buffer[], redo_vaddr
and redo_len[], then we set redo_pending flag. Then we modify the kernel
code, and clear the redo_pending flag.

If an NMI (or MCE) handler intervenes, it will notice the pending
'transaction' and will copy redo_buffer[] to the (redo_vaddr,len)
location and will continue.

So as far as non-maskable contexts are concerned, kernel code patching
becomes an atomic operation. do_nmi() has to be marked notrace but
that's all and easy to maintain.

Hm?

Ingo

2008-10-04 22:27:23

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Ingo Molnar ([email protected]) wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Steven Rostedt <[email protected]> wrote:
> >
> > > The dynamic function tracer is another issue. The problem with NMIs
> > > has nothing to do with locking, or corrupting the buffers. It has to
> > > do with the dynamic code modification. Whenever we modify code, we
> > > must guarantee that it will not be executed on another CPU.
> > >
> > > Kstop_machine serves this purpose rather well. We can modify code
> > > without worrying it will be executed on another CPU, except for NMIs.
> > > The problem now comes where an NMI can come in and execute the code
> > > being modified. That's why I put in all the notrace, lines. But it
> > > gets difficult because of nmi_notifier can call all over the kernel.
> > > Perhaps, we can simply disable the nmi-notifier when we are doing the
> > > kstop_machine call?
> >
> > that would definitely be one way to reduce the cross section, but not
> > enough i'm afraid. For example in the nmi_watchdog=2 case we call into
> > various lapic functions and paravirt lapic handlers which makes it all
> > spread to 3-4 paravirtualization flavors ...
> >
> > sched_clock()'s notrace aspects were pretty manageable, but this in
> > its current form is not.
>
> there's a relatively simple method that would solve all these
> impact-size problems.
>
> We cannot stop NMIs (and MCEs, etc.), but we can make kernel code
> modifications atomic, by adding the following thin layer ontop of it:
>
> #define MAX_CODE_SIZE 10
>
> int redo_len;
> u8 *redo_vaddr;
>
> u8 redo_buffer[MAX_CODE_SIZE];
>
> atomic_t __read_mostly redo_pending;
>
> and use it in do_nmi():
>
> if (unlikely(atomic_read(&redo_pending)))
> modify_code_redo();
>
> i.e. when we modify code, we first fill in the redo_buffer[], redo_vaddr
> and redo_len[], then we set redo_pending flag. Then we modify the kernel
> code, and clear the redo_pending flag.
>
> If an NMI (or MCE) handler intervenes, it will notice the pending
> 'transaction' and will copy redo_buffer[] to the (redo_vaddr,len)
> location and will continue.
>
> So as far as non-maskable contexts are concerned, kernel code patching
> becomes an atomic operation. do_nmi() has to be marked notrace but
> that's all and easy to maintain.
>
> Hm?
>

The comment at the beginning of
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=87a25db0efbd8f73d3d575e48541f2a179915da5;hb=b6148ea934f42e730571f41aa5a1a081a93995b5

explains that code modification on x86 SMP systems is not only a matter
of atomicity, but also a matter of not changing the code underneath a
running CPU which is making assumptions that it won't change underneath
without issuing a synchronizing instruction before the new code is used
by the CPU. The scheme you propose here takes care of atomicity, but
does not take care of the synchronization problem. A sync_core() would
probably be required when such modification is detected.

Also, speaking of plain atomicity, you scheme does not seem to protect
against NMIs running on a different CPU, because the non-atomic change
could race with such NMI.

Mathieu

> Ingo
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-04 23:21:17

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

On Sat, 4 Oct 2008, Mathieu Desnoyers wrote:
> >
> > there's a relatively simple method that would solve all these
> > impact-size problems.
> >
> > We cannot stop NMIs (and MCEs, etc.), but we can make kernel code
> > modifications atomic, by adding the following thin layer ontop of it:
> >
> > #define MAX_CODE_SIZE 10
> >
> > int redo_len;
> > u8 *redo_vaddr;
> >
> > u8 redo_buffer[MAX_CODE_SIZE];
> >
> > atomic_t __read_mostly redo_pending;
> >
> > and use it in do_nmi():
> >
> > if (unlikely(atomic_read(&redo_pending)))
> > modify_code_redo();
> >
> > i.e. when we modify code, we first fill in the redo_buffer[], redo_vaddr
> > and redo_len[], then we set redo_pending flag. Then we modify the kernel
> > code, and clear the redo_pending flag.
> >
> > If an NMI (or MCE) handler intervenes, it will notice the pending
> > 'transaction' and will copy redo_buffer[] to the (redo_vaddr,len)
> > location and will continue.
> >
> > So as far as non-maskable contexts are concerned, kernel code patching
> > becomes an atomic operation. do_nmi() has to be marked notrace but
> > that's all and easy to maintain.
> >
> > Hm?
> >
>
> The comment at the beginning of
> http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=87a25db0efbd8f73d3d575e48541f2a179915da5;hb=b6148ea934f42e730571f41aa5a1a081a93995b5

Mathieu, please stop pointing to git tree comments (especially those that
are not in mainline). If you have an actual technical PDF link, that
would be much better.

>
> explains that code modification on x86 SMP systems is not only a matter
> of atomicity, but also a matter of not changing the code underneath a
> running CPU which is making assumptions that it won't change underneath
> without issuing a synchronizing instruction before the new code is used
> by the CPU. The scheme you propose here takes care of atomicity, but
> does not take care of the synchronization problem. A sync_core() would
> probably be required when such modification is detected.
>
> Also, speaking of plain atomicity, you scheme does not seem to protect
> against NMIs running on a different CPU, because the non-atomic change
> could race with such NMI.

Ingo,

Mathieu is correct in this regard. We do not neet to protect ourselves
from NMIs on the CPU that we execute the code on. We need to protect
ourselves from NMIs running on other CPUS.

-- Steve

2008-10-05 10:14:06

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Mathieu Desnoyers <[email protected]> wrote:

> explains that code modification on x86 SMP systems is not only a
> matter of atomicity, but also a matter of not changing the code
> underneath a running CPU which is making assumptions that it won't
> change underneath without issuing a synchronizing instruction before
> the new code is used by the CPU. The scheme you propose here takes
> care of atomicity, but does not take care of the synchronization
> problem. A sync_core() would probably be required when such
> modification is detected.

that's wrong, my scheme protects against these cases: before _any_ code
is modified we set the redo_pending atomic flag, and make sure that
previous NMI handlers have stopped executing. (easy enough)

then the atomic update of redo_pending should be a sufficient barrier
for another CPU to notice the pending transaction.

Note that the cross-CPU modification can still be 'half done' when the
NMI hits, that's why we execute modify_code_redo() to 'redo' the full
modification before executing further NMI code. That is executed _on the
CPU_ that triggers an NMI, and the CPU itself is self-consistent.

( The modify_code_redo() will have to do a sync_cores() of course, like
all self-modifying code, to flush speculative execution. )

> Also, speaking of plain atomicity, you scheme does not seem to protect
> against NMIs running on a different CPU, because the non-atomic change
> could race with such NMI.

That's wrong too. Another CPU will notice that redo_pending is set and
will execute modify_code_redo() from its NMI handler _before_ calling
all the notifiers and other 'wide' code paths.

the only item that needs to be marked 'notrace' is only the highlevel
do_nmi() handler itself. (as that executes before we have a chance to
execute modify_code_redo())

So we trade a large, fragile, and unmapped set of NMI-implicated
codepaths for a very tight and well controlled an easy to maintain
codepath.

Ingo

2008-10-06 13:53:35

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > explains that code modification on x86 SMP systems is not only a
> > matter of atomicity, but also a matter of not changing the code
> > underneath a running CPU which is making assumptions that it won't
> > change underneath without issuing a synchronizing instruction before
> > the new code is used by the CPU. The scheme you propose here takes
> > care of atomicity, but does not take care of the synchronization
> > problem. A sync_core() would probably be required when such
> > modification is detected.
>
> that's wrong, my scheme protects against these cases: before _any_ code
> is modified we set the redo_pending atomic flag, and make sure that
> previous NMI handlers have stopped executing. (easy enough)
>

Hi Ingo,

Hrm, how will this take care of the following race ?

CPU A CPU B
- NMI fires
- NMI handler checks for
redo_pending flag, == 0
- NMI handler runs code - set redo_pending
about to be modified
- NMI fires
- NMI handler checks redo_pending,
== 1, executes modify_code_redo()
-- race : NMI on A executes code modified by B --
- NMI handler finished running
code about to be modified

Mathieu

> then the atomic update of redo_pending should be a sufficient barrier
> for another CPU to notice the pending transaction.
>
> Note that the cross-CPU modification can still be 'half done' when the
> NMI hits, that's why we execute modify_code_redo() to 'redo' the full
> modification before executing further NMI code. That is executed _on the
> CPU_ that triggers an NMI, and the CPU itself is self-consistent.
>
> ( The modify_code_redo() will have to do a sync_cores() of course, like
> all self-modifying code, to flush speculative execution. )
>
> > Also, speaking of plain atomicity, you scheme does not seem to protect
> > against NMIs running on a different CPU, because the non-atomic change
> > could race with such NMI.
>
> That's wrong too. Another CPU will notice that redo_pending is set and
> will execute modify_code_redo() from its NMI handler _before_ calling
> all the notifiers and other 'wide' code paths.
>
> the only item that needs to be marked 'notrace' is only the highlevel
> do_nmi() handler itself. (as that executes before we have a chance to
> execute modify_code_redo())
>
> So we trade a large, fragile, and unmapped set of NMI-implicated
> codepaths for a very tight and well controlled an easy to maintain
> codepath.
>
> Ingo
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-06 17:10:42

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Steven Rostedt ([email protected]) wrote:
>
> On Sat, 4 Oct 2008, Mathieu Desnoyers wrote:
> > >
> > > there's a relatively simple method that would solve all these
> > > impact-size problems.
> > >
> > > We cannot stop NMIs (and MCEs, etc.), but we can make kernel code
> > > modifications atomic, by adding the following thin layer ontop of it:
> > >
> > > #define MAX_CODE_SIZE 10
> > >
> > > int redo_len;
> > > u8 *redo_vaddr;
> > >
> > > u8 redo_buffer[MAX_CODE_SIZE];
> > >
> > > atomic_t __read_mostly redo_pending;
> > >
> > > and use it in do_nmi():
> > >
> > > if (unlikely(atomic_read(&redo_pending)))
> > > modify_code_redo();
> > >
> > > i.e. when we modify code, we first fill in the redo_buffer[], redo_vaddr
> > > and redo_len[], then we set redo_pending flag. Then we modify the kernel
> > > code, and clear the redo_pending flag.
> > >
> > > If an NMI (or MCE) handler intervenes, it will notice the pending
> > > 'transaction' and will copy redo_buffer[] to the (redo_vaddr,len)
> > > location and will continue.
> > >
> > > So as far as non-maskable contexts are concerned, kernel code patching
> > > becomes an atomic operation. do_nmi() has to be marked notrace but
> > > that's all and easy to maintain.
> > >
> > > Hm?
> > >
> >
> > The comment at the beginning of
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=87a25db0efbd8f73d3d575e48541f2a179915da5;hb=b6148ea934f42e730571f41aa5a1a081a93995b5
>
> Mathieu, please stop pointing to git tree comments (especially those that
> are not in mainline). If you have an actual technical PDF link, that
> would be much better.
>

Hi Steven,

The top 10 lines of the comment the URL points to :

Intel Core 2 Duo Processor for Intel Centrino Duo Processor Technology
Specification Update, AH33

(direct link :
ftp://download.intel.com/design/mobile/SPECUPDT/31407918.pdf)

AH33 -> Page 48

<Quote>
Problem :

The act of one processor, or system bus master, writing data into a
currently executing code segment of a second processor with the intent
of having the second processor execute that data as code is called
cross-modifying code (XMC). XMC that does not force the second processor
to execute a synchronizing instruction, prior to execution of the new
code, is called unsynchronized XMC. Software using unsynchronized XMC to
modify the instruction byte stream of a processor can see unexpected or
unpredictable execution behavior from the processor that is executing
the modified code.
</Quote>

What my patch does is exactly this : it forces the second CPU to issue a
synchronizing instruction (either iret from the breakpoint or cpuid)
before the new instruction is reachable by any CPU. It therefore turns
what would otherwise be an unsynchronized XMC into a synchronized XMC.

And yes patching 20000 sites can be made increadibly fast for the
5-bytes call/nop code-patching case because all the breakpoint handlers
have to do is to increment the return IP of 4 bytes (1 byte for
breakpoint, 4 bytes must be skipped). However, we would have to keep a
hash table of the modified instruction pointers around so the breakpoint
handler can know why the breakpoint happened. After the moment the
breakpoint is removed, given interrupts are disabled in the int3 gate,
this hash table have to be kept around until all the currently running
IRQ handlers have finished their execution.

Mathieu

> >
> > explains that code modification on x86 SMP systems is not only a matter
> > of atomicity, but also a matter of not changing the code underneath a
> > running CPU which is making assumptions that it won't change underneath
> > without issuing a synchronizing instruction before the new code is used
> > by the CPU. The scheme you propose here takes care of atomicity, but
> > does not take care of the synchronization problem. A sync_core() would
> > probably be required when such modification is detected.
> >
> > Also, speaking of plain atomicity, you scheme does not seem to protect
> > against NMIs running on a different CPU, because the non-atomic change
> > could race with such NMI.
>
> Ingo,
>
> Mathieu is correct in this regard. We do not neet to protect ourselves
> from NMIs on the CPU that we execute the code on. We need to protect
> ourselves from NMIs running on other CPUS.
>
> -- Steve
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-10-06 17:14:01

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH 0/3] ring-buffer: less locking and only disable preemption

* Steven Rostedt ([email protected]) wrote:
>
> On Sat, 4 Oct 2008, Mathieu Desnoyers wrote:
> >
> > Or use this code, based on a temporary breakpoint, to do the code
> > patching (part of the -lttng tree). It does not require stop_machine at
> > all and is nmi safe.
> >
>
> When this is supported for all archs, and can be done at all functions
> then I could use it.
>

How about incrementally using this piece of infrastructure when
available on a given architecture ? This way we keep a sub-optimal
fall-back for archs which does not support NMI-safe code patching and
incrementally get the optimal behavior. Otherwise, we would require any
new architecture to implement that up-front, which I doubt is a good
idea.

> I may just have the arch specific code use it, but we'll see.
>
> Also, how good is it at patching 20,000 call sites?
>

Can be done really fast using a hash table, see my previous mail.

Mathieu

> -- Steve
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68