2007-06-15 21:11:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 4/8] Immediate Value - i386 Optimization

i386 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used for the branch test.

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
arch/i386/kernel/Makefile | 1
arch/i386/kernel/immediate.c | 177 +++++++++++++++++++++++++++++++++++++++++++
include/asm-i386/immediate.h | 72 +++++++++++++++++
3 files changed, 249 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-i386/immediate.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-i386/immediate.h 2007-06-15 16:13:55.000000000 -0400
+++ linux-2.6-lttng/include/asm-i386/immediate.h 2007-06-15 16:14:04.000000000 -0400
@@ -1 +1,71 @@
-#include <asm-generic/immediate.h>
+#ifndef _ASM_I386_IMMEDIATE_H
+#define _ASM_I386_IMMEDIATE_H
+
+/*
+ * Immediate values. i386 architecture optimizations.
+ *
+ * (C) Copyright 2006 Mathieu Desnoyers <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#define IF_DEFAULT (IF_OPTIMIZED | IF_LOCKDEP)
+
+/*
+ * Optimized version of the immediate. Passing the flags as a pointer to
+ * the inline assembly to trick it into putting the flags value as third
+ * parameter in the structure.
+ */
+#define immediate_optimized(flags, var) \
+ ({ \
+ int condition; \
+ asm ( ".section __immediate, \"a\", @progbits;\n\t" \
+ ".long %1, 0f, %2;\n\t" \
+ ".previous;\n\t" \
+ "0:\n\t" \
+ "movl %3,%0;\n\t" \
+ : "=r" (condition) \
+ : "m" (var), \
+ "m" (*(char*)flags), \
+ "i" (0)); \
+ (condition); \
+ })
+
+/*
+ * immediate macro selecting the generic or optimized version of immediate,
+ * depending on the flags specified. It is a macro because we need to pass the
+ * name to immediate_optimized() and immediate_generic() so they can declare a
+ * static variable with it.
+ */
+#define _immediate(flags, var) \
+({ \
+ (((flags) & IF_LOCKDEP) && ((flags) & IF_OPTIMIZED)) ? \
+ immediate_optimized(flags, var) : \
+ immediate_generic(flags, var); \
+})
+
+/* immediate with default behavior */
+#define immediate(var) _immediate(IF_DEFAULT, var)
+
+/*
+ * Architecture dependant immediate information, used internally for immediate
+ * activation.
+ */
+
+/*
+ * Offset of the immediate value from the start of the movl instruction, in
+ * bytes. We point to the first lower byte of the 4 bytes immediate value. Only
+ * changing one byte makes sure we do an atomic memory write, independently of
+ * the alignment of the 4 bytes in the load immediate instruction.
+ */
+#define IMMEDIATE_OPTIMIZED_ENABLE_IMMEDIATE_OFFSET 1
+#define IMMEDIATE_OPTIMIZED_ENABLE_TYPE unsigned char
+/* Dereference enable as lvalue from a pointer to its instruction */
+#define IMMEDIATE_OPTIMIZED_ENABLE(a) \
+ (*(IMMEDIATE_OPTIMIZED_ENABLE_TYPE*) \
+ ((char*)(a)+IMMEDIATE_OPTIMIZED_ENABLE_IMMEDIATE_OFFSET))
+
+extern int immediate_optimized_set_enable(void *address, char enable);
+
+#endif /* _ASM_I386_IMMEDIATE_H */
Index: linux-2.6-lttng/arch/i386/kernel/Makefile
===================================================================
--- linux-2.6-lttng.orig/arch/i386/kernel/Makefile 2007-06-15 16:13:51.000000000 -0400
+++ linux-2.6-lttng/arch/i386/kernel/Makefile 2007-06-15 16:14:04.000000000 -0400
@@ -35,6 +35,7 @@
obj-y += sysenter.o vsyscall.o
obj-$(CONFIG_ACPI_SRAT) += srat.o
obj-$(CONFIG_EFI) += efi.o efi_stub.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
obj-$(CONFIG_DOUBLEFAULT) += doublefault.o
obj-$(CONFIG_SERIAL_8250) += legacy_serial.o
obj-$(CONFIG_VM86) += vm86.o
Index: linux-2.6-lttng/arch/i386/kernel/immediate.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/arch/i386/kernel/immediate.c 2007-06-15 16:14:04.000000000 -0400
@@ -0,0 +1,177 @@
+/*
+ * Immediate Value - i386 architecture specific code.
+ *
+ * Rationale
+ *
+ * Required because of :
+ * - Erratum 49 fix for Intel PIII.
+ * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
+ * Centrino Duo Processor Technology Specification Update, AH33.
+ * Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
+ * Instruction Execution Results.
+ *
+ * Permits immediate value modification by XMC with correct serialization.
+ *
+ * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
+ * location that has preemption enabled because it involves no temporary or
+ * reused data structure.
+ *
+ * Quoting Richard J Moore, source of the information motivating this
+ * implementation which differs from the one proposed by Intel which is not
+ * suitable for kernel context (does not support NMI and would require disabling
+ * interrupts on every CPU for a long period) :
+ *
+ * "There is another issue to consider when looking into using probes other
+ * then int3:
+ *
+ * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
+ * practice of modifying code on one processor where another has prefetched
+ * the unmodified version of the code. Intel states that unpredictable general
+ * protection faults may result if a synchronizing instruction (iret, int,
+ * int3, cpuid, etc ) is not executed on the second processor before it
+ * executes the pre-fetched out-of-date copy of the instruction.
+ *
+ * When we became aware of this I had a long discussion with Intel's
+ * microarchitecture guys. It turns out that the reason for this erratum
+ * (which incidentally Intel does not intend to fix) is because the trace
+ * cache - the stream of micorops resulting from instruction interpretation -
+ * cannot guaranteed to be valid. Reading between the lines I assume this
+ * issue arises because of optimization done in the trace cache, where it is
+ * no longer possible to identify the original instruction boundaries. If the
+ * CPU discoverers that the trace cache has been invalidated because of
+ * unsynchronized cross-modification then instruction execution will be
+ * aborted with a GPF. Further discussion with Intel revealed that replacing
+ * the first opcode byte with an int3 would not be subject to this erratum.
+ *
+ * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
+ *
+ * Overall design
+ *
+ * The algorithm proposed by Intel applies not so well in kernel context: it
+ * would imply disabling interrupts and looping on every CPUs while modifying
+ * the code and would not support instrumentation of code called from interrupt
+ * sources that cannot be disabled.
+ *
+ * Therefore, we use a different algorithm to respect Intel's erratum (see the
+ * quoted discussion above). We make sure that no CPU sees an out-of-date copy
+ * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
+ * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
+ * execute a sync_core(), to make sure that even when the breakpoint is removed,
+ * no cpu could possibly still have the out-of-date copy of the instruction,
+ * modify the now unused 2nd byte of the instruction, and then put back the
+ * original 1st byte of the instruction.
+ *
+ * It has exactly the same intent as the algorithm proposed by Intel, but
+ * it has less side-effects, scales better and supports NMI, SMI and MCE.
+ *
+ * Mathieu Desnoyers <[email protected]>
+ */
+
+#include <linux/notifier.h>
+#include <linux/preempt.h>
+#include <linux/smp.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/immediate.h>
+#include <linux/kdebug.h>
+
+#include <asm/cacheflush.h>
+
+#define BREAKPOINT_INSTRUCTION 0xcc
+#define BREAKPOINT_INS_LEN 1
+
+static long target_eip;
+
+static void immediate_synchronize_core(void *info)
+{
+ sync_core(); /* use cpuid to stop speculative execution */
+}
+
+/*
+ * The eip value points right after the breakpoint instruction, in the second
+ * byte of the movb. Incrementing it of 1 byte makes the code resume right after
+ * the movb instruction, effectively skipping this instruction.
+ *
+ * We simply skip the 2 bytes load immediate here, leaving the register in an
+ * undefined state. We don't care about the content (0 or !0), because we are
+ * changing the value 0->1 or 1->0. This small window of undefined value
+ * doesn't matter.
+ */
+static int immediate_notifier(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ enum die_val die_val = (enum die_val) val;
+ struct die_args *args = data;
+
+ if (!args->regs || user_mode_vm(args->regs))
+ return NOTIFY_DONE;
+
+ if (die_val == DIE_INT3 && args->regs->eip == target_eip) {
+ args->regs->eip += 1; /* Skip the next byte of load immediate */
+ return NOTIFY_STOP;
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block immediate_notify = {
+ .notifier_call = immediate_notifier,
+ .priority = 0x7fffffff, /* we need to be notified first */
+};
+
+/*
+ * The address is not aligned. We can only change 1 byte of the value
+ * atomically.
+ * Must be called with immediate_mutex held.
+ */
+int immediate_optimized_set_enable(void *address, char enable)
+{
+ char saved_byte;
+ int ret;
+ char *dest = address;
+
+ if (!(enable ^ dest[1])) /* Must be a state change 0<->1 to execute */
+ return 0;
+
+#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
+ /* Make sure this page is writable */
+ change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
+ global_flush_tlb();
+#endif
+ target_eip = (long)address + BREAKPOINT_INS_LEN;
+ /* register_die_notifier has memory barriers */
+ register_die_notifier(&immediate_notify);
+ saved_byte = *dest;
+ *dest = BREAKPOINT_INSTRUCTION;
+ wmb();
+ /*
+ * Execute serializing instruction on each CPU.
+ * Acts as a memory barrier.
+ */
+ ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
+ BUG_ON(ret != 0);
+
+ dest[1] = enable;
+ wmb();
+ *dest = saved_byte;
+ /*
+ * Wait for all int3 handlers to end
+ * (interrupts are disabled in int3).
+ * This CPU is clearly not in a int3 handler,
+ * because int3 handler is not preemptible and
+ * there cannot be any more int3 handler called
+ * for this site, because we placed the original
+ * instruction back.
+ * synchronize_sched has memory barriers.
+ */
+ synchronize_sched();
+ unregister_die_notifier(&immediate_notify);
+ /* unregister_die_notifier has memory barriers */
+ target_eip = 0;
+#if defined(CONFIG_DEBUG_RODATA)
+ /* Set this page back to RX */
+ change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_RX);
+ global_flush_tlb();
+#endif
+ return 0;
+}
+EXPORT_SYMBOL_GPL(immediate_optimized_set_enable);

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2007-06-15 22:02:22

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization

On 06/15/2007 04:23 PM, Mathieu Desnoyers wrote:
> i386 optimization of the immediate values which uses a movl with code patching
> to set/unset the value used to populate the register used for the branch test.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> ---
> arch/i386/kernel/Makefile | 1
> arch/i386/kernel/immediate.c | 177 +++++++++++++++++++++++++++++++++++++++++++
> include/asm-i386/immediate.h | 72 +++++++++++++++++
> 3 files changed, 249 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-lttng/include/asm-i386/immediate.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/asm-i386/immediate.h 2007-06-15 16:13:55.000000000 -0400
> +++ linux-2.6-lttng/include/asm-i386/immediate.h 2007-06-15 16:14:04.000000000 -0400
> @@ -1 +1,71 @@
> -#include <asm-generic/immediate.h>
> +#ifndef _ASM_I386_IMMEDIATE_H
> +#define _ASM_I386_IMMEDIATE_H
> +
> +/*
> + * Immediate values. i386 architecture optimizations.
> + *
> + * (C) Copyright 2006 Mathieu Desnoyers <[email protected]>
> + *
> + * This file is released under the GPLv2.
> + * See the file COPYING for more details.
> + */
> +
> +#define IF_DEFAULT (IF_OPTIMIZED | IF_LOCKDEP)
> +
> +/*
> + * Optimized version of the immediate. Passing the flags as a pointer to
> + * the inline assembly to trick it into putting the flags value as third
> + * parameter in the structure.
> + */
> +#define immediate_optimized(flags, var) \
> + ({ \
> + int condition; \
> + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> + ".long %1, 0f, %2;\n\t" \
> + ".previous;\n\t" \
> + "0:\n\t" \
> + "movl %3,%0;\n\t" \
> + : "=r" (condition) \
> + : "m" (var), \
> + "m" (*(char*)flags), \
> + "i" (0)); \
> + (condition); \

Unnecessary parens.

> + })
> +
> +/*
> + * immediate macro selecting the generic or optimized version of immediate,
> + * depending on the flags specified. It is a macro because we need to pass the
> + * name to immediate_optimized() and immediate_generic() so they can declare a
> + * static variable with it.
> + */
> +#define _immediate(flags, var) \
> +({ \
> + (((flags) & IF_LOCKDEP) && ((flags) & IF_OPTIMIZED)) ? \
> + immediate_optimized(flags, var) : \
> + immediate_generic(flags, var); \
> +})
> +
> +/* immediate with default behavior */
> +#define immediate(var) _immediate(IF_DEFAULT, var)
> +
> +/*
> + * Architecture dependant immediate information, used internally for immediate
> + * activation.
> + */
> +
> +/*
> + * Offset of the immediate value from the start of the movl instruction, in
> + * bytes. We point to the first lower byte of the 4 bytes immediate value. Only
> + * changing one byte makes sure we do an atomic memory write, independently of
> + * the alignment of the 4 bytes in the load immediate instruction.
> + */
> +#define IMMEDIATE_OPTIMIZED_ENABLE_IMMEDIATE_OFFSET 1
> +#define IMMEDIATE_OPTIMIZED_ENABLE_TYPE unsigned char
> +/* Dereference enable as lvalue from a pointer to its instruction */
> +#define IMMEDIATE_OPTIMIZED_ENABLE(a) \
> + (*(IMMEDIATE_OPTIMIZED_ENABLE_TYPE*) \
> + ((char*)(a)+IMMEDIATE_OPTIMIZED_ENABLE_IMMEDIATE_OFFSET))
> +
> +extern int immediate_optimized_set_enable(void *address, char enable);
> +
> +#endif /* _ASM_I386_IMMEDIATE_H */
> Index: linux-2.6-lttng/arch/i386/kernel/Makefile
> ===================================================================
> --- linux-2.6-lttng.orig/arch/i386/kernel/Makefile 2007-06-15 16:13:51.000000000 -0400
> +++ linux-2.6-lttng/arch/i386/kernel/Makefile 2007-06-15 16:14:04.000000000 -0400
> @@ -35,6 +35,7 @@
> obj-y += sysenter.o vsyscall.o
> obj-$(CONFIG_ACPI_SRAT) += srat.o
> obj-$(CONFIG_EFI) += efi.o efi_stub.o
> +obj-$(CONFIG_IMMEDIATE) += immediate.o
> obj-$(CONFIG_DOUBLEFAULT) += doublefault.o
> obj-$(CONFIG_SERIAL_8250) += legacy_serial.o
> obj-$(CONFIG_VM86) += vm86.o
> Index: linux-2.6-lttng/arch/i386/kernel/immediate.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/arch/i386/kernel/immediate.c 2007-06-15 16:14:04.000000000 -0400
> @@ -0,0 +1,177 @@
> +/*
> + * Immediate Value - i386 architecture specific code.
> + *
> + * Rationale
> + *
> + * Required because of :
> + * - Erratum 49 fix for Intel PIII.
> + * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
> + * Centrino Duo Processor Technology Specification Update, AH33.
> + * Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
> + * Instruction Execution Results.
> + *
> + * Permits immediate value modification by XMC with correct serialization.
> + *
> + * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
> + * location that has preemption enabled because it involves no temporary or
> + * reused data structure.
> + *
> + * Quoting Richard J Moore, source of the information motivating this
> + * implementation which differs from the one proposed by Intel which is not
> + * suitable for kernel context (does not support NMI and would require disabling
> + * interrupts on every CPU for a long period) :
> + *
> + * "There is another issue to consider when looking into using probes other
> + * then int3:
> + *
> + * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
> + * practice of modifying code on one processor where another has prefetched
> + * the unmodified version of the code. Intel states that unpredictable general
> + * protection faults may result if a synchronizing instruction (iret, int,
> + * int3, cpuid, etc ) is not executed on the second processor before it
> + * executes the pre-fetched out-of-date copy of the instruction.
> + *
> + * When we became aware of this I had a long discussion with Intel's
> + * microarchitecture guys. It turns out that the reason for this erratum
> + * (which incidentally Intel does not intend to fix) is because the trace
> + * cache - the stream of micorops resulting from instruction interpretation -
> + * cannot guaranteed to be valid. Reading between the lines I assume this
> + * issue arises because of optimization done in the trace cache, where it is
> + * no longer possible to identify the original instruction boundaries. If the
> + * CPU discoverers that the trace cache has been invalidated because of
> + * unsynchronized cross-modification then instruction execution will be
> + * aborted with a GPF. Further discussion with Intel revealed that replacing
> + * the first opcode byte with an int3 would not be subject to this erratum.
> + *
> + * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
> + *
> + * Overall design
> + *
> + * The algorithm proposed by Intel applies not so well in kernel context: it
> + * would imply disabling interrupts and looping on every CPUs while modifying
> + * the code and would not support instrumentation of code called from interrupt
> + * sources that cannot be disabled.
> + *
> + * Therefore, we use a different algorithm to respect Intel's erratum (see the
> + * quoted discussion above). We make sure that no CPU sees an out-of-date copy
> + * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
> + * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
> + * execute a sync_core(), to make sure that even when the breakpoint is removed,
> + * no cpu could possibly still have the out-of-date copy of the instruction,
> + * modify the now unused 2nd byte of the instruction, and then put back the
> + * original 1st byte of the instruction.
> + *
> + * It has exactly the same intent as the algorithm proposed by Intel, but
> + * it has less side-effects, scales better and supports NMI, SMI and MCE.

Algorithm looks plausible, was it really tested?

> + *
> + * Mathieu Desnoyers <[email protected]>
> + */
> +
> +#include <linux/notifier.h>
> +#include <linux/preempt.h>
> +#include <linux/smp.h>
> +#include <linux/notifier.h>
> +#include <linux/module.h>
> +#include <linux/immediate.h>
> +#include <linux/kdebug.h>
> +
> +#include <asm/cacheflush.h>
> +
> +#define BREAKPOINT_INSTRUCTION 0xcc
> +#define BREAKPOINT_INS_LEN 1
> +
> +static long target_eip;
> +
> +static void immediate_synchronize_core(void *info)
> +{
> + sync_core(); /* use cpuid to stop speculative execution */
> +}
> +
> +/*
> + * The eip value points right after the breakpoint instruction, in the second
> + * byte of the movb. Incrementing it of 1 byte makes the code resume right after
> + * the movb instruction, effectively skipping this instruction.
> + *
> + * We simply skip the 2 bytes load immediate here, leaving the register in an
> + * undefined state. We don't care about the content (0 or !0), because we are
> + * changing the value 0->1 or 1->0. This small window of undefined value
> + * doesn't matter.
> + */
> +static int immediate_notifier(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> + enum die_val die_val = (enum die_val) val;
> + struct die_args *args = data;
> +
> + if (!args->regs || user_mode_vm(args->regs))
> + return NOTIFY_DONE;
> +
> + if (die_val == DIE_INT3 && args->regs->eip == target_eip) {
> + args->regs->eip += 1; /* Skip the next byte of load immediate */

<nitpick>
args->regs->eip++;

> + return NOTIFY_STOP;
> + }
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block immediate_notify = {
> + .notifier_call = immediate_notifier,
> + .priority = 0x7fffffff, /* we need to be notified first */
> +};
> +
> +/*
> + * The address is not aligned. We can only change 1 byte of the value
> + * atomically.
> + * Must be called with immediate_mutex held.
> + */
> +int immediate_optimized_set_enable(void *address, char enable)
> +{
> + char saved_byte;
> + int ret;
> + char *dest = address;
> +
> + if (!(enable ^ dest[1])) /* Must be a state change 0<->1 to execute */
> + return 0;
> +
> +#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
> + /* Make sure this page is writable */
> + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
> + global_flush_tlb();
> +#endif

Can't we have a macro or inline to do this, and the setting back
to read-only? kprobes also has the same ugly #if blocks...

Hmm, what happens if you race with kprobes setting a probe on
the same page? Couldn't it unexpectedly become read-only?

> + target_eip = (long)address + BREAKPOINT_INS_LEN;
> + /* register_die_notifier has memory barriers */
> + register_die_notifier(&immediate_notify);
> + saved_byte = *dest;
> + *dest = BREAKPOINT_INSTRUCTION;
> + wmb();
> + /*
> + * Execute serializing instruction on each CPU.
> + * Acts as a memory barrier.
> + */
> + ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
> + BUG_ON(ret != 0);
> +
> + dest[1] = enable;
> + wmb();
> + *dest = saved_byte;
> + /*
> + * Wait for all int3 handlers to end
> + * (interrupts are disabled in int3).
> + * This CPU is clearly not in a int3 handler,
> + * because int3 handler is not preemptible and
> + * there cannot be any more int3 handler called
> + * for this site, because we placed the original
> + * instruction back.
> + * synchronize_sched has memory barriers.
> + */
> + synchronize_sched();
> + unregister_die_notifier(&immediate_notify);
> + /* unregister_die_notifier has memory barriers */
> + target_eip = 0;
> +#if defined(CONFIG_DEBUG_RODATA)
> + /* Set this page back to RX */
> + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_RX);
> + global_flush_tlb();
> +#endif
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(immediate_optimized_set_enable);
>

2007-06-17 18:19:52

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization

* Chuck Ebbert ([email protected]) wrote:
> On 06/15/2007 04:23 PM, Mathieu Desnoyers wrote:
> > i386 optimization of the immediate values which uses a movl with code patching
> > to set/unset the value used to populate the register used for the branch test.
> >
> > Signed-off-by: Mathieu Desnoyers <[email protected]>
> > ---
> > arch/i386/kernel/Makefile | 1
> > arch/i386/kernel/immediate.c | 177 +++++++++++++++++++++++++++++++++++++++++++
> > include/asm-i386/immediate.h | 72 +++++++++++++++++
> > 3 files changed, 249 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6-lttng/include/asm-i386/immediate.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-i386/immediate.h 2007-06-15 16:13:55.000000000 -0400
> > +++ linux-2.6-lttng/include/asm-i386/immediate.h 2007-06-15 16:14:04.000000000 -0400
> > @@ -1 +1,71 @@
> > -#include <asm-generic/immediate.h>
> > +#ifndef _ASM_I386_IMMEDIATE_H
> > +#define _ASM_I386_IMMEDIATE_H
> > +
> > +/*
> > + * Immediate values. i386 architecture optimizations.
> > + *
> > + * (C) Copyright 2006 Mathieu Desnoyers <[email protected]>
> > + *
> > + * This file is released under the GPLv2.
> > + * See the file COPYING for more details.
> > + */
> > +
> > +#define IF_DEFAULT (IF_OPTIMIZED | IF_LOCKDEP)
> > +
> > +/*
> > + * Optimized version of the immediate. Passing the flags as a pointer to
> > + * the inline assembly to trick it into putting the flags value as third
> > + * parameter in the structure.
> > + */
> > +#define immediate_optimized(flags, var) \
> > + ({ \
> > + int condition; \
> > + asm ( ".section __immediate, \"a\", @progbits;\n\t" \
> > + ".long %1, 0f, %2;\n\t" \
> > + ".previous;\n\t" \
> > + "0:\n\t" \
> > + "movl %3,%0;\n\t" \
> > + : "=r" (condition) \
> > + : "m" (var), \
> > + "m" (*(char*)flags), \
> > + "i" (0)); \
> > + (condition); \
>
Hi Chuck,

> Unnecessary parens.
>

ok, fixed in each immediate.h.

> > + })
> > +
> > +/*
> > + * immediate macro selecting the generic or optimized version of immediate,
> > + * depending on the flags specified. It is a macro because we need to pass the
> > + * name to immediate_optimized() and immediate_generic() so they can declare a
> > + * static variable with it.
> > + */
> > +#define _immediate(flags, var) \
> > +({ \
> > + (((flags) & IF_LOCKDEP) && ((flags) & IF_OPTIMIZED)) ? \
> > + immediate_optimized(flags, var) : \
> > + immediate_generic(flags, var); \
> > +})
> > +
> > +/* immediate with default behavior */
> > +#define immediate(var) _immediate(IF_DEFAULT, var)
> > +
> > +/*
> > + * Architecture dependant immediate information, used internally for immediate
> > + * activation.
> > + */
> > +
> > +/*
> > + * Offset of the immediate value from the start of the movl instruction, in
> > + * bytes. We point to the first lower byte of the 4 bytes immediate value. Only
> > + * changing one byte makes sure we do an atomic memory write, independently of
> > + * the alignment of the 4 bytes in the load immediate instruction.
> > + */
> > +#define IMMEDIATE_OPTIMIZED_ENABLE_IMMEDIATE_OFFSET 1
> > +#define IMMEDIATE_OPTIMIZED_ENABLE_TYPE unsigned char
> > +/* Dereference enable as lvalue from a pointer to its instruction */
> > +#define IMMEDIATE_OPTIMIZED_ENABLE(a) \
> > + (*(IMMEDIATE_OPTIMIZED_ENABLE_TYPE*) \
> > + ((char*)(a)+IMMEDIATE_OPTIMIZED_ENABLE_IMMEDIATE_OFFSET))
> > +
> > +extern int immediate_optimized_set_enable(void *address, char enable);
> > +
> > +#endif /* _ASM_I386_IMMEDIATE_H */
> > Index: linux-2.6-lttng/arch/i386/kernel/Makefile
> > ===================================================================
> > --- linux-2.6-lttng.orig/arch/i386/kernel/Makefile 2007-06-15 16:13:51.000000000 -0400
> > +++ linux-2.6-lttng/arch/i386/kernel/Makefile 2007-06-15 16:14:04.000000000 -0400
> > @@ -35,6 +35,7 @@
> > obj-y += sysenter.o vsyscall.o
> > obj-$(CONFIG_ACPI_SRAT) += srat.o
> > obj-$(CONFIG_EFI) += efi.o efi_stub.o
> > +obj-$(CONFIG_IMMEDIATE) += immediate.o
> > obj-$(CONFIG_DOUBLEFAULT) += doublefault.o
> > obj-$(CONFIG_SERIAL_8250) += legacy_serial.o
> > obj-$(CONFIG_VM86) += vm86.o
> > Index: linux-2.6-lttng/arch/i386/kernel/immediate.c
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ linux-2.6-lttng/arch/i386/kernel/immediate.c 2007-06-15 16:14:04.000000000 -0400
> > @@ -0,0 +1,177 @@
> > +/*
> > + * Immediate Value - i386 architecture specific code.
> > + *
> > + * Rationale
> > + *
> > + * Required because of :
> > + * - Erratum 49 fix for Intel PIII.
> > + * - Still present on newer processors : Intel Core 2 Duo Processor for Intel
> > + * Centrino Duo Processor Technology Specification Update, AH33.
> > + * Unsynchronized Cross-Modifying Code Operations Can Cause Unexpected
> > + * Instruction Execution Results.
> > + *
> > + * Permits immediate value modification by XMC with correct serialization.
> > + *
> > + * Reentrant for NMI and trap handler instrumentation. Permits XMC to a
> > + * location that has preemption enabled because it involves no temporary or
> > + * reused data structure.
> > + *
> > + * Quoting Richard J Moore, source of the information motivating this
> > + * implementation which differs from the one proposed by Intel which is not
> > + * suitable for kernel context (does not support NMI and would require disabling
> > + * interrupts on every CPU for a long period) :
> > + *
> > + * "There is another issue to consider when looking into using probes other
> > + * then int3:
> > + *
> > + * Intel erratum 54 - Unsynchronized Cross-modifying code - refers to the
> > + * practice of modifying code on one processor where another has prefetched
> > + * the unmodified version of the code. Intel states that unpredictable general
> > + * protection faults may result if a synchronizing instruction (iret, int,
> > + * int3, cpuid, etc ) is not executed on the second processor before it
> > + * executes the pre-fetched out-of-date copy of the instruction.
> > + *
> > + * When we became aware of this I had a long discussion with Intel's
> > + * microarchitecture guys. It turns out that the reason for this erratum
> > + * (which incidentally Intel does not intend to fix) is because the trace
> > + * cache - the stream of micorops resulting from instruction interpretation -
> > + * cannot guaranteed to be valid. Reading between the lines I assume this
> > + * issue arises because of optimization done in the trace cache, where it is
> > + * no longer possible to identify the original instruction boundaries. If the
> > + * CPU discoverers that the trace cache has been invalidated because of
> > + * unsynchronized cross-modification then instruction execution will be
> > + * aborted with a GPF. Further discussion with Intel revealed that replacing
> > + * the first opcode byte with an int3 would not be subject to this erratum.
> > + *
> > + * So, is cmpxchg reliable? One has to guarantee more than mere atomicity."
> > + *
> > + * Overall design
> > + *
> > + * The algorithm proposed by Intel applies not so well in kernel context: it
> > + * would imply disabling interrupts and looping on every CPUs while modifying
> > + * the code and would not support instrumentation of code called from interrupt
> > + * sources that cannot be disabled.
> > + *
> > + * Therefore, we use a different algorithm to respect Intel's erratum (see the
> > + * quoted discussion above). We make sure that no CPU sees an out-of-date copy
> > + * of a pre-fetched instruction by 1 - using a breakpoint, which skips the
> > + * instruction that is going to be modified, 2 - issuing an IPI to every CPU to
> > + * execute a sync_core(), to make sure that even when the breakpoint is removed,
> > + * no cpu could possibly still have the out-of-date copy of the instruction,
> > + * modify the now unused 2nd byte of the instruction, and then put back the
> > + * original 1st byte of the instruction.
> > + *
> > + * It has exactly the same intent as the algorithm proposed by Intel, but
> > + * it has less side-effects, scales better and supports NMI, SMI and MCE.
>
> Algorithm looks plausible, was it really tested?
>

Yes, but I can't reproduce the erroneous condition on my setup. I just
realized that I did not test it after my last changes. Therefore, see
the following comments inline for the upcoming fixes :

> > + *
> > + * Mathieu Desnoyers <[email protected]>
> > + */
> > +
> > +#include <linux/notifier.h>
> > +#include <linux/preempt.h>
> > +#include <linux/smp.h>
> > +#include <linux/notifier.h>
> > +#include <linux/module.h>
> > +#include <linux/immediate.h>
> > +#include <linux/kdebug.h>
> > +
> > +#include <asm/cacheflush.h>
> > +
> > +#define BREAKPOINT_INSTRUCTION 0xcc
> > +#define BREAKPOINT_INS_LEN 1
> > +
> > +static long target_eip;
> > +
> > +static void immediate_synchronize_core(void *info)
> > +{
> > + sync_core(); /* use cpuid to stop speculative execution */
> > +}
> > +
> > +/*
> > + * The eip value points right after the breakpoint instruction, in the second
> > + * byte of the movb. Incrementing it of 1 byte makes the code resume right after
> > + * the movb instruction, effectively skipping this instruction.
> > + *
> > + * We simply skip the 2 bytes load immediate here, leaving the register in an
> > + * undefined state. We don't care about the content (0 or !0), because we are
> > + * changing the value 0->1 or 1->0. This small window of undefined value
> > + * doesn't matter.
> > + */
> > +static int immediate_notifier(struct notifier_block *nb,
> > + unsigned long val, void *data)
> > +{
> > + enum die_val die_val = (enum die_val) val;
> > + struct die_args *args = data;
> > +
> > + if (!args->regs || user_mode_vm(args->regs))
> > + return NOTIFY_DONE;
> > +
> > + if (die_val == DIE_INT3 && args->regs->eip == target_eip) {
> > + args->regs->eip += 1; /* Skip the next byte of load immediate */
>
> <nitpick>
> args->regs->eip++;
>

Should now be args->regs->eip += 4; because I jump over the last 4 bytes
of the 5 bytes long movl instead of jumping over the last byte of the 2 bytes
movb I used previously.

I also have to use the int3 handler installed by CONFIG_KPROBES when
CONFIG_IMMEDIATE is used so I do not depend on KPROBES for i386. It will
also be fixed in the next release.

> > + return NOTIFY_STOP;
> > + }
> > + return NOTIFY_DONE;
> > +}
> > +
> > +static struct notifier_block immediate_notify = {
> > + .notifier_call = immediate_notifier,
> > + .priority = 0x7fffffff, /* we need to be notified first */
> > +};
> > +
> > +/*
> > + * The address is not aligned. We can only change 1 byte of the value
> > + * atomically.
> > + * Must be called with immediate_mutex held.
> > + */
> > +int immediate_optimized_set_enable(void *address, char enable)
> > +{
> > + char saved_byte;
> > + int ret;
> > + char *dest = address;
> > +
> > + if (!(enable ^ dest[1])) /* Must be a state change 0<->1 to execute */
> > + return 0;
> > +
> > +#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
> > + /* Make sure this page is writable */
> > + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
> > + global_flush_tlb();
> > +#endif
>
> Can't we have a macro or inline to do this, and the setting back
> to read-only? kprobes also has the same ugly #if blocks...
>
I guess it would be better to share a common macro between kprobes and
immediate for this.

> Hmm, what happens if you race with kprobes setting a probe on
> the same page? Couldn't it unexpectedly become read-only?
>

Sure it can. Is there any spinlock already there that could be used by
different objects and would fit this purpose?

> > + target_eip = (long)address + BREAKPOINT_INS_LEN;
> > + /* register_die_notifier has memory barriers */
> > + register_die_notifier(&immediate_notify);
> > + saved_byte = *dest;
> > + *dest = BREAKPOINT_INSTRUCTION;
> > + wmb();
> > + /*
> > + * Execute serializing instruction on each CPU.
> > + * Acts as a memory barrier.
> > + */
> > + ret = on_each_cpu(immediate_synchronize_core, NULL, 1, 1);
> > + BUG_ON(ret != 0);
> > +
> > + dest[1] = enable;
> > + wmb();
> > + *dest = saved_byte;
> > + /*
> > + * Wait for all int3 handlers to end
> > + * (interrupts are disabled in int3).
> > + * This CPU is clearly not in a int3 handler,
> > + * because int3 handler is not preemptible and
> > + * there cannot be any more int3 handler called
> > + * for this site, because we placed the original
> > + * instruction back.
> > + * synchronize_sched has memory barriers.
> > + */
> > + synchronize_sched();
> > + unregister_die_notifier(&immediate_notify);
> > + /* unregister_die_notifier has memory barriers */
> > + target_eip = 0;
> > +#if defined(CONFIG_DEBUG_RODATA)
> > + /* Set this page back to RX */
> > + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_RX);
> > + global_flush_tlb();
> > +#endif
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(immediate_optimized_set_enable);
> >
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-18 14:57:22

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization; kprobes

* Chuck Ebbert ([email protected]) wrote:
> > + return NOTIFY_STOP;
> > + }
> > + return NOTIFY_DONE;
> > +}
> > +
> > +static struct notifier_block immediate_notify = {
> > + .notifier_call = immediate_notifier,
> > + .priority = 0x7fffffff, /* we need to be notified first */
> > +};
> > +
> > +/*
> > + * The address is not aligned. We can only change 1 byte of the value
> > + * atomically.
> > + * Must be called with immediate_mutex held.
> > + */
> > +int immediate_optimized_set_enable(void *address, char enable)
> > +{
> > + char saved_byte;
> > + int ret;
> > + char *dest = address;
> > +
> > + if (!(enable ^ dest[1])) /* Must be a state change 0<->1 to execute */
> > + return 0;
> > +
> > +#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
> > + /* Make sure this page is writable */
> > + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
> > + global_flush_tlb();
> > +#endif
>
> Can't we have a macro or inline to do this, and the setting back
> to read-only? kprobes also has the same ugly #if blocks...
>
> Hmm, what happens if you race with kprobes setting a probe on
> the same page? Couldn't it unexpectedly become read-only?
>

Hi Chuck,

I am looking more closely at kprobes; a few comments while we are here:

1 - Why is kprobe_count an atomic_t variable instead of a simple
integer? It is always protected by the kprobe_mutex anyway. All this
synchronization seems redundant.

2 - I wonder where is the equivalent of my snippet in kprobes code:
> > +#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
> > + /* Make sure this page is writable */
> > + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
> > + global_flush_tlb();
> > +#endif

I fancy it's done by the kprobe_page_fault handler, but I do not see
clearly how writing the breakpoint from arch_arm_kprobe() in
non-writeable memory is done.

In any case, I would like not to use that kind of approach; I prefer to
set the memory page to RWX before doing the memory write so I do not
depend on the page fault handler (remember that I instrument the page
fault handler itself...).

Maybe we could use a shared "text_mutex" between kprobes and
immediate values to insure mutual exclusion for .text modification.
However, we would still have the following coherency issue when an
immediate value and a kprobe share the same address:

1- enable immediate value
2- put a kprobe at the immediate value load instruction address
3- disable immediate value
4- remove kprobe

The kprobe removal would put back the load immediate instruction and
would not touch the loaded value (which is ok). However, the instruction
copy kept by kprobes would not be in sync with the immediate value
state:

Scenario 1: kprobes int3 handler first:

1- enable immediate value
2- put a kprobe at the immediate value load instruction address

-> int3 triggered
kprobe handler runs. Single-steps the "enabled" state.

3- disable immediate value

-> int3 triggered
kprobe handler runs. Single-steps the "enabled" state. This state is
wrong.

4- remove kprobe


Scenario 2: immediate value int3 handler first:

1- enable immediate value
2- put a kprobe at the immediate value load instruction address

-> int3 triggered
kprobe handler runs. Single-steps the "enabled" state.

3- disable immediate value
-> int3 triggered (while we disable)
While we disable, the immediate value int3 handler is executed first. It
would cause the kprobe handler not to be called, and no "missing"
counter would be incremented.

kprobe handler runs. Single-steps the "enabled" state. This state is
wrong.

4- remove kprobe


Since I don't want to depend on kprobes to put the breakpoint, because
of its reentrancy limitations (see all the __probes functions), It would
be good to figure out a mutual exclusion mechanism between immediate
values and kprobes. Maybe we could forbid kprobes to insert probes on
addresses present in the immediate values tables ? Or better: if we
detect this scenario, we could put the kprobe breakpoint at the
instruction following the "movl".

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-18 18:45:27

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization; kprobes

On 06/18/2007 10:57 AM, Mathieu Desnoyers wrote:
> * Chuck Ebbert ([email protected]) wrote:
>>> + return NOTIFY_STOP;
>>> + }
>>> + return NOTIFY_DONE;
>>> +}
>>> +
>>> +static struct notifier_block immediate_notify = {
>>> + .notifier_call = immediate_notifier,
>>> + .priority = 0x7fffffff, /* we need to be notified first */
>>> +};
>>> +
>>> +/*
>>> + * The address is not aligned. We can only change 1 byte of the value
>>> + * atomically.
>>> + * Must be called with immediate_mutex held.
>>> + */
>>> +int immediate_optimized_set_enable(void *address, char enable)
>>> +{
>>> + char saved_byte;
>>> + int ret;
>>> + char *dest = address;
>>> +
>>> + if (!(enable ^ dest[1])) /* Must be a state change 0<->1 to execute */
>>> + return 0;
>>> +
>>> +#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
>>> + /* Make sure this page is writable */
>>> + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
>>> + global_flush_tlb();
>>> +#endif
>> Can't we have a macro or inline to do this, and the setting back
>> to read-only? kprobes also has the same ugly #if blocks...
>>
>> Hmm, what happens if you race with kprobes setting a probe on
>> the same page? Couldn't it unexpectedly become read-only?
>>
>
> Hi Chuck,
>
> I am looking more closely at kprobes; a few comments while we are here:
>
> 1 - Why is kprobe_count an atomic_t variable instead of a simple
> integer? It is always protected by the kprobe_mutex anyway. All this
> synchronization seems redundant.
>
> 2 - I wonder where is the equivalent of my snippet in kprobes code:
>>> +#if defined(CONFIG_DEBUG_RODATA) || defined(CONFIG_DEBUG_PAGEALLOC)
>>> + /* Make sure this page is writable */
>>> + change_page_attr(virt_to_page(address), 1, PAGE_KERNEL_EXEC);
>>> + global_flush_tlb();
>>> +#endif
>
> I fancy it's done by the kprobe_page_fault handler, but I do not see
> clearly how writing the breakpoint from arch_arm_kprobe() in
> non-writeable memory is done.

Looks like it's not merged yet:

http://lkml.org/lkml/2007/6/7/2

This needs to go in before 2.6.22-final

>
> In any case, I would like not to use that kind of approach; I prefer to
> set the memory page to RWX before doing the memory write so I do not
> depend on the page fault handler (remember that I instrument the page
> fault handler itself...).
>
> Maybe we could use a shared "text_mutex" between kprobes and
> immediate values to insure mutual exclusion for .text modification.
> However, we would still have the following coherency issue when an
> immediate value and a kprobe share the same address:
>
> 1- enable immediate value
> 2- put a kprobe at the immediate value load instruction address
> 3- disable immediate value
> 4- remove kprobe
>
> The kprobe removal would put back the load immediate instruction and
> would not touch the loaded value (which is ok). However, the instruction
> copy kept by kprobes would not be in sync with the immediate value
> state:
>
> Scenario 1: kprobes int3 handler first:
>
> 1- enable immediate value
> 2- put a kprobe at the immediate value load instruction address
>
> -> int3 triggered
> kprobe handler runs. Single-steps the "enabled" state.
>
> 3- disable immediate value
>
> -> int3 triggered
> kprobe handler runs. Single-steps the "enabled" state. This state is
> wrong.
>
> 4- remove kprobe
>
>
> Scenario 2: immediate value int3 handler first:
>
> 1- enable immediate value
> 2- put a kprobe at the immediate value load instruction address
>
> -> int3 triggered
> kprobe handler runs. Single-steps the "enabled" state.
>
> 3- disable immediate value
> -> int3 triggered (while we disable)
> While we disable, the immediate value int3 handler is executed first. It
> would cause the kprobe handler not to be called, and no "missing"
> counter would be incremented.
>
> kprobe handler runs. Single-steps the "enabled" state. This state is
> wrong.
>
> 4- remove kprobe
>
>
> Since I don't want to depend on kprobes to put the breakpoint, because
> of its reentrancy limitations (see all the __probes functions), It would
> be good to figure out a mutual exclusion mechanism between immediate
> values and kprobes. Maybe we could forbid kprobes to insert probes on
> addresses present in the immediate values tables ? Or better: if we
> detect this scenario, we could put the kprobe breakpoint at the
> instruction following the "movl".
>

That's up to you and the kprobes people, I guess...


2007-06-18 18:57:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization; kprobes

On Mon, 18 Jun 2007 14:44:57 -0400
Chuck Ebbert <[email protected]> wrote:

> > I fancy it's done by the kprobe_page_fault handler, but I do not see
> > clearly how writing the breakpoint from arch_arm_kprobe() in
> > non-writeable memory is done.
>
> Looks like it's not merged yet:
>
> http://lkml.org/lkml/2007/6/7/2
>
> This needs to go in before 2.6.22-final

Andi, I'll include the below two patches in the next batch, OK?



From: "S. P. Prasanna" <[email protected]>

Fix the problem of page protection introduced by CONFIG_DEBUG_RODATA for
x86_64 architecture. As per Andi Kleen's suggestion, the kernel text pages
are marked writeable only for a short duration to insert or remove the
breakpoints.

Signed-off-by: Prasanna S P<[email protected]>
Acked-by: Jim Keniston <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/x86_64/kernel/kprobes.c | 26 ++++++++++++++++++++++++++
arch/x86_64/mm/init.c | 6 +++++-
include/asm-x86_64/kprobes.h | 10 ++++++++++
3 files changed, 41 insertions(+), 1 deletion(-)

diff -puN arch/x86_64/kernel/kprobes.c~kprobes-x86_64-fix-for-mark-ro-data arch/x86_64/kernel/kprobes.c
--- a/arch/x86_64/kernel/kprobes.c~kprobes-x86_64-fix-for-mark-ro-data
+++ a/arch/x86_64/kernel/kprobes.c
@@ -209,16 +209,42 @@ static void __kprobes arch_copy_kprobe(s

void __kprobes arch_arm_kprobe(struct kprobe *p)
{
+ unsigned long addr = (unsigned long)p->addr;
+ int page_readonly = 0;
+
+ if (kernel_readonly_text(addr)) {
+ change_page_attr_addr(addr, 1, PAGE_KERNEL_EXEC);
+ global_flush_tlb();
+ page_readonly = 1;
+ }
*p->addr = BREAKPOINT_INSTRUCTION;
flush_icache_range((unsigned long) p->addr,
(unsigned long) p->addr + sizeof(kprobe_opcode_t));
+ if (page_readonly) {
+ change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);
+ global_flush_tlb();
+ }
}

void __kprobes arch_disarm_kprobe(struct kprobe *p)
{
+ unsigned long addr = (unsigned long)p->addr;
+ int page_readonly = 0;
+
+ if (kernel_readonly_text(addr)) {
+ change_page_attr_addr(addr, 1, PAGE_KERNEL_EXEC);
+ global_flush_tlb();
+ page_readonly = 1;
+ }
+
*p->addr = p->opcode;
flush_icache_range((unsigned long) p->addr,
(unsigned long) p->addr + sizeof(kprobe_opcode_t));
+
+ if (page_readonly) {
+ change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);
+ global_flush_tlb();
+ }
}

void __kprobes arch_remove_kprobe(struct kprobe *p)
diff -puN arch/x86_64/mm/init.c~kprobes-x86_64-fix-for-mark-ro-data arch/x86_64/mm/init.c
--- a/arch/x86_64/mm/init.c~kprobes-x86_64-fix-for-mark-ro-data
+++ a/arch/x86_64/mm/init.c
@@ -48,6 +48,7 @@
#define Dprintk(x...)
#endif

+int kernel_text_is_ro;
const struct dma_mapping_ops* dma_ops;
EXPORT_SYMBOL(dma_ops);

@@ -600,10 +601,13 @@ void mark_rodata_ro(void)
{
unsigned long start = (unsigned long)_stext, end;

+ kernel_text_is_ro = 1;
#ifdef CONFIG_HOTPLUG_CPU
/* It must still be possible to apply SMP alternatives. */
- if (num_possible_cpus() > 1)
+ if (num_possible_cpus() > 1) {
start = (unsigned long)_etext;
+ kernel_text_is_ro = 0;
+ }
#endif
end = (unsigned long)__end_rodata;
start = (start + PAGE_SIZE - 1) & PAGE_MASK;
diff -puN include/asm-x86_64/kprobes.h~kprobes-x86_64-fix-for-mark-ro-data include/asm-x86_64/kprobes.h
--- a/include/asm-x86_64/kprobes.h~kprobes-x86_64-fix-for-mark-ro-data
+++ a/include/asm-x86_64/kprobes.h
@@ -26,6 +26,7 @@
#include <linux/types.h>
#include <linux/ptrace.h>
#include <linux/percpu.h>
+#include <asm-generic/sections.h>

#define __ARCH_WANT_KPROBES_INSN_SLOT

@@ -88,4 +89,13 @@ extern int kprobe_handler(struct pt_regs

extern int kprobe_exceptions_notify(struct notifier_block *self,
unsigned long val, void *data);
+extern int kernel_text_is_ro;
+static inline int kernel_readonly_text(unsigned long address)
+{
+ if (kernel_text_is_ro && ((address >= (unsigned long)_stext)
+ && (address < (unsigned long) _etext)))
+ return 1;
+
+ return 0;
+}
#endif /* _ASM_KPROBES_H */
_



From: "S. P. Prasanna" <[email protected]>

Fix the problem of page protection introduced by CONFIG_DEBUG_RODATA.
CONFIG_DEBUG_RODATA marks the text pages as read-only, hence kprobes is
unable to insert breakpoints in the kernel text. This patch overrides the
page protection when adding or removing a probe for the i386 architecture.

Signed-off-by: Prasanna S P<[email protected]>
Acked-by: Jim Keniston <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/kprobes.c | 26 ++++++++++++++++++++++++++
arch/i386/mm/init.c | 2 ++
include/asm-i386/kprobes.h | 12 ++++++++++++
include/asm-i386/pgtable.h | 2 ++
4 files changed, 42 insertions(+)

diff -puN arch/i386/kernel/kprobes.c~kprobes-i386-fix-for-mark-ro-data arch/i386/kernel/kprobes.c
--- a/arch/i386/kernel/kprobes.c~kprobes-i386-fix-for-mark-ro-data
+++ a/arch/i386/kernel/kprobes.c
@@ -169,16 +169,42 @@ int __kprobes arch_prepare_kprobe(struct

void __kprobes arch_arm_kprobe(struct kprobe *p)
{
+ unsigned long addr = (unsigned long) p->addr;
+ int page_readonly = 0;
+
+ if (kernel_readonly_text(addr)) {
+ page_readonly = 1;
+ change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RWX);
+ global_flush_tlb();
+ }
+
*p->addr = BREAKPOINT_INSTRUCTION;
flush_icache_range((unsigned long) p->addr,
(unsigned long) p->addr + sizeof(kprobe_opcode_t));
+
+ if (page_readonly) {
+ change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RX);
+ global_flush_tlb();
+ }
}

void __kprobes arch_disarm_kprobe(struct kprobe *p)
{
+ unsigned long addr = (unsigned long) p->addr;
+ int page_readonly = 0;
+
+ if (kernel_readonly_text(addr)) {
+ page_readonly = 1;
+ change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RWX);
+ global_flush_tlb();
+ }
*p->addr = p->opcode;
flush_icache_range((unsigned long) p->addr,
(unsigned long) p->addr + sizeof(kprobe_opcode_t));
+ if (page_readonly) {
+ change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RX);
+ global_flush_tlb();
+ }
}

void __kprobes arch_remove_kprobe(struct kprobe *p)
diff -puN arch/i386/mm/init.c~kprobes-i386-fix-for-mark-ro-data arch/i386/mm/init.c
--- a/arch/i386/mm/init.c~kprobes-i386-fix-for-mark-ro-data
+++ a/arch/i386/mm/init.c
@@ -45,6 +45,7 @@
#include <asm/sections.h>
#include <asm/paravirt.h>

+int kernel_text_is_ro;
unsigned int __VMALLOC_RESERVE = 128 << 20;

DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
@@ -808,6 +809,7 @@ void mark_rodata_ro(void)
change_page_attr(virt_to_page(start),
size >> PAGE_SHIFT, PAGE_KERNEL_RX);
printk("Write protecting the kernel text: %luk\n", size >> 10);
+ kernel_text_is_ro = 1;
}

start += size;
diff -puN include/asm-i386/kprobes.h~kprobes-i386-fix-for-mark-ro-data include/asm-i386/kprobes.h
--- a/include/asm-i386/kprobes.h~kprobes-i386-fix-for-mark-ro-data
+++ a/include/asm-i386/kprobes.h
@@ -26,6 +26,8 @@
*/
#include <linux/types.h>
#include <linux/ptrace.h>
+#include <linux/pfn.h>
+#include <asm-generic/sections.h>

#define __ARCH_WANT_KPROBES_INSN_SLOT

@@ -90,4 +92,14 @@ static inline void restore_interrupts(st

extern int kprobe_exceptions_notify(struct notifier_block *self,
unsigned long val, void *data);
+extern int kernel_text_is_ro;
+static inline int kernel_readonly_text(unsigned long address)
+{
+
+ if (kernel_text_is_ro && ((address >= PFN_ALIGN(_text))
+ && (address < PFN_ALIGN(_etext))))
+ return 1;
+
+ return 0;
+}
#endif /* _ASM_KPROBES_H */
diff -puN include/asm-i386/pgtable.h~kprobes-i386-fix-for-mark-ro-data include/asm-i386/pgtable.h
--- a/include/asm-i386/pgtable.h~kprobes-i386-fix-for-mark-ro-data
+++ a/include/asm-i386/pgtable.h
@@ -159,6 +159,7 @@ void paging_init(void);
extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
#define __PAGE_KERNEL_RX (__PAGE_KERNEL_EXEC & ~_PAGE_RW)
+#define __PAGE_KERNEL_RWX (__PAGE_KERNEL_EXEC | _PAGE_RW)
#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)
@@ -167,6 +168,7 @@ extern unsigned long long __PAGE_KERNEL,
#define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO)
#define PAGE_KERNEL_EXEC __pgprot(__PAGE_KERNEL_EXEC)
#define PAGE_KERNEL_RX __pgprot(__PAGE_KERNEL_RX)
+#define PAGE_KERNEL_RWX __pgprot(__PAGE_KERNEL_RWX)
#define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE)
#define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE)
#define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC)
_

2007-06-18 19:27:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization; kprobes

Hi Prasanna,

Please see comments below about the patch.

* Andrew Morton ([email protected]) wrote:
> On Mon, 18 Jun 2007 14:44:57 -0400
> Chuck Ebbert <[email protected]> wrote:
>
> > > I fancy it's done by the kprobe_page_fault handler, but I do not see
> > > clearly how writing the breakpoint from arch_arm_kprobe() in
> > > non-writeable memory is done.
> >
> > Looks like it's not merged yet:
> >
> > http://lkml.org/lkml/2007/6/7/2
> >
> > This needs to go in before 2.6.22-final
>
> Andi, I'll include the below two patches in the next batch, OK?
>
>
>
> From: "S. P. Prasanna" <[email protected]>
>
> Fix the problem of page protection introduced by CONFIG_DEBUG_RODATA for
> x86_64 architecture. As per Andi Kleen's suggestion, the kernel text pages
> are marked writeable only for a short duration to insert or remove the
> breakpoints.
>
> Signed-off-by: Prasanna S P<[email protected]>
> Acked-by: Jim Keniston <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> arch/x86_64/kernel/kprobes.c | 26 ++++++++++++++++++++++++++
> arch/x86_64/mm/init.c | 6 +++++-
> include/asm-x86_64/kprobes.h | 10 ++++++++++
> 3 files changed, 41 insertions(+), 1 deletion(-)
>
> diff -puN arch/x86_64/kernel/kprobes.c~kprobes-x86_64-fix-for-mark-ro-data arch/x86_64/kernel/kprobes.c
> --- a/arch/x86_64/kernel/kprobes.c~kprobes-x86_64-fix-for-mark-ro-data
> +++ a/arch/x86_64/kernel/kprobes.c
> @@ -209,16 +209,42 @@ static void __kprobes arch_copy_kprobe(s
>
> void __kprobes arch_arm_kprobe(struct kprobe *p)
> {
> + unsigned long addr = (unsigned long)p->addr;
> + int page_readonly = 0;
> +
> + if (kernel_readonly_text(addr)) {
> + change_page_attr_addr(addr, 1, PAGE_KERNEL_EXEC);
> + global_flush_tlb();
> + page_readonly = 1;
> + }
> *p->addr = BREAKPOINT_INSTRUCTION;
> flush_icache_range((unsigned long) p->addr,
> (unsigned long) p->addr + sizeof(kprobe_opcode_t));
> + if (page_readonly) {
> + change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);
> + global_flush_tlb();
> + }
> }
>
> void __kprobes arch_disarm_kprobe(struct kprobe *p)
> {
> + unsigned long addr = (unsigned long)p->addr;
> + int page_readonly = 0;
> +
> + if (kernel_readonly_text(addr)) {
> + change_page_attr_addr(addr, 1, PAGE_KERNEL_EXEC);
> + global_flush_tlb();
> + page_readonly = 1;
> + }
> +
> *p->addr = p->opcode;
> flush_icache_range((unsigned long) p->addr,
> (unsigned long) p->addr + sizeof(kprobe_opcode_t));
> +
> + if (page_readonly) {
> + change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);
> + global_flush_tlb();
> + }
> }
>
> void __kprobes arch_remove_kprobe(struct kprobe *p)
> diff -puN arch/x86_64/mm/init.c~kprobes-x86_64-fix-for-mark-ro-data arch/x86_64/mm/init.c
> --- a/arch/x86_64/mm/init.c~kprobes-x86_64-fix-for-mark-ro-data
> +++ a/arch/x86_64/mm/init.c
> @@ -48,6 +48,7 @@
> #define Dprintk(x...)
> #endif
>
> +int kernel_text_is_ro;
> const struct dma_mapping_ops* dma_ops;
> EXPORT_SYMBOL(dma_ops);
>
> @@ -600,10 +601,13 @@ void mark_rodata_ro(void)
> {
> unsigned long start = (unsigned long)_stext, end;
>
> + kernel_text_is_ro = 1;
> #ifdef CONFIG_HOTPLUG_CPU
> /* It must still be possible to apply SMP alternatives. */
> - if (num_possible_cpus() > 1)
> + if (num_possible_cpus() > 1) {
> start = (unsigned long)_etext;
> + kernel_text_is_ro = 0;
> + }
> #endif
> end = (unsigned long)__end_rodata;
> start = (start + PAGE_SIZE - 1) & PAGE_MASK;
> diff -puN include/asm-x86_64/kprobes.h~kprobes-x86_64-fix-for-mark-ro-data include/asm-x86_64/kprobes.h
> --- a/include/asm-x86_64/kprobes.h~kprobes-x86_64-fix-for-mark-ro-data
> +++ a/include/asm-x86_64/kprobes.h
> @@ -26,6 +26,7 @@
> #include <linux/types.h>
> #include <linux/ptrace.h>
> #include <linux/percpu.h>
> +#include <asm-generic/sections.h>
>
> #define __ARCH_WANT_KPROBES_INSN_SLOT
>
> @@ -88,4 +89,13 @@ extern int kprobe_handler(struct pt_regs
>
> extern int kprobe_exceptions_notify(struct notifier_block *self,
> unsigned long val, void *data);
> +extern int kernel_text_is_ro;
> +static inline int kernel_readonly_text(unsigned long address)
> +{
> + if (kernel_text_is_ro && ((address >= (unsigned long)_stext)
> + && (address < (unsigned long) _etext)))
> + return 1;
> +
> + return 0;
> +}
> #endif /* _ASM_KPROBES_H */
> _
>
>
>
> From: "S. P. Prasanna" <[email protected]>
>
> Fix the problem of page protection introduced by CONFIG_DEBUG_RODATA.
> CONFIG_DEBUG_RODATA marks the text pages as read-only, hence kprobes is
> unable to insert breakpoints in the kernel text. This patch overrides the
> page protection when adding or removing a probe for the i386 architecture.
>
> Signed-off-by: Prasanna S P<[email protected]>
> Acked-by: Jim Keniston <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> arch/i386/kernel/kprobes.c | 26 ++++++++++++++++++++++++++
> arch/i386/mm/init.c | 2 ++
> include/asm-i386/kprobes.h | 12 ++++++++++++
> include/asm-i386/pgtable.h | 2 ++
> 4 files changed, 42 insertions(+)
>
> diff -puN arch/i386/kernel/kprobes.c~kprobes-i386-fix-for-mark-ro-data arch/i386/kernel/kprobes.c
> --- a/arch/i386/kernel/kprobes.c~kprobes-i386-fix-for-mark-ro-data
> +++ a/arch/i386/kernel/kprobes.c
> @@ -169,16 +169,42 @@ int __kprobes arch_prepare_kprobe(struct
>
> void __kprobes arch_arm_kprobe(struct kprobe *p)
> {
> + unsigned long addr = (unsigned long) p->addr;
> + int page_readonly = 0;
> +
> + if (kernel_readonly_text(addr)) {
> + page_readonly = 1;
> + change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RWX);
> + global_flush_tlb();
> + }
> +
> *p->addr = BREAKPOINT_INSTRUCTION;
> flush_icache_range((unsigned long) p->addr,
> (unsigned long) p->addr + sizeof(kprobe_opcode_t));

Something is weird here:

A- flush_icache_range does nothing on i386. However, I guess it's there
so the code is easier to port to other architectures.
B- since this code is meant to be eventually ported in some way, we have
to be aware that sizeof(kprobe_opcode_t) may differ from 1 byte and that
the instruction can be across a page boundary. However, I do not see
this case dealt properly.

So either we make this function architecture specific (by removing the
flush_icache_range), or we turn this into an architecture independant
function, but we make sure that we change the flags of every potential
pages affected.

> +
> + if (page_readonly) {
> + change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RX);
> + global_flush_tlb();
> + }
> }
>

Something like :

int ro_text = kernel_readonly_text(addr);

if (ro_text) {
...
}
...
if (ro_text) {
...
}

Seems nicer than setting a boolean in the middle of the function,
doesn't it ?


> void __kprobes arch_disarm_kprobe(struct kprobe *p)
> {
> + unsigned long addr = (unsigned long) p->addr;
> + int page_readonly = 0;
> +
> + if (kernel_readonly_text(addr)) {
> + page_readonly = 1;
> + change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RWX);
> + global_flush_tlb();
> + }
> *p->addr = p->opcode;
> flush_icache_range((unsigned long) p->addr,
> (unsigned long) p->addr + sizeof(kprobe_opcode_t));
> + if (page_readonly) {
> + change_page_attr(virt_to_page(addr), 1, PAGE_KERNEL_RX);
> + global_flush_tlb();
> + }
> }
>
> void __kprobes arch_remove_kprobe(struct kprobe *p)
> diff -puN arch/i386/mm/init.c~kprobes-i386-fix-for-mark-ro-data arch/i386/mm/init.c
> --- a/arch/i386/mm/init.c~kprobes-i386-fix-for-mark-ro-data
> +++ a/arch/i386/mm/init.c
> @@ -45,6 +45,7 @@
> #include <asm/sections.h>
> #include <asm/paravirt.h>
>
> +int kernel_text_is_ro;
> unsigned int __VMALLOC_RESERVE = 128 << 20;
>
> DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
> @@ -808,6 +809,7 @@ void mark_rodata_ro(void)
> change_page_attr(virt_to_page(start),
> size >> PAGE_SHIFT, PAGE_KERNEL_RX);
> printk("Write protecting the kernel text: %luk\n", size >> 10);
> + kernel_text_is_ro = 1;
> }
>
> start += size;
> diff -puN include/asm-i386/kprobes.h~kprobes-i386-fix-for-mark-ro-data include/asm-i386/kprobes.h
> --- a/include/asm-i386/kprobes.h~kprobes-i386-fix-for-mark-ro-data
> +++ a/include/asm-i386/kprobes.h
> @@ -26,6 +26,8 @@
> */
> #include <linux/types.h>
> #include <linux/ptrace.h>
> +#include <linux/pfn.h>
> +#include <asm-generic/sections.h>
>
> #define __ARCH_WANT_KPROBES_INSN_SLOT
>
> @@ -90,4 +92,14 @@ static inline void restore_interrupts(st
>
> extern int kprobe_exceptions_notify(struct notifier_block *self,
> unsigned long val, void *data);
> +extern int kernel_text_is_ro;
> +static inline int kernel_readonly_text(unsigned long address)
> +{
> +
> + if (kernel_text_is_ro && ((address >= PFN_ALIGN(_text))
> + && (address < PFN_ALIGN(_etext))))
> + return 1;
> +
> + return 0;
> +}

I see here that it does not check for module text addresses, which is
correct. module.c does not currently implement read-only pages for
module's text section. Is it planned ?

> #endif /* _ASM_KPROBES_H */
> diff -puN include/asm-i386/pgtable.h~kprobes-i386-fix-for-mark-ro-data include/asm-i386/pgtable.h
> --- a/include/asm-i386/pgtable.h~kprobes-i386-fix-for-mark-ro-data
> +++ a/include/asm-i386/pgtable.h
> @@ -159,6 +159,7 @@ void paging_init(void);
> extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
> #define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
> #define __PAGE_KERNEL_RX (__PAGE_KERNEL_EXEC & ~_PAGE_RW)
> +#define __PAGE_KERNEL_RWX (__PAGE_KERNEL_EXEC | _PAGE_RW)
> #define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD)
> #define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
> #define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)
> @@ -167,6 +168,7 @@ extern unsigned long long __PAGE_KERNEL,
> #define PAGE_KERNEL_RO __pgprot(__PAGE_KERNEL_RO)
> #define PAGE_KERNEL_EXEC __pgprot(__PAGE_KERNEL_EXEC)
> #define PAGE_KERNEL_RX __pgprot(__PAGE_KERNEL_RX)
> +#define PAGE_KERNEL_RWX __pgprot(__PAGE_KERNEL_RWX)
> #define PAGE_KERNEL_NOCACHE __pgprot(__PAGE_KERNEL_NOCACHE)
> #define PAGE_KERNEL_LARGE __pgprot(__PAGE_KERNEL_LARGE)
> #define PAGE_KERNEL_LARGE_EXEC __pgprot(__PAGE_KERNEL_LARGE_EXEC)
> _
>
There is a new define for i386, but not for x86_64 ? I doubt it is
required anyway, because __PAGE_KERNEL_EXEC implies
(__PAGE_KERNEL_RX | _PAGE_RW).

Mathieu


--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-18 19:33:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization; kprobes

On Monday 18 June 2007 20:56:32 Andrew Morton wrote:
> On Mon, 18 Jun 2007 14:44:57 -0400
> Chuck Ebbert <[email protected]> wrote:
>
> > > I fancy it's done by the kprobe_page_fault handler, but I do not see
> > > clearly how writing the breakpoint from arch_arm_kprobe() in
> > > non-writeable memory is done.
> >
> > Looks like it's not merged yet:
> >
> > http://lkml.org/lkml/2007/6/7/2
> >
> > This needs to go in before 2.6.22-final
>
> Andi, I'll include the below two patches in the next batch, OK?

It won't work reliably unless some of the c_p_a() fixes get in first.

>
>
>
> +extern int kernel_text_is_ro;

No externs in .c files


I also don't like kernel_text_is_read_only() much, it would
be better to just lookup_address() it and check the write flag.

But for 2.6.22 as a quick fix it might be better to just
make KPROBES dependent on !DEBUG_RODATA. That would be a one liner.

-Andi

2007-06-18 20:17:22

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [patch 4/8] Immediate Value - i386 Optimization; kprobes

On 06/18/2007 03:32 PM, Andi Kleen wrote:
>
> But for 2.6.22 as a quick fix it might be better to just
> make KPROBES dependent on !DEBUG_RODATA. That would be a one liner.

Yeah, that's probably best for now.

2007-06-19 10:06:33

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [patch 1/2] kprobes i386 quick fix mark-ro-data

On Mon, Jun 18, 2007 at 09:32:56PM +0200, Andi Kleen wrote:
> On Monday 18 June 2007 20:56:32 Andrew Morton wrote:
> > On Mon, 18 Jun 2007 14:44:57 -0400
> > Chuck Ebbert <[email protected]> wrote:
> >
> > > > I fancy it's done by the kprobe_page_fault handler, but I do not see
> > > > clearly how writing the breakpoint from arch_arm_kprobe() in
> > > > non-writeable memory is done.
> > >
> > > Looks like it's not merged yet:
> > >
> > > http://lkml.org/lkml/2007/6/7/2
> > >
> > > This needs to go in before 2.6.22-final
> >
> > Andi, I'll include the below two patches in the next batch, OK?
>
> It won't work reliably unless some of the c_p_a() fixes get in first.
>
> >
> >
> >
> > +extern int kernel_text_is_ro;
>
> No externs in .c files
Yes.
>
>
> I also don't like kernel_text_is_read_only() much, it would
> be better to just lookup_address() it and check the write flag.

Yes, I will look into this approach.
>
> But for 2.6.22 as a quick fix it might be better to just
> make KPROBES dependent on !DEBUG_RODATA. That would be a one liner.
>

Please find the quick fix as per your suggestion below.

Thanks
Prasanna

This patch is a quick fix to enable kprobes only if DEBUG_RODATA is
not configured. Since DEBUG_RODATA markes the kernel text pages as read-only.

Signed-off-by: Prasanna S P. <[email protected]>


arch/i386/Kconfig | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/i386/Kconfig~kprobes-quick-fix-mark-ro-data-i386 arch/i386/Kconfig
--- linux-2.6.22-rc5/arch/i386/Kconfig~kprobes-quick-fix-mark-ro-data-i386 2007-06-19 14:55:31.000000000 +0530
+++ linux-2.6.22-rc5-prasanna/arch/i386/Kconfig 2007-06-19 14:55:31.000000000 +0530
@@ -1212,6 +1212,8 @@ source "drivers/Kconfig"

source "fs/Kconfig"

+source "arch/i386/Kconfig.debug"
+
menu "Instrumentation Support"
depends on EXPERIMENTAL

@@ -1219,7 +1221,7 @@ source "arch/i386/oprofile/Kconfig"

config KPROBES
bool "Kprobes (EXPERIMENTAL)"
- depends on KALLSYMS && EXPERIMENTAL && MODULES
+ depends on KALLSYMS && EXPERIMENTAL && MODULES && !DEBUG_RODATA
help
Kprobes allows you to trap at almost any kernel address and
execute a callback function. register_kprobe() establishes
@@ -1228,8 +1230,6 @@ config KPROBES
If in doubt, say "N".
endmenu

-source "arch/i386/Kconfig.debug"
-
source "security/Kconfig"

source "crypto/Kconfig"

_
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2007-06-19 10:07:49

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [patch 2/2] kprobes x86_64 quick fix mark-ro-data

This patch is a quick fix for x86_64 to enable kprobes only if DEBUG_RODATA is
not configured. Since DEBUG_RODATA markes the kernel text pages as read-only.

Signed-off-by: Prasanna S P. <[email protected]>


arch/x86_64/Kconfig | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/x86_64/Kconfig~kprobes-quick-fix-mark-ro-data-x86_64 arch/x86_64/Kconfig
--- linux-2.6.22-rc5/arch/x86_64/Kconfig~kprobes-quick-fix-mark-ro-data-x86_64 2007-06-19 14:55:56.000000000 +0530
+++ linux-2.6.22-rc5-prasanna/arch/x86_64/Kconfig 2007-06-19 14:55:56.000000000 +0530
@@ -764,6 +764,8 @@ source "drivers/firmware/Kconfig"

source fs/Kconfig

+source "arch/x86_64/Kconfig.debug"
+
menu "Instrumentation Support"
depends on EXPERIMENTAL

@@ -771,7 +773,7 @@ source "arch/x86_64/oprofile/Kconfig"

config KPROBES
bool "Kprobes (EXPERIMENTAL)"
- depends on KALLSYMS && EXPERIMENTAL && MODULES
+ depends on KALLSYMS && EXPERIMENTAL && MODULES && !DEBUG_RODATA
help
Kprobes allows you to trap at almost any kernel address and
execute a callback function. register_kprobe() establishes
@@ -780,8 +782,6 @@ config KPROBES
If in doubt, say "N".
endmenu

-source "arch/x86_64/Kconfig.debug"
-
source "security/Kconfig"

source "crypto/Kconfig"

_
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2007-06-19 13:25:32

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 2/2] kprobes x86_64 quick fix mark-ro-data

On Tue, 2007-06-19 at 15:38 +0530, S. P. Prasanna wrote:
> This patch is a quick fix for x86_64 to enable kprobes only if DEBUG_RODATA is
> not configured. Since DEBUG_RODATA markes the kernel text pages as read-only.


it does??

I don't seem to be able to find this in the source code.....


2007-06-19 13:34:53

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/2] kprobes x86_64 quick fix mark-ro-data

* Arjan van de Ven ([email protected]) wrote:
> On Tue, 2007-06-19 at 15:38 +0530, S. P. Prasanna wrote:
> > This patch is a quick fix for x86_64 to enable kprobes only if DEBUG_RODATA is
> > not configured. Since DEBUG_RODATA markes the kernel text pages as read-only.
>
>
> it does??
>
> I don't seem to be able to find this in the source code.....
>

See arch/x86_64/mm/init.c:mark_rodata_ro().

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-19 13:47:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 2/2] kprobes x86_64 quick fix mark-ro-data

On Tue, 2007-06-19 at 09:30 -0400, Mathieu Desnoyers wrote:
> * Arjan van de Ven ([email protected]) wrote:
> > On Tue, 2007-06-19 at 15:38 +0530, S. P. Prasanna wrote:
> > > This patch is a quick fix for x86_64 to enable kprobes only if DEBUG_RODATA is
> > > not configured. Since DEBUG_RODATA markes the kernel text pages as read-only.
> >
> >
> > it does??
> >
> > I don't seem to be able to find this in the source code.....
> >

> See arch/x86_64/mm/init.c:mark_rodata_ro().

eh woops


PATCH] x86: tighten kernel image page access rights
author
Jan Beulich <[email protected]>

Wed, 2 May 2007 17:27:10 +0000
(19:27 +0200)
committer
Andi Kleen <[email protected]>

Wed, 2 May 2007 17:27:10 +0000
(19:27 +0200)
commit
6fb14755a676282a4e6caa05a08c92db8e45cfff


changed it to include text (even though Andi vetoed that before when I
asked for it on grounds of breaking kprobes)... sounds this really wants
to be a 2nd config option to seperatedly do code and data.

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-06-19 14:31:50

by S. P. Prasanna

[permalink] [raw]
Subject: Re: [patch 2/2] kprobes x86_64 quick fix mark-ro-data

On Tue, Jun 19, 2007 at 06:44:30AM -0700, Arjan van de Ven wrote:
> On Tue, 2007-06-19 at 09:30 -0400, Mathieu Desnoyers wrote:
> > * Arjan van de Ven ([email protected]) wrote:
> > > On Tue, 2007-06-19 at 15:38 +0530, S. P. Prasanna wrote:
> > > > This patch is a quick fix for x86_64 to enable kprobes only if DEBUG_RODATA is
> > > > not configured. Since DEBUG_RODATA markes the kernel text pages as read-only.
> > >
> > >
> > > it does??
> > >
> > > I don't seem to be able to find this in the source code.....
> > >
>
> > See arch/x86_64/mm/init.c:mark_rodata_ro().
>
> eh woops
>
>
> PATCH] x86: tighten kernel image page access rights
> author
> Jan Beulich <[email protected]>
>
> Wed, 2 May 2007 17:27:10 +0000
> (19:27 +0200)
> committer
> Andi Kleen <[email protected]>
>
> Wed, 2 May 2007 17:27:10 +0000
> (19:27 +0200)
> commit
> 6fb14755a676282a4e6caa05a08c92db8e45cfff
>
>
> changed it to include text (even though Andi vetoed that before when I
> asked for it on grounds of breaking kprobes)... sounds this really wants
> to be a 2nd config option to seperatedly do code and data.

Something like having a seperate config option and a routine
to mark kernel text as read execute only.
And call mark_rwtext_ro() if CONFIG_DEBUG_ROTEXT is enabled
below in mark_rodata_ro().

/* this code is for i386 architecture*/
static inline void mark_rwtext_ro(void)
{
unsigned long start = PFN_ALIGN(_text);
unsigned long size = PFN_ALIGN(_etext) - start;

#ifdef CONFIG_HOTPLUG_CPU
/* It must still be possible to apply SMP alternatives. */
if (num_possible_cpus() <= 1)
#endif
{
change_page_attr(virt_to_page(start),
size >> PAGE_SHIFT, PAGE_KERNEL_RX);
printk("Write protecting the kernel text: %luk\n", size >> 10);
}
/*
* global_flush_tlb() will be called after marking the data as readonly.
*/
}

#ifdef CONFIG_DEBUG_RODATA

void mark_rodata_ro(void)
{
....

#ifdef CONFIG_DEBUG_ROTEXT
mark_rwtext_ro();
#endif
....

}

Thanks
Prasanna
--
Prasanna S.P.
Linux Technology Center
India Software Labs, IBM Bangalore
Email: [email protected]
Ph: 91-80-41776329

2007-06-19 16:49:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 1/2] kprobes i386 quick fix mark-ro-data


> Please find the quick fix as per your suggestion below.

I've already done it myself. But I ended up doing it the other
way around -- making DEBUG_RODATA dependent on !KPROBES
because I figured KPROBES is more important than DEBUG_RODATA
and it is less confusing for the user this way.

-Andi