2009-11-18 22:44:59

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 0/6] jump label v3

hi,

Refresh of the jump labeling patches. We introduce the following:

# ifdef CONFIG_X86_64
# define JUMP_LABEL_NOP P6_NOP5
# else
# define JUMP_LABEL_NOP ".byte 0xe9 \n\t .long 0\n\t"
# endif

# define JUMP_LABEL(tag, label, cond) \
do { \
static const char __jlstrtab_##tag[] \
__used __attribute__((section("__jump_strings"))) = #tag; \
asm goto("1:" \
JUMP_LABEL_NOP \
".pushsection __jump_table, \"a\" \n\t" \
_ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
".popsection \n\t" \
: : "i" (__jlstrtab_##tag) : : label); \
} while (0)

-------------

I'm using an atomic 5 byte no-op for x86_64 and a long jump for 32-bit x86.
My understanding is that not all 32-bit processors have an atomic 5 byte no-op,
and thus using a long jump or jump 0, for the off case is the safest.

which can then be used by the tracepoint code as:

#define DECLARE_TRACE(name, proto, args) \
extern struct tracepoint __tracepoint_##name; \
static inline void trace_##name(proto) \
{ \
JUMP_LABEL(name, do_trace, __tracepoint_##name.state); \
return; \
do_trace: \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(proto), TP_ARGS(args)); \


--------------

Thus, in the disabled tracing case we have a no-op followed by a jump around
the disabled code. When we enable the tracepoint, we simply patch the no-op
with a jump to the 'do_trace' label. This relies on the 'asm goto' construct
which is already merged into gcc 4.5. In subsequent gcc versions, we also hope
to make use of 'cold' label for the 'do_trace' section. Thus, making the
disabled or straight line codepath, simply a no-op.

As discussed in pervious mails I've seen an average improvement of 30 cycles
per-tracepoint on x86_64 systems that I've tested.

The first 2 patches of the series are a repost of Masami's text_poke_fixup()
function, which allows for efficient instruction patching. The final 4 patches,
implement the the jump patching mechanism for x86 and x86_64.

The implementation is a 'low' level one, in the sense that it is geared
specifically for tracepoints. However, I believe this mechanism will be more
generally useful for other parts of the kernel. Thus, I will propose 'higher'
level interfaces into the jump label code (layered on these 'low' level ones),
as we go.

thanks,

-Jason

Masami Hiramatsu (2):
x86: Introduce generic jump patching without stop_machine
kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE

Jason Baron(4):
move opcode defs from asm/kprobes.h to asm/alternative.h
jump-label-basic
jump-module-support
jump-label-tracepoints

arch/x86/include/asm/alternative.h | 17 +++++
arch/x86/include/asm/jump_label.h | 35 +++++++++++
arch/x86/include/asm/kprobes.h | 3 -
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/jump_label.c | 66 ++++++++++++++++++++
arch/x86/kernel/kprobes.c | 2 +-
include/asm-generic/vmlinux.lds.h | 11 +++-
include/linux/jump_label.h | 47 ++++++++++++++
include/linux/module.h | 12 +++-
include/linux/tracepoint.h | 35 ++++++-----
kernel/kprobes.c | 2 +-
kernel/module.c | 27 ++++++++-
kernel/tracepoint.c | 25 ++++++--
14 files changed, 372 insertions(+), 32 deletions(-)
create mode 100644 arch/x86/include/asm/jump_label.h
create mode 100644 arch/x86/kernel/jump_label.c
create mode 100644 include/linux/jump_label.h


2009-11-18 22:43:53

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 1/6] jump label v3 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE

Change RELATIVEJUMP_INSTRUCTION macro to RELATIVEJUMP_OPCODE since it
represents just the opcode byte.


Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/kprobes.h | 2 +-
arch/x86/kernel/kprobes.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4fe681d..eaec8ea 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -32,7 +32,7 @@ struct kprobe;

typedef u8 kprobe_opcode_t;
#define BREAKPOINT_INSTRUCTION 0xcc
-#define RELATIVEJUMP_INSTRUCTION 0xe9
+#define RELATIVEJUMP_OPCODE 0xe9
#define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 1f3186c..b6b75f1 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -115,7 +115,7 @@ static void __kprobes set_jmp_op(void *from, void *to)
} __attribute__((packed)) * jop;
jop = (struct __arch_jmp_op *)from;
jop->raddr = (s32)((long)(to) - ((long)(from) + 5));
- jop->op = RELATIVEJUMP_INSTRUCTION;
+ jop->op = RELATIVEJUMP_OPCODE;
}

/*
--
1.6.5.1

2009-11-18 22:44:15

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

Add text_poke_fixup() which takes a fixup address to where a processor
jumps if it hits the modifying address while code modifying.
text_poke_fixup() does following steps for this purpose.

1. Setup int3 handler for fixup.
2. Put a breakpoint (int3) on the first byte of modifying region,
and synchronize code on all CPUs.
3. Modify other bytes of modifying region, and synchronize code on all CPUs.
4. Modify the first byte of modifying region, and synchronize code
on all CPUs.
5. Clear int3 handler.

Thus, if some other processor execute modifying address when step2 to step4,
it will be jumped to fixup code.

This still has many limitations for modifying multi-instructions at once.
However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
because;
- Replaced instruction is just one instruction, which is executed atomically.
- Replacing instruction is a jump, so we can set fixup address where the jump
goes to.

Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/alternative.h | 12 ++++
arch/x86/include/asm/kprobes.h | 1 +
arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
kernel/kprobes.c | 2 +-
4 files changed, 134 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index c240efc..03843a8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -160,4 +160,16 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);

+/*
+ * Setup int3 trap and fixup execution for cross-modifying on SMP case.
+ * If the other cpus execute modifying instruction, it will hit int3
+ * and go to fixup code. This just provides a minimal safety check.
+ * Additional checks/restrictions are required for completely safe
+ * cross-modifying.
+ */
+extern void *text_poke_fixup(void *addr, const void *opcode, size_t len,
+ void *fixup);
+extern int text_patch_jump(void *addr, void *dest);
+extern void sync_core_all(void);
+
#endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index eaec8ea..febab97 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -33,6 +33,7 @@ struct kprobe;
typedef u8 kprobe_opcode_t;
#define BREAKPOINT_INSTRUCTION 0xcc
#define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
#define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index de7353c..af47f12 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -4,6 +4,7 @@
#include <linux/list.h>
#include <linux/stringify.h>
#include <linux/kprobes.h>
+#include <linux/kdebug.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/memory.h>
@@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
local_irq_restore(flags);
return addr;
}
+
+/*
+ * On pentium series, Unsynchronized cross-modifying code
+ * operations can cause unexpected instruction execution results.
+ * So after code modified, we should synchronize it on each processor.
+ */
+static void __kprobes __local_sync_core(void *info)
+{
+ sync_core();
+}
+
+void __kprobes sync_core_all(void)
+{
+ on_each_cpu(__local_sync_core, NULL, 1);
+}
+
+/* Safely cross-code modifying with fixup address */
+static void *patch_fixup_from;
+static void *patch_fixup_addr;
+
+static int __kprobes patch_exceptions_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = data;
+ struct pt_regs *regs = args->regs;
+
+ if (likely(!patch_fixup_from))
+ return NOTIFY_DONE;
+
+ if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
+ (unsigned long)patch_fixup_from != regs->ip)
+ return NOTIFY_DONE;
+
+ args->regs->ip = (unsigned long)patch_fixup_addr;
+ return NOTIFY_STOP;
+}
+
+/**
+ * text_poke_fixup() -- cross-modifying kernel text with fixup address.
+ * @addr: Modifying address.
+ * @opcode: New instruction.
+ * @len: length of modifying bytes.
+ * @fixup: Fixup address.
+ *
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
+ void *fixup)
+{
+ static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
+ static const int int3_size = sizeof(int3_insn);
+
+ /* Replacing 1 byte can be done atomically. */
+ if (unlikely(len <= 1))
+ return text_poke(addr, opcode, len);
+
+ /* Preparing */
+ patch_fixup_addr = fixup;
+ wmb();
+ patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
+
+ /* Cap by an int3 */
+ text_poke(addr, &int3_insn, int3_size);
+ sync_core_all();
+
+ /* Replace tail bytes */
+ text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
+ len - int3_size);
+ sync_core_all();
+
+ /* Replace int3 with head byte */
+ text_poke(addr, opcode, int3_size);
+ sync_core_all();
+
+ /* Cleanup */
+ patch_fixup_from = NULL;
+ wmb();
+ return addr;
+}
+
+/**
+ * text_patch_jump() -- cross-modifying kernel text with a relative jump
+ * @addr: Address where jump from.
+ * @dest: Address where jump to.
+ *
+ * Return 0 if succeeded to embed a jump. Otherwise, there is an error.
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+int __kprobes text_patch_jump(void *addr, void *dest)
+{
+ unsigned char jmp_code[RELATIVEJUMP_SIZE];
+ s32 rel = (s32)((long)dest - ((long)addr + RELATIVEJUMP_SIZE));
+
+ if (!addr || !dest ||
+ (long)rel != ((long)dest - ((long)addr + RELATIVEJUMP_SIZE)))
+ return -EINVAL;
+
+ jmp_code[0] = RELATIVEJUMP_OPCODE;
+ *(s32 *)(&jmp_code[1]) = rel;
+
+ text_poke_fixup(addr, jmp_code, RELATIVEJUMP_SIZE, dest);
+ return 0;
+}
+
+static struct notifier_block patch_exceptions_nb = {
+ .notifier_call = patch_exceptions_notify,
+ .priority = 0x7fffffff /* we need to be notified first */
+};
+
+static int __init patch_init(void)
+{
+ return register_die_notifier(&patch_exceptions_nb);
+}
+
+arch_initcall(patch_init);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index e5342a3..43a30d8 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);

static struct notifier_block kprobe_exceptions_nb = {
.notifier_call = kprobe_exceptions_notify,
- .priority = 0x7fffffff /* we need to be notified first */
+ .priority = 0x7ffffff0 /* High priority, but not first. */
};

unsigned long __weak arch_deref_entry_point(void *entry)
--
1.6.5.1

2009-11-18 22:44:28

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 3/6] jump label v3 - move opcode defs

Move x86 opcode macros from arch/x86/include/asm/kprobes.h to
arch/x86/include/asm/alternative.h so they are useful outside of kprobes.


Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/include/asm/alternative.h | 5 +++++
arch/x86/include/asm/kprobes.h | 4 ----
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 03843a8..6a22696 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -160,6 +160,11 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);

+#define BREAKPOINT_INSTRUCTION 0xcc
+#define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
+#define MAX_INSN_SIZE 16
+
/*
* Setup int3 trap and fixup execution for cross-modifying on SMP case.
* If the other cpus execute modifying instruction, it will hit int3
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index febab97..15b1b12 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -31,10 +31,6 @@ struct pt_regs;
struct kprobe;

typedef u8 kprobe_opcode_t;
-#define BREAKPOINT_INSTRUCTION 0xcc
-#define RELATIVEJUMP_OPCODE 0xe9
-#define RELATIVEJUMP_SIZE 5
-#define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
(((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
--
1.6.5.1

2009-11-18 22:44:07

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 4/6] jump label v3 - base patch

base patch to implement 'jump labeling'. Based on a new 'asm goto' inline
assembly gcc mechanism, we can now branch to labels from an 'asm goto'
statment. This allows us to create a 'no-op' fastpath, which can subsequently
be patched with a jump to the slowpath code. This is useful for code which
might be rarely used, but which we'd like to be able to call, if needed.
Tracepoints are the current usecase that these are being implemented for.

Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
condition. This is b/c the die_notifier takes an rcu_read_lock() on the
int3 trap, which then causes another one etc. Since, we aren't going to be
installing removing the handler, the rcu_read_lock() could be avoided for this
case with some code restructuring.

Also, the patch is dependent on CONFIG_KPROBES. This is simply due to a
reliance on the int3 trap handler code that is only enabled with kprobes. This,
could also be modified.

Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/include/asm/jump_label.h | 35 ++++++++++++++++++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/jump_label.c | 63 +++++++++++++++++++++++++++++++++++++
include/asm-generic/vmlinux.lds.h | 11 ++++++-
include/linux/jump_label.h | 45 ++++++++++++++++++++++++++
5 files changed, 154 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/include/asm/jump_label.h
create mode 100644 arch/x86/kernel/jump_label.c
create mode 100644 include/linux/jump_label.h

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
new file mode 100644
index 0000000..5817a86
--- /dev/null
+++ b/arch/x86/include/asm/jump_label.h
@@ -0,0 +1,35 @@
+#ifndef _ASM_X86_JUMP_LABEL_H
+#define _ASM_X86_JUMP_LABEL_H
+
+#include <asm/nops.h>
+
+#if (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)) && \
+ !defined(CONFIG_LOCKDEP) && defined(CONFIG_KPROBES)
+# define __HAVE_ARCH_JUMP_LABEL
+#endif
+
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+# ifdef CONFIG_X86_64
+# define JUMP_LABEL_NOP P6_NOP5
+# else
+# define JUMP_LABEL_NOP ".byte 0xe9 \n\t .long 0\n\t"
+# endif
+
+# define JUMP_LABEL(tag, label, cond) \
+ do { \
+ static const char __jlstrtab_##tag[] \
+ __used __attribute__((section("__jump_strings"))) = #tag; \
+ asm goto("1:" \
+ JUMP_LABEL_NOP \
+ ".pushsection __jump_table, \"a\" \n\t" \
+ _ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
+ ".popsection \n\t" \
+ : : "i" (__jlstrtab_##tag) : : label); \
+ } while (0)
+
+# endif
+
+#endif
+
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 4f2e66e..df3b341 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -32,7 +32,7 @@ GCOV_PROFILE_paravirt.o := n
obj-y := process_$(BITS).o signal.o entry_$(BITS).o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o ldt.o dumpstack.o
-obj-y += setup.o x86_init.o i8259.o irqinit.o
+obj-y += setup.o x86_init.o i8259.o irqinit.o jump_label.o
obj-$(CONFIG_X86_VISWS) += visws_quirks.o
obj-$(CONFIG_X86_32) += probe_roms_32.o
obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
new file mode 100644
index 0000000..4131bbd
--- /dev/null
+++ b/arch/x86/kernel/jump_label.c
@@ -0,0 +1,63 @@
+#include <linux/jump_label.h>
+#include <linux/memory.h>
+#include <linux/uaccess.h>
+#include <asm/alternative.h>
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern struct jump_entry __start___jump_table[];
+extern struct jump_entry __stop___jump_table[];
+
+union jump_code_union {
+ char code[RELATIVEJUMP_SIZE];
+ struct {
+ char jump;
+ int offset;
+ } __attribute__((packed));
+};
+
+void jump_label_transform(struct jump_entry *entry, enum jump_label_type type)
+{
+ union jump_code_union code;
+
+ if (type == JUMP_LABEL_ENABLE) {
+ code.jump = 0xe9;
+ code.offset = entry->target - (entry->code + RELATIVEJUMP_SIZE);
+ } else {
+#ifdef CONFIG_X86_64
+ /* opcode for P6_NOP5 */
+ code.code[0] = 0x0f;
+ code.code[1] = 0x1f;
+ code.code[2] = 0x44;
+ code.code[3] = 0x00;
+ code.code[4] = 0x00;
+#else
+ code.jump = 0xe9;
+ code.offset = 0;
+#endif
+ }
+
+ mutex_lock(&text_mutex);
+ text_poke_fixup((void *)entry->code, &code, RELATIVEJUMP_SIZE,
+ (void *)entry->code + RELATIVEJUMP_SIZE);
+ mutex_unlock(&text_mutex);
+}
+
+int jump_label_update(const char *name, enum jump_label_type type, void *mod)
+{
+ struct jump_entry *iter;
+
+ if (mod)
+ return 0;
+
+ for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {
+ if (!strcmp(name, iter->name)) {
+ if (kernel_text_address(iter->code))
+ jump_label_transform(iter, type);
+ }
+ }
+ }
+ return 0;
+}
+
+#endif
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 5d5def0..8fcbe3b 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -171,7 +171,8 @@
BRANCH_PROFILE() \
TRACE_PRINTKS() \
FTRACE_EVENTS() \
- TRACE_SYSCALLS()
+ TRACE_SYSCALLS() \
+ JUMP_TABLE() \

/*
* Data section helpers
@@ -210,6 +211,7 @@
*(__vermagic) /* Kernel version magic */ \
*(__markers_strings) /* Markers: strings */ \
*(__tracepoints_strings)/* Tracepoints: strings */ \
+ *(__jump_strings)/* Jump: strings */ \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
@@ -218,6 +220,7 @@
\
BUG_TABLE \
\
+ \
/* PCI quirks */ \
.pci_fixup : AT(ADDR(.pci_fixup) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__start_pci_fixups_early) = .; \
@@ -561,6 +564,12 @@
#define BUG_TABLE
#endif

+#define JUMP_TABLE() \
+ . = ALIGN(64); \
+ VMLINUX_SYMBOL(__start___jump_table) = .; \
+ *(__jump_table) \
+ VMLINUX_SYMBOL(__stop___jump_table) = .; \
+
#ifdef CONFIG_PM_TRACE
#define TRACEDATA \
. = ALIGN(4); \
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
new file mode 100644
index 0000000..2719506
--- /dev/null
+++ b/include/linux/jump_label.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_JUMP_LABEL_H
+#define _LINUX_JUMP_LABEL_H
+
+#include <asm/jump_label.h>
+
+struct jump_entry {
+ unsigned long code;
+ unsigned long target;
+ char *name;
+};
+
+enum jump_label_type {
+ JUMP_LABEL_ENABLE,
+ JUMP_LABEL_DISABLE
+};
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern int jump_label_update(const char *name, enum jump_label_type type, void *mod);
+
+#define enable_jump_label(name, mod) \
+ jump_label_update(name, JUMP_LABEL_ENABLE, mod);
+
+#define disable_jump_label(name, mod) \
+ jump_label_update(name, JUMP_LABEL_DISABLE, mod);
+
+#else
+
+#define JUMP_LABEL(tag, label, cond) \
+ if (unlikely(cond)) \
+ goto label;
+
+static inline int enable_jump_label(const char *name, void *mod)
+{
+ return 0;
+}
+
+static inline int disable_jump_label(const char *name, void *mod)
+{
+ return 0;
+}
+
+#endif
+
+#endif
--
1.6.5.1

2009-11-18 22:44:09

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 5/6] jump label v3 - add module support

Add support for 'jump label' for modules.

Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/kernel/jump_label.c | 5 ++++-
include/linux/jump_label.h | 2 ++
include/linux/module.h | 12 +++++++++++-
kernel/module.c | 25 +++++++++++++++++++++++++
4 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 4131bbd..080062d 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -1,6 +1,7 @@
#include <linux/jump_label.h>
#include <linux/memory.h>
#include <linux/uaccess.h>
+#include <linux/module.h>
#include <asm/alternative.h>

#ifdef __HAVE_ARCH_JUMP_LABEL
@@ -47,8 +48,10 @@ int jump_label_update(const char *name, enum jump_label_type type, void *mod)
{
struct jump_entry *iter;

- if (mod)
+ if (mod) {
+ jump_label_update_module(name, type, mod);
return 0;
+ }

for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {
if (!strcmp(name, iter->name)) {
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 2719506..7935ff6 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -17,6 +17,8 @@ enum jump_label_type {
#ifdef __HAVE_ARCH_JUMP_LABEL

extern int jump_label_update(const char *name, enum jump_label_type type, void *mod);
+extern void jump_label_transform(struct jump_entry *entry,
+ enum jump_label_type type);

#define enable_jump_label(name, mod) \
jump_label_update(name, JUMP_LABEL_ENABLE, mod);
diff --git a/include/linux/module.h b/include/linux/module.h
index 819354a..307b64a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -16,6 +16,7 @@
#include <linux/kobject.h>
#include <linux/moduleparam.h>
#include <linux/tracepoint.h>
+#include <linux/jump_label.h>

#include <asm/local.h>
#include <asm/module.h>
@@ -355,7 +356,10 @@ struct module
struct tracepoint *tracepoints;
unsigned int num_tracepoints;
#endif
-
+#ifdef __HAVE_ARCH_JUMP_LABEL
+ struct jump_entry *jump_entries;
+ unsigned int num_jump_entries;
+#endif
#ifdef CONFIG_TRACING
const char **trace_bprintk_fmt_start;
unsigned int num_trace_bprintk_fmt;
@@ -557,6 +561,7 @@ extern void print_modules(void);

extern void module_update_tracepoints(void);
extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
+extern void jump_label_update_module(const char *name, enum jump_label_type type, void *mod);

#else /* !CONFIG_MODULES... */
#define EXPORT_SYMBOL(sym)
@@ -678,6 +683,11 @@ static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
return 0;
}

+static inline void jump_label_update_module(const char *name, enum jump_label_type type, void *mod)
+{
+ return;
+}
+
#endif /* CONFIG_MODULES */

struct device_driver;
diff --git a/kernel/module.c b/kernel/module.c
index 8b7d880..6b8a754 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -55,6 +55,7 @@
#include <linux/async.h>
#include <linux/percpu.h>
#include <linux/kmemleak.h>
+#include <linux/jump_label.h>

#define CREATE_TRACE_POINTS
#include <trace/events/module.h>
@@ -2378,6 +2379,12 @@ static noinline struct module *load_module(void __user *umod,
sizeof(*mod->tracepoints),
&mod->num_tracepoints);
#endif
+#ifdef __HAVE_ARCH_JUMP_LABEL
+ mod->jump_entries = section_objs(hdr, sechdrs, secstrings,
+ "__jump_table",
+ sizeof(*mod->jump_entries),
+ &mod->num_jump_entries);
+#endif
#ifdef CONFIG_EVENT_TRACING
mod->trace_events = section_objs(hdr, sechdrs, secstrings,
"_ftrace_events",
@@ -3143,4 +3150,22 @@ int module_get_iter_tracepoints(struct tracepoint_iter *iter)
mutex_unlock(&module_mutex);
return found;
}
+
+#endif
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+void jump_label_update_module(const char *name, enum jump_label_type type, void *data)
+{
+ struct jump_entry *entry;
+ struct module *mod = data;
+
+ for (entry = mod->jump_entries;
+ entry < mod->jump_entries + mod->num_jump_entries;
+ entry++) {
+ if (!strcmp(name, entry->name))
+ jump_label_transform(entry, type);
+ }
+}
+
#endif
--
1.6.5.1

2009-11-18 22:44:43

by Jason Baron

[permalink] [raw]
Subject: [RFC PATCH 6/6] jump label v3 - tracepoint support

Make use of the jump label infrastructure for tracepoints.

Signed-off-by: Jason Baron <[email protected]>
---
include/linux/tracepoint.h | 35 +++++++++++++++++++----------------
kernel/module.c | 2 +-
kernel/tracepoint.c | 25 ++++++++++++++++++-------
3 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 2aac8a8..7c4e9cb 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -16,6 +16,7 @@

#include <linux/types.h>
#include <linux/rcupdate.h>
+#include <linux/jump_label.h>

struct module;
struct tracepoint;
@@ -63,20 +64,22 @@ struct tracepoint {
* not add unwanted padding between the beginning of the section and the
* structure. Force alignment to the same alignment as the section start.
*/
-#define DECLARE_TRACE(name, proto, args) \
- extern struct tracepoint __tracepoint_##name; \
- static inline void trace_##name(proto) \
- { \
- if (unlikely(__tracepoint_##name.state)) \
- __DO_TRACE(&__tracepoint_##name, \
- TP_PROTO(proto), TP_ARGS(args)); \
- } \
- static inline int register_trace_##name(void (*probe)(proto)) \
- { \
- return tracepoint_probe_register(#name, (void *)probe); \
- } \
- static inline int unregister_trace_##name(void (*probe)(proto)) \
- { \
+#define DECLARE_TRACE(name, proto, args) \
+ extern struct tracepoint __tracepoint_##name; \
+ static inline void trace_##name(proto) \
+ { \
+ JUMP_LABEL(name, do_trace, __tracepoint_##name.state); \
+ return; \
+do_trace: \
+ __DO_TRACE(&__tracepoint_##name, \
+ TP_PROTO(proto), TP_ARGS(args)); \
+ } \
+ static inline int register_trace_##name(void (*probe)(proto)) \
+ { \
+ return tracepoint_probe_register(#name, (void *)probe); \
+ } \
+ static inline int unregister_trace_##name(void (*probe)(proto)) \
+ { \
return tracepoint_probe_unregister(#name, (void *)probe);\
}

@@ -97,7 +100,7 @@ struct tracepoint {
EXPORT_SYMBOL(__tracepoint_##name)

extern void tracepoint_update_probe_range(struct tracepoint *begin,
- struct tracepoint *end);
+ struct tracepoint *end, void *data);

#else /* !CONFIG_TRACEPOINTS */
#define DECLARE_TRACE(name, proto, args) \
@@ -120,7 +123,7 @@ extern void tracepoint_update_probe_range(struct tracepoint *begin,
#define EXPORT_TRACEPOINT_SYMBOL(name)

static inline void tracepoint_update_probe_range(struct tracepoint *begin,
- struct tracepoint *end)
+ struct tracepoint *end, void *data)
{ }
#endif /* CONFIG_TRACEPOINTS */
#endif /* DECLARE_TRACE */
diff --git a/kernel/module.c b/kernel/module.c
index 6b8a754..7ceba66 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3114,7 +3114,7 @@ void module_update_tracepoints(void)
list_for_each_entry(mod, &modules, list)
if (!mod->taints)
tracepoint_update_probe_range(mod->tracepoints,
- mod->tracepoints + mod->num_tracepoints);
+ mod->tracepoints + mod->num_tracepoints, mod);
mutex_unlock(&module_mutex);
}

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index cc89be5..d78f4b0 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -25,6 +25,7 @@
#include <linux/err.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/jump_label.h>

extern struct tracepoint __start___tracepoints[];
extern struct tracepoint __stop___tracepoints[];
@@ -239,7 +240,7 @@ static inline void remove_tracepoint(struct tracepoint_entry *e)
* Sets the probe callback corresponding to one tracepoint.
*/
static void set_tracepoint(struct tracepoint_entry **entry,
- struct tracepoint *elem, int active)
+ struct tracepoint *elem, int active, void *data)
{
WARN_ON(strcmp((*entry)->name, elem->name) != 0);

@@ -256,6 +257,12 @@ static void set_tracepoint(struct tracepoint_entry **entry,
* is used.
*/
rcu_assign_pointer(elem->funcs, (*entry)->funcs);
+
+ if (!elem->state && active)
+ enable_jump_label(elem->name, data);
+ else if (elem->state && !active)
+ disable_jump_label(elem->name, data);
+
elem->state = active;
}

@@ -265,11 +272,14 @@ static void set_tracepoint(struct tracepoint_entry **entry,
* function insures that the original callback is not used anymore. This insured
* by preempt_disable around the call site.
*/
-static void disable_tracepoint(struct tracepoint *elem)
+static void disable_tracepoint(struct tracepoint *elem, void *data)
{
if (elem->unregfunc && elem->state)
elem->unregfunc();

+ if (elem->state)
+ disable_jump_label(elem->name, data);
+
elem->state = 0;
rcu_assign_pointer(elem->funcs, NULL);
}
@@ -282,7 +292,8 @@ static void disable_tracepoint(struct tracepoint *elem)
* Updates the probe callback corresponding to a range of tracepoints.
*/
void
-tracepoint_update_probe_range(struct tracepoint *begin, struct tracepoint *end)
+tracepoint_update_probe_range(struct tracepoint *begin, struct tracepoint *end,
+ void *data)
{
struct tracepoint *iter;
struct tracepoint_entry *mark_entry;
@@ -295,9 +306,9 @@ tracepoint_update_probe_range(struct tracepoint *begin, struct tracepoint *end)
mark_entry = get_tracepoint(iter->name);
if (mark_entry) {
set_tracepoint(&mark_entry, iter,
- !!mark_entry->refcount);
+ !!mark_entry->refcount, data);
} else {
- disable_tracepoint(iter);
+ disable_tracepoint(iter, data);
}
}
mutex_unlock(&tracepoints_mutex);
@@ -310,7 +321,7 @@ static void tracepoint_update_probes(void)
{
/* Core kernel tracepoints */
tracepoint_update_probe_range(__start___tracepoints,
- __stop___tracepoints);
+ __stop___tracepoints, NULL);
/* tracepoints in modules. */
module_update_tracepoints();
}
@@ -565,7 +576,7 @@ int tracepoint_module_notify(struct notifier_block *self,
case MODULE_STATE_COMING:
case MODULE_STATE_GOING:
tracepoint_update_probe_range(mod->tracepoints,
- mod->tracepoints + mod->num_tracepoints);
+ mod->tracepoints + mod->num_tracepoints, data);
break;
}
return 0;
--
1.6.5.1

2009-11-18 22:52:39

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] jump label v3

On 11/18/2009 02:43 PM, Jason Baron wrote:
>
> I'm using an atomic 5 byte no-op for x86_64 and a long jump for 32-bit x86.
> My understanding is that not all 32-bit processors have an atomic 5 byte no-op,
> and thus using a long jump or jump 0, for the off case is the safest.
>

67 66 8D 74 00 (lea si,[si+0]) should work as a 32-bit atomic NOP. It's
not necessarily the fastest, though (I have no data on that.)
Similarly, 66 66 66 66 90 should also work.

-hpa

2009-11-18 23:08:19

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] jump label v3

> 67 66 8D 74 00 (lea si,[si+0]) should work as a 32-bit atomic NOP. It's
> not necessarily the fastest, though (I have no data on that.)
> Similarly, 66 66 66 66 90 should also work.

We should get all the knowledge like that stored in places like the
Kconfig.cpu comments near X86_P6_NOP and/or asm/nops.h macros and comments.

Let's have an ASM_ATOMIC_NOP5 macro in asm/nops.h? I've lost track of the
variants, and I'll leave that to you all who are close to the chip people.

I can't tell if it's the case that there will be kernel configurations
where there is one known-safe choice but a different choice that's optimal
if CPU model checks pass. If so, we could compile in the safe choice and
then mass-change to the optimal choice at boot time. (i.e. like the
alternatives support, but we don't really need to use that too since we
already have another dedicated table of all the PC addresses to touch.)


Thanks,
Roland

2009-11-18 23:19:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] jump label v3

On 11/18/2009 03:07 PM, Roland McGrath wrote:
>> 67 66 8D 74 00 (lea si,[si+0]) should work as a 32-bit atomic NOP. It's
>> not necessarily the fastest, though (I have no data on that.)
>> Similarly, 66 66 66 66 90 should also work.
>
> We should get all the knowledge like that stored in places like the
> Kconfig.cpu comments near X86_P6_NOP and/or asm/nops.h macros and comments.
>
> Let's have an ASM_ATOMIC_NOP5 macro in asm/nops.h? I've lost track of the
> variants, and I'll leave that to you all who are close to the chip people.
>
> I can't tell if it's the case that there will be kernel configurations
> where there is one known-safe choice but a different choice that's optimal
> if CPU model checks pass. If so, we could compile in the safe choice and
> then mass-change to the optimal choice at boot time. (i.e. like the
> alternatives support, but we don't really need to use that too since we
> already have another dedicated table of all the PC addresses to touch.)

Yes, I believe we do something like that already.

-hpa

2009-11-18 23:38:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] notifier atomic call chain notrace

* Jason Baron ([email protected]) wrote:
> Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
> condition. This is b/c the die_notifier takes an rcu_read_lock() on the
> int3 trap, which then causes another one etc. Since, we aren't going to be
> installing removing the handler, the rcu_read_lock() could be avoided for this
> case with some code restructuring.
>
[snip]

Would the following patch help ? I use it in the LTTng tree to alleviate
this problem.


notifier atomic call chain notrace

In LTTng, being able to use the atomic notifier from cpu idle entry to
ensure the tracer flush the last events in the current subbuffer
requires the rcu read-side to be marked "notrace", otherwise it can end
up calling back into lockdep and the tracer.

Also apply to the the die notifier.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Jason Baron <[email protected]>
CC: [email protected]
---
kernel/notifier.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/kernel/notifier.c
===================================================================
--- linux-2.6-lttng.orig/kernel/notifier.c 2009-11-12 17:58:56.000000000 -0500
+++ linux-2.6-lttng/kernel/notifier.c 2009-11-12 18:03:28.000000000 -0500
@@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(str
spin_lock_irqsave(&nh->lock, flags);
ret = notifier_chain_unregister(&nh->head, n);
spin_unlock_irqrestore(&nh->lock, flags);
- synchronize_rcu();
+ synchronize_sched();
return ret;
}
EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
@@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_cha
{
int ret;

- rcu_read_lock();
+ rcu_read_lock_sched_notrace();
ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
- rcu_read_unlock();
+ rcu_read_unlock_sched_notrace();
return ret;
}
EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-19 00:02:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] notifier atomic call chain notrace

On Wed, Nov 18, 2009 at 06:38:15PM -0500, Mathieu Desnoyers wrote:
> * Jason Baron ([email protected]) wrote:
> > Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
> > condition. This is b/c the die_notifier takes an rcu_read_lock() on the
> > int3 trap, which then causes another one etc. Since, we aren't going to be
> > installing removing the handler, the rcu_read_lock() could be avoided for this
> > case with some code restructuring.
> >
> [snip]
>
> Would the following patch help ? I use it in the LTTng tree to alleviate
> this problem.
>
>
> notifier atomic call chain notrace
>
> In LTTng, being able to use the atomic notifier from cpu idle entry to
> ensure the tracer flush the last events in the current subbuffer
> requires the rcu read-side to be marked "notrace", otherwise it can end
> up calling back into lockdep and the tracer.
>
> Also apply to the the die notifier.

Reviewed-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Mathieu Desnoyers <[email protected]>
> CC: "Paul E. McKenney" <[email protected]>
> CC: Jason Baron <[email protected]>
> CC: [email protected]
> ---
> kernel/notifier.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> Index: linux-2.6-lttng/kernel/notifier.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/notifier.c 2009-11-12 17:58:56.000000000 -0500
> +++ linux-2.6-lttng/kernel/notifier.c 2009-11-12 18:03:28.000000000 -0500
> @@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(str
> spin_lock_irqsave(&nh->lock, flags);
> ret = notifier_chain_unregister(&nh->head, n);
> spin_unlock_irqrestore(&nh->lock, flags);
> - synchronize_rcu();
> + synchronize_sched();
> return ret;
> }
> EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
> @@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_cha
> {
> int ret;
>
> - rcu_read_lock();
> + rcu_read_lock_sched_notrace();
> ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
> - rcu_read_unlock();
> + rcu_read_unlock_sched_notrace();
> return ret;
> }
> EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-11-19 00:33:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Jason Baron ([email protected]) wrote:
> Add text_poke_fixup() which takes a fixup address to where a processor
> jumps if it hits the modifying address while code modifying.
> text_poke_fixup() does following steps for this purpose.
>
> 1. Setup int3 handler for fixup.
> 2. Put a breakpoint (int3) on the first byte of modifying region,
> and synchronize code on all CPUs.
> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> 4. Modify the first byte of modifying region, and synchronize code
> on all CPUs.
> 5. Clear int3 handler.
>

Hi Masami,

I like the approach and the API is clean. I have intersped comments
below.

Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
experienced recently in the message below. Might be worth having a look,
I suspect this might have caused the hangs Paul McKenney had with his
past TREE RCU callback migration. I think he did take a mutex in the cpu
hotplug callbacks and might have used IPIs within that same lock.

> Thus, if some other processor execute modifying address when step2 to step4,
> it will be jumped to fixup code.
>
> This still has many limitations for modifying multi-instructions at once.
> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> because;
> - Replaced instruction is just one instruction, which is executed atomically.
> - Replacing instruction is a jump, so we can set fixup address where the jump
> goes to.
>
> Signed-off-by: Masami Hiramatsu <[email protected]>
> ---
> arch/x86/include/asm/alternative.h | 12 ++++
> arch/x86/include/asm/kprobes.h | 1 +
> arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
> kernel/kprobes.c | 2 +-
> 4 files changed, 134 insertions(+), 1 deletions(-)
>

[snip snip]

> index de7353c..af47f12 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -4,6 +4,7 @@
> #include <linux/list.h>
> #include <linux/stringify.h>
> #include <linux/kprobes.h>
> +#include <linux/kdebug.h>
> #include <linux/mm.h>
> #include <linux/vmalloc.h>
> #include <linux/memory.h>
> @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> local_irq_restore(flags);
> return addr;
> }
> +
> +/*
> + * On pentium series, Unsynchronized cross-modifying code
> + * operations can cause unexpected instruction execution results.
> + * So after code modified, we should synchronize it on each processor.
> + */
> +static void __kprobes __local_sync_core(void *info)
> +{
> + sync_core();
> +}
> +
> +void __kprobes sync_core_all(void)
> +{
> + on_each_cpu(__local_sync_core, NULL, 1);

OK, so you rely on the fact that on_each_cpu has memory barriers and
executes the remote "__local_sync_core" with the appropriate memory
barriers underneath, am I correct ?

> +}
> +
> +/* Safely cross-code modifying with fixup address */
> +static void *patch_fixup_from;
> +static void *patch_fixup_addr;
> +
> +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> + unsigned long val, void *data)
> +{
> + struct die_args *args = data;
> + struct pt_regs *regs = args->regs;
> +
> + if (likely(!patch_fixup_from))
> + return NOTIFY_DONE;
> +
> + if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> + (unsigned long)patch_fixup_from != regs->ip)
> + return NOTIFY_DONE;
> +
> + args->regs->ip = (unsigned long)patch_fixup_addr;
> + return NOTIFY_STOP;
> +}
> +
> +/**
> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> + * @addr: Modifying address.
> + * @opcode: New instruction.
> + * @len: length of modifying bytes.
> + * @fixup: Fixup address.
> + *
> + * Note: You must backup replaced instructions before calling this,
> + * if you need to recover it.
> + * Note: Must be called under text_mutex.
> + */
> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> + void *fixup)
> +{
> + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> + static const int int3_size = sizeof(int3_insn);
> +
> + /* Replacing 1 byte can be done atomically. */

I'm sure it can be done atomically, but can it be done safely though ?

(disclaimer: we're still waiting for the official answer on this
statement): Assuming instruction trace cache effects of SMP cross-code
modification, and that only int3 would be safe to patch, then even
changing 1 single byte could only be done by going to an intermediate
int3 and synchronizing all other cores.

> + if (unlikely(len <= 1))
> + return text_poke(addr, opcode, len);
> +
> + /* Preparing */
> + patch_fixup_addr = fixup;
> + wmb();

hrm, missing comment ?

> + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> +
> + /* Cap by an int3 */
> + text_poke(addr, &int3_insn, int3_size);
> + sync_core_all();
> +
> + /* Replace tail bytes */
> + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> + len - int3_size);
> + sync_core_all();
> +
> + /* Replace int3 with head byte */
> + text_poke(addr, opcode, int3_size);
> + sync_core_all();
> +
> + /* Cleanup */
> + patch_fixup_from = NULL;
> + wmb();

missing comment here too.

> + return addr;

Little quiz question:

When patch_fixup_from is set to NULL, what ensures that the int3
handlers have completed their execution ?

I think it's probably OK, because the int3 is an interrupt gate, which
therefore disables interrupts as soon as it runs, and executes the
notifier while irqs are off. When we run sync_core_all() after replacing
the int3 by the new 1st byte, we only return when all other cores have
executed an interrupt, which implies that all int3 handlers previously
running should have ended. Is it right ? It looks to me as if this 3rd
sync_core_all() is only needed because of that. Probably that adding a
comment would be good.


Another thing: I've recently noticed that the following locking seems to
hang the system with doing stress-testing concurrently with cpu
hotplug/hotunplug:

mutex_lock(&text_mutex);
on_each_cpu(something, NULL, 1);

The hang seems to be caused by the fact that alternative.c has:

within cpu hotplug (cpu hotplug lock held)
mutex_lock(&text_mutex);

It might also be caused by the interaction with the stop_machine()
performed within the cpu hotplug lock. I did not find the root cause of
the problem, but this probably calls for lockdep improvements.

In any case, given you are dealing with the exact same locking scenario
here, I would recomment the following stress-test:

in a loop, use text_poke_jump()
in a loop, hotunplug all cpus, then hotplug all cpus

I had to fix this temporarily by taking get/put_online_cpus() about the
text_mutex.

[snip snip]

> +static struct notifier_block patch_exceptions_nb = {
> + .notifier_call = patch_exceptions_notify,
> + .priority = 0x7fffffff /* we need to be notified first */
> +};
> +
> +static int __init patch_init(void)
> +{
> + return register_die_notifier(&patch_exceptions_nb);
> +}
> +
> +arch_initcall(patch_init);
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index e5342a3..43a30d8 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
>
> static struct notifier_block kprobe_exceptions_nb = {
> .notifier_call = kprobe_exceptions_notify,
> - .priority = 0x7fffffff /* we need to be notified first */
> + .priority = 0x7ffffff0 /* High priority, but not first. */

It would be good to keep all these priorities in a centralized header.

Thanks,

Mathieu

> };
>
> unsigned long __weak arch_deref_entry_point(void *entry)
> --
> 1.6.5.1
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-19 00:58:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote:
> * Jason Baron ([email protected]) wrote:
> > Add text_poke_fixup() which takes a fixup address to where a processor
> > jumps if it hits the modifying address while code modifying.
> > text_poke_fixup() does following steps for this purpose.
> >
> > 1. Setup int3 handler for fixup.
> > 2. Put a breakpoint (int3) on the first byte of modifying region,
> > and synchronize code on all CPUs.
> > 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > 4. Modify the first byte of modifying region, and synchronize code
> > on all CPUs.
> > 5. Clear int3 handler.
> >
>
> Hi Masami,
>
> I like the approach and the API is clean. I have intersped comments
> below.
>
> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> experienced recently in the message below. Might be worth having a look,
> I suspect this might have caused the hangs Paul McKenney had with his
> past TREE RCU callback migration. I think he did take a mutex in the cpu
> hotplug callbacks and might have used IPIs within that same lock.

Hello, Mathieu,

By "mutex", do you mean "mutex_lock()"? If so, I don't do that from
within the CPU hotplug notifiers. I do spin_lock_irqsave().

People have been known to invoke synchronize_rcu() from CPU hotplug
notifiers, however -- and this does block. Is that what you are
getting at?

I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
suspicions of late. But this seems to bypass the smp_call_function()
code that is most definitely illegal to invoke with irqs disabled,
so no smoking gun. All that aside, if invoking smp_send_reschedule()
with irqs disabled is in any way a bad idea, please let me know so I
can rearrange the code appropriately.

RCU is currently running reasonably well with the set of patches I have
submitted. It the kernel is now withstanding a level of punishment that
would have reduced 2.6.28 (for example) to a steaming pile of bits, with
or without the help of rcutorture. And I am now hitting the occasional
non-RCU bug, so I am starting to feel like RCU is approaching stability.
Approaching, but not there yet -- a few suspicious areas remain. Time
to crank the testing up another notch or two. ;-)

Thanx, Paul

> > Thus, if some other processor execute modifying address when step2 to step4,
> > it will be jumped to fixup code.
> >
> > This still has many limitations for modifying multi-instructions at once.
> > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > because;
> > - Replaced instruction is just one instruction, which is executed atomically.
> > - Replacing instruction is a jump, so we can set fixup address where the jump
> > goes to.
> >
> > Signed-off-by: Masami Hiramatsu <[email protected]>
> > ---
> > arch/x86/include/asm/alternative.h | 12 ++++
> > arch/x86/include/asm/kprobes.h | 1 +
> > arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
> > kernel/kprobes.c | 2 +-
> > 4 files changed, 134 insertions(+), 1 deletions(-)
> >
>
> [snip snip]
>
> > index de7353c..af47f12 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -4,6 +4,7 @@
> > #include <linux/list.h>
> > #include <linux/stringify.h>
> > #include <linux/kprobes.h>
> > +#include <linux/kdebug.h>
> > #include <linux/mm.h>
> > #include <linux/vmalloc.h>
> > #include <linux/memory.h>
> > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> > local_irq_restore(flags);
> > return addr;
> > }
> > +
> > +/*
> > + * On pentium series, Unsynchronized cross-modifying code
> > + * operations can cause unexpected instruction execution results.
> > + * So after code modified, we should synchronize it on each processor.
> > + */
> > +static void __kprobes __local_sync_core(void *info)
> > +{
> > + sync_core();
> > +}
> > +
> > +void __kprobes sync_core_all(void)
> > +{
> > + on_each_cpu(__local_sync_core, NULL, 1);
>
> OK, so you rely on the fact that on_each_cpu has memory barriers and
> executes the remote "__local_sync_core" with the appropriate memory
> barriers underneath, am I correct ?
>
> > +}
> > +
> > +/* Safely cross-code modifying with fixup address */
> > +static void *patch_fixup_from;
> > +static void *patch_fixup_addr;
> > +
> > +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> > + unsigned long val, void *data)
> > +{
> > + struct die_args *args = data;
> > + struct pt_regs *regs = args->regs;
> > +
> > + if (likely(!patch_fixup_from))
> > + return NOTIFY_DONE;
> > +
> > + if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> > + (unsigned long)patch_fixup_from != regs->ip)
> > + return NOTIFY_DONE;
> > +
> > + args->regs->ip = (unsigned long)patch_fixup_addr;
> > + return NOTIFY_STOP;
> > +}
> > +
> > +/**
> > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > + * @addr: Modifying address.
> > + * @opcode: New instruction.
> > + * @len: length of modifying bytes.
> > + * @fixup: Fixup address.
> > + *
> > + * Note: You must backup replaced instructions before calling this,
> > + * if you need to recover it.
> > + * Note: Must be called under text_mutex.
> > + */
> > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > + void *fixup)
> > +{
> > + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > + static const int int3_size = sizeof(int3_insn);
> > +
> > + /* Replacing 1 byte can be done atomically. */
>
> I'm sure it can be done atomically, but can it be done safely though ?
>
> (disclaimer: we're still waiting for the official answer on this
> statement): Assuming instruction trace cache effects of SMP cross-code
> modification, and that only int3 would be safe to patch, then even
> changing 1 single byte could only be done by going to an intermediate
> int3 and synchronizing all other cores.
>
> > + if (unlikely(len <= 1))
> > + return text_poke(addr, opcode, len);
> > +
> > + /* Preparing */
> > + patch_fixup_addr = fixup;
> > + wmb();
>
> hrm, missing comment ?
>
> > + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > +
> > + /* Cap by an int3 */
> > + text_poke(addr, &int3_insn, int3_size);
> > + sync_core_all();
> > +
> > + /* Replace tail bytes */
> > + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > + len - int3_size);
> > + sync_core_all();
> > +
> > + /* Replace int3 with head byte */
> > + text_poke(addr, opcode, int3_size);
> > + sync_core_all();
> > +
> > + /* Cleanup */
> > + patch_fixup_from = NULL;
> > + wmb();
>
> missing comment here too.
>
> > + return addr;
>
> Little quiz question:
>
> When patch_fixup_from is set to NULL, what ensures that the int3
> handlers have completed their execution ?
>
> I think it's probably OK, because the int3 is an interrupt gate, which
> therefore disables interrupts as soon as it runs, and executes the
> notifier while irqs are off. When we run sync_core_all() after replacing
> the int3 by the new 1st byte, we only return when all other cores have
> executed an interrupt, which implies that all int3 handlers previously
> running should have ended. Is it right ? It looks to me as if this 3rd
> sync_core_all() is only needed because of that. Probably that adding a
> comment would be good.
>
>
> Another thing: I've recently noticed that the following locking seems to
> hang the system with doing stress-testing concurrently with cpu
> hotplug/hotunplug:
>
> mutex_lock(&text_mutex);
> on_each_cpu(something, NULL, 1);
>
> The hang seems to be caused by the fact that alternative.c has:
>
> within cpu hotplug (cpu hotplug lock held)
> mutex_lock(&text_mutex);
>
> It might also be caused by the interaction with the stop_machine()
> performed within the cpu hotplug lock. I did not find the root cause of
> the problem, but this probably calls for lockdep improvements.
>
> In any case, given you are dealing with the exact same locking scenario
> here, I would recomment the following stress-test:
>
> in a loop, use text_poke_jump()
> in a loop, hotunplug all cpus, then hotplug all cpus
>
> I had to fix this temporarily by taking get/put_online_cpus() about the
> text_mutex.
>
> [snip snip]
>
> > +static struct notifier_block patch_exceptions_nb = {
> > + .notifier_call = patch_exceptions_notify,
> > + .priority = 0x7fffffff /* we need to be notified first */
> > +};
> > +
> > +static int __init patch_init(void)
> > +{
> > + return register_die_notifier(&patch_exceptions_nb);
> > +}
> > +
> > +arch_initcall(patch_init);
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index e5342a3..43a30d8 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> >
> > static struct notifier_block kprobe_exceptions_nb = {
> > .notifier_call = kprobe_exceptions_notify,
> > - .priority = 0x7fffffff /* we need to be notified first */
> > + .priority = 0x7ffffff0 /* High priority, but not first. */
>
> It would be good to keep all these priorities in a centralized header.
>
> Thanks,
>
> Mathieu
>
> > };
> >
> > unsigned long __weak arch_deref_entry_point(void *entry)
> > --
> > 1.6.5.1
> >
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-19 01:22:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

On Wed, 2009-11-18 at 16:58 -0800, Paul E. McKenney wrote:

> I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> suspicions of late. But this seems to bypass the smp_call_function()
> code that is most definitely illegal to invoke with irqs disabled,
> so no smoking gun. All that aside, if invoking smp_send_reschedule()
> with irqs disabled is in any way a bad idea, please let me know so I
> can rearrange the code appropriately.

I don't think you have anything to worry about here. If calling
smp_send_reschedule was bad with keeping interrupts disabled, then we
would have a lot of problems with the scheduler. That is called with the
rq lock held and with interrupts disabled.

-- Steve

2009-11-19 01:39:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

On Wed, Nov 18, 2009 at 08:22:47PM -0500, Steven Rostedt wrote:
> On Wed, 2009-11-18 at 16:58 -0800, Paul E. McKenney wrote:
>
> > I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> > suspicions of late. But this seems to bypass the smp_call_function()
> > code that is most definitely illegal to invoke with irqs disabled,
> > so no smoking gun. All that aside, if invoking smp_send_reschedule()
> > with irqs disabled is in any way a bad idea, please let me know so I
> > can rearrange the code appropriately.
>
> I don't think you have anything to worry about here. If calling
> smp_send_reschedule was bad with keeping interrupts disabled, then we
> would have a lot of problems with the scheduler. That is called with the
> rq lock held and with interrupts disabled.

Whew!!! ;-)

Thanx, Paul

2009-11-19 01:57:58

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Paul E. McKenney ([email protected]) wrote:
> On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote:
> > * Jason Baron ([email protected]) wrote:
> > > Add text_poke_fixup() which takes a fixup address to where a processor
> > > jumps if it hits the modifying address while code modifying.
> > > text_poke_fixup() does following steps for this purpose.
> > >
> > > 1. Setup int3 handler for fixup.
> > > 2. Put a breakpoint (int3) on the first byte of modifying region,
> > > and synchronize code on all CPUs.
> > > 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > > 4. Modify the first byte of modifying region, and synchronize code
> > > on all CPUs.
> > > 5. Clear int3 handler.
> > >
> >
> > Hi Masami,
> >
> > I like the approach and the API is clean. I have intersped comments
> > below.
> >
> > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> > experienced recently in the message below. Might be worth having a look,
> > I suspect this might have caused the hangs Paul McKenney had with his
> > past TREE RCU callback migration. I think he did take a mutex in the cpu
> > hotplug callbacks and might have used IPIs within that same lock.
>
> Hello, Mathieu,
>
> By "mutex", do you mean "mutex_lock()"? If so, I don't do that from
> within the CPU hotplug notifiers. I do spin_lock_irqsave().
>
> People have been known to invoke synchronize_rcu() from CPU hotplug
> notifiers, however -- and this does block. Is that what you are
> getting at?

Hi Paul,

What I had in mind is more like a N-way deadlock involving the cpu
hotplug mutex, on_each_cpu, and possibly stop_machine interaction.
However I did not push the lockup analysis far enough to know for sure,
but I feel like it's good to let others know. I am not sure if blocking
primitives could be affected by this.

>
> I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> suspicions of late. But this seems to bypass the smp_call_function()
> code that is most definitely illegal to invoke with irqs disabled,
> so no smoking gun. All that aside, if invoking smp_send_reschedule()
> with irqs disabled is in any way a bad idea, please let me know so I
> can rearrange the code appropriately.
>
> RCU is currently running reasonably well with the set of patches I have
> submitted. It the kernel is now withstanding a level of punishment that
> would have reduced 2.6.28 (for example) to a steaming pile of bits, with
> or without the help of rcutorture. And I am now hitting the occasional
> non-RCU bug, so I am starting to feel like RCU is approaching stability.
> Approaching, but not there yet -- a few suspicious areas remain. Time
> to crank the testing up another notch or two. ;-)

Nice :) I've been going through the same "stabilization" phase in the
past weeks for LTTng. It's good to see things stabilize even under heavy
cpu hotplug stress-tests.

Thanks,

Mathieu

>
> Thanx, Paul
>
> > > Thus, if some other processor execute modifying address when step2 to step4,
> > > it will be jumped to fixup code.
> > >
> > > This still has many limitations for modifying multi-instructions at once.
> > > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > > because;
> > > - Replaced instruction is just one instruction, which is executed atomically.
> > > - Replacing instruction is a jump, so we can set fixup address where the jump
> > > goes to.
> > >
> > > Signed-off-by: Masami Hiramatsu <[email protected]>
> > > ---
> > > arch/x86/include/asm/alternative.h | 12 ++++
> > > arch/x86/include/asm/kprobes.h | 1 +
> > > arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
> > > kernel/kprobes.c | 2 +-
> > > 4 files changed, 134 insertions(+), 1 deletions(-)
> > >
> >
> > [snip snip]
> >
> > > index de7353c..af47f12 100644
> > > --- a/arch/x86/kernel/alternative.c
> > > +++ b/arch/x86/kernel/alternative.c
> > > @@ -4,6 +4,7 @@
> > > #include <linux/list.h>
> > > #include <linux/stringify.h>
> > > #include <linux/kprobes.h>
> > > +#include <linux/kdebug.h>
> > > #include <linux/mm.h>
> > > #include <linux/vmalloc.h>
> > > #include <linux/memory.h>
> > > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> > > local_irq_restore(flags);
> > > return addr;
> > > }
> > > +
> > > +/*
> > > + * On pentium series, Unsynchronized cross-modifying code
> > > + * operations can cause unexpected instruction execution results.
> > > + * So after code modified, we should synchronize it on each processor.
> > > + */
> > > +static void __kprobes __local_sync_core(void *info)
> > > +{
> > > + sync_core();
> > > +}
> > > +
> > > +void __kprobes sync_core_all(void)
> > > +{
> > > + on_each_cpu(__local_sync_core, NULL, 1);
> >
> > OK, so you rely on the fact that on_each_cpu has memory barriers and
> > executes the remote "__local_sync_core" with the appropriate memory
> > barriers underneath, am I correct ?
> >
> > > +}
> > > +
> > > +/* Safely cross-code modifying with fixup address */
> > > +static void *patch_fixup_from;
> > > +static void *patch_fixup_addr;
> > > +
> > > +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> > > + unsigned long val, void *data)
> > > +{
> > > + struct die_args *args = data;
> > > + struct pt_regs *regs = args->regs;
> > > +
> > > + if (likely(!patch_fixup_from))
> > > + return NOTIFY_DONE;
> > > +
> > > + if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> > > + (unsigned long)patch_fixup_from != regs->ip)
> > > + return NOTIFY_DONE;
> > > +
> > > + args->regs->ip = (unsigned long)patch_fixup_addr;
> > > + return NOTIFY_STOP;
> > > +}
> > > +
> > > +/**
> > > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > > + * @addr: Modifying address.
> > > + * @opcode: New instruction.
> > > + * @len: length of modifying bytes.
> > > + * @fixup: Fixup address.
> > > + *
> > > + * Note: You must backup replaced instructions before calling this,
> > > + * if you need to recover it.
> > > + * Note: Must be called under text_mutex.
> > > + */
> > > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > > + void *fixup)
> > > +{
> > > + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > > + static const int int3_size = sizeof(int3_insn);
> > > +
> > > + /* Replacing 1 byte can be done atomically. */
> >
> > I'm sure it can be done atomically, but can it be done safely though ?
> >
> > (disclaimer: we're still waiting for the official answer on this
> > statement): Assuming instruction trace cache effects of SMP cross-code
> > modification, and that only int3 would be safe to patch, then even
> > changing 1 single byte could only be done by going to an intermediate
> > int3 and synchronizing all other cores.
> >
> > > + if (unlikely(len <= 1))
> > > + return text_poke(addr, opcode, len);
> > > +
> > > + /* Preparing */
> > > + patch_fixup_addr = fixup;
> > > + wmb();
> >
> > hrm, missing comment ?
> >
> > > + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > > +
> > > + /* Cap by an int3 */
> > > + text_poke(addr, &int3_insn, int3_size);
> > > + sync_core_all();
> > > +
> > > + /* Replace tail bytes */
> > > + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > > + len - int3_size);
> > > + sync_core_all();
> > > +
> > > + /* Replace int3 with head byte */
> > > + text_poke(addr, opcode, int3_size);
> > > + sync_core_all();
> > > +
> > > + /* Cleanup */
> > > + patch_fixup_from = NULL;
> > > + wmb();
> >
> > missing comment here too.
> >
> > > + return addr;
> >
> > Little quiz question:
> >
> > When patch_fixup_from is set to NULL, what ensures that the int3
> > handlers have completed their execution ?
> >
> > I think it's probably OK, because the int3 is an interrupt gate, which
> > therefore disables interrupts as soon as it runs, and executes the
> > notifier while irqs are off. When we run sync_core_all() after replacing
> > the int3 by the new 1st byte, we only return when all other cores have
> > executed an interrupt, which implies that all int3 handlers previously
> > running should have ended. Is it right ? It looks to me as if this 3rd
> > sync_core_all() is only needed because of that. Probably that adding a
> > comment would be good.
> >
> >
> > Another thing: I've recently noticed that the following locking seems to
> > hang the system with doing stress-testing concurrently with cpu
> > hotplug/hotunplug:
> >
> > mutex_lock(&text_mutex);
> > on_each_cpu(something, NULL, 1);
> >
> > The hang seems to be caused by the fact that alternative.c has:
> >
> > within cpu hotplug (cpu hotplug lock held)
> > mutex_lock(&text_mutex);
> >
> > It might also be caused by the interaction with the stop_machine()
> > performed within the cpu hotplug lock. I did not find the root cause of
> > the problem, but this probably calls for lockdep improvements.
> >
> > In any case, given you are dealing with the exact same locking scenario
> > here, I would recomment the following stress-test:
> >
> > in a loop, use text_poke_jump()
> > in a loop, hotunplug all cpus, then hotplug all cpus
> >
> > I had to fix this temporarily by taking get/put_online_cpus() about the
> > text_mutex.
> >
> > [snip snip]
> >
> > > +static struct notifier_block patch_exceptions_nb = {
> > > + .notifier_call = patch_exceptions_notify,
> > > + .priority = 0x7fffffff /* we need to be notified first */
> > > +};
> > > +
> > > +static int __init patch_init(void)
> > > +{
> > > + return register_die_notifier(&patch_exceptions_nb);
> > > +}
> > > +
> > > +arch_initcall(patch_init);
> > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > index e5342a3..43a30d8 100644
> > > --- a/kernel/kprobes.c
> > > +++ b/kernel/kprobes.c
> > > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> > >
> > > static struct notifier_block kprobe_exceptions_nb = {
> > > .notifier_call = kprobe_exceptions_notify,
> > > - .priority = 0x7fffffff /* we need to be notified first */
> > > + .priority = 0x7ffffff0 /* High priority, but not first. */
> >
> > It would be good to keep all these priorities in a centralized header.
> >
> > Thanks,
> >
> > Mathieu
> >
> > > };
> > >
> > > unsigned long __weak arch_deref_entry_point(void *entry)
> > > --
> > > 1.6.5.1
> > >
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-19 03:55:08

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] jump label v3

This looks like another good iteration, still imperfect but getting there.

There are three aspects of this work that are largely separable.

1. the actual instruction-poking for runtime switching

Masami, Mathieu, et al are hashing this out and I have stopped trying to
keep track of the details.

2. optimal compiled hot path code

You and Richard have been working on this in gcc and we know the state
of it now. When we get the cold labels feature done, it will be ideal
for -O(2?). But people mostly use -Os and there no block reordering
gets done now (I think perhaps this even means likely/unlikely don't
really change which path is the straight line, just the source order
of the blocks still determines it). So we hope for more incremental
improvements here, and maybe even really optimal code for -O2 soon.
But at least for -Os it may not be better than "unconditional jump
around" as the "straight line" path in the foreseeable future. As
noted, that alone is still a nice savings over the status quo for the
disabled case. (You gave an "average cycles saved" for this vs a load
and test, but do you have any comparisons of how those two compare to
no tracepoint at all?)

3. bookkeeping magic to find all the jumps to enable for a given tracepoint

Here you have a working first draft, but it looks pretty clunky.
That strcmp just makes me gag. For a first version that's still
pretty simple, I think it should be trivial to use a pointer
comparison there. For tracepoints, it can be the address of the
struct tracepoint. For the general case, it can be the address of
the global that would be flag variable in case of no asm goto support.

For more incremental improvements, we could cut down on running
through the entire table for every switch. If there are many
different switches (as there are already for many different
tracepoints), then you really just want to run through the
insn-patch list for the particular switch when you toggle it.

It's possible to group this all statically at link time, but all
the linker magic hacking required to get that to go is probably
more trouble than it's worth.

A simple hack is to run through the big unsorted table at boot time
and turn it into a contiguous table for each switch. Then
e.g. hang each table off the per-switch global variable by the same
name that in a no-asm-goto build would be the simple global flag.


Finally, for using this for general purposes unrelated to tracepoints,
I envision something like:

DECLARE_MOSTLY_NOT(foobar);

foo(int x, int y)
{
if (x > y && mostly_not(foobar))
do_foobar(x - y);
}

... set_mostly_not(foobar, onoff);

where it's:

#define DECLARE_MOSTLY_NOT(name) ... __something_##name
#define mostly_not(name) ({ int _doit = 0; __label__ _yes; \
JUMP_LABEL(name, _yes, __something_##name); \
if (0) _yes: __cold _doit = 1; \
unlikely (_doit); })

I don't think we've tried to figure out how well this compiles yet.
But it shows the sort of thing that we can do to expose this feature
in a way that's simple and unrestrictive for kernel code to use casually.


Thanks,
Roland

2009-11-19 03:57:57

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] notifier atomic call chain notrace

Mathieu Desnoyers wrote:
> * Jason Baron ([email protected]) wrote:
>> Note that this is conditional on gcc >= 4.5. Also there is a !lockdep
>> condition. This is b/c the die_notifier takes an rcu_read_lock() on the
>> int3 trap, which then causes another one etc. Since, we aren't going to be
>> installing removing the handler, the rcu_read_lock() could be avoided for this
>> case with some code restructuring.
>>
> [snip]
>
> Would the following patch help ? I use it in the LTTng tree to alleviate
> this problem.

Reviewed-by: Masami Hiramatsu <[email protected]>

The code itself seems OK for me. :-)

I'd just like to hear the opinion from Ingo, since
this change will change all atomic-notifier's locks to
notrace.

> notifier atomic call chain notrace
>
> In LTTng, being able to use the atomic notifier from cpu idle entry to
> ensure the tracer flush the last events in the current subbuffer
> requires the rcu read-side to be marked "notrace", otherwise it can end
> up calling back into lockdep and the tracer.
>
> Also apply to the the die notifier.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> CC: "Paul E. McKenney" <[email protected]>
> CC: Jason Baron <[email protected]>
> CC: [email protected]
> ---
> kernel/notifier.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> Index: linux-2.6-lttng/kernel/notifier.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/notifier.c 2009-11-12 17:58:56.000000000 -0500
> +++ linux-2.6-lttng/kernel/notifier.c 2009-11-12 18:03:28.000000000 -0500
> @@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(str
> spin_lock_irqsave(&nh->lock, flags);
> ret = notifier_chain_unregister(&nh->head, n);
> spin_unlock_irqrestore(&nh->lock, flags);
> - synchronize_rcu();
> + synchronize_sched();
> return ret;
> }
> EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
> @@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_cha
> {
> int ret;
>
> - rcu_read_lock();
> + rcu_read_lock_sched_notrace();
> ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
> - rcu_read_unlock();
> + rcu_read_unlock_sched_notrace();
> return ret;
> }
> EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
>

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-11-19 04:16:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

On Wed, Nov 18, 2009 at 08:57:56PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney ([email protected]) wrote:
> > On Wed, Nov 18, 2009 at 07:28:26PM -0500, Mathieu Desnoyers wrote:
> > > * Jason Baron ([email protected]) wrote:
> > > > Add text_poke_fixup() which takes a fixup address to where a processor
> > > > jumps if it hits the modifying address while code modifying.
> > > > text_poke_fixup() does following steps for this purpose.
> > > >
> > > > 1. Setup int3 handler for fixup.
> > > > 2. Put a breakpoint (int3) on the first byte of modifying region,
> > > > and synchronize code on all CPUs.
> > > > 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > > > 4. Modify the first byte of modifying region, and synchronize code
> > > > on all CPUs.
> > > > 5. Clear int3 handler.
> > > >
> > >
> > > Hi Masami,
> > >
> > > I like the approach and the API is clean. I have intersped comments
> > > below.
> > >
> > > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> > > experienced recently in the message below. Might be worth having a look,
> > > I suspect this might have caused the hangs Paul McKenney had with his
> > > past TREE RCU callback migration. I think he did take a mutex in the cpu
> > > hotplug callbacks and might have used IPIs within that same lock.
> >
> > Hello, Mathieu,
> >
> > By "mutex", do you mean "mutex_lock()"? If so, I don't do that from
> > within the CPU hotplug notifiers. I do spin_lock_irqsave().
> >
> > People have been known to invoke synchronize_rcu() from CPU hotplug
> > notifiers, however -- and this does block. Is that what you are
> > getting at?
>
> Hi Paul,
>
> What I had in mind is more like a N-way deadlock involving the cpu
> hotplug mutex, on_each_cpu, and possibly stop_machine interaction.
> However I did not push the lockup analysis far enough to know for sure,
> but I feel like it's good to let others know. I am not sure if blocking
> primitives could be affected by this.

Then you might be interested to hear that my attempts to move
rcu_offline_cpu() to the CPU_DYING and rcu_online_cpu() to CPU_STARTING
did result in howls of pain from lockdep as well as deadlocks.
I abandoned such attempts. ;-)

(The reason behind my attempts was to reduce the number and complexity
of race conditions in the RCU implementations -- but no big deal, as I
have found other ways.)

> > I do invoke smp_send_reschedule() with irqs disabled, which did arouse my
> > suspicions of late. But this seems to bypass the smp_call_function()
> > code that is most definitely illegal to invoke with irqs disabled,
> > so no smoking gun. All that aside, if invoking smp_send_reschedule()
> > with irqs disabled is in any way a bad idea, please let me know so I
> > can rearrange the code appropriately.
> >
> > RCU is currently running reasonably well with the set of patches I have
> > submitted. It the kernel is now withstanding a level of punishment that
> > would have reduced 2.6.28 (for example) to a steaming pile of bits, with
> > or without the help of rcutorture. And I am now hitting the occasional
> > non-RCU bug, so I am starting to feel like RCU is approaching stability.
> > Approaching, but not there yet -- a few suspicious areas remain. Time
> > to crank the testing up another notch or two. ;-)
>
> Nice :) I've been going through the same "stabilization" phase in the
> past weeks for LTTng. It's good to see things stabilize even under heavy
> cpu hotplug stress-tests.

Cool!!! ;-)

Thanx, Paul

> Thanks,
>
> Mathieu
>
> >
> > Thanx, Paul
> >
> > > > Thus, if some other processor execute modifying address when step2 to step4,
> > > > it will be jumped to fixup code.
> > > >
> > > > This still has many limitations for modifying multi-instructions at once.
> > > > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > > > because;
> > > > - Replaced instruction is just one instruction, which is executed atomically.
> > > > - Replacing instruction is a jump, so we can set fixup address where the jump
> > > > goes to.
> > > >
> > > > Signed-off-by: Masami Hiramatsu <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/alternative.h | 12 ++++
> > > > arch/x86/include/asm/kprobes.h | 1 +
> > > > arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
> > > > kernel/kprobes.c | 2 +-
> > > > 4 files changed, 134 insertions(+), 1 deletions(-)
> > > >
> > >
> > > [snip snip]
> > >
> > > > index de7353c..af47f12 100644
> > > > --- a/arch/x86/kernel/alternative.c
> > > > +++ b/arch/x86/kernel/alternative.c
> > > > @@ -4,6 +4,7 @@
> > > > #include <linux/list.h>
> > > > #include <linux/stringify.h>
> > > > #include <linux/kprobes.h>
> > > > +#include <linux/kdebug.h>
> > > > #include <linux/mm.h>
> > > > #include <linux/vmalloc.h>
> > > > #include <linux/memory.h>
> > > > @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> > > > local_irq_restore(flags);
> > > > return addr;
> > > > }
> > > > +
> > > > +/*
> > > > + * On pentium series, Unsynchronized cross-modifying code
> > > > + * operations can cause unexpected instruction execution results.
> > > > + * So after code modified, we should synchronize it on each processor.
> > > > + */
> > > > +static void __kprobes __local_sync_core(void *info)
> > > > +{
> > > > + sync_core();
> > > > +}
> > > > +
> > > > +void __kprobes sync_core_all(void)
> > > > +{
> > > > + on_each_cpu(__local_sync_core, NULL, 1);
> > >
> > > OK, so you rely on the fact that on_each_cpu has memory barriers and
> > > executes the remote "__local_sync_core" with the appropriate memory
> > > barriers underneath, am I correct ?
> > >
> > > > +}
> > > > +
> > > > +/* Safely cross-code modifying with fixup address */
> > > > +static void *patch_fixup_from;
> > > > +static void *patch_fixup_addr;
> > > > +
> > > > +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> > > > + unsigned long val, void *data)
> > > > +{
> > > > + struct die_args *args = data;
> > > > + struct pt_regs *regs = args->regs;
> > > > +
> > > > + if (likely(!patch_fixup_from))
> > > > + return NOTIFY_DONE;
> > > > +
> > > > + if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> > > > + (unsigned long)patch_fixup_from != regs->ip)
> > > > + return NOTIFY_DONE;
> > > > +
> > > > + args->regs->ip = (unsigned long)patch_fixup_addr;
> > > > + return NOTIFY_STOP;
> > > > +}
> > > > +
> > > > +/**
> > > > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > > > + * @addr: Modifying address.
> > > > + * @opcode: New instruction.
> > > > + * @len: length of modifying bytes.
> > > > + * @fixup: Fixup address.
> > > > + *
> > > > + * Note: You must backup replaced instructions before calling this,
> > > > + * if you need to recover it.
> > > > + * Note: Must be called under text_mutex.
> > > > + */
> > > > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > > > + void *fixup)
> > > > +{
> > > > + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > > > + static const int int3_size = sizeof(int3_insn);
> > > > +
> > > > + /* Replacing 1 byte can be done atomically. */
> > >
> > > I'm sure it can be done atomically, but can it be done safely though ?
> > >
> > > (disclaimer: we're still waiting for the official answer on this
> > > statement): Assuming instruction trace cache effects of SMP cross-code
> > > modification, and that only int3 would be safe to patch, then even
> > > changing 1 single byte could only be done by going to an intermediate
> > > int3 and synchronizing all other cores.
> > >
> > > > + if (unlikely(len <= 1))
> > > > + return text_poke(addr, opcode, len);
> > > > +
> > > > + /* Preparing */
> > > > + patch_fixup_addr = fixup;
> > > > + wmb();
> > >
> > > hrm, missing comment ?
> > >
> > > > + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > > > +
> > > > + /* Cap by an int3 */
> > > > + text_poke(addr, &int3_insn, int3_size);
> > > > + sync_core_all();
> > > > +
> > > > + /* Replace tail bytes */
> > > > + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > > > + len - int3_size);
> > > > + sync_core_all();
> > > > +
> > > > + /* Replace int3 with head byte */
> > > > + text_poke(addr, opcode, int3_size);
> > > > + sync_core_all();
> > > > +
> > > > + /* Cleanup */
> > > > + patch_fixup_from = NULL;
> > > > + wmb();
> > >
> > > missing comment here too.
> > >
> > > > + return addr;
> > >
> > > Little quiz question:
> > >
> > > When patch_fixup_from is set to NULL, what ensures that the int3
> > > handlers have completed their execution ?
> > >
> > > I think it's probably OK, because the int3 is an interrupt gate, which
> > > therefore disables interrupts as soon as it runs, and executes the
> > > notifier while irqs are off. When we run sync_core_all() after replacing
> > > the int3 by the new 1st byte, we only return when all other cores have
> > > executed an interrupt, which implies that all int3 handlers previously
> > > running should have ended. Is it right ? It looks to me as if this 3rd
> > > sync_core_all() is only needed because of that. Probably that adding a
> > > comment would be good.
> > >
> > >
> > > Another thing: I've recently noticed that the following locking seems to
> > > hang the system with doing stress-testing concurrently with cpu
> > > hotplug/hotunplug:
> > >
> > > mutex_lock(&text_mutex);
> > > on_each_cpu(something, NULL, 1);
> > >
> > > The hang seems to be caused by the fact that alternative.c has:
> > >
> > > within cpu hotplug (cpu hotplug lock held)
> > > mutex_lock(&text_mutex);
> > >
> > > It might also be caused by the interaction with the stop_machine()
> > > performed within the cpu hotplug lock. I did not find the root cause of
> > > the problem, but this probably calls for lockdep improvements.
> > >
> > > In any case, given you are dealing with the exact same locking scenario
> > > here, I would recomment the following stress-test:
> > >
> > > in a loop, use text_poke_jump()
> > > in a loop, hotunplug all cpus, then hotplug all cpus
> > >
> > > I had to fix this temporarily by taking get/put_online_cpus() about the
> > > text_mutex.
> > >
> > > [snip snip]
> > >
> > > > +static struct notifier_block patch_exceptions_nb = {
> > > > + .notifier_call = patch_exceptions_notify,
> > > > + .priority = 0x7fffffff /* we need to be notified first */
> > > > +};
> > > > +
> > > > +static int __init patch_init(void)
> > > > +{
> > > > + return register_die_notifier(&patch_exceptions_nb);
> > > > +}
> > > > +
> > > > +arch_initcall(patch_init);
> > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > > index e5342a3..43a30d8 100644
> > > > --- a/kernel/kprobes.c
> > > > +++ b/kernel/kprobes.c
> > > > @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> > > >
> > > > static struct notifier_block kprobe_exceptions_nb = {
> > > > .notifier_call = kprobe_exceptions_notify,
> > > > - .priority = 0x7fffffff /* we need to be notified first */
> > > > + .priority = 0x7ffffff0 /* High priority, but not first. */
> > >
> > > It would be good to keep all these priorities in a centralized header.
> > >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > > };
> > > >
> > > > unsigned long __weak arch_deref_entry_point(void *entry)
> > > > --
> > > > 1.6.5.1
> > > >
> > >
> > > --
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-11-19 14:02:50

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
> * Jason Baron ([email protected]) wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>> 1. Setup int3 handler for fixup.
>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>> and synchronize code on all CPUs.
>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>> 4. Modify the first byte of modifying region, and synchronize code
>> on all CPUs.
>> 5. Clear int3 handler.
>>
>
> Hi Masami,
>
> I like the approach and the API is clean. I have intersped comments
> below.

Hi Mathieu,

Thank you for reviewing :-)

> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> experienced recently in the message below. Might be worth having a look,
> I suspect this might have caused the hangs Paul McKenney had with his
> past TREE RCU callback migration. I think he did take a mutex in the cpu
> hotplug callbacks and might have used IPIs within that same lock.
>
>> Thus, if some other processor execute modifying address when step2 to step4,
>> it will be jumped to fixup code.
>>
>> This still has many limitations for modifying multi-instructions at once.
>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>> because;
>> - Replaced instruction is just one instruction, which is executed atomically.
>> - Replacing instruction is a jump, so we can set fixup address where the jump
>> goes to.
>>
>> Signed-off-by: Masami Hiramatsu <[email protected]>
>> ---
>> arch/x86/include/asm/alternative.h | 12 ++++
>> arch/x86/include/asm/kprobes.h | 1 +
>> arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
>> kernel/kprobes.c | 2 +-
>> 4 files changed, 134 insertions(+), 1 deletions(-)
>>
>
> [snip snip]
>
>> index de7353c..af47f12 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -4,6 +4,7 @@
>> #include <linux/list.h>
>> #include <linux/stringify.h>
>> #include <linux/kprobes.h>
>> +#include <linux/kdebug.h>
>> #include <linux/mm.h>
>> #include <linux/vmalloc.h>
>> #include <linux/memory.h>
>> @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>> local_irq_restore(flags);
>> return addr;
>> }
>> +
>> +/*
>> + * On pentium series, Unsynchronized cross-modifying code
>> + * operations can cause unexpected instruction execution results.
>> + * So after code modified, we should synchronize it on each processor.
>> + */
>> +static void __kprobes __local_sync_core(void *info)
>> +{
>> + sync_core();
>> +}
>> +
>> +void __kprobes sync_core_all(void)
>> +{
>> + on_each_cpu(__local_sync_core, NULL, 1);
>
> OK, so you rely on the fact that on_each_cpu has memory barriers and
> executes the remote "__local_sync_core" with the appropriate memory
> barriers underneath, am I correct ?

Right, at least I expected that.

>> +}
>> +
>> +/* Safely cross-code modifying with fixup address */
>> +static void *patch_fixup_from;
>> +static void *patch_fixup_addr;
>> +
>> +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
>> + unsigned long val, void *data)
>> +{
>> + struct die_args *args = data;
>> + struct pt_regs *regs = args->regs;
>> +
>> + if (likely(!patch_fixup_from))
>> + return NOTIFY_DONE;
>> +
>> + if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
>> + (unsigned long)patch_fixup_from != regs->ip)
>> + return NOTIFY_DONE;
>> +
>> + args->regs->ip = (unsigned long)patch_fixup_addr;
>> + return NOTIFY_STOP;
>> +}
>> +
>> +/**
>> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
>> + * @addr: Modifying address.
>> + * @opcode: New instruction.
>> + * @len: length of modifying bytes.
>> + * @fixup: Fixup address.
>> + *
>> + * Note: You must backup replaced instructions before calling this,
>> + * if you need to recover it.
>> + * Note: Must be called under text_mutex.
>> + */
>> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
>> + void *fixup)
>> +{
>> + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
>> + static const int int3_size = sizeof(int3_insn);
>> +
>> + /* Replacing 1 byte can be done atomically. */
>
> I'm sure it can be done atomically, but can it be done safely though ?
>
> (disclaimer: we're still waiting for the official answer on this
> statement): Assuming instruction trace cache effects of SMP cross-code
> modification, and that only int3 would be safe to patch, then even
> changing 1 single byte could only be done by going to an intermediate
> int3 and synchronizing all other cores.

Ah, I see. Certainly, that's possible (int3 might be a special token of
cache coherency). So even len==1, we might better use int3 capping.

>
>> + if (unlikely(len <= 1))
>> + return text_poke(addr, opcode, len);
>> +
>> + /* Preparing */
>> + patch_fixup_addr = fixup;
>> + wmb();
>
> hrm, missing comment ?

Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.

>
>> + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
>> +
>> + /* Cap by an int3 */
>> + text_poke(addr, &int3_insn, int3_size);
>> + sync_core_all();
>> +
>> + /* Replace tail bytes */
>> + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
>> + len - int3_size);
>> + sync_core_all();
>> +
>> + /* Replace int3 with head byte */
>> + text_poke(addr, opcode, int3_size);
>> + sync_core_all();
>> +
>> + /* Cleanup */
>> + patch_fixup_from = NULL;
>> + wmb();
>
> missing comment here too.
>
>> + return addr;
>
> Little quiz question:
>
> When patch_fixup_from is set to NULL, what ensures that the int3
> handlers have completed their execution ?
>
> I think it's probably OK, because the int3 is an interrupt gate, which
> therefore disables interrupts as soon as it runs, and executes the
> notifier while irqs are off. When we run sync_core_all() after replacing
> the int3 by the new 1st byte, we only return when all other cores have
> executed an interrupt, which implies that all int3 handlers previously
> running should have ended. Is it right ? It looks to me as if this 3rd
> sync_core_all() is only needed because of that. Probably that adding a
> comment would be good.

Thanks, it's a good point and that's more what I've thought.
As you said, it is probably safe. Even if it's not safe,
we can add some int3 fixup handler (with lowest priority)
which set regs->ip-1 if there is no int3 anymore, for safety.

> Another thing: I've recently noticed that the following locking seems to
> hang the system with doing stress-testing concurrently with cpu
> hotplug/hotunplug:
>
> mutex_lock(&text_mutex);
> on_each_cpu(something, NULL, 1);
>
> The hang seems to be caused by the fact that alternative.c has:
>
> within cpu hotplug (cpu hotplug lock held)
> mutex_lock(&text_mutex);
>
> It might also be caused by the interaction with the stop_machine()
> performed within the cpu hotplug lock. I did not find the root cause of
> the problem, but this probably calls for lockdep improvements.

Hmm, would you mean it will happen even if we use stop_machine()
under text_mutex locking?
It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.

> In any case, given you are dealing with the exact same locking scenario
> here, I would recomment the following stress-test:
>
> in a loop, use text_poke_jump()
> in a loop, hotunplug all cpus, then hotplug all cpus
>
> I had to fix this temporarily by taking get/put_online_cpus() about the
> text_mutex.
>
> [snip snip]
>
>> +static struct notifier_block patch_exceptions_nb = {
>> + .notifier_call = patch_exceptions_notify,
>> + .priority = 0x7fffffff /* we need to be notified first */
>> +};
>> +
>> +static int __init patch_init(void)
>> +{
>> + return register_die_notifier(&patch_exceptions_nb);
>> +}
>> +
>> +arch_initcall(patch_init);
>> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
>> index e5342a3..43a30d8 100644
>> --- a/kernel/kprobes.c
>> +++ b/kernel/kprobes.c
>> @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
>>
>> static struct notifier_block kprobe_exceptions_nb = {
>> .notifier_call = kprobe_exceptions_notify,
>> - .priority = 0x7fffffff /* we need to be notified first */
>> + .priority = 0x7ffffff0 /* High priority, but not first. */
>
> It would be good to keep all these priorities in a centralized header.

Sure, I'll put it in kprobes.h.

Thank you!
--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-11-19 16:03:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Masami Hiramatsu ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> > * Jason Baron ([email protected]) wrote:
> >> Add text_poke_fixup() which takes a fixup address to where a processor
> >> jumps if it hits the modifying address while code modifying.
> >> text_poke_fixup() does following steps for this purpose.
> >>
> >> 1. Setup int3 handler for fixup.
> >> 2. Put a breakpoint (int3) on the first byte of modifying region,
> >> and synchronize code on all CPUs.
> >> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> >> 4. Modify the first byte of modifying region, and synchronize code
> >> on all CPUs.
> >> 5. Clear int3 handler.
> >>
> >
> > Hi Masami,
> >
> > I like the approach and the API is clean. I have intersped comments
> > below.
>
> Hi Mathieu,
>
> Thank you for reviewing :-)
>
> > Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> > experienced recently in the message below. Might be worth having a look,
> > I suspect this might have caused the hangs Paul McKenney had with his
> > past TREE RCU callback migration. I think he did take a mutex in the cpu
> > hotplug callbacks and might have used IPIs within that same lock.
> >
> >> Thus, if some other processor execute modifying address when step2 to step4,
> >> it will be jumped to fixup code.
> >>
> >> This still has many limitations for modifying multi-instructions at once.
> >> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> >> because;
> >> - Replaced instruction is just one instruction, which is executed atomically.
> >> - Replacing instruction is a jump, so we can set fixup address where the jump
> >> goes to.
> >>
> >> Signed-off-by: Masami Hiramatsu <[email protected]>
> >> ---
> >> arch/x86/include/asm/alternative.h | 12 ++++
> >> arch/x86/include/asm/kprobes.h | 1 +
> >> arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
> >> kernel/kprobes.c | 2 +-
> >> 4 files changed, 134 insertions(+), 1 deletions(-)
> >>
> >
> > [snip snip]
> >
> >> index de7353c..af47f12 100644
> >> --- a/arch/x86/kernel/alternative.c
> >> +++ b/arch/x86/kernel/alternative.c
> >> @@ -4,6 +4,7 @@
> >> #include <linux/list.h>
> >> #include <linux/stringify.h>
> >> #include <linux/kprobes.h>
> >> +#include <linux/kdebug.h>
> >> #include <linux/mm.h>
> >> #include <linux/vmalloc.h>
> >> #include <linux/memory.h>
> >> @@ -552,3 +553,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
> >> local_irq_restore(flags);
> >> return addr;
> >> }
> >> +
> >> +/*
> >> + * On pentium series, Unsynchronized cross-modifying code
> >> + * operations can cause unexpected instruction execution results.
> >> + * So after code modified, we should synchronize it on each processor.
> >> + */
> >> +static void __kprobes __local_sync_core(void *info)
> >> +{
> >> + sync_core();
> >> +}
> >> +
> >> +void __kprobes sync_core_all(void)
> >> +{
> >> + on_each_cpu(__local_sync_core, NULL, 1);
> >
> > OK, so you rely on the fact that on_each_cpu has memory barriers and
> > executes the remote "__local_sync_core" with the appropriate memory
> > barriers underneath, am I correct ?
>
> Right, at least I expected that.
>
> >> +}
> >> +
> >> +/* Safely cross-code modifying with fixup address */
> >> +static void *patch_fixup_from;
> >> +static void *patch_fixup_addr;
> >> +
> >> +static int __kprobes patch_exceptions_notify(struct notifier_block *self,
> >> + unsigned long val, void *data)
> >> +{
> >> + struct die_args *args = data;
> >> + struct pt_regs *regs = args->regs;
> >> +
> >> + if (likely(!patch_fixup_from))
> >> + return NOTIFY_DONE;
> >> +
> >> + if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
> >> + (unsigned long)patch_fixup_from != regs->ip)
> >> + return NOTIFY_DONE;
> >> +
> >> + args->regs->ip = (unsigned long)patch_fixup_addr;
> >> + return NOTIFY_STOP;
> >> +}
> >> +
> >> +/**
> >> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> >> + * @addr: Modifying address.
> >> + * @opcode: New instruction.
> >> + * @len: length of modifying bytes.
> >> + * @fixup: Fixup address.
> >> + *
> >> + * Note: You must backup replaced instructions before calling this,
> >> + * if you need to recover it.
> >> + * Note: Must be called under text_mutex.
> >> + */
> >> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> >> + void *fixup)
> >> +{
> >> + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> >> + static const int int3_size = sizeof(int3_insn);
> >> +
> >> + /* Replacing 1 byte can be done atomically. */
> >
> > I'm sure it can be done atomically, but can it be done safely though ?
> >
> > (disclaimer: we're still waiting for the official answer on this
> > statement): Assuming instruction trace cache effects of SMP cross-code
> > modification, and that only int3 would be safe to patch, then even
> > changing 1 single byte could only be done by going to an intermediate
> > int3 and synchronizing all other cores.
>
> Ah, I see. Certainly, that's possible (int3 might be a special token of
> cache coherency). So even len==1, we might better use int3 capping.
>
> >
> >> + if (unlikely(len <= 1))
> >> + return text_poke(addr, opcode, len);
> >> +
> >> + /* Preparing */
> >> + patch_fixup_addr = fixup;
> >> + wmb();
> >
> > hrm, missing comment ?
>
> Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
> int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.

When a smp_wmb() is probably enough, and the matching smp_rmb() is
missing in the int3 handler.

But why to you care about the order of these two ? I agree that an
unrelated int3 handler (from kprobes ?) could be running concurrently at
that point, but it clearly cannot be called for this specific address
until the int3 is written by text_poke.

What I am pretty much certain is missing would be a smp_wmb()...

>
> >
> >> + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */

..right here, between where you write to the data used by the int3
handler and where you write the actual breakpoint. On the read-side,
this might be a problem with architectures like alpha needing
smp_read_barrier_depends(), but not for Intel. However, in a spirit to
make this code solid, what I did in the immed. val. is:


target_after_int3 = insn + BREAKPOINT_INS_LEN;
/* register_die_notifier has memory barriers */
register_die_notifier(&imv_notify);
/* The breakpoint will single-step the bypass */
text_poke((void *)insn,
((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);

And I unregister the die notifier at the end after having reached
quiescent state. At least we know that the die notifier chain read-side
has the proper memory barriers, which is not the case for the breakpoint
instruction itself.


> >> +
> >> + /* Cap by an int3 */
> >> + text_poke(addr, &int3_insn, int3_size);
> >> + sync_core_all();
> >> +
> >> + /* Replace tail bytes */
> >> + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> >> + len - int3_size);
> >> + sync_core_all();
> >> +
> >> + /* Replace int3 with head byte */
> >> + text_poke(addr, opcode, int3_size);
> >> + sync_core_all();
> >> +
> >> + /* Cleanup */
> >> + patch_fixup_from = NULL;
> >> + wmb();
> >
> > missing comment here too.
> >
> >> + return addr;
> >
> > Little quiz question:
> >
> > When patch_fixup_from is set to NULL, what ensures that the int3
> > handlers have completed their execution ?
> >
> > I think it's probably OK, because the int3 is an interrupt gate, which
> > therefore disables interrupts as soon as it runs, and executes the
> > notifier while irqs are off. When we run sync_core_all() after replacing
> > the int3 by the new 1st byte, we only return when all other cores have
> > executed an interrupt, which implies that all int3 handlers previously
> > running should have ended. Is it right ? It looks to me as if this 3rd
> > sync_core_all() is only needed because of that. Probably that adding a
> > comment would be good.
>
> Thanks, it's a good point and that's more what I've thought.
> As you said, it is probably safe. Even if it's not safe,
> we can add some int3 fixup handler (with lowest priority)
> which set regs->ip-1 if there is no int3 anymore, for safety.

Well, just ensuring that the we reaches a "disabled IRQ code quiescent
state" should be enough. Another way would be to use
synchronize_sched(), but it might take longer. Actively poking the other
CPUs with IPIs seems quicker. So I would be tempted to leave your code
as is in this respect, but to add a comment.

>
> > Another thing: I've recently noticed that the following locking seems to
> > hang the system with doing stress-testing concurrently with cpu
> > hotplug/hotunplug:
> >
> > mutex_lock(&text_mutex);
> > on_each_cpu(something, NULL, 1);
> >
> > The hang seems to be caused by the fact that alternative.c has:
> >
> > within cpu hotplug (cpu hotplug lock held)
> > mutex_lock(&text_mutex);
> >
> > It might also be caused by the interaction with the stop_machine()
> > performed within the cpu hotplug lock. I did not find the root cause of
> > the problem, but this probably calls for lockdep improvements.
>
> Hmm, would you mean it will happen even if we use stop_machine()
> under text_mutex locking?
> It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.

Yes, but, again.. this calls for more testing. Hopefully it's not
something else in my own code I haven't seen. For not I can just say
that I've been noticing hangs involving cpu hotplug and text mutex, and
taking the cpu hotplug mutex around text mutex (in my immediate values
code) fixed the problem.

Thanks,

Mathieu

>
> > In any case, given you are dealing with the exact same locking scenario
> > here, I would recomment the following stress-test:
> >
> > in a loop, use text_poke_jump()
> > in a loop, hotunplug all cpus, then hotplug all cpus
> >
> > I had to fix this temporarily by taking get/put_online_cpus() about the
> > text_mutex.
> >
> > [snip snip]
> >
> >> +static struct notifier_block patch_exceptions_nb = {
> >> + .notifier_call = patch_exceptions_notify,
> >> + .priority = 0x7fffffff /* we need to be notified first */
> >> +};
> >> +
> >> +static int __init patch_init(void)
> >> +{
> >> + return register_die_notifier(&patch_exceptions_nb);
> >> +}
> >> +
> >> +arch_initcall(patch_init);
> >> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> >> index e5342a3..43a30d8 100644
> >> --- a/kernel/kprobes.c
> >> +++ b/kernel/kprobes.c
> >> @@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);
> >>
> >> static struct notifier_block kprobe_exceptions_nb = {
> >> .notifier_call = kprobe_exceptions_notify,
> >> - .priority = 0x7fffffff /* we need to be notified first */
> >> + .priority = 0x7ffffff0 /* High priority, but not first. */
> >
> > It would be good to keep all these priorities in a centralized header.
>
> Sure, I'll put it in kprobes.h.
>
> Thank you!
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-19 16:49:14

by Jason Baron

[permalink] [raw]
Subject: Re: [PATCH] notifier atomic call chain notrace

On Wed, Nov 18, 2009 at 06:38:15PM -0500, Mathieu Desnoyers wrote:
> Would the following patch help ? I use it in the LTTng tree to alleviate
> this problem.
>

indeed. this patch solves the recursive lockups when lockdep is enabled
and we are doing the jump patching. thanks!

-Jason

2009-11-19 21:56:54

by Jason Baron

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] jump label v3

On Wed, Nov 18, 2009 at 07:54:24PM -0800, Roland McGrath wrote:
> 2. optimal compiled hot path code
>
> You and Richard have been working on this in gcc and we know the state
> of it now. When we get the cold labels feature done, it will be ideal
> for -O(2?). But people mostly use -Os and there no block reordering
> gets done now (I think perhaps this even means likely/unlikely don't
> really change which path is the straight line, just the source order
> of the blocks still determines it). So we hope for more incremental
> improvements here, and maybe even really optimal code for -O2 soon.
> But at least for -Os it may not be better than "unconditional jump
> around" as the "straight line" path in the foreseeable future. As
> noted, that alone is still a nice savings over the status quo for the
> disabled case. (You gave an "average cycles saved" for this vs a load
> and test, but do you have any comparisons of how those two compare to
> no tracepoint at all?)
>

i've run that in the past, and for the nop + jump sequence its between
2 - 4 cycles on average vs. no tracepoint.


> 3. bookkeeping magic to find all the jumps to enable for a given tracepoint
>
> Here you have a working first draft, but it looks pretty clunky.
> That strcmp just makes me gag. For a first version that's still
> pretty simple, I think it should be trivial to use a pointer
> comparison there. For tracepoints, it can be the address of the
> struct tracepoint. For the general case, it can be the address of
> the global that would be flag variable in case of no asm goto support.
>
> For more incremental improvements, we could cut down on running
> through the entire table for every switch. If there are many
> different switches (as there are already for many different
> tracepoints), then you really just want to run through the
> insn-patch list for the particular switch when you toggle it.
>
> It's possible to group this all statically at link time, but all
> the linker magic hacking required to get that to go is probably
> more trouble than it's worth.
>
> A simple hack is to run through the big unsorted table at boot time
> and turn it into a contiguous table for each switch. Then
> e.g. hang each table off the per-switch global variable by the same
> name that in a no-asm-goto build would be the simple global flag.
>

that probably makes the most sense. Do a sort of the jump table and then
store an offset,length pair with each switch. I was thinking of this as follow
on optimization (the tracepoint code is already O(N) per switch toggle, where
is N = total number of all tracepoint site locations, and not O(n), where
n = number of sites per tracepoint). Certainly, if this is a gating issue for
this patchset, I can fix it now.

>
> Finally, for using this for general purposes unrelated to tracepoints,
> I envision something like:
>
> DECLARE_MOSTLY_NOT(foobar);
>
> foo(int x, int y)
> {
> if (x > y && mostly_not(foobar))
> do_foobar(x - y);
> }
>
> ... set_mostly_not(foobar, onoff);
>
> where it's:
>
> #define DECLARE_MOSTLY_NOT(name) ... __something_##name
> #define mostly_not(name) ({ int _doit = 0; __label__ _yes; \
> JUMP_LABEL(name, _yes, __something_##name); \
> if (0) _yes: __cold _doit = 1; \
> unlikely (_doit); })
>
> I don't think we've tried to figure out how well this compiles yet.
> But it shows the sort of thing that we can do to expose this feature
> in a way that's simple and unrestrictive for kernel code to use casually.
>
>

cool. the assembly output would be interesting here...

thanks,

-Jason

2009-11-20 01:02:29

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
[...]
>>>> + if (unlikely(len<= 1))
>>>> + return text_poke(addr, opcode, len);
>>>> +
>>>> + /* Preparing */
>>>> + patch_fixup_addr = fixup;
>>>> + wmb();
>>>
>>> hrm, missing comment ?
>>
>> Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
>> int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.
>
> When a smp_wmb() is probably enough, and the matching smp_rmb() is
> missing in the int3 handler.

OK, thank you.

> But why to you care about the order of these two ? I agree that an
> unrelated int3 handler (from kprobes ?) could be running concurrently at
> that point, but it clearly cannot be called for this specific address
> until the int3 is written by text_poke.

Ah, it's my fault. I fixed that a month ago, and forgot to push it...
Actually, we don't need to care the order of those two. Instead,
we have to update the patch_fixup_* before int3 embedding.

>
> What I am pretty much certain is missing would be a smp_wmb()...

Agreed.

>
>>
>>>
>>>> + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
>
> ..right here, between where you write to the data used by the int3
> handler and where you write the actual breakpoint. On the read-side,
> this might be a problem with architectures like alpha needing
> smp_read_barrier_depends(), but not for Intel. However, in a spirit to
> make this code solid, what I did in the immed. val. is:
>
>
> target_after_int3 = insn + BREAKPOINT_INS_LEN;
> /* register_die_notifier has memory barriers */
> register_die_notifier(&imv_notify);
> /* The breakpoint will single-step the bypass */
> text_poke((void *)insn,
> ((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);

Hmm, it strongly depends on arch. Is smp_wmb() right after setting
patch_fixup_from enough on x86?

> And I unregister the die notifier at the end after having reached
> quiescent state. At least we know that the die notifier chain read-side
> has the proper memory barriers, which is not the case for the breakpoint
> instruction itself.
>
>
>>>> +
>>>> + /* Cap by an int3 */
>>>> + text_poke(addr,&int3_insn, int3_size);
>>>> + sync_core_all();
>>>> +
>>>> + /* Replace tail bytes */
>>>> + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
>>>> + len - int3_size);
>>>> + sync_core_all();
>>>> +
>>>> + /* Replace int3 with head byte */
>>>> + text_poke(addr, opcode, int3_size);
>>>> + sync_core_all();
>>>> +
>>>> + /* Cleanup */
>>>> + patch_fixup_from = NULL;
>>>> + wmb();
>>>
>>> missing comment here too.
>>>
>>>> + return addr;
>>>
>>> Little quiz question:
>>>
>>> When patch_fixup_from is set to NULL, what ensures that the int3
>>> handlers have completed their execution ?
>>>
>>> I think it's probably OK, because the int3 is an interrupt gate, which
>>> therefore disables interrupts as soon as it runs, and executes the
>>> notifier while irqs are off. When we run sync_core_all() after replacing
>>> the int3 by the new 1st byte, we only return when all other cores have
>>> executed an interrupt, which implies that all int3 handlers previously
>>> running should have ended. Is it right ? It looks to me as if this 3rd
>>> sync_core_all() is only needed because of that. Probably that adding a
>>> comment would be good.
>>
>> Thanks, it's a good point and that's more what I've thought.
>> As you said, it is probably safe. Even if it's not safe,
>> we can add some int3 fixup handler (with lowest priority)
>> which set regs->ip-1 if there is no int3 anymore, for safety.
>
> Well, just ensuring that the we reaches a "disabled IRQ code quiescent
> state" should be enough. Another way would be to use
> synchronize_sched(), but it might take longer. Actively poking the other
> CPUs with IPIs seems quicker. So I would be tempted to leave your code
> as is in this respect, but to add a comment.

Agreed. synchronize_sched() waits too long for this purpose.
OK, I'll add a comment for that "waiting for disabled IRQ code quiescent state" :-)
Thanks for the good advice!

>>> Another thing: I've recently noticed that the following locking seems to
>>> hang the system with doing stress-testing concurrently with cpu
>>> hotplug/hotunplug:
>>>
>>> mutex_lock(&text_mutex);
>>> on_each_cpu(something, NULL, 1);
>>>
>>> The hang seems to be caused by the fact that alternative.c has:
>>>
>>> within cpu hotplug (cpu hotplug lock held)
>>> mutex_lock(&text_mutex);
>>>
>>> It might also be caused by the interaction with the stop_machine()
>>> performed within the cpu hotplug lock. I did not find the root cause of
>>> the problem, but this probably calls for lockdep improvements.
>>
>> Hmm, would you mean it will happen even if we use stop_machine()
>> under text_mutex locking?
>> It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.
>
> Yes, but, again.. this calls for more testing. Hopefully it's not
> something else in my own code I haven't seen. For not I can just say
> that I've been noticing hangs involving cpu hotplug and text mutex, and
> taking the cpu hotplug mutex around text mutex (in my immediate values
> code) fixed the problem.

Hmm, I guess that we'd better merge those two mutexes since
text modification always requires disabling cpu-hotplug...

Thank you,


--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-11-20 21:54:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

On 11/18/2009 02:43 PM, Jason Baron wrote:
> Add text_poke_fixup() which takes a fixup address to where a processor
> jumps if it hits the modifying address while code modifying.
> text_poke_fixup() does following steps for this purpose.
>
> 1. Setup int3 handler for fixup.
> 2. Put a breakpoint (int3) on the first byte of modifying region,
> and synchronize code on all CPUs.
> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> 4. Modify the first byte of modifying region, and synchronize code
> on all CPUs.
> 5. Clear int3 handler.
>
> Thus, if some other processor execute modifying address when step2 to step4,
> it will be jumped to fixup code.
>
> This still has many limitations for modifying multi-instructions at once.
> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> because;
> - Replaced instruction is just one instruction, which is executed atomically.
> - Replacing instruction is a jump, so we can set fixup address where the jump
> goes to.
>

I just had a thought about this... regardless of if this is safe or not
(which still remains to be determined)... I have a bit more of a
fundamental question about it:

This code ends up taking *two* global IPIs for each instruction
modification. Each of those requires whole-system synchronization. How
is this better than taking one IPI and having the other CPUs wait until
the modification is complete before returning?

-hpa

2009-11-21 00:06:48

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

Hi Peter,

H. Peter Anvin wrote:
> On 11/18/2009 02:43 PM, Jason Baron wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>> 1. Setup int3 handler for fixup.
>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>> and synchronize code on all CPUs.
>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>> 4. Modify the first byte of modifying region, and synchronize code
>> on all CPUs.
>> 5. Clear int3 handler.
>>
>> Thus, if some other processor execute modifying address when step2 to step4,
>> it will be jumped to fixup code.
>>
>> This still has many limitations for modifying multi-instructions at once.
>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>> because;
>> - Replaced instruction is just one instruction, which is executed atomically.
>> - Replacing instruction is a jump, so we can set fixup address where the jump
>> goes to.
>>
>
> I just had a thought about this... regardless of if this is safe or not
> (which still remains to be determined)... I have a bit more of a
> fundamental question about it:
>
> This code ends up taking *two* global IPIs for each instruction
> modification. Each of those requires whole-system synchronization.

As Mathieu and I talked, first IPI is for synchronizing code, and
second is for waiting for all int3 handling is done.

> How
> is this better than taking one IPI and having the other CPUs wait until
> the modification is complete before returning?

Would you mean using stop_machine()? :-)

If we don't care about NMI, we can use stop_machine() (for
this reason, kprobe-jump-optimization can use stop_machine(),
because kprobes can't probe NMI code), but tracepoint has
to support NMI.

Actually, it might be possible, even it will be complicated.
If one-byte modifying(int3 injection/removing) is always
synchronized, I assume below timechart can work
(and it can support NMI/SMI too).

----
<CPU0> <CPU1>
flag = 0
setup int3 handler
int3 injection[sync]
other-bytes modifying
smp_call_function(func) func()
wait_until(flag==1) irq_disable()
sync_core() for other-bytes modifying
flag = 1
first-byte modifying[sync] wait_until(flag==2)
flag = 2
wait_until(flag==3) irq_enable()
flag = 3
cleanup int3 handler return
return
----

I'm not so sure that this flag-based step-by-step code can
work faster than 2 IPIs :-(

Any comments are welcome! :-)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-11-21 00:20:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

On 11/20/2009 04:06 PM, Masami Hiramatsu wrote:
>
> If we don't care about NMI, we can use stop_machine() (for
> this reason, kprobe-jump-optimization can use stop_machine(),
> because kprobes can't probe NMI code), but tracepoint has
> to support NMI.
>

Ingo pointed out that the NMI issue can be dealt with by having the NMI
handler check a flag if code modification is in progress on the entry to
the NMI handler. So yes, NMI can trap out of the IPI hold, but it will
not go any further.

-hpa

2009-11-21 01:12:30

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
> * Jason Baron ([email protected]) wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>> 1. Setup int3 handler for fixup.
>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>> and synchronize code on all CPUs.
>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>> 4. Modify the first byte of modifying region, and synchronize code
>> on all CPUs.
>> 5. Clear int3 handler.
>>
>
> Hi Masami,
>
> I like the approach and the API is clean. I have intersped comments
> below.
>
> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
> experienced recently in the message below. Might be worth having a look,
> I suspect this might have caused the hangs Paul McKenney had with his
> past TREE RCU callback migration. I think he did take a mutex in the cpu
> hotplug callbacks and might have used IPIs within that same lock.

Hi Mathieu,

I guess that the hang might happen as below;

----
lock text_mutex
modify code
on_each_cpu(do_something)
cpu-hotplug (down)
lock cpu-hotplug mutex
online_cpus is changed
native_cpu_die()
->alternatives_smp_switch(0)
->lock text_mutex -> sleep
(wait for offlined cpu...)
----

If this is correct, I think we can fix it as below.

----
lock cpu-hotplug mutex
lock text_mutex
modify code
on_each_cpu(do_something)
unlock text_mutex
unlock cpu-hotplug mutex
cpu-hotplug (down)
lock cpu-hotplug mutex
online_cpus is changed
native_cpu_die()
->alternatives_smp_switch(0)
->lock text_mutex
modify code
unlock text_mutex
...
unlock cpu-hotplug mutex
----
Or,
----
lock text_mutex
modify code
unlock text_mutex
on_each_cpu(do_something)
cpu-hotplug (down)
lock cpu-hotplug mutex
online_cpus is changed
native_cpu_die()
->alternatives_smp_switch(0)
->lock text_mutex
modify code
unlock text_mutex
...
unlock cpu-hotplug mutex
----
The latter needs another mutex for int3 handler and
frequently mutex_lock/unlock in this patch.

Hmm?

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-11-21 15:32:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Masami Hiramatsu ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> [...]
>>>>> + if (unlikely(len<= 1))
>>>>> + return text_poke(addr, opcode, len);
>>>>> +
>>>>> + /* Preparing */
>>>>> + patch_fixup_addr = fixup;
>>>>> + wmb();
>>>>
>>>> hrm, missing comment ?
>>>
>>> Ah, it's a barrier between patch_fixup_addr and patch_fixup_from.
>>> int3 trap handler checks patch_fixup_from first and refers patch_fixup_addr.
>>
>> When a smp_wmb() is probably enough, and the matching smp_rmb() is
>> missing in the int3 handler.
>
> OK, thank you.
>
>> But why to you care about the order of these two ? I agree that an
>> unrelated int3 handler (from kprobes ?) could be running concurrently at
>> that point, but it clearly cannot be called for this specific address
>> until the int3 is written by text_poke.
>
> Ah, it's my fault. I fixed that a month ago, and forgot to push it...
> Actually, we don't need to care the order of those two. Instead,
> we have to update the patch_fixup_* before int3 embedding.
>
>>
>> What I am pretty much certain is missing would be a smp_wmb()...
>
> Agreed.
>
>>
>>>
>>>>
>>>>> + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
>>
>> ..right here, between where you write to the data used by the int3
>> handler and where you write the actual breakpoint. On the read-side,
>> this might be a problem with architectures like alpha needing
>> smp_read_barrier_depends(), but not for Intel. However, in a spirit to
>> make this code solid, what I did in the immed. val. is:
>>
>>
>> target_after_int3 = insn + BREAKPOINT_INS_LEN;
>> /* register_die_notifier has memory barriers */
>> register_die_notifier(&imv_notify);
>> /* The breakpoint will single-step the bypass */
>> text_poke((void *)insn,
>> ((unsigned char[]){BREAKPOINT_INSTRUCTION}), 1);
>
> Hmm, it strongly depends on arch. Is smp_wmb() right after setting
> patch_fixup_from enough on x86?

What else do you have in mind ? wmb() ? Or adding a
smp_read_barrier_depends() at the beginnig of the handler ?

Clearly, smp_read_barrier_depends() is a no-op on x86, but it might be
good to add it just for code clarity (it helps commenting which ordering
has to be done on the read-side).


>
>> And I unregister the die notifier at the end after having reached
>> quiescent state. At least we know that the die notifier chain read-side
>> has the proper memory barriers, which is not the case for the breakpoint
>> instruction itself.
>>
>>
>>>>> +
>>>>> + /* Cap by an int3 */
>>>>> + text_poke(addr,&int3_insn, int3_size);
>>>>> + sync_core_all();
>>>>> +
>>>>> + /* Replace tail bytes */
>>>>> + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
>>>>> + len - int3_size);
>>>>> + sync_core_all();
>>>>> +
>>>>> + /* Replace int3 with head byte */
>>>>> + text_poke(addr, opcode, int3_size);
>>>>> + sync_core_all();
>>>>> +
>>>>> + /* Cleanup */
>>>>> + patch_fixup_from = NULL;
>>>>> + wmb();
>>>>
>>>> missing comment here too.
>>>>
>>>>> + return addr;
>>>>
>>>> Little quiz question:
>>>>
>>>> When patch_fixup_from is set to NULL, what ensures that the int3
>>>> handlers have completed their execution ?
>>>>
>>>> I think it's probably OK, because the int3 is an interrupt gate, which
>>>> therefore disables interrupts as soon as it runs, and executes the
>>>> notifier while irqs are off. When we run sync_core_all() after replacing
>>>> the int3 by the new 1st byte, we only return when all other cores have
>>>> executed an interrupt, which implies that all int3 handlers previously
>>>> running should have ended. Is it right ? It looks to me as if this 3rd
>>>> sync_core_all() is only needed because of that. Probably that adding a
>>>> comment would be good.
>>>
>>> Thanks, it's a good point and that's more what I've thought.
>>> As you said, it is probably safe. Even if it's not safe,
>>> we can add some int3 fixup handler (with lowest priority)
>>> which set regs->ip-1 if there is no int3 anymore, for safety.
>>
>> Well, just ensuring that the we reaches a "disabled IRQ code quiescent
>> state" should be enough. Another way would be to use
>> synchronize_sched(), but it might take longer. Actively poking the other
>> CPUs with IPIs seems quicker. So I would be tempted to leave your code
>> as is in this respect, but to add a comment.
>
> Agreed. synchronize_sched() waits too long for this purpose.
> OK, I'll add a comment for that "waiting for disabled IRQ code
> quiescent state" :-) Thanks for the good advice!
>
>>>> Another thing: I've recently noticed that the following locking seems to
>>>> hang the system with doing stress-testing concurrently with cpu
>>>> hotplug/hotunplug:
>>>>
>>>> mutex_lock(&text_mutex);
>>>> on_each_cpu(something, NULL, 1);
>>>>
>>>> The hang seems to be caused by the fact that alternative.c has:
>>>>
>>>> within cpu hotplug (cpu hotplug lock held)
>>>> mutex_lock(&text_mutex);
>>>>
>>>> It might also be caused by the interaction with the stop_machine()
>>>> performed within the cpu hotplug lock. I did not find the root cause of
>>>> the problem, but this probably calls for lockdep improvements.
>>>
>>> Hmm, would you mean it will happen even if we use stop_machine()
>>> under text_mutex locking?
>>> It seems that bigger problem of cpu-hotplug and on_each_cpu() etc.
>>
>> Yes, but, again.. this calls for more testing. Hopefully it's not
>> something else in my own code I haven't seen. For not I can just say
>> that I've been noticing hangs involving cpu hotplug and text mutex, and
>> taking the cpu hotplug mutex around text mutex (in my immediate values
>> code) fixed the problem.
>
> Hmm, I guess that we'd better merge those two mutexes since
> text modification always requires disabling cpu-hotplug...

Maybe.. although it's not clear to me that CPU hotplug is required to be
disabled around on_each_cpu calls.

Mathieu

>
> Thank you,
>
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-21 15:43:37

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Masami Hiramatsu ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> * Jason Baron ([email protected]) wrote:
>>> Add text_poke_fixup() which takes a fixup address to where a processor
>>> jumps if it hits the modifying address while code modifying.
>>> text_poke_fixup() does following steps for this purpose.
>>>
>>> 1. Setup int3 handler for fixup.
>>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>>> and synchronize code on all CPUs.
>>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>> 4. Modify the first byte of modifying region, and synchronize code
>>> on all CPUs.
>>> 5. Clear int3 handler.
>>>
>>
>> Hi Masami,
>>
>> I like the approach and the API is clean. I have intersped comments
>> below.
>>
>> Ingo: I raise a question about text_mutex vs on_each_cpu hangs I
>> experienced recently in the message below. Might be worth having a look,
>> I suspect this might have caused the hangs Paul McKenney had with his
>> past TREE RCU callback migration. I think he did take a mutex in the cpu
>> hotplug callbacks and might have used IPIs within that same lock.
>
> Hi Mathieu,
>
> I guess that the hang might happen as below;
>
> ----
> lock text_mutex
> modify code
> on_each_cpu(do_something)
> cpu-hotplug (down)
> lock cpu-hotplug mutex
> online_cpus is changed
> native_cpu_die()
> ->alternatives_smp_switch(0)
> ->lock text_mutex -> sleep
> (wait for offlined cpu...)
> ----
>
> If this is correct, I think we can fix it as below.
>
> ----
> lock cpu-hotplug mutex

Yes, this is the solution I used in my own immediate values code too.

> lock text_mutex
> modify code
> on_each_cpu(do_something)
> unlock text_mutex
> unlock cpu-hotplug mutex
> cpu-hotplug (down)
> lock cpu-hotplug mutex
> online_cpus is changed
> native_cpu_die()
> ->alternatives_smp_switch(0)
> ->lock text_mutex
> modify code
> unlock text_mutex
> ...
> unlock cpu-hotplug mutex
> ----
> Or,
> ----
> lock text_mutex
> modify code
> unlock text_mutex
> on_each_cpu(do_something)
> cpu-hotplug (down)
> lock cpu-hotplug mutex
> online_cpus is changed
> native_cpu_die()
> ->alternatives_smp_switch(0)
> ->lock text_mutex
> modify code
> unlock text_mutex
> ...
> unlock cpu-hotplug mutex
> ----
> The latter needs another mutex for int3 handler and
> frequently mutex_lock/unlock in this patch.
>
> Hmm?

The simplest solution seems to be the best one IMHO. But have you been
able to reproduce the lockup ?

Mathieu

>
> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-21 16:12:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* H. Peter Anvin ([email protected]) wrote:
> On 11/18/2009 02:43 PM, Jason Baron wrote:
> > Add text_poke_fixup() which takes a fixup address to where a processor
> > jumps if it hits the modifying address while code modifying.
> > text_poke_fixup() does following steps for this purpose.
> >
> > 1. Setup int3 handler for fixup.
> > 2. Put a breakpoint (int3) on the first byte of modifying region,
> > and synchronize code on all CPUs.
> > 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > 4. Modify the first byte of modifying region, and synchronize code
> > on all CPUs.
> > 5. Clear int3 handler.
> >
> > Thus, if some other processor execute modifying address when step2 to step4,
> > it will be jumped to fixup code.
> >
> > This still has many limitations for modifying multi-instructions at once.
> > However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
> > because;
> > - Replaced instruction is just one instruction, which is executed atomically.
> > - Replacing instruction is a jump, so we can set fixup address where the jump
> > goes to.
> >
>
> I just had a thought about this... regardless of if this is safe or not
> (which still remains to be determined)... I have a bit more of a
> fundamental question about it:
>
> This code ends up taking *two* global IPIs for each instruction
> modification. Each of those requires whole-system synchronization. How
> is this better than taking one IPI and having the other CPUs wait until
> the modification is complete before returning?

Hi Peter,

There are two main benefit for int3 vs stop_machine() I can think of:

1. Diminish the "whole system interrupt off" latency.
2. Support NMIs, MCEs and traps without adding complexity to the NMI,
MCE nor trap handler.

Also, code update execution time is not performance-critical in terms of
execution time, although the more important aspect is the effect on the
irq latency characteristics of the kernel, as explained below.


1. Latency

* Let's say we have a worse-case interrupt latency of 15ms in our
kernel on CPU A (e.g. induced by printk to a serial console) (yeah,
that's bad, but these exist).

* Concurrently, CPU B performs code modification with stop_machine().
stop_machine() spawns a worker thread on eacu CPU, waiting for all
CPUs to respond to the IPI.

* CPU A is still in its worse-case interrupt latency. All other CPUs
are running in stop_cpu() waiting for CPU A to join the game.

So, basically, we just transformed a single-cpu worse-case latency to an
even worse, longer, whole-system worse case latency.


2. NMIs, MCEs and traps

Currently, the approach Ingo proposes targets NMIs only. Other
non-maskables interrupt sources should also be treated. I think everyone
assumes that stop_cpu never generates any trap. Is it a documented
requirement, or is it just wishful thinking ? Because if it is not, it
could be possible for an architecture to choose a design that would let
stop_cpu generate a trap, and therefore we would have to fiddle with the
trap handlers too. Should we add the fixup test to trap handlers too ?

NMIs can be dealt with using the scheme where the NMI does the
modification itself if it nests over the code modification code, as
proposed by Ingo. However, I'm concerned about adding any bit of
complexity to a code path like the nmi handler. Basically, it has always
been a mechanism meant to do a task with the minimal delay and with
minimal interaction with the system. Adding fixup code to these go
against this idea. If someone really need to hook his device on a NMI
handler, I am certain that they will not see these extra checks and this
extra NMI-response latency as welcome.

MCEs are yet another code path that could be useful to use for system
diagnostic, yet cannot be disabled, just like NMIs. Should we start
auditing all handlers which are not dealt with and add the fixup test
there too ?


So, in comparison (+ is for int3, - for stop machine):

+ Lower system-wide interrupt latency.
+ Lower NMI response-time.
+ Lower complexity in the NMI handler.
+ Handles all nested execution in one spot, without adding complexity
elsewhere. Therefore deals with MCEs and traps.
+ Easier to test that _all_ nested executions are appropriately dealt
with, because the solution is not case-specific.

- Two IPIs instead of one. I'm not even sure it's slower than the worker
thread creation currently done in stop_machine.
- Execution of a breakpoint in the case a remote CPU (or local
NMI, MCE, trap) hits the breakpoint.

Mathieu


>
> -hpa

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-21 16:21:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Masami Hiramatsu ([email protected]) wrote:
> Hi Peter,
>
> H. Peter Anvin wrote:
>> On 11/18/2009 02:43 PM, Jason Baron wrote:
>>> Add text_poke_fixup() which takes a fixup address to where a processor
>>> jumps if it hits the modifying address while code modifying.
>>> text_poke_fixup() does following steps for this purpose.
>>>
>>> 1. Setup int3 handler for fixup.
>>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>>> and synchronize code on all CPUs.
>>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>> 4. Modify the first byte of modifying region, and synchronize code
>>> on all CPUs.
>>> 5. Clear int3 handler.
>>>
>>> Thus, if some other processor execute modifying address when step2 to step4,
>>> it will be jumped to fixup code.
>>>
>>> This still has many limitations for modifying multi-instructions at once.
>>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>>> because;
>>> - Replaced instruction is just one instruction, which is executed atomically.
>>> - Replacing instruction is a jump, so we can set fixup address where the jump
>>> goes to.
>>>
>>
>> I just had a thought about this... regardless of if this is safe or not
>> (which still remains to be determined)... I have a bit more of a
>> fundamental question about it:
>>
>> This code ends up taking *two* global IPIs for each instruction
>> modification. Each of those requires whole-system synchronization.
>
> As Mathieu and I talked, first IPI is for synchronizing code, and
> second is for waiting for all int3 handling is done.
>
>> How
>> is this better than taking one IPI and having the other CPUs wait until
>> the modification is complete before returning?
>
> Would you mean using stop_machine()? :-)
>
> If we don't care about NMI, we can use stop_machine() (for
> this reason, kprobe-jump-optimization can use stop_machine(),
> because kprobes can't probe NMI code), but tracepoint has
> to support NMI.
>
> Actually, it might be possible, even it will be complicated.
> If one-byte modifying(int3 injection/removing) is always
> synchronized, I assume below timechart can work
> (and it can support NMI/SMI too).
>
> ----
> <CPU0> <CPU1>
> flag = 0
> setup int3 handler
> int3 injection[sync]
> other-bytes modifying
> smp_call_function(func) func()
> wait_until(flag==1) irq_disable()
> sync_core() for other-bytes modifying
> flag = 1
> first-byte modifying[sync] wait_until(flag==2)

Hrm, I don't like this too much. In terms of latency, we can get:

CPU 0: CPU 1
interrupts off
* wait_util(flag == 2)
interrupted
softirq runs...
(we have a drink, network bh
processing, etc etc)
back to standard execution
flag = 2

So, as you see, we increase the interrupt latency on all other CPUs of
the duration of a softirq. This is, I think, an unwanted side-effect.

We should really do performance benchmarks comparing stop_machine() and
the int3-based approach rather than to try to come up with tricky
schemes. It's not a real problem until we prove there is indeed a
performance regression. I suspect that the combined effect of cache-line
bouncing, worker thread overhead and the IPI of stop_machine is probably
comparable to the two IPIs we propose for int3.

Thanks,

Mathieu


> flag = 2
> wait_until(flag==3) irq_enable()
> flag = 3
> cleanup int3 handler return
> return
> ----
>
> I'm not so sure that this flag-based step-by-step code can
> work faster than 2 IPIs :-(
>
> Any comments are welcome! :-)
>
> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-11-21 21:53:41

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
> * Masami Hiramatsu ([email protected]) wrote:
>> > Hi Peter,
>> >
>> > H. Peter Anvin wrote:
>>> >> On 11/18/2009 02:43 PM, Jason Baron wrote:
>>>> >>> Add text_poke_fixup() which takes a fixup address to where a processor
>>>> >>> jumps if it hits the modifying address while code modifying.
>>>> >>> text_poke_fixup() does following steps for this purpose.
>>>> >>>
>>>> >>> 1. Setup int3 handler for fixup.
>>>> >>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>>>> >>> and synchronize code on all CPUs.
>>>> >>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>>> >>> 4. Modify the first byte of modifying region, and synchronize code
>>>> >>> on all CPUs.
>>>> >>> 5. Clear int3 handler.
>>>> >>>
>>>> >>> Thus, if some other processor execute modifying address when step2 to step4,
>>>> >>> it will be jumped to fixup code.
>>>> >>>
>>>> >>> This still has many limitations for modifying multi-instructions at once.
>>>> >>> However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
>>>> >>> because;
>>>> >>> - Replaced instruction is just one instruction, which is executed atomically.
>>>> >>> - Replacing instruction is a jump, so we can set fixup address where the jump
>>>> >>> goes to.
>>>> >>>
>>> >>
>>> >> I just had a thought about this... regardless of if this is safe or not
>>> >> (which still remains to be determined)... I have a bit more of a
>>> >> fundamental question about it:
>>> >>
>>> >> This code ends up taking *two* global IPIs for each instruction
>>> >> modification. Each of those requires whole-system synchronization.
>> >
>> > As Mathieu and I talked, first IPI is for synchronizing code, and
>> > second is for waiting for all int3 handling is done.
>> >
>>> >> How
>>> >> is this better than taking one IPI and having the other CPUs wait until
>>> >> the modification is complete before returning?
>> >
>> > Would you mean using stop_machine()? :-)
>> >
>> > If we don't care about NMI, we can use stop_machine() (for
>> > this reason, kprobe-jump-optimization can use stop_machine(),
>> > because kprobes can't probe NMI code), but tracepoint has
>> > to support NMI.
>> >
>> > Actually, it might be possible, even it will be complicated.
>> > If one-byte modifying(int3 injection/removing) is always
>> > synchronized, I assume below timechart can work
>> > (and it can support NMI/SMI too).
>> >
>> > ----
>> > <CPU0> <CPU1>
>> > flag = 0
>> > setup int3 handler
>> > int3 injection[sync]
>> > other-bytes modifying
>> > smp_call_function(func) func()
>> > wait_until(flag==1) irq_disable()
>> > sync_core() for other-bytes modifying
>> > flag = 1
>> > first-byte modifying[sync] wait_until(flag==2)
> Hrm, I don't like this too much. In terms of latency, we can get:
>
> CPU 0: CPU 1
> interrupts off
> * wait_util(flag == 2)
> interrupted
> softirq runs...
> (we have a drink, network bh
> processing, etc etc)
> back to standard execution
> flag = 2
>
> So, as you see, we increase the interrupt latency on all other CPUs of
> the duration of a softirq. This is, I think, an unwanted side-effect.
>
> We should really do performance benchmarks comparing stop_machine() and
> the int3-based approach rather than to try to come up with tricky
> schemes. It's not a real problem until we prove there is indeed a
> performance regression. I suspect that the combined effect of cache-line
> bouncing, worker thread overhead and the IPI of stop_machine is probably
> comparable to the two IPIs we propose for int3.

I assume that total latency of XMC is almost same on normal-size SMP.
However,
- stop_machine() can't support NMI/SMI.
- stop_machine() stops all other processors while XMC.

Anyway, int3-based approach still needs to be ensured its safeness
by processor architects. So, until that, stop_machine() approach
also useful for some cases.

Thank you,
--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2009-11-22 01:46:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] jump label v3 - x86: Introduce generic jump patching without stop_machine

* Masami Hiramatsu ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> > We should really do performance benchmarks comparing stop_machine() and
> > the int3-based approach rather than to try to come up with tricky
> > schemes. It's not a real problem until we prove there is indeed a
> > performance regression. I suspect that the combined effect of cache-line
> > bouncing, worker thread overhead and the IPI of stop_machine is probably
> > comparable to the two IPIs we propose for int3.
>
> I assume that total latency of XMC is almost same on normal-size SMP.
> However,
> - stop_machine() can't support NMI/SMI.
> - stop_machine() stops all other processors while XMC.

I would also add that stop_machine() increases the system interrupt
latency of an amount O(num_online_cpus()), which I'd like to avoid given
the 90- to 128-core machines heading our way pretty quickly.

>
> Anyway, int3-based approach still needs to be ensured its safeness
> by processor architects. So, until that, stop_machine() approach
> also useful for some cases.

True. This makes me think: If performance happens to be a problem, we
could do batched jump instruction modification. Using an hash table to
contain the pointers would allow us to only perform a single pair of IPI
for a whole bunch of instruction modifications.

Mathieu

>
> Thank you,


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68