LinuxLists.cc - [RFC PATCH 0/8] jump label v4

[permalink] [raw]

Subject: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

Add text_poke_fixup() which takes a fixup address to where a processor
jumps if it hits the modifying address while code modifying.
text_poke_fixup() does following steps for this purpose.

1. Setup int3 handler for fixup.
2. Put a breakpoint (int3) on the first byte of modifying region,
and synchronize code on all CPUs.
3. Modify other bytes of modifying region, and synchronize code on all CPUs.
4. Modify the first byte of modifying region, and synchronize code
on all CPUs.
5. Clear int3 handler.

Thus, if some other processor execute modifying address when step2 to step4,
it will be jumped to fixup code.

This still has many limitations for modifying multi-instructions at once.
However, it is enough for 'a 5 bytes nop replacing with a jump' patching,
because;
- Replaced instruction is just one instruction, which is executed atomically.
- Replacing instruction is a jump, so we can set fixup address where the jump
goes to.

Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/alternative.h | 12 ++++
arch/x86/include/asm/kprobes.h | 1 +
arch/x86/kernel/alternative.c | 120 ++++++++++++++++++++++++++++++++++++
kernel/kprobes.c | 2 +-
4 files changed, 134 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 3b5b828..9b856b7 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -166,4 +166,16 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);

+/*
+ * Setup int3 trap and fixup execution for cross-modifying on SMP case.
+ * If the other cpus execute modifying instruction, it will hit int3
+ * and go to fixup code. This just provides a minimal safety check.
+ * Additional checks/restrictions are required for completely safe
+ * cross-modifying.
+ */
+extern void *text_poke_fixup(void *addr, const void *opcode, size_t len,
+ void *fixup);
+extern int text_patch_jump(void *addr, void *dest);
+extern void sync_core_all(void);
+
#endif /* _ASM_X86_ALTERNATIVE_H */
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index eaec8ea..febab97 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -33,6 +33,7 @@ struct kprobe;
typedef u8 kprobe_opcode_t;
#define BREAKPOINT_INSTRUCTION 0xcc
#define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
#define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2589ea4..cf42144 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -4,6 +4,7 @@
#include <linux/list.h>
#include <linux/stringify.h>
#include <linux/kprobes.h>
+#include <linux/kdebug.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/memory.h>
@@ -554,3 +555,122 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
local_irq_restore(flags);
return addr;
}
+
+/*
+ * On pentium series, Unsynchronized cross-modifying code
+ * operations can cause unexpected instruction execution results.
+ * So after code modified, we should synchronize it on each processor.
+ */
+static void __kprobes __local_sync_core(void *info)
+{
+ sync_core();
+}
+
+void __kprobes sync_core_all(void)
+{
+ on_each_cpu(__local_sync_core, NULL, 1);
+}
+
+/* Safely cross-code modifying with fixup address */
+static void *patch_fixup_from;
+static void *patch_fixup_addr;
+
+static int __kprobes patch_exceptions_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct die_args *args = data;
+ struct pt_regs *regs = args->regs;
+
+ if (likely(!patch_fixup_from))
+ return NOTIFY_DONE;
+
+ if (val != DIE_INT3 || !regs || user_mode_vm(regs) ||
+ (unsigned long)patch_fixup_from != regs->ip)
+ return NOTIFY_DONE;
+
+ args->regs->ip = (unsigned long)patch_fixup_addr;
+ return NOTIFY_STOP;
+}
+
+/**
+ * text_poke_fixup() -- cross-modifying kernel text with fixup address.
+ * @addr: Modifying address.
+ * @opcode: New instruction.
+ * @len: length of modifying bytes.
+ * @fixup: Fixup address.
+ *
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
+ void *fixup)
+{
+ static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
+ static const int int3_size = sizeof(int3_insn);
+
+ /* Replacing 1 byte can be done atomically. */
+ if (unlikely(len <= 1))
+ return text_poke(addr, opcode, len);
+
+ /* Preparing */
+ patch_fixup_addr = fixup;
+ wmb();
+ patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
+
+ /* Cap by an int3 */
+ text_poke(addr, &int3_insn, int3_size);
+ sync_core_all();
+
+ /* Replace tail bytes */
+ text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
+ len - int3_size);
+ sync_core_all();
+
+ /* Replace int3 with head byte */
+ text_poke(addr, opcode, int3_size);
+ sync_core_all();
+
+ /* Cleanup */
+ patch_fixup_from = NULL;
+ wmb();
+ return addr;
+}
+
+/**
+ * text_patch_jump() -- cross-modifying kernel text with a relative jump
+ * @addr: Address where jump from.
+ * @dest: Address where jump to.
+ *
+ * Return 0 if succeeded to embed a jump. Otherwise, there is an error.
+ * Note: You must backup replaced instructions before calling this,
+ * if you need to recover it.
+ * Note: Must be called under text_mutex.
+ */
+int __kprobes text_patch_jump(void *addr, void *dest)
+{
+ unsigned char jmp_code[RELATIVEJUMP_SIZE];
+ s32 rel = (s32)((long)dest - ((long)addr + RELATIVEJUMP_SIZE));
+
+ if (!addr || !dest ||
+ (long)rel != ((long)dest - ((long)addr + RELATIVEJUMP_SIZE)))
+ return -EINVAL;
+
+ jmp_code[0] = RELATIVEJUMP_OPCODE;
+ *(s32 *)(&jmp_code[1]) = rel;
+
+ text_poke_fixup(addr, jmp_code, RELATIVEJUMP_SIZE, dest);
+ return 0;
+}
+
+static struct notifier_block patch_exceptions_nb = {
+ .notifier_call = patch_exceptions_notify,
+ .priority = 0x7fffffff /* we need to be notified first */
+};
+
+static int __init patch_init(void)
+{
+ return register_die_notifier(&patch_exceptions_nb);
+}
+
+arch_initcall(patch_init);
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index b7df302..9c4697f 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -898,7 +898,7 @@ EXPORT_SYMBOL_GPL(unregister_kprobes);

static struct notifier_block kprobe_exceptions_nb = {
.notifier_call = kprobe_exceptions_notify,
- .priority = 0x7fffffff /* we need to be notified first */
+ .priority = 0x7ffffff0 /* High priority, but not first. */
};

unsigned long __weak arch_deref_entry_point(void *entry)
--
1.6.5.1

2010-01-12 16:27:12

[permalink] [raw]

Subject: [RFC PATCH 3/8] jump label v4 - move opcode definitions

Move x86 opcode macros from arch/x86/include/asm/kprobes.h to
arch/x86/include/asm/alternative.h so they are useful outside of kprobes.

Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/include/asm/alternative.h | 5 +++++
arch/x86/include/asm/kprobes.h | 4 ----
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 9b856b7..7a7d457 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -166,6 +166,11 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);

+#define BREAKPOINT_INSTRUCTION 0xcc
+#define RELATIVEJUMP_OPCODE 0xe9
+#define RELATIVEJUMP_SIZE 5
+#define MAX_INSN_SIZE 16
+
/*
* Setup int3 trap and fixup execution for cross-modifying on SMP case.
* If the other cpus execute modifying instruction, it will hit int3
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index febab97..15b1b12 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -31,10 +31,6 @@ struct pt_regs;
struct kprobe;

typedef u8 kprobe_opcode_t;
-#define BREAKPOINT_INSTRUCTION 0xcc
-#define RELATIVEJUMP_OPCODE 0xe9
-#define RELATIVEJUMP_SIZE 5
-#define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
(((MAX_STACK_SIZE) < (((unsigned long)current_thread_info()) + \
--
1.6.5.1

2010-01-12 16:26:55

[permalink] [raw]

Subject: [RFC PATCH 1/8] jump label v4 - kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE

Change RELATIVEJUMP_INSTRUCTION macro to RELATIVEJUMP_OPCODE since it
represents just the opcode byte.

Signed-off-by: Masami Hiramatsu <[email protected]>
---
arch/x86/include/asm/kprobes.h | 2 +-
arch/x86/kernel/kprobes.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 4fe681d..eaec8ea 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -32,7 +32,7 @@ struct kprobe;

typedef u8 kprobe_opcode_t;
#define BREAKPOINT_INSTRUCTION 0xcc
-#define RELATIVEJUMP_INSTRUCTION 0xe9
+#define RELATIVEJUMP_OPCODE 0xe9
#define MAX_INSN_SIZE 16
#define MAX_STACK_SIZE 64
#define MIN_STACK_SIZE(ADDR) \
diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index 5b8c750..7039e6e 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -115,7 +115,7 @@ static void __kprobes set_jmp_op(void *from, void *to)
} __attribute__((packed)) * jop;
jop = (struct __arch_jmp_op *)from;
jop->raddr = (s32)((long)(to) - ((long)(from) + 5));
- jop->op = RELATIVEJUMP_INSTRUCTION;
+ jop->op = RELATIVEJUMP_OPCODE;
}

/*
--
1.6.5.1

2010-01-12 16:27:06

[permalink] [raw]

Subject: [RFC PATCH 4/8] jump label v4 - notifier atomic call chain notrace

notifier atomic call chain notrace

In LTTng, being able to use the atomic notifier from cpu idle entry to
ensure the tracer flush the last events in the current subbuffer
requires the rcu read-side to be marked "notrace", otherwise it can end
up calling back into lockdep and the tracer.

Also apply to the the die notifier.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Jason Baron <[email protected]>
CC: [email protected]
---
kernel/notifier.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/notifier.c b/kernel/notifier.c
index acd24e7..5bde20a 100644
--- a/kernel/notifier.c
+++ b/kernel/notifier.c
@@ -148,7 +148,7 @@ int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh,
spin_lock_irqsave(&nh->lock, flags);
ret = notifier_chain_unregister(&nh->head, n);
spin_unlock_irqrestore(&nh->lock, flags);
- synchronize_rcu();
+ synchronize_sched();
return ret;
}
EXPORT_SYMBOL_GPL(atomic_notifier_chain_unregister);
@@ -178,9 +178,9 @@ int __kprobes __atomic_notifier_call_chain(struct atomic_notifier_head *nh,
{
int ret;

- rcu_read_lock();
+ rcu_read_lock_sched_notrace();
ret = notifier_call_chain(&nh->head, val, v, nr_to_call, nr_calls);
- rcu_read_unlock();
+ rcu_read_unlock_sched_notrace();
return ret;
}
EXPORT_SYMBOL_GPL(__atomic_notifier_call_chain);
--
1.6.5.1

2010-01-12 16:27:31

[permalink] [raw]

Subject: [RFC PATCH 8/8] jump label v4 - add module support

Add support for 'jump label' for modules.

Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/kernel/ptrace.c | 1 +
include/linux/jump_label.h | 3 +-
include/linux/module.h | 5 +-
kernel/jump_label.c | 136 ++++++++++++++++++++++++++++++++++++++++++++
kernel/module.c | 7 ++
5 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 017d937..94f5355 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -24,6 +24,7 @@
#include <linux/workqueue.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
+#include <linux/module.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 3d42e8c..fa81c0a 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -21,7 +21,8 @@ extern struct jump_entry __stop___jump_table[];

#define DEFINE_JUMP_LABEL(name) \
const char __jlstrtab_##name[] \
- __used __attribute__((section("__jump_strings"))) = #name;
+ __used __attribute__((section("__jump_strings"))) = #name; \
+ EXPORT_SYMBOL_GPL(__jlstrtab_##name);

extern void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type);
diff --git a/include/linux/module.h b/include/linux/module.h
index 6cb1a3c..df499cf 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -339,7 +339,10 @@ struct module
struct tracepoint *tracepoints;
unsigned int num_tracepoints;
#endif
-
+#ifdef __HAVE_ARCH_JUMP_LABEL
+ struct jump_entry *jump_entries;
+ unsigned int num_jump_entries;
+#endif
#ifdef CONFIG_TRACING
const char **trace_bprintk_fmt_start;
unsigned int num_trace_bprintk_fmt;
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 671fcbb..62b7467 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -30,6 +30,13 @@ struct jump_label_entry {
char name[0];
};

+struct jump_label_module_entry {
+ struct hlist_node hlist;
+ struct jump_entry *table;
+ int nr_entries;
+ struct module *mod;
+};
+
static void swap_jump_label_entries(struct jump_entry *previous, struct jump_entry *next)
{
struct jump_entry tmp;
@@ -145,6 +152,17 @@ int jump_label_update(const char *name, enum jump_label_type type)
arch_jump_label_transform(iter, type);
iter++;
}
+ /* eanble/disable jump labels in modules */
+ hlist_for_each_entry(e_module, module_node, &(entry->modules),
+ hlist) {
+ count = e_module->nr_entries;
+ iter = e_module->table;
+ while (count--) {
+ if (kernel_text_address(iter->code))
+ arch_jump_label_transform(iter, type);
+ iter++;
+ }
+ }
}
mutex_unlock(&jump_label_mutex);
return 0;
@@ -162,4 +180,122 @@ static int init_jump_label(void)
}
early_initcall(init_jump_label);

+#ifdef CONFIG_MODULES
+
+static struct jump_label_module_entry *add_jump_label_module_entry(struct jump_label_entry *entry, struct jump_entry *iter_begin, int count, struct module *mod)
+{
+ struct jump_label_module_entry *e;
+
+ e = kmalloc(sizeof(struct jump_label_module_entry), GFP_KERNEL);
+ if (!e)
+ return ERR_PTR(-ENOMEM);
+ e->mod = mod;
+ e->nr_entries = count;
+ e->table = iter_begin;
+ hlist_add_head(&e->hlist, &entry->modules);
+ return e;
+}
+
+static int add_jump_label_module(struct module *mod)
+{
+ struct jump_entry *iter, *iter_begin;
+ struct jump_label_entry *entry;
+ struct jump_label_module_entry *module_entry;
+ int count;
+
+ /* if the module doesn't have jump label entries, just return */
+ if (!mod->num_jump_entries)
+ return 0;
+
+ sort_jump_label_entries(mod->jump_entries,
+ mod->jump_entries + mod->num_jump_entries);
+ iter = mod->jump_entries;
+ while (iter < mod->jump_entries + mod->num_jump_entries) {
+ entry = get_jump_label_entry(iter->name);
+ iter_begin = iter;
+ count = 0;
+ while ((iter < mod->jump_entries + mod->num_jump_entries) &&
+ (strcmp(iter->name, iter_begin->name) == 0)) {
+ iter++;
+ count++;
+ }
+ if (!entry) {
+ entry = add_jump_label_entry(iter_begin->name, 0, NULL);
+ if (IS_ERR(entry))
+ return PTR_ERR(entry);
+ }
+ module_entry = add_jump_label_module_entry(entry, iter_begin,
+ count, mod);
+ if (IS_ERR(module_entry))
+ return PTR_ERR(module_entry);
+ }
+ return 0;
+}
+
+static void remove_jump_label_module(struct module *mod)
+{
+ struct hlist_head *head;
+ struct hlist_node *node, *node_next, *module_node, *module_node_next;
+ struct jump_label_entry *e;
+ struct jump_label_module_entry *e_module;
+ int i;
+
+ /* if the module doesn't have jump label entries, just return */
+ if (!mod->num_jump_entries)
+ return;
+
+ for (i = 0; i < JUMP_LABEL_TABLE_SIZE; i++) {
+ head = &jump_label_table[i];
+ hlist_for_each_entry_safe(e, node, node_next, head, hlist) {
+ hlist_for_each_entry_safe(e_module, module_node,
+ module_node_next,
+ &(e->modules), hlist) {
+ if (e_module->mod == mod) {
+ hlist_del(&e_module->hlist);
+ kfree(e_module);
+ }
+ }
+ if (hlist_empty(&e->modules) && (e->nr_entries == 0)) {
+ hlist_del(&e->hlist);
+ kfree(e);
+ }
+ }
+ }
+}
+
+static int jump_label_module_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+ struct module *mod = data;
+ int ret = 0;
+
+ switch (val) {
+ case MODULE_STATE_COMING:
+ mutex_lock(&jump_label_mutex);
+ ret = add_jump_label_module(mod);
+ if (ret)
+ remove_jump_label_module(mod);
+ mutex_unlock(&jump_label_mutex);
+ break;
+ case MODULE_STATE_GOING:
+ mutex_lock(&jump_label_mutex);
+ remove_jump_label_module(mod);
+ mutex_unlock(&jump_label_mutex);
+ break;
+ }
+ return ret;
+}
+
+struct notifier_block jump_label_module_nb = {
+ .notifier_call = jump_label_module_notify,
+ .priority = 0,
+};
+
+static int init_jump_label_module(void)
+{
+ return register_module_notifier(&jump_label_module_nb);
+}
+__initcall(init_jump_label_module);
+
+#endif /* CONFIG_MODULES */
+
#endif
diff --git a/kernel/module.c b/kernel/module.c
index f82386b..4324bb1 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -55,6 +55,7 @@
#include <linux/async.h>
#include <linux/percpu.h>
#include <linux/kmemleak.h>
+#include <linux/jump_label.h>

#define CREATE_TRACE_POINTS
#include <trace/events/module.h>
@@ -2245,6 +2246,12 @@ static noinline struct module *load_module(void __user *umod,
sizeof(*mod->tracepoints),
&mod->num_tracepoints);
#endif
+#ifdef __HAVE_ARCH_JUMP_LABEL
+ mod->jump_entries = section_objs(hdr, sechdrs, secstrings,
+ "__jump_table",
+ sizeof(*mod->jump_entries),
+ &mod->num_jump_entries);
+#endif
#ifdef CONFIG_EVENT_TRACING
mod->trace_events = section_objs(hdr, sechdrs, secstrings,
"_ftrace_events",
--
1.6.5.1

2010-01-12 16:27:33

[permalink] [raw]

Subject: [RFC PATCH 7/8] jump label v4 - tracepoint support

Make use of the jump label infrastructure for tracepoints.

Signed-off-by: Jason Baron <[email protected]>
---
include/linux/tracepoint.h | 34 +++++++++++++++++++---------------
kernel/tracepoint.c | 8 ++++++++
2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index f59604e..c18b9c0 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -16,6 +16,7 @@

#include <linux/types.h>
#include <linux/rcupdate.h>
+#include <linux/jump_label.h>

struct module;
struct tracepoint;
@@ -63,20 +64,22 @@ struct tracepoint {
* not add unwanted padding between the beginning of the section and the
* structure. Force alignment to the same alignment as the section start.
*/
-#define DECLARE_TRACE(name, proto, args) \
- extern struct tracepoint __tracepoint_##name; \
- static inline void trace_##name(proto) \
- { \
- if (unlikely(__tracepoint_##name.state)) \
- __DO_TRACE(&__tracepoint_##name, \
- TP_PROTO(proto), TP_ARGS(args)); \
- } \
- static inline int register_trace_##name(void (*probe)(proto)) \
- { \
- return tracepoint_probe_register(#name, (void *)probe); \
- } \
- static inline int unregister_trace_##name(void (*probe)(proto)) \
- { \
+#define DECLARE_TRACE(name, proto, args) \
+ extern struct tracepoint __tracepoint_##name; \
+ static inline void trace_##name(proto) \
+ { \
+ JUMP_LABEL(name, do_trace, __tracepoint_##name.state); \
+ return; \
+do_trace: \
+ __DO_TRACE(&__tracepoint_##name, \
+ TP_PROTO(proto), TP_ARGS(args)); \
+ } \
+ static inline int register_trace_##name(void (*probe)(proto)) \
+ { \
+ return tracepoint_probe_register(#name, (void *)probe); \
+ } \
+ static inline int unregister_trace_##name(void (*probe)(proto)) \
+ { \
return tracepoint_probe_unregister(#name, (void *)probe);\
}

@@ -86,7 +89,8 @@ struct tracepoint {
__attribute__((section("__tracepoints_strings"))) = #name; \
struct tracepoint __tracepoint_##name \
__attribute__((section("__tracepoints"), aligned(32))) = \
- { __tpstrtab_##name, 0, reg, unreg, NULL }
+ { __tpstrtab_##name, 0, reg, unreg, NULL }; \
+ DEFINE_JUMP_LABEL(name);

#define DEFINE_TRACE(name) \
DEFINE_TRACE_FN(name, NULL, NULL);
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index cc89be5..8acced8 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -25,6 +25,7 @@
#include <linux/err.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/jump_label.h>

extern struct tracepoint __start___tracepoints[];
extern struct tracepoint __stop___tracepoints[];
@@ -256,6 +257,10 @@ static void set_tracepoint(struct tracepoint_entry **entry,
* is used.
*/
rcu_assign_pointer(elem->funcs, (*entry)->funcs);
+ if (!elem->state && active)
+ enable_jump_label(elem->name);
+ if (elem->state && !active)
+ disable_jump_label(elem->name);
elem->state = active;
}

@@ -270,6 +275,9 @@ static void disable_tracepoint(struct tracepoint *elem)
if (elem->unregfunc && elem->state)
elem->unregfunc();

+ if (elem->state)
+ disable_jump_label(elem->name);
+
elem->state = 0;
rcu_assign_pointer(elem->funcs, NULL);
}
--
1.6.5.1

2010-01-12 16:28:54

[permalink] [raw]

Subject: [RFC PATCH 5/8] jump label v4 - base patch

base patch to implement 'jump labeling'. Based on a new 'asm goto' inline
assembly gcc mechanism, we can now branch to labels from an 'asm goto'
statment. This allows us to create a 'no-op' fastpath, which can subsequently
be patched with a jump to the slowpath code. This is useful for code which
might be rarely used, but which we'd like to be able to call, if needed.
Tracepoints are the current usecase that these are being implemented for.

the patch is dependent on CONFIG_KPROBES. This is simply due to a
reliance on the int3 trap handler code that is only enabled with kprobes. This,
could be modified.

Signed-off-by: Jason Baron <[email protected]>
---
include/asm-generic/vmlinux.lds.h | 10 ++-
include/linux/jump_label.h | 57 +++++++++++++
kernel/Makefile | 2 +-
kernel/jump_label.c | 165 +++++++++++++++++++++++++++++++++++++
4 files changed, 232 insertions(+), 2 deletions(-)
create mode 100644 include/linux/jump_label.h
create mode 100644 kernel/jump_label.c

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 67e6520..83a469d 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -167,7 +167,8 @@
BRANCH_PROFILE() \
TRACE_PRINTKS() \
FTRACE_EVENTS() \
- TRACE_SYSCALLS()
+ TRACE_SYSCALLS() \
+ JUMP_TABLE() \

/*
* Data section helpers
@@ -206,6 +207,7 @@
*(__vermagic) /* Kernel version magic */ \
*(__markers_strings) /* Markers: strings */ \
*(__tracepoints_strings)/* Tracepoints: strings */ \
+ *(__jump_strings)/* Jump: strings */ \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
@@ -557,6 +559,12 @@
#define BUG_TABLE
#endif

+#define JUMP_TABLE() \
+ . = ALIGN(64); \
+ VMLINUX_SYMBOL(__start___jump_table) = .; \
+ *(__jump_table) \
+ VMLINUX_SYMBOL(__stop___jump_table) = .; \
+
#ifdef CONFIG_PM_TRACE
#define TRACEDATA \
. = ALIGN(4); \
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
new file mode 100644
index 0000000..3d42e8c
--- /dev/null
+++ b/include/linux/jump_label.h
@@ -0,0 +1,57 @@
+#ifndef _LINUX_JUMP_LABEL_H
+#define _LINUX_JUMP_LABEL_H
+
+#include <asm/jump_label.h>
+
+struct jump_entry {
+ unsigned long code;
+ unsigned long target;
+ char *name;
+};
+
+enum jump_label_type {
+ JUMP_LABEL_ENABLE,
+ JUMP_LABEL_DISABLE
+};
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+extern struct jump_entry __start___jump_table[];
+extern struct jump_entry __stop___jump_table[];
+
+#define DEFINE_JUMP_LABEL(name) \
+ const char __jlstrtab_##name[] \
+ __used __attribute__((section("__jump_strings"))) = #name;
+
+extern void arch_jump_label_transform(struct jump_entry *entry,
+ enum jump_label_type type);
+
+extern int jump_label_update(const char *name, enum jump_label_type type);
+
+#define enable_jump_label(name) \
+ jump_label_update(name, JUMP_LABEL_ENABLE);
+
+#define disable_jump_label(name) \
+ jump_label_update(name, JUMP_LABEL_DISABLE);
+
+#else
+
+#define DEFINE_JUMP_LABEL(name)
+
+#define JUMP_LABEL(tag, label, cond) \
+ if (unlikely(cond)) \
+ goto label;
+
+static inline int enable_jump_label(const char *name)
+{
+ return 0;
+}
+
+static inline int disable_jump_label(const char *name)
+{
+ return 0;
+}
+
+#endif
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 864ff75..4001389 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
- async.o
+ async.o jump_label.o
obj-y += groups.o

ifdef CONFIG_FUNCTION_TRACER
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
new file mode 100644
index 0000000..671fcbb
--- /dev/null
+++ b/kernel/jump_label.c
@@ -0,0 +1,165 @@
+/*
+ * jump label support
+ *
+ * Copyright (C) 2009 Jason Baron <[email protected]>
+ *
+ */
+#include <linux/jump_label.h>
+#include <linux/memory.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <asm/alternative.h>
+#include <linux/list.h>
+#include <linux/jhash.h>
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+#define JUMP_LABEL_HASH_BITS 6
+#define JUMP_LABEL_TABLE_SIZE (1 << JUMP_LABEL_HASH_BITS)
+static struct hlist_head jump_label_table[JUMP_LABEL_TABLE_SIZE];
+
+/* mutex to protect coming/going of the the jump_label table */
+static DEFINE_MUTEX(jump_label_mutex);
+
+struct jump_label_entry {
+ struct hlist_node hlist;
+ struct jump_entry *table;
+ int nr_entries;
+ /* hang modules off here */
+ struct hlist_head modules;
+ char name[0];
+};
+
+static void swap_jump_label_entries(struct jump_entry *previous, struct jump_entry *next)
+{
+ struct jump_entry tmp;
+
+ tmp = *next;
+ *next = *previous;
+ *previous = tmp;
+}
+
+static void sort_jump_label_entries(struct jump_entry *start, struct jump_entry *stop)
+{
+ int swapped = 0;
+ struct jump_entry *iter;
+ struct jump_entry *iter_next;
+
+ do {
+ swapped = 0;
+ iter = start;
+ iter_next = start;
+ iter_next++;
+ for (; iter_next < stop; iter++, iter_next++) {
+ if (strcmp(iter->name, iter_next->name) > 0) {
+ swap_jump_label_entries(iter, iter_next);
+ swapped = 1;
+ }
+ }
+ } while (swapped == 1);
+}
+
+static struct jump_label_entry *get_jump_label_entry(const char *name)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct jump_label_entry *e;
+ u32 hash = jhash(name, strlen(name), 0);
+
+ head = &jump_label_table[hash & (JUMP_LABEL_TABLE_SIZE - 1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name))
+ return e;
+ }
+ return NULL;
+}
+
+static struct jump_label_entry *add_jump_label_entry(const char *name, int nr_entries, struct jump_entry *table)
+{
+ struct hlist_head *head;
+ struct hlist_node *node;
+ struct jump_label_entry *e;
+ size_t name_len = strlen(name) + 1;
+ u32 hash = jhash(name, name_len-1, 0);
+
+ head = &jump_label_table[hash & (JUMP_LABEL_TABLE_SIZE - 1)];
+ hlist_for_each_entry(e, node, head, hlist) {
+ if (!strcmp(name, e->name))
+ return ERR_PTR(-EEXIST);
+ }
+ e = kmalloc(sizeof(struct jump_label_entry) + name_len, GFP_KERNEL);
+ if (!e)
+ return ERR_PTR(-ENOMEM);
+ memcpy(&e->name[0], name, name_len);
+ e->table = table;
+ e->nr_entries = nr_entries;
+ INIT_HLIST_HEAD(&(e->modules));
+ hlist_add_head(&e->hlist, head);
+ return e;
+}
+
+static int build_jump_label_hashtable(struct jump_entry *start, struct jump_entry *stop)
+{
+ struct jump_entry *iter, *iter_begin;
+ struct jump_label_entry *entry;
+ int count;
+
+ sort_jump_label_entries(start, stop);
+ iter = start;
+ while (iter < stop) {
+ entry = get_jump_label_entry(iter->name);
+ if (!entry) {
+ iter_begin = iter;
+ count = 0;
+ while ((iter < stop) &&
+ (strcmp(iter->name, iter_begin->name) == 0)) {
+ iter++;
+ count++;
+ }
+ entry = add_jump_label_entry(iter_begin->name, count,
+ iter_begin);
+ if (IS_ERR(entry))
+ return PTR_ERR(entry);
+ continue;
+ }
+ WARN(1, KERN_ERR "build_jump_hashtable: unexpected entry!\n");
+ }
+ return 0;
+}
+
+int jump_label_update(const char *name, enum jump_label_type type)
+{
+ struct jump_entry *iter;
+ struct jump_label_entry *entry;
+ struct hlist_node *module_node;
+ struct jump_label_module_entry *e_module;
+ int count;
+
+ mutex_lock(&jump_label_mutex);
+ entry = get_jump_label_entry(name);
+ if (entry) {
+ count = entry->nr_entries;
+ iter = entry->table;
+ while (count--) {
+ if (kernel_text_address(iter->code))
+ arch_jump_label_transform(iter, type);
+ iter++;
+ }
+ }
+ mutex_unlock(&jump_label_mutex);
+ return 0;
+}
+
+static int init_jump_label(void)
+{
+ int ret;
+
+ mutex_lock(&jump_label_mutex);
+ ret = build_jump_label_hashtable(__start___jump_table,
+ __stop___jump_table);
+ mutex_unlock(&jump_label_mutex);
+ return ret;
+}
+early_initcall(init_jump_label);
+
+#endif
--
1.6.5.1

2010-01-12 16:28:26

[permalink] [raw]

Subject: [RFC PATCH 6/8] jump label v4 - x86 support

add x86 support for jump label. I'm keeping this patch separate so its clear to arch maintainers what was required for x86 support this new feature. hopefully, it wouldn't be too painful for other arches.

Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/include/asm/jump_label.h | 32 ++++++++++++++++++++++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/jump_label.c | 52 +++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/include/asm/jump_label.h
create mode 100644 arch/x86/kernel/jump_label.c

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
new file mode 100644
index 0000000..ac09c1c
--- /dev/null
+++ b/arch/x86/include/asm/jump_label.h
@@ -0,0 +1,32 @@
+#ifndef _ASM_X86_JUMP_LABEL_H
+#define _ASM_X86_JUMP_LABEL_H
+
+#include <asm/nops.h>
+
+#if (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 4)) && \
+ defined(CONFIG_KPROBES)
+# define __HAVE_ARCH_JUMP_LABEL
+#endif
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+# ifdef CONFIG_X86_64
+# define JUMP_LABEL_NOP P6_NOP5
+# else
+# define JUMP_LABEL_NOP ".byte 0xe9 \n\t .long 0\n\t"
+# endif
+
+# define JUMP_LABEL(tag, label, cond) \
+ do { \
+ extern const char __jlstrtab_##tag[]; \
+ asm goto("1:" \
+ JUMP_LABEL_NOP \
+ ".pushsection __jump_table, \"a\" \n\t"\
+ _ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
+ ".popsection \n\t" \
+ : : "i" (__jlstrtab_##tag) : : label);\
+ } while (0)
+
+# endif
+
+#endif
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index d87f09b..6c6fc98 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -32,7 +32,7 @@ GCOV_PROFILE_paravirt.o := n
obj-y := process_$(BITS).o signal.o entry_$(BITS).o
obj-y += traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o ldt.o dumpstack.o
-obj-y += setup.o x86_init.o i8259.o irqinit.o
+obj-y += setup.o x86_init.o i8259.o irqinit.o jump_label.o
obj-$(CONFIG_X86_VISWS) += visws_quirks.o
obj-$(CONFIG_X86_32) += probe_roms_32.o
obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
new file mode 100644
index 0000000..6cb5bea
--- /dev/null
+++ b/arch/x86/kernel/jump_label.c
@@ -0,0 +1,52 @@
+/*
+ * jump label x86 support
+ *
+ * Copyright (C) 2009 Jason Baron <[email protected]>
+ *
+ */
+#include <linux/jump_label.h>
+#include <linux/memory.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/jhash.h>
+#include <asm/alternative.h>
+
+#ifdef __HAVE_ARCH_JUMP_LABEL
+
+union jump_code_union {
+ char code[RELATIVEJUMP_SIZE];
+ struct {
+ char jump;
+ int offset;
+ } __attribute__((packed));
+};
+
+void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)
+{
+ union jump_code_union code;
+
+ if (type == JUMP_LABEL_ENABLE) {
+ code.jump = 0xe9;
+ code.offset = entry->target - (entry->code + RELATIVEJUMP_SIZE);
+ } else {
+#ifdef CONFIG_X86_64
+ /* opcode for P6_NOP5 */
+ code.code[0] = 0x0f;
+ code.code[1] = 0x1f;
+ code.code[2] = 0x44;
+ code.code[3] = 0x00;
+ code.code[4] = 0x00;
+#else
+ code.jump = 0xe9;
+ code.offset = 0;
+#endif
+ }
+
+ mutex_lock(&text_mutex);
+ text_poke_fixup((void *)entry->code, &code, RELATIVEJUMP_SIZE,
+ (void *)entry->code + RELATIVEJUMP_SIZE);
+ mutex_unlock(&text_mutex);
+}
+
+#endif
--
1.6.5.1

2010-01-12 23:17:01

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On 01/12/2010 08:26 AM, Jason Baron wrote:
> Add text_poke_fixup() which takes a fixup address to where a processor
> jumps if it hits the modifying address while code modifying.
> text_poke_fixup() does following steps for this purpose.
>
> 1. Setup int3 handler for fixup.
> 2. Put a breakpoint (int3) on the first byte of modifying region,
> and synchronize code on all CPUs.
> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> 4. Modify the first byte of modifying region, and synchronize code
> on all CPUs.
> 5. Clear int3 handler.
>

We (Intel OTC) have been able to get an *unofficial* answer as to the
validity of this procedure; specifically as it applies to Intel hardware
(obviously). We are working on getting an officially approved answer,
but as far as we currently know, the procedure as outlined above should
work on all Intel hardware. In fact, we believe the synchronization in
step 3 is in fact unnecessary (as the synchronization in step 4 provides
sufficient guard.)

In fact, if a suitable int3 handler is left permanently in place then
step 5 is unnecessary as well. This would slow down other uses of int3
slightly, but might be a worthwhile tradeoff.

Such a permanent int3 handler would need to keep track of two
potentially-spurious breakpoints: the current and the previous. The
reason for needing two is that one could get a #BP from either the
current or the previous modification site between the insertion of int3
and the synchronization in step 2. This, of course, assumes that the
actual code poking is forcibly single-threaded (running under a spinlock
or other mutex) -- if modifications are allowed to run in parallel you
need to consider all possible current or stale #BP sites.

-hpa

2010-01-13 02:06:13

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* H. Peter Anvin ([email protected]) wrote:
> On 01/12/2010 08:26 AM, Jason Baron wrote:
> > Add text_poke_fixup() which takes a fixup address to where a processor
> > jumps if it hits the modifying address while code modifying.
> > text_poke_fixup() does following steps for this purpose.
> >
> > 1. Setup int3 handler for fixup.
> > 2. Put a breakpoint (int3) on the first byte of modifying region,
> > and synchronize code on all CPUs.
> > 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> > 4. Modify the first byte of modifying region, and synchronize code
> > on all CPUs.
> > 5. Clear int3 handler.
> >
>
> We (Intel OTC) have been able to get an *unofficial* answer as to the
> validity of this procedure; specifically as it applies to Intel hardware
> (obviously). We are working on getting an officially approved answer,
> but as far as we currently know, the procedure as outlined above should
> work on all Intel hardware. In fact, we believe the synchronization in
> step 3 is in fact unnecessary (as the synchronization in step 4 provides
> sufficient guard.)

Hi Peter,

This is great news! Thanks to Intel OTC and yourself for looking into
this. In the immediate values patches, I am doing the synchronization at
the end of step (3) to ensure that all remote CPUs issue read memory
barriers, so the stores to the instruction are done in this order:

spin lock
store int3 to 1st byte
smp_wmb()
sync all cores
store new instruction in all but 1st byte
smp_wmb()
issue smp_rmb() on all cores (a sync all cores has this effect)
store new instruction to 1st byte
send IPI to all cores (or call synchronize_sched()) to wait for all
breakpoint handlers to complete.
spin unlock

So the question is: are these wmb/rmb pairs actually needed ? As the
instruction fetch is not performed by instructions per se, I doubt a
rmb() will have any effect on them. I always prefer to stay on the safe
side, but it wouldn't hurt to know.

>
> In fact, if a suitable int3 handler is left permanently in place then
> step 5 is unnecessary as well. This would slow down other uses of int3
> slightly, but might be a worthwhile tradeoff.
>
> Such a permanent int3 handler would need to keep track of two
> potentially-spurious breakpoints: the current and the previous. The
> reason for needing two is that one could get a #BP from either the
> current or the previous modification site between the insertion of int3
> and the synchronization in step 2. This, of course, assumes that the
> actual code poking is forcibly single-threaded (running under a spinlock
> or other mutex) -- if modifications are allowed to run in parallel you
> need to consider all possible current or stale #BP sites.

Hrm. Assuming we have a spinlock protecting all this, given that we
synchronize all cores at step (4) _after_ removing the breakpoint, and
given that the breakpoint handler is an interrupt gate (thus executes
with interrupts off), I am inclined to think that sending the IPIs at
the end of step (4) (and waiting for them to complete) should be enough
to ensure that all in-flight breakpoint handlers for this site have
completed their execution. This would mean that we only have to keep
track of a single site at a time. Or am I missing something ?

Thanks,

Mathieu

>
> -hpa

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-13 04:56:48

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On 01/12/2010 06:06 PM, Mathieu Desnoyers wrote:
> * H. Peter Anvin ([email protected]) wrote:
>> On 01/12/2010 08:26 AM, Jason Baron wrote:
>>> Add text_poke_fixup() which takes a fixup address to where a processor
>>> jumps if it hits the modifying address while code modifying.
>>> text_poke_fixup() does following steps for this purpose.
>>>
>>> 1. Setup int3 handler for fixup.
>>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>>> and synchronize code on all CPUs.
>>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>> 4. Modify the first byte of modifying region, and synchronize code
>>> on all CPUs.
>>> 5. Clear int3 handler.
>>>
>>
>> We (Intel OTC) have been able to get an *unofficial* answer as to the
>> validity of this procedure; specifically as it applies to Intel hardware
>> (obviously). We are working on getting an officially approved answer,
>> but as far as we currently know, the procedure as outlined above should
>> work on all Intel hardware. In fact, we believe the synchronization in
>> step 3 is in fact unnecessary (as the synchronization in step 4 provides
>> sufficient guard.)
>
> Hi Peter,
>
> This is great news! Thanks to Intel OTC and yourself for looking into
> this. In the immediate values patches, I am doing the synchronization at
> the end of step (3) to ensure that all remote CPUs issue read memory
> barriers, so the stores to the instruction are done in this order:
>
> spin lock
> store int3 to 1st byte
> smp_wmb()
> sync all cores
> store new instruction in all but 1st byte
> smp_wmb()
> issue smp_rmb() on all cores (a sync all cores has this effect)
> store new instruction to 1st byte
> send IPI to all cores (or call synchronize_sched()) to wait for all
> breakpoint handlers to complete.
> spin unlock
>
> So the question is: are these wmb/rmb pairs actually needed ? As the
> instruction fetch is not performed by instructions per se, I doubt a
> rmb() will have any effect on them. I always prefer to stay on the safe
> side, but it wouldn't hurt to know.
>

I don't think the smp_rmb() has any function.

However, you're being quite inconsistent in your terminology here. The
assumption above is that the "synchronize code on all CPU" step is
sending an IPI to all cores and waiting for it to return, so that each
core has executed IPI/IRET before continuation.

It is *not* necessary to wait for the breakpoint handlers to return, as
long as they will get to IRET eventually, since IRET is a jump and a
serializing instruction.

> Hrm. Assuming we have a spinlock protecting all this, given that we
> synchronize all cores at step (4) _after_ removing the breakpoint, and
> given that the breakpoint handler is an interrupt gate (thus executes
> with interrupts off), I am inclined to think that sending the IPIs at
> the end of step (4) (and waiting for them to complete) should be enough
> to ensure that all in-flight breakpoint handlers for this site have
> completed their execution. This would mean that we only have to keep
> track of a single site at a time. Or am I missing something ?

Yes: the whole point was that you can omit the synchronization in step 4
if you leave the breakpoint handler in place (I said "omit step 5", but
that wasn't really what I meant.)

That means that at the cost of two compares in the standard #BP handler,
we can get away with only one IPI per atomic instruction poke.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-13 05:39:42

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

H. Peter Anvin wrote:
> On 01/12/2010 08:26 AM, Jason Baron wrote:
>> Add text_poke_fixup() which takes a fixup address to where a processor
>> jumps if it hits the modifying address while code modifying.
>> text_poke_fixup() does following steps for this purpose.
>>
>> 1. Setup int3 handler for fixup.
>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>> and synchronize code on all CPUs.
>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>> 4. Modify the first byte of modifying region, and synchronize code
>> on all CPUs.
>> 5. Clear int3 handler.
>>
>
> We (Intel OTC) have been able to get an *unofficial* answer as to the
> validity of this procedure; specifically as it applies to Intel hardware
> (obviously). We are working on getting an officially approved answer,
> but as far as we currently know, the procedure as outlined above should
> work on all Intel hardware. In fact, we believe the synchronization in
> step 3 is in fact unnecessary (as the synchronization in step 4 provides
> sufficient guard.)

Good news! Thank you very much, Peter!

And actually, this patch is a bit older than I previously posted on LKML.

http://lkml.org/lkml/2009/12/18/312

Oops, I've forgotten update comment on patch... anyway, patch implementation
itself is updated and removed second sync_core_all.
I'll post it again with updated comment.

> In fact, if a suitable int3 handler is left permanently in place then
> step 5 is unnecessary as well. This would slow down other uses of int3
> slightly, but might be a worthwhile tradeoff.

OK.

> Such a permanent int3 handler would need to keep track of two
> potentially-spurious breakpoints: the current and the previous. The
> reason for needing two is that one could get a #BP from either the
> current or the previous modification site between the insertion of int3
> and the synchronization in step 2. This, of course, assumes that the
> actual code poking is forcibly single-threaded (running under a spinlock
> or other mutex) -- if modifications are allowed to run in parallel you
> need to consider all possible current or stale #BP sites.

Sure, and since we are using fixmap for poking, we need to do this
under locking text_mutex.

Thank you!

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-13 14:30:55

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* H. Peter Anvin ([email protected]) wrote:
> On 01/12/2010 06:06 PM, Mathieu Desnoyers wrote:
> > * H. Peter Anvin ([email protected]) wrote:
> >> On 01/12/2010 08:26 AM, Jason Baron wrote:
> >>> Add text_poke_fixup() which takes a fixup address to where a processor
> >>> jumps if it hits the modifying address while code modifying.
> >>> text_poke_fixup() does following steps for this purpose.
> >>>
> >>> 1. Setup int3 handler for fixup.
> >>> 2. Put a breakpoint (int3) on the first byte of modifying region,
> >>> and synchronize code on all CPUs.
> >>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
> >>> 4. Modify the first byte of modifying region, and synchronize code
> >>> on all CPUs.
> >>> 5. Clear int3 handler.
> >>>
> >>
> >> We (Intel OTC) have been able to get an *unofficial* answer as to the
> >> validity of this procedure; specifically as it applies to Intel hardware
> >> (obviously). We are working on getting an officially approved answer,
> >> but as far as we currently know, the procedure as outlined above should
> >> work on all Intel hardware. In fact, we believe the synchronization in
> >> step 3 is in fact unnecessary (as the synchronization in step 4 provides
> >> sufficient guard.)
> >
> > Hi Peter,
> >
> > This is great news! Thanks to Intel OTC and yourself for looking into
> > this. In the immediate values patches, I am doing the synchronization at
> > the end of step (3) to ensure that all remote CPUs issue read memory
> > barriers, so the stores to the instruction are done in this order:
> >
> > spin lock
> > store int3 to 1st byte
> > smp_wmb()
> > sync all cores
> > store new instruction in all but 1st byte
> > smp_wmb()
> > issue smp_rmb() on all cores (a sync all cores has this effect)
> > store new instruction to 1st byte
> > send IPI to all cores (or call synchronize_sched()) to wait for all
> > breakpoint handlers to complete.
> > spin unlock
> >
> > So the question is: are these wmb/rmb pairs actually needed ? As the
> > instruction fetch is not performed by instructions per se, I doubt a
> > rmb() will have any effect on them. I always prefer to stay on the safe
> > side, but it wouldn't hurt to know.
> >
>
> I don't think the smp_rmb() has any function.

OK, that's good to know.

>
> However, you're being quite inconsistent in your terminology here. The
> assumption above is that the "synchronize code on all CPU" step is
> sending an IPI to all cores and waiting for it to return, so that each
> core has executed IPI/IRET before continuation.

To be strictly correct, we cannot assume that the IPI handler issues IRET
before signaling its completion. It's rather the other way around.
This is why I add a smp_mb() in the IPI handler for the "synchronize
code on all CPUs" step.

>
> It is *not* necessary to wait for the breakpoint handlers to return, as
> long as they will get to IRET eventually, since IRET is a jump and a
> serializing instruction.

Ah, I see. So the added smp_mb() would not be needed then, as long as we
know that the other CPUs either are currently running the IPI handler or
have executed it. IOW: they will execute IRET very soon or they just
executed it since the int3 have been written. I am a bit concerned about
NMIs coming in this race window, but as they need to have started after
we have put the breakpoint, that should be OK. (note: entry_*.S
modifications are needed to support nesting breakpoint handlers in NMIs)

>
> > Hrm. Assuming we have a spinlock protecting all this, given that we
> > synchronize all cores at step (4) _after_ removing the breakpoint, and
> > given that the breakpoint handler is an interrupt gate (thus executes
> > with interrupts off), I am inclined to think that sending the IPIs at
> > the end of step (4) (and waiting for them to complete) should be enough
> > to ensure that all in-flight breakpoint handlers for this site have
> > completed their execution. This would mean that we only have to keep
> > track of a single site at a time. Or am I missing something ?
>
> Yes: the whole point was that you can omit the synchronization in step 4
> if you leave the breakpoint handler in place (I said "omit step 5", but
> that wasn't really what I meant.)
>
> That means that at the cost of two compares in the standard #BP handler,
> we can get away with only one IPI per atomic instruction poke.

OK. That makes sense now.

Thanks,

Mathieu

>
> -hpa
>
>
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 06:54:53

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
> * H. Peter Anvin ([email protected]) wrote:
>> On 01/12/2010 06:06 PM, Mathieu Desnoyers wrote:
>>> * H. Peter Anvin ([email protected]) wrote:
>>>> On 01/12/2010 08:26 AM, Jason Baron wrote:
>>>>> Add text_poke_fixup() which takes a fixup address to where a processor
>>>>> jumps if it hits the modifying address while code modifying.
>>>>> text_poke_fixup() does following steps for this purpose.
>>>>>
>>>>> 1. Setup int3 handler for fixup.
>>>>> 2. Put a breakpoint (int3) on the first byte of modifying region,
>>>>> and synchronize code on all CPUs.
>>>>> 3. Modify other bytes of modifying region, and synchronize code on all CPUs.
>>>>> 4. Modify the first byte of modifying region, and synchronize code
>>>>> on all CPUs.
>>>>> 5. Clear int3 handler.
>>>>>
>>>>
>>>> We (Intel OTC) have been able to get an *unofficial* answer as to the
>>>> validity of this procedure; specifically as it applies to Intel hardware
>>>> (obviously). We are working on getting an officially approved answer,
>>>> but as far as we currently know, the procedure as outlined above should
>>>> work on all Intel hardware. In fact, we believe the synchronization in
>>>> step 3 is in fact unnecessary (as the synchronization in step 4 provides
>>>> sufficient guard.)
>>>
>>> Hi Peter,
>>>
>>> This is great news! Thanks to Intel OTC and yourself for looking into
>>> this. In the immediate values patches, I am doing the synchronization at
>>> the end of step (3) to ensure that all remote CPUs issue read memory
>>> barriers, so the stores to the instruction are done in this order:
>>>
>>> spin lock
>>> store int3 to 1st byte
>>> smp_wmb()
>>> sync all cores
>>> store new instruction in all but 1st byte
>>> smp_wmb()
>>> issue smp_rmb() on all cores (a sync all cores has this effect)
>>> store new instruction to 1st byte
>>> send IPI to all cores (or call synchronize_sched()) to wait for all
>>> breakpoint handlers to complete.
>>> spin unlock
>>>
>>> So the question is: are these wmb/rmb pairs actually needed ? As the
>>> instruction fetch is not performed by instructions per se, I doubt a
>>> rmb() will have any effect on them. I always prefer to stay on the safe
>>> side, but it wouldn't hurt to know.
>>>
>>
>> I don't think the smp_rmb() has any function.
>
> OK, that's good to know.
>
>>
>> However, you're being quite inconsistent in your terminology here. The
>> assumption above is that the "synchronize code on all CPU" step is
>> sending an IPI to all cores and waiting for it to return, so that each
>> core has executed IPI/IRET before continuation.
>
> To be strictly correct, we cannot assume that the IPI handler issues IRET
> before signaling its completion. It's rather the other way around.
> This is why I add a smp_mb() in the IPI handler for the "synchronize
> code on all CPUs" step.
>
>>
>> It is *not* necessary to wait for the breakpoint handlers to return, as
>> long as they will get to IRET eventually, since IRET is a jump and a
>> serializing instruction.
>
> Ah, I see. So the added smp_mb() would not be needed then, as long as we
> know that the other CPUs either are currently running the IPI handler or
> have executed it. IOW: they will execute IRET very soon or they just
> executed it since the int3 have been written. I am a bit concerned about
> NMIs coming in this race window, but as they need to have started after
> we have put the breakpoint, that should be OK. (note: entry_*.S
> modifications are needed to support nesting breakpoint handlers in NMIs)
>
>>
>>> Hrm. Assuming we have a spinlock protecting all this, given that we
>>> synchronize all cores at step (4) _after_ removing the breakpoint, and
>>> given that the breakpoint handler is an interrupt gate (thus executes
>>> with interrupts off), I am inclined to think that sending the IPIs at
>>> the end of step (4) (and waiting for them to complete) should be enough
>>> to ensure that all in-flight breakpoint handlers for this site have
>>> completed their execution. This would mean that we only have to keep
>>> track of a single site at a time. Or am I missing something ?
>>
>> Yes: the whole point was that you can omit the synchronization in step 4
>> if you leave the breakpoint handler in place (I said "omit step 5", but
>> that wasn't really what I meant.)

Hmm, in that case, how can we reuse the breakpoint handler for another
text poke site? Even if we leave the handler, I think we need to clear
fixup information for next poking...

>>
>> That means that at the cost of two compares in the standard #BP handler,
>> we can get away with only one IPI per atomic instruction poke.
>
> OK. That makes sense now.

So, let me check the actual replacement steps.

(1) lock text_mutex
(2) setup breakpoint fixup addresses (source and destination)
(3) store int3 to 1st byte, and smp_wmb()
(4) send IPI and issue smp_mb() (or cpuid) with for sync all cores.
(5) store new instruction except 1st byte, and smp_wmb()
(6) store 1st byte of new instruction
(7) send IPI to all cores for waiting for all running breakpoint handlers.
(8) clear fixup addresses
(9) unlock text_mutex

Is this correct?

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-14 15:32:26

by Steven Rostedt

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On Tue, 2010-01-12 at 11:26 -0500, Jason Baron wrote:

> +/**
> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> + * @addr: Modifying address.
> + * @opcode: New instruction.
> + * @len: length of modifying bytes.
> + * @fixup: Fixup address.
> + *
> + * Note: You must backup replaced instructions before calling this,
> + * if you need to recover it.
> + * Note: Must be called under text_mutex.
> + */
> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> + void *fixup)
> +{
> + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> + static const int int3_size = sizeof(int3_insn);
> +
> + /* Replacing 1 byte can be done atomically. */
> + if (unlikely(len <= 1))
> + return text_poke(addr, opcode, len);

This part bothers me. The text_poke just writes over the text directly
(using a separate mapping). But if that memory is in the pipeline of
another CPU, I think this could cause a GPF.

-- Steve

> +
> + /* Preparing */
> + patch_fixup_addr = fixup;
> + wmb();
> + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> +
> + /* Cap by an int3 */
> + text_poke(addr, &int3_insn, int3_size);
> + sync_core_all();
> +
> + /* Replace tail bytes */
> + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> + len - int3_size);
> + sync_core_all();
> +
> + /* Replace int3 with head byte */
> + text_poke(addr, opcode, int3_size);
> + sync_core_all();
> +
> + /* Cleanup */
> + patch_fixup_from = NULL;
> + wmb();
> + return addr;
> +}
> +

2010-01-14 15:39:12

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* Steven Rostedt ([email protected]) wrote:
> On Tue, 2010-01-12 at 11:26 -0500, Jason Baron wrote:
>
> > +/**
> > + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> > + * @addr: Modifying address.
> > + * @opcode: New instruction.
> > + * @len: length of modifying bytes.
> > + * @fixup: Fixup address.
> > + *
> > + * Note: You must backup replaced instructions before calling this,
> > + * if you need to recover it.
> > + * Note: Must be called under text_mutex.
> > + */
> > +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> > + void *fixup)
> > +{
> > + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> > + static const int int3_size = sizeof(int3_insn);
> > +
> > + /* Replacing 1 byte can be done atomically. */
> > + if (unlikely(len <= 1))
> > + return text_poke(addr, opcode, len);
>
> This part bothers me. The text_poke just writes over the text directly
> (using a separate mapping). But if that memory is in the pipeline of
> another CPU, I think this could cause a GPF.

It looks like we are thinking along the same lines.

I'm under the impression that I pointed out this exact same issue in the
previous round of review a few weeks ago. Does this submission reflect
the up-to-date state of this patch ?

Thanks,

Mathieu

>
> -- Steve
>
> > +
> > + /* Preparing */
> > + patch_fixup_addr = fixup;
> > + wmb();
> > + patch_fixup_from = (u8 *)addr + int3_size; /* IP address after int3 */
> > +
> > + /* Cap by an int3 */
> > + text_poke(addr, &int3_insn, int3_size);
> > + sync_core_all();
> > +
> > + /* Replace tail bytes */
> > + text_poke((char *)addr + int3_size, (const char *)opcode + int3_size,
> > + len - int3_size);
> > + sync_core_all();
> > +
> > + /* Replace int3 with head byte */
> > + text_poke(addr, opcode, int3_size);
> > + sync_core_all();
> > +
> > + /* Cleanup */
> > + patch_fixup_from = NULL;
> > + wmb();
> > + return addr;
> > +}
> > +
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-14 15:42:15

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On 01/14/2010 07:32 AM, Steven Rostedt wrote:
>> +
>> + /* Replacing 1 byte can be done atomically. */
>> + if (unlikely(len <= 1))
>> + return text_poke(addr, opcode, len);
>
> This part bothers me. The text_poke just writes over the text directly
> (using a separate mapping). But if that memory is in the pipeline of
> another CPU, I think this could cause a GPF.
>

Could you clarify why you think that?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-14 16:24:38

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
>> On Tue, 2010-01-12 at 11:26 -0500, Jason Baron wrote:
>>
>>> +/**
>>> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
>>> + * @addr: Modifying address.
>>> + * @opcode: New instruction.
>>> + * @len: length of modifying bytes.
>>> + * @fixup: Fixup address.
>>> + *
>>> + * Note: You must backup replaced instructions before calling this,
>>> + * if you need to recover it.
>>> + * Note: Must be called under text_mutex.
>>> + */
>>> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
>>> + void *fixup)
>>> +{
>>> + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
>>> + static const int int3_size = sizeof(int3_insn);
>>> +
>>> + /* Replacing 1 byte can be done atomically. */
>>> + if (unlikely(len <= 1))
>>> + return text_poke(addr, opcode, len);
>>
>> This part bothers me. The text_poke just writes over the text directly
>> (using a separate mapping). But if that memory is in the pipeline of
>> another CPU, I think this could cause a GPF.
>
> It looks like we are thinking along the same lines.
>
> I'm under the impression that I pointed out this exact same issue in the
> previous round of review a few weeks ago. Does this submission reflect
> the up-to-date state of this patch ?

No, the latest patch just skips step 3 if len == 1.
(Jason, could you update your repository?)
I thought I sent it the end of the last year ... :)

http://lkml.org/lkml/2009/12/18/312

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-14 16:43:21

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On Thu, Jan 14, 2010 at 11:23:35AM -0500, Masami Hiramatsu wrote:
> > * Steven Rostedt ([email protected]) wrote:
> >> On Tue, 2010-01-12 at 11:26 -0500, Jason Baron wrote:
> >>
> >>> +/**
> >>> + * text_poke_fixup() -- cross-modifying kernel text with fixup address.
> >>> + * @addr: Modifying address.
> >>> + * @opcode: New instruction.
> >>> + * @len: length of modifying bytes.
> >>> + * @fixup: Fixup address.
> >>> + *
> >>> + * Note: You must backup replaced instructions before calling this,
> >>> + * if you need to recover it.
> >>> + * Note: Must be called under text_mutex.
> >>> + */
> >>> +void *__kprobes text_poke_fixup(void *addr, const void *opcode, size_t len,
> >>> + void *fixup)
> >>> +{
> >>> + static const unsigned char int3_insn = BREAKPOINT_INSTRUCTION;
> >>> + static const int int3_size = sizeof(int3_insn);
> >>> +
> >>> + /* Replacing 1 byte can be done atomically. */
> >>> + if (unlikely(len <= 1))
> >>> + return text_poke(addr, opcode, len);
> >>
> >> This part bothers me. The text_poke just writes over the text directly
> >> (using a separate mapping). But if that memory is in the pipeline of
> >> another CPU, I think this could cause a GPF.
> >
> > It looks like we are thinking along the same lines.
> >
> > I'm under the impression that I pointed out this exact same issue in the
> > previous round of review a few weeks ago. Does this submission reflect
> > the up-to-date state of this patch ?
>
> No, the latest patch just skips step 3 if len == 1.
> (Jason, could you update your repository?)
> I thought I sent it the end of the last year ... :)
>
> http://lkml.org/lkml/2009/12/18/312
>
> Thank you,
>

sorry about that...i've updated to the latest.

thanks,

-Jason

2010-01-14 18:46:18

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
>> It is *not* necessary to wait for the breakpoint handlers to return, as
>> long as they will get to IRET eventually, since IRET is a jump and a
>> serializing instruction.
>
> Ah, I see. So the added smp_mb() would not be needed then, as long as we
> know that the other CPUs either are currently running the IPI handler or
> have executed it. IOW: they will execute IRET very soon or they just
> executed it since the int3 have been written. I am a bit concerned about
> NMIs coming in this race window, but as they need to have started after
> we have put the breakpoint, that should be OK. (note: entry_*.S
> modifications are needed to support nesting breakpoint handlers in NMIs)

Hmm, if we support this to modify NMI code, it seems that we need to
support not only nesting breakpoint handling but also nesting NMIs,
because nesting NMI is unblocked when next IRET (of breakpoint) is
issued.

>From Intel's Software Developer?s Manual Vol.3A 5.7.1 Handling Multiple NMIs
said below.
---
While an NMI interrupt handler is executing, the processor disables additional calls to
the NMI handler until the next IRET instruction is executed. This blocking of subse-
quent NMIs prevents stacking up calls to the NMI handler. [...]
---

I assume that your below patch tried to solve this issue, right?
http://lkml.indiana.edu/hypermail/linux/kernel/0804.1/0965.html

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-17 18:55:43

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* H. Peter Anvin ([email protected]) wrote:
> On 01/14/2010 07:32 AM, Steven Rostedt wrote:
> >> +
> >> + /* Replacing 1 byte can be done atomically. */
> >> + if (unlikely(len <= 1))
> >> + return text_poke(addr, opcode, len);
> >
> > This part bothers me. The text_poke just writes over the text directly
> > (using a separate mapping). But if that memory is in the pipeline of
> > another CPU, I think this could cause a GPF.
> >
>
> Could you clarify why you think that?

Basically, what Steven and I were concerned about in this particular
patch version is the fact that this code took a "shortcut" for
single-byte text modification, thus bypassing the int3-bypass scheme
altogether.

As mere atomicity of the modification is not the only concern here
(because we also have to deal with instruction trace cache coherency and
so forth), then the int3 breakpoint scheme is, I think, also needed for
single-byte updates.

Thanks,

Mathieu

>
> -hpa
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-17 19:16:29

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On Sun, 17 Jan 2010 13:55:39 -0500
Mathieu Desnoyers <[email protected]> wrote:

> * H. Peter Anvin ([email protected]) wrote:
> > On 01/14/2010 07:32 AM, Steven Rostedt wrote:
> > >> +
> > >> + /* Replacing 1 byte can be done atomically. */
> > >> + if (unlikely(len <= 1))
> > >> + return text_poke(addr, opcode, len);
> > >
> > > This part bothers me. The text_poke just writes over the text
> > > directly (using a separate mapping). But if that memory is in the
> > > pipeline of another CPU, I think this could cause a GPF.
> > >
> >
> > Could you clarify why you think that?
>
> Basically, what Steven and I were concerned about in this particular
> patch version is the fact that this code took a "shortcut" for
> single-byte text modification, thus bypassing the int3-bypass scheme
> altogether.

single byte instruction updates are likely 100x safer than any scheme
of multi-byte instruction scheme that I have seen, other than a full
stop_machine().

That does not mean it is safe, it just means it's an order of
complexity less to analyze ;-)

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-01-18 16:00:00

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

Arjan van de Ven wrote:
> On Sun, 17 Jan 2010 13:55:39 -0500
> Mathieu Desnoyers <[email protected]> wrote:
>
>> * H. Peter Anvin ([email protected]) wrote:
>>> On 01/14/2010 07:32 AM, Steven Rostedt wrote:
>>>>> +
>>>>> + /* Replacing 1 byte can be done atomically. */
>>>>> + if (unlikely(len <= 1))
>>>>> + return text_poke(addr, opcode, len);
>>>>
>>>> This part bothers me. The text_poke just writes over the text
>>>> directly (using a separate mapping). But if that memory is in the
>>>> pipeline of another CPU, I think this could cause a GPF.
>>>>
>>>
>>> Could you clarify why you think that?
>>
>> Basically, what Steven and I were concerned about in this particular
>> patch version is the fact that this code took a "shortcut" for
>> single-byte text modification, thus bypassing the int3-bypass scheme
>> altogether.
>
> single byte instruction updates are likely 100x safer than any scheme
> of multi-byte instruction scheme that I have seen, other than a full
> stop_machine().
>
> That does not mean it is safe, it just means it's an order of
> complexity less to analyze ;-)

Yeah, so in the latest patch, I updated it to use int3 even if
len == 1. :-)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-18 16:27:41

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On 01/18/2010 07:59 AM, Masami Hiramatsu wrote:
>>>>>
>>>>> This part bothers me. The text_poke just writes over the text
>>>>> directly (using a separate mapping). But if that memory is in the
>>>>> pipeline of another CPU, I think this could cause a GPF.
>>>>>
>>>>
>>>> Could you clarify why you think that?
>>>
>>> Basically, what Steven and I were concerned about in this particular
>>> patch version is the fact that this code took a "shortcut" for
>>> single-byte text modification, thus bypassing the int3-bypass scheme
>>> altogether.
>>
>> single byte instruction updates are likely 100x safer than any scheme
>> of multi-byte instruction scheme that I have seen, other than a full
>> stop_machine().
>>
>> That does not mean it is safe, it just means it's an order of
>> complexity less to analyze ;-)
>
> Yeah, so in the latest patch, I updated it to use int3 even if
> len == 1. :-)
>

This really doesn't make much sense to me. The whole basis for the int3
scheme itself is that single-byte updates are atomic, so if single-byte
updates can't work -- and as I stated, we at Intel OTC currently believe
it safe -- then int3 can't work either.

The one thing to watch out for is that unless you force an IPI/IRET
cycle afterwards, you can't know when any particular remote processor
will see the update.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-01-18 16:31:40

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On Mon, 18 Jan 2010 10:59:30 -0500
Masami Hiramatsu <[email protected]> wrote:

> Yeah, so in the latest patch, I updated it to use int3 even if
> len == 1. :-)
>

int3 is not making a difference for your case; there is no guarantee
that the other processor even sees the "int3 inbetween state" at all;
if it's not safe without int3 then it won't be safe with int3 either.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-01-18 16:52:36

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* H. Peter Anvin ([email protected]) wrote:
> On 01/18/2010 07:59 AM, Masami Hiramatsu wrote:
> >>>>>
> >>>>> This part bothers me. The text_poke just writes over the text
> >>>>> directly (using a separate mapping). But if that memory is in the
> >>>>> pipeline of another CPU, I think this could cause a GPF.
> >>>>>
> >>>>
> >>>> Could you clarify why you think that?
> >>>
> >>> Basically, what Steven and I were concerned about in this particular
> >>> patch version is the fact that this code took a "shortcut" for
> >>> single-byte text modification, thus bypassing the int3-bypass scheme
> >>> altogether.
> >>
> >> single byte instruction updates are likely 100x safer than any scheme
> >> of multi-byte instruction scheme that I have seen, other than a full
> >> stop_machine().
> >>
> >> That does not mean it is safe, it just means it's an order of
> >> complexity less to analyze ;-)
> >
> > Yeah, so in the latest patch, I updated it to use int3 even if
> > len == 1. :-)
> >
>
> This really doesn't make much sense to me. The whole basis for the int3
> scheme itself is that single-byte updates are atomic, so if single-byte
> updates can't work -- and as I stated, we at Intel OTC currently believe
> it safe -- then int3 can't work either.

The additional characteristic of the int3 instruction (compared to the
general case of a single-byte instruction) is that, when executed, it
will trigger a trap, run a trap handler and return to the original code,
typically with iret. This therefore implies that a serializing
instruction is executed before returning to the instructions following
the modification site when the breakpoint is hit.

So I hand out to Intel's expertise the question of whether single-byte
instruction modification is safe or not in the general case. I'm just
pointing out that I can very well imagine an aggressive superscalar
architecture for which pipeline structure would support single-byte int3
patching without any problem due to the implied serialization, but would
not support the general-case single-byte modification due to its lack of
serialization.

As we might have to port this algorithm to Itanium in a near future, I
prefer to stay on the safe side. Intel's "by the book" recommendation is
more or less that a serializing instruction must be executed on all CPUs
before new code is executed, without mention of single-vs-multi byte
instructions. The int3-based bypass follows this requirement, but the
single-byte code patching does not.

Unless there is a visible performance gain to special-case the
single-byte instruction, I would recommend to stick to the safest
solution, which follows Intel "official" guide-lines too.

Thanks,

Mathieu

>
> The one thing to watch out for is that unless you force an IPI/IRET
> cycle afterwards, you can't know when any particular remote processor
> will see the update.
>
> -hpa
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-18 16:54:47

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* Arjan van de Ven ([email protected]) wrote:
> On Mon, 18 Jan 2010 10:59:30 -0500
> Masami Hiramatsu <[email protected]> wrote:
>
> > Yeah, so in the latest patch, I updated it to use int3 even if
> > len == 1. :-)
> >
>
>
> int3 is not making a difference for your case; there is no guarantee
> that the other processor even sees the "int3 inbetween state" at all;
> if it's not safe without int3 then it won't be safe with int3 either.

What Masami means is that he updated his patch to use the int3+IPI
broadcast scheme.

Therefore, the CPUs not seeing the int3 inbetween state will be forced
to issue a serializing instruction while the int3 is in place anyway.

Thanks,

Mathieu

>
>
> --
> Arjan van de Ven Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-18 18:21:37

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

Mathieu Desnoyers wrote:
> * Arjan van de Ven ([email protected]) wrote:
>> On Mon, 18 Jan 2010 10:59:30 -0500
>> Masami Hiramatsu <[email protected]> wrote:
>>
>>> Yeah, so in the latest patch, I updated it to use int3 even if
>>> len == 1. :-)
>>>
>>
>>
>> int3 is not making a difference for your case; there is no guarantee
>> that the other processor even sees the "int3 inbetween state" at all;
>> if it's not safe without int3 then it won't be safe with int3 either.
>
> What Masami means is that he updated his patch to use the int3+IPI
> broadcast scheme.

Right.

>
> Therefore, the CPUs not seeing the int3 inbetween state will be forced
> to issue a serializing instruction while the int3 is in place anyway.

By the way, in kprobes, we just use a text_poke() to put int3.
I assume that we'd better send IPI afterward, wouldn't it?

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-18 18:33:13

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* Masami Hiramatsu ([email protected]) wrote:
> Mathieu Desnoyers wrote:
> > * Arjan van de Ven ([email protected]) wrote:
> >> On Mon, 18 Jan 2010 10:59:30 -0500
> >> Masami Hiramatsu <[email protected]> wrote:
> >>
> >>> Yeah, so in the latest patch, I updated it to use int3 even if
> >>> len == 1. :-)
> >>>
> >>
> >>
> >> int3 is not making a difference for your case; there is no guarantee
> >> that the other processor even sees the "int3 inbetween state" at all;
> >> if it's not safe without int3 then it won't be safe with int3 either.
> >
> > What Masami means is that he updated his patch to use the int3+IPI
> > broadcast scheme.
>
> Right.
>
> >
> > Therefore, the CPUs not seeing the int3 inbetween state will be forced
> > to issue a serializing instruction while the int3 is in place anyway.
>
> By the way, in kprobes, we just use a text_poke() to put int3.
> I assume that we'd better send IPI afterward, wouldn't it?

Only if you need to ensure that you reached a state where all CPUs are
seeing the int3.

Note that kprobes already issues synchronize_sched() after breakpoint
removal, which should have a similar effect. However, AFAIK, it does not
synchronize after inserting the int3.

Thanks,

Mathieu

>
> Thank you,
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America), Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-01-18 18:54:28

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On 01/18/2010 08:52 AM, Mathieu Desnoyers wrote:
>>
>> This really doesn't make much sense to me. The whole basis for the int3
>> scheme itself is that single-byte updates are atomic, so if single-byte
>> updates can't work -- and as I stated, we at Intel OTC currently believe
>> it safe -- then int3 can't work either.
>
> The additional characteristic of the int3 instruction (compared to the
> general case of a single-byte instruction) is that, when executed, it
> will trigger a trap, run a trap handler and return to the original code,
> typically with iret. This therefore implies that a serializing
> instruction is executed before returning to the instructions following
> the modification site when the breakpoint is hit.
>
> So I hand out to Intel's expertise the question of whether single-byte
> instruction modification is safe or not in the general case. I'm just
> pointing out that I can very well imagine an aggressive superscalar
> architecture for which pipeline structure would support single-byte int3
> patching without any problem due to the implied serialization, but would
> not support the general-case single-byte modification due to its lack of
> serialization.
>

This is utter and complete nonsense. You seem to think that everything
is guaranteed to hit the breakpoint, which is obviously false.
Furthermore, until you have done the serialization, you're not
guaranteed the *breakpoint* is seen, so you have the same condition.

> As we might have to port this algorithm to Itanium in a near future, I
> prefer to stay on the safe side. Intel's "by the book" recommendation is
> more or less that a serializing instruction must be executed on all CPUs
> before new code is executed, without mention of single-vs-multi byte
> instructions. The int3-based bypass follows this requirement, but the
> single-byte code patching does not.
>
> Unless there is a visible performance gain to special-case the
> single-byte instruction, I would recommend to stick to the safest
> solution, which follows Intel "official" guide-lines too.

No, it doesn't. The only thing that follows the "official" guidelines
is stop_machine.

As far as other architectures are concerned, other architectures can
have very different and much stricter rules for I/D coherence. Trying
to extrapolate from the x86 rules is aggravated insanity.

-hpa

2010-01-18 20:53:50

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

H. Peter Anvin wrote:
> On 01/18/2010 08:52 AM, Mathieu Desnoyers wrote:
>>>
>>> This really doesn't make much sense to me. The whole basis for the int3
>>> scheme itself is that single-byte updates are atomic, so if single-byte
>>> updates can't work -- and as I stated, we at Intel OTC currently believe
>>> it safe -- then int3 can't work either.
>>
>> The additional characteristic of the int3 instruction (compared to the
>> general case of a single-byte instruction) is that, when executed, it
>> will trigger a trap, run a trap handler and return to the original code,
>> typically with iret. This therefore implies that a serializing
>> instruction is executed before returning to the instructions following
>> the modification site when the breakpoint is hit.
>>
>> So I hand out to Intel's expertise the question of whether single-byte
>> instruction modification is safe or not in the general case. I'm just
>> pointing out that I can very well imagine an aggressive superscalar
>> architecture for which pipeline structure would support single-byte int3
>> patching without any problem due to the implied serialization, but would
>> not support the general-case single-byte modification due to its lack of
>> serialization.
>>
>
> This is utter and complete nonsense. You seem to think that everything
> is guaranteed to hit the breakpoint, which is obviously false.
> Furthermore, until you have done the serialization, you're not
> guaranteed the *breakpoint* is seen, so you have the same condition.

In that time frame, I guess that the processor sees non-modified
instruction and executes it. Since we'll wait until serializing on
each processor, I think it is OK for int3-bypass method.

(Of course, this can depend on chip, it is possible that there is a chip
which causes a fault when it has a cache-discarding signal on current-
instruction decoding slot. That's also why we are asking this method
is OK for x86 processors.)

Thank you,

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America), Inc.
Software Solutions Division

e-mail: [email protected]

2010-01-18 21:22:31

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

On 01/18/2010 12:53 PM, Masami Hiramatsu wrote:
>>
>> This is utter and complete nonsense. You seem to think that everything
>> is guaranteed to hit the breakpoint, which is obviously false.
>> Furthermore, until you have done the serialization, you're not
>> guaranteed the *breakpoint* is seen, so you have the same condition.
>
> In that time frame, I guess that the processor sees non-modified
> instruction and executes it. Since we'll wait until serializing on
> each processor, I think it is OK for int3-bypass method.
>
> (Of course, this can depend on chip, it is possible that there is a chip
> which causes a fault when it has a cache-discarding signal on current-
> instruction decoding slot. That's also why we are asking this method
> is OK for x86 processors.)
>

Yes, it is possible, however, if that was the case, then int3 wouldn't
work either. As I said, to the best of our knowledge, at least Intel
processors are okay for a single-byte update (I will wait to try to
state the full general rule until it has been officially approved or
killed.)

-hpa

2010-01-18 21:38:11

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] jump label v4 - x86: Introduce generic jump patching without stop_machine

* H. Peter Anvin ([email protected]) wrote:
> On 01/18/2010 08:52 AM, Mathieu Desnoyers wrote:
> >>
> >> This really doesn't make much sense to me. The whole basis for the int3
> >> scheme itself is that single-byte updates are atomic, so if single-byte
> >> updates can't work -- and as I stated, we at Intel OTC currently believe
> >> it safe -- then int3 can't work either.
> >
> > The additional characteristic of the int3 instruction (compared to the
> > general case of a single-byte instruction) is that, when executed, it
> > will trigger a trap, run a trap handler and return to the original code,
> > typically with iret. This therefore implies that a serializing
> > instruction is executed before returning to the instructions following
> > the modification site when the breakpoint is hit.
> >
> > So I hand out to Intel's expertise the question of whether single-byte
> > instruction modification is safe or not in the general case. I'm just
> > pointing out that I can very well imagine an aggressive superscalar
> > architecture for which pipeline structure would support single-byte int3
> > patching without any problem due to the implied serialization, but would
> > not support the general-case single-byte modification due to its lack of
> > serialization.
> >
>
> This is utter and complete nonsense. You seem to think that everything
> is guaranteed to hit the breakpoint, which is obviously false.

What I discuss above is: what actually happens when the breakpoint is
hit.

I'm doing no assumption about whether it is hit or not. In the int3+IPI
broadcast scheme, every cpu receive an IPI between seeing the old and
new instructions. Only *some* cpus *may* hit the breakpoint that is put
there temporarily.

> Furthermore, until you have done the serialization, you're not
> guaranteed the *breakpoint* is seen,

Agreed,

> so you have the same condition.

Hrm ? Same as what exactly ? We have either the old instruction in place
or the breakpoint (before the serialization). After the serialization,
we have either the breakpoint or the new instruction.

What I am pointing out is that specifically turning a 1-byte instruction
into a breakpoint can be safer than turning it into another 1-byte
instruction directly, because *if* cpus hit the breakpoint, they *will*
issue a synchronizing instruction at that point (implied by the
breakpoint). This is not the case if you just modify the 1-byte
instruction in place.

>
> > As we might have to port this algorithm to Itanium in a near future, I
> > prefer to stay on the safe side. Intel's "by the book" recommendation is
> > more or less that a serializing instruction must be executed on all CPUs
> > before new code is executed, without mention of single-vs-multi byte
> > instructions. The int3-based bypass follows this requirement, but the
> > single-byte code patching does not.
> >
> > Unless there is a visible performance gain to special-case the
> > single-byte instruction, I would recommend to stick to the safest
> > solution, which follows Intel "official" guide-lines too.
>
> No, it doesn't. The only thing that follows the "official" guidelines
> is stop_machine.
>
> As far as other architectures are concerned, other architectures can
> have very different and much stricter rules for I/D coherence. Trying
> to extrapolate from the x86 rules is aggravated insanity.

I agree that official Intel guidelines for XMC only discuss the
stop_machine() scheme. OK then, let's see how patching single-byte
instructions deals with the official _uniprocessor_ self-modifying code
guidelines.

(ref. http://www.intel.com/Assets/PDF/specupdate/318586.pdf
7.1.3 Handling Self- and Cross-Modifying Code)

(* OPTION 1 *)
Store modified code (as data) into code segment;
Jump to new code or an intermediate location;
Execute new code;

(* OPTION 2 *)
Store modified code (as data) into code segment;
Execute a serializing instruction; (* For example, CPUID instruction *)
Execute new code;

As you can see, if we self-modify the code on a single cpu machine with
text_poke directly, even for a single-byte instruction, we _have_ to
guarantee that either a jump or a serializing instruction is issued
before the new code is executed.

What I discussed above was that int3 is a special-case, because it
generates a trap, and therefore jumps to a different location.

So, back to the case where we could "simply patch-in any single-byte
instruction in a SMP system", I argue that this is against the
uniprocessor part of the errata, which clearly also applies to SMP.

By the way, I've looked at the Itanium documents a few years ago, and
I have not seen any reason at that time why the breakpoint+IPI scheme
would not work if we additionally perform the appropriate I and D cache
flushes. The rest of the requirements are _very_ similar.

Thanks,

Mathieu

>
> -hpa

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-04-13 17:16:13