2019-06-05 13:25:25

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

In preparation for static_call support, teach text_poke_bp() to
emulate instructions, including CALL.

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/text-patching.h | 2 -
arch/x86/kernel/alternative.c | 47 ++++++++++++++++++++++++++---------
arch/x86/kernel/jump_label.c | 3 --
arch/x86/kernel/kprobes/opt.c | 11 +++++---
4 files changed, 46 insertions(+), 17 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
extern int after_bootmem;
extern __ro_after_init struct mm_struct *poking_mm;
extern __ro_after_init unsigned long poking_addr;
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -921,19 +921,25 @@ static void do_sync_core(void *info)
}

static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static const void *bp_int3_opcode, *bp_int3_addr;

int poke_int3_handler(struct pt_regs *regs)
{
+ long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
+ struct opcode {
+ u8 insn;
+ s32 rel;
+ } __packed opcode;
+
/*
* Having observed our INT3 instruction, we now must observe
* bp_patching_in_progress.
*
- * in_progress = TRUE INT3
- * WMB RMB
- * write INT3 if (in_progress)
+ * in_progress = TRUE INT3
+ * WMB RMB
+ * write INT3 if (in_progress)
*
- * Idem for bp_int3_handler.
+ * Idem for bp_int3_opcode.
*/
smp_rmb();

@@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
return 0;

- /* set up the specified breakpoint handler */
- regs->ip = (unsigned long) bp_int3_handler;
+ opcode = *(struct opcode *)bp_int3_opcode;
+
+ switch (opcode.insn) {
+ case 0xE8: /* CALL */
+ int3_emulate_call(regs, ip + opcode.rel);
+ break;
+
+ case 0xE9: /* JMP */
+ int3_emulate_jmp(regs, ip + opcode.rel);
+ break;
+
+ default: /* assume NOP */
+ int3_emulate_jmp(regs, ip);
+ break;
+ }

return 1;
}
@@ -955,7 +974,7 @@ NOKPROBE_SYMBOL(poke_int3_handler);
* @addr: address to patch
* @opcode: opcode of new instruction
* @len: length to copy
- * @handler: address to jump to when the temporary breakpoint is hit
+ * @emulate: opcode to emulate, when NULL use @opcode
*
* Modify multi-byte instruction by using int3 breakpoint on SMP.
* We completely avoid stop_machine() here, and achieve the
@@ -970,19 +989,25 @@ NOKPROBE_SYMBOL(poke_int3_handler);
* replacing opcode
* - sync cores
*/
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
{
unsigned char int3 = 0xcc;

- bp_int3_handler = handler;
+ bp_int3_opcode = emulate ?: opcode;
bp_int3_addr = (u8 *)addr + sizeof(int3);
bp_patching_in_progress = true;

lockdep_assert_held(&text_mutex);

/*
+ * poke_int3_handler() relies on @opcode being a 5 byte instruction;
+ * notably a JMP, CALL or NOP5_ATOMIC.
+ */
+ BUG_ON(len != 5);
+
+ /*
* Corresponding read barrier in int3 notifier for making sure the
- * in_progress and handler are correctly ordered wrt. patching.
+ * in_progress and opcode are correctly ordered wrt. patching.
*/
smp_wmb();

--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -87,8 +87,7 @@ static void __ref __jump_label_transform
return;
}

- text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
- (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+ text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL);
}

void arch_jump_label_transform(struct jump_entry *entry,
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
insn_buff[0] = RELATIVEJUMP_OPCODE;
*(s32 *)(&insn_buff[1]) = rel;

- text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
- op->optinsn.insn);
+ text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);

list_del_init(&op->list);
}
@@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
void arch_unoptimize_kprobe(struct optimized_kprobe *op)
{
u8 insn_buff[RELATIVEJUMP_SIZE];
+ u8 emulate_buff[RELATIVEJUMP_SIZE];

/* Set int3 to first byte for kprobes */
insn_buff[0] = BREAKPOINT_INSTRUCTION;
memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+
+ emulate_buff[0] = RELATIVEJUMP_OPCODE;
+ *(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
+ ((long)op->kp.addr + RELATIVEJUMP_SIZE));
+
text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
- op->optinsn.insn);
+ emulate_buff);
}

/*



2019-06-07 05:53:40

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

> On Jun 5, 2019, at 6:08 AM, Peter Zijlstra <[email protected]> wrote:
>
> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
>
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
>
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
>
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.
>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/include/asm/text-patching.h | 2 -
> arch/x86/kernel/alternative.c | 47 ++++++++++++++++++++++++++---------
> arch/x86/kernel/jump_label.c | 3 --
> arch/x86/kernel/kprobes/opt.c | 11 +++++---
> 4 files changed, 46 insertions(+), 17 deletions(-)
>
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> -extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
> extern int after_bootmem;
> extern __ro_after_init struct mm_struct *poking_mm;
> extern __ro_after_init unsigned long poking_addr;
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -921,19 +921,25 @@ static void do_sync_core(void *info)
> }
>
> static bool bp_patching_in_progress;
> -static void *bp_int3_handler, *bp_int3_addr;
> +static const void *bp_int3_opcode, *bp_int3_addr;
>
> int poke_int3_handler(struct pt_regs *regs)
> {
> + long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> + struct opcode {
> + u8 insn;
> + s32 rel;
> + } __packed opcode;
> +
> /*
> * Having observed our INT3 instruction, we now must observe
> * bp_patching_in_progress.
> *
> - * in_progress = TRUE INT3
> - * WMB RMB
> - * write INT3 if (in_progress)
> + * in_progress = TRUE INT3
> + * WMB RMB
> + * write INT3 if (in_progress)

I don’t see what has changed in this chunk… Whitespaces?

> *
> - * Idem for bp_int3_handler.
> + * Idem for bp_int3_opcode.
> */
> smp_rmb();
>
> @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> return 0;
>
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + opcode = *(struct opcode *)bp_int3_opcode;
> +
> + switch (opcode.insn) {
> + case 0xE8: /* CALL */
> + int3_emulate_call(regs, ip + opcode.rel);
> + break;
> +
> + case 0xE9: /* JMP */
> + int3_emulate_jmp(regs, ip + opcode.rel);
> + break;

Consider using RELATIVECALL_OPCODE and RELATIVEJUMP_OPCODE instead of the
constants (0xE8, 0xE9), just as you do later in the patch.

2019-06-07 09:24:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Fri, Jun 07, 2019 at 05:41:42AM +0000, Nadav Amit wrote:

> > int poke_int3_handler(struct pt_regs *regs)
> > {
> > + long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> > + struct opcode {
> > + u8 insn;
> > + s32 rel;
> > + } __packed opcode;
> > +
> > /*
> > * Having observed our INT3 instruction, we now must observe
> > * bp_patching_in_progress.
> > *
> > - * in_progress = TRUE INT3
> > - * WMB RMB
> > - * write INT3 if (in_progress)
> > + * in_progress = TRUE INT3
> > + * WMB RMB
> > + * write INT3 if (in_progress)
>
> I don’t see what has changed in this chunk… Whitespaces?

Yep, my editor kept marking that stuff red (space before tab), which
annoyed me enough to fix it.

> > *
> > - * Idem for bp_int3_handler.
> > + * Idem for bp_int3_opcode.
> > */
> > smp_rmb();
> >
> > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > return 0;
> >
> > - /* set up the specified breakpoint handler */
> > - regs->ip = (unsigned long) bp_int3_handler;
> > + opcode = *(struct opcode *)bp_int3_opcode;
> > +
> > + switch (opcode.insn) {
> > + case 0xE8: /* CALL */
> > + int3_emulate_call(regs, ip + opcode.rel);
> > + break;
> > +
> > + case 0xE9: /* JMP */
> > + int3_emulate_jmp(regs, ip + opcode.rel);
> > + break;
>
> Consider using RELATIVECALL_OPCODE and RELATIVEJUMP_OPCODE instead of the
> constants (0xE8, 0xE9), just as you do later in the patch.

Those are private to kprobes..

but I can do something like so:

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -48,8 +48,14 @@ static inline void int3_emulate_jmp(stru
regs->ip = ip;
}

-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE 1
+#define INT3_INSN_OPCODE 0xCC
+
+#define CALL_INSN_SIZE 5
+#define CALL_INSN_OPCODE 0xE8
+
+#define JMP_INSN_SIZE 5
+#define JMP_INSN_OPCODE 0xE9

static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
{
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -952,11 +952,11 @@ int poke_int3_handler(struct pt_regs *re
opcode = *(struct opcode *)bp_int3_opcode;

switch (opcode.insn) {
- case 0xE8: /* CALL */
+ case CALL_INSN_OPCODE:
int3_emulate_call(regs, ip + opcode.rel);
break;

- case 0xE9: /* JMP */
+ case JMP_INSN_OPCODE:
int3_emulate_jmp(regs, ip + opcode.rel);
break;

2019-06-07 14:29:43

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Fri, 7 Jun 2019 10:20:13 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Jun 07, 2019 at 05:41:42AM +0000, Nadav Amit wrote:
>
> > > int poke_int3_handler(struct pt_regs *regs)
> > > {
> > > + long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> > > + struct opcode {
> > > + u8 insn;
> > > + s32 rel;
> > > + } __packed opcode;
> > > +
> > > /*
> > > * Having observed our INT3 instruction, we now must observe
> > > * bp_patching_in_progress.
> > > *
> > > - * in_progress = TRUE INT3
> > > - * WMB RMB
> > > - * write INT3 if (in_progress)
> > > + * in_progress = TRUE INT3
> > > + * WMB RMB
> > > + * write INT3 if (in_progress)
> >
> > I don$B!G(Bt see what has changed in this chunk$B!D(B Whitespaces?
>
> Yep, my editor kept marking that stuff red (space before tab), which
> annoyed me enough to fix it.
>
> > > *
> > > - * Idem for bp_int3_handler.
> > > + * Idem for bp_int3_opcode.
> > > */
> > > smp_rmb();
> > >
> > > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > > if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > > return 0;
> > >
> > > - /* set up the specified breakpoint handler */
> > > - regs->ip = (unsigned long) bp_int3_handler;
> > > + opcode = *(struct opcode *)bp_int3_opcode;
> > > +
> > > + switch (opcode.insn) {
> > > + case 0xE8: /* CALL */
> > > + int3_emulate_call(regs, ip + opcode.rel);
> > > + break;
> > > +
> > > + case 0xE9: /* JMP */
> > > + int3_emulate_jmp(regs, ip + opcode.rel);
> > > + break;
> >
> > Consider using RELATIVECALL_OPCODE and RELATIVEJUMP_OPCODE instead of the
> > constants (0xE8, 0xE9), just as you do later in the patch.
>
> Those are private to kprobes..
>
> but I can do something like so:
>
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -48,8 +48,14 @@ static inline void int3_emulate_jmp(stru
> regs->ip = ip;
> }
>
> -#define INT3_INSN_SIZE 1
> -#define CALL_INSN_SIZE 5
> +#define INT3_INSN_SIZE 1
> +#define INT3_INSN_OPCODE 0xCC
> +
> +#define CALL_INSN_SIZE 5
> +#define CALL_INSN_OPCODE 0xE8
> +
> +#define JMP_INSN_SIZE 5
> +#define JMP_INSN_OPCODE 0xE9
>
> static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
> {
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -952,11 +952,11 @@ int poke_int3_handler(struct pt_regs *re
> opcode = *(struct opcode *)bp_int3_opcode;
>
> switch (opcode.insn) {
> - case 0xE8: /* CALL */
> + case CALL_INSN_OPCODE:
> int3_emulate_call(regs, ip + opcode.rel);
> break;
>
> - case 0xE9: /* JMP */
> + case JMP_INSN_OPCODE:
> int3_emulate_jmp(regs, ip + opcode.rel);
> break;
>

This looks good. I don't want to make those opcode as private.
I would like to share it.

Thank you,

--
Masami Hiramatsu <[email protected]>

2019-06-07 18:26:39

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Wed, 05 Jun 2019 15:08:01 +0200
Peter Zijlstra <[email protected]> wrote:

> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
>
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
>
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
>
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.

Hm, actually it doesn't restores randome text, since the first byte
must always be int3. As the function name means, it just unoptimizes
(jump based optprobe -> int3 based kprobe).
Anyway, that is not an issue. With this patch, optprobe must still work.

>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/include/asm/text-patching.h | 2 -
> arch/x86/kernel/alternative.c | 47 ++++++++++++++++++++++++++---------
> arch/x86/kernel/jump_label.c | 3 --
> arch/x86/kernel/kprobes/opt.c | 11 +++++---
> 4 files changed, 46 insertions(+), 17 deletions(-)
>
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> -extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
> extern int after_bootmem;
> extern __ro_after_init struct mm_struct *poking_mm;
> extern __ro_after_init unsigned long poking_addr;
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -921,19 +921,25 @@ static void do_sync_core(void *info)
> }
>
> static bool bp_patching_in_progress;
> -static void *bp_int3_handler, *bp_int3_addr;
> +static const void *bp_int3_opcode, *bp_int3_addr;
>
> int poke_int3_handler(struct pt_regs *regs)
> {
> + long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
> + struct opcode {
> + u8 insn;
> + s32 rel;
> + } __packed opcode;
> +
> /*
> * Having observed our INT3 instruction, we now must observe
> * bp_patching_in_progress.
> *
> - * in_progress = TRUE INT3
> - * WMB RMB
> - * write INT3 if (in_progress)
> + * in_progress = TRUE INT3
> + * WMB RMB
> + * write INT3 if (in_progress)
> *
> - * Idem for bp_int3_handler.
> + * Idem for bp_int3_opcode.
> */
> smp_rmb();
>
> @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> return 0;
>
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + opcode = *(struct opcode *)bp_int3_opcode;
> +
> + switch (opcode.insn) {
> + case 0xE8: /* CALL */
> + int3_emulate_call(regs, ip + opcode.rel);
> + break;
> +
> + case 0xE9: /* JMP */
> + int3_emulate_jmp(regs, ip + opcode.rel);
> + break;
> +
> + default: /* assume NOP */

Shouldn't we check whether it is actually NOP here?

> + int3_emulate_jmp(regs, ip);
> + break;
> + }

BTW, if we fix the length of patching always 5 bytes and allow user
to apply it only from/to jump/call/nop, we may be better to remove
"len" and rename it, something like "text_poke_branch" etc.

Thank you,

>
> return 1;
> }
> @@ -955,7 +974,7 @@ NOKPROBE_SYMBOL(poke_int3_handler);
> * @addr: address to patch
> * @opcode: opcode of new instruction
> * @len: length to copy
> - * @handler: address to jump to when the temporary breakpoint is hit
> + * @emulate: opcode to emulate, when NULL use @opcode
> *
> * Modify multi-byte instruction by using int3 breakpoint on SMP.
> * We completely avoid stop_machine() here, and achieve the
> @@ -970,19 +989,25 @@ NOKPROBE_SYMBOL(poke_int3_handler);
> * replacing opcode
> * - sync cores
> */
> -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
> {
> unsigned char int3 = 0xcc;
>
> - bp_int3_handler = handler;
> + bp_int3_opcode = emulate ?: opcode;
> bp_int3_addr = (u8 *)addr + sizeof(int3);
> bp_patching_in_progress = true;
>
> lockdep_assert_held(&text_mutex);
>
> /*
> + * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> + * notably a JMP, CALL or NOP5_ATOMIC.
> + */
> + BUG_ON(len != 5);
> +
> + /*
> * Corresponding read barrier in int3 notifier for making sure the
> - * in_progress and handler are correctly ordered wrt. patching.
> + * in_progress and opcode are correctly ordered wrt. patching.
> */
> smp_wmb();
>
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -87,8 +87,7 @@ static void __ref __jump_label_transform
> return;
> }
>
> - text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
> - (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
> + text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL);
> }
>
> void arch_jump_label_transform(struct jump_entry *entry,
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
> insn_buff[0] = RELATIVEJUMP_OPCODE;
> *(s32 *)(&insn_buff[1]) = rel;
>
> - text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> - op->optinsn.insn);
> + text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);
>
> list_del_init(&op->list);
> }
> @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
> void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> {
> u8 insn_buff[RELATIVEJUMP_SIZE];
> + u8 emulate_buff[RELATIVEJUMP_SIZE];
>
> /* Set int3 to first byte for kprobes */
> insn_buff[0] = BREAKPOINT_INSTRUCTION;
> memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
> +
> + emulate_buff[0] = RELATIVEJUMP_OPCODE;
> + *(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
> + ((long)op->kp.addr + RELATIVEJUMP_SIZE));
> +
> text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> - op->optinsn.insn);
> + emulate_buff);
> }
>
> /*
>
>


--
Masami Hiramatsu <[email protected]>

2019-06-07 18:57:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:

> > This fits almost all text_poke_bp() users, except
> > arch_unoptimize_kprobe() which restores random text, and for that site
> > we have to build an explicit emulate instruction.
>
> Hm, actually it doesn't restores randome text, since the first byte
> must always be int3. As the function name means, it just unoptimizes
> (jump based optprobe -> int3 based kprobe).
> Anyway, that is not an issue. With this patch, optprobe must still work.

I thought it basically restored 5 bytes of original text (with no
guarantee it is a single instruction, or even a complete instruction),
with the first byte replaced with INT3.

> > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > return 0;
> >
> > - /* set up the specified breakpoint handler */
> > - regs->ip = (unsigned long) bp_int3_handler;
> > + opcode = *(struct opcode *)bp_int3_opcode;
> > +
> > + switch (opcode.insn) {
> > + case 0xE8: /* CALL */
> > + int3_emulate_call(regs, ip + opcode.rel);
> > + break;
> > +
> > + case 0xE9: /* JMP */
> > + int3_emulate_jmp(regs, ip + opcode.rel);
> > + break;
> > +
> > + default: /* assume NOP */
>
> Shouldn't we check whether it is actually NOP here?

I was/am lazy and didn't want to deal with:

arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5

But maybe we should check for all the various NOP5 variants and BUG() on
anything unexpected.

> > + int3_emulate_jmp(regs, ip);
> > + break;
> > + }
>
> BTW, if we fix the length of patching always 5 bytes and allow user
> to apply it only from/to jump/call/nop, we may be better to remove
> "len" and rename it, something like "text_poke_branch" etc.

I considered it; but was thinking we could still allow patching other
instructions, we'd just have to extend the emulation in
poke_int3_handler().

Then again, if/when we want to do that, we can also restore the @len
argument again.

2019-06-07 18:59:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Fri, Jun 7, 2019 at 10:34 AM Peter Zijlstra <[email protected]> wrote:
>
> I was/am lazy and didn't want to deal with:
>
> arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
> arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
> arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
> arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5

Ugh. Maybe we could just pick one atomic sequence, and not have the
magic atomic nops be dynamic.

It's essentially what STATIC_KEY_INIT_NOP #define seems to do anyway.

NOP5_ATOMIC is already special, and not used for the normal nop
rewriting, only for kprobe/jump_label/ftrace.

So I suspect we could just replace all cases of

ideal_nops[NOP_ATOMIC5]

with

STATIC_KEY_INIT_NOP

and get rid of the whole "let's optimize the atomic 5-byte nop" entirely.

Hmm?

By definition, NOP_ATOMIC5 is just a single nop anyway, it's not used
for the potentially more complex alternative rewriting cases.

Linus

2019-06-07 19:01:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions



> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
>
>>> This fits almost all text_poke_bp() users, except
>>> arch_unoptimize_kprobe() which restores random text, and for that site
>>> we have to build an explicit emulate instruction.
>>
>> Hm, actually it doesn't restores randome text, since the first byte
>> must always be int3. As the function name means, it just unoptimizes
>> (jump based optprobe -> int3 based kprobe).
>> Anyway, that is not an issue. With this patch, optprobe must still work.
>
> I thought it basically restored 5 bytes of original text (with no
> guarantee it is a single instruction, or even a complete instruction),
> with the first byte replaced with INT3.
>

I am surely missing some kprobe context, but is it really safe to use this mechanism to replace more than one instruction?

2019-06-07 20:32:29

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On June 7, 2019 11:10:19 AM PDT, Andy Lutomirski <[email protected]> wrote:
>
>
>> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <[email protected]>
>wrote:
>>
>> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
>>
>>>> This fits almost all text_poke_bp() users, except
>>>> arch_unoptimize_kprobe() which restores random text, and for that
>site
>>>> we have to build an explicit emulate instruction.
>>>
>>> Hm, actually it doesn't restores randome text, since the first byte
>>> must always be int3. As the function name means, it just unoptimizes
>>> (jump based optprobe -> int3 based kprobe).
>>> Anyway, that is not an issue. With this patch, optprobe must still
>work.
>>
>> I thought it basically restored 5 bytes of original text (with no
>> guarantee it is a single instruction, or even a complete
>instruction),
>> with the first byte replaced with INT3.
>>
>
>I am surely missing some kprobe context, but is it really safe to use
>this mechanism to replace more than one instruction?

I don't see how it could be, except *perhaps* inside an NMI have, because you could have a preempted or interrupted now having an in-memory IP pointing inside the middle of the region you are intending to patch.


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-06-10 17:00:38

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Wed, Jun 05, 2019 at 03:08:01PM +0200, Peter Zijlstra wrote:
> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
>
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
>
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
>
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.
>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Reviewed-by: Josh Poimboeuf <[email protected]>

--
Josh

2019-06-11 08:13:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
>
>
> > On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
> >
> >>> This fits almost all text_poke_bp() users, except
> >>> arch_unoptimize_kprobe() which restores random text, and for that site
> >>> we have to build an explicit emulate instruction.
> >>
> >> Hm, actually it doesn't restores randome text, since the first byte
> >> must always be int3. As the function name means, it just unoptimizes
> >> (jump based optprobe -> int3 based kprobe).
> >> Anyway, that is not an issue. With this patch, optprobe must still work.
> >
> > I thought it basically restored 5 bytes of original text (with no
> > guarantee it is a single instruction, or even a complete instruction),
> > with the first byte replaced with INT3.
> >
>
> I am surely missing some kprobe context, but is it really safe to use
> this mechanism to replace more than one instruction?

I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
wrong.

So what happens is that arch_prepare_optimized_kprobe() <-
copy_optimized_instructions() copies however much of the instruction
stream is required such that we can overwrite the instruction at @addr
with a 5 byte jump.

arch_optimize_kprobe() then does the text_poke_bp() that replaces the
instruction @addr with int3, copies the rel jump address and overwrites
the int3 with jmp.

And I'm thinking the problem is with something like:

@addr: nop nop nop nop nop

We copy out the nops into the trampoline, overwrite the first nop with
an INT3, overwrite the remaining nops with the rel addr, but oops,
another CPU can still be executing one of those NOPs, right?

I'm thinking we could fix this by first writing INT3 into all relevant
instructions, which is going to be messy, given the current code base.

2019-06-11 12:13:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Fri, Jun 07, 2019 at 10:48:06AM -0700, Linus Torvalds wrote:
> On Fri, Jun 7, 2019 at 10:34 AM Peter Zijlstra <[email protected]> wrote:
> >
> > I was/am lazy and didn't want to deal with:
> >
> > arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
> > arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
> > arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
> > arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5
>
> Ugh. Maybe we could just pick one atomic sequence, and not have the
> magic atomic nops be dynamic.

That'd be nice..

> It's essentially what STATIC_KEY_INIT_NOP #define seems to do anyway.

Well, that picks something, we'll overwrite it with the ideal nop later,
once we've figured out what it should be.

> NOP5_ATOMIC is already special, and not used for the normal nop
> rewriting, only for kprobe/jump_label/ftrace.

Right, but esp ftrace means there's a _lot_ of them around.

> So I suspect we could just replace all cases of
>
> ideal_nops[NOP_ATOMIC5]
>
> with
>
> STATIC_KEY_INIT_NOP
>
> and get rid of the whole "let's optimize the atomic 5-byte nop" entirely.
>
> Hmm?

So we have:

GENERIC (x86_32): leal ds:0x00(,%esi,1),%esi
K8 (x86_64): osp osp osp osp nop
K7 (x86_32): leal ds:0x00(,%eax,1),%eax
P6 (x86_64): nopl 0x00(%eax,%eax,1)

And I guess the $64k question is if there's any actual performance
difference between the k8 and p6 variants on chips we still care about.

Most modern chips seem to end up selecting p6.

Anyway, the proposed patch looks like so:

---
Subject: x86: Remove ideal_nops[NOP_ATOMIC5]

By picking a single/fixed NOP5_ATOMIC instruction things become simpler.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/jump_label.h | 12 +++---------
arch/x86/include/asm/nops.h | 17 ++++++++---------
arch/x86/kernel/alternative.c | 8 --------
arch/x86/kernel/ftrace.c | 3 ++-
arch/x86/kernel/jump_label.c | 37 ++++++-------------------------------
arch/x86/kernel/kprobes/core.c | 3 ++-
6 files changed, 21 insertions(+), 59 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 65191ce8e1cf..a3d45abcda95 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -4,12 +4,6 @@

#define JUMP_LABEL_NOP_SIZE 5

-#ifdef CONFIG_X86_64
-# define STATIC_KEY_INIT_NOP P6_NOP5_ATOMIC
-#else
-# define STATIC_KEY_INIT_NOP GENERIC_NOP5_ATOMIC
-#endif
-
#include <asm/asm.h>
#include <asm/nops.h>

@@ -21,7 +15,7 @@
static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
{
asm_volatile_goto("1:"
- ".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
+ ".byte " __stringify(NOP5_ATOMIC) "\n\t"
".pushsection __jump_table, \"aw\" \n\t"
_ASM_ALIGN "\n\t"
".long 1b - ., %l[l_yes] - . \n\t"
@@ -61,7 +55,7 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
.long \target - .Lstatic_jump_after_\@
.Lstatic_jump_after_\@:
.else
- .byte STATIC_KEY_INIT_NOP
+ .byte NOP5_ATOMIC
.endif
.pushsection __jump_table, "aw"
_ASM_ALIGN
@@ -73,7 +67,7 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
.macro STATIC_JUMP_IF_FALSE target, key, def
.Lstatic_jump_\@:
.if \def
- .byte STATIC_KEY_INIT_NOP
+ .byte NOP5_ATOMIC
.else
/* Equivalent to "jmp.d32 \target" */
.byte 0xe9
diff --git a/arch/x86/include/asm/nops.h b/arch/x86/include/asm/nops.h
index 12f12b5cf2ca..14cf05e645f5 100644
--- a/arch/x86/include/asm/nops.h
+++ b/arch/x86/include/asm/nops.h
@@ -28,7 +28,6 @@
#define GENERIC_NOP6 0x8d,0xb6,0x00,0x00,0x00,0x00
#define GENERIC_NOP7 0x8d,0xb4,0x26,0x00,0x00,0x00,0x00
#define GENERIC_NOP8 GENERIC_NOP1,GENERIC_NOP7
-#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4

/* Opteron 64bit nops
1: nop
@@ -44,7 +43,6 @@
#define K8_NOP6 K8_NOP3,K8_NOP3
#define K8_NOP7 K8_NOP4,K8_NOP3
#define K8_NOP8 K8_NOP4,K8_NOP4
-#define K8_NOP5_ATOMIC 0x66,K8_NOP4

/* K7 nops
uses eax dependencies (arbitrary choice)
@@ -63,7 +61,6 @@
#define K7_NOP6 0x8d,0x80,0,0,0,0
#define K7_NOP7 0x8D,0x04,0x05,0,0,0,0
#define K7_NOP8 K7_NOP7,K7_NOP1
-#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4

/* P6 nops
uses eax dependencies (Intel-recommended choice)
@@ -86,7 +83,12 @@
#define P6_NOP6 0x66,0x0f,0x1f,0x44,0x00,0
#define P6_NOP7 0x0f,0x1f,0x80,0,0,0,0
#define P6_NOP8 0x0f,0x1f,0x84,0x00,0,0,0,0
-#define P6_NOP5_ATOMIC P6_NOP5
+
+#ifdef CONFIG_X86_64
+#define NOP5_ATOMIC P6_NOP5
+#else
+#define NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
+#endif

#ifdef __ASSEMBLY__
#define _ASM_MK_NOP(x) .byte x
@@ -103,7 +105,6 @@
#define ASM_NOP6 _ASM_MK_NOP(K7_NOP6)
#define ASM_NOP7 _ASM_MK_NOP(K7_NOP7)
#define ASM_NOP8 _ASM_MK_NOP(K7_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K7_NOP5_ATOMIC)
#elif defined(CONFIG_X86_P6_NOP)
#define ASM_NOP1 _ASM_MK_NOP(P6_NOP1)
#define ASM_NOP2 _ASM_MK_NOP(P6_NOP2)
@@ -113,7 +114,6 @@
#define ASM_NOP6 _ASM_MK_NOP(P6_NOP6)
#define ASM_NOP7 _ASM_MK_NOP(P6_NOP7)
#define ASM_NOP8 _ASM_MK_NOP(P6_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(P6_NOP5_ATOMIC)
#elif defined(CONFIG_X86_64)
#define ASM_NOP1 _ASM_MK_NOP(K8_NOP1)
#define ASM_NOP2 _ASM_MK_NOP(K8_NOP2)
@@ -123,7 +123,6 @@
#define ASM_NOP6 _ASM_MK_NOP(K8_NOP6)
#define ASM_NOP7 _ASM_MK_NOP(K8_NOP7)
#define ASM_NOP8 _ASM_MK_NOP(K8_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(K8_NOP5_ATOMIC)
#else
#define ASM_NOP1 _ASM_MK_NOP(GENERIC_NOP1)
#define ASM_NOP2 _ASM_MK_NOP(GENERIC_NOP2)
@@ -133,11 +132,11 @@
#define ASM_NOP6 _ASM_MK_NOP(GENERIC_NOP6)
#define ASM_NOP7 _ASM_MK_NOP(GENERIC_NOP7)
#define ASM_NOP8 _ASM_MK_NOP(GENERIC_NOP8)
-#define ASM_NOP5_ATOMIC _ASM_MK_NOP(GENERIC_NOP5_ATOMIC)
#endif

+#define ASM_NOP5_ATOMIC _ASM_MK_NOP(NOP5_ATOMIC)
+
#define ASM_NOP_MAX 8
-#define NOP_ATOMIC5 (ASM_NOP_MAX+1) /* Entry for the 5-byte atomic NOP */

#ifndef __ASSEMBLY__
extern const unsigned char * const *ideal_nops;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0d57015114e7..4c0250049d4f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -90,7 +90,6 @@ static const unsigned char intelnops[] =
GENERIC_NOP6,
GENERIC_NOP7,
GENERIC_NOP8,
- GENERIC_NOP5_ATOMIC
};
static const unsigned char * const intel_nops[ASM_NOP_MAX+2] =
{
@@ -103,7 +102,6 @@ static const unsigned char * const intel_nops[ASM_NOP_MAX+2] =
intelnops + 1 + 2 + 3 + 4 + 5,
intelnops + 1 + 2 + 3 + 4 + 5 + 6,
intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
- intelnops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
};
#endif

@@ -118,7 +116,6 @@ static const unsigned char k8nops[] =
K8_NOP6,
K8_NOP7,
K8_NOP8,
- K8_NOP5_ATOMIC
};
static const unsigned char * const k8_nops[ASM_NOP_MAX+2] =
{
@@ -131,7 +128,6 @@ static const unsigned char * const k8_nops[ASM_NOP_MAX+2] =
k8nops + 1 + 2 + 3 + 4 + 5,
k8nops + 1 + 2 + 3 + 4 + 5 + 6,
k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
- k8nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
};
#endif

@@ -146,7 +142,6 @@ static const unsigned char k7nops[] =
K7_NOP6,
K7_NOP7,
K7_NOP8,
- K7_NOP5_ATOMIC
};
static const unsigned char * const k7_nops[ASM_NOP_MAX+2] =
{
@@ -159,7 +154,6 @@ static const unsigned char * const k7_nops[ASM_NOP_MAX+2] =
k7nops + 1 + 2 + 3 + 4 + 5,
k7nops + 1 + 2 + 3 + 4 + 5 + 6,
k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
- k7nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
};
#endif

@@ -174,7 +168,6 @@ static const unsigned char p6nops[] =
P6_NOP6,
P6_NOP7,
P6_NOP8,
- P6_NOP5_ATOMIC
};
static const unsigned char * const p6_nops[ASM_NOP_MAX+2] =
{
@@ -187,7 +180,6 @@ static const unsigned char * const p6_nops[ASM_NOP_MAX+2] =
p6nops + 1 + 2 + 3 + 4 + 5,
p6nops + 1 + 2 + 3 + 4 + 5 + 6,
p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7,
- p6nops + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8,
};
#endif

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0927bb158ffc..6ea5ea506a5f 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -101,7 +101,8 @@ static unsigned long text_ip_addr(unsigned long ip)

static const unsigned char *ftrace_nop_replace(void)
{
- return ideal_nops[NOP_ATOMIC5];
+ static const unsigned char nop[] = { NOP5_ATOMIC };
+ return nop;
}

static int
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e631c358f7f4..5a27cf6e1c73 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -40,8 +40,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
int init)
{
union jump_code_union jmp;
- const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
- const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
+ const unsigned char nop[] = { NOP5_ATOMIC };
const void *expect, *code;
int line;

@@ -50,21 +49,13 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
(jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);

if (type == JUMP_LABEL_JMP) {
- if (init) {
- expect = default_nop; line = __LINE__;
- } else {
- expect = ideal_nop; line = __LINE__;
- }
-
+ expect = nop;
+ line = __LINE__;
code = &jmp.code;
} else {
- if (init) {
- expect = default_nop; line = __LINE__;
- } else {
- expect = &jmp.code; line = __LINE__;
- }
-
- code = ideal_nop;
+ expect = &jmp.code;
+ line = __LINE__;
+ code = nop;
}

if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
@@ -108,22 +99,6 @@ static enum {
__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry,
enum jump_label_type type)
{
- /*
- * This function is called at boot up and when modules are
- * first loaded. Check if the default nop, the one that is
- * inserted at compile time, is the ideal nop. If it is, then
- * we do not need to update the nop, and we can leave it as is.
- * If it is not, then we need to update the nop to the ideal nop.
- */
- if (jlstate == JL_STATE_START) {
- const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
- const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-
- if (memcmp(ideal_nop, default_nop, 5) != 0)
- jlstate = JL_STATE_UPDATE;
- else
- jlstate = JL_STATE_NO_UPDATE;
- }
if (jlstate == JL_STATE_UPDATE)
__jump_label_transform(entry, type, 1);
}
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 6afd8061dbae..5b9aa5608d0d 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -204,6 +204,7 @@ int can_boost(struct insn *insn, void *addr)
static unsigned long
__recover_probed_insn(kprobe_opcode_t *buf, unsigned long addr)
{
+ static const unsigned char nop[] = { NOP5_ATOMIC };
struct kprobe *kp;
unsigned long faddr;

@@ -247,7 +248,7 @@ __recover_probed_insn(kprobe_opcode_t *buf, unsigned long addr)
return 0UL;

if (faddr)
- memcpy(buf, ideal_nops[NOP_ATOMIC5], 5);
+ memcpy(buf, nop, 5);
else
buf[0] = kp->opcode;
return (unsigned long)buf;

2019-06-11 12:26:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 10:03:07AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:

> > I am surely missing some kprobe context, but is it really safe to use
> > this mechanism to replace more than one instruction?
>
> I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> wrong.
>
> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
>
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
>
> And I'm thinking the problem is with something like:
>
> @addr: nop nop nop nop nop
>
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
>
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.

Maybe not that bad; how's something like this?

(completely untested)

---
arch/x86/kernel/alternative.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0d57015114e7..8f643dabea72 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -24,6 +24,7 @@
#include <asm/tlbflush.h>
#include <asm/io.h>
#include <asm/fixmap.h>
+#include <asm/insn.h>

int __read_mostly alternatives_patched;

@@ -849,6 +850,7 @@ static void do_sync_core(void *info)

static bool bp_patching_in_progress;
static void *bp_int3_handler, *bp_int3_addr;
+static unsigned int bp_int3_length;

int poke_int3_handler(struct pt_regs *regs)
{
@@ -867,7 +869,11 @@ int poke_int3_handler(struct pt_regs *regs)
if (likely(!bp_patching_in_progress))
return 0;

- if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+ if (user_mode(regs))
+ return 0;
+
+ if (regs->ip < (unsigned long)bp_int3_addr ||
+ regs->ip >= (unsigned long)bp_int3_addr + bp_int3_length)
return 0;

/* set up the specified breakpoint handler */
@@ -900,9 +906,12 @@ NOKPROBE_SYMBOL(poke_int3_handler);
void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
{
unsigned char int3 = 0xcc;
+ void *kaddr = addr;
+ struct insn insn;

bp_int3_handler = handler;
bp_int3_addr = (u8 *)addr + sizeof(int3);
+ bp_int3_length = len - sizeof(int3);
bp_patching_in_progress = true;

lockdep_assert_held(&text_mutex);
@@ -913,7 +922,14 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
*/
smp_wmb();

- text_poke(addr, &int3, sizeof(int3));
+ do {
+ kernel_insn_init(&insn, kaddr, MAX_INSN_SIZE);
+ insn_get_length(&insn);
+
+ text_poke(kaddr, &int3, sizeof(int3));
+
+ kaddr += insn.length;
+ } while (kaddr < addr + len);

on_each_cpu(do_sync_core, NULL, 1);

2019-06-11 12:35:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 02:08:34PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 11, 2019 at 10:03:07AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
>
> > > I am surely missing some kprobe context, but is it really safe to use
> > > this mechanism to replace more than one instruction?
> >
> > I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> > wrong.
> >
> > So what happens is that arch_prepare_optimized_kprobe() <-
> > copy_optimized_instructions() copies however much of the instruction
> > stream is required such that we can overwrite the instruction at @addr
> > with a 5 byte jump.
> >
> > arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> > instruction @addr with int3, copies the rel jump address and overwrites
> > the int3 with jmp.
> >
> > And I'm thinking the problem is with something like:
> >
> > @addr: nop nop nop nop nop
> >
> > We copy out the nops into the trampoline, overwrite the first nop with
> > an INT3, overwrite the remaining nops with the rel addr, but oops,
> > another CPU can still be executing one of those NOPs, right?
> >
> > I'm thinking we could fix this by first writing INT3 into all relevant
> > instructions, which is going to be messy, given the current code base.
>
> Maybe not that bad; how's something like this?
>
> (completely untested)
>
> ---
> arch/x86/kernel/alternative.c | 20 ++++++++++++++++++--
> 1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 0d57015114e7..8f643dabea72 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -24,6 +24,7 @@
> #include <asm/tlbflush.h>
> #include <asm/io.h>
> #include <asm/fixmap.h>
> +#include <asm/insn.h>
>
> int __read_mostly alternatives_patched;
>
> @@ -849,6 +850,7 @@ static void do_sync_core(void *info)
>
> static bool bp_patching_in_progress;
> static void *bp_int3_handler, *bp_int3_addr;
> +static unsigned int bp_int3_length;
>
> int poke_int3_handler(struct pt_regs *regs)
> {
> @@ -867,7 +869,11 @@ int poke_int3_handler(struct pt_regs *regs)
> if (likely(!bp_patching_in_progress))
> return 0;
>
> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> + if (user_mode(regs))
> + return 0;
> +
> + if (regs->ip < (unsigned long)bp_int3_addr ||
> + regs->ip >= (unsigned long)bp_int3_addr + bp_int3_length)
> return 0;

Bugger, this isn't right. It'll jump to the beginning of the trampoline,
even if it is multiple instructions in, which would lead to executing
instructions twice, which would be BAD.

_maybe_, depending on what the slot looks like, we could do something
like:

offset = regs->ip - (unsigned long)bp_int3_addr;
regs->ip = bp_int3_handler + offset;

That is; jump into the slot at the same offset we hit the INT3, but this
is quickly getting yuck.

> /* set up the specified breakpoint handler */
> @@ -900,9 +906,12 @@ NOKPROBE_SYMBOL(poke_int3_handler);
> void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> {
> unsigned char int3 = 0xcc;
> + void *kaddr = addr;
> + struct insn insn;
>
> bp_int3_handler = handler;
> bp_int3_addr = (u8 *)addr + sizeof(int3);
> + bp_int3_length = len - sizeof(int3);
> bp_patching_in_progress = true;
>
> lockdep_assert_held(&text_mutex);
> @@ -913,7 +922,14 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> */
> smp_wmb();
>
> - text_poke(addr, &int3, sizeof(int3));
> + do {
> + kernel_insn_init(&insn, kaddr, MAX_INSN_SIZE);
> + insn_get_length(&insn);
> +
> + text_poke(kaddr, &int3, sizeof(int3));
> +
> + kaddr += insn.length;
> + } while (kaddr < addr + len);
>
> on_each_cpu(do_sync_core, NULL, 1);
>

2019-06-11 12:44:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 02:34:02PM +0200, Peter Zijlstra wrote:

> Bugger, this isn't right. It'll jump to the beginning of the trampoline,
> even if it is multiple instructions in, which would lead to executing
> instructions twice, which would be BAD.
>
> _maybe_, depending on what the slot looks like, we could do something
> like:
>
> offset = regs->ip - (unsigned long)bp_int3_addr;
> regs->ip = bp_int3_handler + offset;
>
> That is; jump into the slot at the same offset we hit the INT3, but this
> is quickly getting yuck.

Yeah, that won't work either... it needs something far more complex :/

2019-06-11 15:56:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Wed, 05 Jun 2019 15:08:01 +0200
Peter Zijlstra <[email protected]> wrote:

> -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
> {
> unsigned char int3 = 0xcc;
>
> - bp_int3_handler = handler;
> + bp_int3_opcode = emulate ?: opcode;
> bp_int3_addr = (u8 *)addr + sizeof(int3);
> bp_patching_in_progress = true;
>
> lockdep_assert_held(&text_mutex);
>
> /*
> + * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> + * notably a JMP, CALL or NOP5_ATOMIC.
> + */
> + BUG_ON(len != 5);

If we have a bug on here, why bother with passing in len at all? Just
force it to be 5.

We could make it a WARN_ON() and return without doing anything.

This also prevents us from ever changing two byte jmps.

-- Steve

> +
> + /*
> * Corresponding read barrier in int3 notifier for making sure the
> - * in_progress and handler are correctly ordered wrt. patching.
> + * in_progress and opcode are correctly ordered wrt. patching.
> */
> smp_wmb();
>
> -

2019-06-11 16:38:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, 11 Jun 2019 10:03:07 +0200
Peter Zijlstra <[email protected]> wrote:


> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
>
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
>
> And I'm thinking the problem is with something like:
>
> @addr: nop nop nop nop nop

What would work would be to:

add breakpoint to first opcode.

call synchronize_tasks();

/* All tasks now hitting breakpoint and jumping over affected
code */

update the rest of the instructions.

replace breakpoint with jmp.

One caveat is that the replaced instructions must not be a call
function. As if the call function calls schedule then it will
circumvent the synchronize_tasks(). It would be OK if that call is the
last of the instructions. But I doubt we modify anything more then a
call size anyway, so this should still work for all current instances.

-- Steve

>
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
>
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.

2019-06-11 16:43:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, 11 Jun 2019 11:22:54 -0400
Steven Rostedt <[email protected]> wrote:

> What would work would be to:
>
> add breakpoint to first opcode.
>
> call synchronize_tasks();

BTW, that should be "synchronize_rcu_tasks()"

-- Steve

>
> /* All tasks now hitting breakpoint and jumping over affected
> code */
>
> update the rest of the instructions.
>
> replace breakpoint with jmp.

2019-06-11 16:44:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 11:14:10AM -0400, Steven Rostedt wrote:
> On Wed, 05 Jun 2019 15:08:01 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
> > {
> > unsigned char int3 = 0xcc;
> >
> > - bp_int3_handler = handler;
> > + bp_int3_opcode = emulate ?: opcode;
> > bp_int3_addr = (u8 *)addr + sizeof(int3);
> > bp_patching_in_progress = true;
> >
> > lockdep_assert_held(&text_mutex);
> >
> > /*
> > + * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> > + * notably a JMP, CALL or NOP5_ATOMIC.
> > + */
> > + BUG_ON(len != 5);
>
> If we have a bug on here, why bother with passing in len at all? Just
> force it to be 5.

Masami said the same.

> We could make it a WARN_ON() and return without doing anything.
>
> This also prevents us from ever changing two byte jmps.

It doesn't; that is, we'd need to add emulation for the 3 byte jump, but
that'd be pretty trivial.

2019-06-11 16:44:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions



> On Jun 11, 2019, at 1:03 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
>>
>>
>>> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
>>>
>>>>> This fits almost all text_poke_bp() users, except
>>>>> arch_unoptimize_kprobe() which restores random text, and for that site
>>>>> we have to build an explicit emulate instruction.
>>>>
>>>> Hm, actually it doesn't restores randome text, since the first byte
>>>> must always be int3. As the function name means, it just unoptimizes
>>>> (jump based optprobe -> int3 based kprobe).
>>>> Anyway, that is not an issue. With this patch, optprobe must still work.
>>>
>>> I thought it basically restored 5 bytes of original text (with no
>>> guarantee it is a single instruction, or even a complete instruction),
>>> with the first byte replaced with INT3.
>>>
>>
>> I am surely missing some kprobe context, but is it really safe to use
>> this mechanism to replace more than one instruction?
>
> I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> wrong.
>
> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
>
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
>
> And I'm thinking the problem is with something like:
>
> @addr: nop nop nop nop nop
>
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
>
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.

How does that help? If RIP == x+2 and you want to put a 5-byte jump at address x, no amount of 0xcc is going to change the fact that RIP is in the middle of the jump.

Live patching can handle this by detecting this condition on each CPU, but performance won’t be great. Maybe some synchronize_sched trickery could help.

2019-06-11 16:44:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 11:22:54AM -0400, Steven Rostedt wrote:
> On Tue, 11 Jun 2019 10:03:07 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>
> > So what happens is that arch_prepare_optimized_kprobe() <-
> > copy_optimized_instructions() copies however much of the instruction
> > stream is required such that we can overwrite the instruction at @addr
> > with a 5 byte jump.
> >
> > arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> > instruction @addr with int3, copies the rel jump address and overwrites
> > the int3 with jmp.
> >
> > And I'm thinking the problem is with something like:
> >
> > @addr: nop nop nop nop nop
>
> What would work would be to:
>
> add breakpoint to first opcode.
>
> call synchronize_tasks();
>
> /* All tasks now hitting breakpoint and jumping over affected
> code */
>
> update the rest of the instructions.
>
> replace breakpoint with jmp.
>
> One caveat is that the replaced instructions must not be a call
> function. As if the call function calls schedule then it will
> circumvent the synchronize_tasks(). It would be OK if that call is the
> last of the instructions. But I doubt we modify anything more then a
> call size anyway, so this should still work for all current instances.

Right, something like this could work (although I cannot currently find
synchronize_tasks), but it would make the optprobe stuff fairly slow
(iirc this sync_tasks() thing could be pretty horrible).


2019-06-11 16:47:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, 11 Jun 2019 08:54:23 -0700
Andy Lutomirski <[email protected]> wrote:


> How does that help? If RIP == x+2 and you want to put a 5-byte jump
> at address x, no amount of 0xcc is going to change the fact that RIP
> is in the middle of the jump.
>
> Live patching can handle this by detecting this condition on each
> CPU, but performance won’t be great. Maybe some synchronize_sched
> trickery could help.

We have synchronize_rcu_tasks() which return after all tasks have
either entered user space or did a voluntary schedule (was not
preempted). Or have not run (still in a sleeping state).

That way we guarantee that all tasks are no longer on any trampoline
or code paths that do not call schedule. I use this to free dynamically
allocated trampolines used by ftrace. And kprobes uses this too for its
own trampolines.

-- Steve

2019-06-11 16:49:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 05:52:48PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 11, 2019 at 11:14:10AM -0400, Steven Rostedt wrote:
> > On Wed, 05 Jun 2019 15:08:01 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> > > -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > > +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
> > > {
> > > unsigned char int3 = 0xcc;
> > >
> > > - bp_int3_handler = handler;
> > > + bp_int3_opcode = emulate ?: opcode;
> > > bp_int3_addr = (u8 *)addr + sizeof(int3);
> > > bp_patching_in_progress = true;
> > >
> > > lockdep_assert_held(&text_mutex);
> > >
> > > /*
> > > + * poke_int3_handler() relies on @opcode being a 5 byte instruction;
> > > + * notably a JMP, CALL or NOP5_ATOMIC.
> > > + */
> > > + BUG_ON(len != 5);
> >
> > If we have a bug on here, why bother with passing in len at all? Just
> > force it to be 5.
>
> Masami said the same.
>
> > We could make it a WARN_ON() and return without doing anything.
> >
> > This also prevents us from ever changing two byte jmps.
>
> It doesn't; that is, we'd need to add emulation for the 3 byte jump, but
> that'd be pretty trivial.

I can't find a 3 byte jump on x86_64, I could only find a 2 byte one.
But something like so should work I suppose, although at this point I'm
thinking we should just used the instruction decode we have instead of
playing iffy games with packed structures.

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e1a4bb42eb92..abb9615dcb1d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -57,6 +57,9 @@ static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
#define JMP_INSN_SIZE 5
#define JMP_INSN_OPCODE 0xE9

+#define JMP8_INSN_SIZE 2
+#define JMP8_INSN_OPCODE 0xEB
+
static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
{
/*
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5d0123a8183b..5df6c74a0b08 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -924,13 +924,18 @@ static void do_sync_core(void *info)
static bool bp_patching_in_progress;
static const void *bp_int3_opcode, *bp_int3_addr;

+struct poke_insn {
+ u8 opcode;
+ union {
+ s8 rel8;
+ s32 rel32;
+ };
+} __packed;
+
int poke_int3_handler(struct pt_regs *regs)
{
long ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
- struct opcode {
- u8 insn;
- s32 rel;
- } __packed opcode;
+ struct poke_insn insn;

/*
* Having observed our INT3 instruction, we now must observe
@@ -950,15 +955,19 @@ int poke_int3_handler(struct pt_regs *regs)
if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
return 0;

- opcode = *(struct opcode *)bp_int3_opcode;
+ insn = *(struct poke_insn *)bp_int3_opcode;

- switch (opcode.insn) {
+ switch (insn.opcode) {
case CALL_INSN_OPCODE:
- int3_emulate_call(regs, ip + opcode.rel);
+ int3_emulate_call(regs, ip + insn.rel32);
break;

case JMP_INSN_OPCODE:
- int3_emulate_jmp(regs, ip + opcode.rel);
+ int3_emulate_jmp(regs, ip + insn.rel32);
+ break;
+
+ case JMP8_INSN_OPCODE:
+ int3_emulate_jmp(regs, ip + insn.rel8);
break;

default: /* assume NOP */
@@ -992,7 +1001,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
*/
void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
{
- unsigned char int3 = 0xcc;
+ unsigned char int3 = INT3_INSN_OPCODE;
+ unsigned char opcode;

bp_int3_opcode = emulate ?: opcode;
bp_int3_addr = (u8 *)addr + sizeof(int3);
@@ -1001,10 +1011,26 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulat
lockdep_assert_held(&text_mutex);

/*
- * poke_int3_handler() relies on @opcode being a 5 byte instruction;
- * notably a JMP, CALL or NOP5_ATOMIC.
+ * Verify we support the actual instruction in poke_int3_handler().
*/
- BUG_ON(len != 5);
+ opcode = *(unsigned char *)bp_int3_opcode;
+ switch (opcode) {
+ case CALL_INSN_OPCODE:
+ BUG_ON(len != CALL_INSN_SIZE);
+ break;
+
+ case JMP_INSN_OPCODE:
+ BUG_ON(len != JMP_INSN_SIZE);
+ break;
+
+ case JMP8_INSN_OPCODE:
+ BUG_ON(len != JMP8_INSN_SIZE);
+ break;
+
+ default: /* assume NOP5_ATOMIC */
+ BUG_ON(len != 5);
+ break;
+ }

/*
* Corresponding read barrier in int3 notifier for making sure the

2019-06-12 18:06:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 06:21:28PM +0200, Peter Zijlstra wrote:
> although at this point I'm
> thinking we should just used the instruction decode we have instead of
> playing iffy games with packed structures.

How's something like this? It accepts jmp/32 jmp/8 call and nop5_atomic.

---
Subject: x86/alternatives: Teach text_poke_bp() to emulate instructions
From: Peter Zijlstra <[email protected]>
Date: Wed Jun 5 10:48:37 CEST 2019

In preparation for static_call support, teach text_poke_bp() to
emulate instructions, including CALL.

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/text-patching.h | 15 +++++--
arch/x86/kernel/alternative.c | 73 +++++++++++++++++++++++++----------
arch/x86/kernel/jump_label.c | 3 -
arch/x86/kernel/kprobes/opt.c | 11 +++--
4 files changed, 75 insertions(+), 27 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -37,7 +37,7 @@ extern void text_poke_early(void *addr,
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
extern int after_bootmem;
extern __ro_after_init struct mm_struct *poking_mm;
extern __ro_after_init unsigned long poking_addr;
@@ -48,8 +48,17 @@ static inline void int3_emulate_jmp(stru
regs->ip = ip;
}

-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE 1
+#define INT3_INSN_OPCODE 0xCC
+
+#define CALL_INSN_SIZE 5
+#define CALL_INSN_OPCODE 0xE8
+
+#define JMP_INSN_SIZE 5
+#define JMP_INSN_OPCODE 0xE9
+
+#define JMP8_INSN_SIZE 2
+#define JMP8_INSN_OPCODE 0xEB

static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
{
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -920,31 +920,45 @@ static void do_sync_core(void *info)
sync_core();
}

-static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static const void *bp_int3_addr;
+static const struct insn *bp_int3_insn;

int poke_int3_handler(struct pt_regs *regs)
{
+ long ip;
+
/*
* Having observed our INT3 instruction, we now must observe
- * bp_patching_in_progress.
- *
- * in_progress = TRUE INT3
- * WMB RMB
- * write INT3 if (in_progress)
+ * bp_int3_addr and bp_int3_insn:
*
- * Idem for bp_int3_handler.
+ * bp_int3_{addr,insn) = .. INT3
+ * WMB RMB
+ * write INT3 if (insn)
*/
smp_rmb();

- if (likely(!bp_patching_in_progress))
+ if (likely(!bp_int3_insn))
return 0;

if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
return 0;

- /* set up the specified breakpoint handler */
- regs->ip = (unsigned long) bp_int3_handler;
+ ip = regs->ip - INT3_INSN_SIZE + bp_int3_insn->length;
+
+ switch (bp_int3_insn->opcode.bytes[0]) {
+ case CALL_INSN_OPCODE:
+ int3_emulate_call(regs, ip + bp_int3_insn->immediate.value);
+ break;
+
+ case JMP_INSN_OPCODE:
+ case JMP8_INSN_OPCODE:
+ int3_emulate_jmp(regs, ip + bp_int3_insn->immediate.value);
+ break;
+
+ default: /* assume NOP */
+ int3_emulate_jmp(regs, ip);
+ break;
+ }

return 1;
}
@@ -955,7 +969,7 @@ NOKPROBE_SYMBOL(poke_int3_handler);
* @addr: address to patch
* @opcode: opcode of new instruction
* @len: length to copy
- * @handler: address to jump to when the temporary breakpoint is hit
+ * @emulate: opcode to emulate, when NULL use @opcode
*
* Modify multi-byte instruction by using int3 breakpoint on SMP.
* We completely avoid stop_machine() here, and achieve the
@@ -970,19 +984,40 @@ NOKPROBE_SYMBOL(poke_int3_handler);
* replacing opcode
* - sync cores
*/
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
{
- unsigned char int3 = 0xcc;
+ unsigned char int3 = INT3_INSN_OPCODE;
+ struct insn insn;

- bp_int3_handler = handler;
- bp_int3_addr = (u8 *)addr + sizeof(int3);
- bp_patching_in_progress = true;
+ bp_int3_addr = addr + INT3_INSN_SIZE;

lockdep_assert_held(&text_mutex);

+ if (!emulate)
+ emulate = opcode;
+
+ kernel_insn_init(&insn, emulate, MAX_INSN_SIZE);
+ insn_get_length(&insn);
+
+ BUG_ON(!insn_complete(&insn));
+ BUG_ON(insn.length != len);
+
+ switch (insn.opcode.bytes[0]) {
+ case CALL_INSN_OPCODE:
+ case JMP_INSN_OPCODE:
+ case JMP8_INSN_OPCODE:
+ break;
+
+ default:
+ BUG_ON(len != 5);
+ BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], 5));
+ }
+
+ bp_int3_insn = &insn;
+
/*
* Corresponding read barrier in int3 notifier for making sure the
- * in_progress and handler are correctly ordered wrt. patching.
+ * in_progress and opcode are correctly ordered wrt. patching.
*/
smp_wmb();

@@ -1011,6 +1046,6 @@ void text_poke_bp(void *addr, const void
* sync_core() implies an smp_mb() and orders this store against
* the writing of the new instruction.
*/
- bp_patching_in_progress = false;
+ bp_int3_insn = NULL;
}

--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -87,8 +87,7 @@ static void __ref __jump_label_transform
return;
}

- text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
- (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+ text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL);
}

void arch_jump_label_transform(struct jump_entry *entry,
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
insn_buff[0] = RELATIVEJUMP_OPCODE;
*(s32 *)(&insn_buff[1]) = rel;

- text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
- op->optinsn.insn);
+ text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);

list_del_init(&op->list);
}
@@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
void arch_unoptimize_kprobe(struct optimized_kprobe *op)
{
u8 insn_buff[RELATIVEJUMP_SIZE];
+ u8 emulate_buff[RELATIVEJUMP_SIZE];

/* Set int3 to first byte for kprobes */
insn_buff[0] = BREAKPOINT_INSTRUCTION;
memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+
+ emulate_buff[0] = RELATIVEJUMP_OPCODE;
+ *(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
+ ((long)op->kp.addr + RELATIVEJUMP_SIZE));
+
text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
- op->optinsn.insn);
+ emulate_buff);
}

/*

2019-06-12 18:10:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Fri, Jun 07, 2019 at 07:34:27PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:

> > > @@ -943,8 +949,21 @@ int poke_int3_handler(struct pt_regs *re
> > > if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> > > return 0;
> > >
> > > - /* set up the specified breakpoint handler */
> > > - regs->ip = (unsigned long) bp_int3_handler;
> > > + opcode = *(struct opcode *)bp_int3_opcode;
> > > +
> > > + switch (opcode.insn) {
> > > + case 0xE8: /* CALL */
> > > + int3_emulate_call(regs, ip + opcode.rel);
> > > + break;
> > > +
> > > + case 0xE9: /* JMP */
> > > + int3_emulate_jmp(regs, ip + opcode.rel);
> > > + break;
> > > +
> > > + default: /* assume NOP */
> >
> > Shouldn't we check whether it is actually NOP here?
>
> I was/am lazy and didn't want to deal with:
>
> arch/x86/include/asm/nops.h:#define GENERIC_NOP5_ATOMIC NOP_DS_PREFIX,GENERIC_NOP4
> arch/x86/include/asm/nops.h:#define K8_NOP5_ATOMIC 0x66,K8_NOP4
> arch/x86/include/asm/nops.h:#define K7_NOP5_ATOMIC NOP_DS_PREFIX,K7_NOP4
> arch/x86/include/asm/nops.h:#define P6_NOP5_ATOMIC P6_NOP5
>
> But maybe we should check for all the various NOP5 variants and BUG() on
> anything unexpected.

I realized we never actually poke a !ideal nop5_atomic, so I've added
that to the latest versions.

2019-06-12 19:45:31

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

> On Jun 11, 2019, at 8:55 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jun 11, 2019 at 11:22:54AM -0400, Steven Rostedt wrote:
>> On Tue, 11 Jun 2019 10:03:07 +0200
>> Peter Zijlstra <[email protected]> wrote:
>>
>>
>>> So what happens is that arch_prepare_optimized_kprobe() <-
>>> copy_optimized_instructions() copies however much of the instruction
>>> stream is required such that we can overwrite the instruction at @addr
>>> with a 5 byte jump.
>>>
>>> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
>>> instruction @addr with int3, copies the rel jump address and overwrites
>>> the int3 with jmp.
>>>
>>> And I'm thinking the problem is with something like:
>>>
>>> @addr: nop nop nop nop nop
>>
>> What would work would be to:
>>
>> add breakpoint to first opcode.
>>
>> call synchronize_tasks();
>>
>> /* All tasks now hitting breakpoint and jumping over affected
>> code */
>>
>> update the rest of the instructions.
>>
>> replace breakpoint with jmp.
>>
>> One caveat is that the replaced instructions must not be a call
>> function. As if the call function calls schedule then it will
>> circumvent the synchronize_tasks(). It would be OK if that call is the
>> last of the instructions. But I doubt we modify anything more then a
>> call size anyway, so this should still work for all current instances.
>
> Right, something like this could work (although I cannot currently find
> synchronize_tasks), but it would make the optprobe stuff fairly slow
> (iirc this sync_tasks() thing could be pretty horrible).

I have run into similar problems before.

I had two problematic scenarios. In the first case, I had a “call” in the
middle of the patched code-block, but this call was always followed by a
“jump” to the end of the potentially patched code-block, so I did not have
the problem.

In the second case, I had an indirect call (which is shorter than a direct
call) being patched into a direct call. In this case, I preceded the
indirect call with NOPs so indeed the indirect call was at the end of the
patched block.

In certain cases, if a shorter instruction should be potentially patched
into a longer one, the shorter one can be preceded by some prefixes. If
there are multiple REX prefixes, for instance, the CPU only uses the last
one, IIRC. This can allow to avoid synchronize_sched() when patching a
single instruction into another instruction with a different length.

Not sure how helpful this information is, but sharing - just in case.

2019-06-17 14:33:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Tue, Jun 11, 2019 at 08:54:23AM -0700, Andy Lutomirski wrote:
> > On Jun 11, 2019, at 1:03 AM, Peter Zijlstra <[email protected]> wrote:

> > arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> > instruction @addr with int3, copies the rel jump address and overwrites
> > the int3 with jmp.
> >
> > And I'm thinking the problem is with something like:
> >
> > @addr: nop nop nop nop nop
> >
> > We copy out the nops into the trampoline, overwrite the first nop with
> > an INT3, overwrite the remaining nops with the rel addr, but oops,
> > another CPU can still be executing one of those NOPs, right?
> >
> > I'm thinking we could fix this by first writing INT3 into all relevant
> > instructions, which is going to be messy, given the current code base.
>
> How does that help? If RIP == x+2 and you want to put a 5-byte jump
> at address x, no amount of 0xcc is going to change the fact that RIP
> is in the middle of the jump.

What I proposed was doing 0xCC on every instruction that's inside of
[x,x+5). After that we do machine wide IPI+SYNC.

So if RIP is at x+2, then it will observe 0xCC and trigger the exception
there.

Now, the problem is that my exception cannot recover from anything
except NOPs, so while it could work for some corner cases, it doesn't
work for the optkprobe case in general.

Only then do we write the JMP offset and again to IPI+SYNC; then we
write the 0xE8 and again IPI-SYNC.

But at that point we already have the guarantee nobody is inside
[x,x+5). That is, except if we can get to x+2 without first going
through x, IOW if x+2 is a branch target we're screwed any which way
around and the poke is never valid.

Is that something optkprobes checks? If so, where and how?

2019-06-17 14:43:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:

> I have run into similar problems before.
>
> I had two problematic scenarios. In the first case, I had a “call” in the
> middle of the patched code-block, but this call was always followed by a
> “jump” to the end of the potentially patched code-block, so I did not have
> the problem.
>
> In the second case, I had an indirect call (which is shorter than a direct

Longer, 6 bytes vs 5 if I'm not mistaken.

> call) being patched into a direct call. In this case, I preceded the
> indirect call with NOPs so indeed the indirect call was at the end of the
> patched block.
>
> In certain cases, if a shorter instruction should be potentially patched
> into a longer one, the shorter one can be preceded by some prefixes. If
> there are multiple REX prefixes, for instance, the CPU only uses the last
> one, IIRC. This can allow to avoid synchronize_sched() when patching a
> single instruction into another instruction with a different length.
>
> Not sure how helpful this information is, but sharing - just in case.

I think we can patch multiple instructions provided:

- all but one instruction are a NOP,
- there are no branch targets inside the range.

By poking INT3 at every instruction in the range and then doing the
machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.

Because all but one instruction are a NOP, we can emulate only the one
instruction (assuming the real instruction is always last), otherwise
NOP when we're behind the real instruction.

Then we can write new instructions, leaving the initial INT3 until last.

Something like this might be useful if we want to support immediate
instructions (like patch_data_* in paravirt_patch.c) for static_call().


2019-06-17 17:13:39

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

> On Jun 17, 2019, at 7:42 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:
>
>> I have run into similar problems before.
>>
>> I had two problematic scenarios. In the first case, I had a “call” in the
>> middle of the patched code-block, but this call was always followed by a
>> “jump” to the end of the potentially patched code-block, so I did not have
>> the problem.
>>
>> In the second case, I had an indirect call (which is shorter than a direct
>
> Longer, 6 bytes vs 5 if I'm not mistaken.

Shorter (2-3 bytes IIRC), since the target was held in a register.

>
>> call) being patched into a direct call. In this case, I preceded the
>> indirect call with NOPs so indeed the indirect call was at the end of the
>> patched block.
>>
>> In certain cases, if a shorter instruction should be potentially patched
>> into a longer one, the shorter one can be preceded by some prefixes. If
>> there are multiple REX prefixes, for instance, the CPU only uses the last
>> one, IIRC. This can allow to avoid synchronize_sched() when patching a
>> single instruction into another instruction with a different length.
>>
>> Not sure how helpful this information is, but sharing - just in case.
>
> I think we can patch multiple instructions provided:
>
> - all but one instruction are a NOP,
> - there are no branch targets inside the range.
>
> By poking INT3 at every instruction in the range and then doing the
> machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.
>
> Because all but one instruction are a NOP, we can emulate only the one
> instruction (assuming the real instruction is always last), otherwise
> NOP when we're behind the real instruction.
>
> Then we can write new instructions, leaving the initial INT3 until last.
>
> Something like this might be useful if we want to support immediate
> instructions (like patch_data_* in paravirt_patch.c) for static_call().

I don’t know what you regard when you say SYNC, but if you regard something
like sync_core() (in contrast to something like synchronize_sched() ), I am
not sure it is sufficient.

Using IPI+sync_core(), I think, would make an assumption that IRQs are never
enabled inside IRQ and exception handlers, or that these handlers would not
be invoked while the patched code is executed. Otherwise, the IPI might be
received inside the IRQ/exception handler, and then return from the handler
will be into the middle of a patched instruction.

2019-06-17 17:31:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Mon, Jun 17, 2019 at 7:42 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:
>
> > I have run into similar problems before.
> >
> > I had two problematic scenarios. In the first case, I had a “call” in the
> > middle of the patched code-block, but this call was always followed by a
> > “jump” to the end of the potentially patched code-block, so I did not have
> > the problem.
> >
> > In the second case, I had an indirect call (which is shorter than a direct
>
> Longer, 6 bytes vs 5 if I'm not mistaken.
>
> > call) being patched into a direct call. In this case, I preceded the
> > indirect call with NOPs so indeed the indirect call was at the end of the
> > patched block.
> >
> > In certain cases, if a shorter instruction should be potentially patched
> > into a longer one, the shorter one can be preceded by some prefixes. If
> > there are multiple REX prefixes, for instance, the CPU only uses the last
> > one, IIRC. This can allow to avoid synchronize_sched() when patching a
> > single instruction into another instruction with a different length.
> >
> > Not sure how helpful this information is, but sharing - just in case.
>
> I think we can patch multiple instructions provided:
>
> - all but one instruction are a NOP,
> - there are no branch targets inside the range.
>
> By poking INT3 at every instruction in the range and then doing the
> machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.

How do you know you'll trap them? You need to IPI, serialize, and get
them to execute an instruction. If the CPU is in an interrupt and RIP
just happens to be pointed to the INT3, you need them to execute a
whole lot more than just one instruction.

2019-06-17 21:14:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

On Mon, Jun 17, 2019 at 10:25:27AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 17, 2019 at 7:42 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Jun 12, 2019 at 07:44:12PM +0000, Nadav Amit wrote:
> >
> > > I have run into similar problems before.
> > >
> > > I had two problematic scenarios. In the first case, I had a “call” in the
> > > middle of the patched code-block, but this call was always followed by a
> > > “jump” to the end of the potentially patched code-block, so I did not have
> > > the problem.
> > >
> > > In the second case, I had an indirect call (which is shorter than a direct
> >
> > Longer, 6 bytes vs 5 if I'm not mistaken.
> >
> > > call) being patched into a direct call. In this case, I preceded the
> > > indirect call with NOPs so indeed the indirect call was at the end of the
> > > patched block.
> > >
> > > In certain cases, if a shorter instruction should be potentially patched
> > > into a longer one, the shorter one can be preceded by some prefixes. If
> > > there are multiple REX prefixes, for instance, the CPU only uses the last
> > > one, IIRC. This can allow to avoid synchronize_sched() when patching a
> > > single instruction into another instruction with a different length.
> > >
> > > Not sure how helpful this information is, but sharing - just in case.
> >
> > I think we can patch multiple instructions provided:
> >
> > - all but one instruction are a NOP,
> > - there are no branch targets inside the range.
> >
> > By poking INT3 at every instruction in the range and then doing the
> > machine wide IPI+SYNC, we'll trap every CPU that is in-side the range.
>
> How do you know you'll trap them? You need to IPI, serialize, and get
> them to execute an instruction. If the CPU is in an interrupt and RIP
> just happens to be pointed to the INT3, you need them to execute a
> whole lot more than just one instruction.

Argh, yes, I'm an idiot.