2022-03-17 04:23:34

by Wangshaobo (bobo)

[permalink] [raw]
Subject: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

This implements dynamic trampoline in ARM64, as reference said, we
complete whole design of supporting long jump in dynamic trampoline:

.text section:
funcA: | funcA(): funcB():|
`-> +-----+ | | ... mov x9 |
| ... | | | adrp <- bl <> |
| nop | | | mov
| nop | | | br x16 ---+
funcB | nop | | | ftrace_(regs_)caller_tramp:
`-> +-----+ | `--> +---------------------+
| nop | | | ... |
| nop | | ftrace callsite +---------------------+
| ... | | `-----> | PLT entry: |
| nop | | | adrp |
| nop | | | add |
funcC: | nop | | ftrace graph callsite | br x16 |
`-> +-----+ | `-----> +---------------------+
| nop | | | ... |
| nop | | +---------------------+

But there is still a tricky problem that is how to adjust tracing ip,
waiting to be solved:

For ARM64, somecases there may be extra instructions inserted into the
head of tracable functions(but not all) by compiler, for instance BTI[1].

This dump vmlinux with CONFIG_BTI=y:

(1) function gic_handle_irq has bti in its head, so we adjust rec->ip+=5 to last nop
ffffffc0080100e0: d53cd042 mrs x2, tpidr_el2
...
ffffffc0080100f0: d503201f nop //__mcount_loc tells the rec->ip
ffffffc0080100f4: d503201f nop
ffffffc0080100f8: d503201f nop

ffffffc0080100fc <gic_handle_irq>:
ffffffc0080100fc: d503245f bti c
ffffffc008010100: d503201f nop
ffffffc008010104: d503201f nop //we adjust origin rec->ip+5 to here
ffffffc008010108: d503233f paciasp
(2) name_to_dev_t.part.0 do not has bti in its head, so we should adjust rec->ip+=4 to last nop
ffff8000080137d4: d503201f nop
ffff8000080137d8: d503201f nop
ffff8000080137dc: d503201f nop

ffff8000080137e0 <name_to_dev_t.part.0>:
ffff8000080137e0: d503201f nop
ffff8000080137e4: d503201f nop
ffff8000080137e8: d503233f paciasp

So at this time we have no idea to identify rec->ip for each tracable function.

we are looking forward to follow-up discussions.

References:
[1] https://developer.arm.com/documentation/100076/0100/a64-instruction-set-reference/a64-general-instructions/bti
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/

Cheng Jian (4):
arm64: introduce aarch64_insn_gen_load_literal
arm64/ftrace: introduce ftrace dynamic trampoline entrances
arm64/ftrace: support dynamically allocated trampolines
arm64/ftrace: implement long jump for dynamic trampolines

arch/arm64/Makefile | 2 +-
arch/arm64/include/asm/ftrace.h | 10 +-
arch/arm64/include/asm/insn.h | 6 +
arch/arm64/include/asm/module.h | 9 +
arch/arm64/kernel/entry-ftrace.S | 88 ++++++--
arch/arm64/kernel/ftrace.c | 366 ++++++++++++++++++++++++++++---
arch/arm64/kernel/module-plts.c | 50 +++++
arch/arm64/lib/insn.c | 49 +++++
8 files changed, 532 insertions(+), 48 deletions(-)

--
2.25.1


2022-03-17 04:51:54

by Wangshaobo (bobo)

[permalink] [raw]
Subject: [RFC PATCH -next v2 4/4] arm64/ftrace: implement long jump for dynamic trampolines

From: Cheng Jian <[email protected]>

When kaslr is enabled, long jump should be implemented for jumping
to according trampoline and to callback from this trampoline.

Firstly, we use -fpatchable-function-entry=5,3 to reserve 3 NOPs at
the end of each function, and also 2 NOPs at the head as 3b23e4991fb6
("arm64: implement ftrace with regs") does, so rec->ip should be
rec->ip + 4 after change in case there is no extra instrument
(such as bti ) be put the head.

But should note this will increase the size of the text section:
//Using defconfig with -fpatchable-function-entry=5,3://
text data bss dec hex filename
27095761 14578580 533268 42207609 2840979 vmlinux

//Using defconfig with -fpatchable-function-entry=2://
text data bss dec hex filename
26394597 14575988 533268 41503853 2794c6d vmlinux

When registering new ftrace function, we put this ftrace_ops registered
into our trampoline allocated dynamically but not the ftrace_ops_list,

.text section:
funcA: | funcA(): funcB(): |
`-> +-----+ | | ... mov x9 x30 |
| ... | | | adrp <- bl <> |
| nop | | | add
| nop | | | br x16 ---+
funcB | nop | | | ftrace_(regs_)caller_tramp:
`-> +-----+ | `--> +---------------------+
| nop | | | ... |
| nop | | ftrace callsite +---------------------+
| ... | | `-----> | PLT entry: |
| nop | | | adrp |
| nop | | | add |
funcC: | nop | | ftrace graph callsite | br x16 |
`-> +-----+ | `-----> +---------------------+
| nop | | | ... |
| nop | | +---------------------+

There are three steps to concreate this conection when registering
a ftrace_ops for a function to be traced (Take funcB as example):

Step 1: use make_init_nop() to generate 'mov x9 x30' for LR reservation.
funcB:
| mov x9 x30
| nop //rec->ip
| ...

Step 2: create trampoline and use ftrace_make_call() to update plt entry:
funcA:
| ...
| adrp
| add
| br x16 //pointer to trampoline allocated

Step 3: fill plt entry to where ftrace_ops(addr stored in x2) contained:
trampoline:
| ...
| adrp
| add
| br x16 //pointer to callback handler

At the end automically replacing nop to bl in funB to enable profiling.
funcB:
| mov x9 x30
| bl <> //to first adrp in funA
| ...

If funcB was bound to another ftrace_ops, Step2 and Step3 will be repeated,
If funcB was bound to multiple ftrace_ops, this trampoline will not be used
but jump to ftrace_(regs)_caller.

Signed-off-by: Cheng Jian <[email protected]>
Signed-off-by: Wang ShaoBo <[email protected]>
---
arch/arm64/Makefile | 2 +-
arch/arm64/include/asm/ftrace.h | 2 +-
arch/arm64/include/asm/module.h | 9 +++
arch/arm64/kernel/entry-ftrace.S | 8 ++
arch/arm64/kernel/ftrace.c | 127 +++++++++++++++++--------------
arch/arm64/kernel/module-plts.c | 50 ++++++++++++
6 files changed, 140 insertions(+), 58 deletions(-)

diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 2f1de88651e6..3139d6d9ee52 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -130,7 +130,7 @@ CHECKFLAGS += -D__aarch64__

ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_REGS),y)
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
- CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+ CC_FLAGS_FTRACE := -fpatchable-function-entry=5,3
endif

# Default value
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 1494cfa8639b..d0562c13b1dc 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -70,7 +70,7 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr)
* See ftrace_init_nop() for the callsite sequence.
*/
if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
- return addr + AARCH64_INSN_SIZE;
+ return addr + AARCH64_INSN_SIZE + sizeof(struct plt_entry);
/*
* addr is the address of the mcount call instruction.
* recordmcount does the necessary offset calculation.
diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
index 4e7fa2623896..30f5b1cd5c9c 100644
--- a/arch/arm64/include/asm/module.h
+++ b/arch/arm64/include/asm/module.h
@@ -36,6 +36,14 @@ extern u64 module_alloc_base;
#define module_alloc_base ((u64)_etext - MODULES_VSIZE)
#endif

+enum {
+ PLT_MAKE_NOP,
+ PLT_MAKE_CALL,
+ PLT_MAKE_CALL_LINK,
+ PLT_MAKE_LCALL,
+ PLT_MAKE_LCALL_LINK
+};
+
struct plt_entry {
/*
* A program that conforms to the AArch64 Procedure Call Standard
@@ -58,6 +66,7 @@ static inline bool is_forbidden_offset_for_adrp(void *place)
}

struct plt_entry get_plt_entry(u64 dst, void *pc);
+struct plt_entry update_plt_entry(u64 pc, u64 dst, int record);
bool plt_entries_equal(const struct plt_entry *a, const struct plt_entry *b);

static inline bool plt_entry_is_initialized(const struct plt_entry *e)
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index 88462d925446..71290f0e0740 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -154,8 +154,12 @@ SYM_INNER_LABEL(ftrace_regs_caller_tramp_ops, SYM_L_GLOBAL)
mov x3, sp // regs
SYM_INNER_LABEL(ftrace_regs_caller_tramp_call, SYM_L_GLOBAL)
nop
+ nop
+ nop
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_INNER_LABEL(ftrace_regs_caller_tramp_graph_call, SYM_L_GLOBAL)
+ nop
+ nop
nop // ftrace_graph_caller()
#endif
ftrace_regs_restore
@@ -175,8 +179,12 @@ SYM_INNER_LABEL(ftrace_caller_tramp_ops, SYM_L_GLOBAL)
mov x3, sp // regs
SYM_INNER_LABEL(ftrace_caller_tramp_call, SYM_L_GLOBAL)
nop
+ nop
+ nop
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_INNER_LABEL(ftrace_caller_tramp_graph_call, SYM_L_GLOBAL)
+ nop
+ nop
nop // ftrace_graph_caller()
#endif
ftrace_regs_restore
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index d35a042baf75..4011aceefe8c 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -234,27 +234,55 @@ ftrace_trampoline_get_func(struct ftrace_ops *ops, bool is_graph)
return func;
}

+static struct plt_entry *
+update_ftrace_tramp_plt(unsigned long pc, unsigned long addr, bool enable, bool link)
+{
+ int i;
+ struct plt_entry entry;
+ u32 *pentry;
+ long offset;
+ struct plt_entry *plt = (struct plt_entry *)pc;
+
+ offset = (long)pc - (long)addr;
+
+ if (!enable)
+ entry = update_plt_entry(pc, addr, PLT_MAKE_NOP);
+ else if (offset < -SZ_128M || offset >= SZ_128M) {
+ if (link)
+ entry = update_plt_entry(pc, addr, PLT_MAKE_LCALL_LINK);
+ else
+ entry = update_plt_entry(pc, addr, PLT_MAKE_LCALL);
+ } else {
+ if (link)
+ entry = update_plt_entry(pc, addr, PLT_MAKE_CALL_LINK);
+ else
+ entry = update_plt_entry(pc, addr, PLT_MAKE_CALL);
+ }
+
+ pentry = (u32 *)&entry;
+ for (i = 0; i < sizeof(struct plt_entry) / AARCH64_INSN_SIZE; i++)
+ aarch64_insn_patch_text_nosync((void *)((u32*)pc + i), *(pentry + i));
+
+ return plt;
+}
+
static int
ftrace_trampoline_modify_call(struct ftrace_ops *ops, bool is_graph, bool active)
{
unsigned long offset;
unsigned long pc;
ftrace_func_t func;
- u32 nop, branch;

offset = calc_trampoline_call_offset(ops->flags &
FTRACE_OPS_FL_SAVE_REGS, is_graph);
pc = ops->trampoline + offset;

func = ftrace_trampoline_get_func(ops, is_graph);
- nop = aarch64_insn_gen_nop();
- branch = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
- AARCH64_INSN_BRANCH_LINK);

- if (active)
- return ftrace_modify_code(pc, 0, branch, false);
- else
- return ftrace_modify_code(pc, 0, nop, false);
+ if (update_ftrace_tramp_plt(pc, (unsigned long)func, active, true))
+ return 0;
+
+ return -EINVAL;
}

extern int ftrace_graph_active;
@@ -262,13 +290,6 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
{
unsigned int size;

- /*
- * If kaslr is enabled, the address of tramp and ftrace call
- * may be far away, Therefore, long jump support is required.
- */
- if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
- return;
-
if (!ops->trampoline) {
ops->trampoline = create_trampoline(ops, &size);
if (!ops->trampoline)
@@ -349,17 +370,34 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
return ftrace_modify_code(pc, 0, new, false);
}

-static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
+extern void ftrace_kernel_plt(void);
+static struct plt_entry *
+get_ftrace_plt(struct module *mod, unsigned long pc, unsigned long addr)
{
-#ifdef CONFIG_ARM64_MODULE_PLTS
- struct plt_entry *plt = mod->arch.ftrace_trampolines;
+ struct plt_entry *plt;

- if (addr == FTRACE_ADDR)
- return &plt[FTRACE_PLT_IDX];
- if (addr == FTRACE_REGS_ADDR &&
- IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
- return &plt[FTRACE_REGS_PLT_IDX];
+ if (mod) {
+#ifdef CONFIG_ARM64_MODULE_PLTS
+ plt = mod->arch.ftrace_trampolines;
+
+ if (addr == FTRACE_ADDR)
+ return &plt[FTRACE_PLT_IDX];
+ else if (addr == FTRACE_REGS_ADDR &&
+ IS_ENABLED(CONFIG_FTRACE_WITH_REGS))
+ return &plt[FTRACE_REGS_PLT_IDX];
+ else {
#endif
+ pc = pc - sizeof(struct plt_entry) - AARCH64_INSN_SIZE;
+ return update_ftrace_tramp_plt(pc, addr, true, false);
+#ifdef CONFIG_ARM64_MODULE_PLTS
+ }
+#endif
+ } else {
+ WARN_ON(addr == FTRACE_ADDR || addr == FTRACE_REGS_ADDR);
+ pc = pc - sizeof(struct plt_entry) - AARCH64_INSN_SIZE;
+ return update_ftrace_tramp_plt(pc, addr, true, false);
+ }
+
return NULL;
}

@@ -376,9 +414,6 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
struct module *mod;
struct plt_entry *plt;

- if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
- return -EINVAL;
-
/*
* On kernels that support module PLTs, the offset between the
* branch instruction and its target may legally exceed the
@@ -393,12 +428,10 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
mod = __module_text_address(pc);
preempt_enable();

- if (WARN_ON(!mod))
- return -EINVAL;
-
- plt = get_ftrace_plt(mod, addr);
+ plt = get_ftrace_plt(mod, pc, addr);
if (!plt) {
- pr_err("ftrace: no module PLT for %ps\n", (void *)addr);
+ pr_err("ftrace: no %s PLT for %ps\n",
+ mod ? "module" : "kernel", (void *)addr);
return -EINVAL;
}

@@ -474,36 +507,18 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
if (offset < -SZ_128M || offset >= SZ_128M) {
u32 replaced;

- if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
- return -EINVAL;
-
- /*
- * 'mod' is only set at module load time, but if we end up
- * dealing with an out-of-range condition, we can assume it
- * is due to a module being loaded far away from the kernel.
- */
- if (!mod) {
- preempt_disable();
- mod = __module_text_address(pc);
- preempt_enable();
-
- if (WARN_ON(!mod))
- return -EINVAL;
- }
-
/*
- * The instruction we are about to patch may be a branch and
- * link instruction that was redirected via a PLT entry. In
- * this case, the normal validation will fail, but we can at
- * least check that we are dealing with a branch and link
- * instruction that points into the right module.
+ * The instruction we are about to patch may be a branch
+ * and link instruction that was redirected via a PLT entry
+ * supported by a dynamic trampoline or module, in this case,
+ * the normal validation will fail, but we can at least check
+ * that we are dealing with a branch and link instruction
+ * that points into the right place.
*/
if (aarch64_insn_read((void *)pc, &replaced))
return -EFAULT;

- if (!aarch64_insn_is_bl(replaced) ||
- !within_module(pc + aarch64_get_branch_offset(replaced),
- mod))
+ if (!aarch64_insn_is_bl(replaced))
return -EINVAL;

validate = false;
diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
index e53493d8b208..b7a832dfaa69 100644
--- a/arch/arm64/kernel/module-plts.c
+++ b/arch/arm64/kernel/module-plts.c
@@ -37,6 +37,56 @@ struct plt_entry get_plt_entry(u64 dst, void *pc)
return plt;
}

+struct plt_entry update_plt_entry(u64 pc, u64 dst, int record)
+{
+ u32 adrp, add, br;
+
+ switch (record) {
+ case PLT_MAKE_NOP:
+ adrp = aarch64_insn_gen_nop();
+ add = aarch64_insn_gen_nop();
+ br = aarch64_insn_gen_nop();
+ break;
+
+ case PLT_MAKE_CALL_LINK:
+ adrp = aarch64_insn_gen_nop();
+ add = aarch64_insn_gen_nop();
+ br = aarch64_insn_gen_branch_imm(pc + 8, dst,
+ AARCH64_INSN_BRANCH_LINK);
+ break;
+
+ case PLT_MAKE_LCALL_LINK:
+ adrp = aarch64_insn_gen_adr(pc, dst, AARCH64_INSN_REG_16,
+ AARCH64_INSN_ADR_TYPE_ADRP);
+ add = aarch64_insn_gen_add_sub_imm(AARCH64_INSN_REG_16,
+ AARCH64_INSN_REG_16, dst % SZ_4K,
+ AARCH64_INSN_VARIANT_64BIT,
+ AARCH64_INSN_ADSB_ADD);
+ br = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
+ AARCH64_INSN_BRANCH_LINK);
+ break;
+
+ case PLT_MAKE_LCALL:
+ adrp = aarch64_insn_gen_adr(pc, dst, AARCH64_INSN_REG_16,
+ AARCH64_INSN_ADR_TYPE_ADRP);
+ add = aarch64_insn_gen_add_sub_imm(AARCH64_INSN_REG_16,
+ AARCH64_INSN_REG_16,
+ dst % SZ_4K,
+ AARCH64_INSN_VARIANT_64BIT,
+ AARCH64_INSN_ADSB_ADD);
+ br = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
+ AARCH64_INSN_BRANCH_NOLINK);
+ break;
+ default:
+ pr_err("[%s %d] error flag %d\n", __func__, __LINE__, record);
+ BUG();
+ break;
+ }
+
+ return (struct plt_entry){ cpu_to_le32(adrp), cpu_to_le32(add),
+ cpu_to_le32(br) };
+}
+
bool plt_entries_equal(const struct plt_entry *a, const struct plt_entry *b)
{
u64 p, q;
--
2.25.1

2022-03-17 05:14:52

by Wangshaobo (bobo)

[permalink] [raw]
Subject: [RFC PATCH -next v2 2/4] arm64/ftrace: introduce ftrace dynamic trampoline entrances

From: Cheng Jian <[email protected]>

We introduce two function entrances ftrace_(regs)_caller_tramp
for dynamic trampoline use, which are put into Read-Only section
and should be entirely copied to the space allocated for trampoline.

Signed-off-by: Cheng Jian <[email protected]>
Signed-off-by: Wang ShaoBo <[email protected]>
---
arch/arm64/kernel/entry-ftrace.S | 80 +++++++++++++++++++++++++-------
1 file changed, 64 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index e535480a4069..88462d925446 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -76,6 +76,23 @@
add x29, sp, #S_STACKFRAME
.endm

+ .macro ftrace_regs_restore
+ /* Restore function arguments */
+ ldp x0, x1, [sp]
+ ldp x2, x3, [sp, #S_X2]
+ ldp x4, x5, [sp, #S_X4]
+ ldp x6, x7, [sp, #S_X6]
+ ldr x8, [sp, #S_X8]
+
+ /* Restore the callsite's FP, LR, PC */
+ ldr x29, [sp, #S_FP]
+ ldr x30, [sp, #S_LR]
+ ldr x9, [sp, #S_PC]
+
+ /* Restore the callsite's SP */
+ add sp, sp, #PT_REGS_SIZE + 16
+ .endm
+
SYM_CODE_START(ftrace_regs_caller)
bti c
ftrace_regs_entry 1
@@ -108,22 +125,8 @@ SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
* x19-x29 per the AAPCS, and we created frame records upon entry, so we need
* to restore x0-x8, x29, and x30.
*/
-ftrace_common_return:
- /* Restore function arguments */
- ldp x0, x1, [sp]
- ldp x2, x3, [sp, #S_X2]
- ldp x4, x5, [sp, #S_X4]
- ldp x6, x7, [sp, #S_X6]
- ldr x8, [sp, #S_X8]
-
- /* Restore the callsite's FP, LR, PC */
- ldr x29, [sp, #S_FP]
- ldr x30, [sp, #S_LR]
- ldr x9, [sp, #S_PC]
-
- /* Restore the callsite's SP */
- add sp, sp, #PT_REGS_SIZE + 16
-
+SYM_INNER_LABEL(ftrace_common_return, SYM_L_GLOBAL)
+ ftrace_regs_restore
ret x9
SYM_CODE_END(ftrace_common)

@@ -138,6 +141,51 @@ SYM_CODE_START(ftrace_graph_caller)
SYM_CODE_END(ftrace_graph_caller)
#endif

+.pushsection ".rodata", "a"
+// ftrace trampoline for ftrace_regs_caller
+SYM_CODE_START(ftrace_regs_caller_tramp)
+ bti c
+ ftrace_regs_entry 1 // save all regs
+
+ sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
+ mov x1, x9 // parent_ip (callsite's LR)
+SYM_INNER_LABEL(ftrace_regs_caller_tramp_ops, SYM_L_GLOBAL)
+ ldr x2, 0 // ops
+ mov x3, sp // regs
+SYM_INNER_LABEL(ftrace_regs_caller_tramp_call, SYM_L_GLOBAL)
+ nop
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+SYM_INNER_LABEL(ftrace_regs_caller_tramp_graph_call, SYM_L_GLOBAL)
+ nop // ftrace_graph_caller()
+#endif
+ ftrace_regs_restore
+SYM_INNER_LABEL(ftrace_regs_caller_tramp_end, SYM_L_GLOBAL)
+ ret x9
+SYM_CODE_END(ftrace_regs_caller_tramp)
+
+// ftrace trampoline for ftrace_caller
+SYM_CODE_START(ftrace_caller_tramp)
+ bti c
+ ftrace_regs_entry 0 // save all regs
+
+ sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
+ mov x1, x9 // parent_ip (callsite's LR)
+SYM_INNER_LABEL(ftrace_caller_tramp_ops, SYM_L_GLOBAL)
+ ldr x2, 0 // ops
+ mov x3, sp // regs
+SYM_INNER_LABEL(ftrace_caller_tramp_call, SYM_L_GLOBAL)
+ nop
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+SYM_INNER_LABEL(ftrace_caller_tramp_graph_call, SYM_L_GLOBAL)
+ nop // ftrace_graph_caller()
+#endif
+ ftrace_regs_restore
+SYM_INNER_LABEL(ftrace_caller_tramp_end, SYM_L_GLOBAL)
+ ret x9
+SYM_CODE_END(ftrace_caller_tramp)
+.popsection // .rodata
+
+
#else /* CONFIG_DYNAMIC_FTRACE_WITH_REGS */

/*
--
2.25.1

2022-03-17 05:21:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

On Wed, 16 Mar 2022 18:01:28 +0800
Wang ShaoBo <[email protected]> wrote:

> This implements dynamic trampoline in ARM64, as reference said, we
> complete whole design of supporting long jump in dynamic trampoline:
>
> .text section:
> funcA: | funcA(): funcB():|
> `-> +-----+ | | ... mov x9 |
> | ... | | | adrp <- bl <> |
> | nop | | | mov
> | nop | | | br x16 ---+
> funcB | nop | | | ftrace_(regs_)caller_tramp:
> `-> +-----+ | `--> +---------------------+
> | nop | | | ... |
> | nop | | ftrace callsite +---------------------+
> | ... | | `-----> | PLT entry: |
> | nop | | | adrp |
> | nop | | | add |
> funcC: | nop | | ftrace graph callsite | br x16 |
> `-> +-----+ | `-----> +---------------------+
> | nop | | | ... |
> | nop | | +---------------------+
>
> But there is still a tricky problem that is how to adjust tracing ip,
> waiting to be solved:
>
> For ARM64, somecases there may be extra instructions inserted into the
> head of tracable functions(but not all) by compiler, for instance BTI[1].
>
> This dump vmlinux with CONFIG_BTI=y:
>
> (1) function gic_handle_irq has bti in its head, so we adjust rec->ip+=5 to last nop
> ffffffc0080100e0: d53cd042 mrs x2, tpidr_el2
> ...
> ffffffc0080100f0: d503201f nop //__mcount_loc tells the rec->ip
> ffffffc0080100f4: d503201f nop
> ffffffc0080100f8: d503201f nop
>
> ffffffc0080100fc <gic_handle_irq>:
> ffffffc0080100fc: d503245f bti c
> ffffffc008010100: d503201f nop
> ffffffc008010104: d503201f nop //we adjust origin rec->ip+5 to here
> ffffffc008010108: d503233f paciasp
> (2) name_to_dev_t.part.0 do not has bti in its head, so we should adjust rec->ip+=4 to last nop
> ffff8000080137d4: d503201f nop
> ffff8000080137d8: d503201f nop
> ffff8000080137dc: d503201f nop
>
> ffff8000080137e0 <name_to_dev_t.part.0>:
> ffff8000080137e0: d503201f nop
> ffff8000080137e4: d503201f nop
> ffff8000080137e8: d503233f paciasp
>
> So at this time we have no idea to identify rec->ip for each tracable function.

This looks like the same issue that Peter Zijlstra is handling for IBT on
x86, which I think can be useful for you too.

https://lore.kernel.org/all/[email protected]/

Specifically this patch:

https://lore.kernel.org/all/[email protected]/

Which modifies the ftrace_location() to return the rec->ip if you pass in
the ip of the function and kallsyms returns that the ip passed in has an
offset of zero.

-- Steve

>
> we are looking forward to follow-up discussions.
>
> References:
> [1] https://developer.arm.com/documentation/100076/0100/a64-instruction-set-reference/a64-general-instructions/bti
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> Cheng Jian (4):
> arm64: introduce aarch64_insn_gen_load_literal
> arm64/ftrace: introduce ftrace dynamic trampoline entrances
> arm64/ftrace: support dynamically allocated trampolines
> arm64/ftrace: implement long jump for dynamic trampolines
>
> arch/arm64/Makefile | 2 +-
> arch/arm64/include/asm/ftrace.h | 10 +-
> arch/arm64/include/asm/insn.h | 6 +
> arch/arm64/include/asm/module.h | 9 +
> arch/arm64/kernel/entry-ftrace.S | 88 ++++++--
> arch/arm64/kernel/ftrace.c | 366 ++++++++++++++++++++++++++++---
> arch/arm64/kernel/module-plts.c | 50 +++++
> arch/arm64/lib/insn.c | 49 +++++
> 8 files changed, 532 insertions(+), 48 deletions(-)
>

2022-03-17 05:25:03

by Wangshaobo (bobo)

[permalink] [raw]
Subject: [RFC PATCH -next v2 1/4] arm64: introduce aarch64_insn_gen_load_literal

From: Cheng Jian <[email protected]>

This introduces helper to generate ldr(literal) instructions.

LDR <Xt>, <label>

Signed-off-by: Cheng Jian <[email protected]>
Signed-off-by: Wang ShaoBo <[email protected]>
---
arch/arm64/include/asm/insn.h | 6 +++++
arch/arm64/lib/insn.c | 49 +++++++++++++++++++++++++++++++++++
2 files changed, 55 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 6b776c8667b2..95b3562843c2 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -458,6 +458,9 @@ u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_cr_op op);
u32 aarch64_insn_gen_nop(void);
u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
enum aarch64_insn_branch_type type);
+u32 aarch64_insn_gen_load_literal(enum aarch64_insn_register reg,
+ enum aarch64_insn_variant variant,
+ long offset);
u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg,
enum aarch64_insn_register base,
enum aarch64_insn_register offset,
@@ -544,6 +547,9 @@ u32 aarch64_insn_gen_prefetch(enum aarch64_insn_register base,
s32 aarch64_get_branch_offset(u32 insn);
u32 aarch64_set_branch_offset(u32 insn, s32 offset);

+s32 aarch64_insn_get_ldr_lit_offset(u32 insn);
+u32 aarch64_insn_set_ldr_lit_offset(u32 insn, u32 offset);
+
s32 aarch64_insn_adrp_get_offset(u32 insn);
u32 aarch64_insn_adrp_set_offset(u32 insn, s32 offset);

diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c
index fccfe363e567..9ac5fb4e76e8 100644
--- a/arch/arm64/lib/insn.c
+++ b/arch/arm64/lib/insn.c
@@ -17,6 +17,7 @@
#include <asm/kprobes.h>

#define AARCH64_INSN_SF_BIT BIT(31)
+#define AARCH64_INSN_OPC_BIT BIT(30)
#define AARCH64_INSN_N_BIT BIT(22)
#define AARCH64_INSN_LSL_12 BIT(22)

@@ -473,6 +474,54 @@ u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, reg);
}

+s32 aarch64_insn_get_ldr_lit_offset(u32 insn)
+{
+ s32 imm;
+
+ if (aarch64_insn_is_ldr_lit(insn)) {
+ imm = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_19, insn);
+ return (imm << 13) >> 11;
+ }
+
+ /* Unhandled instruction */
+ BUG();
+}
+
+u32 aarch64_insn_set_ldr_lit_offset(u32 insn, u32 offset)
+{
+ if (aarch64_insn_is_ldr_lit(insn))
+ return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn,
+ offset >> 2);
+ /* Unhandled instruction */
+ BUG();
+}
+
+u32 aarch64_insn_gen_load_literal(enum aarch64_insn_register reg,
+ enum aarch64_insn_variant variant,
+ long offset)
+{
+ u32 insn;
+
+ insn = aarch64_insn_get_ldr_lit_value();
+
+ switch (variant) {
+ case AARCH64_INSN_VARIANT_32BIT:
+ /* 32-bit ops == 00 */
+ break;
+ case AARCH64_INSN_VARIANT_64BIT:
+ /* 64-bit opc == 01 */
+ insn |= AARCH64_INSN_OPC_BIT;
+ break;
+ default:
+ pr_err("%s: unknown variant encoding %d\n", __func__, variant);
+ return AARCH64_BREAK_FAULT;
+ }
+
+ insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg);
+
+ return aarch64_insn_set_ldr_lit_offset(insn, offset);
+}
+
u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg,
enum aarch64_insn_register base,
enum aarch64_insn_register offset,
--
2.25.1

2022-04-21 15:30:47

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

On Wed, Mar 16, 2022 at 06:01:28PM +0800, Wang ShaoBo wrote:
> This implements dynamic trampoline in ARM64, as reference said, we
> complete whole design of supporting long jump in dynamic trampoline:
>
> .text section:
> funcA: | funcA(): funcB():|
> `-> +-----+ | | ... mov x9 |
> | ... | | | adrp <- bl <> |
> | nop | | | mov
> | nop | | | br x16 ---+
> funcB | nop | | | ftrace_(regs_)caller_tramp:
> `-> +-----+ | `--> +---------------------+
> | nop | | | ... |
> | nop | | ftrace callsite +---------------------+
> | ... | | `-----> | PLT entry: |
> | nop | | | adrp |
> | nop | | | add |
> funcC: | nop | | ftrace graph callsite | br x16 |
> `-> +-----+ | `-----> +---------------------+
> | nop | | | ... |
> | nop | | +---------------------+
>
> But there is still a tricky problem that is how to adjust tracing ip,
> waiting to be solved:
>
> For ARM64, somecases there may be extra instructions inserted into the
> head of tracable functions(but not all) by compiler, for instance BTI[1].
>
> This dump vmlinux with CONFIG_BTI=y:
>
> (1) function gic_handle_irq has bti in its head, so we adjust rec->ip+=5 to last nop
> ffffffc0080100e0: d53cd042 mrs x2, tpidr_el2
> ...
> ffffffc0080100f0: d503201f nop //__mcount_loc tells the rec->ip
> ffffffc0080100f4: d503201f nop
> ffffffc0080100f8: d503201f nop
>
> ffffffc0080100fc <gic_handle_irq>:
> ffffffc0080100fc: d503245f bti c
> ffffffc008010100: d503201f nop
> ffffffc008010104: d503201f nop //we adjust origin rec->ip+5 to here
> ffffffc008010108: d503233f paciasp
> (2) name_to_dev_t.part.0 do not has bti in its head, so we should adjust rec->ip+=4 to last nop
> ffff8000080137d4: d503201f nop
> ffff8000080137d8: d503201f nop
> ffff8000080137dc: d503201f nop
>
> ffff8000080137e0 <name_to_dev_t.part.0>:
> ffff8000080137e0: d503201f nop
> ffff8000080137e4: d503201f nop
> ffff8000080137e8: d503233f paciasp
>
> So at this time we have no idea to identify rec->ip for each tracable function.

When I had looked into this in the past, I had assumed we could figure
this out at ftrace_init_nop() time, and adjust rec->ip there.

However, placing code *before* the function entry point will also break
stacktracing and will require adjustment, and I'd been intending to
clean up the stacktrace code first, so I haven't looked at that in a
while.

I'll take a look at the rest of the series shortly.

Thanks,
Mark.

>
> we are looking forward to follow-up discussions.
>
> References:
> [1] https://developer.arm.com/documentation/100076/0100/a64-instruction-set-reference/a64-general-instructions/bti
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> Cheng Jian (4):
> arm64: introduce aarch64_insn_gen_load_literal
> arm64/ftrace: introduce ftrace dynamic trampoline entrances
> arm64/ftrace: support dynamically allocated trampolines
> arm64/ftrace: implement long jump for dynamic trampolines
>
> arch/arm64/Makefile | 2 +-
> arch/arm64/include/asm/ftrace.h | 10 +-
> arch/arm64/include/asm/insn.h | 6 +
> arch/arm64/include/asm/module.h | 9 +
> arch/arm64/kernel/entry-ftrace.S | 88 ++++++--
> arch/arm64/kernel/ftrace.c | 366 ++++++++++++++++++++++++++++---
> arch/arm64/kernel/module-plts.c | 50 +++++
> arch/arm64/lib/insn.c | 49 +++++
> 8 files changed, 532 insertions(+), 48 deletions(-)
>
> --
> 2.25.1
>

2022-04-21 22:26:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

Is this going anywhere?

-- Steve


On Wed, 16 Mar 2022 18:01:28 +0800
Wang ShaoBo <[email protected]> wrote:

> This implements dynamic trampoline in ARM64, as reference said, we
> complete whole design of supporting long jump in dynamic trampoline:
>
> .text section:
> funcA: | funcA(): funcB():|
> `-> +-----+ | | ... mov x9 |
> | ... | | | adrp <- bl <> |
> | nop | | | mov
> | nop | | | br x16 ---+
> funcB | nop | | | ftrace_(regs_)caller_tramp:
> `-> +-----+ | `--> +---------------------+
> | nop | | | ... |
> | nop | | ftrace callsite +---------------------+
> | ... | | `-----> | PLT entry: |
> | nop | | | adrp |
> | nop | | | add |
> funcC: | nop | | ftrace graph callsite | br x16 |
> `-> +-----+ | `-----> +---------------------+
> | nop | | | ... |
> | nop | | +---------------------+
>
> But there is still a tricky problem that is how to adjust tracing ip,
> waiting to be solved:
>
> For ARM64, somecases there may be extra instructions inserted into the
> head of tracable functions(but not all) by compiler, for instance BTI[1].
>
> This dump vmlinux with CONFIG_BTI=y:
>
> (1) function gic_handle_irq has bti in its head, so we adjust rec->ip+=5 to last nop
> ffffffc0080100e0: d53cd042 mrs x2, tpidr_el2
> ...
> ffffffc0080100f0: d503201f nop //__mcount_loc tells the rec->ip
> ffffffc0080100f4: d503201f nop
> ffffffc0080100f8: d503201f nop
>
> ffffffc0080100fc <gic_handle_irq>:
> ffffffc0080100fc: d503245f bti c
> ffffffc008010100: d503201f nop
> ffffffc008010104: d503201f nop //we adjust origin rec->ip+5 to here
> ffffffc008010108: d503233f paciasp
> (2) name_to_dev_t.part.0 do not has bti in its head, so we should adjust rec->ip+=4 to last nop
> ffff8000080137d4: d503201f nop
> ffff8000080137d8: d503201f nop
> ffff8000080137dc: d503201f nop
>
> ffff8000080137e0 <name_to_dev_t.part.0>:
> ffff8000080137e0: d503201f nop
> ffff8000080137e4: d503201f nop
> ffff8000080137e8: d503233f paciasp
>
> So at this time we have no idea to identify rec->ip for each tracable function.
>
> we are looking forward to follow-up discussions.
>
> References:
> [1] https://developer.arm.com/documentation/100076/0100/a64-instruction-set-reference/a64-general-instructions/bti
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> Cheng Jian (4):
> arm64: introduce aarch64_insn_gen_load_literal
> arm64/ftrace: introduce ftrace dynamic trampoline entrances
> arm64/ftrace: support dynamically allocated trampolines
> arm64/ftrace: implement long jump for dynamic trampolines
>
> arch/arm64/Makefile | 2 +-
> arch/arm64/include/asm/ftrace.h | 10 +-
> arch/arm64/include/asm/insn.h | 6 +
> arch/arm64/include/asm/module.h | 9 +
> arch/arm64/kernel/entry-ftrace.S | 88 ++++++--
> arch/arm64/kernel/ftrace.c | 366 ++++++++++++++++++++++++++++---
> arch/arm64/kernel/module-plts.c | 50 +++++
> arch/arm64/lib/insn.c | 49 +++++
> 8 files changed, 532 insertions(+), 48 deletions(-)
>

2022-04-21 23:45:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

On Thu, 21 Apr 2022 09:13:01 +0800
"Wangshaobo (bobo)" <[email protected]> wrote:

> Not yet, Steve, ftrace_location() looks has no help to find a right
> rec->ip in our case,
>
> ftrace_location() can find a right rec->ip when input ip is in the range
> between
>
> sym+0 and sym+$end, but our question is how to  identify rec->ip from
> __mcount_loc,

Are you saying that the "ftrace location" is not between sym+0 and sym+$end?

>
> this changed the patchable entry before bti to after in gcc:
>
>    [1] https://reviews.llvm.org/D73680
>
> gcc tells the place of first nop of the 5 NOPs when using
> -fpatchable-function-entry=5,3,
>
> but not tells the first nop after bti, so we don't know how to adjust
> our rec->ip for ftrace.

OK, so I do not understand how the compiler is injecting bti with mcount
calls, so I'll just walk away for now ;-)

-- Steve

2022-04-22 02:37:13

by Wangshaobo (bobo)

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline


在 2022/4/21 2:11, Steven Rostedt 写道:
> Is this going anywhere?
>
> -- Steve

Not yet, Steve, ftrace_location() looks has no help to find a right
rec->ip in our case,

ftrace_location() can find a right rec->ip when input ip is in the range
between

sym+0 and sym+$end, but our question is how to  identify rec->ip from
__mcount_loc,

this changed the patchable entry before bti to after in gcc:

   [1] https://reviews.llvm.org/D73680

gcc tells the place of first nop of the 5 NOPs when using
-fpatchable-function-entry=5,3,

but not tells the first nop after bti, so we don't know how to adjust
our rec->ip for ftrace.

>
>
> On Wed, 16 Mar 2022 18:01:28 +0800
> Wang ShaoBo <[email protected]> wrote:
>
>> This implements dynamic trampoline in ARM64, as reference said, we
>> complete whole design of supporting long jump in dynamic trampoline:
>>
>> .text section:
>> funcA: | funcA(): funcB():|
>> `-> +-----+ | | ... mov x9 |
>> | ... | | | adrp <- bl <> |
>> | nop | | | mov
>> | nop | | | br x16 ---+
>> funcB | nop | | | ftrace_(regs_)caller_tramp:
>> `-> +-----+ | `--> +---------------------+
>> | nop | | | ... |
>> | nop | | ftrace callsite +---------------------+
>> | ... | | `-----> | PLT entry: |
>> | nop | | | adrp |
>> | nop | | | add |
>> funcC: | nop | | ftrace graph callsite | br x16 |
>> `-> +-----+ | `-----> +---------------------+
>> | nop | | | ... |
>> | nop | | +---------------------+
>>
>> But there is still a tricky problem that is how to adjust tracing ip,
>> waiting to be solved:
>>
>> For ARM64, somecases there may be extra instructions inserted into the
>> head of tracable functions(but not all) by compiler, for instance BTI[1].
>>
>> This dump vmlinux with CONFIG_BTI=y:
>>
>> (1) function gic_handle_irq has bti in its head, so we adjust rec->ip+=5 to last nop
>> ffffffc0080100e0: d53cd042 mrs x2, tpidr_el2
>> ...
>> ffffffc0080100f0: d503201f nop //__mcount_loc tells the rec->ip
>> ffffffc0080100f4: d503201f nop
>> ffffffc0080100f8: d503201f nop
>>
>> ffffffc0080100fc <gic_handle_irq>:
>> ffffffc0080100fc: d503245f bti c
>> ffffffc008010100: d503201f nop
>> ffffffc008010104: d503201f nop //we adjust origin rec->ip+5 to here
>> ffffffc008010108: d503233f paciasp
>> (2) name_to_dev_t.part.0 do not has bti in its head, so we should adjust rec->ip+=4 to last nop
>> ffff8000080137d4: d503201f nop
>> ffff8000080137d8: d503201f nop
>> ffff8000080137dc: d503201f nop
>>
>> ffff8000080137e0 <name_to_dev_t.part.0>:
>> ffff8000080137e0: d503201f nop
>> ffff8000080137e4: d503201f nop
>> ffff8000080137e8: d503233f paciasp
>>
>> So at this time we have no idea to identify rec->ip for each tracable function.
>>
>> we are looking forward to follow-up discussions.
>>
>> References:
>> [1] https://developer.arm.com/documentation/100076/0100/a64-instruction-set-reference/a64-general-instructions/bti
>> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>>
>> Cheng Jian (4):
>> arm64: introduce aarch64_insn_gen_load_literal
>> arm64/ftrace: introduce ftrace dynamic trampoline entrances
>> arm64/ftrace: support dynamically allocated trampolines
>> arm64/ftrace: implement long jump for dynamic trampolines
>>
>> arch/arm64/Makefile | 2 +-
>> arch/arm64/include/asm/ftrace.h | 10 +-
>> arch/arm64/include/asm/insn.h | 6 +
>> arch/arm64/include/asm/module.h | 9 +
>> arch/arm64/kernel/entry-ftrace.S | 88 ++++++--
>> arch/arm64/kernel/ftrace.c | 366 ++++++++++++++++++++++++++++---
>> arch/arm64/kernel/module-plts.c | 50 +++++
>> arch/arm64/lib/insn.c | 49 +++++
>> 8 files changed, 532 insertions(+), 48 deletions(-)
>>
> .

2022-04-22 20:25:29

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 4/4] arm64/ftrace: implement long jump for dynamic trampolines

On Wed, Mar 16, 2022 at 06:01:32PM +0800, Wang ShaoBo wrote:
> From: Cheng Jian <[email protected]>
>
> When kaslr is enabled, long jump should be implemented for jumping
> to according trampoline and to callback from this trampoline.

I believe this is also potentially a problem with some debug
configurations, even without KASLR (e.g. I have built a ~900MiB Image in
the past), so we *might* need this more generally, even if it is
unusual.

> Firstly, we use -fpatchable-function-entry=5,3 to reserve 3 NOPs at
> the end of each function, and also 2 NOPs at the head as 3b23e4991fb6
> ("arm64: implement ftrace with regs") does, so rec->ip should be
> rec->ip + 4 after change in case there is no extra instrument
> (such as bti ) be put the head.
>
> But should note this will increase the size of the text section:
> //Using defconfig with -fpatchable-function-entry=5,3://
> text data bss dec hex filename
> 27095761 14578580 533268 42207609 2840979 vmlinux
>
> //Using defconfig with -fpatchable-function-entry=2://
> text data bss dec hex filename
> 26394597 14575988 533268 41503853 2794c6d vmlinux

So that's ~687KiB for a ~40M kernel.

> When registering new ftrace function, we put this ftrace_ops registered
> into our trampoline allocated dynamically but not the ftrace_ops_list,
>
> .text section:
> funcA: | funcA(): funcB(): |
> `-> +-----+ | | ... mov x9 x30 |
> | ... | | | adrp <- bl <> |
> | nop | | | add
> | nop | | | br x16 ---+
> funcB | nop | | | ftrace_(regs_)caller_tramp:
> `-> +-----+ | `--> +---------------------+
> | nop | | | ... |
> | nop | | ftrace callsite +---------------------+
> | ... | | `-----> | PLT entry: |
> | nop | | | adrp |
> | nop | | | add |
> funcC: | nop | | ftrace graph callsite | br x16 |
> `-> +-----+ | `-----> +---------------------+
> | nop | | | ... |
> | nop | | +---------------------+
>
> There are three steps to concreate this conection when registering
> a ftrace_ops for a function to be traced (Take funcB as example):
>
> Step 1: use make_init_nop() to generate 'mov x9 x30' for LR reservation.
> funcB:
> | mov x9 x30
> | nop //rec->ip
> | ...

So this is the same as today.

> Step 2: create trampoline and use ftrace_make_call() to update plt entry:
> funcA:
> | ...
> | adrp
> | add
> | br x16 //pointer to trampoline allocated
>
> Step 3: fill plt entry to where ftrace_ops(addr stored in x2) contained:
> trampoline:
> | ...
> | adrp
> | add
> | br x16 //pointer to callback handler
>
> At the end automically replacing nop to bl in funB to enable profiling.
> funcB:
> | mov x9 x30
> | bl <> //to first adrp in funA
> | ...

... but these steps have a bunch of work which cannot be done
atomically, which worries me, because it means we can't safely modify a
patch-site without some *very* expensive global synchronization. I
believe we need some synchronization to safely cleanup the dynamic
trampolines themselves, but I don't know what form that takes today --
how are those freed safely?

When I last looked at this, I'd assumed we'd need a 6-instruction
patch-site (and force function alignment) so that we could have a 64-bit
value that can be patched atomically, e.g.

NOP
NOP
NOP
NOP
func:
(BTI)
NOP
NOP
...

... which we could split up as:

__trampoline_ptr:
.quad <trampoline_addr>
__pre_patch:
LDR X16, __trampoline_ptr
BL X17 /// or `B __post_patch` if disabled
func:
(BTI)
MOV X9, X30
B __pre_patch // or `NOP` if disabled
__post_patch:
...

... and the trampoline would have to adjust the X30 it receives as
that'd point to `func`, and it needs to point to `__post_patch` to
return back into the function.

Thanks,
Mark.

> If funcB was bound to another ftrace_ops, Step2 and Step3 will be repeated,
> If funcB was bound to multiple ftrace_ops, this trampoline will not be used
> but jump to ftrace_(regs)_caller.
>
> Signed-off-by: Cheng Jian <[email protected]>
> Signed-off-by: Wang ShaoBo <[email protected]>
> ---
> arch/arm64/Makefile | 2 +-
> arch/arm64/include/asm/ftrace.h | 2 +-
> arch/arm64/include/asm/module.h | 9 +++
> arch/arm64/kernel/entry-ftrace.S | 8 ++
> arch/arm64/kernel/ftrace.c | 127 +++++++++++++++++--------------
> arch/arm64/kernel/module-plts.c | 50 ++++++++++++
> 6 files changed, 140 insertions(+), 58 deletions(-)
>
> diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
> index 2f1de88651e6..3139d6d9ee52 100644
> --- a/arch/arm64/Makefile
> +++ b/arch/arm64/Makefile
> @@ -130,7 +130,7 @@ CHECKFLAGS += -D__aarch64__
>
> ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_REGS),y)
> KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
> - CC_FLAGS_FTRACE := -fpatchable-function-entry=2
> + CC_FLAGS_FTRACE := -fpatchable-function-entry=5,3
> endif
>
> # Default value
> diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> index 1494cfa8639b..d0562c13b1dc 100644
> --- a/arch/arm64/include/asm/ftrace.h
> +++ b/arch/arm64/include/asm/ftrace.h
> @@ -70,7 +70,7 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr)
> * See ftrace_init_nop() for the callsite sequence.
> */
> if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
> - return addr + AARCH64_INSN_SIZE;
> + return addr + AARCH64_INSN_SIZE + sizeof(struct plt_entry);
> /*
> * addr is the address of the mcount call instruction.
> * recordmcount does the necessary offset calculation.
> diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
> index 4e7fa2623896..30f5b1cd5c9c 100644
> --- a/arch/arm64/include/asm/module.h
> +++ b/arch/arm64/include/asm/module.h
> @@ -36,6 +36,14 @@ extern u64 module_alloc_base;
> #define module_alloc_base ((u64)_etext - MODULES_VSIZE)
> #endif
>
> +enum {
> + PLT_MAKE_NOP,
> + PLT_MAKE_CALL,
> + PLT_MAKE_CALL_LINK,
> + PLT_MAKE_LCALL,
> + PLT_MAKE_LCALL_LINK
> +};
> +
> struct plt_entry {
> /*
> * A program that conforms to the AArch64 Procedure Call Standard
> @@ -58,6 +66,7 @@ static inline bool is_forbidden_offset_for_adrp(void *place)
> }
>
> struct plt_entry get_plt_entry(u64 dst, void *pc);
> +struct plt_entry update_plt_entry(u64 pc, u64 dst, int record);
> bool plt_entries_equal(const struct plt_entry *a, const struct plt_entry *b);
>
> static inline bool plt_entry_is_initialized(const struct plt_entry *e)
> diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
> index 88462d925446..71290f0e0740 100644
> --- a/arch/arm64/kernel/entry-ftrace.S
> +++ b/arch/arm64/kernel/entry-ftrace.S
> @@ -154,8 +154,12 @@ SYM_INNER_LABEL(ftrace_regs_caller_tramp_ops, SYM_L_GLOBAL)
> mov x3, sp // regs
> SYM_INNER_LABEL(ftrace_regs_caller_tramp_call, SYM_L_GLOBAL)
> nop
> + nop
> + nop
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> SYM_INNER_LABEL(ftrace_regs_caller_tramp_graph_call, SYM_L_GLOBAL)
> + nop
> + nop
> nop // ftrace_graph_caller()
> #endif
> ftrace_regs_restore
> @@ -175,8 +179,12 @@ SYM_INNER_LABEL(ftrace_caller_tramp_ops, SYM_L_GLOBAL)
> mov x3, sp // regs
> SYM_INNER_LABEL(ftrace_caller_tramp_call, SYM_L_GLOBAL)
> nop
> + nop
> + nop
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> SYM_INNER_LABEL(ftrace_caller_tramp_graph_call, SYM_L_GLOBAL)
> + nop
> + nop
> nop // ftrace_graph_caller()
> #endif
> ftrace_regs_restore
> diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> index d35a042baf75..4011aceefe8c 100644
> --- a/arch/arm64/kernel/ftrace.c
> +++ b/arch/arm64/kernel/ftrace.c
> @@ -234,27 +234,55 @@ ftrace_trampoline_get_func(struct ftrace_ops *ops, bool is_graph)
> return func;
> }
>
> +static struct plt_entry *
> +update_ftrace_tramp_plt(unsigned long pc, unsigned long addr, bool enable, bool link)
> +{
> + int i;
> + struct plt_entry entry;
> + u32 *pentry;
> + long offset;
> + struct plt_entry *plt = (struct plt_entry *)pc;
> +
> + offset = (long)pc - (long)addr;
> +
> + if (!enable)
> + entry = update_plt_entry(pc, addr, PLT_MAKE_NOP);
> + else if (offset < -SZ_128M || offset >= SZ_128M) {
> + if (link)
> + entry = update_plt_entry(pc, addr, PLT_MAKE_LCALL_LINK);
> + else
> + entry = update_plt_entry(pc, addr, PLT_MAKE_LCALL);
> + } else {
> + if (link)
> + entry = update_plt_entry(pc, addr, PLT_MAKE_CALL_LINK);
> + else
> + entry = update_plt_entry(pc, addr, PLT_MAKE_CALL);
> + }
> +
> + pentry = (u32 *)&entry;
> + for (i = 0; i < sizeof(struct plt_entry) / AARCH64_INSN_SIZE; i++)
> + aarch64_insn_patch_text_nosync((void *)((u32*)pc + i), *(pentry + i));
> +
> + return plt;
> +}
> +
> static int
> ftrace_trampoline_modify_call(struct ftrace_ops *ops, bool is_graph, bool active)
> {
> unsigned long offset;
> unsigned long pc;
> ftrace_func_t func;
> - u32 nop, branch;
>
> offset = calc_trampoline_call_offset(ops->flags &
> FTRACE_OPS_FL_SAVE_REGS, is_graph);
> pc = ops->trampoline + offset;
>
> func = ftrace_trampoline_get_func(ops, is_graph);
> - nop = aarch64_insn_gen_nop();
> - branch = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
> - AARCH64_INSN_BRANCH_LINK);
>
> - if (active)
> - return ftrace_modify_code(pc, 0, branch, false);
> - else
> - return ftrace_modify_code(pc, 0, nop, false);
> + if (update_ftrace_tramp_plt(pc, (unsigned long)func, active, true))
> + return 0;
> +
> + return -EINVAL;
> }
>
> extern int ftrace_graph_active;
> @@ -262,13 +290,6 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
> {
> unsigned int size;
>
> - /*
> - * If kaslr is enabled, the address of tramp and ftrace call
> - * may be far away, Therefore, long jump support is required.
> - */
> - if (IS_ENABLED(CONFIG_RANDOMIZE_BASE))
> - return;
> -
> if (!ops->trampoline) {
> ops->trampoline = create_trampoline(ops, &size);
> if (!ops->trampoline)
> @@ -349,17 +370,34 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
> return ftrace_modify_code(pc, 0, new, false);
> }
>
> -static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
> +extern void ftrace_kernel_plt(void);
> +static struct plt_entry *
> +get_ftrace_plt(struct module *mod, unsigned long pc, unsigned long addr)
> {
> -#ifdef CONFIG_ARM64_MODULE_PLTS
> - struct plt_entry *plt = mod->arch.ftrace_trampolines;
> + struct plt_entry *plt;
>
> - if (addr == FTRACE_ADDR)
> - return &plt[FTRACE_PLT_IDX];
> - if (addr == FTRACE_REGS_ADDR &&
> - IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
> - return &plt[FTRACE_REGS_PLT_IDX];
> + if (mod) {
> +#ifdef CONFIG_ARM64_MODULE_PLTS
> + plt = mod->arch.ftrace_trampolines;
> +
> + if (addr == FTRACE_ADDR)
> + return &plt[FTRACE_PLT_IDX];
> + else if (addr == FTRACE_REGS_ADDR &&
> + IS_ENABLED(CONFIG_FTRACE_WITH_REGS))
> + return &plt[FTRACE_REGS_PLT_IDX];
> + else {
> #endif
> + pc = pc - sizeof(struct plt_entry) - AARCH64_INSN_SIZE;
> + return update_ftrace_tramp_plt(pc, addr, true, false);
> +#ifdef CONFIG_ARM64_MODULE_PLTS
> + }
> +#endif
> + } else {
> + WARN_ON(addr == FTRACE_ADDR || addr == FTRACE_REGS_ADDR);
> + pc = pc - sizeof(struct plt_entry) - AARCH64_INSN_SIZE;
> + return update_ftrace_tramp_plt(pc, addr, true, false);
> + }
> +
> return NULL;
> }
>
> @@ -376,9 +414,6 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
> struct module *mod;
> struct plt_entry *plt;
>
> - if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> - return -EINVAL;
> -
> /*
> * On kernels that support module PLTs, the offset between the
> * branch instruction and its target may legally exceed the
> @@ -393,12 +428,10 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
> mod = __module_text_address(pc);
> preempt_enable();
>
> - if (WARN_ON(!mod))
> - return -EINVAL;
> -
> - plt = get_ftrace_plt(mod, addr);
> + plt = get_ftrace_plt(mod, pc, addr);
> if (!plt) {
> - pr_err("ftrace: no module PLT for %ps\n", (void *)addr);
> + pr_err("ftrace: no %s PLT for %ps\n",
> + mod ? "module" : "kernel", (void *)addr);
> return -EINVAL;
> }
>
> @@ -474,36 +507,18 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
> if (offset < -SZ_128M || offset >= SZ_128M) {
> u32 replaced;
>
> - if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> - return -EINVAL;
> -
> - /*
> - * 'mod' is only set at module load time, but if we end up
> - * dealing with an out-of-range condition, we can assume it
> - * is due to a module being loaded far away from the kernel.
> - */
> - if (!mod) {
> - preempt_disable();
> - mod = __module_text_address(pc);
> - preempt_enable();
> -
> - if (WARN_ON(!mod))
> - return -EINVAL;
> - }
> -
> /*
> - * The instruction we are about to patch may be a branch and
> - * link instruction that was redirected via a PLT entry. In
> - * this case, the normal validation will fail, but we can at
> - * least check that we are dealing with a branch and link
> - * instruction that points into the right module.
> + * The instruction we are about to patch may be a branch
> + * and link instruction that was redirected via a PLT entry
> + * supported by a dynamic trampoline or module, in this case,
> + * the normal validation will fail, but we can at least check
> + * that we are dealing with a branch and link instruction
> + * that points into the right place.
> */
> if (aarch64_insn_read((void *)pc, &replaced))
> return -EFAULT;
>
> - if (!aarch64_insn_is_bl(replaced) ||
> - !within_module(pc + aarch64_get_branch_offset(replaced),
> - mod))
> + if (!aarch64_insn_is_bl(replaced))
> return -EINVAL;
>
> validate = false;
> diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> index e53493d8b208..b7a832dfaa69 100644
> --- a/arch/arm64/kernel/module-plts.c
> +++ b/arch/arm64/kernel/module-plts.c
> @@ -37,6 +37,56 @@ struct plt_entry get_plt_entry(u64 dst, void *pc)
> return plt;
> }
>
> +struct plt_entry update_plt_entry(u64 pc, u64 dst, int record)
> +{
> + u32 adrp, add, br;
> +
> + switch (record) {
> + case PLT_MAKE_NOP:
> + adrp = aarch64_insn_gen_nop();
> + add = aarch64_insn_gen_nop();
> + br = aarch64_insn_gen_nop();
> + break;
> +
> + case PLT_MAKE_CALL_LINK:
> + adrp = aarch64_insn_gen_nop();
> + add = aarch64_insn_gen_nop();
> + br = aarch64_insn_gen_branch_imm(pc + 8, dst,
> + AARCH64_INSN_BRANCH_LINK);
> + break;
> +
> + case PLT_MAKE_LCALL_LINK:
> + adrp = aarch64_insn_gen_adr(pc, dst, AARCH64_INSN_REG_16,
> + AARCH64_INSN_ADR_TYPE_ADRP);
> + add = aarch64_insn_gen_add_sub_imm(AARCH64_INSN_REG_16,
> + AARCH64_INSN_REG_16, dst % SZ_4K,
> + AARCH64_INSN_VARIANT_64BIT,
> + AARCH64_INSN_ADSB_ADD);
> + br = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
> + AARCH64_INSN_BRANCH_LINK);
> + break;
> +
> + case PLT_MAKE_LCALL:
> + adrp = aarch64_insn_gen_adr(pc, dst, AARCH64_INSN_REG_16,
> + AARCH64_INSN_ADR_TYPE_ADRP);
> + add = aarch64_insn_gen_add_sub_imm(AARCH64_INSN_REG_16,
> + AARCH64_INSN_REG_16,
> + dst % SZ_4K,
> + AARCH64_INSN_VARIANT_64BIT,
> + AARCH64_INSN_ADSB_ADD);
> + br = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_16,
> + AARCH64_INSN_BRANCH_NOLINK);
> + break;
> + default:
> + pr_err("[%s %d] error flag %d\n", __func__, __LINE__, record);
> + BUG();
> + break;
> + }
> +
> + return (struct plt_entry){ cpu_to_le32(adrp), cpu_to_le32(add),
> + cpu_to_le32(br) };
> +}
> +
> bool plt_entries_equal(const struct plt_entry *a, const struct plt_entry *b)
> {
> u64 p, q;
> --
> 2.25.1
>

2022-05-25 19:46:09

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

On Thu, Apr 21, 2022 at 08:37:58AM -0400, Steven Rostedt wrote:
> On Thu, 21 Apr 2022 09:13:01 +0800
> "Wangshaobo (bobo)" <[email protected]> wrote:
>
> > Not yet, Steve, ftrace_location() looks has no help to find a right
> > rec->ip in our case,
> >
> > ftrace_location() can find a right rec->ip when input ip is in the range
> > between
> >
> > sym+0 and sym+$end, but our question is how to  identify rec->ip from
> > __mcount_loc,
>
> Are you saying that the "ftrace location" is not between sym+0 and sym+$end?

IIUC yes -- this series as-is moves the call to the trampoline *before* sym+0.

Among other things that completely wrecks backtracing, so I'd *really* like to
avoid that (hance my suggested alternative).

> > this changed the patchable entry before bti to after in gcc:
> >
> >    [1] https://reviews.llvm.org/D73680
> >
> > gcc tells the place of first nop of the 5 NOPs when using
> > -fpatchable-function-entry=5,3,
> >
> > but not tells the first nop after bti, so we don't know how to adjust
> > our rec->ip for ftrace.
>
> OK, so I do not understand how the compiler is injecting bti with mcount
> calls, so I'll just walk away for now ;-)

When using BTI, the compiler has to drop a BTI *at* the function entry point
(i.e. sym+0) for any function that can be called indirectly, but can omit this
when the function is only directly called (which is the case for most functions
created via insterprocedural specialization, or for a number of static
functions).

Today, when we pass:

-fpatchable-function-entry=2

... the compiler places 2 NOPs *after* any BTI, and records the location of the
first NOP. So the two cases we get are:

__func_without_bti:
NOP <--- recorded location
NOP

__func_with_bti:
BTI
NOP <--- recorded location
NOP

... which works just fine, since either sym+0 or sym+4 are reasonable
locations for the patch-site to live.

However, if we were to pass:

-fpatchable-function-entry=5,3

... the compiler places 3 NOPs *before* any BTI, and 2 NOPs *after* any BTI,
still recording the location of the first NOP. So in the two cases we get:

NOP <--- recorded location
NOP
NOP
__func_without_bti:
NOP
NOP

NOP <--- recorded location
NOP
NOP
__func_with_bti:
BTI
NOP
NOP

... so where we want to patch one of the later nops to banch to a pre-function
NOP, we need to know whether or not the compiler generated a BTI. We can
discover discover that either by:

* Checking whether the recorded location is at sym+0 (no BTI) or sym+4 (BTI).

* Reading the instruction before the recorded location, and seeing if this is a
BTI.

... and depending on how we handle thigns the two cases *might* need different
trampolines.

Thanks,
Mark.

2022-05-26 06:41:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

On Wed, 25 May 2022 13:45:13 +0100
Mark Rutland <[email protected]> wrote:

> ... the compiler places 3 NOPs *before* any BTI, and 2 NOPs *after* any BTI,
> still recording the location of the first NOP. So in the two cases we get:
>
> NOP <--- recorded location
> NOP
> NOP
> __func_without_bti:
> NOP
> NOP
>
> NOP <--- recorded location
> NOP
> NOP
> __func_with_bti:
> BTI
> NOP
> NOP

Are you saying that the above "recorded location" is what we have in
mcount_loc section? If that's the case, we will need to modify it to point
to something that kallsyms will recognize (ie. sym+0 or greater). Because
that will cause set_ftrace_filter to fail as well.

-- Steve


>
> ... so where we want to patch one of the later nops to banch to a pre-function
> NOP, we need to know whether or not the compiler generated a BTI. We can
> discover discover that either by:
>
> * Checking whether the recorded location is at sym+0 (no BTI) or sym+4 (BTI).
>
> * Reading the instruction before the recorded location, and seeing if this is a
> BTI.
>
> ... and depending on how we handle thigns the two cases *might* need different
> trampolines.


2022-05-26 11:44:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [RFC PATCH -next v2 0/4] arm64/ftrace: support dynamic trampoline

On Wed, May 25, 2022 at 09:58:45AM -0400, Steven Rostedt wrote:
> On Wed, 25 May 2022 13:45:13 +0100
> Mark Rutland <[email protected]> wrote:
>
> > ... the compiler places 3 NOPs *before* any BTI, and 2 NOPs *after* any BTI,
> > still recording the location of the first NOP. So in the two cases we get:
> >
> > NOP <--- recorded location
> > NOP
> > NOP
> > __func_without_bti:
> > NOP
> > NOP
> >
> > NOP <--- recorded location
> > NOP
> > NOP
> > __func_with_bti:
> > BTI
> > NOP
> > NOP
>
> Are you saying that the above "recorded location" is what we have in
> mcount_loc section?

Yes; I'm saying that with this series, the compiler would record that into the
mcount_loc section.

Note that's not necessarily what goes into rec->ip, which we can adjust at
initialization time to be within the function. We'd need to record the
presence/absence of the BTI somewhere (I guess in dyn_arch_ftrace).

> If that's the case, we will need to modify it to point to something that
> kallsyms will recognize (ie. sym+0 or greater). Because that will cause
> set_ftrace_filter to fail as well.

Yup, understood. Like I mentioned it also wrecks the unwinder and would make it
really hard to implement RELIABLE_STACKTRACE.

Just to be clear, I don't think we should follow this specific approach. I just
wrote the examples to clarify what was being proposed.

Thanks,
Mark.