2023-01-23 13:46:17

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

Hi Catalin, Steve,

I'm not sure how we want to merge this, so I've moved the core ftrace
patch to the start of the series so that it can more easily be placed on
a stable branch if we want that to go via the ftrace tree and the rest
to go via arm64.

This is cleanly pasing the ftrace selftests from v6.2-rc3 (results in
the final patch).

Aside from that, usual cover letter below.

This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
enables support for this on arm64. This significantly reduces the
overhead of tracing when a callsite/tracee has a single associated
tracer, avoids a number of issues that make it undesireably and
infeasible to use dynamically-allocated trampolines (e.g. branch range
limitations), and makes it possible to implement support for
DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.

The main idea is to give each ftrace callsite an associated pointer to
an ftrace_ops. The architecture's ftrace_caller trampoline can recover
the ops pointer and invoke ops->func from this without needing to use
ftrace_ops_list_func, which has to iterate through all registered ops.

To make this work, we use -fpatchable-function-entry=M,N, there N NOPs
are placed before the function entry point. On arm64 NOPs are always 4
bytes, so by allocating 2 per-function NOPs, we have enough space to
place a 64-bit value. So that we can manipulate the pointer atomically,
we need to align instrumented functions to at least 8 bytes, which we
can ensure using -falign-functions=8.

Each callsite ends up looking like:

# Aligned to 8 bytes
func - 8:
< pointer to ops >
func:
BTI // optional
MOV X9, LR
NOP // patched to `BL ftrace_caller`
func_body:

When entering ftrace_caller, the LR points at func_body, and the
ftrace_ops can be recovered at a negative offset from this the LR value:

BIC <tmp>, LR, 0x7 // Align down (skips BTI)
LDR <tmp>, [<tmp>, #-16] // load ops pointer

The ftrace_ops::func (and any other ftrace_ops fields) can then be
recovered from this pointer to the ops.

The first three patches enable the function alignment, working around
cases where GCC drops alignment for cold functions or when building with
'-Os'.

The final four patches implement support for
DYNAMIC_FTRACE_WITH_CALL_OPS on arm64. As noted in the final patch, this
results in a significant reduction in overhead:

Before this series:

Number of tracers || Total time | Per-call average time (ns)
Relevant | Irrelevant || (ns) | Total | Overhead
=========+============++=============+==============+============
0 | 0 || 94,583 | 0.95 | -
0 | 1 || 93,709 | 0.94 | -
0 | 2 || 93,666 | 0.94 | -
0 | 10 || 93,709 | 0.94 | -
0 | 100 || 93,792 | 0.94 | -
---------+------------++-------------+--------------+------------
1 | 1 || 6,467,833 | 64.68 | 63.73
1 | 2 || 7,509,708 | 75.10 | 74.15
1 | 10 || 23,786,792 | 237.87 | 236.92
1 | 100 || 106,432,500 | 1,064.43 | 1063.38
---------+------------++-------------+--------------+------------
1 | 0 || 1,431,875 | 14.32 | 13.37
2 | 0 || 6,456,334 | 64.56 | 63.62
10 | 0 || 22,717,000 | 227.17 | 226.22
100 | 0 || 103,293,667 | 1032.94 | 1031.99
---------+------------++-------------+--------------+--------------

Note: per-call overhead is estiamated relative to the baseline case
with 0 relevant tracers and 0 irrelevant tracers.

After this series:

Number of tracers || Total time | Per-call average time (ns)
Relevant | Irrelevant || (ns) | Total | Overhead
=========+============++=============+==============+============
0 | 0 || 94,541 | 0.95 | -
0 | 1 || 93,666 | 0.94 | -
0 | 2 || 93,709 | 0.94 | -
0 | 10 || 93,667 | 0.94 | -
0 | 100 || 93,792 | 0.94 | -
---------+------------++-------------+--------------+------------
1 | 1 || 281,000 | 2.81 | 1.86
1 | 2 || 281,042 | 2.81 | 1.87
1 | 10 || 280,958 | 2.81 | 1.86
1 | 100 || 281,250 | 2.81 | 1.87
---------+------------++-------------+--------------+------------
1 | 0 || 280,959 | 2.81 | 1.86
2 | 0 || 6,502,708 | 65.03 | 64.08
10 | 0 || 18,681,209 | 186.81 | 185.87
100 | 0 || 103,550,458 | 1,035.50 | 1034.56
---------+------------++-------------+--------------+------------

Note: per-call overhead is estiamated relative to the baseline case
with 0 relevant tracers and 0 irrelevant tracers.


This version of the series can be found in my kernel.org git repo:

git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git

Tagged as:

arm64-ftrace-per-callsite-ops-20230113

Since v1 [1]:
* Fold in Ack from Rafael
* Update comments/commits with description of the GCC issue
* Move the cold attribute changes to compiler_types.h
* Drop the unnecessary changes to the weak attribute
* Move declaration of ftrace_ops earlier
* Clean up and improve commit messages
* Regenerate statistics on misaligned text symbols

Since v2 [2]:
* Fold in Steve's Reviewed-by tag
* Move core ftrace patch to the start of the series
* Add ftrace selftest reults to final patch
* Use FUNCTION_ALIGNMENT_4B by default
* Fix commit message typos

[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[2] https://lore.kernel.org/linux-arm-kernel/[email protected]/

Thanks,
Mark.

Mark Rutland (8):
ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
Compiler attributes: GCC cold function alignment workarounds
ACPI: Don't build ACPICA with '-Os'
arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
arm64: insn: Add helpers for BTI
arm64: patching: Add aarch64_insn_write_literal_u64()
arm64: ftrace: Update stale comment
arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

arch/arm64/Kconfig | 4 +
arch/arm64/Makefile | 5 +-
arch/arm64/include/asm/ftrace.h | 15 +--
arch/arm64/include/asm/insn.h | 1 +
arch/arm64/include/asm/linkage.h | 4 +-
arch/arm64/include/asm/patching.h | 2 +
arch/arm64/kernel/asm-offsets.c | 4 +
arch/arm64/kernel/entry-ftrace.S | 32 +++++-
arch/arm64/kernel/ftrace.c | 158 +++++++++++++++++++++++++++-
arch/arm64/kernel/patching.c | 17 +++
drivers/acpi/acpica/Makefile | 2 +-
include/linux/compiler_attributes.h | 6 --
include/linux/compiler_types.h | 27 +++++
include/linux/ftrace.h | 18 +++-
kernel/exit.c | 9 +-
kernel/trace/Kconfig | 7 ++
kernel/trace/ftrace.c | 109 ++++++++++++++++++-
17 files changed, 380 insertions(+), 40 deletions(-)

--
2.30.2



2023-01-23 13:46:23

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 1/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS

Architectures without dynamic ftrace trampolines incur an overhead when
multiple ftrace_ops are enabled with distinct filters. in these cases,
each call site calls a common trampoline which uses
ftrace_ops_list_func() to iterate over all enabled ftrace functions, and
so incurs an overhead relative to the size of this list (including RCU
protection overhead).

Architectures with dynamic ftrace trampolines avoid this overhead for
call sites which have a single associated ftrace_ops. In these cases,
the dynamic trampoline is customized to branch directly to the relevant
ftrace function, avoiding the list overhead.

On some architectures it's impractical and/or undesirable to implement
dynamic ftrace trampolines. For example, arm64 has limited branch ranges
and cannot always directly branch from a call site to an arbitrary
address (e.g. from a kernel text address to an arbitrary module
address). Calls from modules to core kernel text can be indirected via
PLTs (allocated at module load time) to address this, but the same is
not possible from calls from core kernel text.

Using an indirect branch from a call site to an arbitrary trampoline is
possible, but requires several more instructions in the function
prologue (or immediately before it), and/or comes with far more complex
requirements for patching.

Instead, this patch adds a new option, where an architecture can
associate each call site with a pointer to an ftrace_ops, placed at a
fixed offset from the call site. A shared trampoline can recover this
pointer and call ftrace_ops::func() without needing to go via
ftrace_ops_list_func(), avoiding the associated overhead.

This avoids issues with branch range limitations, and avoids the need to
allocate and manipulate dynamic trampolines, making it far simpler to
implement and maintain, while having similar performance
characteristics.

Note that this allows for dynamic ftrace_ops to be invoked directly from
an architecture's ftrace_caller trampoline, whereas existing code forces
the use of ftrace_ops_get_list_func(), which is in part necessary to
permit the ftrace_ops to be freed once unregistered *and* to avoid
branch/address-generation range limitation on some architectures (e.g.
where ops->func is a module address, and may be outside of the direct
branch range for callsites within the main kernel image).

The CALL_OPS approach avoids this problems and is safe as:

* The existing synchronization in ftrace_shutdown() using
ftrace_shutdown() using synchronize_rcu_tasks_rude() (and
synchronize_rcu_tasks()) ensures that no tasks hold a stale reference
to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline,
or while invoking ftrace_ops::func), when that ftrace_ops is
unregistered.

Arguably this could also be relied upon for the existing scheme,
permitting dynamic ftrace_ops to be invoked directly when ops->func is
in range, but this will require additional logic to handle branch
range limitations, and is not handled by this patch.

* Each callsite's ftrace_ops pointer literal can hold any valid kernel
address, and is updated atomically. As an architecture's ftrace_caller
trampoline will atomically load the ops pointer then dereference
ops->func, there is no risk of invoking ops->func with a mismatches
ops pointer, and updates to the ops pointer do not require special
care.

A subsequent patch will implement architectures support for arm64. There
should be no functional change as a result of this patch alone.

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
---
include/linux/ftrace.h | 18 +++++--
kernel/trace/Kconfig | 7 +++
kernel/trace/ftrace.c | 109 +++++++++++++++++++++++++++++++++++++++--
3 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 99f1146614c0..366c730beaa3 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -39,6 +39,7 @@ static inline void ftrace_boot_snapshot(void) { }

struct ftrace_ops;
struct ftrace_regs;
+struct dyn_ftrace;

#ifdef CONFIG_FUNCTION_TRACER
/*
@@ -57,6 +58,9 @@ void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip);
void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *op, struct ftrace_regs *fregs);
#endif
+extern const struct ftrace_ops ftrace_nop_ops;
+extern const struct ftrace_ops ftrace_list_ops;
+struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
#endif /* CONFIG_FUNCTION_TRACER */

/* Main tracing buffer and events set up */
@@ -391,8 +395,6 @@ struct ftrace_func_entry {
unsigned long direct; /* for direct lookup only */
};

-struct dyn_ftrace;
-
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
extern int ftrace_direct_func_count;
int register_ftrace_direct(unsigned long ip, unsigned long addr);
@@ -563,6 +565,8 @@ bool is_ftrace_trampoline(unsigned long addr);
* IPMODIFY - the record allows for the IP address to be changed.
* DISABLED - the record is not ready to be touched yet
* DIRECT - there is a direct function to call
+ * CALL_OPS - the record can use callsite-specific ops
+ * CALL_OPS_EN - the function is set up to use callsite-specific ops
*
* When a new ftrace_ops is registered and wants a function to save
* pt_regs, the rec->flags REGS is set. When the function has been
@@ -580,9 +584,11 @@ enum {
FTRACE_FL_DISABLED = (1UL << 25),
FTRACE_FL_DIRECT = (1UL << 24),
FTRACE_FL_DIRECT_EN = (1UL << 23),
+ FTRACE_FL_CALL_OPS = (1UL << 22),
+ FTRACE_FL_CALL_OPS_EN = (1UL << 21),
};

-#define FTRACE_REF_MAX_SHIFT 23
+#define FTRACE_REF_MAX_SHIFT 21
#define FTRACE_REF_MAX ((1UL << FTRACE_REF_MAX_SHIFT) - 1)

#define ftrace_rec_count(rec) ((rec)->flags & FTRACE_REF_MAX)
@@ -820,7 +826,8 @@ static inline int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
*/
extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);

-#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+#if defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS) || \
+ defined(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)
/**
* ftrace_modify_call - convert from one addr to another (no nop)
* @rec: the call site record (e.g. mcount/fentry)
@@ -833,6 +840,9 @@ extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
* what we expect it to be, and then on success of the compare,
* it should write to the location.
*
+ * When using call ops, this is called when the associated ops change, even
+ * when (addr == old_addr).
+ *
* The code segment at @rec->ip should be a caller to @old_addr
*
* Return must be:
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 197545241ab8..5df427a2321d 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -42,6 +42,9 @@ config HAVE_DYNAMIC_FTRACE_WITH_REGS
config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
bool

+config HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
+ bool
+
config HAVE_DYNAMIC_FTRACE_WITH_ARGS
bool
help
@@ -257,6 +260,10 @@ config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
depends on DYNAMIC_FTRACE_WITH_REGS
depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS

+config DYNAMIC_FTRACE_WITH_CALL_OPS
+ def_bool y
+ depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
+
config DYNAMIC_FTRACE_WITH_ARGS
def_bool y
depends on DYNAMIC_FTRACE
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 442438b93fe9..e634b80f49d1 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -125,6 +125,33 @@ struct ftrace_ops global_ops;
void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *op, struct ftrace_regs *fregs);

+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+/*
+ * Stub used to invoke the list ops without requiring a separate trampoline.
+ */
+const struct ftrace_ops ftrace_list_ops = {
+ .func = ftrace_ops_list_func,
+ .flags = FTRACE_OPS_FL_STUB,
+};
+
+static void ftrace_ops_nop_func(unsigned long ip, unsigned long parent_ip,
+ struct ftrace_ops *op,
+ struct ftrace_regs *fregs)
+{
+ /* do nothing */
+}
+
+/*
+ * Stub used when a call site is disabled. May be called transiently by threads
+ * which have made it into ftrace_caller but haven't yet recovered the ops at
+ * the point the call site is disabled.
+ */
+const struct ftrace_ops ftrace_nop_ops = {
+ .func = ftrace_ops_nop_func,
+ .flags = FTRACE_OPS_FL_STUB,
+};
+#endif
+
static inline void ftrace_ops_init(struct ftrace_ops *ops)
{
#ifdef CONFIG_DYNAMIC_FTRACE
@@ -1814,6 +1841,18 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
* if rec count is zero.
*/
}
+
+ /*
+ * If the rec has a single associated ops, and ops->func can be
+ * called directly, allow the call site to call via the ops.
+ */
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) &&
+ ftrace_rec_count(rec) == 1 &&
+ ftrace_ops_get_func(ops) == ops->func)
+ rec->flags |= FTRACE_FL_CALL_OPS;
+ else
+ rec->flags &= ~FTRACE_FL_CALL_OPS;
+
count++;

/* Must match FTRACE_UPDATE_CALLS in ftrace_modify_all_code() */
@@ -2108,8 +2147,9 @@ void ftrace_bug(int failed, struct dyn_ftrace *rec)
struct ftrace_ops *ops = NULL;

pr_info("ftrace record flags: %lx\n", rec->flags);
- pr_cont(" (%ld)%s", ftrace_rec_count(rec),
- rec->flags & FTRACE_FL_REGS ? " R" : " ");
+ pr_cont(" (%ld)%s%s", ftrace_rec_count(rec),
+ rec->flags & FTRACE_FL_REGS ? " R" : " ",
+ rec->flags & FTRACE_FL_CALL_OPS ? " O" : " ");
if (rec->flags & FTRACE_FL_TRAMP_EN) {
ops = ftrace_find_tramp_ops_any(rec);
if (ops) {
@@ -2177,6 +2217,7 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
* want the direct enabled (it will be done via the
* direct helper). But if DIRECT_EN is set, and
* the count is not one, we need to clear it.
+ *
*/
if (ftrace_rec_count(rec) == 1) {
if (!(rec->flags & FTRACE_FL_DIRECT) !=
@@ -2185,6 +2226,19 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
} else if (rec->flags & FTRACE_FL_DIRECT_EN) {
flag |= FTRACE_FL_DIRECT;
}
+
+ /*
+ * Ops calls are special, as count matters.
+ * As with direct calls, they must only be enabled when count
+ * is one, otherwise they'll be handled via the list ops.
+ */
+ if (ftrace_rec_count(rec) == 1) {
+ if (!(rec->flags & FTRACE_FL_CALL_OPS) !=
+ !(rec->flags & FTRACE_FL_CALL_OPS_EN))
+ flag |= FTRACE_FL_CALL_OPS;
+ } else if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
+ flag |= FTRACE_FL_CALL_OPS;
+ }
}

/* If the state of this record hasn't changed, then do nothing */
@@ -2229,6 +2283,21 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
rec->flags &= ~FTRACE_FL_DIRECT_EN;
}
}
+
+ if (flag & FTRACE_FL_CALL_OPS) {
+ if (ftrace_rec_count(rec) == 1) {
+ if (rec->flags & FTRACE_FL_CALL_OPS)
+ rec->flags |= FTRACE_FL_CALL_OPS_EN;
+ else
+ rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
+ } else {
+ /*
+ * Can only call directly if there's
+ * only one set of associated ops.
+ */
+ rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
+ }
+ }
}

/*
@@ -2258,7 +2327,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
* and REGS states. The _EN flags must be disabled though.
*/
rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
- FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
+ FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
+ FTRACE_FL_CALL_OPS_EN);
}

ftrace_bug_type = FTRACE_BUG_NOP;
@@ -2431,6 +2501,25 @@ ftrace_find_tramp_ops_new(struct dyn_ftrace *rec)
return NULL;
}

+struct ftrace_ops *
+ftrace_find_unique_ops(struct dyn_ftrace *rec)
+{
+ struct ftrace_ops *op, *found = NULL;
+ unsigned long ip = rec->ip;
+
+ do_for_each_ftrace_op(op, ftrace_ops_list) {
+
+ if (hash_contains_ip(ip, op->func_hash)) {
+ if (found)
+ return NULL;
+ found = op;
+ }
+
+ } while_for_each_ftrace_op(op);
+
+ return found;
+}
+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
/* Protected by rcu_tasks for reading, and direct_mutex for writing */
static struct ftrace_hash *direct_functions = EMPTY_HASH;
@@ -3780,11 +3869,12 @@ static int t_show(struct seq_file *m, void *v)
if (iter->flags & FTRACE_ITER_ENABLED) {
struct ftrace_ops *ops;

- seq_printf(m, " (%ld)%s%s%s",
+ seq_printf(m, " (%ld)%s%s%s%s",
ftrace_rec_count(rec),
rec->flags & FTRACE_FL_REGS ? " R" : " ",
rec->flags & FTRACE_FL_IPMODIFY ? " I" : " ",
- rec->flags & FTRACE_FL_DIRECT ? " D" : " ");
+ rec->flags & FTRACE_FL_DIRECT ? " D" : " ",
+ rec->flags & FTRACE_FL_CALL_OPS ? " O" : " ");
if (rec->flags & FTRACE_FL_TRAMP_EN) {
ops = ftrace_find_tramp_ops_any(rec);
if (ops) {
@@ -3800,6 +3890,15 @@ static int t_show(struct seq_file *m, void *v)
} else {
add_trampoline_func(m, NULL, rec);
}
+ if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
+ ops = ftrace_find_unique_ops(rec);
+ if (ops) {
+ seq_printf(m, "\tops: %pS (%pS)",
+ ops, ops->func);
+ } else {
+ seq_puts(m, "\tops: ERROR!");
+ }
+ }
if (rec->flags & FTRACE_FL_DIRECT) {
unsigned long direct;

--
2.30.2


2023-01-23 13:46:26

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 2/8] Compiler attributes: GCC cold function alignment workarounds

Contemporary versions of GCC (e.g. GCC 12.2.0) drop the alignment
specified by '-falign-functions=N' for functions marked with the
__cold__ attribute, and potentially for callees of __cold__ functions as
these may be implicitly marked as __cold__ by the compiler. LLVM appears
to respect '-falign-functions=N' in such cases.

This has been reported to GCC in bug 88345:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

... which also covers alignment being dropped when '-Os' is used, which
will be addressed in a separate patch.

Currently, use of '-falign-functions=N' is limited to
CONFIG_FUNCTION_ALIGNMENT, which is largely used for performance and/or
analysis reasons (e.g. with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B), but
isn't necessary for correct functionality. However, this dropped
alignment isn't great for the performance and/or analysis cases.

Subsequent patches will use CONFIG_FUNCTION_ALIGNMENT as part of arm64's
ftrace implementation, which will require all instrumented functions to
be aligned to at least 8-bytes.

This patch works around the dropped alignment by avoiding the use of the
__cold__ attribute when CONFIG_FUNCTION_ALIGNMENT is non-zero, and by
specifically aligning abort(), which GCC implicitly marks as __cold__.
As the __cold macro is now dependent upon config options (which is
against the policy described at the top of compiler_attributes.h), it is
moved into compiler_types.h.

I've tested this by building and booting a kernel configured with
defconfig + CONFIG_EXPERT=y + CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
and looking for misaligned text symbols in /proc/kallsyms:

* arm64:

Before:
# uname -rm
6.2.0-rc3 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
5009

After:
# uname -rm
6.2.0-rc3-00001-g2a2bedf8bfa9 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
919

* x86_64:

Before:
# uname -rm
6.2.0-rc3 x86_64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
11537

After:
# uname -rm
6.2.0-rc3-00001-g2a2bedf8bfa9 x86_64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
2805

There's clearly a substantial reduction in the number of misaligned
symbols. From manual inspection, the remaining unaligned text labels are
a combination of ACPICA functions (due to the use of '-Os'), static call
trampolines, and non-function labels in assembly, which will be dealt
with in subsequent patches.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Nick Desaulniers <[email protected]>
---
include/linux/compiler_attributes.h | 6 ------
include/linux/compiler_types.h | 27 +++++++++++++++++++++++++++
kernel/exit.c | 9 ++++++++-
3 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index 898b3458b24a..b83126452c65 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -75,12 +75,6 @@
# define __assume_aligned(a, ...)
#endif

-/*
- * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-cold-function-attribute
- * gcc: https://gcc.gnu.org/onlinedocs/gcc/Label-Attributes.html#index-cold-label-attribute
- */
-#define __cold __attribute__((__cold__))
-
/*
* Note the long name.
*
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 7c1afe0f4129..aab34e30128e 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -79,6 +79,33 @@ static inline void __chk_io_ptr(const volatile void __iomem *ptr) { }
/* Attributes */
#include <linux/compiler_attributes.h>

+#if CONFIG_FUNCTION_ALIGNMENT > 0
+#define __function_aligned __aligned(CONFIG_FUNCTION_ALIGNMENT)
+#else
+#define __function_aligned
+#endif
+
+/*
+ * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-cold-function-attribute
+ * gcc: https://gcc.gnu.org/onlinedocs/gcc/Label-Attributes.html#index-cold-label-attribute
+ *
+ * When -falign-functions=N is in use, we must avoid the cold attribute as
+ * contemporary versions of GCC drop the alignment for cold functions. Worse,
+ * GCC can implicitly mark callees of cold functions as cold themselves, so
+ * it's not sufficient to add __function_aligned here as that will not ensure
+ * that callees are correctly aligned.
+ *
+ * See:
+ *
+ * https://lore.kernel.org/lkml/Y77%2FqVgvaJidFpYt@FVFF77S0Q05N
+ * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c9
+ */
+#if !defined(CONFIG_CC_IS_GCC) || (CONFIG_FUNCTION_ALIGNMENT == 0)
+#define __cold __attribute__((__cold__))
+#else
+#define __cold
+#endif
+
/* Builtins */

/*
diff --git a/kernel/exit.c b/kernel/exit.c
index 15dc2ec80c46..c8e0375705f4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1898,7 +1898,14 @@ bool thread_group_exited(struct pid *pid)
}
EXPORT_SYMBOL(thread_group_exited);

-__weak void abort(void)
+/*
+ * This needs to be __function_aligned as GCC implicitly makes any
+ * implementation of abort() cold and drops alignment specified by
+ * -falign-functions=N.
+ *
+ * See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c11
+ */
+__weak __function_aligned void abort(void)
{
BUG();

--
2.30.2


2023-01-23 13:46:31

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 3/8] ACPI: Don't build ACPICA with '-Os'

The ACPICA code has been built with '-Os' since the beginning of git
history, though there's no explanatory comment as to why.

This is unfortunate as GCC drops the alignment specificed by
'-falign-functions=N' when '-Os' is used, as reported in GCC bug 88345:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

This prevents CONFIG_FUNCTION_ALIGNMENT and
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B from having their expected effect
on the ACPICA code. This is doubly unfortunate as in subsequent patches
arm64 will depend upon CONFIG_FUNCTION_ALIGNMENT for its ftrace
implementation.

Drop the '-Os' flag when building the ACPICA code. With this removed,
the code builds cleanly and works correctly in testing so far.

I've tested this by selecting CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
building and booting a kernel using ACPI, and looking for misaligned
text symbols:

* arm64:

Before, v6.2-rc3:
# uname -rm
6.2.0-rc3 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
5009

Before, v6.2-rc3 + fixed __cold:
# uname -rm
6.2.0-rc3-00001-g2a2bedf8bfa9 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
919

After:
# uname -rm
6.2.0-rc3-00002-g267bddc38572 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
323
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
0

* x86_64:

Before, v6.2-rc3:
# uname -rm
6.2.0-rc3 x86_64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
11537

Before, v6.2-rc3 + fixed __cold:
# uname -rm
6.2.0-rc3-00001-g2a2bedf8bfa9 x86_64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
2805

After:
# uname -rm
6.2.0-rc3-00002-g267bddc38572 x86_64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
1357
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
0

With the patch applied, the remaining unaligned text labels are a
combination of static call trampolines and labels in assembly, which can
be dealt with in subsequent patches.

Signed-off-by: Mark Rutland <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Robert Moore <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
---
drivers/acpi/acpica/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/acpica/Makefile b/drivers/acpi/acpica/Makefile
index 9e0d95d76fff..30f3fc13c29d 100644
--- a/drivers/acpi/acpica/Makefile
+++ b/drivers/acpi/acpica/Makefile
@@ -3,7 +3,7 @@
# Makefile for ACPICA Core interpreter
#

-ccflags-y := -Os -D_LINUX -DBUILDING_ACPICA
+ccflags-y := -D_LINUX -DBUILDING_ACPICA
ccflags-$(CONFIG_ACPI_DEBUG) += -DACPI_DEBUG_OUTPUT

# use acpi.o to put all files here into acpi.o modparam namespace
--
2.30.2


2023-01-23 13:46:40

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 4/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT

On arm64 we don't align assembly function in the same way as C
functions. This somewhat limits the utility of
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B for testing, and adds noise when
testing that we're correctly aligning functions as will be necessary for
ftrace in subsequent patches.

Follow the example of x86, and align assembly functions in the same way
as C functions. Selecting FUNCTION_ALIGNMENT_4B ensures
CONFIG_FUCTION_ALIGNMENT will be a minimum of 4 bytes, matching the
minimum alignment that __ALIGN and __ALIGN_STR provide prior to this
patch.

I've tested this by selecting CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
building and booting a kernel, and looking for misaligned text symbols:

Before, v6.2-rc3:
# uname -rm
6.2.0-rc3 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
5009

Before, v6.2-rc3 + fixed __cold:
# uname -rm
6.2.0-rc3-00001-g2a2bedf8bfa9 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
919

Before, v6.2-rc3 + fixed __cold + fixed ACPICA:
# uname -rm
6.2.0-rc3-00002-g267bddc38572 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
323
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep acpi | wc -l
0

After:
# uname -rm
6.2.0-rc3-00003-g71db61ee3ea1 aarch64
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
112

Considering the remaining 112 unaligned text symbols:

* 20 are non-function KVM NVHE assembly symbols, which are never
instrumented by ftrace:

# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep __kvm_nvhe | wc -l
20
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep __kvm_nvhe
ffffbe6483f73784 t __kvm_nvhe___invalid
ffffbe6483f73788 t __kvm_nvhe___do_hyp_init
ffffbe6483f73ab0 t __kvm_nvhe_reset
ffffbe6483f73b8c T __kvm_nvhe___hyp_idmap_text_end
ffffbe6483f73b8c T __kvm_nvhe___hyp_text_start
ffffbe6483f77864 t __kvm_nvhe___host_enter_restore_full
ffffbe6483f77874 t __kvm_nvhe___host_enter_for_panic
ffffbe6483f778a4 t __kvm_nvhe___host_enter_without_restoring
ffffbe6483f81178 T __kvm_nvhe___guest_exit_panic
ffffbe6483f811c8 T __kvm_nvhe___guest_exit
ffffbe6483f81354 t __kvm_nvhe_abort_guest_exit_start
ffffbe6483f81358 t __kvm_nvhe_abort_guest_exit_end
ffffbe6483f81830 t __kvm_nvhe_wa_epilogue
ffffbe6483f81844 t __kvm_nvhe_el1_trap
ffffbe6483f81864 t __kvm_nvhe_el1_fiq
ffffbe6483f81864 t __kvm_nvhe_el1_irq
ffffbe6483f81884 t __kvm_nvhe_el1_error
ffffbe6483f818a4 t __kvm_nvhe_el2_sync
ffffbe6483f81920 t __kvm_nvhe_el2_error
ffffbe6483f865c8 T __kvm_nvhe___start___kvm_ex_table

* 53 are position-independent functions only used during early boot, which are
built with '-Os', but are never instrumented by ftrace:

# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep __pi | wc -l
53

We *could* drop '-Os' when building these for consistency, but that is
not necessary to ensure that ftrace works correctly.

* The remaining 39 are non-function symbols, and 3 runtime BPF
functions, which are never instrumented by ftrace:

# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep -v __kvm_nvhe | grep -v __pi | wc -l
39
# grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | grep -v __kvm_nvhe | grep -v __pi
ffffbe6482e1009c T __irqentry_text_end
ffffbe6482e10358 T __softirqentry_text_end
ffffbe6482e1435c T __entry_text_end
ffffbe6482e825f8 T __guest_exit_panic
ffffbe6482e82648 T __guest_exit
ffffbe6482e827d4 t abort_guest_exit_start
ffffbe6482e827d8 t abort_guest_exit_end
ffffbe6482e83030 t wa_epilogue
ffffbe6482e83044 t el1_trap
ffffbe6482e83064 t el1_fiq
ffffbe6482e83064 t el1_irq
ffffbe6482e83084 t el1_error
ffffbe6482e830a4 t el2_sync
ffffbe6482e83120 t el2_error
ffffbe6482e93550 T sha256_block_neon
ffffbe64830f3ae0 t e843419@01cc_00002a0c_3104
ffffbe648378bd90 t e843419@09b3_0000d7cb_bc4
ffffbe6483bdab20 t e843419@0c66_000116e2_34c8
ffffbe6483f62c94 T __noinstr_text_end
ffffbe6483f70a18 T __sched_text_end
ffffbe6483f70b2c T __cpuidle_text_end
ffffbe6483f722d4 T __lock_text_end
ffffbe6483f73b8c T __hyp_idmap_text_end
ffffbe6483f73b8c T __hyp_text_start
ffffbe6483f865c8 T __start___kvm_ex_table
ffffbe6483f870d0 t init_el1
ffffbe6483f870f8 t init_el2
ffffbe6483f87324 t pen
ffffbe6483f87b48 T __idmap_text_end
ffffbe64848eb010 T __hibernate_exit_text_start
ffffbe64848eb124 T __hibernate_exit_text_end
ffffbe64848eb124 T __relocate_new_kernel_start
ffffbe64848eb260 T __relocate_new_kernel_end
ffffbe648498a8e8 T _einittext
ffffbe648498a8e8 T __exittext_begin
ffffbe6484999d84 T __exittext_end
ffff8000080756b4 t bpf_prog_6deef7357e7b4530 [bpf]
ffff80000808dd78 t bpf_prog_6deef7357e7b4530 [bpf]
ffff80000809d684 t bpf_prog_6deef7357e7b4530 [bpf]

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/linkage.h | 4 ++--
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 03934808b2ed..6914f6bf41e2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -123,6 +123,7 @@ config ARM64
select DMA_DIRECT_REMAP
select EDAC_SUPPORT
select FRAME_POINTER
+ select FUNCTION_ALIGNMENT_4B
select GENERIC_ALLOCATOR
select GENERIC_ARCH_TOPOLOGY
select GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
index 1436fa1cde24..d3acd9c87509 100644
--- a/arch/arm64/include/asm/linkage.h
+++ b/arch/arm64/include/asm/linkage.h
@@ -5,8 +5,8 @@
#include <asm/assembler.h>
#endif

-#define __ALIGN .align 2
-#define __ALIGN_STR ".align 2"
+#define __ALIGN .balign CONFIG_FUNCTION_ALIGNMENT
+#define __ALIGN_STR ".balign " #CONFIG_FUNCTION_ALIGNMENT

/*
* When using in-kernel BTI we need to ensure that PCS-conformant
--
2.30.2


2023-01-23 13:46:47

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 5/8] arm64: insn: Add helpers for BTI

In subsequent patches we'd like to check whether an instruction is a
BTI. In preparation for this, add basic instruction helpers for BTI
instructions.

Per ARM DDI 0487H.a section C6.2.41, BTI is encoded in binary as
follows, MSB to LSB:

1101 0101 000 0011 0010 0100 xx01 1111

Where the `xx` bits encode J/C/JC:

00 : (omitted)
01 : C
10 : J
11 : JC

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/include/asm/insn.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index aaf1f52fbf3e..139a88e4e852 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -420,6 +420,7 @@ __AARCH64_INSN_FUNCS(sb, 0xFFFFFFFF, 0xD50330FF)
__AARCH64_INSN_FUNCS(clrex, 0xFFFFF0FF, 0xD503305F)
__AARCH64_INSN_FUNCS(ssbb, 0xFFFFFFFF, 0xD503309F)
__AARCH64_INSN_FUNCS(pssbb, 0xFFFFFFFF, 0xD503349F)
+__AARCH64_INSN_FUNCS(bti, 0xFFFFFF3F, 0xD503241f)

#undef __AARCH64_INSN_FUNCS

--
2.30.2


2023-01-23 13:46:56

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 6/8] arm64: patching: Add aarch64_insn_write_literal_u64()

In subsequent patches we'll need to atomically write to a
naturally-aligned 64-bit literal embedded within the kernel text.

Add a helper for this. For consistency with other text patching code we
use copy_to_kernel_nofault(), which is atomic for naturally-aligned
accesses up to 64-bits.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/include/asm/patching.h | 2 ++
arch/arm64/kernel/patching.c | 17 +++++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/arch/arm64/include/asm/patching.h b/arch/arm64/include/asm/patching.h
index 6bf5adc56295..68908b82b168 100644
--- a/arch/arm64/include/asm/patching.h
+++ b/arch/arm64/include/asm/patching.h
@@ -7,6 +7,8 @@
int aarch64_insn_read(void *addr, u32 *insnp);
int aarch64_insn_write(void *addr, u32 insn);

+int aarch64_insn_write_literal_u64(void *addr, u64 val);
+
int aarch64_insn_patch_text_nosync(void *addr, u32 insn);
int aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt);

diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 33e0fabc0b79..b4835f6d594b 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -88,6 +88,23 @@ int __kprobes aarch64_insn_write(void *addr, u32 insn)
return __aarch64_insn_write(addr, cpu_to_le32(insn));
}

+noinstr int aarch64_insn_write_literal_u64(void *addr, u64 val)
+{
+ u64 *waddr;
+ unsigned long flags;
+ int ret;
+
+ raw_spin_lock_irqsave(&patch_lock, flags);
+ waddr = patch_map(addr, FIX_TEXT_POKE0);
+
+ ret = copy_to_kernel_nofault(waddr, &val, sizeof(val));
+
+ patch_unmap(FIX_TEXT_POKE0);
+ raw_spin_unlock_irqrestore(&patch_lock, flags);
+
+ return ret;
+}
+
int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
u32 *tp = addr;
--
2.30.2


2023-01-23 13:46:59

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 7/8] arm64: ftrace: Update stale comment

In commit:

26299b3f6ba26bfc ("ftrace: arm64: move from REGS to ARGS")

... we folded ftrace_regs_entry into ftrace_caller, and
ftrace_regs_entry no longer exists.

Update the comment accordingly.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/kernel/ftrace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index b30b955a8921..38ebdf063255 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -209,7 +209,7 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
* | NOP | MOV X9, LR | MOV X9, LR |
* | NOP | NOP | BL <entry> |
*
- * The LR value will be recovered by ftrace_regs_entry, and restored into LR
+ * The LR value will be recovered by ftrace_caller, and restored into LR
* before returning to the regular function prologue. When a function is not
* being traced, the MOV is not harmful given x9 is not live per the AAPCS.
*
--
2.30.2


2023-01-23 13:47:08

by Mark Rutland

[permalink] [raw]
Subject: [PATCH v3 8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64.
This allows each ftrace callsite to provide an ftrace_ops to the common
ftrace trampoline, allowing each callsite to invoke distinct tracer
functions without the need to fall back to list processing or to
allocate custom trampolines for each callsite. This significantly speeds
up cases where multiple distinct trace functions are used and callsites
are mostly traced by a single tracer.

The main idea is to place a pointer to the ftrace_ops as a literal at a
fixed offset from the function entry point, which can be recovered by
the common ftrace trampoline. Using a 64-bit literal avoids branch range
limitations, and permits the ops to be swapped atomically without
special considerations that apply to code-patching. In future this will
also allow for the implementation of DYNAMIC_FTRACE_WITH_DIRECT_CALLS
without branch range limitations by using additional fields in struct
ftrace_ops.

As noted in the core patch adding support for
DYNAMIC_FTRACE_WITH_CALL_OPS, this approach allows for directly invoking
ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or
part of a module), without going via ftrace_ops_list_func.

Currently, this approach is not compatible with CLANG_CFI, as the
presence/absence of pre-function NOPs changes the offset of the
pre-function type hash, and there's no existing mechanism to ensure a
consistent offset for instrumented and uninstrumented functions. When
CLANG_CFI is enabled, the existing scheme with a global ops->func
pointer is used, and there should be no functional change. I am
currently working with others to allow the two to work together in
future (though this will liekly require updated compiler support).

I've benchamrked this with the ftrace_ops sample module [1], which is
not currently upstream, but available at:

https://lore.kernel.org/lkml/[email protected]
git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git ftrace-ops-sample-20230109

Using that module I measured the total time taken for 100,000 calls to a
trivial instrumented function, with a number of tracers enabled with
relevant filters (which would apply to the instrumented function) and a
number of tracers enabled with irrelevant filters (which would not apply
to the instrumented function). I tested on an M1 MacBook Pro, running
under a HVF-accelerated QEMU VM (i.e. on real hardware).

Before this patch:

Number of tracers || Total time | Per-call average time (ns)
Relevant | Irrelevant || (ns) | Total | Overhead
=========+============++=============+==============+============
0 | 0 || 94,583 | 0.95 | -
0 | 1 || 93,709 | 0.94 | -
0 | 2 || 93,666 | 0.94 | -
0 | 10 || 93,709 | 0.94 | -
0 | 100 || 93,792 | 0.94 | -
---------+------------++-------------+--------------+------------
1 | 1 || 6,467,833 | 64.68 | 63.73
1 | 2 || 7,509,708 | 75.10 | 74.15
1 | 10 || 23,786,792 | 237.87 | 236.92
1 | 100 || 106,432,500 | 1,064.43 | 1063.38
---------+------------++-------------+--------------+------------
1 | 0 || 1,431,875 | 14.32 | 13.37
2 | 0 || 6,456,334 | 64.56 | 63.62
10 | 0 || 22,717,000 | 227.17 | 226.22
100 | 0 || 103,293,667 | 1032.94 | 1031.99
---------+------------++-------------+--------------+--------------

Note: per-call overhead is estimated relative to the baseline case
with 0 relevant tracers and 0 irrelevant tracers.

After this patch

Number of tracers || Total time | Per-call average time (ns)
Relevant | Irrelevant || (ns) | Total | Overhead
=========+============++=============+==============+============
0 | 0 || 94,541 | 0.95 | -
0 | 1 || 93,666 | 0.94 | -
0 | 2 || 93,709 | 0.94 | -
0 | 10 || 93,667 | 0.94 | -
0 | 100 || 93,792 | 0.94 | -
---------+------------++-------------+--------------+------------
1 | 1 || 281,000 | 2.81 | 1.86
1 | 2 || 281,042 | 2.81 | 1.87
1 | 10 || 280,958 | 2.81 | 1.86
1 | 100 || 281,250 | 2.81 | 1.87
---------+------------++-------------+--------------+------------
1 | 0 || 280,959 | 2.81 | 1.86
2 | 0 || 6,502,708 | 65.03 | 64.08
10 | 0 || 18,681,209 | 186.81 | 185.87
100 | 0 || 103,550,458 | 1,035.50 | 1034.56
---------+------------++-------------+--------------+------------

Note: per-call overhead is estimated relative to the baseline case
with 0 relevant tracers and 0 irrelevant tracers.

As can be seen from the above:

a) Whenever there is a single relevant tracer function associated with a
tracee, the overhead of invoking the tracer is constant, and does not
scale with the number of tracers which are *not* associated with that
tracee.

b) The overhead for a single relevant tracer has dropped to ~1/7 of the
overhead prior to this series (from 13.37ns to 1.86ns). This is
largely due to permitting calls to dynamically-allocated ftrace_ops
without going through ftrace_ops_list_func.

I've run the ftrace selftests from v6.2-rc3, which reports:

| # of passed: 110
| # of failed: 0
| # of unresolved: 3
| # of untested: 0
| # of unsupported: 0
| # of xfailed: 1
| # of undefined(test bug): 0

... where the unresolved entries were the tests for DIRECT functions
(which are not supported), and the checkbashisms selftest (which is
irrelevant here):

| [8] Test ftrace direct functions against tracers [UNRESOLVED]
| [9] Test ftrace direct functions against kprobes [UNRESOLVED]
| [62] Meta-selftest: Checkbashisms [UNRESOLVED]

... with all other tests passing (or failing as expected).

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Florent Revest <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Will Deacon <[email protected]>
---
arch/arm64/Kconfig | 3 +
arch/arm64/Makefile | 5 +-
arch/arm64/include/asm/ftrace.h | 15 +--
arch/arm64/kernel/asm-offsets.c | 4 +
arch/arm64/kernel/entry-ftrace.S | 32 ++++++-
arch/arm64/kernel/ftrace.c | 156 +++++++++++++++++++++++++++++++
6 files changed, 195 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6914f6bf41e2..6f6f37161cf6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -124,6 +124,7 @@ config ARM64
select EDAC_SUPPORT
select FRAME_POINTER
select FUNCTION_ALIGNMENT_4B
+ select FUNCTION_ALIGNMENT_8B if DYNAMIC_FTRACE_WITH_CALL_OPS
select GENERIC_ALLOCATOR
select GENERIC_ARCH_TOPOLOGY
select GENERIC_CLOCKEVENTS_BROADCAST
@@ -187,6 +188,8 @@ config ARM64
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_ARGS \
if $(cc-option,-fpatchable-function-entry=2)
+ select HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS \
+ if (DYNAMIC_FTRACE_WITH_ARGS && !CFI_CLANG)
select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY \
if DYNAMIC_FTRACE_WITH_ARGS
select HAVE_EFFICIENT_UNALIGNED_ACCESS
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index d62bd221828f..4c3be442fbb3 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -139,7 +139,10 @@ endif

CHECKFLAGS += -D__aarch64__

-ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
+ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y)
+ KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
+ CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2
+else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y)
KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
CC_FLAGS_FTRACE := -fpatchable-function-entry=2
endif
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 5664729800ae..1c2672bbbf37 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -62,20 +62,7 @@ extern unsigned long ftrace_graph_call;

extern void return_to_handler(void);

-static inline unsigned long ftrace_call_adjust(unsigned long addr)
-{
- /*
- * Adjust addr to point at the BL in the callsite.
- * See ftrace_init_nop() for the callsite sequence.
- */
- if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
- return addr + AARCH64_INSN_SIZE;
- /*
- * addr is the address of the mcount call instruction.
- * recordmcount does the necessary offset calculation.
- */
- return addr;
-}
+unsigned long ftrace_call_adjust(unsigned long addr);

#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
struct dyn_ftrace;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 2234624536d9..ae345b06e9f7 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -9,6 +9,7 @@

#include <linux/arm_sdei.h>
#include <linux/sched.h>
+#include <linux/ftrace.h>
#include <linux/kexec.h>
#include <linux/mm.h>
#include <linux/dma-mapping.h>
@@ -193,6 +194,9 @@ int main(void)
DEFINE(KIMAGE_HEAD, offsetof(struct kimage, head));
DEFINE(KIMAGE_START, offsetof(struct kimage, start));
BLANK();
+#endif
+#ifdef CONFIG_FUNCTION_TRACER
+ DEFINE(FTRACE_OPS_FUNC, offsetof(struct ftrace_ops, func));
#endif
return 0;
}
diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index 3b625f76ffba..350ed81324ac 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -65,13 +65,35 @@ SYM_CODE_START(ftrace_caller)
stp x29, x30, [sp, #FREGS_SIZE]
add x29, sp, #FREGS_SIZE

- sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
- mov x1, x9 // parent_ip (callsite's LR)
- ldr_l x2, function_trace_op // op
- mov x3, sp // regs
+ /* Prepare arguments for the the tracer func */
+ sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
+ mov x1, x9 // parent_ip (callsite's LR)
+ mov x3, sp // regs
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+ /*
+ * The literal pointer to the ops is at an 8-byte aligned boundary
+ * which is either 12 or 16 bytes before the BL instruction in the call
+ * site. See ftrace_call_adjust() for details.
+ *
+ * Therefore here the LR points at `literal + 16` or `literal + 20`,
+ * and we can find the address of the literal in either case by
+ * aligning to an 8-byte boundary and subtracting 16. We do the
+ * alignment first as this allows us to fold the subtraction into the
+ * LDR.
+ */
+ bic x2, x30, 0x7
+ ldr x2, [x2, #-16] // op
+
+ ldr x4, [x2, #FTRACE_OPS_FUNC] // op->func
+ blr x4 // op->func(ip, parent_ip, op, regs)
+
+#else
+ ldr_l x2, function_trace_op // op

SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
- bl ftrace_stub
+ bl ftrace_stub // func(ip, parent_ip, op, regs)
+#endif

/*
* At the callsite x0-x8 and x19-x30 were live. Any C code will have preserved
diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
index 38ebdf063255..5545fe1a9012 100644
--- a/arch/arm64/kernel/ftrace.c
+++ b/arch/arm64/kernel/ftrace.c
@@ -60,6 +60,89 @@ int ftrace_regs_query_register_offset(const char *name)
}
#endif

+unsigned long ftrace_call_adjust(unsigned long addr)
+{
+ /*
+ * When using mcount, addr is the address of the mcount call
+ * instruction, and no adjustment is necessary.
+ */
+ if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_ARGS))
+ return addr;
+
+ /*
+ * When using patchable-function-entry without pre-function NOPS, addr
+ * is the address of the first NOP after the function entry point.
+ *
+ * The compiler has either generated:
+ *
+ * addr+00: func: NOP // To be patched to MOV X9, LR
+ * addr+04: NOP // To be patched to BL <caller>
+ *
+ * Or:
+ *
+ * addr-04: BTI C
+ * addr+00: func: NOP // To be patched to MOV X9, LR
+ * addr+04: NOP // To be patched to BL <caller>
+ *
+ * We must adjust addr to the address of the NOP which will be patched
+ * to `BL <caller>`, which is at `addr + 4` bytes in either case.
+ *
+ */
+ if (!IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
+ return addr + AARCH64_INSN_SIZE;
+
+ /*
+ * When using patchable-function-entry with pre-function NOPs, addr is
+ * the address of the first pre-function NOP.
+ *
+ * Starting from an 8-byte aligned base, the compiler has either
+ * generated:
+ *
+ * addr+00: NOP // Literal (first 32 bits)
+ * addr+04: NOP // Literal (last 32 bits)
+ * addr+08: func: NOP // To be patched to MOV X9, LR
+ * addr+12: NOP // To be patched to BL <caller>
+ *
+ * Or:
+ *
+ * addr+00: NOP // Literal (first 32 bits)
+ * addr+04: NOP // Literal (last 32 bits)
+ * addr+08: func: BTI C
+ * addr+12: NOP // To be patched to MOV X9, LR
+ * addr+16: NOP // To be patched to BL <caller>
+ *
+ * We must adjust addr to the address of the NOP which will be patched
+ * to `BL <caller>`, which is at either addr+12 or addr+16 depending on
+ * whether there is a BTI.
+ */
+
+ if (!IS_ALIGNED(addr, sizeof(unsigned long))) {
+ WARN_RATELIMIT(1, "Misaligned patch-site %pS\n",
+ (void *)(addr + 8));
+ return 0;
+ }
+
+ /* Skip the NOPs placed before the function entry point */
+ addr += 2 * AARCH64_INSN_SIZE;
+
+ /* Skip any BTI */
+ if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) {
+ u32 insn = le32_to_cpu(*(__le32 *)addr);
+
+ if (aarch64_insn_is_bti(insn)) {
+ addr += AARCH64_INSN_SIZE;
+ } else if (insn != aarch64_insn_gen_nop()) {
+ WARN_RATELIMIT(1, "unexpected insn in patch-site %pS: 0x%08x\n",
+ (void *)addr, insn);
+ }
+ }
+
+ /* Skip the first NOP after function entry */
+ addr += AARCH64_INSN_SIZE;
+
+ return addr;
+}
+
/*
* Replace a single instruction, which may be a branch or NOP.
* If @validate == true, a replaced instruction is checked against 'old'.
@@ -98,6 +181,13 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
unsigned long pc;
u32 new;

+ /*
+ * When using CALL_OPS, the function to call is associated with the
+ * call site, and we don't have a global function pointer to update.
+ */
+ if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS))
+ return 0;
+
pc = (unsigned long)ftrace_call;
new = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
AARCH64_INSN_BRANCH_LINK);
@@ -176,6 +266,44 @@ static bool ftrace_find_callable_addr(struct dyn_ftrace *rec,
return true;
}

+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+static const struct ftrace_ops *arm64_rec_get_ops(struct dyn_ftrace *rec)
+{
+ const struct ftrace_ops *ops = NULL;
+
+ if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
+ ops = ftrace_find_unique_ops(rec);
+ WARN_ON_ONCE(!ops);
+ }
+
+ if (!ops)
+ ops = &ftrace_list_ops;
+
+ return ops;
+}
+
+static int ftrace_rec_set_ops(const struct dyn_ftrace *rec,
+ const struct ftrace_ops *ops)
+{
+ unsigned long literal = ALIGN_DOWN(rec->ip - 12, 8);
+ return aarch64_insn_write_literal_u64((void *)literal,
+ (unsigned long)ops);
+}
+
+static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec)
+{
+ return ftrace_rec_set_ops(rec, &ftrace_nop_ops);
+}
+
+static int ftrace_rec_update_ops(struct dyn_ftrace *rec)
+{
+ return ftrace_rec_set_ops(rec, arm64_rec_get_ops(rec));
+}
+#else
+static int ftrace_rec_set_nop_ops(struct dyn_ftrace *rec) { return 0; }
+static int ftrace_rec_update_ops(struct dyn_ftrace *rec) { return 0; }
+#endif
+
/*
* Turn on the call to ftrace_caller() in instrumented function
*/
@@ -183,6 +311,11 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
{
unsigned long pc = rec->ip;
u32 old, new;
+ int ret;
+
+ ret = ftrace_rec_update_ops(rec);
+ if (ret)
+ return ret;

if (!ftrace_find_callable_addr(rec, NULL, &addr))
return -EINVAL;
@@ -193,6 +326,19 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
return ftrace_modify_code(pc, old, new, true);
}

+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
+int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
+ unsigned long addr)
+{
+ if (WARN_ON_ONCE(old_addr != (unsigned long)ftrace_caller))
+ return -EINVAL;
+ if (WARN_ON_ONCE(addr != (unsigned long)ftrace_caller))
+ return -EINVAL;
+
+ return ftrace_rec_update_ops(rec);
+}
+#endif
+
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
/*
* The compiler has inserted two NOPs before the regular function prologue.
@@ -220,6 +366,11 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
{
unsigned long pc = rec->ip - AARCH64_INSN_SIZE;
u32 old, new;
+ int ret;
+
+ ret = ftrace_rec_set_nop_ops(rec);
+ if (ret)
+ return ret;

old = aarch64_insn_gen_nop();
new = aarch64_insn_gen_move_reg(AARCH64_INSN_REG_9,
@@ -237,9 +388,14 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
{
unsigned long pc = rec->ip;
u32 old = 0, new;
+ int ret;

new = aarch64_insn_gen_nop();

+ ret = ftrace_rec_set_nop_ops(rec);
+ if (ret)
+ return ret;
+
/*
* When using mcount, callsites in modules may have been initalized to
* call an arbitrary module PLT (which redirects to the _mcount stub)
--
2.30.2


2023-01-23 14:40:46

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

Hi Mark,

On Mon, Jan 23, 2023 at 01:45:55PM +0000, Mark Rutland wrote:
> I'm not sure how we want to merge this, so I've moved the core ftrace
> patch to the start of the series so that it can more easily be placed on
> a stable branch if we want that to go via the ftrace tree and the rest
> to go via arm64.

Happy to queue the whole series since Steve acked the ftrace changes.
But I'll put on a stable branch in case there are any conflicts with the
ftrace tree.

--
Catalin

2023-01-23 22:50:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

On Mon, 23 Jan 2023 14:40:29 +0000
Catalin Marinas <[email protected]> wrote:

> Hi Mark,
>
> On Mon, Jan 23, 2023 at 01:45:55PM +0000, Mark Rutland wrote:
> > I'm not sure how we want to merge this, so I've moved the core ftrace
> > patch to the start of the series so that it can more easily be placed on
> > a stable branch if we want that to go via the ftrace tree and the rest
> > to go via arm64.
>
> Happy to queue the whole series since Steve acked the ftrace changes.
> But I'll put on a stable branch in case there are any conflicts with the
> ftrace tree.
>

I actually do have a conflict with this patch (I'll post it soon).

I could just cherry pick it from your tree. Linus seems to be OK with that
when it's just a single change.

-- Steve

2023-01-23 23:01:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

On Mon, 23 Jan 2023 17:49:52 -0500
Steven Rostedt <[email protected]> wrote:

> On Mon, 23 Jan 2023 14:40:29 +0000
> Catalin Marinas <[email protected]> wrote:
>
> > Hi Mark,
> >
> > On Mon, Jan 23, 2023 at 01:45:55PM +0000, Mark Rutland wrote:
> > > I'm not sure how we want to merge this, so I've moved the core ftrace
> > > patch to the start of the series so that it can more easily be placed on
> > > a stable branch if we want that to go via the ftrace tree and the rest
> > > to go via arm64.
> >
> > Happy to queue the whole series since Steve acked the ftrace changes.
> > But I'll put on a stable branch in case there are any conflicts with the
> > ftrace tree.
> >
>
> I actually do have a conflict with this patch (I'll post it soon).
>
> I could just cherry pick it from your tree. Linus seems to be OK with that
> when it's just a single change.

Forget about it. I'll post the patch, but it's not important (we need it
mostly for something internal, but could possibly be useful to others). It
can wait a merge window or two to apply it.

So feel free to take this through your tree, and then I'll add my patch in
the merge window after that.

-- Steve


2023-01-24 12:10:54

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

On Mon, 23 Jan 2023 13:45:55 +0000, Mark Rutland wrote:
> I'm not sure how we want to merge this, so I've moved the core ftrace
> patch to the start of the series so that it can more easily be placed on
> a stable branch if we want that to go via the ftrace tree and the rest
> to go via arm64.
>
> This is cleanly pasing the ftrace selftests from v6.2-rc3 (results in
> the final patch).
>
> [...]

Applied to arm64 (for-next/ftrace), thanks!

[1/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
https://git.kernel.org/arm64/c/cbad0fb2d8d9
[2/8] Compiler attributes: GCC cold function alignment workarounds
https://git.kernel.org/arm64/c/c27cd083cfb9
[3/8] ACPI: Don't build ACPICA with '-Os'
https://git.kernel.org/arm64/c/8f9e0a52810d
[4/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
https://git.kernel.org/arm64/c/47a15aa54427
[5/8] arm64: insn: Add helpers for BTI
https://git.kernel.org/arm64/c/2bbbb4015aa1
[6/8] arm64: patching: Add aarch64_insn_write_literal_u64()
https://git.kernel.org/arm64/c/e4ecbe83fd1a
[7/8] arm64: ftrace: Update stale comment
https://git.kernel.org/arm64/c/90955d778ad7
[8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
https://git.kernel.org/arm64/c/baaf553d3bc3

--
Catalin


2023-01-28 08:47:08

by Wangshaobo (bobo)

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS



?? 2023/1/23 21:45, Mark Rutland д??:
> Architectures without dynamic ftrace trampolines incur an overhead when
> multiple ftrace_ops are enabled with distinct filters. in these cases,
> each call site calls a common trampoline which uses
> ftrace_ops_list_func() to iterate over all enabled ftrace functions, and
> so incurs an overhead relative to the size of this list (including RCU
> protection overhead).
>
> Architectures with dynamic ftrace trampolines avoid this overhead for
> call sites which have a single associated ftrace_ops. In these cases,
> the dynamic trampoline is customized to branch directly to the relevant
> ftrace function, avoiding the list overhead.
>
> On some architectures it's impractical and/or undesirable to implement
> dynamic ftrace trampolines. For example, arm64 has limited branch ranges
> and cannot always directly branch from a call site to an arbitrary
> address (e.g. from a kernel text address to an arbitrary module
> address). Calls from modules to core kernel text can be indirected via
> PLTs (allocated at module load time) to address this, but the same is
> not possible from calls from core kernel text.
>
> Using an indirect branch from a call site to an arbitrary trampoline is
> possible, but requires several more instructions in the function
> prologue (or immediately before it), and/or comes with far more complex
> requirements for patching.
>
> Instead, this patch adds a new option, where an architecture can
> associate each call site with a pointer to an ftrace_ops, placed at a
> fixed offset from the call site. A shared trampoline can recover this
> pointer and call ftrace_ops::func() without needing to go via
> ftrace_ops_list_func(), avoiding the associated overhead.
>
> This avoids issues with branch range limitations, and avoids the need to
> allocate and manipulate dynamic trampolines, making it far simpler to
> implement and maintain, while having similar performance
> characteristics.
>
> Note that this allows for dynamic ftrace_ops to be invoked directly from
> an architecture's ftrace_caller trampoline, whereas existing code forces
> the use of ftrace_ops_get_list_func(), which is in part necessary to
> permit the ftrace_ops to be freed once unregistered *and* to avoid
> branch/address-generation range limitation on some architectures (e.g.
> where ops->func is a module address, and may be outside of the direct
> branch range for callsites within the main kernel image).
>
> The CALL_OPS approach avoids this problems and is safe as:
>
> * The existing synchronization in ftrace_shutdown() using
> ftrace_shutdown() using synchronize_rcu_tasks_rude() (and
> synchronize_rcu_tasks()) ensures that no tasks hold a stale reference
> to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline,
> or while invoking ftrace_ops::func), when that ftrace_ops is
> unregistered.
>
> Arguably this could also be relied upon for the existing scheme,
> permitting dynamic ftrace_ops to be invoked directly when ops->func is
> in range, but this will require additional logic to handle branch
> range limitations, and is not handled by this patch.
>
> * Each callsite's ftrace_ops pointer literal can hold any valid kernel
> address, and is updated atomically. As an architecture's ftrace_caller
> trampoline will atomically load the ops pointer then dereference
> ops->func, there is no risk of invoking ops->func with a mismatches
> ops pointer, and updates to the ops pointer do not require special
> care.
>
> A subsequent patch will implement architectures support for arm64. There
> should be no functional change as a result of this patch alone.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Reviewed-by: Steven Rostedt (Google) <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Florent Revest <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Will Deacon <[email protected]>
> ---
> include/linux/ftrace.h | 18 +++++--
> kernel/trace/Kconfig | 7 +++
> kernel/trace/ftrace.c | 109 +++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 125 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 99f1146614c0..366c730beaa3 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -39,6 +39,7 @@ static inline void ftrace_boot_snapshot(void) { }
>
> struct ftrace_ops;
> struct ftrace_regs;
> +struct dyn_ftrace;
>
> #ifdef CONFIG_FUNCTION_TRACER
> /*
> @@ -57,6 +58,9 @@ void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip);
> void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> struct ftrace_ops *op, struct ftrace_regs *fregs);
> #endif
> +extern const struct ftrace_ops ftrace_nop_ops;
> +extern const struct ftrace_ops ftrace_list_ops;
> +struct ftrace_ops *ftrace_find_unique_ops(struct dyn_ftrace *rec);
> #endif /* CONFIG_FUNCTION_TRACER */
>
> /* Main tracing buffer and events set up */
> @@ -391,8 +395,6 @@ struct ftrace_func_entry {
> unsigned long direct; /* for direct lookup only */
> };
>
> -struct dyn_ftrace;
> -
> #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> extern int ftrace_direct_func_count;
> int register_ftrace_direct(unsigned long ip, unsigned long addr);
> @@ -563,6 +565,8 @@ bool is_ftrace_trampoline(unsigned long addr);
> * IPMODIFY - the record allows for the IP address to be changed.
> * DISABLED - the record is not ready to be touched yet
> * DIRECT - there is a direct function to call
> + * CALL_OPS - the record can use callsite-specific ops
> + * CALL_OPS_EN - the function is set up to use callsite-specific ops
> *
> * When a new ftrace_ops is registered and wants a function to save
> * pt_regs, the rec->flags REGS is set. When the function has been
> @@ -580,9 +584,11 @@ enum {
> FTRACE_FL_DISABLED = (1UL << 25),
> FTRACE_FL_DIRECT = (1UL << 24),
> FTRACE_FL_DIRECT_EN = (1UL << 23),
> + FTRACE_FL_CALL_OPS = (1UL << 22),
> + FTRACE_FL_CALL_OPS_EN = (1UL << 21),
> };
>
> -#define FTRACE_REF_MAX_SHIFT 23
> +#define FTRACE_REF_MAX_SHIFT 21
> #define FTRACE_REF_MAX ((1UL << FTRACE_REF_MAX_SHIFT) - 1)
>
> #define ftrace_rec_count(rec) ((rec)->flags & FTRACE_REF_MAX)
> @@ -820,7 +826,8 @@ static inline int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
> */
> extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
>
> -#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +#if defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS) || \
> + defined(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS)
> /**
> * ftrace_modify_call - convert from one addr to another (no nop)
> * @rec: the call site record (e.g. mcount/fentry)
> @@ -833,6 +840,9 @@ extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
> * what we expect it to be, and then on success of the compare,
> * it should write to the location.
> *
> + * When using call ops, this is called when the associated ops change, even
> + * when (addr == old_addr).
> + *
> * The code segment at @rec->ip should be a caller to @old_addr
> *
> * Return must be:
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 197545241ab8..5df427a2321d 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -42,6 +42,9 @@ config HAVE_DYNAMIC_FTRACE_WITH_REGS
> config HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> bool
>
> +config HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
> + bool
> +
> config HAVE_DYNAMIC_FTRACE_WITH_ARGS
> bool
> help
> @@ -257,6 +260,10 @@ config DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> depends on DYNAMIC_FTRACE_WITH_REGS
> depends on HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>
> +config DYNAMIC_FTRACE_WITH_CALL_OPS
> + def_bool y
> + depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
> +
Hi Mark,
I have test your patches and it looks fine with my sample module, but
here setting DYNAMIC_FTRACE_WITH_CALL_OPS to y immutably may increase
the .text section size by 5% or more, how about making this to optional^^

-- Wang ShaoBo
> config DYNAMIC_FTRACE_WITH_ARGS
> def_bool y
> depends on DYNAMIC_FTRACE
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 442438b93fe9..e634b80f49d1 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -125,6 +125,33 @@ struct ftrace_ops global_ops;
> void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> struct ftrace_ops *op, struct ftrace_regs *fregs);
>
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> +/*
> + * Stub used to invoke the list ops without requiring a separate trampoline.
> + */
> +const struct ftrace_ops ftrace_list_ops = {
> + .func = ftrace_ops_list_func,
> + .flags = FTRACE_OPS_FL_STUB,
> +};
> +
> +static void ftrace_ops_nop_func(unsigned long ip, unsigned long parent_ip,
> + struct ftrace_ops *op,
> + struct ftrace_regs *fregs)
> +{
> + /* do nothing */
> +}
> +
> +/*
> + * Stub used when a call site is disabled. May be called transiently by threads
> + * which have made it into ftrace_caller but haven't yet recovered the ops at
> + * the point the call site is disabled.
> + */
> +const struct ftrace_ops ftrace_nop_ops = {
> + .func = ftrace_ops_nop_func,
> + .flags = FTRACE_OPS_FL_STUB,
> +};
> +#endif
> +
> static inline void ftrace_ops_init(struct ftrace_ops *ops)
> {
> #ifdef CONFIG_DYNAMIC_FTRACE
> @@ -1814,6 +1841,18 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
> * if rec count is zero.
> */
> }
> +
> + /*
> + * If the rec has a single associated ops, and ops->func can be
> + * called directly, allow the call site to call via the ops.
> + */
> + if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) &&
> + ftrace_rec_count(rec) == 1 &&
> + ftrace_ops_get_func(ops) == ops->func)
> + rec->flags |= FTRACE_FL_CALL_OPS;
> + else
> + rec->flags &= ~FTRACE_FL_CALL_OPS;
> +
> count++;
>
> /* Must match FTRACE_UPDATE_CALLS in ftrace_modify_all_code() */
> @@ -2108,8 +2147,9 @@ void ftrace_bug(int failed, struct dyn_ftrace *rec)
> struct ftrace_ops *ops = NULL;
>
> pr_info("ftrace record flags: %lx\n", rec->flags);
> - pr_cont(" (%ld)%s", ftrace_rec_count(rec),
> - rec->flags & FTRACE_FL_REGS ? " R" : " ");
> + pr_cont(" (%ld)%s%s", ftrace_rec_count(rec),
> + rec->flags & FTRACE_FL_REGS ? " R" : " ",
> + rec->flags & FTRACE_FL_CALL_OPS ? " O" : " ");
> if (rec->flags & FTRACE_FL_TRAMP_EN) {
> ops = ftrace_find_tramp_ops_any(rec);
> if (ops) {
> @@ -2177,6 +2217,7 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> * want the direct enabled (it will be done via the
> * direct helper). But if DIRECT_EN is set, and
> * the count is not one, we need to clear it.
> + *
> */
> if (ftrace_rec_count(rec) == 1) {
> if (!(rec->flags & FTRACE_FL_DIRECT) !=
> @@ -2185,6 +2226,19 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> } else if (rec->flags & FTRACE_FL_DIRECT_EN) {
> flag |= FTRACE_FL_DIRECT;
> }
> +
> + /*
> + * Ops calls are special, as count matters.
> + * As with direct calls, they must only be enabled when count
> + * is one, otherwise they'll be handled via the list ops.
> + */
> + if (ftrace_rec_count(rec) == 1) {
> + if (!(rec->flags & FTRACE_FL_CALL_OPS) !=
> + !(rec->flags & FTRACE_FL_CALL_OPS_EN))
> + flag |= FTRACE_FL_CALL_OPS;
> + } else if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
> + flag |= FTRACE_FL_CALL_OPS;
> + }
> }
>
> /* If the state of this record hasn't changed, then do nothing */
> @@ -2229,6 +2283,21 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> rec->flags &= ~FTRACE_FL_DIRECT_EN;
> }
> }
> +
> + if (flag & FTRACE_FL_CALL_OPS) {
> + if (ftrace_rec_count(rec) == 1) {
> + if (rec->flags & FTRACE_FL_CALL_OPS)
> + rec->flags |= FTRACE_FL_CALL_OPS_EN;
> + else
> + rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
> + } else {
> + /*
> + * Can only call directly if there's
> + * only one set of associated ops.
> + */
> + rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
> + }
> + }
> }
>
> /*
> @@ -2258,7 +2327,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> * and REGS states. The _EN flags must be disabled though.
> */
> rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
> - FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
> + FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
> + FTRACE_FL_CALL_OPS_EN);
> }
>
> ftrace_bug_type = FTRACE_BUG_NOP;
> @@ -2431,6 +2501,25 @@ ftrace_find_tramp_ops_new(struct dyn_ftrace *rec)
> return NULL;
> }
>
> +struct ftrace_ops *
> +ftrace_find_unique_ops(struct dyn_ftrace *rec)
> +{
> + struct ftrace_ops *op, *found = NULL;
> + unsigned long ip = rec->ip;
> +
> + do_for_each_ftrace_op(op, ftrace_ops_list) {
> +
> + if (hash_contains_ip(ip, op->func_hash)) {
> + if (found)
> + return NULL;
> + found = op;
> + }
> +
> + } while_for_each_ftrace_op(op);
> +
> + return found;
> +}
> +
> #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> /* Protected by rcu_tasks for reading, and direct_mutex for writing */
> static struct ftrace_hash *direct_functions = EMPTY_HASH;
> @@ -3780,11 +3869,12 @@ static int t_show(struct seq_file *m, void *v)
> if (iter->flags & FTRACE_ITER_ENABLED) {
> struct ftrace_ops *ops;
>
> - seq_printf(m, " (%ld)%s%s%s",
> + seq_printf(m, " (%ld)%s%s%s%s",
> ftrace_rec_count(rec),
> rec->flags & FTRACE_FL_REGS ? " R" : " ",
> rec->flags & FTRACE_FL_IPMODIFY ? " I" : " ",
> - rec->flags & FTRACE_FL_DIRECT ? " D" : " ");
> + rec->flags & FTRACE_FL_DIRECT ? " D" : " ",
> + rec->flags & FTRACE_FL_CALL_OPS ? " O" : " ");
> if (rec->flags & FTRACE_FL_TRAMP_EN) {
> ops = ftrace_find_tramp_ops_any(rec);
> if (ops) {
> @@ -3800,6 +3890,15 @@ static int t_show(struct seq_file *m, void *v)
> } else {
> add_trampoline_func(m, NULL, rec);
> }
> + if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
> + ops = ftrace_find_unique_ops(rec);
> + if (ops) {
> + seq_printf(m, "\tops: %pS (%pS)",
> + ops, ops->func);
> + } else {
> + seq_puts(m, "\tops: ERROR!");
> + }
> + }
> if (rec->flags & FTRACE_FL_DIRECT) {
> unsigned long direct;
>
>

2023-01-30 10:25:48

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS

On Sat, Jan 28, 2023 at 04:46:48PM +0800, Wangshaobo (bobo) wrote:
> 锟斤拷 2023/1/23 21:45, Mark Rutland 写锟斤拷:
> > +config DYNAMIC_FTRACE_WITH_CALL_OPS
> > + def_bool y
> > + depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
> > +
> Hi Mark,

Hi,

> I have test your patches and it looks fine with my sample module,

Thanks for testing!

> but here setting DYNAMIC_FTRACE_WITH_CALL_OPS to y immutably may increase the
> .text section size by 5% or more, how about making this to optional^^

We could consider making this optional. I had not made this optional so far as
in the future I'd like to make this the only implementation of ftrace on arm64
(once we can drop the old mcount version, and once we've sorted out the
incompatibility with CFI). In the mean time, it probably makes sense to have
the option at least to enable testing of each of the two forms.

Is your concern that the overall kernel image size is larger, or do you care
specifically about the size of the .text section for some reason?

Thanks,
Mark.

>
> -- Wang ShaoBo
> > config DYNAMIC_FTRACE_WITH_ARGS
> > def_bool y
> > depends on DYNAMIC_FTRACE
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index 442438b93fe9..e634b80f49d1 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -125,6 +125,33 @@ struct ftrace_ops global_ops;
> > void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> > struct ftrace_ops *op, struct ftrace_regs *fregs);
> > +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS
> > +/*
> > + * Stub used to invoke the list ops without requiring a separate trampoline.
> > + */
> > +const struct ftrace_ops ftrace_list_ops = {
> > + .func = ftrace_ops_list_func,
> > + .flags = FTRACE_OPS_FL_STUB,
> > +};
> > +
> > +static void ftrace_ops_nop_func(unsigned long ip, unsigned long parent_ip,
> > + struct ftrace_ops *op,
> > + struct ftrace_regs *fregs)
> > +{
> > + /* do nothing */
> > +}
> > +
> > +/*
> > + * Stub used when a call site is disabled. May be called transiently by threads
> > + * which have made it into ftrace_caller but haven't yet recovered the ops at
> > + * the point the call site is disabled.
> > + */
> > +const struct ftrace_ops ftrace_nop_ops = {
> > + .func = ftrace_ops_nop_func,
> > + .flags = FTRACE_OPS_FL_STUB,
> > +};
> > +#endif
> > +
> > static inline void ftrace_ops_init(struct ftrace_ops *ops)
> > {
> > #ifdef CONFIG_DYNAMIC_FTRACE
> > @@ -1814,6 +1841,18 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
> > * if rec count is zero.
> > */
> > }
> > +
> > + /*
> > + * If the rec has a single associated ops, and ops->func can be
> > + * called directly, allow the call site to call via the ops.
> > + */
> > + if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS) &&
> > + ftrace_rec_count(rec) == 1 &&
> > + ftrace_ops_get_func(ops) == ops->func)
> > + rec->flags |= FTRACE_FL_CALL_OPS;
> > + else
> > + rec->flags &= ~FTRACE_FL_CALL_OPS;
> > +
> > count++;
> > /* Must match FTRACE_UPDATE_CALLS in ftrace_modify_all_code() */
> > @@ -2108,8 +2147,9 @@ void ftrace_bug(int failed, struct dyn_ftrace *rec)
> > struct ftrace_ops *ops = NULL;
> > pr_info("ftrace record flags: %lx\n", rec->flags);
> > - pr_cont(" (%ld)%s", ftrace_rec_count(rec),
> > - rec->flags & FTRACE_FL_REGS ? " R" : " ");
> > + pr_cont(" (%ld)%s%s", ftrace_rec_count(rec),
> > + rec->flags & FTRACE_FL_REGS ? " R" : " ",
> > + rec->flags & FTRACE_FL_CALL_OPS ? " O" : " ");
> > if (rec->flags & FTRACE_FL_TRAMP_EN) {
> > ops = ftrace_find_tramp_ops_any(rec);
> > if (ops) {
> > @@ -2177,6 +2217,7 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> > * want the direct enabled (it will be done via the
> > * direct helper). But if DIRECT_EN is set, and
> > * the count is not one, we need to clear it.
> > + *
> > */
> > if (ftrace_rec_count(rec) == 1) {
> > if (!(rec->flags & FTRACE_FL_DIRECT) !=
> > @@ -2185,6 +2226,19 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> > } else if (rec->flags & FTRACE_FL_DIRECT_EN) {
> > flag |= FTRACE_FL_DIRECT;
> > }
> > +
> > + /*
> > + * Ops calls are special, as count matters.
> > + * As with direct calls, they must only be enabled when count
> > + * is one, otherwise they'll be handled via the list ops.
> > + */
> > + if (ftrace_rec_count(rec) == 1) {
> > + if (!(rec->flags & FTRACE_FL_CALL_OPS) !=
> > + !(rec->flags & FTRACE_FL_CALL_OPS_EN))
> > + flag |= FTRACE_FL_CALL_OPS;
> > + } else if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
> > + flag |= FTRACE_FL_CALL_OPS;
> > + }
> > }
> > /* If the state of this record hasn't changed, then do nothing */
> > @@ -2229,6 +2283,21 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> > rec->flags &= ~FTRACE_FL_DIRECT_EN;
> > }
> > }
> > +
> > + if (flag & FTRACE_FL_CALL_OPS) {
> > + if (ftrace_rec_count(rec) == 1) {
> > + if (rec->flags & FTRACE_FL_CALL_OPS)
> > + rec->flags |= FTRACE_FL_CALL_OPS_EN;
> > + else
> > + rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
> > + } else {
> > + /*
> > + * Can only call directly if there's
> > + * only one set of associated ops.
> > + */
> > + rec->flags &= ~FTRACE_FL_CALL_OPS_EN;
> > + }
> > + }
> > }
> > /*
> > @@ -2258,7 +2327,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
> > * and REGS states. The _EN flags must be disabled though.
> > */
> > rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
> > - FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
> > + FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
> > + FTRACE_FL_CALL_OPS_EN);
> > }
> > ftrace_bug_type = FTRACE_BUG_NOP;
> > @@ -2431,6 +2501,25 @@ ftrace_find_tramp_ops_new(struct dyn_ftrace *rec)
> > return NULL;
> > }
> > +struct ftrace_ops *
> > +ftrace_find_unique_ops(struct dyn_ftrace *rec)
> > +{
> > + struct ftrace_ops *op, *found = NULL;
> > + unsigned long ip = rec->ip;
> > +
> > + do_for_each_ftrace_op(op, ftrace_ops_list) {
> > +
> > + if (hash_contains_ip(ip, op->func_hash)) {
> > + if (found)
> > + return NULL;
> > + found = op;
> > + }
> > +
> > + } while_for_each_ftrace_op(op);
> > +
> > + return found;
> > +}
> > +
> > #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> > /* Protected by rcu_tasks for reading, and direct_mutex for writing */
> > static struct ftrace_hash *direct_functions = EMPTY_HASH;
> > @@ -3780,11 +3869,12 @@ static int t_show(struct seq_file *m, void *v)
> > if (iter->flags & FTRACE_ITER_ENABLED) {
> > struct ftrace_ops *ops;
> > - seq_printf(m, " (%ld)%s%s%s",
> > + seq_printf(m, " (%ld)%s%s%s%s",
> > ftrace_rec_count(rec),
> > rec->flags & FTRACE_FL_REGS ? " R" : " ",
> > rec->flags & FTRACE_FL_IPMODIFY ? " I" : " ",
> > - rec->flags & FTRACE_FL_DIRECT ? " D" : " ");
> > + rec->flags & FTRACE_FL_DIRECT ? " D" : " ",
> > + rec->flags & FTRACE_FL_CALL_OPS ? " O" : " ");
> > if (rec->flags & FTRACE_FL_TRAMP_EN) {
> > ops = ftrace_find_tramp_ops_any(rec);
> > if (ops) {
> > @@ -3800,6 +3890,15 @@ static int t_show(struct seq_file *m, void *v)
> > } else {
> > add_trampoline_func(m, NULL, rec);
> > }
> > + if (rec->flags & FTRACE_FL_CALL_OPS_EN) {
> > + ops = ftrace_find_unique_ops(rec);
> > + if (ops) {
> > + seq_printf(m, "\tops: %pS (%pS)",
> > + ops, ops->func);
> > + } else {
> > + seq_puts(m, "\tops: ERROR!");
> > + }
> > + }
> > if (rec->flags & FTRACE_FL_DIRECT) {
> > unsigned long direct;
> >

2023-01-31 01:26:02

by Wangshaobo (bobo)

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS



在 2023/1/30 18:25, Mark Rutland 写道:
> On Sat, Jan 28, 2023 at 04:46:48PM +0800, Wangshaobo (bobo) wrote:
>> 锟斤拷 2023/1/23 21:45, Mark Rutland 写锟斤拷:
>>> +config DYNAMIC_FTRACE_WITH_CALL_OPS
>>> + def_bool y
>>> + depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
>>> +
>> Hi Mark,
>
> Hi,
>
>> I have test your patches and it looks fine with my sample module,
>
> Thanks for testing!
>
>> but here setting DYNAMIC_FTRACE_WITH_CALL_OPS to y immutably may increase the
>> .text section size by 5% or more, how about making this to optional^^
>
> We could consider making this optional. I had not made this optional so far as
> in the future I'd like to make this the only implementation of ftrace on arm64
> (once we can drop the old mcount version, and once we've sorted out the
> incompatibility with CFI). In the mean time, it probably makes sense to have
> the option at least to enable testing of each of the two forms.
>
> Is your concern that the overall kernel image size is larger, or do you care
> specifically about the size of the .text section for some reason?
>
> Thanks,
> Mark
Embedded devices may pay more attention to Image size, and which may
also indirectly affects performance, for more reason, I think making
sense to have the option for testing is more important.

-- Wang ShaoBo
>
>>
>> -- Wang ShaoBo
>>> config DYNAMIC_FTRACE_WITH_ARGS
>>> def_bool y
>>> depends on DYNAMIC_FTRACE
>>> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
>>> index 442438b93fe9..e634b80f49d1 100644

2023-02-06 09:28:22

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v3 1/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS

On Tue, Jan 31, 2023 at 09:25:51AM +0800, Wangshaobo (bobo) wrote:
> 在 2023/1/30 18:25, Mark Rutland 写道:
> > On Sat, Jan 28, 2023 at 04:46:48PM +0800, Wangshaobo (bobo) wrote:
> > > 锟斤拷 2023/1/23 21:45, Mark Rutland 写锟斤拷:
> > > > +config DYNAMIC_FTRACE_WITH_CALL_OPS
> > > > + def_bool y
> > > > + depends on HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
> > > > +
> > > Hi Mark,
> >
> > Hi,
> >
> > > I have test your patches and it looks fine with my sample module,
> >
> > Thanks for testing!
> >
> > > but here setting DYNAMIC_FTRACE_WITH_CALL_OPS to y immutably may increase the
> > > .text section size by 5% or more, how about making this to optional^^
> >
> > We could consider making this optional. I had not made this optional so far as
> > in the future I'd like to make this the only implementation of ftrace on arm64
> > (once we can drop the old mcount version, and once we've sorted out the
> > incompatibility with CFI). In the mean time, it probably makes sense to have
> > the option at least to enable testing of each of the two forms.
> >
> > Is your concern that the overall kernel image size is larger, or do you care
> > specifically about the size of the .text section for some reason?
> >
> > Thanks,
> > Mark
> Embedded devices may pay more attention to Image size, and which may also
> indirectly affects performance, for more reason,

I appreciate those concerns, however:

a) For the Image size, the mcount_loc table and associated relocations already
imposes a much greater penalty. So I'd expect that where the size truly
matters, ftrace would be completely disabled anyway.

I'm currently looking at shrinking the mcount_loc table (and removing the
need for relocationgs), which should save much more space.

b) For performance, without data this is supposition. Everything so far
indicates that there is not a measureable performance difference, and from
other threads it's possible that the increased function alignment *aids*
performance.

If you have data to the contrary, I'm happy to investigate.

> I think making sense to have the option for testing is more important.

As above, I'm happy to add an option for functional testing of the ftrace
implementation, but I don't think that it's a good idea to use that as a size
or performance tweak.

Thanks,
Mark.

2023-02-23 09:28:56

by Chen-Yu Tsai

[permalink] [raw]
Subject: Re: [PATCH v3 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

On Mon, Jan 23, 2023 at 01:45:55PM +0000, Mark Rutland wrote:
> Hi Catalin, Steve,
>
> I'm not sure how we want to merge this, so I've moved the core ftrace
> patch to the start of the series so that it can more easily be placed on
> a stable branch if we want that to go via the ftrace tree and the rest
> to go via arm64.
>
> This is cleanly pasing the ftrace selftests from v6.2-rc3 (results in
> the final patch).
>
> Aside from that, usual cover letter below.
>
> This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
> enables support for this on arm64. This significantly reduces the
> overhead of tracing when a callsite/tracee has a single associated
> tracer, avoids a number of issues that make it undesireably and
> infeasible to use dynamically-allocated trampolines (e.g. branch range
> limitations), and makes it possible to implement support for
> DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.
>
> The main idea is to give each ftrace callsite an associated pointer to
> an ftrace_ops. The architecture's ftrace_caller trampoline can recover
> the ops pointer and invoke ops->func from this without needing to use
> ftrace_ops_list_func, which has to iterate through all registered ops.
>
> To make this work, we use -fpatchable-function-entry=M,N, there N NOPs
> are placed before the function entry point. On arm64 NOPs are always 4
> bytes, so by allocating 2 per-function NOPs, we have enough space to
> place a 64-bit value. So that we can manipulate the pointer atomically,
> we need to align instrumented functions to at least 8 bytes, which we
> can ensure using -falign-functions=8.

Does this work all the time? Or is it influenced by other Kconfig
options?

I'm getting random misaligned patch-site warnings like the following:

Misaligned patch-site gic_handle_irq+0x0/0x12c
Misaligned patch-site __traceiter_initcall_level+0x0/0x60
Misaligned patch-site __traceiter_initcall_start+0x0/0x60
Misaligned patch-site __traceiter_initcall_finish+0x0/0x68
Misaligned patch-site do_one_initcall+0x0/0x300
Misaligned patch-site do_one_initcall+0x2b0/0x300
Misaligned patch-site match_dev_by_label+0x0/0x50
Misaligned patch-site match_dev_by_uuid+0x0/0x48
Misaligned patch-site wait_for_initramfs+0x0/0x68
Misaligned patch-site panic_show_mem+0x0/0x88
Misaligned patch-site 0xffffffd3b4fef074
[...]
(I assume the unresolved symbol(s) are from modules.)

The warnings were seen on next-20230223 and many versions before, with
Debian's GCC 12.2 cross compile toolchain. I also tried next-20230223
with Linaro's toolchains gcc-linaro-12.2.1-2023.01-x86_64_aarch64-linux-gnu
and gcc-linaro-13.0.0-2022.11-x86_64_aarch64-linux-gnu and the warnings
appeared as well.

Checking panic_show_mem in various places from the Debian GCC's build:

$ aarch64-linux-gnu-nm init/initramfs.o | grep panic_show_mem
0000000000000070 t panic_show_mem
$ aarch64-linux-gnu-nm init/built-in.a | grep panic_show_mem
0000000000000070 t panic_show_mem
$ aarch64-linux-gnu-nm built-in.a | grep panic_show_mem
0000000000000070 t panic_show_mem
$ aarch64-linux-gnu-nm vmlinux.a | grep panic_show_mem
0000000000000070 t panic_show_mem
$ aarch64-linux-gnu-nm vmlinux.o | grep panic_show_mem
0000000000001534 t panic_show_mem
$ aarch64-linux-gnu-nm vmlinux | grep panic_show_mem
ffffffc0080158dc t panic_show_mem

Looks like individual object files do have functions aligned at 8-byte
boundaries, but when all the object files are collected and linked
together into vmlinux.o, the higher alignment gets dropped and some
functions end up on 4-byte boundaries.


Regards
ChenYu

>
> Each callsite ends up looking like:
>
> # Aligned to 8 bytes
> func - 8:
> < pointer to ops >
> func:
> BTI // optional
> MOV X9, LR
> NOP // patched to `BL ftrace_caller`
> func_body:
>
> When entering ftrace_caller, the LR points at func_body, and the
> ftrace_ops can be recovered at a negative offset from this the LR value:
>
> BIC <tmp>, LR, 0x7 // Align down (skips BTI)
> LDR <tmp>, [<tmp>, #-16] // load ops pointer
>
> The ftrace_ops::func (and any other ftrace_ops fields) can then be
> recovered from this pointer to the ops.
>
> The first three patches enable the function alignment, working around
> cases where GCC drops alignment for cold functions or when building with
> '-Os'.
>
> The final four patches implement support for
> DYNAMIC_FTRACE_WITH_CALL_OPS on arm64. As noted in the final patch, this
> results in a significant reduction in overhead:
>
> Before this series:
>
> Number of tracers || Total time | Per-call average time (ns)
> Relevant | Irrelevant || (ns) | Total | Overhead
> =========+============++=============+==============+============
> 0 | 0 || 94,583 | 0.95 | -
> 0 | 1 || 93,709 | 0.94 | -
> 0 | 2 || 93,666 | 0.94 | -
> 0 | 10 || 93,709 | 0.94 | -
> 0 | 100 || 93,792 | 0.94 | -
> ---------+------------++-------------+--------------+------------
> 1 | 1 || 6,467,833 | 64.68 | 63.73
> 1 | 2 || 7,509,708 | 75.10 | 74.15
> 1 | 10 || 23,786,792 | 237.87 | 236.92
> 1 | 100 || 106,432,500 | 1,064.43 | 1063.38
> ---------+------------++-------------+--------------+------------
> 1 | 0 || 1,431,875 | 14.32 | 13.37
> 2 | 0 || 6,456,334 | 64.56 | 63.62
> 10 | 0 || 22,717,000 | 227.17 | 226.22
> 100 | 0 || 103,293,667 | 1032.94 | 1031.99
> ---------+------------++-------------+--------------+--------------
>
> Note: per-call overhead is estiamated relative to the baseline case
> with 0 relevant tracers and 0 irrelevant tracers.
>
> After this series:
>
> Number of tracers || Total time | Per-call average time (ns)
> Relevant | Irrelevant || (ns) | Total | Overhead
> =========+============++=============+==============+============
> 0 | 0 || 94,541 | 0.95 | -
> 0 | 1 || 93,666 | 0.94 | -
> 0 | 2 || 93,709 | 0.94 | -
> 0 | 10 || 93,667 | 0.94 | -
> 0 | 100 || 93,792 | 0.94 | -
> ---------+------------++-------------+--------------+------------
> 1 | 1 || 281,000 | 2.81 | 1.86
> 1 | 2 || 281,042 | 2.81 | 1.87
> 1 | 10 || 280,958 | 2.81 | 1.86
> 1 | 100 || 281,250 | 2.81 | 1.87
> ---------+------------++-------------+--------------+------------
> 1 | 0 || 280,959 | 2.81 | 1.86
> 2 | 0 || 6,502,708 | 65.03 | 64.08
> 10 | 0 || 18,681,209 | 186.81 | 185.87
> 100 | 0 || 103,550,458 | 1,035.50 | 1034.56
> ---------+------------++-------------+--------------+------------
>
> Note: per-call overhead is estiamated relative to the baseline case
> with 0 relevant tracers and 0 irrelevant tracers.
>
>
> This version of the series can be found in my kernel.org git repo:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git
>
> Tagged as:
>
> arm64-ftrace-per-callsite-ops-20230113
>
> Since v1 [1]:
> * Fold in Ack from Rafael
> * Update comments/commits with description of the GCC issue
> * Move the cold attribute changes to compiler_types.h
> * Drop the unnecessary changes to the weak attribute
> * Move declaration of ftrace_ops earlier
> * Clean up and improve commit messages
> * Regenerate statistics on misaligned text symbols
>
> Since v2 [2]:
> * Fold in Steve's Reviewed-by tag
> * Move core ftrace patch to the start of the series
> * Add ftrace selftest reults to final patch
> * Use FUNCTION_ALIGNMENT_4B by default
> * Fix commit message typos
>
> [1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
> [2] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> Thanks,
> Mark.
>
> Mark Rutland (8):
> ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
> Compiler attributes: GCC cold function alignment workarounds
> ACPI: Don't build ACPICA with '-Os'
> arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
> arm64: insn: Add helpers for BTI
> arm64: patching: Add aarch64_insn_write_literal_u64()
> arm64: ftrace: Update stale comment
> arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
>
> arch/arm64/Kconfig | 4 +
> arch/arm64/Makefile | 5 +-
> arch/arm64/include/asm/ftrace.h | 15 +--
> arch/arm64/include/asm/insn.h | 1 +
> arch/arm64/include/asm/linkage.h | 4 +-
> arch/arm64/include/asm/patching.h | 2 +
> arch/arm64/kernel/asm-offsets.c | 4 +
> arch/arm64/kernel/entry-ftrace.S | 32 +++++-
> arch/arm64/kernel/ftrace.c | 158 +++++++++++++++++++++++++++-
> arch/arm64/kernel/patching.c | 17 +++
> drivers/acpi/acpica/Makefile | 2 +-
> include/linux/compiler_attributes.h | 6 --
> include/linux/compiler_types.h | 27 +++++
> include/linux/ftrace.h | 18 +++-
> kernel/exit.c | 9 +-
> kernel/trace/Kconfig | 7 ++
> kernel/trace/ftrace.c | 109 ++++++++++++++++++-
> 17 files changed, 380 insertions(+), 40 deletions(-)
>
> --
> 2.30.2
>
>